Language Testing

The effects of group members' personalities on a test taker's L2 group
oral discussion test scores
Gary J. Ockey
Language Testing 2009; 26; 161
DOI: 10.1177/0265532208101005

The online version of this article can be found at:

Published by:

Additional services and information for Language Testing can be found at:

Email Alerts:





Downloaded from by Green Smith on April 9, 2009

Language Testing 2009 26 (2) 161 186

The effects of group members’
personalities on a test taker’s L2 group
oral discussion test scores
Gary J. Ockey Utah State University, USA

The second language group oral is a test of second language speaking
proficiency, in which a group of three or more English language learners
discuss an assigned topic without interaction with interlocutors. Concerns
expressed about the extent to which test takers’ personal characteristics
affect the scores of others in the group have limited its attractiveness. This
study investigates the degree to which assertive and non-assertive test takers’
scores are affected by the levels of assertiveness of their group members.
The sample of test takers was Japanese first year university students who
were studying English in Japan. The students took the revised NEO-PI-R
(Costa & McCrae, 1992; Shimanoka et al., 2002), a group oral test, and
PhonePass SET-10 (Ordinate, 2004). Two separate MANCOVA analyses
were conducted, one designed to determine the extent to which assertive test
takers’ scores are affected by the levels of assertiveness of group members
(N = 112), and one designed to determine the extent to which non-assertive
test takers’ scores are affected by the levels of assertiveness of group
members (N = 113). The analyses indicated that assertive test takers were
assigned higher scores than expected when grouped with only non-assertive
test takers and lower scores than expected when grouped with only assertive
test takers, while the study failed to find an effect for grouping based on
assertiveness for non-assertive test takers’ scores. The findings of the study
suggest that when the group oral is used, rater-training sessions should
include guidance on how to evaluate a test taker in the context of the group
in which the test taker is assessed and assign scores that are not based on a
comparison of proficiencies of group members.

Keywords: group oral, oral assessment, personality, speaking test,
speaking proficiency

An increasing amount of literature has emerged on the second language
group oral (Folland & Robertson, 1976; Liski & Puntanen, 1983;

Address for correspondence: Gary J. Ockey, Utah State University, Logan, UT 84322, USA;

© The Author(s), 2009. Reprints and Permissions:

Downloaded from by Green Smith on April 9, 2009

162 Effects of group members’ personalities on test taker’s group

Shohamy et al., 1986; Hilsdon, 1995; Fulcher, 1996; Ockey, 2001;
Van Moere & Kobayashi, 2003; Bonk & Ockey, 2003; Berry, 2004;
Bonk & Van Moere, 2004; He & Dai, 2006; Van Moere, 2006) in
which three or more second language learners discuss a topic with-
out any prompting from or interaction with interlocutors; raters sit
outside the group and assign individual oral ability scores to each
member of the group without participating in the discourse. The
group oral test is an appealing format for a number of reasons. It is
relatively practical (Folland & Robertson, 1976; Ockey, 2001) since
more than one test taker can be assessed at the same time, and raters
do not need specialized training for how to conduct effective inter-
views. The group oral test has the potential of positive washback for
communicative classrooms (Nevo & Shohamy, 1984; Hilsdon, 1995).
That is, since the group oral is designed to simulate group discus-
sions that students might have in the classroom and in the real world,
its use may encourage teachers to employ communicative teaching
practices, such as small group discussions to prepare students for the
test. Test administrations are potentially uniform across raters, since
only the test takers are involved in the discussion – not interviewers,
who have been shown to affect the validity of scores yielded from
oral interviews (Ross & Berwick, 1992; Johnson & Tyler, 1998;
Young & He, 1998; Brown, 2003). Unlike the oral interview and
its variants, which have been criticized for yielding discourse that is
not authentic (van Lier, 1989; Lazaraton, 1996; Kormos, 1999), the
group oral is designed to yield authentic discourse; test takers are
expected to have discussions similar to those they might have both
in the classroom and in the real world. The authenticity of a test task
is important since it provides a critical link between the test results
and the target language use situation, the desired domain of score
interpretation (Bachman & Palmer, 1996). Thus, more authentic test
tasks lead to more valid score interpretations. However, a viable
threat to test takers’ score-based inferences yielded from the group
oral is the personal characteristics of test takers’ group members;
a number of researchers have expressed concern about this threat
(Folland & Robertson, 1976; Bonk & Ockey, 2003; Van Moere &
Kobayashi, 2003; Berry, 2004). To increase understanding in this
area, the present study investigates the extent to which assertive and
non-assertive test takers’ group oral scores are affected by the asser-
tiveness of their group members.

Downloaded from by Green Smith on April 9, 2009

Bachman’s model is presented in Figure 1. Gary J. leads to a candidate’s test score. and Skehan’s (1998) models of oral assessment.sagepub. Ockey 163 I Background 1 Bachman’s model of oral assessment Bachman (2001) proposed a model of oral test performance that was based on Hymes’ (1972) theory of communicative competence and built on Kenyon’s (1992). apart Rater Score Scale Criteria Speaking Speech Performance Sample Task Interactants Examiners Task Qualities Others Task Characteristics Candidate: Ability for Use Underlying Competencies Context Figure 1 Bachman’s (2001) model of oral test performance Downloaded from by Green Smith on April 9. and the speaking performance. embedded in a particular context. task qualities and characteristics. 2009 . speech samples. From a trait theorist point of view (Monte. interactants (examiners and other participants). 1991). McNamara’s (1996). Bachman’s (2001) model specifies that a candidate’s score on a speaking test is dependent on the candidate’s underlying com- petence and ability for use. raters. scale criteria. The interaction of these facets.

these facets are sources of construct irrelevant variance. The importance of task characteristics is indicated by two studies that came to quite opposite conclusions about the validity of the score-based inferences yielded from the group oral. Downloaded from http://ltj. and that which does exist has been by Green Smith on April 9. This is because the personal characteristics of interactants have the potential to affect the speaking performance of the candi- date – indicated by the line connecting interactants and speaking performance in Bachman’s model of oral test performance (Figure 1).. have provided some understanding on some of the oral test performance facets. He and Dai’s results indicate that at least in certain contexts. A few studies. checking for comprehen- sion.g. the validity of the score interpretations yielded from the group oral are suspect. asking each other for opinions) were found. very few instances of communication strategies (e. 2009 . He and Dai (2006) reported research that questions the validity of the score interpretations yielded from the group oral. and 61% of the test takers viewed the examiners as their target audience – not the other members of their group.sagepub. given that many of them reportedly spoke directly to the examiners. a rater could assign a score to the candidate that is affected by factors other than a candidate’s underlying competence or ability for use.164 Effects of group members’ personalities on test taker’s group from the candidate’s underlying competence and ability for use. In either case. compromising the validity of the scores (along with the inferences made from them) resulting from the test. In the study. rather than their group members. 2 Research on the group oral Little research has been conducted on the group oral that indicates the extent to which the various facets of the test lead to valid score interpretations. groups of three or four test takers were given four-and-a-half minutes to speak – probably not enough time for all of them to demonstrate their proficiency in a group discussion. A threat of particular concern to the validity of the scores (and consequently the score-based inferences) yielded from the group oral are the personal characteristics of the interactants with whom a test taker is grouped. and/or a rater’s perception of that speaking performance – would be indicated by a direct line from the interactants box to the rater box in Bachman’s model. test takers were not familiar with the group oral format. however. negotiation of meaning. and apparently. In a discourse analysis of sixty groups of English as a foreign language Chinese university test takers.

suggesting that it was measuring something different than the other tasks. which ranged from 0. pro- vides evidence that the group oral may be relatively difficult to rate reliably and may measure a construct somewhat different than other common tests of oral ability. researchers have investigated the extent to which the dominance or talkativeness of test takers impacts scores. indi- cating that more dominant students were also more proficient. and a reporting task). grammar.60. Prior to taking the group oral. which compared the reliability and validity of the group oral to that of other oral test tasks. were assigned to groups with four members. The researchers reported that test takers could be reliably separated into 2–3 proficiency levels. 0.91. the group oral was 0. Groups of five or six Finish students (N = 698) were given 25 or 30 minutes (5 minutes per member) to discuss an assigned topic. Ockey 165 Based on the results of two separate administrations of an English group oral (N = 1324 and N = 1103) to Japanese university test takers. For instance. lexis. and use were compared to length of discourse. The extent to which this variance (not shared with the other measures of oral ability) was construct relevant or irrelevant. Groups of three or four test takers were given approximately 10 minutes to discuss an assigned topic. was unclear from the study. Bonk and Ockey (2003) concluded that the group oral does have potential for yielding valid score-based inferences. The group oral was also found to have the lowest correlation with the other test scores. Hebrew 12th grade speakers (N = 103).sagepub. Liski and Puntanen (1983) investigated the degree to which more talkative students are more proficient than less talkative students. picture task.76 to 0. who were studying English. Shohamy et al.71. Sixty groups of four Japanese university students were given approximately 10 minutes to discuss a topic assigned to them while two raters assigned scores for Downloaded from http://ltj. Frequency of test taker errors for by Green Smith on April 9. Inter-rater reliability was lower than that of other task types (oral interview. test takers had taken a similar test and were prepared by seeing a video which explained the procedures in the students’ first language. however. Gary J.’s (1986) study. 2009 . Van Moere and Kobayashi (2003) investigated the degree to which test takers’ quantity of speech affects their scores. The study found that more talkative students made fewer mistakes per utterance than less talkative students. A few studies have investigated the effects of test takers’ own personal characteristics on their own group oral scores. as compared to the other tasks.

166 Effects of group members’ personalities on test taker’s group

pronunciation, fluency, grammar, vocabulary, and communicative
effectiveness. Video transcriptions were used to determine the
number of words spoken by each participant. Regression analysis
indicated a positive relationship between number of words spoken
by a test taker and test scores, after controlling for ability level
(based on teacher judgments of the test takers’ proficiencies); most
of the difference in scores was attributable to the communicative
effectiveness category. The use of subjective ratings assigned by
a student’s classroom teacher may, however, be an inappropriate
way to control for ability level, since classroom teachers may have
inflated or deflated conceptions of a student’s ability, depending on
the student’s talkativeness.
In another study based on the same institutional test as the one
used by Van Moere and Kobayashi (2003), Bonk and Van Moere
(2004) investigated the effects of test takers’ shyness on their own
group oral scores. Their study included 1055 test takers who were
grouped into 322 groups of three or four members. Ten Likert scale
items on a four-point scale, which were completed by each student
immediately after the examination, were used to measure students’
levels of shyness. The researchers found that shyness was a small but
significant predictor of scores on the test; the standardized regression
coefficient was −0.14, indicating that the shyer the test taker the
lower the individual’s score on the group oral.
At least one study has addressed the personal characteristics of a
candidate’s group and the effects of these personal characteristics on
a test taker’s score. Berry (2004) investigated the effects of a group’s
level of extroversion on a test taker’s score. Groups of four, five, or
six university students, who agreed to be in the study, read a short
research-based text and then discussed the research and its impli-
cations with the other members. Two trained raters assigned each
member of the group a score for relevance, participation, and articu-
lation. For the discussion, groups were given 5 minutes per test taker
and up to 5 additional minutes based on the discretion of the raters
(V. Berry, personal communication, May 20, 2006). Based on scores
on the Eysenck Personality Questionnaire (Eysenck & Eysenck,
1991), an analysis of covariance (ANCOVA), with a writing score
used to control for possible oral ability differences in groups, indi-
cated that extroverts (n = 78) and introverts (n = 85) were assigned
higher scores on the group oral when placed in groups with a high
mean level of extroversion, and introverts did worse than expected
when placed in groups with a low mean level of extroversion.

Downloaded from by Green Smith on April 9, 2009

Gary J. Ockey 167

Berry’s study raises concerns about the personal characteristics of
test takers’ group members affecting their scores.
The research findings on the group oral are mixed, some indicating
the format can lead to valid score interpretations and others suggest-
ing it does not. Two principles are apparent from the research. First,
the context in which the group oral is used affects its success. For
instance, test takers need to be familiar with the format of the test and
test task characteristics must be appropriate for the testing context.
Second, researchers are concerned that one’s own personal charac-
teristics, as well as the personal characteristics of a test taker’s group
members are potential threats to the validity of group oral scores, and
some research indicates that these concerns are legitimate. However,
there is a very limited number of studies which investigate the extent
to which the personal characteristics of test takers’ group members
affect a test taker’s score. The present research aims to help fill this
void by investigating the effects of group members’ assertiveness
on a test taker’s score. Specifically, the research aims to answer the
following research questions.

3 Research questions

• To what extent is an assertive test taker’s score on the group oral
affected by being tested in groups with: (1) only assertive test
takers, (2) assertive and non-assertive test takers, and (3) only
non-assertive test takers?
• To what extent is a non-assertive test taker’s score on the group
oral affected by being tested in groups with: (1) only non-
assertive test takers, (2) assertive and non-assertive test takers,
and (3) only assertive test takers?
II Methods
1 Participants
a Test takers: The test takers in the study were all Japanese first
year university students (18–19 years old), majoring in English as a
foreign language (N = 225). The test takers were uniform in first lan-
guage background and roughly uniform in amount of exposure to the
target language and education, and had relatively homogeneous cul-
tural backgrounds. Consequently, to a large extent, these variables
were controlled in this study. All test takers completed and signed
consent forms indicating their willingness to participate in the study;

Downloaded from by Green Smith on April 9, 2009

168 Effects of group members’ personalities on test taker’s group

the study was approved by review boards of both a US university
and the participating Japanese university.

b Raters: All 34 raters had native-English-speaking compe-
tence, graduate degrees in English language teaching or a related
field, experience teaching a similar population of students, and
experience rating previous administrations of the group oral. In
addition, two days before the administration of the group oral, they
received training for administering and scoring the test. To begin
the training, a group of four project leaders independently rated
two videotaped groups of test takers (four in each group). Next,
they discussed the ratings and came to an agreement on the scoring.
After the project leaders had agreed on the ratings, they provided
training to the raters. The videotapes of the same two groups of test
takers were shown on a large-screen TV, and raters were asked to
assign scores to each of the eight test takers (four in each group).
The project leaders then handed out score sheets along with expla-
nations for why each test taker had been assigned specific scores.
Finally, the project leaders answered questions about each score
that had been assigned and provided a rationale. Raters were asked
to conform to the already agreed upon ratings. To avoid bias, the
researcher did not participate in the training sessions and was not
one of the raters in the study.

2 Instruments
a The Revised NEO Personality Inventory: The NEO-PI-R is
designed to assess the big five domains of personality: neuroticism,
extraversion, openness, conscientiousness, and agreeableness. Each
of these domains is further sub-divided into six traits referred to as
facets, and there are eight items designed to assess each of these 30
facets. One of the facets of the extroversion domain is assertiveness,
which measures one’s willingness to take the lead. Assertiveness is
defined as having the characteristic of speaking without hesitation
and enjoying being a group leader. For each of the eight assertive-
ness items (and all other items on the test), test takers respond to a
five-point Likert scale with 0 representing strongly disagree; 1, disa-
gree; 2, neutral; 3, agree; and 4, strongly agree. An example item is:
‘In conversations, I tend to do most of the talking.’ Scores on the
eight items are aggregated, leading to a possible score range of 0–32
(Costa & McCrae, 1992). The NEO-PI-R has been translated into

Downloaded from by Green Smith on April 9, 2009

latency of the response. An overall score of between 20 and 80 is assigned. r = 0.84. b PhonePass SET-10: PhonePass SET-101 (Ordinate.sagepub. appropriateness of pauses. such as: the Common European Framework. 2002). Ockey 169 Japanese. The instructions include English sentences. PhonePass SET-10 has been shown to correlate with other validated tests of oral proficiency. the ability to understand spoken English on everyday topics and to respond appropriately at a native-like conversational pace in intelligible English.. Manner of speaking is scored according to speech rate. PhonePass SET-10 scores should therefore function as acceptable measures of oral ability to control for differential oral ability of test takers uncontrolled by random assignment in this study. the New TOEFL Speaking. The scores are weighted equally for content accuracy and manner of speaking. 1 PhonePass is currently referred to as Versant. n = 303 (Ordinate. r = 0. Downloaded from http://ltj. which are subdivided into five areas: read-aloud. 2009 . stress of words. PhonePass SET-10 consists of 54 scored items.88. sentence-build and open-question.. n = 321 (Enright et al. Prior to taking the test. 2004). After reading through the instructions. c The group oral: Groups of four test takers are seated in a small circle and given about one minute to read an assigned prompt and think about how they will respond before one of two raters says. along with examples for how to complete each of the sections. and segmental forms of words. 1997. The test yields computer-generated scores. 2002). 2004) is designed to measure ‘facility in spoken English. and is administered over the telephone. Content accuracy is based on a comparison of the number of correct and incorrect words and word sequences that the test taker uses to the expected response pattern of these words and sequences. that by Green Smith on April 9. and the Test of Spoken English. r = 0. which the test taker is asked to read. a test taker is given an instruction sheet which is written in English and the native language of the test taker. and the Japanese version has been shown to yield valid scores for Japanese university students (Shimanoka et al.88. n = 58 (Ordinate.’ The test takes about 13 minutes. short-answer question. is computer scored. 2004). Gary J. the test taker calls a telephone number and is guided through the test by a recorded examiner’s voice. repeat-sentence.

the researcher administered the Japanese version of the NEO-PI-R to the test takers. To ensure that test takers would complete the NEO-PI-R questionnaire honestly. In addition. familiarity with each other. 2004). 2009 . The two by Green Smith on April 9. grammar. The majority of test takers took the test in quiet offices at the university. each of which had about 24 students. 3 Procedures a Administration of the NEO-PI-R: Beginning approximately three months prior to the administration of the group oral. they are expected to sustain a discussion on the assigned topic for eight min- utes. and English Downloaded from http://ltj. were included in the study.’ Once test takers begin. due to scheduling conflicts and other challenges. The procedures followed those described in the PhonePass SET-10 manual (Ordinate.sagepub. b Administration of PhonePass SET-10: Approximately two weeks prior to the administration of the group oral. sit outside of the group and provide ratings on a nine-point scale for each of five oral communication subscales: pronunciation. vocabulary. however. at the end of the test. All 15 first-year English major class groups at the university.75). Scores were satisfactorily reliable (Cronbach’s alpha = 0. fluency. and communication strategies (Appendix A).170 Effects of group members’ personalities on test taker’s group ‘Would someone like to begin. PhonePass SET-10 was administered to test takers. test takers were told that they would be given feedback on how they might be better language learn- ers based on their personality profiles resulting from the NEO-PI-R. the assessment was built into the classroom curriculum with a lesson on personality. and this response set was not included in the study. Only one test taker indicated that he had not answered the questions honestly. there is a question that asks test takers if they marked the items honestly. As a simple validity check. c Grouping students for the group oral: Test takers were placed in groups of four (a few were placed in groups of three for practical reasons) to take the group oral. Group assignment was based on test takers’ assertiveness. 43 of the 225 test takers included in the study did not take the test. one of the raters ends the test by asking the test takers to stop the discussion. who are not involved in the group’s discussion. The proce- dures followed those described in Costa and McCrae (1992) and the majority of participants took between 30 and 40 minutes to respond to the items. After approximately eight minutes.

55). Gary J. These four grouping types are shown in the top half of Figure 2. following Webb et al. The first grouping factor was scores on the assertiveness facet of the NEO-PI-R.74). and the bottom third as non-assertive (had z-scores of −0. Then. resulting in a possible score range of 0–32. Four grouping types were constructed based on the assertiveness of the test takers: all members were assertive. two or three were assertive and one non- assertive. and all members were non-assertive. The bottom half of Figure 2 indicates groupings for the analyses and is discussed in the Analysis of Data section.65 or lower). test takers that received scores between 12 and 16 (had z-scores between −0.55 or greater). black dots indicate assertive test takers and grey dots indicate non-assertive test takers. The middle third was excluded. non. in order to investigate the extent to which group membership based on assertiveness affects scores on the group oral. Ockey 171 proficiency. assertive n = 26 n = 63 n = 23 assertive assertive n = 25 n = 22 n = 66 Black indicates assertive and grey indicates non-assertive Figure 2 Groupings for group oral test (top) and data analyses (bottom) Downloaded from by Green Smith on April 9. (1998). one was assertive and two or three non-assertive. creating an extreme groups study. test takers that received scores between 17 and 28 (had z-scores of 0. The scores were shown to be satisfactorily reliable (Cronbach’s alpha = 0. All 3 assertive 1 assertive All non- assertive 1 non-assertive 3 non-assertive assertive 7 groups 22 groups 23 groups 7 groups All Majority Minority Minority Majority All non- assertive assertive assertive non.sagepub. the resulting score distribu- tion was equally divided into thirds. the middle third as middle. No test takers received a score of 0 or more than 28. The top third was identified as assertive. test takers that received scores between 1 and 11. 2009 .65 and 0. Scores on the eight Likert scale assertiveness items were aggregated.

0. vocabulary. e Dealing with missing data: Due to scheduling conflicts and other challenges. Stratified random sampling based on the three factors was employed. and this was accom- plished by not assigning test takers to groups which included any of their classmates.67. 0.59. Instructors felt that it was not appropriate for students with considerably different oral abilities and levels of familiarity to be tested together. more and more young people have part-time jobs. 2009 .’ For the group oral. Do you have a part-time job? Why or why not? What factors are important when you select a part-time job? What is the best kind of part-time job to have? Should students have part- time jobs or focus more on their studies?’ Test takers’ scores were based on an average of the two raters’ scores. 0. fluency. ‘high proficient’ test tak- ers were grouped with other ‘high proficient’ test takers and ‘low proficient’ test takers with other ‘low proficient’ test takers. An example prompt was as follows: ‘ by Green Smith on April 9. five prompts were used.60. and scores on each of the group oral subscales were shown to be acceptably reliable.56. 23 of the 112 assertive test takers and 20 of the 113 non-assertive test takers included in the analysis did not take PhonePass SET-10. d Administration of the group oral: Prior to taking the group oral. and test takers were instructed to use any of the questions as a beginning point for discussion. The third grouping factor was designed to control for possible effects due to differential proficiency of group members. test takers viewed a video in their first language that explained how they were expected to complete the test task. test takers were categorized as ‘high proficient’ and ‘low proficient. grammar. Cronbach’s alpha estimates were as follows: pronunciation. and communica- tion strategies. and groups were randomly assigned one of them. 0. The test prompts were written in both English and Japanese and the topics were designed to be relevant to the lives of Japanese university students. For test secu- rity reasons. The latter two grouping factors were based on institutionalized testing procedures. Each prompt had five related questions.65.172 Effects of group members’ personalities on test taker’s group The second grouping factor was designed to diminish possible effects due to familiarity with group members.sagepub. The video showed a group of test takers discussing a topic and explained the criteria on which individual performances were assessed. Given that these scores were used as a covariate Downloaded from http://ltj. 0. Based on results of a four skills English test given prior to the group oral.

f Analysis of data: In the analysis. As can be seen in Figure 2. Some assertive test takers were tested in all assertive groups. The boxes at the bottom of the figure indicate the groupings which were compared in the analyses. the scores of all test takers in these groups were classified into an ‘all assertive’ treatment condition. and .sagepub.2 Best subset regression techniques (Stata. and grammar correlated with listening at . These moderate correlations among the variables would indicate that the imputed values would be reasonable predic- tions of test takers’ expected scores. Ockey 173 in the analysis. The other assertive test takers tested in groups in which they were the only assertive test taker. which were available from a battery of tests given on the same day that the group oral was administered. Other assertive test takers were tested in groups in which there were one or two other assertive test takers and one non-assertive test taker. reading correlated with grammar at . suggesting that the missing values were missing at random.52. it was deemed reasonable to impute values for these 43 missing scores. Non-assertive test takers’ group oral scores were analogously grouped for a separate analysis designed to determine the extent 2 Prior to imputing scores. Gary J. 1985–2005) were used to impute missing values based on measures of listening. .com by Green Smith on April 9. t-tests indicated that there were no statistical differences between test takers’ scores who took PhonePass Set-10 and those who did not. Scores of assertive test takers who tested in this environment were classified into a ‘minority assertive’ treatment condition. test takers’ group oral scores were grouped based on the assertiveness of the test takers in the groups in which they were assessed.42 (reading). and less than 20% of the scores were missing. Downloaded from http://ltj. Tabachnick and Fidell (2001) discuss data imputation principles and procedures. and reading. These scores correlated with PhonePass SET-10 scores of test takers in the study at .54 (listening). 2009 . three environments in which an assertive test taker was assessed and three environments in which a non-assertive test taker was assessed were used in the analysis.44. The correla- tions among the variables used to impute the values also moderately correlated with each other. grammar.55. reading correlated with listening at . Black dots indicate assertive test takers in the testing condition and grey dots indicate non-assertive test takers in the testing condition. Scores of assertive test takers who tested in this environment were classified into a ‘majority assertive’ treatment condition. all other test takers were non-assertive.45 (grammar). In the analysis.

14 (NEO-PI-R) PhonePass SET-10 37.36 Grammar 2.44 1.64 21 59 0. To achieve this purpose. Groupings are shown in Figure 2.33 Vocabulary 2. assertive and non-assertive test takers. vocabulary. The analysis employed Type III sums of squares. 2009 .47 0.44 2.36 –0.50 4.51 0. Max.98 17 28 0.00 0.53 1.36 1.26 Fluency 2.81 by Green Smith on April 9. and communication strategies.59 0.00 4.14 –0.50 0.23 –0.00 4. a one-way multivariate analysis of cov- ariance (MANCOVA) with the general linear command in SPSS 11.00 0.00 0. ‘all assertive’ (n = 26). To ensure that differences in scores were not associated with differences in ability uncon- trolled by random assignment. one or two other members were assertive and one non-assertive.19 Communicative strategies 2. The independent variable was assertiveness of the asser- tive test taker’s group members: other members were all assertive.50 4.00 0. and only non-assertive test takers. ‘majority assertive’ (n = 63). fluency.69 0.18 Group oral Pronunciation 2. PhonePass SET-10 scores were used Table 1 Descriptive statistics for assertive test takers Instruments Mean SD Min.86 0.24 6.36 –0.40 N = 112 Downloaded from http://ltj. grammar. The first purpose of the study was to determine the extent to which an assertive test taker’s score on the group oral is affected by being tested in groups with only assertive test takers.50 2.174 Effects of group members’ personalities on test taker’s group to which the scores of non-assertive test takers are affected by the assertiveness of the group members with whom they are grouped. ‘minority assertive’ (n = 23). III Results 1 Effects of group members’ assertiveness on assertive test takers’ group oral scores The descriptive statistics for the assertive test takers’ scores are presented in Table 1.58 1.50 3. Skew Kurt Assertiveness 19.90 2.33 0. The statistics suggest fairly well-centred and normal distributions.sagepub.5 (1989–2002) was performed on the five dependent variables: pronunciation. and other members were all non-assertive.

and homogeneity of regression slopes were shown to be tenable. F (2. 108) = 9. The results indicated that there was a significant difference for grouping type for each of the group oral subscales. and communicative strate- gies. were as follows: pronunciation. homogeneity of the variance– covariance matrices. pronunciation. vocabulary. when controlling for oral ability (with PhonePass SET-10 scores). F (2. p < . Based on statistical tests and inspections of plots. p < .sagepub. p < .000). as estimated by Eta-squared.000.71.00. The MANCOVA was followed by ANCOVAs on each of the dependent variables in order to determine the extent to which group- ing type affected scores on pronunciation.000.90.0 1. the assumptions of normality. vocab- ulary.02. F ( Pronunciation Fluency Grammar Vocabulary Strategies Oral ability subscales Figure 3 Scores of assertive test takers across grouping types Downloaded from http://ltj. fluency.5 0. fluency. 108) = 10. 0. grammar.19.0 3.000. 108) = 12. The MANCOVA results were significant (Wilks’ Lambda = 0.5 1. All Assertive Majority Assertive Minority Assertive 4. Gary J. The effect sizes. p < . This pat- tern held for all subscales.16). Assertive test takers assigned to an all assertive group were assigned lower scores than assertive test takers assigned to a major- ity assertive group. 208) = 3. F (2.14.0 2. F (2. linearity. p < . lack of singularity and multicollinear- ity. and communicative strategies. by Green Smith on April 9. 2009 .19.91. The mean scores of the groups are compared in Figure 3. 108) = 14. F (2.0 0. p < . who in turn were assigned lower scores than assertive test takers assigned to a minority assertive group. grammar. and the effect size was large (Eta-squared = 0.000.5 3.5 Scores 2. indicating that about 16% of the variance in the scores of assertive test takers on the group oral could be accounted for by the way in which test takers were grouped. 0. Ockey 175 as a covariate of oral ability. 108) = 9.

00 1.45 0.43 Group oral Pronunciation 2. vocabulary. assertive test takers who were grouped with other test takers who were all assertive. 2004). Because there is no such theory. The results showed that all comparisons. This means that the effect of grouping based on assertiveness is separately estimated for each dependent variable. Fisher’s Least Significant Difference (Fisher’s LSD) post hoc tests were conducted to deter- mine which groups were significantly different from each other.10 (NEO-PI-R) PhonePass SET-10 36. ‘Minority assertive’ received scores that were significantly higher than assertive test takers assigned to either of the other grouping types.61 0.3 In the final step of the analysis.45 1. Fisher’s LSD is known to powerfully detect differences in groups (Keppel & Wickens.06 1. except for the vocabulary subscale comparison between the majority assertive and minority assertive groups.62 1.’ and assertive test takers assigned to groups in which they were the only assertive test taker. 0.42 0. Skew Kurt Assertiveness 8.62 –0.50 0.sagepub.20 Fluency 2. ‘All assertive’ and ‘Majority assertive.’ received significantly lower scores than assertive test takers assigned to the other two grouping types.42 –0.75 0.75 3.07 Grammar 2.92 1. ‘All assertive.59 6. which would take into account the correlations of the dependent variables.67 0. 2009 .176 Effects of group members’ personalities on test taker’s group grammar.21 Vocabulary 2.21.52 24 63 0.00 4. That is. Max. Downloaded from http://ltj. ‘Majority assertive’ and ‘Minority assertive.’ 2 Effects of group members’ assertiveness on non-assertive test takers’ group oral scores Descriptive statistics for non-assertive test takers are presented in Table 2.51 1.00 0.95 2.00 0.80 by Green Smith on April 9.75 4.75 4.20 1 11 −0. were significant at the . An alternative analysis would be a hierarchical procedure. 0.44 1. and communicative strategies. Table 2 Descriptive statistics for non-assertive test takers Instruments Mean SD Min.05 level.50 1. 0. However. such a procedure relies heavily on theory that would indicate which of the dependent variables should be entered into the analysis first.50 3.64 0.30 N = 113 3 Separate ANCOVAs do not take into account that the five dependent variables were correlated in the analysis.16.57 0. this procedure was not used in the analysis.48 Communicative strategies 2.14.14 –0.

and 1. The means of the groups can be seen in Figure 4. F (10.sagepub.43. an analogous one-way multivariate analysis of covariance (MANCOVA) was performed on the five dependent variables: pronunciation. respectively. 210) = 0. who are tested in groups. if scores of test takers.85). The MANCOVA revealed no significant difference among the groups (Wilks’ Lambda = 0. The analysis employed Type III sums of squares. the majority of variables had fairly normal distributions based on skewness and kurtosis values. not all of the scores in the analyses are completely independent of each other. for the second analysis. other members were assertive or non-assertive. vocabulary. and communication strategies. Ockey 177 As can be seen.10. PhonePass Set-10.55. may not completely satisfy the assumption of independence. 2004). 2009 . Standard deviations were comparable for each of the groups. ‘minority non-assertive’ (n = 22). ranging from 0. these small departures from normal distributions were not deemed to be of concern in the analysis. ‘all non-assertive’ (n = 25). and other members were all assertive. fluency. which may have been uncontrolled by random assignment. and only non-assertive test takers. To achieve this purpose. It can be seen that means for all groups of non-assertive test takers were similar for each of the subscales.95.4 Thus. p = 0. Gary J.33 to 0. The second purpose of the study was to determine the extent to which a non-assertive test taker’s score on the group oral is affected by being tested in groups with only assertive test takers. Oral ability scores. Dividing the analyses into separate MANCOVAs minimizes the threat of possible effects due to minor violations of the assumption of independence. However. who take the test together are not independent. 1.65. as measured by PhonePass SET-10 were used as a covariate to control for the effects of dif- fering oral ability. assertive and non-assertive test takers.06. and vocabulary variables had slightly peaked distributions – kurtosis values were 1. Scores of individuals. ‘major- ity non-assertive’ (n = 66). and the vocabulary variable was slightly right skewed with a value of by Green Smith on April 9. the assertiveness. thus. Downloaded from http://ltj.48. there 4 A factorial MANCOVA which combined the analyses of assertive and non-assertive test takers’ scores was not conducted to limit the effects of possible violations of independence in the data set. The independent variable was assertiveness of the non- assertive test takers’ group members and contained three treatment levels: other members were all non-assertive. grammar. a design analogous to the one used to answer the first research question was used. Given that analysis of variance is known to be robust to violations of normality when cell sizes are greater than 12 (Keppel & Wickens. As can be seen in Figure 2.

0 0. Given the lack of a significant multivariate F test.0 1.5 0. the raters might have compared a test taker’s performance with the performances of other group members and assigned different scores for similar perform- ances. This would be indicated by an arrow between the Interactants box and the Rater box in Bachman’s model of oral test performance (Figure 1).sagepub.0 3. Assertiveness may have been viewed positively when Downloaded from http://ltj.0 2. 2009 . individual ANCOVAs were not performed.0 Pronunciation Fluency Grammar Vocabulary Strategies Oral ability subscales Figure 4 Scores of non-assertive test takers across grouping types was no evidence that non-assertive test takers’ oral proficiency scores were affected by the way in which they were assigned to groups.178 Effects of group members’ personalities on test taker’s group All Non-assertive Majority Non-assertive Minority Non-assertive 4.5 3. That is. IV Discussion The results of the study indicate that the personal characteristics of a test taker’s group members can affect a test taker’s score on the group oral.5 1. Raters may have rewarded assertive test takers when they were leaders of non-assertive groups and penalized assertive test tak- ers when they tried to lead groups in which all members competed to be leaders. Assertive test takers received higher scores than expected when assessed with non-assertive test takers and lower scores than expected when assessed with assertive test takers. and would suggest the need for this relationship to be added to the by Green Smith on April 9. One possible explanation for these results is that raters may have perceived test takers’ abilities to be different based on the group in which they were assessed.5 Scores 2.

This would indicate an effect between the Interactants box and the Speaking perform- ance box of Bachman’s oral test performance model (Figure 1).com by Green Smith on April 9. For example. non-assertive or introverted test takers may not be willing to speak enough to be assigned a valid score in a low stakes situation. It is also possible that the eight minutes given to groups Downloaded from http://ltj. while in the present study the stakes of the group oral were quite high. Test takers may have intentionally or unin- tentionally changed their performances based on the assertiveness of the members in their groups. in low stakes situations. resulting in lower score assignment. An alternative explanation is that the score difference resulted from differential performance of the assertive test takers. In contrast. the stakes of the group oral were quite low. The findings of this study are quite different from those of Berry (2004). assertiveness is a sub- construct of extroversion and thus the two characteristics would be expected to affect scores similarly. in testing contexts in which the stakes are high. In testing contexts in which the stakes are high. based on the grouping type to which they were assigned. the level of the aggressiveness of the par- ticipation might be lessened to the extent that it is viewed positively by raters. 2009 . Similarly. In Berry’s study. non-assertive or introverted test takers may be motivated by the need to perform well to speak enough to be assigned a valid score. Berry found that both extroverts and introverts received higher scores when placed in groups with a high mean level of extroversion. A possible explanation for the dif- ferent findings in the two studies may be that the stakes of the tests were not the same. This aggressive participation might be viewed negatively by raters. On the other hand. as is predicted by the model.sagepub. Gary J. assertive or extroverted test takers might compete with each other to get a high score. who investigated the effects of a group’s mean level of extroversion on a test taker’s group oral score. Assertive test takers grouped with only non-assertive test takers may have had more opportunities to demonstrate their oral ability. No difference in scores was found in the part of the study which investigated the extent to which non-assertive test takers’ scores are affected by the assertiveness of other members of their group. or used more effective com- munication strategies depending on the grouping type to which they were assigned. Ockey 179 other members of the group were non-assertive. assertive test takers might have spoken more fluently or accurately. while assertive test takers grouped with assertive test takers may have had fewer opportunities to demonstrate their oral ability. while it may have been viewed negatively when all members of the group were asser- tive.

in a one-on-one oral interview. many contextual factors. Given more time. and another should include three assertive test takers and one non- assertive test taker. one grouping type should include all assertive test takers. another should include one assertive test taker and three non-assertive test takers. they were able to force themselves to speak enough during this amount of time to be assigned a valid score.180 Effects of group members’ personalities on test taker’s group to perform the task in the present study was an optimal amount of time for non-assertive test takers. especially groupings that have been shown to affect scores on the group oral. making them aware of possible effects of personal characteristics of group mem- bers may help them assign more valid scores. Downloaded from http://ltj. based on the results of this study. such as their assertiveness. when assigning scores. Rater-training sessions could include showing examples of test takers in various group- ings. It is important that such factors are understood and then taken into account when designing. such as the stakes of the tests and the time allocated for the task. introverts or non-assertive test takers may not be willing to sustain a conver- sation which is representative of their actual abilities. training sessions should include teach- ing raters to avoid comparing the performances of group members because it is possible that the unfair score assignment in this study resulted from such comparisons. will interact in various ways to affect score assignment when the group oral or any other oral assessment is by Green Smith on April 9. 2009 . raters are expected to take into account personal characteristics of the test tak- ers. No doubt. For instance. as was the case in Berry’s study. implementing and assigning scores in the group oral and other oral performance-based assessments. While the extent to which raters would be able to achieve this is not clear. if the group oral is to be used to assess the oral abilities of second language learners. When the group oral is employed.sagepub. Second. rater-training might include guidance on how to take into account important characteristics of the group in which a test taker is assessed when assigning scores. it may be necessary for raters to be able to also take into account some of the personal characteristics of one’s group members. First. The research has at least two implications for rater-training. For instance. Raters should be trained to assign scores that are independent of the scores that they assign to other members of the group by focusing on comparing each test taker’s performance to the rating scale descriptors. In many individual oral performance-based assessments. raters are generally expected to consider a test taker’s topical knowledge and cultural background when asking questions and assigning scores.

First. and straightforwardness as defined by Costa and McCrae (1992). this may have affected the results of the study. such as self-consciousness. activity. 2009 . compliance. Other personal characteristics of group members that warrant investigation are target language proficiency and gender as were investigated by Bonk and Van Moere (2004). social status. Research suggests that cultural groups differ on their mean personality scores in the Big Five domains (Costa & McCrae. Second. given that this study did determine that group membership can affect scores on the group oral.sagepub. How each of these personal characteristics of one’s group members affects one’s group oral scores. The findings indicate that rater- training sessions might include guidance on how to assign scores to test takers given the context of the group in which they are assessed and that are independent of the performances of other members of the group. The study suggests a number of areas for future research. The testing context may have a somewhat different effect on how personal characteristics of group members impact the scores of oth- ers in the group. and age. similar research needs to be conducted using other cultural groups. Fourth. Since an individual’s personality may exhibit itself somewhat differently when speaking a different language. multiple first language backgrounds. Future research should investigate the extent to which possible differential manifestations of test takers’ personalities across languages could impact such findings. 1992).com by Green Smith on April 9. to ensure comprehension of the items. cultural differences. a number of other personal characteristics of test takers’ group members may impact a test taker’s score on the group oral. as well as. altru- ism. consequently. Assertive test takers’ scores were affected by the assertiveness of a test taker’s group members. methods that are less reliant on the assumption of independence of individual Downloaded from http://ltj. while non-assertive test takers’ scores were not found to be affected by the personal characteristics of the test taker’s group members. the extent to which they interact with each other in a particular context are fruitful areas for further research. These characteristics include other personality facets. Gary J. it was necessary for test takers to complete the personality inventory in their native language. given the limited English proficiency of the test takers in the study. Ockey 181 V Conclusions The study found that test takers’ scores on the group oral can be affected by the personal characteristics of a test taker’s group mem- bers. Third. different cultural groups might be more or less assertive in such a group oral testing situation. and.

Fifth. F. Missouri. Given the need for valid practical ways of assessing the oral lan- guage abilities of second language speakers. Hong Kong: Oxford University Press. Kanda University of International Studies. the study raises questions about the degree to which raters can be trained to effectively take into account the varying group contexts in which an individual is assessed when assigning individual scores. Paper presented at the meeting of the American Association of Applied Linguistics – International Language Testing Association (AAAL-ILTA) Symposium. The author is also grateful to the students. Language testing in practice. by Green Smith on April 9. John Schumann. A. F. L. especially Professors Frank Johnson and Michael Torpey. such as hierarchical linear modeling techniques (Raudenbush & Bryk. Bachman. Steven Reise. February: Speaking as a realization of communicative competence. (1996). or the rater’s judgment that changes when comparing the performances of individuals within a group. An understanding of these effects can help guide the development and implementation of appropriately designed and administered group oral assessments. consideration should be given to the use of the group oral. at the host university. Louis. L. S. Downloaded from http://ltj. if this test format is to be used. Acknowledgements The author wishes to thank Professors Lyle Bachman. teachers. this study raises the question of the extent to which it is the test taker’s performance that changes when being tested with different individuals. 2009 .sagepub. However. The study was partially funded by the Educational Testing Service and the PhonePass Set-10 tests were provided by Ordinate. James Ockey. & Palmer. more needs to be understood about how group membership can affect scores in the contexts in which the test is administered. 2002). VI References Bachman. St. should be explored for analyzing group oral data. and Noreen Webb for their feedback on the paper. (2001). and administrators. Research on these latter two issues would most likely provide further guidance on how to train raters effectively for the group oral.182 Effects of group members’ personalities on test taker’s group scores. William Bonk.

edi- tors. Downloaded from http://ltj. A. and gender on individual scores. 225–246. P. Gary J. Kormos. J. W. On communicative competence. 30. California. March). English Language Teaching Journal. 16. (2004. 163–188. Testing tasks: Issues in task design and the group oral. A. 89–110. Ockey 183 Berry. Odessa. Paper presented at the annual meeting of the Language Testing Research Colloquium. editors. (1976). G. 23. NJ: Pearson-Prentice Hall. Middlesex: Penguin. Language Testing. Eysenck. J. Temecula. Towards objectivity in group oral testing. Paper presented at the annual meeting of the Language Testing Research Colloquium. A many-facet Rasch analysis of the second language group oral discussion task. Brown. (1999). V. Liski. Lazaraton. Upper Saddle River. Costa. (1996). 23–51. Hertfordshire: Prentice Hall International. S. (1991). & Puntanen. 27–51). J.. Hilsdon. Keppel. (1992. Simulating conversations in oral-proficiency assessments: A conversation analysis of role plays and non-scripted interviews in lan- guage exams. Vancouver. J. & Robertson. In Alderson. 156–167. & Tyler. & Holmes. G. E. H. L2 group oral testing: The influ- ence of shyness/outgoingness. (1972). (1998). Johnson. J. (1996). Language Testing. L. 269–293). London: Hodder. Measuring second language performance. McNamara. He. 13. & Van Moere. T. A study of the interaction between individual personality differences and oral performance test facets. Harmondsworth. Kenyon. Design and analysis: A researcher’s hand- book. D. (2004). Hymes. match of interlocutors’ proficiency level. 33. W. Language Testing. Interlocutor support in oral proficiency interviews: The case of CASE. editors. PA: John Benjamins.sagepub. Bonk. (1983). London: Longman. 370–402. (2004). 1–25. & North. Language Testing. fourth edition. & by Green Smith on April 9. & Wickens. Folland. The group oral exam: Advantages and limitations. Language Testing. J. In Pride. (1995). 189–197). 20. Interviewer variation and the co-construction of speaking proficiency. D. F. King’s College. Y. D. Sociolinguistics: Selected readings (pp. J. D. W. R. (1996). Canada. B. Philadelphia.. J. Fulcher. Development and use of rating scales in lan- guage testing. Bonk. Re-analyzing the OPI: How much does it look like natural conversation? In Young. Language Testing. Language testing in the 1990s: The communicative legacy (pp. A. 151–172. (2003). Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. Eysenck personality questionnaire – Revised. 2009 . M. & Eysenck. Revised NEO personality inventory (NEO- PI-R) and NEO five–factor inventory (NEO-FFI) professional manual. February). A study of the statistical foundations of group conversation tests in spoken English. & Ockey.. A. S. G. Unpublished doctoral dis- sertation. T. (1992). Language Learning. (2006). A corpus-based investigation into the validity of the CET-SET group discussion. R & He. FL: Psychological Assessment Resources. University of London. (2003). 20. 13. & Dai.

Ockey. 4. M.. Raudenbush. fourth edition. A. 212–220. Louisiana. A. G. S. (1998). Language proficiency interviews: A discourse approach. London: Sage Publications. Shimanoka. TX: StataCorp. Shimanoka. Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. Applying the joint committee’s evaluation standards for the assessment of alternative testing methods. editors. W. E. N. performance. Skehan. (2001). (1984). Young. 6. Introducing a new comprehen- sive test of oral proficiency. W. New Orleans. (1985–2005). PA: John Benjamins. (1997). (1992). Computer software. Is the oral interview superior to the group oral? Working Papers on Language Acquisition and Education. Using multivariate statistics. 23. & Fidell. Validity evidence in a group oral test. & He. Reves. (2004). A. second edition. Retrieved February 25.. S. SET-10 test description validation summary. (1991). K. NEO-PI-R NEO-FFI manual. (1986). Van Moere. instruction. and fainting in coils: Oral proficiency interviews as conversation.184 Effects of group members’ personalities on test taker’s group Monte. 159–176.. TESOL Quarterly.. drawling. Boston. Processing perspectives to second language development. M. E. Gondo. Tabachnick. The Japanese Journal of Personality. (1998). M. Nevo. & Berwick. Reeling. Thames Valley Working Papers in Applied Linguistics. (2002). Stata 8. J. Shohamy. Tokyo: Tokyo Shinri. M. SPSS 11. 22–41. ELT Journal. (1989–2002). International University of Japan. Fort Worth.. & Sugrue. from www. Van Moere. K. Paper pre- sented at the Annual Meeting of the American Educational Research Association. Equity issues in collaborative group assessment: Group composition and performance. Nakazato. & Bryk. (ED. 70–88. 2009 . (2001). In Young. 138–147. TX: Harcourt Brace College.. Language Testing. Y. & Kobayashi. 23. 35. American Educational Research Journal. Ordinate. by Green Smith on April 9. Reading. Hierarchical linear models: Applications and data analysis methods. 489–508. writhing. & Takayama. (1989). Philadelphia. van Lier. Y. K.2 for Windows. 411–440. Webb.5 for Windows. W. (2002). L. MA: Allyn & Bacon. Construction and factorial validity of the Japanese NEO-PI-R. SPSS. (2006). 607–651. S. & Shohamy. UK. and assessment.. 11. R. & Takayama. (1998).ordinate. Gondo. Nemer. 14. 2005. R. C. 1–24). & Bejarano. 40. D. B. (2003.sagepub. The discourse of accommodation in oral proficiency interviews. Ross. W. Y. Studies in Second Language Acquisition. Beneath the mask: An introduction to theories of personality. R & He. July). 243 934). Downloaded from http://ltj. B. L. M. College Station. Chizhik. Computer software. Who speaks most in this group? Does that matter? Paper presented at the annual meeting of the Language Testing and Research Colloquium.. Nakazato. stretching. P. A. Y.

NSs taught in junior high conversation-like way 0. “I agree with you. produces does not blend for words and long an opinion clearly.0 May not have mastered Speech is hesitant. Relies mostly on simple Generally has Responds to others some difficult sounds of some groping for (but appropriate) enough lexis for without long pauses to Downloaded from http://ltj. Fragments of speech Does not use any Shows knowledge Shows no awareness uses Japanese that are so halting that discernible grammatical of only the simplest of other speakers.5 rhythm. cannot interaction. words are not would not think person school or beginning blended together had virtually no English high school 1. may are pronounced in communication with a no attempt at complex limited words used say. Doesn’t have enough Lexis not adequate Does not initiate like pronunciation. Slow strained speech. too nervous to interact effectively 2. they unnatural pauses. 2009 English. shows words together. constant groping grammar to express for task. makes frequent errors.5 explanation. may katakana-like conversation is not morphology words and phrases speak.5 some attempts to blend communication grammar is attempted but particular knowledge others’ opinions words completely may be inaccurate of vocabulary (continued) Gary J. parallel awareness structures. Appendix A: Rating scales for group oral Pronunciation Fluency Grammar Vocabulary Communication strategies • pronunciation • automatization • use of morphology • range of • interaction • intonation • fillers • complexity of syntax vocabulary • confidence • word blending • speaking speed (relativization. complex not demonstrate any disagreement between 2. express opinion monologue only.sagepub.0 Somewhat katakana. but not in a phonology and really possible. embedded • conversational clauses. mostly understandable spaces are present but morphosyntax to express opinion but does shows agreement or to a naïve NS. properly with the some turn-taking. connectors) 0 Very heavy accent. Ockey 185 . but would be words and unfilled grammar. makes generally don’t impede meaning. has enough expressing some maintain by Green Smith on April 9.” isolation NS would be difficult grammar but not relate ideas in 1.

sound system of still not be quick are only in late-acquired vocabulary shows ability to negotiate English. Downloaded from by Green Smith on April 9. may evidence of responds appropriately not mastered the words but speech may make errors but they some advanced to others’ opinions. may make of vocabulary views. 2009 186 Effects of group members’ personalities on test taker’s group . Uses both simple and Shows evidence Confident and natural. good but has still rarely gropes for complex grammar. interacts smoothly English grammar Note: A test taker who consistently meets the criteria at a level is assigned the score at the bottom of the box while a test taker who sometimes fails to meet the criteria is assigned a score at the bottom of the box.sagepub. Shows ability to use some Shows some Generally confident. complex grammar of a wide range asks others to expand on intonation.0 Pronunciation is May use some fillers. has shows ability to speak effectively. shows how own practically mastered quickly in short bursts occasional errors but they knowledge and others’ ideas are the sound system of are only in late-acquired related.5 comprehension. accent does grammar meaning quickly and not interfere with relatively naturally 3. pronunciation and uses fillers effectively. can blend words 4 Speaks with excellent Excellent fluency. Appendix A: Rating scales for group oral (continued) 3. Downloaded from by Green Smith on April Permissions: http://www.sagepub.sagepub.1177/0265532208101010 The online version of this article can be found at: 26. 2009 .com/subscriptions Reprints: Additional services and information for Language Testing can be found at: Email Alerts: An investigation into native and non-native teachers' judgments of oral English performance: A mixed methods approach Youn-Hee Kim Language Testing Subscriptions: http://ltj. 187 DOI: Published by: http://www.nav Citations http://ltj. Language Testing http://ltj.sagepublications.

A qualitative analysis demonstrated that the judgments of the NS teachers were more detailed and elaborate than those of the NNS teachers in the areas of Keywords: mixed methods. However. University of Toronto.. and evaluation criteria.utoronto. email: younkim@oise.sagepub. © The Author(s). Ont. and the accuracy of transferred information. The evaluation behaviors of two groups of teachers (12 Canadian NS teachers and 12 Korean NNS teachers) were compared with regard to internal consist- ency. Canada This study used a mixed methods research approach to examine how native English-speaking (NS) and non-native English-speaking (NNS) teachers assess students’ oral English performance. with only one or two inconsistent raters in each group. Modern Language Center. oral English performance assessment. Reprints and Permissions: http://www. M5S The two groups of teachers also exhibited similar severity patterns across different tasks. specific gram- mar use. 2009 . many-faceted Rasch Measurement In the complex world of language assessment. Canada.sagepub. Results of a Many-faceted Rasch Measurement analysis showed that most of the NS and NNS teachers maintained acceptable levels of internal consistency..1177/0265532208101010 Downloaded from by Green Smith on April 9. the presence of raters is one of the features that distinguish performance assessment from traditional assessment.nav DOI:10. While scores in traditional fixed response assessments (e. 252 Bloor Street West. NS and NNS. severity. Ontario Institute for Studies in Education. multiple-choice tests) are elicited solely from the interaction between test-takers and tasks. it is possible that the final scores awarded by a rater could be affected by variables Address for correspondence: Youn-Hee Kim. sub- stantial dissimilarities emerged in the evaluation criteria teachers used to assess students’ performance. Toronto. These findings are used as the basis for a discussion of NS versus NNS teachers as language assessors on the one hand and the usefulness of mixed methods inquiries on the other. – Language Testing 2009 26 (2) 187 217 An investigation into native and non-native teachers’ judgments of oral English performance: A mixed methods approach Youn-Hee Kim University of Toronto.

Lowenberg. in particular. in line with the global spread of English as a lingua franca. 2003. 2000.sagepub. 2006). and ‘localization’ of the language has occurred in ‘outer circle countries’ such as China. Lowenberg. investigates how native English-speaking (NS) and non-native English-speaking (NNS) teachers evaluate students’ oral English performance in a classroom setting. However. 2002). 2000). 1992. 2006). Korea.188 An investigation into native and non-native teachers’ judgments inherent to that rater (McNamara. 1985. and it is therefore unsurprising that large-scale. Indeed. These developments suggest that new avenues of oppor- tunity may be opening for non-native English speakers as lan- guage assessors. Use of a rater for perform- ance assessment therefore adds a new dimension of interaction to the process of assessment. A mixed methods approach will be utilized to address the following research questions: 1) Do NS and NNS teachers exhibit similar levels of internal consistency when they assess students’ oral English perform- ance? 2) Do NS and NNS teachers exhibit interchangeable severity across different tasks when they assess students’ oral English performance? 3) How do NS and NNS teachers differ in drawing on evalu- ation criteria when they comment on students’ oral English performance? Downloaded from http://ltj. The increasing interest in rater variability has also given rise to issues of eligibility. non-native English speakers outnumber native English speakers internationally (Crystal. Japan. 2009 . the current status of English as a language of international com- munication has caused language professionals to reconsider whether native speakers should be the only acceptable standard (Taylor. Lowenberg. 1997. the question of whether native speakers should be the only ‘norm maker[s]’ (Kachru. The normative system of native speakers has long been assumed in English proficiency tests (Taylor. 1985) in language assessment has inspired heated debate among language professionals. Jenkins. 1996). and Russia (Kachru. 2000). Graddol. This study. and makes monitoring of reliability and validity even more crucial. high-stakes tests such as the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS) rendered their assessments using native English-speaking ability as a benchmark ( by Green Smith on April 9. 1997. 2003.

1995). ‘there is little evidence that native speakers are more suitable than non-native speakers … However. She found that teachers tended to be more severe than non-teachers as far as linguistic ability was concerned. but that there were no significant differences in such areas as compre- hensibility. 1989. This may be explained by their use of different native languages. 1995. comparing three differ- ent rater groups (i. 1987. Youn-Hee Kim 189 I Review of the literature A great deal of research exploring rater variability in second language oral performance assessment has been conducted. Chalhoub-Deville sug- gested that the discrepant findings of the two studies could be due to the fact that her study focused on modern standard Arabic (MSA). For example. and body language. the way in which they perceive Downloaded from http://ltj. this was somewhat at odds with Brown’s (1995) study which found that while native speakers tended to be more severe than non-native speakers. 1995. small rater samples. Chalhoub-Deville. and that pronunciation and hesitation were the most distracting factors for both sets of raters. Another line of research has focused on raters’ different lin- guistic backgrounds. Chalhoub-Deville.sagepub.. In general. personality. in a study of raters’ professional backgrounds. found that teachers attended more to the creativity and adequacy of information in a narration task than to linguistic features. Chalhoub- Deville & by Green Smith on April 9. whereas Hadden’s study focused on English. Fayer & Krasinski. The results showed that non-native raters tended to be more severe in general and to express more annoyance when rating linguistic forms. Fayer and Krasinski (1987) examined how the English-speaking performance of Puerto Rican students was perceived by native English-speaking raters and native Spanish- speaking raters. Brown concluded that. and non-teaching native Arabic speakers living in Lebanon). Brown. Hadden (1991) investigated how native English-speaking teachers and non-teachers perceive the competence of Chinese students in spoken English. Hadden. Galloway. 1995. 2009 . and different methodologies (Brown. 1980. 1991). but the outcomes of some studies contradicted one another. non-teaching native Arabic speakers living in the USA. with a number of early studies focusing on the impact of raters’ different backgrounds (Barnwell. However. native Arabic-speaking teachers living in the USA. 2005. Chalhoub-Deville (1995). on the other hand.e. the difference was not significant. social acceptability. teachers and non-native speakers were shown to be more severe in their assessments than non-teachers and native speakers.

Galloway. In some early studies (e. however. one of which was presented in English and the other in Spanish. Barnwell suggested that both studies were small in scope. MANOVA results indicated significant variability among the different groups of native English-speaking teachers across all three tasks. Comparing native and non- native Spanish speakers with or without teaching experience. Chalhoub-Deville and Wigglesworth (2005) inves- tigated whether native English-speaking teachers who live in dif- ferent English-speaking countries (i. Downloaded from http://ltj. 13).com by Green Smith on April 9. 2) picture- based narration. Studies of raters with diverse linguistic and professional back- grounds have also been conducted. with teachers resid- ing in the UK the most severe and those in the USA the most lenient across the board. 2009 . Galloway (1980) found that non-native teachers tended to focus on grammatical forms and reacted more negatively to non- verbal behavior and slow speech.e. the UK. and the USA) exhibited significantly different rating behaviors in their assessments of students’ performance on three Test of Spoken English (TSE) tasks – 1) give and support an opinion.. This result conflicts with that of Galloway (1980). and that it was therefore premature to draw conclusions about native speak- ers’ responses to non-native speaking performance.01) suggested that little difference exists among different groups of native English-speaking teachers. Fayer & Krasinski. who found that untrained native speakers were more lenient than teachers. and cognitive strategies.. while non-teaching native speakers seemed to place more emphasis on content and on sup- porting students’ attempts at self-expression. Conversely. Although the above studies provide some evidence that raters’ linguistic and professional backgrounds influence their evaluation behavior. Barnwell (1989) reported that untrained native Spanish speakers provided more severe assessments than an ACTFL-trained Spanish rater. and 3) presentation – which require different linguistic. Canada. the very small effect size (η2 = 0. First.190 An investigation into native and non-native teachers’ judgments the items (assessment criteria) and the way in which they apply the scale do differ’ (p. Hill (1997) further pointed out that the use of two different versions of rating scales in Barnwell’s study.sagepub. 1980. most extant studies are not grounded in finely tuned methodologies. 1987. Australia. further research is needed for two reasons. remains questionable. functional. One recent study of rater behavior focused on the effect of country of origin and task on evaluations of students’ oral English performance.g.

However. which in turn might have elicited unknown or unexpected rating by Green Smith on April 9. ix). evaluative com- ments) in their assessment of students’ oral English performance (Greene et al. examining both the product that the teachers generate (i. The expansion design was considered particularly well suited to this study because it would offer a comprehensive and diverse illustration of rating behavior. The complementarity design was included to Downloaded from http://ltj. Also. 1989). Caracelli.e. to my knowledge. having raters assess only one type of speech sample did not take the potential systematic effect of task type on task performance into consideration. known as the ‘third methodological movement’ (Tashakkori & Teddlie.. 1989 for a review of mixed methods evaluation designs). which are most commonly used in empirical mixed methods evaluation studies (see Greene et al.. The mixed methods approach of the present study seeks to enhance understanding of raters’ behavior by investigating not only the scores assigned by NS and NNS teachers but also how they assess students’ oral English performance. Youn-Hee Kim 191 Hadden. & Graham. 2003.. preventing researchers from probing research phenom- ena from diverse data sources and perspectives. incorporates quantitative and qualitative research methods and techniques into a single study and has the potential to reduce the biases inherent in one method while enhancing the validity of inquiry (Greene. Had the task types var- ied.sagepub. the numeric scores awarded to stu- dents) and the process that they go through (i. all previous studies that have examined native and non-native English-speaking raters’ behavior in oral language per- formance assessment have been conducted using only a quantitative framework. 1989). II Methodology 1 Research design overview The underlying research framework of this study is based on both expansion and complementarity mixed methods designs.. 1991). A mixed methods approach. p. 2009 . raters were simply asked to assess speech samples of less than four minutes’ length without reference to a carefully designed rating scale. Second. raters could have assessed diverse oral language output.e. no previous studies have attempted to use both quantitative and qualitative rating protocols to investigate differences between native and non-native English-speaking teach- ers’ judgments of their students’ oral English performance.

In order to ensure that the teachers were sufficiently qualified. 2009 . 2 Participants Ten Korean students were selected from a college-level language institute in Montreal. so that the student sample would include differing levels of English profi- ciency. speaking. 2007). certain participation criteria were outlined: 1) at least one year of prior experience teaching an English conversation course to non- native English speakers in a college-level language institution. 1989). Intramethod mixing. up to Level V for students with the highest English proficiency. Twelve native English-speaking Canadian teachers of English and 12 non-native English-speaking Korean teachers of English constituted the NS and NNS teacher groups. For the teacher samples. The students were drawn from class levels ranging from beginner to advanced.. reading. The language institute sorted students into one of five class levels according to their aggregate scores on a placement test meas- uring four English language skills (listening. and were informed about the research project and the test to participate in the study. was the selected guiding procedure. and 3) high proficiency in spoken English for Korean teachers of English. a concurrent mixed methods sampling procedure was used in which a single sample produced data for both the quantitative and qualitative elements of the study (Teddlie & Yu. 2003).192 An investigation into native and non-native teachers’ judgments provide greater understanding of the NS and NNS teachers’ rating behaviors by investigating the overlapping but different aspects of rater behavior that different methods might elicit (Greene et by Green Smith on April 9. respectively. and writing): Level I for students with the lowest English proficiency. The same weight was given to both quantitative and qualitative methods. with neither method dominating the other. in which a single method concurrently or sequentially incorporates quantitative and qualitative components (Johnson & Turner. 2) at least one graduate degree in a field related to linguistics or language education. Canada. Table 1 shows the distribution of the student sample across the five class levels. Teachers’ background information Table 1 Distribution of students across class levels Level I II III IV V Number of students 1 1 3 3 2 Downloaded from http://ltj.sagepub.

The purpose of the test was to assess the overall oral communicative language ability of non-native English speakers within an aca- demic context. [T3]). and describing a graph of human life expectancy (Task 7. [T7]). 3 Instruments A semi-direct oral English test was developed for the study. nine NS and eight NNS teachers reported having taken graduate-level courses specifically in Second Language Testing and Evaluation. and four NS and one NNS teacher had been trained as raters of spoken English. Finally. target language tasks. [T1]). Throughout the test.sagepub. 2009 . An effort was also made to select test tasks related to hypothetical situations that could occur within an academic context. as well as task difficulty and interest. In addition. and task characteristics (Bachman & Palmer. the guiding principles of the Simulated Oral Proficiency Interview (SOPI) were referenced. In developing the test. and topic-based. [T2]). Youn-Hee Kim 193 was obtained via a questionnaire after their student evaluations were completed: all of the NNS teachers had lived in English- speaking countries for one to seven years for academic purposes. and their self-assessed English proficiency levels ranged from advanced (six teachers) to near-native (six teachers). situation-based. [T4]). explaining the library services based on a provided informational note (Task 2. the topic-based task required test-takers to offer Downloaded from http://ltj. 1996). narrating a story from six sequential pictures (Task 4. 1996).com by Green Smith on April 9. The test tasks were selected and revised to reflect potential test-takers’ language proficiency and topical knowledge. The picture-based task required test-takers to describe or narrate visual information. The test consisted of three different task types in order to assess the diverse oral language output of test-takers: picture- based. communicative language abil- ity would be evidenced by the effective use of language knowledge and strategic competence (Bachman & Palmer. Initial test development began with the identification of target language use domain. such as describing the layout of a library (Task 1. none of the NNS teachers indicated their self-assessed English proficiency levels at or below an upper-intermediate level. such as congratulating a friend on being admitted to school (Task 3. The situation-based task required test-takers to perform the appropriate pragmatic function in a hypothetical situation.

Because this study aimed to investigate how the teachers com- mented on the students’ oral communicative ability and defined the evaluation criteria to be measured. the rating scale did not provide teachers with any information about which evaluation features to draw on. it has consistently been reported that scores from indirect speaking tests have a high correlation with those from direct speaking tests (Clark & Swinton. [T5]).sagepub. 1995). 2009 . 1992a. the six levels describing the degree of successfulness of communication proved to be indis- tinguishable without dependence on the adjacent levels. however. & Antonia. 1992b). 1980. Although the lexical density produced in direct speaking tests and indirect speaking tests have been found to be different (O’Loughlin. teachers who participated in the trials did not use all six levels of the rating scale in their evaluations. 1979. The test was administered in a computer-mediated indirect interview format. The rating scale only clarified the degree of commu- nicative success without addressing specific evaluation criteria. Ulsh. To deal with cases in which teachers ‘sit on the fence’. In order to effectively and economically facilitate an understanding of the task without providing test-takers with a lot of vocabulary (Underhill. Downloaded from http://ltj. Moreover. each task was accompanied by visual stimuli. A response of ‘I don’t know’ or no response was automatically rated NR (Not Ratable). six levels were set as the upper limit during the initial stage of the rating scale development. 3. 2. O’Loughlin. the rating scale was trimmed to four levels. an even number of levels was sought in the rating scale. It had four levels labeled 1.194 An investigation into native and non-native teachers’ judgments their opinions on a given topic. such as explaining their personal preferences for either individual or group work (Task 5. enabling the teachers to consistently distinguish each level from the others. 8 of which were allotted for responses. [T6]). Kenyon. The test lasted approximately 25 minutes. discussing the harmful effects of Internet use (Task 6. in order not to cause a cognitive and psycho- logical overload on the teachers. A four-point rating scale was developed for rating (see Appendix A). and 4. Doyle. Throughout the by Green Smith on April 9. The indirect method was selected because the intervention of interlocutors in a direct speaking test might affect the reliability of test performance (Stansfield & Kenyon. Stansfield. Paiva. and suggesting reasons for an increase in human life expectancy (Task 8. For these reasons. 1995. More importantly. [T8]). 1987). 1990).

and the comparison of different teacher groups was not explicitly mentioned. stopping.. While the NS teachers were asked to write comments in English. which had two phases: 1) rating the students’ test responses according to the four-point rating by Green Smith on April 9. Meetings with the NS teachers were held in Montreal. current visa status. Each meeting lasted approximately 30 minutes. and meetings with the NNS teach- ers followed in Daegu. the NNS teachers were asked to write comments in Korean (which were later translated into English). awarded by 24 teach- ers to 80 sample responses by 10 students on eight tasks. 5 Data analyses Both quantitative and qualitative data were collected. Korea. In addition. Each Downloaded from http://ltj. The two groups of teachers were therefore unaware of each other. but that it would help to iden- tify the construct being measured. and their speech responses were simultaneously recorded as digital sound files. Youn-Hee Kim 195 4 Procedure The test was administered individually to each of 10 Korean students. To decrease the subject expectancy effect.sagepub.727 valid ratings. education level. The teachers rated and commented on 80 test responses (10 students × 8 tasks). and then 12 of the possible test response sets were distrib- uted to both groups of teachers.e. and 2) justifying those ratings by providing written comments either in English or in Korean. a minimum amount of infor- mation about the students (i. The rationale for requiring teachers’ comments was that they would supply not only the evaluation criteria that they drew on to infer students’ oral proficiency. they justified their ratings by writing down their reasons or comments. A meeting was held with each teacher in order to explain the research project and to go over the scoring procedure. and replaying of test responses and to listen to them as many times as they wanted. etc. After rat- ing a single task response by one student according to the rating scale. the teachers were told that the purpose of the study was to investigate teachers’ rating behavior. Canada. The order of the students’ test response sets was randomized to minimize a potential ordering effect. The quanti- tative data consisted of 1.) was provided to the teachers. They then moved on to the next task response of that student. The teachers were allowed to control the playing. 2009 .

teacher group. 2005). Version 3.0 (Linacre. 1989) was used to analyze quantitative ratings. and the Partial Credit Model to teacher groups. 2009 . Three different types of statistical analysis were carried out to investigate teachers’ internal consistency. the different methods tended to remain distinct throughout the study. A hybrid Many-faceted Rasch Measurement Model (Myford & Wolfe. one teacher failed to make one rating. 1993) guided the analysis of qualitative written comments. a Quantitative data analysis: The data were analyzed using the FACETS computer program. Four facets were specified: student. In addition.sagepub. 2004a) was used to differentially apply the Rating Scale Model to teachers and tasks. Since the nature of the component designs to which this study belongs does not permit enough room to combine the two approaches (Caracelli & Greene. so that the data matrix was fully crossed. and to minimize any bias that is inherent to a particular analysis.196 An investigation into native and non-native teachers’ judgments teacher rated every student’s performance on every task. and typology development and data transformation (Caracelli & Greene. The multiple analyses were intended to strengthen the validity of inferences drawn about raters’ internal consistency through converging evidence. Downloaded from http://ltj. A rating of NR (Not Ratable) was treated as missing data. 2) proportions of large standard residuals between observed and expected scores (Myford & Wolfe. 1997). and 3) a single rater–rest of the raters (SR/ROR) correlation (Myford & Wolfe. and task. and 3) a bias analysis between individual teachers and tasks.57. The quantitative and qualitative research approaches were integrated at a later stage (rather than at the outset of the research process) when the findings from both methods were interpreted and the study was concluded. Teachers’ severity measures were also examined in three different ways based on: 1) task difficulty measures. 2) a bias analysis between teacher groups and tasks. 2004a).295 written com- ments. Figure 1 summarizes the overall data analysis procedures. The teacher group facet was entered as a dummy facet and anchored at zero. Both types of data were analyzed in a concurrent manner: a Many-faceted Rasch Measurement (Linacre. based on: 1) fit statis- tics. 2000). there were eight such cases among the 80 speech samples. The qualitative data included by Green Smith on April 9. teacher.

Youn-Hee Kim 197

Fit statistics

Teachers’ internal
Proportions of large
standard residuals

Single rater–rest of the
raters (SR/ROR) correlation

Task difficulty measures

Teachers’ severity Bias analysis (group

Bias analysis (individual

Typology development
3,295 Teachers’
written evaluation criteria
comments Data transformation
(quantification of
evaluation features) for

Figure 1 Flowchart of data analysis procedure

b Qualitative data analysis: The written comments were analyzed
based on evaluation criteria, with each written comment constituting
one criterion. Comments that provided only evaluative adjectives
without offering evaluative substance (e.g., accurate, clear, and so
on) were excluded in the analysis so as not to misjudge the evalua-
tive intent. The 3,295 written comments were open-coded so that the
evaluation criteria that the teachers drew upon emerged. Nineteen
recurring evaluation criteria were identified (see Appendix B for
definitions and specific examples). Once I had coded and analyzed
the teachers’ comments, a second coder conducted an independ-
ent examination of the original uncoded comments of 10 teachers
(five NS and five NNS teachers); our results reached approximately
95% agreement (for a detailed description about coding procedures,

Downloaded from by Green Smith on April 9, 2009

198 An investigation into native and non-native teachers’ judgments

see Kim, 2005). The 19 evaluative criteria were compared across the
two teacher groups through a frequency analysis.

III Results and discussion
1 Do NS and NNS teachers exhibit similar levels of internal
consistency when they assess students’ oral English performance?
To examine fit statistics, the infit indices of each teacher were
assessed. Teachers’ fit statistics indicate the degree to which each
teacher is internally consistent in his or her ratings. Determining
an acceptable range of infit mean squares for teachers is not a
clear-cut process (Myford & Wolfe, 2004a); indeed, there are no
straightforward rules for interpreting fit statistics, or for setting
upper and lower limits. As Myford and Wolfe (2004a) noted,
such decisions are related to the assessment context and depend
on the targeted use of the test results. If the stakes are high, tight
quality control limits such as mean squares of 0.8 to 1.2 would
be set on multiple-choice tests (Linacre & Williams, 1998);
however, in the case of low-stakes tests, looser limits would be
allowed. Wright and Linacre (1994) proposed the mean square
values of 0.6 to 1.4 as reasonable values for data in which a rating
scale is involved, with the caveat that the ranges are likely to vary
depending on the particulars of the test situation.
In the present study, the lower and upper quality control limits
were set at 0.5 and 1.5, respectively (Lunz & Stahl, 1990), given
the test’s rating scale and the fact that it investigates teachers’ rating
behaviors in a classroom setting rather than those of trained raters
in a high-stakes test setting. Infit mean square values greater than
1.5 indicate significant misfit, or a high degree of inconsistency in
the ratings. On the other hand, infit mean square values less than
0.5 indicate overfit, or a lack of variability in their scoring. The
fit statistics in Table 2 show that three teachers, NS10, NNS6, and
NNS7, have misfit values. None of the teachers show overfit rating
Another analysis was carried out based on proportions of large
standard residuals between observed and expected scores in order
to more precisely identify the teachers whose rating patterns dif-
fered greatly from the model expectations. According to Myford
and Wolfe (2000), investigating the proportion to which each rater is
involved with the large standard residuals between observed scores

Downloaded from by Green Smith on April 9, 2009

Youn-Hee Kim 199
Table 2 Teacher measurement report

Teacher Obsvd Fair–M Measure Model Infit Outfit PtBis
average average (logits) S.E. MnSq MnSq

NS10 2.9 2.78 −0.60 0.20 1.51 1.37 0.56
NNS10 2.9 2.74 −0.52 0.20 1.26 1.21 0.58
NNS11 2.8 2.63 −0.29 0.19 1.09 0.94 0.55
NNS1 2.7 2.52 −0.07 0.19 0.85 0.74 0.57
NS9 2.7 2.43 0.11 0.19 1.34 1.43 0.51
NS5 2.6 2.37 0.23 0.19 1.07 1.28 0.53
NNS9 2.6 2.35 0.26 0.19 1.29 1.46 0.50
NS12 2.6 2.32 0.33 0.19 0.96 1.12 0.54
NNS7 2.6 2.32 0.33 0.19 1.54 1.29 0.49
NNS5 2.5 2.29 0.40 0.19 0.81 0.82 0.57
NS7 2.5 2.27 0.44 0.19 1.11 1.12 0.53
NS11 2.5 2.25 0.47 0.19 1.00 0.94 0.53
NS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60
NNS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60
NNS12 2.4 2.17 0.65 0.19 0.83 0.97 0.56
NNS2 2.4 2.13 0.72 0.19 0.69 0.68 0.57
NS3 2.4 2.08 0.83 0.19 0.77 1.03 0.57
NNS3 2.4 2.08 0.83 0.19 0.85 0.73 0.59
NS2 2.3 2.02 0.97 0.19 0.67 0.69 0.57
NS8 2.3 1.99 1.05 0.19 0.78 0.77 0.59
NS6 2.2 1.91 1.23 0.19 1.30 1.41 0.53
NNS6 2.2 1.84 1.38 0.19 1.61 1.74 0.49
NS1 2.1 1.75 1.60 0.20 0.68 0.60 0.58
NNS8 2.1 1.73 1.64 0.20 0.85 0.72 0.56
Mean 2.5 2.22 0.54 0.19 1.00 1.00 0.55
S.D. 0.2 0.27 0.58 0.00 0.31 0.33 0.03
RMSE (model) = 0.19 Adj. S.D. = 0.55
Separation = 2.87 Separation (not inter-rater) Reliability = 0.89
Fixed (all same) χ2 = 214.7 d.f. = 23
Significance (probability) = .00

Note: SR/ROR correlation is presented as the point-biserial correlation (PtBis) in the
FACET output.

and expected scores can provide useful information about rater
behavior. If raters are interchangeable, it should be expected that
all raters would be assigned the same proportion of large standard
residuals, according to the proportion of total ratings that they make
(Myford & Wolfe, 2000). Based on the number of large standard
residuals and ratings that all raters make and each rater makes, they
suggest that the null proportion of large standard residuals for each
rater ( π ) and the observed proportion of large standard residuals for
each rater (Pr) can be computed using equations (1) and (2):

Downloaded from by Green Smith on April 9, 2009

not NS10. May.sagepub. An inconsistent rating will occur when the observed propor- tion exceeds the null proportion beyond the acceptable deviation (Myford & Wolfe. however. personal communication. 2004b) introduced the more advanced Many-faceted Rasch Measurement application to detect raters’ consistency based on the single rater–rest of the raters (SR/ROR) correlation. an unexpected observation was reported if the stand- ardized residual was greater than +2. the rater is considered to be exercising an inconsistent rating pattern. 2005).com by Green Smith on April 9. Nu = the total number of large standard residuals and Nt = the total number of ratings. a result similar to what was found in the fit analysis. Interestingly. According to them. who had been flagged as misfitting teachers by their infit indices.200 An investigation into native and non-native teachers’ judgments π = Nu (1) Nt where. 2000). they are flagged with Downloaded from http://ltj. Pr − π Zp = (3) π − π2 Ntr In this study. who had Zp values greater than +2. When raters exhibit randomness. Nur Pr = (2) N tr where. if a Zp value for a rater is below +2. This may be because NS10 produced only a small number of unexpected ratings which did not produce large residuals. if the value is above +2. Myford. Thus. The two NNS teachers whose observed Zp values were greater than +2 were NNS6 and NNS7. When rating consistency was exam- ined. the analy- sis of NS teachers showed that it was NS9. That small Zp value indicates that while the teacher gave a few ratings that were somewhat unexpectedly higher (or lower) than the model would expect. 31. Myford and Wolfe (2004a. those ratings were not highly unexpected (C. which was the case in 89 out of a total of 1. it indicates that the unexpected ratings that he or she made are random error. one NS teacher and two NNS teachers were found to exhibit inconsistent rating patterns. Myford and Wolfe propose that the frequency of unexpected ratings (Zp) can be calculated using equa- tion (3). Nur = the number of large standard residuals made by rater r and Ntr = the number of ratings made by rater r. 2009 .727 responses.

the ratings of the NS group were slightly more diverse across tasks.41 logit spread.97 logits. and NNS9 seemed to be on the borderline in their consistency. comparison of task difficulty measures is considered a legitimate approach. the range of task difficulty measures was similar to that of the NS group. 2009 . and Tasks 3 and 2 were given the lowest difficulty meas- ure by the NS and the NNS teacher groups. and NNS9 showed not only large fit indices but also low SR/ROR correlations. Youn-Hee Kim 201 significantly large infit and outfit mean square indices. and that the NNS teachers were as dependable as the NS teachers in assessing students’ oral English performance. with task difficulty measures ranging from −0. though slightly narrower: from −0.53 logits to 0. Task 6 was given the highest difficulty measure by both groups of teachers.sagepub. sig- nificantly large infit and outfit mean square indices may also indicate other rater effects (Myford & by Green Smith on April 9. NNS7. NS9. with a 1. 2004b). 2004a. respectively. Figure 2 shows the task difficulty derived from the NS and the NNS groups of teachers. As can be seen. Downloaded from http://ltj. Four teachers appeared to be inconsistent: NS9. with a 1. Figure 2 also shows that both groups exhibited generally similar patterns in task difficulty meas- ures. most of the NS and NNS teachers were consistent in their ratings. Given that task difficulty is determined to some extent by raters’ severity in a performance assessment setting. When compared relatively. the three different types of statistical approaches showed converging evidence. More specifically. This result implies that the two groups rarely differed in terms of internal consistency. NNS6. Myford and Wolfe suggested that it is important to examine significantly low SR/ROR correlations as well. NNS7. whereas NNS6 was obviously signaled as an inconsistent teacher. In summary.59 logits to 0. with one or two teachers from each group showing inconsistent rating patterns. however.82 logits.50 logit spread. Thus. they suggested that randomness will be detected when infit and outfit mean square indices are significantly larger than 1 and SR/ROR correlations are significantly lower than those of other raters. 2 Do NS and NNS teachers exhibit interchangeable severity across different tasks when they assess students’ oral English performance? The analysis was carried out in order to identify whether the two groups of teachers showed similar severity measures across differ- ent tasks. in the NNS group’s ratings.

and significantly severe patterns on Task 7. that group of teachers is scoring a task leniently compared with the way they have assessed other tasks. As the bias slopes of Figure 3 illustrate. While an interaction was found between individual teachers and tasks.202 An investigation into native and non-native teachers’ judgments 1. By the same token. one teacher from each group exhibited significantly lenient rating patterns on Tasks 1 and 4. an estimate of the extent to which a teacher group was biased toward a particular task is standardized to a Z-score.6 -0. the NS and NNS teacher groups do not appear to have any significant interactions with particular tasks.8 Task Difficulty (logits) 0.2 T1 T2 T3 T4 T5 T6 T7 T8 -0. no bias emerged toward a particular task from a particular group of teachers. As shown in Table 3. thus. In the bias analysis.8 Tasks NS Teacher Group NNS Teacher Group Figure 2 Task difficulty measures by NS and NNS teacher groups A bias analysis was carried out to further explore the potential interaction between teacher groups and tasks. where the values are above +2.4 0.4 -0. certain teachers from each group showed exactly the same bias patterns on particular by Green Smith on April 9.sagepub. neither of the two groups of teachers was positively or negatively biased toward any particular tasks.6 0. A bias analysis between individual teachers and tasks confirmed the result of the previous analysis. suggesting a significant interaction between the group and the task. Strikingly. that group of teachers is thought to be scoring a task without significant bias. 2009 .2 1 0. When the Z-score values in a bias analysis fall between −2 and +2.2 0 -0. Two NS teachers exhibited conflicting rating patterns on Task 6: NS11 showed a significantly Downloaded from http://ltj. that group of teachers is thought to be rating that task more severely than other tasks. Where the val- ues fall below −2.

21 0.55 −2. 2009 .19 1.11 0.44 1.69 1.8 NS5 T6 0. NS9 rated Task 6 significantly more severely.0 NS9 T6 −0.92 1.43 −1.64 1. The overall results of the multiple quantitative analyses also show that the NS and NNS Table 3 Bias analysis: Interactions between teachers and tasks Teacher Task Obs-Exp Bias Model Z-score Infit average measure S.4 -0.21 0.1 NNS6 T7 −0.9 NS9 T4 0.06 0.65 2. In summary.90 0.23 0. NNS GROUP 1 0. It is very interesting that one teacher from each group showed the same bias patterns on Tasks 1.55 −1.43 −1.6 -0.53 −2. NS GROUP 2.69 2.7 NS3 T1 0.1 Downloaded from http://ltj.02 0.4 z-value 0.55 −2.84 1.sagepub.09 2.54 −1.8 T1 T2 T3 T4 T5 T6 T7 T8 Figure 3 Bias analysis between teacher groups and tasks more lenient pattern of ratings. that is.60 2.29 0.90 0.26 0.24 0.49 1.8 0.6 0.5 NNS12 T1 0.47 −1.22 0.34 1.58 by Green Smith on April 9.18 0.01 0.7 NS6 T7 −0.3 NNS6 T6 −0.54 3.38 −1. since it implies that the ratings of these two teachers may be interchangeable in that they display the same bias patterns.13 1.44 −1. MnSq (logits) NS11 T6 0.50 −2. and this is confirmed by both the task difficulty measures and the two bias analyses.93 1. while NS9 showed the exact reverse pattern.1 NS3 T6 −0. the NS and NNS teachers seem to have behaved similarly in terms of severity. 4.58 −2.E.5 NNS9 T4 0. and 7. Youn-Hee Kim 203 Teacher Group 1.2 -0.2 0 -0.06 0.60 1.

the NS and NNS teacher groups could not be compared for individual tasks in that they made very few comments (fewer than 10) on some evaluation criteria related to a particular task. the NNS group made only 1. The analysis was conducted using comments drawn from all eight tasks. the overall number of comments for these two criteria was similar in the two teacher groups (46 vs. Given that the quantitative outcomes that the two groups of teachers generated were rarely different. the research focus now turns to qualitative analyses of the processes the teachers went through in their assessments. When comments from both groups were reviewed. The com- parison was therefore based on comments for all tasks rather than for each individual task. Interestingly. This may be because providing students with detailed evaluative comments on their per- formance is not as widely used in an EFL context as traditional fixed response assessment. and they were quantified and compared between the NS and NNS teacher groups.sagepub. Figure 4 illustrates the frequency of comments made by the NS and NNS teacher groups for the 19 evaluation criteria. 50 comments Downloaded from http://ltj. While the quantitative approach to their ratings provided initial insights into the teach- ers’ evaluation behavior. Still.123 com- ments. 2009 . a variety of themes emerged.172.204 An investigation into native and non-native teachers’ judgments teachers appeared to reach an agreement as to the score a test-taker should be awarded. the total number of comments made by the two groups differed distinctly: while the NS group made 2. Figure 4 also shows that the NS group provided more comments than the NNS group for all but two of the evalua- tion criteria: accuracy of transferred information and completeness of discourse. 3 How do NS and NNS teachers differ in drawing on evaluation criteria when they comment on students’ oral English performance? In order to illustrate teachers’ evaluation patterns. Nineteen evaluation criteria were identified. exhibiting little difference in internal consist- ency and severity. elucidating dimensions that might have been obscured by the use of a solely quantitative method. The mixed methods design was intended to enhance the study’s depth and breadth. a more comprehensive and enriched understanding was anticipated from the complementary inclusion of a qualitative by Green Smith on April 9. their written comments were analyzed qualitatively.

overall language use (7.42%). and specific grammar use (6.sagepub.46% of all comments). For example.00%). ‘arrive’ for ‘alive’)’. ‘reverse’ for ‘reserve’.69%).47%). ‘pronunciation dif- ficulty. Although the NS and NNS groups differed in that the NS group made more comments across most of the evaluation criteria. vowels.g.47%).23% of all comments). the NS group was found to draw most frequently on overall language use (13. 2009 . ‘really’)’. the NS teach- ers commented that ‘some small pronunciation issue (‘can’/‘can’t’ & ‘show’/‘saw’) causes confusion’. and overall lan- guage use to be the primary evaluation criteria. etc. pronunciation. The NS teachers provided more detailed and elaborate comments. d/t. f/p. Youn-Hee Kim 205 NS NNS 300 Frequency of Comments 250 200 150 100 50 0 l g ruc y al rele fo nt se nte ncy fe ent t n un ry ra f di ails n g p l i sk t g u ce p r e ss vo use cio pec ram ture t a pri se en co ess ra ram use ili rse of e tio ov pic d in cy h o f m e nc la str com e ta an van nt ppr ar u ge tenc igib tra gum m n e t n n cou bu cia de e pl eme ere flu pp ate sh ar e gu rre ag th iat ca m m ll ar ar s h t es g on in ns nt of o ro i l t and o pr ul ic g ll ela ess to ac tio t a ra n er erst of f l er l n k i co upp ne e ex as ete bo d tu un s ra s co al cu m -c ac ov so Figure 4 Frequency distribution of the comments by NS and NNS teacher groups for accuracy of transferred information. when evaluating pronunciation. i/e. l/r.68%). ‘sometimes pronunciation is not clear. When the evaluation criteria emphasized by the two teacher groups were examined.. especially at word onsets’. pronunciation (11.33%).com by Green Smith on April 9. These trends indicate that the two teacher groups shared common ideas about the ways in which the students’ performance should be assessed..70%). both groups considered vocabulary. The explicit pinpointing of pronunciation errors Downloaded from http://ltj. ‘pronunciation occasionally unclear (e. The NNS group emphasized pronun- ciation (15. often singling out a word or phrase from students’ speech responses and using it as a springboard for justifying their evaluative com- ments. intelligibil- ity (7. fluency (9. vocabulary (11. vocabulary (14. 53 vs. and coherence (5. 66 comments for completeness of discourse). ‘some words mispronounced (e.g.’.

The NS teach- ers provided more detailed feedback on specific aspects of gram- mar by Green Smith on April 9. Brown. Similar patterns appeared in the evaluation criteria of specific grammar use and accuracy of transferred information. While this study examined non-native speakers’ pho- nological features through a qualitative lens. Although pronunciation was one of the most frequently mentioned evaluation criteria and constituted 15. as one of the reviewers of this article suggested. ‘hard to follow due to pronunciation’. confirming that their attention is more focused on overall phonological performance or intelligibility than specific phonologi- cal accuracy.sagepub. 1995. It can also be interpreted to suggest that the NS teachers were less tolerant of or more easily distracted by phonological errors made by non-native English-speakers. as long as students’ oral performance was intelligible or comprehensible. Instead of identifying problems with specific phonological features. 2009 . the previous studies focused on the quantitative scores awarded on pronunciation as one analytic evaluation criterion. 1987) that indicated native speakers are less concerned about or annoyed by non-native speech features as long as they are intelligible. making more comments compared to the NNS teachers Downloaded from http://ltj. the NNS teachers were more familiar with the students’ English pronunciation than the NS teach- ers because the NNS teachers shared the same first language back- ground with the students.. When the comments provided by the NNS teachers on pronun- ciation were examined. ‘good description of library but problems with pro- nunciation (often only with effort can words be understood)’. the NNS teachers did not seem to be interested in the micro-level of phonological performance.206 An investigation into native and non-native teachers’ judgments might imply that the NS teachers tended to be sensitive or strict in terms of phonological accuracy. Another possible explanation might be that.23% of the total comments. Fayer & Krasinski. ‘problems with word stress’. their comments included ‘problems with pronunciation’. they were somewhat different. Intelligibility was the third most frequently mentioned evaluation criterion by the NNS teachers. This inconsistency might ulti- mately be due to the different methodological approaches employed in the studies. These findings are somewhat contradictory to those of previous studies (e. etc. It appears that the NNS teachers were less influenced by phonologi- cal accuracy than by global comprehensibility or intelligibility. they tended to focus on the overall quality of students’ pronunciation performance.g. the NNS teachers were more general in their evaluation comments. In other words. For example.

the NNS teachers neither responsively nor meticu- lously cited the use of prepositions or verb tenses. past progressive. For example. The same observations were also made on Task 4 (narrating a story from six sequential pictures) and Task 7 (describing a graph of human life expectancy). Tasks 4 and 7 were similar in that students had to describe events that had taken place in the past in order to complete them successfully. Although the NNS Downloaded from http://ltj. and ‘problems with tense’). as their comments make manifest: ‘successfully recounted in the past with complex structure (i. By contrast. Their 29 total comments on specific grammar use were often too short to enable interpretation of their judgments (e. by stating ‘prepositions of place could be more precise (e. The responses of the two teacher groups to the accuracy of trans- ferred information followed the same pattern. ‘tense accuracy is important for listener comprehension in this task’.. ‘in front of’ computers)’ and ‘incorrect or vague use of prepositions of place hinders visualization’. ‘problems with prepositions’. present. native Spanish speakers as opposed to native Korean speakers). the NS teachers were more aware than the NNS teachers of the precise use of verb tenses. ‘wrong tense’. As was the case with preposition use. It was therefore essential for students to be comfortable with a variety of verb tenses (past. her conjectures are noteworthy. 2009 . past progressive)’.. for example.g. Galloway noted that ‘con- fusion of tense may not have caused problems for the non-native speaker. but it did seem to impede communication seriously for the native speaker’ (p. when evaluating students’ performance on Task 1 (describing the layout of a library). ‘no prepositions’.sagepub. Youn-Hee Kim 207 (152 vs. and ‘minor error in verb tense (didn’t use future in refer- ence to 2010 at first)’. past perfect. suggesting that the NNS teachers were less distracted by the mis- use of prepositions and verb tenses than NS teachers. ‘chang- ing verb tense caused some confusion’. 29 comments). Speculating as to why native and non-native speakers had different perceptions of the extent to which linguistic errors disrupt communication. past perfect.. consistent with Galloway’s (1980) findings. and future) so as not to confuse their listeners. 432). ‘all recounted in present tense’.com by Green Smith on April 9.e..e. Although the native language group in the Galloway study was quite different from that of the present study (i.g. They further pointed out that accurate use of prepositions might facilitate listeners’ visualization of given information. the NS teachers paid more attention to accurate use of prepositions than to other grammatical features.

NNS teachers who teach daily in an EFL context may be poorly informed about how to evaluate students’ language performance without depending on numeric scores and traditional fixed response assessment. It may simply be that the NNS teachers considered content errors to be simple mistakes that should not be used to misrepresent students’ overall oral English proficiency. This different evaluation culture might have contributed to the dissimilar evaluation patterns for the NS and NNS teachers. 46. For example. confused renew- als for grads & undergrads. less elaborate comments than the NS teachers on certain evaluation cri- teria requires careful interpretation. fines of $50/day → 50¢/day)’. then less so when talking about fines (e. their characteristics were dissimilar.g. where students were asked to verbalize literal and numeric information. comment- ing ‘accurate info’.208 An investigation into native and non-native teachers’ judgments teachers provided more comments than did the NS teachers (50 vs. for example. those who participated in the study were not told that they should make their comments as specific as Downloaded from http://ltj. NOT 2000)’.g. ‘not very inaccurate info’.g. Because this study was intended only to capture teachers’ evaluation behavior.sagepub. as long as the speech was comprehensible.. closing time of 9:00 pm instead of 6:00 pm)’. 2007). or ‘provided wrong information’. The different evaluation behaviors might also be attributable to a meth- odological matter. it has been pointed out that NNS teachers had not been effectively trained to assess students’ performance (Lee. The tendency of the NNS teachers to provide less detailed. This was especially evident in Task 2 (explaining the library services based on a provided informational note) and Task 7 (describing a graph of human life expectancy).com by Green Smith on April 9. etc. respectively). the NNS teachers were primarily concerned with whether the delivered information was generally correct. On these two tasks. ‘some incorrect info (e. By con- trast. ‘gradually accurate at first. com- menting ‘some key information inaccurate (e. Although there have been recent advances in performance assessment in an ELF context.. ‘$50’ → ‘50¢’)’. The NNS teachers’ global judgments on the accuracy of transmitted information raises the question of whether the NNS teachers were not as attentive as the NS teachers to specific aspects of content accuracy.. they pointed out every inconsistency between the provided visual information and the transferred verbalized information. and jotted down content errors whenever they occurred. ‘some incorrect information (the gap between men and women was smallest in 1930. 2009 . the NS teach- ers appeared very attentive to the accuracy of transmitted informa- tion.

For example. but they also drew on different evaluation criteria across different tasks. The teachers not only exhibited different severity measures. 2009 . A striking disparity. Of the eight individual tasks. enabling the teachers to exhibit varied rating behaviors while assessing diverse oral language output. More by Green Smith on April 9. Three different statistical approaches were used to compare teach- ers’ internal consistency. This observation arose Downloaded from http://ltj. and they revealed almost identical patterns. both teacher groups were most severe on Task 6. Youn-Hee Kim 209 possible. A variety of test tasks were employed. appeared in the NS and NNS teachers’ evaluation criteria for students’ performance. Most of the NS and NNS teachers maintained acceptable levels of internal consistency. which might have influenced the NNS teachers’ lack of evaluative comments. IV Conclusion and implications This study has examined how a sample of NS and NNS teachers assessed students’ oral English performance from comprehensive perspectives. focusing on overall quality without considering the granularity of their own comments. with only one or two teachers from each group identified as inconsistent raters. it is also possible that the NNS teachers did not orient their comments toward providing feedback for the students.sagepub. These findings suggest that employing multi- ple tasks might be useful in capturing diverse rater behaviors. a bias analysis carried out for individual teachers and individual tasks showed that one teacher from each group exhibited exactly the same bias patterns on certain tasks. and more evidence needs to be gathered to address the specific ways in which the NS and NNS teachers provided students with feedback related to those linguistic errors. The NS teachers provided far more com- ments than the NNS teachers with regard to students’ performance across almost all of the evaluation criteria. and neither was positively or negatively biased toward a particular task. however. Similar results were obtained when the severity of the two groups was compared. As one of the reviewers suggested. the NNS teachers may sim- ply have noted the major characteristics of students’ oral output. To suggest that the NNS teachers did not identify linguistic errors as accurately as did the NS teachers would therefore be premature. A qualitative analysis further showed the NS teachers to be more detailed and elaborate in their comments than were the NNS teachers.

Collecting diverse data also helped to over- come the limitations of the aforementioned previous by Green Smith on April 9. By the same token. the study’s results offer no indication that NNS teachers should be denied posi- tions as assessors simply because they do not own the language by ‘primogeniture and due of birth’ (Widdowson. Further research is therefore warranted to investigate the effectiveness of NNS teachers within their local educational systems. NNS teachers could be more compelling or sensitive assessors than NS teachers in ‘expanding circles countries’ (Kachru. This study has shown that by combining quantitative and qualita- tive research methods. 379). above and beyond findings from the quan- titative analysis alone. The comparable internal consistency and severity patterns that the NS and NNS teachers exhibited appear to support the assertion that NNS teachers can function as assessors as reliably as NS teachers can. 2009 . Only Canadian and Korean English teachers were included in the sample.’ In a sense. Several methodological limitations and suggestions should be noted.sagepub. an inquiry into validity is a complicated quest. First. Therefore. 1985). the study has not evidenced how different qualitative evaluation approaches interact with students and which evaluation method would be more beneficial to them. with at least one graduate degree related to linguis- tics or language education. and the accuracy of transferred information. and no validity claims are ‘one-size-fits-all. and most of these were well-qualified and experienced. Although the NS teachers provided more detailed and elaborate comments. Diverse paradigms and multiple research methods ena- bled diverse social phenomena to be explored from different angles. a comprehensive understanding of research phenomena can be achieved via paradigmatic and methodological pluralism. Limiting the research outcomes to the Downloaded from http://ltj. the involvement of native speakers in an assessment setting should not be interpreted as a panacea. Considering assessment practices can be truly valid only when all contextual factors are considered. 1994. since the former might be more familiar with the instructional objectives and cur- riculum goals of indigenous educational systems. p. the inclusion of a qualitative analysis provided insight into the dif- ferent ways in which NS and NNS teachers assessed students’ oral language performance. specific grammar use.210 An investigation into native and non-native teachers’ judgments from their judgments on pronunciation. which depended solely on numeric data to investigate raters’ behavior in oral language performance assessment. this study’s results cannot be generalized to other populations.

Youn-Hee Kim 211

specific context in which this study was carried out will make the
interpretations of the study more valid. The use of other qualitative
approaches is also recommended. The only qualitative data col-
lected were written comments, which failed to offer a full account
of the teachers’ in-depth rating behavior. Those behaviors could be
further investigated using verbal protocols or in-depth interviews
for a fuller picture of what the teachers consider effective language
performance. As one of the reviewers pointed out, it might be also
interesting to investigate whether the comments made by the NS
and NNS teachers tap different constructs of underlying oral profi-
ciency and thereby result in different rating scales. Lastly, further
research is suggested to examine the extent to which the semi-
direct oral test and the rating scale employed in this study represent
the construct of underlying oral proficiency.

I would like to acknowledge that this research project was funded
by the Social Sciences and Humanities Research Council of Canada
through McGill University’s Institutional Grant. My sincere appre-
ciation goes to Carolyn Turner for her patience, insight, and guid-
ance, which inspired me to complete this research project. I am also
very grateful to Eunice Jang, Alister Cumming, and Merrill Swain
for their valuable comments and suggestions on an earlier version
of this article. Thanks are also due to three anonymous reviewers of
Language Testing for their helpful comments.

V References
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford:
Oxford University Press.
Barnwell, D. (1989). ‘Naïve’ native speakers and judgments of oral proficiency
in Spanish. Language Testing, 6, 152–163.
Brown, A. (1995). The effect of rater variables in the development of an occu-
pation-specific language performance test. Language Testing, 12, 1–15.
Caracelli, V. J. & Greene, J. C. (1993). Data analysis strategies for mixed-
method evaluation designs. Educational Evaluation and Policy Analysis,
15, 195–207.
Caracelli, V. J. & Greene, J. C. (1997). Crafting mixed method evaluation
designs. In Greene, J. C. & Caracelli, V. J., editors, Advances in mixed-
method evaluation: The challenges and benefits of integrating diverse

Downloaded from by Green Smith on April 9, 2009

212 An investigation into native and non-native teachers’ judgments

paradigms. New Directions for Evaluation no. 74 (pp. 19–32). San
Francisco: Jossey-Bass.
Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different
tests and rater groups. Language Testing, 12, 16–33.
Chalhoub-Deville, M. & Wigglesworth, G. (2005). Rater judgment and English
language speaking proficiency. World Englishes, 24, 383–391.
Clark, J. L. D. & Swinton, S. S. (1979). An exploration of speaking proficiency
measures in the TOEFL context (TOEFL Research Report No. RR-04).
Princeton, NJ: Educational Testing Service.
Clark, J. L. D. & Swinton, S. S. (1980). The test of spoken English as a
measure of communicative ability in English-medium instructional set-
tings (TOEFL Research Report No. RR-07). Princeton, NJ: Educational
Testing Service.
Crystal, D. (2003). English as a global language. Cambridge: Cambridge
University Press.
Fayer, J. M. & Krasinski, E. (1987). Native and nonnative judgments of intel-
ligibility and irritation. Language Learning, 37, 313–326.
Galloway, V. B. (1980). Perceptions of the communicative efforts of American
students of Spanish. Modern Language Journal, 64, 428–433.
Graddol, D. (1997). The future of English?: A guide to forecasting the
popularity of English in the 21st century. London, UK: The British
Greene, J. C., Caracelli, V. J. & Graham, W. F. (1989). Toward a conceptual
framework for mixed-method evaluation design. Educational Evaluation
and Policy Analysis, 11, 255–274.
Hadden, B. L. (1991). Teacher and nonteacher perceptions of second-language
communication. Language Learning, 41, 1–24.
Hill, K. (1997). Who should be the judge?: The use of non-native speakers
as raters on a test of English as an international language. In Huhta, A.,
Kohonen, V., Kurki-Suonio, L., & Luoma, S., editors, Current develop-
ments and alternatives in language assessment: Proceedings of LTRC
96 (pp. 275–290). Jyväskylä: University of Jyväskylä and University of
Jenkins, J. (2003). World Englishes: A resource book for students. New York:
Johnson, B. & Turner, L. A. (2003). Data collection strategies in mixed
methods research. In Tashakkori, A. & Teddlie, C., editors, Handbook
of mixed methods in social and behavioral research (pp. 297–319).
Thousand Oaks, CA: Sage.
Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: The
English language in the outer circle. In Quirk, R. & Widdowson, H.,
editors, English in the world: Teaching and learning the language and
literatures (pp. 11–30). Cambridge: Cambridge University Press.
Kachru, B. B. (1992). The other side of English. In Kachru, B. B., editors, The
other tongue: English across cultures (pp. 1–15). Urbana, IL: University
of Illinois Press.

Downloaded from by Green Smith on April 9, 2009

Youn-Hee Kim 213

Kim, Y-H. (2005). An investigation into variability of tasks and teacher-judges
in second language oral performance assessment. Unpublished master’s
thesis, McGill University, Montreal, Quebec, Canada.
Lee, H-K. (2007). A study on the English teacher quality as an English instruc-
tor and as an assessor in the Korean secondary school. English Teaching,
62, 309–330.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA
Linacre, J. M. (2005). A user’s guide to facets: Rasch-model computer pro-
grams. [Computer software and manual]. Retrieved April 10, 2005, from
Linacre, J. M. & Williams, J. (1998). How much is enough? Rasch Measurement:
Transactions of the Rasch Measurement SIG, 12, 653.
Lowenberg, P. H. (2000). Assessing English proficiency in the global con-
text: The significance of non-native norms. In Kam, H. W., editor,
Language in the global context: Implications for the language classroom
(pp. 207–228). Singapore: SEAMEO Regional Language Center.
Lowenberg, P. H. (2002). Assessing English proficiency in the Expanding
Circle. World Englishes, 21, 431–435.
Lunz, M. E. & Stahl, J. A. (1990). Judge severity and consistency across grad-
ing periods. Evaluation and the health professions, 13, 425–444.
McNamara, T. F. (1996). Measuring second language performance. London:
Myford, C. M. & Wolfe, E. W. (2000). Monitoring sources of variability within
the test of spoken English Assessment System (TOEFL Research Report
No. RR-65). Princeton, NJ: Educational Testing Service.
Myford, C. M. & Wolfe, E. W. (2004a). Detecting and measuring rater effects
using many-facet Rasch measurement: Part I. In Smith, Jr., E. V. &
Smith, R. M., editors, Introduction toRasch measurement (pp. 460–517).
Maple Grove, MN: JAM Press.
Myford, C. M. & Wolfe, E. W. (2004b). Detecting and measuring rater effects
using many-facet Rasch measurement: Part II. In Smith, Jr., E. V. &
Smith, R. M., editors, Introduction to Rasch measurement. Maple Grove,
MN: JAM Press, 518–574.
O’Loughlin, K. (1995). Lexical density in candidate output on direct and
semi-direct versions of an oral proficiency test. Language Testing, 12,
Stansfield, C. W. & Kenyon, D. M. (1992a). The development and validation of
a simulated oral proficiency interview. The Modern Language Journal,
72, 129–141.
Stansfield, C. W. & Kenyon, D. M. (1992b). Research on the comparability of
the oral proficiency interview and the simulated oral proficiency inter-
view. System, 20, 347–364.
Stansfield, C. W., Kenyon, D. M., Paiva, R., Doyle, F., Ulsh, I., & Antonia, M.
(1990). The development and validation of the Portuguese Speaking Test,
Hispania, 73, 641–651.

Downloaded from by Green Smith on April 9, 2009

D. (2007). Testing spoken language: A handbook of oral testing techniques. Taylor. Cambridge: Cambridge University Press.214 An investigation into native and non-native teachers’ judgments Tashakkori.sagepub. 370. 1. A. Rasch Measurement: Transactions of the Rasch Measurement SIG. The ownership of English. Wright. Mixed methods sampling: A typology with exam- ples. Downloaded from http://ltj. 2009 . (1994). 51–60. Journal of Mixed Methods Research. The changing landscape of English: Implications for lan- guage assessment. Reasonable mean-square fit values. & Teddlie. (1994). L. CA: Sage. (2006). editors (2003).com by Green Smith on April 9. Underhill. & Linacre.. Handbook of mixed methods in social and behavioral research. 377–388. Teddlie. M. C. 77–100. C. TESOL Quarterly. H. ELT Journal. 28. F. N. 8. B. Widdowson. J. G. 60. B. Thousand Oaks. & Yu. (1987).

1 Overall communication is generally unsuccessful. or a response of ‘I don’t know’ is automatically rated NR (Not Ratable). 3 Overall communication is generally successful. No response. 3. 2009 . Downloaded from http://ltj. 2. 2 Overall communication is less successful. Notes: by Green Smith on April 9. a great deal of listener effort is required. little or no lis- tener effort is required. ‘Communication’ is defined as an examinee’s ability to both address a given task and get a message across. more listener effort is required. Youn-Hee Kim 215 Appendix A: Rating scale for the oral English test 4 Overall communication is almost always successful. A score of 4 does not necessarily mean speech is comparable to that of native English speakers. some listener effort is required.

g. Appendix B: Definitions and examples of the evaluation criteria Evaluation criteria & definitions Examples 1.’ ‘Successfully accomplished task. he crossed a girl. $50 a day for book overdue?)’ ‘Incorrect information (e.g..g.)’ 8. Fluency: the degree to which the response is fluent without ‘Choppy... Overall task accomplishment: the degree to which a ‘Generally accomplished the task.’ ‘Arguments quite strong’ 4.’ understands the given task ‘Didn’t understand everything about the task. Vocabulary: the degree to which vocabulary used in the ‘Good choice of vocabulary’ response is of good and appropriate quality ‘Some unusual vocabulary choices (e.g..g.’ 2.’ the response is robust ‘Good statement of main reason presented. Overall language use: the degree to which the language ‘Generally good use of language’ component of the response is of good and appropriate ‘Native-like language’ quality ‘Very limited language’ Downloaded from http://ltj. Understanding the task: the degree to which a speaker ‘Didn’t seem to understand the task.. l/r. focusing on physically harmful effects of laptops rather than on harmful effects of the internet)’ 6. Strength of argument: the degree to which the argument of ‘Good range of points raised. d/t. Accuracy of transferred information: the degree to which a ‘Misinterpretation of information (e.sagepub. Topic relevance: the degree to which the content of the ‘Not all points relevant’ response is relevant to the topic ‘Suddenly addressing irrelevant topic (i. halted’ too much hesitation ‘Pausing. ‘9pm’ instead of ‘6pm’)’ 5..e. graduate renewals for speaker transfers the given information accurately undergrads.’ 3. vowels. ‘circulation ’) 9. by Green Smith on April 9. stalling – periods of silence’ 216 An investigation into native and non-native teachers’ judgments ‘Smooth flow of speech’ . Pronunciation: the degree to which pronunciation of the ‘Native-like pronunciation’ response is of good quality and clarity ‘Pronunciation difficulty (e. i/e)’ ‘Mispronunciation of some words (e.’ speaker accomplishes the general demands of the task ‘Task not really well accomplished. 2009 7.

’ structure of the response is of good quality and complexity ‘Telegraphic speech’ ‘Took risk with more complex sentence structure’ 12.’ 19. Contextual appropriateness: the degree to which the ‘Appropriate language for a given situation’ response is appropriate to the intended communicative ‘Student response would have been appropriate if Monica had goals of a given situation expressed worry about going to graduate school. more advice (culturally not appropriate)’ 15.’ 217 .’ information or details are provided for effective ‘Student only made one general comment about the graph without communication referring to specifics.’ Youn-Hee Kim of the response is elaborated ‘Good elaboration of reasons’ ‘Connect ideas smoothly by elaborating his arguments. Coherence: the degree to which the response is developed ‘Good use of linking words’ in a coherent manner ‘Great time markers’ ‘Organized answer’ 17. Elaboration of argument: the degree to which the argument ‘Mentioned his arguments but did not explain them.’ 18. 2009 discourse of the response is organized in a complete ‘No reference to conclusion’ manner ‘End not finished. Sentence structure: the degree to which the sentential ‘Cannot make complex sentences. General grammar use: the degree to which the general ‘Generally good grammar’ grammatical use is of good quality ‘Some problems with grammar’ ‘Few grammatical errors’ by Green Smith on April 9. Supplement of details: the degree to which sufficient ‘Provides enough details for effective explanation about the graph.’ 16. Intelligibility: the degree to which the response is ‘Hard to understand language (a great deal of listener work required)’ intelligible or comprehensible ‘Almost always understandable language’ 11. Specific grammar use: the degree to which the micro-level ‘Omission of articles’ of grammatical use is of good quality ‘Incorrect or vague use of prepositions of place’ ‘Good use of past progressive’ 14. Socio-cultural appropriateness: the degree to which the ‘Cultural/pragmatic issue (a little formal to congratulate a friend)’ response is appropriate in a social and cultural sense ‘Little congratulations.’ ‘Lacks enough information with logical explanation.sagepub. Completeness of discourse: the degree to which the ‘Incomplete speech’ Downloaded from http://ltj. 10.

nav Citations Subscriptions: Reprints: Published by: http://www.sagepub. 219 DOI: 10.sagepublications.nav Permissions: Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.sagepub. Language Testing A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats Yo In'nami and Rie Koizumi Language Testing Downloaded from http://ltj. 2009 by Green Smith on April 9.1177/0265532208101006 The online version of this article can be found at: http://ltj.sagepub.sagepub.

tut. Japan. test format. 1–1 Hibarigaoka. it appears that previous studies have been limited by the narrative approach to accumulate their findings. This approach has been criticized because (a) it has less objectivity and replicability Address for correspondence: Yo In’nami. © The Author(s). Alderson. Aichi 441–8580. Japan Rie Koizumi Tokiwa University. Toyohashi University of Technology. – Language Testing 2009 26 (2) 219 244 A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats Yo In’nami Toyohashi University of Technology. Format effects favoring multiple-choice formats across the three domains are consistently observed when studies employ between-subjects designs. Although this topic has been researched in the fields of language learning and educational meas- urement. random and Engineering. and L2 listening test performance. email: innami@hse. Keywords: meta-analysis. L2 reading.sagepub. or stem-equivalent items. or learners with a high L2 proficiency level. stem-equivalent items. Fifty-six data sources located in an extensive search of the literature were the basis for the estimates of the mean effect sizes of test format effects.sagepub. Department of Humanities. Management Science. 2009. Toyohashi. The results using the mixed effects model of meta-analysis indicate that multi- ple-choice formats are easier than open-ended formats in L1 reading and L2 listening. Brantmeier. with the degree of format effect ranging from small to large in L1 reading and medium to large in L2 listening. Overall. although multiple-choice formats are found to be easier than open-ended formats when any one of the following four condi- tions is met: the studies involve between-subjects designs. listening test Among the many existing variables that are considered to affect language test performance.nav open-ended. Tempaku. Japan A meta-analysis was conducted on the effects of multiple-choice and open- ended formats on L1 reading. 2001). multiple-choice. 2000.1177/0265532208101006 Downloaded from http://ltj. reading test. 2005. test method. Reprints and Permissions: http://www. random assign- by Green Smith on April 9. Bachman & Palmer. format effects in L2 reading are not 2009 .. 1996. one central issue is the effect of test for- mats on test performance (e.

Cook et al. In these examples. 1992. 2009 . Kobayashi. c-test. 2000).com by Green Smith on April 9. the same correct responses are required across the formats that share the same or a similar stem (i. I Literature review 1 Format effects A variety of test formats or methods have been employed in language testing. the current study uses meta-analysis to quantitatively synthesize format effects on reading and listening test performance. 1998). To avoid these limitations.. An open-ended format refers to a question that requires learners to formulate their own answers with several words or phrases. Buck.. it provides a solid basis for quantitative synthesis. summary. Campbell. 1999. and (c) it has difficulty handling the rich volume of information extracted from studies (e. While the literature on test format effects is enormous.e.g.g. 2002). 2001. 1993. the relative difficulty of test formats). thus. Cohen. Bennett & Ward.220 Focus on multiple-choice and open-ended formats due to individual differences among reviewers. highly important and interesting in terms of its comprehensive coverage of previous Downloaded from http://ltj. these two formats are defined as follows. researchers must understand the characteristics of each format and make the best selection accord- ing to which one(s) most appropriately serve(s) the purpose of a test in each context. Norris & Ortega. Buck. multiple-choice. The previous literature from a quantitative perspective has mainly focused on two issues: (a) differences in construct or trait measured using multiple-choice and open-ended formats and (b) differences between test scores in multiple-choice and open-ended formats (i. This has been one of the most investigated comparisons. Based on Davies et al... (1999).. (b) it mainly builds on conclusions drawn by the authors of original studies without reanalysis or reinterpretation. 2001. ordering. matching. Table 1 provides examples from a reading test. recall.sagepub. and summary gap-filling (e. the literature review below focuses on comparing multiple-choice and open-ended formats in reading and listening. Multiple- choice is a format with a stem and three or more options from which learners are required to select one.e. Among a wide range of studies investigating the former issue (e. Alderson. including cloze. open-ended (or short-answer). Since there is no perfect test format that functions well in every situation.. 2000. stem equivalent).g. gap-filling.

. however. narrative review vs. Note: Adopted from Campbell (1999. 0. according to each domain. which synthesized 56 sets of correlation coefficients between stem-equivalent multiple-choice and open- ended formats based on 29 studies from a variety of disciplines. most studies have used the t test or analysis of variance to compare the mean scores in L1 reading. 1990). which was somewhat prob- lematic (see section I. C. the previ- ous findings are summarized below. They thought it would be They believed in what Dorothea a good way for her to be elected to was doing.sagepub. They were also concerned about They understood the plight of people people with mental illness. It should also be noted that the summary was based on a narrative or traditional review of the literature. whereas some other studies have found no statistical difference between the two formats ( by Green Smith on April 9. public office. pp.91. D.* with mental illness. * = answer keys in the multiple-choice version.g. (Example answers) B. Yo In’nami and Rie Koizumi 221 Table 1 Examples of parallel formats Multiple-choice Open-ended Based on the passage.95 [95% confidence interval: 0. and L2 listen- ing. They did not fully understand how They were (also) concerned about much danger she was in. studies is Rodriguez (2003).e. 226). 2009 . what is the most probable reason Howe and most probable reason Howe and Mann encouraged Dorothea Dix to Mann encouraged Dorothea Dix to push for reform? push for reform? A. 1997. Elinor. Davey. Pressley et al. 1987). L2 reading. The literature in L1 reading has provided mixed results. what is the Based on the passage.g. 223. Since format effects vary by domain (Traub. quantitative synthesis). This appears to suggest that multiple-choice and open-ended formats measure a very similar con- struct when they use the same stem. Underlined letters in the open-ended version indicate examples of correct answers. In the case of the differences between the test scores in both test formats. In L2 reading. 2002.2). They truly thought Dorothea could make a difference.. most studies have shown that multiple-choice formats are easier than Downloaded from http://ltj. 1993). Some studies have shown that multiple-choice formats are easier than open-ended formats (e. people with mental illness.. this approach was employed to illustrate how different conclusions were drawn according to the way the literature was summarized (i.97]). Arthur et al... The results revealed a correlation between them approaching unity (0. They wanted to share in her fame.

in some L1 and L2 reading studies. the former was quantitatively summarized in Rodriguez (2003). but the latter does not appear to have been quantitatively synthesized. Second. First. L2 reading. Teng. such a method precludes integrative and systematic reviews for three reasons (e.g.sagepub. 1999). a narrative approach appears less objective and less replicable in collecting and summarizing previous studies. 1984). which contradicts Elinor (1997) and Trujillo (2006). 1991). Wolf. Third. all of the studies that examined the test score differences between multiple-choice and open-ended formats have used a narrative approach to reviewing the literature. However. even if each study was reanalyzed and reinterpreted.. 1992. 2000). All studies have shown that in L2 listening. since all studies are limited in one way or another (e. even if all of the relevant studies are collected and reviewed. 1984. Norris & Ortega..e. it would become very difficult to synthesize previous studies because reviewers easily lose track of the information collected as the number Downloaded from http://ltj. therefore. no significant difference was found between the two formats. Glass et by Green Smith on April 9. and L2 listening was that in most studies. 2000). In’nami. since the studies included in the integration of the findings often depend on the reviewer (e. multiple-choice formats were easier than open-ended formats. 1981. 1994. differences in construct measured and test scores). reviewers mostly take the findings of previous studies at face value and directly infer their own conclusions without rean- alysis or reinterpretation (Norris & Ortega. Berne.222 Focus on multiple-choice and open-ended formats open-ended formats (e. 2006. however.. 2 Research synthesis approach Although previous studies have individually provided valuable information on format effects. Cooper & Hedges. different reviewers can draw different conclusions about the findings in question. multiple-choice formats are easier than open-ended formats (e. What became clear through the review of L1 reading. sample size or research design). in which the two formats were considered to be of similar difficulty.g. the interpretation of each study must always be carefully examined.g.g.. However.. 2009 .. Regarding the two issues of multiple-choice and open-ended format effects (i. it should not be taken for granted that the conclu- sion drawn in each study is trustworthy.. Not surprisingly. Shohamy. Light & Pillemer.

L2 reading. and L2 listening. in order for the reader to evaluate the appropriateness of each step. and data summarization. 1994. 2006). To contend with the first prob- lem. Lipsey & Wilson. the current study conducts meta-analysis on the effects of two formats (multiple-choice and open-ended) on reading and listening test per- formance to combine and interpret previous studies in a meaningful way. and if neces- sary. 2009 . A great deal of information from the study findings and characteristics is systematically coded and used to conduct a detailed analysis of how the study findings are explained by the study features. inclusion criteria for analysis. this study expands previous studies by targeting a wide range of areas. Yo In’nami and Rie Koizumi 223 of studies or amount of information regarding the relationships between study findings and study characteristics increases (Lipsey & Wilson. One way to address these three problems with the narrative review is to use meta-analysis. replicate the entire step (e. multiple-choice or open-ended formats? To what degree are they easier? Are there any variables related to format effects? II Method 1 Data collection a Data identification: In order to identify as many relevant studies as possible to obtain the most comprehensive synthesis of format effects. but it has rarely been used in language testing (except for Ross. The research question.sagepub.. 1999. In response to the second and third problems. Besides the use of meta-analysis. 2000. Norris & Ortega. which is a research method that summarizes a set of empirical data across studies. meta-analysis reports the detailed processes of data by Green Smith on April 9. 2001). Thus. meta-analysis reanalyzes and reinterprets previous studies by taking the study characteristics into account. including L1 reading.g. Reflecting the advantages discussed above. Cooper & Hedges. a comparison of meta-analytic findings across domains would help clarify the specificity and gen- eralizability of test format effects. is as follows: Which are easier. investigated separately for L1 reading. three approaches were used to perform the Downloaded from http://ltj. 2001). and L2 listening studies. 1993).g. 1998). Blok. Although test format effects are claimed to vary by domain (Traub. meta-analysis has recently been used in second language acquisition studies (e.. L2 reading.

task format. evaluation format. authors’ by Green Smith on April 9. Cronbach (1990). Linguistics and Language Behavior Abstracts. (1999). and Haladyna (2004). Cohen (1998). Bachman (1990). response mode. Brown (2005). This list of keywords was constructed by the authors based on the keywords and synonyms retrieved from the thesauruses sup- plied in databases. Bachman and Palmer (1996). FirstSearch. Language Learning. The journals included Annual Review of Applied Linguistics. evaluation type. and educational measurement were reviewed. Cohen (1994). ELT Journal. response type. (1993). Second Language Research. short answer. Studies in Language Testing series published by Cambridge University Press. comparison. constructed response. Richards and Schmidt (2002). ScienceDirect. Second. and Urquhart and Weir (1998). and difficulty. Brennan (2006). response format. Language Assessment Quarterly. Language Teaching. assess- ment format.sagepub. Davies and Elder (2004). Abstract. item format. open-ended. Applied Psychological Measurement. and TESOL Quarterly. Language Testing. item type. first and second language acquisition. selected response. The books in first and second lan- guage acquisition mainly included Brown (2006). Applied Linguistics. evaluation method. question format. RELC Journal. Kintsch (1998). multiple-choice. books and articles reviewed. question type. Fulcher and Davidson (2007). title. Grabe and Stoller (2002). difference. PsycINFO. Downing and Haladyna (2006). The books in educational measure- ment mainly included Anastasi and Urbina (1995). assessment type. Reading Research Quarterly. and article keyword searches were used. First. combinations were used: test method. Clapham and Corson (1997). and feedback from colleagues. Davies et al. Rost (2002). Clapham. Different editions of the same book were also checked. A date range restriction was not imposed. System. Ellis (1994). and Weir (2005). Kaplan (2002). Flowerdew and Miller (2005). Doughty and Long (2003). ILTA Language Testing Bibliography (1999). Since a single key- word retrieves a large number of irrelevant studies. and Wall (1995). books and journals in language testing. 2009 . Applied Measurement in Education. Studies in Second Language Acquisition. Educational and Psychological Downloaded from http://ltj. The books in language testing mainly included Alderson. Hinkel (2005). we conducted literature retrieval through computer searches of the Educational Resources Information Center. task type. Bennett and Ward. and Web of Science. Modern Language Journal.224 Focus on multiple-choice and open-ended formats literature search. free response. test format. assessment method.

The remaining studies were examined by the first author. relevant studies were searched through communication with other researchers. Downloaded from http://ltj. In each of the three approaches. The agreement percentage was 80.e. b Criteria for the inclusion of a study: The literature search retrieved approximately 10. Disagreement was resolved through discussion. theoretical. Journal of Educational and Behavioral Statistics. These processes reduced the number of retrieved studies to 237. c Moderator variables coded for each study: In order to examine the relation between the format effects in a study and the variables affecting those format effects (moderator variables). and the reported moderator variables in at least two studies (the minimum number of studies required for meta-analysis) were coded. and Journal of Educational Measurement. As a result.60.. Their titles.4% (n = 20) of the 237 studies was separately examined by both authors. and the kappa coefficient was 0. Disagreement was discussed and solved. This resulted in 15 coded variables. 2009 . A sample of 0.000 studies. Yo In’nami and Rie Koizumi 225 Measurement. SDs. A sample of 8. t.000 studies was independ- ently examined by both authors. or F values). and review paper and chapter. Educational Measurement: Issues and Practice. means. When necessary.2% (n = 20) of the 10. and study descriptors were inspected if (a) the studies used multiple- choice and open-ended formats (see section I. The agreement percentage was 90. both published and unpublished.sagepub. every effort was made to contact the authors to request further details. the reference list of every empirical. n. 37 studies were retained and included in the meta-analysis.1 for the definitions) and (b) the subject matter of the test was language comprehension in a certain length of text in L1/L2 reading or L1/L2 listening. abstracts.80. was further scrutinized for addi- tional relevant materials. Third. They were further inspected if they met all of the following criteria: (a) the information required to calculate effect sizes was reported ( by Green Smith on April 9. and (b) the full score was the same across formats or the percentage of correct responses could be calculated. the 37 stud- ies retained for the current meta-analysis were inspected. r. The first author investi- gated the remaining studies. and the kappa coefficient was 0.

whereas with within- subjects studies. Downloaded from http://ltj. there was no significant dif- ference between the test scores in multiple-choice and open-ended formats. studies with random assignment were predicted to more clearly show format effects than were studies with non-random assignment. studies with stem equivalency were predicted to more clearly show format effects than were studies with non- stem equivalency. d) Access or no access to the text when answering: Davey and LaSasso (1984) reported that when learners were allowed to con- sult the text (i. 1984. Thus. multiple-choice formats were easier than open-ended formats.. In contrast. 1991). Wolf. learners assigned to one condition (a multiple-choice followed by open-ended format) were more likely to perform better on both formats than those assigned to the opposite condition (an open-ended fol- lowed by multiple-choice format). Shohamy. and more so in studies that administer open-ended formats first than studies that administer multiple-choice formats first. and this may obscure differ- ences in test format effects. 2002) by Green Smith on April 9. Thus. it appears that format effects are more clearly observable in studies with between-subjects designs than those with within-subjects designs. 2003). no access to the text).sagepub.e. 2009 . c) Stem or non-stem equivalency: Sharing the same or similar stem as well as the same correct response across formats seems to be crucial in investigating format effects because it avoids confounding format effects with other variables (Rodriguez.. This suggests that an opportunity to consult the text in open-ended formats may increase scores and reduce format differences.226 Focus on multiple-choice and open-ended formats a) Between-subjects or within-subjects designs (counterbalanced or non-counterbalanced): Studies with between-subjects designs tended to find multiple-choice formats easier than open-ended formats (e.e. whether learners were permitted to take notes while listening to the text and refer to them while answering questions was coded for ‘access to the text’. counterbalanced designs found the two formats to be of similar difficulty (Trujillo. Thus. access to the text). 2006). In listening.g. b) Random or non-random assignment: A random assignment of learners to different treatments can exclude factors that are irrelevant to treatments and enable the interpretation of the dif- ferences observed between treatment groups as effects of the treatments (Shadish et al. Maxwell and Delaney (1990) suggested that due to learning effects. when learners were not allowed to refer to the text (i..

Thus. h) Learners’ age (primary. n) Percentage correct in a multiple-choice format and (o) percent- age correct in an open-ended format: Although it appears that Downloaded from http://ltj. they performed better on multiple-choice formats when the questions were implicit. how- ever. Concerning the reliability of tests with a multiple-choice format. k) Reliability of tests with a multiple-choice format. high proficiency learners were hypothesized to be less affected by format differences than were low-proficiency learners. another possibility was that if the distractors did not function well. learners’ L2 proficiency level was defined as follows: A middle level of L2 proficiency referred to learners with three to four semesters of L2 study. secondary. reductions in the number of multiple-choice options were hypothesized to widen mean score differences between multiple-choice and open-ended formats. 1996). f) The number of options in a multiple-choice format: Since Rodriguez (2005) showed that reductions in the number of multiple-choice options tended to make multiple-choice formats slightly easier. Based on Shohamy (1984). thereby increasing the degree of format effects. However. 2009 . coded for the current meta-analysis. the reliability may be depressed. thus. Yo In’nami and Rie Koizumi 227 e) Text explicit or implicit questions: Kobayashi (2004) found that learners performed equally in both multiple-choice and open- ended formats when the questions were explicit in the text. and (m) reliability of scor- ing in open-ended formats: The high reliabilities of tests and scoring suggest that the tests consistently measured something relative to the errors. they were coded. g) Learners’ L2 proficiency level: The definition of this variable is often inconsistently reported and not easily comparable across studies. and thus. (l) reliability of tests with an open-ended format. (i) learners’ by Green Smith on April 9. which leads to the format effect. whereas a high level of L2 pro- ficiency referred to learners with five or more semesters of L2 study. was the L2 instruction time that learners had received. fairly relevant information often reported. Based on Norris and Ortega (2000).sagepub. and (j) learners’ L2: These are among the test taker characteris- tics whose plausible effects on test performance must be consid- ered (Bachman & Palmer. and the scores may also be increased. studies with high reliabilities of tests and scoring were predicted to more clearly show format effects than were studies with low reliabilities of tests and scoring. or adult).

it was estimated using Kuder-Richardson formula 21 (KR21).80. and can be divided into two types (Kline. This would result in a decrease in the degree of format effects. there are two types of effect size Downloaded from http://ltj. The effect size data associated with each variable of interest was also coded. and the kappa coefficient was 0. When necessary. Disagreement was discussed and resolved. a positive relationship might exist between the percentages of correct responses in multiple-choice formats and format effects. which uses the SD of only one group ( by Green Smith on April 9. they might produce higher scores. Among d type. if open-ended formats are designed to be easier. According to Morris and DeShon (2002).sagepub. The coding for all the variables was independently conducted by both authors based on a sample of 27% (n = 10) of the 37 studies. When the reliability of a test was not reported. The result would be a decrease in the format effects. The agreement percentage between the two authors was 90. compared with Cohen’s d. which indicates the proportion of variance explained by an effect of interest. they might produce lower scores. The former type of effect size was used because this study focused on the mean score difference between multiple-choice and open-ended tests. Coding for the remaining studies was conducted by the first author. 2 Meta-analysis a Effect size for individual studies: Effect size is an estimate of the magnitude of the observed effect or relationship.228 Focus on multiple-choice and open-ended formats no studies have investigated this issue. 2004): (a) d type. and (b) r type. further details were solicited from the authors of each study. This is because since open-ended formats usually appear to yield lower scores than do multiple- choice formats. The reason for this is that since multiple-choice formats usually appear to yield higher scores than do open-ended formats. On the other hand. Cooper & Hedges. 1994). 2009 .. if multiple-choice formats are designed to be more difficult. which would be similar to the scores in open- ended formats.g. which represents the standardized mean difference between groups. Hedges’ g was selected because it uses a pooled SD computed from two groups and tends to estimate the population SD accurately. a negative relationship could be predicted between the percentage correct in open-ended formats and format effects. which would be similar to the scores in multiple-choice formats.

The effect size gbetween-subjects was hereafter denoted simply as g. Then. where ρ is the population correlation between the repeated measures. the mean imputation method was considered to be more appropriate in the current study. However.74.21]). Since an aggregate of correlations across within-subjects studies is the best estimate of ρ (Morris & DeShon).93 [mean = 0. calculated above. therefore. along with a weighted mean of the SDs. The derived value was 0.34 to 0. When multiple effect sizes were available in a single study. the average cor- relation of 0. SDD SDD (the SD of the difference scores) = SD 2 group1 + SD 2 group 2 −(2)(r )(SDgroup1)(SDgroup 2 ).74 (Pearson’s correlations before Fisher’s Z transformation ranged from 0. using gbetween-subjects = gwithin-subjects 2(1 − ρ ).68. 2002). other alternatives to mean imputation such as Buck’s method and maximum likelihood methods (Pigott. 2009 . 1994).sagepub. The mean of correlations was calculated by Fisher’s Z transformation from six of the within-subjects studies included in the current meta-analysis. demanding that we impute values for the missing correlations. which was computed using a formula for a pooled Downloaded from http://ltj. sp (n1–1)s 2 + (n 2–1) s22 sp (pooled or combined. those from the within-subjects design were converted into the between-subjects metric (Morris & DeShon. SD = 0. n1 + n 2–2 Mean1 − Mean 2 gwithin-subjects = . One difficulty in calculating SDD was that the correlation between groups 1 and 2 was not always reported in all studies. was also used as ρ.com by Green Smith on April 9. Once the effect sizes from each study were computed. Yo In’nami and Rie Koizumi 229 g: g for a between-subjects design and g for a within-subjects design. and was used to replace a missing correlation. SD) = . mean imputation is considered to leave a centralizing artifact and result in the artificial homogeneity of effect sizes (Cooper & Hedges. a weighted mean of the means was calculated. They are defined as Mean1 − Mean 2 gbetween-subjects = . Nevertheless. the sampling variance of the effect size was computed according to the study design and effect size metric based on Morris and DeShon (2002). 1994) assume a large set of data sources.

The mixed effects model assumes that effect sizes vary across studies partly due to study character- istics (i. for the equations in the mixed effects model). moderator variables) and partly due to other randomly distributed unaccountable sources (Lipsey & Wilson. the current meta-analysis included 56 data sources from 37 studies. 15 were coded (see section II. Thus. 1994). In the end. see Lipsey & Wilson. Although such coding allowed multiple effect sizes from one study to be entered into the meta-analysis. b Effect size aggregation: The effect sizes from individual studies were combined to attain an aggregate effect size for each format effect using the mixed effects model. by Green Smith on April 9. The full reference of studies can be obtained from the authors. Analysis by another meta- analysis model.g. 2001). Among such variables. that is. described above (Lipsey & Wilson. Lipsey & Wilson. Moderator variable analysis was conducted by grouping studies according to the moderator variables or by performing a regression analysis.. 1994. This weighted-average procedure for means and SDs is widely used to reduce the bias caused by depend- ency between the effect sizes in one study (e.e. 2001). all of which reported format effects based on mean scores. These continuous variables were analyzed using a weighted single variable linear regression based on a method of moments (Raudenbush. it was conducted in order to obtain the most comprehensive picture of the variables to explain the format effects across studies. with the standardized mean score dif- ference (g) between the two formats as a dependent variable and each Downloaded from http://ltj. Shadish & Haddock. 1994. Moderator variables (a) to (j) were categorical. Hunter & Schmidt. information on moderator vari- ables could be lost. 1994..c) and each effect size was calculated along with an overall single effect size from each study. However. which suggested the necessity of using the mixed effects model. The model also assumes that each study in a meta-analysis is a randomly sampled observation drawn from a population of studies.1. this procedure might mix up some potentially important variables that the primary researcher independently investigated. therefore. whereas moderator variables (k) to (o) were continuous. demonstrated a wide variability among the effect sizes. and an artificial cat- egorization of the latter variables was not appropriate (e.g.230 Focus on multiple-choice and open-ended formats SD. 2001.. Cooper & Hedges. 2009 . 2004). the fixed effects model. the mixed effects model not only permits the drawing of inferences about studies that have already been conducted but also generalizes to the population of studies from which these studies were sam- pled (Raudenbush. 2001). each study contributed a single effect size.

SD = 19. open-ended). In addition.84. If one confidence interval does not overlap with another. means that there was a signifi- cant difference between the mean scores of the two formats. although Cohen warned against its use because of its arbitrariness and recommended a context-specific interpretation of magnitude differences.50. the results from the regression analysis were considered tentative. The calculated effect size g was interpreted as follows.SPS) written by Wilson (2001). Since moderator variables were often highly correlated (VIFs [variance inflation factor] ranged from by Green Smith on April 9. and a weighted single variable linear regression was conducted using a SPSS macro (METAREG. effect size aggregation was conducted using Comprehensive Meta-Analysis (Borenstein et al. A result with a small confidence interval is more trust- worthy than a result with a large confidence interval.20| as small. the mean of one format is half a pooled SD higher than the mean of another format. g and its confidence interval were interpreted based on Cohen’s (1988) guideline of |0. in which the regression assumptions were difficult to test. the mean of one format (e.50| as medium.50.75]). Second.g. First. assume a test with an SD of 10 and two groups of learners. this suggests a statistical difference between the observed effect and null hypothesis of no effect.00 [mean = 21. which. |0. a positive effect size with a 95% confidence interval not encompassing zero indicated that a multiple-choice format was easier than an open- ended format. especially in L2 lis- tening (k = 4 to 5). ranging from Downloaded from http://ltj. To simplify.80| as large effects. in the current study. if g = 0.. If g = 0.g. III Results and discussion 1 Format effects in L1 reading The results are shown in Table 2. If a confi- dence interval does not include zero. multiple-choice) is five points higher than that of another format (e.. A 95% confidence interval around the mean effect size was calculated. The overall effect size was positive with a confidence interval greater than zero. this suggests a statistical difference between the two observed effects. In the current study. 2005). since the number of studies was small (k = 4 to 21). and |0. Yo In’nami and Rie Koizumi 231 of the five moderator variables ([k] to [o]) as a separate independent variable. The effect size for individual studies was calcu- lated using Excel.04 to 50. multiple regression could not be conducted. 2009 .

83) Not counterbalanced 926 926 5 1.23. 1.29) Notes: n1 and n2 = sample sizes.44 (0. Furthermore.62 (–0.sagepub. 0. Downloaded from http://ltj.58. k = the number of data sources.08.65 (0.66 (0. 1.90.22) Counterbalanced 670 670 6 0.87.33. 1. 2.25.79 (0.11) Random assignment 2513 2509 11 0.20.232 Focus on multiple-choice and open-ended formats Table 2 Meta-analysis of multiple-choice and open-ended format effects in L1 reading Variable n1 n2 k g (95% CI) Overalla 18581 3661 22 0. 0.06]).11) Learners’ L1 (Hebrew) 37 25 1 0. respectively). 2. 1.45) Number of multiple-choice options (5) 269 269 3 0.12) Multiple-choice→Open-ended 527 527 3 1. 1.17 (–0.10 (0.07) Within-subjects 1786 1786 14 0.04) No access to the text when answering 15195 323 4 0. and 0.06 (g = 0.76 (0.10 (– (0. 0.10. 1.85.10 (0. 1. 1. 0.57) Number of multiple-choice options (4) 2212 2208 8 0. aThe italicized results for the moderator variables are based on one data source and are not interpreted.00) Text explicit question 121 121 2 –0.06 (–0. 0. by Green Smith on April 9. 1.53 (0.55 (0.85 (0.30 (–0. This suggested that multiple- choice formats were easier than open-ended formats and that the degree of difference ranged from small to large.24.73. 1.46. 1.03 (–0.13. 0.24.58) Learners’ L1 (Swedish) 14962 90 1 1.32. within-subjects designs (g = 0.14 (–0.76) Not random assignment 16068 1152 11 0. 2. 1.20.60) Number of multiple-choice options (3) 176 144 2 0. 0.76]).13) Text implicit question 121 121 2 0. 1.79) Learners’ L1 (English) 3308 3230 19 0.86) Open-ended→Multiple-choice 399 399 2 1.12.84) Non-stem equivalency 556 556 4 1.72 (0. 0.10. 1.49 (0.24 to] for both counterbalanced and non-coun- terbalanced).40.22.56. 0.91 (0. 0.45) Stem equivalency 18025 3105 18 0.64 [0. 0.43) Learners’ L1 (Dutch) 274 316 1 0.15 (–0.65 [0. 2009 .44 [0. and random assignment (g = 0. 0.65 (0. 0.07]).66 [0.64 (0.09.60 ( Between-subjects 16795 1875 8 0. the results of the single variable linear regression in Table 3 indicated that the moderator variables did not sig- nificantly affect multiple-choice and open-ended format effects (p = 0.26.26) Learners’ age (adult) 1293 1249 11 0.20.06. 2.73) Learners’ age (primary) 352 352 5 0.51) Access to the text when answering 3619 3571 21 0.77) Learners’ age (secondary) 16676 1800 4 0. CI = confidence interval. Format effects were particularly observed across conditions such as between- subjects designs (g = 0. 0.10.

23 . B = an unstandardized regression coefficient.10 1.13 (Multiple-choice) 10 Constant 1.52 .sagepub.68 (Open-ended) 4 Constant –17.53 –0.83 .20 .14 0.44 7.27 1.64 .48 .41 .08 (Open-ended) 21 Constant 0.10 0.36 .15 .00 (Open-ended) 12 Constant –4.04 .46 –1.93 .56 0.73 1.55 0.00 Scoring reliability 5.03 4.19 (Open-ended) 5 Constant 10.00 (Multiple-choice) 21 Constant 1.19 .10 –.43 .54 .14 0.05 . 2009 .04 .49 .50 2.74 .81 .00 Percentage correct 0.17 –1.00 Test reliability –5.53 –1.16 1.18 .00 Percentage correct 3.00 Scoring reliability –11.10 .02 .00 Percentage correct – by Green Smith on April 9.62 .52 .90 .03 .68 0.27 –1.01 .61 8.59 1.00 Scoring reliability 19.67 .98 –1.04 (Open-ended) Note: k = the number of data sources.98 .00 Test reliability –0.90 .21 2.81 0.01 –0.85 .27 –2.43 0.27 (Open-ended) L2 listening 5 Constant 3.00 Test reliability –1.33 3.00 .16 1.00 Test reliability –1.26 .83 –1.56 (Open-ended) 5 Constant –1.70 .06 12.91 .21 –.06 –.41 .46 .16 . β = a standardized regression coefficient.00 Percentage correct –2.44 1.36 1.32 .00 .00 Test reliability 1.27 .75 .86 0.05 1.26 0.00 .86 1.00 Percentage correct 1.56 1.28 .66 1.12 –4.81 .45 6.02 0.82 9.01 –.79 –0.14 .29 –1.71 0.71 0.01 0.07 2.06 .42 .27 (Open-ended) 10 Constant –0.53 .41 .06 .50 . Yo In’nami and Rie Koizumi 233 Table 3 Weighted single variable linear regression meta-analysis of multiple- choice and open-ended format effects Domain k Variable B Standard Z p β R2 error of B L1 reading 19 Constant –0.32 .87 –.10 9.25 (Open-ended) L2 reading 8 Constant 1.00 Percentage correct 4.56 0.05 .87 (Multiple-choice) 5 Constant 4.79 1.07 .86 .70 2.17 (Multiple-choice) 19 Constant 0.04 0.84 0.17 .39 6.24 –.44 .68 .42 0.39 1.82 (Multiple-choice) 5 Constant 0.45 1.89 .00 –.00 Test reliability –6. Downloaded from http://ltj.17 (Multiple-choice) 8 Constant 1.12 .49 .08 .18 –.38 –0.81 .99 .19 .70 .47 11.

23 (–1.208) Counterbalanced 126 126 1 0. 0.23.30 (0.sagepub.65.78.15 (–0.78) Number of multiple-choice options (3) – – – – Number of multiple-choice options (4) 1148 1036 9 0. 2. 0.72 (0.19) Not counterbalanced 329 329 2 0. Downloaded from http://ltj. 1. 1. This indicated that the mean scores in multiple-choice and Table 4 Meta-analysis of multiple-choice and open-ended format effects in L2 reading Variable n1 n2 k g (95% CI) Overalla 1217 1093 11 0. ranging from –0.88) Non-stem equivalency 153 153 2 –1.49 (0.88) Number of multiple-choice options (5) – – – – Learners’ L2 proficiency level (mid) 56 56 2 0. 1.27 (–0.59.05 (–1. 1.94) Learners’ age (primary) – – – – Learners’ age (secondary) 449 337 2 1.09.05 (–1.33 (–0. 1. –1.03 (–0. 1.02 (–1.30 (0. the overall effect of multiple-choice and open-ended formats was not observed because its confidence interval contained zero.64) Learners’ L2 proficiency level (high) 94 85 2 1.03 (–1.42.234 Focus on multiple-choice and open-ended formats 2 Format effects in L2 reading As shown in Table 4.81) No access to the text when answering 341 341 2 0.70 (0. 0.21 (–0.15 (– (0.92) Learners’ L1 (Spanish) 126 126 1 0.20) Not random assignment 519 507 5 –0. 0.59 (1.04. –0.38 (–0.22 [–0.66 (g = 0.39) Text implicit question 32 32 1 0. 2009 .81) Learners’ L1 (English) 80 80 2 0. 0. 0.46. 0.19 (– Learners’ L2 (Spanish) 80 80 2 0. 0. 1. 0.22 (–0.47) Learners’ L1 (Chinese) 32 32 1]).com by Green Smith on April 9.69) Note: a See Table 2 note.28 (–0.60) Learners’ L1 (Japanese) 70 61 1 1.05 (–0.23 to (–0.10) Learners’ age (adult) 768 756 9 0.04) Random assignment 698 586 6 0. 0.04) Multiple-choice→Open-ended – – – – Open-ended→Multiple-choice 329 329 2 0.45) Text explicit question 32 32 1 1. 2. 1. 1.78.91) Learners’ L2 (Japanese) 32 32 1 1. 0.57.87 (1.09) Learners’ L2 (English) 776 652 6 0.21) Stem equivalency 1064 940 9 0.66) Between-subjects 609 485 6 0. 1.05 (–0.03 (–0.89. 1.19) Learners’ L1 (Taiwanese) 121 121 1 –1.58) Access to the text when answering 876 752 9 0.87. 0. 1.69) Learners’ L1 (Hebrew) 459 344 3 0. (– Within-subjects 608 608 5 –

02 [–1. which indicated that they Downloaded from http://ltj.1.22]).58]). random assignment (g = 0.01. 0. 0. This is interpreted in section III. for most moderator variables.10. 1. and percentage cor- rect in multiple-choice formats) in the regression each significantly affected the effect sizes (p = 0.c. a negative value of effect size with the confidence interval not including zero was observed for non-stem equivalency (g = –1. test reliabili- ties of multiple-choice and open-ended formats.00.66 (g = 1. 4 Analysis of moderator variables In section II. there were two moderator variables with confidence intervals that did not overlap.66]). However. and 0.72 [0.18.57. which suggests that multiple-choice formats were easier than open-ended formats in the following con- ditions: between-subjects designs (g = 0. the confidence inter- vals for effect sizes overlapped in L1 reading for between-subjects designs (g = 0.57 to 1.11 [0. Table 3 illustrates that three moderator variables (i. This suggested that multiple-choice formats were eas- ier than open-ended formats and that the degree of difference ranged from medium to large.88]).27. our predictions were not supported. overall. 1. format effects were found across conditions such as between-subjects designs (g = 0. In contrast. respectively).99 [0.57. the overall effect size was positive with a confi- dence interval greater than zero. thus. positive values of effect sizes with confidence intervals not including zero were observed. 0.24. However. For example.11 [0. 15 moderator variables were listed with possible predictions of format effects.62]) and random assignment (g = 1. 1.4. and learners’ L2 proficiency level being high (g = 1.94]).207. as shown in Tables 2 to 5.64 [0. as shown in Table 3 (p = 0. none of them was found to significantly affect the effect sizes.36. stem equivalency (g = 0.66]). In addition. 2009 . 0. In particular.21. 1.10.20]). However.10. 0.65. –0.66 [0.29 [0.07]) and within-subjects designs (g = 0. 1.46. and 0.. the confidence intervals for effect sizes within each moderator variable overlapped.00. 1.e. which suggests that multiple-choice for- mats were more difficult than open-ended formats.70 [0. Among the five moderator variables investigated in the regression.sagepub. ranging from 0.49 [0. 1. 1.20. Yo In’nami and Rie Koizumi 235 open-ended formats were not statistically different across by Green Smith on April 9. respectively). 3 Format effects in L2 listening As seen in Table 5.20]).23.

which was in line with the prediction.66) Learners’ L1 (English) 57 53 1 0.236 Focus on multiple-choice and open-ended formats Table 5 Meta-analysis of multiple-choice and open-ended format effects in L2 listening Variable n1 n2 k g (95% CI) Overalla 321 324 5 1.21]). this indicated that studies with random assignment more clearly showed format effects than did those with non-random assignment. 2009 .50.62) Within-subjects 45 45 1 1. 2.10.89 (0.50.66) Non-stem equivalency – – – – Access to the text when answering 264 271 4 by Green Smith on April Between-subjects 276 279 4 0.72 [0.89 (0. 1.92) Learners’ L1 (Taiwanese) 45 45 1 1. 1. 1.17 (0.17 (0.88]) and non-stem equivalency Downloaded from http://ltj.36. 1.99 (0. 1. 1.50.70) Learners’ age (primary) – – – – Learners’ age (secondary) – – – – Learners’ age (adult) 321 324 5 1.11 (0. there was no overlap between studies with random assign- ment (g = 0.02 (– Learners’ L2 proficiency level (high) 275 281 5 1.58 (1.sagepub.20]) and non-random assignment (g = –0. 1. the confidence intervals for effect sizes did not overlap between studies with stem equivalency (g = 0.15.11 (0.89 (0. 1.30 1.30 1.84) Number of multiple-choice options (5) – – – – Learners’ L2 proficiency level (mid) 18 17 1 0.85) Learners’ L2 (Spanish) 57 53 1 0. again in L2 reading. 1. Second.50.50. were crucial moderator variables for test format effects. 0. 1.80.89 (0.10 (0.28) Learners’ L1 (Japanese) 219 226 3 1.11 (0. 1. 0.89 (0.49 [0. Since the confidence intervals for effect sizes for studies with non-random assignment included zero.66) Not random assignment – – – – Stem equivalency 321 324 5 1. in L2 reading.11 (0. First. 1.28) Number of multiple-choice options (4) 63 70 2 1.86) Learners’ L2 (English) 264 271 4 No access to the text when answering 57 53 1 0.04 (0.86) Counterbalanced 45 45 1 1. 1.28) Note: a See Table 2 note.58 (1. 1.86) Not counterbalanced – – – – Multiple-choice→Open-ended – – – – Open-ended→Multiple-choice – – – – Random assignment 321 324 5 1.28) Text explicit question – – – – Text implicit question – – – – Number of multiple-choice options (3) 57 53 1 0.58 (1. 1.38 [–0.30 1.98.

90. and were also predicted to have negative regression coefficients. all in L2 listening. when the variables were compared across the three domains of L1 reading. this requires further testing with larger data sets. our hypothesis that a positive relationship would exist between the percentage cor- rect in multiple-choice formats and format effects was supported (p = 0. β = 0.83. by Green Smith on April 9. β = –0. The results of test reliabilities of multiple-choice formats supported the latter prediction. However. β = –0.87) and open-ended for- mats (p = 0.68). Regarding the relation- ship between percentage correct and format effects. there were two contradictory predictions.46. whereas there was only one prediction for the test reliabilities of open-ended formats.01. Although most of the rela- tionships were not significant. The results showed negative standardized regression coefficients for multiple- choice formats (p = 0.82). This did not support the prediction that studies with stem equivalency were likely to more clearly show format effects than were studies with non-stem equivalency.00. which was contrary to the results. and L2 listening.02 [–1. L2 reading. 2009 . Studies with high test reliabilities were predicted to more clearly show format effects than were studies with low test reliabilities. Yo In’nami and Rie Koizumi 237 (g = –1.93.58]). In addition to the inspection of confidence intervals for effect sizes within each moderator variable. R2 = 0. Downloaded from http://ltj. 5 Comparison of moderator variables across domains Thus far. which suggests that studies with stem equivalency indicated format effects of multiple-choice formats being easier than open-ended formats. On the other hand.sagepub. These results suggested that the mod- erator variables influenced the test format effects in both predictable and unpredictable ways. some moderator variables were found to have relation- ships with multiple-choice and open-ended format effects in a rather complicated way. single variable linear regres- sion analysis was used to examine the relationships between test format effects and moderator variables. –0. and were also predicted to have positive regression coefficients. R2 = 0. three were found to be significant: the reliabilities of multiple-choice and open-ended test formats and percentage correct in multiple-choice formats. whereas studies with non-stem equivalency showed format effects in the opposite direction. both of which did not include zero. studies with low test reliabilities of multiple-choice for- mats were predicted to more clearly show format effects. Concerning reliability. R2 = 0.

on the whole. although multiple-choice formats are found to be easier than open-ended formats when either one of the four moderator variables (i.66]).88]. L2 reading g = 0. This suggested the importance of these variables in the investigation of multiple-choice and open-ended format effects and the generalizability of the functions of these variables across L1 reading. In contrast.07].57.49 [0.55 [0.36. between-subjects designs.20.23. L2 reading g = 0.13. and stem equivalency (L1 reading g = 0. the current study examined the relative difficulty of multiple-choice and open-ended formats and the variables related to this difficulty. Their confidence intervals for effect sizes across the three domains did not include zero.66]). and L2 listening. stem equivalency.238 Focus on multiple-choice and open-ended formats three variables were consistently and significantly related to those format effects and could be considered especially important.20]). 6 Summary and comparison of narratives with meta-analysis synthesis This study used meta-analysis to quantitatively synthesize the effects of two test formats (multiple-choice and open-ended) on test performance in L1 reading. 1. L2 reading.207.20].. and L2 listening g = 1. L2 reading.26. 1. More specifically. and L2 listening g = 0. The degree of format effect difference ranges from small to large in L1 reading and medium to large in L2 listening.99 [0. the effects of multiple-choice and open-ended formats in L2 reading are not observed in the current meta-analysis. 1. multiple-choice formats were likely to be easier than open- ended formats. and L2 listening g = 1.64 [0.70 [ by Green Smith on April 9.62]).57. They were between-subjects designs (L1 reading g = 0. or learners’ L2 proficiency level being high) are included in the Downloaded from http://ltj.44 [0. The results indicate that multiple-choice formats are easier than open-ended formats in L1 reading and L2 listening.11 [0. 0.e. L2 reading g = 0.84]. One reason for which these three variables were related to multiple-choice and open-ended format effects would be that between-subjects designs exclude the learners’ carry-over effects by requiring them to take either multiple- choice or open-ended formats. 0. random assignment (L1 reading g = 0.11 [0. 0. 2009 . random assignment. and L2 listening. and that the random assignment of learners to multiple-choice or open-ended conditions and stem equivalent tests help eliminate irrelevant factors and enable a direct comparison of test formats. 1.76]. 1.10.72 [0. This suggested that in studies with these three vari- ables. 1.sagepub.

6 points. based on the finding that in L1 reading and L2 listening. the probability that the multiple-choice format is easier than the open-ended format is approximately 68% in L1 reading and 78% in L2 listening. multiple-choice for- mats were easier than open-ended formats. In L2 listening.6 points. Using Dunlap’s (1999) conversion program. Overall. and an 11. and L2 listening. However. in percentile terms. effect sizes can be converted into probabilities based on normal curve z probability values.4 and 10. These differences may be of little importance in a low-stakes test but of Downloaded from http://ltj. For example.7 and 16.5 point difference in L1 reading with a 95% confidence interval of 2. the degree of format effects from the narrative review was unknown. multiple-choice formats are easier than open-ended formats. or stem-equivalent by Green Smith on April 9. Yo In’nami and Rie Koizumi 239 analysis. a test developer might prefer using the former if the intention is to make the test easier.sagepub. Furthermore. the inconsistent results in L1 and L2 reading from the narrative approach to research synthesis are clarified in the meta-analysis. multiple-choice formats are shown to be easier than open-ended formats to a medium to large degree. in a test with an SD of 10. The option of multiple-choice or open-ended formats would produce a small to large degree of score difference in L1 reading and medium to large degree of score difference in L2 listening.1 point difference with a 95% confidence interval of 5. but it is found to be from small to large in L1 reading and medium to large in L2 listening. First. Format effects favoring multiple-choice formats across domains are consistently observed when the studies use between- subjects designs. the results of the meta-analysis indicate that in L1 reading. IV Implications Two implications are discussed. when the findings from the narrative and meta-analytic reviews are compared.1) summarized that in most studies across L1 reading. It should be remembered that the narrative review (see section I. multiple-choice formats yield higher scores than do open- ended formats – with the degree of difference ranging from small to large – whereas no significant difference is found in L2 reading. L2 reading. 2009 . Thus. random assignment. these would be a 6. some studies in L1 and L2 reading showed no significant difference between the two formats. Of interest is a comparison between the findings in the meta-analysis and narrative review.

com by Green Smith on April 9. These two tests have been widely examined in language testing (e. such as cloze and c-tests. these results provide evidence that quantitative data synthesis plays an important role in explaining inconsistencies in test format effects. V Suggestions for future research There are two future areas of research that may provide more insight into the effects of format on test performance.240 Focus on multiple-choice and open-ended formats great significance in a high-stakes test. First. although meta- analysis can be performed with at least two studies. 1993. another equally important and promising research agenda would be conducting meta-analysis on the effects of other test formats. Thus..g.. Acknowledgements We would like to thank Akihiko Mochizuki. the tentative findings from the regression analysis must be replicated. random assignment. L2 reading. which also showed the prevalent effects of three moderator variables (between- subjects designs. the inconsistent relative difficulty of multiple-choice and open-ended formats in L1 and L2 reading found in the narrative review was clarified in the current meta-analysis. and stem-equivalent items) in investigating multiple-choice and open-ended format effects across L1 reading. in which only a few studies were currently available for inclu- sion in the current meta-analysis. No. Brown. and two anonymous reviewers for their valuable comments on earlier versions of this paper. Takayuki Nakanishi. especially if an examinee’s score is located around the cut-off point. 2006). larger aggrega- tions of studies are desirable for obtaining more precise information on format effects.sagepub. and other moderator variables that were not included in the current meta-analysis (e. Grotjahn. This would be especially significant in L2 listen- ing. and L2 listening. Miyoko Kobayashi. This research was supported by Educational Testing Service (TOEFL Small Grants for Doctoral Research in Second or Foreign Language Assessment) and Japan Society for the Promotion of Science (Grants-in-Aid for Scientific Research. 2009 . and a rich volume of literature would be ideal for quantitative synthesis. Downloaded from http://ltj. Second. 06J03782). passage length or learners’ background knowledge) must be inves- tigated. Second. Furthermore.g.

. R. 93–116. Fundamental considerations in language testing. B. (1992). by Green Smith on April 9. D. R.). Assessing listening. A. D. & Palmer. (2005). NJ: Erlbaum. 37–53. W. W. E.). J. 985–1008. & Wall. G. Construction versus choice in cognitive measurement: Issues in constructed response. Effects of reader’s knowledge. Brown. (1997). Brennan. M. Personnel Psychology. and portfolio assessment. J. Hedges. (2005). J. C. Reading to young children in educational settings: A meta- analysis of recent research. 2009 . and target lan- guage experience on foreign language learners’ performance on listening comprehension tests. C. The effects of text type. D. (1990). Oxford. & Barrett. J. F. Comprehensive meta-analysis (Version 2.. UK: Oxford University Press. Brantmeier.. Cambridge. 9236396) Blok. Bachman.. Higgins. Principles of language learning and teaching (5th ed. Cambridge. (2001). F.2. Netherlands: Kluwer. Educational measurement (4th ed. NJ: Biostat. CT: Praeger. (1999). & Urbina. Yo In’nami and Rie Koizumi 241 VI References Alderson. (Eds. A. White Plains. Berne. 89. Campbell. (1993).. R. C. (1999). Brown.. NY: Pearson. E. Volume 7: Language testing and assessment. assessment task. Language testing in practice. Dordrecht. & Ward. UK: Oxford University Press. 343–371. 9938651) Clapham C. Hillsdale. J. 55. (2006). Upper Saddle River. D. (UMI No. Cambridge. C. J. Buck. C. H..). Arthur.. H. Downloaded from http://ltj. What are the characteristics of natural cloze tests? Language Testing. G. Englewood Cliffs. Psychological testing (7th ed. Brown. UK: Cambridge University Press. Clapham. L. Testing in language programs: A comprehensive guide to English language assessment. (1996). L. Bachman. D. text type. J. Anastasi. Assessing reading. Oxford. S.). & Rothstein. Encyclopedia of language and education. Bennett. (1995). (Eds. (1993). Language test construction and evaluation.sagepub. (UMI No. NJ: Prentice Hall.). (1995).. V. & Corson. Cognitive processes elicited by multiple-choice and constructed-response questions on an assessment of reading comprehen- sion. S. Jr. (2006). Language Learning.023) [Computer software]. UK: Cambridge University Press. Modern Language Journal. Edwards. 49. H.. Borenstein. UK: Cambridge University Press. Alderson. and test type on L1 and L2 reading comprehension in Spanish. Multiple-choice and constructed response tests of ability: Race-based subgroup perform- ance differences on alternative paper-and-pencil test formats.. L. performance testing. New York: McGraw Hill. Westport. 10. (2005). (2000). (2002). L.

D. T. M. Cohen. Teaching and researching reading. G. (Eds. NJ: Erlbaum. S. B. 52. (2003). (1988). Cooper. New York: Russell Sage Foundation. D. ED 412746) Ellis.. McGaw. (1997. New York: Cambridge University Press. Elder. S. A. & Elder. May). L. Elinor. H. P. (1987). (ERIC Document Reproduction Service No. Davies... M.. Brown. Essentials of psychological testing (5th ed. R. Fulcher. UK: Oxford University Press. (1981). Malden. Mahwah.). Grabe.. & Smith. S. Instruments. The study of second language acquisition. & Haladyna. Downloaded from http://ltj. MA: Blackwell.. Hill. UK: Cambridge University Press. W. H. G. (2006). (2002). V.. R. A.. 261–283. J.. Second language listening: Theory and practice. The handbook of research synthe- sis. (1998).. Glass. Davies. (1984). 19.242 Focus on multiple-choice and open-ended formats Cohen. et al. Dunlap.). (2007). Ramat- Gan.. 199–206. Cordray. (Eds. M. J. C. Frankfurt am Main. Boston. Davey. Handbook of test develop- ment. Doughty. T. W. Oxford. B. D. L. L. Journal of Reading Behavior. L. NJ: Erlbaum. Meta-analysis in social research. The C-test: Theory. Cohen.). MA: Blackwell. (2006). (1999). (2004). New York: Russell Sage Foundation. H.sagepub. & LaSasso. J.. A. F. 706–709. Behavior Research Methods. Reading native and foreign language texts and tests: The case of Arabic and Hebrew native speakers reading L1 and English FL texts and tests. (1999).. A. Cambridge. A. Flowerdew.). & Davidson. Grotjahn.. Paper presented at the Language Testing Symposium. M.. Language testing and assessment: An advanced resource book. Assessing language ability in the classroom (2nd ed. 2009 by Green Smith on April 9. Strategies in learning and using a second language. (Eds. UK: Longman. T. applica- tions. Harlow.. L. Germany: Peter Lang.. & Computers. Harlow. The handbook of applied linguistics. Israel. T. (1990). A program to compute McGraw and Wong’s common language effect size indicator. (1994). Meta-analysis for explanation: A casebook. (Eds. CA: Sage.). R. Essex. H. B. J. UK: Pearson. Cooper.. Beverly Hills. K. & McNamara. (2005). Malden. & Hedges. Cook.. Davey.. 31. Hillsdale. & Stoller. C. Postpassage questions: Task and reader effects on com- prehension and metacomprehension processes. Lumley. (1994).). L. Light. (1992). & Miller. V. Hedges. empirical research. (1994). The interaction of reader and task factors in the assessment of reading comprehension. & Long. Journal of Experimental Education. V. D. Statistical power analysis for the behavioral sciences (2nd ed). C. New York: HarperCollins. The handbook of second lan- guage acquisition. Hartmann. Dictionary of language testing.). J. MA: Heinle & Heinle. Cronbach. New York: Routledge. F. C. Downing. (Ed.

Kobayashi. T. Ghatala. In H. J. University of Tsukuba. (1999). from http://www. Thousand Oaks. Kaplan. Downloaded from http://ltj. Beyond significance testing: Reforming data analysis meth- ods in behavioral research. (1990). J. NJ: Erlbaum. (1984).. The Oxford handbook of applied linguistics.htm In’nami.). Thousand Oaks. W. M. & Wilson.. The handbook of research synthesis (pp.). (2006). Norris. 25. Pigott.). DC: American Psychological Association. M. Effectiveness of L2 instruction: A research syn- thesis and quantitative meta-analysis. Methods for handling missing data in research synthesis. by Green Smith on April 9. & Delaney. Dokkai test kaito hoho ga jukensha no test tokuten ni ataeru eikyo: Chugokugo bogo washa no baai [The effects of test methods on reading test scores of Chinese students learning Japanese as a foreign language]. (1990). D.. & Ortega. S. Ochanomizu University. CA: Wadsworth. 417–528. E. (2000). L. K. (Ed. Reading Research Quarterly. CA: Sage. E. Language Learning. Language Testing. & Pirie. Designing experiments and analyz- ing data: A model comparison perspective. Kintsch. Maxwell. Synthesizing research on language learning and teaching.. Sometimes adults miss the main ideas and do not realize it: Confidence in responses to short-answer and multiple-choice comprehension questions. Woloshyn... Cooper & L. Japan. & Pillemer. 19. Combining effect size estimates in meta-analysis with repeated measures and independent-group T. 2006. (2001). Unpublished master’s thesis. Washington. Methods of meta-analysis: Correcting error and bias in research findings (2nd ed. (2004). (Ed. B.). D. & DeShon. D. Yo In’nami and Rie Koizumi 243 Haladyna. Method effects on reading comprehension test per- formance: Text organization and response format. 193–220. Lipsey. R. E. CA: Sage. M.. Norris. Pressley. 105–125.). Handbook of research in second language teaching and learning. The effects of task types on listening test performance: A quantitative and qualitative study. J. B.. 232–249. E. UK: Cambridge University Press. Unpublished doctoral dissertation. Practical meta-analysis. Summing up: The science of reviewing research. Comprehension: A paradigm for cognition. M. New York: Oxford University Press.. F. V. (2002). D. Hunter. H.iltaonline. NJ: Erlbaum. Light. MA: Harvard University Press. R. Morris. Cambridge. R. Japan.). (2004). L.. 163–175). ILTA Language Testing Bibliography. Developing and validating multiple-choice test items (3rd ed. Psychological Methods. B. (2002). S. & Schmidt. L. (2006). 50. (1994). Retrieved January 2. (2002). Amsterdam: John Benjamins. (2005). Hinkel. P. 2009 . M. M. W.sagepub. & Ortega. (2004). Hedges (Eds. Belmont. B. Mahwah. Cambridge. (1998). (Eds. Kobayashi. Kline. V. S. Mahwah. New York: Russell Sage Foundation. R. J. 7. (2004).

. Ross. C. (1994). language of assessment. (2002). (1998). On the equivalence of the traits assessed by multiple- choice and constructed-response tests. In H. Hillsdale.. Does the testing method make a difference? The case of reading comprehension. Paper presented at the American Association for Applied Linguistics. & by Green Smith on April 9. H. S. R. 9124507) Downloaded from http://ltj. W.-C. 29–44). (1994). Reading in a second language: Process. C. Language Testing. M. UK: Palgrave Macmillan. (2006).. D. New York: Russell Sage Foundation. Self-assessment in second language testing: A meta-analysis and analysis of experiential factors. E. Harlow. (1999. 24(2). Wilson. Trujillo. 2005. Hampshire. Rodriguez. Teng. (1998).). W. 3–13.). New York: Russell Sage Foundation. Essex. J. C. (2005). (2005). Harlow. from http://mason. & Campbell. (2002). & Haddock. and portfolio assessment (pp. V.). 261–281).. T. (2002). and target lan- guage experience on foreign language learners performance on reading comprehension tests. F. Cooper & L. Hedges (Eds. UK: Pearson Education. UK: Longman. Construction versus choice in cognitive measurement: Issues in constructed response. London: Longman. Random effects models. E. Wolf. 163–184. J. Language testing and validation: An evidence-based approach. Educational Measurement: Issues and Practice. K.244 Focus on multiple-choice and open-ended formats Raudenbush. Shadish. The effects of task. product and practice. (2001). The handbook of research synthesis (pp. S. (UMI No. (UMI No. Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. In R. C. C. M. Shadish.gmu. B. Cook. R. Hedges (Eds. R. C. Construct equivalence of multiple-choice and con- structed-response items: A random effects synthesis of correlations.. M. METAREG. Shohamy. T. H. Ward (Eds. Basingstoke. D. Language Testing. Weir. Journal of Educational Measurement. D. 40. 301–321). 2009 . Longman dictionary of language teach- ing & applied linguistics (3rd ed. In H. W. 147– C. (1991). 15. R. Rost. ED 432920) Traub. Teaching and researching listening. The effect of format and language on the observed scores of secondary-English speakers. Combining estimates of effect size. A. V. (2003). (1993). Bennett & W. J. MA: Houghton Mifflin. Retrieved February 13.). D. E. (1984). & Weir. Richards. Boston. 1–20. 3198256) Urquhart.sagepub. March). Experimental and quasi-experimental designs for generalized causal inference. NJ: Erlbaum. (ERIC Document Reproduction Service No. 1. performance testing. The effects of question type and preview on EFL listening assessment. Cooper & L.SPS [SPSS macro]. The handbook of research synthesis (pp. Rodriguez.

26.nav Permissions: http://www.1177/0265532208101007 The online version of this article can be found at: The development and validation of a Korean C-Test using Rasch Analysis Sunyoung Lee-Ellis Language Testing 2009.sagepub. Language Testing http://ltj.sagepub.sagepub.nav Citations Published by: Subscriptions: Reprints: 2009 .com by Green Smith on April Downloaded from Additional services and information for Language Testing can be found at: Email Alerts: http://ltj. 245 DOI: 10.

or unavailable to the public. OPI) are often costly. and Rasch measurement statistics such as separation reliability. it has been necessary to measure learner proficiency independently and regularly because it is often considered a moderator variable.sagepub. Reprints and Permissions: http://www. L2 researchers of less commonly taught languages rarely have such a tool. Rasch analysis (Bond & Fox. University of Maryland. this study attempted to develop and validate a 30-minute C-Test (Klein-Braley. 2009. For example. 3215. 2005) were administered to 37 learners of Korean. and Montrul (2005) used an independent proficiency test. and Interagency Language Roundtable (ILR) skill-level descriptions were utilized in passage selection in order to test a wide range of participant proficiency levels.. email: sunyoung@umd. and model fit statistics were used to suggest further improvement to the C-Test. Korean C-Test. and the results reveal the potential of this C-Test as a quick proficiency indicator. Rasch. Hirakawa (1989) used insti- tutional by Green Smith on April 9. 1981). The developed test demonstrated excellent reliability and validity indices. This Korean C-Test was developed with the specifics of Korean language structure in mind.sagepub. In cross-sectional studies examining developmental patterns. Existing proficiency measures (e. 2007) was used to examine the reliability and concurrent validity of the measure. Jimenez Hall. (1988) used TOEFL scores to examine Address for correspondence: Sunyoung Lee-Ellis. College Park. second language testing In second language (L2) acquisition research. Bardovi-Harlig (1992) used a cloze test. DLPT. Keywords: Korean proficiency test. MD. time-consuming. With the intent to provide a practical and reliable measure of Korean L2 proficiency. USA.1177/0265532208101007 Downloaded from http://ltj. which changes over time.nav DOI:10. difficulty measures. C-Test. 2009 . USA Despite the importance of having a reliable and valid measure of Second Language (L2) proficiency. less commonly taught languages. 20740.g. while Bley-Vroman et al. – Language Testing 2009 26 (2) 245 274 The development and validation of a Korean C-Test using Rasch Analysis Sunyoung Lee-Ellis University of proficiency information is often used to group participants by level. The resulting test and a self-assessment questionnaire ( © The Author(s). Department of Second Language Acquisition.

.g. Second Language Research. Many studies on instructional SLA fall into this category (see Norris & Ortega. child L2 vs.. 2002). the measurement of proficiency is often necessary and plays an important role in the interpretation of results.1%) (e.. 2007).g. Alternatively. O’Grady et al. Kim et al.. This approach has often been used in studies examining different conditions of L2 acquisition such as L1 influ- ence (e.0%). 2005) by Green Smith on April 9. followed by standardized test scores (22. KSL researchers use such coarse measures of proficiency because unlike more commonly researched languages like English.g. use institutional status as their basis (e. and researchers’ impressionistic judgments are even more unreliable. Language Learning. Studies in Second Language Acquisition). the number of semester in college foreign language program). The problem is that neither institutional status nor impressionistic judgment is necessarily reli- able or valid. Whong-Barr & Schwartz. researchers may also attempt to eliminate the problem of having heterogeneous proficiency levels by statisti- cally controlling it. 2004). Therefore....0%).g.. the proficiency measures used in the L2 lit- erature thus far have not always been the most reliable. Whether grouping participants by level or controlling profi- ciency for statistical purposes. Unsworth. the most commonly used L2 proficiency measure was institutional status (40. they make the research findings impossible to generalize.3%). Jeon & Kim. when such measures of proficiency are used.e. Language institutions are not consistent with respect to the standard they maintain for completion and promotion to the next level. and in-house or study-internal instruments (14.sagepub. many pre- vious and current Korean as a second language (KSL) studies either do not provide proficiency information at all (e. adult L2 (e. post-treatment gain scores without reference to any independ- ent proficiency. For example. researchers’ impressionistic judgment (21. 2003). 2009 .246 The development and validation of a Korean C-Test whether there is a correlation between grammaticality judgment test scores and learner proficiency. Nevertheless. 2000. Applied Linguistics.. no single practical measure Downloaded from http://ltj.g. or use in-house measures that have not been tested for reliability or validated independently (e. This lack of an adequate proficiency measure is magnified when it comes to less commonly taught languages. for a summary).g.. and heritage vs. According to Thomas’s (2001) survey of research published in four prominent SLA journals (i. Others avoid the question entirely and simply calculate pre. nonheritage (e. 2006).

ease and efficiency of test administration as well as objectivity of scoring (e..sagepub..62). TOEIC (r = . like cloze tests. this paper describes an attempt to develop and validate a 30-minute Korean proficiency measure: a Korean C-Test. examinee abilities are measured when the linguistic message is introduced with some noise or interference. 1992. 1984). (See Babaii & Ansary (2001) for further discussion. Cleary. On the other hand. Sunyoung Lee-Ellis 247 of Korean proficiency exists. 1986). the evidence from previous studies on C-Tests seems sufficient to support the idea that it is measuring the same latent variable that most other types of language assessment measure. Therefore. 2001).. the proficiency tests that have been developed in Korea (e. Dörnyei & Katona.. are a type of operationalization of the reduced redundancy principle (Spolsky..g. Conversely. Eckes & Grotjahn.g. With this type of test. reliable or otherwise. 1973). including the TOEFL (r = . This lack of a prac- tical and valid proficiency test poses serious limitations to both past and future KSL research.. the Michigan Test (r = . but they are costly and impractical to supplement L2 studies..83) (see Eckes & Grotjahn (2006) for a sum- mary of these and others). C-Tests.55 to .e. and the Oxford Placement Test (r = . Advocates applaud its high reliability and concurrent validity indices (e. There has been some controversy regarding the C-Test as a measure of proficiency. Considering the purpose of this project (i. poor item discrimination (e.91). In the test. For example. the Defense Language Proficiency Test and Oral Proficiency Interview) are too labor-intensive and costly for most research.g. the high practicality and empirically demonstrated convergent Downloaded from http://ltj.g. and unclear construct validity (e. many studies have demonstrated a high cor- relation between C-Test scores and other types of institutionalized proficiency test scores.54 to . The Korean proficiency measures that have been developed in the USA (e. and often are not available to the general public.) Despite this controversy. 2006). so speakers of the language can supply missing linguistic items under such conditions (Babaii and Ansary.g.. Grotjahn. it also has been subject to criticism for its lack of face validity (e.61). Klein-Braley. with the rationale being that languages are naturally redundant. Klein-Braley & by Green Smith on April 9. creating a practical measure of proficiency for KSL learners). 1997). Jafarpur.g. Korean Language Proficiency Test and Test of Proficiency in Korean) are available.g. 1988).. parts of some words are deleted. 1995). 2009 . and its alleged measure of integrative use of language (e.g.

The 10 passages were distributed from level 1 to level 3. and one 3+. However. Downloaded from http://ltj. while three others were either too narrow in focus or too technical in by Green Smith on April 9. two 3. and a newspaper. I Method 1 C-Test development a Text selection: The most important aspect of C-Test devel- opment is the selection of texts. On the other hand. Following Klein-Braley (1997).248 The development and validation of a Korean C-Test validity of the C-Test provide reason enough to examine it as a proficiency measure. The five remaining passages were considered acceptable and their descrip- tions made by the passage rater and investigator are provided in Table 1. In particular. three 2. including Korean grade school textbooks. higher level passages (1+ and above) were collected from authentic materials. while one was from a grade school textbook. The rating was then compared to the investigator’s initial rating of the passages. resulting in one level 1 passage. 2009 . two 1+. texts assumed to represent authentic samples of the language that L2 learners will confront were collected. which seemed neces- sary to be sensitive enough to differentiate low-level learners.e. one 2+. Two of the five passages deemed inappropriate were from newspapers and two were from youth magazines.. it was decided that two of the passages selected were too challenging when reduced to excerpts because of the lexicon and reduced context. youth magazines. Such passages contained a coherent flow of content. The initially selected passages were rated by a DLI-certified Korean ILR passage level rating expert. because ILR descriptions of lower level passages (i. Through a discussion with the rater.sagepub. but featured simpler grammar and higher frequency vocabulary. Disagreements were resolved via discussion. A total of 10 passages were either created or collected. lower level passages were created by the investigator based on the vocabulary and grammatical features typi- cally found in introductory and intermediate KSL textbooks. level 0+ and 1) turned out to be inappropriate due to their lack of discourse features. an attempt was made to collect texts of widely varying levels according to the Interagency Language Roundtable (ILR) reading skill level descrip- tion.

1997). However. Unlike languages using Roman alphabets. if a word has two syllables. beginning with the second word of the second sentence. where consonants and vowel symbols are written linearly. the application of such a deletion rule is not as straightforward in Korean because of its unique orthographic principles. Korean letters are combined into a syllable unit.. See example (1). Therefore. Downloaded from http://ltj. (1) ㅇㅕㅇ ㅎㅗㅏ 영화 Ø ye ng h wa yenghwa (movie) Because the unit of orthography is on the syllable level in Korean.g. the second half of every second word is deleted. 영 화 영 __). unclear word boundaries. and the productive use of postpositions and suffixes. Sunyoung Lee-Ellis 249 Table 1 The levels and content of the selected passages Level Topic Characteristics Source 1 Daily life Concrete and frequent Created based on vocabulary KSL textbooks Minimum structural complexity Familiar everyday topic 1+ Travel plan Concrete and frequent Same as above vocabulary Mixed structural complexity Familiar topic 1+ Department store Concrete but less frequent Same as above advertisement vocabulary Mixed structural complexity Familiar topic 2 Traffic congestion Concrete but less frequent Korean vocabulary grade school Intermediate structural textbook complexity Less familiar topic: reporting facts 2+ Human rights Abstract and concrete Same as above vocabulary Mixed intermediate structural complexity Abstract concept b C-Test development: Korean language specifics: According to standard C-Test development recommendations (Klein-Braley. 2009 .com by Green Smith on April 9.sagepub. the most natural unit of deletion seems to be the syllable. the second syllable may logically be deleted (e. with the syllable unit marked with underlines.

and perhaps the most important in terms of the C-Test development. the latter method was chosen over the first method for the following reasons. First. See example (2). two different deletion methods were initially considered: (1) to use both right-hand and left-hand deletion. is that functional ele- ments such as postpositions and verbal inflections always come at the end of the word (with their boundaries as defined above). One of the two participants scored unusually low (25%) relative to the participant’s assumed proficiency based on his experience (e. Even though the questionable results in Fouser’s C-Test may be due to many reasons other than Downloaded from http://ltj.g. object marker lul) are ortho- graphically bound to their host content words (e. or (2) to delete everything after the second half of the content word including postpositions and inflections. 2009 .g. In his study of an English C-Test.250 The development and validation of a Korean C-Test A second distinctive attribute of Korean orthography is that Korean word boundaries do not coincide with the location of spaces in writing because postpositions are not orthographically separated from their host words. receiving an MA in Korea) and the investigator’s qualitative by Green Smith on April 9.. Furthermore. In this paper. postpositions are often monosyllabic.. postpositions (e. Cheolsu-Sub Yenghwa-Obj See-Past- Particle Particle Ending Noun-Postposition Noun-Postposition See-Inflections As shown. (2) 철수가 영화를 보았다. it seems more appropriate to include postpositions within the word boundaries for the current purpose of creating C-Test items. To ensure that both content and functional words were deleted. so they cannot be further reduced when a syllable is the unit of dele- tion. One final unique characteristic of Korean. but it yielded unreliable results. Fouser (2001) also reported on a Korean C-Test that he developed using a hybrid deletion method to examine the proficiency of his two participants in his qualitative study.. the results from previous research raise questions about the left-hand deletion. Cleary (1988) found that left-hand deletion discouraged test takers from utilizing the context or processing the discourse.g. Cheolsu-ka Yenghwa-lul Bo-ass-ta. Therefore. deletion of the second half of a word often results in the deletion of functional elements exclusively. As a result.sagepub. the noun Yenghwa).

one-syllable words and proper nouns that could not be recovered from the context were left intact. should be psycholinguistically more valid than left-hand deletion even in languages like Korean. See example (3) for a sample sentence containing one item with two syllables deleted. the second half of each second word was deleted beginning with the second word of the second sentence. for a summary). Cheolsu-Sub Yenghwa-Obj See-Past-Ending. This clue was given with an intention of reducing the number of possible answers deviant from the target answer and to prevent case marker omission. Cheolsu-ka Yeng __ __ Bo-ass-ta. When a content word had three syllables. this study chose to delete anything after half of the content word. the left-hand deletion method seems to counter recent psycholinguistic findings about how people access words in their mental lexicon: words are accessed on-line starting from acoustic onset to offset while it is being heard. including any postpositions or inflections. Particle Particle Noun-Postposition Noun-Postposition See-Inflections Downloaded from http://ltj. 1987). (3) 철수가 영화를 보았다. However. or from left to right while it is being read (see by Green Smith on April 9.sagepub. To this end. it does encourage one to exercise caution in using such a deletion method in the development of a Korean C-Test. people can recognize a word while it is being heard or read: before the complete word is presented (Marslen- Wilson. The number of syllables deleted was specified by assigning one blank per syllable. which conforms to the natural lan- guage processing mechanism. the final syllable and all of the associated postpo- sitions and inflections were deleted. If languages are processed on-line from left to right. which is marginally acceptable in certain contexts. 철수가 영 __ __ 보았다. Sunyoung Lee-Ellis 251 left-hand deletion. To prepare the C-Test. while the standard practice of right-hand deletion could be preserved. Aside from the problems mentioned above. both content words and functional elements could be deleted for testing. 1997. The complete C-Test is provided in Appendix A. the right-hand deletion method. Therefore. 2009 . In this way.

sagepub. and the rest of the participants were heritage language learners who were enrolled in undergradu- ate Korean language classes between the first and fourth semesters. some sentences at the end were cropped and modified in a way that the story’s cohesion was maintained. and sentences after the twenty-fifth deletion were left intact. Five participants were foreign language learners who were enrolled in a third-semester Korean language class. 2005). four participants were graduate level foreign language learners who lived in Korea for more than three years. Nonnative speakers were recruited from the University of Maryland. as cited in Kondo-Brown. and from two to eight participants were Downloaded from http://ltj. 1981. from the most basic survival type of language use up through communicative and receptive competen- cies expected of a well-educated near-native speaker of the language (Clark. test time was established at 40 minutes. The questionnaire contains 15 ‘can do’ questions measuring self-perceived speaking ability. 3 Test administration a Participants: Five native speakers and 37 English-speaking non- native speakers of Korean participated in this study. a self-assessment questionnaire adapted from Kondo-Brown (2005) was administered. most of whom were enrolled in the Korean language program. 2 Self-assessment questionnaire In addition to the C-Test. The self-assessment instrument is provided in Appendix B. In the case of passages that were significantly longer. b Procedures: Based on pilot test results with an intermediate level non-native speaker of Korean. Participants’ majors varied greatly from arts and humanities to busi- ness to the hard sciences. The test was administered in a quiet room on the University of Maryland campus at the participants’ convenience.252 The development and validation of a Korean C-Test A total of 125 items were deleted with 25 items missing from each passage. so only nonnative speaker data are reported in this paper. Participants were asked to rate their own ability to perform the 15 oral tasks on a scale of 1 (not at all) through 5 (no problem at all).com by Green Smith on April 9. There were seven test administrations. 2009 . Native speaker data were used during the test development phase to ensure that items were functioning properly. Items in this questionnaire are intended to represent a wide range of proficiency levels.

participants were provided with instructions at the beginning of the test along with examples. students were provided with the self-assessment questionnaire. Therefore. while the third method yielded a total of 219 items. three different scoring methods were used at first to determine whether significantly different results would be obtained: 1) partial credit scoring. For instance. the Partial Credit Model (PCM) of Rasch Analysis seemed most appropriate. Rasch PCM was run. 2) dichotomous scoring for each item. However. given the item in example (4). it seemed reasonable to give partial credit for partial completion of the items. 2009 . the three scoring meth- ods yielded different scores for the responses. as shown in Table by Green Smith on April 9. the first two scoring methods yielded 125 items (25 items in each of the five passages).sagepub. The resulting item and person reliability indices are provided in the next section. Sunyoung Lee-Ellis 253 tested at the same time. Since most of the participants were assumed to have no familiarity with the C-Test format. and postposition. 3 C-Test scoring Given the deletion method used in the formation of this C-Test. they were also advised to do their best and work on all five texts for the full 40 minutes no matter how difficult it became. Therefore. They were informed that the test was designed to assess all ranges of proficiency up to near-native level so it would be chal- lenging for many of them. while the two other scoring methods were analyzed using the Rasch model for dichoto- mous items. which took about 10 minutes for most participants to complete. grammatical inflec- tion. one could argue that the partial credit gained by completing one feature of an item should not necessarily amount to the same partial credit allotted to the comple- tion on another feature. Downloaded from http://ltj. indi- vidual features were considered separate items. or 3) dichotomous scoring for each subset of features. Since each item may contain more than one aspect of the language including knowledge of lexicon. (4) Example Item: 영 __ __ Answer Key: 영화를 (Movie Obj-Particle) Features involved: Lexicon and Particle Notice that in the third scoring method shown in Table 2. After the test. For the first scoring method.

II Results and discussion 1 C-Test item analysis As mentioned. A possible answer was examined individually by the investigator for its grammatical and contextual appropriateness and credit was given only when the answer was both grammatically and contextually appropriate.영*를 0 point Lexicon or Particle 영 ** 0 point None Dichotomous: Feature 영 화 1 point Lexicon 를 1 point Particle *incorrect response Even though one of the claimed merits of the C-Test is an objec- tive scoring method by virtue of allowing only one correct answer. As shown in Table 3. Therefore. provided that the answer given was unambiguous. answers that were possible but deviant from the target answer emerged.254 The development and validation of a Korean C-Test Table 2 Scoring methods Scoring method Example Credit Feature Partial credit 영 화를 2 points Lexicon and Particle 영 화*.영*를 1 point Lexicon or Particle 영 ** 0 point None Dichotomous: Item 영 화를 1 point Lexicon and Particle 영 화*. C-Test results were analyzed using the one-parameter IRT model (Rasch) because it provides stable estimations of exami- nee ability and item difficulty using a true interval scale presumed to underlie those traits. spell- ing is not considered a characteristic of the construct under consider- ation. Because of these similar results. Downloaded from http://ltj.sagepub.. since the test aims to assess the global language proficiency of participants with varying backgrounds (either instructional or naturalistic). delete everything after the second half of content words including attached functional words). Rasch fit statistics provide useful data to examine the quality of the test items (Bond & Fox.e. 2007). PCM was chosen for further data analysis because it seems most appropriate in that it allows partial credit for the partial success of an item. 2009 . all three scoring systems yielded very similar fit statistics and item and person separation indices. due to the specific deletion method of the current C-Test ( by Green Smith on April 9. In addition. Furthermore. incorrect spelling was not penalized for the scoring.

93 n/a The person and item separation reliability indices shown in the summary statistics are very high.99 Dich: Item . Furthermore.55 .00 1. 2009 .10 .. This in fact is very prom- ising in terms of a C-Test as a measure of proficiency since if the interpretation is legitimate and generalizable to other samples of the population. The items are displayed based on their difficulty level in the right-most column.20 .00 .10 1.e. The person–item variable map in Figure 1 displays both item dif- ficulty and examinee ability estimates on an interval scale.99 . indicating that the response patterns observed for a participant on each item (person fit) and an item on each participant (item fit) match the modeled expectations..96 .05 .00 .19 1.50 . suggesting that the test success- fully differentiated levels of proficiency.99 .98 .99 Dich: Feature .99 Item PCM .sagepub.e.98 .00 .com by Green Smith on April 9.00 . with the ‘X’s to the left of the horizontal line depict- ing the ability score of each individual participant. As mentioned.39 1.99 −.05 .93 n/a Dich: Feature . meaning the person ordering resulting from Rasch analysis exactly matches the person ordering based on the raw score. item and person discriminations are high) and that the range of items sufficiently bounded person abilities for a proficiency measure (i.00 1.00 1.98 .94 n/a Dich: Item 1. which demonstrates that the items provided stable Downloaded from http://ltj. a researcher untrained in Rasch measurement could sim- ply use the raw score data to determine examinee abilities.00 .07 .Cronbach’s tics method tion alpha reliabi- lity Mean SE MNSQ ZSTD MNSQ ZSTD Person PCM −. Scoring Measure Infit Outfit Separa.11 −.00 .30 1. the distribution of items covers the range of all person abilities and more.00 .00. and that items were well- spread along the measures of difficulty.50 . test difficulty was appropriate). The left-most column shows the common scale (in logits).22 . the reported person raw-score-to-measure correlation was 1. In addition. both person and item infit/outfit statistics were well within the range of expectation (Z-standard ±2).69 .29 1. The figure demonstrates that both item difficulty and person ability show great dispersion (i.03 . thereby demonstrating the distribution of the examinee abilities relative to the difficulty of the items. Sunyoung Lee-Ellis 255 Table 3 Person and item measure statistics Statis.00 .00 .69 .02 .02 .

2009 .com by Green Smith on April 9.256 The development and validation of a Korean C-Test persons MAP OF items <more> | <rare> 5 + I0106 | | | | | 4 + | | | I0124 | T I0090 | I0098 I0108 3 X + I0071 | X T| X | I0096 | I0115 | I0041 I0069 I0085 I0094 I0102 I0111 I0121 2 X + I0104 X | I0089 I0099 I0122 I0125 XX | S I0060 I0064 I0076 I0081 I0097 I0107 | I0077 I0084 I0120 | I0053 I0055 I0057 X | I0048 I0101 1 S+ I0026 I0082 I0105 I0113 | I0044 I0065 | I0029 I0062 I0068 I0072 I0079 I0080 I0092 I0114 I0118 X | I0047 I0052 I0061 I0070 I0073 I0100 I0119 XX | I0028 I0038 I0088 I0109 | I0033 I0054 I0078 I0103 I0116 0 X + M I0039 I0095 | I0030 I0031 I0056 I0059 I0066 I0067 I0117 XX | I0037 I0046 I0058 I0075 I0091 I0123 XX M | I0011 I0050 I0063 I0110 | I0035 I0042 I0083 I0086 I0112 X | I0015 I0027 I0034 -1 XXX + I0013 I0087 I0093 XXX | I0036 I0040 I0043 X | I0003 I0045 X | I0004 I0006 I0051 I0074 XXX | S I0024 I0049 X | I0002 I0010 I0021 I0023 -2 XX S + I0032 XX | XXXX | I0017 I0020 I0022 | I0007 I0025 | | I0019 -3 + | I0001 I0014 I0018 | T I0016 | I0008 T| I0009 I0012 | -4 + | | | | | I0005 -5 + <less> | <frequ> Figure 1 Person map of items Downloaded from http://ltj.sagepub.

9 1.98 .46 −1.1 3 −1.11 . one could consider discarding such items if a more streamlined test is desirable. 3.3 82. 19. 2. Table 4 Misfit individuals according to the degree of misfit Person Measure SE Infit Outfit Exact Match observed % expected % MNSQ ZSTD MNSQ ZSTD 7 −1.9 73.34 2. mean- ing they did not behave according to the Rasch model’s expecta- tions.58 −.1 13 −2.27 1. and 3) examinees may be demoralized by the perceived difficulty of the test if easy items are eliminated.3 84.21 . there is one item that is well above person abilities (i. resulting in significantly higher model expectations than their ability estimates.e. Fit statistics show the degree of match between the pattern of observed responses and the modeled expectation. it is recommended to keep the low-difficulty items as well as the one very difficult item for three reasons: 1) the test is designed to measure the full range of proficiency levels.2 .0 .64 4. On the other hand. Table 4 shows the fit statistics of only the individuals who did not fit the model well (misfit).20 82.2 Downloaded from http://ltj.1 .com by Green Smith on April 9. Item 106).0 58.34 .49 ..5 83. individual fit statistics were examined.6 19 −.16 1.26 . and 13) who misfit: their standardized infit statistics were more extreme than ±2.18 .70 86. P2 and P19 were found to have responded correctly to a couple of difficult items as well incorrectly to some easy items.64 −2.8 2 −2.64 2. This means that these items do not contribute to the discrimination of person abilities because virtually everyone answered those items cor- rectly.52 2.5 67. 32. P3 and P13 showed higher observed scores than expected because they responded incorrectly to some of the items that were below their ability.1 68.0 1.30 . There were six individuals (P7. which can exhibit either a pattern of responses observed for a candidate on each item (person fit) or the pattern for each item on each person (item fit).9 77.7 66.22 1. there are several items too easy for the participants (below the lowest person ability logit score).0 3. That said.02 . 2009 . Closer examination of the data reveals that two participants (P7 and P32) responded correctly to some of the items that were above their ability logits. Finally. In addition. Therefore. However. In order to examine the quality of the test.17 1. 2) very beginning learners did not participate in this test administration.59 3.52 2.62 −2.0 63.16 1.sagepub.3 32 0. Sunyoung Lee-Ellis 257 estimation of person abilities.

Rasch modeling should be applied when local independence is satisfied. i. Wilson & Iventosch. Such fit to the model is clearly illustrated in the item characteristic curves (ICC) (Figure 2). In an attempt to better fit the test to Rasch model expectations. However. In the case of the current by Green Smith on April 9. Re-analysis of the data was conducted using the five super-items. The fact that the omission of misfitting items resulted in an improvement of person fit suggests that this may be one way to improve the test. 2 C-Test super-item analysis As Draney (1996) pointed out. these super-items seem to function very well in discriminating people according to their abilities. By design. a reanalysis was attempted after eliminating the misfitting items from the data. 2009 . This modification can be easily accomplished by simply filling in the missing part. One way to deal with this violation of local independence is to create super-items (Wilson. such that the revised test may better conform to the Rasch expectation of unidimensionality. and the summary statistics from PCM analysis are provided in Table 5. 2004). Put another way. Norris. 2004).sagepub. There were 17 misfitting items out of 125 (misfitting items are marked with an asterisk in the test in Appendix A). the omission of those misfitting items eliminated many of the misfitting persons as well. As indicated by the person separation reliability of ..97 and item separation reliability of 1. item fit statistics were also calculated. resulting in only two misfit participants (P7 and P36). both person and item infit Z-standard scores are 0). Furthermore. there is a single construct underlying the items in the C-Test. 1988. where individual summed scores on those five items can be used as scores on the super-items. Not surprisingly. In other words. Downloaded from http://ltj.e. this assumption is not satisfied by the current C-Test. this conclusion cannot be confirmed until the revised version is tested with a new sample population.e. these items are not discriminating person abilities in a manner consistent with other items.. the five passages can be treated as independent super-items (see Norris. That said.00. items are contextually interrelated in the passages. 1989.258 The development and validation of a Korean C-Test In addition to person misfit analyses. where groups subsume the related items. the item fit statistics show excellent fit to the model (i. the success or failure of any item should not depend on the success or failure of any other item.

Sunyoung Lee-Ellis 259 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 Expected score 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 –6 –4 –2 0 2 4 6 8 10 Measure relative to item difficulty 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 Expected score 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 –6 –4 –2 0 2 4 6 8 10 Measure relative to item difficulty (Continued) Downloaded from by Green Smith on April 9.sagepub. 2009 .

2009 .260 The development and validation of a Korean C-Test Figure 2 (Continued) 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 Expected score 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 –4 –3 –2 –1 0 1 2 3 4 5 6 7 Measure relative to item difficulty 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 Expected score 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 –4 –3 –2 –1 0 1 2 3 4 5 6 7 Measure relative to item difficulty (Continued) Downloaded from by Green Smith on April 9.sagepub.

Sunyoung Lee-Ellis 261

Figure 2 (Continued)
Expected score

–4 –3 –2 –1 0 1 2 3 4 5 6 7

Measure relative to item difficulty

Figure 2 Item characteristics curves of the five super-items, from top to bottom
item 1 through item 5

Table 5 C-Test super-item statistics

Statistics Measure Infit Outfit Separation Cronbach’s
reliability alpha


Person 0.91 0.16 1.07 0.00 1.06 0.20 0.97 0.96
Item 1.00 0.05 1.00 0.00 1.07 0.10 1.00 n/a

Figure 2 above shows the five ICCs from each super-item. The
center curves are the expected score ICCs, which show the Rasch-
model predictions for each measure relative to item difficulty, and
the ‘x’s represent observations in an interval. In short, the model
ICC curves of all five items exhibit good discrimination, and the
observed scores of each item fit quite nicely with the model’s expec-
tation, with virtually no point plotted outside the 95% confidence
interval band.

Downloaded from by Green Smith on April 9, 2009

262 The development and validation of a Korean C-Test

3 Self-assessment data analysis
The self-assessment questionnaire data were analyzed using the
Rasch Rating Scale Model (RSM). RSM provides an item estimate
for each Likert item as well as a set of estimates for the four
thresholds that mark the boundaries between the five Likert cat-
egories (Bond & Fox, 2007). It can therefore account for the dif-
ferent relative values that items and thresholds carry, as well as the
amount of attitude in the individual based on empirical evidence.
The summary statistics for the self-assessment data are provided
in Table 6.
As shown, the self-assessment measure showed very high person-
separation reliability (.95) and Cronbach Alpha reliability (.94), as
well as high item-separation reliability (.99). Furthermore, both
person and item fit were sound (ZSTD −.2 and .1, respectively).
The individual item fit statistics were also sound except for item
2 (infit ZSTD: 2.3) and item 3 (infit ZSTD: 2.0). Notice that item
2 was asking whether examinees can say the days of the week and
item 3 was about being able to give the current date, both of which
tend to ask very specific knowledge (see Appendix B). Thus, there
is likely to be some variation among lower ability persons. Finally,
in terms of person fit, five examinees misfit (P32, P10, P20, P24,
and P16). The first four participants show a higher expected match
than observed, meaning those examinee scores are more random
than the model predicted. On the other hand, P16 showed a higher
observed match than expected, so the data were more predictable
than the model predicted. Nevertheless, the mean infit ZSTD was
only −.2, which is very sound.
In conclusion, despite the often-discussed concerns about the
reliability of self-reported data, the self-assessment question-
naire used in this study based on a ‘can do’ list seems to func-
tion reasonably well, demonstrating both high reliability and few

Table 6 Self-assessment statistics (Rating Scale Model)

Statistics Measure Infit Outfit Separation Cronbach’s
reliability alpha


Person 1.29 .48 .97 −.20 .88 4.39 .95 .94
Item .00 .32 1.05 .10 .88 14.73 .99 n/a

Downloaded from by Green Smith on April 9, 2009

Sunyoung Lee-Ellis 263

4 Concurrent validity
The reliability of test instruments is only one condition necessary for
the validation of a test. Therefore, in addition to reliability, the con-
vergent validity of the C-Test was examined by analyzing how well
the C-Test correlated with the self-assessment (SA) questionnaire.
The Pearson correlation between the participants’ C-Test logit
scores and SA logit scores was found to be relatively strong (0.825),
and the correlation was statistically significant (p < 0.01). Figure 3
is a scatter plot illustrating the correlation between SA scores and
C-Test scores.
When the 37 individual SA estimates were plotted against their
C-Test estimates, the resulting distribution demonstrates a positive
linear relationship, indicating a high correlation between the indi-
viduals’ SA performance and C-Test performance. Furthermore, the
points in the scatterplot generally fall within the 95% confidence
band, which indicates that the amount of variation between the two
tests generally falls within the modeled expectation.
The strong correlation coefficient between the two test scores also
suggests that the two tests were assessing the same trait, presumably,
general Korean language proficiency. Notice also that the SA was
designed to examine participants’ verbal abilities while the C-Test was
a written test. Considering that there could be a language mode effect
interacting with the results, a correlation of 0.825 seems quite strong.
The high correlation is also particularly interesting given the
heterogeneous participants of the study, which included both

C-Test Estimates

-4 -2 0 2 4 6 8
Self-assessment Estimates

Figure 3 Scatter plot of the self-assessment and C-Test logit scores along with a
95% confidence interval band

Downloaded from by Green Smith on April 9, 2009

Recall that the test was originally intended to be a 30-minute test. 2006). It may be the case that both instruments are testing a common general proficiency.g. 2009 . The resulting person and item separation reliabilities as well as Pearson correlation index are presented in Table 7.. In addition. including a cross-group comparison. Considering that the C-Test is composed of five independent passages. However. but after piloting it with an intermediate level nonna- tive speaker of Korean. passages 3. it did not prove to be the most challenging. 4. it was determined that the test takers should be given 40 minutes. However.00 to by Green Smith on April 9. showed lower item difficulty estimation than passage 4. structural complexity. As a result. In light of this practical motivation. and perhaps more interestingly. 5 Streamlining the C-Test The final analysis conducted relates to maximizing the efficiency of the C-Test. the shortened test resulted in some loss of item separation reliability (from 1. the correlation between the two instruments was still high. In the literature. and level of vocabulary. the dispersion of the super-item difficulties was examined for redundancy. It seems that even though passage 5 was rated the most difficult (ILR Level 2+) because of its abstract topic.sagepub. would be required to draw such a conclusion. each of which functions also as a super-item. more research. In order to determine whether the deletion of passage 5 would be detrimental to the test’s reliability and validity. In order to decide which passage to delete. the person Downloaded from http://ltj. an attempt was made to streamline the test using the information obtained from the Rasch analysis. a reanalysis was conducted after passage 5 was omitted. and 5 were quite close to each other in terms of their mean logit item difficulty.e. where the learners in fact are not that different. If some passages (i.93). Kondo-Brown. which was supposed to be the most difficult passage. observations have been made about their differ- ent language profiles..264 The development and validation of a Korean C-Test foreign language learners (n = 9) and heritage learners (n = 18). with heritage learners being stronger in oral tasks and foreign language learners being better at written tasks (e. As shown. which may be longer than desirable. the most logical way to shorten the test would be to reduce the number of passages. passage 5. a redundant passage could be removed. Nevertheless. passage 5 seems to be a reasonable candi- date for deletion. super items) exhibited similar levels of difficulty. As shown in Figure 4.

Developed in consideration of Korean language characteristics.sagepub. Item separation rel. Correlation with SA Full test 0.93 0. it demonstrated excellent reliability and validity indices: its person.and item-separation and correlation Downloaded from http://ltj.98 0. The reduction will help streamline the test with seemingly little cost to its reliability and by Green Smith on April 9. III Conclusion The current Korean C-Test development and validation project seems to have been worthwhile. and the correlation with SA is also intact (from 0.98).97 to 0.825 Reduced test 0.825 to 0.826). Sunyoung Lee-Ellis 265 persons MAP OF items <more> | <rare> 3 + | | | | X | 2 + XX T | | T X | XX | X | 1 + X S| S | Passage 4 | Passage 5 | Passage 3 XXX | 0 X + M Passage 2 XXXX M | XX | XXXXX | XX | XXX | S -1 X + XXXX S | | XXX | Passage 1 X | T | -2 + <less> | <frequ> Figure 4 Person map of super-items Table 7 Comparison of reliability and correlation indices Person separation rel. 2009 . it is suggested that the C-Test be shortened by omitting passage 5.00 0.97 1. Therefore.826 separation reliability is slightly higher (from 0.

First. The number of items on the test is significantly more than the number of subjects. if the same test were administered to a group with a restricted range of proficiency. Second.sagepub. these could be omitted from the test. given that Korean is a post-positional language. the validity of a test should always be determined in consideration Downloaded from http://ltj. Although the left-hand deletion method was not included in the current study because of its questionable effectiveness and its non-conformity to current psycholinguistic lexical access models. In addition. the current study leaves room for future research. Although it appears that both types of learners were reliably assessed via the by Green Smith on April 9. and as a result. A major advantage of selecting the C-Test as a proficiency meas- ure is that it is practical. the C-Test. different deletion methods including left-hand deletion should be attempted and compared to the modi- fied right-hand deletion method used in the current study. notice that the participant profile in the current study varied between foreign language learners and heritage language learners. there may be several fac- tors contributing to these indices. In addition. the proficiency range of the examinees was very wide. Still. Furthermore. 2009 . Rasch analysis also provided information regarding redundancy. it should be pointed out that caution should be used in interpreting the exceptionally high reliability and concurrent valid- ity indices of the current results. In other words. potentially providing very consistent estimates of profi- ciency for different pools of participants. it is an empirical question whether the left-hand deletion in post-positional languages is a more viable alternative. a cross-group comparison with greater sample sizes is needed to understand whether the C-Test can be used equally val- idly between and across the two groups. Once misfitting items were identified. if used properly. It is relatively easy to develop as long as the passages are carefully graded and selected. can become a very useful tool for L2 research. and its administration and scoring is simple and quick. Finally. unidimensionality improved as evi- denced by the accompanying improvement in person fit statistics. In fact. That said. which may have contributed to a high person separation reliability. Rasch analysis provided useful information to improve the test. which is likely to have caused a high item separation reliability.266 The development and validation of a Korean C-Test values with a separate proficiency measure were very high. which in turn helped streamline the test without much cost to reliability and validity. Given this high practicality. the reliability probably would have been lower.

Draney. Language Testing. 29(2). Applying the Rasch model: Fundamental measurement in the human sciences. The ascent of Babel: An exploration of language.. C. T. RELC Journal. and practical instrument if used only as a coarse estimation of general proficiency in future experimental studies. 19.. T. & Ioup. 13. Oxford: Oxford University Press. 244–263. and understanding. Unpublished doctoral dis- sertation. 26–35. 4(1). valid. (2007). 15(4). Felix. (2006). NJ: Lawrence Erlbaum Associates. The relationship of form and meaning: A cross- sectional study of tense and aspect in the interlanguage of learners of English as a second language. Developing KFL students’ pragmatic awareness of Korean speech acts: The use of discourse completion tasks. M. Finally..sagepub.. John Norris. R. G. A closer look at the construct validity of C-Tests. 253–278. 209–219. the current C-Test seems to function as a reliable.. A. H. The C-Test: A valid operationalization of reduced redundancy principle? System. David Ellis. T. Any errors that remain are my own. (2001). Sincere thanks to Steve Ross for teaching me Rasch Analysis. S. Validation of C-Test among Hungarian EFL learners. 23. Babaii. (1997). L. Bardovi-Harlig. Language Awareness. M. The polytomous Saltus model: A mixture model approach to the diagnosis of developmental differences. S. & Grotjahn. Mahwah. G. Bond. E.. IV References Altmann. & Fox. C. (2006). and Robert DeKeyser for their fruitful discussions on this project. Z. Language Testing. mind. In that sense. (1996). University of California at Berkeley. L. Downloaded from http://ltj. Bley-Vroman. K. (1988). The C-Test in English: Left-hand deletions. 9. G. Many thanks also to Michael Long and Cathy Doughty for their endless support of the students in our PhD program. 290–325. I would like to thank the anonymous reviewers for their insightful comments. Acknowledgements This paper was supported by the Department of Second Language Acquisition at the University of Maryland. (1992).com by Green Smith on April 9. Eckes. 2009 . Sunyoung Lee-Ellis 267 of its context and purpose. Second Language Research. Applied Psycholinguistics. Cleary. T. & Ansary. K. 1–32. Dörnyei. and Jeansue Mueller for her contributions during the test development phase. & Katona. Byon. The accessibility of Universal Grammar in adult language learning. (1988). 187–206. (1992).

14.86. & Choo.. Second Language Research. (1997). Reading skill level descrip- tion. In R.268 The development and validation of a Korean C-Test Fouser. S. (2005). M. On knowledge and development of unaccusativity in Spanish L2 acquisition. 60–85. & Yoon. 71–102. Studies in Second Language ILRscale4. Lee. (1986). A study of the L2 acquisition of English reflexives. (2007). 563–581. (2003).. K. 29(2). J. R. applications (pp. Retrieved January 10. M.. PhD thesis. 50(3). Language Testing. Differences in language skills: Heritage language learner subgroups and foreign language learners.. (1984). 12. 134–146. S. R. Development of relativization in Korean as a foreign language: The noun phrase accessibility hierarchy in head- internal and head-external relative clauses. Is C-Test superior to cloze? Language Testing.-Y. Montrul. 433–448. Language Learning.-H. & Kim. Linguistics. K. 1153–1190.197. M. 2009 . Jeon. & Ortega.). (2004).htm Jafarpur. Grotjahn.. Heritage language development: Focus on East Asian immigrants. Frankfurt: Peter Lang. Binding Interpretations by Korean Heritage Speakers and Adult L2 Learners of Korean. Norris. C. (1985). Retrieved February 20. O’Grady. 417–528. Language Testing. J. W. Buffalo. C-Tests in the context of reduced redundancy testing: An appraisal. 6.186/posters/29/KimBUCLD2004. L. Cross-linguistic influence in third language acquisition (pp. (2005). 43(6). The Modern Language Journal. S. (1987). 253–276. (1995). 89. empirical research. Kim. C. from http://www. Klein-Braley. 25(3). Kondo-Brown. 138–148). 1. (2006). M. A survey of research on the by Green Smith on April 9. Interagency Language Roundtable. Functional parallelism in spoken word recogni- tion. Kondo-Brown. J. Netherlands: John Benjamins. Studies in Second Language Acquisition. Grotjahn (Ed. & Raatz. K. U. 47–84. Empirical investigations of cloze tests. Cognition. 45–83). (Ed. Baker & N. Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Hirakawa.. C. Language Testing. Hornberger (Eds. A subject-object asymmetry in the acquisition of relative clauses in Korean as a second language. J. 194–216. Online Supplement to the Proceedings of 29th Annual Boston University Conference on Language Development. Klein-Braley.pdf Klein-Braley.govtilr. In C. Development and evaluation of a curriculum-based German C-test for placement purposes. The C-test: Theory. (2000). 3. (1981). (1989). H. Montrul. 159–185. M. H. Too close for comfort? Sociolinguistic transfer from Japanese into Korean as an L3.sagepub. Marslen-Wilson. NY: Multilingual Matters. W. 2008. University of Duisberg. from http://128.).). Norris. J. Downloaded from http://ltj. Test validation and cognitive psychology: Some methodo- logical considerations. (2005). 25. Amsterdam. (2001). A. 2008.

278–289. child L2 and adult L2 acquisition: Differences and similarities.. & Iventosch. 24. (2004). B. 44(2). MA: Newbury House. (1988). Whong-Barr. 2009 . Wilson. 633–644. 105. Saltus: A psychometric model of discontinuity in cognitive development. 579–616. Richards (Eds. Applied Measurement in Education. Sunyoung Lee-Ellis 269 Spolsky. (2002). What does it mean to know a by Green Smith on April 9. D. Studies in Second Language Acquisition. 319–334. Rowley. M. Language Learning. Oller & J. 28(2). & Schwartz. Thomas. (1973). Morphological and syntactic transfer in child L2 acquisition of the English dative alternation. Psychological Bulletin. Child L1.. S. M. B. Assessment of L2 Proficiency in second language acquisi- tion research. M. or how do you get somebody to perform his competence? In J. Wilson. Unsworth. (1989). 307–336.sagepub. L. Using the partial credit model to investi- gate responses to structured subtests. M. 164–176). (2001). Focus on the learner (pp. Proceedings of the Annual Boston University Conference on Language Development. Downloaded from http://ltj. 1(4).).

You will read five texts.. fill in the part that you know instead of skipping the entire words/phrases.. 제 이 름 은 김철수입니다. … Notice that partial points are available. Example: 안녕_ _ _. This test is designed for all ranges of proficiency (i. and make sure you work on all five texts if you have time. Passage 1 안녕하세요.g.sagepub. 한국어 수 _ _ 매일 오 _ 10시에 시작 _ _ _ . 2009 . 오늘은 여행 _ _ 전화를 걸 _ 서울에서 제주도 _ _ 왕복 비 Downloaded from http://ltj. Each line represents one syllable. 김치찌개는 맵습니다. 운동을 한 다 _ _ 아침을 먹습니다. 그렇지만 듣 _ _ 읽기는 쉽 _ _ _ . 아침은 기숙 _ 식당에서 먹 습니다. 제 이름은 김철수입니다. so it will seem challenging to many of you. 제 주도는 한반 _ _ 남쪽에 있 _ 섬이예요.) You will be given 40 minutes to complete the test. 저는 대학 _ _ 다닙니다. 한국어는 쓰 _ _ 말하기가 어 _ _ _ _ .270 The development and validation of a Korean C-Test Appendix A: Korean C-Test This is a test of how well you comprehend and produce written Korean. 한국 식 _ _ 극장 바 _ 옆에 있습니다.” will receive partial credit. 제 이 _ _ 김철수입니다. parts of some words are missing. Spelling will not be assessed as long as the words are identifiable. In each. Passage 2 올 여름에는 가족들과 함께 제주도에 여행을 가려고 해요.e. Study each text and write in the missing letters. 체육 _ _ _ 운동을 합 _ _ . (e. No negative point will be deducted for a wrong answer. 불고 _ _ 맛있습니다. If you know only part of the missing parts. … Your job is to complete the test as: 안 녕 하 세 요. 영화를 _ 후에 한국 식당에서 저 _ _ 먹습니다. from beginning to near- native).com by Green Smith on April 9. 주말에는 친 _ _ _ 같이 극 _ _ _ 영화를 봅니다. 저는 대학 _ _ _ 한국어를 배 _ _ _ . 아침에 일어 _ _ 학교 체육 _ _ 갑니다. 한국어 배 _ _ 것이 참 재미 _ _ _ _ . 한국의 하와이라 불 _ _ 제주도는 자 _ _ 아름다워서 신혼 _ _ 장소로 인 _ _ 굉장히 많 _ _ . “제 이? 은 김철수입니다. However. please do your best until the end.

Passage 3 안녕하세요. 자동 _ _ 점점 많아 _ _ 반면 주 _ 공간은 제 _ _ _ 있 기 때 _ _ 주차난이 생 _ _ . 칠 _ 에서는 집안을 따 _ _ _ 해 줄 전 _ 히터와 가스 난 _ 등 다양 _ 난방용 가 _ 제품을 특가판 _ _ _ 있습니다. 주차장이 부족하면 사람 _ _ 주택 가 골 _ 이나 도로에까지 주차를 하 _ 경우가 많다. 감사합니다 by Green Smith on April 9. 2009 . 장 _ 등의 겨 _ 상품이 각 30프로씩 할 _ _ 가격에 판 _ _ _ 있습니다. 여행 _ _ _ 호텔도 소개 _ 주었지만 호 텔은 아직 안 정 _ _ _ . 고객 여러 분의 많은 성원 부탁드립니다. 목도 리. 직장 여 _ _ 위한 여성복 코너 _ _ 여성 정 _ 과 겨울 속 _ _ 50 프로 세일하고 있 _ . 인터넷으로 정 _ _ 더 찾아 보 _ 어느 호 텔이 좋 _ 지 알아 보 _ _ 해요. 도로에서는 교 _ 체증으로 인 _ _ 에너지와 시 _ _ 낭비된다. 이러한 권 리를 인권이라고 한다. Passage 4 도시의 가장 큰 문제점이라면 뭐니뭐니해도 교통 문제가 제일 크 다. 따라서 교통 문제를 해결하기 위해서 는 자가용보다는 버스나 지하철을 많이 이용해야 할 것이다. Sunyoung Lee-Ellis 271 행 _ _ _ 네장 예 _ _ _ _ . 특히 출 _ _ 시간에는 한꺼 _ _ 차량이 일제 _ 몰려서 도 _ _ 아 주 복 _ _ _ . 요 _ _ 인터넷이 있 _ _ 호텔 뿐 아 _ _ 유명한 관 _ 명소와 맛 _ _ 식당도 찾아 볼 _ 있어서 참 편리해요. 인간답게 살아 간 _ _ 것은 인간으로서 존엄 _ _ 지키고 자신 _ 행복을 추 _ _ 면서 사는 것을 의미 _ _ . 저 _ 서울 백화점과 함 _ 겨울나기 준 _ _ 시작하세요. 서울역 앞에 위치한 서울 백화점입니다. 사람들은 저 _ _ 행복을 추 _ _ _ 있으며. 그러한 행복을 누 _ _ Downloaded from http://ltj. 게다가 뉴욕 같은 대도 _ _ 주차난은 매 _ 심각한 수준 _ _ . Passage 5 (Optional: See Results and Discussion) 인간은 누구나 인간답게 살아갈 권리를 가지고 있다. 이렇게 불 _ 으로 주 _ _ 차량은 또 다시 교통 혼 _ _ 원인이 되 _ 더 심 _ _ 교통 체증을 일으킨다. 삼층 아동 _ 코너에서도 코 _ . 저희 백 화점 _ _ _ 겨울철을 맞 _ 겨울옷과 난 _ 제품을 세일 _ _ 있습니 다.

2009 . 자 _ _ 꿈을 실 _ _ _ 위해 노 _ _ _ 것 등 각자 행 _ _ 기준이 다 다 _ _ . 그리고 다 _ 사람 에게 피 _ _ 주지 않 _ 한. C-Test answer key: *: misfitting items: See Results and Discussion Passage 1 대학교에 일어나서 *체육관에 *체육관에서 합니다 다음에 기 숙사 대학교에서 배웁니다 수업은 *오전 시작합니다 쓰기와 어렵습니다 *듣기와 쉽습니다 배우는 재미있습니다 친구와 극 장에서 본 저녁을 *식당이 바로 불고기가 Passage 2 *한반도의 *있는 불리는 자연이 *신혼여행 인기가 많아요 여행 사에 *걸어 제주도까지 비행기표를 예약했어요 여행사에서 소 개해 정했어요 정보를 보고 좋은 보려고 요즘은 있어서 아니 라 관광 맛있는 수 Passage3 백화점에서는 맞아 난방 세일하고 여성을 코너에서 정장과 속 옷을 있고 아동복 코트 장갑 겨울 할인된 판매되고 칠층 따뜻 하게 *전기 난로 다양한 가전 특가판매하고 저희 함께 준비를 Passage 4 교통 인하여 시간이 출퇴근 한꺼번에 일제히 도로가 *복잡하 다 대도시의 매우 수준이다 자동차는 많아지는 주차 제한되어 때문에 생긴다 사람들이 골목 *하는 불법 주차된 *혼잡의 되어 심각한 Downloaded from http://ltj. 다른 사 _ _ _ 위해 봉 _ _ _ 것. 행 _ _ 정의는 개인 _ _ 다 다를 수 있다.com by Green Smith on April 9. 이런 권리는 인간이라면 누 _ _ 다 보 _ 받아야 할 당 _ _ 권리이다.sagepub. 이런 권 리를 우리는 기본권이라고 한다. 행복 추구권은 존중 _ _ .272 The development and validation of a Korean C-Test 위해 열 _ _ 노력하고 있다.

sagepub. 2009 . Sunyoung Lee-Ellis 273 Passage 5 간다는 존엄성을 자신의 추구하면서 의미한다 저마다 추구하 고 누리기 열심히 *행복의 개인마다 사람을 봉사하는 자신의 실현하기 노력하 *행복의 다르다 다른 피해를 않는 *존중된다 누구나 보장 *당연한 Downloaded from by Green Smith on April 9.

.. etc.g.g. 5 4 3 2 1 Order a simple meal in a restaurant.. 5 4 3 2 1 Say the days of the week. day. 5 4 3 2 1 Give directions on the street. 2009 . and use appropriate greetings and leave-taking expressions.S. environmental pollution). birth control. Downloaded from http://ltj. Difficult 5 4 3 2 1 Count to 10 in Korean. 5 4 3 2 1 Describe and discuss the U. 5 4 3 2 1 Describe my present job. 5 4 3 2 1 Give the current date (month. 5 4 3 2 1 Describe my latest travel experience accurately and in detail. early schooling. year). a crime report or sports event held recently in your university) in detail. 5 4 3 2 1 Construct a structured hypothesis on an abstract issue (e. 5 4 3 2 1 Report an event or news that happened around me recently ( by Green Smith on April 9. 5 4 3 2 1 State and support with examples and reasons of my position on a controversial topic (e. studies. cigarette smoking. or other major life activities accurately and in detail. globalization and ethnic identity) and discuss the topic knowledgably. composition of family. 5 4 3 2 1 Introduce myself in social situations.274 The development and validation of a Korean C-Test Appendix B: Self-assessment questionnaire Please indicate how well you can carry out the following tasks in Korean using the scale of: 5 (no problem at all) 4 (somewhat easily) 3 (with some difficulty) 2 (with great difficulty) 1 (cannot at all) Really Really Easy ------------------.sagepub. 5 4 3 2 1 Sustain everyday conversation in casual style Korean with my Korean friend. 5 4 3 2 1 Give simple biographical information about myself (place of birth. 5 4 3 2 1 Sustain everyday conversation in very polite style Korean with a person much older than I am.). educational system in detail.

com/cgi/content/abstract/26/2/275 Published by: http://www.sagepub.sagepub. Language Testing http://ltj.1177/0265532208101008 The online version of this article can be found at: http://ltj.nav Citations Additional services and information for Language Testing can be found at: Email Alerts: http://ltj. 2009 .sagepub.sagepub. 26.sagepub.nav Permissions: Downloaded from http://ltj.sagepublications. 275 DOI: Reprints: Subscriptions: Diagnostic assessment of writing: A comparison of two rating scales Ute Knoch Language Testing by Green Smith on April 9.sagepub.

nav DOI:10. Questionnaires and interviews were administered to elicit the raters’ perceptions of the efficacy of the two types of scales. However. Victoria. Address for correspondence: Ute Knoch. 2002). email: uknoch@unimelb. He lists several specific features which distinguish diagnostic tests from other types of tests.1177/0265532208101008 Downloaded from http://ltj. Keywords: diagnostic writing by Green Smith on April 9. – Language Testing 2009 26 (2) 275 304 Diagnostic assessment of writing: A comparison of two rating scales Ute Knoch University of Melbourne. one ‘a priori’ developed scale with less specific descriptors of the kind commonly used in proficiency tests and one empirically developed scale with detailed level descriptors. Rater feedback also showed a preference for the more detailed Room 521.sagepub. Language Testing Research Centre. second language writing assessment. The results indicate that rater reliability was substantially higher and that raters were able to better distinguish between different aspects of writing when the more detailed descriptors were used. rating scales used in performance assessment have been repeatedly criticized for being imprecise and therefore often resulting in holistic marking by raters ( Among these. Arts Centre. 3052. A quantitative comparison of rater behaviour was undertaken using FACETS.sagepub. The validation process involved 10 trained raters applying both sets of descriptors to the rating of 100 writing scripts yielded from a large-scale diagnostic assessment administered to both native and non-native speakers of English at a large university. Carlton. Level 5. 2009 . 2009. © The Author(s). rating scales Alderson (2005) argues that diagnostic tests are often confused with placement or proficiency tests. The aim of this study is to compare two rating scales for writing in an EAP context. rating scale development. he writes that diagnostic tests should be designed to identify strengths and weaknesses in the learner’s knowledge and use of lan- guage and that diagnostic tests usually focus on specific rather than global abilities. Australia Alderson (2005) suggests that diagnostic tests should identify strengths and weaknesses in learners’ use of language and focus on specific elements rather than global abilities. University of The findings are discussed in terms of their implications for rater training and rating scale development. second language writing. Reprints and Permissions: http://www. rating scale validation.

Intuitively developed scales are developed based on existing scales or what scale developers think might be com- mon features at various levels of proficiency.sagepub. Another possible classification of rating scales represents the way the scales are constructed. 2002). In recent years. reliable and detailed feedback on the features of learner performance that require further work. McNamara (2002) and Turner (2000). for example. better diagnostic informa- tion can be expected. Accordingly. 2009 . However. I Rating scales Several classifications of rating scales have been proposed in the literature. 2002) and therefore an argument can be made that diagnostic tests of writing should be direct rather than indirect. have higher con- struct validity for second language by Green Smith on April 9. have argued that the rating scale (and the way raters interpret the rating scale) represents the de-facto test construct. a number of researchers have proposed that scales should be developed based Downloaded from http://ltj. then. This is all the more important in diagnostic contexts.276 Diagnostic assessment of writing When discussing the diagnostic assessment of writing. showing that analytic scales are generally accepted to result in higher reliability. 121). but are time-consuming to construct and therefore expensive. where it is incumbent upon raters to provide valid. 1991. The question. Weigle summarizes the differences between these two scales in terms of six qualities of test usefulness (p. careful attention needs to be paid not only to the formulation of a rating scale but also to the man- ner in which it is used. The most commonly cited categorization is that of holi- stic and analytic scales (Hamp-Lyons. Alderson (2005) describes the use of indirect tests (in this case the DIALANG test) rather than the use of performance tests. indirect tests are used less and less to assess writing ability in the current era of performance testing because they are not considered to be adequate and valid measures of the multi-faceted nature of writing (Weigle. Weigle. Because analytic scales meas- ure writing on several different aspects. Fulcher (2003) distinguishes between two main approaches to scale development: intuitive methods or empirical methods. Typical examples of these scales are the FSI family of scales. One central aspect in the performance assessment of writing is the rating scale. is how direct diagnostic tests of writing should differ from proficiency or placement tests.

Examples of such scales are those produced by North and Schneider (1998) who proposed the method of scaling descriptors. 600 DELNA writing scripts at five pro- ficiency levels were analysed using a range of discourse analytic by Green Smith on April 9. the resulting scoring profile would be less useful to candidates.. Downloaded from http://ltj. binary-choice. 2004). if raters resort to letting an overall. Ute Knoch 277 on empirical methods. Fulcher’s data-based scale (1996) as well as Upshur and Turner (1999) and Turner and Upshur’s (2002) empirically derived. The first criticism is that they are usually intuitively designed and therefore often do not closely enough represent the features of candidate discourse. a new rating scale was developed. Rating scales commonly used in the assessment of writing have been criticized for a number of reasons. global impression guide their ratings. 1995. 2003). Similarly. The problems with intuitively developed rating scales described above might affect the raters’ ability to make fine-grained distinc- tions between different traits on a rating scale. rather than offering precise and detailed descriptions of the nature of performance at each level. Watson Todd et al. II The current study The purpose of this study was to establish whether an empirically developed rating scale for writing assessment with band descriptors based on discourse analytic measures would result in more valid and reliable ratings for a diagnostic context than a rating scale typical of proficiency testing. 2009 . During the first phase. This might result in important diagnostic information being lost. the analysis phase. The band levels have moreover been criticized for often using relativistic wording to differentiate between levels (Mickan.sagepub. Based on the findings in Phase 1. Brindley (1998) and others have pointed out that the criteria often use impressionistic terminology which is open to sub- jective interpretations (Upshur & Turner. boundary definition (EBB) scales. The study was conducted in two main phases. These discourse analytic measures were selected because they were able to distinguish between writing scripts at different proficiency levels and because they represented a range of aspects of writing. Furthermore. even when using an analytic rating scale. It is therefore doubtful whether intuitively developed rating scales are suitable in a diagnostic context.

the validation phase. 10 raters rated 100 writing scripts using first the existing descriptors and then the new rating scale. so that the most appropriate language support can be offered (Elder.278 Diagnostic assessment of writing During the second phase of this study. (c) variability in the ratings and (e) what the different traits measure? Research question 2: What are raters’ perceptions of the two differ- ent rating scales for writing? III Method 1 Context of the research DELNA (Diagnostic English Language Needs Assessment) is a university-funded procedure designed to identify the English lan- guage needs of undergraduate students following their admission to the University of Auckland. This paper reports on the findings from the second phase. 2007). (b) rater spread and agreement. 2009 . Because both qualitative and quantitative data were collected to support the findings. 2003. 2008).sagepub. The assessment includes a screening component which is made up of a speed-reading and a vocabulary task. More spe- cifically. The overarching research question for the whole study is as fol- lows: To what extent is an empirically developed rating scale of academic writing with level descriptors based on discourse analytic measures more valid and useful for diagnostic writing assessment than an existing rating scale? To guide the data collection and analysis of Phase 2. detailed interviews were conducted with seven of the ten raters to elicit their opinions of the efficacy of the two scales. this study is situated in the paradigm of mixed methods research (Creswell & Plano Clark. an embedded mixed methods research model was by Green Smith on April 9. Elder & Von Randow. where qualitative data are used to supplement quantitative data. This is used to quickly eliminate highly proficient users of English and exempt these from the time consuming and resource-intensive diagnostic procedure. two more specific research questions were formulated: Research question 1: Do the ratings produced using the two rating scales differ in terms of (a) the discrimination between candidates. Downloaded from http://ltj. Afterwards.

the vocabulary and spelling scale). The scale reflects common practice in language testing in that the descriptors are graded using adjectives like ‘ade- quate’. Vocabulary & Spelling) each consisting of six band levels ranging from four to nine. but also to their academic departments as well as tutors working in the English Language Self-Access Centre. • fluency (number of self-corrections as measured by cross-outs). Candidates are given a time limit of 30 minutes.g. The results of the DELNA assessment are not only made available to students. Sentence structure. Coherence. the Student Learning Centre and on English as a second language credit courses. Development of ideas.sagepub. Style. Interpretation. take ESOL credit courses. Based on their results on DELNA. 2 Instruments a The rating scales: The DELNA rating scale: The existing DELNA rating scale is an analytic rating scale with nine traits (Organization. 2007a): • accuracy (percentage error-free t-units). Data description. An abridged version of the DELNA scale can be found in Appendix by Green Smith on April 9. • style (number of hedging devices – see Knoch (2008)). different features of writing are conflated into one category (e. The writing task is routinely double-marked using an analytic rating scale. The writing section of the DELNA assessment is an expository writing task in which students are given a table or graph of infor- mation which they are asked to describe and interpret. Downloaded from http://ltj. students will be asked to attend language tutorials set up within their specific disciplines. The scale is described in more detail below. The new scale: The new scale was developed based on an analysis of 600 DELNA writing samples. In some trait scales. see tutors in the English Language Self-Access Centre. the Student Learning Centre or take a specific writing course designed for English-speaking background students. 2009 . Ute Knoch 279 The diagnostic component comprises objectively scored reading and listening tasks and a subjectively scored writing task. ‘sufficient’ or ‘severe’. • complexity (number of words from Academic Wordlist). Grammatical accuracy. The scripts were ana- lyzed using discourse analytic measures in the following categories (for a more detailed description of each measure refer to Knoch.

2007a). Where possible. • coherence (based on topical structure analysis – see Knoch (2007c)).. 1990.280 Diagnostic assessment of writing • paragraphing (number of logical paragraphs from five paragraph model). The DELNA scale has the same number of levels for each trait. and (c) occur in most writing samples. Measures to represent these constructs were then chosen based on a requirement to fulfil the following criteria: measures had to (a) be able to discriminate successfully between different levels of by Green Smith on April 9. Cumming et al. 2007c). please refer to Knoch (2007a. (b) be practical in the rating process. 2001) and models of writing (e. mod- els of rater decision-making (e. 1996). Bachman & Palmer. several possible models were reviewed as part of the larger study (Knoch. An abridged version of the scale can be found in Appendix 2. whilst the new scale has vary- ing number of levels for different traits. The new scale does not make use of any adverbials in the level descriptors. A comparison of the number of band levels of the two trait scales can be seen in Table 1. nor does it ask raters to focus on more than one aspect of writing in one trait scale. For example.sagepub. The new scale differs from the existing DELNA scale in that it provides more explicit descriptors. To ground the selection of the discourse-analytic measures in theory. In addition to the qualitative differences of the level descriptors of the two rating scales described above. raters are required to estimate the percentage of error- free sentences. Grabe & Kaplan. • cohesion (types of linking devices. The reason for this is that the scale was developed on an empirical basis. number of anaphoric pro- nominals ‘this/these’). As none of these were found to be satisfactory by themselves. b The writing samples: The one hundred writing scripts used in the second phase of the study were randomly selected from the Downloaded from http://ltj. • content (number of ideas and supporting ideas). These included: models of communicative competence (e. a taxonomy of all were used to select the constructs to include in the scale. The number of levels reflects the findings of the empirical investigation.g. Bachman. 2009 . for accuracy. the two scales differ in the number of band levels of the trait scales. raters are given features of writing which they can count. 2007b. For a detailed account of the development of the new scale. 1996).

what they would change about the two scales in terms of the wording. 2009 .sagepub. The idea behind the development of this manual was that because the raters were all very familiar with the existing rating scale. categories and number of levels and if they found any categories difficult to apply. clear instructions are provided on how each trait is to be rated. 3 Participants Ten DELNA raters were drawn from a larger pool of raters based on their availability at the time of the study. Knoch et al. Ute Knoch 281 Table 1 Comparison of traits and band levels in existing and new scale DELNA scale Band levels New scale Band levels Grammatical accuracy 6 Accuracy 6 Sentence structure 6 Vocabulary and spelling 6 Lexical complexity 4 Data description 6 Data description 6 Data interpretation 6 Data interpretation 5 Data – Part 3 6 Data – Part 3 5 Style 6 Hedging 6 Organization 6 Paragraphing 5 Coherence 5 Cohesion 6 Cohesion 4 scripts produced during the 2004 administration of the DELNA assessment. Each trait scale is further illustrated with examples and practice exercises... d The interview questions: Because the interviews were semi- structured. a very lengthy training session would have had to be held to introduce them to the new scale. a training manual was produced which the raters were asked to study at home before the rater training ses- sion. All raters have several years of experience as DELNA raters and take part in regular train- ing moderation sessions either face-to-face or online (Elder et al. the exact interview questions varied from participant to participant. This was. 2007. not pos- sible because of time constraints on the part of the raters. however. 2007). c The training manual: To help the raters become familiar with the new by Green Smith on April 9. The raters were asked what they thought about the two rating scales. Downloaded from http://ltj. In the manual.

In the many-faceted Rasch model. In both cases they rated 12 scripts as a group. A coun- terbalanced design was not possible because for practicality reasons the ratings using the existing DELNA scale had to be completed before the new scale was designed. a factor analysis of the rating data and the analysis of the interviews with the raters. items and examinees). However. FACETS is a generalization of Wright and Masters’ (1982) partial credit model that makes possible the analysis of data from assessments that have more than the traditional two facets associ- ated with multiple-choice tests (i. The interviews were conducted in a quiet room and lasted for 30–45 minutes. The raters discussed their own ratings and then compared these to benchmark ratings.sagepub. a completely new group of raters could have been recruited for this study. candidates. the results of the two rating rounds were analyzed using the multi-faceted Rasch measurement program Facets (Linacre. because of the large number of scripts in the study. To avoid the effect of all raters being previously familiar with the DELNA rating scale. all raters were invited to par- ticipate in semi-structured interviews. it can be contended that the raters were not able to remember any of the scripts in the sample from one rating round to the next. b Data collection: The ratings based on the existing DELNA scale were collected over a period of eight weeks.e. the two months between rating rounds and feedback received from the raters. However. as being familiar with the context of the assess- ment was important for the interviews. 2006). Seven of the ten raters agreed to participate. Each of these is discussed below. each facet of the assessment situation (e. trait) is represented by one parameter. After the ratings were by Green Smith on April 9. c Data analysis: Three types of data analysis were undertaken: the analysis of the multi-faceted Rasch data. Two months after the raters had completed their ratings based on the DELNA scale. raters. and for practicality reasons. 2009 . this was not done. First. The model states that the likelihood of a particular rating on a given Downloaded from http://ltj. In each case. the raters met in plenary for a face-to-face session. they rated the same 100 scripts using the new scale.g.282 Diagnostic assessment of writing 4 Procedures a Rater training: The rater training sessions for both the existing rating scale and the new scale were conducted very similarly.

Ute Knoch 283 rating scale from a particular rater for a particular student can be predicted mathematically from the proficiency of the student and the severity of the rater. the more discriminating the rating scale is. The higher the separation ratio. 2) Rater separation: The next hypothesis made was that a well functioning rating scale would result in small differences between raters in terms of their leniency and harshness as a group. 2009 . which indicates.sagepub. Therefore. Higher values on both of these indices point to a better-functioning rating scale. 2005). 3) Rater reliability: The third hypothesis was that a necessary condition for valid- ity of a rating scale. in percentage terms. which is a measure of how similarly the raters are ranking the candidates and (b) the per- centage of exact rater agreement. When a rating scale is analyzed. it was further contended that a better functioning rating scale would result in fewer raters rating either inconsist- ently or overly consistently (by overusing the central categories of the rating scale). 4) Variation in ratings: Because rating behaviour is directly influenced by the rating scale used. Rater infit means square values have an expected value of 1 and can range from Downloaded from http://ltj. a number of hypotheses were developed for comparing the two rating scales: 1) Discrimination of the rating scale: The first hypothesis was that a well functioning rating scale would result in a high candidate by Green Smith on April 9. is rater reliability (Davies & Elder. a rating scale resulting in a smaller rater separation ratio is seen to be functioning better. To interpret the results of the multi-faceted Rasch analysis. FACETS provides two measures of rater reliability: (a) the rater point biserial correlation index. the candidate separation ratio is an excellent indicator of the discrimination of the rating scale. The measure indicating variability in raters’ scores is the rater infit mean square value. how many times raters awarded exactly the same score as another rater in the sample.

low values (. principal axis factoring was used. Apart from the multi-faceted Rasch analysis. High infit mean square values (in this case 1. by Green Smith on April 9. Before this analysis.7 was chosen for this study as the lower limit) indi- cate that the observed ratings are closer to the expected ratings than the Rasch model predicts.3 was chosen as the cut-off level following McNamara (1996) and Myford and Wolfe (2000)) denote ratings that are further away from the expected ratings than the model predicts. This could indicate that a rater is rating very consistently. To determine the number of factors to be retained in the analysis. The interview data were transcribed and then subjected to a quali- tative analysis via a hermeneutic process of reading. scree plots and Jolliffe’s (1986) criterion of retaining eigenvalues over . The coding themes that emerged during this process were then grouped into categories. IV Results Research question 1: Do the individual trait scales on the two rating scales differ in terms of (a) the discrimination between candidates. normally the inside values. Varimax rotation was chosen to make the output of the factor analysis more comprehensible. (b) rater spread and agreement. This is a sign that the rater in question is rating inconsistently. the closer the rater’s ratings are to the expected ratings. including positive and negative com- ments about each of the two scales. For this. showing too much variation. This analysis is designed to uncover the latent structure of interrelationships of a set of observed variables. analysing and re-reading. it was further of interest to ascertain how many different aspects of writing ability the raters were able to discern when using the two rating scales. however it is more likely that the rater concerned is overusing certain categories of the rating scale.7 were used. (c) variability in the ratings and (d) what the different traits measure? Downloaded from http://ltj.284 Diagnostic assessment of writing 0 to infinity. The closer the calculated value is to 1. 2009 .sagepub. both the determinant of the R-matrix and the Kaiser-Meyer-Olkin measure of sample adequacy were calculated to ensure suitability of the data to this type of analysis. Each of the four criteria discussed above will be used to compare the two rating scales.

The left-hand column of each Wright map displays the logit values ranging from positive values to negative values. 2009 . the candidates were only spread over three logits. ranging over five logits. In the interest of space. part three of prompt (S6). repair fluency (S2). The second column shows the candidates in the sample. Ute Knoch 285 1 Comparison of individual trait scales The first step was to compare individual trait scales wherever possible (following Table 1). Higher ability candidates are plotted higher on the logit scale. the Wright map shows the traits in each rating scale. For the exist- ing scale. coherence (S9) and cohesion (S10). the narrow columns on the right of each Wright map represent the trait scales (with band levels) in the order they were entered into FACETS. data interpretation (S7). it was further of interest how the two scales as a whole performed. hedging (S5). these are from left to right: organization (S1). More difficult traits are plotted higher on the map than easier traits. Raters plotted higher on this map are more severe than those plotted lower on the map. the candidates were more spread out. For the full results please refer to Knoch (2007a). whilst lower ability candidates can be found lower on the logit scale. although most individual trait Downloaded from http://ltj.sagepub. Therefore. Next. only a sum- mary of the findings will be presented with regard to the compari- son of individual scales. Finally. data interpretation (S5). part three of prompt (S8). lexical complexity (S3). When the raters employed the new scale. the following obser- vations could be made. these are from left to right: accuracy (S1). and fewer raters rating with too much or too little variation. 2 Comparison of whole scales After the individual trait scales were analysed and compared. The next column in the Wright map represents the raters. data description (S4). sentence structure (S7). style (S3).com by Green Smith on April 9. For the new scale. First of all. data description (S6). The findings for the comparison of the individual trait scales generally showed that the trait scales on the new scales resulted in a higher candidate discrimination. greater rater reliability. when the raters used the existing scale. paragraphing (S4). Figures 1 and 2 present the Wright maps of the two rating scales. cohesion (S2). grammatical accuracy (S8) and vocabulary/spelling (S9). smaller differences between rat- ers in terms of leniency and harshness. When the two Wright maps were compared.

.1 | S...3 | S... . ............ ... | | | --...6 | S....... | --...... ..... + + + + + | | | | | 8 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --....* * | | *** | 1 5 | Sentence Structure Vocabulary & spelling | | | | | | --..... . ........ . .- Figure 1 Wright map of DELNA scale ..7 | S. ... .4 | S........| | | | | | | | | | 8 | 8 | | | | | | | | | | | | | | | | | | 8 | | | | | * | | | | | | 8 | | | | | | + 2 + *** + + + --....... ..| | | | | *** | | | | | | --......6 | S....... . ... ..... .. . .......................... | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --..... | | | | | * 0 * ******* * 4 7 * Grammatical accuracy Interpretation * * --.. ... | | | | | | | | | | | | | | | | | | | | | | | * | | | | | | | | | | | | + -2 + * + + + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) + (4) + ..... ...... * * --.. .+ + + + 8 + + + 8 + 8 + | | *** | | | | | | | | | | | | | | * | | | | | | | | | | | | | | **** | | | | --..2 | S.. ...... ...........1 | S..... .. ........ .5 | S....... ... ... | | | *** | | | 7 | | | | | | | | | + 1 + ** + + + + + + + + + + + + | | ***** | | | | 7 | 7 | | | --.............. .... .............. . .. | | --.... . .........4 | S......| | | | | | *** | | Part three | | | | | 7 | | 7 | | | | | ***** | 2 | | | | | 7 | | | | 7 | | Diagnostic assessment of writing | | **** | | | | | | | | | | | 7 | | | **** | | Data | | | | | | 7 | | | | | | **** | 10 3 | | | | | | | | | | | | | ***** | 6 | | --.- |Measr | *=1 | -Rater | -Item | S...... ....... .com by Green Smith on April 9. ..* * * --... .- |Measr | +Candidate | -Rater | -Item | S......5 | S. ...... ...| | | --.* --. ........ | | | ** | | | | --....| --... .. ..| | | --..8 | S.....9 | .. ...| | | | | | | | *** | | | | | | | --. | | | ***** | 8 | Conesion | | | | | | | | | | | | **** | | Organisation Style | | | | | | | | | | | | *** | | | | | | | 6 | | | | | Downloaded from http://ltj..... .... ...... | | | | | | | | | ** | | | | | | | | 8 | --.....| | --. .. .. 2009 | | ***** | | | | | | 6 | | 6 | | 6 | | | | ** | 9 | | | 6 | 6 | | | | 6 | | 6 | | | ** | | | 6 | | | | | | | | | + -1 + * + + + + + + + + --. | | | --.| | | --..sagepub.. . ..........+ + + + | | ** | | | | | | | | | | | | | | *** | | | | | | | --.......7 | S.8 | S.... . | | | | 5 | | | | | | * | | | | | --...| | | | * | | | | | | --. .................. ... . ....... ................3 | S. ...- 286 + 4 + + + + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) + (9) + | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --.9 | ... | --. .... .....| | | | | * | | | | | | | | | | | | + 3 + + + + + + + + --....| | --. ..... .. .2 | S...

--.* * * * * 6 * 6 * * | | ****** | 1 8 | | | | | | | | | | | --.| | | | --. ---------. ----- + 2 + + + + (9) + (9) + (8) + (9) + (9) + (9) + (8) + (8) + (8) + (8) + | | | | | | | | | --.10 | -----.1 | S.6 | S.6 | S.| 8 | | | | --.8 | S. ------------------------. ---.| | --. ----- | Measr | *=2 | -Rater | -Item | S. -. ----- |Measr | +Candidate | -Rater | -Item | S. --. -----. | | | | 5 | | | | | | | | | | | | | | | | | | + -1 + + + + + --. | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 8 | | | | | | | | | | | 8 | --. ---. by Green Smith on April 9. -.| | | | | | | | | | | | | | 7 | | | | | | * | | | | | | | 8 | | | | | | | | | | | 8 | | | | | | | | 7 | | + 1 + + + + + + + + + --. ---. ---.| | | | | *** | | Coherence Interpretation | 6 | | | | | | --. | | | --.| --.1 | S. ---. ---------.4 | S. ------------------------. --. -. -. -.4 | S. --.| 6 | | | | --. -. -. --.7 | S. ---. -. -. ---. --. ---. ------------------------. --. -.| | --. ------.| | | Downloaded from http://ltj.5 | S.9 | S. -. --.| | | | | | --.sagepub. | | | | | | | * | | | --.9 | S.3 | S. -. ----- 287 Figure 2 Wright map of new rating . --. --------------------.2 | S.3 | S. ---.+ + + + + + + + 5 + | | * | | | 5 | | | | | | | | 5 | | | | | | | | | | | 5 | | | | | | | | | | | | | | | | | 5 | | | | | | | | | | 5 | --. | 6 | | | | | | **** | 2 3 7 | | 7 | | | | --.5 | S. ---------. --. | 7 | | | | | * | | | | | | | | | | | | | | | *** | | | | | 7 | | | 7 | | | | | | | | | | --.| | | | | ** | 6 3 7 | | | | | | | --. -.10 | -----. | | | | ** | | | | | 6 | | | | | | | | | | | | Cohesion | | 6 | | | --.2 | S. ---------. -. | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + -2 + + + + (4) + (4) + (5) + (5) + (4) + (4) + (4) + (4) + (4) + (4) + Ute Knoch -----. --. ------. -. ---. ---. --------------------. | | | | *** | | Accuracy Hedges | | | | | 7 | | | | | | | | *** | | Lexical Complexity Part three | | | | | | | | --. --------------------. 2009 | | * | | | | | | | | 5 | | | | | | | | | | --. ------. | --. ---.| | | * | 4 9 | | --. --.7 | S. ---. ---. --------------------. ------. | | --. | | | 7 | | 6 | | | | | | | ***** | | | | | | | 6 | | | | | | | | ** | | Data | | --.8 | S. | | | | | --. + + + + + | | * | | | | | | | | | | | | 7 | | | | | Repair fluency | | | | | --. ------------------------. ---. ---.| | | | | | * 0 * ** * 10 5 * Paragraphing * * 7 * --.

47 Rater point biserial: 0.9%).21 (for the most lenient rater). a difference of one and a half logits. was lower when the new scale was used. When employing the existing scale.2% Variation in ratings: Variation in ratings: % Raters infit high: 40% % Raters infit high: 0% % Rater infit low: 10% % Rater infit low: 0% Trait statistics: Trait statistics: Spread of trait measures: 0. displayed in Table 2 below.34 Rater separation and reliability: Rater separation and reliability: Rater separation ratio: by Green Smith on April 9. a difference of just over one logit. ranging from . When the different traits were examined on the two Wright maps. it seems that as a whole.3. That the raters rated more similarly in terms of severity could also be seen by the inter-rater reliability statistics in Table 2. the candidate separation ratio.9% Exact agreement: 51. however.78 (for Data – part three) to −.19 Rater point biserial: 0. On the DELNA scale the traits spread from .38 Exact agreement: 37.78 on the logit scale for repair fluency to −. a range of less than half a logit. the raters were spread from .47 Trait fit values: data and part Trait fit values: repair fluency and three much over 1. The rater point biserial correlation coefficient. For the existing scale. 2009 . Their severity measures (in logits) ranged from .sagepub. which showed that the exact agreement was higher when the new scale was used (51. it became clear that the traits on the new scale were slightly more spread out in terms of difficulty.37 Spread of trait measures: 0. This is also con- firmed by the first of the rating scale statistics for the whole scale.76 Trait separation: 9.25 (for the harshest rater) to −.74 for cohesion.64 to −. no low data slightly high.15 Candidate separation ratio: 5.74 logits. It also became apparent that the raters were a lot less spread out when using the new scale.67 Rater separation ratio: 4. the number of raters displaying too much or too little vari- ability in their ratings was scrutinized. half the raters fell into one of these categories whilst no raters did for the new scale.37 (for style).2%) than when the existing scale was applied (37. Next.14 Trait separation: 12. Table 2 Rating scale statistics for entire existing and new rating scales DELNA scale New scale Candidate discrimination: Candidate discrimination: Candidate separation ratio: 8. a range of nearly one and a half logits.53 to −0. lexis and coherence low Downloaded from http://ltj.78 to −0. the existing scale was more discriminating.288 Diagnostic assessment of writing scales on the new scale were more discriminating.

All other eigenvalues were clearly below 1 (following Kaiser.146 4 . per- sonal communication).341 86.000 Downloaded from http://ltj.694 7.472 2 .645 8 . Table 3 Principal factor analysis: Existing DELNA scale Component Eigenvalue % of variance Cumulative % 1 5.sagepub. Both the scree plots and the tables displaying the results from the PFA show that when the existing rating scale was analyzed. 2009 . 1986) and there was no further level- ing off point on the scree plot. This component had an eigenvalue of 5. a principal axis factor analy- sis (or principal factor analysis – PFA) was performed on the rating data. However. only one major component was found. as is found in Figure 1 with the traits on the existing rating scale.983 72.275 3.657 7.209 2.7 (following Jolliffe.243 6 . but not how many different groups of traits the data was measuring. If the traits were not measuring the same underlying construct.319 7 . 1960) and below . it indicates that raters had difficulty distinguishing between the different traits or that the traits were related or dependent on each other (Carol Myford.455 3 .487 5 . Table 3 (DELNA scale) and Table 4 (new scale) below show the results from the principal factor analysis.8 and accounted for about 64% of the entire variance. Components with low eigenvalues are discarded from the analysis.326 96. When the new scale was analyzed. it is not necessarily a problem to have a bunching up of traits around the zero logit point.472 64. PFA reduces the data in hand into a number of components.691 80.549 by Green Smith on April 9. each with an eigenvalue representing the amount of variance of the components. as they are not seen to be contributing enough to the overall variance.827 98.138 1.528 100.076 94. The fact that the traits in Figure 2 (new scale) were more spread out shows that the different traits were measuring different aspects.472 9 .168 1. then this explains why both the candidate separation of the new scale and the rater point biserial of the new scale were lower than that of the existing scale. Because the results above only indicate that the traits were meas- uring different underlying abilities.415 4.803 64. Ute Knoch 289 In a criterion-referenced situation as was the case when these rating scales were used.756 91.

36 100.5 0 0.5 2 1.64 10 . because only one component was identified for the exist- ing scale.85 8 . a rotation of the data was necessary. no factor loadings can be displayed. A trait was considered to be loading on a factor if the loading was higher than .276 12.4 (as indicated in bold font). accounting for 34% of the variance.17 75. coherence and cohesion.434 34. 2009 .577 5.389 3.10 3 1. lexical complexity.5 5 3. This is.75 9 . was made up of accuracy.817 8.34 34. For this. A varimax rotation was chosen to facilitate the interpretation of the factors of the new scale.0 3 1.sagepub. The next step in the PFA was to identify which variables load onto which component.63 67.154 11.64 4 .0 1 0. The second factor.491 4. However. an Downloaded from http://ltj. This factor can be described as a general writing ability factor.0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 Factor Number Factor Number Figure 3 Scree plots of principal factor analysis Table 4 Principal factor analysis: New scale Component Eigenvalue % of variance Cumulative % 1 3.000 however.77 88.07 7 .763 7. The six factor loadings for the new scale can be seen in Table 5.290 Diagnostic assessment of writing Scree Plot Scree Plot 6 3.34 2 1.7. which accounted for a further 13% of the variance.91 93.44 6 .236 2.28 5 . the results were different. at first glance. The largest by Green Smith on April 9.89 97.54 58. was made up of hedging and interpretation of data. The PFA resulted in six compo- nents with eigenvalues over 0.0 2.76 47.5 Eigenvalue 4 Eigenvalue 2.63 83.863 8.

Research question 2: What are raters’ perceptions of the two different rating scales for writing? The most commonly emerging themes in the interviews were grouped into the following sections: Downloaded from http://ltj. The six factors together accounted for 83% of the entire variance of the score.116 0.061 0.253 0.112 0. which accounted for another 8% of the variance.340 0. which accounted for 9% of the variance.005 0.850 0.072 0.041 Data description 0.021 unusual factor.sagepub.062 0.009 0.090 0.009 Content – Part 3 0. which accounted for 12% of the variance.269 0. the sec- tion in which writers are required to extend their ideas. However.064 00.064 Lexical complexity 0.030 Cohesion 0. That all three parts of the content load on separate factors shows that they were all measuring different aspects of content.097 0.119 −0.045 0.067 −0.062 0. consisted of Part Three of the content.168 −0. It can therefore be argued that the ratings based on the new scale not only accounted for more aspects of writing ability. 2009 .016 0. there was less unaccounted variance when the new scale was used.092 Coherence 0.106 0.056 −0.945 0.037 0.017 0.867 0. but it also accounted for a larger amount of variation of the scores.875 0. a writer who scored high on hedging might also have put forward more ideas in this section of the essay.105 0. the description of data. For this reason.029 −0.035 Repair fluency 0.025 0. The last factor.139 0.009 −0.083 0.030 −0.338 0.030 Interpretation of data 0. The third factor.971 −0.448 −0. it can be argued that writers need to make use of hedging in the section where the data is interpreted since the writer is speculating rather than stating facts.064 0.141 0. In other words. The fourth factor. whilst the single factor found in the analysis of the existing rating scale only accounted for 64% of the data.288 0.037 by Green Smith on April 9.009 −0.796 0.731 0. was another content factor. Ute Knoch 291 Table 5 Loadings for principal factor analysis Component 1 2 3 4 5 6 Accuracy 0.155 −0.992 Hedging 0.174 0.039 0.959 −0. only had paragraphing loading on it. Repair fluency was the only measure that loaded on the fifth factor.089 0. which also accounted for 8% of the variance.004 Paragraphing 0.241 0.

The first strategy that a number of raters referred to in their inter- views was assigning a global score to a script..] Sometimes I look at it [the descriptors] I’m going ‘what do you mean by that?’ [. 2009 . it’s really just a sort of adverbial thingy anyway isn’t it so I think I just go with gut instinct on that one.292 Diagnostic assessment of writing • themes emerging about DELNA scale • themes emerging about new scale 1 Themes emerging about DELNA scale The most regularly emerging theme in the interviews was that raters often experienced problems when using the DELNA scale. So there’s sort of adverbial [sic]. A variety of strategies (both conscious and subconscious) emerged from the interviews. appropriate. which is more holistic than analytic: Downloaded from http://ltj. Yeah. Although most raters reported having problems deciding on band levels with the DELNA scale. In the extract below. Because sometimes it is actually.. These were as follows: • assigning a global score • rating with a halo effect • disregarding the DELNA descriptors. sometimes you have a number of ideas but there is not much support for them and what is sufficient. Rater 5 below describes his rating process. what exactly is support. One of the most commonly mentioned problems was that the raters thought the descriptors were often too vague to arrive easily at a score. I find that tricky by Green Smith on April 9. Problems with the vagueness of the DELNA descriptors were also reflected in the comments by Rater 5 below: Rater 5: [. […] You just can’t.sagepub. In the example below... limited and inadequate. A number of raters pointed directly to the adjectives used as being the problem. Rater 10 talked about the descriptors for vocabulary and spelling: Rater 10: Well there’s always a bit of a problem with vocabulary and spelling anyway in deciding you know the difference between extensive... there is nothing specific there to hang things on. for example.] And here relevant and supported. the methods of coping were quite dif- ferent for different raters. yeah. Rater 4 talked about the problems she encountered when deciding on a score for Content: Rater 4: [. adequate.] You just kind of have to find a way around it cause it’s not really descriptive enough. usually after the first reading.

The DELNA descriptors recommend awarding either a five or a six. if they have no error-free sentences they get a four and I don’t care if it is something that Downloaded from http://ltj. yeah. sorry. this is a native speaker. it is almost like giving a global mark for the three things in consideration. 2009 . And I suspect I often tend to give the same grade or similar grades for cohesion and style. I always automatically think. 2 Themes emerging about new scale The most commonly emerging theme about the new scale was that the raters liked the fact that the descriptors in the new scale were more explicit. Some raters seemed to clearly disregard the DELNA descriptors and override them with their own general impression of a script. this is a non-native speaker. looking at this it ought to be a six.. I did. so I have to make it come out as a seven and I’d say. if I had other reservations about the language and stuff. [. I let go of the sense of this is a seven. I might even go up to a seven. it had a clearer guideline for what I was actually doing. well. yes. Below. will it be suf- ficient for academic writing and then that is sort of borderline between six and seven quite often and then is it a better seven or is it an eight or is it less than a six. then I would give it a six or even a five if it is really bad. again. so I was more confident for giving nines and fours. by Green Smith on April 9. Ute Knoch 293 Rater 5: Mmh. I might well push the score up in a way not to pull them down. How well will this come across.sagepub. maybe a bit of variation.. Because I thought I had something to actually back it up with. Just for the paragraphing. cohesion and style: Rater 10: For style.. Yeah. or is it five. Probably for the whole of fluency. and that it was someone that wasn’t so strong in academic writing but actually had very good English.] So in a way. This overall. but it is possible particularly if I suspected that it was a native speaker. Rater 10: Mmh [. But if I was sort of convinced by the writer in every other way. Rater 10 (below) was talking about the score she would award for organiza- tion to a script that had no clear paragraph breaks but was otherwise well organized. where a rater awards the same score for a number of categories on the scale.. This is evident in the following extracts from the interviews: Researcher: Do you feel you used the whole range there [accuracy in the new scale]? Rater 10: Yes. Rater 10 talked about awarding scores for the three categories grouped under fluency in the DELNA scale.. if someone had no paragraphing. but I [. With. I think I would.] yeah.] well. And I think also because I didn’t. holistic type rating often results in a halo effect. I think I would be more likely to. (sighs). I just tend to go with the gut instinct. but everything else was good..

we need to turn to the limited literature on diagnostic assessment. I found that it [the new scale] is quite different to the DELNA one and it is quite amazing to be able to count things and say.294 Diagnostic assessment of writing might otherwise get a six or a seven and yes. style wasn’t really considered. I know exactly which score to use now. These are shown in Table 6. To establish construct validity for a rating scale used for diagnostic assessment. Downloaded from http://ltj. I have no problems with that. Four of Alderson’s 18 statements are cen- tral to rating scales and rating scale development. argued that a simple count of hedging devices could not capture variety and appropriateness: Researcher: You said that. The following section aims to discuss these results in light of the overarching research question: To what extent is an empirically developed rating scale of academic writing with level descriptors based on discourse analytic measures more valid for diagnostic writing assessment than an existing rating scale? V Discussion DELNA is a diagnostic assessment system. other than hedging. Whilst the comments about the new scale reported above shed a positive light on the scale.sagepub. The idea of being able to arrive precisely at a score was also echoed in the following comment by Rater 7: Rater 7: It is interesting. for example. 2009 . Alderson (2005) compiled a list of features which distinguish diagnostic tests from other types of tests. it does seem a bit limited. a less positive comment was also made by the raters. But maybe that is the inter-rater reliability issue coming up. Above. if they can write completely error-free then I can give them a nine. Rater 5: Yeah. the results for research questions 1 and 2 were by Green Smith on April 9. Three raters criticized the fact that some information was lost because the descriptors in the new scale were too specific. Rater 5. […] I suppose that is similar [to the DELNA scale] it sort of relies on the marker’s knowledge of English in a more kind of global way sort of. This section will discuss each of Alderson’s four statement in turn and then focus on the raters’ perceptions of the two scales.

Diagnostic tests are designed to identify strengths and weaknesses in a learner’s knowledge and use of language. the existing scale seemed to lend itself to a more holistic approach to rating. Therefore. Diagnostic tests should enable a detailed analysis and report of responses to items or tasks. Lumley. the PFA showed that the new scale distinguished six different writing factors.. impressionistic type rating. the rating scale descriptors do not offer raters sufficient information on which to base their decisions and so raters resort to a global impres- sion when awarding scores. whilst the current DELNA scale resulted in one large factor. Diagnostic tests are more likely to be . they would resort to more global. It is possible that. It was possible to show that simply providing raters with more explicit scoring criteria can significantly reduce this effect. This then would explain why. focussed on specific elements than on global abilities. Some studies have in fact found that raters display halo effects only when encountering problems in the rating process (e. Diagnostic tests thus give detailed feedback which can be acted upon. It could therefore be argued that Downloaded from http://ltj. as hypothesized in this study. found that when rat- ers could not identify certain features in the descriptors.. 4. However. Diagnostic tests are designed to identify strengths and weaknesses in a learner’s knowledge and use of language. Vaughan.sagepub. Although developed as an analytic scale. it could be argued that the new scale was more successful in identifying differ- ent strengths and weaknesses. 2. The main reason that the ratings based on the DELNA scale resulted in only one factor was the halo effect displayed by most by Green Smith on April 9. Lumley. for example. 3. Alderson’s first statement calls for diagnostic assessments to identify strengths and weaknesses in a learner’s knowledge and use of language. This study sug- gests that the halo effect and impressionistic type marking might be more widespread than has so far been reported. 2002. Both rating scales compared in this study were analytic scales and were designed to identify strengths and weaknesses in the learners’ writing ability. Statement 1. the raters were able to discern distinct aspects of a candidate’s writing ability. 2009 . 1991). when using the empirically developed new scale with its more detailed descriptors.g. Ute Knoch 295 Table 6 Extract from Alderson’s (2005) features of diagnostic tests 1.

More detailed suggestions on what score report cards could look like are beyond the scope of this paper. as evident in the quantita- tive analysis. centred around the middle of the scale range. the section on academic style could suggest the use of more hedging devices or students could be told how they could improve the coherence of their essays rather than just being told that their writing ‘lacks academic style’ or is ‘incoherent’. Statement 4. Alderson’s (2005) second and third statements assert that diag- nostic assessments should enable a detailed analysis and report of responses to tasks and that this feedback should be in a form that can be acted upon. Further longitudinal research is needed to determine whether this is indeed the case. but also a rating scale by Green Smith on April 9. A score report card based on the new scale could be designed to make clearer suggestions to students.sagepub. Both rating scales lend themselves to a detailed report of a candidate’s performance. However. if the raters at times resort to a holistic impression to guide their marking when using the DELNA scale. then this information is likely to be less useful to students than if they are presented with a more jagged profile of some higher and some lower scores.] focussed on specific elements than on global abilities. For example. If a diagnostic test of writing should focus on specific elements. It is possible that extended use of the new scale might also result in more holistic rating behavior.. 2009 . However. this will reduce the amount of detail that can be provided to students. If most scores are. but can be found in Knoch (2007a). Alderson’s fourth statement states that diagnostic tests are more likely to be focussed on specific elements rather than on global abilities. Statement 2. then this needs to be reflected in the rating scale. for example. Diagnostic tests thus give detailed feedback which can be acted upon.. Diagnostic tests are more likely to be [. the descriptors need to lend themselves to isolating more detailed aspects of a writing performance. The descriptors Downloaded from http://ltj. Therefore. what was not established in this study was whether the raters rated analytically because they were unfamiliar with the new scale. Diagnostic tests should enable a detailed analysis and report of responses to items or tasks and Statement 3.296 Diagnostic assessment of writing the halo effect is not necessarily only a rater effect.

sagepub. Raters perceptions of the scale usefulness are important as they provide one perspective on the construct validity of the scale. the following conclusions can be drawn. otherwise the diagnostic power of the assessment is by Green Smith on April 9. As language experts they are well qualified to judge whether the writing construct is adequately represented by the scale. And how well do they do that. Returning to the overall purpose of this study. To arrive at a truly diagnostic assessment of writing. was influencing their behaviour. This might result in raters displaying less of the halo or central tendency effects. 1 Raters’ perceptions of the two scales Finally. We need. therefore. such as IELTS. so that they recognize the importance of rating each aspect of writing separately. Rater 10 commented: I think I prefer the existing DELNA scale because I like to mark on ‘gut instinct’ and to feel that a script is ‘probably a six’ etc. In the course of the interviews it became apparent that most raters treated DELNA as a proficiency or placement test rather than a diagnostic assessment. 2009 . but I try not to focus too much on them. For example. Similarly. Bachman and Palmer (1996) suggest that ‘the most important consideration in designing and developing a language test is the use for which it was intended’ (p. It was a little disconcerting with the ‘new’ scale to feel that scores were varying widely for different categories for the same script. Rater 5 mentioned in his interview: I notice these things [features of form] as I am reading through. all categories in an analytic scale need to be reported back to stakeholders individually. to remember that the purpose of this test is to provide Downloaded from http://ltj. in an earlier phase of this research. 17).’ It seems therefore that the diagnostic purpose of the assessment was not clear to them and/or that their experience of rating different kinds of tests. been found to discriminate between texts at different levels of proficiency. Ute Knoch 297 of the new scale were more focussed on specific elements of writ- ing because they were based on discourse analytic measures which had. I try to go for broad ideas and are they answer- ing the question. it was also important to establish the raters’ perceptions of the efficacy of the two scales for diagnostic assessment. The findings of this study suggest that raters need to be made aware of the purpose of the assessment in their training sessions. Are they communicating to me what they need to communicate first of all.

VI Conclusion The findings of this study have a number of implications. Overall. Test developers should therefore strive to maximize overall usefulness given the constraints of a particular context. as Bachman and Palmer (1996) and Weigle (2002) by Green Smith on April 9. Practicality is always a crucial consideration. it could be argued that construct validity is central (as is the case in most assessment situations). 2009 . One aspect not reported on previously has to do with practicality. Weigle’s summary table can be expanded in the following manner (Table 7). the new scale proved only minimally more time-consuming. construct validity should not be sacrificed simply to ensure practicality. a priori developed scales and more detailed. empirically developed scales. The first refers to the classification of rating scale types commonly found in the literature. Therefore. Overall. Most evidence speaks in favour of the new scale. Some raters even reported being able to use the new scale faster. Two aspects of practicality have to be taken into consideration: (1) practicality of scale development and (2) practicality of scale use. it is also necessary to distinguish two types of analytic scales: less detailed. it is impossible to maximise all aspects of test usefulness. Researchers and practitioners need to be made aware of the differ- ences between analytic scales and need to be careful when making decisions about the type of scale to use or the development method to adopt. this study seems to suggest that although these two types of scales are distinct. the new scale has been shown to generally function more validly and reliably in the diagnostic context that it was trialled in than the pre-existing scale. Weigle (2002). In terms of practicality of use. however. it can be argued. In the context of DELNA. as well as many other authors distinguishes between holistic and analytic rating scales However. Downloaded from http://ltj. It is clear that the scale development process for an empirically developed scale is more laborious. a diag- nostic assessment. Since each context is different. rather than try to maximize all qualities. but wherever possible.sagepub. But. the importance of each quality of test usefulness varies from situation to situation. The task of the test developer is to determine an appropriate balance among the qualities in a specific situation.298 Diagnostic assessment of writing detailed diagnostic information to the stakeholders on test takers’ writing ability. that the new scale is less practical.

but might for diagnostic be used holistically purposes. developed analytic scales. The table provides a short description of the purpose of each test. scale. Construct Holistic scale Analytic scales Higher construct Validity assumes that all more appropriate validity as based relevant aspects as different aspects on real student of writing develop of writing ability performance. For example. Time-consuming. most easy. useful for rater training. scores correlate with superficial aspects such as length and by Green Smith on April 9. may useful diagnostic than intuitively be misleading for information developed analytic placement and may for placement. 2009 . assumes at the same rate develop at different that different aspects of and can thus be rates. impression. Table 8 presents different purposes for which writing tests are administered. expensive. holistic effect. score. Ute Knoch 299 Table 7 Extension of Weigle’s (2002) table to include empirically developed analytic scales Quality Holistic scale Analytic scale Analytic scale – intuitively developed – empirically developed Reliability Lower than analytic Higher than holistic. The need for these different types of scales is a consequence of the way the scores are reported. Another implication relates to score reporting. Results of tests Downloaded from http://ltj. especially useful not provide enough instruction and for rater training. But raters writing ability develop captured in a single might rate with halo at different speeds. whilst for a proficiency test it might be less important if the rating scale in use is holistic or analytic (as long as it results in reliable ratings). Higher than intuitively but still acceptable. Impact Single score may More scales Provides even more mask an uneven can provide diagnostic information writing profile. what type of rating scale might be used and how the score should be reported. Practicality Relatively fast and Time-consuming. Authenticity White (1985) Raters may read Raters assess each argues that reading holistically and adjust aspect individually. relevant information diagnosis. expensive. holistically is a more analytical scores natural process than to match holistic reading analytically. by raters. the rating scale used in diagnostic assessment would need to be analytic and at the same time should provide a dif- ferentiated score profile.sagepub.

L. Fundamental considerations in language testing. Oxford: Oxford University Press. designed to focus on specific rather than global abilities. The author has attempted to show that not all analytic rating scales can be assumed to be func- tioning diagnostically and that scale developers and users need to be careful when selecting or developing a scale for diagnostic assessment. of writing proficiency are usually only reported as one averaged score but. The uniqueness of diagnostic assessment has recently again been high- lighted by Alderson’s (2005) book Diagnosing foreign language proficiency and therefore this study is an important contribution to the writing assessment literature. S. Bachman. as Alderson (2005) suggests. (1990). Downloaded from http://ltj. (1996). Diagnosing foreign language proficiency. 2009 . the score profiles of diagnos- tic tests should be as detailed as possible and therefore any averaging of scores is not desirable. Bachman. London: Continuum. in differentiated designed to provide scores across detailed feedback traits.sagepub. A. this study has been able to show that a rating scale with descriptors based on discourse-analytic measures is more valid and useful for diagnostic writing assessment purposes. Oxford: Oxford University Press. The interface between learning and by Green Smith on April 9. general writing analytic. Overall. F. Language testing in practice. ability of students.300 Diagnostic assessment of writing Table 8 Rating scales and score reporting for different types of writing assessment Purpose of Definition Rating scale Score reporting writing test Proficiency test Designed to test Holistic or One averaged score. VII References Alderson. writing ability. C. (2005). & Palmer.. L. F. separate strengths and analytic and for each trait weaknesses in needs to result rating scale. which students can act upon. Diagnostic test Designed to identify Needs to be In detail.

G. 12(2).. ‘Little coherence. (2007c). Thousand Oaks. Scoring procedures for ESL contexts. U. (1960). T. 15– by Green Smith on April 9. D.. & Powers. B. R. TOEFL Monograph Series 22.sagepub. A. (2007). In L. C. (1986). CA: Sage. 173–194. Downloaded from http://ltj. (1991). Elder. Validity and validation in language testing. Handbook of research in second language teaching and learning. Designing and conducting mixed methods research. Davies. Testing second language speaking. Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic framework. D. Hamp-Lyons. (2003).. Princeton.). Exploring the utility of a web-based English language screening tool. L. Norwood. (2008). C. Do empirically developed rating scales function differ- ently to conventional rating scales for academic writing? Spaan Fellow Working Papers in Second or Foreign Language Assessment. Creswell. W. Theory and practice of writing. J. Cohen (Eds. Fulcher. NJ: Educational Testing Service. Melbourne Papers in Language Testing. E. Elder. F. Fulcher. Grabe. University of Auckland.). & von Randow. The assessment of academic style in EAP writing: The case of the rating scale. (2007b). (2007a). Elder. (2007). G. NJ: Lawrence Erlbaum. (2003). & Plano Clark. Evaluating rater responses to an online rater training program. Bachman & A. H. 13(2). 13(1). Language Assessment Quarterly. Interfaces between sec- ond language acquisition and language testing research (pp. Knoch. Cumming. 5(3). (1996). (1998). G. (2005). Kantor. U. 37–64.... In E. 112–140). Describing language development? Rating scales and SLA. Hinkel (Ed. Principal component analysis.. Hamp- Lyons (Ed. U. considerable strain for reader’: A comparison between two rating scales for the assessment of coher- ence. L. G. C. & Von Randow. (1996). 20. Kaiser. F. London: Longman/ Pearson Education. V. Knoch. Barkhuizen. Language Testing. R. J. Does thick description lead to smart tests? A data- based approach to rating scale construction.). Assessing Writing. 1–36. Jolliffe.. Mahwah. J. 108–128. Unpublished PhD. In L. 34–67. 2009 . Assessing second language writing in academic contexts. The application of electronic computers to factor analysis. Language Testing. C. New York: Longman. (2001). NJ: Ablex. Cambridge: Cambridge University Press. 5. U. New York: Springer- Verlag. & Kaplan. Knoch. Ute Knoch 301 Brindley. Knoch. 12(1). 24(1). Knoch. Diagnostic writing assessment: The development and vali- dation of a rating scale. A. Educational and Psychological Measurement. The DELNA initiative at the University of Auckland. TESOLANZ Newsletter. I. (2008). & Elder. U. 208–238.. 141–151.

Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing.. U. Scaling descriptors for language profi- ciency scales. Linacre. Assessing second language writing in academic contexts. C. C. E. W. (2000). E. (2007). Myford. Norwood. (2002)... J. T. 12. Mickan. E. 26–43. TESOL Quarterly. T. (1991). J. Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Constructing rating scales for second language tests. (1995). White.. R. Listening to the voices of rating scale developers: Identifying salient features for second language performance assessment. P. Turner. Essex: Pearson Education. (1982). Canberra: IELTS Australia. 36(1). Monitoring sources of variability within the Test of Spoken English assessment system. Annual Review of Applied Linguistics. NJ: Educational Testing Service. Chicago: MESA Press. & von Randow. 555–584. Language Testing.. J.sagepub. The Canadian Modern Language Review. Lumley. & Turner. 22. Assessing writing. Downloaded from http://ltj. Upshur. 49(1). J. North. Language Testing. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing. Teaching and assessing writing. Weigle. B. Rating scales derived from student samples: Effects of the scale maker and the student sample on scale con- tent and student scores. 16(1). A. J. 246–276. Turner. E. by Green Smith on April 9. (2004). Facets Rasch measurement computer program. McNamara. C. (2002). (2003).. (2002). 3–12. 19(3). G. Read. NJ: Ablex. (2000). & Wolfe. N. E. 56(4). T. Chicago. Measuring the coherence of writing using topic-based analysis. C.. C. Harlow. M. Discourse and assessment. G.. (1999). Thienpermpool. Upshur. M. 15(2).302 Diagnostic assessment of writing Knoch. S. & Masters. & Upshur. ‘What’s your score?’ An investigation into language descriptors for rating written performance. Holistic assessment: What goes on in the rater’s mind? In L. (1996). 82–111. 221–242. IL: Winsteps. Hamp-Lyons (Ed. Rating scale analysis. San Francisco: Jossey- Bass. 85–104. Assessing Writing. (1998). & Turner.). P.. J. Watson Todd. McNamara.. C. Vaughan. D. C. 2009 . Cambridge: Cambridge University Press. (2006). M. 49–70. Measuring second language performance. E. B. A. 9. 217–263. & Schneider. S. A. Wright. & Keyuravong. (1985). ELT Journal.

Appendix 1: Abridged DELNA scale 9 8 7 6 5 4 FLUENCY Organization Essay fluent – well organised Evidence of organization Little organization – possibly no – logical paragraphing – paragraphing may not be paragraphing entirely logical Cohesion Appropriate use of cohesive Lack / inappropriate use of Cohesive devices absent / devices – message able to be cohesive devices causes some inadequate / inappropriate followed throughout strain for reader – considerable strain for reader Style Generally academic – may be Some understanding of Style not appropriate to task slight awkwardness academic style CONTENT Description of Data described accurately Data described adequately / Data (partially) described / may data may be overemphasis on figures be inaccuracies / very brief / inappropriate Interpretation Interpretation sufficient / Interpretation may be brief / Interpretation often inaccurate / of data appropriate inappropriate very brief / inappropriate Development Ideas sufficient and Ideas may not be expressed Few appropriate ideas expressed / extension of by Green Smith on April 9. appropriate / may be Limited. possibly inaccurate / Range and use of vocabulary Ute Knoch spelling few minor spelling errors inappropriate vocab / spelling errors inadequate. Some may lack clearly or supported appropriately – inadequate supporting evidence ideas obvious relevance – essay may be short – essay may be short FORM Sentence Controlled and varied Adequate range – errors in complex Limited control of sentence Downloaded from http://ltj. Errors in word formation & spelling cause strain 303 . 2009 Structure sentence structure sentences may be frequent structure Grammatical No significant errors in syntax Errors intrusive / may cause Frequent errors in syntax cause accuracy problems with expression of ideas significant strain Vocabulary & Vocab.

some figures) incomprehensible figures) (most trends. most Diagnostic assessment of writing figures) Interpretation Five or more relevant reasons and/or supporting No reasons provided of data ideas Part 3 of task Four or more relevant ideas No ideas provided Coherence Coherence Writer makes regular use of super structures. there mechanically) compared to text length. Writer might use this/these to refer to ideas more than four times . Few incidences of unrelated progression Infrequent: sequential progression No coherence breaks and superstructure Cohesion Cohesion Connectives used sparingly but skilfully (not Writer uses few connectives. Appendix 2: Abridged new scale 304 9 8 7 6 5 4 Accuracy Accuracy Error-free Nearly error-free Nearly no or no error-free sentences Fluency Repair No self-corrections No more than 5 self.sagepub. and often is little cohesion. Frequent: Unrelated by Green Smith on April 9. 2009 sequential progression and possibly indirect coherence breaks and some progression. extended progression. describe a relationship between ideas This/these not or very rarely used. Downloaded from http://ltj. More than 20 self-corrections Fluency corrections Complexity Lexical Large number of words from academic wordlist Less than 5 words from AWL / uses only Complexity (more than 20) / vocabulary extensive – makes very basic vocabulary use of large number of sophisticated words Mechanics Paragraphing 5 paragraphs 4 paragraphs 1 paragraph Reader-Writer Hedges More than 9 hedging 7–8 hedging devices No hedging devices Interaction devices Content Data All data described (all Most data described (all Data description not attempted or description trends and relevant trends.

sagepub. Language Testing http://ltj. 305 DOI: Published by: Citations by Green Smith on April 9.1177/0265532208101011 The online version of this article can be found at: Subscriptions: Test review: BEST Plus spoken language test Alistair Van Moere Language Testing Reprints: http://www.sagepub.sagepub. Downloaded from http://ltj. 2009 .com Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.nav Permissions: http://www.sagepublications.

sagepub. scripted oral interviews featuring one test administrator and one candidate. everyday language within topic areas generally cov- ered in adult education courses. which is the accountability system for federally funded ESL and adult education in the USA. www. 2009. depending on the proficiency level of the candidate. Tests are face-to-face. – Language Testing 2009 26 (2) 305 322 Test reviews BEST Plus spoken language test Alistair Van Moere Pearson Knowledge Technologies.00 depending on the quantity purchased. Publisher: Center for Applied Linguistics (CAL). Reprints and Permissions: http://www.000 for a CAL trainer to give a one-day training workshop to 20 test administrators. It can cost $3. Price: Tests cost $1. The test is used to mea- sure language gain in individuals as well as the overall effectiveness of a language program through a pretest–posttest process. Washington. and to have test administrators trained. Language Complexity. 2009 . the scores are benchmarked to Student Performance Levels (SPLs) and to the National Reporting System (NRS) ESL Functioning Levels.sagepub. An integrated test of listening and speaking. Length and administration: Tests last for between 5 and 20 min- DOI:10. it evaluates learners’ performance on a rating scale consisting of three criteria: Listening Comprehension. USA Test purpose: To assess the ability to understand and use unprepared. © The Author(s). 2008).50 to $1.1177/0265532208101011 Downloaded from http://ltj. Scores: Reported on a scale of 88– by Green Smith on April 9. but institutions need to pay for the test administrators’ time. The BEST Plus is one of several standardized assessments approved by the National Reporting Sys- tem (NRS. and Com- munication.

such as personal identification. The test admin- istrator asks the question presented on the computer screen. Each time a new topic is initiated. General expansion. As the candidate progresses through the test. the first item presented is an Entry Question or Photo Description. Yes/no question. Entry item. The print-based format is also semi-adaptive. Elabora- tion (see Table 1). The items in the print-based forms of the test are all drawn from the item pool of the computer-adaptive version. The level locator is used to determine which level to adminis- ter. The test comprises seven item types which vary in their cognitive and linguistic demands: Photo description. Personal expansion. followed by increasingly challenging questions if the candidate is able. after which a new topic is presented and questions again spiral in difficulty. There are three forms (A. and then three tests of varying levels – low. for example when shown a picture to describe. In the computer-adaptive format. Choice question. but the candidate does not operate the computer. There are 13 general topic domains.sagepub. listens to the candidate’s response and immediately enters a rating of the per- formance. 2009 . the first six warm-up items presented are always on the topic of per- sonal identification (in the print-based test these are used to locate the candidate’s level). The computer-adaptive system determines the difficulty level of the next question.306 Test reviews I Description The BEST Plus is available in two face-to-face formats: computer- adaptive and print-based. housing and transportation. intermediate and high diffi- culty. higher ability candidates quickly move to open-ended expansion questions while lower ability candidates move to questions which provide Downloaded from http://ltj. 2005a).com by Green Smith on April 9. The development of the computer-adaptive algorithm is described in the BEST Plus Technical Report (CAL. test items are delivered via a computer with a BEST Plus CD. To obtain a scaled score report. B and C) so that different forms may be used for pre.and post- testing. The candidate is presented with a relatively easy item on a selected topic. the test administrator must enter the raw print-based scores into a computer which is loaded with the test CD. Each form consists of an initial six-item ‘level locator’. health. The test tasks are defined by their content as well as their cognitive and linguistic demands. It seems that the linguistic and psychometric theory underpinning the design of the instrument is similar to that of the ACTFL OPI. At times the candidate is invited to look at the computer screen.

The adaptive scoring algorithm is sophisticated yet easy to understand. A candidate may encounter up to seven different topically-organized groups of ques- tions in a test administration. if for six consecutive questions the candidate’s estimated ability remains below a scale score of 330 (corresponding to the lowest band on the scale descriptors). Downloaded from http://ltj. if the standard error falls below . while still guaranteeing scoring precision. the average performance of the 2400 students. What do you think about news reports in the United States? more support such as choice or yes/no questions. The starting abil- ity estimate expressed in logits for each candidate is 0. There are three rules which stop the presentation of items and end the by Green Smith on April 9. The candidate’s ability estimate is updated after the scoring of each item presented. Rasch modeling was used to calibrate the diffi- culty of the 258 items in the BEST Plus pool. Test reviews 307 Table 1 The seven item types of the BEST Plus and sample questions (CAL. First. if the maximum number of 25 questions is reached. while still maintaining a reasonable level of precision. II Reliability. Entry item I usually get the news from a newspaper. 2005b) Item-type Sample question Photo description Tell me about this picture.sagepub. This rule ensures that the test is no longer than it has to be. validity and test score interpretation The field test involved 25 language programs. 40 test administrators and 2400 students. How about you? Yes/no question Do you like to watch the news on TV? Choice question Do you like getting the news in English or in your language? Personal expansion What are some other ways you get the news? General expansion Do you think it’s important to keep up with the news? Why/ Why not? Elaboration Some people think that news reports in the United States are unreliable and show only one side of an issue. Third. 2009 . This rule ensures that tests do not go on for too long. Others think that the news is accurate and unbiased. Second.2 on a 7-logit scale (about 3% of the scale). This rule ensures that candidates of very low level are not over-tested.

9’s. Even when can- didates are stopped out of the test by the third rule (i. a problem inherent in all face-to-face interview tests (even more so for internationally delivered tests such as the IELTS). for example.3 (or 30 points) (CAL. 2005a. the standard error for every adaptive test score is recorded in the BEST Plus Scores Database. should it be logistically possible to transport students from one location to another between tests. would exhibit lower levels of inter-rater reliability. however. 32 candidates took the test on two successive occasions. It is likely that test administrators in programs across the United States. would also likely be lower than reported here.sagepub. To establish reliability.308 Test reviews To briefly elaborate the concept of precision. This is.2 of a logit on a 7-logit scale represents a standard error of 20 points on the BEST Plus scaled score range of 88 to 999. that one’s actual test score lies in a range +/−5 around their reported test score. than it is to be told that test reliability is . Downloaded from http://ltj.95 when applying KR20.89. and a novice scorer). of course. accessible on the test administrator’s computer. Correlation on raw test scores among the rater pairings were consistently in the high . Validity evidence as gathered in several studies is documented in the BEST Plus Technical Report (CAL. p. Much to the credit of the BEST Plus. with varying expectations about their own population of students’ ability. Test–retest reli- ability. an expert scorer. and further that the two forms of the adaptive test ‘may be considered parallel’ (CAL. The correlations are high. 14). 2005a) – a very well-written and comprehensive booklet. a standard error of . Unfortunately it is not reported on the candidate’s score report (our profession has not yet the courage to go that far). As Luoma points out (2004. with three examiners on each occasion (a test by Green Smith on April 9. especially given that linguistically homogenous groups cluster in certain geographical locations and test administra- tors become accustomed to the speech styles they hear locally. it is unfortunate that the technical report only refers to reliability as established under a controlled experiment in one location and with such a small number of participants. 2005a. they are pre- sented with 25 questions) standard error will rarely exceed . p. 13). Test–retest reliability for the two sets of scores was .e. It is far easier to under- stand. 2009 . 183) standard error is a much under-used statistic since stakeholders can readily understand the concept of a band of confidence around their test score within which their ‘true score’ lies. The Technical Report concludes that the degree of inter-rater reliability that can be achieved is quite high ‘even for novice raters’. p.

which provide criterion-referenced descriptors of proficiency. correlation of scores between test forms ranged from . demonstrating the usefulness of the test as a placement instrument in ESL programs. The 24 programs that partici- pated in the field test provided information about which classes 1866 of their students were eventually placed into. a study reported in the CAL Digest (2007) established the relationship between educational level gain on the BEST Plus and the number of instructional hours received: 53% of candidates made a level gain if they received less than 60 instructional hours. since although the language of each question is carefully controlled.72. Test reviews 309 Standardization of listening assessment would also seem to be prob- by Green Smith on April 9. In an otherwise thorough val- idation report. The relationship between performances on the BEST Plus and the SPLs was determined through a standard-setting Downloaded from http://ltj. on the basis of existing measures or sorting procedures. A correlation coefficient was calcu- lated for each program showing the relationship between placement levels and scores on the BEST Plus. however. task selection. Separately. the BEST Plus approach is to be commended for advances in the elimination of sources of variance associated with test task. Nevertheless. SPLs range from 0 (no ability) to 10 (equal to a native speaker). The average coefficient across all programs was an encouraging .96 (Form A and C). the manner in which it can be delivered might vary con- siderably from administrator to administrator (in the training materi- als the speed of administrators’ speech varies greatly). though not necessarily in terms of test scores earned or precision of measurement. as compared to 70% of candidates who made a level gain if they received more than 140 instructional hours. The fact that test items on the print-based tests are drawn from the computer-adaptive item pool demonstrates equivalence in terms of language content and the processes candidates engage in. there is no mention of equivalence between the print-based tests and computer-adaptive test. and the delivery of these tasks (although scoring and scor- ers’ perceptions of proficiency across locations may yet add to test score variance). Criterion-related validity evidence was gathered in a rather cre- ative fashion during the field testing.sagepub. 2009 .85 (Form B and C) to . Evidence for the equivalence of the three print-based test forms is presented in a study of 48 adult ESL students who each took two test forms. BEST Plus scores are benchmarked to the Student Performance Levels (SPLs).

Listening Comprehension is assessed through the candidate’s response. including a training video containing over 60 benchmark samples and three scoring activities. Training materials have been developed for a test administrator and scoring calibration workshop. A Scoring Refresher Toolkit ($150) is sold for previously trained test administrators to receive practice. Fur- thermore. The approach as explained in the Technical Report follows good standard-setting practice. and test administration Ancillary materials for the test are excellent. despite federal funding being allocated based on progress in NRS Levels. a request for rep- Downloaded from http://ltj. Concerning the rating scales. then a student could fall into either of the adjacent SPLs other than the one to which the benchmark table assigns them. If the candidate’s response indicates understanding the highest score is assigned. no information about benchmarking to the NRS Levels is avail- able. Information is stored in a secure database so that in sub- sequent post-tests the administrator can select classes and students from drop down by Green Smith on April 9. 2009 . The test CD features user-friendly tools such as importing students’ names and registration numbers from a tab-delimited text file – convenient for inputting a class roster. III Test materials. and Communication (4-point scale). the rating criteria. Disconcertingly. but with a score range from 88 to 999 and 10 levels in the SPL. Language Complex- ity (5-point scale). It appears that the approach to scoring is holistic. it is hard to see that 30 performances were sufficient to determine precise boundaries on the SPL. sometimes to the point of con- founding the constructs. SPL 3 = 418–438). although the boundaries are further apart at the upper Levels (SPL 8 = BEST Plus 599–706.310 Test reviews study in which 11 panelists from across the United States viewed 30 video-taped administrations of the BEST Plus. score reports are available immedi- ately after the test and can be printed from the computer.sagepub. The panelists’ ratings were used to establish the boundary between SPL bands. the lower Levels are narrower (SPL 2 = 401–417. For example. SPL 9 = BEST Plus 707– 795). Further. Since the error in the test is up to 30 points. performances are scored on three cri- teria: Listening Comprehension (3-point scale). All test administrators must attend this work- shop to be accredited.

grammatical structures. which are by Green Smith on April 9. if the Communication criteria were re-labeled as Comprehensibility. and if no response is given. p. then. intonation. Pre- sumably this has been accounted for somehow in the scoring algo- rithm. and the number and serious- ness of errors. Test reviews 311 etition results in the middle score being assigned. But a zero in Listening Comprehension dictates a default score of zero in Language Complexity and Communication. scoring would be more transparent if the content of speech was more clearly separated from the manner of its deliv- ery. A comparison of the descriptors of the BEST Plus rating criteria and the descriptors of the SPLs led me to wonder how precisely the instrument could separate candidates at the high SPLs. The Listening Comprehension descriptors of the SPLs general- ize to the real world in a particularly wide-ranging fashion (Level 9: ‘Understands almost all speech in any context’) considering listening Downloaded from http://ltj. particularly from Level 8 (‘speaks fluently in both familiar and unfamiliar situations’) to Level 10 (‘equal to that of a native speaker’). elaboration and detail. There also appears to be overlap in the Language Complexity and Communication descriptors (to be fair. 2009 . the lowest score (zero) is assigned. ‘comprehensible but sparodically difficult to understand’ (band 2). 114).sagepub. The descriptors are simple yet quite general. the construct is invested in the rating scales. then I find the construct to lack specificity. I found that high proficiency candidates tended to score within the same bands repeatedly regard- less of item difficulty. The keyword in the Communication descriptors is comprehen- sible: ‘Response is comprehensible yet easy to understand’ (band 3). Language Complexity embraces length of the response. In my own administrator training and testing. Communication combines word choice. If. many spoken tests face simi- lar issues). It might improve appearances and scoring. as are organization and clarity of response. Overlap in the constructs concerns vocabulary and word choice. cohesion and organization of the response. level of vocabulary. but it is not reported on. especially as there was no opportunity to take control of the interaction beyond the script and really probe the candidate’s abil- ity. Although separation of the constructs is addressed in the training materials. as Fulcher asserts (2003. pronunciation. which would seem to confound the constructs and skew scoring. or the response indicates misunderstanding. Following from this. ‘comprehensible but generally difficult to understand’ (band 1). it is noticeable that the rating criteria contain quite wide proficiency bands.

2009 . Washington. V References Center for Applied Linguistics. raters. although this likely becomes more natural with practice. the advantage of the BEST Plus approach is that it enables non-expert testers to conduct oral assessments which are sufficiently reliable for low-stakes decision-making in adult education programs. IV Summary The BEST Plus is the result of sound psychometric and linguistic theory and practice. Admittedly. Even in response to a request for clarification. as would happen in normal conversation. (2005a). developed under rigorous procedures. it was certainly an interview and not a conversation (van Lier. negotiating. BEST Plus technical report. with an accompanying body of documented validity evidence.312 Test reviews comprehension is only assessed by means of the ~25 questions deliv- ered by the administrator. As such. Although pro- fessional examiners and raters administering this test would probably enjoy more detailed criteria and the opportunity to give diagnostic feedback. each response is evaluated on a seemingly blunt scale. I found the interaction was one-sided. so that there is no allowance for re-phrasing. however. The test is composed of carefully selected and standardized items. or engaging in com- munication strategies. the admin- istrator is only allowed to repeat the question once as written.sagepub. DC: Center for Applied Linguistics. The test administration itself was quite enjoyable: the test questions were topical and conversational. Downloaded from http://ltj. precision of scoring is appar- ently achieved by repeated questioning and evaluating. I would want to see empirical data show- ing that the demands of the test questions and precision of scoring were able to discriminate at the higher levels. However. the BEST Plus successfully facilitates standardized one- on-one oral testing in diverse locations using qualified. which elicit relevant functions and cover topics appropriate to the target by Green Smith on April 9. I felt constrained by the fact that one must read all questions from the screen (or print) and continually enter scores. As a novice BEST Plus administrator. where questions are presented adaptively to control for difficulty and target each candidate’s ability. though not necessarily professional. 1989).

versanttest. December). 489–508. TESOL Quarterly. National Reporting Service. S. Retrieved 14 August 2008.sagepub. BEST Plus trainer manual. perhaps. as both a developmental and exit measure. G. and takes approximately 15 minutes to complete. is its main weakness. They claim that the test may be used not only as a measure of proficiency. the test is equally useful in academic as well as professional contexts (Technical Handbook. Washington. 2008. DC: Center for Applied L.aspx van Lier. the Downloaded from http://ltj. (2008). (2004). drawling. Fulcher. but also for educational and professional placement. 2009 . Test reviews 313 Center for Applied Linguistics. (2003). from http://www. In general. Reeling. Luoma. The Versant SpanishTM Test Janna Fox and Wendy Fraser Carleton University. (2007. Washington. DC: CAL Digest. Cambridge: Cambridge University Press. but first we provide an overview of the test. as well as achievement. Assessing speaking. Canada The Versant SpanishTM Test (www. p. 3). Effects of instructional hours and intensity of instruction on NRS level gain in listening and speaking. Characteristics of English literacy par- ticipants in adult education: 2000–200. and herein. writhing. I Test administration The test may be administered either over the telephone or computer. Testing second language speaking. 2008). These claims are further discussed below. Center for Applied applies state- of-the-art technology to language assessment in a quick and efficient tool designed for the purpose of ‘assessing an individual’s ability to speak and understand spoken Spanish’ (The Versant Span- ishTM Test: Test Description & Validation Summary [referred to hereafter as the Technical Handbook]. stretching and fainting in coils: Oral proficiency interviews as conversation. London: Pearson by Green Smith on April 9.nrsweb. According to the developers. (1989). the Pearson Education Group. (2005b). 23.

sagepub.nav Citations http://ltj.nav Permissions: Reprints: Published by: http://www. Language Testing Subscriptions: Test review: The Versant SpanishTM Test Janna Fox and Wendy Fraser Language Testing 2009. 2009 Downloaded from The online version of this article can be found at: http://ltj. 313 DOI: Additional services and information for Language Testing can be found at: Email Alerts: by Green Smith on April 9. 26.

They claim that the test may be used not only as a measure of proficiency. TESOL Quarterly.aspx van Lier.sagepub. Test reviews 313 Center for Applied Linguistics. Effects of instructional hours and intensity of instruction on NRS level gain in listening and speaking. as both a developmental and exit measure. DC: CAL Digest. writhing. but also for educational and professional placement. The Versant SpanishTM Test Janna Fox and Wendy Fraser Carleton by Green Smith on April 9. Assessing speaking. December). as well as achievement. Retrieved 14 August 2008. Center for Applied Linguistics. Luoma. Reeling. stretching and fainting in coils: Oral proficiency interviews as conversation. Cambridge: Cambridge University Press. 2009 . 489–508. S. the Downloaded from http://ltj. Characteristics of English literacy par- ticipants in adult education: 2000–200. is its main weakness. the test is equally useful in academic as well as professional contexts (Technical Handbook. 2008.versanttest. and herein. Testing second language speaking. (2004). the Pearson Education applies state- of-the-art technology to language assessment in a quick and efficient tool designed for the purpose of ‘assessing an individual’s ability to speak and understand spoken Spanish’ (The Versant Span- ishTM Test: Test Description & Validation Summary [referred to hereafter as the Technical Handbook]. G. 3). from http://www. Canada The Versant SpanishTM Test (www. drawling. perhaps. and takes approximately 15 minutes to complete.nrsweb. p. Washington. but first we provide an overview of the test. Washington. (2007. (1989). (2008). BEST Plus trainer manual. National Reporting Service. 23. L. (2005b). I Test administration The test may be administered either over the telephone or 2008). Fulcher. DC: Center for Applied Linguistics. (2003). According to the developers. These claims are further discussed below. London: Pearson Education. In general.

Downloaded from by Green Smith on April 9. III Test scoring The responses to the first six sections are scored automatically using a computerized scoring system. II Test description The test consists of 60 items divided into seven sections: reading aloud. hangs up at the end of the test. thus ensuring that each administration event is unique. it is also available on the computer screen. The first item in each section is con- sidered a practice item and only the remaining items in the section are scored. The printed text is available on the test paper given to the test taker a few minutes before the beginning of the test. answering short answer questions. and for those using the computer. The Technical Handbook recommends that the test taker receive the instructions for the test five minutes before taking it. the Versant Testing System selects items for each test taker from a pool of items graded for difficulty. saying the opposites of words. The test taker interacts with the testing system. repeating.sagepub. The responses to the last section. are not scored. and story retelling.314 Test reviews Technical Handbook indicates that the end user of the test will be responsible for administration: verifying the identity of the test taker. All test instructions are provided in print and also read aloud. the open-ended questions. In addition. 51 responses are automatically scored. Developers have designed the test so that each section consists of increasingly challenging items delivered by native speakers from various Span- ish-speaking regions. 2009 . Opportunities to prepare for and practice taking the test are also available on-line. answering open-ended questions. These responses are stored electronically and available to the end score user for additional information or as supplementary evidence to consider in borderline cases. giving the test taker the required material. In total. and then retrieves the results from a secure website within minutes of test completion. and monitoring the admin- istration of the test. building sentences from words and phrases. through the use of algorithms.

which has been developed for optimal recognition of non-native speech pro- duction. More specifically. Test reviews 315 The computerized system consists of a speech recognizer. During the development process. 2008. pp. In analyzing content. length and position of pauses. con- structs which were defined by Carroll (1986. 9). vocabulary. 2008. to build phrases and clause structures. The developers define auto- maticity as ‘the ability to access and retrieve lexical items. stress and segmen- tation are measured. according to the test developers. Scores are reported both numerically. It provides information on four diagnostic subscores (sentence mastery. and to articulate responses without conscious attention to the linguistic code’ ( by Green Smith on April 9. and manner (fluency and pronunciation). 17–19). The Versant SpanishTM Test measures the ‘automaticity’ with which test takers can respond to the test tasks. In assessing fluency and pronunciation. in a range of 20 to 80. inferences drawn on the basis of scores on The Versant SpanishTM Test represent the degree of automaticity pos- sessed by the test taker. the computer program has been cali- brated to recognize responses with only slightly less accuracy than a native listener. 2008. and in criterion-referenced descriptors. The test developers go on to explain that the four diagnostic subscores represent two basic aspects of language. each of which contributes 50% to the overall score: content (sentence mastery and vocabulary). cited in the Techni- cal Handbook. The score report is available within minutes of test completion on a secure website. 2009 . IV Test construct The construct that The Versant SpanishTM Test measures is defined as ‘the ability to understand spoken Spanish on everyday topics and to respond appropriately at a native-like conversational pace in intelli- gible Spanish’ (Technical Handbook. Thus. fluency and pronunciation) as well as an overall score. scores were automatically generated and then recalibrated to bring the results in line with human raters (Technical Handbook. p. 9) as knowledge and control. p. 8). They state that Downloaded from http://ltj. such elements as rate of speech. based on the computerized measurement of the encoding and decoding skills that are engaged in the completion of the test tasks.sagepub. based on a weighted average of the subscores with sentence mastery and fluency contributing 60%.

Additional claims state that. he argued.. Recently. Chun attacked the merits of the test for its lack of authen- ticity which. 3). citing Bachman and Palmer’s (1996) definition. similar claims made by developers of another Pearson product. 2008). led to an underrepresentation of the construct. ‘performance on these tasks does not necessarily predict or reflect the speaking ability of the test taker to function in another environment – that of the real-life domain of school and work’ (2006. etc. V Claims. 2008) and Versant for English test developers (Downey et al. 3). led to a heated public discussion between a test reviewer. 2008). Chun (2006. These proficiency tests typically operation- alize spoken proficiency in relation to key component traits – e. sentence comprehension. 10). concluding. By developing ‘context-independent material’ the developers claim the test is more efficient in that less time is devoted to creating a context and more to accumulating samples of speech. staff or applicants’ [italics added] (p. 2009 .. Downloaded from http://ltj. rhetorical and cog- nitive elements of communication’ in an effort to evaluate language ability that is uncontaminated by the social context. They argue that their test ‘predicts general spoken language facility. 304). the Versant for English Test – a very close cousin of The Versant SpanishTM Test. which is essential in successful oral communication’ ( by Green Smith on April 9. Chun (2006. pronun- ciation.g.sagepub. phonological fluency. usefulness and evidence The Versant SpanishTM Test claims to provide ‘an accurate assess- ment of how well a person speaks Spanish and understands spoken Spanish’ (p. argued against the tests’ ‘usefulness’. ‘Academic institutions.316 Test reviews The Versant SpanishTM Test focuses on ‘the psycholinguistic ele- ments of spoken language rather than the social. corpora- tions and government agencies throughout the world will find the test useful in evaluating the Spanish listening and speaking ability of students. p. global constructs of spoken proficiency in a language. vocabulary. drawing on a socio-cultural theoretical frame- work to inform his review of the Versant for English Test. Such global claims are often made by proficiency tests on the basis of context-independent.

sagepub. this is one of its strong points.93. with a standard error of measurement of 2.95. which also includes reli- ability. impact and practicality. Thus. In the section that follows. construct validity. p. it was r = 0.97’ (Technical Manual.86. 9). 2008. ‘the ability to access and retrieve lexical items. The key in this regard is ‘automaticity’. reliability of the CEF Estimate ratings was r = 0. and for the ILR-Estimate/SPT.6) (Technical Manual. 2008. there is a detailed explanation pro- vided by the test developers of the theoretical framework that defines the construct of oral language proficiency operationalized by the test. For the estimated ILR-Estimate/ DLI. When The Versant SpanishTM Test is compared with other widely used tests of oral language pro- ficiency. the reliability was 0. namely.96 (N = 267. interactiveness. we feel it is important to evaluate the tests’ usefulness by considering all of the features that the Bachman and Palmer (1996) standard identifies. When ACTFL OPI rat- ings are analyzed as a function of The Versant SpanishTM Test scores. ‘authenticity’ is only one of the components of Bachman and Palmer’s (1996) usefulness standard. and to articulate responses without conscious attention to the linguistic code’ (p. 17). Automaticity has a long history in the research literature Downloaded from http://ltj. 1 Reliability The test developers have included a number of studies in the Techni- cal Manual to support claims of by Green Smith on April 9. p. ‘all of which contribute in unique but interrelated ways to the over- all usefulness of a given test’ (p. to build phrases and clause structures. Although The Versant Span- ishTM Test may suffer somewhat from a lack of authenticity (as it is administered by phone or computer and operationalizes a psycho- linguistic model of language proficiency). each of these features is considered.92. Split half reliability estimates of the overall score of the test are reported at 0. 18). 19). suggesting a strong relationship between the machine-generated scores and the human-rated interviews. there is evidence to support the requirement for reliability. high inter-rater reliability is reported: ‘For the SPT-Inter- view-ILR. the inter-rater reliability was r = 0. 2 Construct validity With regard to construct validity. the correlation is 0. 2009 . Test reviews 317 However. given the nature of the test. Indeed.

sagepub. from a range of Spanish speaking backgrounds. 2003. clearly the test – like all computer/tape-mediated tests of spoken pro- ficiency – has limited interactivity. Levelt. Thus. The sections are largely context free. 2008. extract meaning as speech continues. and no visual or extra-textual support or realia is provided to the test taker – although a computer screen is available for the computer administered test and paper documents accompany the phone admin- istered test. the construct of the test is well-defined and there is evidence that the test measures what it intends to measure. as the test developers point out. 2009 . regardless of gender or age. 2001). the interactivity of the test might be improved if the computer interface potential were fully utilized. the test defines its construct narrowly and the Technical Manual provides evidence that it is measuring the construct well. Downloaded from http://ltj. the test measures a test taker’s ability to ‘track what is being said. pp. p. 2008). and then formulate and produce a relevant and intelligible response’ (Technical Manual. cluster in the top range of scores from 75 to 80. 2008. drawing on the work of Levelt (1989) and others. Whereas L1 speakers of Spanish. This. However. 3 Interactiveness Although the test developers report strong correlations between The Versant SpanishTM Test and oral interviews scored by live-raters. The test developers report that this pat- tern is constant. Indeed. both European and Latin American. across the sections of the test. Evidence of construct validation is provided in studies that report on the test’s discriminatory power amongst learners of Spanish as a second or foreign language (Technical Manual. examination of the construct of The Versant SpanishTM Test reveals its ‘diagnostic’ origins which can be traced back to the original PhonePass test. 2003. true to its roots as a diagnostic tool. In by Green Smith on April 9. is a psycholinguistic model of language pro- ficiency – as opposed to the socio-cultural model applied by Chun (2006. obtaining near-maximum scores on the test. Essentially.318 Test reviews (Cutler. Jescheniak et al. 8). theoretical support for the construct is well-articulated (although clearly those who subscribe to other the- oretical frameworks may not be in agreement with this perspective). the scores of less proficient Spanish language learners distribute across a range from 20 to 75.. In future. 17–19).

13). you know. the test is highly reliable. I would need to know much more than this for the placement.’ Thus. ‘Versant Spanish Test scores can be used to evaluate the level of spoken Spanish skills of individuals entering into [i.e. avoiding inevitable measure- ment error due to interlocutor variability (McNamara. however. placement]. limited the test as a useful placement tool. which may be used to set appropriate criterion scores for a local context. Test reviews 319 The argument for inferences drawn from the test performances that are elicited depends on the adequacy of the construct to represent oral language proficiency as ‘automaticity’. 4 Impact Although the test developers unabashedly claim that. all of the teachers thought the test would be useful for their students to take: ‘to learn more especially at the advanced levels Downloaded from http://ltj. Because it is computer- mediated. in the view of the teachers interviewed for this test by Green Smith on April 9. ‘Versant Spanish Test scores may be used for making valid decisions about oral Spanish interaction skills of individuals’ (p. 2004). there is rarely a positive relationship between tests and programs (Green & Weir. Conversely. ultimately..sagepub. When proficiency tests are used for placement and/or achievement purposes. 2007). who listened to sample tests that had been administered to three test takers and reviewed the resulting score reports. What is surprising here is that the test developers have not emphasized the obvious diagnostic properties of the The Versant SpanishTM Test. As we were writing this review. they put the onus on the score user to determine ‘what can be regarded as a minimum requirement in their context’ (p. and exiting Spanish language courses’ (p. 13).. 2009 . we interviewed four university teachers of Spanish.e. One teacher summed up the responses of the group. I could learn much more from the face-to-face interview I have with students than from this test – in fact. achievement]. 13). progressing through [i. They also offer Versant Benchmarking Kits to score users. is also a limitation. I prefer this. None of the instructors thought that the test would be useful as a placement test for their programs. however. pointing out: ‘we have an interview and a written test for placement in our program. They go on to state that. This strength. the precise and narrow construct definition of the test.

you know. the score reports that are currently issued regarding test performances relate numerical scores to rather difficult to inter- pret criterion-referenced statements. It would be such a good thing for them to take. leaving it open to criticism like that found in Chun (2006. and yet this purpose is in line with origins of the test and a valuable feature of the test. I’m thinking about my students who are in the international business program. Score: 66–74 Criterion: Test taker speaks fluently almost effortlessly. though a conceptually difficult subject may on occasion obstruct a natural.sagepub. from time to time. phrasing.320 Test reviews about their pronunciation.’ Although it may be advantageous from a marketing perspective to suggest that the test can be used for proficiency. For example. and achievement purposes. 2008). first language speaker of a language might also on occasion ‘obstruct a natural. the following criteria attempt to explain differences in proficiency at the top of the Fluency subtest scale: Score: 75–80 (the highest score range on the test) Criterion: Test taker speaks with good rhythm. I would like it for this. The test is most useful for proficiency and diagnos- tic purposes. by Green Smith on April 9. One may ask what the real difference is in these two criterion descriptions (and whether a fully fluent.’ Another of the instructors said. 2009 . smooth flow of language’ when managing a ‘conceptually difficult subject’). it undermines the credibility of the test to do this. placement. The diagnostic potential of the test is not developed or discussed to any extent in the Technical Manual. Speech is generally smooth with no evidence of hesitations or pausing other than for gathering thoughts. ‘You know this is what’s missing now. to see how they are improving in relation to native speakers. according to scale descriptors in the Technical Manual (2008). smooth flow of language. Downloaded from http://ltj. We don’t really have a test like this and we could use this for specially the students who want to perfect their speaking. their fluency development … these are not things that we specifically teach in the program … but it would be helpful for the students to have a sense of how they are develop- ing and where their strengths and weaknesses are. and overall timing.

For example. As elaborated in Bachman’s (2005) discussion of building and supporting a case for test use.). 6 Overall usefulness In conclusion. It might also be used to compare Spanish language learners with fully fluent first language Spanish speakers – using data from the studies reported in the Tech- nical Manual. as the test examines specific features of language. with regard to Sentence Mastery. on the other hand. these specific features might be reflected in the score report (e. they may lead to contradictory information.g. the test taker and test score user are told that on the one hand. The graphic display of performance at the top of the score report is very informative. we would argue that The Versant SpanishTM Test meets the usefulness standard. the relationship between claims and warrants is estab- lished through evidence. and the results are available almost immediately upon completion of the test.sagepub. The test user and/or test taker determine when the test is taken. the potential for positive impact would be increased if score reports reported more specific information to test takers and test users. degree of vocabulary mastery etc. highly reliable. with regard to Vocabulary.’ The test developers may want to examine the way in which these criteria are being used. there is little evidence of its claims to also be effective as a placement and achievement test. In addition. the ‘Test taker can … produce simple meaningful sentences and uses a range of structures correctly’. degree of automaticity. 5 Practicality On the practicality component. and the test developers narrowed the scope to purposes that are suitable for proficiency and/or diagnostic tests.. The Versant SpanishTM Test is excep- tional. on the same score report but in different sections of the test. exceptionally well-researched. it is an innovative. applying Bachman and Palmer’s (1996) criteria. In sum. the ‘Test taker … may be able to produce a few isolated words. Indeed. because criteria are attached to scores that are computer gen- by Green Smith on April 9. Test reviews 321 Also. However. well-designed test of proficiency in spoken Span- ish – with enormous (but currently under-emphasized) potential as a diagnostic tool. 2009 . It appears more evidence of its usefulness Downloaded from http://ltj.

L. J. Hahne. Downey. 21(4). R. MA: MIT Press. T. (2008). 160–167. & Schriefers. Language Assessment Quarterly. Green. Encyclopedia of cog- nitive science. 2(1). C.). We look forward to future developments from the Pearson group. VI References Bachman. C. An analysis of a language test for employment: The authenticity of the PhonePass Test. M.. Fox. 131–137). Language Assessment Quarterly. CA: Pearson Education. J. & Palmer. 15(3). H. UK: Oxford University Press. (1989). A. (1996). Palo Alto..). L. Language Assessment Quarterly. Volume 2 (pp. (2003). computer-administered and computer-scored test. which maximizes convenience for test takers and test score users alike and provides unique and useful information about their ability to speak and understand Spanish. Farhady. Language testing in practice. 1–34. Oxford. D. 295–306. (2008). 168–172. Building and supporting a case for test use. C. R. & Van Moere. In L. Turner & C. Cambridge. H. (2005). C. Comments on ‘Evaluation of the usefulness of the Versant for English Test: A response’: The author responds. Suzuki. L. ON: University of Ottawa Press. (2003). computer- mediated. Language testing reconsidered (pp. McNamara. (2004). Doe (Eds. 467–494.sagepub. Lexical access. Chun. Language Assessment Quarterly. Cognitive Brain Research. Cheng. A. A.. W. Ottawa. 261–276. A. 858–864). In its current form it represents a skillfully designed. Cutler. 5(2). by Green Smith on April 9. J. Speaking: From intention to articulation. A. Language testing: A question of context. Levelt. Present-Thomas. Bayliss. 3(3). (2006).322 Test reviews in placement and achievement contexts needs to be provided in order to support these claims. Nadel (Ed. (2008). Chun. & Weir. The Versant SpanishTM Test: Test description & validation summary [Technical Handbook]. M. 5(2). Can placement testing inform instructional deci- sions? Language Testing. Evaluation of the usefulness of the Versant for English Test: A response. 2009 . Bachman. (2007). Jescheniak. Information flow in the mental lexicon during speech planning: Evidence from event-related brain potentials. Wesche.. M.. London: Nature Publishing Group. Downloaded from http://ltj. In J.