System 29 (2001) 371–383 www.elsevier.


Writing evaluation: what can analytic versus holistic essay scoring tell us?
Nahla Bacha*
Lebanese American University, PO Box 36, Byblos, Lebanon Received 14 November 2000; received in revised form 26 February 2001; accepted 29 March 2001

Abstract Two important issues in essay evaluation are choice of an appropriate rating scale and setting up criteria based on the purpose of the evaluation. Research has shown that reliable and valid information gained from both analytic and holistic scoring instruments can tell teachers much about their students’ proficiency levels. However, it is claimed that the purpose of the essay task, whether for diagnosis, development or promotion, is significant in deciding which scale is chosen. Revisiting the value of these scales is necessary for teachers to continue to be aware of their relevance. This article reports a study carried out on a sample of final exam essays written by L1 Arabic non-native students of English attending the Freshman English I course in the English as a Foreign Language (EFL) program at the Lebanese American University. Specifically, it aims to find out what analytic and holistic scoring using one evaluation instrument, the English as a Second Language (ESL) Composition Profile (Jacobs et al., 1981. Testing ESL Composition: A Practical Approach. Newbury House, Rowley, MA), can tell teachers about their students’ essay proficiency on which to base promotional decisions. Findings indicate that the EFL program would benefit from more analytic measures. # 2001 Published by Elsevier Science Ltd.
Keywords: English as a Foreign/Second Language; Writing evaluation; Essay evaluation; Analytic evaluation; Holistic evaluation

* Tel./fax: +961-9-547256. E-mail address: (N. Bacha). 0346-251X/01/$ - see front matter # 2001 Published by Elsevier Science Ltd. PII: S0346-251X(01)00025-2

teachers have had to address a number of concerns that have been found in the research to affect the assigning of a final score to a writing product. Tedick and Mathison. 1986. Kroll (1990a) emphasizes the complexity in setting criteria and standards by which to evaluate student writing. Kroll. 1999). 1990. Thus. 1990. Introduction In adopting scoring instruments with clearly identifiable criteria for evaluating1 English as a Foreign Language/English as a Second Language (EFL/ESL) essays.. Although Gamaroff (2000) states that language testing is not in an ‘‘abyss of ignorance’’ (Alderson. 1990. in order for these programs to obtain valid results upon which to base decisions. Some of these concerns have included the need to attain valid and reliable scores. give sufficient writing time. Lloyd-Jones. the choice of the ‘right’ essay writing evaluation criteria in many EFL/ESL programs remains problematic as often those chosen are inappropriate for the purpose. two main related concerns in both L1 and L2 essay evaluation literature are the appropriateness of the scoring criteria and the standard required (Gamaroff. Crowhurst and Piche. 1995. Evaluating essays in EFL/ESL programs has been mainly for diagnostic. 1987. 1981. Raymond. 1990). set relevant tasks. 1991. This is most crucial when decisions concerning student promotion at the end of the semester to the next English course have to be made mainly based on essay writing scores. Read. Kunnan. It is then important that teachers are aware of the potential of the evaluation criteria being adopted. Shohamy. 2000). 1982. 1998). 1986a. 1995. developmental or promotional purposes (Weir. Literature review In evaluating essay writing either analytically or holistically. 1992. 1986. 2000). Brossell. b. Purves and Degenhart.1963. Carlman. 1993. b. Kegley. Elbow. Upshur and Turner. and choose appropriate rhetorical modes (Braddock et al. Hillocks. 1980. 1983. Bacha / System 29 (2001) 371–383 1. Reid. 1990. Hamp-Lyons. 1994. However. 1990. can tell us in essay writing evaluation for promotional purposes. set clear essay prompts. 1986a. 1995.. 1986. Cushing-Weigle. Hayward. 1 . Evaluation sometimes refers to assigning a score to a direct writing product based on predefined criteria. 1983. analytic and holistic. Caudery. 1995. Tedick. This article focuses on examining what two types of scoring instruments.. Hoetker. It is distinguished from assessment in that the scoring for the latter focuses more on feedback and alternative evaluative techniques in the process of learning. These issues are also of concern and often require different consideration when evaluating non-native student essays in English (Oller and Perkins. Cooper and Odell. Pere-Woodley. 1982.. Jacobs et al. 1990a. Douglas. 1982. 1985. the paramount guiding principle is obviously the purpose of the essay. Smith et al. Horowitz. 2. In this study. the choice of evaluation instrument to be adopted becomes significant. 1977. 1995. 1983 cited in Gamaroff. the term evaluation will refer to assigning a score to a direct writing product. 1979.372 N. Quellmalz et al. 1991. b.

and is usually referred to in theoretical terms. in this case. two important related issues to be considered are the content and construct validity of the essay task. . poor to fair. language. (1981) text as guides for teachers.N.. 141) This difficulty. 1981. we cannot easily establish procedures for evaluating ESL writing in terms of adherence to some model of native-speaker writing. . Hamp-Lyons (1990) comments that it is ‘‘the best-known scoring procedure for ESL writing at the present time’’ (p. and mechanics with each one having four rating levels of very poor. . . and very good to excellent. however.’’ such as the final exam grade in the . organization. Even narrowing the discussion to a focus on academic writing is fraught with complexity. The range for each of the writing skills are content 13–30. Each component and level has clear descriptors of the writing proficiency for that particular level as well as a numerical scale. For example. 74). Therefore. (1981) writing criteria have content validity since the criteria outlined evaluate the performance of EFL/ ESL students’ different types of expository writing which are tasks they perform in the foreign language classroom. is the degree to which the scoring instrument is able to distinguish among abilities in what it sets out to measure. average to good. 1981). An evaluation instrument is said to have concurrent validity when the scores obtained on a test using the instrument significantly and positively correlate with scores obtained on another test that also aims to test similar skills. . but of lesser importance compared with the other two issues in this article. The Jacobs et al.. Bacha / System 29 (2001) 371–383 373 There is no single written standard that can be said to represent the ‘ideal’ written product in English. very good to excellent content has a minimum rating of 27 and a maximum of 30 indicating essay writing which is ‘‘knowledgeable — substantive — thorough development of thesis — relevant to assigned topic’’. (1981) ESL Composition Profile criteria was selected among other similar ones in the present study as it has been used successfully in evaluating the essay writing proficiency levels of students in ESL/EFL programs comparable with those at the Lebanese American University. A writing evaluation instrument is said to have content validity when ‘‘. 78). the second but equally important issue. is best addressed by identifying the purpose of the essay writing in EFL programs prior to adopting any evaluates writers’ performance on the kind of writing tasks they are normally required to do in the classroom’’ (Jacobs et al. while very poor content has a minimum of 13 and a maximum of 16 indicating essay writing that ‘‘does not show knowledge of subject — nonsubstantive-not pertinent — or not enough to evaluate’’ (Jacobs et al. (p. language 5–25 and mechanics 2–5. Examples of tests that could correlate with essays might be ‘‘.measures of overall English proficiency. the theoretical construct is ‘‘essay writing ability’’ which the instrument aims to measure. Benchmark essays are included in Jacobs et al. vocabulary 7–20. Jacobs’ et al. Jacobs’ et al. Once the purpose is determined. is concurrent validity. Construct validity. An additional concern. vocabulary. p. organization 7–20. (1981) criteria have been researched and found to have construct validity in that significant differences were found when student essay scores were compared. . The Profile is divided into five major writing components: content.

1981. high inter. . 1999). 1994. as it saves time and money and at the same is efficient. . 1990.establish the standards for criterion-referenced evaluation. (1981) criteria have been established to have concurrent validity. 1994). Some instruments may be a combination of both as the Jacobs’ et al. Carlson et al. 1985. 1994. ‘‘The correlation coefficients which result from the relationships between the tests can be considered to be ‘‘validity coefficients.. Elbow. . each of whom scores the essays without knowing the score(s) assigned by the other raters (blind evaluation). 1984.and intra-reliability correlation can be attained (Myers. p. or a number on a preconceived ordinal scale which corresponds to a set of descriptive criteria (e. pp. grading criteria are detailed for each of the levels which ‘‘. Upshur and Turner.’’ (Reid. 1984. (1981) ESL Composition Profile is. . They do see its value for evaluating classroom essays and large scale ratings such as those done for the Test of Written English international exam. 1990. Hamp-Lyons.’’. The Test of Written English as part of the Test of English as a Foreign Language — TOEFL — has a 1–6 scale — Pierce. Obviously. Cohen and Manion. 74) or another sample of writing performance. 1991). criterionreferenced evaluation criteria (rather than norm-referenced) are needed which analytic scales provide. scores being highly correlated with those of the TOEFL and Michigan Test Battery. 239) rather than norm-referenced (evaluating students in comparison with each other). . however. serve to initially identify the students’ writing proficiency level. . therefore. Reid. Reid. For student diagnostic purposes.374 N. 1981. . Sixty or above provides strong empirical support for the concurrent validity. Although in any holistic scoring often specific features of compositions are involved (Cohen and Manion.. it is important that they are aware that these issues need to be considered. 1993.. that holistic scoring focuses on what the writer does well rather than on the writer’s specific areas of weakness which is of more importance for decisions concerning promotion (Charney. Jacobs’ et al. then papers can be evaluated across groups. Benchmark papers (sample papers drawn from students’ work which represent the different levels) are chosen as guides for the raters. 1980. holistic scoring can. 1993. Bacha / System 29 (2001) 371–383 English courses even though an essay requires a [similar] writing performance specifically (Jacobs et al. but for more specific feedback needed in following up on students’ progress and to evaluate students’ proficiency levels for promotional purposes. Cumming. If there is a score discrepancy. Cumming. a third and sometimes a fourth reader is required and the two or three closest scores or all four scores are averaged. Holistic and analytic scoring instruments or rating scales have been used in EFL/ ESL programs to identify students’ writing proficiency levels for different purposes. however. 1990.’’ for the instrument in question (Jacobs et al. to be determined by those concerned. 1990. 1995). p. teachers do not usually compute all these validity tests. A further argument for adopting analytic evaluation scales is that some research indicates learners may not perform the same in each of the . 1993. In more explicit holistic scoring. Some researchers argue. 1981. Hamp-Lyons. in any classroom situation. 74–75).g. White. The final grade is usually the average of the two raters’ scores. Homburg. Najimy. . a percentage. holistic scales are mainly used for impressionistic evaluation that could be in the form of a letter grade. Through rater training and experience.

Analytic scoring scales. 1990a). 1982. Analytic evaluation instruments are considered more appropriate to base decisions concerning the extent to which the students are ready to begin more advanced writing courses. and have all raters sensitive to the writing strategies of students from other languages and cultures (p. raters should focus on the task objectives set. Basically. have also been found to be better suited in evaluating the different aspects of the writing skill. 1991. that in any type of analytic evaluation. was used in efforts to obtain more information than holistic scores could provide. Multiple-trait scoring. 1991. Cohen and Manion (1994) give score results of a few sample essays that were evaluated by the various scales mentioned earlier. attempt to have novice raters approximate expert raters in rating. This is quite normal even among the most experienced raters. 1995. focuses on more than one trait in the essay. ratings within the specific writing areas may vary from one instructor to the next depending upon the relative weight placed on each. One type of analytic scale. Hamp-Lyons. However. this scale deals with the setting up of criteria prior to the production of a particular task. Although this view may be true in some situations.N. 1986a. another analytic scale. Reid. Garmaroff (2000) argues that among many factors perhaps why. They note that there is a variation of results depending upon how raters perform and which scales are used. Teachers need to be aware. Nevertheless. 1990. Bacha / System 29 (2001) 371–383 375 components of the writing skill (Kroll. Connor-Linton. 1983. syntactic. . 1995). 1995). 1994. 1990) Writing Scale]. 1990. Also. they may unknowingly fall back on holistic methods in actual ratings if attention to the specified writing areas is not given. in being more criterion-referenced. in using the different types of analytic scales. discourse and rhetorical features necessary (Tedick.e. summarized an article or argued one side of an issue (Cohen and Manion. 1994). 1981. Other studies have shown significant correlation between some linguistic features (e. Connor-Linton.. 1994). depending on the luck of the draw’’ (p. whether the student fails or passes . these analytic scales can be very informative about the students’ proficiency levels in specific writing areas. 336). studies have indicated that inter-rater reliability of 0. Such scales may also include specific features such as cohesion subsumed under organization [i. Often teachers need to consider how students have read. Also. Alderson and Beretta. in other words. primary trait. 1993. Tedick and Mathison. use the same criteria with a common understanding. vocabulary and syntactic features) as well as between linguistic features and holistic scores (Mullen. Bamberg. if applied well. It is claimed in this article that the overriding factor. 46). the structural and lexical signals that connect a text as in Weir’s (1983. Gamaroff. . Jacobs et al. 1992. Perkins. Hamp-Lyons. 2000). 1980). ‘‘. however. Bachman.8 and higher has been obtained (Kaczmarek.g. 1980.many fail and pass [depends] on who marks one’s tests and exams (who is usually one’s teacher/lecturer). there could be a backwash effect in that instruction is influenced by the evaluation criteria (Cohen and Manion. to evaluate clearly specific tasks first developed by the National Assessment of Educational Progress (NAEP) in the USA by Lloyd Jones (1977 in Cohen and Manion. They conclude that in any writing evaluation training program. making more qualitative evaluation procedures such as lexical. it does seem to undermine what teachers can and should do as part of a team in EFL programs.

and write a well organized and developed essay within a 90 min time period — considered ample for the task of writing a final essay product (Caudery. main study language in high school. The final essay (given as part of the final exam which also includes a reading comprehension component) is used to test for promotional purposes. each in a different rhetorical mode. It is also the view of the author that essay evaluation should be team work in any EFL program and not left to individuals alone.2. it is the final essay that often determines whether the student is ready to begin the more advanced Freshman English II course. however.67%.98%). Final grades are reported on the basis of pass (P) or no pass (NP). can tell teachers in evaluating their students’ final essays.1. 1990. 3. Method 3.376 N. In a sense it is revisiting the essay evaluation scenario and re-raising teachers’ awareness to the significance of their decisions. either English (35%) or French (60%). 1982. (bilingual: English and French 5%). Kroll. French 1. The percentage of the final exam is 30% of the final course grade with the final essay constituting 20% and the reading comprehension 10%.98%. White. The sample of essays was written by a representative selection of the student population (N=156) in gender (2/3 males and 1/3 female). Business 26. The procedure Students were instructed to choose one of two topics2. 2 .59%.55%. first language (Arabic 94. The sample A stratified random sample of 30 essays was selected from a corpus of N=156 final exam essays written by L1 Arabic non-native students at the end of a 4-month semester in the Freshman English I course.73%). The main objective of these EFL courses is to give students an opportunity to become more proficient in basic reading comprehension and academic essay writing skills. 1990b. Although research findings suggest that the rhetorical or organizational mode and topic affect performance (Hoetker. Bacha / System 29 (2001) 371–383 the course is primarily the choice of the scoring scale and criteria suitable for the purpose of the task. English 1. Engineering and Architecture 32. In the final analysis. Students must gain a minimum percentage score of 60% in the course to receive a P. 1994). and the interpretation (or misinterpretation) of the results. Compare and contrast the overall performance of computers and human beings at work or give the causes and effects of loneliness among young people. the first of four in the EFL program at the Lebanese American University. Often we as teachers are so busy in evaluating piles of essays that a reminder of the significance of knowing what these scales offer us is helpful. analytic and holistic. This article is an attempt to investigate what two types of writing rating scales. and major in three schools (Arts and Sciences 40. 3.

as essays were comparable with those given during the regular course instruction. but this was first time it was rigorously applied. (1981) ESL Composition Profile according to final course objectives by two readers (the class teacher and another who was teaching a different section of the same course). Hamp-Lyons.N. 1990a. 1994). B=80–89%. In several instances. there was also a similar significant positive relation (P=0. 1986. Moreover. The final analytic scores for each of the writing skill components were averaged and a mean for each of the five parts computed. 1997). Failing=below 60%) in which case the average of the two closest scores were computed. the choice gave students the opportunity to perform their best. It is apparent that it is the language component that is the weakest when one compares the percentages.. the topics were considered valid for promotional purposes. 1992. However. Each essay was first given two holistically rated percentage scores using Jacobs’ et al.. That is. . White. the essays were not typed before evaluation so as to keep the scoring procedure in a realistic context. C=70–79%. 1988. D=60–69%. The Spearman Correlation Coefficient was used to test the strength of the relationship between the two raters’ scores (inter-rater reliability) and between the same raters’ scores on two occasions (intra-rater reliability). Kegley. Kroll. Some research has shown that L1 Arabic non-native writers in English show weaknesses when writing in the English script which has been noted to negatively influence raters’ scores (Sweedler-Brown. vocabulary. organization. 1990. language. The Friedman’s Statistical Test was used to check for any significant differences on more than two variables. Using the same profile. a third reader was required when discrepancies exceeded one letter-score range (A=90–100%. Bacha / System 29 (2001) 371–383 377 Quellmalz et al. 1981). Results Significant positive relations of 0. Ruth and Murphy. The final holistic percentage score for each essay was then calculated by recording the mean of the two readers’ scores. and mechanics (Jacobs et al. the same readers then analytically scored each essay according to the five parts: content. 1991.8 were obtained between the two readers’ holistic essay scores (P=0. When a random sample of essays (N=10) was re-evaluated by the same readers. The average mean score obtained for all the 156 essays was 65%. Siegel. 1986. 4. 1993). raters were in agreement with their own readings of the same essays on two occasions (intra-rater reliability). The profile had been used in the LAU EFL program by some of the instructors. 1982. Sweedler-Brown. Carlman.001) between the scores of the first and second readings using the Spearman Correlation Coefficient Statistical Test. in this case the five specific writing areas in the ESL Profile (SPSS.001) using the Spearman Correlation Coefficient Statistical Test (inter-rater reliability). which does reflect the low writing proficiency level expected at this level. This statistical test was considered over the Pearson one since the holistic and analytic scores were not normally distributed. Table 1 indicates the mean analytic raw scores and Table 2 the scores when converted to percentages.

. there were high significant differences (P=0. 20%.334 68. the highest being between the vocabulary and organization components.050 13.and intra-reliability coefficients. the holistic scoring revealed little about the performance of the students in the different components . 1981).400 S. 5%. All in all.149 There were very high positive significant relations of 0.D. When the scores of the different components of the writing skill were correlated. Interestingly.001) were obtained between the holistic and the analytic scores when the holistic scores were computed separately with each of the five writing components.339 1. Table 2 Mean percent analytic scores of final essay Writing areas Content Organization Vocabulary Language Mechanics Mean 70. Mechanics.687 1. That is. 25%. very high significant correlation coefficients (P=0.000 S.607 The % allocated to each writing area (Jacobs et al. This might be an indicator that the raters evaluate essays seeing a link between these two components. Table 3 indicates high significant coefficients. When the mean percent analytic component scores were compared. Discussion Although there were high inter. 1981). the results indicate that the components are inter-related confirming the internal reliability or consistency of the scores (Jacobs et al.437 9. Language. 30%. 7. however. organization and mechanics. 20%. Content.200 68.D.072 12.001) using the Friedman two-way ANOVA.814 3.378 N. 5.467 15.05) between the different components of the writing skill.650 13. Organization.250 67.018 12.094 0.8 and above in the raters’ analytic evaluation (P= < 0. Vocabulary.736 8. 2. Bacha / System 29 (2001) 371–383 Table 1 Mean analytic raw scores of final essay Writing areas Content Organization Vocabulary Language Mechanics Mean 21. Mechanics. Language It seems that student performance is the lowest in the vocabulary and language components when compared with that in content.. Vocabulary. students performed significantly differently from best to least as follows: Content.900 3.333 63. Organization.

In the study.7607 Organization 379 0. if this be the case. . White (1994) warns against holistic scoring alone and suggests that ‘‘. That the final essay significantly correlated with the final exam results seems to suggest that writing proficiency is in some way related to academic course work at a given point in time. 266). However. Decisions concerning students passing or failing the course would be better informed ones if not totally based on the holistic rated essay scores. but their relatively poor language and vocabulary repertoire have placed them at a disadvantage in these upper level English courses.001 in all cases. a finding which confirms much of . Of course. . this reinforces some recent research (Hoey.8149 0. Jacobs et al. this finding suggests that the performance of the students is a reliable indicator of the students’ writing proficiency level. of the writing skill. there was high significant correlation between the two raters’ analytic scores on each of the five components as well as among the scores of the different components of the writing skill. . students performed significantly differently in the various aspects of the writing skill. This finding confirms research in the field (Kroll.7751 Language P=0. This is quite a significant finding (not revealed in the holistic rating) as it implies that incoming students need to work on widening their vocabulary repertoire in an academic setting. the highest being between vocabulary and organization. That is. Although based on one writing sample. There have been many experiences where students have been passed due to their having ‘relevant’ ideas and ‘appropriate’ organization. 1991) that claims that lexis and organization are related in various ways over stretches of discourse. Bacha / System 29 (2001) 371–383 Table 3 Correlation among analytic mean percent scores Organization Vocabulary Language Mechanics 0. (1981) point out that an important aspect in testing is the internal consistency or reliability of scores.8774 0.N. When the analytic scores were compared with each other. in this case at the end of the semester (concurrent validity). The fact that the language component indicated the lowest scores as identified by the criteria confirms both teacher and student perceptions during informal interviews throughout the course. EFL programs and the teachers should re-visit the value of the scoring instruments adopted and consider not relying entirely on holistic scores. . there should be no significant correlation differences among the parts or between the parts and the whole score (p.which normally yield no gain scores across an academic year’’ (p. 1990a) that students may have different proficiency levels in the various writing components.7712 Vocabulary 0. there were high significant differences among the different writing components.7945 0.8679 0. to determine the improvement of their students’ writing proficiency. that is.9587 0. N=30. the reverse has also been the case. 71) if the test is to be considered a reliable one.6871 Content 0.8962 0.we need to define writing more inclusively than our holistic scoring guides.

content) and correspondingly lower scores to the aspects of writing that the rating scale gives lower weight to (e.g. Although holistic scoring may blend together many of the traits assessed separately in analytic scoring.g. Also. 1993). and implicitly given higher scores to aspects of the writing that the rating score gives higher weight to (e. scored on) a common scale (e.g. it might be argued that each component in the ESL Composition Profile (Jacobs et al. specifically related to L2 Arabic speakers of English (Doushaq.380 N. 6. Sweedler-Brown (1992) also reported high significant correlation between sentence features (rather than more rhetorical ones) and holistic scores and recommends that experienced writing instructors need to work with teachers that are not trained in ESL/EFL evaluation techniques. making it relatively easier and reliable. The research concludes that teachers may have been rating the essays according to more surface features at the sentence level rather than discourse features over larger stretches of text. to compare these scores with one another more validly. The fact that there were high significant positive relations (P=0. since these components are weighed differently in the teaching and learning process as expressed in the profile. < 001) between the holistic and the analytic scores using the Spearman Correlation Coefficient Test indicates internal consistency or reliability of the scores. Bacha / System 29 (2001) 371–383 the literature in the field. There is a need to look deeper into the students’ language component to describe what actually is going on and devise ways to help students widen their vocabulary repertoire. Conclusion The aim of the present study was to find out what holistic and analytic evaluation can tell us and what general lessons can be drawn for the evaluation of writing. they may have needed to be converted to (or better. However. Shakir (1991) reported findings where there was a much higher correlation between the grammar score in comparison with other components and the coherence holistic scores. 1986. Silva. using z scores or some other type of conversion). It seems that the readers were focusing on each of the components equally when rating the essays and did not emphasize any component over the other as some of the research suggests might happen. that is raters may have been influenced by the score values for each component of the compositions. 1981) is weighted differently. The fact that the content component mean percentage score is significantly higher in comparison to the other components indicates that the students had relevant ideas on the topic given. This finding also confirms the teachers’ experiences and students’ comments throughout the course during informal interviews. mechanics). whereas mechanics is not as important in holistic ratings. However. The consistent high significance of the relation between the vocabulary and the holistic essay ratings may suggest that teachers consider vocabulary a very important sub-component of the writing skill. For example. which might have affected the raters’ scores. the scale was found to be appropriate for the present study. it is not as informative . but they found it significantly more difficult to organize and express their ideas..

1990. Educational Testing Service. 19). An investigation into stylistic errors of Arab students learning English for academic purposes. benchmark essays and training sessions are based (Pierce. 1994. Research in the Teaching of English 18. C.. Most of all perhaps. 1985: Relationship of Admission Test Scores to Writing Performance of Native and Non-native Speakers of English (TOEFL Research Reports. 39–49. In: Straub. G. Cumming. 1994..R.. Urbana.. Princeton. Language Testing 7 (1). 1984. Measuring. Bachman. N. Elbow. 1983. Looking behind the curtain: what do L2 composition ratings really mean? TESOL Quarterly 29 (4). L. 1990.. 167–187. the choice of instrument and how we use the information obtained. Fundamental Considerations in Language Testing. Rhetorical specification in essay examination topics. pp. Bamberg. The validity of timed essay tests in the assessment of writing skills. Multiple-choice and holistic essay scores: what are they measuring? College Composition and Communication 33 (4).. Annual Review of Applied Linguistics 15 (5).F. Illinois. Charney. Douglas. Braddock. 101–109. Doushaq..). L.. 27–39. 1982.. Language Testing 11 (2). Caudery. L. Evaluating Writing: Describing. 122–131. 1999. National Council of Teachers of English. English for Specific Purposes 5 (1). What does language testing have to offer. Connor-Linton. B. Carlson. Inc. References Alderson. 65–81. M. New Jersey.. 1986.. 1995. Topic differences on writing tests: how much do they matter? English Quarterly 19 (1). Developments in language testing. S. A. Odell. M. 175–196. D. (Ed. L. C. In the final analysis. ELT Journal 44 (2). New York. et al.F. Evaluating Second Language Education. Beretta. Expertise in evaluating second language compositions. Judging. 197–223.. Audience and mode of discourse effects on syntactic complexity in writing at two grade levels. 1986. A Sourcebook for Responding to Student Writing. New Jersey. Bacha / System 29 (2001) 371–383 381 for the learning situation as analytic scoring. National Council of Teachers of English. J.. Oxford University Press.H. R. relevant evaluation criteria go hand in hand with the purpose upon which the criteria. Cooper. depends very much upon what we in evaluating our students’ work are looking for.. 1990. and liking: sorting out three forms of judgments. these initial results confirm the complexity involved in choosing rating scales and delineating criteria for valid and reliable essay evaluation on which to base promotion decisions. 31–51. TESOL Quarterly 25 (4).. . 1991. L. Routledge. Cohen. Cambridge University Press. A. (Eds. 1979. Piche. R. Crowhurst. T. 1991. 1995.. Elbow. S.. Cambridge. 1992. 1999). D. Effects of training on raters of ESL compositions. Bachman. evaluating. Illinois. Although the study was done on a limited sample.J. 1977. 762–765. What holistic and analytic evaluation can tell teachers about our students’ essay writing proficiency is quite a lot for any EFL/ESL program in any context. Research in the Teaching of English 13 (2). Manion. Hampton Press.N. 165–173..L. The validity of using holistic scoring to evaluate writing. Ranking.. 404–406.. P. Research in Written Composition. 1963. Oxford. et al. Carlman.. College English 45 (2).. Research Methods in Education. the results indicate that more attention should be given to the language and vocabulary aspects of students’ essay writing and a combination of holistic and analytic evaluation is needed to better evaluate students’ essay writing proficiency at the end of a course of study. Brossell. 671–672. Cushing-Weigle.). G.

Cambridge. What professors actually require: academic tasks for the ESL classroom. Hayward. 1980.). Long Beach.C. In: Meara. Homburg.). Oxford. Urbana. Publishers. Language Teaching 2. (Eds. 1991. 1982. University of Edinburgh.). 1981. Spoken Language.. 1990a.. Worth. 1991). College Composition and Communication 33 (4). Ft. Newbury House. B. Hamp-Lyons. Myers. Urbana. What does time buy? ESL student performance on home versus class compositions. K. Kunnan. New Jersey. 107–120. Massachusetts. 69–87. 1980. Assessing Second Language Writing in Academic Contexts. 759–762. Writing in L1 and L2: analyzing and evaluating learners’ texts. 69–83. 140–154. (Ed. 753–758. In: Oller Jr. R. 2000. 1986b. Perkins. Bacha / System 29 (2001) 371–383 Gamaroff. G... D. Lloyd-Jones. Writing in a foreign language and rhetorical transfer: influences on raters’ evaluation. 31–53. Evaluating writing proficiency in ESL. TESOL Quarterly 17 (4). Research in Language Testing. Cambridge University Press. L. 72–84. Hoetker. 1984. Illinois.. Horowitz. Illinois. Cambridge University Press.M. Research in Language Testing. Kroll. (Ed.... 1990b. 1986.. pp. K. Lawrence Erlbaum Associates.J. Jacobs. Essay examination prompts and the teaching of academic writing.J.. Pere-Woodley. Teaching Composition: 12 Bibliographical Essays.A. R. Texas.J. N. L. C.M. pp. Cambridge.). Illinois.. Testing second language writing in academic settings. . D. A.P. pp. Second language writing: assessment issues. Scoring and rating essay tasks. 147–154. Ablex. 1995. TESOL Quarterly 24 (4). 160–170. 1980. TESOL Quarterly 29 (4). Patterns of Lexis in Text.).. Hoey. J. M. 87–107. Oller Jr. 1990. Newbury House Publishers Incorporated. Kroll. Holistic evaluation of ESL compositions: can it be validated objectively? TESOL Quarterly 18 (1). 1987. 1991. (Ed.. B. Perkins.W. In: Kroll. MA. 1986a. L. pp. Hamp-Lyons..W.). T. Tests of writing ability. H.. (Ed. pp. 1981. Mullen. Cambridge. National Conference on Research in English and ERIC Clearinghouse on Reading and Communication Skills. Essay examination topics and students’ writing. objective measures and objective tests to evaluate ESL writing ability. B. Rating nonnative writing: the trouble with holistic scoring. (Ed.). Second Language Writing: Research Insights for the Classroom.). L. TESOL Quarterly 20 (3). J. (Ed. Rater reliability in language assessment: the bug of all bears. Hamp-Lyons. K. 1998. Validation in Language Assessment: Selected Papers from the 17th Language Testing Research Colloquium. Perkins. (Ed. Second Language Writing Research Insights for the Composition. Najimy. Rowley. Hamp-Lyons. 1990. K.. A Procedure for Writing Assessment and Holistic Scoring. M.. In: Tate. et al. The effect of mode-discourse on student writing performance: implications for policy.. 377–392.. K. P. Urbana. L. Unpublished doctoral dissertation. Hamp-Lyons. MA. 1986.. 151–159. Hillocks Jr. 1986b.. Educational Evaluation and Policy Analysis 8 (2). England. 1986a. M. On the use of composition scoring techniques. Evaluations of essay prompts by nonnative speakers of English. Second Language Writing: Research Insights for the Classroom. Cambridge University Press.382 N.. Newbury House Publishers Incorporated. (Ed. 651–671. (Eds. 155–176. Testing ESL Composition: A Practical Approach. Oxford University Press.W. G. Kegley. J. Newbury House Publishers Incorporated.H. Norwood. 1980. Research on Written Composition: New Directions for Teaching.. National Council of Teachers of English. M.).. pp. MA. Kaczmarek. Measure for Measure: A Guidebook for Evaluating Students’ Expository Writing. J. Perkins. 445–462. In: Kroll. System 28 (1). (Winner of the Duke of Edinburgh English Language Prize. Christian University Press. NJ.).. National Council of Teachers of English... English for Specific Purposes 5 (2).. London. 1991.M.. B. Research in Language Testing. P. 1983. Center for Information on Language Teaching. Horowitz. In: Oller Jr.

. 1985. Smith. English for Specific Purposes 9.. Ablex.O.M. J. 399–411. Mathison.. D. 123–143.. 205–230. 188–211.. Teaching and Assessing Writing: Recent Advances in Understanding. Weir. Holistic scoring in ESL writing assessment: what does an analysis of rhetorical features reveal? In: Academic Writing in a Second Language: Essays on Research and Pedagogy. C. L. Performance assessment in language testing. Shohamy. and Improving Student Performance. W. . Unpublished doctoral dissertation. Prentice-Hall. ELT Journal 49 (1). Bacha / System 29 (2001) 371–383 383 Pierce. Degenhart. NJ. What we don’t know about the evaluation of writing. 1990. J.. E. Journal of Research and Development in Education 26 (1). E. 24–29. A. Foreign Language Annals 24 (5). A. C..J. 1990.. C. 3–12. Weir. Annual Review of Applied Linguistics 15. 108–122.. Norwood. 1991. Toward an understanding of the distinct nature of L2 writing: the ESL research and its implications. Evaluating. Ablex Publishing Corporation. 1982.E. Purves. pp. Michigan. 1992. Norwood. Quellmalz.. 1995. ESL essay evaluation: the influence of sentence-level and rhetorical features. 159–163. ESL writing assessment: subject matter knowledge and its impact on performance. 109–121.M. TOEFL test of written English (TWE) scoring guide. 1995.. Research in the Teaching of English 26 (1).C. S. E. T.C. J. Multiple tests: some practical considerations. 1994.. 1990. Sweedler-Brown. 1995. 657–677. 773–775. The effect of training on the appearance bias of holistic essay graders.S. College Composition and Communication 33 (4). B. 1983. San Francisco. SPSS Base 7. TESOL Quarterly 25 (1). 1993.. Murphy.A. Turner. 73–88. Reid. Shakir.N. Communicative Language Testing. et al. SPSS Inc. Some effects of varying the structure of a topic on college students’ writing. Written Communication 2 (1). 1997. et al. C. 1993. Siegel. 1990. Read.. J. Prentice-Hall.J. Raymond. Chicago. Identifying the language problems of overseas students in tertiary education in the UK. Reflections on research and assessment in written composition. 241–258.. 1982. Silva. TESOL Quarterly 27 (4). Tedick.O. England. A. 1988. Effects of discourse and response mode on the measurement of writing competence. Upshur. White. Jossey-Bass Publishers. D. University of London.. Ruth.. 3–17. 1991....N.A. English for Specific Purposes 9. Constructing rating scales for second language tests. 1992. Coherence in EFL student-written texts: two perspectives. 399–403. Tedick. TESOL Quarterly 24 (4). Journal of Second Language Writing 2 (1).. Teaching ESL Writing. Sweedler-Brown.. C. London. 1993.5 for Windows’ User’s Guide. Designing writing tasks for the assessment of writing. New Jersey.. Journal of Educational Measurement 19 (4). Providing relevant content in an EAP writing test. M.