Assessing Writing: Semire Dikli, Susan Bleyle

Assessing Writing 22 (2014) 1–17
Contents lists available at ScienceDirect
Assessing Writing
Automated Essay Scoring feedback for second

language writers: How does it compare to
instructor feedback?
Semire Dikli 1, Susan Bleyle ∗
Georgia Gwinnett College, 1000 University Center Lane, Lawrenceville, GA 30043, United States
a r t i c l e i n f o a b s t r a c t
Article history: Writing is an essential component of students’ academic English

Received 22 July 2013 development, yet it requires a considerable amount of time and
Received in revised form 18 March 2014 effort on the part of both students and teachers. In an effort to
Accepted 27 March 2014
reduce their workload, many instructors are looking into the use
Available online 22 April 2014
of Automated Essay Scoring (AES) systems to complement more
traditional ways of providing feedback. This paper investigates the
Keywords:
use of an AES system in a college ESL writing classroom. Par-
Automated Essay Scoring
ticipants included 14 advanced students from various linguistic
Online writing evaluation
Second language writing backgrounds who wrote on three prompts and received feedback
Feedback from the instructor and the AES system (Criterion). Instructor feed-
back on the drafts (n = 37) was compared to AES feedback and
analyzed both quantitatively and qualitatively across the feedback
categories of grammar (e.g., subject-verb agreement, ill-formed
verbs), usage (e.g., incorrect articles, prepositions), mechanics (e.g.,
spelling, capitalization), and perceived quality by an additional ESL
instructor. Data were triangulated with opinion surveys regarding
student perceptions of the feedback received. The results show
large discrepancies between the two feedback types (the instructor
provided more and better quality feedback) and suggest important
pedagogical implications by providing ESL writing instructors with
insights regarding the use of AES systems in their classrooms.
© 2014 Elsevier Ltd. All rights reserved.
∗ Corresponding author. Tel.: +1 678 708 3952.

E-mail addresses: sdikli@ggc.edu (S. Dikli), sbleyle@ggc.edu (S. Bleyle).
1
Tel.: +1 404 734 1413.
http://dx.doi.org/10.1016/j.asw.2014.03.006
1075-2935/© 2014 Elsevier Ltd. All rights reserved.
2 S. Dikli, S. Bleyle / Assessing Writing 22 (2014) 1–17
1. Introduction
Despite the different point of views and conflicting findings in research, feedback on form remains
an important aspect of second language writing. Gate keeping mechanisms such as high stakes-exams
(e.g., Compass and TOEFL – Test of English as a Foreign Language) place great emphasis on grammatical
accuracy in writing. Also, many English as a Second Language (ESL) students are likely to request
explicit grammar feedback on their writing so to see their errors and learn from them. Similarly,
many ESL teachers believe that they are not doing their job effectively if they do not put emphasis on
grammatical errors in student essays. Feedback and revision, however, require a considerable amount
of time and effort on the part of both students and teachers. In an effort to reduce the workload for
both teachers and students, many instructors are looking into the use of Automated Essay Scoring
(AES) systems to complement more traditional ways of providing writing feedback.
Due to their capability of evaluating an essay within seconds, AES systems have become an integral
part of large scale writing assessments since 1999. Over the years, developing companies expanded
their research in AES and produced instructional applications that are able to generate not only imme-
diate scoring but also instant feedback on various traits of writing. Hence, instructional applications
of AES systems have gained popularity in school systems including middle and high schools as well as
colleges and universities. Four AES systems that are widely used include Project Essay Grader (PEG)
by Page and Measurement Inc, Intelligent Essay Assessor (IEA) and WriteToLearn by Pearson Assess-
ments, IntelliMetric and MY Access by IntelliMetric, and e-rater® and Criteron by the Educational
Testing Service (ETS) (see Author A, 2006 for a detailed overview of AES systems).
This study employed Criterion, a classroom-based AES system which is supported by ETS’ e-rater
scoring engine to be able to score essays within seconds and provide feedback on five traits of writing
(grammar, usage, mechanics, style, and organization and development) at the same time. E-rater® uses
Natural Language Processing (NLP) techniques and identifies specific lexical and syntactic cues in a text
to analyze essays (Burstein, 2003; Kukich, 2000). Criterion, like other classroom-based AES systems,
has been developed with native speakers of English in mind, yet it is marketed for and used by English
as a Second/Foreign Language (ESL/EFL) speakers as well (Warschauer & Ware, 2006). In our own
English for Academic Purposes (EAP) writing classrooms in a program for matriculated, developmental
immigrant college students, we confront on a daily basis the dilemma of finding enough hours in the
day to provide adequate feedback on student writing. We were intrigued by the marketing of Criterion
as providing effective feedback for ESL writers and decided to not only try it with our students but
to conduct an investigation of its efficacy as compared to teacher feedback on form. Therefore, the
present study explored the use of the AES system Criterion in an English for Academic Purposes (EAP)
classroom and compared automated feedback to instructor feedback in order to better understand the
affordances and limitations of AES feedback for second language writers.
2. Literature review
2.1. The adequacy and effectiveness of instructor feedback on grammar and usage
Scholars have debated for approximately 30 years about whether focusing on the formal aspects of
language or the meaning of the text should be the main concern for teachers while providing feedback.
Research results, however, are inconclusive. Zamel started this ongoing discussion when she published
her 1985 study advocating the importance of feedback on the content (content should precede form)
of an essay. Otherwise, students may neglect to address content-related issues in their writings as
they will be concerned more about the accuracy. Fathman and Whalley (1990) disagreed with Zamel
(1985), stating that emphasizing grammar might not have a negative influence on the content after
all. Likewise, Ferris (1995) found that excessive attention on grammar was not ineffective as it was
believed. Based on similar findings in their studies, Both Ferris (1995) and Fathman and Whalley (1990)
recommended that teachers keep a balance between feedback on form and content. Ashwell’s (2000)
findings also contradicted those of Zamel’s (1985) and supported those of Ferris (1995) and Fathman
and Whalley (1990). In his 2000 study, Ashwell found that the feedback pattern that Zamel (1985)
recommended was not superior to the other types of feedback patterns that he used in his study, the
S. Dikli, S. Bleyle / Assessing Writing 22 (2014) 1–17 3
first of which involved teacher feedback on the content prior to feedback on form and the second of
which included the reverse pattern. While the third pattern involved mixing both form and content
feedback, the final one included no feedback at all. In a later study, Chandler (2003) also disagreed
with Zamel’s position on the content vs. form dichotomy and suggested that feedback order may not
be required for some genres such as autobiographical writing in his 2003 study and journal writing in
Fathman and Whalley (1990) study.
Truscott (1996, 2004, 2007), on the other hand, disagreed with his colleagues’ views on grammar
feedback, particularly that of Ferris (1995) and Ashwell (2000). He strongly argued that since grammar
correction dealt only with surface level correction, it was an ineffective strategy to use in writing classes
(1996). In his 2007 article, Truscott reviewed various studies on grammar correction and claimed that
grammar correction was actually “harmful” to second language students. Findings of a more recent
study (Bitchener & Knoch, 2010) do not support Truscott’s view of correction. Bitchener and Knoch
(2010) found that corrective written feedback improved the accuracy of English as a Second Language
(ESL) students’ writings. In an attempt to assist with the implementation of error treatment in second
language writing, Ferris wrote a book in 2011 and summarized the viewpoints in error treatment: (1)
error feedback helps students revise and edit their texts; (2) error feedback leads to accuracy gains
over time; (3) students and teachers value error feedback; and (4) written accuracy is important in
real world.
2.2. Research on Automated Essay Scoring
The major focus of AES research has been to prove the validity and reliability of AES systems, and
such research has reported high agreement rates between machine and human raters (see Burstein
& Chodorow, 1999; Burstein et al., 1998; Landauer, Laham, & Foltz, 2003; Nichols, 2004; Page, 2003;
Rudner, Garcia, & Welch, 2006; Vantage Learning, 2000a, 2000b, 2001, 2002, 2003a, 2003b; Wang
& Brown, 2007). Like other developing companies, ETS conducted a number of studies that mainly
addressed reliability and validity issues to prove that e-rater, the scoring engine of Criterion, has
high correlations with human raters (Attali, 2007; Attali & Burstein, 2006; Burstein, 2003; Burstein
& Chodorow, 1999; Burstein et al., 1998; Chodorow & Burstein, 2004; Lee, Gentile, & Kantor, 2008;
Powers, Burstein, Chodorow, Fowles, & Kukich, 2001). The results of these studies did show that e-rater
was a reliable tool to score essays.
There are a handful of studies that focus on the instructional applications of AES systems. Unlike
the majority of scoring engine studies, most of these classroom-based studies have been conducted
by independent researchers (Author A, 2010; Chen & Cheng, 2008; Choi & Lee, 2010; Grimes &
Warschauer, 2010; Warschauer & Grimes, 2008). A study that focused on Criterion provided essential
information regarding the effectiveness of the automated feedback and revision features provided by
Criterion (Attali, 2004). This study employed more than 9000 essays that were submitted more than
once by 6–12 grade students. The results revealed that the rate of the error types was significantly
reduced, and the students were able to use the feedback provided by Criterion effectively to improve
their writing quality.
Grimes and Warschauer (2010) investigated the attitudes of teachers and students toward the use
of the MY Access program at middle school settings and found that despite the negative views of many
teachers and students regarding the reliability of the program, MY Access provided benefits in terms
of classroom management and student motivation. In a similar study, Warschauer and Grimes (2008)
explored the use of MY Access in four secondary schools. The observation, interview, and survey results
revealed that both teachers and students had positive attitudes (i.e., increasing student motivation,
promoting autonomous student activity, being a time saver for teachers) toward MY Access, yet the
program was not used extensively in the classrooms. The results also showed the program promoted
surface level revisions by the students.
Chen and Cheng (2008), on the other hand, conducted a study in an EFL setting to examine the
use of MY Access in three EFL classrooms in three different ways and report student perceptions.
They found that students in three classes perceived the use of the AES system unfavorably at large.
However, students did identify the use of the program more positively when MY Access was used to
assist students to revise their writings, particularly when the teacher feedback was provided following
the AES feedback. An example of a study on MY Access in an ESL setting is Author A’s (2010) study,
where she compared the feedback generated by MY Access to that of provided by the teacher. After
qualitatively analyzing the data collected in an Intensive English Center, Author A concluded the AES
feedback was generic, redundant, and lengthy in nature. Choi and Lee (2010) also focused on the
feedback capability of AES. In their experimental study, they investigated the use of Criterion in a
college ESL writing program and the influence of different types of feedback (by Criterion and/or the
teacher) on students’ writing quality. Choi and Lee found that the impact was positive and that it was
most effective when two feedback types were provided together.
As discussed previously, the form versus meaning dichotomy has been a major concern to research
in ESL writing. Studies reviewed above support the fact that AES systems promote revisions on the
surface features of an essay rather than to its meaning. Furthermore, AES systems are found to be most
effective when they are used along with teacher feedback as a support tool in the classrooms.
2.3. Limitations of AES studies
Research done with regard to AES and ESL is limited for several reasons. First of all, the majority
of AES studies include writing data from native English speaking writers mostly in large-scale writing
assessments (e.g., Attali, 2004; Attali & Burstein, 2006; Burstein et al., 1998; Landauer, Laham, Rehder, &
Schreiner, 1997; Landauer et al., 2003; Nichols, 2004; Page, 2003; Powers, Burstein, Chodorow, Fowles,
& Kukich, 2000; Powers et al., 2001; Rock, 2007; Rudner et al., 2006; Vantage Learning, 2000a, 2000b,
2001, 2002; Wang & Brown, 2007). A small number of studies analyzed the essays from non-native
English speakers (e.g., Attali, 2007; Attali & Burstein, 2006; Author A, 2010; Burstein & Chodorow,
1999; Chodorow & Burstein, 2004; Edelblut & Vantage Learning, 2003; Chen & Cheng, 2008; Choi &
Lee, 2010; Elliot & Mikulas, 2004; Lee et al., 2008), yet only a few of them investigated the use of an
AES system in ESL or EFL classrooms (e.g., Author A, 2010; Chen & Cheng, 2008; Choi & Lee, 2010).
The present study aims to extend the scope of research in this area as our knowledge regarding the
utilization of AES in ESL/EFL classroom contexts is insufficient. Additionally, only two studies (Author
A, 2010; Choi & Lee, 2010) compared AES feedback to instructor feedback. While Author A, 2010 made a
direct, qualitative comparison between MY Access program and teacher feedback (form and content),
Choi & Lee quantitatively compared Criterion scoring and feedback to those of the instructor. Unlike
these two studies, the present study compared and contrasted the feedback that is generated both by
Criterion and the instructor on the surface features of writing and analyzed the feedback data both
qualitatively and quantitatively.
Another limitation is that the vast majority of research in AES is mainly conducted or sponsored by
the developers of these systems (e.g., Attali, 2004, 2007; Attali & Burstein, 2006; Burstein & Chodorow,
1999; Burstein et al., 1998; Chodorow & Burstein, 2004; Elliot & Mikulas, 2004; Edelblut & Vantage
Learning, 2003; Landauer et al., 1997, 2003; Lee et al., 2008; Nichols, 2004; Page, 2003; Powers et al.,
2000, 2001; Rock, 2007; Rudner et al., 2006; Vantage Learning, 2000a, 2000b, 2001, 2002, 2003a,
2003b; Wang & Brown, 2007). We believe as independent researchers, we can provide an outsider
perspective to the research in this field. Thus, the present study aims to have a closer look into the
feedback provided by an AES system (Criterion) in terms of grammar, mechanics, usage, and style and
provide insights about its feedback capacity on the formal aspects of the language by comparing it to
the teacher feedback. This study is a critical one since it fills in a gap in AES research by focusing on
non-native English speakers and their use of an AES system in a classroom environment.
3. The present study
3.1. Participants and setting
The context for this study is a four-year, open access institution in the Southeastern United States
where both researchers are faculty in the English for Academic Purposes (EAP) program. The college is
the newest member of its state’s university system and was founded in order to meet the needs of the
county’s rapidly expanding population, of which 32.2% speak a language other than English at home
Table 1
Participant information.
Name Gender L1 Age of arrival Years in U.S.
Anika F Bengali 21 13
Elsie F Haitian Creole 13 6
Nora F (Liberia) 22 21
Hao M Vietnamese 14 5
Huynh M Vietnamese 22 10
Hong F Vietnamese Born in U.S. 18
Felicita F Spanish 17 1
Fareed M Urdu 18 1
Sofia F Spanish Born in U.S. 19
Thanh M Vietnamese 8 10
Linh F Vietnamese 22 1
Xin M Mandarin 11 10
Ying F Cantonese 19 2
Zeb F Hmong Born in U.S. 28
Note: All names are pseudonyms.
(U.S. Census). Participants included 14 advanced EAP students, all matriculated college students from
a wide variety of linguistic backgrounds (see Table 1).
These students represented the complex range of linguistic minority students who are entering U.S.
colleges in increasing numbers (Roberge, 2009). Seven of the participants, for example, fit the category
often referred to as Generation 1.5, a term first used by Rumbaut and Ima (1988), to describe people
who immigrated during childhood or adolescence and have life and educational experiences—often
interrupted—spanning two or more countries. Four other participants immigrated to the U.S. as adults
in their twenties and can be described as first generation immigrants who received their secondary
education abroad. In addition, three participants were actually born in the U.S. to immigrant parents,
and are therefore considered second generation immigrants. Nevertheless, what all study participants
have in common is that they self-identified at the time of their college matriculation as English as a
Second Language (ESL) speakers for whom English is not their dominant language. As such, all partic-
ipants voluntarily selected to take the ESL version of the college placement test (COMPASS ESL test of
grammar and eWRITE). Thus, though we recognize the complexity of these participants’ immigration
and linguistic backgrounds, we do consider all of these students to be, by their own definition, second
language writers of English.
The students were from one section of EAP 0091 (Structure and Composition II) which was taught
by the second author of this paper. EAP 0091 is the final course in a two-level pre-college writing
sequence for students who are not writing at the college level upon entry, as determined by the
placement testing. EAP 0091 is a prerequisite course for the college-level course sequence of English
1101 and 1102, as well as all other writing-intensive courses at the college, including most courses
in science, education, business, and liberal arts. In order to satisfy EAP 0091 course requirements and
move on to college-level work, students are required to receive a passing score by at least two out of
three raters from the EAP and English departments on a timed writing essay exam given at the end
of the semester. While a the holistic scoring rubric includes organization and development as well as
grammar, usage, and mechanics, there is a tendency for raters to place a great deal of emphasis on
facility of language use. Therefore, the ability to write—with minimal errors in grammar, usage, and
mechanics—a multi-paragraph essay in response to a prompt is of critical importance to students in
the program.
3.2. Research questions
This study was conducted to answer the following research questions:
(1) What is the nature of AES feedback compared to instructor feedback in a college ESL classroom in
terms of the feedback categories of Grammar, Usage, and Mechanics?
(2) What are the student perceptions of using an AES system in college ESL classroom?
3.3. Criterion program
The AES system used in this study is Criterion by ETS. The program provides instant feedback and
scoring on grammar, usage, mechanics, style, and organization and development in student essays. The
Criterion library offers over 180 higher education topics (i.e., 61 College Level I topics, 64 College Level
II topics, 10 College Preparatory topics, 14 GRE topics, and 35 TOEFL test topics) from various genres
including persuasive, informative, narrative, expository, issue, and argumentative. Additionally, the
program generates topics at high, middle, and elemantary school levels. Instructors are not limited
to the topics in the Criterion library since the program allows them to use their own topics as well.
Criterion offers two student support tools: The Writer’s Handbook, which provides explanation of
various errors and examples and Pre-writing Templates. including Outline, List, Idea Tree, Free Write,
Idea Web, Compare & Contrast, Cause & Effect, and Persuasive. Finally, the program provides lesson
plans and exercises for instructors to adopt or adapt for classroom use (ETS, n.d.).
3.4. Data collection and procedures
The main purpose of this study was to provide insights regarding the use of an AES system in an
ESL college classroom. Participants wrote on three separate prompts in one semester as part of the
Structure and Composition II class. The following prompts were selected from the Criterion library:
Prompt (1) The 21st century has begun. What changes do you think this new century will bring? Use
examples and details in your answer.
Prompt (2) Do you agree or disagree with the following statement? People always learn from their
mistakes. Use specific reasons and details to support your answer.
Prompt (3) Successful students do well in school for many different reasons. Identify one or two
important personal characteristics that help a student succeed in school. Use specific examples to
show why you think these characteristics are important for student success.
Students received feedback on their essays from both Criterion and their instructor. Students were
given unlimited opportunity to revise their drafts in Criterion prior to each assignment due date, and
they were only asked to send their essays for teacher feedback and scoring when they felt they were
ready. Students then received a grade based on a calculation of 40% weight from the Criterion score
and 60% from the instructor’s score. Students were also given an optional revision opportunity after
they received a grade. If students chose to revise their drafts, instructor scores were averaged.
Students completed a demographic and a computer literacy survey at the beginning of the semester.
At the end of the semester, they also completed a survey to ascertain their perceptions of using AES
and of AES and instructor feedback (see Appendix A). The survey included a section with general
questions with regard to the challenges students faced and resources they used at the initial and later
stages of drafting in the form of a checklist, another section including likert items regarding students’
opinions on each feedback trait category provided by Criterion and the instructor, and a third section
involving open-ended questions about students’ overall perceptions of Criterion (e.g., the strengths
and weaknesses of Criterion, future suggestions of its use in writing classes, etc.). While the document
(essay) analysis was the main data collection method, opinion surveys provided crucial information
regarding student perceptions of the feedback they received.
3.5. Data analysis
In the present study, instructor feedback on student drafts (n = 37)2 was compared to Criterion feed-
back across the Criterion-designated feedback categories of Grammar (i.e., Subject-Verb Agreement,
2
The total number of drafts does not equal 42 because not all participants submitted responses to all three prompts.
Ill-Formed Verbs, Fragment or Missing Comma, Run-On Sentences, Garbled Sentences, Pronoun Errors,
Possessive Errors, Wrong or Missing Word, Proofread This!), Usage (i.e., Wrong Article, Missing or Extra
Article, Confused Words, Faulty Comparisons, Preposition Error, Nonstandard Word Form, Negation Error),
and Mechanics (i.e., Spelling, Capitalize Proper Nouns, Missing Initial Capital Letter in a Sentence, Miss-
ing Question Mark, Missing Final Punctuation, Missing Apostrophe, Missing Comma, Hyphen Error, Fused
Words, Compound Words, Duplicates). In order to keep the feedback process as natural as possible
for the instructor, she was not asked to adapt her feedback to conform with the above Criterion
coding categories or to limit her feedback to those categories in Criterion’s repertoire. Instead, the
instructor continued to use her own set of form-focused error codes which she had developed prior
to incorporating Criterion into her instruction (see Appendix B).
Thus, in order to make a comparison between Criterion and instructor feedback, the researchers
mapped the instructor’s codes to the equivalent error categories used by Criterion (see Appendix
C). This process was not entirely straightforward as there were times when one Criterion category
encapsulated multiple instructor codes (e.g., the Criterion code Pronoun Errors mapped to three dis-
tinct instructor codes: Pronoun (pro), Pronoun Reference (pro ref), and Pronoun Agreement (pro agr)).
Likewise, the opposite was also occasionally the case. Criterion, for example, subdivided the single
instructor category of Word Form (wf) into Wrong Form of Word, Faulty Comparisons, and Compound
Words. In order to develop the final mapping relationships, both researchers initially mapped Criterion
and instructor codes independently and then discussed any discrepancies in order to come to agree-
ment. The Criterion Writer’s Handbook was also consulted as verbal descriptions and examples of each
error type helped the researchers clarify the intended meanings of Criterion’s codes (e.g., the difference
between Proofread This!, which was closer to the instructors Unclear category, and Garbled Sentence,
which was closer to a Sentence Structure error) and map the relationships between the two coding sys-
tems. Importantly, there were a number of error types identified by the instructor (e.g., Singular/Plural
(s/pl), Parallel Structure (ps), and (vt)) which were not addressed in the Criterion classification system.
These are also listed in Appendix B and will be discussed in more detail in the findings.
Feedback data were analyzed both quantitatively and qualitatively. Essay feedback on all three
prompts was coded using NVivo qualitative software. In order to evaluate the quality of the feedback
given by both the course instructor (Author B) and Criterion, the researcher who was not the class
instructor (Author A) coded all instructor and Criterion feedback as either accurate or inaccurate.
During this process, there were times when Author A noted occasions of ambiguity in the instructor’s
coding choices (cf. Example 11 in the Findings, which the instructor labeled as a Word Form (wf) error,
but which could equally likely have been considered an error of Passive Voice (pass). As long as the
instructor’s coding choice was deemed by Author A to be an acceptable possibility, however, it was
coded as accurate. It is also important to note that while the instructor also provided students with
ample feedback on content and organization, this feedback was not included as part of the study since
Criterion provides only very limited feedback in these two areas. Finally, survey results were coded
using NVivo. Likert-scale items were analyzed quantitatively, whereas open-ended student comments
were analyzed qualitatively
4. Findings
4.1. Trait feedback category grammar
Table 2 highlights the numbers of errors across all 37 essay drafts identified in the trait feedback
category Grammar by both Criterion and the instructor as well as the percentage of feedback in each
error type coded as inaccurate by the non-instructor researcher (Author A). It is clear that for the large
majority of error types (Fragment or Missing Comma and Run-on Sentences were the only exceptions),
the instructor identified more errors (570 as compared to 94) than did Criterion. Interestingly, many
of these under-identified error types are those that have been found by other researchers as common
challenges for second language learners, such as problems with pronouns and verb forms (Doolan &
Miller, 2012). Examples 1 and 2 below highlight two of the 74 verb form errors coded by the instructor
that were missed by Criterion.
Table 2
Grammar errors identified by Criterion and Instructor.
Error type Instructor total Coded inaccurate Criterion total Coded inaccurate
Wrong or missing word 213 5 (2.3%) 3 1 (33.3%)

Ill-formed verbs 90 1 (1.1%) 16 0
Proofread this! 59 0 8 6 (75%)
Subject-verb agreement 56 1 (1.8%) 11 2 (18.2%)
Pronoun errors 52 0 0 0
Garbled sentences 36 0 3 3 (100%)
Fragment or missing comma 29 0 33 10 (30.3%)
Possessive errors 26 0 3 1 (33.3%)
Run-on sentences 9 0 17 12 (70.6%)
Total 570 7 (1.2%) 94 35 (37%)
Example 1 (Instructor-coded Verb Form error missed by Criterion): I love playing games and
hang out with friends therefore my grade are bad. (Xin, Prompt 3)
Example 2 (Instructor-coded Verb Form error missed by Criterion): In conclusion, we can say
we have already step in 21st century. (Anika, Prompt 1)
In fact, in example sentence 1, Criterion found no errors at all, despite the fact that the instructor
also coded for a run-on sentence between “friends” and “therefore.” In Example 2, Criterion missed
the verb form error of the missing past participle in the present perfect tense “have stepped” and only
identified a missing article before the noun “21st century.”
Criterion also proved unreliable at identifying the numerous word choice errors made by students
in the category Wrong or Missing Word, likely because it is very difficult for a computer program to
recognize the semantic relationships between words as well as a human reader is able to do. Example
3 illustrates an example sentence with two Wrong or Missing Word errors which were not recognized
by Criterion.
Example 3 (Instructor-coded Wrong or Missing Word error missed by Criterion): In conclusion,

there is a little bit of people who do not learn from their mistakes and do so over and over again
without looking [] the consequences. (Felicita, Prompt 2)
In this case, Criterion did not recognize that the quantifier “a little bit of,” which is typically used
with non-count nouns, cannot modify the count noun “people.” Also, the program failed to identify
the missing preposition “at,” which should collocate with the verb “looking” in Example 3. In fact,
Criterion did not identify a single error type in this sentence.
Interestingly, the two error categories which Criterion identified with the highest frequency relative
to the instructor’s error identification were Fragments and Missing Comma and Run-on Sentences, which
are error types that are much less exclusive to second language writers and are frequently found in
the writing of developmental students whose first language is English (Doolan & Miller, 2012). Even
in these categories, however, the similarity in the frequency with which Criterion identified these
errors is deceiving, as evidenced by the large percentage (30.3 and 70.6, respectively) of errors coded
as inaccurate. In many cases, Criterion tended to incorrectly label non-fragment errors as fragments
(see Example 4) and miss actual fragments entirely (see Examples 5). Clearly, Example 4 contains a
variety of error types (i.e., the instructor labeled both missing word and verb form error types here),
yet it is not an example of a fragment, which is how Criterion coded the underlined sentence.
Example 4 (Criterion-coded Fragment Error judged inaccurate): The next and most
characteristic in becoming a successful student is achieving good grades in school.
I will Illustrates two or three examples of achieving good grades. (Nora, Prompt 3)
Example 5, on the other hand, illustrates an instructor-coded fragment in a section of text where
Criterion did not identify any errors at all, including the singular/plural and verb form errors which the
instructor also found to be evident.
Example 5 (Instructor-coded Fragment error missed by Criterion): The government has to create
ways for small business and large corporation to help built up the economy. By creating jobs and
opportunities for the Americans to help strengthen the economy and reduce unemployment. (Nora,
Prompt 1)
In addition to the discrepancies between the quantity and quality of Grammar errors within the
types included in Criterion’s trait feedback category, there were also a number of grammatical error
types identified by the instructor which Criterion did not address, including Singular/Plural (233);
Connecting Words (21); Comma Splice (23); Modal Verbs (51); Verb Tense (83); and Word Order (18). Of
these, Singular/Plural and Verb Tense are the most salient, both in terms of the number of errors missed
by Criterion and also because of the degree with which these errors impact the meaning-making
potential of second language writers. Examples 6 and 7 below highlight two sample errors missed by
Criterion.
Example 6 (Instructor-coded Singular/Plural error missed by Criterion): Some people learn from
their mistakes and other do not. (Hong, Prompt 2)
Example 7 (Instructor-coded Verb Tense error missed by Criterion): We just go through the
economic crisis in 2009 and 2010. (Ying, Prompt 1)
In Example 6, Hong should have used the plural others, and in Example 7, Ying needed the past
form went in order to accurately situate the economic crises of 2009 and 2010 in the past tense.
Unfortunately, Criterion did not help them or their classmates with a large number of similar errors
in their writing.
4.2. Trait feedback category usage
Table 3 illustrates the number or errors identified by both Criterion (total = 125) and the instructor
(total = 364) across all of the feedback types comprising the category of Usage. Once again, the instructor
identified more errors of all types within the category with the exception of Confused Words and
Negation Errors. Upon further inspection, however, the discrepancy in the Confused Words category was
determined to be an issue of labeling rather than of error identification. In particular, the instructor
tended to label errors with commonly confused words such as too/to/two or their/there/they’re as Wrong
Word errors under the Grammar trait category or as Spelling errors in the category of Mechanics rather
than as Confused Words.
Within the Usage category, challenges with articles, and prepositions, and word forms predomi-
nate, which aligns with what is typically found in second language writing (Doolan & Miller, 2012).
Unfortunately for students, Criterion seems to do a very poor job of identifying these errors, failing
to give students the feedback they need to make corrections. While Criterion did identify a relatively
large number of Wrong Article and Missing or Extra Article errors (17 and 76) in the sample essays, a
large percentage of these errors (88.2 and 40.8, respectively) were coded as inaccurate, meaning that
they were misidentified. Examples 8 and 9 illustrate this problem.
Table 3
Usage errors identified by Criterion and Instructor.
Missing or extra article 118 0 76 31 (40.8%)

Preposition error 106 0 19 5 (26.3%)
Wrong form of word 96 0 0 0
Wrong article 37 0 17 15 (88.2%)
Nonstandard word form 3 0 0 0
Confused words 2 0 11 3 (27.3%)
Faulty comparisons 2 0 1 0
Negation error 0 0 1 0
Total 364 0 125 54 (43%)
Example 8 (Criterion-coded Wrong Article error judged inaccurate): Flooding and beach erosion
can cause damages to oceanfront properties a long coastal areas which cost billions of dollars
to repair. (Huynh, Prompt 1)
Example 9 (Criterion-coded Missing or Extra Article error judged inaccurate): Third, scientist
and doctors might find cures for diseases or discover new ideas. (Sofia, Prompt 1)
In Example 8, Criterion has clearly mistaken a spelling error (Huynh should have written along
rather than a long) for an article error, but the only feedback the program gave him was this: “You
may have used the wrong article or pronoun. Proofread the sentence to make sure that the article or
pronoun agrees with the word it describes.” In Example 9, it is also true that Criterion has correctly
identified an error, but rather than an article error, this is actually a singular/plural issue. Had Sofia
used the intended plural form scientists, no article would have been necessary.
Other frequently made errors by the second language writers in this sample, namely those related
to prepositions and word forms, were also highly under-identified in Criterion as compared to in the
instructor’s feedback. Criterion found only 19 to the instructor’s 106 Preposition errors, and 0 to the
instructor’s 96 Wrong Form of Word errors. Examples 10 and 11 illustrate just two of the many missed
errors of these types.
Example 10 (Instructor-coded Preposition error missed by Criterion): I also learned a lesson of

not admitting the error at the first place. (Zeb, Prompt 2)
Example 11 (Instructor-coded Wrong Form of Word error missed by Criterion): In order to

complete that mission, students have to be succeeded in school. (Hoa, Prompt 3)
Clearly, in Example 10, Zeb made a small mistake in substituting the preposition at for in, while in
Example 11, Hoa should have either used the adjective successful rather than the past participle verb
form succeeded. Of course, another possible option would have been to use the base verb form succeed
rather than the incorrect passive form be succeeded. Unfortunately, Criterion gave neither student the
feedback needed to raise awareness about any of these issues or possibilities in their writing.
4.3. Trait feedback category mechanics
In the trait feedback category of Mechanics (see Table 4), some of the most salient differences
between Criterion and instructor feedback are in the error types Spelling and Missing Commas. While
Criterion identified 33 Spelling errors compared to the instructor’s 44, the difference is actually much
greater when Criterion’s 72.7% inaccuracy rate is taken into consideration. One of the key differences
here is the inability of Criterion to evaluate the many proper nouns used correctly by students when
the names are not found in Criterion’s spellcheck dictionary. Example 12 gives one illustration of this:
Table 4
Mechanics errors identified by Criterion and Instructor.
Missing comma 112 0 7 0

Spelling 44 0 33 24 (72.7%)
Duplicates 6 1 (16.7%) 0 0
Capitalize proper nouns 4 0 1 0
Missing initial capital letter in a sentence 3 0 5 0
Compound words 2 0 6 1 (16.7%)
Missing question mark 1 0 1 1 (100%)
Missing final punctuation 0 0 1 1 (100%)
Missing apostrophe 0 0 0 0
Hyphen error 0 0 0 0
Fused words 0 0 0 0
Total 172 1 (0.6%) 54 27 (50%)
Example 12 (Criterion-coded Spelling error judged inaccurate): The introduction of iPod has
changed our lives. . .Secondly, the touch screen of iTouch makes people’ lives much easier by
touching the screen to change the song instead of pressing the skip button like other mp3
players. (Hoa, Prompt 1)
Evidently, at the time of this study, Criterion’s spellcheck had not been programmed to recognize
some common words related to twenty-first century technology, such as iPod, iTouch, and mp3, and
inaccurately labeled them as spelling errors. In addition, Criterion labeled as spelling errors many
proper nouns that were accurate spellings of the names of historical figures and geographical locations.
Even more frequent than missed or mistaken spelling errors was Criterion’s lack of ability to recog-
nize sentences which lacked either an obligatory or recommended comma. In fact, Criterion seemed
programmed to recognize instances of omitted commas in only very limited constructions, with six
out of the seven identified errors occurring after transitional adverbs, such as however or for instance,
as seen in Example 13:
Example 13 (Criterion-coded Missing Comma error after a transitional adverb): However

when people learn from these mistakes they can prevent and fix the consequences. (Felicita,
Prompt 2)
The instructor, on the other hand, gave students feedback on a much wider array of omitted comma
usage, including in compound sentences, before examples, and prior to non-restrictive relative clauses
as illustrated in Examples 14–15.
Example 14 (Instructor-coded Missing Comma error missed by Criterion): Many changes have
already started to take place [] specifically in the economy. (Elsie, Prompt 1)
Example 15 (Instructor-coded Missing Comma error missed by Criterion): Flooding and beach
erosion can cause damages to oceanfront properties a long coastal areas [] which 23 cost billions
of dollars to repair. (Huynh, Prompt 1)
4.4. Student perceptions of AES vs. teacher feedback
Survey results indicate that despite the problems identified in this study regarding missed and
misidentified errors on the part of the AES system, students themselves perceived Criterion feedback
to be helpful. When asked to identify their level of agreement with the following statement, “I find
the Criterion feedback on Grammar helpful,” 3 out of 12 strongly agreed, 6 agreed, and 3 were neu-
tral. No students disagreed or strongly disagreed. The results for the trait feedback category of Usage
were identical, and for Mechanics, the answers were even more positive (3 strongly agree, 8 agree, 1
neutral). In addition, Criterion feedback was rated the second most frequent resource used to revise
the essays prior to submission for instructor grading, with 7 out of 12 students indicating that they
used Criterion feedback in this way. The only more frequently used resource (n = 9) was a grammar
or writing textbook. Other selections included the Criterion Writer’s Handbook (3), Criterion model
essays (2), the college tutoring center (3), and appointments with the instructor (2).
Despite the positive impressions of Criterion feedback held by most students, instructor feedback
did rate as more valuable in their estimation. In response to the statement, “I find the feedback on
accuracy that the instructor provided helpful” 8 strongly agreed, 3 agreed, and only 1 student was
neutral. In terms of open-ended responses, students were asked whether or not Criterion met their
expectations and what they would identify as the system’s strengths and weaknesses. A majority of
students (8 out of the 11 who responded to the question) identified Criterion’s ability to identify errors
of grammar, usage, and mechanics as a strength of the program. Student responses included, “how it
shows the errors;” “to find all grammatical mistakes;” “checking spelling;” “it gave me some advice
to correct my errors;” “showing me mistakes I made when I write an essay;” “Criterion’s strength are
identifying our errors and organization;” “identifying error on grammar and usage;” “it can show the
writer’s mistakes before submission.” Only one student, on the other hand, identified feedback as a
weakness of Criterion, saying, “Sometime the site misses some important errors.” Students were much
more likely to attribute weaknesses to Criterion that were unrelated to grammar, usage, and mechanics
feedback, such as Criterion’s failure to adequately address issues of content and organization, a lack
of confidence in Criterion’s scoring ability, and difficulties with the interface of the program.
5. Discussion and conclusion
In this study, AES (Automated Essay Scoring) feedback on student essays in response to three dis-
tinct prompts was compared to instructor feedback in terms of Criterion’s categories of Grammar,
Usage, and Mechanics. The results revealed that there were differences between the Automated Essay
Scoring (AES) feedback and the instructor feedback. Criterion missed or misidentified a large number
of errors made by the students, yet most of those errors were accurately identified by the instruc-
tor. In fact, the instructor provided both more (i.e., a higher total number of coded errors) and better
quality (i.e., a higher percentage of coded errors judged as accurate) feedback on form compared
to the AES system used in this study (Criterion). While the authors acknowledge that more feed-
back does not necessarily equate with higher quality feedback, as effective instructors often target
their feedback on form to the most pressing issues in student writing, it is nevertheless important
that the feedback that is provided is accurate. For the purposes of this study, we therefore defined
quality feedback as that which was deemed to be accurate. Feedback which was judged inaccurate
because it either labeled an error incorrectly or labeled a non-error as an error and was, correspond-
ingly, considered to be of lower quality. Our findings are consistent with the qualitative findings of
Author A, 2010 which investigated the use of MY Access (the instructional application of IntelliMetric
by Vantage Learning) in an ESL classroom and compared teacher feedback to MY Access feedback.
Author A found that the instructor feedback was more accurate and focussed compared to MY Access
feedback.
Additionally, Criterion did not provide any feedback on some of the error categories that the instruc-
tor did, which was particularly troubling considering that many of these errors (e.g., singular/plural
and verb tense) stand out as particularly salient in the writing of developmental, immigrant writers
(Reid, 1998). Reid noted that because these writers learn English in a U.S. context, and often rely
more on their oral/aural skills rather than on literacy skills in making grammatical decisions in their
writing, they have a tendancy to miss inflectional morphemes such as plurals and verb endings. This
was certainly found to be the case with the participants in this study. The fact that Criterion does not
adequately address these types of errors makes its efficacy for use in the ESL classroom particularly
questionable. Of course, it is important to note that AES systems are continuously are being updated
and there is some evidence to indicate that future versions may be better equipt to address such
common L2 errors as articles and prepositions (see Chodorow, Gamon, & Tetreault, 2010; Gamon &
Chodorow, 2013; Leacock, Chodorow, Gamon, & Tetreault, 2010).
An additional, though perhaps lesser, concern that arose during the study pertained to the
somewhat confusing distinctions Criterion made between error categories. While the separation of
Mechanics into its own category seemed quite logical, both students and the instructor expressed con-
fusion regarding the differences between Grammar and Usage. In fact, the division of sub-categories
here (e.g., Verb Form errors belonging to Grammar, but Word Form errors belonging to Usage) were
often perceived as arbitrary, especially because the instructor did not make such distinction among
her own error codes (see Appendix B).
At the end of the semester, students completed a survey to ascertain their perceptions of using
Criterion, particularly in terms of the feedback they received. The survey results showed that some stu-
dents recognized the weaknesses in Criterion’s ability to provide feedback, yet most students seemed
to trust Criterion feedback, at least in part. The underlying reason could be that the instructor used the
AES program as a support tool in her classroom, which may have conveyed to the students a sense of
trust in the program. In addition, research shows that students have more positive attitudes toward
AES feedback when it is combined with teacher feedback (Chen & Cheng, 2008; Choi & Lee, 2010;
Warschauer & Grimes, 2008), which was the case in this study. It is also possible that students’ trust
in Criterion related to the fact that, as an advanced form of technology, it may have been seen by them
as more infalliable than it really turned out to be. Regardless of the limitations, students reported
using Criterion feedback more than all but one other resource available to them to revise their writ-
ing. Similarly, the results of Warschauer and Grimes’ 2008 study indicated that students and teachers
had positive attitudes toward MY Access even though they did not use the program extensively in
classrooms.
Despite the many limitations of Criterion as revealed in this study, it is important to point out that
an AES program certainly has an edge over instructor feedback in terms of speed. The instructor in
this study, for example, reported spending an average of 15 minutes per essay coding grammatical,
mechanical, and usage-based errors alone, which does not include additional time spent addressing
issues of content and organization. Because giving detailed feedback is so time-consuming, it was not
feasible for the instructor to routinely provide students with feedback on multiple drafts prior to the
submission of a final draft for grading. Therefore, students frequently relied on Criterion to provide
feedback on a large number of drafts—sometimes as many as eight to ten—until they were satisfied
with their work. Despite the fact that the feedback they received was not always as accurate as they—or
their instructor—may have hoped for, students still felt motivated to use the AES program to revise
their writing.
The results of this study suggest that when selecting AES as a possible support tool for supplemen-
tal writing and to increase production, teachers should be aware of the limitations of such systems,
particularly with regard to second language writers, and make sure inform students of these limita-
tions. Finally, it is imperative AES developers include ESL/EFL (English as a Second/Foreign Language)
specialists in their teams to improve the capabilities of these programs as they tend to miss many
types of typical second language writing errors. We heartily hope that the results of this study direct
the attention of AES developers to the needs of second language writers.
Appendix A. Criterion survey questions
1. Which challenges did you face in learning how to use Criterion? (Select all that apply.)
• Registering
• Logging in
• Finding the writing topic
• Using the “Make a Plan” feature
• Submitting your essay
• Finding Criterion feedback
• Revising the essay based on Criterion feedback
• Finding instructor pop-up-note feedback
• Revising the essay based on instructor pop-up-note feedback
• Other:
2. Which resources did you use to revise your essay before the first due date for instructor review?
(Select all that apply.)
• Criterion trait feedback
• Criterion Writer’s Handbook
• Criterion Model Essays
• Grammar or writing textbook (e.g., Grammar Sense 4)
• AEC
• Appointment with instructor
• Other:
3. Which resources did you use for your optional revision(s) after receiving instructor feedback and
grade? (Select all that apply.)
• Criterion trait feedback
• Criterion Writer’s Handbook
• Criterion Model Essays
• Instructor Pop-Up Notes (“I” Symbols)
• Instructor General Comments
• Grammar or writing textbook (e.g., Grammar Sense 4)
• Academic Enhancement Center (AEC)
• Appointment with instructor
• Other:
For questions 4–12, please rate how much you personally agree or disagree with these statements.
Strongly Disagree Neutral Agree Strongly

disagree agree
4. I find the Criterion feedback on Grammar helpful (e.g.,

run-on, agreement, pronoun errors)
5. I find the Criterion feedback on Usage helpful (e.g.,
article, word form, preposition error)
6. I find the Criterion feedback on Mechanics helpful (e.g.,
spelling, capitalization, punctuation)
7. I find the Criterion feedback on Style helpful (e.g.,
repetition, passive, sentence length)
8. I find the Criterion feedback on Organization &
Development helpful (e.g., thesis, ideas, conclusion)
9. I find the feedback on accuracy that the instructor
provided adequate
10. I find the feedback on content that the instructor
provided adequate
11. I find the score that Criterion provided adequate
12. I find the score that instructor provided adequate
Questions 13–16 are open-ended. Please answer in as much detail as possible.
13. Did Criterion meet your expectations? Explain.

14. What are Criterion’s strengths?
15. What are Criterion’s weaknesses?
16. How should Criterion be used in EAP in future semesters?
Appendix B. Instructor editing symbols and example sentences
Editing Explanation Example of instructor-coded error from the data set

symbol
art Incorrect or missing We can to some extent predict what [] 21st century will bring. (Anika, Prompt
article 1)
cap Capitalization – capital If we talk about phones, nokia was the first company who had introduced [a] 5
letter needed mega pixel camera in its phones. (Fareed, Prompt 1)
conn Incorrect or missing They absolutely understand the differences between pass[ing] and fail[ing], []
connecting word bad grades and good grades. (Linh, Prompt 3)
cs Comma splice – two When I came to the United States I was not perfect, I should have gone to
independent clauses complete my education [;] however, I am back in school and I have to learn
joined by a comma from all of my mistakes. (Nora, Prompt 2)
det Incorrect or missing Such experiences help us realize and understand things which we don’t
determiner understand until we commit any mistake related to it. (Fareed, Prompt 2)
frag Fragment – incomplete For example, Ralf Lauren, the creator of Polo who has all of America wearing
sentence the Polo designs. (Nora, Prompt 1)
lc Lower case – word(s) When it comes to the economy, changes have greatly occurred in the aspect of
incorrectly capitalized groceries, Fuel for motor vehicles, [and] household utilities. (Elsie, Prompt 1)
mod Incorrect use or If a person was to make mistakes and continue making them, [there their] life
formation of a modal will just be more difficult and stressful. (Sofia, Prompt 2)
p Punctuation incorrect In conclusion, being an adult is a long journey to me and it means that I must
or missing reach the point where I can demonstrate that I have responsibilities, know
how to control my emotions [] and spend my time and money wisely. (Huynh,
Prompt 2)
pass Incorrect use or You can also ask if you have the chance of making up those assignments that
formation of the were not turn in. (Nora, Prompt 3)
passive voice
prep Incorrect or missing In fact, cars will completely be run by electricity, a resource in which is
preposition recyclable. (Thanh, Prompt 1)
Appendix B (Continued )
Editing Explanation Example of instructor-coded error from the data set
symbol
pro Pronoun error

We don’t usually like mistakes and don’t want to be aware of our errors, but
pro ref Reference/agreement
it is truly the best for us. (Zeb, Prompt 2)
pro agr Unclear or agreement
incorrect
ro Run-on – two When I came to the United States I was not perfect [,.] I should have gone to
independent clauses complete my education however, I am back in school and I have to learn from
joined with no all of my mistakes. (Nora, Prompt 2)
punctuation
sp Spelling error – word In addition, emotion[s] in humans are very strong and some time
incorrectly spelled uncontrollable. (Xin, Prompt 2)
s/pl Problem with a I believe that in the 21st century the technology will keep growing, [the]
singular or plural of a economy might keep decreasing, and scientist and doctors might find cures for
noun diseases or discover new ideas. (Sofia, Prompt 1)
ss Incorrect sentence When people have a job, which mean is they will have [thean] income. (Ying,
structure (missing Prompt 1)
subject, missing verb,
repetition of subject,
clause problems)
sva Incorrect subject-verb Most diseases that were unknown to people now has a cure. (Hao, Prompt 1)
agreement
uncl Message not clear After [committing making] the same mistake one or two times the person
becomes perfect in that job. (Fareed, Prompt 2)
vf Verb incorrectly Now the computer is used for all sorts of things like surfing the web, playing
formed games and even watch movies. (Hong, Prompt 1)
vt Incorrect verb tense For example, train[s] are going to be faster and more comfortable. Then,
people spent less time on the way going home. (Xin, Prompt 1)
wc Incorrect word choice The 21century has brought too many advantages and disadvantages for
human[s] in different areas such as computer science, weather [,] and health
care. (Linh, Prompt 1)
wf Incorrect word form Most people always lose their confident and pride whenever they make a
mistake in life. (Hao, Prompt 2)
wo Incorrect or awkward These changes and more are those that differentiate one from another century.
word order (Felicita, Prompt 1)
Error codes adapted from Lane and Lange (1999).
Appendix C. Comparison of criterion and instructor feedback categories
Criterion category Criterion subcategory Equivalent instructor feedback category
Grammar Fragment or missing comma Fragment (frag)

Run-on sentences Run-on (ro)
Garbled sentences Sentence structure (ss)
Subject-verb agreement Subject-verb agreement (sva)
Ill-formed verbs Verb form (vf); Conditional (cond)
Pronoun errors Pronoun (pro); Pronoun Reference (pro
ref); Pronoun Agreement (pro agr)
Possessive errors Possessive (poss); Punctuation (p)
Wrong or missing word Word choice (wc); Missing word (mw)
Proofread this! Unclear (uncl)
N/A Singular/Plural (s/pl)
N/A Connecting words (conn)
N/A Comma splice (cs)
N/A Modal verbs (mod)
N/A Verb tense (vt)
N/A Word order (wo)
Usage Wrong article Article (art); Determiner (det)

Missing or extra article Article (art); Determiner (det)
Confused words Word choice (wc)
Appendix C (Continued )
Criterion category Criterion subcategory Equivalent instructor feedback category
Wrong form of word Word form (wf)

Faulty comparisons Word form (wf)
Preposition error Preposition (prep)
Nonstandard word form Informal (inf)
Negation error Word choice (wc); verb form (vf)
Mechanics Spelling Spelling (sp)

Capitalize proper nouns Capitalization (cap)
Missing initial capital letter in Capitalization (cap)
a sentence
Missing question mark Punctuation (p)
Missing final punctuation Punctuation (p)
Missing apostrophe Punctuation (p)
Missing comma Punctuation (p)
Hypen error Punctuation (p); Spelling (sp)
Fused words Spelling (sp)
Compound words Spelling (sp); word form (wf)
N/A Lower case (lc)
References
Ashwell, T. (2000). Patterns of teacher response to student writing in a multiple-draft composition classroom: Is content
feedback followed by form feedback the best method? Journal of Second Language Writing, 9(3), 227–257.
Attali, Y. (2004, April). Exploring the feedback and revision features of Criterion. In Paper presented at the National Council on
Measurement in Education (NCME) San Diego, CA.
Attali, Y. (2007). Construct validity of e-rater in scoring TOEFL essays (RR-07-21). Princeton, NJ: Educational Testing Service (ETS).
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3),
1–31.
Author A. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1).
Author A. (2010). Nature of automated essay scoring feedback. CALICO Journal, 28(1).
Bitchener, J., & Knoch, U. (2010). Raising the linguistic accuracy level of advanced L2 writers with written corrective feedback.
Journal of Second Language Writing, 19, 207–217.
Burstein, J. (2003). The e-rater scoring engine: Automated scoring with natural language processing. In M. D. Shermis, & J.
C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary approach (pp. 113–121). Mahwah, NJ: Lawrence Erbaum
Associates.
Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, Karen, et al. (1998). Computer analysis of essay content for
automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays (RR-98-15).
Princeton, NJ: Educational Testing Service (ETS).
Burstein, J., & Chodorow, M. (1999, June). Automated Essay Scoring for nonnative English speakers. In Proceedings of the ACL99
workshop on computer-mediated language assessment and evaluation of natural language processing College Park, MD.
Chandler, J. (2003). The efficacy of various kinds of error feedback for improvement in the accuracy and fluency of L2 student
writing. Journal of Second Language Writing, (12), 267–296.
Chen, C. E., & Cheng, W. E. (2008). Beyond the design of Automated Writing Evaluation: Pedagogical practices and perceived
learning effectiveness in EFL writing classes. Language Learning & Technology, 12(2), 94–112.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays (Research report no.
73). Princeton, NJ: Educational Testing Service (ETS).
Chodorow, M., Gamon, M., & Tetreault, J. (2010). The utility of article and preposition error correction systems for English
language learners: Feedback and assessment. Language Testing, 27(3), 419–436.
Choi, J., & Lee, Y. (2010). The use of feedback in the ESL writing class integrating Automated Essay Scoring (AES). In D. Gibson, & B.
Dodge (Eds.), Proceedings of society for information technology & teacher education international conference (pp. 3008–3012).
Chesapeake, VA: AACE.
Doolan, S. M., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative study. Journal of Second Language Writing,
21(1), 1–22.
Edelblut, P., & Vantage Learning. (2003, November). An analysis of the reliability of computer Automated Essay Scoring by
IntelliMetric of essays written in Malay language. In Paper presented at TechEX 03, Ruamrudee International School.
Educational Testing Service (ETS). (n.d.). Retrieved from http://www.ets.org
Elliot, S., & Mikulas, C. (2004, April). A summary of studies demonstrating the educational impact of the MY Access online writing
instructional application. In Paper presented at the National Council on Measurement in Education (NCME) San Diego, CA,
Fathman, A. K., & Whalley, E. (1990). Teacher response to student writing: Focus on form versus content. In B. Kroll (Ed.), Second
language writing: Research insights for the classroom (pp. 178–190). Cambridge, UK: Cambridge University Press.
Ferris, D. (1995). Student reactions to teacher response in multiple draft composition 35 classrooms. TESOL Quarterly, 29(1),
33–50.
Ferris, D. (2011). Treatment of error in second language writing (2nd ed.). Ann Arbor, MI: University of Michigan Press.
Gamon, M., & Chodorow, M. (2013). Grammatical error detection in Automatic Essay Scoring and feedback. In M. Shermis, & J.
Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 251–266). New York:
Routledge.
Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal
of Technology, Language, and Assessment, 8(6).
Kukich, K. (2000, September/October). Beyond Automated Essay Scoring. In M. A. Hearst (Ed.), The debate on automatyed essay
grading IEEE Intelligent Systems (pp. 27–31).
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated Essay Scoring: A cross disciplinary perspective. In M. D. Shermis, & J.
C. Burstein (Eds.), Automated Essay Scoring and annotation of essays with the Intelligent Essay Assessor (pp. 87–112). Mahwah,
NJ: Lawrence Erlbaum Associates.
Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How well can passage meaning be derived without using word
order? A comparison of Latent Semantic Analysis and humans. In Proceedings of the 19th annual conference of the cognitive
science society (pp. 412–417). Mahwah, NJ: Erlbaum.
Lane, J., & Lange, E. (1999). Writing clearly: An editing guide (2nd ed.). Boston, MA: Heinle & Heinle.
Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners.
Synthesis Lectures on Human Language Technologies, 3(1), 1–134.
Lee, W. Y., Gentile, C., & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores from humans and e-rater (RR 08-01).
Princeton, NJ: Educational Testing Service (ETS).
Nichols, P. D. (2004, April). Evidence for the interpretation and use of scores from an Automated Essay Scorer. In Paper presented
at the Annual Meeting of the American Educational Research Association (AERA) San Diego, CA.
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis, & J. Burstein (Eds.), Automated Essay Scoring: A cross-disciplinary
perspective (pp. 43–54). Mahwah, NJ: Lawrence Erlbaum Associates.
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2000). Comparing the validity of automated and human
essay scoring (RR-00-10). Princeton, NJ: Educational Testing Service (ETS).
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2001). Stumping E-rater: Challenging the validity of
Automated Essay Scoring (RR-01-03). Princeton, NJ: Educational Testing Service (ETS).
Reid, J. (1998). “Eye” learners and “ear” learners: Identifying the language needs of international students and US resident
writers. In J. M. Reid, & P. Byrd (Eds.), Grammar in the composition classroom: Essays on teaching ESL for college-bound students
(pp. 3–17). Boston: Heinle & Heinle.
Roberge, M. (2009). A teacher’s perspective on generation 1.5. In M. Roberge, M. Siegal, & L. Harklau (Eds.), Generation 1.5 in
college composition (pp. 3–24). New York: Routledge.
Rock, J. L. (2007). The impact of short-term use of criterion on writing skills in ninth grade (RR-07-07).
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of the IntelliMetricSM essay scoring system. Journal of Technology,
Learning, and Assessment, 4(4).
Rumbaut, R. G., & Ima, K. (1988). The adaptation of Southeast Asian refugee youth: A comparative study. Final Report to the U.S.
Department of Health and Human Services. San Diego, CA: San Diego State University (ERIC Document Reproduction Service
ED 299 372).
Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46(2), 327–369.
Truscott, J. (2004). Evidence and conjecture on the effects of correction: A response to Chandler. Journal of Second Language
Writing, 13(4), 337–343.
Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately. Journal of Second Language Writing,
16(2007), 255–272.
Vantage Learning. (2000a). A study of expert scoring and IntelliMetric scoring accuracy for dimensional scoring of Grade 11 student
writing responses (RB-397). Newtown, PA: Vantage Learning.
Vantage Learning. (2000b). A true score study of IntelliMetric accuracy for holistic and dimensional scoring of college entry-level
writing program (RB-407). Newtown, PA: Vantage Learning.
Vantage Learning. (2001). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments
(RB-524). Newtown, PA: Vantage Learning.
Vantage Learning. (2002). A study of expert scoring, standard human scoring and IntelliMetric scoring accuracy for statewide eighth
grade writing responses (RB-726). Newtown, PA: Vantage Learning.
Vantage Learning. (2003a). Assessing the accuracy of IntelliMetric for scoring a district-wide writing assessment (RB-806). Newtown,
PA: Vantage Learning.
Vantage Learning. (2003b). How does IntelliMetric score essay responses? (RB-929). Newtown, PA: Vantage Learning.
Wang, J., & Brown, M. S. (2007). Automated Essay Scoring versus Human Scoring: A comparative study. Journal of Technology,
Learning, and Assessment, 6(2).
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies, (3), 22–36.
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching
Research, 10(2), 1–24. Retrieved from http://www.gse.uci.edu/faculty/markw/awe.pdf
Zamel, V. (1985). Responding to student writing. TESOL Quarterly, 19(1), 79–97.
Semire Dikli received her Ph.D. in Multilingual-Multicultural Education at Florida State University. She has taught English for
Academic Purposes (EAP) and other English as a Second/Foreign Language (ESL/EFL) related courses both in the U.S. and in
Turkey. Her research interests include writing assessment and technology.
Susan Bleyle is an assistant professor of English for Academic Purposes at Georgia Gwinnett College and a doctoral student
in Language and Literacy Education at the University of Georgia. Her research interests include third language acquisition, the
education of developmental immigrant students, and second language writing.

Assessing Writing: Semire Dikli, Susan Bleyle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessing Writing: Semire Dikli, Susan Bleyle

Uploaded by

Copyright:

Available Formats

Assessing Writing 22 (2014) 1–17

Contents lists available at ScienceDirect

Automated Essay Scoring feedback for second

Article history: Writing is an essential component of students’ academic English

∗ Corresponding author. Tel.: +1 678 708 3952.

2.2. Research on Automated Essay Scoring

2.3. Limitations of AES studies

3. The present study

3.1. Participants and setting

Name Gender L1 Age of arrival Years in U.S.

Note: All names are pseudonyms.

3.2. Research questions

This study was conducted to answer the following research questions:

3.3. Criterion program

3.4. Data collection and procedures

3.5. Data analysis

4.1. Trait feedback category grammar

Wrong or missing word 213 5 (2.3%) 3 1 (33.3%)

Example 3 (Instructor-coded Wrong or Missing Word error missed by Criterion): In conclusion,

4.2. Trait feedback category usage

Missing or extra article 118 0 76 31 (40.8%)

Example 10 (Instructor-coded Preposition error missed by Criterion): I also learned a lesson of

Example 11 (Instructor-coded Wrong Form of Word error missed by Criterion): In order to

4.3. Trait feedback category mechanics

Missing comma 112 0 7 0

Example 13 (Criterion-coded Missing Comma error after a transitional adverb): However

4.4. Student perceptions of AES vs. teacher feedback

5. Discussion and conclusion

Appendix A. Criterion survey questions

Strongly Disagree Neutral Agree Strongly

4. I ﬁnd the Criterion feedback on Grammar helpful (e.g.,

Questions 13–16 are open-ended. Please answer in as much detail as possible.

13. Did Criterion meet your expectations? Explain.

Appendix B. Instructor editing symbols and example sentences

Editing Explanation Example of instructor-coded error from the data set

pro Pronoun error

Error codes adapted from Lane and Lange (1999).

Appendix C. Comparison of criterion and instructor feedback categories

Criterion category Criterion subcategory Equivalent instructor feedback category

Grammar Fragment or missing comma Fragment (frag)

Usage Wrong article Article (art); Determiner (det)

Wrong form of word Word form (wf)

Mechanics Spelling Spelling (sp)

You might also like