Wilson, Roscoe - 2020 - Automated Writing Evaluation and Feedback Multiple Metrics of Efficacy

Article
Journal of Educational Computing
Automated Writing Research

2020, Vol. 58(1) 87–125
! The Author(s) 2019
Evaluation and Article reuse guidelines:
sagepub.com/journals-permissions
Feedback: Multiple DOI: 10.1177/0735633119830764
journals.sagepub.com/home/jec
Metrics of Efficacy
Joshua Wilson1 and Rod D. Roscoe2
Abstract
The present study extended research on the effectiveness of automated writing
evaluation (AWE) systems. Sixth graders were randomly assigned by classroom to
an AWE condition that used Project Essay Grade Writing (n ¼ 56) or a word-processing
condition that used Google Docs (n ¼ 58). Effectiveness was evaluated using multiple
metrics: writing self-efficacy, holistic writing quality, performance on a state English
language arts test, and teachers’ perceptions of AWE’s social validity. Path analyses
showed that after controlling for pretest measures, composing condition had no
effect on holistic writing quality, but students in the AWE condition had more posi-
tive writing self-efficacy and better performance on the state English language arts
test. Posttest writing self-efficacy partially mediated the effect of composing condi-
tion on state test performance. Teachers reported positive perceptions of AWE’s
social validity. Results emphasize the importance of using multiple metrics and con-
sidering both contextual factors and AWE implementation methods when evaluating
AWE effectiveness.
Keywords
automated writing evaluation, interactive learning environments, automated
feedback, writing, writing self-efficacy
1
School of Education, University of Delaware, Newark, DE, USA
2
Human Systems Engineering, Arizona State University-Polytechnic, Mesa, AZ, USA
Corresponding Author:
Joshua Wilson, School of Education, University of Delaware, 213E Willard Hall Education Building,
Newark, DE 19716, USA.
Email: joshwils@udel.edu
88 Journal of Educational Computing Research 58(1)
Introduction
In response to an ongoing need for innovative and scalable writing instruction
and assessment (National Center for Education Statistics, 2012; Persky, Daane,
& Jin, 2002), educators and researchers are increasingly looking toward
computer-based approaches such as automated writing evaluation (AWE;
Palermo & Thompson, 2018; Shermis & Burstein, 2013). At their core, most
AWE technologies use (a) natural language processing (NLP) tools to extract
linguistic, syntactic, semantic, or rhetorical features of text related to writing
quality and (b) statistical or machine-learning algorithms to generate scores and
feedback based on patterns observed among those features. Importantly,
modern AWEs typically do more than provide scores and feedback. When
coupled with tutorials or educational games (e.g., Roscoe & McNamara,
2013), learning management features (e.g., Grimes & Warschauer, 2010), or
peer assessment (e.g., Balfour, 2013), AWEs potentially offer flexible, robust,
and time-saving additions to the writing curriculum.
Despite widespread adoption of AWE in English language arts (ELA) courses
across the United States (Grimes & Warschauer, 2010; Stevenson, 2016), there
remains meaningful reticence toward the appropriateness of AWE in writing
instruction, perhaps because adoption has largely outpaced the body of support-
ing evidence. There are multiple, interrelated dimensions by which AWEs might
be judged (e.g., accuracy, validity, and efficacy), yet prior scholarship has pre-
dominantly privileged evidence of AWE’s scoring accuracy and reliability (e.g.,
Keith, 2003; Shermis, 2014; Shermis & Burstein, 2003, 2013). Evaluations of
AWE’s effectiveness (e.g., Stevenson & Phakiti, 2014)—that is, AWE’s ability
to improve the teaching and learning of writing—are less common despite such
evaluations having the most immediate and practical benefit for educators.
This article contributes to the evidence base of the effectiveness of AWE and
does so by conceptualizing and assessing efficacy in multiple ways. We consider
the effects of classroom AWE use on students’ writing quality after instruction,
students’ writing self-efficacy, students’ performance on standardized state tests,
and teachers’ perceptions of AWE effectiveness and usability. Although prior
research has focused on one or two indicators, the combination of metrics in
the current study offers triangulation on whether and how AWEs may be effect-
ive for improving the teaching and learning of writing in ELA classrooms.
The Effectiveness of AWE in Writing Instruction

Depending on their configuration, AWEs support the teaching and learning of
writing by enabling a variety of interactions between the technology, learners,
teachers, and peers. For instance, learner–system interactions include direct use
of the system by students to plan, write, receive automated scores and feedback,
revise their work, and improve their writing. Such interactions translate into
opportunities for increased writing practice with formative feedback, which is
Wilson and Roscoe 89
crucial for development of writing skill (see Kellogg, 2008; Kellogg & Raulerson,
2007). Similarly, teacher–system interactions describe how teachers might access
system data to monitor students’ writing progress to plan or modify instruction
(e.g., Grimes & Warschauer, 2010). In some AWEs, learner–teacher interactions
are enabled by features whereby students can send messages to their teacher
asking for help, and teachers can provide supplemental feedback via in-text
and summary comments (e.g., Wilson & Czik, 2016). Finally, some AWEs
enable learner–learner interactions by facilitating peer assessment (e.g.,
Balfour, 2013). Collectively, these interactions enable teachers to enact various
evidence-based practices for writing instruction, including adult-, peer-, and
technology-based feedback (Graham, McKeown, Kiuhara, & Harris, 2012;
Graham & Perin, 2007; Morphy & Graham, 2012), as well as formative assess-
ment practices pertaining to diagnostic assessment and progress monitoring
(Graham, Hebert, & Harris, 2015).
For instance, one study observed that teachers who incorporated an AWE
system called Project Essay Grade (PEG) Writing gave proportionately more
feedback on higher level writing skills (idea development, organization, and
style) than teachers who did not use the AWE (Wilson & Czik, 2016). Due to
time, lack of training, and other constraints, teachers typically provide insufficient
feedback on higher level writing skills (Ballock, McQuitty, & McNary, 2017;
Matsumara, Patthey-Chavez, Valdes, & Garnier, 2002), despite the importance
of such feedback. Thus, the finding that an AWE can support improved teacher
feedback practices demonstrates the potential value of AWE in ELA classrooms.
Despite these plausible benefits and affordances, the adoption of AWE
remains controversial. Decades of research have resulted in highly reliable and
accurate scoring algorithms (Page, 2003; Shermis, 2014; Shermis & Burstein,
2003)—computer-assigned ratings demonstrate near-perfect agreement with
expert human-assigned ratings. Nonetheless, critics have argued that AWE
undermines the inherently social nature of writing (National Council of
Teachers of English, 2013). Critics also fear that exposing students to automated
feedback will cause their writing to become stilted and predictable (Cheville,
2004; Conference on College Composition and Communication, 2004;
National Council of Teachers of English, 2013). Finally, because AWE relies
on NLP-based detection of textual features, it is inherently limited to what can
be detected via those methods. Consequently, some opponents argue that auto-
mated approaches lack construct validity, wherein the construct of writing qual-
ity is narrowed to a mere measurement of text length, syntax, and vocabulary1
(Condon, 2013; Perelman, 2014; see also Bejar, Flor, Futagi, & Ramineni, 2014).
Overall, given that writing involves complex, sociocultural interactions between
authors, texts, and readers, there are valid tensions with respect to the construct
validity of AWE (Cotos, 2015; Deane, 2013).
The aforementioned debates center on questions of whether AWEs can or
should perform the task of assessing writing. However, in practical terms, a more
fundamental question might be whether incorporating AWEs into ELA class-

rooms helps students improve their writing or helps teachers improve their
instruction. Are AWE technologies effective?
Currently, studies on the effectiveness of AWE are sparse. A few studies have
considered efficacy in terms of writing quality, but less attention has been paid to
performance on standardized tests or other important outcomes such as self-
efficacy, attitudes, technology perceptions, or technology adoption. No prior
study has simultaneously considered all these outcomes. We briefly review this
mixed evidence base, which then drives the present study that evaluated an AWE
system on these myriad metrics.
Effects on Writing Quality

A recent review highlights that clear conclusions about AWE’s effects on writing
quality are difficult to draw because so few studies have sampled students in first-
language classrooms in the elementary or middle grades (Stevenson & Phakiti,
2014). Most evaluations of AWE have been conducted with samples of post-
secondary students and English learners (e.g., Dikli & Bleyle, 2014; Liu, Li, Xu,
& Liu, 2016). Moreover, the most common control group in prior efficacy
studies has been a no feedback control group, despite the intended use of
AWE to supplement and not replace the teacher as a feedback agent (Deane,
2013; Grimes & Warschauer, 2010). Nevertheless, Stevenson and Phakiti (2014)
concluded that AWE has small to moderate benefits for writing quality.
For instance, a study of the Criterion AWE system conducted by Kellogg,
Whiteford, and Quinlan (2010) found that undergraduate students who received
continuous automated feedback were able to decrease the amount of errors in
grammar, usage, mechanics, and style from first to final drafts, but the amount
of automated feedback had no bearing on overall writing quality. One potential
explanation is that feedback provision is different from feedback usage: Simply
providing students with feedback does not ensure that students use it. Indeed,
Chapelle, Cotos, and Lee (2015) found that undergraduates using Criterion
ignored the feedback more than 50% of the time, but students who did use
the feedback were successful at improving their writing in 70% of those
instances. These studies suggest that, when used, automated feedback appears
to support gains in components of writing quality.
However, these gains in writing quality appear limited to composing in the
presence of automated feedback. For instance, Wilson et al. examined whether
students in Grades 4 to 8 (approximately 1,000 students in each study) were able
to use automated feedback from PEG Writing to improve their writing quality
when revising and whether this improvement would transfer to new compos-
itions written without PEG support. Findings were consistent across demo-
graphics and achievement levels: Students were able to effectively use feedback
from PEG Writing to improve their essay quality across successive drafts,
though this improvement did not transfer to performance on first drafts of

unsupported essays (Wilson, 2017; Wilson & Andrada, 2016; Wilson,
Olinghouse, & Andrada, 2014).
Yet, when AWE is combined with effective evidence-based strategy instruc-
tion, students have made gains in independent writing performance. Palermo
and Thompson (2018) evaluated the effects of three writing instructional condi-
tions on the writing quality of middle school students assigned to a business-as-
usual teacher instruction condition, an AWE+teacher instruction condition, or
an AWE+Self-Regulated Strategy Development (see Graham & Perin, 2007)
instruction condition. Students in the AWE+Self-Regulated Strategy
Development condition composed longer, better quality essays that included
more key elements of argumentation than did students in the other conditions.
Students in the AWE+teacher condition outperformed students in the teacher-
only condition. Results suggest that the combination of AWE with highly effect-
ive writing strategies leads to transfer of learning. Indeed, as shown in Wilson &
Czik (2016), combining AWE with business-as-usual teacher instruction had no
effects on students’ writing performance from pretest to posttest, nor was that
combination more effective than the combination of Google Docs with business-
as-usual writing instruction. The instructional context of AWE is important.
In sum, existing evidence indicates that AWE has small to moderate effects on
components of writing quality particularly when students compose in the pres-
ence of AWE support. However, additional research is needed to study whether
AWE is associated with improvements in more global measures of writing qual-
ity when students compose independently.
Effects on State Test Performance

Stakeholders considering adoption of AWE must weigh not only the direct costs
associated with purchasing a given software but also the costs of not purchasing
an alternative educational technology that is potentially more effective at pro-
moting outcomes for which schools are accountable. In the United States, end-
of-year tests are the primary accountability metrics (Behizadeh & Pang, 2016).
Therefore, the efficacy of AWE for supporting improvements in students’ state
test performance is an important consideration.
Only one methodologically rigorous study has tested whether the use of AWE
leads to improved performance on state tests relative to a non-AWE control
condition. Shermis, Burstein, and Bliss (2004) explored this question using
within-class random assignment to assign 10th-grade students to compose up
to 7 prompts with or without the use of Criterion. Performance on a state
writing test was then contrasted for the two groups. No statistically significant
differences were found. As Warschauer and Ware (2006) explain, this may have
been due to the use of within-class random assignment. Introducing AWE at the
student level, rather than classroom level, likely did not allow for a broader
instructional integration of the software. Alternatively, state test performance

may be too distal an outcome for AWE to influence.
Effects on Writing Self-Efficacy

In addition to improving writing quality and proficiency on state tests, other cru-
cial outcomes pertain to students’ writing motivation and writing self-efficacy.
Writing motivation refers to a student’s level of interest, values, effort,
goal orientation, and outcome attributions as they refer to writing (Troia,
Shankland, & Wolbers, 2012). Writing self-efficacy refers to a student’s
confidence in their ability as a writer (Bruning & Kauffman, 2016). Such factors
are key components of theories of writing proficiency (Bruning & Kauffman,
2016; Hayes, 2012; Sanders-Reio, Alexander, Reio, & Newman, 2014; Troia
et al., 2012), and empirical research has shown that motivation and self-efficacy
influence writing achievement (Graham, Berninger, & Fan, 2007). Attitudes do
not directly cause students to write well or poorly, but students with more posi-
tive motivational beliefs tend to write more, both in and outside of school
(Troia, Harbaugh, Shankland, Wolbers, & Lawrence, 2013), and students
with greater self-efficacy expend more effort on their writing and persist in the
face of challenge (Pajares, Johnson, & Usher, 2007). Importantly, one principal
source of students’ writing motivation and self-efficacy is the feedback
they receive from teachers, peers, or other sources (Dujinhower, Prins, &
Stokking, 2010; Pajares, 2003); feedback can increase or decrease students’
self-efficacy.
Prior research conducted in U.S. ELA classrooms suggests that AWE may
have a positive influence on students’ self-efficacy. Warschauer and Grimes
(2008) surveyed teachers and students in four secondary schools, and respondents
reported that the use of AWE led to improved writing motivation and revising
behavior and that these improvements were stronger than those associated with
using basic word processing (Grimes & Warschauer, 2010). This latter finding is
significant because word processing itself is associated with improvements in
writing motivation when compared with handwriting (Morphy & Graham,
2012). Wilson and Czik (2016) reported similar results in a quasi-experimental
study of eighth-grade students assigned by classroom to either an AWE condition
(PEG Writing) or a word-processing condition using Google Docs. Students who
used PEG Writing expressed stronger willingness to invest time in solving prob-
lems in their writing. Other studies have used behavioral measures of writing
motivation, showing that students who use AWE increase time on task and
revise more (Foltz, 2014; Moore & MacArthur, 2016).
Overall, AWE shows promise for benefiting writing motivation and self-
efficacy, but additional research is needed to replicate or extend these findings.
For instance, prior research suggests that self-efficacy beliefs may influence other
outcomes of interest, such as writing quality or performance on state tests. Thus,

evaluations of AWE should model relationships between AWE usage, self-
efficacy, writing quality, and state test performance.
Perceived Effectiveness and Utility

Although objective measures of performance and attitudes represent meaningful
metrics of efficacy, teachers’ perceptions of educational technology tools are also
essential (Tondeur, van Braak, Ertmer, & Ottenbreit-Leftwich, 2017). Teachers’
perceptions of the acceptability, importance, and efficacy of AWE—that is, its
social validity (see Lyst, Gabriel, O’Shaughnessy, Meyers, & Meyers, 2005;
Wolf, 1978)—represent another measure of AWE effectiveness. Teachers’ per-
ceptions are likely to influence their degree of adoption and integration of AWE
into instruction (see Davidson, Richardson, & Jones, 2014; Hutchison &
Woodward, 2014) and whether they positively predispose students to AWE
(Roscoe, Wilson, Johnson, & Mayra, 2017). Teachers’ comfort level with tech-
nology tools such as AWE and their perceptions of its affordances and efficacy
are likely to influence teacher use (see Ertmer, 1999; Hew & Brush, 2007;
Kopcha, 2012; Okumus, Lewis, Wiebe, & Hollerbrands, 2016), which in turn
may influence the magnitude of the effects of AWE on students’ writing
outcomes.
Of the studies that examined the efficacy of AWE in ELA classrooms in the
United States, few have probed teachers’ perceptions of AWE. Such studies
reported that teachers perceive AWE to save them time and make
writing instruction easier and to help teachers focus their feedback on higher
level writing concerns (Grimes & Warschauer, 2010; Palermo & Thompson,
2018; Wilson & Czik, 2016). In addition, teachers have reported that AWE
fosters students’ writing motivation, self-efficacy, and independence while writ-
ing (Warschauer & Grimes, 2008; Wilson & Czik, 2016). Teachers have
expressed less positive beliefs about the accuracy of the AWE scoring engines
(Grimes & Warschauer, 2010; Palermo & Thompson, 2018), perceiving that
automated feedback can be broad or difficult for students to interpret
and implement when revising (Palermo & Thompson, 2018). Nevertheless,
these reservations do not appear to be the limiting factors in teachers’ use
of AWE. Instead, teachers cite external pressures such as lack of time for
writing instruction, lack of computer access, or competing priorities (e.g.,
raising reading scores) as reasons for lack of AWE usage (Warschauer &
Grimes, 2008).
In sum, insufficient research has probed teachers’ perceptions of AWE,
though findings tend to be positive. Thus, the present study examined teachers’
perceptions of an AWE system with respect to its usability, effectiveness, and
desirability.
Research Objectives and Questions

Substantially more scholarship has attended to the reliability of automated
scoring than to establishing whether AWE systems are effective at improving
the teaching and learning of writing (Bennett, 2004; Shermis, Burstein, Elliot,
Miel, & Foltz, 2016). Consequently, scholars have called for expanding the body
of knowledge on AWE by directly addressing this issue of AWE’s efficacy
(Shermis et al., 2016; Xi, 2010). Thus, the overarching purpose of the current
study was to investigate multiple metrics of the effectiveness of AWE instruction
and feedback.
This work was conducted in the context of middle school ELA with students
and teachers who used the PEG Writing AWE system (see Bunch, Vaughn, &
Miel, 2016; Page, 2003, and see later for a description) over most of a school
year. Specifically, sixth-grade students used either PEG Writing or word-proces-
sing software (Google Docs) to compose three essays with either automated or
teacher feedback, respectively. Because word-processing software has demon-
strated benefits for writing outcomes, particularly for struggling writers
(Morphy & Graham, 2012), it serves as a fair control.
The work was guided by five research questions (RQs). Specifically, what are
the effects of using PEG Writing versus Google Docs on students’ writing qual-
ity (RQ1), writing self-efficacy (RQ2), and state ELA test performance (RQ3)?
In addition, how are these metrics interrelated (RQ4)? For example, are the
effects of AWE on writing outcomes mediated by self-efficacy beliefs, as sug-
gested by prior work on motivation? Finally, what are teachers’ perceptions of
the usability, effectiveness, and desirability of PEG Writing in comparison with
Google Docs (RQ5)?
Method
PEG Writing
The current study implemented PEG Writing (henceforth referred to as PEG)
based on its well-established history, widespread use, and evidence base (see
Palermo & Thompson, 2018; Wilson, 2018; Wilson & Czik, 2016). PEG is an
AWE system based on the PEG scoring system developed by Ellis Page (2003)
and acquired by Measurement Incorporated in 2003. PEG allows teachers to
assign pregenerated writing prompts or to create their own prompts that use
words, documents, images, videos, music, or a combination of these media as
stimuli for students’ writing. The design of this system enables a full range of
interactions between the system, learners, and teachers.
Prompts are assigned in one of three genres: narrative, argumentative, or
informative. PEG includes multiple digital, interactive graphic organizers that
students may use as aids during the prewriting process. After prewriting, PEG
allows up to 1 hour for students to submit their first draft, after which students
immediately receive ratings for six traits of writing quality scored on a 1 to 5

scale: idea development, organization, style, sentence structure, word choice,
and conventions. These traits are summed to create an Overall Score, ranging
from 6 to 30. PEG provides students with formative feedback on each of the six
traits, indicating ways for students to improve their text with respect to that
specific trait. PEG also provides students with feedback on grammar and spel-
ling, as well as customized links to one or more of its 70 + mini lessons that
students can complete to gain skills needed to improve as writers. After students
receive this feedback, they may revise their essays and receive updated ratings
and automated feedback. Teachers can provide students with supplemental feed-
back through PEG by embedding margin comments or adding summary com-
ments. Students may leave comments for their teacher using PEG’s messaging
functionality.
PEG also offers a peer review function that enables teachers to establish
random or intentional groupings of students who can review each other’s
texts and provide specific suggestions for improvement. Finally, when students
are ready to publish and print their essays, PEG includes formatting functions
that allow students to alter the text in many of the same ways as Microsoft
WordÕ (e.g., font type, style, alignment, or size).
Although PEG’s scoring model is proprietary, PEG uses a combination of
NLP techniques such as syntactic analysis and semantic analysis to measure
hundreds of variables, some defined a priori and some induced from the data.
These variables are combined using machine-learning algorithms to predict hol-
istic and analytic essay ratings assigned by trained raters (Bunch et al., 2016;
Shermis et al., 2016). PEG’s scoring models are trained to produce accurate
scoring models for five different scoring bands (Grades 3–4, 5–6, 7–8, 9–10,
11–12) for argumentative writing, informative/explanatory writing, and narra-
tive writing. These models can be applied reliably to prompts of any topic within
a specific genre.
Empirical studies document that PEG’s essay ratings are consistent with
those assigned by trained human raters (Keith, 2003; Shermis, 2014; Shermis,
Koch, Page, Keith, & Harrington, 2002). For instance, when compared with
trained raters’ scores assigned to essays from eight different state writing assess-
ments, PEG yielded quadratic weighted kappas ranging from .70 to .84, indicat-
ing a very high level of reliability (Shermis, 2014).
Setting and Participants

This study was approved by institutional review board and conducted during the
2015–2016 school year in a middle school in an urban/suburban school district
in the mid-Atlantic region of the United States. During this year, the school
enrolled a total of 662 students in Grades 6 to 8, of whom 24% were Hispanic,
36% were African American, 36% were White, and < 5% were Asian or other
minority. Many students (40%) came from low-income families and 100% of the
school population received free lunch.
Three sixth-grade ELA teachers were recruited and consented to participate
in the study. None of the teachers or their students had previous exposure to
AWE software or PEG Writing. As compensation, teachers and students
received free subscriptions to PEG, and teachers received a stipend for assisting
with the informed consent process, study procedures, and obtaining students’
demographic and achievement data. All three teachers had earned a master’s
degree.
After receiving parental consent and student assent, a total of 114 students
participated in the study. These students were instructed in a total of 10 ELA
classroom sections by the three ELA teachers. Following a quasi-experimental
design, students were randomly assigned by classroom section to use either PEG
(n ¼ 56) or Google Docs (n ¼ 58). Specifically, half of each teacher’s sections
were randomly assigned to either condition (five sections per condition).
Instruction was thus relatively constant across condition because each teacher
covered sections within each condition. Although this design was potentially
weaker (i.e., reduced internal validity) than random assignment within class-
rooms, the design mitigated issues of resentment or rivalry among students if
one composing tool was seen as more desirable than the other. Also, as noted
earlier (see Shermis et al., 2004), between-class assignment of software condi-
tions may be more successful for promoting consistent AWE integration than
within-class assignment.
Table 1 summarizes sample demographics. Chi-square tests indicated that
there were no statistically significant differences with respect to gender, race,
or special education status. A one-way analysis of variance (ANOVA) indicated
no difference in the average chronological age (measured in months) of students
in each group, F(1, 112) ¼ 0.09, p ¼ .771.
Measures
Writing quality. Writing quality was assessed via timed, expository essay prompts
administered prior to and following the experimental intervention. Two prompts
were obtained from public items from the State of Texas Assessment of
Academic Readiness (STAAR) test. STAAR tests were selected to avoid the
possibility that students would have been exposed to the prompts. STAAR
tests are rigorously evaluated for reliability and validity (Human Resources
Research Organization, 2016; Texas Education Agency, 2014, 2015a, 2015b).
For each prompt, students are asked to read a quotation and answer a topical
question. Students are instructed to write an essay and clearly state a controlling
idea, organize and develop their explanation, choose their words carefully, and
use correct spelling, capitalization, punctuation, grammar, and sentences. Due
to local constraints, it was not possible to counterbalance the order of the
Table 1. Demographics for Composing Conditions.
PEG (n ¼ 56) Google Docs (n ¼ 58)
Male 20 22
Race
African American 21 19
Hispanic 11 18
Asian 2 0
White 22 21
Special education 4 4
Chronological age in months: M (SD) 140.37 (4.57) 140.14 (4.11)
Note. PEG ¼ Project Essay Grade.
prompts. However, because these prompts were implemented as part of a state-

wide testing system, they had been designed to be closely equivalent in difficulty
(Texas Education Agency, 2014, 2015a, 2015b).
The pretest prompt asked students to read a quotation from Hugh Prather
stating, ‘‘True humor is fun—it does not put down, kid, or mock. It makes
people feel wonderful, not separate, different, and cut off.’’ Students were then
instructed to ‘‘Write an essay explaining whether it is important to laugh.’’ The
posttest writing prompt began with a quotation from Michael Jordan stating,
‘‘If you run into a wall, don’t turn around and give up. Figure out how to climb
it, go through it, or work around it.’’ Students were then instructed to ‘‘Write an
essay explaining the importance of never giving up.’’
Students’ texts were transcribed by trained research assistants. A random
sample of 20% of essays (n ¼ 46) were checked to ensure accuracy of transcrip-
tion, which was near perfect (99%). Accuracy was defined as the percent of
correctly transcribed elements (i.e., words, spelling, punctuation marks, and
formatting).
Transcribed essays were scored in two ways. First, essays were assessed via
PEG. PEG’s scoring reliability and validity are well established (Shermis, 2014),
and PEG’s consistency ensures that scores at pretest and posttest held the same
meaning (i.e., no rater drift), a critical measurement property for valid research
design. As detailed in the PEG Writing section, PEG uses a six-trait scoring
model and also outputs an Overall Score, calculated as the sum of the six traits
(range ¼ 6–30). Due to high collinearity among traits (rs > .85), the PEG Overall
Score (henceforth the PEG score) was the first measure of writing quality in the
current study.
Second, essays were scored by two trained graduate research assistants using
three analytic rubric traits: Organization and Purpose, Evidence and
Elaboration, and Conventions. Organization and Purpose (scores from 1 to 4)
referred to the degree to which the response had a clear and effective organiza-
tion structure, including the use of a thesis statement, transition words, intro-
duction and conclusion, and a logical progression of ideas. Evidence and
Elaboration (scores from 1 to 4) referred to the degree to which the response
included sufficient elaboration and support for the thesis, as well as appropriate
vocabulary and style. Finally, Conventions (scores from 1 to 3) referred to the
degree to which the response demonstrated accurate use of sentence formation,
punctuation, capitalization, grammar, and spelling. A final score—henceforth
referred to as the rater score—was calculated as the sum of the three trait scores
(range ¼ 0–11). Zeros were assigned to essays that were unable to be scored due
to lack of topic match or inadequate development (i.e., the student wrote only
one or two sentences).
Raters were trained by the first author until they reached a criterion of 80%
exact agreement on each trait. Raters then independently double scored all
essays. Interrater agreement, assessed as the percentage of exact agreement,
was strong for all three traits: Organization and Purpose (91.64%),
Elaboration and Evidence (94.43%), and Conventions (93.81%). Percent agree-
ment for the final rater score was 81.73% with inter-rater reliability of r ¼ .98.
Disagreements between raters were resolved by consensus. The rater score was
positively and significantly (p < .01) correlated with the PEG score at pretest
(r ¼ .57) and posttest (r ¼ .62).
State ELA test. The Smarter Balanced Assessment Consortium summative ELA
test served as the state’s accountability assessment. This test is aligned with the
Common Core State Standards (Common Core State Standards Initiative, 2010)
and is administered annually in the spring to students in Grades 3 to 8 and
Grade 11 in 15 U.S. states, the U.S. Virgin Islands, and the Bureau of Indian
Education.
The Smarter Balanced ELA test is a computer-based assessment that evalu-
ates the Common Core reading standards for literacy and informational text;
writing standards related to organization/purpose, evidence/elaboration, and
conventions; and listening and research standards. In total, the Smarter
Balanced ELA test is 3.5 hours in duration, though times vary by grade level
and local context. Sixth graders respond to 13 to 17 selected-response items
assessing reading standards, 10 selected-response items and 1 performance
task assessing writing standards, 8 to 9 selected-response items assessing listen-
ing standards, and 6 selected-response items and 2 to 3 constructed-response
items assessing research standards. Thus, writing skills are essential for success
on the Smarter Balanced ELA test, but the test assesses diverse literacy skills.
To assess writing proficiency, Smarter Balanced first requires students to read
source materials and answer two constructed-response questions and one
selected-response question about those materials. Students then use those
source materials when planning, drafting, and revising an essay in response to
a prompt in a subsequent performance task. Overall, these writing tasks require

about 2 hours to complete.
Performance on the Smarter Balanced ELA test is summarized via a scale
score, representing students’ performance across all test components (i.e., read-
ing, writing, listening, and research). The ELA scale score is vertically scaled
across grade levels, although grade-specific cutpoints are used to demarcate four
proficiency levels. For Grade 6, these cutpoints include Level 1: Novice
(<2,457), Level 2: Developing (range ¼ 2,457–2,530), Level 3: Proficient
(range ¼ 2,531–2,617), and Level 4: Advanced (>2,617).
Because this test is used for accountability purposes by a consortium of U.S.
states, the Smarter Balanced ELA test has undergone a rigorous validation
process, ensuring that it yields reliable scores and valid score inferences.
Technical documentation is available at http://www.smarterbalanced.org/assess-
ments/development/.
Writing self-efficacy. The 22-item Self-Efficacy for Writing Scale (SEWS; Bruning,
Dempsey, Kaufmann, McKim, & Zumbrunn, 2013) was used to measure stu-
dents’ writing self-efficacy. Students read a series of statements (e.g., ‘‘I can
quickly think of the perfect word’’) and rated their confidence in performing
that action on a continuous scale from 0 (no chance) to 100 (complete certainty).
Items assessed students’ confidence related to conventions (five items; e.g., ‘‘I can
spell my words correctly’’), idea generation (six items; e.g., ‘‘I can think of many
ideas for my writing’’), and self-regulation (nine items; e.g., ‘‘I can avoid distrac-
tions while I write’’). Two additional items asked about students’ confidence
composing in different genres (‘‘I can write a good story,’’ and ‘‘I can write a
good report’’). A final item asked about students’ overall writing self-efficacy
(‘‘I can do what it takes to be a good writer’’).
The SEWS was administered at pretest and posttest by trained research assist-
ants. The complete scale (all 22 items) demonstrated excellent reliability at pre-
test (a ¼ .94) and posttest (a ¼ .92).
Social validity survey. To evaluate teachers’ perceptions of AWE’s social validity,

all three participating teachers completed a survey with several sections. The first
section contained seven items targeting different writing skills: spelling, capital-
ization and punctuation, grammar and sentence structure, vocabulary and word
choice, organization, ideas and elaboration, and style. Teachers were asked to
indicate whether they gave more feedback to students on a particular writing
skill in PEG or Google Docs, or whether there was no substantial difference.
A complete list of these items is presented in Table 5.
The second section asked teachers to compare PEG and Google Docs with
respect to the following traits: (a) usability (four items; e.g., ‘‘Which system was
easier for you to use?’’), (b) effectiveness (three items; e.g., ‘‘Which system pro-
moted greater student writing quality?’’), and (c) desirability (two items; e.g.,
‘‘Which system would you like to use in the future?’’). Specifically, for each trait,
teachers were asked to indicate which system (i.e., PEG vs. Google Docs) was
superior or if the two systems were equivalent. A complete list of these items is
presented in Table 6.
The final section asked teachers to evaluate 19 statements about PEG using a
5-point Likert scale, ranging from 1 (Strongly Agree) to 5 (Strongly Disagree).
These items specifically probed teachers perceptions of PEG’s usability (eight
items), effectiveness (nine items), and desirability (two items). A complete list of
these items is presented in Table 7.
Procedures
The study occurred in three phases over the course of about 7 months: training
and pretesting, classroom instruction and writing, and posttesting.
Training and pretesting phase. All participating teachers used both PEG and
Google Docs (in different classroom sections). Teachers had no prior exposure
to PEG, so the first author provided training on how to use PEG via a single 45-
minute session during a shared planning period. This training demonstrated how
to (a) manage course rosters, (b) assign pregenerated writing prompts, (c) create
and assign teacher-made prompts, (d) attach stimulus material to teacher-cre-
ated prompts, (e) review student writing and add supplemental feedback via
embedded or summary comments, and (f) review student data to monitor
their progress.2 Teachers were already highly familiar with Google Docs (i.e.,
2+ years of experience); thus, no training was provided for that software.
After teacher training, pretesting took place on 2 consecutive days in October
2015. Students completed the SEWS on Day 1 and the writing prompt (on
‘‘humor and laughter’’) on Day 2; both measures were administered by trained
graduate assistants. To ensure fidelity, testing procedures were audio recorded
and independently reviewed to assess adherence to the protocols. Proctors fol-
lowed established procedures with 100% fidelity across all administrations and
class sections.
After completing the pretest, the first author taught students how to use PEG
during a single ELA class session. Students were taught how to locate an
assigned writing prompt, select and complete an electronic graphic organizer,
draft and submit an essay for evaluation by PEG, review feedback and complete
a revision, access recommended mini lessons, and send and review messages with
the teacher.
Classroom instruction and writing phase. Teachers were instructed to implement their
normal ELA curriculum, with the requirement of at least three writing prompt
assignments between the end of October and March. Aside from the required
writing prompts, teachers were encouraged to assign additional writing
assignments using the respective software if they wished or if required by

their curriculum. Students completed these writing assignments using either
PEG or Google Docs based on the classroom-assigned condition. As a team,
the teachers selected prompts from their district curriculum or the PEG prompt
library.
The first writing assignment was on the topic of Bullying and was completed
in October 2015. This prompt was the Fall performance task in their curriculum.
Students were asked to read two short texts about the topic, review a fact sheet
with data about the frequency of bullying, and watch a short video about
cyberbullying. All of these materials were embedded in PEG and Google
Docs. Students then authored an essay in response to the following prompt:
You have been chosen to write an essay that will be published on the National
Bullying Prevention Awareness website. Write an informative essay about the nega-
tive effects bullying has on people and what can be done to help. Your essay will be
read by other students, teachers, and parents. Make sure to have a main idea,
clearly organize your essay, and support your main idea with evidence from the
sources using your own words. Be sure to develop your ideas clearly.
The second writing assignment was on the topic of The Best Music and was
completed in January 2019. This assignment was not required by the curriculum
and was assigned to fulfill the obligation of the study procedures. This assign-
ment contained no stimulus or source materials and was selected from PEG’s
prompt library; students were asked to express a well-supported opinion on the
following prompt topic:
Write a brief persuasive essay detailing what you believe to be the best music. The
short essay should make your reader feel inclined to listen to the particular kind of
music you present in your argument. The response should include details and
support that makes the argument clear and provokes the audience to believe simi-
larly or reconsider their beliefs.
The final writing assignment was on the topic of Class Pets and took place at the
beginning of March 2016. This prompt was the Spring performance task in their
curriculum. Students read two texts that expressing contrary positions on the
topic of class pets and viewed a YouTube video on Pets in the Classroom. These
materials were embedded within PEG and Google Docs. Students then wrote an
essay responding to the following prompt:
Your principal is deciding whether or not pets will be allowed in the classroom.
Your task is to convince your school principal either to allow pets in classrooms, or
not allow pets in classrooms. Based on the video and articles that you used for
research, write an argument essay stating and explaining your position on this
issue. Make sure to have a claim and support your claim with clear reasons and
relevant evidence. Be sure to develop your ideas clearly.
Posttesting. Posttests occurred on a single day in mid-April. Trained research

assistants again administered the SEWS and a new writing prompt (on ‘‘never
giving up’’). Fidelity of posttesting procedures was again perfect (i.e., 100%
adherence to protocols).
Later in April, students completed the state-mandated, computer-based
Smarter Balanced ELA test, and teachers completed a web-based social validity
survey.
Data Analysis
Path analysis was conducted to answer RQs 1 to 4. Path analysis is an extension
of multiple regression and a special form of structural equation modeling that
examines relationships among observed rather than latent variables (Tabachnick
& Fidell, 2013). Although ANOVA tests only direct effects, and analysis of
covariance tests direct effects after adjusting for one or more covariates, neither
of these techniques, and only path analysis, is able to test both direct and indir-
ect (i.e., mediation) effects (Tabachnick & Fidell, 2013). In path analysis, direct
effects are the direct influence of one variable on another, which are calculated as
regression coefficients; indirect effects are the influence of one variable on
another when mediated by a third variable, calculated as the product of two
direct effects. Path analysis also allows for calculating total effects, the sum of
the direct and indirect effects (Bollen, 1989).
In the present study, the main direct effects tested were the effect of condi-
tion (a dummy-coded variable: 1 ¼ PEG; 0 ¼ Google Docs) on writing quality,
writing self-efficacy, and state ELA test performance. To test indirect,
mediation effects, we used an exploratory path analysis in which all paths
between variables were estimated to identify statistically significant relations.
Models were reestimated for parsimony after deleting nonstatistically signifi-
cant paths.
To assess the fit of the path models, we examined the chi-square test,
the standardized root mean residual (SRMR), comparative fit index (CFI;
Bentler, 1990), the Tucker–Lewis index (TLI; Tucker & Lewis, 1973), and
the root mean square error of approximation (RMSEA; Steiger & Lind,
1980). A good fitting model was one that displayed one or more of the following
criteria: nonsignificant chi-square values, SRMR values lower than .08 (Brown,
2006), CFI and TLI values exceeding .95, and RMSEA values below .05 (Brown,
2006).
Path analysis was conducted using Mplus v.8.0 (Muthén & Muthén,
1998–2017). Mplus uses a robust weighted least squares estimator for path
analysis models that include combination of continuous and categorical depend-

ent variables, as was the case in the current study.
Handling of Missing Data

Due to student absence or relocation, missing data at pretest ranged from 3% to
6% depending on measure and at posttest ranged from 2% to 11% depending
on measure. Across the entire dataset, there was a total of 7.12% missing data,
which was treated as missing at random based on a nonstatistically significant
result of the Little’s missing completely at random test: v2 ¼ 905.98, df ¼ 1257,
p ¼ .959. Researchers recommend the use of multiple imputation techniques
(Peugh & Enders, 2004; Wilkinson & Task Force on Statistical Inference,
1999) to address data missing at random. Thus, multiple imputation was con-
ducted using SPSS v.24 using the Markov Chain Monte Carlo method. A total
of five multiply-imputed datasets were created. Once generated, the five multi-
ply-imputed datasets were used with Mplus to estimate the path analysis models.
Results
Pretest Equivalence
Descriptive statistics for the pretest writing prompt and SEWS are presented in
the first two columns in Table 2, and posttest measures appear in columns 3 and
4. All measures were distributed normally, as indicated by z scores for skewness
and kurtosis W3.29W, corresponding to alpha of .001 (Tabachnick & Fidell,
2013). Measures were weakly to moderately correlated within and across pretest
and posttest administrations (see Table 3).
To ensure random assignment, we conducted a series of ANOVAs on the
pretest measures. There were no statistically significant differences for any meas-
ure: PEG score: F(1, 112) ¼ 0.06, p ¼ .807; rater score: F(1, 112) ¼ 0.20, p ¼ .656;
or SEWS score: F(1, 112) ¼ 0.24, p ¼ .625.
Path Analysis
Path analyses were conducted to understand the direct and indirect relationships
among the covariates (pretest measures), the independent variable (composing
condition), and the dependent variables (student outcomes). Four path models
were estimated based on the correlation matrix presented in Table 3. The models
were structured temporally, with the model organized from left to right to sig-
nify the linear passage through time (i.e., pretest, treatment, posttest, and state
test). The first two models include the PEG score as the measure of writing
quality, and the latter two models include rater score as the writing quality
measure. In each model, the ‘composing condition’ variable represents the
Table 2. Descriptive Statistics.
Pretest Posttest
PEG Google Docs PEG Google Docs

M (SD) M (SD) M (SD) M (SD)
Writing quality
PEG score 10.20 (3.09) 10.08 (2.63) 10.29 (2.90) 10.27 (2.82)
Rater score 5.96 (2.27) 5.77 (2.42) 5.88 (2.34) 6.01 (2.15)
Writing self-efficacy 69.58 (19.18) 68.55 (16.93) 73.30 (15.42) 68.21 (14.11)
Smarter Balanced – – 2,575.66 (69.12) 2,550.89 (67.97)
ELA score
Note. PEG ¼ Project Essay Grade; ELA ¼ English language arts. PEG: n ¼ 56. Google Docs: n ¼ 58. PEG
score: range ¼ 6–30. Rater score ¼ range: 0–11. Average self-efficacy (range ¼ 0100) is the average of all
22 items on the Self-Efficacy for Writing Scale (Bruning et al., 2013). Smarter Balanced ELA score:
range ¼ <2,457 – >2,617.
Table 3. Correlations Among Pretest and Posttest Study Measures.
1. 2. 3. 4. 5. 6. 7.
1. PEG score pretest –

2. Rater score pretest .57** –
3. Self-efficacy pretest .31** .44** –
4. PEG score posttest .34** .38** .29** –
5. Rater score posttest .27** .29** .25* .62** –
6. Self-efficacy posttest .19* .29** .58** .30** .18 –
7. State ELA test score .25** .41** .33** .39** .35** .37** –
Note. PEG ¼ Project Essay Grade; ELA ¼ English language arts. N ¼ 114. PEG score: range ¼ 6–30. Rater
score: range ¼ 0–11. Average self-efficacy (range ¼ 0100) is the average of all 22 items on the Self-Efficacy
for Writing Scale (Bruning et al., 2013). The State ELA test score is the scale score on the Smarter
Balanced ELA test (range ¼ <2,457 – >2,617).
*p < .05. **p < .01.
effect of being assigned to the PEG condition versus the Google Docs condition
on the outcomes of interest.
Path analysis featuring the PEG score as a measure of writing quality. The first model
(see Figure 1), called a saturated model (see Kenny, 2011), included all possible
relationships between pretest writing quality (i.e., pretest PEG score), pretest
writing self-efficacy, composing condition (dummy coded as 1 ¼ PEG and
0 ¼ Google Docs), posttest writing quality (i.e., posttest PEG score), posttest
0.64
0.57 Posttest writing

Pretest writing
self-efficacy self-efficacy
0.12 0.17 0.75
0.21
0.03 0.16
Composing 0.15 Smarter Balanced

0.31 Condition ELA score
0.02 -0.02
0.01
0.28
Pretest PEG 0.07 Posttest PEG
Score Score
0.28
0.85
Figure 1. Saturated path model predicting study outcomes from pretest writing self-effi-
cacy, pretest PEG writing quality, and composing condition. The saturated model represents
direct and indirect paths between all predictors and all outcomes. The curved arrow repre-
sents a correlation. Straight arrows represent regression paths. Residuals for study out-
comes are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG
Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent
statistically significant paths at p < .05. Dashed lines represent nonstatistically significant
paths at p .10.
ELA ¼ English language arts; PEG ¼ Project Essay Grade.
writing self-efficacy, and Smarter Balanced ELA score. This model enabled iden-
tifying the paths that were or were not statistically significant. Model 1 was not
used to test hypotheses; thus, estimates of model fit are not reported.
Model 1 indicates that pretest measures were not related to (i.e., did not differ
across) composing condition, and composing condition had no effect on posttest
writing quality: Students in both PEG and Google Docs conditions wrote
equally well at posttest. Indeed, examination of the descriptive statistics (see
Table 2) shows that not only did the groups not differ at posttest, but neither
group’s writing quality improved from pretest to posttest.
However, composing condition did have direct and small effects on posttest
writing self-efficacy ( ¼ .16) and state test performance ( ¼ .15). Thus, the use
of PEG was associated with small positive effects on these outcomes. In add-
ition, Model 1 indicates that state test performance was also predicted by
posttest writing self-efficacy ( ¼ .17) and posttest writing quality ( ¼ .28).
Pretest writing self-efficacy and pretest writing quality were correlated (r ¼ .31)
but did not directly predict performance on the state test.
0.65

Pretest writing
0.25 0.78
0.21 0.16

0.32
Pretest PEG Posttest PEG
Score Score
0.28
0.85
Figure 2. Parsimonious path model predicting study outcomes from pretest writing self-
efficacy, pretest PEG writing quality, and composing condition. The curved arrow represents
a correlation. Straight arrows represent regression paths. Residuals for study outcomes
are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG Writing,
0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent statistically
significant paths at p < .05. Dashed lines represent nonstatistically significant paths at
p .10.
ELA ¼ English language arts; PEG ¼ Project Essay Grade.
Based on the results of the saturated model, a parsimonious path analysis

model (Model 2) was estimated including only those paths that were statistically
significant in the saturated model. Standardized parameter estimates are pre-
sented in Figure 2. Fit of this model was excellent: v2 ¼ 6.17, df ¼ 7, p ¼ .520,
RMSEA ¼ 0.00, 90% CI [0.00, 0.11], CFI ¼ 1.00, TLI ¼ 1.00, SRMR ¼ 0.09.
Model 2 explained 15% of the variance in posttest writing quality as measured
by the PEG score, 36% of the variance in posttest writing self-efficacy, and 22%
of the variance in state test performance. A decomposition of the direct, indirect,
and total effects of the predictors on study outcomes in this model is presented in
Table 4.
Unlike Model 1, the direct effect of composing condition on Smarter
Balanced ELA performance in Model 2 was not statistically significant
( ¼ .14, SE ¼ .08, t ¼ 1.65, p ¼ .100). The effect of composing with PEG was
not direct on state ELA test performance. Instead, PEG had a statistically sig-
nificant and positive indirect effect on state ELA test performance via its direct
effect on writing self-efficacy at posttest. Students in the PEG condition may
Table 4. Direct, Indirect, and Total Effects of the Path Analysis Models.
Posttest writing self-efficacy Posttest writing quality Smarter Balanced ELA score
Predictor Direct Indirect Total Direct Indirect Total Direct Indirect Total
Path analysis including the PEG score

Pretest writing self-efficacy 0.57*** 0.00 0.57*** 0.21* 0.00 0.21* 0.00 0.21*** 0.21***
Pretest PEG score 0.00 0.00 0.00 0.28** 0.00 0.28** 0.00 0.09* 0.09*
Composing condition 0.16* 0.00 0.16* 0.00 0.00 0.00 0.04y 0.14y 0.18*
Posttest writing self-efficacy – – – – – – 0.25** 0.00 0.25**
Posttest PEG score – – – – – – 0.32*** 0.00 0.32***
Path analysis including the rater score
Pretest writing self-efficacy 0.57*** 0.00 0.57** 0.00 0.00 0.00 0.00 0.13* 0.13*
Pretest rater score 0.00 0.00 0.00 0.29** 0.00 0.29** 0.26** 0.07* 0.33***
Composing condition 0.16* 0.00 0.16** 0.00 0.00 0.00 0.04y 0.14y 0.17*
Posttest writing self-efficacy – – – – – – 0.22** 0.00 0.22**
Posttest rater score – – – – – – 0.24* 0.00 0.24*
Note. PEG ¼ Project Essay Grade; ELA ¼ English language arts. N ¼ 114. Exogenous variables are predictor variables. Standardized coefficients are presented.
y
p .10. *p .05. **p .01. ***p .001.
107
0.64

Pretest writing
0.15 0.20 0.71
0.15
0.01 0.16

0.04 -0.04
0.05
0.23
0.25
Pretest Rater Posttest Rater
Score 0.23 Score
0.89
Figure 3. Saturated path model predicting study outcomes from pretest writing self-
efficacy, pretest human-scored writing quality, and composing condition. The saturated
model represents direct and indirect paths between all predictors and all outcomes.
The curved arrow represents a correlation. Straight arrows represent regression paths.
Residuals for study outcomes are indicated by single arrows. Composing condition ¼ dummy
variable (1 ¼ PEG Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold
lines represent statistically significant paths at p < .05. Dashed lines represent nonstatistically
significant paths at p .10.
ELA ¼ English language arts.
have performed better on the Smarter Balanced ELA test as a result of increases
in writing self-efficacy.
In sum, as summarized in Figure 2, the parsimonious path analysis that
included the PEG score indicated that students’ prior writing quality and self-
efficacy (which were correlated) were persistent and influential throughout the
school year. Pretest self-efficacy predicted posttest self-efficacy and posttest writ-
ing quality, and pretest writing quality predicted posttest writing quality. Both
self-efficacy and writing quality had direct effects on state test performance, but
students who used PEG experienced greater gains in posttest self-efficacy and in
doing so may have experienced a performance advantage on the state ELA test.
Path analysis featuring the rater score as a measure of writing quality. The same ana-
lytical approach used to generate Models 1 and 2 using the PEG score was
replicated using the rater score as a measure of writing quality. Model 3 (see
Figure 3) thus presents a saturated model of all possible paths, and Model 4 (see
Figure 4) represents a parsimonious model of those data.
0.64

Pretest writing
0.22 0.71
0.16

0.24
0.26
Pretest Rater Posttest Rater
Score 0.29 Score
0.91
Figure 4. Parsimonious path model predicting study outcomes from pretest writing self-
efficacy, pretest human-scored writing quality, and composing condition. The curved arrow
represents a correlation. Straight arrows represent regression paths. Residuals for study
outcomes are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG
Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent
statistically significant paths at p < .05. Dashed lines represent nonstatistically significant
paths at p .10.
ELA ¼ English language arts.
Model 3 corroborates prior results that pretest measures were unrelated to (i.e.,
did not differ across) composing condition. Similarly, composing condition had
no effect on posttest writing quality (rater score). Indeed, there were no measur-
able gains in the rater score from pretest to posttest for either group. However,
composing condition did have a direct and small positive effect on posttest writing
self-efficacy ( ¼ .16): Students who used PEG exhibited greater writing self-
efficacy at posttest compared with students in the Google Docs condition.
Third, state test performance was predicted by composing condition ( ¼ .14)
along with posttest writing self-efficacy ( ¼ .20) and posttest writing quality
( ¼ .23), all of which were in the positive direction indicating that greater writing
self-efficacy and writing quality at posttest was associated with greater perform-
ance on the state ELA test. Pretest writing self-efficacy did not directly predict
performance on the state test, but pretest writing quality did ( ¼ .25).
A parsimonious Model 4 was then generated including only those paths that
were statistically significant in Model 3. Standardized parameter estimates are
presented in Figure 4. Fit of the parsimonious model was excellent: v2 ¼ 3.35,
df ¼ 7, p ¼ .851, RMSEA ¼ 0.00, 90% CI [0.00, 0.06], CFI ¼ 1.00, TLI ¼ 1.00,
SRMR ¼ 0.08. Model 4 explained 9% of the variance in posttest writing quality

as measured by the rater score, 36% of the variance in posttest writing self-
efficacy, and 29% of the variance in state test performance. A decomposition of
the direct, indirect, and total effects in this model is presented in Table 4.
As summarized in Figure 4, the direct effect of composing condition on state
test performance was no longer statistically significant in Model 4 ( ¼ .14,
p ¼ 0.086). Statistically significant paths included relationships between pretest
and posttest self-efficacy, composing condition and posttest self-efficacy, posttest
self-efficacy and state test performance, and posttest writing quality and state
test performance.
In sum, as with the parsimonious model that included the PEG score, the
path analysis that included the rater score indicated that students’ prior writing
quality and self-efficacy (which were correlated) were persistent and influential
throughout the school year. Pretest self-efficacy predicted posttest self-efficacy,
and pretest writing quality predicted posttest writing quality and state test per-
formance. In turn, and as expected based on previous research, both of these
factors influenced performance on a standardized state test of ELA proficiency.
Skilled and confident writers do better on ELA assessments.
However, results show that using different forms of writing technology in the
classroom—PEG AWE software or Google Docs—also influenced these out-
comes. While neither technology demonstrated an advantage for writing quality
alone, PEG appeared to directly facilitate more positive self-efficacy beliefs and
state test performance. Moreover, PEG also had a positive indirect effect on
state test scores via its direct influence on posttest self-efficacy.
Social Validity
Direct measures of student performance and self-efficacy are important out-
comes for evaluating the effectiveness of AWE. However, another important
metric is how classroom teachers perceive the efficacy, utility, and value of
such technologies. These perceptions influence how teachers adopt, use, and
persist with a technology—tools that are perceived as ‘‘useless’’ or ‘‘wastes of
time’’ will likely be abandoned.
A social validity survey probed teachers’ perceptions of the usability, effect-
iveness, and desirability of the AWE system after students completed the postt-
est. Importantly, teachers were unaware of the results of the posttesting when
they completed the survey. The first two sections of the survey asked teachers to
directly compare PEG and Google Docs; the final section asked teachers to
solely evaluate PEG.
Comparisons of PEG and Google Docs: Feedback. In this study, all three teachers used
both PEG and Google Docs and were thus positioned to directly compare how
the systems influenced the teachers’ own feedback practices. A potential benefit
Table 5. Teachers’ Comparisons of PEG and Google Docs: Teacher Feedback Demands.
Teacher 1 Teacher 2 Teacher 3
I gave more feedback on spelling to students ND Google Docs ND

who wrote their essays with . . .
I gave more feedback on capitalization and ND Google Docs ND
punctuation to students who wrote their
essays with . . .
I gave more feedback on grammar and sen- ND Google Docs ND
tence structure to students who wrote
their essays with . . .
I gave more feedback on vocabulary and word ND ND ND
choice to students who wrote their essays
with . . .
I gave more feedback on organization to stu- ND ND ND
dents who wrote their essays with . . .
I gave more feedback on ideas and elaboration ND PEG ND
to students who wrote their essays with . . .
I gave more feedback on style to students who ND PEG ND
wrote their essays with . . .
Note. PEG ¼ Project Essay Grade; ND ¼ no difference.
and dimension of AWE’s efficacy is that it allows teachers to provide more

feedback on higher level writing skills because the AWE system is more adept
at addressing lower level writing skills (Deane, 2013; Grimes & Warschauer,
2010; Wilson & Czik, 2016).
However, as seen in Table 5, such a division of labor appeared to occur for
only one of the three teachers. Teacher 2 reported providing more feedback on
lower level writing skills (spelling, capitalization and punctuation, grammar and
sentence structure) to students who used Google Docs and slightly more feed-
back on higher level writing skills (ideas and elaboration, style) to students who
used PEG; she indicated no difference between the systems with respect to
providing feedback on vocabulary and word choice and organization. In con-
trast, Teachers 1 and 3 reported that there was no difference in the amounts of
feedback they gave students who used PEG or Google Docs for any dimension
of writing ability. Thus, the software students used did not appear to influence
the feedback practices of Teachers 1 and 3 but did influence those of Teacher 2
in ways that are consistent with prior research.
Comparisons of PEG and Google Docs: Usability, effectiveness, and desirability. In the
second section of the survey, teachers compared the two systems with respect
to their usability, effectiveness, and desirability. As summarized in Table 6,
Table 6. Teachers’ Comparisons of PEG and Google Docs: Usability, Effectiveness, and
Desirability.
Usability
Which system was easier for you to use? PEG PEG PEG
Which system was easier for your students to use? PEG PEG PEG
Which system was more efficient for you to use? PEG PEG PEG
Which system enabled you to give more feedback PEG ND PEG
on content and ideas?
Effectiveness
Which system was more motivating for students to use? PEG PEG PEG
Which system promoted greater student independence? PEG PEG PEG
Which system promoted greater student writing quality? PEG PEG PEG
Desirability
Which system would you like to use in the future? PEG PEG PEG
Which system do you think students would like to PEG PEG PEG
use in the future?
Note. PEG ¼ Project Essay Grade; ND ¼ no difference.
Teachers 1 and 3, who previously did not report a distinction between the sys-
tems with respect to feedback practices, consistently reported PEG to be easier
to use, more effective, and more desirable than Google Docs. Interestingly, in
this section of the survey, both of these teachers reported that PEG enabled
them to provide more feedback on content and ideas despite reporting ‘no dif-
ference’ in the first section of the survey that asked about feedback practices.
Teacher 2 also submitted a very positive evaluation of PEG, reporting that
PEG was superior to Google Docs for all surveyed aspects with the exception of
one item: ‘‘Which system enabled you to give more feedback on content and
ideas?’’ As with Teachers 1 and 3, Teacher 2’s response to this question contra-
dicted that of her responses to the first section of the survey. In the first section,
Teacher 2 reported that she gave more feedback on idea development and elab-
oration to students who used PEG, but here, she reported that there was no
difference in the systems. Given that all three teachers’ responses to this question
contradicted their prior responses, it is possible that teachers were considering
different things in each section of the survey. In the first section, they may have
been considering characteristics of the specific essays that they read, the prompts
they assigned, and the specific difficulties their students had with the assignment.
In the second section, they may have been considering the capabilities of the
systems more abstractly. Although this is speculation, it offers a plausible
explanation as to why their answers differed across survey sections.
Teachers’ evaluations of PEG’s usability, effectiveness, and desirability. Table 7 reports

results of each teacher’s evaluation of PEG’s usability, effectiveness, and desir-
ability. In general, the three teachers held very positive perceptions of PEG’s
usability. Teachers agreed or strongly agreed that PEG was easy for them and
their students to use, the trait feedback provided by PEG was appropriate, and
their students took advantage of the feedback received from PEG. The teachers
agreed less strongly about whether PEG helped them teach more, grade less,
or differentiate their writing instruction and whether their students took advan-
tage of the spelling and grammar feedback they received from PEG. There
was only one instance where a teacher reported a rating less than neutral:
Teacher 2 disagreed that her students had sufficient keyboarding skills to benefit
from PEG.
The teachers agreed that PEG was effective, but their level of agreement was
more tepid. Teachers 1 and 3 recorded neutral responses to many of the items,
such as whether PEG improved their students’ writing skills, their students’
keyboarding skills, or the amount of writing and revising students did. They
both felt neutral about whether PEG helped them address their students’ writing
needs. However, all three teachers agreed or strongly agreed that English lan-
guage learners and students with disabilities would benefit from PEG and that
students receive more feedback when using PEG, which speaks to the benefits of
AWE for struggling learners.
PEG’s desirability was evident in teachers’ responses. All three teachers
would recommend PEG to other teachers, and Teachers 1 and 2 agreed that
they would like to continue to use PEG next year. Teacher 3 was neutral about
that statement, which may reflect her neutral perceptions of PEG’s effectiveness.
Indeed, despite presenting a very positive evaluation of PEG’s usability, Teacher
3’s evaluation of PEG’s effectiveness was neutral, which may have tempered her
desire to use PEG again. Curiously, her neutral evaluation of PEG’s effective-
ness did not appear to be related to her desire to recommend PEG to others; she
reported strongly agree to that item.
In sum, taken together, teachers’ responses to all three sections of the social
validity survey indicate that teachers were aware of PEG’s capabilities and limi-
tations and still appraised PEG very positively in terms of its usability, effect-
iveness, and desirability. This finding is important because the teachers used
PEG for the majority of the school year, suggesting that their perceptions
were not based simply on novelty—unlikely to persist until April of the school
year—but rather an informed experience.
Discussion
In the United States, increasing attention has been paid to identifying whether
AWE systems (Stevenson, 2016) are effective tools for improving the teaching
and learning of writing. The efficiency and reliability of AWE systems, and their
Table 7. Teacher Evaluations of PEG: Usability, Effectiveness, and Desirability.
Usability
PEG was easy for me to use. SA A SA
PEG was easy for my students to use. SA A SA
Using PEG allowed me to teach more and N A A
grade less.
I receive sufficient information from PEG to A A N
differentiate my writing instruction.
The trait feedback provided by PEG is SA A SA
appropriate.
Students take advantage of the spelling and SA N SA
grammar feedback they receive from PEG.
Students take advantage of the trait feedback SA A A
they receive from PEG.
Students had sufficient keyboarding skills to A D SA
benefit from PEG.
Effectiveness
Using PEG improved my students’ writing A A N
motivation.
Using PEG improved my students’ writing skills. N A N
English language learners benefit from PEG. A A SA
Students with disabilities benefit from PEG. A A SA
Using PEG improved my students’ keyboarding. N A N
Using PEG increased the amount of writing my N A N
students did.
Using PEG increased the amount of revising my N A N
students did.
Students receive more feedback when using SA A SA
PEG.
PEG helped me address my students’ writing N A N
needs.
Desirability
I would like to continue using PEG next year. A A N
I would recommend PEG to other teachers. A A SA
Note. PEG ¼ Project Essay Grade; SA ¼ strongly agree; A ¼ agree; N ¼ neutral; D ¼ disagree; SD ¼ strongly
disagree.
potential to reduce teachers’ grading load and expedite the practice-feedback

loop (Kellogg et al., 2010), have led many districts, schools, and teachers to
implement such systems. However, implementation has largely outpaced
research, particularly with respect to evaluating AWE efficacy (Shermis et al.,
2016; Xi, 2010).
The current study was intended to contribute to the growing body of research
on the efficacy of AWE for helping improve students’ writing outcomes. We
examined efficacy in a multifaceted manner using measures of writing quality,
writing self-efficacy, state test performance, and teachers’ perceptions of soft-
ware usability, effectiveness, and desirability.
After accounting for students’ pretest writing quality and writing self-efficacy,
composing with PEG was associated with more positive writing self-efficacy and
higher scores on the state ELA test. Furthermore, the direct gains in self-efficacy
also indirectly influenced performance on the state test: Posttest self-efficacy
partially mediated the effect of composing condition on state test performance.
Composing with PEG was not, however, associated with better performance on
the experimenter-administered posttest writing prompt. Finally, teachers shared
positive evaluations of PEG’s usability, effectiveness, and desirability.
Prior studies have documented associations between AWE and students’
writing motivation (Grimes & Warschauer, 2010; Warschauer & Grimes,
2008; Wilson & Czik, 2016), but the present study advanced existing knowledge
by documenting statistically significant prepost gains using a valid and psycho-
metrically sound self-efficacy measure. The present study was also the first to
document a positive effect, albeit an indirect one, on state ELA test
performance.
In sum, across three of the four metrics—self-efficacy, state test performance,
and social validity—AWE was shown to be effective at supporting writing out-
comes and writing instruction. Findings extend prior research by providing a
broader analysis of AWE efficacy, comprehensively examining efficacy across
multiple student-level outcomes as well as teachers’ perceptions of AWE’s
usability, effectiveness, and desirability. Findings also have implications for
AWE implementation and future research on the efficacy of AWE systems.
Considering Writing Quality Outcomes in Light of Other Metrics

While there was no statistically significant difference between composing condi-
tions for posttest writing quality, it is important to note that neither condition
made gains from pretest to posttest in their writing quality. Reasons for this may
relate to the degree of alignment between the pretest/posttest writing assessment
and the manner in which students’ practiced writing and interacted with the
AWE system throughout the year.
Given that most experimental or quasi-experimental studies of AWE efficacy
have examined changes in lower level writing skills, such as grammar, usage, and
mechanics (e.g., Kellogg et al., 2010) or examined changes in overall writing

quality within but not across essays (e.g., Wilson, 2017; Wilson & Andrada,
2016; Wilson et al., 2014), we were interested in examining whether an extended
AWE implementation (i.e., across the majority of a school year) would translate
into gains in overall writing quality. Thus, we assessed writing quality holistic-
ally to document whether there were changes in students’ overall writing skill as
measured by a single-draft-only timed on-demand writing assessment. Holistic
quality is intended to capture global, rather than specific, gains in students’
writing skills. Because writing is a complex, multicomponent ability, improve-
ments in holistic writing quality are dependent on students’ developing multiple
low-level skills (e.g., spelling, capitalization, sentence structure and grammar)
and high-level skills (e.g., techniques for organization and elaboration), which is
best accomplished when students engage in substantial amounts of writing prac-
tice (Graham et al., 2012, 2016; Kellogg & Whiteford, 2009; Monk, 2016).
Yet, despite having access to the AWE software (and Google Docs) for the
entire 7 months of the study, students did not complete any other writing assign-
ments that required multiple cycles of drafting, revising, and editing aside from
the three assignments required by the study. Further, students’ opportunities for
writing and engaging with PEG were spread out rather than sustained and
distributed over time. Students experienced three episodes of massed practice:
three intense periods of writing activity followed by long periods of writing
inactivity. Massed practice is useful for supporting initial skill acquisition but
less effective at supporting retention and generalization (Archer & Hughes,
2011). In contrast, long-term retention and transfer are benefited by sustaining
practice over time and gradually increasing the level of challenge (i.e., distrib-
uted practice; Archer & Hughes, 2011).
Thus, it is possible that the three practice opportunities were insufficient for
supporting generalized improvements in sixth graders’ writing proficiency as
assessed using a single-draft-only timed on-demand writing prompt. Indeed, a
post hoc repeated measures ANOVA showed that neither group improved pretest
to posttest as measured by two different holistic scores applied to first-draft-only
on-demand writing samples: The main effect of time was not statistically signifi-
cant, F(1, 113) ¼ 0.21, p ¼ .648. The lack of transfer effects is consistent with prior
research on PEG by Wilson et al. (2014), Wilson and Andrada (2016), and Wilson
(2017) and with results of a study of Criterion by Shermis, Wilson Garvan, and
Diao (2008) that showed negligible gains in sixth-grade students’ overall essay
quality scores until after the completion of six prompts within a school year.
Therefore, because (a) holistic writing quality assesses global improvements in
writing ability, (b) students require substantial practice to demonstrate such global
improvements, and (c) students experienced minimal writing practice marked by
sporadic rather than sustained activity, it follows that (d) there was likely a mis-
alignment between the manner in which we assessed writing quality and the
manner in which writing was practiced. Consequently, there were no detectable
effects of composing condition nor a detectable maturation effect when assessing

students using a single-draft-only timed on-demand writing prompt scored holis-
tically. Had we used a more fine-grained outcome measure, or had the teachers’
curriculum involved greater amounts of writing practice, there would have been
greater alignment and perhaps results would have differed.
Even the trait scores that comprised the PEG score and the rater score may
have been too broad to capture the kinds of nuanced changes that may have
occurred. Post hoc analyses of the individual trait scores comprising the holistic
quality measures—the six traits for the PEG score and three traits for the rater
score—confirmed results of the prior analyses. When considering each rating
system separately (PEG and rater), there was no effect of composing condition
on any trait, nor was there a main effect of time nor a time-by-trait interaction.
More fine-grained measures may be required.
This concept of (mis)alignment between methods of assessment and practice
might explain why there were no statistically significant effects of composing
condition on researcher-administered writing prompts but an effect on the state
ELA assessment. The writing required for the former solely involved writing a
rough draft in response to a prompt within a 30-minute window. The writing
required for the state test involved drafting, reviewing, revising, and editing an
essay in response to a prompt within a much longer time frame (>60 minutes).
Considering the nature of the AWE intervention, in which students used feed-
back to facilitate revising, the exclusion of a period of revising from the
researcher-administered prompts may have resulted in misalignment. Likewise,
the inclusion of revising in the state test may have resulted in better alignment
with the nature of the AWE intervention. Considering this improved alignment
and considering the direct effects of AWE on self-efficacy and the relationship
between self-efficacy and writing performance (Pajares, 2003; Pajares &
Johnson, 1996; Perin, Lauterbach, Raufman, & Kalamkarian, 2017), including
performance on state literacy assessments (Bruning et al., 2013), may explain
why there were no effects of composing condition on the researcher-adminis-
tered writing prompt but there were effects on the state test.
In sum, the alignment between the assessment procedure and the nature of
students’ writing practice/interaction with AWE might account for why PEG
was associated with positive effects on self-efficacy and superior performance on
the state test and was unanimously viewed by teachers as promoting greater
student writing quality (see Table 6), yet not associated with superior perform-
ance on the posttest writing quality measure.
Implications of Study Findings for AWE Implementation

Results of the present study suggest that when implementation of AWE is infre-
quent or embedded within a curricular context that de-emphasizes writing instruc-
tion and writing practice, AWE is not associated with gains in generalized writing
performance. This is not to say that AWE is ineffective. In fact, repeated exposure
to automated scores and feedback, even via infrequent massed practice, appears to
improve students’ self-efficacy with respect to key writing skills and thereby
improve students’ state ELA test performance to a greater extent than students
who compose with tools that lack automated feedback such as Google Docs.
AWE is effective, but its effectiveness is dependent on the instructional con-
text. In the present study, the overall number of practice opportunities (three
total essays) was too low, but students who used the AWE system did complete,
on average, a total of six drafts per essay (SD ¼ 1.7) and effectively improved
their writing quality across drafts: The average gain between first and final drafts
across the three essays was 1.13 points (SD ¼ 0.54). Furthermore, students who
completed a greater number of revisions per essay had higher scores on average
on the state ELA test (r ¼ .353, p < .001). However, AWE did not create add-
itional writing practice opportunities outside of those that were part of the cur-
riculum (Prompts 1 and 3) or required by the research study (Prompt 2). AWE
affords space for teachers and students to practice writing, particularly revising,
but it does not change the curriculum. The pressure to keep pace with a cur-
riculum is not something that AWE will change, nor will AWE cause an ELA
curriculum to prioritize writing.
Unfortunately, research shows that most U.S. ELA curricula prioritize reading
over writing, primarily using writing activities to support reading outcomes such
as comprehension. As with the teachers in the present study, research suggests that
teachers in the United States rarely assign writing that requires multiple stages of
drafting and revising, and consequently students lack the practice needed to
develop writing skills, knowledge, and strategies to a level of proficiency
(Gilbert & Graham, 2010; Graham, Capizzi, Harris, Hebert, & Morphy, 2013).
If AWE is to be used as a method of delivering the curriculum, it will be
partially effective. However, to maximize the potential efficacy of AWE, it may
be necessary to identify ways for teachers to implement AWE in a manner that
extends or enriches the curriculum.
Limitations
Our interest in examining improvements in overall writing quality on students’
first drafts led us to select a writing prompt for the pretest and posttest that did
not involve any revising. In contrast, the Smarter Balanced ELA test required
students to engage in the full writing process: planning, drafting, revising, and
editing. This discrepancy might explain why there were was no effect of compos-
ing condition on posttest writing quality but an effect on state test performance,
which underscores the importance of evaluating efficacy of AWE with multiple
metrics. Future research that uses measures of writing quality should use writing
prompts that require students to engage in both drafting and revising and editing
and use multiple broad and fine-grained measures of writing quality.
We also elected to survey teachers’ perceptions of PEG’s social validity

because we considered the teacher, rather than the student, as the agent
who would make decisions about AWE adoption and implementation.
Nevertheless, we acknowledge the importance of students’ perceptions of
the usability, effectiveness, and desirability of AWE (see Roscoe et al., 2017).
Future research should collect this information from not only teachers but
from students, too. Also, future research should expand on the social validity
survey used in the current study to include open-ended items, focus groups,
or individual interviews to more deeply probe perceptions of the social validity
of AWE.
Directions for Future Research

The present study suggests that researchers interested in assessing the efficacy of
AWE systems should pay careful attention to the alignment between the assess-
ment method and the way that students experience AWE. Fine-grained measures
are needed if AWE implementation is sporadic and less intense. Research might
explore changes in the NLP features that are composited when examining the
trait scores. Alternatively, more intensive AWE implementations may be needed
before changes in overall writing ability can be documented on a first-draft,
holistically scored writing assessment. Finally, future research should pay par-
ticular attention to issues of AWE implementation. One potential avenue of
future research would be to create professional development modules that
train teachers how to use the capabilities of AWE to implement evidence-
based writing instruction practice. Such training would move beyond a technical
focus—that is, ‘‘How do I use the features of this program?’’—to a focus on
technological–pedagogical–content knowledge (Mishra & Koehler, 2006) and
help teachers answer the question, ‘‘How do I use the features of this program
to implement evidence-based writing instruction?’’ This form of professional
development may assist teachers in using AWE as a tool to transform and
extend, rather than simply deliver, the ELA curriculum.
Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, author-
ship, and/or publication of this article: This research was supported in part by a
Delegated Authority contract from Measurement IncorporatedÕ to University of
Delaware (EDUC432914160001). The opinions expressed in this article are those of the
author and do not necessarily reflect the positions or policies of this agency, and no
official endorsement by it should be inferred.
Notes
1. It is worth noting that innovations in NLP tools and techniques are ongoing and thus
pushing the boundaries of automated assessment of writing. For example, automated
methods have shown promise for evaluating more complicated, stylistic elements of
writing quality, such as humor (Skalicky, Berger, Crossley, & McNamara, 2016).
2. Although PEG supports peer review, teachers indicated that they did not use peer
review in their instruction. Therefore, teachers were not trained to use this function.
ORCID iD
Joshua Wilson http://orcid.org/0000-0002-7192-3510
References
Archer, A. L., & Hughes, C. A. (2011). Explicit instruction: Effective and efficient teach-
ing. New York, NY: Guilford Press.
Balfour, S. P. (2013). Assessing writing in MOOCs: Automated essay scoring and
Calibrated Peer ReviewTM. Research & Practice in Assessment, 8, 40–48.
Ballock, E., McQuitty, V., & McNary, S. (2017). An exploration of professional know-
ledge needed for reading and responding to student writing. Journal of Teacher
Education, 69, 1–13. doi:10.1177/0022487117702576
Behizadeh, N., & Pang, M. E. (2016). Awaiting a new wave: The status of state writing
assessment in the United States. Assessing Writing, 29, 25–41.
Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of auto-
mated scoring to construct-irrelevant response strategies (CIRS): An illustration.
Assessing Writing, 22, 48–59.
Bennett, R. E. (2004). Moving the field forward some thoughts on validity and automated
scoring. Research Memorandum (RM-04-01). Princeton, NJ: Educational Testing
Service.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological
Bulletin, 107, 238–246.
Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: John
Wiley & Sons, Inc.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY:
Guilford Press.
Bruning, R., Dempsey, M., Kauffman, D. F., McKim, C., & Zumbrunn, S. (2013).
Examining dimensions of self-efficacy for writing. Journal of Educational
Psychology, 105, 25–38.
Bruning, R. H., & Kauffman, D. F. (2016). Self-efficacy beliefs and motivation in writing
development. In C. A. McArthur, S. Graham & J. Fitzgerald (Eds.), Handbook of
writing research (2nd., pp. 160–173). New York, NY: Guilford Press.
Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems.
In Y. Rosen, S. Ferrara & M. Mosharraf (Eds.), Handbook of research on technology
tools for real-world skill development (pp. 611–626). Hershey, PA: IGI Global.
Chapelle, C. A., Cotos, E., & Lee, J. (2015). Diagnostic assessment with automated
writing evaluation: A look at validity arguments for new classroom assessments.
Language Testing, 32(3), 385–405.
Cheville, J. (2004). Automated scoring technologies and the rising influence of error.
English Journal, 93(4), 47–52.
Common Core State Standards Initiative. (2010). Common core state standards for
English language arts & literacy in history/social studies, science, and technical subjects.
Retrieved from http://www.corestandards.org/assets/CCSSI_ELA%20Standards.pdf
Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated
scoring of essays: Fishing for red herrings? Assessing Writing, 18, 100–108.
Conference on College Composition and Communication. (2004). CCCC position state-
ment on teaching, learning and assessing writing in digital environments. Retrieved from
http://www.ncte.org/cccc/resources/positions/digitalenvironments
Cotos, E. (2015). AWE for writing pedagogy: From healthy tension to tangible prospects.
Special issue on Assessment for Writing and Pedagogy. Writing and Pedagogy, 7(2–3),
197–231.
Davidson, L. Y. J., Richardson, M., & Jones, D. (2014). Teachers’ perspective on using
technology as an instructional tool. Research in Higher Education, 24, 1–25.
Deane, P. (2013). On the relation between automated essay scoring and modern views of
the writing construct. Assessing Writing, 18, 7–24.
Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language
writers: How does it compare to instructor feedback? Assessing Writing, 22, 1–17.
Dujinhower, H., Prins, F. J., & Stokking, K. M. (2010). Progress feedback effects on
students’ writing mastery goal, self-efficacy beliefs, and performance. Educational
Research and Evaluation, 16, 53–74.
Ertmer, P. A. (1999). Addressing first- and second-order barriers to change: Strategies
for technology integration. Educational Technology Research and Development, 47, 47–61.
Foltz, P. W. (2014, May). Improving student writing through automated formative assess-
ment: Practices and results. Paper presented at the International Association for
Educational Assessment (IAEA), Singapore.
Gilbert, J., & Graham, S. (2010). Teaching writing to elementary students in Grades 4-6:
A national survey. Elementary School Journal, 110, 494–518.
Graham, S., Berninger, V., & Fan, W. (2007). The structural relationship between writing
attitude and writing achievement in first and third grade students. Contemporary
Educational Psychology, 32(3), 516–536.
Graham, S., Bruch, J., Fitzgerald, J., Friedrich, L., Furgeson, J., Greene, K., . . . Smither
Wulsin, C. (2016). Teaching secondary students to write effectively. Washington, DC:
National Center for Education Evaluation and Regional Assistance, Institute of
Education Sciences, U.S. Department of Education (NCEE 2017-4002).
Graham, S., Capizzi, A., Harris, K. R., Hebert, M., & Morphy, P. (2013). Teaching writing
to middle school students: A national survey. Reading and Writing, 27, 1015–1042.
Graham, S., Hebert, M., & Harris, K. R. (2015). Formative assessment and writing: A
meta-analysis. Elementary School Journal, 115, 523–547.
Graham, S., McKeown, D., Kiuhara, S., & Harris, K. R. (2012). A meta-analysis of
writing instruction for students in the elementary grades. Journal of Educational
Graham, S., & Perin, D. (2007). Writing next: Effective strategies to improve writing of
adolescents in middle and high schools – A report to Carnegie Corporation of New York.
Washington, DC: Alliance for Excellent Education.
Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of
automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6),
1–44. Retrieved from http://www.jtla.org
Hayes, J. R. (2012). Modeling and remodeling writing. Written Communication, 29(3),
369–388.
Hew, K., & Brush, T. (2007). Integrating technology into K-12 teaching and learning:
Current knowledge gaps and recommendations for future research. Educational
Technology Research and Development, 55(3), 223–252.
Human Resources Research Organization. (2016). Independent evaluation of the validity
and reliability of STAAR Grades 3-8 Assessment Scores: Part 1. Retrieved from http://
tea.texas.gov/student.assessment/reports/
Hutchison, A. C., & Woodward, L. (2014). An examination of how a teacher’s use of
digital tools empowers and constrains language arts instruction. Computers in the
Schools, 31, 316–338.
Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis &
J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective
(pp. 147–167). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Kellogg, R. T. (2008). Training writing skills: A cognitive developmental perspective.
Journal of Writing Research, 1(1), 1–26.
Kellogg, R. T., & Raulerson, B. A. (2007). Improving the writing skills of college stu-
dents. Psychonomic Bulletin & Review, 14(2), 237–242.
Kellogg, R. T., & Whiteford, A. P. (2009). Training advanced writing skills: The case for
deliberate practice. Educational Psychologist, 44, 250–266.
Kellogg, R. T., Whiteford, A. P., & Quinlan, T. (2010). Does automated feedback
help students learn to write? Journal of Educational Computing Research, 42(2),
173–196.
Kenny, D. A. (2011, August 15). Path analysis. Retrieved from http://davidakenny.net/
cm/pathanal.htm
Kopcha, T. J. (2012). Teachers’ perceptions of the barriers to technology integration and
practices with technology under situated professional development. Computers &
Education, 59, 1109–1121.
Liu, M. L., Li, Y., Xu, W., & Liu, L. (2016). Automated essay feedback generation and its
impact in the revision. IEEE Transactions on Learning Technologies, 99, 1–13.
doi:10.1109/TLT.2016.2612659
Lyst, A. M., Gabriel, G., O’Shaughnessy, T. E., Meyers, J., & Meyers, B. (2005). Social
validity: Perceptions of check and connect with early literacy support. Journal of
School Psychology, 43, 197–218.
Matsumara, L. C., Patthey-Chavez, G. G., Valdés, R., & Garnier, G. (2002). Teacher
feedback, writing assignment quality, and third-grade students’ revision in lower- and
higher-achieving urban schools. Elementary School Journal, 103, 3–25.
Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A
framework for teacher knowledge. Teachers College Record, 108, 1017–1054.
Monk, J. (2016). Revealing the iceberg: Creative writing, process, & deliberate practice.
English in Education, 50, 95–115.
Moore, N. S., & MacArthur, C. A. (2016). Student use of automated essay evaluation
technology during revision. Journal of Writing Research, 8, 149–175.
Morphy, P., & Graham, S. (2012). Word processing programs and weaker writers/read-
ers: A meta-analysis of research findings. Reading and Writing, 25, 641–678.
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide (Eighth.). Los Angeles,
CA: Muthén & Muthén.
National Center for Education Statistics. (2012). The Nation’s Report Card: Writing 2011.
Washington, DC: Institute of Education Sciences, U.S. Department of Education
(NCES 2012–470).
National Council of Teachers of English. (2013). NCTE position statement on machine
scoring. Retrieved from http://www.ncte.org/positions/statements/machine_scoring
Okumus, S., Lewis, L., Wiebe, E., & Hollerbrands, K. (2016). Utility and usability as
factors influencing teacher decisions about software integration. Education Technology
Research and Development, 64, 1227–1249.
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.),
Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Mahwah, NJ:
Lawrence Erlbaum Associates, Inc.
Pajares, F. (2003). Self-efficacy beliefs, motivation, and achievement in writing: A review
of the literature. Reading & Writing Quarterly, 19, 139–158.
Pajares, F., Johnson, M., & Usher, E. (2007). Sources of writing self-efficacy beliefs of
elementary, middle, and high school students. Research in the Teaching of English, 42,
104–120.
Pajares, F., & Johnson, M. J. (1996). Self-efficacy beliefs and the writing performance of
entering high school students. Psychology in the Schools, 33, 163–175.
Palermo, C., & Thomson, M. M. (2018). Teacher implementation of self-regulated strat-
egy development with an automated writing evaluation system: Effects on the argu-
mentative writing performance of middle school students. Contemporary Educational
Perelman, L. (2014). When ‘‘the state of the art’’ is counting words. Assessing Writing, 21,
104–111.
Perin, D., Lauterbach, M., Raufman, J., & Kalamkarian, H. S. (2017). Text-based writing
of low-skilled postsecondary students: Relation to comprehension, self-efficacy and
teacher judgements. Reading and Writing, 30, 887–915.
Persky, H. R., Daane, M. C., & Jin, Y. (2002). The Nation’s Report Card: Writing 2002.
Washington, DC: National Center for Education Statistics, Institute of Education
Sciences. U. S. Department for Education (NCES 2003-529).
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of
reporting practices and suggestions for improvement. Review of Educational Research,
74, 525–556.
Roscoe, R. D., & McNamara, D. S. (2013). Writing Pal: Feasibility of an intelligent
writing strategy tutor in the high school classroom. Journal of Educational
Psychology, 105, 1010–1025.
Roscoe, R. D., Wilson, J., Johnson, A. C., & Mayra, C. R. (2017). Presentation, expect-
ations, and experience: Sources of student perceptions of automated writing evalu-
ation. Computers in Human Behavior, 70, 207–221.
Sanders-Reio, J., Alexander, P. A., Reio, T. G. Jr., & Newman, I. (2014). Do students’
beliefs about writing relate to their writing self-efficacy, apprehension, and perform-
ance? Learning and Instruction, 33, 1–11.
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and

future directions from a United States demonstration. Assessing Writing, 20, 53–76.
Shermis, M. D. & Burstein, J. C. (Eds.),. (2003). Automated essay scoring: A cross-
disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum Associates.
Shermis, M. D. & Burstein, J. C. (Eds.),. (2013). Handbook of automated essay evaluation.
New York, NY: Routledge.
Shermis, M. D., Burstein, J. C., & Bliss, L. (2004, April). The impact of automated essay
scoring on high stakes writing assessments. Paper presented at the annual meeting of the
National Council on Measurement in Education, San Diego, CA.
Shermis, M. D., Burstein, J. C., Elliot, N., Miel, S., & Foltz, P. W. (2016). Automated
writing evaluation: An expanding body of knowledge. In C. A. McArthur, S. Graham
& J. Fitzgerald (Eds.), Handbook of writing research (2nd., pp. 395–409). New York,
NY: Guilford Press.
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., & Harrington, S. (2002). Trait
ratings for automated essay grading. Educational and Psychological Measurement, 62,
5–18.
Shermis, M. D., Wilson Garvan, C., & Diao Y. (2008, March). The impact of automated
essay scoring on writing outcomes. Paper presented at the annual meeting of the
National Council of Measurement in Education, New York, NY.
Skalicky, S., Berger, C. M., Crossley, S. A., & McNamara, D. S. (2016). Linguistic features
of humor in academic writing. Advances in Language and Literacy Studies, 7, 248–259.
Steiger, J. H., & Lind, J. C. (1980, May). Statistically based tests for the number of
common factors. Paper presented at the annual Spring Meeting of the Psychometric
Society, Iowa City, IA.
Stevenson, M. (2016). A critical interpretative synthesis: The integration of automated writ-
ing evaluation into classroom writing instruction. Computers and Composition, 42, 1–16.
Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the
quality of writing. Assessing Writing, 19, 51–65.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Boston,
MA: Pearson.
Texas Education Agency. (2014). Technical digest for the academic year 2012–13.
Retrieved from http://tea.texas.gov/student.assessment/reports/
Texas Education Agency. (2015a). Technical digest for the academic year 2013–14.
Texas Education Agency. (2015b). Technical digest for the academic year 2014–15.
Tondeur, J., van Braak, J., Ertmer, P. A., & Ottenbreit-Leftwich, A. (2017).
Understanding the relationship between teachers’ pedagogical beliefs and technology
use in education: A systematic review of qualitative evidence. Education Technology
Research and Development, 65, 555–575.
Troia, G. A., Harbaugh, A. G., Shankland, R. K., Wolbers, K. A., & Lawrence, A. M.
(2013). Relationships between writing motivation, writing activity, and writing per-
formance: Effects of grade, sex, and ability. Reading and Writing, 26, 17–44.
Troia, G. A., Shankland, R. K., & Wolbers, K. A. (2012). Motivation research in writing:
Theoretical and empirical considerations. Reading and Writing Quarterly: Overcoming
Learning Difficulties, 28, 5–28.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor
analysis. Psychometrika, 38, 1–10.
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom.
Pedagogies: An International Journal, 3, 22–36.
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the class-
room research agenda. Language Teaching Research, 10(2), 157–180.
Wilkinson, L. Task Force on Statistical Inference (1999). Statistical methods in psych-
ology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Wilson, J. (2017). Associated effects of automated essay evaluation software on growth in
writing quality for students with and without disabilities. Reading and Writing, 30,
691–718.
Wilson, J. (2018). Universal screening with automated essay scoring: Evaluating classifi-
cation accuracy in Grades 3 and 4. Journal of School Psychology, 68, 19–37.
Wilson, J., & Andrada, G. N. (2016). Using automated feedback to improve writing
quality: Opportunities and challenges. In Y. Rosen, S. Ferrara & M. Mosharraf
(Eds.), Handbook of research on technology tools for real-world skill development
(pp. 678–703). Hershey, PA: IGI Global.
Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English language
arts classrooms: Effects on teacher feedback, student motivation, and writing quality.
Computers and Education, 100, 94–109.
Wilson, J., Olinghouse, N. G., & Andrada, G. N. (2014). Does automated feed-
back improve writing quality? Learning Disabilities: A Contemporary Journal, 12,
93–118.
Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied
behavioral analysis is finding its heart. Journal of Applied Behavior Analysis, 11,
203–214.
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we
heading? Language Testing, 27(3), 291–300.
Author Biographies
Joshua Wilson, PhD, is an assistant professor of education at University of
Delaware. His research focuses on how automated scoring and automated feed-
back systems support the teaching and learning of writing. He is particularly
interested in exploring the potential of these systems to improve outcomes for
struggling writers.
Rod D. Roscoe, PhD, is an assistant professor of human systems engineering at

Arizona State University. His research investigates the intersection of learning
science, computer science, and user science in educational technology design,
and how this integration can inform effective and innovative uses of such tech-
nologies. Current work focuses on automated writing evaluation and the impact
on human-centered instruction in engineering education.

Wilson, Roscoe - 2020 - Automated Writing Evaluation and Feedback Multiple Metrics of Efficacy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wilson, Roscoe - 2020 - Automated Writing Evaluation and Feedback Multiple Metrics of Efficacy

Uploaded by

Copyright:

Available Formats

Article

Journal of Educational Computing

Automated Writing Research

Joshua Wilson1 and Rod D. Roscoe2

The Effectiveness of AWE in Writing Instruction

fundamental question might be whether incorporating AWEs into ELA class-

Effects on Writing Quality

though this improvement did not transfer to performance on ﬁrst drafts of

Effects on State Test Performance

instructional integration of the software. Alternatively, state test performance

Effects on Writing Self-Efficacy

outcomes of interest, such as writing quality or performance on state tests. Thus,

Perceived Effectiveness and Utility

Research Objectives and Questions

immediately receive ratings for six traits of writing quality scored on a 1 to 5

Setting and Participants

Table 1. Demographics for Composing Conditions.

PEG (n ¼ 56) Google Docs (n ¼ 58)

prompts. However, because these prompts were implemented as part of a state-

a prompt in a subsequent performance task. Overall, these writing tasks require

Social validity survey. To evaluate teachers’ perceptions of AWE’s social validity,

assignments using the respective software if they wished or if required by

Posttesting. Posttests occurred on a single day in mid-April. Trained research

analysis models that include combination of continuous and categorical depend-

Handling of Missing Data

Table 2. Descriptive Statistics.

PEG Google Docs PEG Google Docs

Table 3. Correlations Among Pretest and Posttest Study Measures.

1. PEG score pretest –

0.57 Posttest writing

Composing 0.15 Smarter Balanced

0.57 Posttest writing

Composing 0.14 Smarter Balanced

Based on the results of the saturated model, a parsimonious path analysis

Path analysis including the PEG score

0.55 Posttest writing

Composing 0.14 Smarter Balanced

0.57 Posttest writing

Composing 0.14 Smarter Balanced

SRMR ¼ 0.08. Model 4 explained 9% of the variance in posttest writing quality

Teacher 1 Teacher 2 Teacher 3

I gave more feedback on spelling to students ND Google Docs ND

and dimension of AWE’s eﬃcacy is that it allows teachers to provide more

Teacher 1 Teacher 2 Teacher 3

Teachers’ evaluations of PEG’s usability, effectiveness, and desirability. Table 7 reports

Table 7. Teacher Evaluations of PEG: Usability, Effectiveness, and Desirability.

Teacher 1 Teacher 2 Teacher 3

potential to reduce teachers’ grading load and expedite the practice-feedback

Considering Writing Quality Outcomes in Light of Other Metrics

mechanics (e.g., Kellogg et al., 2010) or examined changes in overall writing

eﬀects of composing condition nor a detectable maturation eﬀect when assessing

Implications of Study Findings for AWE Implementation

We also elected to survey teachers’ perceptions of PEG’s social validity

Directions for Future Research

Declaration of Conflicting Interests

Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and

Rod D. Roscoe, PhD, is an assistant professor of human systems engineering at

You might also like