Professional Documents
Culture Documents
Wilson, Roscoe - 2020 - Automated Writing Evaluation and Feedback Multiple Metrics of Efficacy
Wilson, Roscoe - 2020 - Automated Writing Evaluation and Feedback Multiple Metrics of Efficacy
Abstract
The present study extended research on the effectiveness of automated writing
evaluation (AWE) systems. Sixth graders were randomly assigned by classroom to
an AWE condition that used Project Essay Grade Writing (n ¼ 56) or a word-processing
condition that used Google Docs (n ¼ 58). Effectiveness was evaluated using multiple
metrics: writing self-efficacy, holistic writing quality, performance on a state English
language arts test, and teachers’ perceptions of AWE’s social validity. Path analyses
showed that after controlling for pretest measures, composing condition had no
effect on holistic writing quality, but students in the AWE condition had more posi-
tive writing self-efficacy and better performance on the state English language arts
test. Posttest writing self-efficacy partially mediated the effect of composing condi-
tion on state test performance. Teachers reported positive perceptions of AWE’s
social validity. Results emphasize the importance of using multiple metrics and con-
sidering both contextual factors and AWE implementation methods when evaluating
AWE effectiveness.
Keywords
automated writing evaluation, interactive learning environments, automated
feedback, writing, writing self-efficacy
1
School of Education, University of Delaware, Newark, DE, USA
2
Human Systems Engineering, Arizona State University-Polytechnic, Mesa, AZ, USA
Corresponding Author:
Joshua Wilson, School of Education, University of Delaware, 213E Willard Hall Education Building,
Newark, DE 19716, USA.
Email: joshwils@udel.edu
88 Journal of Educational Computing Research 58(1)
Introduction
In response to an ongoing need for innovative and scalable writing instruction
and assessment (National Center for Education Statistics, 2012; Persky, Daane,
& Jin, 2002), educators and researchers are increasingly looking toward
computer-based approaches such as automated writing evaluation (AWE;
Palermo & Thompson, 2018; Shermis & Burstein, 2013). At their core, most
AWE technologies use (a) natural language processing (NLP) tools to extract
linguistic, syntactic, semantic, or rhetorical features of text related to writing
quality and (b) statistical or machine-learning algorithms to generate scores and
feedback based on patterns observed among those features. Importantly,
modern AWEs typically do more than provide scores and feedback. When
coupled with tutorials or educational games (e.g., Roscoe & McNamara,
2013), learning management features (e.g., Grimes & Warschauer, 2010), or
peer assessment (e.g., Balfour, 2013), AWEs potentially offer flexible, robust,
and time-saving additions to the writing curriculum.
Despite widespread adoption of AWE in English language arts (ELA) courses
across the United States (Grimes & Warschauer, 2010; Stevenson, 2016), there
remains meaningful reticence toward the appropriateness of AWE in writing
instruction, perhaps because adoption has largely outpaced the body of support-
ing evidence. There are multiple, interrelated dimensions by which AWEs might
be judged (e.g., accuracy, validity, and efficacy), yet prior scholarship has pre-
dominantly privileged evidence of AWE’s scoring accuracy and reliability (e.g.,
Keith, 2003; Shermis, 2014; Shermis & Burstein, 2003, 2013). Evaluations of
AWE’s effectiveness (e.g., Stevenson & Phakiti, 2014)—that is, AWE’s ability
to improve the teaching and learning of writing—are less common despite such
evaluations having the most immediate and practical benefit for educators.
This article contributes to the evidence base of the effectiveness of AWE and
does so by conceptualizing and assessing efficacy in multiple ways. We consider
the effects of classroom AWE use on students’ writing quality after instruction,
students’ writing self-efficacy, students’ performance on standardized state tests,
and teachers’ perceptions of AWE effectiveness and usability. Although prior
research has focused on one or two indicators, the combination of metrics in
the current study offers triangulation on whether and how AWEs may be effect-
ive for improving the teaching and learning of writing in ELA classrooms.
crucial for development of writing skill (see Kellogg, 2008; Kellogg & Raulerson,
2007). Similarly, teacher–system interactions describe how teachers might access
system data to monitor students’ writing progress to plan or modify instruction
(e.g., Grimes & Warschauer, 2010). In some AWEs, learner–teacher interactions
are enabled by features whereby students can send messages to their teacher
asking for help, and teachers can provide supplemental feedback via in-text
and summary comments (e.g., Wilson & Czik, 2016). Finally, some AWEs
enable learner–learner interactions by facilitating peer assessment (e.g.,
Balfour, 2013). Collectively, these interactions enable teachers to enact various
evidence-based practices for writing instruction, including adult-, peer-, and
technology-based feedback (Graham, McKeown, Kiuhara, & Harris, 2012;
Graham & Perin, 2007; Morphy & Graham, 2012), as well as formative assess-
ment practices pertaining to diagnostic assessment and progress monitoring
(Graham, Hebert, & Harris, 2015).
For instance, one study observed that teachers who incorporated an AWE
system called Project Essay Grade (PEG) Writing gave proportionately more
feedback on higher level writing skills (idea development, organization, and
style) than teachers who did not use the AWE (Wilson & Czik, 2016). Due to
time, lack of training, and other constraints, teachers typically provide insufficient
feedback on higher level writing skills (Ballock, McQuitty, & McNary, 2017;
Matsumara, Patthey-Chavez, Valdes, & Garnier, 2002), despite the importance
of such feedback. Thus, the finding that an AWE can support improved teacher
feedback practices demonstrates the potential value of AWE in ELA classrooms.
Despite these plausible benefits and affordances, the adoption of AWE
remains controversial. Decades of research have resulted in highly reliable and
accurate scoring algorithms (Page, 2003; Shermis, 2014; Shermis & Burstein,
2003)—computer-assigned ratings demonstrate near-perfect agreement with
expert human-assigned ratings. Nonetheless, critics have argued that AWE
undermines the inherently social nature of writing (National Council of
Teachers of English, 2013). Critics also fear that exposing students to automated
feedback will cause their writing to become stilted and predictable (Cheville,
2004; Conference on College Composition and Communication, 2004;
National Council of Teachers of English, 2013). Finally, because AWE relies
on NLP-based detection of textual features, it is inherently limited to what can
be detected via those methods. Consequently, some opponents argue that auto-
mated approaches lack construct validity, wherein the construct of writing qual-
ity is narrowed to a mere measurement of text length, syntax, and vocabulary1
(Condon, 2013; Perelman, 2014; see also Bejar, Flor, Futagi, & Ramineni, 2014).
Overall, given that writing involves complex, sociocultural interactions between
authors, texts, and readers, there are valid tensions with respect to the construct
validity of AWE (Cotos, 2015; Deane, 2013).
The aforementioned debates center on questions of whether AWEs can or
should perform the task of assessing writing. However, in practical terms, a more
90 Journal of Educational Computing Research 58(1)
Method
PEG Writing
The current study implemented PEG Writing (henceforth referred to as PEG)
based on its well-established history, widespread use, and evidence base (see
Palermo & Thompson, 2018; Wilson, 2018; Wilson & Czik, 2016). PEG is an
AWE system based on the PEG scoring system developed by Ellis Page (2003)
and acquired by Measurement Incorporated in 2003. PEG allows teachers to
assign pregenerated writing prompts or to create their own prompts that use
words, documents, images, videos, music, or a combination of these media as
stimuli for students’ writing. The design of this system enables a full range of
interactions between the system, learners, and teachers.
Prompts are assigned in one of three genres: narrative, argumentative, or
informative. PEG includes multiple digital, interactive graphic organizers that
students may use as aids during the prewriting process. After prewriting, PEG
allows up to 1 hour for students to submit their first draft, after which students
Wilson and Roscoe 95
minority. Many students (40%) came from low-income families and 100% of the
school population received free lunch.
Three sixth-grade ELA teachers were recruited and consented to participate
in the study. None of the teachers or their students had previous exposure to
AWE software or PEG Writing. As compensation, teachers and students
received free subscriptions to PEG, and teachers received a stipend for assisting
with the informed consent process, study procedures, and obtaining students’
demographic and achievement data. All three teachers had earned a master’s
degree.
After receiving parental consent and student assent, a total of 114 students
participated in the study. These students were instructed in a total of 10 ELA
classroom sections by the three ELA teachers. Following a quasi-experimental
design, students were randomly assigned by classroom section to use either PEG
(n ¼ 56) or Google Docs (n ¼ 58). Specifically, half of each teacher’s sections
were randomly assigned to either condition (five sections per condition).
Instruction was thus relatively constant across condition because each teacher
covered sections within each condition. Although this design was potentially
weaker (i.e., reduced internal validity) than random assignment within class-
rooms, the design mitigated issues of resentment or rivalry among students if
one composing tool was seen as more desirable than the other. Also, as noted
earlier (see Shermis et al., 2004), between-class assignment of software condi-
tions may be more successful for promoting consistent AWE integration than
within-class assignment.
Table 1 summarizes sample demographics. Chi-square tests indicated that
there were no statistically significant differences with respect to gender, race,
or special education status. A one-way analysis of variance (ANOVA) indicated
no difference in the average chronological age (measured in months) of students
in each group, F(1, 112) ¼ 0.09, p ¼ .771.
Measures
Writing quality. Writing quality was assessed via timed, expository essay prompts
administered prior to and following the experimental intervention. Two prompts
were obtained from public items from the State of Texas Assessment of
Academic Readiness (STAAR) test. STAAR tests were selected to avoid the
possibility that students would have been exposed to the prompts. STAAR
tests are rigorously evaluated for reliability and validity (Human Resources
Research Organization, 2016; Texas Education Agency, 2014, 2015a, 2015b).
For each prompt, students are asked to read a quotation and answer a topical
question. Students are instructed to write an essay and clearly state a controlling
idea, organize and develop their explanation, choose their words carefully, and
use correct spelling, capitalization, punctuation, grammar, and sentences. Due
to local constraints, it was not possible to counterbalance the order of the
Wilson and Roscoe 97
Male 20 22
Race
African American 21 19
Hispanic 11 18
Asian 2 0
White 22 21
Special education 4 4
Chronological age in months: M (SD) 140.37 (4.57) 140.14 (4.11)
Note. PEG ¼ Project Essay Grade.
referred to the degree to which the response had a clear and effective organiza-
tion structure, including the use of a thesis statement, transition words, intro-
duction and conclusion, and a logical progression of ideas. Evidence and
Elaboration (scores from 1 to 4) referred to the degree to which the response
included sufficient elaboration and support for the thesis, as well as appropriate
vocabulary and style. Finally, Conventions (scores from 1 to 3) referred to the
degree to which the response demonstrated accurate use of sentence formation,
punctuation, capitalization, grammar, and spelling. A final score—henceforth
referred to as the rater score—was calculated as the sum of the three trait scores
(range ¼ 0–11). Zeros were assigned to essays that were unable to be scored due
to lack of topic match or inadequate development (i.e., the student wrote only
one or two sentences).
Raters were trained by the first author until they reached a criterion of 80%
exact agreement on each trait. Raters then independently double scored all
essays. Interrater agreement, assessed as the percentage of exact agreement,
was strong for all three traits: Organization and Purpose (91.64%),
Elaboration and Evidence (94.43%), and Conventions (93.81%). Percent agree-
ment for the final rater score was 81.73% with inter-rater reliability of r ¼ .98.
Disagreements between raters were resolved by consensus. The rater score was
positively and significantly (p < .01) correlated with the PEG score at pretest
(r ¼ .57) and posttest (r ¼ .62).
State ELA test. The Smarter Balanced Assessment Consortium summative ELA
test served as the state’s accountability assessment. This test is aligned with the
Common Core State Standards (Common Core State Standards Initiative, 2010)
and is administered annually in the spring to students in Grades 3 to 8 and
Grade 11 in 15 U.S. states, the U.S. Virgin Islands, and the Bureau of Indian
Education.
The Smarter Balanced ELA test is a computer-based assessment that evalu-
ates the Common Core reading standards for literacy and informational text;
writing standards related to organization/purpose, evidence/elaboration, and
conventions; and listening and research standards. In total, the Smarter
Balanced ELA test is 3.5 hours in duration, though times vary by grade level
and local context. Sixth graders respond to 13 to 17 selected-response items
assessing reading standards, 10 selected-response items and 1 performance
task assessing writing standards, 8 to 9 selected-response items assessing listen-
ing standards, and 6 selected-response items and 2 to 3 constructed-response
items assessing research standards. Thus, writing skills are essential for success
on the Smarter Balanced ELA test, but the test assesses diverse literacy skills.
To assess writing proficiency, Smarter Balanced first requires students to read
source materials and answer two constructed-response questions and one
selected-response question about those materials. Students then use those
source materials when planning, drafting, and revising an essay in response to
Wilson and Roscoe 99
Writing self-efficacy. The 22-item Self-Efficacy for Writing Scale (SEWS; Bruning,
Dempsey, Kaufmann, McKim, & Zumbrunn, 2013) was used to measure stu-
dents’ writing self-efficacy. Students read a series of statements (e.g., ‘‘I can
quickly think of the perfect word’’) and rated their confidence in performing
that action on a continuous scale from 0 (no chance) to 100 (complete certainty).
Items assessed students’ confidence related to conventions (five items; e.g., ‘‘I can
spell my words correctly’’), idea generation (six items; e.g., ‘‘I can think of many
ideas for my writing’’), and self-regulation (nine items; e.g., ‘‘I can avoid distrac-
tions while I write’’). Two additional items asked about students’ confidence
composing in different genres (‘‘I can write a good story,’’ and ‘‘I can write a
good report’’). A final item asked about students’ overall writing self-efficacy
(‘‘I can do what it takes to be a good writer’’).
The SEWS was administered at pretest and posttest by trained research assist-
ants. The complete scale (all 22 items) demonstrated excellent reliability at pre-
test (a ¼ .94) and posttest (a ¼ .92).
‘‘Which system would you like to use in the future?’’). Specifically, for each trait,
teachers were asked to indicate which system (i.e., PEG vs. Google Docs) was
superior or if the two systems were equivalent. A complete list of these items is
presented in Table 6.
The final section asked teachers to evaluate 19 statements about PEG using a
5-point Likert scale, ranging from 1 (Strongly Agree) to 5 (Strongly Disagree).
These items specifically probed teachers perceptions of PEG’s usability (eight
items), effectiveness (nine items), and desirability (two items). A complete list of
these items is presented in Table 7.
Procedures
The study occurred in three phases over the course of about 7 months: training
and pretesting, classroom instruction and writing, and posttesting.
Training and pretesting phase. All participating teachers used both PEG and
Google Docs (in different classroom sections). Teachers had no prior exposure
to PEG, so the first author provided training on how to use PEG via a single 45-
minute session during a shared planning period. This training demonstrated how
to (a) manage course rosters, (b) assign pregenerated writing prompts, (c) create
and assign teacher-made prompts, (d) attach stimulus material to teacher-cre-
ated prompts, (e) review student writing and add supplemental feedback via
embedded or summary comments, and (f) review student data to monitor
their progress.2 Teachers were already highly familiar with Google Docs (i.e.,
2+ years of experience); thus, no training was provided for that software.
After teacher training, pretesting took place on 2 consecutive days in October
2015. Students completed the SEWS on Day 1 and the writing prompt (on
‘‘humor and laughter’’) on Day 2; both measures were administered by trained
graduate assistants. To ensure fidelity, testing procedures were audio recorded
and independently reviewed to assess adherence to the protocols. Proctors fol-
lowed established procedures with 100% fidelity across all administrations and
class sections.
After completing the pretest, the first author taught students how to use PEG
during a single ELA class session. Students were taught how to locate an
assigned writing prompt, select and complete an electronic graphic organizer,
draft and submit an essay for evaluation by PEG, review feedback and complete
a revision, access recommended mini lessons, and send and review messages with
the teacher.
Classroom instruction and writing phase. Teachers were instructed to implement their
normal ELA curriculum, with the requirement of at least three writing prompt
assignments between the end of October and March. Aside from the required
writing prompts, teachers were encouraged to assign additional writing
Wilson and Roscoe 101
You have been chosen to write an essay that will be published on the National
Bullying Prevention Awareness website. Write an informative essay about the nega-
tive effects bullying has on people and what can be done to help. Your essay will be
read by other students, teachers, and parents. Make sure to have a main idea,
clearly organize your essay, and support your main idea with evidence from the
sources using your own words. Be sure to develop your ideas clearly.
The second writing assignment was on the topic of The Best Music and was
completed in January 2019. This assignment was not required by the curriculum
and was assigned to fulfill the obligation of the study procedures. This assign-
ment contained no stimulus or source materials and was selected from PEG’s
prompt library; students were asked to express a well-supported opinion on the
following prompt topic:
Write a brief persuasive essay detailing what you believe to be the best music. The
short essay should make your reader feel inclined to listen to the particular kind of
music you present in your argument. The response should include details and
support that makes the argument clear and provokes the audience to believe simi-
larly or reconsider their beliefs.
The final writing assignment was on the topic of Class Pets and took place at the
beginning of March 2016. This prompt was the Spring performance task in their
curriculum. Students read two texts that expressing contrary positions on the
topic of class pets and viewed a YouTube video on Pets in the Classroom. These
materials were embedded within PEG and Google Docs. Students then wrote an
essay responding to the following prompt:
Your principal is deciding whether or not pets will be allowed in the classroom.
Your task is to convince your school principal either to allow pets in classrooms, or
not allow pets in classrooms. Based on the video and articles that you used for
research, write an argument essay stating and explaining your position on this
102 Journal of Educational Computing Research 58(1)
issue. Make sure to have a claim and support your claim with clear reasons and
relevant evidence. Be sure to develop your ideas clearly.
Data Analysis
Path analysis was conducted to answer RQs 1 to 4. Path analysis is an extension
of multiple regression and a special form of structural equation modeling that
examines relationships among observed rather than latent variables (Tabachnick
& Fidell, 2013). Although ANOVA tests only direct effects, and analysis of
covariance tests direct effects after adjusting for one or more covariates, neither
of these techniques, and only path analysis, is able to test both direct and indir-
ect (i.e., mediation) effects (Tabachnick & Fidell, 2013). In path analysis, direct
effects are the direct influence of one variable on another, which are calculated as
regression coefficients; indirect effects are the influence of one variable on
another when mediated by a third variable, calculated as the product of two
direct effects. Path analysis also allows for calculating total effects, the sum of
the direct and indirect effects (Bollen, 1989).
In the present study, the main direct effects tested were the effect of condi-
tion (a dummy-coded variable: 1 ¼ PEG; 0 ¼ Google Docs) on writing quality,
writing self-efficacy, and state ELA test performance. To test indirect,
mediation effects, we used an exploratory path analysis in which all paths
between variables were estimated to identify statistically significant relations.
Models were reestimated for parsimony after deleting nonstatistically signifi-
cant paths.
To assess the fit of the path models, we examined the chi-square test,
the standardized root mean residual (SRMR), comparative fit index (CFI;
Bentler, 1990), the Tucker–Lewis index (TLI; Tucker & Lewis, 1973), and
the root mean square error of approximation (RMSEA; Steiger & Lind,
1980). A good fitting model was one that displayed one or more of the following
criteria: nonsignificant chi-square values, SRMR values lower than .08 (Brown,
2006), CFI and TLI values exceeding .95, and RMSEA values below .05 (Brown,
2006).
Path analysis was conducted using Mplus v.8.0 (Muthén & Muthén,
1998–2017). Mplus uses a robust weighted least squares estimator for path
Wilson and Roscoe 103
Results
Pretest Equivalence
Descriptive statistics for the pretest writing prompt and SEWS are presented in
the first two columns in Table 2, and posttest measures appear in columns 3 and
4. All measures were distributed normally, as indicated by z scores for skewness
and kurtosis W3.29W, corresponding to alpha of .001 (Tabachnick & Fidell,
2013). Measures were weakly to moderately correlated within and across pretest
and posttest administrations (see Table 3).
To ensure random assignment, we conducted a series of ANOVAs on the
pretest measures. There were no statistically significant differences for any meas-
ure: PEG score: F(1, 112) ¼ 0.06, p ¼ .807; rater score: F(1, 112) ¼ 0.20, p ¼ .656;
or SEWS score: F(1, 112) ¼ 0.24, p ¼ .625.
Path Analysis
Path analyses were conducted to understand the direct and indirect relationships
among the covariates (pretest measures), the independent variable (composing
condition), and the dependent variables (student outcomes). Four path models
were estimated based on the correlation matrix presented in Table 3. The models
were structured temporally, with the model organized from left to right to sig-
nify the linear passage through time (i.e., pretest, treatment, posttest, and state
test). The first two models include the PEG score as the measure of writing
quality, and the latter two models include rater score as the writing quality
measure. In each model, the ‘composing condition’ variable represents the
104 Journal of Educational Computing Research 58(1)
Pretest Posttest
Writing quality
PEG score 10.20 (3.09) 10.08 (2.63) 10.29 (2.90) 10.27 (2.82)
Rater score 5.96 (2.27) 5.77 (2.42) 5.88 (2.34) 6.01 (2.15)
Writing self-efficacy 69.58 (19.18) 68.55 (16.93) 73.30 (15.42) 68.21 (14.11)
Smarter Balanced – – 2,575.66 (69.12) 2,550.89 (67.97)
ELA score
Note. PEG ¼ Project Essay Grade; ELA ¼ English language arts. PEG: n ¼ 56. Google Docs: n ¼ 58. PEG
score: range ¼ 6–30. Rater score ¼ range: 0–11. Average self-efficacy (range ¼ 0100) is the average of all
22 items on the Self-Efficacy for Writing Scale (Bruning et al., 2013). Smarter Balanced ELA score:
range ¼ <2,457 – >2,617.
1. 2. 3. 4. 5. 6. 7.
effect of being assigned to the PEG condition versus the Google Docs condition
on the outcomes of interest.
Path analysis featuring the PEG score as a measure of writing quality. The first model
(see Figure 1), called a saturated model (see Kenny, 2011), included all possible
relationships between pretest writing quality (i.e., pretest PEG score), pretest
writing self-efficacy, composing condition (dummy coded as 1 ¼ PEG and
0 ¼ Google Docs), posttest writing quality (i.e., posttest PEG score), posttest
Wilson and Roscoe 105
0.64
0.02 -0.02
0.01
0.28
Pretest PEG 0.07 Posttest PEG
Score Score
0.28
0.85
Figure 1. Saturated path model predicting study outcomes from pretest writing self-effi-
cacy, pretest PEG writing quality, and composing condition. The saturated model represents
direct and indirect paths between all predictors and all outcomes. The curved arrow repre-
sents a correlation. Straight arrows represent regression paths. Residuals for study out-
comes are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG
Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent
statistically significant paths at p < .05. Dashed lines represent nonstatistically significant
paths at p .10.
ELA ¼ English language arts; PEG ¼ Project Essay Grade.
writing self-efficacy, and Smarter Balanced ELA score. This model enabled iden-
tifying the paths that were or were not statistically significant. Model 1 was not
used to test hypotheses; thus, estimates of model fit are not reported.
Model 1 indicates that pretest measures were not related to (i.e., did not differ
across) composing condition, and composing condition had no effect on posttest
writing quality: Students in both PEG and Google Docs conditions wrote
equally well at posttest. Indeed, examination of the descriptive statistics (see
Table 2) shows that not only did the groups not differ at posttest, but neither
group’s writing quality improved from pretest to posttest.
However, composing condition did have direct and small effects on posttest
writing self-efficacy ( ¼ .16) and state test performance ( ¼ .15). Thus, the use
of PEG was associated with small positive effects on these outcomes. In add-
ition, Model 1 indicates that state test performance was also predicted by
posttest writing self-efficacy ( ¼ .17) and posttest writing quality ( ¼ .28).
Pretest writing self-efficacy and pretest writing quality were correlated (r ¼ .31)
but did not directly predict performance on the state test.
106 Journal of Educational Computing Research 58(1)
0.65
0.32
Pretest PEG Posttest PEG
Score Score
0.28
0.85
Figure 2. Parsimonious path model predicting study outcomes from pretest writing self-
efficacy, pretest PEG writing quality, and composing condition. The curved arrow represents
a correlation. Straight arrows represent regression paths. Residuals for study outcomes
are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG Writing,
0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent statistically
significant paths at p < .05. Dashed lines represent nonstatistically significant paths at
p .10.
ELA ¼ English language arts; PEG ¼ Project Essay Grade.
Posttest writing self-efficacy Posttest writing quality Smarter Balanced ELA score
Predictor Direct Indirect Total Direct Indirect Total Direct Indirect Total
107
108 Journal of Educational Computing Research 58(1)
0.64
0.04 -0.04
0.05
0.23
0.25
Pretest Rater Posttest Rater
Score 0.23 Score
0.89
Figure 3. Saturated path model predicting study outcomes from pretest writing self-
efficacy, pretest human-scored writing quality, and composing condition. The saturated
model represents direct and indirect paths between all predictors and all outcomes.
The curved arrow represents a correlation. Straight arrows represent regression paths.
Residuals for study outcomes are indicated by single arrows. Composing condition ¼ dummy
variable (1 ¼ PEG Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold
lines represent statistically significant paths at p < .05. Dashed lines represent nonstatistically
significant paths at p .10.
ELA ¼ English language arts.
have performed better on the Smarter Balanced ELA test as a result of increases
in writing self-efficacy.
In sum, as summarized in Figure 2, the parsimonious path analysis that
included the PEG score indicated that students’ prior writing quality and self-
efficacy (which were correlated) were persistent and influential throughout the
school year. Pretest self-efficacy predicted posttest self-efficacy and posttest writ-
ing quality, and pretest writing quality predicted posttest writing quality. Both
self-efficacy and writing quality had direct effects on state test performance, but
students who used PEG experienced greater gains in posttest self-efficacy and in
doing so may have experienced a performance advantage on the state ELA test.
Path analysis featuring the rater score as a measure of writing quality. The same ana-
lytical approach used to generate Models 1 and 2 using the PEG score was
replicated using the rater score as a measure of writing quality. Model 3 (see
Figure 3) thus presents a saturated model of all possible paths, and Model 4 (see
Figure 4) represents a parsimonious model of those data.
Wilson and Roscoe 109
0.64
0.16
0.24
0.26
Pretest Rater Posttest Rater
Score 0.29 Score
0.91
Figure 4. Parsimonious path model predicting study outcomes from pretest writing self-
efficacy, pretest human-scored writing quality, and composing condition. The curved arrow
represents a correlation. Straight arrows represent regression paths. Residuals for study
outcomes are indicated by single arrows. Composing condition ¼ dummy variable (1 ¼ PEG
Writing, 0 ¼ Google Docs). Standardized coefficients are presented. Bold lines represent
statistically significant paths at p < .05. Dashed lines represent nonstatistically significant
paths at p .10.
ELA ¼ English language arts.
Model 3 corroborates prior results that pretest measures were unrelated to (i.e.,
did not differ across) composing condition. Similarly, composing condition had
no effect on posttest writing quality (rater score). Indeed, there were no measur-
able gains in the rater score from pretest to posttest for either group. However,
composing condition did have a direct and small positive effect on posttest writing
self-efficacy ( ¼ .16): Students who used PEG exhibited greater writing self-
efficacy at posttest compared with students in the Google Docs condition.
Third, state test performance was predicted by composing condition ( ¼ .14)
along with posttest writing self-efficacy ( ¼ .20) and posttest writing quality
( ¼ .23), all of which were in the positive direction indicating that greater writing
self-efficacy and writing quality at posttest was associated with greater perform-
ance on the state ELA test. Pretest writing self-efficacy did not directly predict
performance on the state test, but pretest writing quality did ( ¼ .25).
A parsimonious Model 4 was then generated including only those paths that
were statistically significant in Model 3. Standardized parameter estimates are
presented in Figure 4. Fit of the parsimonious model was excellent: v2 ¼ 3.35,
df ¼ 7, p ¼ .851, RMSEA ¼ 0.00, 90% CI [0.00, 0.06], CFI ¼ 1.00, TLI ¼ 1.00,
110 Journal of Educational Computing Research 58(1)
Social Validity
Direct measures of student performance and self-efficacy are important out-
comes for evaluating the effectiveness of AWE. However, another important
metric is how classroom teachers perceive the efficacy, utility, and value of
such technologies. These perceptions influence how teachers adopt, use, and
persist with a technology—tools that are perceived as ‘‘useless’’ or ‘‘wastes of
time’’ will likely be abandoned.
A social validity survey probed teachers’ perceptions of the usability, effect-
iveness, and desirability of the AWE system after students completed the postt-
est. Importantly, teachers were unaware of the results of the posttesting when
they completed the survey. The first two sections of the survey asked teachers to
directly compare PEG and Google Docs; the final section asked teachers to
solely evaluate PEG.
Comparisons of PEG and Google Docs: Feedback. In this study, all three teachers used
both PEG and Google Docs and were thus positioned to directly compare how
the systems influenced the teachers’ own feedback practices. A potential benefit
Wilson and Roscoe 111
Table 5. Teachers’ Comparisons of PEG and Google Docs: Teacher Feedback Demands.
Comparisons of PEG and Google Docs: Usability, effectiveness, and desirability. In the
second section of the survey, teachers compared the two systems with respect
to their usability, effectiveness, and desirability. As summarized in Table 6,
112 Journal of Educational Computing Research 58(1)
Table 6. Teachers’ Comparisons of PEG and Google Docs: Usability, Effectiveness, and
Desirability.
Usability
Which system was easier for you to use? PEG PEG PEG
Which system was easier for your students to use? PEG PEG PEG
Which system was more efficient for you to use? PEG PEG PEG
Which system enabled you to give more feedback PEG ND PEG
on content and ideas?
Effectiveness
Which system was more motivating for students to use? PEG PEG PEG
Which system promoted greater student independence? PEG PEG PEG
Which system promoted greater student writing quality? PEG PEG PEG
Desirability
Which system would you like to use in the future? PEG PEG PEG
Which system do you think students would like to PEG PEG PEG
use in the future?
Note. PEG ¼ Project Essay Grade; ND ¼ no difference.
Teachers 1 and 3, who previously did not report a distinction between the sys-
tems with respect to feedback practices, consistently reported PEG to be easier
to use, more effective, and more desirable than Google Docs. Interestingly, in
this section of the survey, both of these teachers reported that PEG enabled
them to provide more feedback on content and ideas despite reporting ‘no dif-
ference’ in the first section of the survey that asked about feedback practices.
Teacher 2 also submitted a very positive evaluation of PEG, reporting that
PEG was superior to Google Docs for all surveyed aspects with the exception of
one item: ‘‘Which system enabled you to give more feedback on content and
ideas?’’ As with Teachers 1 and 3, Teacher 2’s response to this question contra-
dicted that of her responses to the first section of the survey. In the first section,
Teacher 2 reported that she gave more feedback on idea development and elab-
oration to students who used PEG, but here, she reported that there was no
difference in the systems. Given that all three teachers’ responses to this question
contradicted their prior responses, it is possible that teachers were considering
different things in each section of the survey. In the first section, they may have
been considering characteristics of the specific essays that they read, the prompts
they assigned, and the specific difficulties their students had with the assignment.
In the second section, they may have been considering the capabilities of the
systems more abstractly. Although this is speculation, it offers a plausible
explanation as to why their answers differed across survey sections.
Wilson and Roscoe 113
Discussion
In the United States, increasing attention has been paid to identifying whether
AWE systems (Stevenson, 2016) are effective tools for improving the teaching
and learning of writing. The efficiency and reliability of AWE systems, and their
114 Journal of Educational Computing Research 58(1)
Usability
PEG was easy for me to use. SA A SA
PEG was easy for my students to use. SA A SA
Using PEG allowed me to teach more and N A A
grade less.
I receive sufficient information from PEG to A A N
differentiate my writing instruction.
The trait feedback provided by PEG is SA A SA
appropriate.
Students take advantage of the spelling and SA N SA
grammar feedback they receive from PEG.
Students take advantage of the trait feedback SA A A
they receive from PEG.
Students had sufficient keyboarding skills to A D SA
benefit from PEG.
Effectiveness
Using PEG improved my students’ writing A A N
motivation.
Using PEG improved my students’ writing skills. N A N
English language learners benefit from PEG. A A SA
Students with disabilities benefit from PEG. A A SA
Using PEG improved my students’ keyboarding. N A N
Using PEG increased the amount of writing my N A N
students did.
Using PEG increased the amount of revising my N A N
students did.
Students receive more feedback when using SA A SA
PEG.
PEG helped me address my students’ writing N A N
needs.
Desirability
I would like to continue using PEG next year. A A N
I would recommend PEG to other teachers. A A SA
Note. PEG ¼ Project Essay Grade; SA ¼ strongly agree; A ¼ agree; N ¼ neutral; D ¼ disagree; SD ¼ strongly
disagree.
Wilson and Roscoe 115
performance. This is not to say that AWE is ineffective. In fact, repeated exposure
to automated scores and feedback, even via infrequent massed practice, appears to
improve students’ self-efficacy with respect to key writing skills and thereby
improve students’ state ELA test performance to a greater extent than students
who compose with tools that lack automated feedback such as Google Docs.
AWE is effective, but its effectiveness is dependent on the instructional con-
text. In the present study, the overall number of practice opportunities (three
total essays) was too low, but students who used the AWE system did complete,
on average, a total of six drafts per essay (SD ¼ 1.7) and effectively improved
their writing quality across drafts: The average gain between first and final drafts
across the three essays was 1.13 points (SD ¼ 0.54). Furthermore, students who
completed a greater number of revisions per essay had higher scores on average
on the state ELA test (r ¼ .353, p < .001). However, AWE did not create add-
itional writing practice opportunities outside of those that were part of the cur-
riculum (Prompts 1 and 3) or required by the research study (Prompt 2). AWE
affords space for teachers and students to practice writing, particularly revising,
but it does not change the curriculum. The pressure to keep pace with a cur-
riculum is not something that AWE will change, nor will AWE cause an ELA
curriculum to prioritize writing.
Unfortunately, research shows that most U.S. ELA curricula prioritize reading
over writing, primarily using writing activities to support reading outcomes such
as comprehension. As with the teachers in the present study, research suggests that
teachers in the United States rarely assign writing that requires multiple stages of
drafting and revising, and consequently students lack the practice needed to
develop writing skills, knowledge, and strategies to a level of proficiency
(Gilbert & Graham, 2010; Graham, Capizzi, Harris, Hebert, & Morphy, 2013).
If AWE is to be used as a method of delivering the curriculum, it will be
partially effective. However, to maximize the potential efficacy of AWE, it may
be necessary to identify ways for teachers to implement AWE in a manner that
extends or enriches the curriculum.
Limitations
Our interest in examining improvements in overall writing quality on students’
first drafts led us to select a writing prompt for the pretest and posttest that did
not involve any revising. In contrast, the Smarter Balanced ELA test required
students to engage in the full writing process: planning, drafting, revising, and
editing. This discrepancy might explain why there were was no effect of compos-
ing condition on posttest writing quality but an effect on state test performance,
which underscores the importance of evaluating efficacy of AWE with multiple
metrics. Future research that uses measures of writing quality should use writing
prompts that require students to engage in both drafting and revising and editing
and use multiple broad and fine-grained measures of writing quality.
Wilson and Roscoe 119
Funding
The authors disclosed receipt of the following financial support for the research, author-
ship, and/or publication of this article: This research was supported in part by a
Delegated Authority contract from Measurement IncorporatedÕ to University of
Delaware (EDUC432914160001). The opinions expressed in this article are those of the
author and do not necessarily reflect the positions or policies of this agency, and no
official endorsement by it should be inferred.
120 Journal of Educational Computing Research 58(1)
Notes
1. It is worth noting that innovations in NLP tools and techniques are ongoing and thus
pushing the boundaries of automated assessment of writing. For example, automated
methods have shown promise for evaluating more complicated, stylistic elements of
writing quality, such as humor (Skalicky, Berger, Crossley, & McNamara, 2016).
2. Although PEG supports peer review, teachers indicated that they did not use peer
review in their instruction. Therefore, teachers were not trained to use this function.
ORCID iD
Joshua Wilson http://orcid.org/0000-0002-7192-3510
References
Archer, A. L., & Hughes, C. A. (2011). Explicit instruction: Effective and efficient teach-
ing. New York, NY: Guilford Press.
Balfour, S. P. (2013). Assessing writing in MOOCs: Automated essay scoring and
Calibrated Peer ReviewTM. Research & Practice in Assessment, 8, 40–48.
Ballock, E., McQuitty, V., & McNary, S. (2017). An exploration of professional know-
ledge needed for reading and responding to student writing. Journal of Teacher
Education, 69, 1–13. doi:10.1177/0022487117702576
Behizadeh, N., & Pang, M. E. (2016). Awaiting a new wave: The status of state writing
assessment in the United States. Assessing Writing, 29, 25–41.
Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of auto-
mated scoring to construct-irrelevant response strategies (CIRS): An illustration.
Assessing Writing, 22, 48–59.
Bennett, R. E. (2004). Moving the field forward some thoughts on validity and automated
scoring. Research Memorandum (RM-04-01). Princeton, NJ: Educational Testing
Service.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological
Bulletin, 107, 238–246.
Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: John
Wiley & Sons, Inc.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY:
Guilford Press.
Bruning, R., Dempsey, M., Kauffman, D. F., McKim, C., & Zumbrunn, S. (2013).
Examining dimensions of self-efficacy for writing. Journal of Educational
Psychology, 105, 25–38.
Bruning, R. H., & Kauffman, D. F. (2016). Self-efficacy beliefs and motivation in writing
development. In C. A. McArthur, S. Graham & J. Fitzgerald (Eds.), Handbook of
writing research (2nd., pp. 160–173). New York, NY: Guilford Press.
Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems.
In Y. Rosen, S. Ferrara & M. Mosharraf (Eds.), Handbook of research on technology
tools for real-world skill development (pp. 611–626). Hershey, PA: IGI Global.
Chapelle, C. A., Cotos, E., & Lee, J. (2015). Diagnostic assessment with automated
writing evaluation: A look at validity arguments for new classroom assessments.
Language Testing, 32(3), 385–405.
Wilson and Roscoe 121
Cheville, J. (2004). Automated scoring technologies and the rising influence of error.
English Journal, 93(4), 47–52.
Common Core State Standards Initiative. (2010). Common core state standards for
English language arts & literacy in history/social studies, science, and technical subjects.
Retrieved from http://www.corestandards.org/assets/CCSSI_ELA%20Standards.pdf
Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated
scoring of essays: Fishing for red herrings? Assessing Writing, 18, 100–108.
Conference on College Composition and Communication. (2004). CCCC position state-
ment on teaching, learning and assessing writing in digital environments. Retrieved from
http://www.ncte.org/cccc/resources/positions/digitalenvironments
Cotos, E. (2015). AWE for writing pedagogy: From healthy tension to tangible prospects.
Special issue on Assessment for Writing and Pedagogy. Writing and Pedagogy, 7(2–3),
197–231.
Davidson, L. Y. J., Richardson, M., & Jones, D. (2014). Teachers’ perspective on using
technology as an instructional tool. Research in Higher Education, 24, 1–25.
Deane, P. (2013). On the relation between automated essay scoring and modern views of
the writing construct. Assessing Writing, 18, 7–24.
Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language
writers: How does it compare to instructor feedback? Assessing Writing, 22, 1–17.
Dujinhower, H., Prins, F. J., & Stokking, K. M. (2010). Progress feedback effects on
students’ writing mastery goal, self-efficacy beliefs, and performance. Educational
Research and Evaluation, 16, 53–74.
Ertmer, P. A. (1999). Addressing first- and second-order barriers to change: Strategies
for technology integration. Educational Technology Research and Development, 47, 47–61.
Foltz, P. W. (2014, May). Improving student writing through automated formative assess-
ment: Practices and results. Paper presented at the International Association for
Educational Assessment (IAEA), Singapore.
Gilbert, J., & Graham, S. (2010). Teaching writing to elementary students in Grades 4-6:
A national survey. Elementary School Journal, 110, 494–518.
Graham, S., Berninger, V., & Fan, W. (2007). The structural relationship between writing
attitude and writing achievement in first and third grade students. Contemporary
Educational Psychology, 32(3), 516–536.
Graham, S., Bruch, J., Fitzgerald, J., Friedrich, L., Furgeson, J., Greene, K., . . . Smither
Wulsin, C. (2016). Teaching secondary students to write effectively. Washington, DC:
National Center for Education Evaluation and Regional Assistance, Institute of
Education Sciences, U.S. Department of Education (NCEE 2017-4002).
Graham, S., Capizzi, A., Harris, K. R., Hebert, M., & Morphy, P. (2013). Teaching writing
to middle school students: A national survey. Reading and Writing, 27, 1015–1042.
Graham, S., Hebert, M., & Harris, K. R. (2015). Formative assessment and writing: A
meta-analysis. Elementary School Journal, 115, 523–547.
Graham, S., McKeown, D., Kiuhara, S., & Harris, K. R. (2012). A meta-analysis of
writing instruction for students in the elementary grades. Journal of Educational
Psychology, 104, 879–896.
Graham, S., & Perin, D. (2007). Writing next: Effective strategies to improve writing of
adolescents in middle and high schools – A report to Carnegie Corporation of New York.
Washington, DC: Alliance for Excellent Education.
122 Journal of Educational Computing Research 58(1)
Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of
automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6),
1–44. Retrieved from http://www.jtla.org
Hayes, J. R. (2012). Modeling and remodeling writing. Written Communication, 29(3),
369–388.
Hew, K., & Brush, T. (2007). Integrating technology into K-12 teaching and learning:
Current knowledge gaps and recommendations for future research. Educational
Technology Research and Development, 55(3), 223–252.
Human Resources Research Organization. (2016). Independent evaluation of the validity
and reliability of STAAR Grades 3-8 Assessment Scores: Part 1. Retrieved from http://
tea.texas.gov/student.assessment/reports/
Hutchison, A. C., & Woodward, L. (2014). An examination of how a teacher’s use of
digital tools empowers and constrains language arts instruction. Computers in the
Schools, 31, 316–338.
Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis &
J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective
(pp. 147–167). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Kellogg, R. T. (2008). Training writing skills: A cognitive developmental perspective.
Journal of Writing Research, 1(1), 1–26.
Kellogg, R. T., & Raulerson, B. A. (2007). Improving the writing skills of college stu-
dents. Psychonomic Bulletin & Review, 14(2), 237–242.
Kellogg, R. T., & Whiteford, A. P. (2009). Training advanced writing skills: The case for
deliberate practice. Educational Psychologist, 44, 250–266.
Kellogg, R. T., Whiteford, A. P., & Quinlan, T. (2010). Does automated feedback
help students learn to write? Journal of Educational Computing Research, 42(2),
173–196.
Kenny, D. A. (2011, August 15). Path analysis. Retrieved from http://davidakenny.net/
cm/pathanal.htm
Kopcha, T. J. (2012). Teachers’ perceptions of the barriers to technology integration and
practices with technology under situated professional development. Computers &
Education, 59, 1109–1121.
Liu, M. L., Li, Y., Xu, W., & Liu, L. (2016). Automated essay feedback generation and its
impact in the revision. IEEE Transactions on Learning Technologies, 99, 1–13.
doi:10.1109/TLT.2016.2612659
Lyst, A. M., Gabriel, G., O’Shaughnessy, T. E., Meyers, J., & Meyers, B. (2005). Social
validity: Perceptions of check and connect with early literacy support. Journal of
School Psychology, 43, 197–218.
Matsumara, L. C., Patthey-Chavez, G. G., Valdés, R., & Garnier, G. (2002). Teacher
feedback, writing assignment quality, and third-grade students’ revision in lower- and
higher-achieving urban schools. Elementary School Journal, 103, 3–25.
Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A
framework for teacher knowledge. Teachers College Record, 108, 1017–1054.
Monk, J. (2016). Revealing the iceberg: Creative writing, process, & deliberate practice.
English in Education, 50, 95–115.
Moore, N. S., & MacArthur, C. A. (2016). Student use of automated essay evaluation
technology during revision. Journal of Writing Research, 8, 149–175.
Wilson and Roscoe 123
Morphy, P., & Graham, S. (2012). Word processing programs and weaker writers/read-
ers: A meta-analysis of research findings. Reading and Writing, 25, 641–678.
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide (Eighth.). Los Angeles,
CA: Muthén & Muthén.
National Center for Education Statistics. (2012). The Nation’s Report Card: Writing 2011.
Washington, DC: Institute of Education Sciences, U.S. Department of Education
(NCES 2012–470).
National Council of Teachers of English. (2013). NCTE position statement on machine
scoring. Retrieved from http://www.ncte.org/positions/statements/machine_scoring
Okumus, S., Lewis, L., Wiebe, E., & Hollerbrands, K. (2016). Utility and usability as
factors influencing teacher decisions about software integration. Education Technology
Research and Development, 64, 1227–1249.
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.),
Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Mahwah, NJ:
Lawrence Erlbaum Associates, Inc.
Pajares, F. (2003). Self-efficacy beliefs, motivation, and achievement in writing: A review
of the literature. Reading & Writing Quarterly, 19, 139–158.
Pajares, F., Johnson, M., & Usher, E. (2007). Sources of writing self-efficacy beliefs of
elementary, middle, and high school students. Research in the Teaching of English, 42,
104–120.
Pajares, F., & Johnson, M. J. (1996). Self-efficacy beliefs and the writing performance of
entering high school students. Psychology in the Schools, 33, 163–175.
Palermo, C., & Thomson, M. M. (2018). Teacher implementation of self-regulated strat-
egy development with an automated writing evaluation system: Effects on the argu-
mentative writing performance of middle school students. Contemporary Educational
Psychology, 54, 255–270.
Perelman, L. (2014). When ‘‘the state of the art’’ is counting words. Assessing Writing, 21,
104–111.
Perin, D., Lauterbach, M., Raufman, J., & Kalamkarian, H. S. (2017). Text-based writing
of low-skilled postsecondary students: Relation to comprehension, self-efficacy and
teacher judgements. Reading and Writing, 30, 887–915.
Persky, H. R., Daane, M. C., & Jin, Y. (2002). The Nation’s Report Card: Writing 2002.
Washington, DC: National Center for Education Statistics, Institute of Education
Sciences. U. S. Department for Education (NCES 2003-529).
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of
reporting practices and suggestions for improvement. Review of Educational Research,
74, 525–556.
Roscoe, R. D., & McNamara, D. S. (2013). Writing Pal: Feasibility of an intelligent
writing strategy tutor in the high school classroom. Journal of Educational
Psychology, 105, 1010–1025.
Roscoe, R. D., Wilson, J., Johnson, A. C., & Mayra, C. R. (2017). Presentation, expect-
ations, and experience: Sources of student perceptions of automated writing evalu-
ation. Computers in Human Behavior, 70, 207–221.
Sanders-Reio, J., Alexander, P. A., Reio, T. G. Jr., & Newman, I. (2014). Do students’
beliefs about writing relate to their writing self-efficacy, apprehension, and perform-
ance? Learning and Instruction, 33, 1–11.
124 Journal of Educational Computing Research 58(1)
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor
analysis. Psychometrika, 38, 1–10.
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom.
Pedagogies: An International Journal, 3, 22–36.
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the class-
room research agenda. Language Teaching Research, 10(2), 157–180.
Wilkinson, L. Task Force on Statistical Inference (1999). Statistical methods in psych-
ology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Wilson, J. (2017). Associated effects of automated essay evaluation software on growth in
writing quality for students with and without disabilities. Reading and Writing, 30,
691–718.
Wilson, J. (2018). Universal screening with automated essay scoring: Evaluating classifi-
cation accuracy in Grades 3 and 4. Journal of School Psychology, 68, 19–37.
Wilson, J., & Andrada, G. N. (2016). Using automated feedback to improve writing
quality: Opportunities and challenges. In Y. Rosen, S. Ferrara & M. Mosharraf
(Eds.), Handbook of research on technology tools for real-world skill development
(pp. 678–703). Hershey, PA: IGI Global.
Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English language
arts classrooms: Effects on teacher feedback, student motivation, and writing quality.
Computers and Education, 100, 94–109.
Wilson, J., Olinghouse, N. G., & Andrada, G. N. (2014). Does automated feed-
back improve writing quality? Learning Disabilities: A Contemporary Journal, 12,
93–118.
Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied
behavioral analysis is finding its heart. Journal of Applied Behavior Analysis, 11,
203–214.
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we
heading? Language Testing, 27(3), 291–300.
Author Biographies
Joshua Wilson, PhD, is an assistant professor of education at University of
Delaware. His research focuses on how automated scoring and automated feed-
back systems support the teaching and learning of writing. He is particularly
interested in exploring the potential of these systems to improve outcomes for
struggling writers.