Guo2018 PDF

Journal of Educational Measurement
Summer 2018, Vol. 55, No. 2, pp. 194–216
Modeling Basic Writing Processes From Keystroke Logs

Hongwen Guo, Paul D. Deane, Peter W. van Rijn, Mo Zhang,
and Randy E. Bennett
Educational Testing Service
The goal of this study is to model pauses extracted from writing keystroke logs as
a way of characterizing the processes students use in essay composition. Low-level
timing data were modeled, the interkey interval and its subtype, the intraword dura-
tion, thought to reflect processes associated with keyboarding skills and composition
fluency. Heavy-tailed probability distributions (lognormal and stable distributions)
were fit to individual students’ data. Both density functions fit reasonably well, and
estimated parameters were found to be robust across prompts designed to assess
student proficiency for the same writing purpose. In addition, estimated parameters
for both density functions were statistically significantly associated with human es-
say scores after accounting for total time spent writing the essay, a result consistent
with cognitive theory on the role of low-level processes in writing.
Since the beginning of writing assessment, the focus has been upon grading the
end result, the essay produced by the examinee. That focus is reasonable given that
the end result is a socially valued communication device, one used to inform, con-
vince, or tell a story. With the advent of computer-based testing, it is possible to go
beyond evaluating the quality of that end result to analyzing the process used by the
student to get there. In fact, a sizable literature suggests that the processes leading to
the final product can provide valuable information about underlying cognitive activ-
ities during composition (e.g., Breetvelt, van den Bergh, & Rijlaarsdam, 1994; Kel-
logg, 2001; McCutchen, 2000). Further, by combining information extracted from
both product and process features, we can obtain a much richer profile of learn-
ers’ writing behavior (e.g., Almond, Deane, Quinlan, Wagner, & Sydorenko, 2012;
Deane, 2014), a profile that might eventually be of use in improving that behavior
and hopefully the end result. In particular, keystroke logs can be used to record the
process of composition. A well-designed keystroke logging system can accurately
capture individual keystrokes and changes made to the text, along with associated
time stamps (e.g., Leijten & van Waes, 2013). The resulting logs can then be ana-
lyzed to identify larger patterns that reflect underlying features of the writing process.
Pauses and distributions of various kinds of pauses (e.g., interkey, intraword,
between-word) are the lowest level features in the keystroke logs and are of particular
interest to researchers. The literature indicates that long pauses are more likely to be
associated with such processes as deliberation and text planning, reading of source
materials, and evaluating the text produced so far (e.g., Baaijen, Galbraith, & de
Glopper, 2012; Chukharev-Hudilainen, 2014; Deane, 2014; Xu & Ding, 2014). How-
ever, long pauses can also indicate difficulties in lexical access such as spelling and
word finding, as well as keyboarding difficulty. Short pauses, in contrast, are more
likely to reflect basic keyboarding fluency (e.g., Alves, Castro, & de Sousa, 2007).
194 Copyright
c 2018 by the National Council on Measurement in Education
It has long been observed in the cognitive sciences that human response-time (RT)
data display highly skewed and heavy-tailed distributions. That is, the RT distribu-
tion decreases to zero much more slowly than the commonly used normal distribu-
tion. Heavy-tailed distributions such as the log-normal distribution (Ulrich & Miller,
1993), Weibull distribution (Logan, 1992), and power law distributions (Moscoso del
Prado Martin, 2009) have therefore been suggested as better models of RT distribu-
tions than is the normal distribution. In the context of educational assessment, the
lognormal distribution has been recommended to model RT of test items (van der
Linden, 2006). However, Ma, Holden, and Serota (2016) demonstrate that RT dis-
tributions have power-law tails and argue that, among the closed-form distributions,
they are best fit by the generalized inverse gamma distribution. Ihlen (2013) investi-
gated RT distributions for a range of simple cognitive tasks and fitted the so-called
stable distribution to RT sequences involving autocorrelation structures.
When the underlying data follow a heavy-tailed distribution, commonly used sum-
mary features such as averages and standard deviations tend to be unsuitable for
description as well as for applied purposes such as prediction. That is, the average
of n independent variables may not converge to a normal distribution with vanish-
ing spread as n tends to infinity, but it may instead display the same variability as
the original variables. Many features of heavy-tailed phenomena render traditional
statistical tools useless at best and dangerous at worst (Cooke & Nieboer, 2011).
Heavy-tailed distributions have also been observed in the temporal data obtained
from keystroke logs that record the time course of writing (e.g., Almond et al., 2012;
Chukharev-Hudilainen, 2014). Typical measures include interkey intervals (IKIs)
(the duration between successive keystrokes) and intraword intervals (the duration
between successive keystrokes within a word). Chukharev-Hudilainen (2014) fit-
ted the distribution of IKIs with the exponentially modified Gaussian distribution.
Almond et al. (2012) modeled the distribution of inter- and intraword intervals as
a mixture of lognormal distributions, where five parameters were used to describe
pause sequences. These latter investigators analyzed two test forms where students
were asked to write for very different purposes and where the analysis was based
on a rather small sample (20–80 students per test form). The results were, unfortu-
nately, ambiguous with respect to the interpretation of the parameters extracted, both
in terms of underlying cognitive processes and in terms of their relations to external
measures of writing proficiency.
In the present study, we draw upon a substantially larger sample of essays that in-
cluded responses to parallel prompts for the same writing purpose. We are concerned
with the following research question: Are models that characterize pauses in writing
process informative, predictive, and consistent across prompts that target the same
writing purpose? That is, can we model pauses with reasonable precision so that the
obtained estimates persist beyond a response to a single prompt written on one occa-
sion? Are the estimates theoretically meaningful in the sense that they are associated
with essay quality as perceived by human judges?
Whereas it might, in principle, be possible to use pause data to improve sum-
mative essay scoring—that is, to use these data as a proxy for essay quality—that
goal is not of interest here. Rather, the identification of precise and meaningful esti-
mates is a first step toward characterizing a student’s writing processes in a manner
195
Guo et al.
that might have action implications for instruction. Such modeling of pauses, for
example, might be able to identify individual students who need to practice key-
boarding because their typing is consistently dysfluent. In addition, the modeling of
pauses might be thought of as a building block for analysis of more complex log
features from which processes related to planning, organization, and editing can be
inferred.
In the current investigation, we focus on pauses that are likely to reflect a mixture
of processes related to basic keyboarding skills and composition fluency, namely,
IKI, and more particularly its subtype, the intraword duration (IWD) (or within-word
duration). We estimate probability distributions for both features in order to provide
meaningful summary statistics that might be used to represent the writing temporal
processes employed by individual students. Because we expect such distributions to
be heavy-tailed, we compare two alternative distribution types (the stable and log-
normal distributions), and evaluate how well they function as models for keystroke
data. We use as evaluation criteria estimation errors, associations with human scores,
and consistency of parameters across writing prompts.
The remainder of the article is organized as follows. We first describe the data
set. In the next section (Estimation Methods), we describe the probability distribu-
tions considered in the study, outline the estimation methods, specify the software
packages used, and lay out the analysis steps. We then present the estimation results,
compare how well the stable distribution and lognormal distribution fit our data, eval-
uate how well the estimated parameters relate to human raters’ judgments of essay
quality, and assess how robust parameters are when students write two essays for
the same purpose. Finally, we discuss the results, implications, and limitations of the
study, as well as directions for future research.
Data
Participants for the study were a convenience sample of more than 2,500 students
from Grade 6 to Grade 9 in seven U.S. states. The characteristics of the sample are
given in Table 1. Because each student wrote two essays and because of missing
background information, the total sample size in Table 1 was much smaller than the
number of essays used in the later keystroke modeling (refer to the essay numbers in
the later Analysis section).
We used six scenario-based English-language arts (ELAs) summative writing as-
sessments developed as part of the cognitively based assessment of learning (CBAL)
research initiative at the Educational Testing Service (ETS) (Bennett, Deane, & van
Rijn, 2016). The six test forms were designed to serve two different writing pur-
poses (Deane & Song, 2014). One writing purpose focused on building an argument
using evidence extracted from given source reading materials (argumentation), and
the other focused on evaluating competing proposals using explicit criteria to rec-
ommend one proposal over the other (policy recommendation). The three forms de-
veloped for each purpose were designed to be parallel and thus contained the same
number of items and item types, presented in the same order. The major difference
between forms was the topic about which students were asked to read, think, and
write. Evidence suggests that the psychometric functioning of the forms with respect
196
Table 1
Summaries of Students’ Background Information
Argumentation Policy Recommendation
Grade
6 117
7 303 766
8 741 186
9 473
Missing 322 453
Gender
Male 46.62% 49.20%
Female 53.38% 50.80%
Race/ethnicity
White 73.88 % 68.94%
Hispanic 17.91% 21.80%
Black 2.95% 5.05%
Asian 3.88% 3.65%
English proficient 97.80% 88.16%
Free or reduced price lunch 27.15% 34.23%
Note. Percentages are based on sample sizes (for argumentation and policy recommendation, respectively)
of 1,390 and 1,069 for gender, 1,390 and 1,069 for race/ethnicity, 1,318 and 1,005 for English proficient,
and 965 and 1,005 for free or reduced price lunch. Missing data were not counted in these sample sizes.
Copyright by Educational Testing Service, 2018 All rights reserved.
to reliability, generalizability, relations to external variables, and internal structure is

more than adequate for research purposes (see Bennett, Zwick, & van Rijn, 2017).
For example, the Cronbach’s alpha of the six forms ranged from .76 to .84, with an
average of .81 (van Rijn, Chen, & Yan-Koo, 2016, table 20; van Rijn & Yan-Koo,
2016, table 18).
The six writing test forms were administered online during the spring of 2013. The
three argumentation forms will be referenced as BA, SN, and CG (which stand as ab-
breviations for the topic of each form—Banning Advertisements to children under
12, restricting access to Social Networking, and paying children Cash for Grades).
The three policy recommendation forms will be referenced as SL, CF, and GG (de-
ciding between Service Learning projects, alternate themes for a Culture Fair, and
alternate uses of a Generous Gift). It was expected and stated in the test administra-
tion manual that tests were given in regular classes, and students were not expected
to leave the test until completion.
Each student took two of the three parallel test forms focused on the same writ-
ing purpose (either argumentation or policy recommendation), with the second form
being administered within four weeks after the completion of the first one. Students
were randomly assigned within classes to one of the three pairings of parallel test
forms within a writing purpose. Although each test was timed, the timing was set so
as to allow most students to comfortably finish.1
Keystroke logs for the essay item were collected as part of data collection. We used
the keystroke logging system developed at ETS (Deane, 2014). The ETS keystroke
197
Guo et al.
system can extract a large number of features ranging from relatively simple ones
like counts and time durations for certain actions (e.g., total number of backspacing
events or total time spent on insertion events), to more complex characteristics that
link pause data to linguistic features (e.g., the statistical likelihood of particular word
sequences). At the lowest level of detail, the keystroke log captures the sequence and
timing of changes in the text that lead to the student’s final submitted essay.
The essays were scored by human raters against two rubrics, each on an integer
scale ranging from 0 to 5. For the purpose of our analyses, we excluded responses
that received a human score of 0 (a very small subset of the total sample), since that
score point was used to denote essays with unusual response characteristics including
empty and off-topic responses, plagiarized responses, and responses consisting of
random keystrokes. One of the scoring rubrics (denoted as RS1) evaluated writing
fundamentals (e.g., word usage, mechanics, organization, and development), and the
other rubric (denoted as RS2) evaluated student performance on such higher level
skills as the quality of argumentation. For each rubric, scoring was done holistically;
that is, the rubric described for each score level the characteristics that a response
at that level should possess. Raters used those descriptions, explanatory notes, and
exemplar responses for each level to arrive at a score for an essay.
For this study, we used human scores as criterion variables in several of the analy-
ses. For argumentation forms CG and SN, a mean score taken across two raters was
used because all responses were double scored. When raters disagreed by more than
one point, a third rater was employed. For BA, mean scores were used for the 20%
of responses that were double scored and a single human rater’s score was used for
the remaining responses. A similar situation obtained for the policy recommendation
forms CF and GG, where 100% of responses were double scored, and for SL for
which 20% of the responses were so judged (note that because of the existing large
database for BA and SL, only 20% of the 2013 essays were double scored).
With respect to the characteristics of the human scores, rater agreement for the
policy recommendation forms ran from quadratic weighted kappas of .70 to .80 for
RS1 and from .70 to .77 for RS2 (van Rijn et al., 2016). For the argumentation forms,
kappa values went from .62 to .74 for RS1 and .59 to .70 for RS2 (van Rijn & Yan-
Koo, 2016). Polyserial correlations between a single essay rating and the total test
score (after removing the essay score) ran from .60 to .62 for RS1 and from .61 to
.64 for RS2 on the policy recommendation forms (van Rijn et al., 2016). For the
argumentation forms, the comparable values were .59 to .66 for RS1 and .52 to .61
for RS2 (van Rijn & Yan-Koo, 2016).
Estimation Methods
As in Almond et al. (2012), we fitted lognormal distributions to the data in an
attempt to obtain more informative results. Additionally, we used a more flexible
model, the stable distribution, to fit the data. Both lognormal and stable distribu-
tions are heavy-tailed. We briefly introduce their definitions in this section (inter-
ested readers can explore some of their tail properties from the references below).
All estimations and computations in the study were produced using the R statistical
software packages (R Core Team, 2014).
198
Lognormal Distribution
A real-valued random variable X is said to have a lognormal distribution if the
logarithm of the variable follows a normal distribution. That is,
log(X ) ∼ N (μ, σ), (1)
where μ and σ are the mean and standard deviation of the transformed variable. The
mean, median, mode, and variance of the original variable X are eμ+σ /2 , eμ , eμ−σ ,
2 2
and (eσ − 1)e2μ+σ .

2 2
Estimation of μ and σ can be obtained by the maximum likelihood estimation

(MLE) method in the R package fitdistrplus (Delignette-Muller & Dutang, 2015).
Stable Distributions
A real-valued random variable X is said to have a stable distribution if, for any
n ≥ 2, there are a positive number Cn and a real number Dn such that
d
X 1 + X 2 + · · · + X n = C n X + Dn , (2)
where X 1 , . . . , X n are independently and identically distributed as X . That is, the

sum of the variables has the same distribution as the original variable but with a
different scale (Nolan, 2016). This class of distributions does not have a closed-form
density. Instead, the characteristic function φ X (u) of a stable distribution S(α, β, γ, δ)
is given by
⎧
⎪ α πα
⎨exp{−|γu| [1 + iβsgn(u) tan( 2 )(|γu| − 1)] + iδu},
1−α
α ∈ (0, 1) (1, 2)
φ X (u) = exp{−|γu|[1 + iβsgn(u) π2 ln |u|] + iδu}, α=1
⎪
⎩exp{iγu − 1 σ2 u 2 },
2
α = 2.
(3)
The parameter α is called the index of stability or characteristic exponent and must
be in the range of 0 < α ≤ 2. The smaller α is, the heavier the tail. The parameter
β is called the skewness of the law, and must be in [−1, 1]. If β = 0, the distribution
is symmetric; if β > 0, it is skewed to the right; and if β < 0, to the left. The α and
β determine the shape of the distribution. The parameter γ is a scale parameter in
(0, +∞). The parameter δ is a location parameter that shifts the distribution right if
δ > 0, and left if δ < 0 (Nolan, 2016).
Any member of this class has the same distribution as the sum of its independent
copies. The normal distribution is a special case when α = 2 (McCulloch, 1986). All
stable distributions except the normal distribution have heavy tails with an asymp-
totic power law (Pareto) decay. One consequence is that the pth moment exists for
only p < α, and the generalized central limit theorem (McCulloch, 1996; Nolan,
2016) has to be used to derive the limit distribution of the sum of stable distributions.
We used the R package StableEstim (Kharrat & Boshnakov, 2015) to estimate the
stable parameters. Among the different methods, the MLE method produced the most
accurate estimation and the generalized method of moments (GMMS) the second
most accurate estimation. However, the MLE method was too time-consuming and
199
Guo et al.
complex for a single RT sequence. Therefore, the GMM was used in the following
analysis.
Analysis
We estimate the density functions of the keystroke log pauses for each of the 5,181
essays written by students from the six writing prompts. The sample sizes of students
for each prompt are 684 for SL, 776 for BA, 1,127 for CG, 1,025 for CF, 1,051 for
GG, and 1,118 for SN. The pause sequence in each essay keystroke log has two esti-
mated density functions, one being the stable distribution and the other the lognormal
distribution.
We first model or summarize the keystroke log sequences by the distribution pa-
rameters. The difference between estimated parametric density functions and the em-
pirical distributions (histograms) is computed to evaluate the fit of the densities. That
is, we define the estimation error as

T

Ft
(Yt − f (t))2 , (4)
t=0
T
where Ft is the relative frequency at the tth bin in the histogram, Yt is the corre-
sponding histogram density at time t that is Ft divided by the bin width, f (t) is the
estimated density (either stable or lognormal density) at time t, and T is the total
number of equally spaced bins in the histogram.
Second, in order to evaluate which parameters are better associated with hu-
man scores, correlation coefficients between these estimated parameters and human
scores are calculated for each prompt, respectively. Simple linear regression analysis
is also employed, with human scores regressed on the estimated parameters, sepa-
rately for the stable distributions and the lognormal distributions, to provide prelim-
inary information on the extent to which the keystroke summaries are incrementally
related to human judgments of essay quality, after accounting for the total time spent
on writing.
Third, consistency of the estimated parameters is evaluated by computing the cor-
relations between the two sets of estimates obtained for each student. Each set is
based on one of the two essays written by that student, where both prompts targeted
the same writing purpose (i.e., either argumentation or policy recommendation).
Thus, a relatively high correlation between the parameter estimates would suggest
that the estimates reflect generalizable characteristics of the student’s writing for that
purpose as opposed to idiosyncrasies associated with a prompt.
Results
We first present results based on visual inspection of the keystroke pauses. It is
common that heavy-tailed distributions show spiky patterns throughout the observed
values. That is, on the plot of the observation sequence, a few observations are much
larger than the others in value.
To illustrate this phenomenon, we plotted the IWD sequence for one student’s
essay in Figure 1.
200
0 20 40 60 80 100
IWD
0 100 200 300 400 500 600 700
pauses
3.0
2.0
Density
1.0
0.0
0 20 40 60 80 100
time in seconds
Figure 1. The IWD sequence plot (upper panel; IWD is measured in seconds) and its
estimated density function (lower panel) of one student’s essay. Copyright by Educational
Testing Service, 2018 All rights reserved.
On the upper panel of this plot, the X -axis is the number of the intraword pauses
with the total being 718, and the Y -axis is the intraword pause/duration (IWD) in
seconds. The plot shows that the student had a few very long IWDs, while most of
the other IWDs were short.
The lower panel in Figure 1 shows a typical heavy-tailed density function, which
was estimated using the nonparametric kernel density method for the IWD observa-
tions, where the X -axis is the time in seconds, and the Y -axis is the density of the
IWDs (note that the maximum value of the density does not need to be less than 1;
instead, its integration is 1). It can be seen clearly that the density function is very
skewed.
We then fitted two parametric distributions, the lognormal distribution and the sta-
ble distribution, to the data sequence for this student. In the upper panel of Figure 2,
the X -axis stands for time in seconds, and the Y-axis stands for density. The back-
ground bars are the histogram of the IWD sequence, which serve as the empirical
frequencies; the dots (upper panel) or the line (lower panel) are the kernel density
estimation; the dotted line is the lognormal distribution estimation; and the dashed
line is the stable distribution estimation.
The plot in the upper panel of Figure 2 shows that the stable distribution with
four estimated parameters fits the distribution better than the lognormal distribu-
tion for this IWD sequence in the time interval of .5–1.5 (for time larger than 1.5,
the difference between the two distributions is not differentiable). For comparison
and exploratory purposes, we also examined the density function of the logarithm-
transformed IWD sequence. The lower panel in Figure 2 compares the normal dis-
tribution estimation to the stable distribution estimation. We observed that the sta-
ble distribution again showed better fit than the normal distribution for this IWD
201
4
den
3
STAB
Lnorm
Density
2
1
0
0 1 2 3 4 5
time (sec)
3.0
den
STAB
norm
2.0
Density
1.0
0.0
−3 −2 −1 0 1 2 3
time (sec)
Figure 2. Density estimation of the IWD sequence (upper) and its logarithm (lower) that
contains 718 observations in one student’s essay: density estimation (dots or line labeled as
den), fitted stable distribution (dashed curve labeled as STAB), and fitted lognormal (upper
panel) or normal (lower panel) distribution (dotted curve). Copyright by Educational Testing
Service, 2018 All rights reserved. (Color figure can be viewed at wileyonlinelibrary.com)
sequence in the time interval of −2.5 to 0 (for time larger than 0, the difference
between the two distributions is not differentiable).
IWD and IKI

We analyzed IKI (which entails all character-level pause data in the essay includ-
ing IWD, between-word pauses, back-space pauses, and between-sentence pauses)
using one prompt (Service Learning, N of essays = 684) to see how different (or
comparable) the estimations were between the two types of pauses, IKI and IWD.
The cognitive literature suggests that the vast majority of keystroke pauses are likely
to reflect processes related to the fluency of word-finding, spelling, and keyboard-
ing, and that these processes are likely to be particularly strongly reflected within
words—i.e., in the IWD data. However, cognitive theory also suggests that pauses
between words, sentences, and paragraphs (those in IKI but not in IWD) are some-
what more likely to reflect additional higher order processes, such as discourse- and
sentence-level planning.
We found that the estimated parameters for both stable distributions and lognormal
distributions were highly correlated between IKI and IWD. This result can be seen in
Figure 3, which shows the effect of adding between-word pauses, between-sentences
202
alphs beta gamma
0.8
2.0
1.0
0.912 0.884 0.972
0.6
0.5
1.5
iwd
iwd
iwd
0.4
0.0
1.0
−0.5
0.2
0.5
−1.0
0.0
0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
IKI 0.003 IKI −0.0045 IKI 1e−04
delta muL stL

1.4
1.8
0.99 0.997 0.975
1.2
1.6
0
1.0
1.4
0.8
−1
1.2
iwd
iwd
iwd
0.6
1.0
−2
0.4
0.8
0.2
−3
0.6
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 −3 −2 −1 0 0.6 0.8 1.0 1.2 1.4 1.6
IKI 6e−04 IKI −7e−04 IKI −8e−04
Figure 3. Regression plots of estimated parameters for stable and lognormal distributions
based on interword interval (IWD) against interkey intervals (IKI). The correlations between
the estimates based on IWD and that on IKI are shown at the top left corner of each graph.
The Y -axis stands for IWD, and the X -axis stands for IKI with a numerical value for the
average difference between estimated parameters from IWD and from IKI. Copyright by
Educational Testing Service, 2018 All rights reserved. (Color figure can be viewed at
wileyonlinelibrary.com)
pauses, and backspace pauses to intraword pauses. In this figure, the X -axis stands
for estimates from the IKIs and the Y -axis stands for estimates from the IWDs. For
the stable distribution, the correlations, from left to right and then top-down, are
.912 for alpha estimates, .884 for beta estimates, .972 for gamma estimates, and .990
for delta estimate. For the lognormal distribution, the correlations are .997 for mu
estimates and .975 for sigma estimates. The results show that the estimates of the
two types of pauses are nearly identical for the location and scale parameters as
indicated by correlational strength.
One potential explanation for such high correlations on all parameters is the fact
that the majority of the IKIs are, in fact, IWDs in our sample. Another potential
explanation may be related to randomness of pause locations in the writing process.
That is, students may pause randomly at a position, either within a word, between two
words, or between sentences. That said, as cognitive research would suggest, longer
pauses were observed more often between words and between sentences; therefore,
the tails of the IKI densities are somewhat heavier than that of IWD on average (that
is, the alpha of IKI is slightly smaller than that of IWD, at a value of .003, as shown
under the X-axis in Figure 3).
Because of the close similarity between IKI and IWD,2 later sections of this article
focus primarily on analyzing the IWD data. As indicated, we estimated separate pa-
rameters for stable distributions and lognormal distributions for each student’s IWD
203
Table 2
Summaries of Estimated Parameters of Each Distribution, Human Essay Scores, and Time
Spent Writing the Essay
Argumentation Policy Recommendation

BA SN CG SL CF GG
Prompts Mean(std) Mean(std) Mean(std) Mean(std) Mean(std) Mean(std)
Stable
alpha 1.03(.20) 1.04(.21) 1.02(.21) 1.00(.19) .99(.17) 1.02(.18)
beta .92(.12) .91(.12) .92(.11) .93(.13) .94(.10) .94(.10)
gamma .10(.05) .09(.05) .09(.05) .11(.06) .12(.07) .12(.06)
delta .24(.07) .23(.07) .23(.06) .25(.08) .26(.08) .26(.08)
Lognormal
meanL .47(.21) .44(.18) .44(.19) .54(.23) .56(.25) .55(.23)
medianL .32(.11) .31(.10) .31(.09) .35(.13) .36(.13) .36(.12)
modeL .16(.05) .16(.05) .16(.04) .16(.06) .16(.06) .16(.06)
ssL .50(.42) .44(.30) .47(.47) .65(.45) .70(.53) .66(.46)
Essay
RS1 2.80(.99) 2.66(.82) 2.82(.86) 2.47(.94) 2.54(.86) 2.51(.82)
RS2 2.85(1.11) 3.11(.96) 2.79(.97) 3.00(1.00) 2.85(.97) 2.95(.89)
TT 737(471) 739(475) 743(457) 997(496) 1,021(524) 984(534)
by prompt. This estimation gave us the opportunity to contrast parameters for the
same students on essays written to the same writing purpose (for the three prompts
written to the policy recommendation purpose and the three prompts written to the
argumentation purpose) but not across writing purposes.
In the following tables and figures, alpha, beta, gamma, and delta are estimated
parameters for the stable distributions, and MeanL, medianL, modeL, and ssL are
the estimated mean, median, mode, and standard deviation of the raw data obtained
from the lognormal distributions. TT is the total time spent on the essay task. RS1 and
RS2 are the human scores assigned to each essay based on the two scoring rubrics,
writing fundamentals and higher level skills, respectively.
Comparison Between Stable and Lognormal Distributions

Table 2 gives summary statistics (mean and standard deviation) for the estimated
parameters of each distribution, human essay scores, and time spent writing the es-
say, from which we observe that estimates are quite similar across prompts. From
fitting the stable distribution, on average, estimated alphas ranged from 1.00 to 1.04,
indicating that IWDs were heavy-tailed but likely to have first moment; betas3 ranged
from .91 to .94, indicating data were highly skewed to right; gammas ranged from
.09 to .12, indicating highly peaked distributions; and deltas ranged from .23 to .26,
indicating the peaks were close to zero. Estimated gammas and deltas for argumenta-
tion prompts were slightly smaller on average than those for policy recommendation
prompts, which might indicate slightly better keyboarding skills for students taking
204
Table 3
Mean/Standard Deviation of Estimation Error and the Percentages of IWD Sequences With a
Smaller Estimation Error Using the Stable Distribution Than Using the Lognormal Distribu-
tion
Prompts BA SN CG SL CF GG
ErrorS .11(.07) .12(.11) .13(.19) .11(.15) .10(.08) .10(.10)

ErrorL .11(.04) .11(.05) .12(.05) .11(.06) .11(.08) .11(.05)
Percentage 59% 61% 61% 67% 59% 62%
the argumentation forms. Note that from Table 1, students writing the argumentation
prompts were seventh to ninth graders, while those writing the policy recommenda-
tion prompts were sixth to eighth graders.
We observed similar results from the lognormal fitting. On average, means of
IWDs ranged from .44 to .56, medians were smaller, ranged from .31 to .36, and
modes were even smaller at about .16, indicating very skewed distributions of IWDs
with modes close to zero. The estimated standard deviation of the lognormal distri-
bution was around .5 on average, a size similar to the mean, indicating a relatively
large spread of data variation. The mean, median, and standard deviation for the ar-
gumentation prompts were slightly smaller than those for the policy recommendation
prompts, which again might indicate slightly better keyboarding skills for argumen-
tation students.
From Table 2, we also observed that, on average, human scores on RS1 (funda-
mental skills) ranged from 2.47 to 2.82, with scores on RS2 (higher order skills) rang-
ing from 2.79 to 3.11. RS1 scores for argumentation prompts were slightly higher
than the RS1 scores for the policy recommendation prompts, but the RS2 argumen-
tation prompt scores were similar to the RS2 scores on the policy recommendation
prompts on average. Total writing time ranged from 737 to 1,021 seconds, with wide
variation, at around 500 seconds. Finally, students who wrote policy recommenda-
tion prompts used more time than those who responded to the argumentation forms.
Table 3 shows estimation errors4 across the studied six writing prompts, SL, BA,
CG, CF, GG, and SN. Overall, estimation error of lognormal distribution (ErrorL)
was similar to that of stable distribution (ErrorS), but with slightly smaller variation.
Percentage-wise, stable distributions had smaller estimation errors than the lognor-
mal distributions for more than 59% of the IWD sequences; this result is expected
since the stable distribution is more flexible (having more parameters) than the log-
normal distribution.
Association Between Estimated Parameters and Human Scores

To investigate whether the estimated parameters are theoretically meaningful in
the sense of being associated with essay quality, we calculated the correlation be-
tween parameter estimates and human scores for each essay. We also included con-
sideration of total time spent on the essay to see how that variable might affect the
205
Table 4
Correlation Coefficients Between Estimated Parameters and Human RS1 Scores for Each
Prompt
Stable Lognormal
alpha beta gam delta meanL medL modeL ssL TT
SL −.03* −.13 −.32 −.22 −.31 −.26 −.08* −.31 .41

BA −.12 −.02* −.24 −.25 −.19 −.24 −.22 −.10 .48
CG −.25 −.00* −.36 −.32 −.20 −.27 −.25 −.12 .50
CF −.15 −.11 −.34 −.24 −.28 −.25 −.11 −.27 .44
GG −.11 −.09 −.35 −.32 −.33 −.32 −.18 −.29 .52
SN −.21 .05* −.29 −.32 −.20 −.25 −.26 −.11 .47
Note. The correlation coefficients were not significantly different from zero for values with * (i.e., p-value
≥ 0.05). Copyright by Educational Testing Service, 2018 All rights reserved.
Table 5
Correlation Coefficients Between Estimated Parameters and Human RS2 Scores for Each
Prompt
Stable Lognormal
alpha beta gam delta meanL medL modeL ssL TT
SL −.09* −.07* −.22 −.17 −.18 −.17 −.10 −.15 .39

BA −.15 .00* −.24 −.21 −.16 −.17 −.13 −.12 .50
CG −.23 .04* −.26 −.23 −.13 −.17 −.17 −.08 .42
CF −.15 −.13 −.33 −.23 −.27 −.25 −.12 −.25 .32
GG −.06 −.05* −.29 −.26 −.28 −.27 −.14 −.24 .37
SN −.17 .00* −.23 −.25 −.15 −.20 −.22 −.08 .49
≥ 0.05). Copyright by Educational Testing Service, 2018 All rights reserved.
relationship between IWD and human scores. Total time may reflect motivation as
well as how much a student wrote.
Tables 4 and 5 display the Pearson correlation coefficients between each of the
estimated parameters and human scores on RS1 and RS2, in turn.
Notice that the correlations in the tables are almost always negative, most notice-
ably for the location and scale parameters, a finding generally consistent with cog-
nitive theory in writing. Lower location parameters, such as smaller delta, meanL,
medL, or modeL are indicative of faster typing speed. Lower scale parameters, such
as smaller gamma or ssL, denote less variation in pauses between keystrokes within
words. Students who can execute these basic processes automatically—getting words
onto the page quickly and fluently—can devote more cognitive resources to the
higher level processes concerned with what to say and how to say it, thereby more
readily earning higher essay scores.
In addition, the total writing time (TT) has the strongest correlation with hu-
man scores, and these correlations are positive, in good part because it takes time
206
Table 6
Partial Correlation Coefficients Between Estimated Parameters and Human RS1 Scores on
Each Prompt Accounting for Total Time Spent on the Essay
Stable Lognormal
alpha beta gam delta meanL medL modeL ssL
SL .08 −.18 −.39 −.29 −.46 −.38 −.07 −.46

BA .05 −.04 −.25 −.30 −.32 −.36 −.21 −.21
CG −.07 −.06 −.38 −.36 −.37 −.41 −.20 −.27
CF −.02 −.14 −.42 −.31 −.47 −.37 −.07 −.49
GG .03 −.15 −.42 −.37 −.51 −.43 −.12 −.50
SN −.05 .02 −.32 −.36 −.40 −.39 −.23 −.33
to generate satisfactory text. To account for TT, the partial correlation coefficients
are presented for RS1 only in Table 6, where the partial correlation is computed by
Prokhorov (2001),
ρX Y − ρX Z ρZ Y
ρ X Y.Z = , (5)
1 − ρ2X Z 1 − ρ2Z Y
where X and Y are the estimated parameters of the distributions and Z is the total
time variable. Table 6 shows that the correlations between RS1 and gamma, delta,
meanL, medL, and ssL appear to be stronger than were the zero-order relationships
that did not include time, suggesting that some IWD parameters from both distribu-
tions moderately relate to essay quality even after accounting for total time.
Given the above result, it makes sense to ask what the four parameters of each
distribution might jointly add to the association with human score, after accounting
for total time. In addition to indicating the general extent of the association over total
time, a clear difference between the two distributions in association might suggest
a preference for using one or the other distribution to describe student processes.
We compared the regression analysis with TT alone to that with TT plus the param-
eter estimates for each distribution separately, using the RS1 human scores as the
dependent variable.
Because of different numbers of independent variables in the regressions, instead
of R-squares the adjusted R-squared values are reported. Adjusted R-squared values
with TT alone are shown in the first data column of Table 7, and those for TT plus
estimates are reported in the second and third data columns. Comparisons between
the model in the first column and the one in the second (or the third) column are all
statistically significant, with the p-value being infinitesimal. The results suggest that
adding characteristics of keystroke logs significantly improved the association with
essay quality. The results also suggest similar levels of prediction for the lognormal
distribution parameters and the stable distribution ones.
207
Table 7
Adjusted R-Square Values From the Simple Linear Regression of RS1 on TT Alone and on TT
Plus Parameter Estimates From the Stable or Lognormal Distributions Separately
TT with Estimates
TT Alone Stable Lognormal
SL 16.54% 34.65% 35.90%

BA 23.34% 33.13% 35.37%
CG 24.90% 39.19% 41.41%
CF 19.57% 36.99% 39.37%
GG 26.53% 43.74% 46.76%
SN 22.20% 33.47% 36.10%
Association Between Estimates and Human Scores by Grade

Because there may be between-grade differences in keyboard fluency and in es-
say quality from a developmental perspective, we could find a correlation between
both variables in the total sample, although no correlation was present within grade
groups.
To address this concern, we computed the grade-level correlations by aggregating
IWD parameter estimates of the three prompts for the same writing purpose (recall
the similarities from Table 2). Table 8 shows the correlations by grade between RS1
and each of the IWD parameter estimates for each of the two writing purposes. Cor-
relations between RS1 and all location parameters (delta, meanL, and medianL) in
Table 8 are significantly different from zero within grades, suggesting that the re-
lationships observed in the combined sample hold. The scale parameters (gamma
and ssL) are also strongly correlated with RS1, except for the seventh grade for the
argumentation prompts with a p-value of .0496.
Consistency of Estimates Across Prompts

In Table 9, correlation coefficients between parameters for the three policy recom-
mendation prompts (Culture Fair, Generous Gift, and Service Learning) are reported.
The upper half of the table reports the correlation between parameters estimated for
the stable distributions. For example, the correlation coefficient is .83 between gam-
mas estimated from the CF prompt and from the GG prompt; it is .16 between betas
estimated from the CF prompt and from the GG prompt. The lower half of the table
reports the correlation between parameters estimated for the lognormal distributions.
As can be seen, the correlations for gamma and delta parameters from the stable dis-
tribution are in the high .70s to low .80s across policy recommendation prompts, sug-
gesting that these parameters reflect persistent aspects of student writing processes
that may not be associated with a particular prompt. For the lognormal distribution,
all four parameters appear to be consistent or highly consistent across prompts, with
correlations running from the high .60s to middle .80s.
208
Table 8
Correlation With RS1 for Parameter Estimates by Grade
Argumentation Forms
Seventh Grade Eighth Grade Ninth Grade
Stable
alpha −.16 −.21 −.26
beta .03* −.02* .05*
gamma −.22 −.34 −.28
delta −.31 −.36 −.19
Lognormal
meanL −.20 −.21 −.12
medianL −.27 −.30 −.12
modeL −.29 −.32 −.12
ssL −.10* −.10 −.10
Policy Recommendation Forms

Sixth Grade Seventh Grade Eighth Grade
Stable
alpha .06* −.12 −.17
beta −.19 −.12 .03*
gamma −.32 −.37 −.35
delta −.25 −.29 −.27
Lognormal
meanL −.32 −.33 −.29
medianL −.28 −.31 −.26
modeL −.07* −.16 −.05*
ssL −.31 −.31 −.28
≥ .05). Copyright by Educational Testing Service, 2018 All rights reserved.
In Table 10, correlation coefficients between parameters for the three argumenta-
tion prompts (Cash for Grades, Ban Ads, and Social Networking) are reported, with
a pattern of results that is generally similar to that found for the policy recommenda-
tion prompts.
Discussion
This article examined approaches to modeling basic timing features extracted from
keystroke logs in student essay writing. That modeling is potentially important be-
cause such logs can provide a window into the cognitive processes used in compo-
sition, processes that may have implications for improving writing skill. Of partic-
ular interest to this study is that the pause events captured by such logs can have
heavy-tailed distributions. Two probability models (lognormal and stable distribu-
tions) were fit to each of two log features, the IKI and its subtype, ISD (IWD), for
209
Table 9
Correlation Coefficients Between Parameters Across Three Policy Recommendation Prompts
(CF, GG, and SL)
Stable Distribution
alpha beta gamma delta
CF GG CF GG CF GG CF GG
GG .38 .16 .83 .84

SL .38 .34 .29 .16 .80 .80 .78 .79
mean median mode ss
CF GG CF GG CF GG CF GG
GG .86 .87 .71 .72

SL .85 .82 .84 .84 .67 .69 .76 .67
Table 10
Correlation Coefficients Between Parameters Across Three Argumentation Prompts (CF, GG,
and SL)
Stable Distribution
alpha beta gamma delta
CG BA CG BA CG BA CG BA
BA .49 .08 .75 .88

SN .39 .36 .10 .05 .73 .75 .86 .86
mean median mode ss
CG BA CG BA CG BA CG BA
BA .82 .90 .85 .65

SN .83 .72 .90 .87 .79 .81 .64 .50
each student. Essay responses to each of six writing prompts were analyzed, where
each student took two prompts targeted at one of two writing purposes.
Because the IKI and IWD proved to be very highly correlated, we focused on the
IWD. From the analysis results, we found that overall, lognormal and stable distri-
butions produced similar average estimation errors, while the variation of estimation
errors was smaller for the lognormal fitting. Stable distributions produced smaller
210
estimation errors in around 62% of cases compared to that from the lognormal
fitting.
With respect to theoretical meaning, the cognitive literature on writing suggests
that processes related to keyboarding skills and composition fluency are among the
lower level factors that contribute to essay quality, particularly writing fundamen-
tals (Chenu, Pellegrino, Jisa, & Fayol, 2014; Fayol, Foulin, Maggio, & Lete, 2012;
Medimorec & Risko, 2017). Because IWD reflects the speed with which letters
within words are written, it should have a negative relationship with essay score.
In our results, of all estimated parameters, the gamma (or scale) parameter from the
stable distribution generally showed the strongest association to the human scores on
both the first and second rubrics (RS1 and RS2), which measured writing fundamen-
tals and higher order skills, respectively. The delta (or location) parameter produced
somewhat weaker relations to the human scores; the alpha parameters showed even
weaker, but still noticeable relations; and the beta parameters had negligible asso-
ciation. Overall, parameters estimated from the lognormal distributions had slightly
lower correlations to the human scores than did the gamma and delta estimates from
the stable distribution, but they were more homogeneous across the four parame-
ters. The correlations of estimated parameters with the RS2 higher order skills rubric
were generally lower than those with the RS1 writing fundamentals rubric, a result
consistent with prior research on IWDs.
Because the total time spent writing an essay is also related to essay score, we
evaluated the relationship of the IWD parameters to essay score after accounting for
total time. When total time was accounted for, the partial correlation between the sta-
ble distribution’s estimated gamma and delta parameters and the human RS1 scores
increased in strength over the corresponding zero-order correlations. In contrast, the
correlations for the estimated alpha parameters, which had the highest relationships
with total time, decreased to the point of negligible association with human scores.
The correlation of the estimated parameters for the lognormal distributions with hu-
man RS1 scores became stronger when the total time was accounted for. In addition,
the correlations were often slightly greater and generally more homogeneous for the
lognormal parameters than for the stable parameters.
We also examined the extent to which the IWD parameter estimates jointly added
to the association with human score, after accounting for total time. Separate linear
regressions confirmed that the four estimated parameters from each of the two dis-
tributions added value to the total writing time. However, the differences between
the two sets of distributional parameters were relatively small in this incremental
association with human essay scores.
Because our analysis combined students across three grade levels for each writing
purpose, we checked whether the within-grade correlations between the IWD param-
eter estimates and human score were noticeably different from the combined-grade
estimates. This analysis showed that the association between estimated parameters
and human scores was still significant within grade levels.
Certainly, key to the meaning of IWD parameter estimates is the extent to which
they are consistent across prompts within students. For the argumentation prompts,
we observed good levels of consistency for gamma and delta in the stable distri-
bution, and for the mean and median in the lognormal distribution. For the stable
211
Guo et al.
distribution, the between-prompt correlations ranged from .73 to .83 for gamma
and from .79 to .88 for delta. For the lognormal distribution, the comparable values
ranged from .72 to .86 for mean and from .84 to .90 for median. In contrast, the total
time spent on the essay task, despite having the strongest correlation with human
scores for individual prompts, was not as stable across prompts, producing a value
of only around .5 or below (detailed tables are available upon request). Compared
to previously reported results on the cross-prompt correlations of human scores on
RS1 writing fundamentals (which ranged from .3 to .6; Deane & Zhang, 2015), the
extracted parameters from the stable or lognormal distributions were much more ro-
bust. In particular, delta appeared to be slightly more robust than gamma in the stable
estimation, and the median was slightly more robust than the mean in the lognormal
estimation. Even so, multiple indices are recommended to be extracted from data
when either distribution is used in characterizing basic student writing processes.
Finally, with respect to the relationship among parameters, for the stable distribu-
tion, we observed an association between gamma and delta (about .7) and between
gamma and alpha (about .4), which may indicate that for the studied students, the
estimated scale and location parameters are positively correlated. (Tables are not re-
ported here but available upon request.) The positive correlation may indicate that,
generally, those students who are more proficient with keyboarding have less varia-
tion in typing speed and tend to have fewer long pauses during the writing process.
In view of the negative correlation between the human scores and the parameters of
the pause distributions, the general trend is that lower scale parameters (less vari-
ation, such as smaller gamma or ssL) and lower location parameters (faster typing
speed, such as smaller delta, meanL, or medL) are likely to get higher human scores.
This result aligns with findings reported in such other studies as Zhang, Hao, Li, and
Deane (2016), where higher performing students showed more steady writing tempo
and were more fluent in their composition processes than lower performing students.
In summary, timing data in writing are heavy-tailed, and distributions with heavy
tails should be used in modeling the data. Both heavy-tailed density functions we
studied fitted the data reasonably well. Both sets of estimates significantly improved
the association with essay quality, after accounting for writing time. The lognormal
distribution was slightly better in this regard. Finally, the estimated parameters for
the two density functions were quite consistent across prompts for the same writing
purpose, more so than was total time. These results imply that IWD represents a
process, or mixture of processes, that is both reasonably persistent and implicated in
human judgments of writing proficiency.
One limitation of this study was its assumption of independence among keystrokes
for individual students. Further analysis is necessary to investigate and model the de-
pendence structure of the data. Ignoring dependence among observations may lead to
inaccurate estimation. Particularly, for the stable distributions, autocorrelation func-
tions (ACFs) analysis and its interpretation in many statistical software packages
may be problematic since ACF was originally built for capturing dependencies in
Gaussian time series (Cont, 2001).
It is also important to note that the structure of the study most strongly supports
generalization within closely similar writing tasks since each student wrote two es-
says in response to parallel prompts that focused on the same writing purpose. We
212
do not yet have sufficient information about how similar stable or lognormal distri-
bution estimates might be for writing samples collected from the same students for
very different writing purposes.
Finally, all of our results are drawn from a single type of writing situation: on
demand composition accomplished during a single writing session. It is not clear
whether our results will generalize to writing distributed across multiple sessions or
done under very different social and cognitive constraints than those observed here.
Much of the existing literature attempts to subclassify the production of text bursts
based on the kinds of writing behavior observed in them, which is not homogeneous,
but includes distinctive subpatterns, such as backspacing over segments of text to
delete them, jumping around to make small edits at multiple points, and various
other behaviors. In the current study, pauses were modeled by a single probability
distribution, either the lognormal or the stable distribution. A logical follow-on to
the approach examined here would be to induce a theory-driven subclassification of
keystroke events to examine whether the parameters associated with different event
types (such as insertions vs. deletions, in-place writing vs. jumps for editing) dif-
fer significantly, which may be modeled by mixtures of long-tailed distributions. It
would also be interesting to investigate if different subclassifications of keystroke
events have different associations with external variables such as human scores.
Theory also suggests that the longest pauses are most likely to be associated with
specific types of events, particularly if those pauses manifest themselves in loca-
tions where such events are known to occur from prior psycholinguistic studies.
As noted, the stable distribution is theoretically grounded in a model that assumes
that pauses reflect a mixture of discrete processes (which explains why the longest
pauses asymptotically follow the same distribution as the sum of pauses, cf. Cooke
& Nieboer, 2011; Hill, 1975). It might therefore be worthwhile to seek out mod-
els that explicitly infer which processes are likely to be responsible for individual
pauses. Note that such an approach requires some other source of evidence about
the processes being engaged at specific points in order to train a predictive model,
along with appropriate linguistic features to characterize the contexts in which partic-
ular processes are more likely to be engaged. Some such evidence may be available
through think-aloud protocols, as well as through natural language processing anal-
ysis that helps identify contexts (such as clause boundaries) where certain processes,
such as sentence-level planning processes, are more likely to occur (Leijten & van
Waes, 2013).
Note that the identification of informative, predictive, and consistent process fea-
tures is a first step toward being able to characterize a student’s writing in a manner
that might have action implications for instruction. In that light, one potential impli-
cation of this study is that students who display similar parameters when we estimate
the density functions for IWD sequences might also display similar overall writing
process patterns. For instance, students who obtain low human scores on their essays
might cluster into several groups, each group reflecting different underlying issues,
such as inefficient keyboarding, lack of editing and revision, or lack of motivation. In
future studies, we will seek to examine whether students who are clustered together
with respect to their IWD distribution parameters are more homogeneous with re-
213
Guo et al.
spect to other measures of their writing in ways that might have direct value for
teaching and learning.
Acknowledgment
We would like to thank Shelby Haberman and Yimin Xiao for their input. We are
also thankful to Chong Min Lee and Chen Li for preparing the data sets. Comments
from Rebeca Zwick, J. R. Lockwood, and Kathleen Sheehan are greatly appreciated
as well. Any opinions expressed in this publication are those of the authors and not
necessarily of Educational Testing Service.
Notes
1
The allowed essay writing time was 35 minutes. However, the average writing
time was about 10 minutes, and most of the students finished the task within 20
minutes.
2
We also examined interparagraph interval and found that the occurrence of such
pauses was extremely sparse, making further analysis infeasible.
3
In fitting the stable distribution, because the IWDs were all positive we tried
fixing beta to 1 and restricting alpha in (0, 1). However, the fit was worse. Relaxing
the four parameters of stable distributions produced the best model fit in the studied
data, which flexibly reflected different writing processes in terms of frequencies and
duration of intraword pauses.
4
Error is the weighted difference between estimated distributions and the empirical
frequencies, with weights equal to the empirical frequencies, as defined in (4).
References
Almond, R., Deane, P., Quinlan, T., Wagner, M., & Sydorenko, T. (2012). A preliminary analy-
sis of keystroke log data from a timed writing task (Research Report No. RR-12-23). Prince-
ton, NJ: Educational Testing Service.
Alves, R. A., Castro, S. L., & de Sousa, L. (2007). Influence of typing skill on pause execution
cycles in written composition. In M. Torrance, L. van Waes, & D. Galbraith (Eds.), Writ-
ing and cognition: Research and applications (pp. 55–65). Amsterdam, The Netherlands:
Elsevier.
Baaijen, V. M., Galbraith, D., & de Glopper, K. (2012). Keystroke analysis: Reflections on
procedures and measures. Written Communication, 29, 246–277.
Bennett, R. E., Deane, P., & van Rijn, P. W. (2016). From cognitive-domain theory to assess-
ment practice. Educational Psychologist, 51, 82–107.
Bennett, R. E., Zwick, R., & van Rijn, P. W. (2017). Innovation in K-12 assessment: A re-
view of CBAL research. In H. Jiao & R. W. Lissitz (Eds.), Technology enhanced innova-
tive assessment: Development, modeling, and scoring from an interdisciplinary perspective
(pp. 197–247). Charlotte, NC: Information Age.
Breetvelt, I., van den Bergh, H., & Rijlaarsdam, G. (1994). Relations between writing pro-
cesses and text quality: When and how? Cognition and Instruction, 12, 103–123.
Chenu, F., Pellegrino, F., Jisa, H., & Fayol, M. (2014). Interword and intraword pause thresh-
old in writing. Frontiers in Psychology, 5, 182. https://doi.org/10.3389/fpsyg.2014.00182
Chukharev-Hudilainen, E. (2014). Pauses in spontaneous written communication: A keystroke
logging study. Journal of Writing Research, 6, 61–84.
214
Cont, R. (2001). Empirical properties of asset returns: Stylized facts and statistical issues.
Quantitative Finance, 1, 223–236.
Cooke, R., & Nieboer, D. (2011). Heavy-tailed distributions: Data, diagnostics, and new de-
velopments (RFF DP 11-19). Washington DC: Resources for the Future. Retrieved from
http://www.rff.org/documents/RFF-DP-11–19.pdf
Deane, P. (2014). Using writing process and product features to assess writing quality and
explore how those features relate to other literacy tasks (Research Report No. RR-14-03).
Princeton, NJ: Educational Testing Service.
Deane, P., & Song, Y. (2014). A assessments to measure and support the development of
argumentative reading and writing skills. Psicologia Educativa, 20, 99–108.
Deane, P., & Zhang, M. (2015). Exploring the feasibility of using writing process features to
assess text production skills (Research Report No. RR-15-26). Princeton, NJ: Educational
Testing Service.
Delignette-Muller, M., & Dutang, C. (2015). fitdistrplus: An R package for fitting distribu-
tions. Journal of Statistical Software, 64, 1–34.
Fayol, M., Foulin, J.-N., Maggio, S., & Lete, B. (2012). Towards a dynamic approach of how
children and adults manage text production. In E. Grigorenko, E. Mambrino, & D. Preiss
(Eds.), Writing: A mosaic of new perspectives (pp. 141–158). New York, NY: Psychology
Press.
Hill, B. M. (1975). A simple general approach to inference about the tail of a distribution.
Annals of Statistics, 3, 1163–1174.
Ihlen, E. (2013). The influence of power law distributions on long-range trial dependency of
response times. Journal of Mathematical Psychology, 57, 215–224.
Kellogg, R. T. (2001). Competition for working memory among writing processes. American
Journal of Psychology, 114, 175–191.
Kharrat, T., & Boshnakov, G. (2015). StableEstim: Estimate the 4 parameters of stable law
using different methods. R package version 2.0. Retrieved from http://CRAN.R-project.org/
package=StableEstim
Leijten, M., & van Waes, L. (2013). Keystroke logging in writing research using input log to
analyze and visualize writing processes. Written Communication, 30, 358–392
Logan, G. D. (1992). Shapes of reaction-time distributions and shapes of learning curves: A
test of the instance theory of automaticity. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 18, 883–914.
Ma, T., Holden, J., & Serota, R. (2016). Distribution of human response times. Complexity,
21, 61–69.
McCulloch, J. (1986). Simple consistent estimators of stable distribution parameters. Commu-
nications in Statistics Simulation and Computation, 15, 1109–1136.
McCulloch, J. H. (1996). Financial applications of stable distributions. In G. S. Maddala & C.
R. Rao (Eds.), Statistical methods in finance (pp. 393–425). Amsterdam, The Netherlands:
Elsevier.
McCutchen, D. (2000). Knowledge, processing, and working memory: Implications for a the-
ory of writing. Educational Psychologist, 35, 13–23
Medimorec, S., & Risko, E. F. (2017). Pauses in written composition: On the importance
of where writers pause. Reading and Writing: An Interdisciplinary Journal, 30, 1267–
1285.
Moscoso del Prado Martin, F. (2009). Scale-invariance of human latencies. In N. A. Taatgen
& H. van Rijn (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science
Society (pp. 1270–1275). Austin, TX: Cognitive Science Society.
Nolan, J. (2016). Stable distributions: Models for heavy tailed data. Retrieved from
http://fs2.american.edu/jpnolan/www/stable/chap1.pdf
215
Guo et al.
Prokhorov, A. V. (2001). Partial correlation coefficient. In M. Hazewinkel (Ed.), Encyclopedia

of mathematics. Dordrecht, The Netherlands: Springer.
R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Aus-
tria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/
Ulrich, R., & Miller, J. (1993). Information processing models generating lognormally dis-
tributed reaction times. Journal of Mathematical Psychology, 37, 513–525.
van der Linden, W. (2006). A lognormal model for response times on test items. Journal of
Educational and Behavioral Statistics, 31, 181–204.
van Rijn, P., Chen, J., & Yan-Koo, Y. (2016). Statistical results from the 2013 CBAL English
language arts multistate study: Parallel forms for policy recommendation writing (RM-16-
01). Princeton, NJ: Educational Testing Service.
van Rijn, P., & Yan-Koo, Y. (2016). Statistical results from the 2013 CBAL English language
arts multistate study: Parallel forms for argumentative writing (RM-16-15). Princeton, NJ:
Educational Testing Service.
Xu, X., & Ding, Y. (2014). An exploratory study of pauses in computer-assisted EFL writing.
Language Learning and Technology, 18, 80–96.
Zhang, M., Hao, J., Li, C., & Deane, P. (2016). Classification of writing patterns using
keystroke logs. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & M. Wiberg
(Eds.), Quantitative psychology research (pp. 299–314). New York, NY: Springer.
Authors
HONGWEN GUO is Senior Psychometrician at Educational Testing Service, 660 Rosedale
Road, Princeton, NJ 08541; hguo@ets.org. Her primary research interests include psycho-
metric and statistical methods.
PAUL D. DEANE is Principal Research Scientist at Educational Testing Service, 660
Rosedale Road, Princeton, NJ 08541; pdeane@ets.org. His research interests include writ-
ing assessment, vocabulary assessment, automated essay scoring, and evidence-centered
design.
PETER W. VAN RIJN is Senior Research Scientist at ETS Global, Strawinskylaan 929,
1077XX Amsterdam, The Netherlands; pvanrijn@etsglobal.org. His research focuses on
educational measurement and psychometrics.
MO ZHANG is Research Scientist at Educational Testing Service, 660 Rosedale Road, Prince-
ton, NJ; mzhang@ets.org. Her research interests lie in the methodologies of measurement
and validation for automated and human scoring of constructed responses.
RANDY E. BENNETT is Norman O. Frederiksen Chair in Assessment Innovation, Educa-
tional Testing Service, 660 Rosedale Road, Princeton, NJ 08541; rbennett@ets.org. His
primary research interests include integrating advances in cognitive and learning science,
technology, and measurement to create new approaches to assessment that have positive
impact on teaching and learning.
216

Guo2018 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guo2018 PDF

Uploaded by

Copyright:

Available Formats

Journal of Educational Measurement

Summer 2018, Vol. 55, No. 2, pp. 194–216

Modeling Basic Writing Processes From Keystroke Logs

Argumentation Policy Recommendation

to reliability, generalizability, relations to external variables, and internal structure is

log(X ) ∼ N (μ, σ), (1)

and (eσ − 1)e2μ+σ .

Estimation of μ and σ can be obtained by the maximum likelihood estimation

where X 1 , . . . , X n are independently and identically distributed as X . That is, the

0 100 200 300 400 500 600 700

IWD and IKI

IKI 0.003 IKI −0.0045 IKI 1e−04

delta muL stL

IKI 6e−04 IKI −7e−04 IKI −8e−04

Argumentation Policy Recommendation

Copyright by Educational Testing Service, 2018 All rights reserved.

Comparison Between Stable and Lognormal Distributions

ErrorS .11(.07) .12(.11) .13(.19) .11(.15) .10(.08) .10(.10)

Copyright by Educational Testing Service, 2018 All rights reserved.

Association Between Estimated Parameters and Human Scores

SL −.03* −.13 −.32 −.22 −.31 −.26 −.08* −.31 .41

SL −.09* −.07* −.22 −.17 −.18 −.17 −.10 −.15 .39

SL .08 −.18 −.39 −.29 −.46 −.38 −.07 −.46

Copyright by Educational Testing Service, 2018 All rights reserved.

SL 16.54% 34.65% 35.90%

Copyright by Educational Testing Service, 2018 All rights reserved.

Association Between Estimates and Human Scores by Grade

Consistency of Estimates Across Prompts

Policy Recommendation Forms

GG .38 .16 .83 .84

GG .86 .87 .71 .72

Copyright by Educational Testing Service, 2018 All rights reserved.

BA .49 .08 .75 .88

BA .82 .90 .85 .65

Copyright by Educational Testing Service, 2018 All rights reserved.

Prokhorov, A. V. (2001). Partial correlation coefficient. In M. Hazewinkel (Ed.), Encyclopedia

You might also like