Professional Documents
Culture Documents
David James Woo , Hengky Susanto , Chi Ho Yeung , Kai Guo , and April Ka Yeng Fung
a, * b b c d
a
Precious Blood Secondary School, Hong Kong
b
Department of Science and Environmental Studies, The Education University of Hong Kong,
Hong Kong
c
Faculty of Education, The University of Hong Kong, Hong Kong
d
Hoi Ping Chamber of Commerce Secondary School, Hong Kong
*
Corresponding author
• Email address: net_david@pbss.hk
• Postal address: Precious Blood Secondary School, 338 San Ha Street, Chai Wan, Hong
Kong
Declarations of interest
None.
Acknowledgement
The work by C.H.Y. is supported by the Research Grants Council of the Hong Kong Special
Administrative Region, China (Projects No. EdUHK GRF 18301119).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 2
Abstract
English as a foreign language (EFL) students’ use of text generated from artificial intelligence
(AI) natural language generation (NLG) tools may improve their writing quality. However, it
remains unclear to what extent AI-generated text in these students’ writing might lead to higher-
quality writing. We explored 23 Hong Kong secondary school students’ attempts to write stories
comprising their own words and AI-generated text. Human experts scored the stories for
dimensions of content, language, and organization. We analyzed the basic organization and
structure and syntactic complexity of the stories’ AI-generated text and performed multiple linear
regression and cluster analyses. The results show the number of human words and the number of
AI-generated words contribute significantly to scores. Besides, students can be grouped into
competent and less competent writers who use more AI-generated text or less AI-generated text
compared to their peers. Comparisons of the clusters reveal some benefit of AI-generated text in
improving the quality of both high-scoring students’ and low-scoring students’ writing. The
findings can inform pedagogical strategies to use AI-generated text for EFL students’ writing
and to address digital divides. This study contributes designs of NLG tools and writing activities
In the past decade, designing artificial intelligence (AI) education has become a strategic
priority for educators around the world (Sing et al., 2022). Generally, the term ‘AI education’
can be understood in two ways: learning with AI and learning about AI. The former involves
applying AI-enabled technologies in teaching and learning practice (Wang & Cheng, 2021) such
that students develop interaction strategies with AI to solve authentic problems and tasks (Kim et
al., 2022). The latter refers to the teaching and learning of AI as a new subject, for instance, to
develop an understanding of AI, AI futures, and procedural use of AI (Kim et al., 2022), and to
learn AI development skills such as coding (Wang & Cheng, 2021). In language education, AI-
powered technologies have been integrated into classrooms to enable students’ learning with AI
and foster their development of different language skills, such as speaking (Lin & Mubarok,
2021; Yang et al., 2022) and listening (Dizon, 2020). Particularly, writing researchers and
educators have designed and developed a variety of AI-enabled tools such as intelligent writing
assistants (Godwin-Jones, 2022), educational chatbots (Guo et al., 2022), and automated writing
and launched in November 2022, has brought mainstream attention to AI’s capability to predict
and to generate text given an input (Bender et al., 2021). AI-generated text can be coherent,
lengthy, indistinguishable from human writing (Brown et al., 2020), and highly opinionated
(Jakesch et al., 2023). Not least for this reason, universities (Yau & Chan, 2023) and other
institutions (Stack Exchange Inc., 2022) have been banning compositions comprising AI-
generated text. In more mundane forms, AI’s natural language generation (NLG) capability (Gatt
& Krahmer, 2018) has been integrated into the everyday functionality of email clients, Internet
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 4
search, and word processors (Godwin-Jones, 2022). Importantly, AI-NLG tools have become an
emergent resource for the writing classroom. Recent studies have provided some evidence that
students’ collaboration with AI-NLG tools can improve their writing skills. For example, Dizon
and Gayed (2021) found that Japanese university students who wrote with Grammarly, a
predictive text and intelligent writing assistant, produced more lexical variation and fewer
grammatical errors. Liu et al. (2021) indicated that a reflective thinking, AI-supported English
writing approach improved university students’ writing quality. Kangasharju et al. (2022)
designed a Poem Machine that drafted poems as models for students, and suggested their tool
inspired and supported lower secondary students when they engaged in writing poems. To
improve students’ argumentative essay writing, Guo et al. (2022) proposed a chatbot-assisted
approach that might better facilitate the scaffolding of students’ idea production and argument
construction. Yet, it should be noted that although AI’s NLG capability has been integrated into
these instructional tools, these technologies often have very narrow or predetermined
conversational paths (Kuhail et al., 2023), and appear not to be developed from the latest
computer network architecture (Vaswani et al., 2017). Far from ChatGPT’s capability to
dynamically predict text, dated technical designs of educational AI-NLG tools constrain AI
capabilities and ultimately what can be learned about and with AI in education.
The integration of AI-NLG tools into writers’ writing processes has enabled collaborative
writing between humans and AI. During such collaboration, some issues may require attention.
For example, Jakesch et al. (2023) provided evidence that co-writing with generative AI can shift
people’s attitudes and opinions on a single topic, even if unintended. In the educational setting,
writing teachers should pay careful attention to such issues when they try to apply AI-NLG tools
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 5
into their teaching practice. Wang et al. (2023) identified different approaches by which students
interacted with AI agents for language learning, and suggested different gains of student learning
based on interaction approach, indicating the need for students to learn effective interactions with
AI. Hence, writing teachers need to develop pedagogical strategies to foster students’ successful
collaboration with AI agents in generating written texts. Moreover, using AI-NLG tools requires
teachers to identify suitable tasks for student-AI collaborative writing and appropriate evaluation
methods for assessing compositions using human words and AI-generated text (Godwin-Jones,
2022). To reach these goals, an important first step would be to understand students’ interactions
Based on a cognitive process theory of writing (Flower & Hayes, 1981), we have viewed
a student’s writing as iterative decision-making, specifically, for the generation and structuring
difference between skilled writers and novice writers is that the former can make a wider variety
and more specific decisions than the latter. The practical implications are that writing must not
only be taught but also cultivated per students’ needs, not imposed from a teacher’s hands
students’ writing (Prior, 2006). Within this paradigm, writing is a collaborative effort, situated in
classrooms, teachers mediate students’ writing, although there is evidence showing students do
not receive sufficient writing instruction and feedback from teachers in a classroom context
(Butterfuss et al., 2022). Furthermore, students increasingly consult online sources and repurpose
information from those sources for their writing (Tan, 2023). Ultimately, students need strategies
to navigate the elements of their socio-technical environment so as to improve the quality of their
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 6
writing (Crossley et al., 2016). With AI-generated text becoming one of the elements in the
environment, students need to develop strategies to effectively utilize the text. Moreover, since
different students may perceive AI’s affordances in different ways, either as an opportunity or a
constraint (Jeon, 2022), it will be necessary to understand learner differences in using AI-
generated text for language learning. However, so far few studies have been conducted to
examine how students cope with AI-generated text to produce compositions, how their
interactions with the text reflect learner differences, and what types of student-AI interactions
This Study
To better understand how students write with AI that dynamically generates text, this
study explores the integration of text generated from state-of-the-art (SOTA) AI-NLG tools to
the educational context of English as a foreign language (EFL) students’ written compositions
(i.e., narrative stories). This article reports on a study exploring language features of AI-
generated text in students’ compositions and their association with human-rated scores of the
compositions. A multiple linear regression analysis and a cluster analysis were employed to
provide insights into how distinct groups of students have interacted with AI-generated text to
complete compositions. Specifically, the study was guided by the following three research
questions (RQs):
RQ1: What are the language features of AI-generated text in students’ compositions
RQ2: What patterns of interaction with AI-generated text can be identified in students’
RQ3: How do students’ interaction patterns with AI-generated text impact and benefit
Language features within AI-generated text were adopted in the study as an indicator of
students’ strategies for using AI-NLG tools for writing, since these features can reflect a student
writer’s knowledge and facilitate their growth (Crossley, 2020). Although language features have
been measured in compositions written with NLG tools that generate content (Gayed et al.,
2022), to the best of our knowledge no attempt has been made to measure the language features
of AI-generated text in students’ compositions and to investigate how the use of AI-generated,
English language text may improve the quality of EFL students’ writing, if at all. By measuring
the language features of AI-generated text in students’ compositions, and experts’ ratings of
text to complete written compositions. By identifying any benefit from using AI-generated text
on expert ratings of writing quality, we hope to inform teaching and learning practice in the EFL
classroom, enabling teachers to deliver effective instructional strategies and helping students to
Methodology
This study was conducted in two Hong Kong secondary schools from December 2022 to
January 2023. One school, Ho Man Tin (HMT) School (pseudonym) receives primary school
students at the 88-99th percentile of academic achievement in its geographic district. The other
school, Chai Wan (CW) School (pseudonym) receives primary school students at the 44-55th
percentile of academic achievement in its geographic district. In both schools, a creative writing
contest was prosecuted and required student participants to create a SOTA AI-NLG tool and then
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 8
to use that tool to co-author a short story. To support the contest, students attended two
In the first workshop, students learned to code their SOTA AI-NLG tool. They used
Python programming language, Gradio software development kit, and HuggingFace, a repository
for machine-learning (ML) models and applications. A language model forms the basis of any
AI-NLG tool and those available to students on Hugging Face vary widely in size, from GPT2’s
117 million parameters (Radford et al., 2019) to GPT-J’s six billion parameters (Wang &
Komatsuzaki, 2021). However, they are generally smaller than the largest, proprietary language
models such as ChatGPT, which can exceed 100 billion parameters (Brown et al., 2020).
Importantly, the Hugging Face language models are open-source, free-to-use and do not incur
the larger, proprietary language models’ huge environmental and financial costs to develop and
maintain (Bender et al., 2021). Thus, smaller language models can be rapidly implemented in
schools.
Students designed AI-NLG tools according to their preferences and coding skills.
Initially, students were taught to design an AI-NLG tool by using one language model, with one
textbox for human text input and one text box of AI-generated text output. The AI-generated text
generally includes proper English language capitalization, punctuation and spacing, and its
length can vary from words to a paragraph (see Figure 1). Subsequently, students were taught to
design an AI-NLG tool by using more than one language model so that the tool could comprise
Figure 1
An AI-NLG tool comprising one language model and one textbox of output
Figure 2
An AI-NLG tool comprising three language models and three textboxes of output
In the second workshop, students learned to interact with their AI-NLG tool to write a
story. Specifically, they were taught different prompting methods and digital writing skills for
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 10
selecting AI-generated text, and editing that text into a story. Students used their AI-NLG tools
for 45 minutes. They could use the tools freely and repeatedly at that time. Students did not need
to complete their short stories for contest submission during that time.
Data Collection
Students wrote stories on Google Docs and shared their docs with the first author. Any
story that exceeded the contest’s 500-word limit was not included in this study. Any story that
appeared incomplete but fell within the 500-word limit was included in this study. Thus, this
study collected 23 stories, 16 from HMT School students and seven from CW School students.
Since the stories comprise a student’s own words and AI-generated text, to facilitate our
analysis of students’ interaction patterns with AI-generated text, a student highlighted the
student’s own text in red and AI-generated text in black (see Figure 3).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 11
Figure 3
A story written on Google Docs with a student’s words and AI-generated text
Scoring of Stories
To prepare each story for human scoring, we removed a student’s name and indicators of
a student’s text and AI-generated text. For human scoring, we assembled EFL subject matter
experts: Aberdeen School’s Native English Teacher (NET); Bowen School’s English panel head;
and a postgraduate student researching EFL writing. To establish reliability in human scoring,
the NET prepared a scoring rubric (see Appendix A) adapted from a standardized rubric that is
used to assess the quality of English language writing in Hong Kong secondary schools and is
familiar to the human scorers. The scorers were instructed to score stories for content (C) (e.g.,
creativity, task completion, etc.), language (L) (e.g., grammar, punctuation, spelling, etc.) and
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 12
organization (O) (e.g., idea development and cohesiveness). The full mark for each criterion was
five and a score was awarded in increments of one. Besides providing annotated examples of
stories, the NET briefed the other two scorers and wrote notes for the awarding of specific marks
(see Appendix A). These include awarding a score of one for C if a story was incomplete, and a
story’s L and O scores not exceeding the C score plus/minus one. Two experts independently
marked the stories and we calculated the proportion of agreement (DeCuir-Gunby et al., 2010).
The proportion of agreement for C was 73%, L 87% and O 70%. For all but two of the 23
stories, the scorers produced either the exact same total CLO mark, that is, the sum of C, L and O
scores, or showed a total mark difference of only one mark (see Appendix B). To reconcile any
possible differences in scoring, we averaged the scores awarded by the two experts.
The scoring rubric includes an AI Words criterion with descriptors. These descriptors
were developed from our literature review and a pilot study of four students’ stories written
using their own words and AI-generated text. They were developed from measures of basic
structure and organization and syntactic complexity that appeared associated with higher CLO
scores. The descriptors are organized according to scaffolded performance levels with the fine-
grain use of AI-generated text set at the lowest performance level and the broadest use of AI-
generated text set at the highest performance level. In other words, a story must evidence the
descriptor at the lowest performance level before being assessed at a higher performance level.
The NET scored de-anonymized versions of students’ stories for the AI Words dimension.
Data Analysis
operationalized the measures for the basic structure and organization of a composition and the
syntactic complexity of AI-generated text as set out in the scoring rubric. Therefore, we sought
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 13
to count the number of words in a story, and the number of AI-generated words and the number
of student words. Furthermore, we sought to count the number of AI-generated text instances or
chunks, in a story. An AI chunk is defined as AI-generated text embedded within students’ own
text instances or chunks. Next, we counted three syntactic forms of AI-generated text.
Specifically, we defined short, medium and long AI chunks to be AI-generated text shorter than
five words in length, longer than or equal to five words in length or sentence length, and longer
mark would be categorized as a short AI chunk. In our analysis, we did not find any instances of
We analyzed language features in stories using the functions available on Google Docs.
We prepared on Excel the descriptive statistics for the basic organization and structure and
For insights into what patterns of interactions with AI-generated text might be more
effective for human ratings of writing, we used statistical methods to compare scores from
human raters with basic organization and structure and syntactic complexity statistics.
In phase 2, we analyze the data using Multiple Linear Regression (MLR; Aiken et al.,
2005), a statistical technique that describes the relationship between multiple variables, while
taking the effect of each variable into account. For example, the algorithm explores the statistical
relationship between students’ score, amount of AI generated words used in their writing, and the
actual length of their writing, while exposing how these variables may affect the quality of the
Lastly, in phase 3 we focus on analyzing patterns and effects of how AI’s assistance may
improve students’ writing quality. Here, we perform cluster analysis using unsupervised machine
Means Clustering (Macqueen, 1967), and Mean-Shift Clustering (Fukunaga et al., 1975).
In this section, we first present and discuss the descriptive statistics of language features
and scores. Second, we present and discuss the results of the multiple linear regression analysis
As shown in Table 1, 21 out of the 23 stories contain AI chunks, and 19 of them contain
long AI chunks but only 13 of them contain short AI chunks. If one considers that embedding
shorter AI chunks into the stories is a better synthesis between the AI-generated and human-
written text, this result may imply that it is challenging to achieve this synthesis as less students
achieved this goal. In addition, each story with short, medium and long AI chunks contains an
average of 3.77, 3.16 and 2.21 of them respectively, and each such story contains an average of
6.67 AI chunks. All these measures show a large standard deviation, implying that there is a
Moreover, as shown in Table 1, the number of AI-generated words, human written words
and the total number of words in the stories are 81.57, 248.57 and 323.04 respectively, and the
standard deviation in all these measures is large. While some variation in the number of AI
words may come from the variation in the length of the stories, this may still imply that students
vary a lot in their intention to incorporate AI-generated text. To better understand this variation,
we show the distribution of the percentage of AI-generated words in Figure 4. As we can see in
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 15
Table 1, although the average percentage of the usage of AI words is roughly 28%, and a
majority of the students used less than 20% of AI words in their stories, there are some students
who embedded a large percentage of AI words up to almost 90%. This again shows that students
Table 1
Figure 4
Other than students’ usage of AI-generated text, we also examine the scores of their
stories. Table 2 shows the average content (C), language (L), organization (O) scores, as well as
the average CLO scores, AI words scores, and the grand total scores which is the sum of the
CLO and AI words scores. As shown in Table 2, students scored roughly equally in the C, L, O
as well as the AI words categories. Figure 5 shows the distribution of the grand total score. The
scores spread widely across the whole range from 0 to the full mark of 20. These results implied
that the markers consider that there is a large variation in students’ performance.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 17
Table 2
Figure 5
To identify students’ patterns of interaction with AI-generated text and AI’s contributions
to the human-rated scores they obtained, we examined the following multiple linear regression
(MLR) model (Aiken et al., 2005) given by Equation (1) to identify the potential correlation of
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 18
various AI-generated chunks and words with different scoring items including the C, L, O, CLO,
AI words and the grand total scores. We remark that instead of such multiple linear regression
model, computing merely the Pearson correlation between the scores and the variables does not
accurately represent the casualty relation between them, as they may be simultaneously affected
by other variables and are not the cause and effect to each other as suggested by a high Pearson
correlation coefficient. The multiple linear regression model can eliminate such confounding
number of short AI chunks, 𝑥" = number of medium AI chunks, 𝑥# = number of long AI chunks,
𝑥$ = number of human chunks, 𝑥% = number of AI words, and 𝑥& = number of human words.
Although a linear relationship may not fully capture the casualty relationship between
scores and variables, but as shown in the last column of Table 3, the model is highly statistically
significant in explaining various scores, and later on give us interesting interpretations in how the
scores of stories depends on the incorporation of AI-generated text. To avoid the phenomenon
called co-linearity which may mask the correlation of related factors with the scores, we have not
included the percentage of AI words in the model as it is related to the number of AI words and
By the method of least squares (Dekking et al., 2005), one can obtain the best fitted
parameters 𝑚' ’s in the model, but the values of 𝑚' may not fully represent the correlation of the
variable 𝑥' with scores since 𝑥' ’s are not normalized. We thus compute the so-called partial
correlations (Brown & Hendrix, 2005), which are correlations between the score and the
variables after removing the co-dependence on all other variables in Equation 1. For instance, the
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 19
partial correlation 𝐶+ (𝑠𝑐𝑜𝑟𝑒, 𝑥! ) between the score and 𝑥! is related to the fitted parameter 𝑚!
Where 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑥! )(! )*((" ,…,(#) and 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑠𝑐𝑜𝑟𝑒)/0123)*((" ,…,(#) correspond to the
residuals of the linear regression models in fitting x_1 and the score respectively and separately
using only the factors 𝑥" , … , 𝑥& (Dekking et al., 2005), and hence the co-dependence on factors
Table 3 shows the results of partial correlation between various scoring items and the
variables 𝑥! , … , 𝑥& . Except AI word scores, all scores are strongly positively correlated with the
number of human words, with a value of 𝐶+ > 0.9, and positively correlated with the number of
AI words, with a value of 𝐶+ ≈ 0.5, and these partial correlations are all statistically significant.
As the contest limited all stories to 500 words or less, these results imply that more competent
students will use more words, regardless of their own words or AI-generated text, to complete
their stories and score higher. We also note that although the scores are more strongly correlated
and statistically significant with the number of human words, the number of AI words also plays
its role in contributing to all C, L, O scores significantly. Per existing research that a competent
writer has capacity to edit and to improve so that a completed text might be very different from a
first draft (Flower & Hayes, 1981), this may imply that competent writers not only write more
words, but they can synthesize AI words or inspiration into their own writing which contributes
to C, L and O criteria.
One interesting observation is that AI word scores were significantly positively and
negatively correlated with the number of human chunks and the number of medium AI chunks,
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 20
respectively. There are three performance levels to achieve for the AI Words criterion: (1) AI
words in at least one chunk and at least one short chunk; (2) at least eight chunks of AI words;
(3) no more than 33% AI words. For competent students who are keen to succeed in the contest,
they may decide to embed human chunks in longer AI chunks and avoid using medium AI
chunks to increase the number of AI chunks. Hence their positive and the negative partial
correlations with AI words score. These results may also imply that students were well aware of
the competition rules, and developed strategies for the efficient use of AI words in their stories.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 21
Table 3
a p-value less than 0.05 and 0.01 respectively. The last column shows the p-value of the
corresponding linear regression model in explaining the various scores by the 6 factors.
We used cluster analysis to group students according to the language features of AI-
generated text in their stories and the stories’ human-rated scores. Our cluster analysis algorithms
considered the following language features and scores: the number of AI chunks (short, medium,
and long), the percentage of AI-generated words used in a story, the total number of words in a
story, the C score, the L score, the O score, the total CLO score, and finally the AI Words score.
Algorithm 1: EM-Algorithm
estimate the likelihood of data points belonging to a particular cluster. The algorithm is
configured to cluster the dataset into eight groups of students and the output is described in
Figure 6.
Figure 6
First, we observe students in clusters Zero, Two, and Seven received a full or near full
CLO score. Additionally, AI-generated words made up only 9% to 19% of their total word count,
which is low compared to other clusters. The stories written by students in these three clusters
also tend to be longer, nearer the 500-word limit, compared to other clusters. This indicates that
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 23
the most competent writers demonstrate less inclination to seek assistance from AI, which
Although students in these three clusters appeared to be competent writers that relied less
on AI-generated text, we observed a difference in these clusters’ AI Words scores. On the one
hand, students in cluster Zero performed quite well in the AI Words criterion. On the other hand,
all students in clusters Two and Seven received no score for the AI Words criterion. This may
indicate these students’ ignorance or misunderstanding about the scoring rubric descriptors; or
these students intentionally did not use AI-generated text according to the scoring rubric
descriptors.
Among clusters of high users of AI-generated text, we observed two different results.
Students in clusters Four and Five wrote stories with AI-generated words comprising 50% to
89% of the total number of words. Compared to other students, they scored relatively high on
their CLO (80th percentile), specifically, 4.5 for C, 4.5 for O and 4.25 for L. On the other hand,
cluster Three produced stories with AI-generated words comprising 55% to 78% of the total
number of words but the students’ CLO score was low; specifically, they scored 1 point for C
and O criteria, and 2 points for the L criteria. The CLO scores of cluster Three indicate that those
students did not submit complete stories, for which reason they could not exceed 1 point for the
C criteria but could achieve an additional point for L. We interpret cluster Three as students who
did not appear to be competent writers, had decided to seek more assistance from AI-generated
text, but had not developed effective strategies to leverage AI-generated text in their writing. In
this way, our study provides evidence to question claims that compositions comprising
exclusively AI-generated text would be scored positively for cohesiveness and grammatical
accuracy (Godwin-Jones, 2022), for which reason automatic writing evaluation systems would
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 24
have limited value. On the other hand, the results suggest that students in clusters Four and Five
were more competent writers, had decided to seek more assistance from AI-generated text, and
had developed effective strategies to incorporate more AI-generated text in their stories.
measuring the distance between each data to the center of each cluster (see Figure 7). Like with
revealed additional insights into students’ interaction patterns with AI-generated text – the
Figure 7
Students in cluster One wrote the longest stories of at least 449 words and AI-generated
words comprised only 19% of their stories’ words. However, less than half of the students in this
cluster achieved a full CLO score of 15. This indicates that the use of AI-generated text does not
Another observation is that students in cluster Zero scored low on CLO and on AI Words,
and wrote the least number of words in their stories, 157 words on average, compared to all
students. The low CLO scores indicate that students did not submit complete stories.
Additionally, they had not used AI-generated text to extend their writing. The implication is that
less competent writers may lack the decision-making ability and strategies by which they might
Different from EM-Algorithm and K-Means Clustering, Mean-Shift Clustering does not
dataset and assigns each data point to the clusters by shifting points towards the highest density
(or mean value) in each cluster (see Figure 8). This algorithm grouped the students into six
clusters.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 26
Figure 8
One observation was that the use rate of AI-words varied greatly, from 0 to 77.4%, in
cluster One, which featured students with the lowest CLO scores and the shortest stories. This
indicates that the length of a story is a more accurate indicator of a student’s writing competence
than the number of AI-words used in a story. Additionally, a students’ higher use of AI-
generated text might improve the student’s CLO score if the student were to write nothing at all
without AI-generated text. Therefore, this cluster provides an initial clue that the use of AI-
generated text does not universally compensate for students’ writing competence.
To summarize, we plot three-dimensional graphs of the results for each cluster analysis
algorithm, showing a horizontal view in Figure 9 and a vertical view in Figure 10. From these
three-dimensional graphs, visually the students can be categorized into four super clusters, 1)
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 27
students with high CLO scores, but fewer AI words; 2) students with high scores, but more AI
words; 3) students with low scores, but fewer AI words; and 4) students with low scores, but
more AI words.
Figure 9
Note. The x-axis is the normalized average CLO score between 0 and 100; the y-axis is the
percentage of AI words used in the writing; and the z-axis is the cluster ID.
Figure 10
Conclusion
Major Findings
The descriptive statistics show a large variation in students’ use of AI-generated text in
their stories. At the same time, human scoring indicates a large variation in the C, L and O scores
of students’ stories.
A multiple linear regression of AI-generated text language features and CLO scores show
the number of human words and the number of AI-generated words contribute significantly to
CLO scores. Significant correlations between different syntactic forms of AI-generated text and
AI Words scores highlight a group of competent and strategic students who use AI-generated
Clustering analyses of students show groups of competent writers that strategically use
specific syntactic forms of AI-generated text but less AI-generated text than their peers; and
groups that effectively use more AI-generated text. The analyses also reveal groups of less
competent writers that use little AI-generated text and may not have strategies for its effective
use; and groups that use more AI-generated text and that benefit from its use if they were to
In sum, our study has evidenced that students’ use of AI-generated text does not
universally benefit their writing. Moreover, students’ access to AI-generated text from SOTA-
NLG tools does not preclude students using AI-generated text much in their writing and students
Pedagogical Implications
Our study can inform the flexible design of AI-related curricula to address the various
needs of schools and students, especially for secondary education for which there has been little
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 29
research (Chiu, 2021). First, our study contributes a curricular effort to achieve both learning
about AI and learning with AI (Ali et al., 2021; Williams et al., 2022). Specifically, as writing
activities with NLG tools have been scarce (Lin & Chang, 2020), our study contributes design
specifications for an authentic writing task and SOTA NLG-tools that can be freely replicated
and adapted. Second, we contribute specific measures of basic structure and organization and
syntactic complexity of AI-generated text that can inform teaching and learning practice with AI-
generated text.
Furthermore, our findings can inform teachers’ instructional methods for different groups
of EFL students to complete writing tasks with AI-generated text from SOTA NLG-tools. More
competent writers are more able to discern and optimize the use of AI to complete a high-quality
composition. These writers can benefit from fine-grain instruction on different language features
of AI-generated text, such as syntactic forms, and ways to integrate such text so as to improve
writing quality. For less competent student writers, they will need additional instruction from
teachers, peers, and other elements in their writing classroom to improve their writing
competence, to decide how to use AI-generated text and then how to develop strategic and
effective use of AI-generated text for writing of higher quality. Appropriate instructional
methods for different groups of students to use AI-generated text are necessary to broach digital
divides between writers of different competence and between schools. Without differentiated
Finally, some universities and institutions have been banning their students’ use of AI-
generated text for classroom, coursework, and assessment tasks, which might be in part due to an
unclear picture of how AI-generated text can be used in an effective and ethically correct way.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 30
Our empirical evidence supports such a rationale for those policies and we find it essential to
Given the dearth of research on human-AI collaborative writing, for experts’ scoring of
AI-generated text, we adapted an existing scoring rubric, which is essential for assessing writing
measures of basic structure and organization and syntactic complexity to study AI-generated text
in compositions. We did not study AI-generated text for other constructs of language features
that predict human judgements of both first language and EFL writing proficiency: lexical
sophistication (i.e., the use of advanced words that indicate lexical knowledge, often measured
by number-of-word calculations); and text cohesion (i.e., the interconnectivity of text segments)
(Crossley, 2020); nor did we study AI-generated text in terms of discourse features. Future
studies may explore quality writing with AI-generated text by operationalizing measures for
these other constructs or for discourse features. With further study with additional measures, we
can better understand how competent writers have effectively integrated larger quantities of AI-
In our study, students composed short stories and in subsequent research can expand
students’ writing to argumentative and factual text types. Furthermore, our study’s SOTA AI-
NLG tools utilized open-source, free-to-use language models and could output at most a
paragraph of text. Subsequent research could fine-tune such models to improve their capacity to
generate appropriate language for the target text type and explore such models’ contributions to
student writing. Besides, study should expand to the largest, proprietary language language
models like ChatGPT and to AI-NLG tools that can generate several paragraphs of text.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 31
Although AI-NLG tools hold promise to improve students’ writing, SOTA AI-NLG tools
that predict text suffer limitations and their use raises risk in the writing classroom. One
limitation is such tools’ propensity to hallucinate, that is, to generate text that deviates from its
source input and fails to meet user expectations (Ji et al., 2022). Such text can comprise
offensive or biased material (Bender et al., 2021), or factually incorrect and nonsensical answers,
although plausible sounding (OpenAI, 2022). Another limitation is such tools’ propensity to
degenerate, that is, to generate bland and repetitive text (Holtzman et al., 2020). Therefore,
research will be necessary into prompt programming (Reynolds & McDonell, 2021) of SOTA
AI-NLG tools in the writing classroom, that is, the careful selection of input by which a student
interacts with and controls an AI-NLG tool to perform a desired task. Since rewriting a prompt
can significantly change a tool’s performance, future studies should investigate the conditions
under which a student would prompt a SOTA AI-NLG tool, what prompts students have used to
generate text, and whether students perceive any AI-generated text as satisfactory for integration
into a composition. Future research may explore students’ interactions with AI-NLG tools in the
writing process, using methods such as screen recordings, think-aloud protocols, interviews, and
stimulated recalls, to provide a complete understanding of how the tools contribute to students’
References
Aiken, L. S., West, S. G., & Pitts, S. C. (2003). Multiple linear regression. Handbook of
Ali, S., DiPaola, D., Lee, I., Hong, J., & Breazeal, C. (2021). Exploring Generative Models with
Middle School Students. Proceedings of the 2021 CHI Conference on Human Factors in
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of
Stochastic Parrots: Can Language Models Be Too Big?. Proceedings of the 2021 ACM
https://doi.org/10.1145/3442188.3445922
Brown, B. L., & Hendrix, S. B. (2005). Partial correlation coefficients. Encyclopedia of statistics
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan,
T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020).
https://doi.org/10.48550/arXiv.2005.14165
Butterfuss, R., Roscoe, R. D., Allen, L. K., McCarthy, K. S., & McNamara, D. S. (2022).
https://doi.org/10.1177/07356331211045304
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 33
Chen, Z., Chen, W., Jia, J., & Le, H. (2022). Exploring AWE-supported writing process: An
https://doi.org/10125/73482
https://doi.org/10.1007/s11528-021-00637-1
Chiu, T. K. F., Meng, H., Chai, C.-S., King, I., Wong, S., & Yam, Y. (2022). Creation and
Crossley, S. A., Muldner, K., & McNamara, D. S. (2016). Idea Generation in Student Writing:
2020.11.03.01
DeCuir-Gunby, J. T., Marshall, P. L., & McCulloch, A. W. (2010). Developing and Using a
https://doi.org/10.1177/1525822X10388468
Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A Modern Introduction
to Probability and Statistics: Understanding why and how (Vol. 488). London: Springer.
https://link.springer.com/book/10.1007/1-84628-168-7
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 34
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete
Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B
Dizon, G. (2020). Evaluating intelligent personal assistants for L2 listening and speaking
https://doi.org/10125/44705
Dizon, G., & Gayed, J. (2021). Examining the impact of Grammarly on the quality of mobile L2
https://doi.org/10.29140/jaltcall.v17n2.336
Flower, L., & Hayes, J. R. (1981). A Cognitive Process Theory of Writing. College Composition
Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with
40. https://doi.org/10.1109/TIT.1975.1055330
Gatt, A., & Krahmer, E. (2018). Survey of the State of the Art in Natural Language Generation:
Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61,
65–170. https://doi.org/10.1613/jair.5477
Godwin-Jones, R. (2022). Partnering with AI: Intelligent writing assistance and instructed
http://doi.org/10125/73474
Gong, X., Tang, Y., Liu, X., Jing, S., Cui, W., Liang, J., & Wang, F. Y. (2020, October). K-9
IEEE International Conference on Networking, Sensing and Control (ICNSC) (pp. 1-6).
IEEE. https://doi.org/10.1109/ICNSC48988.2020.9238087
Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’
https://doi.org/10.1016/j.asw.2022.100666
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text
Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., & Naaman, M. (2023). Co-Writing with
https://doi.org/10.1145/3544548.3581196
Jeon, J. (2022). Exploring AI chatbot affordances in the EFL classroom: Young learners’
https://doi.org/10.1080/09588221.2021.2021241
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Dai, W., Madotto, A., &
Kangasharju, A., Ilomäki, L., Lakkala, M., & Toom, A. (2022). Lower secondary students’
poetry writing with the AI-based Poetry Machine. Computers and Education: Artificial
Kim, H., Yang, H., Shin, D., & Lee, J. H. (2022). Design principles and architecture of a second
http://hdl.handle.net/10125/73463
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 36
Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration:
Technologies. https://doi.org/10.1007/s10639-021-10831-6
Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational
1018. https://doi.org/10.1007/s10639-022-11177-3
Lin, M. P.-C., & Chang, D. (2020). Enhancing Post-secondary Writers’ Writing Skills with a
Lin, C. J., & Mubarok, H. (2021). Learning analytics for investigating the mind map-guided AI
Liu, C., Hou, J., Tu, Y. F., Wang, Y., & Hwang, G. J. (2021). Incorporating a reflective thinking
https://doi.org/10.1080/10494820.2021.2012812
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.
on-mathematical-statistics-and-probability/Proceedings-of-the-Fifth-Berkeley-
Symposium-on-Mathematical-Statistics-and/chapter/Some-methods-for-classification-
and-analysis-of-multivariate-observations/bsmsp/1200512992
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 37
OpenAI. (2022, November 30). ChatGPT: Optimizing Language Models for Dialogue. OpenAI.
https://openai.com/blog/chatgpt/
Fitzgerald (Eds.), The Handbook of Writing Research (pp. 54–66). Guilford Press.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models
https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-
Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models:
https://doi.org/10.48550/arXiv.2102.07350
Roscoe, R., Allen, L., & McNamara, D. (2013). Developing pedagogically-guided algorithms for
https://doi.org/10.1504/IJLT.2013.059131
Sing, C. C., Teo, T., Huang, F., Chiu, T. K. F., & Xing wei, W. (2022). Secondary school
students’ intentions to learn AI: Testing moderation effects of readiness, social good and
https://doi.org/10.1007/s11423-022-10111-1
Stack Exchange Inc. (2022, December 8). Temporary policy: ChatGPT is banned [Forum post].
Tan, X. (2023). Stories behind the scenes: L2 students’ cognitive processes of multimodal
composing and traditional writing. Journal of Second Language Writing, 59, 100958.
https://doi.org/10.1016/j.jslw.2022.100958
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 38
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &
https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-
Abstract.html
Wang, B., & Komatsuzaki, A. (2021, June 4). GPT-J-6B: 6B JAX-Based Transformer. Aran
Komatsuzaki. https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
Wang, T., & Cheng, E. C. K. (2021). An investigation of barriers to Hong Kong K-12 schools
Wang, X., Liu, Q., Pang, H., Tan, S. C., Lei, J., Wallace, M. P., & Li, L. (2023). What matters in
cluster analysis and epistemic network analysis. Computers & Education, 194, 104703.
https://doi.org/10.1016/j.compedu.2022.104703
Williams, R., Ali, S., Devasia, N., DiPaola, D., Hong, J., Kaputsos, S. P., Jordan, B., & Breazeal,
C. (2022). AI + Ethics Curricula for Middle School Youth: Lessons Learned from Three
https://doi.org/10.1007/s40593-022-00298-y
Xu, W., & Ouyang, F. (2022). A systematic review of AI role in the educational system based on
https://doi.org/10.1007/s10639-021-10774-y
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 39
Yang, H., Kim, H., Lee, J. H., & Shin, D. (2022). Implementation of an AI chatbot as an English
https://doi.org/10.1017/S0958344022000039
Yau, C., & Chan, K. (2023, February 17). University of Hong Kong temporarily bans students
kong/education/article/3210650/university-hong-kong-temporarily-bans-students-using-
chatgpt-other-ai-based-tools-coursework
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 40
Appendix A
The Assessment Rubric for the 1st Human-AI Creative Writing Contest for Hong Kong Secondary Schools
Score Content Language Organization AI Words
5 · Content fulfills the requirements of · Wide range of accurate sentence structures with a · Text is organized effectively, with · AI words
the question good grasp of simple and complex sentences logical development of ideas compose less
· Almost totally relevant · Grammar mainly accurate with occasional common · Cohesion in most parts of the text is than ⅓ of the
· Most ideas are well errors that do not affect overall clarity clear total number of
developed/supported · Vocabulary is wide, with many examples of more · Strong cohesive ties throughout the words in the
· Creativity and imagination are shown sophisticated lexis text text
when appropriate · Spelling and punctuation are mostly correct · Overall structure is coherent,
· Shows general awareness of audience · Register, tone and style are appropriate to the genre sophisticated and appropriate to the
and text-type genre and text-type
3 · Content just satisfies the requirements · Simple sentences are generally accurately constructed. · Parts of the text have clearly · At least 8 AI
of the question · Occasional attempts are made to use more complex defined topics chunks of any
· Relevant ideas but may show some sentences. Structures used tend to be repetitive in nature · Cohesion in some parts of the text is length
gaps or redundant information · Grammatical errors sometimes affect meaning clear
· Some ideas but not well developed · Common vocabulary is generally appropriate · Some cohesive ties in some parts of
· Some evidence of creativity and · Most common words are spelt correctly, with basic the text
imagination punctuation being accurate · Overall structure is mostly coherent
· Shows occasional awareness of · There is some evidence of register, tone and style and appropriate to the genre and text-
audience appropriate to the genre and text-type type
1 · Content shows very limited attempts · Some short simple sentences accurately structured · Parts of the text reflect some · AI words
to fulfill the requirements of the · Grammatical errors frequently obscure meaning attempts to organize topics used in long
question · Very simple vocabulary of limited range often based · Some use of cohesive devices to chunks (more
· Intermittently relevant; ideas may be on the prompt(s) link ideas than 1 sentence
repetitive · A few words are spelt correctly with basic punctuation in length) and
· Some ideas but few are developed being occasionally accurate in short chunks
· Ideas may include misconception of (less than 5
the task or some inaccurate information words in
· Very limited awareness of audience length).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 41
Note.
1. Content mark cannot exceed 1 if the story is not complete, that is, missing exposition; conflict; climax; and / or resolution.
2. Content mark cannot exceed 1 if the story is not a story, for example, an article or an essay
3. Creativity in content refers to the details, transformation and originality of ideas
4. Language and organization marks cannot exceed +/- 1 of the content mark.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 42
Appendix B
Human-rated scores for CW students’ stories
Marker 1 Marker 2 Marker Average
Text Content Language Organization Sub-total C L O Sub-total C L O CLO AI Grand
Name (C) (L) (O) words total
1 1 2 1 4 1 2 1 4 1 2 1 4 0 4
2 1 1 1 3 1 1 2 4 1 1 1.5 3.5 1 5
3 1 2 1 4 1 2 1 4 1 2 1 4 1 5
4 1 2 1 4 1 2 2 5 1 2 1.5 4.5 1 6
5 1 1 1 3 1 1 1 3 1 1 1 3 0 3
6 1 2 1 4 1 2 1 4 1 2 1 4 3 7
7 5 4 4 13 4 4 4 12 4.5 4 4 12.5 3 16
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 43