You are on page 1of 43

EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 1

Exploring AI-Generated Text in Student Writing: How Does AI Help?

David James Woo , Hengky Susanto , Chi Ho Yeung , Kai Guo , and April Ka Yeng Fung
a, * b b c d

a
Precious Blood Secondary School, Hong Kong
b
Department of Science and Environmental Studies, The Education University of Hong Kong,
Hong Kong
c
Faculty of Education, The University of Hong Kong, Hong Kong
d
Hoi Ping Chamber of Commerce Secondary School, Hong Kong

*
Corresponding author
• Email address: net_david@pbss.hk
• Postal address: Precious Blood Secondary School, 338 San Ha Street, Chai Wan, Hong
Kong

Declarations of interest
None.

Acknowledgement
The work by C.H.Y. is supported by the Research Grants Council of the Hong Kong Special
Administrative Region, China (Projects No. EdUHK GRF 18301119).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 2

Exploring AI-Generated Text in Student Writing: How Does AI Help?

Abstract

English as a foreign language (EFL) students’ use of text generated from artificial intelligence

(AI) natural language generation (NLG) tools may improve their writing quality. However, it

remains unclear to what extent AI-generated text in these students’ writing might lead to higher-

quality writing. We explored 23 Hong Kong secondary school students’ attempts to write stories

comprising their own words and AI-generated text. Human experts scored the stories for

dimensions of content, language, and organization. We analyzed the basic organization and

structure and syntactic complexity of the stories’ AI-generated text and performed multiple linear

regression and cluster analyses. The results show the number of human words and the number of

AI-generated words contribute significantly to scores. Besides, students can be grouped into

competent and less competent writers who use more AI-generated text or less AI-generated text

compared to their peers. Comparisons of the clusters reveal some benefit of AI-generated text in

improving the quality of both high-scoring students’ and low-scoring students’ writing. The

findings can inform pedagogical strategies to use AI-generated text for EFL students’ writing

and to address digital divides. This study contributes designs of NLG tools and writing activities

to implement AI-generated text in schools.

Keywords: artificial intelligence; natural language generation; creative writing; short

stories; EFL learners

Language(s) Learned in This Study: English


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 3

Exploring AI-Generated Text in Student Writing: How Does AI Help?

In the past decade, designing artificial intelligence (AI) education has become a strategic

priority for educators around the world (Sing et al., 2022). Generally, the term ‘AI education’

can be understood in two ways: learning with AI and learning about AI. The former involves

applying AI-enabled technologies in teaching and learning practice (Wang & Cheng, 2021) such

that students develop interaction strategies with AI to solve authentic problems and tasks (Kim et

al., 2022). The latter refers to the teaching and learning of AI as a new subject, for instance, to

develop an understanding of AI, AI futures, and procedural use of AI (Kim et al., 2022), and to

learn AI development skills such as coding (Wang & Cheng, 2021). In language education, AI-

powered technologies have been integrated into classrooms to enable students’ learning with AI

and foster their development of different language skills, such as speaking (Lin & Mubarok,

2021; Yang et al., 2022) and listening (Dizon, 2020). Particularly, writing researchers and

educators have designed and developed a variety of AI-enabled tools such as intelligent writing

assistants (Godwin-Jones, 2022), educational chatbots (Guo et al., 2022), and automated writing

evaluation systems (Chen et al., 2022), to enhance students’ learning to write.

Recently, ChatGPT (https://openai.com/blog/chatgpt/), a chatbot developed by OpenAI

and launched in November 2022, has brought mainstream attention to AI’s capability to predict

and to generate text given an input (Bender et al., 2021). AI-generated text can be coherent,

lengthy, indistinguishable from human writing (Brown et al., 2020), and highly opinionated

(Jakesch et al., 2023). Not least for this reason, universities (Yau & Chan, 2023) and other

institutions (Stack Exchange Inc., 2022) have been banning compositions comprising AI-

generated text. In more mundane forms, AI’s natural language generation (NLG) capability (Gatt

& Krahmer, 2018) has been integrated into the everyday functionality of email clients, Internet
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 4

search, and word processors (Godwin-Jones, 2022). Importantly, AI-NLG tools have become an

emergent resource for the writing classroom. Recent studies have provided some evidence that

students’ collaboration with AI-NLG tools can improve their writing skills. For example, Dizon

and Gayed (2021) found that Japanese university students who wrote with Grammarly, a

predictive text and intelligent writing assistant, produced more lexical variation and fewer

grammatical errors. Liu et al. (2021) indicated that a reflective thinking, AI-supported English

writing approach improved university students’ writing quality. Kangasharju et al. (2022)

designed a Poem Machine that drafted poems as models for students, and suggested their tool

inspired and supported lower secondary students when they engaged in writing poems. To

improve students’ argumentative essay writing, Guo et al. (2022) proposed a chatbot-assisted

approach that might better facilitate the scaffolding of students’ idea production and argument

construction. Yet, it should be noted that although AI’s NLG capability has been integrated into

these instructional tools, these technologies often have very narrow or predetermined

conversational paths (Kuhail et al., 2023), and appear not to be developed from the latest

computer network architecture (Vaswani et al., 2017). Far from ChatGPT’s capability to

dynamically predict text, dated technical designs of educational AI-NLG tools constrain AI

capabilities and ultimately what can be learned about and with AI in education.

Human-AI Collaborative Writing Process

The integration of AI-NLG tools into writers’ writing processes has enabled collaborative

writing between humans and AI. During such collaboration, some issues may require attention.

For example, Jakesch et al. (2023) provided evidence that co-writing with generative AI can shift

people’s attitudes and opinions on a single topic, even if unintended. In the educational setting,

writing teachers should pay careful attention to such issues when they try to apply AI-NLG tools
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 5

into their teaching practice. Wang et al. (2023) identified different approaches by which students

interacted with AI agents for language learning, and suggested different gains of student learning

based on interaction approach, indicating the need for students to learn effective interactions with

AI. Hence, writing teachers need to develop pedagogical strategies to foster students’ successful

collaboration with AI agents in generating written texts. Moreover, using AI-NLG tools requires

teachers to identify suitable tasks for student-AI collaborative writing and appropriate evaluation

methods for assessing compositions using human words and AI-generated text (Godwin-Jones,

2022). To reach these goals, an important first step would be to understand students’ interactions

with text generated from AI-NLG tools to complete written compositions.

Based on a cognitive process theory of writing (Flower & Hayes, 1981), we have viewed

a student’s writing as iterative decision-making, specifically, for the generation and structuring

of ideas, in response to a rhetorical problem such as a school assignment. In this model, a

difference between skilled writers and novice writers is that the former can make a wider variety

and more specific decisions than the latter. The practical implications are that writing must not

only be taught but also cultivated per students’ needs, not imposed from a teacher’s hands

(Vygotsky, 1978). Importantly, we adopt a sociocultural theory to understand the development of

students’ writing (Prior, 2006). Within this paradigm, writing is a collaborative effort, situated in

a socio-technical environment and mediated by elements in the environment. In traditional

classrooms, teachers mediate students’ writing, although there is evidence showing students do

not receive sufficient writing instruction and feedback from teachers in a classroom context

(Butterfuss et al., 2022). Furthermore, students increasingly consult online sources and repurpose

information from those sources for their writing (Tan, 2023). Ultimately, students need strategies

to navigate the elements of their socio-technical environment so as to improve the quality of their
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 6

writing (Crossley et al., 2016). With AI-generated text becoming one of the elements in the

environment, students need to develop strategies to effectively utilize the text. Moreover, since

different students may perceive AI’s affordances in different ways, either as an opportunity or a

constraint (Jeon, 2022), it will be necessary to understand learner differences in using AI-

generated text for language learning. However, so far few studies have been conducted to

examine how students cope with AI-generated text to produce compositions, how their

interactions with the text reflect learner differences, and what types of student-AI interactions

may lead to successful writing outcomes.

This Study

To better understand how students write with AI that dynamically generates text, this

study explores the integration of text generated from state-of-the-art (SOTA) AI-NLG tools to

the educational context of English as a foreign language (EFL) students’ written compositions

(i.e., narrative stories). This article reports on a study exploring language features of AI-

generated text in students’ compositions and their association with human-rated scores of the

compositions. A multiple linear regression analysis and a cluster analysis were employed to

provide insights into how distinct groups of students have interacted with AI-generated text to

complete compositions. Specifically, the study was guided by the following three research

questions (RQs):

RQ1: What are the language features of AI-generated text in students’ compositions

written with AI?

RQ2: What patterns of interaction with AI-generated text can be identified in students’

compositions and what are their differences?


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 7

RQ3: How do students’ interaction patterns with AI-generated text impact and benefit

their writing, if at all?

Language features within AI-generated text were adopted in the study as an indicator of

students’ strategies for using AI-NLG tools for writing, since these features can reflect a student

writer’s knowledge and facilitate their growth (Crossley, 2020). Although language features have

been measured in compositions written with NLG tools that generate content (Gayed et al.,

2022), to the best of our knowledge no attempt has been made to measure the language features

of AI-generated text in students’ compositions and to investigate how the use of AI-generated,

English language text may improve the quality of EFL students’ writing, if at all. By measuring

the language features of AI-generated text in students’ compositions, and experts’ ratings of

students’ compositions, we aim to identify students’ patterns of interaction with AI-generated

text to complete written compositions. By identifying any benefit from using AI-generated text

on expert ratings of writing quality, we hope to inform teaching and learning practice in the EFL

classroom, enabling teachers to deliver effective instructional strategies and helping students to

make better decisions about using AI-generated text.

Methodology

Research Participants and Context

This study was conducted in two Hong Kong secondary schools from December 2022 to

January 2023. One school, Ho Man Tin (HMT) School (pseudonym) receives primary school

students at the 88-99th percentile of academic achievement in its geographic district. The other

school, Chai Wan (CW) School (pseudonym) receives primary school students at the 44-55th

percentile of academic achievement in its geographic district. In both schools, a creative writing

contest was prosecuted and required student participants to create a SOTA AI-NLG tool and then
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 8

to use that tool to co-author a short story. To support the contest, students attended two

workshops presented by the first author.

In the first workshop, students learned to code their SOTA AI-NLG tool. They used

Python programming language, Gradio software development kit, and HuggingFace, a repository

for machine-learning (ML) models and applications. A language model forms the basis of any

AI-NLG tool and those available to students on Hugging Face vary widely in size, from GPT2’s

117 million parameters (Radford et al., 2019) to GPT-J’s six billion parameters (Wang &

Komatsuzaki, 2021). However, they are generally smaller than the largest, proprietary language

models such as ChatGPT, which can exceed 100 billion parameters (Brown et al., 2020).

Importantly, the Hugging Face language models are open-source, free-to-use and do not incur

the larger, proprietary language models’ huge environmental and financial costs to develop and

maintain (Bender et al., 2021). Thus, smaller language models can be rapidly implemented in

schools.

Students designed AI-NLG tools according to their preferences and coding skills.

Initially, students were taught to design an AI-NLG tool by using one language model, with one

textbox for human text input and one text box of AI-generated text output. The AI-generated text

generally includes proper English language capitalization, punctuation and spacing, and its

length can vary from words to a paragraph (see Figure 1). Subsequently, students were taught to

design an AI-NLG tool by using more than one language model so that the tool could comprise

several text boxes of AI-generated text output (see Figure 2).


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 9

Figure 1

An AI-NLG tool comprising one language model and one textbox of output

Figure 2

An AI-NLG tool comprising three language models and three textboxes of output

In the second workshop, students learned to interact with their AI-NLG tool to write a

story. Specifically, they were taught different prompting methods and digital writing skills for
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 10

selecting AI-generated text, and editing that text into a story. Students used their AI-NLG tools

for 45 minutes. They could use the tools freely and repeatedly at that time. Students did not need

to complete their short stories for contest submission during that time.

Data Collection

Students wrote stories on Google Docs and shared their docs with the first author. Any

story that exceeded the contest’s 500-word limit was not included in this study. Any story that

appeared incomplete but fell within the 500-word limit was included in this study. Thus, this

study collected 23 stories, 16 from HMT School students and seven from CW School students.

Since the stories comprise a student’s own words and AI-generated text, to facilitate our

analysis of students’ interaction patterns with AI-generated text, a student highlighted the

student’s own text in red and AI-generated text in black (see Figure 3).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 11

Figure 3

A story written on Google Docs with a student’s words and AI-generated text

Scoring of Stories

To prepare each story for human scoring, we removed a student’s name and indicators of

a student’s text and AI-generated text. For human scoring, we assembled EFL subject matter

experts: Aberdeen School’s Native English Teacher (NET); Bowen School’s English panel head;

and a postgraduate student researching EFL writing. To establish reliability in human scoring,

the NET prepared a scoring rubric (see Appendix A) adapted from a standardized rubric that is

used to assess the quality of English language writing in Hong Kong secondary schools and is

familiar to the human scorers. The scorers were instructed to score stories for content (C) (e.g.,

creativity, task completion, etc.), language (L) (e.g., grammar, punctuation, spelling, etc.) and
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 12

organization (O) (e.g., idea development and cohesiveness). The full mark for each criterion was

five and a score was awarded in increments of one. Besides providing annotated examples of

stories, the NET briefed the other two scorers and wrote notes for the awarding of specific marks

(see Appendix A). These include awarding a score of one for C if a story was incomplete, and a

story’s L and O scores not exceeding the C score plus/minus one. Two experts independently

marked the stories and we calculated the proportion of agreement (DeCuir-Gunby et al., 2010).

The proportion of agreement for C was 73%, L 87% and O 70%. For all but two of the 23

stories, the scorers produced either the exact same total CLO mark, that is, the sum of C, L and O

scores, or showed a total mark difference of only one mark (see Appendix B). To reconcile any

possible differences in scoring, we averaged the scores awarded by the two experts.

The scoring rubric includes an AI Words criterion with descriptors. These descriptors

were developed from our literature review and a pilot study of four students’ stories written

using their own words and AI-generated text. They were developed from measures of basic

structure and organization and syntactic complexity that appeared associated with higher CLO

scores. The descriptors are organized according to scaffolded performance levels with the fine-

grain use of AI-generated text set at the lowest performance level and the broadest use of AI-

generated text set at the highest performance level. In other words, a story must evidence the

descriptor at the lowest performance level before being assessed at a higher performance level.

The NET scored de-anonymized versions of students’ stories for the AI Words dimension.

Data Analysis

To analyze language features of student’s stories written with NLG tools, we

operationalized the measures for the basic structure and organization of a composition and the

syntactic complexity of AI-generated text as set out in the scoring rubric. Therefore, we sought
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 13

to count the number of words in a story, and the number of AI-generated words and the number

of student words. Furthermore, we sought to count the number of AI-generated text instances or

chunks, in a story. An AI chunk is defined as AI-generated text embedded within students’ own

text instances or chunks. Next, we counted three syntactic forms of AI-generated text.

Specifically, we defined short, medium and long AI chunks to be AI-generated text shorter than

five words in length, longer than or equal to five words in length or sentence length, and longer

than a sentence in length respectively. For instance, an instance of an AI-generated punctuation

mark would be categorized as a short AI chunk. In our analysis, we did not find any instances of

an AI-generated sentence that was shorter than five words in length.

We analyzed language features in stories using the functions available on Google Docs.

We prepared on Excel the descriptive statistics for the basic organization and structure and

syntactic complexity measures. In phase 1 of our findings, we examine these descriptive

statistics alongside human-rated scores.

For insights into what patterns of interactions with AI-generated text might be more

effective for human ratings of writing, we used statistical methods to compare scores from

human raters with basic organization and structure and syntactic complexity statistics.

In phase 2, we analyze the data using Multiple Linear Regression (MLR; Aiken et al.,

2005), a statistical technique that describes the relationship between multiple variables, while

taking the effect of each variable into account. For example, the algorithm explores the statistical

relationship between students’ score, amount of AI generated words used in their writing, and the

actual length of their writing, while exposing how these variables may affect the quality of the

students’ writing outcome.


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 14

Lastly, in phase 3 we focus on analyzing patterns and effects of how AI’s assistance may

improve students’ writing quality. Here, we perform cluster analysis using unsupervised machine

learning techniques Expectation-Maximization (EM) Algorithm (Dempster et al., 1977), K-

Means Clustering (Macqueen, 1967), and Mean-Shift Clustering (Fukunaga et al., 1975).

Results and Discussion

In this section, we first present and discuss the descriptive statistics of language features

and scores. Second, we present and discuss the results of the multiple linear regression analysis

and finally, the cluster analysis.

Phase 1: Descriptive Statistics of Language Features and Scores

As shown in Table 1, 21 out of the 23 stories contain AI chunks, and 19 of them contain

long AI chunks but only 13 of them contain short AI chunks. If one considers that embedding

shorter AI chunks into the stories is a better synthesis between the AI-generated and human-

written text, this result may imply that it is challenging to achieve this synthesis as less students

achieved this goal. In addition, each story with short, medium and long AI chunks contains an

average of 3.77, 3.16 and 2.21 of them respectively, and each such story contains an average of

6.67 AI chunks. All these measures show a large standard deviation, implying that there is a

large variation in students’ intention in embedding AI-generated words in their stories.

Moreover, as shown in Table 1, the number of AI-generated words, human written words

and the total number of words in the stories are 81.57, 248.57 and 323.04 respectively, and the

standard deviation in all these measures is large. While some variation in the number of AI

words may come from the variation in the length of the stories, this may still imply that students

vary a lot in their intention to incorporate AI-generated text. To better understand this variation,

we show the distribution of the percentage of AI-generated words in Figure 4. As we can see in
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 15

Table 1, although the average percentage of the usage of AI words is roughly 28%, and a

majority of the students used less than 20% of AI words in their stories, there are some students

who embedded a large percentage of AI words up to almost 90%. This again shows that students

have a large variation in embedding AI-generated text in their stories.

Table 1

The Utilization of AI-generated Words by Students in their Stories

Average count Average length


(std. dev.) (std. dev.)
No. of stories with this *Only stories with *Only stories with
category of this category of this category of
chunks/words chunks/words are chunks/words are
*Out of 23 stories included included

Short AI chunks 13 3.77 (2.62) 2.15 (0.72)

Medium AI chunks 14 3.16 (2.48) 12.38 (6.58)

Long AI chunks 19 2.21 (1.67) 23.27 (16.52)

AI chunks 21 6.67 (4.99) 13.83 (12.71)

Human chunks 23 6.7 (4.9)

AI words 21 81.57 (90.56)

Human words 23 248.57 (164.69)

Words 23 323.04 (159)

Percentage of AI words 27.95% (27.91%)


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 16

Figure 4

The Distribution of the Percentage of AI Words among the 23 Stories

Other than students’ usage of AI-generated text, we also examine the scores of their

stories. Table 2 shows the average content (C), language (L), organization (O) scores, as well as

the average CLO scores, AI words scores, and the grand total scores which is the sum of the

CLO and AI words scores. As shown in Table 2, students scored roughly equally in the C, L, O

as well as the AI words categories. Figure 5 shows the distribution of the grand total score. The

scores spread widely across the whole range from 0 to the full mark of 20. These results implied

that the markers consider that there is a large variation in students’ performance.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 17

Table 2

The scores in different scoring items averaged over the 23 stories

Scoring item Full mark Average mark Standard deviation

Content (C) 5 3.22 1.8

Language (L) 5 3.43 1.42

Organization (O) 5 3.07 1.6

CLO subtotal 15 9.72 4.74

AI words score 5 3.4 1.84

Grand total 20 11.2 5.94

Figure 5

The Distribution of the Grand Total Scores among the 23 Stories

Phase 2: Multiple Linear Regression Analysis

To identify students’ patterns of interaction with AI-generated text and AI’s contributions

to the human-rated scores they obtained, we examined the following multiple linear regression

(MLR) model (Aiken et al., 2005) given by Equation (1) to identify the potential correlation of
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 18

various AI-generated chunks and words with different scoring items including the C, L, O, CLO,

AI words and the grand total scores. We remark that instead of such multiple linear regression

model, computing merely the Pearson correlation between the scores and the variables does not

accurately represent the casualty relation between them, as they may be simultaneously affected

by other variables and are not the cause and effect to each other as suggested by a high Pearson

correlation coefficient. The multiple linear regression model can eliminate such confounding

impact and reveal the dependence of scores on individual factors:

𝑆𝑐𝑜𝑟𝑒 = 𝑚! 𝑥! + 𝑚" 𝑥" + 𝑚# 𝑥# + 𝑚$ 𝑥$ + 𝑚% 𝑥% + 𝑚& 𝑥& + 𝐶, Equation (1) where 𝑥! =

number of short AI chunks, 𝑥" = number of medium AI chunks, 𝑥# = number of long AI chunks,

𝑥$ = number of human chunks, 𝑥% = number of AI words, and 𝑥& = number of human words.

Although a linear relationship may not fully capture the casualty relationship between

scores and variables, but as shown in the last column of Table 3, the model is highly statistically

significant in explaining various scores, and later on give us interesting interpretations in how the

scores of stories depends on the incorporation of AI-generated text. To avoid the phenomenon

called co-linearity which may mask the correlation of related factors with the scores, we have not

included the percentage of AI words in the model as it is related to the number of AI words and

human words in the stories.

By the method of least squares (Dekking et al., 2005), one can obtain the best fitted

parameters 𝑚' ’s in the model, but the values of 𝑚' may not fully represent the correlation of the

variable 𝑥' with scores since 𝑥' ’s are not normalized. We thus compute the so-called partial

correlations (Brown & Hendrix, 2005), which are correlations between the score and the

variables after removing the co-dependence on all other variables in Equation 1. For instance, the
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 19

partial correlation 𝐶+ (𝑠𝑐𝑜𝑟𝑒, 𝑥! ) between the score and 𝑥! is related to the fitted parameter 𝑚!

through the following relation:

𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑥! )(! )*((" ,…,(# )


𝐶+ (𝑠𝑐𝑜𝑟𝑒, 𝑥! ) = 𝑚! ×
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑠𝑐𝑜𝑟𝑒)/0123)*((" ,…,(# )

Where 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑥! )(! )*((" ,…,(#) and 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑠𝑐𝑜𝑟𝑒)/0123)*((" ,…,(#) correspond to the

residuals of the linear regression models in fitting x_1 and the score respectively and separately

using only the factors 𝑥" , … , 𝑥& (Dekking et al., 2005), and hence the co-dependence on factors

𝑥" , … , 𝑥& are eliminated in the partial correlation 𝐶+ (𝑠𝑐𝑜𝑟𝑒, 𝑥! ).

Table 3 shows the results of partial correlation between various scoring items and the

variables 𝑥! , … , 𝑥& . Except AI word scores, all scores are strongly positively correlated with the

number of human words, with a value of 𝐶+ > 0.9, and positively correlated with the number of

AI words, with a value of 𝐶+ ≈ 0.5, and these partial correlations are all statistically significant.

As the contest limited all stories to 500 words or less, these results imply that more competent

students will use more words, regardless of their own words or AI-generated text, to complete

their stories and score higher. We also note that although the scores are more strongly correlated

and statistically significant with the number of human words, the number of AI words also plays

its role in contributing to all C, L, O scores significantly. Per existing research that a competent

writer has capacity to edit and to improve so that a completed text might be very different from a

first draft (Flower & Hayes, 1981), this may imply that competent writers not only write more

words, but they can synthesize AI words or inspiration into their own writing which contributes

to C, L and O criteria.

One interesting observation is that AI word scores were significantly positively and

negatively correlated with the number of human chunks and the number of medium AI chunks,
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 20

respectively. There are three performance levels to achieve for the AI Words criterion: (1) AI

words in at least one chunk and at least one short chunk; (2) at least eight chunks of AI words;

(3) no more than 33% AI words. For competent students who are keen to succeed in the contest,

they may decide to embed human chunks in longer AI chunks and avoid using medium AI

chunks to increase the number of AI chunks. Hence their positive and the negative partial

correlations with AI words score. These results may also imply that students were well aware of

the competition rules, and developed strategies for the efficient use of AI words in their stories.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 21

Table 3

The Partial Correlations between Various Scores and Factors

No. short No. med. No. long No. of No. of No. of


AI AI AI human AI Human Model
chunks chunks chunks chunks words words p-value

Content (C) 0.125 0.124 0.18 -0.124 0.49* 0.937** 7.47 x


score 10-8

Language 0.218 0.228 0.215 -0.238 0.537* 0.914** 1.03 x


(L) score 10-6

Organization 0.075 -0.002 0.172 -0.055 0.532* 0.943** 3.99 x


(O) score 10-8

CLO score 0.168 0.144 0.224 -0.168 0.586* 0.951** 1.16 x


10-8

AI words -0.13 -0.477* -0.357 0.48* 0.089 -0.204 2.15 x


score 10-5

Grand total 0.056 -0.174 -0.032 0.159 0.5* 0.907** 4.29 x


10-8
Note. A single star (*) and a double star (**) correspond to the cases of statistical significance with

a p-value less than 0.05 and 0.01 respectively. The last column shows the p-value of the

corresponding linear regression model in explaining the various scores by the 6 factors.

Phase 3: Cluster Analysis

We used cluster analysis to group students according to the language features of AI-

generated text in their stories and the stories’ human-rated scores. Our cluster analysis algorithms

considered the following language features and scores: the number of AI chunks (short, medium,

and long), the percentage of AI-generated words used in a story, the total number of words in a

story, the C score, the L score, the O score, the total CLO score, and finally the AI Words score.

Moreover, we used different algorithms to generate different types of clusters.


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 22

Algorithm 1: EM-Algorithm

EM-Algorithm is an algorithm that uses Gaussian distribution to probabilistically

estimate the likelihood of data points belonging to a particular cluster. The algorithm is

configured to cluster the dataset into eight groups of students and the output is described in

Figure 6.

Figure 6

The Outputs from EM-Algorithm

First, we observe students in clusters Zero, Two, and Seven received a full or near full

CLO score. Additionally, AI-generated words made up only 9% to 19% of their total word count,

which is low compared to other clusters. The stories written by students in these three clusters

also tend to be longer, nearer the 500-word limit, compared to other clusters. This indicates that
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 23

the most competent writers demonstrate less inclination to seek assistance from AI, which

confirms results from descriptive statistics and MLR.

Although students in these three clusters appeared to be competent writers that relied less

on AI-generated text, we observed a difference in these clusters’ AI Words scores. On the one

hand, students in cluster Zero performed quite well in the AI Words criterion. On the other hand,

all students in clusters Two and Seven received no score for the AI Words criterion. This may

indicate these students’ ignorance or misunderstanding about the scoring rubric descriptors; or

these students intentionally did not use AI-generated text according to the scoring rubric

descriptors.

Among clusters of high users of AI-generated text, we observed two different results.

Students in clusters Four and Five wrote stories with AI-generated words comprising 50% to

89% of the total number of words. Compared to other students, they scored relatively high on

their CLO (80th percentile), specifically, 4.5 for C, 4.5 for O and 4.25 for L. On the other hand,

cluster Three produced stories with AI-generated words comprising 55% to 78% of the total

number of words but the students’ CLO score was low; specifically, they scored 1 point for C

and O criteria, and 2 points for the L criteria. The CLO scores of cluster Three indicate that those

students did not submit complete stories, for which reason they could not exceed 1 point for the

C criteria but could achieve an additional point for L. We interpret cluster Three as students who

did not appear to be competent writers, had decided to seek more assistance from AI-generated

text, but had not developed effective strategies to leverage AI-generated text in their writing. In

this way, our study provides evidence to question claims that compositions comprising

exclusively AI-generated text would be scored positively for cohesiveness and grammatical

accuracy (Godwin-Jones, 2022), for which reason automatic writing evaluation systems would
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 24

have limited value. On the other hand, the results suggest that students in clusters Four and Five

were more competent writers, had decided to seek more assistance from AI-generated text, and

had developed effective strategies to incorporate more AI-generated text in their stories.

Algorithm 2: K-Means Clustering

K-Means Clustering statistically identifies the K number of clusters in the dataset by

measuring the distance between each data to the center of each cluster (see Figure 7). Like with

the EM-Algorithm clustering analysis, we configured K to 8 clusters. K-means Clustering

revealed additional insights into students’ interaction patterns with AI-generated text – the

attribution of specific language features of AI-generated text in stories to human-rated scores.

Figure 7

The Outputs from K-Means Algorithm


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 25

Students in cluster One wrote the longest stories of at least 449 words and AI-generated

words comprised only 19% of their stories’ words. However, less than half of the students in this

cluster achieved a full CLO score of 15. This indicates that the use of AI-generated text does not

necessarily contribute to the highest CLO performance levels.

Another observation is that students in cluster Zero scored low on CLO and on AI Words,

and wrote the least number of words in their stories, 157 words on average, compared to all

students. The low CLO scores indicate that students did not submit complete stories.

Additionally, they had not used AI-generated text to extend their writing. The implication is that

less competent writers may lack the decision-making ability and strategies by which they might

use AI-generated text to complete writing tasks.

Algorithm 3: Mean-Shift Clustering

Different from EM-Algorithm and K-Means Clustering, Mean-Shift Clustering does not

require a predetermined number of clusters. Instead, it determines the number of clusters in a

dataset and assigns each data point to the clusters by shifting points towards the highest density

(or mean value) in each cluster (see Figure 8). This algorithm grouped the students into six

clusters.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 26

Figure 8

The Outputs from Means-Shift Algorithm

One observation was that the use rate of AI-words varied greatly, from 0 to 77.4%, in

cluster One, which featured students with the lowest CLO scores and the shortest stories. This

indicates that the length of a story is a more accurate indicator of a student’s writing competence

than the number of AI-words used in a story. Additionally, a students’ higher use of AI-

generated text might improve the student’s CLO score if the student were to write nothing at all

without AI-generated text. Therefore, this cluster provides an initial clue that the use of AI-

generated text does not universally compensate for students’ writing competence.

To summarize, we plot three-dimensional graphs of the results for each cluster analysis

algorithm, showing a horizontal view in Figure 9 and a vertical view in Figure 10. From these

three-dimensional graphs, visually the students can be categorized into four super clusters, 1)
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 27

students with high CLO scores, but fewer AI words; 2) students with high scores, but more AI

words; 3) students with low scores, but fewer AI words; and 4) students with low scores, but

more AI words.

Figure 9

Horizontal View of Three Clustering Algorithm Three-dimensional Graphs

Note. The x-axis is the normalized average CLO score between 0 and 100; the y-axis is the

percentage of AI words used in the writing; and the z-axis is the cluster ID.

Figure 10

Vertical View of Three Clustering Algorithm Three-dimensional Graphs


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 28

Conclusion

Major Findings

The descriptive statistics show a large variation in students’ use of AI-generated text in

their stories. At the same time, human scoring indicates a large variation in the C, L and O scores

of students’ stories.

A multiple linear regression of AI-generated text language features and CLO scores show

the number of human words and the number of AI-generated words contribute significantly to

CLO scores. Significant correlations between different syntactic forms of AI-generated text and

AI Words scores highlight a group of competent and strategic students who use AI-generated

text as a means to boost their AI Words scores.

Clustering analyses of students show groups of competent writers that strategically use

specific syntactic forms of AI-generated text but less AI-generated text than their peers; and

groups that effectively use more AI-generated text. The analyses also reveal groups of less

competent writers that use little AI-generated text and may not have strategies for its effective

use; and groups that use more AI-generated text and that benefit from its use if they were to

otherwise write nothing at all.

In sum, our study has evidenced that students’ use of AI-generated text does not

universally benefit their writing. Moreover, students’ access to AI-generated text from SOTA-

NLG tools does not preclude students using AI-generated text much in their writing and students

using it to improve human-rated scores of their writing performance.

Pedagogical Implications

Our study can inform the flexible design of AI-related curricula to address the various

needs of schools and students, especially for secondary education for which there has been little
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 29

research (Chiu, 2021). First, our study contributes a curricular effort to achieve both learning

about AI and learning with AI (Ali et al., 2021; Williams et al., 2022). Specifically, as writing

activities with NLG tools have been scarce (Lin & Chang, 2020), our study contributes design

specifications for an authentic writing task and SOTA NLG-tools that can be freely replicated

and adapted. Second, we contribute specific measures of basic structure and organization and

syntactic complexity of AI-generated text that can inform teaching and learning practice with AI-

generated text.

Furthermore, our findings can inform teachers’ instructional methods for different groups

of EFL students to complete writing tasks with AI-generated text from SOTA NLG-tools. More

competent writers are more able to discern and optimize the use of AI to complete a high-quality

composition. These writers can benefit from fine-grain instruction on different language features

of AI-generated text, such as syntactic forms, and ways to integrate such text so as to improve

writing quality. For less competent student writers, they will need additional instruction from

teachers, peers, and other elements in their writing classroom to improve their writing

competence, to decide how to use AI-generated text and then how to develop strategic and

effective use of AI-generated text for writing of higher quality. Appropriate instructional

methods for different groups of students to use AI-generated text are necessary to broach digital

divides between writers of different competence and between schools. Without differentiated

methods, AI-generated text may widen these digital divides.

Finally, some universities and institutions have been banning their students’ use of AI-

generated text for classroom, coursework, and assessment tasks, which might be in part due to an

unclear picture of how AI-generated text can be used in an effective and ethically correct way.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 30

Our empirical evidence supports such a rationale for those policies and we find it essential to

continue scrutinizing the pedagogical value of AI-generated text in education.

Limitations and Future Research

Given the dearth of research on human-AI collaborative writing, for experts’ scoring of

AI-generated text, we adapted an existing scoring rubric, which is essential for assessing writing

in classroom contexts (Roscoe et al., 2013). Moreover, we operationalized relatively simple

measures of basic structure and organization and syntactic complexity to study AI-generated text

in compositions. We did not study AI-generated text for other constructs of language features

that predict human judgements of both first language and EFL writing proficiency: lexical

sophistication (i.e., the use of advanced words that indicate lexical knowledge, often measured

by number-of-word calculations); and text cohesion (i.e., the interconnectivity of text segments)

(Crossley, 2020); nor did we study AI-generated text in terms of discourse features. Future

studies may explore quality writing with AI-generated text by operationalizing measures for

these other constructs or for discourse features. With further study with additional measures, we

can better understand how competent writers have effectively integrated larger quantities of AI-

generated text into their writing.

In our study, students composed short stories and in subsequent research can expand

students’ writing to argumentative and factual text types. Furthermore, our study’s SOTA AI-

NLG tools utilized open-source, free-to-use language models and could output at most a

paragraph of text. Subsequent research could fine-tune such models to improve their capacity to

generate appropriate language for the target text type and explore such models’ contributions to

student writing. Besides, study should expand to the largest, proprietary language language

models like ChatGPT and to AI-NLG tools that can generate several paragraphs of text.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 31

Although AI-NLG tools hold promise to improve students’ writing, SOTA AI-NLG tools

that predict text suffer limitations and their use raises risk in the writing classroom. One

limitation is such tools’ propensity to hallucinate, that is, to generate text that deviates from its

source input and fails to meet user expectations (Ji et al., 2022). Such text can comprise

offensive or biased material (Bender et al., 2021), or factually incorrect and nonsensical answers,

although plausible sounding (OpenAI, 2022). Another limitation is such tools’ propensity to

degenerate, that is, to generate bland and repetitive text (Holtzman et al., 2020). Therefore,

research will be necessary into prompt programming (Reynolds & McDonell, 2021) of SOTA

AI-NLG tools in the writing classroom, that is, the careful selection of input by which a student

interacts with and controls an AI-NLG tool to perform a desired task. Since rewriting a prompt

can significantly change a tool’s performance, future studies should investigate the conditions

under which a student would prompt a SOTA AI-NLG tool, what prompts students have used to

generate text, and whether students perceive any AI-generated text as satisfactory for integration

into a composition. Future research may explore students’ interactions with AI-NLG tools in the

writing process, using methods such as screen recordings, think-aloud protocols, interviews, and

stimulated recalls, to provide a complete understanding of how the tools contribute to students’

writing and how different students use AI for writing.


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 32

References

Aiken, L. S., West, S. G., & Pitts, S. C. (2003). Multiple linear regression. Handbook of

psychology, 481-507. https://doi.org/10.1002/0471264385.wei0219

Ali, S., DiPaola, D., Lee, I., Hong, J., & Breazeal, C. (2021). Exploring Generative Models with

Middle School Students. Proceedings of the 2021 CHI Conference on Human Factors in

Computing Systems, 1–13. https://doi.org/10.1145/3411764.3445226

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of

Stochastic Parrots: Can Language Models Be Too Big?. Proceedings of the 2021 ACM

Conference on Fairness, Accountability, and Transparency, 610–623.

https://doi.org/10.1145/3442188.3445922

Brown, B. L., & Hendrix, S. B. (2005). Partial correlation coefficients. Encyclopedia of statistics

in behavioral science. https://doi.org/10.1002/0470013192.bsa469

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A.,

Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan,

T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020).

Language Models are Few-Shot Learners (arXiv:2005.14165). arXiv.

https://doi.org/10.48550/arXiv.2005.14165

Butterfuss, R., Roscoe, R. D., Allen, L. K., McCarthy, K. S., & McNamara, D. S. (2022).

Strategy Uptake in Writing Pal: Adaptive Feedback and Instruction. Journal of

Educational Computing Research, 60(3), 696–721.

https://doi.org/10.1177/07356331211045304
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 33

Chen, Z., Chen, W., Jia, J., & Le, H. (2022). Exploring AWE-supported writing process: An

activity theory perspective. Language Learning & Technology, 26(2), 129–148.

https://doi.org/10125/73482

Chiu, T. K. F. (2021). A Holistic Approach to the Design of Artificial Intelligence (AI)

Education for K-12 Schools. TechTrends, 65(5), 796–807.

https://doi.org/10.1007/s11528-021-00637-1

Chiu, T. K. F., Meng, H., Chai, C.-S., King, I., Wong, S., & Yam, Y. (2022). Creation and

Evaluation of a Pretertiary Artificial Intelligence (AI) Curriculum. IEEE Transactions on

Education, 65(1), 30–39. https://doi.org/10.1109/TE.2021.3085878

Crossley, S. A., Muldner, K., & McNamara, D. S. (2016). Idea Generation in Student Writing:

Computational Assessments and Links to Successful Writing. Written Communication,

33(3), 328–354. https://doi.org/10.1177/0741088316650178

Crossley, S. A. (2020). Linguistic features in writing quality and development: An overview.

Journal of Writing Research, 11(3), Article 3. https://doi.org/10.17239/jowr-

2020.11.03.01

DeCuir-Gunby, J. T., Marshall, P. L., & McCulloch, A. W. (2010). Developing and Using a

Codebook for the Analysis of Interview Data: An Example from a Professional

Development Research Project. Field Methods, 23(2), 136–155.

https://doi.org/10.1177/1525822X10388468

Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A Modern Introduction

to Probability and Statistics: Understanding why and how (Vol. 488). London: Springer.

https://link.springer.com/book/10.1007/1-84628-168-7
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 34

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete

Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B

(Methodological), 39(1), 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Dizon, G. (2020). Evaluating intelligent personal assistants for L2 listening and speaking

development. Language Learning & Technology, 24(1), 16-26.

https://doi.org/10125/44705

Dizon, G., & Gayed, J. (2021). Examining the impact of Grammarly on the quality of mobile L2

writing. The JALT CALL Journal, 17(2), 74–92.

https://doi.org/10.29140/jaltcall.v17n2.336

Flower, L., & Hayes, J. R. (1981). A Cognitive Process Theory of Writing. College Composition

and Communication, 32(4), 365–387. https://doi.org/10.2307/356600

Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with

applications in pattern recognition. IEEE Transactions on Information Theory, 21(1), 32–

40. https://doi.org/10.1109/TIT.1975.1055330

Gatt, A., & Krahmer, E. (2018). Survey of the State of the Art in Natural Language Generation:

Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61,

65–170. https://doi.org/10.1613/jair.5477

Godwin-Jones, R. (2022). Partnering with AI: Intelligent writing assistance and instructed

language learning. Language Learning & Technology, 26(2), 5–24.

http://doi.org/10125/73474

Gong, X., Tang, Y., Liu, X., Jing, S., Cui, W., Liang, J., & Wang, F. Y. (2020, October). K-9

Artificial Intelligence Education in Qingdao: Issues, Challenges and Suggestions. In 2020


EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 35

IEEE International Conference on Networking, Sensing and Control (ICNSC) (pp. 1-6).

IEEE. https://doi.org/10.1109/ICNSC48988.2020.9238087

Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’

argumentative writing. Assessing Writing, 54, 100666.

https://doi.org/10.1016/j.asw.2022.100666

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text

Degeneration (arXiv:1904.09751). arXiv. http://arxiv.org/abs/1904.09751

Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., & Naaman, M. (2023). Co-Writing with

Opinionated Language Models Affects Users’ Views.

https://doi.org/10.1145/3544548.3581196

Jeon, J. (2022). Exploring AI chatbot affordances in the EFL classroom: Young learners’

experiences and perspectives. Computer Assisted Language Learning, 1–26.

https://doi.org/10.1080/09588221.2021.2021241

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Dai, W., Madotto, A., &

Fung, P. (2022). Survey of Hallucination in Natural Language Generation. ACM

Computing Surveys, 3571730. https://doi.org/10.1145/3571730

Kangasharju, A., Ilomäki, L., Lakkala, M., & Toom, A. (2022). Lower secondary students’

poetry writing with the AI-based Poetry Machine. Computers and Education: Artificial

Intelligence, 3, 100048. https://doi.org/10.1016/j.caeai.2022.100048

Kim, H., Yang, H., Shin, D., & Lee, J. H. (2022). Design principles and architecture of a second

language learning chatbot. Language Learning & Technology, 26(1), 1–18.

http://hdl.handle.net/10125/73463
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 36

Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration:

Perspectives of leading teachers for AI in education. Education and Information

Technologies. https://doi.org/10.1007/s10639-021-10831-6

Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational

chatbots: A systematic review. Education and Information Technologies, 28(1), 973–

1018. https://doi.org/10.1007/s10639-022-11177-3

Lin, M. P.-C., & Chang, D. (2020). Enhancing Post-secondary Writers’ Writing Skills with a

Chatbot: A Mixed-Method Classroom Study. Journal of Educational Technology &

Society, 23(1), 78–92. https://www.jstor.org/stable/26915408

Lin, C. J., & Mubarok, H. (2021). Learning analytics for investigating the mind map-guided AI

chatbot approach in an EFL flipped speaking classroom. Educational Technology &

Society, 24(4), 16-35. https://www.jstor.org/stable/48629242

Liu, C., Hou, J., Tu, Y. F., Wang, Y., & Hwang, G. J. (2021). Incorporating a reflective thinking

promoting mechanism into artificial intelligence-supported English writing environments.

Interactive Learning Environments, 1-19.

https://doi.org/10.1080/10494820.2021.2012812

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,

Volume 1: Statistics, 5.1, 281–298. https://projecteuclid.org/ebooks/berkeley-symposium-

on-mathematical-statistics-and-probability/Proceedings-of-the-Fifth-Berkeley-

Symposium-on-Mathematical-Statistics-and/chapter/Some-methods-for-classification-

and-analysis-of-multivariate-observations/bsmsp/1200512992
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 37

OpenAI. (2022, November 30). ChatGPT: Optimizing Language Models for Dialogue. OpenAI.

https://openai.com/blog/chatgpt/

Prior, P. A. (2006). A Sociocultural Theory of Writing. In C. A. MacArthur, S. Graham, & J.

Fitzgerald (Eds.), The Handbook of Writing Research (pp. 54–66). Guilford Press.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models

are Unsupervised Multitask Learners. Undefined.

https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-

Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe

Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models:

Beyond the Few-Shot Paradigm (arXiv:2102.07350). arXiv.

https://doi.org/10.48550/arXiv.2102.07350

Roscoe, R., Allen, L., & McNamara, D. (2013). Developing pedagogically-guided algorithms for

intelligent writing feedback. International Journal of Learning Technology, 8, 362–381.

https://doi.org/10.1504/IJLT.2013.059131

Sing, C. C., Teo, T., Huang, F., Chiu, T. K. F., & Xing wei, W. (2022). Secondary school

students’ intentions to learn AI: Testing moderation effects of readiness, social good and

optimism. Educational Technology Research and Development, 70(3), 765–782.

https://doi.org/10.1007/s11423-022-10111-1

Stack Exchange Inc. (2022, December 8). Temporary policy: ChatGPT is banned [Forum post].

Meta Stack Overflow. https://meta.stackoverflow.com/q/421831

Tan, X. (2023). Stories behind the scenes: L2 students’ cognitive processes of multimodal

composing and traditional writing. Journal of Second Language Writing, 59, 100958.

https://doi.org/10.1016/j.jslw.2022.100958
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 38

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &

Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information

Processing Systems, 30.

https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-

Abstract.html

Wang, B., & Komatsuzaki, A. (2021, June 4). GPT-J-6B: 6B JAX-Based Transformer. Aran

Komatsuzaki. https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/

Wang, T., & Cheng, E. C. K. (2021). An investigation of barriers to Hong Kong K-12 schools

incorporating Artificial Intelligence in education. Computers and Education: Artificial

Intelligence, 2, 100031. https://doi.org/10.1016/j.caeai.2021.100031

Wang, X., Liu, Q., Pang, H., Tan, S. C., Lei, J., Wallace, M. P., & Li, L. (2023). What matters in

AI-supported learning: A study of human-AI interactions in language learning using

cluster analysis and epistemic network analysis. Computers & Education, 194, 104703.

https://doi.org/10.1016/j.compedu.2022.104703

Williams, R., Ali, S., Devasia, N., DiPaola, D., Hong, J., Kaputsos, S. P., Jordan, B., & Breazeal,

C. (2022). AI + Ethics Curricula for Middle School Youth: Lessons Learned from Three

Project-Based Curricula. International Journal of Artificial Intelligence in Education.

https://doi.org/10.1007/s40593-022-00298-y

Xu, W., & Ouyang, F. (2022). A systematic review of AI role in the educational system based on

a proposed conceptual framework. Education and Information Technologies, 1-29.

https://doi.org/10.1007/s10639-021-10774-y
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 39

Yang, H., Kim, H., Lee, J. H., & Shin, D. (2022). Implementation of an AI chatbot as an English

conversation partner in EFL speaking classes. ReCALL, 34(3), 327-343.

https://doi.org/10.1017/S0958344022000039

Yau, C., & Chan, K. (2023, February 17). University of Hong Kong temporarily bans students

from using ChatGPT. South China Morning Post. https://www.scmp.com/news/hong-

kong/education/article/3210650/university-hong-kong-temporarily-bans-students-using-

chatgpt-other-ai-based-tools-coursework
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 40

Appendix A
The Assessment Rubric for the 1st Human-AI Creative Writing Contest for Hong Kong Secondary Schools
Score Content Language Organization AI Words

5 · Content fulfills the requirements of · Wide range of accurate sentence structures with a · Text is organized effectively, with · AI words
the question good grasp of simple and complex sentences logical development of ideas compose less
· Almost totally relevant · Grammar mainly accurate with occasional common · Cohesion in most parts of the text is than ⅓ of the
· Most ideas are well errors that do not affect overall clarity clear total number of
developed/supported · Vocabulary is wide, with many examples of more · Strong cohesive ties throughout the words in the
· Creativity and imagination are shown sophisticated lexis text text
when appropriate · Spelling and punctuation are mostly correct · Overall structure is coherent,
· Shows general awareness of audience · Register, tone and style are appropriate to the genre sophisticated and appropriate to the
and text-type genre and text-type
3 · Content just satisfies the requirements · Simple sentences are generally accurately constructed. · Parts of the text have clearly · At least 8 AI
of the question · Occasional attempts are made to use more complex defined topics chunks of any
· Relevant ideas but may show some sentences. Structures used tend to be repetitive in nature · Cohesion in some parts of the text is length
gaps or redundant information · Grammatical errors sometimes affect meaning clear
· Some ideas but not well developed · Common vocabulary is generally appropriate · Some cohesive ties in some parts of
· Some evidence of creativity and · Most common words are spelt correctly, with basic the text
imagination punctuation being accurate · Overall structure is mostly coherent
· Shows occasional awareness of · There is some evidence of register, tone and style and appropriate to the genre and text-
audience appropriate to the genre and text-type type
1 · Content shows very limited attempts · Some short simple sentences accurately structured · Parts of the text reflect some · AI words
to fulfill the requirements of the · Grammatical errors frequently obscure meaning attempts to organize topics used in long
question · Very simple vocabulary of limited range often based · Some use of cohesive devices to chunks (more
· Intermittently relevant; ideas may be on the prompt(s) link ideas than 1 sentence
repetitive · A few words are spelt correctly with basic punctuation in length) and
· Some ideas but few are developed being occasionally accurate in short chunks
· Ideas may include misconception of (less than 5
the task or some inaccurate information words in
· Very limited awareness of audience length).
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 41

Note.
1. Content mark cannot exceed 1 if the story is not complete, that is, missing exposition; conflict; climax; and / or resolution.
2. Content mark cannot exceed 1 if the story is not a story, for example, an article or an essay
3. Creativity in content refers to the details, transformation and originality of ideas
4. Language and organization marks cannot exceed +/- 1 of the content mark.
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 42

Appendix B
Human-rated scores for CW students’ stories
Marker 1 Marker 2 Marker Average
Text Content Language Organization Sub-total C L O Sub-total C L O CLO AI Grand
Name (C) (L) (O) words total
1 1 2 1 4 1 2 1 4 1 2 1 4 0 4
2 1 1 1 3 1 1 2 4 1 1 1.5 3.5 1 5
3 1 2 1 4 1 2 1 4 1 2 1 4 1 5
4 1 2 1 4 1 2 2 5 1 2 1.5 4.5 1 6
5 1 1 1 3 1 1 1 3 1 1 1 3 0 3
6 1 2 1 4 1 2 1 4 1 2 1 4 3 7
7 5 4 4 13 4 4 4 12 4.5 4 4 12.5 3 16
EXPLORING AI-GENERATED TEXT IN STUDENT WRITING 43

Human-rated scores for HMT students’ stories


Marker 1 Marker 2 Marker Average
Text Content Language Organization Sub-total C L O Sub-total C L O CLO AI words Grand
Name (C) (L) (O) total
1 5 4 4 13 5 4 4 13 5 4 4 13 0 13
2 5 5 4 14 5 5 4 14 5 5 4 14 5 19
3 5 5 5 15 5 5 4 14 5 5 4.5 14.5 0 15
4 5 5 4 14 4 5 5 14 4.5 5 4.5 14 0 14
5 5 5 5 15 5 5 5 15 5 5 5 15 5 20
6 1 2 1 4 1 2 1 4 1 2 1 4 0 4
7 5 5 5 15 5 5 5 15 5 5 5 15 5 20
8 5 5 5 15 5 5 5 15 5 5 5 15 5 20
9 5 4 4 13 5 4 5 14 5 4 4.5 13.5 5 19
10 1 2 1 4 1 2 1 4 1 2 1 4 0 4
11 3 4 2 9 1 2 1 4 2 3 1.5 6.5 0 7
12 4 3 4 11 4 4 4 12 4 3.5 4 11.5 0 12
13 5 5 4 14 4 5 4 13 4.5 5 4 13.5 0 14
14 5 4 4 13 3 4 4 11 4 4 4 12 0 12
15 5 5 4 14 4 4 4 12 4.5 4.5 4 13 0 13
16 3 3 4 10 3 3 3 9 3 3 3.5 9.5 0 10

You might also like