0% found this document useful (0 votes)
103 views10 pages

Large Language Models

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Uploaded by

vilaj17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views10 pages

Large Language Models

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Uploaded by

vilaj17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Large Language Models in Student Assessment:

Comparing ChatGPT and Human Graders


Magnus Lundgren*
Department of Political Science, University of Gothenburg

Abstract
This study investigates the efficacy of large language models (LLMs) as tools for grading master-level
student essays. Utilizing a sample of 60 essays in political science, the study compares the accuracy of
grades suggested by the GPT-4 model with those awarded by university teachers. Results indicate that
while GPT-4 aligns with human grading standards on mean scores, it exhibits a risk-averse grading
pattern and its interrater reliability with human raters is low. Furthermore, modifications in the grading
instructions (prompt engineering) do not significantly alter AI performance, suggesting that GPT-4
primarily assesses generic essay characteristics such as language quality rather than adapting to nuanced
grading criteria. These findings contribute to the understanding of AI’s potential and limitations in higher
education, highlighting the need for further development to enhance its adaptability and sensitivity to
specific educational assessment requirements.

Introduction hen’s kappa of 0.18 and a percent agreement of 35%,


indicating significant room for improvement in AI
In the realm of higher education, grading remains a grading alignment with human judgment. Third, the
pivotal yet challenging task, directly impacting the investigation reveals that adjustments to the grading
quality of education and the workload of educators instructions via prompt engineering do not signifi-
[1, 2]. As student populations grow and the range of cantly influence GPT-4’s performance. This suggests
assessments diversify [3], the traditional methods of that the AI predominantly evaluates essays based
grading, often labor-intensive and subject to human on generic characteristics such as language quality
biases, call for innovative solutions. This study in- and structural coherence, rather than adapting to the
vestigates the performance of Generative Pretrained detailed and nuanced assessment criteria embedded
Transformers (GPTs), a type of AI language model within different prompts.
popularized by OpenAI, in assessing and grading The absence of a human-to-human comparison
student essays. for the same set of essays limits our understand-
The central inquiry focuses on the efficacy and re- ing of how GPT-4’s interrater reliability stacks up
liability of GPT-4 as a grading tool. Specifically, the against typical human variance in grading. However,
study aims to answer the question of whether GPT-4 this limitation notwithstanding, the study’s empir-
can provide accurate numerical grades for written ical findings contribute to a growing literature on
essays in higher social science education. The study using AI for grading and evaluation in higher edu-
utilizes a sample of 60 anonymized master-level es- cation [4, 5, 6, 7], suggesting three principal implica-
says previously graded by teachers as a benchmark tions. First, although AI has the potential to develop
to assess the performance of GPT-4, using a variety into a resource-efficient assessment tool, reducing
of instructions (“prompts”) to explore variation in the grading workload for university teachers, further
quantitative measures of predictive performance and technological development is required to improve
interrater reliability. alignment with human raters. Second, the research
The study makes three critical findings regarding underscores the challenge AI presently faces in grad-
the use of GPTs for grading essays in higher social sci- ing complex, lengthy essay materials compared to
ence education. First, GPT-4’s grading closely aligns simpler, more deterministic tasks like exam questions.
with human graders in terms of mean scores, though Third, the consistent performance of GPT-4 across
it exhibits a conservative grading pattern, primarily different prompts reveals a limitation in its ability to
assigning grades within a narrower middle range. differentiate based on nuanced grading instructions,
Second, GPT-4 demonstrates relatively low interrater suggesting a risk of misclassification in cases where
reliability with human graders, as evidenced by a Co- linguistic flair exceeds analytical quality.

* [Link]@[Link]
AI and Language Models in could be most effective.
On the topic of using AI in grading, [11] provides a
Student Evaluation conceptual discussion, reflecting on potential benefits
of AI in grading (such as efficiency and consistency)
The development of automated essay scoring (AES) and drawbacks (such as privacy concerns and the
systems has significantly evolved over the past quality of feedback). Anoter study [7] used ChatGPT
decades. Initially conceptualized in the late 1960s (GPT-3) to grade open-ended questions answered by
with Project Essay Grader (PEG), AES technolo- 42 industry professionals in technical training. The ef-
gies have grown in sophistication, incorporating ad- fectiveness of ChatGPT was evaluated by comparing
vanced machine learning techniques to evaluate the its corrections and feedback against subject matter
quality of written texts based on features like essay experts. The results suggested that the AI had some
length, word length, and syntactic variety (see [8] for capability to identify nuanced semantic details and
a review). Following the arrival of LLMs, and specifi- that subject matter experts “tended to agree with the
cally GPTs, an emerging body of literature has started corrections and feedback given by ChatGPT”.
to probe their performance and other characteris- These studies represent significant advancements
tics in classifying and evaluating written text in the in our understanding of AI tools in higher educa-
context of higher education, thereby moving beyond tion assessment but also exhibit limitations. Most
more conventional machine learning approaches (e.g., importantly, they have remained focused on using AI
[9]). in the grading of shorter exam questions. The chal-
One line of research has focused on the poten- lenges in grading exam questions are likely different
tial to employ GPTs in the service of research assis- from those that emerge from grading student essays,
tance tasks. A study by [4] showed that GPT-3 could which are less focused on providing a factual answer
replicate human-sourced survey responses while [5] and typically require—and assess—a greater range of
found that LLMs could be used as surrogates for analytical skills. Exploring whether and when LLMs
human subjects in certain types of social science re- have ability to grade essay assignments is therefore
search. Similarly, exploring whether this type of AI an important line of inquiry. Another limitation in
models can be utilized for research assistance tasks, existing research is their reliance on LLMs of infe-
such as automated content classification in social sci- rior capability. Most published studies are carried
ence research, Lupo et al. (2023: 1) found that “an out on early versions of GPTs, predominantly GPT-2
LLM can be as good as or even better than a human and -3, motivating further studies of more recent and
annotator while being much faster”. capable versions of these models.
Another literature focuses on the usage of AI in
student assessment. [10] review a variety of applica-
tions of AI in higher education student assessment. Methodology
They highlight a growing trend in the use of AI for
formative assessment, providing immediate feedback The study was performed using OpenAI’s GPT-4
and grading, and point to how AI has been used to model, available via API1 and the ChatGPT web in-
assist teachers with large groups of students. At the terface.2 At the time of the research, this model
same time, [10] observe a lack of widespread use of ranked as the most capable GPT publicly available.
AI in education due to limited knowledge among The validation set consists of a sample of 60 stu-
users and the need for specific training for effective dent essays from a master-level course in political
implementation. science offered at the University of Gothenburg in
Focusing on how AI tools can assist in feedback, 2022 and 2023. The essay assignment requires stu-
[6] examine the efficacy and student preferences con- dents to write an individual research paper analyzing
cerning AI-generated feedback in writing for English the impact of a specific policy, international agree-
as a New Language (ENL) learners. They report ment, or intervention on a particular aspect of an
two longitudinal studies: The first assesses if AI- assigned country’s social, political, institutional, or
generated feedback (using GPT-4) impacts learning economic conditions, focusing on detailed empirical
outcomes compared to human tutor feedback, find- analysis and guided by academic literature. The es-
ing no significant difference between the two. The says were around 3,500 words, structured into 4-6
second study explores ENL students’ preferences for subsections, and typically included at least one or
AI-generated versus human tutor feedback, revealing two figures or graphs.
no significant differences in preferences, while sug- Human raters graded the essays on a numerical
gesting that both types of feedback may have distinct scale, 1-7, taking into consideration clarity and struc-
advantages. The authors conclude that a blended 1 [Link]
approach, integrating both AI and human feedback, 2 [Link]

2
ture, integration of academic literature, empirical mid-range and high-grade standards. The (4) every-
rigor, and quality of writing and presentation, with thing prompt combined elements from the previous
higher scores indicating clearer organization, deeper ones—reference papers and a grading matrix—and
analysis, better integration of academic sources, more additionally specifies an expected grade distribution,
rigorous empirical work, and superior writing qual- making it the most complex and structured grading
ity. All human grades were motivated with attendant instruction. The expectation was that, if GPT-4 can
written comments.3 adapt to more detailed and granular instructions,
This validation set is free from “data contamina- these modified prompts would improve its ability to
tion” which would occur if the coded material was predict grades set by human raters.
part of the LLM’s training material [12]. The tem- All prompts were submitted in English, corre-
poral cutoff of the used GPT-4 model preceded the sponding to both the course language and the lan-
grading and the essays exist in the public domain. guage of the student assignments. Recent research
Ethical considerations were paramount in this suggests that GPT-4 attains a higher performance in
study, with a strong focus on data privacy. Ensuring coding material in the English language compared
the confidentiality of student data was critical; hence, with other languages [13].
the study implemented rigorous anonymization pro- The collected data were evaluated for interrater
tocols to prevent any possibility of identifying indi- reliability and overall predictive accuracy. Following
viduals from the essay data. These protocols were [14], interrater reliability was assessed based on both
designed to comply with the General Data Protection consensus and consistency estimates. Consensus is
Regulation (GDPR) and institutional guidelines. All assessed based on percent agreement, calculated by
papers were anonymized and any information that dividing the number of GPT-human agreement by
could tie the text to a person or institution was re- the total number of observations, and Cohen’s kappa,
moved prior to handling by ChatGPT. Furthermore, which adjusts percent agreement for the agreement
prior to the execution of the study, all GPT settings that could happen by chance. Consistency is eval-
were modified to minimize privacy concerns, includ- uated based on the Pearson correlation coefficient,
ing opting out from sharing conversations with Ope- which is the most conventional correlation measure,
nAI to train future GPT models.4 Additionally, all and Spearman’s rank correlation coefficient, which
tests were run in a zero-shot setup, meaning that no measures how well two sets of ranks correlate. Predic-
two grading processes influenced each other. tive accuracy was also evaluated based on ordinary
To assess the overall viability of the approach and least squares (OLS) regression.
to calibrate the research protocol, a pilot study was
carried out (reported in Appendix A).
The main study was carried out according to the Results
following protocol:
Descriptive and distributional statistics (Table 1, Fig-
1. Input of individual essay in pdf format; ure 1) suggest that the mean GPT-4 grades align
2. Input of grading instructions (“prompt”); with that of human raters. The GPT-4 rater with the
3. Collection of output data (suggested grade). basic prompt grades somewhat more leniently, but
the other GPT-4 raters arrive at grade distributions
Steps 1-3 were repeated for each of the 60 essays
with means that are statistically indistinguishable
using the same prompt, yielding 60 observations
from that of human raters. Inspection of the dis-
(suggested grades) per prompt.
tributions (Figure 1) suggests that the GPT-4 raters
To probe GPT-4’s sensitivity to different instruc-
produce grade distributions with a narrower range,
tions and settings, the protocol was repeated with
which is also confirmed by the observed standard
four different prompts (full information is available
deviations, which are considerably smaller for GPT-4
in Appendix B). The (1) basic prompt requests grad-
raters than human raters. Overall, these measures
ing a master’s level political science paper on a 1-7
indicate that GPT-4 raters exhibit a “bias towards the
scale based purely on the submitted paper. The (2)
middle,” awarding fewer essay grades at the lower
grading criteria prompt added a specific grading ma-
and higher ends of the spectrum compared with hu-
trix for grading the paper, stipulating more specific
man raters. Taken together, this suggests that GPT-4
criteria for a set of dimensions, each graded 1-7. The
is risk averse, seemingly avoiding assigning low or
(3) reference prompt introduced two reference pa-
high grades. Potentially, this is suggestive of the
pers to calibrate the grading, providing examples of
“polite” or “optimistic” traits that researchers have
3 This study focuses solely on the numerical grades. Further attributed to GPTs in other studies [15].
research should investigate whether LLMs can generate written In terms of interrater reliability (Table 2), GPT-
comments that align with those provided by human raters.
4 4 raters exhibit relatively low levels of agreement
[Link]

3
Human grader GPT−4 (Basic prompt) GPT−4 (Criteria prompt)

35
20

30

40
25
15

30
Frequency

Frequency

Frequency
20
10

15

20
10
5

10
5
0

0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

GPT−4 (References prompt) GPT−4 (Everything prompt)


30

30
25

25
20

20
Frequency

Frequency
15

15
10

10
5

5
0

0
1 2 3 4 5 6 7 1 2 3 4 5 6 7

Figure 1: Distribution of essay grades: human raters and GPT-4

In terms of interrater
Table reliability,
1: Essay grades Table II suggest that
descriptive statistics GPT-4
ments of theraters
essay exhibit
[Link]
This also low levels
implies that
of agreement with human
Mean raters. while GPT-4
S.D. The best-performing GPT-4 and human
rater raters
overall maybasic
is the prioritize simi-
prompt,
Human raters 4.95 1.47 lar criteria when assessing essays, the interpretation
which
GPT-4: attains
Basic a 35 percent
5.60* agreement
0.62 and a Cohen’s kappacriteria
of these of 0.18,
intowhereas other
numerical prompts
grades exhibitstend
vari-
GPT-4: Criteria ability. These results are largely replicated by OLS
to have lower scores.5.03
Since0.45
we lack a human-to-human comparison
regressions, for this
which also essay
provide set, we cannot
additional informa-
GPT-4: References 5.30 0.77
compare it to human-level interrater
GPT-4: Everything 5.17 0.81 reliability tion
for on
thistheexact
uncertainty of estimates (see also Figure
material. However, these are
2).
* t-test indicates difference in means at p<0.01 compared with
interrater reliability levels that would be interpretedWhile
human raters.
as lowweunder
may want to place more
conventional emphasis on the
circumstances.
two consistency measures, the results nevertheless in-
dicate that prompt engineering does not significantly
with human raters. The best-performing GPT-4 rater alter the performance beyond the basic prompt. This
TableisII.
overall theInterrater
basic prompt, reliability scores,
which attains by prompt
a 35 percent suggests that GPT-4 predominantly assesses generic
agreement and a Cohen’s kappa of 0.18, whereas features of essay content, such as the quality of the
other prompts tend to have lower scores. Since we
Percent language
Cohen’s or style of writing,Spearman’s
Pearson’s rather than adapting to
lack a human-to-human comparison for this essay the nuanced requirements embedded within various
agree kappa correlation rho
set, we cannot compare it to human-level interrater prompt configurations.
reliability for this exact material. However, these are coefficient
interrater reliability levels that would be interpreted
GPT4: Basic 35 0.18 0.30 0.32
as low under conventional circumstances. Conclusion
GPT4:
Given Criteria
the nature of the essay assignment and 25 the 0.16 0.31 0.30
grading This study has evaluated the ability of using LLMs,
GPT4: system,
Reference it ispapers
motivated to place more empha-
27 0.08
specifically 0.43
OpenAI’s GPT-4 0.10
model, to grade master-
sis on the two consistency measures, which are less
GPT4: Comprehensive
sensitive to exact matches. These measures suggest 27 level
0.13 student essays
0.26 in social science.
0.13 A sample of 60
that
Note:there is an is
Comparison overall correlation
with human raters. between human
essays was processed through four different grading
and GPT-4 grades, suggesting that human and GPT prompts to assess how well the AI’s suggested grades
raters attribute similar importance to the core ele- aligned with those given by human graders. Empiri-

8
Figure 2. Scatter plots of essay grades, human raters v. GPT-4

Basic prompt Criteria prompt


7

7
6

6
5

5
GPT−4

GPT−4
β =.70 β =.93
4

4
3

3
2

2
1

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Human grader Human grader

References prompt Everything prompt


7

7
6

β =.20 β =.27
5

5
GPT−4

GPT−4
4

4
3

3
2

2
1

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Human grader Human grader

Figure 2: Scatter plots of essay grades: human raters vs. GPT-4

5
Percent Cohen’s Pearson’s Spearman’s
agree kappa correla- rho
tion
GPT-4: Basic 35 0.18 0.30 0.32
GPT-4: Criteria 25 0.16 0.31 0.30
GPT-4: Reference papers 27 0.08 0.43 0.10
GPT-4: Comprehensive 27 0.13 0.26 0.13
Table 2: Interrater reliability scores by prompt. Note: Comparison is with human raters.

cal findings revealed that GPT-4 closely matches hu- criminate between the differences dictated by grading
man grading standards in terms of mean scores but instructions. This is perhaps the most troublesome
exhibits a bias towards middle-range grades, result- finding, as it suggests possible limitations to the cal-
ing in a risk-averse assessment profile. Importantly, ibration of LLMs to rate material based on grading
the AI demonstrated relatively low interrater relia- matrices and similar instructions. If LLMs mainly
bility with human raters and adjustments in grading pick up general features of the written text, such as
instructions did not significantly influence its per- language usage and style, it increases the risk of mis-
formance, suggesting that GPT-4 evaluates based on classification. Even if such general language features
generic features of the essays rather than specific correlate with essay quality—as is likely often the
grading criteria. case—this risk is non-negligible. For example, if a
In the absence of comparable measures of human given essay is exceptionally creative with regard to
interrater ability for the same material, it is difficult its handling of empirical material but below average
to assess the full meaning of these results. It is possi- in terms of writing quality and style, a human rater
ble that reliability is comparable to what we would may presently be more likely to account for the cre-
observe in a human-to-human comparison. How- ative qualities, whereas LLMs appear more likely to
ever, the results appear to suggest three principal overlook them.
implications. In light of these findings—and the limitations they
First, the interrater reliability scores, as indicated suggest—the overall conclusion is that LLMs are not
by Cohen’s kappa and percent agreement scores, yet a mature tool for assessment of longer essays in
raise concerns about the consistency between hu- higher education. At the same time, we must assume
man and AI grading of this type of essay material. that the AI technology will continue to evolve, pro-
Moreover, the narrower grade distributions and the viding teachers with gradually more sophisticated
overall lower interrater reliability with human raters and tailored software solutions which may address
suggest areas where further training of the AI models several or all of the observed limitations. This would
might be necessary to better mimic human grading involve, e.g., sensitivity to specialized grading rubrics
behaviors and improve their utility in educational and prompt-specific criteria, greater ability to han-
settings (and compliance with relevant regulations). dle varied text and open-ended assignments, and
Overall, the discrepancy suggests a need for better ability to analyze the quality of arguments. As AI
alignment of LLM grading mechanisms with human attains such capabilities, it likely to become a tool
judgment patterns. that teachers employ for a variety of tasks, including
Second, the performance observed here is worse grading.
than LLM performance on related tasks such as grad- The introduction of AI in grading in higher edu-
ing of exam questions [7] and coding of political text cation would raise several broader questions. One
[4]. This deviation is likely explained by differences question pertains to the role of assessment plays for
in the coded material. For instance, whereas [7] con- teachers in evaluating their students’ learning. If
cerned the grading of shorter and more deterministic grading is outsourced to an automatized agent, will
exam questions and [4] asked LLMs to code shorter teachers be as able to observe whether their students
texts into topical categories, this study concerned the “are getting it”—and to adjust their course content
more difficult task of categorizing longer texts into accordingly—as they are when they do the grading
numerically ordered categories. If so, it suggests that themselves? Likely, the AI tools of tomorrow can
the complexity of the source material is an important be adjusted to provide the teacher with information
source of variation in the expected performance of about where a particular group of students fell short
LLMs in grading in higher education. and where they did well, but it is not certain that such
Third, the stability across different prompt condi- information can replace the experience of having
tions highlights a limitation in LLM capacity to dis- “close contact” with the student essays themselves.

6
Likewise, if teachers are seen as free-riding on AIs in References
grading or unable to explain AI-determined grades
with sufficient clarity, it may undermine teachers’ [1] D. Allen, ed. Assessing student learning: From
legitimacy and students’ motivations to learn [16]. grading to understanding. Teachers College Press,
Another question pertains to student assessment as 1998.
an exercise of public authority (in many national sys-
[2] C. Golding and L. Adam. “Evaluate to improve:
tems). When AI systems are employed for grading in
useful approaches to student evaluation”. In:
universities, defining legal accountability and ensur-
Assessment & Evaluation in Higher Education 41.1
ing transparency becomes crucial. These systems blur
(2016), pp. 1–14.
traditional lines of responsibility, raising concerns
about who is accountable for AI decisions—whether [3] A. Bryntesson and M. Börjesson. Breddad
it be the developers, the institution, or the individ- rekrytering till högre utbildning: En beskrivn-
ual educators using the technology. Moreover, the ing och jämförelse av policy och utfall i de
opaque nature of AI algorithms can conflict with the skandinaviska länderna. Uppsala universitet,
required transparency of public decisions, making Humanistisk-samhällsvetenskapliga vetenskap-
it difficult for students to understand and challenge sområdet, Fakulteten för utbildningsveten-
their grades. Practical integration of AI in student skaper, Institutionen för pedagogik, didaktik
assessment would therefore depend on establishing och utbildningsstudier, 2021.
regulatory frameworks that define accountability for [4] L. P. Argyle et al. “Out of one, many: Using
AI-generated outcomes and creating mechanisms to language models to simulate human samples”.
ensure that AI decision-making processes in educa- In: Political Analysis 31.3 (2023), pp. 337–351.
tional settings are as transparent and explainable as [5] G. V. Aher, R. I. Arriaga, and A. T. Kalai. “Us-
possible. ing large language models to simulate multiple
A final, more philosophical question arises when humans and replicate human subject studies”.
considering the role of AI in grading: Is anything In: International Conference on Machine Learning.
significant lost when the assessment interaction be- PMLR. 2023, pp. 337–371.
tween teacher and student is mediated by AI? In
[6] J. Escalante, A. Pack, and A. Barrett. “AI-
the pedagogy of higher education, grading is not
generated feedback on writing: insights into
merely a measure of student performance but also a
efficacy and ENL student preference”. In: In-
critical feedback mechanism that influences learning
ternational Journal of Educational Technology in
strategies and outcomes. Some may argue that the
Higher Education 20.1 (2023), p. 57.
personal insights and nuanced understanding that ed-
ucators bring to the grading process could be dimin- [7] G. Pinto et al. “Large language models for ed-
ished when replaced by AI. However, arguably, the ucation: Grading open-ended questions using
fundamental aspects of grading—evaluating knowl- ChatGPT”. In: Proceedings of the XXXVII Brazil-
edge, comprehension, and critical thinking—can be ian Symposium on Software Engineering. 2023,
efficiently mimicked by sophisticated algorithms. If pp. 293–302.
AI continues to evolve to address its current limi- [8] K. Yang et al. “Unveiling the Tapestry of Au-
tations and social and regulatory norms evolve to tomated Essay Scoring: A Comprehensive In-
permit its application in higher education, the tech- vestigation of Accuracy, Fairness, and Gener-
nology may be able to enhance grading by providing alizability”. In: Proceedings of the AAAI Confer-
more consistent and unbiased assessments. By reli- ence on Artificial Intelligence. Vol. 38. 20. 2024,
ably handling routine grading tasks, AI could then al- pp. 22466–22474.
low educators to dedicate more time to personalized
[9] Ken Benoit et al. “Text as data: an overview”.
teaching and mentoring [6, 11]. Thus, rather than de-
In: The SAGE handbook of research methods in
tracting from human interaction, a more advanced AI
political science and international relations. 2020,
could support and enhance educational engagement
pp. 461–497.
by freeing up human capacities for the deeper, more
meaningful interactions between teacher and student [10] V. González-Calatayud, P. Prendes-Espinosa,
than what is provided by conventional grading. and R. Roig-Vila. “Artificial intelligence for stu-
dent assessment: A systematic review”. In: Ap-
plied Sciences 11.12 (2021), p. 5467.
[11] R. Kumar. “Faculty members’ use of artificial
intelligence to grade student papers: a case of
implications”. In: International Journal for Educa-
tional Integrity 19.1 (2023), pp. 1–10.

7
[12] Shahriar Golchin and Mihai Surdeanu. “Time Table 3: Summary of results for papers A-C
travel in llms: tracing data contamination in Grader A B C
large language models”. In: arXiv preprint GPT-4: Basic 6 6 6
arXiv:2308.08493 (2023). GPT-4: Basic + grading 5.5 6 6
[13] Kabir Ahuja et al. “Mega: multilingual eval- criteria
uation of generative ai”. In: arXiv preprint GPT-4: Basic + reference 5.5 4 6
arXiv:2303.12528 (2023). paper
[14] S. E. Stemler. “A Comparison of Consensus, GPT-4: Basic + grading 6 5 6
Consistency, and Measurement Approaches to criteria + reference paper
Estimating Interrater Reliability”. In: Practical Human grader 5 4 7
Assessment, Research, and Evaluation 9.1 (2004), Grades for papers A, B, and C.
p. 4.
[15] X. Wang et al. “Does role-playing chatbots cap- Second, GPT-4 is biased upwards for papers A
ture the character personalities? assessing per- and B and downwards for paper C. Taken together,
sonality traits for role-playing chatbots”. In: this suggests that GPT-4 is risk averse, seemingly
arXiv preprint arXiv:2310.17976 (2023). avoiding assigning low or high grades. Potentially,
[16] K. Struyven, F. Dochy, and S. Janssens. “Stu- this is suggestive of the “polite” or “optimistic” traits
dents’ perceptions about evaluation and assess- that researchers have attributed to GPT-4.
ment in higher education: A review”. In: Assess- Third, increasing the granularity and specificity
ment & evaluation in higher education 30.4 (2005), of prompts appears to have the potential to increase
pp. 325–341. alignment between GPT-4 and human graders. In this
small sample, the provision of a reference paper with
a pre-assigned grade generated greater alignment.
Appendix A: Pilot Study
Prior to the full study, a pilot study was carried out. Appendix B: Prompts
This involved the grading of a set of three papers
(called papers A, B, and C below). The purpose of Basic: I want you to act as a teaching assistant grad-
the pilot study was to assess the general viability ing a student paper from a master’s level course
of the approach and how variation in prompting in political science. I will input a paper and I
affected results. No statistical analysis was carried want you to grade the paper on a scale from 1
out. to 7 where 1 is a failing grade and 7 the highest
The papers were written as the final research as- possible grade. Grade the attached paper and
signment within the course “International Admin- provide the numerical grade (nothing else).
istration and Policy” and examine various policy Grading criteria: I want you to act as a teaching as-
interventions by international institutions in specific sistant grading a student paper from a master’s
developing countries. Each paper was around 3500 level course in political science. I will input a
words, excluding references. paper and I want you to grade the paper on a
Procedure: Each paper was submitted to ChatGPT scale from 1 to 7 where 1 is a failing grade and 7
(GPT-4) together with prompts of gradually increas- the highest possible grade. Grade the attached
ing specificity of instruction (Table 3). Each itera- paper using the also attached grading matrix
tion involved uploading the paper, possible reference (“grading matrix”). Provide the numerical grade
material, and inputting the prompt. The suggested and nothing else.
numerical grade was collected from GPT-4 output Reference paper: I want you to act as a teaching as-
(Table 3). Comments volunteered by GPT-4 were not sistant grading a student paper from a master’s
collected. level course in political science. I will input a
At that stage, three preliminary impressions were paper and I want you to grade the paper on a
noted: scale from 1 to 7 where 1 is a failing grade and 7
First, while N is very low and no statistical analysis the highest possible grade. I will attach two ref-
has been performed, the results suggest the possibil- erence papers, the first of which received a grade
ity of alignment between GPT-4 and human graders. of 4 (“reference 4”) and the second of which re-
The average GPT grades suggest an overall ranking ceived a grade of 7 (“reference 7”). Grade the
of papers A, B, and C equivalent to that implied by attached paper (“paper”) calibrating against the
the human grader. quality of the reference papers. Provide the nu-
merical grade and nothing else.

8
Everything: I want you to act as a teaching assistant ing of the topic.
grading a student paper from a master’s level 5 Good, detailed analysis; uses data effectively
course in political science. I will input a paper to support arguments.
and I want you to grade the paper on a scale 6 Very strong analysis; demonstrates depth and
from 1 to 7 where 1 is a failing grade and 7 the insight in data interpretation and critical
highest possible grade. I will attach two refer- engagement.
ence papers, the first of which received a grade 7 Outstanding depth in analysis; insightful and
of 4 (“reference 4”) and the second of which re- demonstrates sophisticated understanding
ceived a grade of 7 (“reference 7”). I will also and application of data.
attach a grading matrix (“grading matrix”) that Integration of literature
show the requirements for each grade. Grades
should average around 5 with a standard devia- 1 Inadequate or incorrect use of literature; major
tion of 1.5. Grade the attached paper (“paper”) citation issues.
calibrating against the quality of the reference 2 Inadequate or incorrect use of literature; major
papers and taking the grading matrix and overall citation issues.
grade distribution into account. Don’t be afraid 3 Uses some relevant literature but often relies
to award low or high grades for papers that are on non-academic sources or superficial ref-
clearly better or worse than the average. Provide erences.
the numerical grade and nothing else. 4 Good use of academic sources but integration
into the argument could be improved.
5 Strong use and integration of academic litera-
Appendix C: Grading Matrix ture to support the paper’s arguments.
6 Very strong integration, using a diverse range
Clarity and structure of relevant academic resources effectively.
1 Paper lacks logical flow; sections are disjointed 7 Exceptional integration of literature; uses aca-
and the introduction, body, and conclusion demic sources creatively and effectively to
are unclear or missing. advance a compelling argument.
2 Paper lacks logical flow; sections are disjointed Empirical rigor
and the introduction, body, and conclusion
1 Empirical content is largely inaccurate or ir-
are unclear or missing.
relevant; sources are unreliable or misinter-
3 Basic structure is present, but transitions be-
preted.
tween sections are weak; organization is
2 Empirical content is largely inaccurate or ir-
somewhat confusing.
relevant; sources are unreliable or misinter-
4 Adequate structure; clear sections, but some
preted.
transitions might be abrupt.
3 Some relevant empirical content but with in-
5 Well-organized; good flow between sections
accuracies or generalizations.
and clear overall structure.
4 Reasonable accuracy and relevance of empiri-
6 Very well-organized; sections are logically
cal content with minor errors.
structured and transitions are smooth, en-
5 Accurate and relevant empirical analysis, well-
hancing readability.
supported by reliable sources.
7 Exceptionally clear and structured; the orga-
6 Very accurate empirical work, detailed and
nization enhances the argument’s effective-
well-supported by high-quality sources.
ness and engages the reader throughout.
7 Exceptionally rigorous and accurate empirical
Analysis analysis, setting a high standard for aca-
demic research.
1 Superficial analysis with major misunder-
standings of the topic; lacks critical engage- Quality of writing
ment with data. 1 Numerous grammatical, stylistic, or format-
2 Superficial analysis with major misunder- ting errors; does not follow academic writ-
standings of the topic; lacks critical engage- ing standards.
ment with data. 2 Numerous grammatical, stylistic, or format-
3 Some analysis present but lacks depth; mini- ting errors; does not follow academic writ-
mal critical engagement with sources. ing standards.
4 Satisfactory analysis with reasonable interpre- 3 Some errors in grammar and style; inconsis-
tation of data; demonstrates an understand- tent adherence to academic writing stan-

9
dards.
4 Generally follows academic writing standards
with minor errors.
5 Good quality of writing; well-edited and for-
matted correctly according to academic
standards.
6 Very high-quality writing; very few errors and
excellent adherence to academic style.
7 Exceptional writing quality; flawless in terms
of grammar, style, and adherence to the
highest academic standards.

10

You might also like