You are on page 1of 10

Computers and Education: Artificial Intelligence 3 (2022) 100059

Contents lists available at ScienceDirect

Computers and Education: Artificial Intelligence


journal homepage: www.sciencedirect.com/journal/computers-and-education-artificial-intelligence

Towards automated content analysis of educational feedback: A


multi-language study
Ikenna Osakwe a, *, Guanliang Chen a, Alex Whitelock-Wainwright a, Dragan Gašević a,
Anderson Pinheiro Cavalcanti b, Rafael Ferreira Mello c
a
Monash University, Clayton, VIC, Australia
b
Centro de Informática, Universidade Federal Rural de Pernambuco, Recife, PE, Brazil
c
Departamento de Computação, Universidade Federal Rural de Pernambuco, Recife, PE, Brazil

A R T I C L E I N F O A B S T R A C T

Keywords: Feedback is a crucial element of a student’s learning process. It enables students to identify weaknesses and
Natural language processing improve self-regulation. However, studies show this to be an area of great dissatisfaction in higher education.
Classification With ever-growing course participation numbers, delivering effective feedback is becoming an increasingly
Feedback
challenging task. The efficacy of feedback will depend on four levels of feedback; namely, feedback about the
Content analysis
self, task, process or self-regulation. Hence, this paper explores the use of automated content analysis to examine
feedback provided by instructors for feedback practices measured on self, task, process, and self-regulation levels.
For this purpose, four binary XGBoost classifiers were trained and evaluated, one for each level of feedback. The
results indicate effective classification performance on self, task, and process levels with accuracy values of 0.87,
0.82, and 0.69, respectively. Additionally, inter-language transferability of feedback features is measured using
cross-language classification performance and feature importance analysis. Findings indicate a low generaliz­
ability of features between English and Portuguese feedback spaces.

1. Introduction beliefs, developing prior knowledge, or correcting inconsistent beliefs.


Laurillard emphasized “action without feedback is completely unpro­
Quality education requires personal attention and support. A crucial ductive for the learner” (Laurillard, 2013, p. 61).
element of this is feedback, whose definition changes depending on the Despite widespread recognition of feedback’s importance to
educational research literature. Within this paper, feedback is concep­ learning, much of the current literature indicates a pervasiveness of low
tualized as the flow of information from one agent to another regarding quality feedback in higher education (Hattie & Gan, 2011). Feedback
a learner decision (Hattie & Timperley, 2007). Numerous studies have quality is consistently rated one of the greatest causes of dissatisfaction
highlighted the significant role feedback plays in the learning process. for higher education students (Ferguson, 2011). Weaver (2006) reports
Price et al. (2010) state that feedback is the most critical component of that while collegiate scholars acknowledge the value of feedback in
assessments. Likewise, a meta-analytic study of 450,000 effect sizes facilitating learning, they find instructors’ feedback comments to be
across 180,000 studies concluded that feedback was the biggest incomprehensible and ineffectual. Ferguson (2011) identifies a lack of
contributor to achievement (Hattie, 1999). Feedback directs learners to timely feedback delivery, unclear expectations and low utility as key
the appropriate type of study or practice and helps individuals recognize concerns amongst learners. Mulliner and Tucker (2017) conducted a
areas of deficiency, which can be used to enhance learning tactics and survey of higher educational academic staff and students that found 93%
strategies (Parikh et al., 2001; Weaver, 2006; Glover & Brown, 2006). of staff being satisfied with the quality of feedback provided, compared
Dweck (1999) reports feedback can affect a student’s motivation, as well to just 67% of students being satisfied with the quality of feedback
as what is learned and in what manner. Butler and Winne (1995) received.
theorize that feedback promotes learning byscaffolding consistent As higher educational institutions embrace technology, there is a

* Corresponding author.
E-mail addresses: richard.osakwe@monash.edu (I. Osakwe), guanliang.chen@monash.edu (G. Chen), alex.wainwright@monash.edu (A. Whitelock-Wainwright),
dragan.gasevic@monash.edu (D. Gašević), apc@cin.ufpe.br (A. Pinheiro Cavalcanti), rafael.mello@ufrpe.br (R. Ferreira Mello).

https://doi.org/10.1016/j.caeai.2022.100059
Received 30 May 2021; Received in revised form 1 March 2022; Accepted 1 March 2022
Available online 13 March 2022
2666-920X/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

growing portfolio of approaches that utilize ubiquitous data collection and tactics, which generate outcomes that can be mental (i.e cognitive or
to improve learning processes and design decisions. The field of affective/motivational states) or externally observable (e.g., a tangible
Learning Analytics (LA) is one of such approaches and could provide a product or the learner’s behavior). During the monitoring phase, the
solution to the state of feedback in education. LA researchers are actively learner generates internal feedback and assesses goal progression. This
exploring automated feedback solutions that can enable instructors to internal feedback shapes additional tactics and strategies, including
efficiently identify and deliver effective feedback, and improve the knowledge, motivation, and beliefs (Butler & Winne, 1995; Nicol &
speed of feedback delivery to students (Keuning et al., 2018; Liu et al., Macfarlane-Dick, 2006). External feedback can be received from a va­
2017; Ma et al., 2017; Villal’on et al., 2008; Wijewickrema et al., 2018). riety of sources including an instructor, peers, or machine. This is then
In that vein, several studies Liu et al. (2017); Ma et al. (2017); Villal’on used to augment, confirm, or adjust the student’s interpretation of the
et al. (2008); Wijewickrema et al. (2018) have examined the use of data task’s properties and learning path Nicol and Macfarlane-Dick (2006).
mining methods to generate automated textual feedback. For instance, However, for external feedback to play a meaningful role in the learning
Villal’on et al. (2008) and Wade-Stein and Kintsch (2004) employ Latent process it must be effective enough to be interpreted and acted upon by
Semantic Analysis (LSA) to provide textual feedback on writing coher­ the student (Nicol & Macfarlane-Dick, 2006).
ence to students. However, these analyses are often limited to domain
specific areas such as computer programming or writing, or lack of 2.2. Drivers of effective feedback
grounding in educational theory. Much less work has gone into the
exploration of automated domain-agnostic analysis to identify infor­ The literature identifies good practices of feedback in education.
mative feedback practices (see Section 2.2 for details Cavalcanti et al. Nicol and Macfarlane-Dick (2006) are highly cited in the literature on
(2020; 2019)). Progress in such areas can enhance the instructor’s feedback for learning. They largely identify good feedback as that which
ability to provide effective feedback comments and analyze features improves the learner’s self-regulation abilities. The ability of feedback to
associated with effective feedback for generalizable feedback genera­ improve self-regulation with greater efficacy will largely be driven by its
tors. Furthermore, we conduct our analyses across different languages to informativeness, polarity, and timing.
measure the extent to which constructed tools can be adapted or utilized
by wider audiences. Therefore, this study aims to answer the following 2.2.1. Informativeness
Research Questions (RQs): Researchers theorize that effective feedback should integrate feed-
up, feed-back, and feed-forward (Hattie & Timperley, 2007; Fisher &
1. To what extent can the automated analysis of feedback messages be Frey, 2009). Feed-up clarifies goals and expectations to the learner. In
used to identify informative feedback components? line with feedback’s primary purpose of reducing the discrepancy be­
(a) How accurate are the predictions that are made about these tween current knowledge/task performance and a goal, one can surmise
feedback practices? that without clearly defined goals feedback becomes substantially less
(b) What are specific features of text that can be used to predict potent (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
feedback components? Macfarlane-Dick, 2006). Additionally, students are more likely to
(c) How transferable are the identified feedback features to text engage with learning tasks when they understand expectation (Hattie &
written in different languages? Timperley, 2007; Fisher & Frey, 2009; Nicol & Macfarlane-Dick, 2006).
(d) What is the most minimal feature set needed to make accurate Feed-back informs the learner on the level of progress towards an ex­
predictions? pected outcome (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick,
2006). Feedback is more efficacious when it informs the learner on
2. Background progress (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
Macfarlane-Dick, 2006). Students employ progress details to shape
2.1. Feedback in the learning process tactics and strategies (Butler & Winne, 1995; Nicol & Macfarlane-Dick,
2006). Feed-forward directs the learner on how to enhance progress
It is often theorized that the primary objective of feedback is to (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
reduce the discrepancy between one’s knowledge/exercise performance Macfarlane-Dick, 2006). Feed-forward can have a great impact on
and a goal (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick, 2006; learning by enabling the student to determine the next steps in learning
Butler & Winne, 1995; Leibold & Schwarz, 2015). Hence, feedback is and how to tackle them (Hattie & Timperley, 2007; Nicol &
often examined through the lens of self-regulated learning (Butler & Macfarlane-Dick, 2006). Hattie and Timperley (2007) report feed-up,
Winne, 1995; Clark, 2012; Nicol & Macfarlane-Dick, 2006). Most the­ feed-back and feed-forward can operate at four levels (Table 1). Feed­
orists recognize that self-regulated learners are the most effective back’s ability to lessen the discrepancy between performance and goals
learners (Butler & Winne, 1995). Academically speaking, self-regulation will be determined by the four levels at which it operates (Butler &
is a cycle of setting knowledge construction goals, selecting strategies Winne, 1995; Hattie & Timperley, 2007).
that maximize progress towards achieving said goals, and monitoring
progress with the possibility of altering strategies depending on the level 2.2.2. Polarity
of progression (Butler & Winne, 1995; Nicol & Macfarlane-Dick, 2006). Stern and Solomon (2006) and Weaver (2006) propose that effective
Feedback is a critical catalyst throughout this operation. While learners feedback should constitute positive (reinforcing) comments including
monitor task progression, they form internal feedback (Nicol, 2021) evaluations on how and to what degree the goals were completed.
which shapes adopted tactics (cognitive routines to execute a goal) and Positive comments can also improve the learner’s motivation and
strategies (monitoring and use of cognitive routines). More effective learning experience (Stern & Solomon, 2006; Weaver, 2006).
self-regulators generate better feedback or better employ the internal Conversely, negative (corrective) feedback may be useful in adjusting
feedback to achieve goals (Nicol & Macfarlane-Dick, 2006). inadequate behavior by alerting the learner of substandard strategies
As self-regulating learners engage in academic exercises, they utilize (Ilgen & Davis, 2000). Hattie and Timperley (2007) and Butler and
cognitive, meta-cognitive, and motivational processes to form an ideo­ Winne (1995) theorize the level at which feedback operates should
logical construct of the exercise’s details and requirements, alongisde determine polarity of comments. These studies (Butler & Winne, 1995;
the setting of goals (Butler & Winne, 1995; Narciss, 2013; Nicol & Hattie & Timperley, 2007) advise against either positive (e.g. “You’re a
Macfarlane-Dick, 2006). The extent to which the learner’s goals aligns very smart student!“) or negative comments (e.g. “You’re a careless
with those of the instructor will vary with the student’s understanding of student”) at the self-level. At the self-regulation level, the impact of
the exercise and motivations. The learner’s goals determine strategies positive and negative comments will be influenced by the learner’s

2
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

Table 1 grounded in educational literature, identified tools should be able to


Four levels of feedback identified by Hattie and Timperley (2007). Each level learn the features associated with informative feedback practices.
specifies different elements that the feedback is targeting and can be regarded as Despite numerous benefits, limited work has gone into this area; this
hierarchical, ranging from general comments made about the student them­ study was able to identify just two previous works on this content.
selves up to directives on how to improve self-regulation. For each of the levels, a The literature authors propose the use of supervised machine
description and example are given to clarify the feedback type.
learning algorithms to analyze properties of written feedback. Cav­
Level Description Example alcanti et al. (2020, 2019) utilize binary random forest classifiers to
Feedback about Personal evaluations “You are a bright student” examine feedback comments delivered to undergraduate students.
the self (FS) about the learner Cavalcanti et al. (2019) analyze instructor feedback comments through
Feedback about How well tasks are “You need to include more about the lens of Nicol and Macfarlane-Dick’s (2006) seven principles of good
the task (FT) understood or the Treaty of Versailles.”
feedback; however, Cavalcanti et al. (2020) suggest that these indicators
performed
Feedback about Processes needed to “You need to edit this piece of may be too general for textual analysis and suggest Hattie and Tim­
the process (FP) understand or perform writing by attending to the perley’s (2007) four levels of feedback operation due to its focus on
tasks descriptors you have used so the learning tasks, learning processes, and self-regulation. Cavalcanti et al.
reader is able to understand the
(2019) were able to detect the presence of one or more of Nicol and
nuances of your meaning.”
Feedback about How to improve self- “You already know the key features Macfarlane-Dick’s principles of good feedback with an accuracy of 0.75,
self-regulation regulation of the opening of an argument. while Cavalcanti et al.’s (2020) best performing models obtained ac­
(FR) Check to see whether you have curacies of 0.75, 0.64, and 0.87 for classification of task, process, and
incorporated them in your first self-regulation levels of feedback, respectively.
paragraph.”
The examined papers provide initial advancements into automated
content analysis, yet suffer from significant limitations. Firstly, the
motivation (Hattie & Timperley, 2007; Kluger & Dijk, 2010). The datasets used in these studies were from just 2 courses in Portuguese;
ef-fectiveness of positive is likely improved when the learner is pro­ hence, conclusions drawn from their findings may be limited by the lack
motion focused (i.e., undergoing a task one wants to do) while a pre­ of data diversity. This study builds on the works of Cavalcanti et al.
vention focused (i.e., undergoing a task one needs to do) learner might (2020, 2019) by undertaking a more expansive analysis that includes the
benefit from negative feedback (Hattie & Timperley, 2007; Kluger & use of an additional dataset that includes feedback comments written in
Dijk, 2010). the English language, with the goal of building a classifier that better
captures the feedback vector space. This will be undertaken in
2.2.3. Timeliness answering research questions RQ1a and RQ1b.
Timeliness has a significant role in the effectiveness of feedback. The works of Cavalcanti et al. (2020, 2019) are also limited by the
Feedback that is too slow to arrive is unlikely to be used by the learner; size of their model’s feature space, having employed 120 features for
however, feedback that is received too early may impede the learner’s their developed classifiers. Larger dimensionality reduces the inter­
self-regulation ability (Yang & Carless, 2013). The level of feedback pretability of machine learning models and may hinder the ability of
operation should also affect the timing of feedback and its efficiency. detecting the most significant features (Holzinger, 2018). In research
Several studies (Bangert-Drowns et al., 1991; Clariana et al., 2000; question RQ1d, we explore the potential of minimizing the feature set,
Hattie & Timperley, 2007) have found that at the task level, some delay while maintaining or improving accuracies to improve explainability of
may be more efficacious; whereas, immediacy may be more efficacious produced models.
for feedback at the process level. Additionally, for research question RQ1c we explore the trans­
ferability of feedback across languages. These contributions will expand
2.2.4. Summary the flexibility of automated feedback analysis tools.
Taking into account the frameworks posed by these papers, it is
important to understand these principles are not mutually exclusive. In 3. Method
fact, good feedback will often comprise a combination of these elements.
However, a thorough examination of the presented feedback drivers 3.1. Data
reveals the importance of recognizing the various levels of feedback
operation (i.e., feedback about the self, task, process and self-regulation) The dataset used in the current study consisted of feedback com­
and how these feed into selected feedback strategies. The dimensions of ments provided by instructors; a total of 1272 instructor feedback
feedback highlight the required nuance and associated difficulty comments were collected. These comments included 1000 examples of
involved with administering effective feedback, particularly at scale, feedback written in Portugese from fully online Undergraduate Biology
thus signalling the growing need for automated feedback tools and and Literature courses. Feedback was delivered to students within two
technologies. weeks of the assignment. The 272 English feedback examples were from
Learning Analytics (LA), Software Engineering (SE), and Software
Development (SD) courses. The LA and SE courses were online Master’s
2.3. Automated content analysis of feedback courses. Feedback was provided for various written assessments
throughout the courses’ duration. The SD course was undertaken in
Automated feedback content analysis involves the use of natural person by Undergraduate Computer Science students. Feedback was
language processing to extract features from textual feedback. Auto­ delivered online within 2 weeks of submission, for group work assign­
mated content analysis can provide significant contributions to feedback ments throughout the semester.
quality in education. These tools can aid in lessening the level of The Portugese dataset was studied using the entire feedback message
dissonance between instructors and learners concerning feedback due to its brevity — average comment length was 30.00 words per
quality (Mulliner & Tucker, 2017), by effectively examining feedback comment with a standard deviation of 30.00. The English dataset con­
comments for the presence of informative feedback components (Cav­ sisted of feedback much longer in nature, with an average of approxi­
alcanti et al., 2020, 2019). At a product level, a tool can be constructed mately 440.97 words per comment (standard deviation = 298.41). In
to train instructors to employ effective feedback. Additionally, auto­ order to promote a more comparable level of analysis while maintaining
mated content analysis can be used as a module for automated textual semantically meaningful units of analyses, a more granular approach
generators; in pursuance of flexible feedback generators that are was taken by splitting comments at the paragraph level for analysis. The

3
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

resultant 1092 English feedback paragraphs had an average length of 3.3.3. Language of delivered feedback
94.28 words and a standard deviation of 52.54 words. Subsequently, the This study records the language of feedback delivered as an addi­
total number of observations increased from 1272 to 2092 records with tional feature to account potential inter-language differences in data
an average of 63.60 words and standard deviation of 53.69. distribution.

3.3.4. Named entities


3.2. Coding scheme Number of named entities is measured and may provide information
on the level of detail in delivered feedback comments. According to
Two possible coding schemes were examined for labeling the data Cavalcanti et al. (2020, 2019), the number of named entities may be
used in this study: (i) Nicol and Macfarlane-Dick’s (2006) seven guiding useful in predicting FS level feedback.
principles of good feedback, (ii) and Hattie and Timperley’s (2007) four
levels of feedback operation. Cavalcanti et al. (2020) note that Nicol and 3.4. Analysis
Macfarlane-Dick’s (2006) indicators may be too general for feedback
analysis and propose the use of Hattie and Timperleys ’Hattie and 3.4.1. Data analysis and pre-processing
Timperley (2007) four levels of feedback for its better suitability for For the general classifier, feedback examples from both the English
textual analysis due to its focus learning tasks, learning process, and and Portuguese datasets were combined and split into 70% training and
self-regulation. Hence, feedback examples were coded using Hattie and 30% test sets (Table 2). The validation set is implicitly extracted from
Timperley’s (2007) proposed four levels of feedback. the training data during the training process. The training data suffered
Feedback examples were coded by experts using instructions of from class imbalances; particularly at the FS and FR levels (see Table 2).
Hattie and Timperley’s (2007) study. Each feedback record was exam­ A genetic algorithm (GA) was used for heuristic search. GA is an
ined by two expert coders separately. After this step, the differences evolutionary technique for executing a heuristic search. The algorithm is
between each pair of experts were compared. For the Portuguese feed­ initialized with a population of potential solutions (individuals). Each
back examples, the inter-rater agreement reached 72.2% with a Cohen’s individual is represented using a chromosome that consists of a number
kappa (inter-rater ability considering chance Cohen (1960)) of 0.44. The of genes. For our problem space, an individual is a sequence of sampling
English feedback comments had inter-rater agreement of 63.8% and techniques. Each gene is a number from 0 to 11; 1–11 representing one
Cohen’s kappa of 0.38. Another two experts who did not participate in of the sampling algorithms listed above, and 0 being the absence of an
the first coding stage resolved the divergent cases. These measures met algorithm. After each iteration, the best performing individuals are
expectations for content analysis experimentation (Neuendorf, 2017). randomly combined and mutated. These altered individuals comprise
The annotation process led to a dataset with four sets of binary the new population for the next iteration.
classes: class 0 if a feedback message did not belong to a particular level; Following the advice of Baeck (2000) a chromosome size of 7 is used.
class 1 if the feedback message belonged to the feedback level. GA was initialized with population size of 50 and ran for 2000 iterations
for each level of feedback. Performance of each individual was assessed
using a combination of accuracy, kappa and frequented imbalance
3.3. Feature engineering classifier performance measures (Luque et al., 2019): AUC, F1 score,
geometric mean, and average precision. The crossover rate (probability
Feature extraction was informed by relevant studies (Cavalcanti of combining two individuals) and the mutation rate (probability of
et al., 2020, 2019; Kovanovi’c et al., 2016). The studies promote the use mutating a gene) were set to 50% and 15%, respectively. In conclusion
of linguistic features such as those developed in LIWC (Linguistic Inquiry of the experiment the best performing sampling sequence for each level
and Word Count) (Tausczik & Pennebaker, 2010) and Coh-Metrix of feedback was:
(Graesser et al., 2004) over traditional textual features such as lexical
FS: TomekLinks → NCL → ADASY N → ENN
N-grams or Part-Of-Speech. According to Kovanovi’c et al. (2016), these
features encourage overfitting by inflating the feature space. Addition­ FT: TomekLinks → BSMOTE → SV MSMOTE → ADASY N → OSS →
ally, these traditional features are data dependent and thus make it BSMOTES → NM
difficult to define the feature space beforehand (Kovanovi’c et al.,
2016). Hence, we used feature sets that incorporated 86 LIWC (Tausczik FR: TomekLinks → IHT → SMOTE → NM → OSS → OSS → OSS
& Pennebaker, 2010) features, 78 Coh-Metrix (Graesser et al., 2004)
FP: SMOTE → NCL → SV MSMOTE → NCL → TomekLinks
features, and two additional features, which are relevant to this content
area — number of named entities and language of delivered feedback.

3.3.1. LIWC
A lexical tool that analyzes text across various psychological di­
mensions Tausczik and Pennebaker (2010). The main categories pro­
vided by LIWC include cognitive processes, social processes, informal
language, personal concerns, affect, relativity, time orientation, drives, Table 2
and perceptual processes. The relative contribution of these categories Number of instances for each class in the training and test datasets for each level
in the text offers a descriptive profile of various psychological constructs of feedback.
involved in the writing (Tausczik & Pennebaker, 2010). The LIWC dic­ Class 0 Class 1 Total
tionary has over 120,000 words, where each word can be assigned to
Train 1149 (82.19%) 249 (17.81%) 1398 (70%)
one or more categories (Tausczik & Pennebaker, 2010).
FS
Test 567 (82.17%) 123 (17.83%) 690 (30%)

Train 602 (43.06%) 796 (56.94%) 1398 (70%)


3.3.2. Coh-Metrix FT
Test 297 (43.04%) 393 (56.96%) 690 (30%)
A computational linguistics system that measures a set of features
Train 1290 (92.27%) 108 (7.73%) 1398 (70%)
about the text that are widely adopted in the educational research to FR
Test 637 (92.32%) 53 (7.68%) 690 (30%)
evaluate the quality of written activities (Graesser et al., 2004).
Coh-Metrix can provide insight into cohesion, language, complexity, FP Train 808 (57.80%) 590 (42.20%) 1398 (70%)
Test 399 (57.83%) 291 (42.17%) 690 (30%)
and readability of texts (Graesser et al., 2004).

4
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

3.5. Model selection and evaluation — RQ1a the feature space was trimmed, classifiers were trained using the
reduced feature sets and their accuracies are recorded using the test set.
Prior research on feedback analysis found success with the Decision Finally, analysis was carried out by examining the relative change in
tree ensembles; namely, the Random Forest (RF) algorithm (Cavalcanti prediction accuracy against the size of feature space.
et al., 2019, 2020). Decision tree ensembles are widely regarded clas­ Once the English and Portuguese trained classifiers were developed,
sification algorithms that are well suited to feedback analysis (Cav­ feature transferability was measured by
alcanti et al., 2019, 2020). This is due to their white-box properties, easy
interpretability, high accuracy and ability to identify important features i) the inter-language prediction performance: the prediction performance
in a dataset (Cavalcanti et al., 2019, 2020; Chen & Guestrin, 2016; (measured by accuracy, F1 score and Cohen’s κ) of the English
Denisko & Hoffman, 2018). trained classifier on the English test set was compared to the pre­
This study employed a decision tree implementation called XGBoost dictor performance on the Portuguese test set for the FS, FT, and FP
or eXtreme Gradient Boosting (Chen & Guestrin, 2016). XGBoost has levels of feedback. The same process was repeated for the Portuguese
been shown to outperform Random Forest on numerous classification trained classifier;
tasks (Pan, 2018; Xiao et al., 2017). The algorithm utilizes gradient ii) A comparison of significant features: The most important features for
boosting, which involves sequentially combining models (in this case, the English and Portuguese trained classifiers are compared at the FS,
decision trees) that predict the residuals or errors of previous models at FT and FP levels of feedback.
each iteration to improve overall accuracy (Chen & Guestrin, 2016). It is
called gradient boosting because it uses a gradient descent algorithm to 4. Results
minimize the loss when adding new models. XGBoost is ideal due to
their superior accuracy and their implicit analysis of feature importance To assess model performance, we report F1 scores, Cohen’s κ and
(Chen & Guestrin, 2016). Four binary XGBoost classifiers were trained; accuracy values obtained by binary classifiers on the test, for each level
one for each level of feedback. of feedback. For further analysis, confusion matrices are also presented.
Previous studies (Cavalcanti et al., 2019, 2020) on textual feedback
analysis have evaluated models using accuracy and Cohen’s Kappa (κ). 4.1. Model training and evaluation — RQ1a
Accuracy can be calculated using the ratio of correctly identified sam­
ples to total number of samples. Furthermore, Cohen’s κ measures the In answering RQ1a, we compare the results of classifiers at the four
inter-rater agreement while taking into consideration the possibility of levels of feedback using a selection of class balancing strategies (see
agreement by chance; thus, can be used to measure the level of agree­ Table 3). At the FS level, the classifier obtained better results without the
ment between automatically and manually coded messages in the use of class balancing strategies. However, the use of class balancing
dataset (Cavalcanti et al., 2019, 2020; Cohen, 1960). F1 score (He & strategies was effective in improving classifier performance at the
Garcia, 2009) can be used for assessing model performance with remaining three levels of feedback. A comparison of class balancing
imbalanced data. F1 captures information on the completeness and strategies showed SMOTE to have achieved roughly equal of higher
exactness of positive predictions (He & Garcia, 2009). These evaluation accuracy, kappa and F1 values at the FS, FT and FP levels. Although at
measures are used to assess performance of the developed feedback the FR level, the GA pipeline resulted in significantly higher kappa and
classifiers and contrast with that of similar works on automated feed­ F1 values.
back analysis.
4.2. Feature importance analysis — RQ1b
3.5.1. Feature analysis — RQ1b
The outputs of decision tree models can be analyzed with tools such SMOTE balanced classifiers were chosen for feature analysis due to
as SHAP (SHapley Additive exPlanations) (Lundberg et al., 2019). SHAP superior results in 3 of the 4 levels of feedback. The top ten most im­
utilizes a game theory informed algorithm to explain the output of pactful features were compiled using their SHAP values and can be
machine learning models. Given an input of a machine learning model viewed in Table 4.
and data records, SHAP leverages the concept of Shapley values by At the FS level, larger values of features associated with affect (af­
measuring the average marginal contribution of a feature over all fective words, positive emotions, exclamation marks) and second person
possible permutations. SHAP can diagnose the most impactful features pronouns (particularly, “you”) had the most impact on classification.
using their SHAP value, which is the mean absolute contribution of each The SHAP algorithm also found greater use of tentative language and
feature (Lundberg et al., 2019). A higher SHAP value for a feature im­ informal speech were important in discerning FS. The features most
plies a greater importance compared to another feature. important in predicting FT labels were related to the length and amount
of information. These included word counts, frequency of content words
3.5.2. Feature transferability — RQ1c and the minimum frequency of content words; all of which had a posi­
To measure the transferability of features across languages, the tive association with FT observation. Another finding was that FT had a
dataset was split by language, creating Portuguese and English feedback negative association with social processes and causation words. At the
datasets. Each of these datasets was split into training and test splits FR level, the most important features were associated with higher
(70% training and 30% test set), and binary classifiers were trained and coherence and referential cohesion; i.e., noun, word and content over­
tuned, resulting in English feedback trained classifiers, and Portuguese lap; and, the mean edit distance of words. Additional significant features
feedback trained classifiers for each level of feedback, with the excep­ were time focused words (present and future), risk related words and
tion of the FR level. For the Portuguese feedback examples, the FR level differentiation words. Some of the most impactful features for the FP
had just eight positive instances out of 1000 records, which was not level can be linked to delivering corrective and new information
enough to train a machine learning algorithm (Hastie et al., 2009); including the frequency of semicolons, negative connectives and
hence, this level of feedback was excluded from all transferability discrepancy words. Other important features were topic coherence,
analysis. frequency of adverbs, future focused words, and words associated with
relativity.
3.5.3. Feature ablation — RQ1d
In conducting a feature ablation study, features were ranked in terms 4.3. Feature transferability — RQ1c
of their importance and sequentially removed from the feature set in
ascending order; i.e the lowest importance features are removed first. As As seen on Table 5, cross language prediction resulted in sizable

5
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

Table 3
Performance of the classifiers trained to address research question RQ1 on the combined dataset involving both the English and Portuguese datasets. Legend: ACC –
Accuracy; K – Cohen’s kappa; F1 – F1 Score.
Class Balancing FS FT FR FP

ACC K F1 ACC K F1 ACC K F1 ACC K F1

None 0.88 0.52 0.58 0.82 0.64 0.83 0.92 0.00 0.00 0.68 0.33 0.57
SMOTE 0.87 0.51 0.58 0.82 0.65 0.83 0.91 0.04 0.07 0.69 0.35 0.59
GA 0.83 0.45 0.56 0.82 0.64 0.82 0.59 0.12 0.24 0.70 0.35 0.57

Table 4
Top 10 important features are measured using SHAP and displayed from most to least important for FS, FT, FR and FP classifiers.
FS FT

Variable Description SHAP Variable Description SHAP

liwc.Exclam Freq. of exclamation marks 1.02 cm.WRDFRQa Freq. of all words 0.46
liwc.posemo Freq. of words with positive emotion 0.73 cm.WRDFRQc Freq. of content words 0.39
liwc.you Freq. of the word “you” 0.24 cm.WRDFRQmc Minimum freq. of content words 0.34
liwc.affect Freq. of affective words 0.20 cm.DRNP Noun phrase density 0.10
cm.SYNMEDlem Minimal edit distance of lemmas 0.20 cm.DRAP Adverbial phrase density 0.10
cm.WRDFRQc Freq. of content words 0.15 liwc.SemiC Freq. of semicolons 0.08
liwc.tentat Freq. of tentative words 0.15 cm.DESWLsy Mean word length 0.07
liwc.reward Freq. of words associated with reward 0.14 liwc.adverb Freq. of adverbs 0.07
liwc.informal Freq. of informal words 0.14 liwc.social Freq. of words related to social processes 0.07
cm.WRDPRP2 Freq. of second person pronouns 0.14 liwc.article Freq. of articles 0.07

FS FT
Variable Description SHAP Variable Description SHAP

cm.CRFNO1 Noun overlap between adjunct sentences 0.56 liwc.SemiC Freq. of semicolons 0.39
cm.WRDPRP3s Freq. of third person pronouns 0.50 cm.LSASS1 LSA measure of semantic coherence 0.19
cm.CRFSO1 Word stem overlap between adjunct sentences 0.43 cm.CNCNeg Freq. of negative connectives 0.12
cm.DRAP Adverbial phrase density 0.35 liwc.adverb Freq. of adverbs 0.11
cm.CRFCWOa Content word overlap of all sentences 0.25 cm.DESWLltd Standard deviation of average no. of letters/word 0.09
liwc.risk Freq. of risk related words 0.23 liwc.space Freq. of words related to space 0.09
liwc.differ Freq. of words related to differentiation 0.21 liwc.verb Freq. of verbs 0.08
liwc.focusfuture Freq. of future focus words 0.21 liwc.shehe Freq. of third person singular pronouns 0.07
liwc.focuspresent Freq. of present focus words 0.20 cm.SYNLE Mean no. of words before the main verb 0.06
liwc.affiliation Freq. of affiliation words 0.16 liwc.discrep Freq. of words associated with discrepancy 0.06

Table 5
For RQ1c classifiers are exclusively trained on English (EN) and Portuguese (PT) feedback examples. Performance of each classifier is measured against EN and PT
feedback examples. Legend: ACC – Accuracy; K – Cohen’s kappa; F1 – F1 Score.
FS FT FP

ACC K F1 ACC K F1 ACC K F1

EN Classifier EN 0.83 0.42 0.52 0.69 0.13 0.31 0.66 0.23 0.49
PT 0.85 0.03 0.04 0.11 0.00 0.00 0.49 − 0.02 0.30

PT Classifier EN 0.79 0.06 0.12 0.28 0.00 0.43 0.35 0.00 0.52
PT 0.94 0.74 0.77 0.91 0.49 0.95 0.78 0.56 0.79

drops in accuracy and kappa values. One can further analyze feature processes. At the FP level, the English and Portuguese classifiers had no
transferability by examining the most important features for the English overlap in their ten most important features. Important features for the
and Portuguese trained classifiers. At the FS level, features associated Portuguese classifier included coherence measures, word counts (total
with affect and informal language were important for both languages. and per sentence), and verb phrase density. The most important features
However, within the affective processes, Portuguese FS comments were for the English classifier consisted of the mean number of words before
more greatly impacted by the level of arousal (as measured using fre­ main verbs, lexical diversity frequency of discrepancy words, personal
quency of exclamation marks). In comparison, FS English comments pronouns, quotes and verbs.
were greatly impacted by the presence of friendship and social pro-
cesses. Furthermore, English FS comments were more tied with refer­
4.4. Feature ablation — RQ1d
ences to the person (measured using frequency of second person pro­
nouns). The most important features for the Portuguese trained classifier
In order to reduce the size of the model, a feature ablation study was
and the English trained classifier had no intersections at the FT level. For
carried. The shape of the curves were largely concave downwards;
the Portuguese classifier, four of the ten most important features were
hence, to minimize the number of features while maintaining a similar
related to complexity of feedback comments; namely, lexical diversity,
level of accuracy, one can look at when the curve starts to dip. This
verb overlaps, edit distance of lemmas and age of acquisition of content
occurs at roughly 30, 15, 40 and 20 features for FS (Fig. 1a), FT (Fig. 1b),
words. Contrastingly, the English feedback comments at the FT level
FR (Fig. 1c) and FP (Fig. 1d) respectively.
were highly associated with rewards. Other important features were
In order to analyze the remaining features of the four classifiers after
frequency of numbers, conjunctions and words related to social
ablation, these features were broadly categorized using categories

6
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

Fig. 1. For RQ1d, a feature ablation study is carried out by iteratively removing features, starting from the least important. The resultant feature set is used to train a
classifier and the new model’s accuracy is recorded. Four graphs are displayed showing accuracy on the y-axis, and number of features remaining on x-axis (from 166
to 1). The x-axis has been inverted, meaning the values are decreasing from left to right. Ablation results are displayed for the four levels of feedback.

informed by LIWC (Tausczik & Pennebaker, 2010) and Coh-Metrix the automated analysis of feedback messages can be used to identify
(Graesser et al., 2004), and similar features were grouped together. informative feedback components. Four binary classifiers were devel­
The distribution of these categories for each level of feedback can be oped using a variety of features (linguistic and psychological) obtained
seen in Table 6. from LIWC and Coh-Metrix; an independent feature termed number of
named entities was also created. The best performing models were
5. Discussion effective in identifying FS, FT and FP. While not a direct comparison due
to the addition of the English feedback examples, the models achieved
The goal of this study was to examine how accurately one can model better results over those reported by Cavalcanti et al. (2020). The clas­
the feature space of informative feedback components, and how this sifiers were able to improve accuracy by 0.07 and 0.05 for FT and FP,
feature space varies across languages. In that vein, four research ques­ respectively and increase kappa values by 0.11, 0.35, and 0.06 for FS,
tions were answered using novel statistical learning methodologies, with FT, and, FP, respectively (see Table 7). Furthermore, when trained solely
a view of promoting effective feedback at scale. on the Portuguese feedback examples used (Cavalcanti et al., 2020), this
study was able to improve accuracy by 0.07, 0.16, and 0.14 and increase
5.1. Model performance — RQ1a kappa values by 0.35, 0.20 and 0.28 for FS, FT, and FP, respectively (see
Table 7). All of these results signify an improved method for feedback
Research question RQ1 focused on investigating the extent to which classification as the selected classification algorithm was able to better
model the complex feedback feature space.
Similar to previous works (Cavalcanti et al., 2020), the FR classifier
Table 6
was not as effective in identifying instances. The model obtained a poor
In addressing RQ1d, the features that remain after ablation are analyzed and
categorized. The name and distribution these categories can be seen for the four
kappa of 0.06, which was likely caused by the model’s poor ability of
levels of feedback. detecting positive cases of FR. Poor performance on this level was due to
the significantly lower cases of positive instances as compared to the
Category FS FT FR FP
other levels of feedback.
Personal Concerns 1 1 0 0
Informal Language 1 0 0 0
Pronoun Use 1 0 1 1 Table 7
Punctuation Use 1 1 1 1 A comparison of RQ1 results and similar works by Cavalcanti et al. (2020). The
Complexity 1 1 1 1 table presents classifier performance from RQ1 at the four levels of feedback and
Information Volume 1 1 1 1 the results obtained by Cavalcanti et al. (2020). Legend: ACC – Accuracy; K –
Action 0 0 1 1
Cohen’s kappa.
Time Orientation 1 0 1 1
Relativity 0 0 1 1 FS FT FR FP
Drives 1 0 1 0
ACC K ACC K ACC K ACC K
Biological Processes 0 0 1 0
Affect 1 0 0 0 RQ1 Results 0.87 0.51 0.82 0.65 0.91 0.04 0.69 0.35
Social 1 1 1 0 Cavalcanti 0.87 0.39 0.75 0.29 – – 0.64 0.28
Cognitive Processes 1 1 1 1 et al.

7
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

5.2. Feature analysis — RQ1b that appeared in the top ten significant features across all four levels.
Hence, each level of feedback might be playing a different role in
The focus of research question RQ1b was analyzing the most informing the learner (Hattie & Timperley, 2007). FR and FP, both had
important textual features associated with the four levels of feedback. strong positive associations with future focused processes. This finding is
Hattie and Timperley (2007) state that FS involves evaluations of the plausible; per Hattie and Timperley (2007), both levels of feedback can
person, which are often a form of praise. The current findings add weight be related to assessment strategy. Specifically, FP can involve informa­
to this claim, as those features found to be most important in predicting tion on error-detection strategies, while FR is more broadly related to
the FS level were affective processes (particularly, positive emotions) strategies towards knowledge acquisition goals (Hattie & Timperley,
and social processes, which align with the concept of praise. The use of 2007). Additionally, both FT and FP counted adverbs amongst their
second person pronouns (i.e., you, your) was also identified as impor­ most important features, indicating the descriptive nature of these levels
tant, which can be analogized as the instructor’s comments being of feedback.
directed at the person. FS is often thought to be the least effective level of
feedback (Butler & Winne, 1995; Hattie & Timperley, 2007) and relat­ 5.3. Feature transferability — RQ1c
edly, the FS classifier had a negative association with discrepancy
words; this might indicate FS comments have little actionable infor­ Research of feedback tools should consider good cross-language
mation or insight. performance to promote generalizability. In that vein, we studied
FT is sometimes referred to as corrective feedback and provides in­ inter-language classifier performance, and compared the most signifi­
formation on details related to task accomplishment such as correctness cant features for classifiers trained on different language feedback.
or behavior. Accordingly, this study found the predictors most associ­ Barbosa et al. (2020) used similar linguistic features to those used in this
ated with FT were those that related to the amount of information project, such as LIWC and Coh-Metrix, to study cross-language classifi­
provided. Specifically, higher values of word counts, frequency of con­ cation of cognitive presence in online discussions, and found features to
tent words and minimum frequency of content —- all of which can be be independent of language; hence, we expected to find a moderate level
linked to greater information —- were positively correlated with of generalizability of feedback features across languages. However, our
observance of FT. Hattie and Timperley (2007) suggest instructors not to findings indicate a low transferability of feedback features. The average
rely solely on FT, but rather to view it as a process that moves the student accuracy differential on inter-language performance amounted to
to FP and FR. This theory is backed by the finding of strong negative − 0.06, − 0.59, and − 0.26; while the average kappa differential was
association of causation words and FT; hence, FT comments were less approximately − 0.50, − 0.27, and − 0.33 for FS, FT, and FP, respectively.
likely to illustrate the causes of the student’s failings, which is essential Likewise, the Portuguese and English trained classifiers showed minimal
for the learner’s self-regulation (Butler & Winne, 1995; Hattie & Tim­ overlap in their most important features across all levels of feedback.
perley, 2007; Nicol & Macfarlane-Dick, 2006). One possible explanation for this finding might be the difference in
This study also found a negative association between social processes courses represented in the English and Portuguese datasets. English
and FT. Per LIWC (Tausczik & Pennebaker, 2010), words associated with feedback examples were primarily from STEM related courses, including
social processes (e.g. mate, talk) have psychological correlates with Environmental Studies and Software Engineering, while Portuguese
social concerns and social support. As FT is primarily concerned with feedback examples had more of a mix, hailing from Biology and Liter­
directing students towards task/performance failing (Hattie & Timper­ ature courses. Hence, the different nature of represented courses might
ley, 2007), perhaps greater social support when delivering this type of have influenced the transferability analysis.
information could result in greater efficacy; however, more research is Another explanation for the low transferability of features might be
suggested in this area. the cultural differences in communication. For instance, at the FS level
Compared to FT, FP is more concerned with the processes involved of feedback, we observed greater association of friendship and social
with undertaking tasks. Specifically, FP is believed to promote a deeper processes for the English feedback; i.e. English instructors might have
understanding of learning as it enables the identification of relationships displayed a greater level of familiarity with students. As an instructor
between resources and output, and the development of stronger cogni­ can be viewed as an authority figure, this difference might be related to
tive processes. To achieve this, Balzer et al. (1989) state FP should whether a culture is “horizontal”, and therefore emphasizes equality, or
concern information about actual relations in the learning environment, “vertical”, and emphasizes hierarchy (Shavitt et al., 2006). The impli­
relations which have been recognized by the learner, and relations be­ cations of this finding would indicate instructors will need to consider
tween the learning environment and the learner’s perceptions. There­ the cultural backgrounds of the learner while delivering feedback for
fore, the value of FP comes from providing useful information on improved efficacy.
relationships. The findings of this study corroborate the theoretical
views of FP. Amongst the most important features for FP were frequency 5.4. Feature ablation — RQ1d
of content words, adverbs, negative connectives and discrepancy words.
These imply that FP comments were tied to providing new and correc­ In conclusion of the feature ablation analysis per research question
tive information. Other significant features can be tied back to re­ RQ1d, this study was able to cut down the size of feature sets from 166
lationships; including frequency of semicolons (semicolons are often features to roughly 30, 15, 40 and 20 features for FS, FT, FR, and FP,
used to link together ideas) and features associated with space and respectively, while maintaining equivalent accuracy. Similar works
relativity. (Cavalcanti et al., 2019, 2020) have used feature sets as large as 120
According to Butler and Winne (1995), one of the goals of FR should features; hence, this study shows significant improvement in model size.
be to improve the student’s ability to monitor current progress and use After ablation, the remaining features for the four levels of feedback
that information to form effective learning strategies. Accordingly, some were put into broad categories. The distribution of these categories can
of the most important predictors of FR were greater present and future be further analyzed. Firstly, only the FS level had features associated
focused processes. However, in the dataset, only 7% of feedback ex­ with affect and informal language categories. Hence, the level of for­
amples included FR. This trend might be indicative of the poor standing mality and the use of affective processes in delivered feedback only
of today’s feedback in higher education, as FR is often theorized as the seems to have played an important role in FS. This is consistent with the
most critical component of effective feedback (Butler & Winne, 1995; theoretical view of FS being the more personal and affective level of
Nicol & Macfarlane-Dick, 2006). feedback (Hattie & Timperley, 2007). Another finding was that the
We can also analyze potential similarities of the most important relativity category was only associated with FR and FP. This correlates
features of the four levels of feedback. Firstly, there was no single feature with Hattie and Timperley (2007) theory of FP, and FR being the two

8
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

levels of feedback that concern the learner’s environment. Furthermore, feedback features across languages. Feedback tools should be general­
the FP and FR categories were the only levels to be associated with ac­ izable enough to cater to a variety of languages. By analyzing the
tions. This finding is aligned with literature which considers the two transferability of feedback features across languages, this study aimed to
levels to be the most efficacious in their ability to provide actionable enhance the global adaptability of current and future feedback tools.
insight due to their tendencies to be prescriptive as opposed to The findings indicate feedback features have low transferability between
descriptive (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick, 2006). feedback examples delivered in English and Portuguese. However, a
To fully understand the differences (or similarities) of the four levels more expansive study is suggested, with a greater size and variety of
of feedback, we suggest further analysis of the role of these categories on feedback from different languages.
the four levels of feedback; including measuring the mean impact of the
features per category. This can be used to study the polarity of the cat­ Declaration of competing interest
egorical relationships and identify which categories are most significant
in determining the levels of feedback. The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
5.5. Limitations the work reported in this paper.

The nature of delivered feedback may vary depending on the gran­ References
ularity, and feedback features may change depending on the course,
school, language and culture, to name a few. As the data used in this Baeck, T. (2000). Evolutionary computation 1: Basic algorithms and operators (1st ed.). CRC
project consisted of feedback from two languages and five courses, the Press.
Balzer, W. K., Doherty, M. E., & O’Connor, R. (1989). Effects of cognitive feedback on
feedback space in this study may not be fully representative and greater performance. Psychological Bulletin, 106, 410–433.
variety will be needed for more accurate representations. Another issue Bangert-Drowns, R. L., Kulik, C.-L. C., Kulik, J. A., & Morgan, M. (1991). The
encountered with the data was the presence of class imbalances. As instructional effect of feedback in test-like events. Review of Educational Research, 61,
213–238.
evidenced by the improved classification performance of SMOTE Barbosa, G., Camelo, R., Cavalcanti, A. P., Miranda, P., Mello, R. F., Kovanovic, V., &
incorporated classifiers, class imbalances can have a negative impact on Gasevic, D. (2020). Towards automatic cross-language classification of cognitive
model efficiency. Furthermore, this study can be improved with greater presence in online discussions. In Proceedings of the tenth international conference on
learning analytics & knowledge LAK ’20 (pp. 605–614). Frankfurt, Germany: ACM.
occurrences of positive instances of the different levels of feedback. For
Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical
instance, the FR level had just 108 instances in the training set. The few synthesis. Review of Educational Research, 65, 245–281.
cases of FR contributed to low prediction performance; and, this study Cavalcanti, A. P., Diego, A., Mello, R. F., Mangaroska, K., Nascimento, A., Freitas, F., &
Gasevic, D. (2020). How good is my feedback?: A content analysis of written
was not able to measure feature transferability at this level as the Por­
feedback. In Proceedings of the tenth international conference on learning analytics &
tuguese feedback examples had just 8 cases in the entire dataset. Finally, knowledge (pp. 428–437). Frankfurt Germany: ACM.
while we noted timeliness as one of the issues that plague feedback in Cavalcanti, A. P., Ferreira Leite de Mello, R., Rolim, V., Andre, M., Freitas, F., &
Higher Education, this study focused on addressing the challenges that Gasevic, D. (2019). An analysis of the use of good feedback practices in online
learning courses. In 2019 IEEE 19th international conference on advanced learning
relate to the content of feedback. Future research will look towards technologies (ICALT) (pp. 153–157). Maceio, Brazil: IEEE.
feedback generators which can address the issue of timelines in addition. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings
of the 22nd ACM SIGKDD international conference on knowledge discovery and data
mining KDD ’16 (pp. 785–794). San Fran- cisco, California, USA: ACM.
6. Conclusion and future research Clariana, R. B., Wagner, D., & Roher Murphy, L. C. (2000). Applying a connectionist
description of feedback timing. Educational Technology Research & Development, 48,
This study proposed four main contributions. First, this study 5–22.
Clark, I. (2012). Formative assessment: Assessment is for self-regulated learning.
explored how accurately a trained model can identify the presence of Educational Psychology Review, 24, 205–249.
different feedback practices. Hence, four binary classifiers were trained Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
on 2092 feedback examples using the criteria of Hattie and Timperley’s Psychological Measurement, 20, 37–46.
Denisko, D., & Hoffman, M. M. (2018). Classification and interaction in random forests.
proposed four levels of feedback operation. The constructed classifiers, Proceedings of the National Academy of Sciences, 115, 1690–1692.
using primarily linguistic and psychological features, were effective in Dweck, C. S. (1999). Self-theories: Their role in motivation, personality, and
identifying the presence of FT, FP and FS levels of feedback and showed development. In Self-theories: Their role in motivation, personality, and development (p.
195). New York, NY, US: Psychology Press. xiii.
better performance than similar works in this content area; however, the
Ferguson, P. (2011). Student perceptions of quality feedback in teacher education.
FR classifier was marred by a lack of adequate data. The implications of Assessment & Evaluation in Higher Education, 36, 51–62.
these results provide a proof of concept for a tool that can automatically Fisher, D., & Frey, N. (2009). Feed up, back, forward. In Educational leadership (Vol. 67,
analyze and potentially diagnose the contents of an instructor’s feed­ pp. 20–25). ASCD.
Glover, C., & Brown, E. (2006). Written feedback for students: Too much, too detailed or
back. We aim to transform assessment feedback to educational feedback too incomprehensible to be effective? Bioscience Education, 7, 1–16.
by promoting the understanding and utilization of good feedback Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix:
practices to improve their efficacy on learner adoption. Analysis of text on cohesion and language. Behav Res Methods In- strum Comput., 36,
193–202.
Another goal of this paper was to identify the prominent textual Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning:
features of feedback components. Identified features were able to Data mining, inference, and prediction. Springer series in statistics (2nd ed.). New York,
corroborate the findings of educational research on feedback theory. NY: Springer.
Hattie, J. (1999). Influences on student learning. Inaugural lecture given on August, 2, 21.
The presented findings can be further used to inspire the design of future Hattie, J., & Gan, M. (2011). Instruction based on feedback. In Handbook of research on
automated feedback generators, e.g., intentionally including the prom­ learning and instruction (pp. 249–271).
inent terms specific to different feedback practices when generating Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational
Research, 77, 81–112.
feedback. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on
Furthermore, this study sought to identify the most minimal feature Knowledge and Data Engineering, 21, 1263–1284.
set needed for accurate feedback analysis. The ablation study showed Holzinger, A. (2018). From machine learning to explainable AI. In 2018 world symposium
on digital intelligence for systems and machines (DISA) (pp. 55–66). Kosice: IEEE.
the ability to cut the size of feature sets significantly to enable more
Ilgen, D., & Davis, C. (2000). Bearing bad news: Reactions to negative performance
lightweight feedback systems. To encourage the widespread use of an feedback. Applied Psychology, 49, 550–565.
automatic feedback analysis tool, it is important to minimize the number Keuning, H., Jeuring, J., & Heeren, B. (2018). A systematic literature review of
of variables included in constructed models to increase its automated feedback generation for programming exercises. ACM Transactions on
Computing Education, 19, 1–43.
interpretability. Kluger, A. N., & Dijk, D. V. (2010). Feedback, the various tasks of the doctor, and the
Finally, this study conducted an analysis of the transferability of feedforward alternative. Medical Education, 44, 1166–1174.

9
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059

Kovanović, V., Joksimović, S., Waters, Z., Gasević, D., Kitto, K., Hatala, M., & Siemens, G. Pan, B. (2018). Application of XGBoost algorithm in hourly PM2.5 concentration
(2016). Towards automated content analysis of discussion transcripts: A cognitive prediction. In IOP conference series: Earth and environ- mental science (Vol. 113)IOP
presence case. In Proceedings of the sixth international conference on learning analytics Publishing, Article 012127.
& knowledge LAK ’16 (pp. 15–24). Edinburgh, United Kingdom: ACM. Parikh, A., McReelis, K., & Hodges, B. (2001). Student feedback in problem based
Laurillard, D. (2013). Rethinking university teaching (0th ed.). Routledge. learning: A survey of 103 final year students across five ontario medical schools.
Leibold, N., & Schwarz, L. M. (2015). The art of giving online feedback. In Journal of Medical Education, 35, 632–636.
effective teaching (Vol. 15, pp. 34–46). ERIC. Price, M., Handley, K., Millar, J., & O’Donovan, B. (2010). Feedback : All that effort, but
Liu, M., Li, Y., Xu, W., & Liu, L. (2017). Automated essay feedback generation and its what is the effect? Assessment & Evaluation in Higher Education, 35, 277–289.
impact on revision. IEEE Trans. Learn. Technol., 10, 502–513. Shavitt, S., Lalwani, A. K., Zhang, J., & Torelli, C. J. (2006). The horizontal/vertical
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., distinction in cross-cultural consumer research. Journal of Consumer Psychology, 16,
Himmelfarb, J., Bansal, N., & Lee, S.-I. (2019). Explain- able AI for trees: From local 325–342.
explanations to global understanding, Article 04610. arXiv:1905. Stern, L. A., & Solomon, A. (2006). Effective faculty feedback: The road less traveled.
Luque, A., Carrasco, A., Martín, A., & Heras, A.d. l. (2019). The impact of class imbalance Assessing Writing, 11, 22–41.
in classification performance metrics based on the binary confusion matrix. Pattern Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC
Recognition, 91, 216–231. and computerized text analysis methods. Journal of Literary Semantics, 29, 24–54.
Ma, X., Wijewickrema, S., Zhou, S., Zhou, Y., Mhammedi, Z., O’Leary, S., & Bailey, J. Villalón, J., Kearney, P., Calvo, R. A., & Reimann, P. (2008). Glosser: Enhanced feedback
(2017). Adversarial generation of real-time feedback with neural networks for for student writing tasks. In 2008 eighth IEEE international conference on advanced
simulation-based training. In Proceedings of the twenty-sixth international joint learning technologies (pp. 454–458). Santander, Cantabria, Spain: IEEE.
conference on artificial intelligence (pp. 3763–3769) (Melbourne, Australia). Wade-Stein, D., & Kintsch, E. (2004). Summary street: Interactive computer support for
Mulliner, E., & Tucker, M. (2017). Feedback on feedback practice: Perceptions of writing. In Cognition and instruction (Vol. 22, pp. 333–362). Routledge.
students and academics. Assessment & Evaluation in Higher Education, 42, 266–288. Weaver, M. R. (2006). Do students value feedback? Student perceptions of tutors’ written
Narciss, S. (2013). Designing and evaluating tutoring feedback strategies for digital responses. Assessment & Evaluation in Higher Education, 31, 379–394.
learning environments on the basis of the interactive tutoring feedback model. Wijewickrema, S., Ma, X., Piromchai, P., Briggs, R., Bailey, J., Kennedy, G., & O’Leary, S.
Digital Education Review, 23, 7–26. (2018). Providing automated real-time technical feedback for virtual reality based
Neuendorf, K. A. (2017). The content analysis guidebook (2nd ed.). Los Angeles: SAGE. surgical training is the simpler the better?. In Artificial intelligence in education lecture
Nicol, D. (2021). The power of internal feedback: Exploiting natural comparison notes in computer science (pp. 584–598). Cham: Springer.
processes. AssessMent & Evaluation in Higher Education, 46, 756–778. Xiao, Z., Wang, Y., Fu, K., & Wu, F. (2017). Identifying different transportation modes
Nicol, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated from trajectory data using tree-based ensemble classifiers. In ISPRS international
learning: A model and seven principles of good feedback practice. Studies in Higher journal of geo-information (Vol. 6, p. 57). Multidisciplinary Digital Publishing
Education, 31, 199–218. Institute. Number: 2.
Yang, M., & Carless, D. (2013). The feedback triangle and the enhancement of dialogic
feedback processes. Teaching in Higher Education, 18, 285–297.

10

You might also like