Professional Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Keywords: Feedback is a crucial element of a student’s learning process. It enables students to identify weaknesses and
Natural language processing improve self-regulation. However, studies show this to be an area of great dissatisfaction in higher education.
Classification With ever-growing course participation numbers, delivering effective feedback is becoming an increasingly
Feedback
challenging task. The efficacy of feedback will depend on four levels of feedback; namely, feedback about the
Content analysis
self, task, process or self-regulation. Hence, this paper explores the use of automated content analysis to examine
feedback provided by instructors for feedback practices measured on self, task, process, and self-regulation levels.
For this purpose, four binary XGBoost classifiers were trained and evaluated, one for each level of feedback. The
results indicate effective classification performance on self, task, and process levels with accuracy values of 0.87,
0.82, and 0.69, respectively. Additionally, inter-language transferability of feedback features is measured using
cross-language classification performance and feature importance analysis. Findings indicate a low generaliz
ability of features between English and Portuguese feedback spaces.
* Corresponding author.
E-mail addresses: richard.osakwe@monash.edu (I. Osakwe), guanliang.chen@monash.edu (G. Chen), alex.wainwright@monash.edu (A. Whitelock-Wainwright),
dragan.gasevic@monash.edu (D. Gašević), apc@cin.ufpe.br (A. Pinheiro Cavalcanti), rafael.mello@ufrpe.br (R. Ferreira Mello).
https://doi.org/10.1016/j.caeai.2022.100059
Received 30 May 2021; Received in revised form 1 March 2022; Accepted 1 March 2022
Available online 13 March 2022
2666-920X/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
growing portfolio of approaches that utilize ubiquitous data collection and tactics, which generate outcomes that can be mental (i.e cognitive or
to improve learning processes and design decisions. The field of affective/motivational states) or externally observable (e.g., a tangible
Learning Analytics (LA) is one of such approaches and could provide a product or the learner’s behavior). During the monitoring phase, the
solution to the state of feedback in education. LA researchers are actively learner generates internal feedback and assesses goal progression. This
exploring automated feedback solutions that can enable instructors to internal feedback shapes additional tactics and strategies, including
efficiently identify and deliver effective feedback, and improve the knowledge, motivation, and beliefs (Butler & Winne, 1995; Nicol &
speed of feedback delivery to students (Keuning et al., 2018; Liu et al., Macfarlane-Dick, 2006). External feedback can be received from a va
2017; Ma et al., 2017; Villal’on et al., 2008; Wijewickrema et al., 2018). riety of sources including an instructor, peers, or machine. This is then
In that vein, several studies Liu et al. (2017); Ma et al. (2017); Villal’on used to augment, confirm, or adjust the student’s interpretation of the
et al. (2008); Wijewickrema et al. (2018) have examined the use of data task’s properties and learning path Nicol and Macfarlane-Dick (2006).
mining methods to generate automated textual feedback. For instance, However, for external feedback to play a meaningful role in the learning
Villal’on et al. (2008) and Wade-Stein and Kintsch (2004) employ Latent process it must be effective enough to be interpreted and acted upon by
Semantic Analysis (LSA) to provide textual feedback on writing coher the student (Nicol & Macfarlane-Dick, 2006).
ence to students. However, these analyses are often limited to domain
specific areas such as computer programming or writing, or lack of 2.2. Drivers of effective feedback
grounding in educational theory. Much less work has gone into the
exploration of automated domain-agnostic analysis to identify infor The literature identifies good practices of feedback in education.
mative feedback practices (see Section 2.2 for details Cavalcanti et al. Nicol and Macfarlane-Dick (2006) are highly cited in the literature on
(2020; 2019)). Progress in such areas can enhance the instructor’s feedback for learning. They largely identify good feedback as that which
ability to provide effective feedback comments and analyze features improves the learner’s self-regulation abilities. The ability of feedback to
associated with effective feedback for generalizable feedback genera improve self-regulation with greater efficacy will largely be driven by its
tors. Furthermore, we conduct our analyses across different languages to informativeness, polarity, and timing.
measure the extent to which constructed tools can be adapted or utilized
by wider audiences. Therefore, this study aims to answer the following 2.2.1. Informativeness
Research Questions (RQs): Researchers theorize that effective feedback should integrate feed-
up, feed-back, and feed-forward (Hattie & Timperley, 2007; Fisher &
1. To what extent can the automated analysis of feedback messages be Frey, 2009). Feed-up clarifies goals and expectations to the learner. In
used to identify informative feedback components? line with feedback’s primary purpose of reducing the discrepancy be
(a) How accurate are the predictions that are made about these tween current knowledge/task performance and a goal, one can surmise
feedback practices? that without clearly defined goals feedback becomes substantially less
(b) What are specific features of text that can be used to predict potent (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
feedback components? Macfarlane-Dick, 2006). Additionally, students are more likely to
(c) How transferable are the identified feedback features to text engage with learning tasks when they understand expectation (Hattie &
written in different languages? Timperley, 2007; Fisher & Frey, 2009; Nicol & Macfarlane-Dick, 2006).
(d) What is the most minimal feature set needed to make accurate Feed-back informs the learner on the level of progress towards an ex
predictions? pected outcome (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick,
2006). Feedback is more efficacious when it informs the learner on
2. Background progress (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
Macfarlane-Dick, 2006). Students employ progress details to shape
2.1. Feedback in the learning process tactics and strategies (Butler & Winne, 1995; Nicol & Macfarlane-Dick,
2006). Feed-forward directs the learner on how to enhance progress
It is often theorized that the primary objective of feedback is to (Hattie & Timperley, 2007; Fisher & Frey, 2009; Nicol &
reduce the discrepancy between one’s knowledge/exercise performance Macfarlane-Dick, 2006). Feed-forward can have a great impact on
and a goal (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick, 2006; learning by enabling the student to determine the next steps in learning
Butler & Winne, 1995; Leibold & Schwarz, 2015). Hence, feedback is and how to tackle them (Hattie & Timperley, 2007; Nicol &
often examined through the lens of self-regulated learning (Butler & Macfarlane-Dick, 2006). Hattie and Timperley (2007) report feed-up,
Winne, 1995; Clark, 2012; Nicol & Macfarlane-Dick, 2006). Most the feed-back and feed-forward can operate at four levels (Table 1). Feed
orists recognize that self-regulated learners are the most effective back’s ability to lessen the discrepancy between performance and goals
learners (Butler & Winne, 1995). Academically speaking, self-regulation will be determined by the four levels at which it operates (Butler &
is a cycle of setting knowledge construction goals, selecting strategies Winne, 1995; Hattie & Timperley, 2007).
that maximize progress towards achieving said goals, and monitoring
progress with the possibility of altering strategies depending on the level 2.2.2. Polarity
of progression (Butler & Winne, 1995; Nicol & Macfarlane-Dick, 2006). Stern and Solomon (2006) and Weaver (2006) propose that effective
Feedback is a critical catalyst throughout this operation. While learners feedback should constitute positive (reinforcing) comments including
monitor task progression, they form internal feedback (Nicol, 2021) evaluations on how and to what degree the goals were completed.
which shapes adopted tactics (cognitive routines to execute a goal) and Positive comments can also improve the learner’s motivation and
strategies (monitoring and use of cognitive routines). More effective learning experience (Stern & Solomon, 2006; Weaver, 2006).
self-regulators generate better feedback or better employ the internal Conversely, negative (corrective) feedback may be useful in adjusting
feedback to achieve goals (Nicol & Macfarlane-Dick, 2006). inadequate behavior by alerting the learner of substandard strategies
As self-regulating learners engage in academic exercises, they utilize (Ilgen & Davis, 2000). Hattie and Timperley (2007) and Butler and
cognitive, meta-cognitive, and motivational processes to form an ideo Winne (1995) theorize the level at which feedback operates should
logical construct of the exercise’s details and requirements, alongisde determine polarity of comments. These studies (Butler & Winne, 1995;
the setting of goals (Butler & Winne, 1995; Narciss, 2013; Nicol & Hattie & Timperley, 2007) advise against either positive (e.g. “You’re a
Macfarlane-Dick, 2006). The extent to which the learner’s goals aligns very smart student!“) or negative comments (e.g. “You’re a careless
with those of the instructor will vary with the student’s understanding of student”) at the self-level. At the self-regulation level, the impact of
the exercise and motivations. The learner’s goals determine strategies positive and negative comments will be influenced by the learner’s
2
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
3
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
resultant 1092 English feedback paragraphs had an average length of 3.3.3. Language of delivered feedback
94.28 words and a standard deviation of 52.54 words. Subsequently, the This study records the language of feedback delivered as an addi
total number of observations increased from 1272 to 2092 records with tional feature to account potential inter-language differences in data
an average of 63.60 words and standard deviation of 53.69. distribution.
3.3.1. LIWC
A lexical tool that analyzes text across various psychological di
mensions Tausczik and Pennebaker (2010). The main categories pro
vided by LIWC include cognitive processes, social processes, informal
language, personal concerns, affect, relativity, time orientation, drives, Table 2
and perceptual processes. The relative contribution of these categories Number of instances for each class in the training and test datasets for each level
in the text offers a descriptive profile of various psychological constructs of feedback.
involved in the writing (Tausczik & Pennebaker, 2010). The LIWC dic Class 0 Class 1 Total
tionary has over 120,000 words, where each word can be assigned to
Train 1149 (82.19%) 249 (17.81%) 1398 (70%)
one or more categories (Tausczik & Pennebaker, 2010).
FS
Test 567 (82.17%) 123 (17.83%) 690 (30%)
4
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
3.5. Model selection and evaluation — RQ1a the feature space was trimmed, classifiers were trained using the
reduced feature sets and their accuracies are recorded using the test set.
Prior research on feedback analysis found success with the Decision Finally, analysis was carried out by examining the relative change in
tree ensembles; namely, the Random Forest (RF) algorithm (Cavalcanti prediction accuracy against the size of feature space.
et al., 2019, 2020). Decision tree ensembles are widely regarded clas Once the English and Portuguese trained classifiers were developed,
sification algorithms that are well suited to feedback analysis (Cav feature transferability was measured by
alcanti et al., 2019, 2020). This is due to their white-box properties, easy
interpretability, high accuracy and ability to identify important features i) the inter-language prediction performance: the prediction performance
in a dataset (Cavalcanti et al., 2019, 2020; Chen & Guestrin, 2016; (measured by accuracy, F1 score and Cohen’s κ) of the English
Denisko & Hoffman, 2018). trained classifier on the English test set was compared to the pre
This study employed a decision tree implementation called XGBoost dictor performance on the Portuguese test set for the FS, FT, and FP
or eXtreme Gradient Boosting (Chen & Guestrin, 2016). XGBoost has levels of feedback. The same process was repeated for the Portuguese
been shown to outperform Random Forest on numerous classification trained classifier;
tasks (Pan, 2018; Xiao et al., 2017). The algorithm utilizes gradient ii) A comparison of significant features: The most important features for
boosting, which involves sequentially combining models (in this case, the English and Portuguese trained classifiers are compared at the FS,
decision trees) that predict the residuals or errors of previous models at FT and FP levels of feedback.
each iteration to improve overall accuracy (Chen & Guestrin, 2016). It is
called gradient boosting because it uses a gradient descent algorithm to 4. Results
minimize the loss when adding new models. XGBoost is ideal due to
their superior accuracy and their implicit analysis of feature importance To assess model performance, we report F1 scores, Cohen’s κ and
(Chen & Guestrin, 2016). Four binary XGBoost classifiers were trained; accuracy values obtained by binary classifiers on the test, for each level
one for each level of feedback. of feedback. For further analysis, confusion matrices are also presented.
Previous studies (Cavalcanti et al., 2019, 2020) on textual feedback
analysis have evaluated models using accuracy and Cohen’s Kappa (κ). 4.1. Model training and evaluation — RQ1a
Accuracy can be calculated using the ratio of correctly identified sam
ples to total number of samples. Furthermore, Cohen’s κ measures the In answering RQ1a, we compare the results of classifiers at the four
inter-rater agreement while taking into consideration the possibility of levels of feedback using a selection of class balancing strategies (see
agreement by chance; thus, can be used to measure the level of agree Table 3). At the FS level, the classifier obtained better results without the
ment between automatically and manually coded messages in the use of class balancing strategies. However, the use of class balancing
dataset (Cavalcanti et al., 2019, 2020; Cohen, 1960). F1 score (He & strategies was effective in improving classifier performance at the
Garcia, 2009) can be used for assessing model performance with remaining three levels of feedback. A comparison of class balancing
imbalanced data. F1 captures information on the completeness and strategies showed SMOTE to have achieved roughly equal of higher
exactness of positive predictions (He & Garcia, 2009). These evaluation accuracy, kappa and F1 values at the FS, FT and FP levels. Although at
measures are used to assess performance of the developed feedback the FR level, the GA pipeline resulted in significantly higher kappa and
classifiers and contrast with that of similar works on automated feed F1 values.
back analysis.
4.2. Feature importance analysis — RQ1b
3.5.1. Feature analysis — RQ1b
The outputs of decision tree models can be analyzed with tools such SMOTE balanced classifiers were chosen for feature analysis due to
as SHAP (SHapley Additive exPlanations) (Lundberg et al., 2019). SHAP superior results in 3 of the 4 levels of feedback. The top ten most im
utilizes a game theory informed algorithm to explain the output of pactful features were compiled using their SHAP values and can be
machine learning models. Given an input of a machine learning model viewed in Table 4.
and data records, SHAP leverages the concept of Shapley values by At the FS level, larger values of features associated with affect (af
measuring the average marginal contribution of a feature over all fective words, positive emotions, exclamation marks) and second person
possible permutations. SHAP can diagnose the most impactful features pronouns (particularly, “you”) had the most impact on classification.
using their SHAP value, which is the mean absolute contribution of each The SHAP algorithm also found greater use of tentative language and
feature (Lundberg et al., 2019). A higher SHAP value for a feature im informal speech were important in discerning FS. The features most
plies a greater importance compared to another feature. important in predicting FT labels were related to the length and amount
of information. These included word counts, frequency of content words
3.5.2. Feature transferability — RQ1c and the minimum frequency of content words; all of which had a posi
To measure the transferability of features across languages, the tive association with FT observation. Another finding was that FT had a
dataset was split by language, creating Portuguese and English feedback negative association with social processes and causation words. At the
datasets. Each of these datasets was split into training and test splits FR level, the most important features were associated with higher
(70% training and 30% test set), and binary classifiers were trained and coherence and referential cohesion; i.e., noun, word and content over
tuned, resulting in English feedback trained classifiers, and Portuguese lap; and, the mean edit distance of words. Additional significant features
feedback trained classifiers for each level of feedback, with the excep were time focused words (present and future), risk related words and
tion of the FR level. For the Portuguese feedback examples, the FR level differentiation words. Some of the most impactful features for the FP
had just eight positive instances out of 1000 records, which was not level can be linked to delivering corrective and new information
enough to train a machine learning algorithm (Hastie et al., 2009); including the frequency of semicolons, negative connectives and
hence, this level of feedback was excluded from all transferability discrepancy words. Other important features were topic coherence,
analysis. frequency of adverbs, future focused words, and words associated with
relativity.
3.5.3. Feature ablation — RQ1d
In conducting a feature ablation study, features were ranked in terms 4.3. Feature transferability — RQ1c
of their importance and sequentially removed from the feature set in
ascending order; i.e the lowest importance features are removed first. As As seen on Table 5, cross language prediction resulted in sizable
5
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
Table 3
Performance of the classifiers trained to address research question RQ1 on the combined dataset involving both the English and Portuguese datasets. Legend: ACC –
Accuracy; K – Cohen’s kappa; F1 – F1 Score.
Class Balancing FS FT FR FP
None 0.88 0.52 0.58 0.82 0.64 0.83 0.92 0.00 0.00 0.68 0.33 0.57
SMOTE 0.87 0.51 0.58 0.82 0.65 0.83 0.91 0.04 0.07 0.69 0.35 0.59
GA 0.83 0.45 0.56 0.82 0.64 0.82 0.59 0.12 0.24 0.70 0.35 0.57
Table 4
Top 10 important features are measured using SHAP and displayed from most to least important for FS, FT, FR and FP classifiers.
FS FT
liwc.Exclam Freq. of exclamation marks 1.02 cm.WRDFRQa Freq. of all words 0.46
liwc.posemo Freq. of words with positive emotion 0.73 cm.WRDFRQc Freq. of content words 0.39
liwc.you Freq. of the word “you” 0.24 cm.WRDFRQmc Minimum freq. of content words 0.34
liwc.affect Freq. of affective words 0.20 cm.DRNP Noun phrase density 0.10
cm.SYNMEDlem Minimal edit distance of lemmas 0.20 cm.DRAP Adverbial phrase density 0.10
cm.WRDFRQc Freq. of content words 0.15 liwc.SemiC Freq. of semicolons 0.08
liwc.tentat Freq. of tentative words 0.15 cm.DESWLsy Mean word length 0.07
liwc.reward Freq. of words associated with reward 0.14 liwc.adverb Freq. of adverbs 0.07
liwc.informal Freq. of informal words 0.14 liwc.social Freq. of words related to social processes 0.07
cm.WRDPRP2 Freq. of second person pronouns 0.14 liwc.article Freq. of articles 0.07
FS FT
Variable Description SHAP Variable Description SHAP
cm.CRFNO1 Noun overlap between adjunct sentences 0.56 liwc.SemiC Freq. of semicolons 0.39
cm.WRDPRP3s Freq. of third person pronouns 0.50 cm.LSASS1 LSA measure of semantic coherence 0.19
cm.CRFSO1 Word stem overlap between adjunct sentences 0.43 cm.CNCNeg Freq. of negative connectives 0.12
cm.DRAP Adverbial phrase density 0.35 liwc.adverb Freq. of adverbs 0.11
cm.CRFCWOa Content word overlap of all sentences 0.25 cm.DESWLltd Standard deviation of average no. of letters/word 0.09
liwc.risk Freq. of risk related words 0.23 liwc.space Freq. of words related to space 0.09
liwc.differ Freq. of words related to differentiation 0.21 liwc.verb Freq. of verbs 0.08
liwc.focusfuture Freq. of future focus words 0.21 liwc.shehe Freq. of third person singular pronouns 0.07
liwc.focuspresent Freq. of present focus words 0.20 cm.SYNLE Mean no. of words before the main verb 0.06
liwc.affiliation Freq. of affiliation words 0.16 liwc.discrep Freq. of words associated with discrepancy 0.06
Table 5
For RQ1c classifiers are exclusively trained on English (EN) and Portuguese (PT) feedback examples. Performance of each classifier is measured against EN and PT
feedback examples. Legend: ACC – Accuracy; K – Cohen’s kappa; F1 – F1 Score.
FS FT FP
EN Classifier EN 0.83 0.42 0.52 0.69 0.13 0.31 0.66 0.23 0.49
PT 0.85 0.03 0.04 0.11 0.00 0.00 0.49 − 0.02 0.30
PT Classifier EN 0.79 0.06 0.12 0.28 0.00 0.43 0.35 0.00 0.52
PT 0.94 0.74 0.77 0.91 0.49 0.95 0.78 0.56 0.79
drops in accuracy and kappa values. One can further analyze feature processes. At the FP level, the English and Portuguese classifiers had no
transferability by examining the most important features for the English overlap in their ten most important features. Important features for the
and Portuguese trained classifiers. At the FS level, features associated Portuguese classifier included coherence measures, word counts (total
with affect and informal language were important for both languages. and per sentence), and verb phrase density. The most important features
However, within the affective processes, Portuguese FS comments were for the English classifier consisted of the mean number of words before
more greatly impacted by the level of arousal (as measured using fre main verbs, lexical diversity frequency of discrepancy words, personal
quency of exclamation marks). In comparison, FS English comments pronouns, quotes and verbs.
were greatly impacted by the presence of friendship and social pro-
cesses. Furthermore, English FS comments were more tied with refer
4.4. Feature ablation — RQ1d
ences to the person (measured using frequency of second person pro
nouns). The most important features for the Portuguese trained classifier
In order to reduce the size of the model, a feature ablation study was
and the English trained classifier had no intersections at the FT level. For
carried. The shape of the curves were largely concave downwards;
the Portuguese classifier, four of the ten most important features were
hence, to minimize the number of features while maintaining a similar
related to complexity of feedback comments; namely, lexical diversity,
level of accuracy, one can look at when the curve starts to dip. This
verb overlaps, edit distance of lemmas and age of acquisition of content
occurs at roughly 30, 15, 40 and 20 features for FS (Fig. 1a), FT (Fig. 1b),
words. Contrastingly, the English feedback comments at the FT level
FR (Fig. 1c) and FP (Fig. 1d) respectively.
were highly associated with rewards. Other important features were
In order to analyze the remaining features of the four classifiers after
frequency of numbers, conjunctions and words related to social
ablation, these features were broadly categorized using categories
6
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
Fig. 1. For RQ1d, a feature ablation study is carried out by iteratively removing features, starting from the least important. The resultant feature set is used to train a
classifier and the new model’s accuracy is recorded. Four graphs are displayed showing accuracy on the y-axis, and number of features remaining on x-axis (from 166
to 1). The x-axis has been inverted, meaning the values are decreasing from left to right. Ablation results are displayed for the four levels of feedback.
informed by LIWC (Tausczik & Pennebaker, 2010) and Coh-Metrix the automated analysis of feedback messages can be used to identify
(Graesser et al., 2004), and similar features were grouped together. informative feedback components. Four binary classifiers were devel
The distribution of these categories for each level of feedback can be oped using a variety of features (linguistic and psychological) obtained
seen in Table 6. from LIWC and Coh-Metrix; an independent feature termed number of
named entities was also created. The best performing models were
5. Discussion effective in identifying FS, FT and FP. While not a direct comparison due
to the addition of the English feedback examples, the models achieved
The goal of this study was to examine how accurately one can model better results over those reported by Cavalcanti et al. (2020). The clas
the feature space of informative feedback components, and how this sifiers were able to improve accuracy by 0.07 and 0.05 for FT and FP,
feature space varies across languages. In that vein, four research ques respectively and increase kappa values by 0.11, 0.35, and 0.06 for FS,
tions were answered using novel statistical learning methodologies, with FT, and, FP, respectively (see Table 7). Furthermore, when trained solely
a view of promoting effective feedback at scale. on the Portuguese feedback examples used (Cavalcanti et al., 2020), this
study was able to improve accuracy by 0.07, 0.16, and 0.14 and increase
5.1. Model performance — RQ1a kappa values by 0.35, 0.20 and 0.28 for FS, FT, and FP, respectively (see
Table 7). All of these results signify an improved method for feedback
Research question RQ1 focused on investigating the extent to which classification as the selected classification algorithm was able to better
model the complex feedback feature space.
Similar to previous works (Cavalcanti et al., 2020), the FR classifier
Table 6
was not as effective in identifying instances. The model obtained a poor
In addressing RQ1d, the features that remain after ablation are analyzed and
categorized. The name and distribution these categories can be seen for the four
kappa of 0.06, which was likely caused by the model’s poor ability of
levels of feedback. detecting positive cases of FR. Poor performance on this level was due to
the significantly lower cases of positive instances as compared to the
Category FS FT FR FP
other levels of feedback.
Personal Concerns 1 1 0 0
Informal Language 1 0 0 0
Pronoun Use 1 0 1 1 Table 7
Punctuation Use 1 1 1 1 A comparison of RQ1 results and similar works by Cavalcanti et al. (2020). The
Complexity 1 1 1 1 table presents classifier performance from RQ1 at the four levels of feedback and
Information Volume 1 1 1 1 the results obtained by Cavalcanti et al. (2020). Legend: ACC – Accuracy; K –
Action 0 0 1 1
Cohen’s kappa.
Time Orientation 1 0 1 1
Relativity 0 0 1 1 FS FT FR FP
Drives 1 0 1 0
ACC K ACC K ACC K ACC K
Biological Processes 0 0 1 0
Affect 1 0 0 0 RQ1 Results 0.87 0.51 0.82 0.65 0.91 0.04 0.69 0.35
Social 1 1 1 0 Cavalcanti 0.87 0.39 0.75 0.29 – – 0.64 0.28
Cognitive Processes 1 1 1 1 et al.
7
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
5.2. Feature analysis — RQ1b that appeared in the top ten significant features across all four levels.
Hence, each level of feedback might be playing a different role in
The focus of research question RQ1b was analyzing the most informing the learner (Hattie & Timperley, 2007). FR and FP, both had
important textual features associated with the four levels of feedback. strong positive associations with future focused processes. This finding is
Hattie and Timperley (2007) state that FS involves evaluations of the plausible; per Hattie and Timperley (2007), both levels of feedback can
person, which are often a form of praise. The current findings add weight be related to assessment strategy. Specifically, FP can involve informa
to this claim, as those features found to be most important in predicting tion on error-detection strategies, while FR is more broadly related to
the FS level were affective processes (particularly, positive emotions) strategies towards knowledge acquisition goals (Hattie & Timperley,
and social processes, which align with the concept of praise. The use of 2007). Additionally, both FT and FP counted adverbs amongst their
second person pronouns (i.e., you, your) was also identified as impor most important features, indicating the descriptive nature of these levels
tant, which can be analogized as the instructor’s comments being of feedback.
directed at the person. FS is often thought to be the least effective level of
feedback (Butler & Winne, 1995; Hattie & Timperley, 2007) and relat 5.3. Feature transferability — RQ1c
edly, the FS classifier had a negative association with discrepancy
words; this might indicate FS comments have little actionable infor Research of feedback tools should consider good cross-language
mation or insight. performance to promote generalizability. In that vein, we studied
FT is sometimes referred to as corrective feedback and provides in inter-language classifier performance, and compared the most signifi
formation on details related to task accomplishment such as correctness cant features for classifiers trained on different language feedback.
or behavior. Accordingly, this study found the predictors most associ Barbosa et al. (2020) used similar linguistic features to those used in this
ated with FT were those that related to the amount of information project, such as LIWC and Coh-Metrix, to study cross-language classifi
provided. Specifically, higher values of word counts, frequency of con cation of cognitive presence in online discussions, and found features to
tent words and minimum frequency of content —- all of which can be be independent of language; hence, we expected to find a moderate level
linked to greater information —- were positively correlated with of generalizability of feedback features across languages. However, our
observance of FT. Hattie and Timperley (2007) suggest instructors not to findings indicate a low transferability of feedback features. The average
rely solely on FT, but rather to view it as a process that moves the student accuracy differential on inter-language performance amounted to
to FP and FR. This theory is backed by the finding of strong negative − 0.06, − 0.59, and − 0.26; while the average kappa differential was
association of causation words and FT; hence, FT comments were less approximately − 0.50, − 0.27, and − 0.33 for FS, FT, and FP, respectively.
likely to illustrate the causes of the student’s failings, which is essential Likewise, the Portuguese and English trained classifiers showed minimal
for the learner’s self-regulation (Butler & Winne, 1995; Hattie & Tim overlap in their most important features across all levels of feedback.
perley, 2007; Nicol & Macfarlane-Dick, 2006). One possible explanation for this finding might be the difference in
This study also found a negative association between social processes courses represented in the English and Portuguese datasets. English
and FT. Per LIWC (Tausczik & Pennebaker, 2010), words associated with feedback examples were primarily from STEM related courses, including
social processes (e.g. mate, talk) have psychological correlates with Environmental Studies and Software Engineering, while Portuguese
social concerns and social support. As FT is primarily concerned with feedback examples had more of a mix, hailing from Biology and Liter
directing students towards task/performance failing (Hattie & Timper ature courses. Hence, the different nature of represented courses might
ley, 2007), perhaps greater social support when delivering this type of have influenced the transferability analysis.
information could result in greater efficacy; however, more research is Another explanation for the low transferability of features might be
suggested in this area. the cultural differences in communication. For instance, at the FS level
Compared to FT, FP is more concerned with the processes involved of feedback, we observed greater association of friendship and social
with undertaking tasks. Specifically, FP is believed to promote a deeper processes for the English feedback; i.e. English instructors might have
understanding of learning as it enables the identification of relationships displayed a greater level of familiarity with students. As an instructor
between resources and output, and the development of stronger cogni can be viewed as an authority figure, this difference might be related to
tive processes. To achieve this, Balzer et al. (1989) state FP should whether a culture is “horizontal”, and therefore emphasizes equality, or
concern information about actual relations in the learning environment, “vertical”, and emphasizes hierarchy (Shavitt et al., 2006). The impli
relations which have been recognized by the learner, and relations be cations of this finding would indicate instructors will need to consider
tween the learning environment and the learner’s perceptions. There the cultural backgrounds of the learner while delivering feedback for
fore, the value of FP comes from providing useful information on improved efficacy.
relationships. The findings of this study corroborate the theoretical
views of FP. Amongst the most important features for FP were frequency 5.4. Feature ablation — RQ1d
of content words, adverbs, negative connectives and discrepancy words.
These imply that FP comments were tied to providing new and correc In conclusion of the feature ablation analysis per research question
tive information. Other significant features can be tied back to re RQ1d, this study was able to cut down the size of feature sets from 166
lationships; including frequency of semicolons (semicolons are often features to roughly 30, 15, 40 and 20 features for FS, FT, FR, and FP,
used to link together ideas) and features associated with space and respectively, while maintaining equivalent accuracy. Similar works
relativity. (Cavalcanti et al., 2019, 2020) have used feature sets as large as 120
According to Butler and Winne (1995), one of the goals of FR should features; hence, this study shows significant improvement in model size.
be to improve the student’s ability to monitor current progress and use After ablation, the remaining features for the four levels of feedback
that information to form effective learning strategies. Accordingly, some were put into broad categories. The distribution of these categories can
of the most important predictors of FR were greater present and future be further analyzed. Firstly, only the FS level had features associated
focused processes. However, in the dataset, only 7% of feedback ex with affect and informal language categories. Hence, the level of for
amples included FR. This trend might be indicative of the poor standing mality and the use of affective processes in delivered feedback only
of today’s feedback in higher education, as FR is often theorized as the seems to have played an important role in FS. This is consistent with the
most critical component of effective feedback (Butler & Winne, 1995; theoretical view of FS being the more personal and affective level of
Nicol & Macfarlane-Dick, 2006). feedback (Hattie & Timperley, 2007). Another finding was that the
We can also analyze potential similarities of the most important relativity category was only associated with FR and FP. This correlates
features of the four levels of feedback. Firstly, there was no single feature with Hattie and Timperley (2007) theory of FP, and FR being the two
8
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
levels of feedback that concern the learner’s environment. Furthermore, feedback features across languages. Feedback tools should be general
the FP and FR categories were the only levels to be associated with ac izable enough to cater to a variety of languages. By analyzing the
tions. This finding is aligned with literature which considers the two transferability of feedback features across languages, this study aimed to
levels to be the most efficacious in their ability to provide actionable enhance the global adaptability of current and future feedback tools.
insight due to their tendencies to be prescriptive as opposed to The findings indicate feedback features have low transferability between
descriptive (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick, 2006). feedback examples delivered in English and Portuguese. However, a
To fully understand the differences (or similarities) of the four levels more expansive study is suggested, with a greater size and variety of
of feedback, we suggest further analysis of the role of these categories on feedback from different languages.
the four levels of feedback; including measuring the mean impact of the
features per category. This can be used to study the polarity of the cat Declaration of competing interest
egorical relationships and identify which categories are most significant
in determining the levels of feedback. The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
5.5. Limitations the work reported in this paper.
The nature of delivered feedback may vary depending on the gran References
ularity, and feedback features may change depending on the course,
school, language and culture, to name a few. As the data used in this Baeck, T. (2000). Evolutionary computation 1: Basic algorithms and operators (1st ed.). CRC
project consisted of feedback from two languages and five courses, the Press.
Balzer, W. K., Doherty, M. E., & O’Connor, R. (1989). Effects of cognitive feedback on
feedback space in this study may not be fully representative and greater performance. Psychological Bulletin, 106, 410–433.
variety will be needed for more accurate representations. Another issue Bangert-Drowns, R. L., Kulik, C.-L. C., Kulik, J. A., & Morgan, M. (1991). The
encountered with the data was the presence of class imbalances. As instructional effect of feedback in test-like events. Review of Educational Research, 61,
213–238.
evidenced by the improved classification performance of SMOTE Barbosa, G., Camelo, R., Cavalcanti, A. P., Miranda, P., Mello, R. F., Kovanovic, V., &
incorporated classifiers, class imbalances can have a negative impact on Gasevic, D. (2020). Towards automatic cross-language classification of cognitive
model efficiency. Furthermore, this study can be improved with greater presence in online discussions. In Proceedings of the tenth international conference on
learning analytics & knowledge LAK ’20 (pp. 605–614). Frankfurt, Germany: ACM.
occurrences of positive instances of the different levels of feedback. For
Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical
instance, the FR level had just 108 instances in the training set. The few synthesis. Review of Educational Research, 65, 245–281.
cases of FR contributed to low prediction performance; and, this study Cavalcanti, A. P., Diego, A., Mello, R. F., Mangaroska, K., Nascimento, A., Freitas, F., &
Gasevic, D. (2020). How good is my feedback?: A content analysis of written
was not able to measure feature transferability at this level as the Por
feedback. In Proceedings of the tenth international conference on learning analytics &
tuguese feedback examples had just 8 cases in the entire dataset. Finally, knowledge (pp. 428–437). Frankfurt Germany: ACM.
while we noted timeliness as one of the issues that plague feedback in Cavalcanti, A. P., Ferreira Leite de Mello, R., Rolim, V., Andre, M., Freitas, F., &
Higher Education, this study focused on addressing the challenges that Gasevic, D. (2019). An analysis of the use of good feedback practices in online
learning courses. In 2019 IEEE 19th international conference on advanced learning
relate to the content of feedback. Future research will look towards technologies (ICALT) (pp. 153–157). Maceio, Brazil: IEEE.
feedback generators which can address the issue of timelines in addition. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings
of the 22nd ACM SIGKDD international conference on knowledge discovery and data
mining KDD ’16 (pp. 785–794). San Fran- cisco, California, USA: ACM.
6. Conclusion and future research Clariana, R. B., Wagner, D., & Roher Murphy, L. C. (2000). Applying a connectionist
description of feedback timing. Educational Technology Research & Development, 48,
This study proposed four main contributions. First, this study 5–22.
Clark, I. (2012). Formative assessment: Assessment is for self-regulated learning.
explored how accurately a trained model can identify the presence of Educational Psychology Review, 24, 205–249.
different feedback practices. Hence, four binary classifiers were trained Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
on 2092 feedback examples using the criteria of Hattie and Timperley’s Psychological Measurement, 20, 37–46.
Denisko, D., & Hoffman, M. M. (2018). Classification and interaction in random forests.
proposed four levels of feedback operation. The constructed classifiers, Proceedings of the National Academy of Sciences, 115, 1690–1692.
using primarily linguistic and psychological features, were effective in Dweck, C. S. (1999). Self-theories: Their role in motivation, personality, and
identifying the presence of FT, FP and FS levels of feedback and showed development. In Self-theories: Their role in motivation, personality, and development (p.
195). New York, NY, US: Psychology Press. xiii.
better performance than similar works in this content area; however, the
Ferguson, P. (2011). Student perceptions of quality feedback in teacher education.
FR classifier was marred by a lack of adequate data. The implications of Assessment & Evaluation in Higher Education, 36, 51–62.
these results provide a proof of concept for a tool that can automatically Fisher, D., & Frey, N. (2009). Feed up, back, forward. In Educational leadership (Vol. 67,
analyze and potentially diagnose the contents of an instructor’s feed pp. 20–25). ASCD.
Glover, C., & Brown, E. (2006). Written feedback for students: Too much, too detailed or
back. We aim to transform assessment feedback to educational feedback too incomprehensible to be effective? Bioscience Education, 7, 1–16.
by promoting the understanding and utilization of good feedback Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix:
practices to improve their efficacy on learner adoption. Analysis of text on cohesion and language. Behav Res Methods In- strum Comput., 36,
193–202.
Another goal of this paper was to identify the prominent textual Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning:
features of feedback components. Identified features were able to Data mining, inference, and prediction. Springer series in statistics (2nd ed.). New York,
corroborate the findings of educational research on feedback theory. NY: Springer.
Hattie, J. (1999). Influences on student learning. Inaugural lecture given on August, 2, 21.
The presented findings can be further used to inspire the design of future Hattie, J., & Gan, M. (2011). Instruction based on feedback. In Handbook of research on
automated feedback generators, e.g., intentionally including the prom learning and instruction (pp. 249–271).
inent terms specific to different feedback practices when generating Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational
Research, 77, 81–112.
feedback. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on
Furthermore, this study sought to identify the most minimal feature Knowledge and Data Engineering, 21, 1263–1284.
set needed for accurate feedback analysis. The ablation study showed Holzinger, A. (2018). From machine learning to explainable AI. In 2018 world symposium
on digital intelligence for systems and machines (DISA) (pp. 55–66). Kosice: IEEE.
the ability to cut the size of feature sets significantly to enable more
Ilgen, D., & Davis, C. (2000). Bearing bad news: Reactions to negative performance
lightweight feedback systems. To encourage the widespread use of an feedback. Applied Psychology, 49, 550–565.
automatic feedback analysis tool, it is important to minimize the number Keuning, H., Jeuring, J., & Heeren, B. (2018). A systematic literature review of
of variables included in constructed models to increase its automated feedback generation for programming exercises. ACM Transactions on
Computing Education, 19, 1–43.
interpretability. Kluger, A. N., & Dijk, D. V. (2010). Feedback, the various tasks of the doctor, and the
Finally, this study conducted an analysis of the transferability of feedforward alternative. Medical Education, 44, 1166–1174.
9
I. Osakwe et al. Computers and Education: Artificial Intelligence 3 (2022) 100059
Kovanović, V., Joksimović, S., Waters, Z., Gasević, D., Kitto, K., Hatala, M., & Siemens, G. Pan, B. (2018). Application of XGBoost algorithm in hourly PM2.5 concentration
(2016). Towards automated content analysis of discussion transcripts: A cognitive prediction. In IOP conference series: Earth and environ- mental science (Vol. 113)IOP
presence case. In Proceedings of the sixth international conference on learning analytics Publishing, Article 012127.
& knowledge LAK ’16 (pp. 15–24). Edinburgh, United Kingdom: ACM. Parikh, A., McReelis, K., & Hodges, B. (2001). Student feedback in problem based
Laurillard, D. (2013). Rethinking university teaching (0th ed.). Routledge. learning: A survey of 103 final year students across five ontario medical schools.
Leibold, N., & Schwarz, L. M. (2015). The art of giving online feedback. In Journal of Medical Education, 35, 632–636.
effective teaching (Vol. 15, pp. 34–46). ERIC. Price, M., Handley, K., Millar, J., & O’Donovan, B. (2010). Feedback : All that effort, but
Liu, M., Li, Y., Xu, W., & Liu, L. (2017). Automated essay feedback generation and its what is the effect? Assessment & Evaluation in Higher Education, 35, 277–289.
impact on revision. IEEE Trans. Learn. Technol., 10, 502–513. Shavitt, S., Lalwani, A. K., Zhang, J., & Torelli, C. J. (2006). The horizontal/vertical
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., distinction in cross-cultural consumer research. Journal of Consumer Psychology, 16,
Himmelfarb, J., Bansal, N., & Lee, S.-I. (2019). Explain- able AI for trees: From local 325–342.
explanations to global understanding, Article 04610. arXiv:1905. Stern, L. A., & Solomon, A. (2006). Effective faculty feedback: The road less traveled.
Luque, A., Carrasco, A., Martín, A., & Heras, A.d. l. (2019). The impact of class imbalance Assessing Writing, 11, 22–41.
in classification performance metrics based on the binary confusion matrix. Pattern Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC
Recognition, 91, 216–231. and computerized text analysis methods. Journal of Literary Semantics, 29, 24–54.
Ma, X., Wijewickrema, S., Zhou, S., Zhou, Y., Mhammedi, Z., O’Leary, S., & Bailey, J. Villalón, J., Kearney, P., Calvo, R. A., & Reimann, P. (2008). Glosser: Enhanced feedback
(2017). Adversarial generation of real-time feedback with neural networks for for student writing tasks. In 2008 eighth IEEE international conference on advanced
simulation-based training. In Proceedings of the twenty-sixth international joint learning technologies (pp. 454–458). Santander, Cantabria, Spain: IEEE.
conference on artificial intelligence (pp. 3763–3769) (Melbourne, Australia). Wade-Stein, D., & Kintsch, E. (2004). Summary street: Interactive computer support for
Mulliner, E., & Tucker, M. (2017). Feedback on feedback practice: Perceptions of writing. In Cognition and instruction (Vol. 22, pp. 333–362). Routledge.
students and academics. Assessment & Evaluation in Higher Education, 42, 266–288. Weaver, M. R. (2006). Do students value feedback? Student perceptions of tutors’ written
Narciss, S. (2013). Designing and evaluating tutoring feedback strategies for digital responses. Assessment & Evaluation in Higher Education, 31, 379–394.
learning environments on the basis of the interactive tutoring feedback model. Wijewickrema, S., Ma, X., Piromchai, P., Briggs, R., Bailey, J., Kennedy, G., & O’Leary, S.
Digital Education Review, 23, 7–26. (2018). Providing automated real-time technical feedback for virtual reality based
Neuendorf, K. A. (2017). The content analysis guidebook (2nd ed.). Los Angeles: SAGE. surgical training is the simpler the better?. In Artificial intelligence in education lecture
Nicol, D. (2021). The power of internal feedback: Exploiting natural comparison notes in computer science (pp. 584–598). Cham: Springer.
processes. AssessMent & Evaluation in Higher Education, 46, 756–778. Xiao, Z., Wang, Y., Fu, K., & Wu, F. (2017). Identifying different transportation modes
Nicol, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated from trajectory data using tree-based ensemble classifiers. In ISPRS international
learning: A model and seven principles of good feedback practice. Studies in Higher journal of geo-information (Vol. 6, p. 57). Multidisciplinary Digital Publishing
Education, 31, 199–218. Institute. Number: 2.
Yang, M., & Carless, D. (2013). The feedback triangle and the enhancement of dialogic
feedback processes. Teaching in Higher Education, 18, 285–297.
10