Professional Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Keywords: This paper proposes the adoption of a deep learning method to automate the categorisation of online discussion
Cognitive presence messages according to the phases of cognitive presence, a fundamental construct from the widely used Com
Deep learning munity of Inquiry (CoI) framework of online learning. We investigated not only the performance of a deep
Explainable artificial intelligence
learning classifier but also its generalisability and interpretability, using explainable artificial intelligence al
Online discussion
gorithms. In the study, we compared a Convolution Neural Network (CNN) model with the previous approaches
reported on the literature based on random forest classifiers and linguistics features of psychological processes
and cohesion. The CNN classifier trained and tested on the individual data set reached results up to Cohen’s κ of
0.528, demonstrating a similar performance to those of the random forest classifiers. Also, the generalisability
outcomes of the CNN classifiers across two disciplinary courses were similar to the results of the random forest
approach. Finally, the visualisations of explainable artificial intelligence provide novel insights into identifying
the phases of cognitive presence by word-level relevant indicators, as a complement to the feature importance
analysis from the random forest. Thus, we envisage combining the deep learning method and the conventional
machine learning algorithms (e.g. random forest) as complementary approaches to classify the phases of
cognitive presence.
* Corresponding author.
E-mail addresses: y.hu@auckland.ac.nz (Y. Hu), rafael.mello@ufrpe.br (R. Ferreira Mello), dragan.gasevic@monash.edu (D. Gašević).
https://doi.org/10.1016/j.caeai.2021.100037
Received 26 April 2021; Received in revised form 23 August 2021; Accepted 10 October 2021
Available online 29 October 2021
2666-920X/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
ongoing discussions to assist instructors in their teaching (Kovanović, (Garrison et al., 2001). Within cognitive presence, this cycle is repre
Gašević, & Hatala, 2014). sented in four phases:
Several automated and semi-automated methods have been pro
posed in the recent literature (Barbosa et al., 2020; Ferreira et al., 2018; 1. Triggering event initiates the cycle of critical inquiry. In this phase, a
Kovanović et al., 2014b, 2016; Neto et al., 2018; Rolim et al., 2019; student or the instructor proposes a problem or dilemma to trigger
Waters et al., 2015). While early approaches applied a combination of the discussion. It could be a problem statement or a question asked
word and phrase counts features with black-box machine learning al by a student;
gorithms (i.e., Neural Networks and Support Vector Machine) 2. Exploration involves an exchange of findings and reflections of initial
(Kovanović, Joksimović, et al., 2014; Mcklin, 2004), recent studies ideas. Based on the interactions, the participants propose possible
proposed the adoption of features based on psychological processes, solutions for the given problem or explanations for the questions;
writing cohesiveness, and discussion structure compound with decision 3. Integration is the phase in which the participants synthesise coherent
tree algorithms (Barbosa et al., 2020; Kovanović, Joksimović, et al., solutions and knowledge constructed on the ideas and the informa
2014; Neto et al., 2018). The main advantages of the second approach tion shared in the previous phases. In other words, the participants
are the description of important indicators of different cognitive pres construct structured solutions to the given problem/question by
ence phases and the improvement in the generalisability of the created integrating various ideas about the solution;
classifiers (Kovanović et al., 2016). 4. Resolution is the phase in which the participants assess the newly-
Advances in natural language processing have led to the develop constructed knowledge through hypothesis testing or real-world
ment of more accurate text classification techniques based on deep application in relation to the problem/dilemma that initially trig
neural networks (Lai et al., 2015). The main idea of these methods is that gered the discussion.
the classifier tends to reach better results, even for different contexts, as
the database increases. Moreover, explainable artificial intelligence is a In this study, we focuses on the analysis of cognitive presence since it
recent method for unpacking the results of deep neural networks clas is the ‘primary issue’ of students’ learning evidence to be explored
sifiers (Miller, 2019; Samek & Müller, 2019). The combination of these before the other dimensions (Rourke & Kanuka, 2009). The other two
approaches could provide alternative analysis not explored before in presences of the CoI will be investigated in our future research. Thus, the
terms of the analysis of CoI constrains. discussion messages that do not fall into any of the above phases are
Therefore, this paper proposes a new method for automatic catego classified as the Other. Also, the discussion messages that reveal in
risation of online discussion messages using deep machine learning and dicators of two phases are classified into the higher one, as the rule of
explainable artificial intelligence. Our study analyses the efficiency and coding up. Moreover, examples of student discussion messages explored
the generalisability of this approach in the context of cognitive presence. in this study were provided in Table 1 for a better understanding of each
The results showed that this approach reached results comparable to the cognitive phase.
previous approaches reported on the literature (Barbosa et al., 2021;
Kovanović et al., 2016; Neto et al., 2018). Moreover, the explainable 2.2. Content analysis of online discussions with CoI
artificial intelligence visualisations shine new insights into the feature
importance for this problem. The results and their implications are Quantitative Content Analysis (QCA) (Rourke et al., 2001) is broadly
further discussed in this paper. applied to assess social and cognitive presences by assigning the units of
the online discussion transcripts to a class and counting the number of
2. Background the units in each class (Bauer, 2007). The CoI model defines three QCA
coding schemes, one for each presence. Although the CoI model has
2.1. The Community of Inquiry and cognitive presence extensively been adopted in education research, content analysis has
been essentially employed for retrospective analysis, after the courses
The Community of Inquiry (CoI) framework proposed by Garrison are finished, without much impact on the actual student learning and
et al. (1999) is a social-constructivist model used to analyse the inter outcomes (Strijbos, 2011).
active educational experience in online learning environments. It de In this regard, the recent literature reports several studies on the
scribes educational engagement that occurs in a learning community, in adoption of automated methods (i.e., machine learning techniques) to
which “a group of individuals who collaboratively engage in purposeful assess CoI presences, intending to drive instructional interventions and
critical discourse and reflection to construct personal meaning and confirm enhance student learning outcomes (Kovanović, Gašević, & Hatala,
mutual understanding” (Garrison & Anderson, 2011, p. 2). The CoI 2014; Rolim et al., 2019). Initially, the methods focused on the use of
framework has been widely used and validated for multidisciplinary combinations of traditional text mining features (i.e., bag-of-words
courses to understand the participants’ engagement in online discussion vectors) and classification algorithms. For instance, Mcklin (2004) and
forums (Garrison & Anderson, 2011; Garrison, Anderson, & Archer, Kovanović, Joksimović, et al. (2014) proposed the application of a
2010; Hu et al., 2020; Liu & Yang, 2014). This model defines three di bag-of-words (n-gram) and supervised algorithms based on Neural
mensions, called presences, to mould the learning experience: (i) Networks and Support Vector Machines (SVMs), respectively, to auto
Cognitive presence explains the critical reflection of the knowledge matically classify online discussion messages into the phases of cognitive
(re)construction, and problem-solving processes in the learning com presence. The methods reached Cohen’s κ of 0.31 (Mcklin, 2004) and
munity (Garrison et al., 2001); (ii) Social presence analyses the ca 0.41 (Kovanović, Joksimović, et al., 2014). Moreover, previous works
pacity of the participants to humanise their relationships during an also combined word and document embeddings with LIWC and
online discussion. It considers how social interactions could be used to contextual features to categorise online discussion messages according
model the social climate within a group of learners (i.e., cohesion, af to the cognitive presence phases (Hayati et al., 2019). Although this
fective, and interactive dimensions) (Rourke et al., 1999); (iii) Teaching study used embeddings to enhance the feature vectors, the paper
presence concerns the teaching role before (i.e., course design) and adopted black-box algorithms in the analysis, such as SVM. the authors
during (i.e., facilitation and direct instruction) the course (Anderson did not provide details about the number of messages in the evaluation
et al., 2001). data, but the best classifier proposed reached Cohen’s κ of 0.68.
Although all three presences contribute to the understanding of The shortcomings of the initial approaches (bag-of-words and black-
learning experience, cognitive presence is the primary element of the box classification algorithms) are the lack of: (i) generalisability,
CoI model in terms of the operationalisation of the practical inquiry implicating that the creation of classification models is dependent on the
cycle of social-constructivist learning within online environments domain, and (ii) interpretability, which does not provide information on
2
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
Table 1 et al., 2015), but its application rapidly spread to other domains as
Examples of student discussion messages for each phase of cognitive presence Natural Language Processing (NLP) (Young et al., 2018). These algo
from two courses. rithms rely on the idea that the large number of connections generated
Phase An example of student discussion messages Course on deep neural networks could simulate the same learning process as the
Other Hi XXX I really appreciate your suggestions. I will keep Software
human brain (Goodfellow et al., 2016).
that in mind for future presentations. course In terms of NLP, deep learning has played an increasing role in tasks
An interesting start and I think I’m going to enjoy the Philosophy related to topic modelling/classification (ALRashdi & O’Keefe, 2019)
next 8 weeks. MOOC and machine translation (Yang et al., 2020). The Convolutional Neural
Trigger I noticed you defined organizational model related Software
Network (CNN), as a class of the deep neural networks, has been
more on process actors do you think any other course
important factors can influence the organizational commonly used for the text classification tasks, especially for small text
model study as well? documents such as Twitter and online discussions posts (Lai et al., 2015;
Have you managed to argue the above (that there is no Philosophy Wei et al., 2017). Some promising studies demonstrated that CNN out
argument) in standard form? MOOC performs traditional machine learning algorithms in this context. For
Exploration From my understanding TagViewer transcribes audio Software
and links it to video. Is there future work for the course
instance, Kamath et al. (2018) compared the results of a deep learning
TagViewer to interpret changes in facial responses for architecture against random forests, SVM, Logistic Regression and
example? Even though body language doesn’t have Multilayer Perceptron (MLP) in two text classification tasks. The final
associated audio it does say quite a bit and it would be results showed that the deep learning algorithm achieved results up to
nice to transcribe that as well.
50% better than the other ones. Similarly, Guo et al. (2019) compared
It’s not a disagreement but discussion! Technically I Philosophy
think the words have different meanings but in a MOOC different architectures of CNN with Adaboost, a state-of-the-art decision
religious context they also have a connection so maybe tree algorithm, with the goal of identifying urgency in online discussion
we’re both right! messages (Guo et al., 2019). In this context, CNN outperformed Ada
Integration … I’ve liked using iterative processes as you describe Software boost by up to 14%.
however I don’t believe we should ignore process course
Despite the promising results, the inability to explain the function
improvement tools that can be integrated into the
process to build better software. I believe the patterns ality or reasons of the prediction output from the deep learning algo
and questionnaires described by the author’s are some rithms is one of the core barriers that Artificial Intelligence is
of these tools. If we consider usability what your are encountering (Barredo Arrieta et al., 2020). It is also one of the main
describing was almost 70% less effective than using the
reasons why the white-box algorithms (e.g. random forests) can be
authors patterns …
… I think it was legitimate as he was acting in the Philosophy preferable in educational research, notably in the previous studies of the
course of his duties. Your professional role applies MOOC automatic classifier for the cognitive presence (Barbosa et al., 2020;
constraints which would not apply if you were Kovanović et al., 2016; Neto et al., 2018), compared to the deep learning
discussing matters as an individual; therefore, in the neural networks. However, the field of eXplainable Artificial Intelli
example, the appeal to authority is appropriate …
gence (XAI) aims to create machine learning techniques that enable
Resolution … Agile is by its definition supposed to be flexible and Software
dynamic in nature. As a result I find it ironic that there course human audiences to easily understand, reasonably clarify, and appro
would be a sticking point on literature Vs real world. I priately trust the AI methods and implementations (Gunning, 2017). XAI
would think the approach should be flexible based on can potentially complement the adoption of deep learning algorithms
the client and developer skill sets. Software
since it unpacks the model decisions. In learning analytics, explain
development is a complex human interaction
regardless of the project size …
ability of the automatic analysis approach and its product is also a sig
… Another way to test it would be to see if similar Philosophy nificant goal (Gašević et al., 2017). It should provide scholars and
positions eg heads of industry are also held by more MOOC educators insights into the research questions, for example, the in
left-handed people than statistics would suggest. It dicators of different cognitive phases revealed in the online discussion
would be incredibly difficult to iron out other possible
messages in our study.
factors - all the other things that help to make people
successful, like good schooling, access to health care
and a stable family environment, but if it were 3. Research questions
possible, could we scientifically show that being left-
handed is a factor? I’m assuming that if we could we
As discussed in Section 2.1, cognitive presence has a crucial role in
would scientifically show that it’s pure coincidence …
analysing knowledge construction and critical thinking in online
learning environments. It assesses “the extent to which the participants
why a specific class was chosen for each message (Kovanović et al., in any particular configuration of a community of inquiry are able to
2016). To overcome these problems, the previous methods reported on construct meaning through sustained communication” (Garrison et al.,
the literature used domain-independent features and decision trees, 1999, p. 89). Although several studies proposed automatic methods for
especially the random forests algorithm. In this direction, Kovanović classifying discussion transcripts according to the phases of cognitive
et al. (2016), Neto et al. (2018), and Barbosa et al. (2020) adopted presence, to our knowledge, no publication applied deep learning
random forests and features based on linguistic resources (Coh-metrix techniques for the automatic content analysis of cognitive presence.
(McNamara et al., 2014) and LIWC (Tausczik & Pennebaker, 2010)), Hence, our first research question is:
latent semantic analysis, named entity recognition, and discussion
structures (Waters et al., 2015), to identify the phases of cognitive Research Question 1 (RQ1):
presence. These studies reached Cohen’s κ of 0.63 (Kovanović et al.,
2016), 0.72 (Neto et al., 2018), and 0.53 (Barbosa et al., 2020) when To what extent can accurately deep learning methods classify online
applied to discussion messages written in English, Portuguese and across discussion messages according to the phases of cognitive presence? Can deep
these two languages, respectively. learning outperform traditional machine learning techniques for this task?
One of the main limitations reported in the literature is the lack of
generalisability of the existing classifiers of cognitive presence (Barbosa
2.3. Deep learning and explainable artificial intelligence et al., 2020; Kovanović et al., 2016; Neto et al., 2018). As deep learning
methods are based on the content of the text, it could lead to overfitting
Deep learning algorithms have gained notoriety due to the of the generated model. Therefore, our second research question is:
outstanding results achieved in computer vision tasks (Russakovsky
3
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
4
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
the between-course generalisability of the automatic classifier for understand the deep learning algorithm’s decisions. We applied
cognitive presence. Gradient-based Sensitivity Analysis (SA), as an example of eXplainable
AI to visualise the importance of words in the discussion messages for
4.2. Deep learning architectures identifying the phases of cognitive presence. Similar methods were used
to explain deep learning neural networks in NLP tasks (Arras, Horn,
In this study, we used an Inception CNN of naïve form (Szegedy et al., et al., 2017,b). Our trained neural network has a prediction function fc
2015), with the application of pre-trained word-embedding models as for each category c and an input vector x, which consists of dimension
word representations to build a classifier for analysing phases of d using word embeddings. We use xi,t to denote the ith element of the
cognitive presence. The architectures of this approach is explained in word embedding vector, which represents the tth word in a message.
Section 4.2.2. Gradient-based SA (Gevrey et al., 2003) assigns the relevance scores Ri,t
by calculating squared partial derivatives:
4.2.1. Word embeddings ( )2
∂fc
Word embedding is an NLP technique that uses continuous real Ri,t = (x) (1)
number vectors to represent texts (i.e. words, phrases, sentences and
∂xi,t
documents). These vector representations are learned from very large Theses derivations can be computed by using standard gradient
corpora, and are composed of meaningful features associated with po backpropagation (Rumelhart et al., 1986) in most neural networks. SA
sitions in a high-dimensional space. The main idea of word embeddings can also be a decomposition of the squared gradient norm if we sum up
is that semantically similar words or phrases can be located very closely the relevance scores of the entire input dimension d:
in the space, which also means they have similar representation vectors. ∑
We used the GloVe pre-trained model (Pennington et al., 2014) to covert ‖∇xfc (x)‖ 22 = Ri,t (2)
the text messages into embedding vectors as the inputs for our deep
d
learning neural networks. For the out-of-vocabulary words (77 words) In this study, we employed the SmoothGrad method from the
that GloVe did not contain, we randomly initialised their vectors, which iNNvestigate library (Alber et al., 2019) as it can effectively improve the
is a common approach in the literature. During the data pre-processing, gradient-based sensitivity maps by averaging the gradient over number
we removed the punctuation and numbers in each sentence, and also of inputs with added noise (Smilkov et al., 2017). The relevance scores
performed case-folding and lemmatisation. The max length of tokens for of SA are positive values and were between 0 and 1 in our study. We
each message was 1939, which contained the complete information for plotted heatmaps to visualise the word-level relevance scores for each
all the messages from both data sets. A matrix of shape D × L × N was cognitive phase. The words highlighted in red denote strong relevance
generated as the input variables of our neural networks, in which D (0.5 < R ≤ 1) to a target phase, while words in blues indicate low
denotes word embedding dimension,2 L the max length of tokens, and N relevance (0 < R ≤ 0.5).
the number of messages for training.
4.4. Evaluation metrics
4.2.2. Convolutional neural networks
The architecture of the naïve-form Inception CNN (Szegedy et al., We used the metrics of accuracy, Cohen’s κ, F1 score, weighted-aver
2015) used in this study was based on an architecture3 applied in the aged F1 score, and error rate of each class to measure the performance of
classification of discussions on the Quora platform with high perfor our CNN and RF classifiers. Accuracy score and Cohen’s κ are commonly
mance. Fig. 1 depicts the details of the architecture. It consists of a used to evaluate the performance of a supervised machine-learning
spatial dropout layer, five convolution layers with max poolings, a batch classifier (i.e., input variables have been pre-classified), in educational
normalisation process, and a fully-connected layer. Initially, the word research (Barbosa et al., 2020; Kovanović et al., 2014b, 2016; Neto et al.,
embedding matrix, as the vector representation of the input messages, is 2018; Wang et al., 2015; Waters et al., 2015). Accuracy score is defined as
fed into the spatial dropout neuron with five different filters. Instead of the number of true positives (TPc) plus the number of false positives
feeding the matrix into the convolution layers directly, the use of the (FPc) over the sum of all the true and false predictions for each class c.
spatial dropout can be effective for preventing neural works from ∑ ∑
TP + FP
over-fitting (Lee & Lee, 2020). After the dropout layer, the five convo accuracy = ∑ ∑ c c ∑c c ∑ (3)
lution layers in one dimension operate the five matrices produced from c TPc + c FPc + c NPc + c FNc
the dropout layer separately. Then, five max pooling layers retain the Cohen’s κ (Cohen, 1960) was first introduced to measure the reli
important information from the convolution layers but reduce the ability between two human coders on the classification tasks, lessening
dimensionality, respectively. After concatenation of the five layers, a the agreement caused by chance. It can also be used to evaluate the
new matrix is produced. Next, the new matrix passes through the batch degree of agreement between the actual and predicted labels in machine
normalisation for improving training speed and validation accuracy learning tasks. Cohen’s κ is defined as
before the fully-connected layer computing the prediction probability
P0 − Pe
for each class. Finally, the network outputs one of the five phases of κ= (4)
1 − Pe
cognitive presence using the activation function of softmax. Table 3
demonstrates the values of the optimal hyper-parameters implemented where P0 is the agreement probability of all classes. Pc is the agreement
for all the experiments. probability by chance, which is computed as the columns and rows
marginal probabilities in the confusion matrix (Ben-David, 2008).
4.3. Explainable artificial intelligence Macro- and micro-averaged scores are often adopted to compute the
overall performance of multi-class classification problems.
We are interested in the relevance scores between the input variables Macro-averaged method considers all classes as equally important,
and the identified categories on word level. In other words, we wonder whereas micro-averaged regards the different contributions of each
which words in the messages are important for each cognitive phase, to class, valuing the minority class the least (Van Asch, 2013).
Macro-Averaged Precision (macro AP) is defined as the mean value of
precision score (Pc) of each class c.
2
The 100-dimension pre-trained vectors were used in our experiments. http
s://nlp.stanford.edu/projects/glove/.
3
https://www.kaggle.com/christofhenkel/inceptioncnn-with-flip.
5
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
∑
weightedAR = Rc × Wc (9)
Table 4
c
Distribution of training and test data in the experiments.
weightedAP × weightedAR RQ Experiment Classifier Training Test size
weightedF1 = 2 × (10) size
weightedAP + weightedAR
1 Train & validate on 1 1747 1747 (10%)
To evaluate the class-level performance, the error rate was used with
SoftwareSet (90%)
the confusion matrices as the number of false positives (FPc) over the Train & validate on MOOCSet 2 1479 1479 (10%)
number of true positives (TPc) plus false positives (FPc) for each class. (90%)
2 Validate Classifier 1 on 1 1747 1479 (100%)
FPc MOOCSet (90%)
error rate of each class c = (11)
FPc + TPc Validate Classifier 2 on 2 1479 1747 (100%)
SoftwareSet (90%)
Train on both but validate on 3 3051 1747 (10%)
SoftwareSet
Train on both but validate on 4 3078 1479 (10%)
MOOCSet
4 Train & validate on both 5 2903 323 (10% of
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_s
each set)
core.html/#sklearn.metrics.f1_score.
6
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
7
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
8
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
Table 7
Outcomes of the CNN classifiers for combined-course data set validation.
Experiment Accuracy Kappa Weighted F1 Macro F1
Classifier 3 Train on both but test on SoftwareSet (CNN) 0.463 0.194 0.501 0.327
Classifier 4 Train on both but test on MOOCSet (CNN) 0.628 0.359 0.649 0.404
Classifier 5 Train & test on both (CNN) 0.582 0.366 0.603 0.398
RF classifier 3 Train on both but test on SoftwareSet (RF) 0.528 0.281 0.550 0.419
RF classifier 4 Train on both but test on MOOCSet (RF) 0.679 0.440 0.719 0.504
RF classifier 5 Train & test on both (RF) 0.624 0.408 0.662 0.444
relevance. In comparison with the positive associations, the majority of word embeddings, which include an extensive vocabulary compared
the lowest relevant words in both the data sets were ‘stop-words’, which to the traditional machine learning methods (Pennington et al., 2014).
refers to the most commonly used words in a language, such as the, to, a,
is in Fig. 6. Apart from the stop words, there were some nouns high 6.2. Generalisability of our CNN classifiers across different data sets –
lighted as low relevance to the Exploration phase in the SoftwareSet RQ2
message (Fig. 6a), namely programmer and adaptation.
We also provided two sample messages classified into the Integration Another important finding of this study concerns the analysis of the
phase from the two data sets in Fig. 7. In the message from the Soft generalisability of a method based on textual-related features. While the
wareSet (Fig. 7a), the highly relevant words for classification of the literature suggests that the adoption of linguistic resources (i.e., LIWC
Integration phase contained implementation, i, fold, and long, and the and Coh-Metrix) and structural information (i.e., discussion context
moderately relevant words included corresponding, exception, aspects, features) (Kovanović et al., 2016; Farrow et al., 2020; Barbosa et al.,
and methodology. While the low-relevance words were presents, actually, 2020) can increase the generalisability of a classifier, our results
reports, and however and also the stop-words. Comparatively in the demonstrate that the use of the CNN classifiers on cross-course valida
MOOCSet message (Fig. 7b), the words that indicate strong relevance tions can reach similar results (Table 6). This finding implies that the
were correlation, between, associated, relation, lateralization, and func deep learning approaches with word embeddings can have potential to
tional, whereas the low-relevance words were divergent, necessarily, increase the generlisability if we can build neural networks including the
hemisphere(s), specifically, cite and so on. sequential structures of the discussion messages, such as Recurrent
Neural Networks (Graves et al., 2013) and Transformers (Wolf et al.,
6. Discussion 2020), in future work. Also, the generalisability could be improved by
using the advanced pre-trained models (e.g. BERT (Devlin et al., 2018))
6.1. The performance comparison between our CNN classifiers and RF and fine-tuning them with the corpora that contain the domain-specific
baslines – RQ1 vocabularies in the sample data sets.
Random forests classifiers achieved better results when the training
The CNN classifiers presented in this study reached Cohen’s κ of step was performed in the combined data set (the SoftwareSet and LCT
0.475 and 0.528 for the SoftwareSet and LCT MOOCSet, respectively, MOOCSet). A possible reason for this could be found in the fact that the
which demonstrates a moderate inter-rater agreement (Landis & Koch, optimal parameters (i.e. ntree and mtry) were fine-tuned separately in
1977). These results were similar to the performance of the RF classifiers the RF classifiers for the individual-course and combined-course ex
which used random forests with LIWC and Coh-Metrix features, and the periments. However, we used the same hyper-parameters, such as the
outcomes of the experiments without the over-sampling methods re number of filters, batch sizes and kernel sizes, in all of our CNN exper
ported in the previous studies reported on the literature (Kovanović iments to maintain the consistency. This result provides us a hint that the
et al., 2016; Farrow et al., 2019; Barbosa et al., 2020). In addition, the hyper-parameters should be adjusted according to the change of the data
CNN classifiers obtained a smaller error rate in the Integration phase set.
(Figs. 4 and 5) compared to the previous works (Barbosa et al., 2020;
Farrow et al., 2019). Conceptually, the Integration phase can be char 6.3. Important features revealed from the explainable AI – RQ3
acterised by a knowledge construction process where the students
formulate new concepts and solutions (Garrison et al., 2001). The error This paper also introduced some examples of the eXplainable Arti
reduction in this phase could be related to the adoption of pre-trained ficial Intelligence (XAI) visualisations to investigate the indicators for
Fig. 4. Confusion Matrix with SoftwareSet as the training set and MOOCSet as the test set.
9
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
Fig. 5. Confusion Matrix with MOOCSet as the training set and SoftwareSet as the test set.
Fig. 6. Two XAI visualisations for the classification of Exploration phase in the two data sets. High relevance is mapped to red, low relevance to blue. The colour
opacity is normalised to the maximum relevance scores per message. The redder the word is, the more important it is to identify the phase. The bluer the word is, the
lower the relevance. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
the phases of cognitive presence through the training of deep neural the higher phases (i.e., Integration and Resolution). This information
networks. Interestingly, the first four high relevant words we mentioned could be connected with the results in our XAI visualisations as it reflects
in the MOOCSet message (Fig. 7b), and the word corresponding in the that the expressions in terms of associating ideas in the messages can be
SoftwareSet message (Fig. 7a), can be the vocabularies for ‘connecting strong indicators for the Integration phase. Similarly, the XAI could shine
ideas’ and ‘convergence’, which are the important indicators to identify a light on the topic of a question asked during the discussion since the
the Integration phase in the coding scheme of cognitive presence number of questions marks can be strong indicators of Triggering event
(Garrison et al., 2001; Hu et al., 2020). Besides, most the high-relevance messages according to the previous studies (Barbosa et al., 2020;
words in the examples of the Exploration phases (Figs. 6 and 7) were the Kovanović et al., 2016). A divergence from the previous work is also
course-relevant concepts previously used in the discussions, which is reflected. Several important indicators proposed in the previous work
align to the definition of this phase (Garrison et al., 2001). These find were related to the stop-words (i.e., number of personal pronouns,
ings suggest that the use of expressions, which indicate relationships prepositions, indefinite pronouns, pronouns in the third person singu
between different subjects, ideas or arguments in a discussion message, lar); however, the XAI highlighted many of them as low-relevance
can be significant features to identify the Integration phase. Also, in indicators.
shorter messages, the use of specific nouns or noun phrases, which are Therefore, we suggest that the deep learning method (CNN and XAI),
the core concepts delivered in the courses, could be important indicators and the previous approach (random forests and MDG (Barbosa et al.,
to identify the Exploration phase. 2020; Farrow et al., 2020; Kovanović et al., 2016)) could be applied as
We also found the important features provided by our deep learning complementary approaches. On the one hand, it is possible to propose a
with XAI method were word-level based, which differs from what the compound classifier to benefit from the best predictive power of each
previous works have found by the Mean Decrease Gini (MDG) inter approach since random forests better predicts the Triggering event and
pretation (Barbosa et al., 2020; Kovanović et al., 2016). The previous the Exploration phases, and CNN obtained more favourable results for
works proposed more abstract features, such as the degree of lexical the Integration phase. On the other hand, XAI could provide additional
diversity, the degree of meaningfulness content words and so on. insights (i.e., word-level positive and negative indicators) on the most
Educational researchers and course instructors have to seek the ex relevant features extracted by MDG.
pressions matched with definitions of those abstract features in the
Coh-Metrix and LIWC tools for each message. This work is still 7. Final remarks
time-consuming and nontransparent. In comparison, the word-level
explanations of XAI are perceivable and understandable. Nevertheless, This paper has three contributions. First, the introduction of a deep
these two explainable methods have some similarities. For instance, learning model to automatically classify online discussion messages into
Kovanović et al. (2016) suggests that messages, which had higher se cognitive presence phases. The results reached Cohen’s κ values of
mantic similarities with their previous ones, tend to be more relevant to 0.498, a moderate inter-rater agreement. Second, the proposed
10
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
Fig. 7. Two XAI visualisations for the classification of Integration phase in the two data sets. High relevance is mapped to red, low relevance to blue. The colour
opacity is normalised to the maximum relevance scores per message. The redder the word is, the more important it is to identify the phase. The bluer the word is, the
lower the relevance is. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
approach was applied to several scenarios to evaluate the general Declaration of competing interest
isability power of this solution. This investigation concluded that deep
learning models produce results similar to those of the random forests The authors declare that they have no known competing financial
classifiers. Finally, this study adopted an explainable artificial intelli interests or personal relationships that could have appeared to influence
gence visualisation method to provide insights into the indicators of the work reported in this paper.
different cognitive presence phases.
The study’s practical implications include providing more informa References
tion to support decision-making from instructors in relation to the stu
dents’ participation in the online discussion. While previous works Alber, M., Lapuschkin, S., Seegerer, P., Hägele, M., Schütt, K. T., Montavon, G.,
Samek, W., Müller, K. R., Dähne, S., & Kindermans, P. J. (2019). INNvestigate neural
focused on provide structural information based on the text (lexical networks. Journal of Machine Learning Research, 20, Article 04260. arXiv:1808.
diversity and readability of the text), the XAI visualisation proposed ALRashdi, R., & O’Keefe, S. (2019). Deep learning and word embeddings for tweet
produce insights on the relevant content for the discussion. The com classification for crisis response. arXiv preprint arXiv:1903.11024.
Anderson, T., & Dron, J. (2010). Three generations of distance education pedagogy. In
bination of both approaches produces a richer analysis for the The international review of research in open and distance learning (Vol. 12, pp. 80–97).
instructors. URL: http://www.irrodl.org/index.php/irrodl/article/view/890/.
Despite promising results, some limitations can be identified, such as Anderson, T., Rourke, L., Garrison, D. R., & Archer, W. (2001). Assessing teaching
presence in a computer conferencing context. Journal of Asynchronous Learning
the small number of message examples used in the current study, which Networks, 5, 1–17.
could have limited the potential of the deep learning models used. Next, Arras, L., Horn, F., Montavon, G., Müller, K. R., & Samek, W. (2017). ” what is relevant in
the data came from only two courses (fully online masters course about a text document?”: An interpretable machine learning approach. PLoS One, 12,
Article e0181142.
software engineering and a MOOC course on philosophy), therefore,
Arras, L., Montavon, G., Müller, K. R., & Samek, W. (2017b). Explaining recurrent neural
results might have been affected by the intricacies of their course de network predictions in sentiment analysis. arXiv preprint arXiv:1706.07206.
signs. Moreover, the discussion structural features, such as the conver Barbosa, G., Camelo, R., Cavalcanti, A. P., Miranda, P., Mello, R. F., Kovanović, V., &
sational sequences, were not involved in the construction of the deep Gašević, D. (2020). Towards automatic cross-language classification of cognitive
presence in online discussions. In Proceedings of the tenth international conference on
learning classifiers. Finally, we did not explore the potential of learning analytics & knowledge (pp. 605–614).
combining the results of the deep learning and traditional machine Barbosa, A., Ferreira, M., Ferreira Mello, R., Dueire Lins, R., & Gasevic, D. (2021). The
learning algorithms. Future work should aim to conduct experimenta impact of automatic text translation on classification of online discussions for social
and cognitive presences. In LAK21: 11th international learning analytics and knowledge
tion with the approach and possible evaluation with larger sample sizes conference (pp. 77–87).
composed of data from different domains and languages. We also seek to Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A.,
optimise the approach proposed in this study by analysing the perfor Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020).
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and
mance of different deep learning models and combine them with challenges toward responsible AI. Information Fusion, 58, 82–115.
traditional machine learning algorithms. Bauer, M. W. (2007). Content analysis. an introduction to its methodology–by klaus
krippendorff from words to numbers. narrative, data and social science–by roberto
franzosi. British Journal of Sociology, 58, 329–331.
11
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037
Ben-David, A. (2008). About the relationship between roc curves and cohen’s kappa. Liu, C. J., & Yang, S. C. (2014). Using the community of inquiry model to investigate
Engineering Applications of Artificial Intelligence, 21, 874–882. students’ knowledge construction in asynchronous online discussions. Education
Chakravarthi, B. R., Priyadharshini, R., Muralidaran, V., Suryawanshi, S., Jose, N., computing research, 51, 327–354. https://doi.org/10.2190/EC.51.3.d
Sherly, E., & McCrae, J. P. (2020). Overview of the track on sentiment analysis for Mcklin, T. E. (2004). Analyzing cognitive presence in online courses using an artificial neural
dravidian languages in code-mixed text. In Forum for information retrieval evaluation network. Atlanta, GA, USA: Ph.D. thesis.. URL: https://scholarworks.gsu.
(pp. 21–24). edu/msit_diss/1/. aAI3190967.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and McKnight, P. E., & Najab, J. (2010). Mann-whitney u test. The Corsini Encyclopedia of
Psychological Measurement, 20, 37–46. Psychology, 1–1.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation
bidirectional transformers for language understanding. arXiv preprint arXiv: of text and discourse with Coh-Metrix. Cambridge University Press.
1810.04805. Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences.
Falotico, R., & Quatto, P. (2015). Fleiss’ kappa statistic without paradoxes. Quality and Artificial Intelligence, 267, 1–38.
Quantity, 49, 463–470. Neto, V., Rolim, V., Ferreira, R., Kovanović, V., Gašević, D., Lins, R. D., & Lins, R. (2018).
Farrow, E., Moore, J., & Gasevic, D. (2019). Analysing discussion forum data: A replication Automated analysis of cognitive presence in online discussions written in
study avoiding data contamination. https://doi.org/10.1145/3303772.3303779 Portuguese. In European conference on technology enhanced learning (pp. 245–261).
Farrow, E., Moore, J., & Gašević, D. (2020). Dialogue attributes that inform depth and Springer. Springer International Publishing. https://doi.org/10.1007/978-3-319-
quality of participation in course discussion forums. In The 10th international 98572-5_19. URL:.
conference on learning analytics and knowledge (pp. 129–134). Frankfurt, Germany: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word
LAK ’20). representation. In Proceedings of the 2014 conference on empirical methods in natural
Ferreira, R., Kovanović, V., Gašević, D., & Rolim, V. (2018). Towards combined network language processing (pp. 1532–1543). EMNLP).
and text analytics of student discourse in online discussions. In Artificial intelligence in Rivard, R. (2013). Measuring the mooc dropout rate. Higher Education, 8, 2013.
education (pp. 111–126). Cham, Switzerland: Springer International Publishing. Rolim, V., Ferreira, R., Kovanović, V., & Gašević, D. (2019). Analysing social presence in
https://doi.org/10.1007/978-3-319-93843-1_9. online discussions through network and text analytics. In Proceedings of the 19th IEEE
Garrison, D. R., & Anderson, T. (2011). E-learning in the 21st century: A framework for international conference on advanced learning technologies (ICALT’19) (pp. 163–167).
research and practice (2nd ed. ed). New York, NY, USA: Routledge. New Jersey, USA: IEEE. https://doi.org/10.1109/ICALT.2019.00058.
Garrison, D. R., Anderson, T., & Archer, W. (1999). Critical inquiry in a text-based Rourke, L., Anderson, T., Garrison, D. R., & Archer, W. (1999). Assessing social presence
environment: Computer conferencing in higher education. The Internet and Higher in asynchronous text-based computer conferencing. The Journal of Distance
Education, 2, 87–105. https://doi.org/10.1016/S1096-7516(00)_00016-6 Education, 14, 50–71.
Garrison, D. R., Anderson, T., & Archer, W. (2001). Critical thinking, cognitive presence, Rourke, L., Anderson, T., Garrison, D. R., & Archer, W. (2001). Methodological issues in
and computer conferencing in distance education. American Journal of Distance the content analysis of computer conference transcripts. International Journal of
Education, 15, 7–23. https://doi.org/10.1080/08923640109527071 Artificial Intelligence in Education (IJAIED), 12, 8–22.
Garrison, D. R., Anderson, T., & Archer, W. (2010). The first decade of the community of Rourke, L., & Kanuka, H. (2009). Learning in communities of inquiry: A review of the
inquiry framework: A retrospective. The Internet and Higher Education, 13, 5–9. literature (winner 2009 best research article award). International Journal of E-
https://doi.org/10.1016/j.iheduc.2009.10.003 Learning & Distance Education/Revue internationale du e-learning et la formation à
Garrison, D. R., Cleveland-Innes, M., & Fung, T. S. (2010). Exploring causal relationships distance, 23, 19–48.
among teaching, cognitive and social presence: Student perceptions of the Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by
community of inquiry framework. Internet and Higher Education, 13, 31–36. back-propagating errors. Nature, 323, 533–536.
Gašević, D., Kovanović, V., & Joksimović, S. (2017). Piecing the learning analytics Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
puzzle: A consolidated model of a field of research and practice. Learning: Research Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual
and Practice, 3, 63–78. recognition challenge. International Journal of Computer Vision, 115, 211–252.
Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of methods to Samek, W., & Müller, K. R. (2019). Towards explainable artificial intelligence. In
study the contribution of variables in artificial neural network models. Ecological Explainable AI: Interpreting, explaining and visualizing deep learning (pp. 5–22).
Modelling, 160, 249–264. https://doi.org/10.1016/S0304-3800(02)00257-0 Springer.
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). Smoothgrad:
Cambridge: MIT press. Removing noise by adding noise. arXiv preprint arXiv:1706.03825.
Graves, A., Mohamed, A.r., & Hinton, G. (2013). Speech recognition with deep recurrent Strijbos, J. W. (2011). Assessment of (computer-supported) collaborative learning. IEEE
neural networks. In 2013 IEEE international conference on acoustics, speech and signal Transactions on Learning Technologies, 4, 59–73.
processing (pp. 6645–6649). IEEE. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Gunning, D. (2017). Explainable artificial intelligence (xai). Defense Advanced Research Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In
Projects Agency (DARPA). nd Web 2. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Guo, S. X., Sun, X., Wang, S. X., Gao, Y., & Feng, J. (2019). Attention-based character- Recognition 07-12-June (pp. 1–9). https://doi.org/10.1109/CVPR.2015.7298594.
word hybrid neural networks with semantic and structural information for arXiv:1409.4842.
identifying of urgent posts in MOOC discussion forums. IEEE Access, 7, Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: Liwc
120522–120532. and computerized text analysis methods. Journal of Language and Social Psychology,
Hayati, H., Chanaa, A., Idrissi, M. K., & Bennani, S. (2019). Doc2vec &naïve bayes: 29, 24–54.
Learners’ cognitive presence assessment through asynchronous online discussion tq Van Asch, V. (2013). Macro-and micro-averaged evaluation measures (Vol. 49). Belgium:
transcripts. International Journal of Emerging Technologies in Learning, 14. CLiPS.
Hu, Y., Donald, C., Giacaman, N., & Zhu, Z. (2020). Towards automated analysis of Wang, X., Yang, D., Wen, M., Koedinger, K., & Rosé, C. P. (2015). Investigating how
cognitive presence in MOOC discussions: A manual classification study. In The 10th student’s cognitive behavior in mooc discussion forums affect learning gains. International
international conference on learning analytics and knowledge (pp. 135–140). Frankfurt, Educational Data Mining Society.
Germany: LAK ’20). Waters, Z., Kovanović, V., Kitto, K., & Gašević, D. (2015). Structure matters: Adoption of
Kamath, C. N., Bukhari, S. S., & Dengel, A. (2018). Comparative study between structured classification approach in the context of cognitive presence classification.
traditional machine learning and deep learning approaches for text classification. In In Information retrieval technology (pp. 227–238). Cham, Switzerland: Springer
Proceedings of the ACM symposium on document engineering 2018 (pp. 1–11). International Publishing. https://doi.org/10.1007/978-3-319-28940-3_18.
Kovanović, V., Gašević, D., & Hatala, M. (2014a). Learning analytics for communities of Wei, X., Lin, H., Yang, L., & Yu, Y. (2017). A convolution-LSTM-based deep neural
inquiry. Journal of Learning Analytics, 1, 195–198. network for cross-domain MOOC forum post classification. Information, 8, 1–16.
Kovanović, V., Joksimović, S., Gašević, D., & Hatala, M. (2014b). Automated cognitive https://doi.org/10.3390/info8030092
presence detection in online discussion transcripts. Workshop Proceedings of LAK Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
2014, Indianapolis, IN, USA. URL: http://ceur-ws.org/Vol-1137/. Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y.,
Kovanović, V., Joksimović, S., Waters, Z., Gašević, D., Kitto, K., Hatala, M., & Siemens, G. Plu, J., Xu, C., Le Scao, T., Gugger, S., … Rush, A. (2020). Transformers: State-of-the-
(2016). Towards automated content analysis of discussion transcripts: A cognitive art natural language processing. In Proceedings of the 2020 conference on empirical
presence case. In Proceedings of the sixth international conference on learning analytics methods in natural language processing: System demonstrations (pp. 38–45). Association
and knowledge (pp. 15–24). New York, NY, USA: Association for Computing for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-demos.6.
Machinery. https://doi.org/10.1145/2883851.2883950. Online https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for Yang, S., Wang, Y., & Chu, X. (2020). A survey of deep learning techniques for neural
text classification. In Proceedings of the Twenty-ninth AAAI conference on artificial machine translation, Article 07526. arXiv:2002.
intelligence (pp. 2267–2273). Austin, Texas, USA. Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical based natural language processing. Ieee Computational intelligenCe Magazine, 13,
data. Biometrics. 55–75.
Lee, S., & Lee, C. (2020). Revisiting spatial dropout for regularizing convolutional neural
networks. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-
09054-7
12