You are on page 1of 12

Computers and Education: Artificial Intelligence 2 (2021) 100037

Contents lists available at ScienceDirect

Computers and Education: Artificial Intelligence


journal homepage: www.sciencedirect.com/journal/computers-and-education-artificial-intelligence

Automatic analysis of cognitive presence in online discussions: An approach


using deep learning and explainable artificial intelligence
Yuanyuan Hu a, *, Rafael Ferreira Mello b, c, Dragan Gašević d, e, f
a
Faculty of Engineering, University of Auckland, Auckland, New Zealand
b
Departamento de Computação, Universidade Federal Rural de Pernambuco, Recife, Brazil
c
Cesar School, Recife, Brazil
d
Faculty of Information Technology, Monash University, Melbourne, Australia
e
School of Informatics, University of Edinburgh, UK
f
Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia

A R T I C L E I N F O A B S T R A C T

Keywords: This paper proposes the adoption of a deep learning method to automate the categorisation of online discussion
Cognitive presence messages according to the phases of cognitive presence, a fundamental construct from the widely used Com­
Deep learning munity of Inquiry (CoI) framework of online learning. We investigated not only the performance of a deep
Explainable artificial intelligence
learning classifier but also its generalisability and interpretability, using explainable artificial intelligence al­
Online discussion
gorithms. In the study, we compared a Convolution Neural Network (CNN) model with the previous approaches
reported on the literature based on random forest classifiers and linguistics features of psychological processes
and cohesion. The CNN classifier trained and tested on the individual data set reached results up to Cohen’s κ of
0.528, demonstrating a similar performance to those of the random forest classifiers. Also, the generalisability
outcomes of the CNN classifiers across two disciplinary courses were similar to the results of the random forest
approach. Finally, the visualisations of explainable artificial intelligence provide novel insights into identifying
the phases of cognitive presence by word-level relevant indicators, as a complement to the feature importance
analysis from the random forest. Thus, we envisage combining the deep learning method and the conventional
machine learning algorithms (e.g. random forest) as complementary approaches to classify the phases of
cognitive presence.

1. Introduction them, cognitive presence is a central construct to seize the development of


the critical, in-depth thinking and argumentation skills of the students
Asynchronous online discussions play a key role in aiding social (Garrison et al., 2001). The CoI framework also defines coding schemes
interaction within educational environments, especially in higher edu­ for the categorisation of discussion messages of students according to
cation (Anderson & Dron, 2010). They promote course participation, each presence. For instance, cognitive presence is operationalized using
collaboration, and even reduce drop-out rates (Rivard, 2013). Although a four-phase inquiry process (Garrison et al., 2001): 1) Triggering event,
the online discussions have several benefits, it is crucial to provide tools where a problem or dilemma is identified and conceptualized, 2)
and methods to assist the instructors to monitor contributions and Exploration, in which students’ explore and brainstorm potential solu­
promote understanding of the process of knowledge (co-)construction in tions to the issue at hand, 3) Integration, during which students (co-)
a group of students (Garrison, Cleveland-Innes, & Fung, 2010). construct new knowledge by synthesizing the existing information, and
In this context, the Community of Inquiry (CoI) framework (Garrison 4) Resolution, in which students evaluate the newly-created knowledge
et al., 1999) is a widely validated model that provides different di­ through hypothesis testing or vicarious application to the problem/­
mensions to analyse the dynamic and social nature of modern online dilemma that triggered the learning cycle. Although commonly used in
learning. CoI defines three presences that explain students and in­ research for content analysis, the manual coding process is
structors’ social interactions in online learning environments. Within labour-intensive, time-consuming, and not practical to be performed on

* Corresponding author.
E-mail addresses: y.hu@auckland.ac.nz (Y. Hu), rafael.mello@ufrpe.br (R. Ferreira Mello), dragan.gasevic@monash.edu (D. Gašević).

https://doi.org/10.1016/j.caeai.2021.100037
Received 26 April 2021; Received in revised form 23 August 2021; Accepted 10 October 2021
Available online 29 October 2021
2666-920X/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

ongoing discussions to assist instructors in their teaching (Kovanović, (Garrison et al., 2001). Within cognitive presence, this cycle is repre­
Gašević, & Hatala, 2014). sented in four phases:
Several automated and semi-automated methods have been pro­
posed in the recent literature (Barbosa et al., 2020; Ferreira et al., 2018; 1. Triggering event initiates the cycle of critical inquiry. In this phase, a
Kovanović et al., 2014b, 2016; Neto et al., 2018; Rolim et al., 2019; student or the instructor proposes a problem or dilemma to trigger
Waters et al., 2015). While early approaches applied a combination of the discussion. It could be a problem statement or a question asked
word and phrase counts features with black-box machine learning al­ by a student;
gorithms (i.e., Neural Networks and Support Vector Machine) 2. Exploration involves an exchange of findings and reflections of initial
(Kovanović, Joksimović, et al., 2014; Mcklin, 2004), recent studies ideas. Based on the interactions, the participants propose possible
proposed the adoption of features based on psychological processes, solutions for the given problem or explanations for the questions;
writing cohesiveness, and discussion structure compound with decision 3. Integration is the phase in which the participants synthesise coherent
tree algorithms (Barbosa et al., 2020; Kovanović, Joksimović, et al., solutions and knowledge constructed on the ideas and the informa­
2014; Neto et al., 2018). The main advantages of the second approach tion shared in the previous phases. In other words, the participants
are the description of important indicators of different cognitive pres­ construct structured solutions to the given problem/question by
ence phases and the improvement in the generalisability of the created integrating various ideas about the solution;
classifiers (Kovanović et al., 2016). 4. Resolution is the phase in which the participants assess the newly-
Advances in natural language processing have led to the develop­ constructed knowledge through hypothesis testing or real-world
ment of more accurate text classification techniques based on deep application in relation to the problem/dilemma that initially trig­
neural networks (Lai et al., 2015). The main idea of these methods is that gered the discussion.
the classifier tends to reach better results, even for different contexts, as
the database increases. Moreover, explainable artificial intelligence is a In this study, we focuses on the analysis of cognitive presence since it
recent method for unpacking the results of deep neural networks clas­ is the ‘primary issue’ of students’ learning evidence to be explored
sifiers (Miller, 2019; Samek & Müller, 2019). The combination of these before the other dimensions (Rourke & Kanuka, 2009). The other two
approaches could provide alternative analysis not explored before in presences of the CoI will be investigated in our future research. Thus, the
terms of the analysis of CoI constrains. discussion messages that do not fall into any of the above phases are
Therefore, this paper proposes a new method for automatic catego­ classified as the Other. Also, the discussion messages that reveal in­
risation of online discussion messages using deep machine learning and dicators of two phases are classified into the higher one, as the rule of
explainable artificial intelligence. Our study analyses the efficiency and coding up. Moreover, examples of student discussion messages explored
the generalisability of this approach in the context of cognitive presence. in this study were provided in Table 1 for a better understanding of each
The results showed that this approach reached results comparable to the cognitive phase.
previous approaches reported on the literature (Barbosa et al., 2021;
Kovanović et al., 2016; Neto et al., 2018). Moreover, the explainable 2.2. Content analysis of online discussions with CoI
artificial intelligence visualisations shine new insights into the feature
importance for this problem. The results and their implications are Quantitative Content Analysis (QCA) (Rourke et al., 2001) is broadly
further discussed in this paper. applied to assess social and cognitive presences by assigning the units of
the online discussion transcripts to a class and counting the number of
2. Background the units in each class (Bauer, 2007). The CoI model defines three QCA
coding schemes, one for each presence. Although the CoI model has
2.1. The Community of Inquiry and cognitive presence extensively been adopted in education research, content analysis has
been essentially employed for retrospective analysis, after the courses
The Community of Inquiry (CoI) framework proposed by Garrison are finished, without much impact on the actual student learning and
et al. (1999) is a social-constructivist model used to analyse the inter­ outcomes (Strijbos, 2011).
active educational experience in online learning environments. It de­ In this regard, the recent literature reports several studies on the
scribes educational engagement that occurs in a learning community, in adoption of automated methods (i.e., machine learning techniques) to
which “a group of individuals who collaboratively engage in purposeful assess CoI presences, intending to drive instructional interventions and
critical discourse and reflection to construct personal meaning and confirm enhance student learning outcomes (Kovanović, Gašević, & Hatala,
mutual understanding” (Garrison & Anderson, 2011, p. 2). The CoI 2014; Rolim et al., 2019). Initially, the methods focused on the use of
framework has been widely used and validated for multidisciplinary combinations of traditional text mining features (i.e., bag-of-words
courses to understand the participants’ engagement in online discussion vectors) and classification algorithms. For instance, Mcklin (2004) and
forums (Garrison & Anderson, 2011; Garrison, Anderson, & Archer, Kovanović, Joksimović, et al. (2014) proposed the application of a
2010; Hu et al., 2020; Liu & Yang, 2014). This model defines three di­ bag-of-words (n-gram) and supervised algorithms based on Neural
mensions, called presences, to mould the learning experience: (i) Networks and Support Vector Machines (SVMs), respectively, to auto­
Cognitive presence explains the critical reflection of the knowledge matically classify online discussion messages into the phases of cognitive
(re)construction, and problem-solving processes in the learning com­ presence. The methods reached Cohen’s κ of 0.31 (Mcklin, 2004) and
munity (Garrison et al., 2001); (ii) Social presence analyses the ca­ 0.41 (Kovanović, Joksimović, et al., 2014). Moreover, previous works
pacity of the participants to humanise their relationships during an also combined word and document embeddings with LIWC and
online discussion. It considers how social interactions could be used to contextual features to categorise online discussion messages according
model the social climate within a group of learners (i.e., cohesion, af­ to the cognitive presence phases (Hayati et al., 2019). Although this
fective, and interactive dimensions) (Rourke et al., 1999); (iii) Teaching study used embeddings to enhance the feature vectors, the paper
presence concerns the teaching role before (i.e., course design) and adopted black-box algorithms in the analysis, such as SVM. the authors
during (i.e., facilitation and direct instruction) the course (Anderson did not provide details about the number of messages in the evaluation
et al., 2001). data, but the best classifier proposed reached Cohen’s κ of 0.68.
Although all three presences contribute to the understanding of The shortcomings of the initial approaches (bag-of-words and black-
learning experience, cognitive presence is the primary element of the box classification algorithms) are the lack of: (i) generalisability,
CoI model in terms of the operationalisation of the practical inquiry implicating that the creation of classification models is dependent on the
cycle of social-constructivist learning within online environments domain, and (ii) interpretability, which does not provide information on

2
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Table 1 et al., 2015), but its application rapidly spread to other domains as
Examples of student discussion messages for each phase of cognitive presence Natural Language Processing (NLP) (Young et al., 2018). These algo­
from two courses. rithms rely on the idea that the large number of connections generated
Phase An example of student discussion messages Course on deep neural networks could simulate the same learning process as the
Other Hi XXX I really appreciate your suggestions. I will keep Software
human brain (Goodfellow et al., 2016).
that in mind for future presentations. course In terms of NLP, deep learning has played an increasing role in tasks
An interesting start and I think I’m going to enjoy the Philosophy related to topic modelling/classification (ALRashdi & O’Keefe, 2019)
next 8 weeks. MOOC and machine translation (Yang et al., 2020). The Convolutional Neural
Trigger I noticed you defined organizational model related Software
Network (CNN), as a class of the deep neural networks, has been
more on process actors do you think any other course
important factors can influence the organizational commonly used for the text classification tasks, especially for small text
model study as well? documents such as Twitter and online discussions posts (Lai et al., 2015;
Have you managed to argue the above (that there is no Philosophy Wei et al., 2017). Some promising studies demonstrated that CNN out­
argument) in standard form? MOOC performs traditional machine learning algorithms in this context. For
Exploration From my understanding TagViewer transcribes audio Software
and links it to video. Is there future work for the course
instance, Kamath et al. (2018) compared the results of a deep learning
TagViewer to interpret changes in facial responses for architecture against random forests, SVM, Logistic Regression and
example? Even though body language doesn’t have Multilayer Perceptron (MLP) in two text classification tasks. The final
associated audio it does say quite a bit and it would be results showed that the deep learning algorithm achieved results up to
nice to transcribe that as well.
50% better than the other ones. Similarly, Guo et al. (2019) compared
It’s not a disagreement but discussion! Technically I Philosophy
think the words have different meanings but in a MOOC different architectures of CNN with Adaboost, a state-of-the-art decision
religious context they also have a connection so maybe tree algorithm, with the goal of identifying urgency in online discussion
we’re both right! messages (Guo et al., 2019). In this context, CNN outperformed Ada­
Integration … I’ve liked using iterative processes as you describe Software boost by up to 14%.
however I don’t believe we should ignore process course
Despite the promising results, the inability to explain the function­
improvement tools that can be integrated into the
process to build better software. I believe the patterns ality or reasons of the prediction output from the deep learning algo­
and questionnaires described by the author’s are some rithms is one of the core barriers that Artificial Intelligence is
of these tools. If we consider usability what your are encountering (Barredo Arrieta et al., 2020). It is also one of the main
describing was almost 70% less effective than using the
reasons why the white-box algorithms (e.g. random forests) can be
authors patterns …
… I think it was legitimate as he was acting in the Philosophy preferable in educational research, notably in the previous studies of the
course of his duties. Your professional role applies MOOC automatic classifier for the cognitive presence (Barbosa et al., 2020;
constraints which would not apply if you were Kovanović et al., 2016; Neto et al., 2018), compared to the deep learning
discussing matters as an individual; therefore, in the neural networks. However, the field of eXplainable Artificial Intelli­
example, the appeal to authority is appropriate …
gence (XAI) aims to create machine learning techniques that enable
Resolution … Agile is by its definition supposed to be flexible and Software
dynamic in nature. As a result I find it ironic that there course human audiences to easily understand, reasonably clarify, and appro­
would be a sticking point on literature Vs real world. I priately trust the AI methods and implementations (Gunning, 2017). XAI
would think the approach should be flexible based on can potentially complement the adoption of deep learning algorithms
the client and developer skill sets. Software
since it unpacks the model decisions. In learning analytics, explain­
development is a complex human interaction
regardless of the project size …
ability of the automatic analysis approach and its product is also a sig­
… Another way to test it would be to see if similar Philosophy nificant goal (Gašević et al., 2017). It should provide scholars and
positions eg heads of industry are also held by more MOOC educators insights into the research questions, for example, the in­
left-handed people than statistics would suggest. It dicators of different cognitive phases revealed in the online discussion
would be incredibly difficult to iron out other possible
messages in our study.
factors - all the other things that help to make people
successful, like good schooling, access to health care
and a stable family environment, but if it were 3. Research questions
possible, could we scientifically show that being left-
handed is a factor? I’m assuming that if we could we
As discussed in Section 2.1, cognitive presence has a crucial role in
would scientifically show that it’s pure coincidence …
analysing knowledge construction and critical thinking in online
learning environments. It assesses “the extent to which the participants
why a specific class was chosen for each message (Kovanović et al., in any particular configuration of a community of inquiry are able to
2016). To overcome these problems, the previous methods reported on construct meaning through sustained communication” (Garrison et al.,
the literature used domain-independent features and decision trees, 1999, p. 89). Although several studies proposed automatic methods for
especially the random forests algorithm. In this direction, Kovanović classifying discussion transcripts according to the phases of cognitive
et al. (2016), Neto et al. (2018), and Barbosa et al. (2020) adopted presence, to our knowledge, no publication applied deep learning
random forests and features based on linguistic resources (Coh-metrix techniques for the automatic content analysis of cognitive presence.
(McNamara et al., 2014) and LIWC (Tausczik & Pennebaker, 2010)), Hence, our first research question is:
latent semantic analysis, named entity recognition, and discussion
structures (Waters et al., 2015), to identify the phases of cognitive Research Question 1 (RQ1):
presence. These studies reached Cohen’s κ of 0.63 (Kovanović et al.,
2016), 0.72 (Neto et al., 2018), and 0.53 (Barbosa et al., 2020) when To what extent can accurately deep learning methods classify online
applied to discussion messages written in English, Portuguese and across discussion messages according to the phases of cognitive presence? Can deep
these two languages, respectively. learning outperform traditional machine learning techniques for this task?
One of the main limitations reported in the literature is the lack of
generalisability of the existing classifiers of cognitive presence (Barbosa
2.3. Deep learning and explainable artificial intelligence et al., 2020; Kovanović et al., 2016; Neto et al., 2018). As deep learning
methods are based on the content of the text, it could lead to overfitting
Deep learning algorithms have gained notoriety due to the of the generated model. Therefore, our second research question is:
outstanding results achieved in computer vision tasks (Russakovsky

3
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Research Question 2 (RQ2): Table 2


Distribution of cognitive presence phases for both data sets.
To what extent does the use of deep learning methods affects the gen­ ID Phases SoftwareSet MOOCSet Total (%)
eralisability of a classifier for cognitive presence? Messages (%) Messages (%)
Finally, Section 2.2 highlights the need to not only to provide ac­ 1 Other 140 (8.01%) 85 (5.75%) 225 (6.97%)
curate classification of discussion messages but also to explain the 2 Triggering 308 (17.63%) 279 (18.86%) 587
reason why messages were classified in one of the cognitive presence event (18.20%)
phases. Traditionally, deep learning method are categorised as black- 3 Exploration 684 (39.15%) 835 (56.46%) 1519
(47.09%)
box algorithms. However, eXplainable Artificial Intelligence (XAI) 4 Integration 508 (29.08%) 244 (16.50%) 752
(Miller, 2019; Samek & Müller, 2019) is a method to unpack the deep (23.31%)
learning model with visual representations of the output. This is a novel 5 Resolution 107 (6.12%) 36 (2.43%) 143 (4.43%)
approach never adopted in this context. We hypothesise that XAI could Total 1747 (100%) 1479 (100%) 3226
(100%)
provide additional information on the development of cognitive pres­
ence phases. Thus, our last research question is:
from an archived offering of the Philosophy MOOC, which was designed
Research Question 3 (RQ3): and taught by a course-design team at a New Zealand’s university, on
the FutureLearn platform. This course focused on the basic concepts of
What are the important features revealed from the deep learning with XAI logical and critical thinking, how to build good arguments, and how to
for identifying cognitive presence phases in online discussions? What are the link the concepts with daily life. The MOOCSet consists of 1917 dis­
differences between the information provided by deep learning with XAI and cussion messages that were randomly selected from the forums under
previous machine learning methods used in the classification of cognitive the eight-weekly topics of the archived offering. The offering had 12,311
presence? messages in total generated by 1000 registered learners.
The 1917 messages from the LCT MOOC were manually classified by
4. Methods three trained coders (77.15% agreement, Fleiss’s κ of 0.763, which is a
combination of multiple uses of the Cohen’s κ measure applied when
4.1. Data description more the two coders participated in the annotation process (Falotico &
Quatto, 2015)) based on a revised classification rubric of cognitive
In this study, we utilised two data sets coming from the online dis­ presence for the MOOC settings (Hu et al., 2020). This study uses only
cussion forums of two different courses. The software engineering data the messages classified in the same category by all three coders (1479
set is from a fully online, masters-level, for-credit course, which was messages). The forth column in Table 2 displays the distribution of the
applied in the previous automatic classifier studies (Barbosa et al., 2020; cognitive presence phases in MOOCSet. The three coders were
Kovanović et al., 2016). The Logical and Critical Thinking data set is post-graduate students from the Philosophy Department, who were also
from an introductory Philosophy MOOC, which was used to validate the the teaching assistants of the LCT MOOC.
classification rubric of cognitive presence in the MOOC setting (Hu et al.,
2020). 4.1.3. Comparison between the two data sets
The discussion messages in both of the data sets share some simi­
4.1.1. The software engineering course data set larities. One pattern is that the distribution ratio of messages across the
The software engineering course data set (SoftwareSet) consisted of four phases of cognitive presence and messages coded as other has a
1747 discussion messages1 generated by 81 students from six offerings similar trend. For instance, the messages classified into Exploration
of the course between 2008 and 2011, on the Moodle LMS, at a Canadian accounted for the largest proportion, and the Resolution for the least in
public university. The course was a research-intensive course that con­ both data sets. Also, the messages coded as Other, which is called None-
tained 14 different topics in software engineering, such as software re­ cognitive presence in the study of the LCT MOOC set, had the second
quirements, design, and maintenance. The students were asked to share lowest proportion. Another similarity between the two data sets is that
a video presentation about a course-related research paper, post a new all the messages were contributed by the learners without the in­
thread in the discussion forum, and invite other students to engage in the structors’ participation. The learners from both of the courses raised
debate about the presentation. The students’ participation in the threads threads in terms of questions and comments around different topics in
accounted for 15% of the final course mark. the videos and articles.
All the 1747 messages were manually classified by two expert coders Some differences are also revealed between the two data sets. Firstly,
(98.1% agreement, Cohen’s κ of 0.974, which is a largely used measure they were from two disciplines (i.e., software engineering and philoso­
to evaluate the inter-rater agreement when two coders are involved in phy); therefore, the vocabularies utilised are distinct. Secondly, the
the process (Ben-David, 2008; Cohen, 1960)) into five phases of cogni­ types of the two courses where the messages originate from were
tive presence according to Garrison et al.’s coding scheme (Garrison different. The software engineering course was a small-scale, for-credit,
et al., 1999). Both coders are experts in educational technology research university course, whereas the LCT MOOC was an open and free course
with extensive experience with CoI framework. Moreover, one of them with the participation of the large number of learners. The learners’
has expertise in the course (software engineering), and the other has a demographics and motivations in the MOOC are diverse, which is
background in psychology and education. The disagreements (32 mes­ distinct from the for-credit, university courses.
sages) were resolved by discussions of the coders. The third column in Moreover, the details of the distribution across the four phases of
Table 2 illustrates the distribution of the cognitive presence phases in cognitive presence and Other messages were not exactly the same
SoftwareSet. although the trend was similar. We used the Mann-Whitney U Test to
investigate whether the differences of the distribution were significant
4.1.2. The logical and critical thinking MOOC data set between the two data sets (McKnight & Najab, 2010). The result (p <
The logical and critical thinking MOOC data set (MOOCSet) was .001, U = 1146776, r = 1747&1479, with two-sided hypothesis) indi­
cated that the distributions of the four phases of cognitive presence were
statistically different between the data sets. We are aware that these
1
In both of our data sets, an individual message refers to a starting post or its differences could affect the outcomes of our experiments, most notably
replies from each discussion thread.

4
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

the between-course generalisability of the automatic classifier for understand the deep learning algorithm’s decisions. We applied
cognitive presence. Gradient-based Sensitivity Analysis (SA), as an example of eXplainable
AI to visualise the importance of words in the discussion messages for
4.2. Deep learning architectures identifying the phases of cognitive presence. Similar methods were used
to explain deep learning neural networks in NLP tasks (Arras, Horn,
In this study, we used an Inception CNN of naïve form (Szegedy et al., et al., 2017,b). Our trained neural network has a prediction function fc
2015), with the application of pre-trained word-embedding models as for each category c and an input vector x, which consists of dimension
word representations to build a classifier for analysing phases of d using word embeddings. We use xi,t to denote the ith element of the
cognitive presence. The architectures of this approach is explained in word embedding vector, which represents the tth word in a message.
Section 4.2.2. Gradient-based SA (Gevrey et al., 2003) assigns the relevance scores Ri,t
by calculating squared partial derivatives:
4.2.1. Word embeddings ( )2
∂fc
Word embedding is an NLP technique that uses continuous real Ri,t = (x) (1)
number vectors to represent texts (i.e. words, phrases, sentences and
∂xi,t
documents). These vector representations are learned from very large Theses derivations can be computed by using standard gradient
corpora, and are composed of meaningful features associated with po­ backpropagation (Rumelhart et al., 1986) in most neural networks. SA
sitions in a high-dimensional space. The main idea of word embeddings can also be a decomposition of the squared gradient norm if we sum up
is that semantically similar words or phrases can be located very closely the relevance scores of the entire input dimension d:
in the space, which also means they have similar representation vectors. ∑
We used the GloVe pre-trained model (Pennington et al., 2014) to covert ‖∇xfc (x)‖ 22 = Ri,t (2)
the text messages into embedding vectors as the inputs for our deep
d

learning neural networks. For the out-of-vocabulary words (77 words) In this study, we employed the SmoothGrad method from the
that GloVe did not contain, we randomly initialised their vectors, which iNNvestigate library (Alber et al., 2019) as it can effectively improve the
is a common approach in the literature. During the data pre-processing, gradient-based sensitivity maps by averaging the gradient over number
we removed the punctuation and numbers in each sentence, and also of inputs with added noise (Smilkov et al., 2017). The relevance scores
performed case-folding and lemmatisation. The max length of tokens for of SA are positive values and were between 0 and 1 in our study. We
each message was 1939, which contained the complete information for plotted heatmaps to visualise the word-level relevance scores for each
all the messages from both data sets. A matrix of shape D × L × N was cognitive phase. The words highlighted in red denote strong relevance
generated as the input variables of our neural networks, in which D (0.5 < R ≤ 1) to a target phase, while words in blues indicate low
denotes word embedding dimension,2 L the max length of tokens, and N relevance (0 < R ≤ 0.5).
the number of messages for training.
4.4. Evaluation metrics
4.2.2. Convolutional neural networks
The architecture of the naïve-form Inception CNN (Szegedy et al., We used the metrics of accuracy, Cohen’s κ, F1 score, weighted-aver­
2015) used in this study was based on an architecture3 applied in the aged F1 score, and error rate of each class to measure the performance of
classification of discussions on the Quora platform with high perfor­ our CNN and RF classifiers. Accuracy score and Cohen’s κ are commonly
mance. Fig. 1 depicts the details of the architecture. It consists of a used to evaluate the performance of a supervised machine-learning
spatial dropout layer, five convolution layers with max poolings, a batch classifier (i.e., input variables have been pre-classified), in educational
normalisation process, and a fully-connected layer. Initially, the word research (Barbosa et al., 2020; Kovanović et al., 2014b, 2016; Neto et al.,
embedding matrix, as the vector representation of the input messages, is 2018; Wang et al., 2015; Waters et al., 2015). Accuracy score is defined as
fed into the spatial dropout neuron with five different filters. Instead of the number of true positives (TPc) plus the number of false positives
feeding the matrix into the convolution layers directly, the use of the (FPc) over the sum of all the true and false predictions for each class c.
spatial dropout can be effective for preventing neural works from ∑ ∑
TP + FP
over-fitting (Lee & Lee, 2020). After the dropout layer, the five convo­ accuracy = ∑ ∑ c c ∑c c ∑ (3)
lution layers in one dimension operate the five matrices produced from c TPc + c FPc + c NPc + c FNc

the dropout layer separately. Then, five max pooling layers retain the Cohen’s κ (Cohen, 1960) was first introduced to measure the reli­
important information from the convolution layers but reduce the ability between two human coders on the classification tasks, lessening
dimensionality, respectively. After concatenation of the five layers, a the agreement caused by chance. It can also be used to evaluate the
new matrix is produced. Next, the new matrix passes through the batch degree of agreement between the actual and predicted labels in machine
normalisation for improving training speed and validation accuracy learning tasks. Cohen’s κ is defined as
before the fully-connected layer computing the prediction probability
P0 − Pe
for each class. Finally, the network outputs one of the five phases of κ= (4)
1 − Pe
cognitive presence using the activation function of softmax. Table 3
demonstrates the values of the optimal hyper-parameters implemented where P0 is the agreement probability of all classes. Pc is the agreement
for all the experiments. probability by chance, which is computed as the columns and rows
marginal probabilities in the confusion matrix (Ben-David, 2008).
4.3. Explainable artificial intelligence Macro- and micro-averaged scores are often adopted to compute the
overall performance of multi-class classification problems.
We are interested in the relevance scores between the input variables Macro-averaged method considers all classes as equally important,
and the identified categories on word level. In other words, we wonder whereas micro-averaged regards the different contributions of each
which words in the messages are important for each cognitive phase, to class, valuing the minority class the least (Van Asch, 2013).
Macro-Averaged Precision (macro AP) is defined as the mean value of
precision score (Pc) of each class c.
2
The 100-dimension pre-trained vectors were used in our experiments. http
s://nlp.stanford.edu/projects/glove/.
3
https://www.kaggle.com/christofhenkel/inceptioncnn-with-flip.

5
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Fig. 1. Architecture of the naïve-form Inception CNN used in this study.

4.5. Data processing


Table 3
The optimal hyper-parameters for all experiments in this
To address research question 1, we trained two classifiers based on
study.
SoftwareSet (Classifier 1) and MOOCSet (Classifier 2), respectively. The
Hyper-parameters Values 10-fold cross-validation (CV) method was adopted to randomly divide
Kernel size (1, 2, 3, 5, 7) the entire sample data into 10 non-overlapping folds of approximately
Number of filters 36 equal size. The k-fold CV approach has been commonly used in the
Hidden size 128
machine learning tasks to reduce the risk of overfitting, aiming for more
Batch size 36
Epoch 300 valid result. In every CV loop, the combination of nine-fold data was the
Optimizer Adam training set, and the remainder was the validation set. Moreover, we
Learning rate 0.005 repeated the training process with 10-fold CV six times by different
seeds. Thus, sixty weights file with performance outcomes were gener­
∑ ated in order to select the best classifier for identifying the cognitive
c Pc
macroAP = (5) presence phases by using one of the course data set (SoftwareSet or
the number of c
MOOCSet). The distribution of training and test data in the experiments
Also, macro-Averaged Recall (macro AR) is defined as the mean value for research question 1 can be seen in Table 4.
of recall score (Rc) of each class c. To address research question 2, we used the best weights of the clas­
∑ sifiers trained by one dataset to validate on the other one. For example,
c Rc Classifier 1 was applied to test on the entire MOOCSet (1479 messages),
macroAR = (6)
the number of c
and Classifier 2 on the entire SoftwareSet (1747 messages). Moreover,
Macro-averaged F1 score is defined as the harmonic mean of macroAP we trained three new CNN classifiers by concatenation of the two data
and marcoAR. sets, and evaluated the best models on the partial set of SoftwareSet
(Classifier 3), MOOCSet (Classifier 4) and both (Classifier 5). For Classifier
macroAP × macroAR
macroF1 = 2 × (7) 3, the SoftwareSet was split into training and test parts in the ratio of 9 :
macroAP + macroAR
1 by using stratified sampling method. This split before the data
We also used Weighted-averaged F1 score, which has been applied in concatenation was to ensure that the same distribution of instances from
the text classification tasks in recent years (Chakravarthi et al., 2020), in five categories of cognitive presence contained in the training and test
regard to class imbalance issues in our data set, especially the impact of sets. After combing the training part (90% of 1747 messages) of the
the minority instances (e.g., the Resolution phase). Weighted-averaged F1 SoftwareSet with the entire MOOCSet, we performed the same training
score4 is defined as the harmonic mean of precision and recall scores process as in the experiments for research question 1. The model weights
weighted (Pc, Rc) by the proportion of true instances for each class (Wc). with the best performing outcomes was selected to validate on the test
∑ part of SoftwareSet (10% of 1747 messages). Similarly, the MOOCSet
weightedAP = Pc × Wc (8)
was split before the concatenation and training process in the
c


weightedAR = Rc × Wc (9)
Table 4
c
Distribution of training and test data in the experiments.
weightedAP × weightedAR RQ Experiment Classifier Training Test size
weightedF1 = 2 × (10) size
weightedAP + weightedAR
1 Train & validate on 1 1747 1747 (10%)
To evaluate the class-level performance, the error rate was used with
SoftwareSet (90%)
the confusion matrices as the number of false positives (FPc) over the Train & validate on MOOCSet 2 1479 1479 (10%)
number of true positives (TPc) plus false positives (FPc) for each class. (90%)
2 Validate Classifier 1 on 1 1747 1479 (100%)
FPc MOOCSet (90%)
error ​ rate ​ of ​ each ​ class ​ c = (11)
FPc + TPc Validate Classifier 2 on 2 1479 1747 (100%)
SoftwareSet (90%)
Train on both but validate on 3 3051 1747 (10%)
SoftwareSet
Train on both but validate on 4 3078 1479 (10%)
MOOCSet
4 Train & validate on both 5 2903 323 (10% of
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_s
each set)
core.html/#sklearn.metrics.f1_score.

6
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

experiment of Classifier 4. In the final experiment for Classifier 5, we split Table 5


both of the SoftwareSet and MOOCSet in the ratio 9:1, and then com­ Outcomes of the CNN classifiers for the individual course set compared to the RF
bined each of the 90% data as a new training set, and the 10% as a test classifiers (RFs).
set. The details of training-test split of the experiments for research Experiment Accuracy Kappa Weighted Macro
question 2 is also listed in Table 4. (SD) (SD) F1 (SD) F1 (SD)
Finally, to address research question 3, we generated several XAI Classifier 1 Train & test on 0.632 0.475 0.649 0.541
heatmaps to visualise the word-level relevance to every cognitive pres­ the SoftwareSet (0.03) (0.05) (0.03) (0.04)
ence phase by the SmootGrad sensitivity analysis method. The relevance (CNN)
Classifier 2 Train & test on 0.707 0.528 0.723 0.467
scores were obtained from the best performing weights and the training
the LCT (0.03) (0.06) (0.03) (0.06)
data of the experiments. MOOCSet
The source ode of the experiments in this study is publicly available (CNN)
at github.com/yuanhu19/CognitivePresence_Cnn_XAI repository. RF Train & test on 0.642 0.479 0.667 0.530
classifier the SoftwareSet (0.04) (0.05) (0.05) (0.03)
1 for (RF)
4.6. Random forests classifiers RQ1
RF Train & test on 0.720 0.488 0.751 0.487
In order to compare the proposed method, we evaluated the Random classifier the LCT (0.04) (0.08) (0.05) (0.06)
Forests (RF) approach derived from the state-of-art model for classifi­ 2 for MOOCSet (RF)
RQ1
cation of cognitive presence phases by Kovanović et al. (2016), Neto
et al. (2018) and Farrow et al. (2019). We used 199 classification fea­
tures extracted from the textual information in the two course data sets, the best case. We can see that the Cohen’s κ coefficient values obtained
106 features by Coh-metrix (McNamara et al., 2014) and 93 features by in both classifiers fell in the ‘moderate’ agreement level (Landis & Koch,
LIWC (Linguistic Inquiry Word Count) (Tausczik & Pennebaker, 2010) 1977). Also, the confusion matrices for both CNN experiments are
for English language (Kovanović et al., 2016). In order to maintain the shown in Figs. 2 and 3. They demonstrates that Classifier 1 (SoftwareSet)
consistency of the classification features with our CNN classifiers, the had the lowest error rate in the Integration phase, whereas Classifier 2
contextual features, such as the message depth and number of replies (MOOCSet) performed the best in the Triggering event phase. Besides,
and so on, were excluded from the classification features in the RF the Exploration phase revealed the second lowest error rater, and the
classifiers. Similarly, the class rebalancing method was not included. Resolution obtained the highest (i.e. greater than 0.9) in the CNN clas­
The sample data was also pre-processed to remove numbers, and to sifiers for both course data sets.
perform case-folding and lemmatisation, in all of RF classifier Compared to the performance of the RF classifiers, CNN Classifier 1
approaches. (SoftwareSet) reached similar accuracy, Cohen’s κ and macro F1 scores
After the pre-processing and feature extraction, the training-test split with minor differences (lower than 0.011). CNN Classifier 2 (MOOCSet)
in the RF classifiers were exactly the same with the CNN classifiers of displayed a better Cohen’s κ and slightly lower accuracy and macro F1
each experiment (as shown in Table 4). The 10-fold CV was used to scores than the RF classifier 2, with minor differences (lower than
select the best performing RF models by the fine-tuning of two important 0.025). We can see the weighted F1 scores decreased in both of the CNN
parameters, ntree (i.e. the number of decision trees built) and mtry (i.e. classifiers.
the number of classification features randomly selected by each tree). To
select the optimal ntree value, we examined 100 to 1300 sampled with
every 100 interval. For each ntree value, a 10-fold CV was used to 5.2. Evaluation across different data sets – RQ2
perform 30 different numbers, which were randomly selected from 1 to
199, to fine-tune the mtry value. The optimal RF classifiers were created This section reports the performance of the CNN classifiers that were
with the best mtry and ntree. In the RF classifier of classifiers for research trained on a course data set but validated on another one for answering
question 1, a final 10-fold CV was conducted with the entire sample data the second research question. Table 6 displays the evaluation outcomes of
to report the best performance and SD value. While in the RF classifiers the cross-course experiments. The weights of the best performing clas­
for research question 2, we repeated the training-test loop 10 times with sifiers (1 and 2), as reported in Section 5.1, were applied in the first two
the optimal RF classifiers, as a consistent approach to compare with our experiments.
CNN classifiers. The optimal parameters in the experiments were: (i) CNN Classifier 1 achieved the accuracy of 0.416, Cohen’s κ of 0.116,
ntree of 900, mtry of 179 in the RF classifiers of Classifier 1, (ii) ntree of weighted F1 of 0.447 and macro F1 score of 0.190, when the training set
700, mtry of 118 in the RF classifiers of Classifier 2, and (iii) ntree of 500, was the SoftwareSet and test set was the MOOCSet. The confusion ma­
mtry of 160 in the RF classifiers of Classifier 3, 4 and 5. trix of this case (Fig. 4) illustrates that the classifier trained on Soft­
wareSet had lower error rates on the prediction of Other and Exploration
5. Results in the MOOCSet, and obtained higher error rates on the other three
phases. The confusions indicate that many LCT MOOC messages of the
5.1. Evaluation for individual data set – RQ1 Triggering-event and Exploration phases were misclassified into the
Other category. Also, many Integration-phase messages from the
This subsection reports the performance of the CNN classifiers and MOOCSet were misclassified into Exploration phase by the Classifier 1
RF classifiers trained and tested on the individual course data set learned from the SoftwareSet.
(SoftwareSet or LCT MOOCSet) for answering the first research question. CNN Classifier 2, as trained on MOOCSet and tested on the Soft­
Table 5 displays the outcome metrics of both the CNN classifiers and the wareSet, reached an accuracy of 0.421, Cohen’s κ of 0.175, weighted F1
RF classifiers. In the experiment on the SoftwareSet, the best performing of 0.462 and macro F1 of 0.276. It obtained an overall higher perfor­
CNN Classifier 1 achieved accuracy of 0.632 (SD = 0.03), Cohen’s κ of mance than the previous experiment. The confusion matrix in Fig. 5
0.475 (SD = 0.05), weighted F1 score of 0.649 (SD = 0.03), and macro F1 shows that the lowest error rate appeared at the prediction of the Inte­
score of 0.541 (SD = 0.04). The CNN Classifier 2, which was trained and gration phase in the SoftwareSet by using the classifier trained on the
tested on the LCT MOOCSet, showed an overall better performance MOOCSet. The error rates for identifying the other four categories were
except the macro F1 score than Classifier 1. Classifier 2 reached an ac­ greater than 0.65, demonstrating a pool performance. Moreover, the
curacy of 0.707 (SD = 0.03), Cohen’s κ of 0.475 (SD = 0.06), weighted majority of the misclassification occurred between the Exploration and
F1 score of 0.723 (SD = 0.03) and macro F1 score of 0.467 (SD = 0.06), in Integration phase.

7
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Fig. 2. Confusion Matrix by training and testing on SoftwareSet.

Fig. 3. Confusion Matrix by training and testing on LCT MOOCSet.

We also examined the performance of the CNN classifiers trained on


Table 6
the combination of the two course data sets by three experiments.
Outcomes of the CNN classifiers for cross-course validation compared to the RF
Table 7 displays the outcomes of the CNN classifiers trained on the
classifiers (RFs).
combined set and tested on partial SoftwareSet, MOOCSet and both,
Experiment Accuracy Kappa Weighted Macro respectively. We can see that Classifier 4 performed better than Classifiers
F1 F1
3 and 5. Our CNN classifiers of the combined-course experiments had
Classifier 1 Train on 0.416 0.116 0.447 0.190 lower performance than the RF classifiers. Moreover, the overall results
SoftwareSet but
of the combined-course experiments surpassed the outcomes of the
test on MOOCSet
(CNN) cross-course validations (Table 6), but were lower than the results in the
Classifier 2 Train on MOOCSet 0.421 0.175 0.462 0.276 individual-course experiments (Table 5).
but test on
SoftwareSet (CNN)
RF classifier Train on 0.454 0.188 0.453 0.262 5.3. Visualisation by the explainable AI – RQ3
1 for RQ2 SoftwareSet but
test on MOOCSet The results of the XAI heatmaps were generated from the experi­
(RF)
ments of the best performing classifiers trained and tested on the indi­
RF classifier Train on MOOCSet 0.419 0.079 0.512 0.206
2 for RQ2 but test on vidual course data set, SoftwareSet or LCT MOOCSet. These two
SoftwareSet (RF) classifiers can demonstrate purer comparative outcomes of the word-
level predictive indicators of the cognitive presence phases between
the two data sets, than the classifiers trained in the combined-course
In both of the cross-course experiments, the Cohen’s κ values were at experiments. Due to the page limitation, we provided four messages
a ‘slight-agreement’ level (Landis & Koch, 1977). Also, both of them classified into Exploration (Fig. 6) or Integration (Fig. 7) phase from
reached a zero accuracy for the classification of the Resolution phase. both the data sets, as the examples of the XAI visualisations.
Outcomes of the RF classifiers for both experiments were also listed in We can observe that the word constructs and accommodate were
Table 6. The CNN classifier trained on the MOOCSet and tested on the highly relevant with classification of the Exploration phase in Softwar­
SoftwareSet had a better performance than the RF classifier with the eSet, followed by the programming methodology and function as moder­
same condition, whereas the CNN classifier reached slightly lower out­ ately positive (Fig. 6a). While in MOOCSet (Fig. 6b), the word conclusion
comes than the RF approach for the other way around (i.e., trained on was revealed a high relevance to identify the Exploration phase, and the
the SoftwareSet and tested on the MOOCSet). words agree, example, even, premises and true performed moderate

8
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Table 7
Outcomes of the CNN classifiers for combined-course data set validation.
Experiment Accuracy Kappa Weighted F1 Macro F1

Classifier 3 Train on both but test on SoftwareSet (CNN) 0.463 0.194 0.501 0.327
Classifier 4 Train on both but test on MOOCSet (CNN) 0.628 0.359 0.649 0.404
Classifier 5 Train & test on both (CNN) 0.582 0.366 0.603 0.398
RF classifier 3 Train on both but test on SoftwareSet (RF) 0.528 0.281 0.550 0.419
RF classifier 4 Train on both but test on MOOCSet (RF) 0.679 0.440 0.719 0.504
RF classifier 5 Train & test on both (RF) 0.624 0.408 0.662 0.444

relevance. In comparison with the positive associations, the majority of word embeddings, which include an extensive vocabulary compared
the lowest relevant words in both the data sets were ‘stop-words’, which to the traditional machine learning methods (Pennington et al., 2014).
refers to the most commonly used words in a language, such as the, to, a,
is in Fig. 6. Apart from the stop words, there were some nouns high­ 6.2. Generalisability of our CNN classifiers across different data sets –
lighted as low relevance to the Exploration phase in the SoftwareSet RQ2
message (Fig. 6a), namely programmer and adaptation.
We also provided two sample messages classified into the Integration Another important finding of this study concerns the analysis of the
phase from the two data sets in Fig. 7. In the message from the Soft­ generalisability of a method based on textual-related features. While the
wareSet (Fig. 7a), the highly relevant words for classification of the literature suggests that the adoption of linguistic resources (i.e., LIWC
Integration phase contained implementation, i, fold, and long, and the and Coh-Metrix) and structural information (i.e., discussion context
moderately relevant words included corresponding, exception, aspects, features) (Kovanović et al., 2016; Farrow et al., 2020; Barbosa et al.,
and methodology. While the low-relevance words were presents, actually, 2020) can increase the generalisability of a classifier, our results
reports, and however and also the stop-words. Comparatively in the demonstrate that the use of the CNN classifiers on cross-course valida­
MOOCSet message (Fig. 7b), the words that indicate strong relevance tions can reach similar results (Table 6). This finding implies that the
were correlation, between, associated, relation, lateralization, and func­ deep learning approaches with word embeddings can have potential to
tional, whereas the low-relevance words were divergent, necessarily, increase the generlisability if we can build neural networks including the
hemisphere(s), specifically, cite and so on. sequential structures of the discussion messages, such as Recurrent
Neural Networks (Graves et al., 2013) and Transformers (Wolf et al.,
6. Discussion 2020), in future work. Also, the generalisability could be improved by
using the advanced pre-trained models (e.g. BERT (Devlin et al., 2018))
6.1. The performance comparison between our CNN classifiers and RF and fine-tuning them with the corpora that contain the domain-specific
baslines – RQ1 vocabularies in the sample data sets.
Random forests classifiers achieved better results when the training
The CNN classifiers presented in this study reached Cohen’s κ of step was performed in the combined data set (the SoftwareSet and LCT
0.475 and 0.528 for the SoftwareSet and LCT MOOCSet, respectively, MOOCSet). A possible reason for this could be found in the fact that the
which demonstrates a moderate inter-rater agreement (Landis & Koch, optimal parameters (i.e. ntree and mtry) were fine-tuned separately in
1977). These results were similar to the performance of the RF classifiers the RF classifiers for the individual-course and combined-course ex­
which used random forests with LIWC and Coh-Metrix features, and the periments. However, we used the same hyper-parameters, such as the
outcomes of the experiments without the over-sampling methods re­ number of filters, batch sizes and kernel sizes, in all of our CNN exper­
ported in the previous studies reported on the literature (Kovanović iments to maintain the consistency. This result provides us a hint that the
et al., 2016; Farrow et al., 2019; Barbosa et al., 2020). In addition, the hyper-parameters should be adjusted according to the change of the data
CNN classifiers obtained a smaller error rate in the Integration phase set.
(Figs. 4 and 5) compared to the previous works (Barbosa et al., 2020;
Farrow et al., 2019). Conceptually, the Integration phase can be char­ 6.3. Important features revealed from the explainable AI – RQ3
acterised by a knowledge construction process where the students
formulate new concepts and solutions (Garrison et al., 2001). The error This paper also introduced some examples of the eXplainable Arti­
reduction in this phase could be related to the adoption of pre-trained ficial Intelligence (XAI) visualisations to investigate the indicators for

Fig. 4. Confusion Matrix with SoftwareSet as the training set and MOOCSet as the test set.

9
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Fig. 5. Confusion Matrix with MOOCSet as the training set and SoftwareSet as the test set.

Fig. 6. Two XAI visualisations for the classification of Exploration phase in the two data sets. High relevance is mapped to red, low relevance to blue. The colour
opacity is normalised to the maximum relevance scores per message. The redder the word is, the more important it is to identify the phase. The bluer the word is, the
lower the relevance. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

the phases of cognitive presence through the training of deep neural the higher phases (i.e., Integration and Resolution). This information
networks. Interestingly, the first four high relevant words we mentioned could be connected with the results in our XAI visualisations as it reflects
in the MOOCSet message (Fig. 7b), and the word corresponding in the that the expressions in terms of associating ideas in the messages can be
SoftwareSet message (Fig. 7a), can be the vocabularies for ‘connecting strong indicators for the Integration phase. Similarly, the XAI could shine
ideas’ and ‘convergence’, which are the important indicators to identify a light on the topic of a question asked during the discussion since the
the Integration phase in the coding scheme of cognitive presence number of questions marks can be strong indicators of Triggering event
(Garrison et al., 2001; Hu et al., 2020). Besides, most the high-relevance messages according to the previous studies (Barbosa et al., 2020;
words in the examples of the Exploration phases (Figs. 6 and 7) were the Kovanović et al., 2016). A divergence from the previous work is also
course-relevant concepts previously used in the discussions, which is reflected. Several important indicators proposed in the previous work
align to the definition of this phase (Garrison et al., 2001). These find­ were related to the stop-words (i.e., number of personal pronouns,
ings suggest that the use of expressions, which indicate relationships prepositions, indefinite pronouns, pronouns in the third person singu­
between different subjects, ideas or arguments in a discussion message, lar); however, the XAI highlighted many of them as low-relevance
can be significant features to identify the Integration phase. Also, in indicators.
shorter messages, the use of specific nouns or noun phrases, which are Therefore, we suggest that the deep learning method (CNN and XAI),
the core concepts delivered in the courses, could be important indicators and the previous approach (random forests and MDG (Barbosa et al.,
to identify the Exploration phase. 2020; Farrow et al., 2020; Kovanović et al., 2016)) could be applied as
We also found the important features provided by our deep learning complementary approaches. On the one hand, it is possible to propose a
with XAI method were word-level based, which differs from what the compound classifier to benefit from the best predictive power of each
previous works have found by the Mean Decrease Gini (MDG) inter­ approach since random forests better predicts the Triggering event and
pretation (Barbosa et al., 2020; Kovanović et al., 2016). The previous the Exploration phases, and CNN obtained more favourable results for
works proposed more abstract features, such as the degree of lexical the Integration phase. On the other hand, XAI could provide additional
diversity, the degree of meaningfulness content words and so on. insights (i.e., word-level positive and negative indicators) on the most
Educational researchers and course instructors have to seek the ex­ relevant features extracted by MDG.
pressions matched with definitions of those abstract features in the
Coh-Metrix and LIWC tools for each message. This work is still 7. Final remarks
time-consuming and nontransparent. In comparison, the word-level
explanations of XAI are perceivable and understandable. Nevertheless, This paper has three contributions. First, the introduction of a deep
these two explainable methods have some similarities. For instance, learning model to automatically classify online discussion messages into
Kovanović et al. (2016) suggests that messages, which had higher se­ cognitive presence phases. The results reached Cohen’s κ values of
mantic similarities with their previous ones, tend to be more relevant to 0.498, a moderate inter-rater agreement. Second, the proposed

10
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Fig. 7. Two XAI visualisations for the classification of Integration phase in the two data sets. High relevance is mapped to red, low relevance to blue. The colour
opacity is normalised to the maximum relevance scores per message. The redder the word is, the more important it is to identify the phase. The bluer the word is, the
lower the relevance is. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

approach was applied to several scenarios to evaluate the general­ Declaration of competing interest
isability power of this solution. This investigation concluded that deep
learning models produce results similar to those of the random forests The authors declare that they have no known competing financial
classifiers. Finally, this study adopted an explainable artificial intelli­ interests or personal relationships that could have appeared to influence
gence visualisation method to provide insights into the indicators of the work reported in this paper.
different cognitive presence phases.
The study’s practical implications include providing more informa­ References
tion to support decision-making from instructors in relation to the stu­
dents’ participation in the online discussion. While previous works Alber, M., Lapuschkin, S., Seegerer, P., Hägele, M., Schütt, K. T., Montavon, G.,
Samek, W., Müller, K. R., Dähne, S., & Kindermans, P. J. (2019). INNvestigate neural
focused on provide structural information based on the text (lexical networks. Journal of Machine Learning Research, 20, Article 04260. arXiv:1808.
diversity and readability of the text), the XAI visualisation proposed ALRashdi, R., & O’Keefe, S. (2019). Deep learning and word embeddings for tweet
produce insights on the relevant content for the discussion. The com­ classification for crisis response. arXiv preprint arXiv:1903.11024.
Anderson, T., & Dron, J. (2010). Three generations of distance education pedagogy. In
bination of both approaches produces a richer analysis for the The international review of research in open and distance learning (Vol. 12, pp. 80–97).
instructors. URL: http://www.irrodl.org/index.php/irrodl/article/view/890/.
Despite promising results, some limitations can be identified, such as Anderson, T., Rourke, L., Garrison, D. R., & Archer, W. (2001). Assessing teaching
presence in a computer conferencing context. Journal of Asynchronous Learning
the small number of message examples used in the current study, which Networks, 5, 1–17.
could have limited the potential of the deep learning models used. Next, Arras, L., Horn, F., Montavon, G., Müller, K. R., & Samek, W. (2017). ” what is relevant in
the data came from only two courses (fully online masters course about a text document?”: An interpretable machine learning approach. PLoS One, 12,
Article e0181142.
software engineering and a MOOC course on philosophy), therefore,
Arras, L., Montavon, G., Müller, K. R., & Samek, W. (2017b). Explaining recurrent neural
results might have been affected by the intricacies of their course de­ network predictions in sentiment analysis. arXiv preprint arXiv:1706.07206.
signs. Moreover, the discussion structural features, such as the conver­ Barbosa, G., Camelo, R., Cavalcanti, A. P., Miranda, P., Mello, R. F., Kovanović, V., &
sational sequences, were not involved in the construction of the deep Gašević, D. (2020). Towards automatic cross-language classification of cognitive
presence in online discussions. In Proceedings of the tenth international conference on
learning classifiers. Finally, we did not explore the potential of learning analytics & knowledge (pp. 605–614).
combining the results of the deep learning and traditional machine Barbosa, A., Ferreira, M., Ferreira Mello, R., Dueire Lins, R., & Gasevic, D. (2021). The
learning algorithms. Future work should aim to conduct experimenta­ impact of automatic text translation on classification of online discussions for social
and cognitive presences. In LAK21: 11th international learning analytics and knowledge
tion with the approach and possible evaluation with larger sample sizes conference (pp. 77–87).
composed of data from different domains and languages. We also seek to Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A.,
optimise the approach proposed in this study by analysing the perfor­ Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020).
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and
mance of different deep learning models and combine them with challenges toward responsible AI. Information Fusion, 58, 82–115.
traditional machine learning algorithms. Bauer, M. W. (2007). Content analysis. an introduction to its methodology–by klaus
krippendorff from words to numbers. narrative, data and social science–by roberto
franzosi. British Journal of Sociology, 58, 329–331.

11
Y. Hu et al. Computers and Education: Artificial Intelligence 2 (2021) 100037

Ben-David, A. (2008). About the relationship between roc curves and cohen’s kappa. Liu, C. J., & Yang, S. C. (2014). Using the community of inquiry model to investigate
Engineering Applications of Artificial Intelligence, 21, 874–882. students’ knowledge construction in asynchronous online discussions. Education
Chakravarthi, B. R., Priyadharshini, R., Muralidaran, V., Suryawanshi, S., Jose, N., computing research, 51, 327–354. https://doi.org/10.2190/EC.51.3.d
Sherly, E., & McCrae, J. P. (2020). Overview of the track on sentiment analysis for Mcklin, T. E. (2004). Analyzing cognitive presence in online courses using an artificial neural
dravidian languages in code-mixed text. In Forum for information retrieval evaluation network. Atlanta, GA, USA: Ph.D. thesis.. URL: https://scholarworks.gsu.
(pp. 21–24). edu/msit_diss/1/. aAI3190967.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and McKnight, P. E., & Najab, J. (2010). Mann-whitney u test. The Corsini Encyclopedia of
Psychological Measurement, 20, 37–46. Psychology, 1–1.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation
bidirectional transformers for language understanding. arXiv preprint arXiv: of text and discourse with Coh-Metrix. Cambridge University Press.
1810.04805. Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences.
Falotico, R., & Quatto, P. (2015). Fleiss’ kappa statistic without paradoxes. Quality and Artificial Intelligence, 267, 1–38.
Quantity, 49, 463–470. Neto, V., Rolim, V., Ferreira, R., Kovanović, V., Gašević, D., Lins, R. D., & Lins, R. (2018).
Farrow, E., Moore, J., & Gasevic, D. (2019). Analysing discussion forum data: A replication Automated analysis of cognitive presence in online discussions written in
study avoiding data contamination. https://doi.org/10.1145/3303772.3303779 Portuguese. In European conference on technology enhanced learning (pp. 245–261).
Farrow, E., Moore, J., & Gašević, D. (2020). Dialogue attributes that inform depth and Springer. Springer International Publishing. https://doi.org/10.1007/978-3-319-
quality of participation in course discussion forums. In The 10th international 98572-5_19. URL:.
conference on learning analytics and knowledge (pp. 129–134). Frankfurt, Germany: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word
LAK ’20). representation. In Proceedings of the 2014 conference on empirical methods in natural
Ferreira, R., Kovanović, V., Gašević, D., & Rolim, V. (2018). Towards combined network language processing (pp. 1532–1543). EMNLP).
and text analytics of student discourse in online discussions. In Artificial intelligence in Rivard, R. (2013). Measuring the mooc dropout rate. Higher Education, 8, 2013.
education (pp. 111–126). Cham, Switzerland: Springer International Publishing. Rolim, V., Ferreira, R., Kovanović, V., & Gašević, D. (2019). Analysing social presence in
https://doi.org/10.1007/978-3-319-93843-1_9. online discussions through network and text analytics. In Proceedings of the 19th IEEE
Garrison, D. R., & Anderson, T. (2011). E-learning in the 21st century: A framework for international conference on advanced learning technologies (ICALT’19) (pp. 163–167).
research and practice (2nd ed. ed). New York, NY, USA: Routledge. New Jersey, USA: IEEE. https://doi.org/10.1109/ICALT.2019.00058.
Garrison, D. R., Anderson, T., & Archer, W. (1999). Critical inquiry in a text-based Rourke, L., Anderson, T., Garrison, D. R., & Archer, W. (1999). Assessing social presence
environment: Computer conferencing in higher education. The Internet and Higher in asynchronous text-based computer conferencing. The Journal of Distance
Education, 2, 87–105. https://doi.org/10.1016/S1096-7516(00)_00016-6 Education, 14, 50–71.
Garrison, D. R., Anderson, T., & Archer, W. (2001). Critical thinking, cognitive presence, Rourke, L., Anderson, T., Garrison, D. R., & Archer, W. (2001). Methodological issues in
and computer conferencing in distance education. American Journal of Distance the content analysis of computer conference transcripts. International Journal of
Education, 15, 7–23. https://doi.org/10.1080/08923640109527071 Artificial Intelligence in Education (IJAIED), 12, 8–22.
Garrison, D. R., Anderson, T., & Archer, W. (2010). The first decade of the community of Rourke, L., & Kanuka, H. (2009). Learning in communities of inquiry: A review of the
inquiry framework: A retrospective. The Internet and Higher Education, 13, 5–9. literature (winner 2009 best research article award). International Journal of E-
https://doi.org/10.1016/j.iheduc.2009.10.003 Learning & Distance Education/Revue internationale du e-learning et la formation à
Garrison, D. R., Cleveland-Innes, M., & Fung, T. S. (2010). Exploring causal relationships distance, 23, 19–48.
among teaching, cognitive and social presence: Student perceptions of the Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by
community of inquiry framework. Internet and Higher Education, 13, 31–36. back-propagating errors. Nature, 323, 533–536.
Gašević, D., Kovanović, V., & Joksimović, S. (2017). Piecing the learning analytics Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
puzzle: A consolidated model of a field of research and practice. Learning: Research Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual
and Practice, 3, 63–78. recognition challenge. International Journal of Computer Vision, 115, 211–252.
Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of methods to Samek, W., & Müller, K. R. (2019). Towards explainable artificial intelligence. In
study the contribution of variables in artificial neural network models. Ecological Explainable AI: Interpreting, explaining and visualizing deep learning (pp. 5–22).
Modelling, 160, 249–264. https://doi.org/10.1016/S0304-3800(02)00257-0 Springer.
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). Smoothgrad:
Cambridge: MIT press. Removing noise by adding noise. arXiv preprint arXiv:1706.03825.
Graves, A., Mohamed, A.r., & Hinton, G. (2013). Speech recognition with deep recurrent Strijbos, J. W. (2011). Assessment of (computer-supported) collaborative learning. IEEE
neural networks. In 2013 IEEE international conference on acoustics, speech and signal Transactions on Learning Technologies, 4, 59–73.
processing (pp. 6645–6649). IEEE. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Gunning, D. (2017). Explainable artificial intelligence (xai). Defense Advanced Research Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In
Projects Agency (DARPA). nd Web 2. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Guo, S. X., Sun, X., Wang, S. X., Gao, Y., & Feng, J. (2019). Attention-based character- Recognition 07-12-June (pp. 1–9). https://doi.org/10.1109/CVPR.2015.7298594.
word hybrid neural networks with semantic and structural information for arXiv:1409.4842.
identifying of urgent posts in MOOC discussion forums. IEEE Access, 7, Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: Liwc
120522–120532. and computerized text analysis methods. Journal of Language and Social Psychology,
Hayati, H., Chanaa, A., Idrissi, M. K., & Bennani, S. (2019). Doc2vec &naïve bayes: 29, 24–54.
Learners’ cognitive presence assessment through asynchronous online discussion tq Van Asch, V. (2013). Macro-and micro-averaged evaluation measures (Vol. 49). Belgium:
transcripts. International Journal of Emerging Technologies in Learning, 14. CLiPS.
Hu, Y., Donald, C., Giacaman, N., & Zhu, Z. (2020). Towards automated analysis of Wang, X., Yang, D., Wen, M., Koedinger, K., & Rosé, C. P. (2015). Investigating how
cognitive presence in MOOC discussions: A manual classification study. In The 10th student’s cognitive behavior in mooc discussion forums affect learning gains. International
international conference on learning analytics and knowledge (pp. 135–140). Frankfurt, Educational Data Mining Society.
Germany: LAK ’20). Waters, Z., Kovanović, V., Kitto, K., & Gašević, D. (2015). Structure matters: Adoption of
Kamath, C. N., Bukhari, S. S., & Dengel, A. (2018). Comparative study between structured classification approach in the context of cognitive presence classification.
traditional machine learning and deep learning approaches for text classification. In In Information retrieval technology (pp. 227–238). Cham, Switzerland: Springer
Proceedings of the ACM symposium on document engineering 2018 (pp. 1–11). International Publishing. https://doi.org/10.1007/978-3-319-28940-3_18.
Kovanović, V., Gašević, D., & Hatala, M. (2014a). Learning analytics for communities of Wei, X., Lin, H., Yang, L., & Yu, Y. (2017). A convolution-LSTM-based deep neural
inquiry. Journal of Learning Analytics, 1, 195–198. network for cross-domain MOOC forum post classification. Information, 8, 1–16.
Kovanović, V., Joksimović, S., Gašević, D., & Hatala, M. (2014b). Automated cognitive https://doi.org/10.3390/info8030092
presence detection in online discussion transcripts. Workshop Proceedings of LAK Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
2014, Indianapolis, IN, USA. URL: http://ceur-ws.org/Vol-1137/. Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y.,
Kovanović, V., Joksimović, S., Waters, Z., Gašević, D., Kitto, K., Hatala, M., & Siemens, G. Plu, J., Xu, C., Le Scao, T., Gugger, S., … Rush, A. (2020). Transformers: State-of-the-
(2016). Towards automated content analysis of discussion transcripts: A cognitive art natural language processing. In Proceedings of the 2020 conference on empirical
presence case. In Proceedings of the sixth international conference on learning analytics methods in natural language processing: System demonstrations (pp. 38–45). Association
and knowledge (pp. 15–24). New York, NY, USA: Association for Computing for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-demos.6.
Machinery. https://doi.org/10.1145/2883851.2883950. Online https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for Yang, S., Wang, Y., & Chu, X. (2020). A survey of deep learning techniques for neural
text classification. In Proceedings of the Twenty-ninth AAAI conference on artificial machine translation, Article 07526. arXiv:2002.
intelligence (pp. 2267–2273). Austin, Texas, USA. Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical based natural language processing. Ieee Computational intelligenCe Magazine, 13,
data. Biometrics. 55–75.
Lee, S., & Lee, C. (2020). Revisiting spatial dropout for regularizing convolutional neural
networks. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-
09054-7

12

You might also like