You are on page 1of 4

Student cluster analysis based on Moodle data and

academic performance indicators

Marian Bucos Bogdan Drăgulescu


Communications Department Communications Department
Politehnica University Timișoara Politehnica University Timișoara
Timișoara, România Timișoara, România
marian.bucos@upt.ro bogdan.dragulescu@upt.ro

Abstract—The present work considers the possibility of results of four clustering algorithms: X-means, K-Means,
using Moodle course logs and student performance indicators Hierarchical, and Expectation Maximization. A low number
within the Database Systems course to apply the K-Means of clusters were produced (2 or 3 depending on the course and
clustering algorithm. Clusters of students are identified and algorithm), and the authors did not observe groups that had an
explained to partition students with similar study behaviours obvious behaviour difference. They explain this phenomenon
and performance. Moreover, the understanding of the five by the small number of students and homogenous groups in
groups emerged in cluster analysis allowed us to identify a terms of age, tech use and former experience.
cluster that contains 86% of students at risk of not completing
the course activity. One important aspect that differentiates our Pradana et al. in [5] used multiple clustering algorithms
study from other similar works is the use of data collected over (K-Means, Hierarchical, and Louvain) to see the most
a long period of time, from 2015 to 2019. The final data set, appropriate clustering technique in analyzing Moodle log
obtained after preprocessing, contains no less than 185.206 activity data. They focused on four different Moodle courses,
course logs. in a five months period. Certain actions extracted for the
events in the Moodle logs were used in the clustering process.
Keywords—student clusters, clustering analysis, Moodle data, The results revealed that the Louvain algorithm can divide
student performance, educational data mining more evenly and precisely compared to K-Means and
I. INTRODUCTION Hierarchical cluster techniques.
Higher education institutions are responsible for providing Other approaches propose using clustering to improve the
students with learning scenarios for developing technical performance of predicting students at risk in tandem with
skills and to certify that those skills are acquired by the classification algorithms. Such examples are: a hybrid
students. At the same time, due to economic criteria, the algorithm that shows a strong relationship between student
number of students per teacher rate is increasing. This leads to behaviour and their academic performance [6], using Decision
a harder job for the educators to identify the students at risk of Trees classifier and K-Means clustering to predict student’s
failure that could be helped. One solution is to take advantage GPA [7], using K-Means clustering and classification
of the data collected by educational platforms that become methods to predict course achievements for students through
ubiquitous in the past decade. their procrastination behaviour [8].
Knowledge discovery from data is the scope of the Data In the above-mentioned studies, some shortcomings can
Mining research field. The main function is the application of be identified: small datasets, mixed data from multiple
specific methods to develop models that can detect previously courses, not all studies evaluate the resulting clusters, the
unknown patterns [1]. Applying Data Mining methods for number of clusters may be too small for some use cases.
analyzing educational data defines the field of Educational Hence, our research question for this student clustering study
Data Mining (EDM). According to Baker and Yacef, the most is as follows: What student clusters emerge in the Database
used approaches in EDM are classification, clustering, Systems course based on browsing history events logged in
relationship mining, and pattern discovery [2]. On the other Moodle platform and performance activity grades? In addition
hand, the authors argued that one of the areas that most often to identifying clusters and understanding their specificity, the
attracts attention in EDM research is identifying factors that present work continues our previous research, which aims to
are associated with student failure. identify students at risk of failing activities or courses [9] [10].
As mentioned above a popular data mining technique is This article deals with the problem of identification of
the use of clustering algorithms to predict students at risk or Moodle course student groups using cluster analysis. This
to identify the predictors of this outcome. Bharara et al. in [3] paper is organized as follows. The next section describes the
used the K-Means clustering method to identify meaningful method of the research, including data collection and data
indicators and to study the relationships between these preprocessing, and the clustering algorithm used. Section 3
metrics, analyzing the effects on student’s performance. presents the experiment carried out and the results of the
Evaluating the clusters produced the authors reported that the student clustering study. Section 4 discusses the outcomes of
parental involvement features and learning platform this work. Section 5 presents the conclusions and offers
interactional features vary with demographic features and insights for further research.
affect students’ learning behaviour and overall performance.
II. METHODOLOGY
In another study, the authors proposed clustering students As mentioned earlier, the objective of this paper is to
by mining Moodle log with the first objective to define identify different clusters of students in the Database Systems
relevant clustering features [4]. They used three courses with course at the Politehnica University Timișoara. The general
a low number of students (15, 30, and 56) and compared the process we will follow when developing this clustering model

978-1-7281-9513-1/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on April 19,2022 at 00:05:22 UTC from IEEE Xplore. Restrictions apply.
can be resumed as follows: data collection, data A total of 32 student records were discarded from DS2 due
preprocessing, data modelling, model validation and results to incomplete information (students without class activity), or
interpretation. duplicate information (re-enrolled students). The data set
contains a total of 597 instances after cleaning.
A. Data collection
The data used in this study was collected from the Since the first data set (DS1) contains browsing history
Database Systems course, from the Faculty of Electronics, events carried out during the 12 weeks of study and multiple
Telecommunications, and Information Technologies, at the records are available for each student, we focused on grouping
Politehnica University Timișoara. The data sets include and aggregating them for each one. In a feature engineering
Moodle course logs and course grade books exported from our process, first, we grouped records for each student and
university learning management system (UPT Virtual aggregate them. By counting unique date values for each
Campus). This study considers Database Systems course data, student, we introduced a new attribute Course visits.
gathered over five years, from 2015 to 2019. A new grouping operation on DS1 data set considered the
Moodle course logs contain detailed information about attributes Moodle ID and Visit date, where the attribute Visit
student activity, including student’s ID number, the date and date was extracted from the Unix time attribute. This time a
time when course-specific information was collected, the standard count aggregation function was used to obtain the
context and name of student actions, the address of the number of events accessed by each student daily. Performing
machine from which the access was made. It should be noted a second grouping operation on the data set that resulted in the
that course tutors can export and capitalize such log data from previous step, another two attributes were created: Min events,
the Moodle course administration section. and Max events. These attributes were introduced by setting
appropriate aggregation functions (min, max) for each group.
The initial data set (DS1), obtained by merging log data
from all instances of the Database Systems course, includes Attributes introduced by feature engineering represent
no less than 237.441 records. For each event record, 11 numerical indicators of a student's browsing history in the
attributes were collected, such as: Time, User full name, online course. The data set obtained by joining the resulting
Affected user, Event context, Component, Event name, data sets consists of unique student records with a total of 597
Description, Origin, IP address, Moodle ID, Unix time. x 4 data points. Only attributes that make sense in cluster
analysis have been preserved: Moodle ID, Course visits, Min
Also, Moodle tutors can export course grade books from events, and Max events.
the course grades administration section. A grade book
contains all the grades for each student in a course. In our case, The same feature engineering process took place on the
the grade books of the Database Systems instances from 2015 DS2 data set. By applying min, max, and mean aggregation
until 2019 follow the same format: Moodle ID, Meeting 01, functions, the attributes Min grade, Max grade, and Activity
Meeting 02, Meeting 03, …, Meeting 14, Final activity grade. grade were introduced. The data set, obtained by removing the
attributes that are no longer of interest, consists of a total
Regarding the Database Systems course considered for number of 597 x 4 data points, with the following attributes:
this study, students participate in 14 mandatory face-to-face Moodle ID, Min grade, Max grade, and Activity grade.
practical activity meetings during the semester (two of which
are allocated for recovery of lost activities), but they do not Data normalization is a common preprocessing practice,
receive grades for all these meetings. Therefore, most Meeting also required before using K-Means clustering algorithm. To
attributes have null values. In this case, we considered only avoid cases in which attributes have different weight in the
the data corresponding to the first 12 study weeks. The second decision process we applied standard normalization [11]. To
data set (DS2), obtained by merging the Database Systems perform normalization, we used the mean (μ) and the standard
course grade books data, includes 629 records. deviation (σ) for each attribute.

B. Data preprocessing = ( − )/ (1)


The data preprocessing refers to any type of processing
performed on our initial data sets to prepare them for applying Each normalized value is obtained by subtracting the mean
clustering methods. After completing the data collection of the corresponding attribute and dividing the value by the
process, we have two data sets. The first data set (DS1) standard deviation. The standard normalization ensures that
contains information about online user activity in the Database each attribute has a normal distribution with zero mean and
Systems course, while the second one (DS2) stores standard deviation equal to one.
information about student performance in class.
After data preprocessing, our experiment benefits a new
The data cleaning process on our initial data sets was done data set obtained by merging the last version of DS1 and DS2
separately, for each one. For the first data set, the cleanup data sets with index Moodle ID. The final data set contained a
operation considered removing several types of records: total of 597 x 7 data points, with the following attributes:
records that did not belong to students (course interventions Moodle ID (MID), Course visits (CVD), Min events (MIED),
form tutors or Moodle administrators), records of re-enrolled Max events (MXED), Min grade (MIGR), Max grade
students, records that contain null values, or records of (MXGR), and Activity grade (AVGR). Two attributes were
duplicate events (similar student actions in very short time- removed from the final data set before cluster analysis. The
frame). Since the present work aims at student clustering attribute Moodle ID is the unique identifier of each student,
according to their online activities and grades for 12 study while the attribute Min events has extremely low variance.
weeks, some more records were discarded from the data set
(activities carried out outside the study period considered). C. Cluster analysis
After removing all these log records, the first data set contains Cluster analysis represents a set of exploratory techniques
a total of 185.206 instances. that are used to identify groups of similar objects within the

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on April 19,2022 at 00:05:22 UTC from IEEE Xplore. Restrictions apply.
data. The most popular clustering method is the K-Means
algorithm [12][13] as it is simple to understand and
implement. The K-Means algorithm is an iterative process that
partitions a data set into distinct groups (clusters), while data
points within a cluster are as similar as possible. Each cluster
is characterized by the mean of the data points assigned to that
cluster (cluster centre or centroids).
In this study, we used the Silhouette analysis to choose the
optimal value for the number of clusters (k). A Silhouette
analysis is a better approach that can be used to measure the
separation distance between clusters. The implementation of
the Silhouette analysis is done by calculating the average
Fig. 2. Profiling variables - Heat Map
Silhouette coefficient for different values of k. The Silhouette
coefficient is calculated by using (2), where x is the mean Presence of five different clusters can be inferred. The
distance between a sample and all other points in the same information provided in Fig. 2 showing the graphical
cluster and y is the mean distance between a sample and all representation of the variables per cluster can be used to
other points in the nearest cluster [14]. understand what type of students’ part of each cluster are.
= ( − )/max( , ) (2) The first cluster (Cluster 0) contains a total of 131 students.
We can see that students from this cluster have most days with
The Silhouette coefficient values vary between -1 and 1. access to the online course (49.60 for CVD), even if the
A value close to 1 means that the sample is close to its cluster, maximum number of daily events is low (27.00 for MXED).
while a value close to -1 denotes that the sample is assigned to The mean values for their face-to-face meetings variables are:
the wrong cluster. 3.37 for MIGR, 8.52 for MXGR, and 6.05 for AVGR.
In the second cluster (Cluster 1) we can find 78 students.
III. RESULTS
This cluster contains the students with the best average values
Our clustering experiment was performed using Scikit- of variables, that denotes performance in face-to-face
learn [15], a popular, extensive, and powerful Python package meetings. Even the values of the variables corresponding to
for implementing machine learning. The Scikit-learn package accessing online resources have good values. The mean values
provides tools for data preprocessing, data modelling, model for their variables are: 46.80 for MXED, 41.30 for CVD, 7.11
validation, and so forth. To conduct our experiment, we used for MIGR, 9.34 for MXGR, and 8.26 for AVGR.
the most common K-Means clustering algorithm
implementation from Scikit-learn. The third cluster (Cluster 2) is composed of 93 students.
These students have the best value in terms of maximum
We have a final data set with 597 unique students that we events per day, good value for daily course visits, but not so
will be able to create effective clusters with. The data set good marks in face-to-face activity meetings. The mean
contains numerical variables like Course visits (days), the values for their variables are: 58.90 for MXED, 37.60 for
number of days in which each student accessed the Database CVD, 4.16 for MIGR, 7.07 for MXGR, and 5.60 for AVGR.
Systems course on Moodle university platform, or Activity
grade, the average grade of each student in face-to-face course The fourth cluster (Cluster 3) consists of 179 students,
meetings during the first 12 weeks of study. being the most populated cluster of all. They are part of the
two poorest performing clusters in terms of accessing online
The K-Means clustering algorithm is applied to this data resources, with 25.90 in MXED and 31.60 in CVD. However,
set with k between 2 and 10, followed by the calculation of the these students have appreciable values of the maximum grade
average silhouette coefficient for each k. The results indicate (8.85 in MXGR) and medium values of the average activity
values for average silhouette coefficient between 0.355 and grade (6.74 in AVGR), but only 4.57 in MIGR.
0.396.
The last cluster (Cluster 4) contains 116 students. They
have the lowest performance for all variables considered. The
mean values are: 26.10 for MXED, 29.50 for CVD, 2.73 for
MIGR, 5.93 for MXGR, and 4.32 for AVGR.
IV. DISCUSSIONS
The student cluster analysis was conducted at the
Politehnica University of Timișoara using course data
gathered over a period of five years, from 2015 to 2019. The
results of this study reveal that within our Database Systems
course could be identified five different student clusters based
on browsing history events logged in Moodle platform and
activity grades. The optimal value for the number of clusters
(five) used in the K-Means algorithm was validated by the
maximum average silhouette coefficient (0.3963).
We can better visualize differences between clusters by
plotting the means of the profiling variables for each cluster.
Fig. 1. Average silhouette coefficient depending on k value
As shown in Fig. 3, Cluster 4 can be labelled At-Risk and

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on April 19,2022 at 00:05:22 UTC from IEEE Xplore. Restrictions apply.
includes students with the lowest grade performance and V. CONCLUSIONS
minimal involvement in accessing the course's online This study shows that the approach used herein is a
resources. Within this group are found over 86% of students practical way to define homogenous student groups in the
who do not complete the activity in the 14th week. We must Database Systems course based on the K-Means clustering
reiterate that the data set used within the cluster analysis from algorithm and student online activities recorded within
this study considers the first 12 weeks of study. We also find Moodle course logs from our university learning management
low involvement in accessing online resources from students system. The results offer the premises for monitoring and
belonging to Cluster 3. We labelled this cluster Minimal identifying the students who present the risk of not completing
Involvement Works, because, despite the low involvement in the course activity.
accessing online resources, students in this cluster complete
the activity without much effort. Future research might focus on evaluating the
performance of multiple clustering algorithms on current data
Another group in which we identify students who did not sets, or even on including in research other long-term courses
complete the activity, even if much less, is Cluster 2. Only 8% from our university learning management system.
of students who did not complete the activity at the end of the
14 weeks of study are part of this group. We labelled this The information gained in this clustering study may
cluster Unsuccessful Hard Working because the facilitate the development of effective clustering Moodle
representatives of this cluster have a medium value in terms based tools both at the university level and within the
of the number of days they access online resources (MXED), educational community that uses Moodle.
and the highest value of the maximum number of events
accessed daily (CVD). However, the average value of the REFERENCES
Activity grade (AVG) variable does not exceed 5.60. We can [1] O. Maimon and L. Rokach, “Introduction to knowledge discovery and
say that within this group there are students who have data mining,” in Data mining and knowledge discovery handbook,
problems in understanding certain aspects of the course. Springer, 2009, pp. 1–15.
[2] R. S. Baker and K. Yacef, “The state of educational data mining in
In Cluster 0 we find the students who present most of the 2009: A review and future visions,” JEDM-J. Educ. Data Min., vol. 1,
days with access to the online course, but with a small number no. 1, pp. 3–17, 2009.
[3] S. Bharara, S. Sabitha, and A. Bansal, “Application of learning
of maximum daily events. We labelled this cluster Potential analytics using clustering data Mining for Students’ disposition
Loyalists. A final group, Cluster 1, belongs to the Champions, analysis,” Educ. Inf. Technol., vol. 23, no. 2, pp. 957–984, Mar. 2018,
students who have the best study performance, and high doi: 10.1007/s10639-017-9645-7.
values both for the number of days they access the course and [4] A. Bovo, S. Sanchez, O. Héguy, and Y. Duthen, “Clustering moodle
for the maximum number of events held daily. data as a tool for profiling students,” in 2013 Second International
Conference on E-Learning and E-Technologies in Education (ICEEE),
Another element that can be noticed in Fig. 3 is the Sep. 2013, pp. 121–126, doi: 10.1109/ICeLeTE.2013.6644359.
presence of two groups of variables for each cluster. The first [5] C. Pradana, S. S. Kusumawardani, and A. E. Permanasari,
“Comparison Clustering Performance Based on Moodle Log Mining,”
group, defined by the variables Min grade, Max grade and IOP Conf. Ser. Mater. Sci. Eng., vol. 722, p. 012012, Jan. 2020, doi:
Activity grade, determines the performance of students from 10.1088/1757-899X/722/1/012012.
each cluster in face-to-face meetings. The second group, [6] B. K. Francis and S. S. Babu, “Predicting Academic Performance of
defined by the variables Max events and Course visits, Students Using a Hybrid Data Mining Approach,” J. Med. Syst., vol.
describes the students' behaviour in terms of accessing online 43, no. 6, p. 162, Apr. 2019, doi: 10.1007/s10916-019-1295-4.
[7] M. Shovon, H. Islam, and M. Haque, “An Approach of Improving
resources. Based on these groups we can designate two Students Academic Performance by using k means clustering
clusters (Cluster 3 and Cluster 4) whose students exhibit algorithm and Decision tree,” ArXiv Prepr. ArXiv12116340, 2012.
similar behaviours in terms of accessing online resources. [8] Y. Yang, D. Hooshyar, M. Pedaste, M. Wang, Y.-M. Huang, and H.
Similarly, two other clusters can be specified (Cluster 0 and Lim, “Predicting course achievement of university students based on
Cluster 3) whose students have close study performances. As their procrastination behaviour on Moodle,” Soft Comput., Jul. 2020,
doi: 10.1007/s00500-020-05110-4.
stated in [16], using this approach we can cluster students with
[9] M. Bucos and B. Drăgulescu, “Predicting student success using data
similar preferences for group formation purposes, allowing generated in traditional educational environments,” TEM J., vol. 7, no.
students to work in a more motivating environment. 3, Art. no. 3, 2018.
[10] B. Drăgulescu, M. Bucos, and R. Vasiu, “Predicting assignment
submissions in a multi-class classification problem,” TEM J., vol. 4,
no. 3, Art. no. 3, 2015.
[11] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and
Techniques. Elsevier, 2011.
[12] S. P. Llyod and S. P. Lloyd, “Least squares quantization in PCM,”
IEEE Trans. Inf. Theory, Mar. 1982, doi: 10.1109/TIT.1982.1056489.
[13] J. MacQueen, “Some methods for classification and analysis of
multivariate observations,” in Proceedings of the fifth Berkeley
symposium on mathematical statistics and probability, California,
1967, vol. 1, pp. 281–297.
[14] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–
65, Nov. 1987, doi: 10.1016/0377-0427(87)90125-7.
[15] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J.
Mach. Learn. Res., vol. 12, no. 85, pp. 2825–2830, 2011.
[16] J. A. Ruipérez-Valiente, P. J. Muñoz-Merino, and C. Delgado Kloos,
“Detecting and clustering students by their gamification behavior with
badges: A case study in engineering education,” Int. J. Eng. Educ., vol.
Fig. 3. Profiling variables - Snake Plot 33, no. 2-B, pp. 816–830, 2017.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on April 19,2022 at 00:05:22 UTC from IEEE Xplore. Restrictions apply.

You might also like