Sustainability 15 00895 With Cover

3.9 5.
Article
Sustainable e-Learning by
Data Mining—Successful
Results in a Chilean University
Aurora Sánchez, Cristian Vidal-Silva, Gabriela Mancilla, Miguel Tupac-Yupanqui and

José M. Rubio
https://doi.org/10.3390/su15020895
sustainability
Article
Sustainable e-Learning by Data Mining—Successful Results in
a Chilean University
Aurora Sánchez 1 , Cristian Vidal-Silva 2, *, Gabriela Mancilla 1 , Miguel Tupac-Yupanqui 3 and José M. Rubio 4
1 Department of Administration, Universidad Católica del Norte, Angamos 0610, Antofagasta 1270709, Chile
2 Faculty of Engineering, School of Videogame Development and Virtual Reality Engineering,
University of Talca, Talca 3460000, Chile
3 EAP, Ingeniería de Sistemas e Informática, Universidad Continental, Huancayo 12000, Peru
4 Escuela de Computación e Informática, Facultad de Ingeniería, Ciencia y Tecnología, Universidad Bernardo
O’Higgins, Santiago 8370993, Chile
* Correspondence: cvidal@utalca.cl; Tel.: +56-9-62002702
Abstract: People are increasingly open to using online education mainly to break the distance and
time barriers of presential education. This type of education is sustainable at all levels, and its
relevance has increased even more during the pandemic. Consequently, educational institutions are
saving large volumes of data containing relevant information about their operations, but they do not
know why students succeed or fail. The Knowledge Discovery in Databases (KDD) process could
support this challenge by extracting innovative models to identify the main patterns and factors
that could affect the success of their students in online education programs. This work uses the
CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology to analyze data from
the Distance Education Center of the Universidad Católica del Norte (DEC-UCN) from 2000 to 2018.
CRISP-DM was chosen because it represents a proven process that integrates multiple methodologies
to provide an effective meta-process for data knowledge projects. DEC-UCN is one of the first centers
to implement online learning in Chile, and this study analyses 18,610 records in this period. The
study applies data mining, the most critical KDD phase, to find hidden data patterns to identify
the variables associated with students’ success in online learning (e-learning) programs. This study
found that the main variables explaining student success in e-learning programs are age, gender,
Citation: Sánchez, A.; Vidal-Silva, C.;
degree study, educational level, and locality.
Mancilla, G.; Tupac-Yupanqui, M.;
Rubio, J.M. Sustainable e-Learning by Keywords: CRISP-DM; e-learning; data mining; KDD; DEC-UCN; students’ success
Data Mining—Successful Results in a
Chilean University. Sustainability
2023, 15, 895. https://doi.org/
10.3390/su15020895 1. Introduction
Academic Editor: Tarah Wright Current advances in education and technology facilitate people to develop compe-
tencies in defined areas at home [1]. As [2] highlight, online learning is a model that
Received: 11 October 2022 has revolutionized education thanks to the inclusion of Information and Communication
Revised: 10 December 2022
Technologies (ICTs) and the growing of Educational Data Mining (EDM). Educational
Accepted: 20 December 2022
institutions place attention on this revolution for the use of new methodologies in the
Published: 4 January 2023
educational process [3]. Multiple studies exist that evaluate the success of online learning
technology platforms, mainly based on the success of DeLone and McLean information
systems model to measure and assess the success and sustainability of electronic learning
Copyright: © 2023 by the authors.
systems [4,5]. However, despite the rapid growth of online learning and EDM, there are
Licensee MDPI, Basel, Switzerland. many problems faced by institutions that offer online courses, and the variables that impact
This article is an open access article student success in distance education is yet unknown.
distributed under the terms and Tools for identifying behavioral data patterns and factors for the success of online
conditions of the Creative Commons learning already exist [6]. This study fills the gap in terms of the variables for the student
Attribution (CC BY) license (https:// success in e-learning programs by adapting the data mining methodology CRISP-DM
creativecommons.org/licenses/by/ (Cross-Industry Standard Process for Data Mining) to discover the variables of success [7].
4.0/). E-learning could readily meet the needs, features, and requirements of potential students
Sustainability 2023, 15, 895. https://doi.org/10.3390/su15020895 https://www.mdpi.com/journal/sustainability

Sustainability 2023, 15, 895 2 of 16
who select this modality of study [8–10], even more so during the pandemic [11]. No
previous research exists in South American countries that identifies determinants for the
student success in e-learning programs.
Knowledge Discovery in Databases (KDD), commonly known as data mining [12,13],
is a process for the patterns discovery and predictive modeling in large databases [14].
KDD makes extensive use of data mining methods, automated techniques, and algorithms
for pattern recognition and identifying hidden patterns in e-learning environments [15].
Characteristically, data mining uses machine learning methods developed in the domain of
artificial intelligence [16]. Data mining uses statistical, mathematical, artificial intelligence,
and machine learning techniques to extract and identify pertinent information and related
knowledge hidden in large volumes of raw data [17]. Data mining is technically the process
of finding correlations or patterns between thousands of fields in large databases [15]. Data
mining finds these patterns and relationships using data analysis tools and techniques to
build models and machine learning [18,19].
Data mining comprises various techniques for pre-processing, analyzing, and inter-
preting data. Most researchers in the area agree that we could organize them into pattern
recognition and machine learning. Pattern recognition aims to identify implicit objects
and relations, and machine learning techniques are mainly applied to extract generalized
knowledge from data for use in prediction tasks. Researchers can use classification tech-
niques in data mining to predict group membership for data occurrences. Consequently,
data mining involves more than collecting and managing data because it includes analysis
and prediction. Classification techniques allow the processing of a wider variety of data
than regression, and they are growing in popularity. There is a great variety of algorithms
for classification purposes. Scheuer and Mclaren [20] propose a model to identify the
most influential factors that predict student academic performance. They predict students’
passing or failing status by considering and defining their academic performance (high,
medium, or low).
Educational Data Mining (EDM) is concerned with developing, investigating, and ap-
plying machine learning, data mining, and statistical methods to detect patterns in extensive
collections of data from educational institutions that would otherwise be impossible to
analyze using traditional computing techniques [2,10,21]. In this sense, in recent years,
the use of deep learning techniques has emerged in EDM. Hence, developing data mining
competencies represents a research area. For example, the work of [22] presents positive
experiences to enhance knowledge acquisition about data mining via the game-based
approach. Regarding the implementation of EDM systems, the work of Almaiah et al. [23]
discuss about traditional issues, and the success of using modern programming languages.
Problem Statement, Goal, and Contributions

In EDM, the data of interestis not limited to individual student interactions in an
educational system. We can consider different administrative and demographic data such
as gender, age, and grades for discovering patterns. As [2] discusses, EDM applications
exist to find educational patterns such as cognitive skills, motivational effects, and social
emotions. The work of [24–26] presents successful results of EDM regarding administrative
issues of e-learning systems and the effects on students’ performance, factors that influence
the use and success of mobile learning, and critical challenges and factors to determine
students’ success in pandemic time. A report still needs to be made about EDM to discover
patterns in students’ success in online education in South American countries. This work
asks for the following research question: Can the use of data mining tools in education allow
the identification of student success patterns in e-learning programs? Trying to answer it,
the primary goal of this article is to determine variables associated with student success in
e-learning programs by using the CRISP-DM methodology [7] with data from a university
in the developing country Chile. This study analyses the causes of success or failure from
the set of student variables since we consider their demographic and performance features.
The main contributions of this paper are the following:
• First, this article identifies potential variables for success or failure in e-learning
programs, not only academic factors, through a systematic literature review.
• Second, this article defines a repeatable data mining application for identifying stu-
dents’ success patterns in e-learning environments using a large set of data. This
analysis was not feasible with other methods.
• Third, this article provides a utilization example of multi-year historical data starting
when e-learning programs began being a phenomenon in Chile and other countries
(the year 2000). Other institutions in the region could repeat this application.
The remainder of this paper is organized as follows. Section 2 defines the main
concepts of e-learning, CRISP-DM, the data mining process, and previous data mining in
education experiences. Section 3 describes the applied methodology and case study data:
we define the main steps of the data mining process, the data source, concepts, and expected
results. After that, Section 4 highlights obtained results to validate our hypothesis. Section 5
describes the usefulness of this work for a similar context, overall, for online educational
institutions and programs concerning what variables are relevant to consider. The paper
concludes with a discussion of future work in Section 6.
2. e-Learning and Data Mining

The e-learning process is characterizable by recording most of the traditional learning
process variables, which range from student entry data to the efficiency and ease-of-use of
the applied platforms [1]. The large volume of data in the e-learning process provides the
opportunity to analyze that data using knowledge discovery tools. The KDD (Knowledge
Discovery Database) process looks for hidden patterns in large volumes of data that
information systems usually store. Those patterns can be high-value information for the
decision-making process in organizations [27]. Figure 1 [14,28] illustrates the KDD. Data
mining is an essential KDD phase for applying algorithms to find hidden behavior patterns
in the data [29].
Figure 1. KDD methodology process.
The application of data mining techniques has two primary purposes: building models
and detecting patterns [30]. The model building seeks to produce a summary of the data set
to identify and describe the main characteristics. Pattern detection seeks to identify small
deviations from the norm to detect unusual behavior patterns by discovering patterns and
rules and searches for content. When it is not possible to build models for the data set, you
can look for behavior patterns. Pattern and rule discovery seeks frequent combinations and
associations of attributes found in database transactions (for example, products purchased
together). Techniques based on association rules usually address that issue.
2.1. CRISP-DM
CRISP-DM method is one of the most efficient methodologies for developing projects
applying data mining [31,32]. The objective of CRISP-DM is to allow different using
a common vocabulary, methodology, and tools in data mining activities. CRISP-DM
organizes in six phases from general to specific tasks:
1. Business Understanding Phase: The first phase analysis of the problem includes
understanding the project’s objectives and requirements from a business or institu-
tional perspective.
2. Data Comprehension Phase: The second phase of data analysis includes the initial
data collection, identifying the quality of the data.
3. Data Preparation Phase: This phase includes general data selection tasks for applying
modeling techniques (variables and samples), data cleaning, generation of additional
variables, integration of different data sources, and format changes.
4. Modeling Phase: In this phase, selecting the most appropriate modeling techniques
takes place to generate and evaluate the model. The parameters used in the model
generation depend on the characteristics of the data.
5. Evaluation Phase: In the evaluation phase, the model is evaluated, not from the data
point of view, but for fulfilling the problem’s success criteria. If the generated model
is valid based on the success established in the first phase, the model is exploited.
6. Implementation Phase: At this stage, in addition to the implementation of the model,
the results must be presented and documented understandably, to achieve an increase
in knowledge.
Figure 2 [33] illustrates CRISP-DM stages.
Figure 2. The CRISP-DM methodology process.
2.2. Data Mining Techniques

This research applied four techniques, naive Bayes, random forest, AdaBoost, deci-
sion trees (J48), and neural networks, which are recognized as successful algorithms for
classification purposes [34,35]. Some of the main characteristics of those classifiers are:
• Naive Bayes: It is based on Bayes theorem with an assumption of independence

among predictors. It assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Naïve Bayes mainly targets the text
classification industry. It is mainly used for clustering and classification purposes
depending on the conditional probability of happening [36].
• Random Forest: This classifier combines prediction trees in which a hierarchical
division of the underlying data space is sustained. In the hierarchical division of the
data space, comment partitions are created that are more skewed in terms of their
distribution of terms [37].
• AdaBoost: The name is an acronym for Adaptive Boosting and it is a meta-algorithm.
This algorithm supports a distribution or set of weights over the training set. Initially,
all weights equally set, but on each round, the weights of incorrectly classified samples
are increased then the weak learner focuses on these samples. AdaBoost originally
ability to minimize the error, and maximize the margin with respect to features [38].
• Decision Trees (J48): The J48 algorithm builds a decision tree that classifies the class
attribute based on the input attributes. The algorithm is based on the C4.5 algorithm
developed by Quinlan [39]. The algorithm uses a greedy search method to create
decision trees and allows changing different parameters to obtain a better classification
accuracy [40].
• Neural Networks: The development of the neural network uses a non-linear optimiza-
tion model. Unlike other analyses, it is not easy to interpret clearly, unlike the results
and parameters provided by other analyses. For the construction of the classifier used
in this research, the multilayer perceptron neural network was used, which builds a
neural network in the form of a waterfall, which has one or more hidden layers [41].
2.3. Data Mining in Education

E-learning is the result of the adaptation and use of information technologies in
education [42–44]. The works of Wani et al. [45–47] highlight the importance of e-learning in
higher education using a virtual training environment for the development of professional
competencies. The success of e-learning systems is associated with different variables, such
as information technology and the program’ modality applied. Multiple studies exist that
evaluate e-learning technology platforms’ success, most of which used the information
systems success model of DeLone and McLean [4,48–52]. Despite the rapid growth of
e-learning, institutions face problems in teaching courses in that modality. The knowledge
of the variables that impact student success in e-learning is still unknown in different
countries such as Chile and many in developing countries. Tools such as data mining that
permit identifying behavioral data patterns could identify factors for e-learning success.
Higher distance education permits developing competencies regarding the current
market demands without geographical, economic, and time barriers [53,54]. In this way,
quality e-learning programs would contribute to developing countries such as Chile. For ex-
ample, Cidral et al. [55] reviewed the success of e-learning platforms in Brazil. That study
concluded that users’ satisfaction with program content quality and easy-to-use interfaces
are critical issues for success. Several authors have studied the success or failure of students
in e-learning programs, such as [56], who analyzed the causes of online university dropouts
in a systematic analysis of the literature; they defined a classification of the involved vari-
ables: student, institution, teachers, media, degree of social, and academic integration.
Those authors indicate that knowledge of personal, social, and demographic characteristics
can be essential to predict student success and failure in e-learning programs.
By applying the data mining process of [20], Figure 3 shows our proposed model
to predict student academic performance. Decision tree and random forest have known
classification data mining techniques, whereas CHAID ID3 represents a version of the
CHi-square automatic interaction detection algorithm, the oldest decision tree algorithm in
history, and RMS (Root Mean Square) [34].
Figure 3. Proposed model to predict the most influential factors of students at risk.
3. Methodology
This study looks to determine the success of the online learning modality provided by
the Distance Education Center of the Universidad Católica del Norte (DEC-UCN) by using
data mining to support the case analysis methodology to know about the initial conditions
of students in educative programs.
This study worked with data on the admission and final results of DEC-UCN students
between 2000 and 2018. The total number of students was of 12,264. The study stages were
developed from the CRISP-DM model to analyze the database information and apply the
corresponding tools. We used particular data mining techniques and algorithms, such as
decision trees, descriptive statistics, and neural networks. The computational tool used was
SPSS Statistics 22 [57]. The benefit of this technique is that it provides an easy understanding
of data mining decision making.
3.1. Institution Background

In the form of programs leading to a professional degree, the origin of DEC-UCN dates
back to 1982. The DEC-CNU was instituted in 1996. Located in Antofagasta, the capital city
of Chile’s second region, the DEC-UCN is under the office of the university’s academic vice-
rector. The pedagogical model of the DEC-UCN is in harmony with the PE-UCN design
(PEdagogical model at the Universidad Católica del Norte). In other words, a pedagogical
model that is accountable to the institutional mission supported by education in human
values, based on training by competencies and taking into account the constant changes
in society. This academic unit develops continuous training spaces through 100% online
programs, using an education model focused on technologies to support distance education.
One of DEC-UCN’s biggest problems is the lack of knowledge of student success variables
who choose to study at a distance program.
3.2. Selection and Understanding of Data

This study required the data of the students enrolled in the period between January
2000 and December 2018 stored on a local server (LICANCABUR) with access restricted.
This server provides services to the Oracle Developer 2000 database system called ANTEC,
the official database of the DEC-UCN. Other complementary data are stored in files and
printed matter managed by the managers of the different areas. We then obtained records
of 18,610 DEC-UCN, data exported from the ANTEC database system, through SQL (Struc-
tured Query Language) queries for the generation of an Excel file with the requested data.
Regarding the data source, DEC-UCN offered online education mainly for technical majors
in the areas of management, business, and computer programming with a record of 430,
558, and 296 students, respectively. Since records are from a set of years, some students
appear in more than one record.
In the ANTEC database system, we used the ANTEC browser to perform the nec-
essary SQL queries. We selected tables STUDENT, STUDENT_ADDRESS, PROGRAM,
and STUDENT_PROGRAM to obtain the personal data and the prior academic status that
a student reached in a given program. The excluded data are indifferent to the sample since
they did not contain reliable information for the investigation. Figure 4 presents an extract
of the relational model of the ANTEC system.
Figure 4. Excerpt from the relational model of the ANTEC database system.
3.3. Data Preparation

We applied the necessary changes to the files with the SPSS Statistics 22 tool due to its
presentation capabilities that make the result more understandable to the end-user.
1. Data selection and cleaning: first, we selected the data attributes, considering the
objective and data quality problems. As a result, we selected the following tables
and attributes: (i) STUDENT (RUT of the student, sex, date of birth, profession,
nationality, marital status, education); (ii) STUDENT_ADDRESS (RUT of the student,
commune, city); (iii) PROGRAM (name of the program, date of registration, type
of program); (iv) STUDENT_PROGRAM (RUT of the student, academic situation,
final grade of the program). For the analysis, we followed the next filter: Students
who enrolled in the programs between 2000 and 2018 (enrollment date >= 2000) and
(enrollment date <= 2018)).
2. Data Quality: a problem presented by the data is the amount of missing data in some
of them, such as the program’s attribute name. We decided to keep the records with
unknown values because their elimination results in excluding rows with valid values
in the objective set. Moreover, we faced data categorization because the applied
techniques for the analysis (classification) mainly use categorical data to facilitate their
construction and interpretation.
3. Construction of the Data: to fulfill the project’s objective, an attribute called region was
created, which derives from the attribute commune and city. The region attribute takes
the value corresponding to the region that the commune and city belong to. When the
data selection and construction process was complete, the changes were saved to the
files for their use later in the modeling stage. The new files are in SPSS format.
3.4. Modeling
We carried out the data modeling for the DEC-UCN at a global level. In this study,
the classification model predicts the student profile’ associated with success in programs
with an online learning modality. Considering the research and results of [58–60], we
applied the decision tree AdaBoostM1 and tree J.48 along with the naive Bayes and random
forest algorithms for classification. The classification model takes as a dependent variable
“state”, which is a categorical variable, and the category “Graduated” as the highest level
of success of a student in the model. The academic success in this perspective measures
students in the category graduated from a started program.
For the formulation of the model, we applied neural networks to identify relationships
between the variables and determine their importance concerning the target variable.
For constructing the decision trees, we initially used two algorithms: (i) the C5.0 algorithm
that presents rules that allow a clearer understanding of the generative partitioning; (ii) the
CHAID algorithm that, from a statistical point of view (based on the significance of the
chi-square test), constructs the trees by comparing the categories, contracting those that do
not present differences in their results. Subsequently, a decision tree algorithm is selected
based on the results obtained (case prediction) and the analysis of the construction of the
tree itself.
In order to predict the accuracy and ensure precision, the study established a confusion
matrix for each algorithm, which was necessary to calculate the metrics of Precision, Recall,
F1, Accuracy, and the Matthews correlation coefficient. Table 1 defines the procedure and
characteristic of those measures [61].
Table 1. Data mining metrics.
Metric Definition
It is used to measure the positive patterns that are correctly
Precision predicted from the total predicted patterns in a positive class [62].
It permits to measure the fraction of positive patterns that
Recall are correctly classified [62].
It measures the ratio of correct predictions over the total
Accuracy number of instances evaluated [63].
Metric that represents the harmonic mean between recall and
F1 precision values [41].
Measure that is not affected by the dataset problem of being
unbalanced. MCC is a correlation coefficient between observed
and predicted binary rankings; returns a value between −1 and +1.
Matthew’s correlation A coefficient of +1 represents a perfect prediction, 0 is
coefficient (MCC) no better than a random prediction, and −1 indicates complete
disagreement between prediction and observation [64].
It is a graphic representation of the relationship between the true-positive
and false-positive ratios of the classifier. The area under the ROC curve
provides an approach to evaluate which model is better on average.
ROC curve A model will be considered to discriminate better than chance if the curve
lies above the diagonal of no discrimination, i.e., if the AUC is higher than [65].
4. Results
The statistical results are the behavior patterns that influence students’ success in on-
line learning modality and their failures. The programs with the largest number of students
are Human Resources Administration, Environmental Management, Family Medication,
Psychopedagogy, Total Quality Management, Integrated Management, and Educational
Orientation (see Table 2).
Table 2. List of educational programs with the largest number of students.
Program Frequency Percentage

Human Resources Management 1996 10.7
Environmental Management 1617 8,7
Family Mediation 1506 8.1
Psychopedagogy 1499 8.0
Total Quality Management Total 1169 6.3
Integrated Management: Quality. Environment. and Safety 1009 5.4
Educational Orientation 894 4.8
Higher Education 648 3.5
Education and Professional Technical High School Teacher 545 2.9
Primary Education Teacher with a Minor in NB1 and NB2 509 2.7
Family Counseling 507 2.7
Behavioral Management Techniques Applied to Children and Adolescents 499 2.7
Education and Primary Education Teacher 467 2.5
Criminal Procedural Law: “Accusatory System or Oral Trial” 446 2.4
Communication and Language Disorder 418 2.2
Educational Administration 402 2.2
Administration of Technical-Pedagogical Units 388 2.1
Preparation and Evaluation of Investment Projects 326 1.7
Minor in Language and Communication for Teachers of the Second Cycle of Language and
292 1.6
Communication
Minor in Education in Mathematics for Teachers of the Second Cycle of Basic General Education 247 1.3
Degree in Education and Primary Education Teacher 237 1.3
Management in Corporate Communication 179 1.0
Continuous Improvement 173 0.9
Higher level in Executive Secretariat 154 0.8
Mathematics Education for Primary Education Teachers 145 0.8
Pedagogical Management for Higher Level Technical Training 129 0.7
Formulation and evaluation of projects 98 0.5
Others 2180 11.6
Initially, we present the analysis of the data using decision trees. This analysis shows
that the first level of the tree identified the variable “type of programs” as the main predictor
of student success at the DEC-UCN, from left to right, from nodes 20 to 22. The type of
program with the highest percentage of graduates is Continuous Improvement. With a
p-value of 0.001, a chi-square of 66.4 and a degree of freedom (df) of 8, we can observe that
students belonging to the Metropolitan, Magallanes, Tarapacá, and Bio Bio areas obtained
the highest percentage of graduated students with 76.8%, followed by the Aysén and Los
Lagos areas with 67.9%. Figure 5 illustrates the mentioned results.
The second type of program with the highest percentage of graduate students is
Training and Technical Courses with 53.8%. In this program, students with non-university
professions obtained a larger portion of qualifications than students with university profes-
sions from the art and health science areas. Students prefer this program due to its high
degree percentage compared to professional and technical courses (see Figure 6).
Figure 5. Decision Tree I: Program with the highest percentage of graduates.
Figure 6. Decision Tree II: Second program with the highest percentage of graduates.
Students with the highest percentage of elimination belong to undergraduate degree

programs. Figure 7 depicts a total of 853 students in undergraduate programs. With a
p-value of 0.002, a chi-square of 22.251, and a degree of freedom (df) of 3, we classify stu-
dents with complete primary and secondary schools but without completing professional
or higher-level technician studies with the highest elimination trend: 362 of 454; that is,
79.7% of students of undergraduate programs. In the same context, we also distinguish
students who completed higher education studies and who present an elimination trend of
66.2%, 264 of 399 students.
Figure 7. Decision Tree III: Program with the highest percentage of students eliminated.
The analysis of the data using neural networks gave additional information about
the main variables that could predict success of students in e-learning programs. Table 3
shows that when using neural networks to identify the variables, the analysis classifies
60.8% of correct predictions. The model in Figure 8 showed that the determinant factors for
academic success for all programs, from the highest to the lowest, are age, program code,
profession, scholarity, type of program, region, and finally sex, according to the student’s
final academic situation in the most successful programs, which gives a reasonable first
approximation regarding the topic. The study analyzed comparatively the performance of
the classification algorithms used, as defined in the research model. These results indicated
that AdaBoostM1 and naive Bayes were the algorithms with the lowest performance.
Table 4 shows that precision, recall, and F-measure indicators were comparatively low.
The AdaBoost M1 algorithm achieved a correct classification of 62.15% compared to naive
Bayes, with 61.7%. The MCC values are also closer to zero (0.118 and 0.007), so their
prediction is not much better than chance. The ROC values are also quite close to 0.5,
which is not an indication of good prediction. The tree J.48 and random forest algorithms
had the best results. The random forest algorithm stands out as the one with the best
result, with 64.5% of the correctly classified instances, achieving the best prediction of
graduate students. In addition, this algorithm obtains the best MCC value, indicating a
better relationship between the observed data and the prediction. The ROC value is also a
sign of its good performance with a value of 65.2%, well above the rest of the algorithms.
Table 3. Importance grade of variables in the global program type using neural networks.
Predicted
Sample
Act Rem Abn Trans Grad Cert Correct %
Active (Act) 0 2 0 0 0 34 0.0%
Removed (Rem) 0 6102 0 0 0 1140 84.3%
Abnegated (Abn) 0 426 0 0 0 20 0.0%
Training Transferred (Trans) 0 17 0 0 0 3 0.0%
Graduated (Grad) 0 302 0 0 0 201 0.0%
Certified (Cert) 0 2906 0 0 0 1720 37.2%
Overall % 0.0% 75.8% 0.0% 0.0% 0.0% 24.2% 60.8%
Active (Act) 0 2 0 0 0 16 0.0%
Removed (Rem) 0 2565 0 0 0 479 84.3%
Abnegated (Abn) 0 181 0 0 0 8 0.0%
Testing Transferred (Trans) 0 13 0 0 0 0 0.0%
Graduated (Grad) 0 103 0 0 0 97 0.0%
Certified (Cert) 0 1300 0 0 0 753 36.7%
Overall % 0.0% 75.0% 0.0% 0.0% 0.0% 24.5% 60.2%
Table 4. Classification results of applied algorithms.
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
AdaBoostM1 0.622 0.532 0.593 0.622 0.574 0.118 0.543 0.559
Naïve 0.617 0.616 0.548 0.617 0.475 0.007 0.535 0.554
Bayes
Random 0.645 0.435 0.634 0.645 0.636 0.222 0.652 0.658
Forest
TREE J.48 0.643 0.494 0.625 0.643 0.607 0.184 0.604 0.609
Figure 8. Determinant factors for academic success.

5. Discussion
The advances in technology permit the massive application of data mining nowadays.
As Soria-Barreto et al. [66] remark, computing tools and technologies permit a more effective
e-learning success. In this research, we aimed to identify variables for student success or
failures in the e-learning programs at DEC-UCN by applying the CRISP-DM methodology,
which is one of the most widely used tools in this research field. We identified factors
that determined student success in studying online programs through the decision tree
and neural network techniques. Those results contribute to a greater understanding of
the factors with the contingent issue of distance education in Chile. Our study identified
the types of programs with the greatest success in terms of the student’s final academic
situation and the programs with the greatest failure. The greatest failure programs are
undergraduate and bachelor degrees that require more time and dedication for their
completion. The number of programs without a degree continues increasing due to its
short-term characteristics.
This study is highly relevant for e-learning programs because of data from a database
of the oldest online program in Chile. The database contained student records from
2000 to 2018 inclusive; that is, 18,610 records in nineteen years. We highlight that our
results found variables that determine the success and failure of students. Our study
established that student success and failure largely depend on age, sex, previous education,
job, and region. Understanding each program’s academic success factors is decisive for
the students’ selection and dissemination of the programs. These results support the
organization’s know-how to establish policies for disseminating and maintaining students
in online learning modalities. The found variables are relevant for online education in
Chile and other neighboring countries because educational institutions can consider those
variables to organize their programs.
6. Conclusions, Recommendation, and Future Work

This study showed that data mining techniques are essential for discovering ed-
ucational data patterns. Educational institutions can apply the described data mining
techniques to analyze their data. Because we used data from a university in a develop-
ing country, institutions in countries such as Chile could use results from the presented
techniques. Regarding them, we draw the following main conclusions:
• The use of educational data mining, particularly the CRISP-DM methodology, greatly
contributes to systematization and efficiency in identifying patterns in the data of
distance education. The study allowed us to systematize the data in various sources
and formats of the distance education platform in the institution under study (DEC-
UCN) and provide valuable information for future analyses in this context.
• Data mining tools can present more significant advantages than purely statistical
tools since they are exploratory, allowing working with different dimensions of the
same problem. It is also essential to highlight the possibility and flexibility of these
analysis tools to allow us to work with categorical and numerical variables in the
same analysis.
• The performance analysis of the different classification algorithms indicated that
the random forest and decision tree algorithms were the ones that allowed a better
prediction of results and, therefore, identified the variables that could better explain
the performance of students in e-learning programs. The decision tree proved to be
a beneficial tool to find relationships between variables unidentified by previously
used analysis tools, mainly because a decision tree uses techniques less restrictive than
statistics. Those techniques do not require, for example, conditions of data normality
and are tolerant of noise in the data.
The study results in the case of the DEC-UCN will allow the organization to focus
the admission efforts on the retention of students potentially more exposed and prone to
dropping out. We are currently working on applying big data techniques and data mining
for pattern discovery to compare their results and know the best approach.
Author Contributions: Formal analysis, M.T.-Y. and J.M.R.; Investigation, A.S., C.V.-S. and G.M.;
Data curation, M.T.-Y. and J.M.R.; Project administration, A.S. All authors have read and agreed to
the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data is part of the UCN database.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Coman, C.; T, îru, L.G.; Meses, an-Schmitz, L.; Stanciu, C.; Bularca, M.C. Online teaching and learning in higher education during
the coronavirus pandemic: Students’ perspective. Sustainability 2020, 12, 10367. [CrossRef]
2. Koedinger, K.R.; D’Mello, S.; McLaughlin, E.A.; Pardos, Z.A.; Rosé, C.P. Data mining and education. WIREs Cogn. Sci. 2015, 6,
333–353.
3. Asín, A.; Peinado, J.; Jurado, P. La sociedad del conocimiento y las TICs: Una inmejorable oportunidad para el cambio docente.
In Pixel-Bit: Revista de Medios y Educación Nº 34; Universidad de Sevilla: Seville, Spain, 2009; pp. 179–204, ISSN 1133-8482.
4. Delone, W.H.; McLean, E.R. The DeLone and McLean Model of Information Systems Success: A Ten-Year Update. J. Manag. Inf.
Syst. 2003. 19, 9–30.
5. Alsabawy, A.; Cater-Steel, A.; Soar, J. A Model to Measure E-Learning Systems Success. Meas. Organ. Inf. Syst. Success New
Technol. Pract. 2012, 39, 293–317. [CrossRef]
6. Herrera, M.; Ruiz, S.; Romagnano, M.R.; Ganga, L.; Lund, M.I.; Torres, E. Aplicando métodos y técnicas de la ciencia de los
datos a datos universitarios. In Proceedings of the XXI Workshop de Investigadores en Ciencias de la Computación (WICC 2019,
Universidad Nacional de San Juan, San Jose, Argentina, 21 October 2019.
7. Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández Orallo, J.; Kull, M.; Lachiche, N.; Ramírez Quintana, M.J.;
Flach, P.A. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Trans. Knowl. Data
Eng. 2019, 33, 3048–3061. [CrossRef]
8. Hussin, W.N.T.W.; Harun, J.; Shukor, N.A. A Review on the Classification of Students’ Interaction in Online Social Collaborative
Problem-based Learning Environment: How Can We Enhance the Students’ Online Interaction? Univ. J. Educ. Res. 2019,
7, 125–134. [CrossRef]
9. Fukuzawa, S.; Cahn, J. Technology in problem-based learning: Helpful or hindrance? Int. J. Inf. Learn. Technol. 2019, 36, 66–76.
[CrossRef]
10. Valverde-Berrocoso, J.; Garrido-Arroyo, M.d.C.; Burgos-Videla, C.; Morales-Cevallos, M.B. Trends in educational research about
e-learning: A systematic literature review (2009–2018). Sustainability 2020, 12, 5153. [CrossRef]
11. Ocaña, J.M.; Morales-Urrutia, E.K.; Pérez-Marín, D.; Pizarro, C. Can a learning companion be used to continue teaching
programming to children even during the COVID-19 pandemic? IEEE Access 2020, 8, 157840–157861. [CrossRef]
12. Palacios, C.A.; Reyes-Suárez, J.A.; Bearzotti, L.A.; Leiva, V.; Marchant, C. Knowledge Discovery for Higher Education Student
Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile. Entropy 2021, 23, 485. [CrossRef]
13. Gao, P.; Wu, W.; Yang, Y. Discovering Themes and Trends in Digital Transformation and Innovation Research. J. Theor. Appl.
Electron. Commer. Res. 2022, 17, 1162–1184. [CrossRef]
14. Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 1996, 17, 37–37.
15. Nájera, A.B.U.; de la Calleja Mora, J. Brief review of educational applications using data mining and machine learning. Redie. Rev.
Electrón. De Investig. Educ. 2017, 19, 84–96.
16. Cummins, M.R. Nonhypothesis-driven research: Data mining and knowledge discovery. In Clinical Research Informatics; Springer:
Berlin/Heidelberg, Germany, 2019; pp. 341–356.
17. Sugiyarti, E.; Jasmi, K.A.; Basiron, B.; Huda, M.; Shankar, K.; Maseleno, A. Decision support system of scholarship grantee
selection using data mining. Int. J. Pure Appl. Math. 2018, 119, 2239–2249.
18. Witten, I.H.; Frank, E. Data mining: Practical machine learning tools and techniques with Java implementations. Acm Sigmod Rec.
2002, 31, 76–77. [CrossRef]
19. Ngo, T. Data mining: Practical machine learning tools and technique, by ian h. witten, eibe frank, mark a. hell. ACM SIGSOFT
Softw. Eng. Notes 2011, 36, 51–52. [CrossRef]
20. Scheuer, O.; McLaren, B.M. Educational data mining. Encycl. Sci. Learn. 2012, 1075, 1079.
21. Hernández-Blanco, A.; Herrera-Flores, B.; Tomás, D.; Navarro-Colorado, B. A systematic review of deep learning approaches to
educational data mining. Complexity 2019, 2019, 1306039. [CrossRef]
22. Cengiz, M.; Birant, K.U.; Yildirim, P.; Birant, D. Development of an interactive game-based learning environment to teach data
mining. Int. J. Eng. Educ. 2017, 33, 1598–1617.
23. Almaiah, M.A.; Almulhem, A. A conceptual framework for determining the success factors of e-learning system implementation
using Delphi technique. J. Theor. Appl. Inf. Technol. 2018, 96, 5962–5976.
24. Almaiah, M.A.; Alyoussef, I.Y. Analysis of the effect of course design, course content support, course assessment and instructor
characteristics on the actual use of E-learning system. IEEE Access 2019, 7, 171907–171922. [CrossRef]
25. Almaiah, M.A.; Alismaiel, O.A. Examination of factors influencing the use of mobile learning system: An empirical study. Educ.
Inf. Technol. 2019, 24, 885–909. [CrossRef]
26. Almaiah, M.A.; Al-Khasawneh, A.; Althunibat, A. Exploring the critical challenges and factors influencing the E-learning system
usage during COVID-19 pandemic. Educ. Inf. Technol. 2020, 25, 5261–5280. [CrossRef] [PubMed]
27. Hendrickx, T.; Cule, B.; Meysman, P.; Naulaerts, S.; Laukens, K.; Goethals, B. Mining Association Rules in Graphs Based on
Frequent Cohesive Itemsets. In Proceedings of the Advances in Knowledge Discovery and Data Mining; Cao, T., Lim, E.P., Zhou, Z.H.,
Ho, T.B., Cheung, D., Motoda, H., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 637–648.
28. Moro, S.; Cortez, P.; Laureano, R. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology;
EUROSIS-ETI: Ostend, Belgium, 2011.
29. Ghazal, M.M.; Hammad, A. Application of knowledge discovery in database (KDD) techniques in cost overrun of construction
projects. Int. J. Constr. Manag. 2022, 22, 1632–1646. [CrossRef]
30. Hand, D.J.; Smyth, P.; Mannila, H. Principles of Data Mining; MIT Press: Cambridge, MA, USA, 2001.
31. Dåderman, A.; Rosander, S. Evaluating Frameworks for Implementing Machine Learning in Signal Processing: A Comparative Study of
CRISP-DM, SEMMA and KDD; KTH, School of Electrical Engineering and Computer Science (EECS): Stockholm, Sweden, 2018.
32. Wiemer, H.; Drowatzky, L.; Ihlenfeldt, S. Data Mining Methodology for Engineering Applications (DMME)—A Holistic Extension
to the CRISP-DM Model. Appl. Sci. 2019, 9, 2407. [CrossRef]
33. Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International
Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1,
pp. 29–39.
34. Phyu, T.N. Survey of classification techniques in data mining. In Proceedings of the International Multiconference of Engineers
and Computer Scientists, London, UK, 1–3 July 2009; Volume 1.
35. Soofi, A.A.; Awan, A. Classification techniques in machine learning: Applications and issues. J. Basic Appl. Sci. 2017, 13, 459–465.
[CrossRef]
36. Mahesh, B. Machine learning algorithms-a review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386.
37. Phan, T.N.; Kuch, V.; Lehnert, L.W. Land Cover Classification using Google Earth Engine and Random Forest Classifier—The
Role of Image Composition. Remote Sens. 2020, 12, 2411. [CrossRef]
38. Hameed, K.; Chai, D.; Rassau, A. A sample weight and adaboost cnn-based coarse to fine classification of fruit and vegetables at
a supermarket self-checkout. Appl. Sci. 2020, 10, 8667. [CrossRef]
39. Quinlan, J. C4.5: Programs for Machine Learning; Ebrary online; Elsevier Science: Amsterdam, The Netherlands, 2014.
40. Badawi, S.A.Q.; Takruri, M.; Albadawi, Y.; Khattak, M.A.K.; Nileshwar, A.K.; Mosalam, E. Four Severity Levels for Grading the
Tortuosity of a Retinal Fundus Image. J. Imaging 2022, 8, 258. [CrossRef]
41. Chaves, L.; Marques, G. Data mining techniques for early diagnosis of diabetes: A comparative study. Appl. Sci. 2021, 11, 2218.
[CrossRef]
42. Martínez-Cerdá, J.F.; Torrent-Sellens, J.; González-González, I. Socio-technical e-learning innovation and ways of learning in the
ICT-space-time continuum to improve the employability skills of adults. Comput. Hum. Behav. 2020, 107, 105753. [CrossRef]
43. Pozón-López, I.; Kalinic, Z.; Higueras-Castillo, E.; Liébana-Cabanillas, F. A multi-analytical approach to modeling of customer
satisfaction and intention to use in Massive Open Online Courses (MOOC). Interact. Learn. Environ. 2020, 28, 1003–1021.
[CrossRef]
44. Gilar-Corbi, R.; Pozo-Rico, T.; Castejón, J.L. Desarrollando la Inteligencia Emocional en Educación Superior: Evaluación de la Efectividad
de un Programa en tres Países; Universidad Nacional de Educación a Distancia (España): Madrid, Spain, 2019.
45. Wani, H.A. The relevance of e-learning in higher education. ATIKAN 2013, 3.
46. Meskhi, B.; Ponomareva, S.; Ugnich, E. E-learning in higher inclusive education: Needs, opportunities and limitations. Int. J.
Educ. Manag. 2019, 33, 424–437. [CrossRef]
47. Saqr, M.; Alamro, A. The role of social network analysis as a learning analytics tool in online problem based learning. BMC Med.
Educ. 2019, 19, 160. [CrossRef] [PubMed]
48. Al-Fraihat, D.; Joy, M.; Sinclair, J. Evaluating E-learning systems success: An empirical study. Comput. Hum. Behav. 2020,
102, 67–86. [CrossRef]
49. Romi, I.M. A Model for e-Learning Systems Success: Systems, Determinants, and Performance; Palestine Polytechnic University:
Hebron, Palestinian, 2017.
50. Hayashi, A.; Chen, C.; Ryan, T.; Wu, J. The role of social presence and moderating role of computer self efficacy in predicting the
continuance usage of e-learning systems. J. Inf. Syst. Educ. 2020, 15, 5.
51. Damabi, M.; Firoozbakht, M.; Ahmadyan, A. A Model for Customers Satisfaction and Trust for Mobile Banking Using DeLone
and McLean Model of Information Systems Success. J. Soft Comput. Decis. Support Syst. 2018, 5, 21–28.
52. Donovan, E.; Guzman, I.R.; Adya, M.; Wang, W. A Cloud Update of the DeLone and McLean Model of Information Systems
Success. J. Inf. Technol. Manag. 2018, 29, 23–34.
53. Németh, T. How to back up Modules with blended learning The e-Learning platform of FAME. Prosperitas 2019, 6, 102–141.
[CrossRef]
54. Radha, S.; Michael Mariadhas, J.; Subramani, A.; Akbar Jan, N. Role of e-learning and digital media resources in employability of
management students. Online J. Distance Educ. e-Learn. 2019, 7, 116–123.
55. Cidral, W.A.; Oliveira, T.; Di Felice, M.; Aparicio, M. E-learning success determinants: Brazilian empirical study. Comput. Educ.
2018, 122, 273–290. [CrossRef]
56. García Aretio, L. El problema del abandono en estudios a distancia. Respuestas desde el Diálogo Didáctico Mediado. RIED. Rev.
Iberoam. De Educ. Distancia 2019, 22, 245–270. [CrossRef]
57. Weinberg, S.L.; Abramowitz, S.K. Statistics Using IBM SPSS: An Integrative Approach, 3rd ed.; Cambridge University Press:
Cambridge, CA, USA, 2016.
58. Li, M.; Xu, H.; Deng, Y. Evidential Decision Tree Based on Belief Entropy. Entropy 2019, 21, 897. [CrossRef]
59. Zhao, L.; Lee, S.; Jeong, S.P. Decision Tree Application to Classification Problems with Boosting Algorithm. Electronics 2021,
10, 1903. [CrossRef]
60. Chiu, Y.P. Social Recommendations for Facebook Brand Pages. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 71–84. [CrossRef]
61. Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag.
Process 2015, 5, 1.
62. Nhu, V.H.; Janizadeh, S.; Avand, M.; Chen, W.; Farzin, M.; Omidvar, E.; Shirzadi, A.; Shahabi, H.; Clague, J.; Jaafari, A.; et al.
Gis-based gully erosion susceptibility mapping: A comparison of computational ensemble data mining models. Appl. Sci. 2020,
10, 2039. [CrossRef]
63. Tsiakmaki, M.; Kostopoulos, G.; Kotsiantis, S.; Ragos, O. Implementing AutoML in educational data mining for prediction tasks.
Appl. Sci. 2019, 10, 90. [CrossRef]
64. Chicco, D.; Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection
fraction alone. BMC Med. Inform. Decis. Mak. 2020, 20, 1–16. [CrossRef] [PubMed]
65. Jiménez-Valverde, A. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure
in species distribution modelling. Glob. Ecol. Biogeogr. 2012, 21, 498–507. [CrossRef]
66. Soria-Barreto, K.; Ruiz-Campo, S.; Al-Adwan, A.S.; Zuniga-Jara, S. University students intention to continue using online learning
tools and technologies: An international comparison. Sustainability 2021, 13, 13813. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Sustainability 15 00895 With Cover

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sustainability 15 00895 With Cover

Uploaded by

Copyright:

Available Formats

3.9 5.

Aurora Sánchez, Cristian Vidal-Silva, Gabriela Mancilla, Miguel Tupac-Yupanqui and

Sustainability 2023, 15, 895. https://doi.org/10.3390/su15020895 https://www.mdpi.com/journal/sustainability

Problem Statement, Goal, and Contributions

2. e-Learning and Data Mining

Figure 1. KDD methodology process.

Figure 2. The CRISP-DM methodology process.

2.2. Data Mining Techniques

• Naive Bayes: It is based on Bayes theorem with an assumption of independence

2.3. Data Mining in Education

3.1. Institution Background

3.2. Selection and Understanding of Data

3.3. Data Preparation

Table 1. Data mining metrics.

Table 2. List of educational programs with the largest number of students.

Program Frequency Percentage

Figure 5. Decision Tree I: Program with the highest percentage of graduates.

Students with the highest percentage of elimination belong to undergraduate degree

Table 4. Classification results of applied algorithms.

Figure 8. Determinant factors for academic success.

6. Conclusions, Recommendation, and Future Work

You might also like