RMS Individual Presentation

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/349240193
Enhancing Prediction of Student Success: Automated Machine Learning

Approach
Article in Computers & Electrical Engineering · February 2021

DOI: 10.1016/j.compeleceng.2020.106903
CITATIONS READS
47 133
3 authors, including:
Hassan Zeineddine Udo C. Braendle

American University In Dubai American University In Dubai
15 PUBLICATIONS 151 CITATIONS 50 PUBLICATIONS 484 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Global Sustainability Research Network (GSRN) View project
All content following this page was uploaded by Hassan Zeineddine on 03 February 2022.
The user has requested enhancement of the downloaded file.

PREPRINT
Enhancing Prediction of Student Success: Automated Machine Learning Approach
Hassan Zeineddine, PhD*, Udo Braendle, PhD*, Assaad Farah, PhD*
*AUD/IBM Center of Excellence for Smarter Logistics, American University in Dubai, Dubai, PO Box 28282, UAE
Abstract
Students’ success has recently become a primary strategic objective for most institutions of higher education. With
budget cuts and increasing operational costs, academic institutions are paying more attention to sustaining students’
enrollment in their programs without compromising rigor and quality of education. With the scientific advancements
in Big Data Analytics and Machine Learning, universities are increasingly relying on data to predict students’
performance. Many initiatives and research projects addressed the use of students’ behavioral and academic data to
classify students and predict their future performance using advanced statistics and Machine Learning. To allow for
early intervention, this paper proposes the use of Automated Machine Learning to enhance the accuracy of predicting
student performance using data available prior to the start of the academic program.
Key words: Automated Machine Learning, Prediction Accuracy, Student Performance, Pre-Admission Data,
Ensemble Model, Higher Education.
1
P
1. Introduction
Student retention is a pressing issue for academic institutions around the globe, given tight budgets and limited
resources [1]. The average dropout rate in Organization for Economic Co-operation and Development (OECD)
countries is around 45% [2]. Accordingly, higher education establishments are creating and setting intervention
strategies to remedy this problem. Researchers and practitioners agree that such strategies are most effective if applied
in a student’s first year of study. Hence, a lot of focus has been placed on predicting, as early as possible, vulnerable
students who are prone to drop their courses [2, 3].
Recently, predictive analysis has relied on Machine Learning to support business decision-making. Applications
in finance, operations and risk management are good attestations of the relevance of Machine Learning research in
various business functions. Evermann et al., for example, used machine learning to predict business process
performance [4], and Carneiro et al. to spot credit-card fraud [5].
More and more, Machine Learning is used in the field of higher education management. Specifically, there has
been an increased interest in adopting Machine Learning to predict student performance and identify students at risk
based on initial data gathered during their years of study, as surveyed in the work of Miguéis et al. [6]. Fewer work
addressed the prediction of student performance using data prior to starting their academic journey [2, 3].
Given the complexity of choosing an optimal prediction model for a given dataset from a wide pool of predictive
methods and different hyper-parameter values per model, the automation of this process can help increase the
prediction accuracy [7, 8, 9]. In relation to that, Automated Machine Learning (AutoML) is a technique meant to
derive the best classification model and corresponding hyper-parameters for a given decision-making problem. This
technique can add value if used in predicting student performance. Yet, the review of the literature in this area shows
a lack of empirical work using AutoML. Our research work relies on AutoML to help increase the accuracy of
predicting student performance using data available upon entering an academic program.
The rest of the paper is organized as follows. Section 2 will discuss the Theoretical Background. Section 3 presents
the Methodology and Section 4 highlights the results. Section 5 concludes.
2
PREPRINT
2. Theoretical Background
The topic of predicting student performance in academic institutions has attracted the attention of researchers and
academic administrators for the past two decades [10]. The literature is mainly focusing on two fronts: identifying the
most critical attributes for predicting student performance and finding the best prediction method for enhancing the
prediction accuracy [2, 6, 10, 11, 12, 13 14].
In relation to identifying critical attributes, several factors may affect a student’s performance such as social and
economic standing, psychological elements, demographics, school systems, and social networks [15]. Reviews of the
common attributes used in predicting student performance discussed several factors and categorized them as either
internal or external [16]. Attributes such as assignment marks, quizzes, class tests and attendance are classified as
internal assessment [17]. Several papers have also used cumulative grade point average (CGPA) as their main internal
attributes to assess student performance [16]. In terms of external assessment, one needs to mention student
demographics such as gender, age, family background, special needs, etc. [18]. Other popular external attributes are
socio-demographic characteristics, extra-curricular activities, high school background and social interaction network
[18, 19, 20]. Several researchers have also used psychometric factors such as personal interest, study habits, and family
support [21, 22].
Several machine learning methods have been used in the literature to predict student performance, mainly Logistic
Regression, Decision Tree, Artificial Neural Network, Naive Bayes, K-Nearest Neighbor, Support Vector Machines,
and different Ensemble methods. The next paragraphs discuss the use of these methods in predicting student
performance and focus on the prediction accuracy.
2.1. Logistic Regression
Regression methods for predicting student performance use a finite set of relationships among the dependent and
independent variables, generating a predictive function that models these associations [12, 18, 19, 23]. The logistic
regression method for predicting student performance is normally used to describe the associations between a
number of independent variables that could be categorized as binary, categorical and continuous [2, 13, 21, 24]. The
level of prediction accuracy using the logistic regression is around 70% using variables such as career aspirations,
CGPA, psychological scores, and personal interests.
3
2.2. Decision Tree
Many researchers have used the Decision Tree prediction method for its clarity and ease in exposing small and
large data sets and forecasting the value [6, 18, 21, 23, 25]. The logic when applying decision tree techniques is
equivalent to a series of IF-THEN statements, which can help in simplifying the understanding of this method. There
are several papers that have used this method to predict student performance using key indicators such as student
grades in specific courses and current CGPA [13, 22, 26]. The accuracy of prediction using this method while relying
on data prior to students starting an academic program is around 70% [2], and reaches 90% when using data gathered
after joining the program [16].
2.3. Artificial Neural Network
An Artificial Neural Network (ANN) can detect all existing interactions among independent variables. It has been
widely used as a method in educational data mining. The ANN’s ability to detect with high confidence complex
associations between independent and dependent variables makes it a powerful tool in predicting student performance
[12, 13, 23, 24, 25, 26]. The most common variables used in forecasting student performance using neural network
are student attitude towards learning, admission data, CGPA and grades in specific courses. This technique led to up
to 98% accuracy in predicting student performance using data after students joining an institution, and had an accuracy
of around 70% using data prior to students starting their academic journey [2, 16].
2.4. Naive Bayes
Naïve Bayes is another method used to predict student performance. It uses all attributes existing in the data and
makes comparisons among independent variables to show the significance and effect of each of these predictors. The
papers that used this method predominantly considered variables such as grades, scholarships, CGPA, high school
background, demographics, social network data and internal assessments. Research using Naïve Bayes relied mostly
on data gathered after students had started their academic journey [6, 13, 21, 23, 24], with a minimum accuracy of
50% and a maximum of 76% [16].
4
2.5. K-Nearest Neighbors
The K-Nearest Neighbors is a simple algorithm that classifies a data point based on the prevalent class of its K-
Nearest Neighbors. The data in this technique encompasses a number of multivariate attributes that are used for
classification. The K-Nearest Neighbors method is quick in predicting student performance in terms of level of
learning (slow, medium, good and excellent learner) [13, 21, 23]. Its accuracy rate was slightly above 60% when using
psychomotor factors, and reached 83% when using data extracted from internal assessments, CGPA, and extra-
curricular activities [16].
2.6. Support Vector Machine
Support Vector Machine (SVM) is a supervised learning method that classifies data points by segregating them
using an N-dimensional hyperplane, where N is the number of attributes characterizing a data point. This method has
helped researchers in predicting student performance when working with small samples [6, 12, 13, 21, 23, 25]. The
SVM also proves to be effective when dealing with overlapped data. Earlier research used CGPA, extra-curricular
activities, psychomotor tests and internal assessments in predicting student performance [19] and reached an accuracy
of` around 80% [16].
2.7 Ensemble of Methods
There is a general consensus that combining prediction methods produces more accurate and more robust
prediction results [27]. The collective decision of all methods is the result of a probabilistic averaging or a voting
scheme. To ensure an increase in accuracy over individual methods, the methods in an ensemble should have a fair
level of uncorrelated errors [28]. In another term, each constituent method should yield better accuracy than the other
methods in the set if applied individually on a different segment of the data space. In addition, none of the methods
will be able to yield optimal accuracy if applied on the entire data space. Several papers addressed the topic of
predicting student performance using the Ensemble Method [2, 3, 6, 11, 24]. Specifically, those papers relied on
Radom Forest, Boosted Trees, Bagged Trees, and Information Fusion. Delen [11] reported 82% accuracy in predicting
students’ performance within their first year of studies using the Information Fusion approach. Migues [6] reported
95% accuracy in predicting students’ performance within their first year of studies using Boosted Trees, relying on
5
PREPRINT
earned grades and completed credits. Hoffait and Schyns’ work [2] was distinguished with their use of Random Forest
based on data gathered prior to admission. They extended different ensemble models with a special algorithm to
increase their prediction accuracy. The algorithm aims at identifying a subset of students who are most likely to fail,
out of the general set of students who are predicted to fail. It ensures that the prediction accuracy rate , using the
identified subset, should be equal to a confidence level, which is defined by the decision maker. After applying the
algorithm on Random Forest, it identified 21.2% of students from the set of those who were facing a high risk of
failure, with a confidence of 91%. However, when considering the entire set of students, the authors reported close to
70% accuracy for predicting Fail, and close to 59% accuracy for predicting Pass.
2.8 Automated Machine Learning
After reviewing the widely used prediction methods, it is important to re-emphasize the value of automation for
choosing an optimal prediction model, given the complexity of such a task. Various AutoML applications have
recently been described in the literature [7, 8, 9]. The Study of Tuggener et al. [9] confirms the superiority of auto-
generated machine learning models over human-designed models. Luo et al. highlight the cost of building and
generalizing Machine Learning models that often requires hundreds of manual iterations to identify a suitable
prediction model and corresponding hyper-parameters, and encourage medical researchers to adopt AutoML for cost
efficiency. Salvador et al. [7] led experimental analysis examining the search space of 812 billion possible
combinations of methods and categorical hyper-parameters, for 21 publicly available data sets, and 7 data sets from
real chemical production processes. Relying on their results, they encouraged practitioners to use AutoML on a broad
variety of classification problems. Stadelmann et al. reported practical use of AutoML in-analyzing house and client-
related data at PricewaterhouseCoopers [8].
In light of the reviewed literature, there is an evident need to use AutoML in an attempt to improve the accuracy
of predicting student performance. Particularly, such a need is prominent when predicting students’ performance based
on data prior to starting their first academic year, where the accuracy level is around 70%. Increasing the accuracy of
prediction, based on data available from day one, is not only of high value for researchers but also for practitioners
focusing on student success and retention.
6
This paper relies on an automatic search algorithm in machine learning to identify the optimal model to predict
student success at the start of their first year in a university – using data available prior to starting a new program. This
can help in an early intervention approach to mitigate their risk of failure.
3. Methodology
In this study, we rely on AutoML to derive the best classification model and corresponding hyper-parameters.
Amongst the most popular tools that offer AutoML features are Auto-Weka [28] and Auto-sklearn [29]. We chose to
run the Auto-Weka search algorithm with the hyper-parameter optimization option. Figure 1 represents the automated
machine learning process that looped through the list of predictive methods and corresponding hyper-parameter values
to identify the model with the best accuracy. The search algorithm concluded with an Ensemble Model of multiple
methods that yielded the best classification accuracy out of all the auto-tested combinations of prediction methods and
corresponding hyper-parameters. The prediction mechanism of the identified Ensemble Model is based on a voting
scheme that adopts the prediction outcome resulting from the majority of the constituent methods. The constituent
methods of the ensemble are:
▪ Artificial Neural Network
▪ K-Nearest Neighbors
▪ K-Means Clustering
▪ Naïve Bayes
▪ Support Vector Machine
▪ Logistic Regression
▪ Decision Tree
7
All combinations of
All ML hyper-parameter Training dataset Testing dataset
methods values that correspond
to one ML method
Select Select a combination Train ML Test ML

Chose
one of of hyper-parameters method based method based
Start best ML
the ML that correspond the on the training on the testing
method
Methods selected ML method dataset dataset
Done all Done all

methods? combinations?
End
Figure 1 – Automated Machine Learning Process
3.1. Artificial Neural Network
Mimicking the neural connections and interactions in the human brain, an ANN models a brain neuron using a
mathematical function F(𝑥) [30]. The ANN simulates the interconnections among neurons by nesting functions based
on a network model. The function’s parameter x is a vector of size n, 𝑥 = [𝑥1 , 𝑥2 , … , 𝑥𝑛 ]. We can represent this
function as:
F(𝑥) = S(∑𝑛𝑖 𝜔𝑖 𝐹(𝑥𝑖 )) (1)
If x is a scalar, F(𝑥) = 𝑥. The factor ω is a weight that will be learned through training the network on historical data.
S is a transfer function that normalizes the output within a specific range of values. The adopted transfer function in
this study is the Sigmoid function that modulates values between 0 and 1 as follows:
1
S(𝑥) = (2)
1+𝑒 −𝑥
The ANN is a hierarchical model made of multiple layers. Each layer has a number of nodes (neurons) that
connects via unidirectional links with all nodes in the downstream layer. There is no connection to upstream or same-
layer nodes. Normally, there are three types of layers: the input layer, a set of middle layers, and the output layer as
8
shown in figure 2. The architecture of the ANN adopted in this study is made of an input layer representing the
different categorical values of the adopted data features, 2 middle layers having 12 and 7 neurons respectively, and an
output layer made of one neuron representing the binary outcome.
Input Layer x1 x2 ..... xn
ẋ1 ẋ2 ..... ẋm
Middle Layers
ẍ1 ẍ2 ..... ẍl
Output Layer x
Figure 2 – ANN Hierarchy
3.2. K-Nearest Neighbors
The K-Nearest Neighbors method classifies a data point based on the dominant class of its K-nearest neighboring
points within a training data set. The distance between two data points is measured using a specific function, such as
the Euclidean, Manhattan and Chebychev functions [30]. In our study, the adopted distance function is the Euclidean,
in which K was set to 1. The Euclidean distance between two data points x and y, where x and y are vectors of size n,
is:
𝐸(𝑥, 𝑦) = √∑𝑛𝑖=0(𝑥𝑖 − 𝑦𝑖 )2 (3)
3.3. K-Means Clustering
The K-Means Clustering method assigns a data point to one class out of K different classes. Before the
classification, the clustering algorithm arranges the data points of a training set into K different clusters, which
represent eventually the classes. In this study, K was equal to 2 since we have two different classes: Pass and Fail.
9
The assignment of a data point to a cluster is decided based on its distance from the centroid of each cluster. The
centroid is the average data point of all points in the cluster. The adopted distance function for this algorithm is the
Euclidean Distance [30].
In our study, after the training phase to assign historical data points into two different clusters, the K-Means
Clustering classifier is able to predict the outcome of a new data point by assigning it to the cluster that has the closest
centroid.
3.4. Naïve Bayes
The Naive Bayes classifier is a simple technique that predicts outcomes based on the Bayesian theorem. The
training of Naïve Bayes classifier is fast compared to other computationally intensive models. It classifies a data point
x based on the conditional probability of being in a class C given the values of its constituent scalars [x1, x2, …, xn],
without relying on any additional parameter. The class that has the highest probability of occurrence given the inputs
will be the predicted class [14].
The probability of being in the class C given 𝑥 = [𝑥1 , 𝑥2 , … , 𝑥𝑛 ] is as follows:
𝑃(𝐶)
𝑃(𝐶|𝑥) = ∗ ∏𝑛𝑖=1 𝑃(𝑥𝑖 |𝐶) (4)
𝑃(𝑥)
3.5. Support Vector Machine
The SVM classifier derives boundaries between data points that belong to different classes. Points within certain
boundaries are normally part of a common class. The ideal scenario is when the data points belonging to different
classes are separable via a linear boundary. However, in most cases this is not possible due to data overlaps as shown
in figure 3. SVM casts the data points to a new higher dimension space in which the data becomes linearly separable
with a hyperplane, using a specific kernel function. This technique is based on the Cover's Theorem stating that non-
linearly separable data points would highly likely be separated by a hyperplane if projected to a higher-dimensional
space via some non-linear transformation. The boundary hyperplane will be realized by referencing the borderline
data points, which are called the support vectors. The identified support vectors should be away from the boundary by
a given margin. Not only the kernel function takes care of casting to a new space but also provides the dot product
10
between two data points x and y for measuring distances, hence reducing the computational overhead. We relied in
this study on the polynomial kernel function F of degree d as shown below [30]:
𝐹(𝑥, 𝑦) = (∑𝑛𝑖=1 𝑥𝑖 ∙ 𝑦𝑖 + 𝑐)𝑑 (5)
Figure 3 – Non-linearly separable overlapping data classes
3.6. Logistic Regression
Logistic Regression classifier transforms the output of a linear regression function f(x) into a value between 0 and
1 using the logistic function L as described in the function below [30]. It reflects the odds of class occurrence with
respect to the given features.
1
𝐿(𝑓(𝑥)) = (6)
1+𝑒 −𝑓(𝑥)
3.7. Decision Tree
A Decision Tree classifier learns from a set of historic data points and generates a corresponding tree-like structure.
The features and respective values are analyzed and structured in a hierarchical tree-like topology, which helps in
answering questions by a simple root-to-leave traversal. The root and all other decision nodes are connected to two or
more downstream nodes (all representing answers to decision questions). A leaf node has no downstream connections
and represents the final answer to the series of questions captured in the path of nodes preceding it up to the root [30].
Figure 4 is a snapshot from a section of the Decision Tree pertaining to this study.
11
Figure 4 – Section of the Decision Tree (No – CGPA Not Below 2.0; Yes – CGPA Below 2.0)
3.8. Data Sources
We have collected the data for this study from different sources within academic institutions in the United Arab
Emirates. Specifically, we relied on student records from Admission, Registrar and Student Service offices. Our
sample included records of 1491 students, of whom 1014 were in good academic standing.
We faced three main challenges when building the predictive model based on this sample: data inconsistency,
imbalance and overlap. For students who have spent at least a semester in a university program, several data features
would exist and should help in producing predictive models with high precisions. For example, we can rely on several
features to predict students’ success in a particular course or program such as grades in key courses, exams, past terms’
CGPAs, probations, warnings, class participations, and extra-curricular engagements. For new entrants, in the absence
of this data, other variables that are available upon admission are required to build a precise predictive model. These
variables represent common attributes of the admitted students such as age, gender, ethnicity, study program, course
load, on-campus residency, probation, and school education system. We used all of these variables in this study.
Furthermore, to address the differences in the high school systems and the inconsistency in evaluation schemes, we
relied on the students’ placement in developmental English and Math courses that are based on scores from standard
exams such as TOFFEL, IELT, English ACCUPLACER, Math ACCUPLACER, and SAT. We have used 13 data
12
features in developing this predictive model, as described in Table 1, and transformed their values to categorical
ranges.
The imbalance between the number of passing (1014) and failing (477) students biases the predictive model. We
needed to apply a careful data balancing technique to ensure better precision without compromising the learning value
from the data. We chose the Synthetic Minority Oversampling Technique (SMOTE) [31] to create extra data points
in the training data set in order to make a balance between the data classes. Table 2 shows the percentage of failing
students for each category under each data feature.
Students having similar data might end up having different outcomes causing confusion to a prediction method.
Due to this data overlap, a method resorts to a particular stochastic guess within certain probabilistic limits to predict
an outcome, leading to reduced prediction accuracy. Our proposed ensemble of multiple predictive methods increases
the prediction accuracy since it relies on voting amongst different methods. In other words, the prediction outcome of
the Ensemble Model is the most recurring classification among the set of methods.
Table 1
Data Features.
Feature Values Description
Program BBA, ENG, BAIS, ARC, ID, Program of study in the university. The 6 considered programs
VC, BCIS, GEN are: Bachelor of Business Administration (BBA), Engineering
(ENG), Bachelor of Arts in International Studies (BAIS),
Architecture (ARC), Interior Design (ID), Visual
Communications (VC), Bachelor of Communication and
Information Studies (BCIS), and General (GEN)
School HSD, IB, IGCSE, BAC, OTH School system from which a student is coming, as per the UAE
System Ministry of Education. Several high school systems are
considered: High School Diploma (HSD), International
Baccalaureate (IB), International General Certificate of
Secondary Education (IGCSE), Baccalaureate (BAC), Other
(OTH)
13
Ethnicity NAMR, AUS, ASIA, SAMR, The ethnic community to which the student belongs: North
EURO, LEVN, PERS, GCC, American (NAMR), Australian (AUS), Asian (ASIA), South
AFRC, NAFR, SASA, NASA American (SAMR), European (EURO), Levantine (LEVN),
Persian (PERS), Arab Gulf (GCC), African (AFRC), North
African (NAFR), South Asian (SASA), North Asian (NASA)
Gender Male, Female
Age Group AGE20+ and AGE19- The age is inferred from the date of birth an grouped under two
categories: 19-and-below, and 20-and-above
Scholarship NONE, QUART, HALF, The scholarship status of the student: No scholarship (NONE),
FULL 25% scholarship (QUART), 50% scholarship (HALF), and
100% scholarship (FULL)
Transfer TRC, TRN, NON The transfer status of the students: Transferred from another
Status university with no credits counted (TRN), Transferred from
another university with some credits counted (TRC), Not
transferred from any university (i.e. coming directly from a
high school. This is the case of the majority of students)
(NON)
Admitted on YES, NO The admission on probation status: Student admitted on

Probation
probation (YES), Student admitted with no probation (NO).
In Dorm YES, NO The dorm occupancy: Student lives in a campus-based
dormitory (YES), Student does not live in a campus-based
dormitory (NO)
Course Load HIGH, MODR, NORM The course load is inferred from the number of registered
courses (credits): High load (HIGH) is for 6 or more courses
(18+ credits), Normal load (NORM) is for 5 courses (15
credits), Moderate load (MODR) is for 4 course or lower (12-
credits)
14
Math Level MATH1, MATH2, MATH3, The level of math skills upon admission based on the math
MATH4, MATH5 placement test. The lowest math is MATH1 and the highest
math level can go up to MATH 5 depending on the program of
study.
English Level ENGL1, The level of English skills upon admission based on the
ENGL2, ENGL3, ENGL4, English placement test. The lowest English is ENGL1 and the
ENGL5 highest English level can go up to ENGL5 depending on the
program of study.
Result PASS, FAIL The outcome based on the student’s Grade Point Average
(CGPA) ranging from 0 to 4, in their first semester at the
university. If the CGPA is below 2, the student is considered to
be failing (FAIL); otherwise, the student is considered to be
passing (PASS).
Table 2
Descriptive Data.
Feature Percentage of failing students
Program GEN (60%), BBA(42%), ENG(34%), ARC(26%), BAIS(24%), VC(19%),
ID(15%), BCIS(13%)
School System HSD(33%), IGCSE(32%), BAC(26%), IB(19%)
Ethnicity NASA(42%), GCC(39%), SASA (38%), NAFR(38%), AFRC(35%), PERS(32%),
MEST(28%), EURO(21%)
Gender Female(14%), Male(49%)
Age Group Age19-(30%), Age20+(49%)
Scholarship NONE(37%), FULL(19%), QUART(10%), HALF(0%)
Transfer Status AS(93%), TRN(54%), NHS(29%), TRC(26%)
Admitted on Probation YES(60%), NO(30%)
In Dorm YES(31%), NO(32%)
15
Course Load LOW(65%), MOD(67%), NORM(35%), HIGH(18%)
Math Level NONE(15%), MATH1(48%), MATH2(36%), All other Math levels (average of
25%)
English Level NONE(23%), ENGL1(44%), ENGL2(18%), All other English levels (average of
10%)
Result 31.9% of the dataset were failing students.
4. Results
We used a 10 folds cross-validation to test the accuracy of the resulting Ensemble Model. The model is trained on
90% of the points and tested with 10% over 10 different runs. It is important to note that the data points that are
allocated for testing as part of the 10% split are different each time. Figure 5 is a schematic representation of the cross-
validation process adopted in Weka for this study.
Trained
Training SMOTEd
Ensemble
Dataset Dataset
Model
90%
Start Split SMOTE Train Test
10%
Dataset Test
Dataset
No
Yes Finished Aggregate

10 loops Results
End
Figure 5 – 10 folds Cross Validation of the Model Accuracy
16
Table 3 lists the classification methods with their corresponding accuracy rates when applied on our data set. In
addition to the overall accuracy, the table differentiates between the accuracy of predicting Fail and Pass. This
differentiation is important to assess the efficiency of these methods in targeting students at risk.
Table 3
Methods Comparison.
Classification Method Accuracy Rate of Accuracy Rate of Overall Kappa
predicting Fail predicting Pass Accuracy Statistics
Students Students Rate
Ensemble Model 83.0% 72.5% 75.9% 0.50
Artificial Neural Network 73.6% 69.8% 71.0% 0.39
K-Nearest Neighbors 77.4% 65.4% 69.2% 0.37
K-Means Clustering 74.2% 36.4% 48.5% 0.08
Naïve Bayes 76.7% 69.8% 72.0% 0.42
Support Vector Machine 56.0% 82.2% 73.8% 0.38
Logistic Regression 73.0% 69.8% 70.8% 0.38
Decision Tree 76.7% 65.7% 69.2% 0.37
Further, table 3 highlights the kappa coefficient (κ), which is a statistic representing the level of agreement between
two different classifiers. It factors in the possibility of accidental agreements. In our case, the agreement is measured
between the modeled classifier and the observed process.
𝑃𝑜 −𝑃𝑒
𝑘= (7)
1−𝑃𝑒
Po is the probability of making the right prediction, i.e. the accuracy measure. Pe is the probability of accidental
agreement between the classifiers. In a binary system having two predictors, 𝑃𝑒 = 𝑃1 (𝑎) ∙ 𝑃2 (𝑎) + 𝑃1 (𝑏) ∙ 𝑃2 (𝑏),
where Pi(n) is the probability of classifier i predicting class n. A kappa coefficient between 0.4 and 0.75 is considered
good according to Fleiss’ Scale. A kappa below 0.4is poor, and above 0.75 is excellent. Our Ensemble Model achieved
a Kapa of 0.5, which is nearly 20% higher than what is achieved in using each prediction model separately (using the
same data). This implies that our Ensemble Model resulting from the automatic search leaves less chance for accidental
guessing.
17
5. Conclusion
The reported work in this paper contributes to the body of knowledge in the field of predicting student academic
success. Specifically, it relies on AutoML to increase the prediction accuracy of student performance using data
features available prior to the students starting their new academic program, i.e. pre-start data. In effect, the accuracy
of predicting student performance using pre-start data has never exceeded 70%, as found in the current literature [2,
3]. In our study, we achieved 75.9% overall accuracy through the use of AutoML, with a Kapa of 0.5. Accordingly,
we encourage researchers in this field to adopt AutoML in their search for an optimal student performance prediction
model, especially when using pre-start data.
Beside improving the overall prediction accuracy, it is of paramount importance to improve the accuracy of
predicting the failing students, who need immediate attention and support from specialized units within academic
institutions. The maximum accuracy rate reported in the literature on predicting failure of new-start students is at
70%. In our case, the auto-generated Ensemble Model predicts failing students with an accuracy of 83%, after
balancing the data using Synthetic Minority Oversampling Technique. Such a result emphasizes the importance of
balancing data using advanced statistical techniques to achieve better prediction, especially if the minority class is of
interest. The authors acknowledge the overgeneralization limitation of using SMOTE. Yet, since the data set
contains a sizeable minority, the risk of creating synthetic values outside of the minority set, which overlap with the
majority set, is rather minor.
The resulting increase in prediction accuracy of students at risk allows academic institutions to be more efficient
in supporting those students while utilizing the least amount of resources. Future studies may rely on descriptive
statistics to analyze the role of different psychographic variables and their impact on the predictive model. It would
be interesting for upcoming studies to test auto-generated ensemble models in predicting student career success using
academic and psychographic data.
18
References
[1] M. Tight, Student retention and engagement in higher education, Journal of Further and Higher Education, Mar
2019, DOI: 10.1080/0309877X.2019.1576860.
[2] A.S. Hoffait, M. Schyns, Early detection of university students with potential difficulties, Decision Support
Systems 101 (2017) 1–11.
[3] J.P. Vandamme, N. Meskens, J.F. Superby, Predicting academic performance by data mining methods,
Education Economics 15 (4) (2007) 405–419.
[4] J. Evermann, J.R. Rehse, P. Fettke, Predicting process behaviour using deep learning, Decision Support Systems
100 (2017) 129-140.
[5] N. Carneiro, G. Figueira, M. Costa, A data mining based system for credit-card fraud detection in e-tail, Decision
Support Systems 95 (2017) 91-101.
[6] V.L. Miguéis, Ana Freitas, Paulo J.V. Garcia, André Silva, Early segmentation of students according to their
academic performance: A predictive modelling approach, Decision Support Systems 115 (2018) 36-51.
[7] M. M. Salvador, M. Budka, B. Gabrys, Automatic Composition and Optimization of Multicomponent Predictive
Systems With an Extended Auto-WEKA, IEEE Transactions on Automation Science and Engineering 16 (2) 2019.
[8] T. Stadelmann, M. Amirian, I. Arabaci, M. Arnold, G. F. Duivesteijn, I. Elezi, M. Geiger, S. Lӧrwald, B.B.
Meier, K. Rombach, Deep learning in the wild, IAPR Workshop on Artificial Neural Networks in Pattern
Recognition, Springer, 2018, pp. 17–38,.
[9] L. Tuggener, M. Amirian, K. Rombach, S. Lӧrwald, A. Varlet, C. Westermann, T. Stadelmann, Automated
Machine Learning in Practice, State of the Art and Recent Results, Proceedings of the 6th IEEE Swiss Conference
on Data Science (SDS’19), Bern, Switzerland, June 14, 2019.
[10] A. Pena-Ayala, Educational data mining: a survey and a data mining-based analysis of recent works, Expert
Systems with Applications 41 (4) (2014) 1432–1462.
[11] D. Delen, A comparative analysis of machine learning techniques for student retention management, Decision
Support Systems 49 (4) (2010) 498–506.
19
[12] S. Huang, N. Fang, Predicting student academic performance in an engineering dynamics course: a comparison
of four types of predictive mathematical models, Computers & Education 61 (2013) 133–145.
[13] F. Marbouti, H.A. Diefes-Dux, K. Madhavan, Models for early prediction of at-risk students in a course using
standards-based grading, Computers & Education 103 (2016) 1–15.
[14] C. Márquez-Vera, A. Cano, C. Romero, A.Y.M. Noaman, H. Mousa Fardoun, S. Ventura, Early dropout
prediction using data mining: a case study with high school students, Expert Systems 33 (1) (2016) 107–124.
[15] M. Richardson, C. Abraham, R. Bond, Psychological correlates of university students' academic performance: a
systematic review and meta-analysis, Psychological Bulletin 138 (2) (2012) 353–387.
[16] A.M. Shahiri, H. Wahidah, A.R. Nur’aini, A Review on Predicting Student's Performance Using Data Mining
Techniques. Procedia Computer Science 72 (2015) 414-422.
[17] Z.K. Papamitsiou, V. Terzis, A.A. Economides, Temporal learning analytics for computer based testing,
Proceedings of the Fourth International Conference on Learning Analytics And Knowledge, LAK ’14, ACM, New
York, NY, USA, 2014, pp. 31–35.
[18] S. Natek, M. Zwilling, Student data mining solution-knowledge management system related to higher education
institutions, Expert Systems with Applications 41 (14) (2014) 6400–6407.
[19] M. Mayilvaganan, D. Kalpanadevi, Comparison of classification techniques for predicting the performance of
students academic environment, 2014 International Conference on Communication and Network Technologies
(ICCNT), IEEE, 2014, pp. 113–118.
[20] G. Putnik, E. Costa, C. Alves, H. Castro, L. Varela, V. Shah, Analysing the correlation between social network
analysis measures and performance of students in social network-based engineering education, International Journal
of Technology and Design Education 26 (3) (2016) 413–437.
[21] G. Gray, C. McGuinness, P. Owende, An application of classification models to predict learner progression in
tertiary education, Advance Computing Conference (IACC), 2014 IEEE International, 2014, pp. 549–554.
[22] T. Mishra, D. Kumar, S. Gupta, Mining students' data for prediction performance, 2014 Fourth International
Conference on Advanced Computing Communication Technologies, 2014, pp. 255–262.
20
[23] P. Strecht, L. Cruz, C. Soares, J. Mendes-Moreira, R. Abreu, A Comparative Study of Classification and
Regression Algorithms for Modelling Students' Academic Performance, International Educational Data Mining
Society, Madrid, 2015.
[24] C. Romero, P.G. Espejo, A. Zafra, J.R. Romero, S. Ventura, Web usage mining for predicting final marks of
students that use Moodle courses, Computer Applications in Engineering Education 21 (1) (2013) 135–146.
[25] E.B. Costa, B. Fonseca, M.A. Santana, F.F. de Araújo, J. Rego, Evaluating the effectiveness of educational data
mining techniques for early prediction of students' academic failure in introductory programming courses,
Computers in Human Behavior 73 (2017) 247–256.
[26] C. Romero, S. Ventura, Data mining in education, Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery 3 (1) (2013) 12–27.
[27] G. Seni, J. Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions,
Morgan and Claypool, 2010.
[28] L. Kotthoff, C. Thornton, H.H. Hoos, F. Hutter, K. Leyton-Brown, Auto-WEKA 2.0: Automatic model
selection and hyperparameter optimization in WEKA, Journal of Machine Learning Research 18 (2017) 1-5.
[29] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated
machine learning, Advances in Neural Information Processing Systems 28 (2015) 2962–2970.
[30] G. James, D. Witten, T. Hastie, R. Tibshirani, R., An Introduction to Statistical Learning: With Applications in
R, 2013, Springer.
[31] G. Douzas, F. Bacaoa, F. Last. Improving imbalanced learning through a heuristic oversampling method based
on k-means and SMOTE. Information Sciences 465 (2018) 1-20.
21
Bios:
Hassan Zeineddine holds a PhD in computer sciences from the University of Ottawa in Canada. He has 15 years of
industry experience associated with several leading telecommunication companies in North America. Hassan’s
current research interests are in the fields of data analytics, operations research, logistics and supply chains
collaboration. His other research interests include process modeling and simulations.
Assaad Farah holds a PhD in Management from the University of Bath in the United Kingdom. In addition to his
academic responsibilities, he is an executive educator and consultant mainly for the UAE public sector. Prior to that,
he worked in the aeronautical and mobile industry in Canada. His research focus revolves around knowledge
management, strategic HRM and AI.
Udo Braendle has worked in practice and for universities for more than 15 years. His research mainly focuses on
management science, regulation and the social and environmental behavior of firms. He has published widely on
these issues in leading journals, such as the Social Responsibility Journal, Journal of Management and Governance,
and the Business Strategy Review.
22
View publication stats

RMS Individual Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RMS Individual Presentation

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Enhancing Prediction of Student Success: Automated Machine Learning

Article in Computers & Electrical Engineering · February 2021

Hassan Zeineddine Udo C. Braendle

SEE PROFILE SEE PROFILE

Global Sustainability Research Network (GSRN) View project

The user has requested enhancement of the downloaded file.

Hassan Zeineddine, PhD*, Udo Braendle, PhD*, Assaad Farah, PhD*

Ensemble Model, Higher Education.

students who are prone to drop their courses [2, 3].

performance [4], and Carneiro et al. to spot credit-card fraud [5].

the Methodology and Section 4 highlights the results. Section 5 concludes.

prediction accuracy [2, 6, 10, 11, 12, 13 14].

support [21, 22].

performance and focus on the prediction accuracy.

2.1. Logistic Regression

CGPA, psychological scores, and personal interests.

after joining the program [16].

2.3. Artificial Neural Network

2.4. Naive Bayes

50% and a maximum of 76% [16].

curricular activities [16].

2.6. Support Vector Machine

of` around 80% [16].

2.7 Ensemble of Methods

2.8 Automated Machine Learning

related data at PricewaterhouseCoopers [8].

focusing on student success and retention.

can help in an early intervention approach to mitigate their risk of failure.

methods of the ensemble are:

▪ Artificial Neural Network

▪ Support Vector Machine

Select Select a combination Train ML Test ML

Done all Done all

Figure 1 – Automated Machine Learning Process

3.1. Artificial Neural Network

F(𝑥) = S(∑𝑛𝑖 𝜔𝑖 𝐹(𝑥𝑖 )) (1)

output layer made of one neuron representing the binary outcome.

Input Layer x1 x2 ..... xn

Figure 2 – ANN Hierarchy

3.2. K-Nearest Neighbors

𝐸(𝑥, 𝑦) = √∑𝑛𝑖=0(𝑥𝑖 − 𝑦𝑖 )2 (3)

3.3. K-Means Clustering

Euclidean Distance [30].

3.4. Naïve Bayes

will be the predicted class [14].

The probability of being in the class C given 𝑥 = [𝑥1 , 𝑥2 , … , 𝑥𝑛 ] is as follows:

3.5. Support Vector Machine

𝐹(𝑥, 𝑦) = (∑𝑛𝑖=1 𝑥𝑖 ∙ 𝑦𝑖 + 𝑐)𝑑 (5)

Figure 3 – Non-linearly separable overlapping data classes

3.6. Logistic Regression

respect to the given features.

3.7. Decision Tree

3.8. Data Sources

students for each category under each data feature.

VC, BCIS, GEN are: Bachelor of Business Administration (BBA), Engineering

(ENG), Bachelor of Arts in International Studies (BAIS),

Architecture (ARC), Interior Design (ID), Visual

Communications (VC), Bachelor of Communication and

Information Studies (BCIS), and General (GEN)

System Ministry of Education. Several high school systems are

considered: High School Diploma (HSD), International

Baccalaureate (IB), International General Certificate of

Secondary Education (IGCSE), Baccalaureate (BAC), Other

Hassan Zeineddine, PhD, Udo Braendle, PhD, Assaad Farah, PhD*