You are on page 1of 17

Student dropout analysis:

The cornerstone of human progress is education. It has a direct bearing on the development of the
entire country and is not just about the individual. The goal of India's School Education Vision2030 is to
replace the subpar results of the educational system with a high-quality education for every kid. By
2030, there will be 30 million students enrolled in schools, up from 25 million in 2010. The number of
students enrolled in schools has gone up recently.

However, the annual dropout rate, which is still high in high school at 17% compared to a baseline of
4%, hasn't decreased, though. India is still a long way from achieving universal education,
notwithstanding an increase in participation. Dropping out of school can be caused by a variety of
factors, including unanticipated life events, loss of enthusiasm, and the family's financial situation. Low
living standards, high unemployment, high rates of illiteracy, and sluggish GDP development are all
consequences of the Indian education system's inability to retain students.

Generally, school dropout has several reasons such as academic performance, schools, inaccessibility to
schools, harsh teaching environment, and financial problems to name a few, can be classified into a few
broad categories such:
1. School centered

2. student-centred

3. for parents

Students will drop out for these reasons and if detected early, dropouts can be prevented with
appropriate measures. Several studies have been done in recent decades, to identify the main causes of
dropout in India, but none of them have focused on identifying students who drop out.

Addressing this issue requires proactive measures to identify at-risk students and provide timely
interventions. ML is one of the good approaches to handle such issues; Machine learning (ML) has
emerged as a promising tool for analyzing complex educational data to predict dropout risk and
facilitate targeted interventions. By leveraging ML algorithms, institutions can harness vast datasets
encompassing student demographics, academic performance, and socio-economic factors to develop
predictive models.

Several studies have been done in recent decades, to identify the main causes of dropout in India, but
none of them have focused on identifying students who drop out. The paper includes the following
information:

1. We collect data such as academic data, financial status, social class, and medical data about students,
which are the main factors that control dropout.

2. We identify key characteristics that cause student suspensions.

3. We test different machine learning algorithms to predict which student is at risk during the current
session.
Data collection is a difficult step in this study because schools, especially in rural areas, often do not
keep proper records and also because they do not. Make data readily available to the public. However,
UDISE, an initiative of the National University of Educational Planning and Management, started tracking
students in 2016. in the year. The system tracks the academic journey of students in about 1.5 million
government and private schools in India. Data collection currently continues on an annual basis and will
move towards tracking through Aadhar card numbers in the future.

These models enable educators to identify students at risk of dropout early on, allowing for tailored
support and intervention strategies. Our project aims to explore the application of ML in student
dropout analysis and discuss implications for educational practice. Our project aims to address these
challenges and advance dropout analysis methodologies.

Fig: Student dropout UDISE+ Booklet 2020-21 by the Ministry of Education, Government of India.
2.Literature Review

The literature reviews explain the deep analysis of findings related to the techniques used in the
detection of Student dropout analysis. The review for finding the article for this survey indicated
approximately 1,009,105 articles are available on the keyword search dropout.

1. In 2020, Fisnik Dalipi [1] did research on the prediction of student dropout in
MOOC platforms. The purpose and methodology were that the paper expects to
give an outline of the peculiarity of MOOC understudy dropout expectations
utilizing machine learning techniques. It investigates the difficulties related to
anticipating and making sense of understudy dropouts in MOOCs and proposes
bits of knowledge and suggestions to foster successful machine-learning
solutions. Different machine learning designs are examined including K-means,
Support Vector Machines (SVM), Decision Tree (DT), deep neural network
(DNN), recurrent neural network (RNN), and natural language processing (NLP)
models. Accuracy is 89% to AUC upsides of 0.710, exhibiting the viability of the
models in tending to MOOC dropout expectations.
2. Novelty Work in this paper presents the clever methodology of bringing together
clickstream information and understudy-given information as a norm, similar to
learning object metadata guidelines, to further develop the expectation models.
Furthermore, the paper reveals insight into the potential chance to further develop
understudy social commitment to MOOCs for stronger dropout prediction.
3. In 2020, Ahmed A Mubarak [2] provided a way to predict student dropout at an
early stage so that the instructors can intervene with the student during the course
duration. The two types of models were built in this case. First is the sequential
logistic regression and the second is the input-output hidden Markov model. This
model was used to predict dropout based on the student's weekly status of activity
and evaluated the student. Also, it evaluated the student based on course end. This
type of approach to predict dropout gives good feedback to the course instructors.
4. In 2024, Monteverde-Suárez [3] predicted student performance in academics and
also focused on some related factors to it. The review expects to foresee
understudies' scholastic advancement in first-year clinical understudies utilizing
machine learning model models and artificial Neural Networks (ANN)in light of
sociodemographic information and academic history. The technology used was
Data Mining (EDM) methods, machine learning classification techniques, and the
use of ANN and NB models. Exactness and Discoveries in ANN models showed
somewhat better execution in precision, awareness, and explicitness. The two
models would be advised to be responsive while ordering customary understudies
and better particularity while arranging sporadic understudies. The concentrate
additionally distinguished the level of right responses in the symptomatic test as
the best factor for anticipating understudies' scholarly accomplishment.
5. In this paper, Luton Wang [4] told that many people are participating in online
courses which were MOOC courses. As participation was there so dropout of
students also came into the picture. The dilemma of the dropout was increased
which was a concern for online learning. By studying the data of MOOC, they
looked out for the different types of behaviors in students, and through that, they
predicted the dropout due to some issues like the interval time of the events and
the time series data, which becomes tricky or we can say difficult to predict. They
proposed a time-controlled long short-term memory neural network model having
the capability of predicting early on the behavior of a student at various intervals
of time. Also using this model, they created some gates that give some long-term
and short-term information so as it improves the level of performance.
6. In this paper, Warit Tenpipat [5] used the classification models to predict the
dropouts of students from the institute. The classification models used were
gradient-boosted trees, decision trees, and random forests. The data they used was
taken from the registrar's office of the university called KMUTT. They
preprocessed the data in various steps. The conclusion shows that there is not that
much difference in the results provided by the three classification models.
Coming to the accuracy part, the gradient boosted trees was high as compared to
others. Some factors were there like the GPA, academics, etc. This helped the
university to predict the dropout in students.
7. The author Dr. P. Asha [6] said that student dropout becomes a major concern for
institutes and universities. In this paper, the data of the university was isolated by
using the Hadoop with HFDS, hive, map-reduce, sqoop, and R. This was able to
find the better, average, or poor performers in the academics and also other
activities like participating in sports, events, etc. This was to protect the poor
performers who were thinking of getting a drop from the course. This type of data
also helps the department to implement some steps to prevent the dropout.
8. In the paper, Di Sun [7] said that online class rates are increasing at a high which
is beneficial for open online type classes. In this case, the dropout was the
challenge for those open online platforms. To deal with this, the author introduced
a model based on predicting how much a student can cover the portion of a
particular syllabus of the course. A model was created using the recurrent neural
network and also the URL embedding layer. These two were the solutions to the
problem. Also, the layer representation of learning resources was used to solve the
issues. The other model was compared with their model and the model used in the
paper gave better and efficient results than the other one.
9. The dropout of students is a nightmare for every university or institute. The author
Phillip Benachour [8] for creating the data the classification method was used
based on the time series which looked out for the student behavior and the online
modules. They introduced a model based on the algorithm called as the time
series forest which is TSF. This model was doing the prediction without the
pedagogical persons who are the experts in the education industry. This model
was giving good results when some part of the data was taken and it was given
0.84 as the accuracy and it was possible when the 5% data was getting used and
processed.
10. Thai-Nichi Institute of Technology which is in Thailand facing dropout issues
and most of those dropout issues are from first and second-year students. The
author of this paper Kittinan Limsathitwong [9] created the website for evaluation
of a particular subject of the student. The various models were made to use
algorithms like the Random Forest and the Decision Tree to get the efficiency.
The result of the Decision Tree based on the precision was 0.80, the recall was
0.92 and the fi-measure was 0.85. This model helped to reduce the exit of students
from the institute by focusing on the students who required the proper guidance so
that they could improve the student learning and look at their performances in
academics.
11. The dropout of students is increasing and at the school level most probably IX and
X standard students are dropping out. The IX standard students are more as
compared to the X students. There are many reasons like family issues, academic
pressure, money issues, etc. To reduce this, author Mahesh Mardolkar [10]
provided a method to tackle those dropout issues. He preferred the KNN as the
method of prediction as it is easy to implement and also handles various types of
data. R studio is also used to give analysis of data in a graphical pattern. The
dropout or no dropout category was created to help the teachers to look for the
students who are in dropping out and can help them by making some plan for
them so that they can reduce their issues and focus on their studies.
12. In the paper, the author G.A.S Santos[11] told the situations of the societies of
Brazil. As students drops out, the country faced a financial loss as universities and
the institutes were financed by the resources of the public. Therefore, a model was
created called EvolvedDTree by using various machine learning applications. The
model was giving good results in terms of average fc score and also 95%
accuracy. An algorithm was used which was genetic and also the decision tree,
cluster stratified sampling was used. The students having the score or we can say
the GPA of 5.9 were the risky students who spend a year in the institute and from
that one third were of first year and therefore suggested to monitor those risky
students.
13. In Chilie, dropouts were the concern and faced many problems which costed the
institutes, people etc. In the paper, author Felipe A. Bello [12] used ML
techniques to prevent the student dropout. The rate was 21.9% in 6 year study of
engineering informatics course. For this the 4 cohorts of data was used so as to
provide the random forest feature selection. Later they build the decision tree
using the features which are identitical. The accuracy of the tree was 97.21% but
on new data it was 81.01%. There were some factor which they found was related
to the studies and academics. Some variables were considered and based on the
variables they were finding the variable of highest value. Other factor in that were
the cultural and the social economic activities are the crucial ones and according
to the behavior of variables, they predicted the student will drop or not.
14. The author Lin Qiu [13] told in the paper that MOOC are providing a good
option for learners who are learning globally but the dropout rates are also
increasing. Various older methods or we can say the traditional methods were
used which were dependent manually which contained lots of heavy work which
was not that much effective in prediction. Therefor, the author created a model
based on the CNN which was a two dimensional and it was called DP CNN. The
data was like students were browsing different pages on the sites and getting
information of the student through that. This refers to clickstream data. The result
was good than other models and also the model the author proposed was end to
end which was meant to reduce the critical or we can say the complex way of
prediction.
15. This paper give the ways to prevent the dropout in online learning which is a
critical situation for any institute. The author Alvaro Ortigosa [14] introduced
some models to prevent the dropout of the students based on the algorithm called
C5.0. The data collected from the 11000 students gathered and took five years.
They introduced the model called SPA and got 11700 scores of the risk from 5700
students or more than that. Also they recorded the 13000 retention actions. They
also used the white box prediction method in production and gave better results.
Also they learnt the lessons while facing problems while making the model and
also helped those people who want to create the complete dropout system in real
cases.
16.
SR. AUTHOR YEAR PROPOSED WORK MODEL ACCURACY
NO USED
1. Vinayak Hegde, 2018 The increment in the dropout of Naive Bayes Correctly Classified
Prageeth P P[1] students from the educational Instances:36(72%)
institutes became a major Incorrectly
concern. This paper uses the Classified
procedure to predict the Instances:14(28%)
dropout by Naive-Bayes
classification algorithm and the Of total instances 50
programming language is R.
The data collected used
techniques of data processing.
The important factors in
dropout of students are
considered. Also, they
predicted the dropout in early
state so that institutes can
rescue the future of student.

2. Nafisa Tasnim, 2019 The author proposed a way to Threshold based Precision: Dataset
recognize the dropout of approach A: Thresold:0.9514
Mahit Kumar Paul,
students and that is the LR:0.9420 Naïve
Two methods:
A. H. M. Sarowar threshold way. While extracting Bayes:0.8927
original dataset
Sattar[2] the required features, it requires SVM:0.9450
detecting
the corresponding information outliers. Dataset B:
gains, attributed values. From Thresold:0.9646
this a threshold value gets LR:0.9628 Naïve
calculated. Bayes:0.8841
SVM:0.9753
After outliers:
Precision:
Dataset A:
Thresold:0.9645
LR:0.9639 Naïve
Bayes:0.9358
Support Vector
Machine:0.9788
Dataset B:
Thresold:0.9845
LR:1.0000 Naïve
Bayes:0.9763
SVM:1.0000
3. Marcell Nagy, 2018 The author used various Feature selection Decision Tree:63%
machine learning algorithms to and Extraction
Roland Random
identify the students who are at
Molontay[3] Tree based forest:65.5%
risk and also predicted the Algorithms,
dropout of students from Generalized Liner
Naive Bayes,
institutes. They used the data of model:67%
KNN, Linear
the enrolment time. By Naïve Bayes:68.3%
Models, Deep
imputation, they handled the Learning Adaptive
issues of missed data. They
performed Feature selection Boost:68.8%
and extraction and after that k-NN:69%
various classifiers have been
trained which includes various Logistic
algorithms with various input Regression:70.3%
settings. Using 10-fold cross- Gradient Boosted
validation, these methods were Trees:70.6%
tested.
Deep
Learning:73.5%
3. Methodology
This section explains how we determined the "at-risk students" were, as shown in Figure 1.

Fig. 1 Workflow diagram

3.1 Data Procurement


The ultimate goal of this project is to locate potential dropouts, which necessitates obtaining
comprehensive data on each enrolled student. This comprehensive data was acquired from the
UDISE and included demographic information about the students. This system documents the
academic path of every student enrolled in one of the roughly 1.5 million public and private
schools in India [8]. The block Bodwad in the state of Maharashtra provided the dataset used in
this study for classes 10 and 11. Bodwad town serves as the headquarters of Bodwad Taluka,
which is located in the Jalgaon District of Maharashtra State, India [10]. (Table 1).

Table 1 Number of schools in district Bodwad

A Machine Learning Approaches to Identify the Student …

The 20,000 records in the dataset were obtained using the UDISE Student Data Capture Format.
(Annexure I—the DCF's first page) [8]. The following variable values were first taken into
account when creating the predictive model:
1. Date of birth
2. Date of affirmation
3. Orientation
4. Disadvantaged∗
5. Social classification (general, planned position, booked clan, other in reverse
classes) ∗
6. Religion (Hindu, Muslim, Christian, Sikh)
7. Beneath poverty line(BPL)∗
8. Actual incapacity
9. Free training beneficiary
10. Participation records
11. Assessment markTo diversify the students' data, several modifications were made to these
variables. As UDISE only began gathering student data in 2016–17, examination and attendance
information were not accessible. Therefore, dummy data was constructed, and random values
between 0 and 100% were entered. To train our models for this study, 17% of the database's
students were selected at random from the classifiers found in the literature review and
designated as output class "dropped out."
3.2 Data Analysis

Data Cleaning Errors including redundant rows, incorrect attribute values, and missing data are
unavoidable when collecting large amounts of data. The following methods are employed to
address these errors:

• The dataset's duplicate rows (which had the same feature set) were eliminated.
• Random values unique to these characteristics were assigned if a row containing three or fewer
features had null or incorrect values. The row was eliminated from the dataset if more than three
attributes had inaccurate or null values.
17,359 students made up the dataset once the data was cleaned.

Feature Analysis is an Examination of Features Analyzing the characteristics that lead students
to discontinue their studies is crucial for forecasting student attrition. By doing this, the noise in
the dataset is reduced, which improves the models' effectiveness.
Using the "date of birth" data, a new feature called "age" was created. The correlation between
each characteristic in the dataset and the output label was determined, as seen in Figs. 2, 3, 4, and
5.
The figures below suggest that age and disability had nearly equal correlations with output class
for their respective values; hence, these factors were removed.

Fig. 2 From left to right: correlation of religion, social category, and attendance with output class

Fig. 3 Correlation of homelessness (left) and below the poverty line (right) with output class
Fig. 4 From left to right: correlation of disadvantaged, free education recipient, disability, and gender
with output class

Fig. 5 Correlation of exam marks (left) and age(right) with output class

As seen in Figs. 6, 7, 8, 9, and 10, data is further analyzed by visualizing in the form of bar
graphs and histograms.
The frequency charts above make it clear that low exam scores and low attendance are the main
causes of student dropout. In addition to this, female students and those living below the poverty
line are more likely to drop out of school.
Nine features have been chosen as a consequence of the feature analysis, as listed below:

1. Participation
2. Assessment marks
3. Orientation
4. Vagrancy
5. Beneath the destitution line
6. Religion
7. Social classification
8. Impeded
9. Free schooling beneficiary.
Fig. 6 Comparison of dropped-out and retained students based on gender

Fig. 7 Comparison of dropped-out and retained students based on below poverty line (left) and
disadvantaged (right)

Fig. 8 Comparison of dropped-out and retained students based on homelessness (left) and free education
(right)
Fig. 9 Comparison of dropped-out and retained students based on religion (left) and social category
(right)

Fig. 10 Comparison of dropped-out and retained students based on attendance (left) and examination
marks (right)

In Data Preprocessing to prepare it for training the prediction model, the obtained data is
preprocessed. Due to their categorical character, gender, socioeconomic class, and religion were
one-hot encoded. Those who received free education and were below the poverty line were
mapped to binary numbers 1 and 0, which stand for YES and NO. Exam and attendance marks
were normalized using the Z-score to bring all feature values to the same scale.
Since each feature is given the same weight, the standardization of input data guarantees a faster
convergence of the model [11].
The following equation is used to normalize the Z-score:

z = (x – μ) / σ
……………………… (1)
where:

x: Original value
μ: Mean of data
σ: Standard deviation of data
Deploying Machine Learning Algorithm in Supervised learning algorithms have been utilized
to classify students into two groups based on their likelihood of dropping out or sticking with it.
Supervised algorithms operate by teaching them how to translate input variables to output
variables. Accordingly, the algorithm learns the relationship between the input and output
variables and iteratively predicts the outcome in response to fresh data until the error is
minimized to a manageable level [12].
A 3:2 ratio is used to randomly split the dataset obtained from UDISE into training and testing
sets. Logistic regression, k-nearest neighbors, support vector machines, neural networks, and
gradient boost classifiers are the techniques utilized for the aforementioned task.
The model selected has the highest weighted accuracy (6).

K-Nearest Neighbor (KNN), Using the training data, a student in the testing set is classified by
finding its k (a predetermined positive integer) nearest neighbors. The distance between the
training data and the testing sample determines who the neighbors are. Next, the k-nearest
neighbors vote to determine the testing sample's output class [13].
Using the Euclidean distance between the training data and the testing sample's feature vectors,
we computed the neighbors in this study.

Fig. 11 Plot of weighted accuracy over a range of values of k

√∑ (
i=n
2
Euidean distance= x i− y i )
i=1
………………(2)

Where:
x i ith attribute of A student

y i ith attribute of B student

n Number of attributes of each student.

This model gave best weighted accuracy at k = 7 (Fig. 11).

Logistic Regression:
This statistical model determines the likelihood that a certain student will be enrolled in a class
to forecast the output label. The sigmoid function is used to determine this probability [14].
Let β and X represent the feature weight vector and training set, respectively. The probability of
the ith student in a binary classification can be calculated as follows:

1
P ( y i=1| X , β )= T

1+e−β X

1
P ( y i=1| X , β )= T

1+e−β X
Because it integrates the prior gradient descent values, stochastic average gradient descent,
which converges more quickly than conventional stochastic gradient descent, is used here for
training, where the feature weights are adjusted [15].

In Support Vector Machine By identifying a hyperplane that optimizes the margin between the
two classes, it conducts classification. Support vectors are the vectors that are closest to this
hyperplane [16]. This is accomplished by selecting an appropriate kernel that yields the best
results while transforming the data into the necessary form [17].
The complete dataset is subjected to tenfold cross-validation to identify the optimal kernel
among Gaussian, polynomial, linear, and radial bias functions. It is discovered that the linear
kernel provides the best-weighted accuracy.

Decision Trees and Gradient Boost: Decision trees work by creating a graphical tree structure
in which decisions are made iteratively at each node until a leaf node is reached [18]. Class
names are represented by leaves, while dataset properties are represented by other nodes with a
branch for each conceivable result. In the testing stage, we start at the root and take the branch
based on the feature set at each node. The student is categorized as either kept or dropped out
when they arrive at Leaf. It is well known that decision trees can handle category and numerical
data without the need for a label or one-hot encoding.
By including weak classifiers in the model, the greedy gradient boost technique
minimizes the model's loss [19]. Weak classifiers, such as decision trees, are added one after the
other while maintaining the ensemble's current classifiers frozen. As the trees are added, gradient
descent minimizes the loss function.
A set of 50 decision trees with a learning rate of unity was employed.

Neural Network:
This machine-learning model aims to replicate the functionality of the human brain [11]. The
neural networks are arranged into layers, with one or more neurons in each layer. The input layer
presents the pattern, which is then passed to the hidden layers where data processing is done.
One neuron in the output layer of a binary classification problem uses an activation function to
determine the correct class label. The error is computed, and the parameters at each layer are
modified using propagation. In this manner, the network receives several batches of training
data, and it becomes trained after a certain amount of finite epochs [11]. Neural networks make it
simple to hypothesize nonlinear approximators.
A batch of 200 training samples is supplied to the network for each epoch during network
training. Every epoch, the training data is rearranged since the model picks up the most quickly
from the most surprising dataset [11]. The network that had two hidden layers, each with nine
neurons, produced the greatest results. The activation function sigmoid(3) has been employed in
both output and hidden layers.

You might also like