You are on page 1of 17

MADDA WALABU UNIVERSITY

COLLEGE OF COMPUTING AND INFORMATICS


DEPARTMENT OF COMPUTER SCIENCE
MSC PROGRAM

BIG DATA ANALYTSIS RESEARCH PROPOSAL


ON
PREDICTING ADABA WOREDA STUDENTS' ACADEMIC
SUCCESS USING BIG DATA AND MACHINE LEARNING.

Done By: Abu Husen

ID: PGE/49446/15

2016 E.C

Submitted to: - Dr.Kuulaa Qaqqabaa

Madda Walabu University, Ethiopia


February 25, 2024
ABSTRACT
Predicting student’s academic success is one of the subjects related to the Educational
Data Mining process, which intends to extract useful information and new patterns from
educational data. Understanding the drivers of student success may assist educators in
developing pedagogical methods providing a tool for personalized feedback and advice.

The proposal paper describes the development of a system to predict student academic success or
failure by utilizing various factors such as personal information, academic evaluation, student
activities, environment, and attendance.

Machine Learning Algorithms like KNN, and SVM were initially compared and used but were
found to be insufficient due to the increasing number of students and data diversity. To overcome
this issue, Big Data technology was implemented to distribute processing and improve efficiency
without compromising accuracy. The system achieved a recognition rate in predicting student
success, demonstrating the effectiveness of the approach in monitoring and enhancing student
performance.

However there are several factors that were proposed to influence the student academic success,
for example, the student's gender, previous educational background, the existence of a special
statute, and the parents' educational degree. Data belongs to Adaba Secondary School students, at
Adaba Woreda Education Office, during 2013 to 2015 academic years. In addition, it was studied
which factors are the strongest predictors of the student’s academic success.

Keywords— Data Mining, Big Data Analytics, Machine Learning, SVM, KNN, Machine
Learning, Education, Predicting, Accuracy

I|Page
TABLE CONTENTS
ABSTRACT .................................................................................................................................... I

LIST OF FIGURES .................................................................................................................... III

1. INTRODUCTION ................................................................................................................. 1

1.1 BACKGROUND OF THE STUDY ................................................................................ 2

1.2 STATEMENT OF PROBLEM ........................................................................................ 5

1.3 OBJECTIVE OF THE STUDY ....................................................................................... 6

1.3.1 General Objective ..................................................................................................... 6

1.3.2 Specific Objectives ................................................................................................... 6

1.4 RESEARCH QUESTION ................................................................................................ 7

1.5 IMPORTANCE OF THE STUDY ................................................................................... 7

1.6 SCOPE AND LIMITATIONS OF THE STUDY ............................................................ 8

1.6.1 Scope of the Study .................................................................................................... 8

1.6.2 Limitations of the Study............................................................................................ 8

1.7 Stakeholders of the project ............................................................................................... 9

2. LITERATURE REVIEW ................................................................................................... 10

3. METHODOLOGY .............................................................................................................. 11

3. System architecture used for student prediction ................................................................ 12

4. PREFERENCE ...................................................................................................................... 13

II | P a g e
LIST OF FIGURES

Figure1: Student Intake and flow from Primary to Secondary Schools........................................................ 2


Figure 2:Performance on Learning Assessments .......................................................................................... 3
Figure3: Adaba Woreda Schools Stakeholders in predicting student academic success............................... 9
Figure4: System Architecture developed to predict student academic success .......................................... 12

III | P a g e
1. INTRODUCTION
Data analysis is an analytical tool that includes a comprehensive and sophisticated set of
procedures and algorithms for extracting meaningful information from studied data. In recent
years, it has been employed in almost every sector, including health, economics, social services,
human resources, education, industry, and government [1]. In Ethiopia, education is an area
where a huge amount of data is generated and accumulated. As a result, Big Data Technologies
are being used in the education sector; it has been given the moniker Educational Data Mining
because it is used to mine educational data. It supports predictions, grouping, association
extraction, model discovery, and presentation of data. It is utilized for diverse objectives,
including assessing learners and developing.
In my research proposal focuses on the students' academic achievement. Student Academic
performance is a key indicator of educational quality and institutional success. A successful
student is one who has completed their program and validated every semester at school. Student
academic success is characterized as a set of indicators that capture engagement, assessment
completion, and learning. However, we utilize the grade point average (GPA) of the semesters or
total mark transfer to quantify student academic progress.
In this proposed document, I will present a method and Machine Learning Algorithms for
predicting the academic success of secondary school students, because students face many
changes, both in teaching methods and in evaluation methods that require assistance to be
successful in their educational life cycle. A variety of elements influence pupils' academic
progress. I divided them into five categories: personal information of the students, academic
evaluation and activities of the pupils in school, psychological and environmental factors. Then I
used the property selection methods to identify properties that would be effective for predicting
student academic background.
The present research proposal is structured as follows. The opening section explains why BDA
and Machine Learning are vital for improving research. This part comprises the problem, study
scope, limitations, questions, and research deliverable methodologies. The Literature Review
provides background for the issue and summarizes past research in this area. Technique
highlights the technique used, as well as the models and performance measures used to illustrate
the big data and machine learning approach. Finally, the Conclusion section offers the work's
principal findings.

1|Page
1.1 BACKGROUND OF THE STUDY

In Ethiopia, the academic year begins in September and ends in July, and the official primary
school entrance age is 7. The system is structured so that the primary school cycle lasts 6 years,
lower secondary lasts 4 years, and upper secondary lasts 2 years. Ethiopia has a total of
21,418,000 students enrolled in primary and secondary education. Of these students, about
16,200,000 (76%) are enrolled in primary education. The study shows the highest level of
education reached by youth ages 15-24 in Ethiopia. Although youth in this age group may still be
in school and working towards their educational goals, it is notable that approximately 16% of
youth have no formal education and 54% of youth have attained at most incomplete primary
education, meaning that in total 70% of 15-24 year olds have not completed primary education in
Ethiopia [2].

Figure1: Student Intake and flow from Primary to Secondary Schools


School Participation and Efficiency
The percentage of out of school children in a country shows what proportion of children are not
currently participating in the education system and therefore, missing out on the benefits of
school. In Ethiopia, 25% of children of official primary school ages are out of school which also
considers the proportion of children out of school by different characteristics wherever data is
available. For example, the data shows that approximately 25% of boys of primary school age

2|Page
are out of school compared to 25% of girls of the same age. For children of primary school age
in Ethiopia, the biggest disparity can be seen between the poorest and the richest children. Nearly
55% of female youth of secondary school age are out of school compared to 46% of male youth
of the same age. For youth of secondary school age, the biggest disparity can be seen between
the poorest and the richest youth.

The Performance on Learning


This section provides information on indicators of learning, which lends insight into the quality
of educational provision. In this profile, learning is measured through literacy rates, which are
important because literacy is a foundational skill needed to attain secondary levels of learning,
and national performance on learning assessments. According to UNESCO Institute for
Statistics (UIS), compared to other countries, Ethiopia ranks at the 29 percentile in access and at
the 8 percentile in learning. The data source compares youth and adult literacy rates and shows
that, in Ethiopia, the literacy rate is 55% among the youth population; this is lower than the
average youth literacy rate in other low income countries.

Figure 2:Performance on Learning Assessments


Accordingly to Adaba Woreda Education Office (AWEO), secondary schools students
registered in 2015 is a lowest number of enrolments in secondary education, 5000 students,
reaching the lowest rate in the last decade (2013-2015). In 2016 the trend continued, and
Woreda had a new enrolment, 4571 students enrolled in Secondary Education (SE). On the other
hand, grade repetition has been identified by the Office for Education Growth and
Development as one of the main problems of the Woreda education system. The reported

3|Page
state that the share of early school leavers is substantial and many of those fail to pursue
additional training 15-24 year-olds have not completed 1st cycle and 2nd cycle secondary
education and are not enrolled in any further training or education, in the Woreda. One of the
main goals set by AWEO is the reduction of student dropout and year repetition rates and the
need for metrics to measure success in improving equity, performance, and school dropout rates.

The student’s academic success prediction is associated with different features. The most
frequently used features are the following: the grade point average (GPA) and internal
assessments (such as exam marks, assignment marks, and quizzes), followed by student
demographic data (such as gender, age, and residence) and external assessments (such as final
exam mark for specific a subject). Moreover, high school background, scholarship, and extra-
curricular activities are also used by researchers [3].

The study presents a data mining methodology and machine learning algorithms to create a
model that predicts the Adaba Secondary School student’s academic success as a key indicator of
the quality of education and institutional success in Adaba Woreda Education Office, found in
West Arsi Zone, which currently has approximately 4571 students enrolled in secondary school
programs.

In general, the study recognizes developments in data analytics and machine learning, as well as
the importance of properly applying these tools to improve educational outcomes for students
enrolled in a School Woreda based on their total mark transfer, grade point average (GPA) and
semester assessments. The study's purpose is to contribute to the field by developing accurate
and reliable prediction models that can improve educational practices and decision-making,
resulting in improved student performance and satisfaction.

4|Page
1.2 STATEMENT OF PROBLEM

Predicting student success with big data and machine learning algorithms is a problem that seeks
to use the quantity of data available in educational institutions to uncover trends and
characteristics that influence student progress. This issue is essential because it has the potential
to transform education by enabling educators to proactively identify students who are at danger
of academic failure or require further assistance. Schools can use big data and machine learning
algorithms to create individualized interventions that improve student results and address
individual needs.

The challenge lies in effectively analyzing large volumes of data and applying sophisticated
machine learning algorithms to identify meaningful patterns and relationships. It requires
expertise in data collection, data preprocessing, feature engineering, model selection, and
evaluation. Additionally, ethical considerations regarding data privacy and bias need to be
addressed to ensure that the predictions and interventions are fair and reliable.

Overall, the problem of predicting student academic success using big data and machine learning
algorithms presents an exciting opportunity for educators to leverage technology and data mining
insights to support student achievement and promote educational equity. By utilizing big data
technologies, the system achieved a remarkable recognition rate in predicting student success,
showcasing the effectiveness of the approach in monitoring and improving student outcomes.

5|Page
1.3 OBJECTIVE OF THE STUDY

1.3.1 General Objective

The main objective of the study is to predict Adaba Woreda Secondary Schools student’s
academic success and failure is to create accurate and reliable models that can forecast or
identify factors such as personal information, academic evaluation, student activities,
environment, and attendance that influence the success of students.

1.3.2 Specific Objectives

By leveraging the power of big data and machine learning, there are some specific objectives is
to create predictive models that can:
1. Identify at-risk students: This early identification allows educators to intervene and
provide the necessary support to improve student outcomes.
2. Personalize interventions: Predictive models can help identify the specific areas where
students are struggling or where they need additional support.
3. Delivery resource allocation: This ensures that resources are directed towards the students
who need them the most, improving overall efficiency.
Overall, the objective is to use big data and machine learning algorithms to inform decision
making, personalize interventions, and optimize educational practices to maximize student
success and improve educational outcomes.

6|Page
1.4 RESEARCH QUESTION
The following research questions have been addressed and explored in the proposal:
Q1: What is the students’ academic level used to predict students’ academic success?
Q2: What key features are used to predict students’ academic success?
Q3: What machine learning models are used for the prediction?
Q4: Which model performs best?
Q5: How accurate and reliable are the prediction models in forecasting student success?
These research questions aim to explore the effectiveness, impact, and implications of using big
data analytics and machine learning to predict student academic success.

1.5 IMPORTANCE OF THE STUDY

The study on predicting student success addresses the critical issue of academic failure among
students by providing a tool for early prediction. By accurately predicting student success or
failure, teachers and administrators can identify at-risk students and provide them with the
necessary assistance to improve their academic performance and success.
Generally, the study of predicting student success using big data analytics and machine learning
is important because it enables early intervention, supports personalized learning, optimizes
resource allocation, informs data-driven decision-making, promotes educational equity,
facilitates continuous improvement, and advances research in educational data analytics.

7|Page
1.6 SCOPE AND LIMITATIONS OF THE STUDY
1.6.1 Scope of the Study
The scope of the study on the prediction of student success using big data analytics and machine
learning can encompass various aspects related to student outcomes and the application of
predictive models. Here are some dimensions within the scope that the study considers:
Predictive Variables: These predictors can include academic-related factors (grades, test
scores), demographic information, socio-economic status, attendance records, behavioral data,
learning styles, engagement metrics, and other relevant data sources.
Prediction Models: It may consider various techniques, such as decision trees, KNN, neural
networks, random forests, or SVM models, and compare their performance in predicting student
success outcomes.
Student Success Outcomes: The scope may include a specific focus on academic outcomes,
such as total grade mark, grade point average, retention rates, or graduation rates.
Evaluation Metrics: Common evaluation metrics include accuracy, precision, recall, score, or
other relevant measures that indicate the predictive power of the models.
The identified dimensions the scope will allow researchers to generate meaningful insights and
contribute to the field of predicting student success using big data analytics and machine
learning.

1.6.2 Limitations of the Study


The paper point out that the methods used for predicting student success may not be fully
comprehensive given the progressive number of students, specialties, learning methods, and the
diversity of data sources, which can impact data processing time.
Some potential limitations of the study include Data Quality and Availability, Interpretability of
Models, Human Factors and Bias, Complex Nature of Student Success, and Overreliance on
Models. Recognizing these limitations is essential for a comprehensive understanding of the
potential pitfalls and challenges associated with the prediction of student success using big data
analytics and machine learning. It allows researchers and practitioners to address these
limitations and potential biases, ensuring responsible and ethical use of these techniques in
education.

8|Page
1.7 Stakeholders of the project
There are several key stakeholders who have a vested interest in the outcomes and potential
impacts of the project. The key stakeholders can include:

 Parents and Students: Parents and students are primary stakeholders who have a strong
interest in student success. Their vision typically revolves around obtaining high-quality
education, achieving academic goals, and personal growth.
 Teachers and Educators: They may see the potential of predictive models in optimizing
their teaching strategies, providing targeted support, and adjusting instructional approaches
to meet individual student requirements.
 Educational Institutions: Local schools and other educational institutions are crucial
stakeholders in this project. Their vision and strategic goals generally revolve around
providing quality education, enhancing student outcomes, and promoting student success.
 Administrators and Policy Makers: School administrators, district officials, Peer Groups,
and policymakers have a vested interest in ensuring educational excellence, equitable
resource allocation, and improved student outcomes.

Figure3: Adaba Woreda Schools Stakeholders in predicting student academic success

9|Page
2. LITERATURE REVIEW
Various studies have highlighted the potential of these approaches in identifying patterns and
generating accurate predictions about student performance and outcomes.

Researchers have explored different data sources and variables to predict student success. The
document discusses the challenge of analyzing data effectively due to the increasing number of
students and the diversity of data sources [4]. Studies have shown that the combination of
multiple data sources improves the accuracy of predictions.
Previous studies have explored the prediction of student performance using KNN, Neural
Network, Decision Trees, Regression, and SVM, highlighting the importance of factors such as
internet behavior and registration data [5]. Different research axes have been identified,
including factors influencing academic success prediction and data mining methods for model
construction [6].
Factors considered in predicting student success include personal information, academic
evaluation, student activities, psychological aspects, and environment. Various Big Data Analysis
techniques and machine learning algorithms have been employed in predicting student success,
such as SVM and KNN [7].

Researchers have used various machine learning algorithms to predict student success. Models
have been developed to predict various outcomes, such as academic achievement, dropouts,
course completion, graduation rates, and engagement [8].
Researchers have employed techniques like correlation analysis, feature importance ranking, and
domain expert input to determine the most relevant features for prediction. Feature engineering
approaches have also been explored to derive new features that capture complex relationships
within the data [9].
Studies have evaluated the performance of prediction models using metrics like accuracy,
precision, recall, and score. Researchers have compared the performance of different algorithms
and explored ensemble models to improve prediction accuracy. Cross-validation and holdout
validation techniques have been used to assess model performance and generalizability [10].
While there are promising findings in the literature, it is important to note that the field is
evolving, and further research is needed to advance the application of big data analytics and
machine learning in predicting student success.

10 | P a g e
3. METHODOLOGY
In the field of education, predicting student success using big data analytics involves utilizing
large data sets to identify patterns and make informed predictions about a student's academic
performance and future success. Here is a methodology that can be used in the prediction of
student success using big data analytics:
1. Data Collection: Collect relevant data about students, including demographics, academic
records, attendance, behavioral data, extracurricular activities, and any other information that
may be useful in predicting student success.
2. Data Preprocessing: Clean and preprocess the collected data to eliminate errors,
inconsistencies, and missing values. This step also involves transforming the data into a
suitable format for analysis.
3. Feature Selection: Identify the most relevant features that can be used to predict student
success. This can be done through techniques such as correlation analysis, feature importance
ranking, or domain expertise. Three types of feature selection methods, namely filtering,
wrapper, and embedded methods, were compared to improve classification results.
4. Data Analysis: Apply various data analysis techniques, such as statistical analysis, data
mining, machine learning, and predictive modeling, to uncover patterns and relationships in
the data that can be used to predict student success.
5. Model Development: Develop prediction models using the selected features and the
appropriate machine learning algorithms. These models should be trained on a subset of the
data and validated using another subset to ensure their accuracy and reliability.

11 | P a g e
3. System architecture used for student prediction
The system architecture for student prediction using big data analytics and machine learning may
involve several components working together to process, analyze, and predict student outcomes.
Here is a high-level overview of typical system architecture:

Data Collection Data Preprocessing

Data Analyzing

Feature Selection

Storage in HDFS

Classification
Student Dataset
(LMS, School Data)

Decision Making Model Development

Model Evaluation

Figure4: System Architecture developed to predict student academic success

12 | P a g e
4. PREFERENCE

[1] The Future of Big Data and Analytics in K-12 Education, "Education Week," big-data-and-
analytics.
[2] World Bank, National Education Profile , 2018.
[3] A. S. A. W. A. &. H. A. K. Hashim, "Student Performance Prediction Model based on
Supervised Machine Learning Algorithms," IOP Conference Series: Materials Science and
Engineering, 2020.
[4] E. A.-S. A. Z. A. &. A. A. Alshdaifat, "The impact of data normalization on predicting
student performance: A case study from Hashemite University.," International Journal of
Advanced Trends in Computer Science and Engineering, 9(, p. 4580–4588, 2020.
[5] S. K. M. S.-P. f. Hussain, "Predicting Students’ Academic Performance at Secondary and
Intermediate Level Using Machine Learning. Ann. Data. Science.," 2021.
[6] J. W. H. P. R. W. Xing Xu, "Prediction of academic performance associated with internet
usage behaviors using machine learning algorithms," Computers in Human Behavior, 98 , p.
166–173, 2019.
[7] B. Z. M. Ahmed Mueen, "Modeling and Predicting Students' Academic Performance Using
Data Mining Techniques," International Journal of Modern Education and Computer
Science 11, pp. 36-42 , 2016.
[8] H. &. Y. J. Lu, "Student Performance Prediction Model Based on Discriminative Feature
Selection.," International Journal of Emerging Technologies in Learning (iJET), , vol. 10, p.
55–68, 2018.
[9] M. G. M. F. N. e. a. Radovic, "Minimum redundancy maximum relevance feature selection
approach for temporal gene expression data.," BMC Bioinformatics 18, 9 , 2017.
[10] P. D.M.W., J. Mach. Learn. Technol, p. 37–63, 2011.
[11] A. B. Zorić, "Predicting Students’ Academic Performance Based on Enrolment Data,"
International Journal of Innovation and Economic Development, vol. 6 , pp. 54-61, 4
October 2020 .
[12] Zlatko J. Kovačić, "Early Prediction of Student Success: Mining Students Enrolment Data,"
Proceedings of Informing Science & IT Education Conference (InSITE) , 2010.
[13] M. A. M. A. A. e. a. Arowolo, "Optimized hybrid investigative based dimensionality
reduction methods for malaria vector using KNN classifier," J Big Data 8, 2021.
[14] H. &. Z. J. Hu, "Application of Teaching Quality Assessment Based on Parallel Genetic
Support Vector Algorithm in the Cloud Computing Teaching System.," International Journal
of Emerge, 2016.
[15] P. D.M.W., p. 37–63, 2011.

13 | P a g e

You might also like