ProjectReport Print

Visvesvaraya Technological University
Jnana Sangama, Belagavi - 590 014
A PROJECT REPORT ON
Student Performance Prediction
Submitted by
MOHAMMED ADNAN : 4PM17CS047

MOHAMMED AZEEM SHARIF : 4PM17CS048
SHOAIB AHMED : 4PM17CS075
UMAR FAROOQ : 4PM16CS093
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
Under the Guidance of

Dr. MANU A P
Prof., Dept. of CSE
PES INSTITUTE OF TECHNOLOGY AND MANAGEMENT

(Approved by AICTE, New Delhi, Affiliated to VTU, Belagavi, ISO 9001 Certified)
NH 204, Sagar Road, Shivamogga - 577 204

August 6, 2021
PESITM, NH 206, Sagar Road, Shivamogga - 577 204, Karnataka
Department of Computer Science and Engineering
CERTIFICATE
Certified that the project work entitled “Student Performance Prediction”

carried out by Mohammed Adnan (4PM17CS047), Mohammed Azeem Sharif (4PM17
CS048), Shoaib Ahmed (4PM17CS075) and Umar Farooq (4PM17CS093), bonafide
students of PESITM, Shivamogga in partial fulfillment for the award of Bachelor
of Engineering in Computer Science & Engineering of the Visvesvaraya Technolog-
ical University, Belagavi during the year 2020-21. It is certified that all correc-
tions/suggestions indicated for Internal Assessment have been incorporated in the
Report deposited in the departmental library.
The project report has been approved as it satisfies the academic requirements in
respect of Project work prescribed for the said Degree.
Dr. Manu A P Dr. Chatrapathy K Dr. Chaitanya Kumar M V

Project Supervisor HOD, CSE Principal, PESITM
External Viva
Name of the Examiner Signature with Date
1.
2.
i
PESITM, NH 206, Sagar Road, Shivamogga - 577 204, Karnataka
Department of Computer Science and Engineering
DECLARATION
We, Mohammed Adnan (4PM17CS047), Mohammed Azeem Sharif(4PM17CS048),

Shoaib Ahmed (4PM17CS075) and Umar Farooq (4PM17CS93) students of 8th
semester B.E. in Computer Science & Engineering, PESITM, Shivamogga hereby de-
clare that the final year B.E. major project report entitled Student Performance
Prediction which is being submitted to the PESITM, Shivamogga during the
year 2020-21 is a record of an original work done by us under the supervision of Dr.
Manu A P, Prof., Dept. of CSE, PESITM, Shivamogga. This Project work is submit-
ted in partial fulfilment of the requirements for the award of the Degree of Bachelor
of Engineering in Computer Science and Engineering. The material contained
in this report has not been submitted to any University or Institution for the award
of any degree.
(Mohammed Adnan) (Mohammed Azeem Sharif )

(4PM17CS047) (4PM17CS048)
(Shoaib Ahmed) (Umar Farooq)

(4PM17CS075) (4PM17CS093)
Place: PESITM, Shivamogga.

Date: August 6, 2021
ii
Acknowledgements
We take this opportunity to extend our deep sense of gratitude to our Project guide
Dr. Manu A P, Prof., Department of CSE, PESITM, for his keen interest and invalu-
able help.
We would also like to express our sincere gratitude to Dr. Likewin Thomas, As-
soc. Prof., Dept. of CSE, PESITM., for the kind support and guidance as project
co-ordinator.
We are very much indebted and thankful to Dr. Chatrapathy K., Prof. and Head,
Dept. of CSE, PESITM, for his valuable guidance, encouragement and support.
We are highly grateful to Dr. Chaitanya Kumar M V, Principal, PESITM, for

permitting us to carry out this project work in the institution. Finally, we would
like to thank all the teaching and non-teaching staff of Dept. of CSE for their kind
co-operation. The support provided by the College, the IT Department and Depart-
mental library is gratefully acknowledged.
MOHAMMED ADNAN
MOHAMMED AZEEM SHARIF
SHOAIB AHMED
UMAR FAROOQ
Place: PESITM, Shivamogga

Date: August 6, 2021
iii
Abstract
As the competitive environment prevails among educational institutions, the chal-

lenge is to increase the quality of education through data mining. Student’s perfor-
mance is a great concerning factor to the higher education. In this project, one of
a machine learning technique have been used to build a regressor that can predict
the upcoming semester marks of students. A model is proposed to predict in which
the algorithm employed is a Supervised machine learning algorithm called Decision
tree Algorithm. The importance of several different attributes, or “features” are
considered, in order to determine which of these are correlated with student marks
prediction. Some of the features are selected to predict the immediate next semesters
marks of a student. Features like semester marks, attendance marks, score obtained
from 10th and 12th examination, study time and other aspects were selected to con-
duct this work. Due to early prediction and solution, better results can be expected
in final exams. Students can view their percentage of passing in further studies.
iv
Contents
Abstract iv
1 Introduction 1
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . 1
1.1.2 Decision Algorithm Tree . . . . . . . . . . . . . . . . . . . . . 2
1.2 Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 MongoDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Survey 7
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Existing System 14
3.1 Why We Need Educational Data Mining and What Can It Do? . . . 14
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Data Selection and Pre-processing . . . . . . . . . . . . . . . . . . . . 16
3.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Proposed system 19
4.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Marks Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Result Analysis 26
v
6 Conclusion 31
vi
List of Figures
1.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Types of Decision Tree Algorithms . . . . . . . . . . . . . . . . . . . 3
4.1 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Option Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 3rd Semester Input Page . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 3rd Semester Prediction Output . . . . . . . . . . . . . . . . . . . . . 28
5.5 4th Semester Input Page . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6 4th Semester Prediction Output . . . . . . . . . . . . . . . . . . . . . 29
5.7 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vii
Chapter 1
Introduction
This chapter includes introduction to Machine Learning, Machine learning types,

Decision Tree algorithm, Decision tree algorithms types, Flask framework and Mon-
goDb.
1.1 Machine Learning

Machine Learning are the science of securing computers to read without being
clearly organized. It is firmly related to computer statistics, which focus on computer
forecasting. In its application to business problems, machine learning is also referred
to as forecasting analysis. Machine learning focuses on the development of computer
programs that can access data and use it for self-study. The learning process begins
with looking at either data, for example, direct experience, or instruction, so that you
can look at patterns in the data and make better decisions in the future based on the
examples provided. The main purpose is to allow computers to read automatically
without human intervention or assistance and to correct actions accordingly.
1.1.1 Types of Machine Learning

Supervised Learning Approach: Supervised learning is often the problem of
objective classification often causes the computer to read the classification system
we have created. Category learning is suitable for any problem where classification is
important and classification is not yet determined to determine. Supervised learning
is a familiar way to aim neural networks and decision trees. The supervised learning
algorithm will insert new invisible entries and will determine which new input labels
will be separated based on previous training data. The purpose of the supervised
1 PESITM, Shivamogga
Figure 1.1: Types of Machine Learning
learning model is to predict the appropriate label of the recently introduced input
data.
Unsupervised Learning Approach: Unsupervised learning is a form of learn-
ing that allows us to deal with problems with little or no knowledge of what our
problem should look like. We can find a structure by combining data based on rela-
tionships between data variables. With unchecked reading there is no response based
on predictive outcome. Basically, it is a form of formal reading that helps to detect
previously unknown patterns in data sets without an existing label.
Reinforcement Learning: Reinforcement learning is a learning process that
connects with its environment by producing actions and earning mistakes or rewards.
Error testing and delayed rewards are the most appropriate indicators of learning
reinforcement. This allows machines and software agents to automatically determine
appropriate behaviour within a particular context in order to maximize its effective-
ness. A simple reward answer is needed for the agent to learn which action is best.
1.1.2 Decision Algorithm Tree

Decision trees are powerful and popular tools for separation and retrieval. The deci-
sion tree represents rules, which can be understood by people and used in an informa-
tion system such as a database. The decisive tree is a sequential model of supervised
learning in which a local region is identified by a sequence of recurrent cracks in a
number of steps. The decision tree is built on the areas of internal decision and final
leaves. The deciduous tree is also a non-parameter model in the sense that you do
not consider any parameter of the size and shape of the tree is not fixed beforehand
but the trees grow. The decision tree is divided and reconstructed in the form of a
tree containing the Decision node, leaf node, edge and path.
Figure 1.2: Types of Decision Tree Algorithms
Classification Trees : A split tree is an algorithm in which the target variables

are sorted or categorized. The algorithm is then used to identify the ”category” in
which the targeted direction may fall. Includes binary division where class-based
volatility can take one of two values, special values.
Regression Trees : A regression tree means an algorithm in which the target
variable is present and the algorithm is used to predict its value. The problem of
retrospective type becomes less when the target variable belongs to any class should
be a number that should be a regression type.
1.2 Flask
Flask is a python API that allows us to build web applications. Flask is a small
framework that provides basic features of a web application. This framework cannot
rely on external libraries. A framework for a web-application or Web Framework
is a collection of modules and libraries that help write programs without coding at
the grassroots level such as contracts, cable management. Flask is based on WSGI
(Web Server Gateway Interface) tools and Jinja2 template engine. It does not have
a range of data extraction, form verification, or other items where existing third-
party libraries provide similar services. Flask supports extensions that can add app
features as if they were made to Flask itself. Extensions are available on object-related
mappers, form validation, and upload handling, a wide variety of open verification
technologies and many common framework-related tools.
The Features of Flask
• Development Server and debugger
• Integrated support for unit testing
• RESTful request dispatching
• Uses Jinja templating
• Support for secure cookies (client side sessions)
• 100
• Unicode-based
• Extensive documentation
• Google App Engine compatibility
• Extensions available to enhance features desired
Advantages of Flask
• Higher compatibility with latest technology
• Technical experimentation
• Easier to use for simple cases
• Codebase size is relatively smaller
• High scalability for simple applications
• Easy to build a quick prototype
• Routing URL is easy
• Easy to develop and maintain applications
• Database integration is easy
• Small core and easily extensible
• Minimal yet powerful platform
• Lots of resources available online
1.3 MongoDb
MongoDB is a NoSQL-directed database used for high-volume data storage. NoSQL
database is very useful for working with large sets of distributed data. Using tables
and lines as in the database of traditional relationships, MongoDB uses collections and
texts. The text contains a pair of key values that are the basic unit of data in Mon-
goDB. Collections contain document sets and function equivalent to related database
tables. MongoDB supports a wide variety of data types. It is one of the most widely
used data technology unrelated to large data applications and other processing tasks
that incorporates inconsistent data into a solid relationships model. MongoDB can be
used for its ad-hoc queries, indexes, load balancing, merging, JavaScript third-party
creation and other features.
How MongoDB works
MongoDB uses text-based records containing data formats made by field pairs
and values. Documentation is a basic data unit in MongoDB. The text is the same
as JSON. It uses a different one called Binary JSON (BSON). The advantage of
using BSON is that it absorbs many types of data. The fields in these documents are
similar to the columns of related information. The content contained can be a variety
of data types, including other texts, edits and text edits, according to the MongoDB
user manual. The documents also include a key as a separate identifier. Document
sets are called collections, which act as the equivalent of the related database tables.
Collections may contain any type of data, but the limit on the information in the
collection may not be distributed to separate information. The NoSQL DBMS uses
a single main data component format, with a second database that stores copies of
the main database. Tasks are automatically repeated in that second failover data.
MongoDB Features
• Each database contains collections that also contain text. Each document may
differ in the number of different fields. The size and content of each document
may differ from one another.
• The structure of the document is closely related to how engineers construct

their categories and objects in their programming language. Developers often
say that their classes are not rows and columns but have a clear two-digit
structure.
• Lines (or texts as they are called in MongoDB) do not require a pre-defined
schema, instead fields can be built on a plane.
• The data model found within MongoDB allows you to represent high-level re-
lationships, maintain organization, and other complex structures easily.
Key Components of MongoDB
• ID: This is a required field for all MongoDB documents. The id field represents
a unique value in the MongoDB document. The id field is the same as the key
document. If you create a new document outside the id field, MongoDB will
create a field automatically. MongoDB will add a unique 24-digit identifier to
each document in the collection.
• Collection: This is organizing of MongoDB documents. A Collection in the

equivalent of a table which is created in any other RDMS such as Oracle or MS
SQL. A collection exists within a single database.
• Cursor: This is a pointer to the result set of query. Clients can visit through
a cursor to repetitive results.
• Database: This is a container for collections such as the RDMS as a tableware.

Each database receives its own set of files in the file system. MongoDB server
can store most of the data.
• Document: The record in the MongoDB collection is actually called a docu-

ment. The text will have the insert name and values.
• Field: Pair the number of words in a document. Document has zero or more
fields. Fields are like columns in related information.
Chapter 2
Literature Survey
This chapter includes Literature survey. In literature survey, various papers are re-
ferred that are related to the project. It includes papers from IJERT Havan Agrawal,
Harshil Mavani Department of Information Technology.
The 2001 National Research Council report [1] known the crucial ought to develop
innovative approaches to change higher-education establishments to retain students,
guarantee their timely graduation, and guarantee they’re well-trained and workforce-
ready in their fields of study. the power to predict student grades in future enrolment
terms provides valuable data to assist students, advisers, and educators reach these
goals. because the volume and kind of data being collected in ancient university
settings still expand, new opportunities to use huge knowledge analytics arise. during
this paper, we tend to develop a system to accomplish one such task: predicting
students’ course grades for the ensuing enrolment term during an ancient university
setting. Students take courses over a sequence of educational terms [2].
2.1 Background
Machine Learning is a set of techniques that gives computers the ability to learn
without the intervention of human programming [3]. ML has supported a wide range
of applications such as medical diagnostics, business analysis, DNA sequence group-
ing, robotics, predictive analysis, etc. We are particularly interested in the area
of predictive analysis, where ML allows us to implement complex models that are
used for prediction purposes. ML algorithms are classified into two main streams:
supervised and unsupervised, in this project supervised learning has been enforced;
Supervised Learning (SL) seeks algorithms that are able to think from given con-
texts in order to generate general ideas, and then make predictions of future situations
[4]. In other words, the purpose of SL is to create a clear model for the distribution of
class labels in terms of predictive signals. Rule Induction is an effective SL predictive
tool, which has been able to reach an appropriate 94 percent rate when predicting
dropouts for new nursing students, from 3978 records to 528 students [5].
Data Mining
The application of machine learning has been portrayed here. The general pipeline
used for essentially all machine learning problems consists of:
1. Define the problem.
2. Collect Data.
3. Design features.
4. Train the model.
5. Test the model.
The problem addressed here is to predict the results of the fourth semester of
students based on the results of students in their third semester, and the current
schools for internal assessment [6].
Educational information can be collected from a variety of sources. We have
collected real-world data from 100 high schools in Hubei province, so it is important
for administrators to predict entry points and assist students at risk of improving
the quality of education. Training details are a combination of different types of
information:
• Background and demographic data: Gender, age, health status, family status
etc.
• Previous study details: High school entry points, primary school GPA etc.
• School examination data: Type of school, school level etc.
• Learning data: All high school subject points (middle term assessment, final
term assessment, average)
• Personal data: Personality, attention, psychological-related data, etc.
Table 2.1: Neural Network Accuracy
All raw data collected is transmitted to numerical values, and we then calculate
and measure the data values by subtracting the meanings and subtraction by the
standard deviation of its elements to ensure that each value varies in the same range.
After adjusting each input vector, the entire database is whitened [7] to render the
input no longer required.
Setting up the data was very easy. We are using data from a CSV file. First,
we omit unnecessary columns from the data set. After that we lower the student ID
column. The student id is unnecessary and the study column is also not required
because these two columns will not make a difference in the desired result. All our
data is in numerical form.
After that we produce a compilation graph. From the merging graph we can say
that the final mark is highly dependent on intermediate marks and intermediate CT
scores [8].
The first step in use is to collect the required data required for the research task.
To narrow our analysis, we may identify attributes that are different from the data set
and delete them as those that were not used for analysis. After collecting the data,
the data is converted into the desired form. This process is called data processing.
The higher the accuracy of the pre-processing of raw data, the more accurate the
measurement of the relevant data.
The next step after pre-processing data is to find incomplete, incorrect data on
the database and delete it to get the correct results for the job. Deleting an unwanted
data category is called Data Cleaning. Next, we can select any techniques such as
linear regression, vector support mechanism, standard Naive Bayes separation, tree
algorithms for better resolution separation [9].
The training dataset size was increased in increments of 10, starting from 40, for 17
subjects. The test set was of 10 students, to predict a single subject. The accuracy
results are summarized in Table 2 [10].
Attribute Selection
The attributes that have been frequently used are the CGPA and internal assess-
ment. Ten of the 30 papers used the CGPA as their key indicators to predict student
performance. The main reason why most researchers use CGPA is because it has
a significant amount of future academic performance and performance. It can also
be considered an indication of acquired learning abilities [11]. The result shows that
CGPA is the most important input method with 0.87 compared to other variables.
Besides, in the study of Christian and Ayub [12]. In this study, internal assessments
were categorized as mark sharing, queries, web activity, class tests and visits. All
attributes will be collected by a single attribute called an internal test [13].
Procedures
• Decision Trees (DTs) are classic algorithms, which are organized in a tree-
like structure in which each internal node represents a ‘test’ on an attribute.
The goal is to achieve perfect classification with minimal number of decisions,
although not always possible due to noise or inconsistencies in data. The core
algorithm for building decision trees called ID3, which employs a top-down,
greedy search through the space of possible branches with no backtracking.
The main challenge while building the tree is to decide on which attribute to
split the data at a certain step in order to have the ‘best’ split [14]. The ID3
algorithm uses standard deviation reduction as a replacement of IG to construct
a decision tree.
• Naive Bayes algorithm (NB) is a simple method for classification based

on the theory of probability, i.e., the Bayesian theorem (Witten and Frank,
2000). It is called naive because it simplifies problems relying on two impor-
tant assumptions: it assumes that the prognostic attributes are conditionally
independent with familiar classification, and it supposes that there are no hid-
den attributes that could affect the process of prediction. This classifier repre-
sents the promising approach to the probabilistic discovery of knowledge, and
it provides a very efficient algorithm for data classification.
• Statistical methods and neural networks are considered to be very un-

suitable for data mining purposes. The information models found under these
frameworks are often regarded as black box methods, which can achieve excel-
lent levels of accuracy but are very difficult for people to understand [15].
• K-Nearest Neighbor approach took a short time to identify student perfor-
mance as a slow learner, a regular student, a good student and an excellent
student. The neighbor of K-Nearest provides good accuracy in measuring a
detailed pattern of student progress in higher education. On the other hand,
the K-Nearest neighbor showed the highest accuracy (83 percent) in the com-
bination of three indicators, namely internal assessment, CGPA and additional
study activities in predicting student performance [3].
2.2 Related Studies

There are many subjects in the field of study that explore ways to use machine
learning strategies for a variety of educational purposes. One of the focus courses
is on identifying high-risk students, as well as identifying factors that affect student
performance.
A study by Kotsiantis et al [16] is one of the first studies investigating the use of
machine learning strategies on a learning prediction platform. The contribution of
this study was that they pioneered and recorded the course of several such studies.
While machine learning techniques were previously used in a number of settings, it
was the first time that these techniques were used in a learning environment.
The scope of the task was to study whether machine learning methods could be
useful in this regard. But there is no job that uses machine learning techniques to
predict student outings. Some studies are only statistically related to certain student
characteristics and dropping out of school without being able to guess which students
are dropping out. For this, it uses many learning algorithms from the data provided
by Hellenic Open University. It was proven that reading skills can allow teachers,
using only student numerical data even from the beginning of the academic year to
get satisfactory accuracy of drop-out students. This accuracy is enhanced as new
information emerges from the curriculum during the academic year.
It has proven that learning algorithms predict the drop-out of new students with
satisfactory accuracy and have been useful in an effort to protect and reduce drop-out
students. Accuracy reaches 63
Bhardwaj and Pal [17] conducted research, determining factors that significantly
affect student performance. They used an algorithm to differentiate Bayesian in their
study. Predicting student performance with high accuracy is beneficial in identifying
early students with low academic success. It is imperative that students who are
identified can be assisted by the teacher in order to improve their performance in the
upcoming curriculum.
The objectives of this paper were to assist undergraduate and tertiary graduates
and:
• Generating data source of predictive variables,
• Identifying different factors, which effects a student’s learning behavior and

their performance during academic career
• Constructing predictive model using classification data mining techniques on

the basis of identified predictive variables and
• Validation of the developed model for students studying in Indian Universities

or Institutions.
This study uses a step-based approach that employed during training phase. And
classification success of algorithms was obtained for different phases of a semester.
There were three steps:
1. 1st step: Information of attendance for first four weeks, and grade of 1st as-
signment,
2. 2nd step: Information of attendance for first seven weeks, grade of 1st, 2nd
assignments, and midterm grade
3. 3rd step: Information of attendance for first ten weeks, grade of 1st, 2nd and
3rd assignments, final exam grade, and midterm grade.
After dataset for testing and training were prepared, and the experiment was
started. In the study used, three decision schemes were used in addition to three in-
dividual machine learning algorithms. Three machine learning algorithms were used
individually for the classification of students as successor or failure, which are Naı̈ve
Bayes, K-Star and C4.5. Time-invariant attributes were excluded and time-varying
data are used in the proposed study. Results shows that exclusion of time-invariant
data has no significant impact on entire results.
The study by Havan Agrawal, Harshil Mavani [18], In this paper, a model was
proposed to predict the performance of students in academic. The algorithm em-
ployed here is machine learning technique called Neural Networks. The importance
of several different attributes is considered in order to determine which are correlated
with student performance.
Initially, the linear relationship between a student’s previous academic perfor-
mance was considered. This relationship accurately expressed by using Multivariate
Linear Regression, it uses past semester marks of a student and marks scored by the
student’s senior batches to predict future marks of the student.
The error statistics were as follows:
Average error = 6
Accurate = 296
Erroneous = 124
Accuracy Rate = 70.48
An application was built that employed neural networks. The application pro-
vides access of data from .csv files. When prediction is required, it dynamically
trains network of 3 layers, and provide prediction of marks. This study confirms
that past performances have got a significant influence over students’ performance.
And confirmed that the performance of neural networks increases with dataset size
increases.
Chapter 3
Existing System
Data mining can be used to make decisions in the education system. Decision tree
separator is one of the most widely used methods used to test data based on the
division and conquest process. This paper discusses the use of cutting-edge trees in
education data mining. Decision-algorithms are used in previous performance data
of engineering students to produce a model and this model can be used to predict
student performance. It will help to identify students in advance who may be failing
and allow the teacher to give appropriate feedback.
3.1 Why We Need Educational Data Mining and

What Can It Do?
The education system has a large number of educational data. This data can be
student data, teacher data, alumni data, resource data etc. Educational data mining
is used to find patterns in this decision-making data. There are two types of education
system:
1. Cultural Education Program: In this program there is direct communication

between students and teachers. Student records include details such as atten-
dance, marks can be saved manually or digitally. Student performance is a
measure of this knowledge.
2. Web-based learning program: Also known as e-learning. It is becoming increas-

ingly popular as students can study anywhere without any time. In a web-based
program, various data about students is automatically collected by logs.
The results of education data mining can be used by different members of the
education system [1], [2]. Students can use them to identify activities, resources and
learning activities to improve their learning. Teachers can use them to find more
feedback, identify at-risk students and guide them to help them succeed, point out
the most common mistakes and organize content on the site effectively. On the other
hand, administrators can use them to decide which courses they will offer, which
students can contribute the most to the institution etc.
Most of the features chosen to create the model are based on previous student
performance, as we feel that a student’s previous performance reflects his or her fu-
ture performance in most cases. Also, social media is hard to come by. For example,
students are reluctant to disclose information such as parental income and may pro-
vide incorrect information. In this paper, we would like to find a combination of past
performance and predict future performance. Various data mining methods can be
used in educational programs such as consolidation, segregation, external acquisition,
organizational governance mines and successive mining.
Classification to build structures from examples of past decisions that can be used
to make decisions in intangible situations. Data segregation is a two-step process. In
the first step, the model is built by analyzing data duplicates from training data with
a set of symbols. For each pace of training data, the value of the class label attribute
is known. The phase algorithm is used in the training data to create the model.
In the second phase of classification, test data are used to test model accuracy. If
model accuracy is acceptable the model can be used to distinguish unknown data
subtitles (e.g., an unknown category label). The basic strategies for segregation
are tree import decisions, Bayesian classification, Bessie belief networks and neural
networks. Other methods such as genetic algorithms, rigorous sets, abstract concepts,
criminal assumptions can also be used for segregation.
3.2 Methodology
The decision tree is a tree structure similar to a flow chart, where each inner node
is shown in rectangles, and the leaf nodes are indicated by ovals. Every internal node
has two or more child nodes. All internal nodes contain partitions, which check the
number of attribute adjectives. Arcs from internal nodes to their offspring are labeled
with different test results. Each leaf node has a category label associated with it.
The decision tree is built from a training set, which contains data duplication.
Each phase is described in full with a set of attributes and a category label. Attributes
can have different or continuous values. Decision trees are used to separate data
headers from their unknown cable. Depending on the Tuple responsibility values, the
path from root to leaf can be traced. The leaf category is the predicted category for
the decision tree for that Vehicle.
The task of building a tree from a training set is called tree-making or tree-
building. Most existing tree import systems embrace greed (i.e., non-retroviral) high
division and subdivision. Starting with a blank tree and the entire training set, the
following algorithm is used in the training data (where each Tuple is associated with
a class label) until there is no further separation.
Algorithm :
1. Make a node called N.
2. If all of the tuples in the partition belong to the same class, return the N node
as a leaf node labelled with that class.
3. If the list is empty, return N as a leaf node labelled with the sample’s most
common class.
4. Look for the splitting attribute so that the partitions created at each branch
are completely pure.
5. Label node N with the splitting criterion that will be used as a test at that
node.
6. Remove the splitting attribute from the attribute list if it is separately valued.
7. Assume Pi represents the partitions formed as a result of the I outcomes.
8. If any Pi is empty, attach a leaf with the partition’s majority class to node N.
9. Otherwise, iteratively apply the entire process to each partition.
10. ReturnN
3.3 Data Selection and Pre-processing

Data from 346 school students are collected from the first year of engineering 2009-
10, 2010-11. Information was collected through a registration form completed by
the student at the time of admission. The student includes personal information
(category, gender, etc.), previous performance data (SSC or 10 marks, HSC test
marks or 10 + 2 etc.), address and contact number. Of these attributes that could
affect their outcomes were selected as shown in Table 1. Many attributes reflect
past student performance. The reason behind focusing on previous performance data
is that it is easily available in the administrative department of the institute. If
student has performed well previously, it is most likely that he will perform better in
subsequent exams as well. The attributes are given below,
Branch -The courses offered by institute Computer Engineering (COMP), IT engi-

neering (IT), Electronics and Telecommunication (ETC), Electronics (ELX),Mechanical
Engineering (MECH).
HSCPercent -The percentage of marks earned by a student in the upper secondary

school.
HSCMaths -Marks in Mathematics
HSCPCM –Total marks in Physics, Chemistry and Mathematics (i.e., out of 300).
HSCCET -Marks obtained in common entrance test. The entrance test is compul-
sory for the student to get admission in engineering course. The Maharashtra state
CET is out of 200.
SSCpercent -The percentage of marks earned by student in Higher secondary class.
SSCMaths –The Percentage of marks in Maths.
SSCSci -The Parentage of marks gained in Science.
Atype -The admission type which may be through central process or through Man-
agement of institute.
SSCMedium -Medium of the student at secondary school level.
FEresult -Result of student in First Year of Engineering. This can result in values
like PASS, FAIL, or BACKLOG. In general, if a student fails in up to three theory and
two practical subjects of an academic year or vice versa, he/she is awarded backlog
and promoted to next class, given that they do not have backlog of previous year.
3.4 Result
This study shows that past academic performance of students can be used to create
a model using a tree algorithm that can be used to predict student performance in
the first year of engineering testing. From the confusion matrix it is clear that the
actual model rating for the FAIL category is 0.907, which means that the model
successfully identifies students who may have failed. These students can be referred
to appropriate counselling to improve their results. Model accuracy may improve if
we add attributes that reflect current performance (e.g., attendance, test marks, etc.)
and consider other factors.
Chapter 4
Proposed system
This chapter includes the problem statement of project, objectives, architectural

diagram, methodology to solve and the steps to install the project on operating
system.
4.1 Objectives
1. Providing interface to predict marks: As Student marks predictor is com-
pletely powered with machine learning. It is a web application in which flask
framework is used to build completely. Machine learning technique is used to
create a model which is responsible for the prediction of the marks of a given
student based on some parameters or features. The model is created with the
data of hundreds of rows which includes the various data of various student
which is used to train the model. Based on the data the model is trained and
tested with small amount of the data from the dataset. In this project a beauti-
ful interface is provided for the user to enter the data of a particular student to
predict the marks. Based on the parameters which entered into the project the
data will be decoded with appropriate encoder then it will be going to passed
in machine learning model to predict the marks. Finally, after this the marks
will be displayed which is predicted by the machine learning model.
2. Identifying the area in which improvement needed: In this project,

when the student marks is predicted user will get an idea in where the student
needs to improve to get more marks in the upcoming semester. Because of the
interface which is provided in this project helps to student to look after the
areas in which a student needs to get improve. This project is used more than
20 features to evaluate the marks. In those features if the users want to change
some of the values user can do that because of simple interface of this project.
Identifying the particular feature in which the students need to get improve
for example the study time if the student increase study time the chances of
getting more marks will get increase so like this student can see all the features
where student needs to get improved.
3. Providing flexibility of parameters: Flexibility in parameters that gives an

advantage to improve the accuracy as well as the prediction of marks. If the
more parameters or features need to be added can be added very easily to make
machine learning model more accurate.
4.2 Problem Statement

The identified problem statement of the project is to predict the future academic
performance of a student.
4.2.1 Marks Prediction

Analysis the performance of a student based on marks is little difficult. This project
allows the teachers, parents and students to analysis the performance based on the
marks prediction of a student is easy way and allows to know where the improvement
is needed for the more marks.
4.3 System Architecture

In system architecture a student’s performance model is given to evaluate the fea-
tures that may have an impact on student’s academic success. The figure below shows
the main steps in the proposed methodology. This methodology starts by collecting
data from various sources like college database, surveys in form of questionaries etc.
This step is followed by data cleaning step, which concerns with filling missing values
and transforming the collected data into a suitable format. After that, feature selec-
tion process is applied to choose the best feature set with higher ranks. We applied
the Pearson Correlation method for feature selection.
Figure 4.1: Proposed System Architecture
Then the process will be followed by the evaluation of results and patterns to
generate the knowledge representation.
4.3.1 Use Case Diagram
Figure 4.2: Use Case Diagram
The use case diagram below depicts the possible interaction between the user and
Student performance predictor interface. Arrows that move from the user to the sys-
tem indicates user making prediction request by supplying the necessary information
to the system while the one that comes out from the system indicates the system’s
response to the request made. The arrows that move from the administrator into the
system indicates that the administrator provides the necessary backend assistant.
The administrator has been added because the system needs someone to manage and
service the system.
4.3.2 State Diagram
Figure 4.3: State Diagram
State diagrams depict the behavior of a system. An object in a system transit from
one state to another. A state diagram gives this description. The figure below shows
the state diagram, for the Student Performance Predictor.
4.3.3 Activity Diagram

An activity diagram visually presents a series of actions or flow of control in a system
similar to a flowchart or a data flow diagram. Activity diagrams are often used in
business process modeling. They can also describe the steps in a use case diagram.
Figure 4.4: Activity Diagram

Activities modeled can be sequential and concurrent. In both cases an activity
diagram will have a beginning (an initial state) and an end (a final state).
4.4 Methodology
The proposed system is tried to solve using following methodology.
Step-1 Starting Student marks predictor: To get started need to launch the project
by executing the command below to start the server components of flask python
app.py.
Step-2 Selecting the semester to predict: To predict the marks of the student user
need to select which semester user wants to predict the marks of the student by
clicking the button which is provided in the front-end.
Step-3 Input the features: After selection of semester user needs to fill the features
which are input to the machine learning model for prediction features which are
included in our project are;
• Sex (gender of student)
• Age (age of student)
• Mother’s job
• Father’s job
• Travel time to college
• Hosteller
• Study time spend by student
• Subject failed in last semesters
• Educational support by parents
• Extra paid for other courses
• Extra curriculum activities
• Higher Education will
• Internet facility
• Free time
• Going out with friends or family
• Health condition
• Tenth percentage
• PUC/Diploma percentage
• Previous semesters CGPA scores
Step-4 Get prediction marks: When the features get inputted to the model the
model evaluates the marks based on the features given by the user then it will display
the prediction marks with all features which are inputted.
Step-5 Saving in Database: After displaying the marks on the monitor the project
stores the student predicted marks which all features which are used into the database
for the further usage. The user and fetch the data from database by student’s name.
Decision Tree Regressor

Decision tree regressor is a algorithm which is in machine learning. This algorithm
is used in this project to evaluate the marks or prediction of a students based on
some parameters. The decision tree has two types into it, Decision tree classification
and Decision tree Regression. Decision tree classification is providing the data in
classified format for example pass and fail. The decision trees is used to fit a sine
curve with addition noisy observation. As a result, it learns local linear regressions
approximating the sine curve. The user may know the result in maximum two outputs
that is pass or fail. Decision tree regressor is provide the result in regressor where
user can see the result in numbers. For implementation of this project Decision tree
regressor is used to predict the marks of a student. This algorithm initially takes
the features as input from the dataset. The related and appropriate features are
selected to give input to this algorithm based on these features this regressor is going
to predict.
Chapter 5
Result Analysis
In the previous chapter implementation and the user interaction with model is dis-
cussed. In this chapter the result of the project is discussed. The snapshots of the
project showing the output of the Student Marks Predictor is shown below.
Figure 5.1: Home Page
This is the home page where it consist of header bars like home(specifies current
page), team(consist of team details), technologies(technologies used in project), con-
tact(contact details of admin) and result(output page). And “Get Started” button
to get prediction.
Figure 5.2: Option Page
Option page- Provides user with two options; 3rd and 4th semester prediction.
Selecting any of these choices will take the user to the data output page.
Figure 5.3: 3rd Semester Input Page
This is the student entry detail page; one can enter his/her academic and personal
details in order to predict their 3rd semester percentage and SGPA. The inputs be-
ing, Student name, Sex, Age, Mother’s occupation, Fathers occupation, Travel time,
Hosteller, Study time, Subject failed, Education support, extra-paid, Extracurricu-
lar, Higher education, internet, free time, going out, health, Tenth score, PU score,
1st semester SGPA, 2nd semester SGPA.
Figure 5.4: 3rd Semester Prediction Output
The data entered by the user saves in the MongoDB database along with the
predicted percentage, Then it displays all the details along with the predicted
percentage and SGPA of 3rd sem.
Figure 5.5: 4th Semester Input Page
This is the student entry detail page, one can enter his/her academic and personal
details in order to predict their 4th semester percentage and SGPA. The inputs be-
ing, Student name, Sex, Age, Mother’s occupation, Fathers occupation, Travel time,
Hosteller, Study time, Subject failed, Education support, extra-paid, Extracurricu-
lar, Higher education, internet, free time, going out, health, Tenth score, PU score,
1st semester SGPA, 2nd semester SGPA, 3rd semester SGPA.
Figure 5.6: 4th Semester Prediction Output

The data entered by the user saves in the MongoDB database along with the
predicted percentage, then it displays all the details along with the predicted
percentage and SGPA of 4th sem.
Figure 5.7: MongoDB
This is the database used for storing and retrieving data. The data entered in the
interface is reflected here and during the prediction the result is retrieved from the
database.
Chapter 6
Conclusion
Student marks predictor is a web application used to predict the marks of student
with appropriate features with one of the machine learning algorithms called Decision
tree regressor. In this project, we tried to provide teachers, parents or students to see
the marks of upcoming semester with some of the parameters where student can get a
picture that where student need to get improve the performance to score more. And
this project is made available for every student as well as every teacher to evaluate
the student performance. This project can be implemented in real time to evaluate
the performance of the student with student’s activities in daily life. This project tell
student that where student needs to put some effort in particular area to improve
marks.
Future Scope
For future work, the system can be extended with more distinctive attributes to
get more accurate results, useful to improve the students learning outcomes. Also,
experiments could be done using other data mining algorithms to get a broader
approach, and more valuable and accurate outputs. Some different software may be
utilized while at the same time various factors will be used.
Bibliography
[1] N. R. Council, Building a Workforce for the Information Economy. National

Academies Press, 2001.
[2] Mack Sweeney, Jaime Lester, & Huzefa Rangwala “Next-Term Student Grade
Prediction” in IEEE International Conference on Big Data, 2015
[3] Navamani J., & Kannammal A, ”Predicting performance of schools by applying

data mining techniques on public examination results”. Res. J. Appl. Sci. Eng.
Technol. 2015, 9, 262–271
[4] Kotsiantis, S. Supervised machine learning: A review of classification techniques.

Informatica 2007, 31, 249–268.
[5] Juan L. Rastrollo-Guerrero, Juan A. Gomez-Pulido, & Arturo Duran-Dominguez

“Analyzing and Predicting Students’ Performance by Means of Machine Learning:
A Review”, Appl. Sci. 2020
[6] Pushpa S K, Manjunath T N, Mrunal T V, Amartya Singh, & C Suhas, “Class

Result Prediction using Machine Learning”, 2017
[7] Bo Guo, Rui Zhang, Guang Xu, Chuangming Shi & Li Yang, “Predicting Stu-
dents Performance in Educational Data Mining”, International Symposium on
Educational Technology 2015.
[8] H.M. Rafi Hasan, Mohammad Touhidul Islam, AKM Shahariar Azad Rabby, &
Syed Akhter Hossain: Machine Learning Algorithm for Student’s Performance
Prediction, IEEE 2019.
[9] Boddeti Sravani, & Myneni Madhu Bala, “Prediction of Student Performance Us-
ing Linear Regression”, International Conference for Emerging Technology (IN-
CET) 2020.
[10] Havan Agrawal, & Harshil Mavani: Student Performance Prediction using Ma-
chine Learning, IJERT 2015, ISSN: 2278-0181.
[11] Z. Ibrahim, & D. Rusli, predicting students’ academic performance: comparing

artificial neural network, decision tree and linear regression, in: 21st Annual SAS
Malaysia Forum, 5th September, 2007.
[12] Christian, & M.Ayub, Exploration of classification on using binary tree for pre-
dicting students’ performance in data and software engineering(ICODSE), 2014
International conference, IEEE, 2014, pp. 1-6
[13] Amirah Mohamed Shahiri, Wahidah Husain, & Nur’aini Abdul Rashid, “A Re-
view on Predicting Student’s Performance using Data Mining Techniques”, Pro-
cedia Computer Science 72 (2015) 414 – 422.
[14] Thi-Oanh Tran, Hai-Trieu Dang, Viet-Thuong Dinh, Thi-Minh-Ngoc Truong,

Thi-Phuong-Thao Vuong, & Xuan-Hieu Phan, “Performance Prediction for Stu-
dents: A Multi-Strategy Approach”, ISSN: 1314-4081.
[15] Osamanbegovic, Edin, & Sulijic, Mirza, “Data Mining Approach for Predicting
Students Performance”, Econstor pp, 3-12
[16] S.Kotsiantis, C. Pierrakeas & P. Pintelas, “Preventing student dropout in dis-

tance learning systems using machine learning techniques”, AI Techniques in
Web-based Educational systems at Seventh Intermarital conference on Knowl-
edge based intelligent Information and engineering system , pp. 3-5, September
2003.
[17] Brijesh Kumar Bhardwaj, & Saurabh Pal, “Data Mining: A prediction for per-
formance improvements using classification”, IJCSIS, Vol. 9, April 2011
[18] Havan Agrawal, & Harshil Mavani, “Student Performance Prediction using Ma-
chine Learning”, IJERT, Volume 04, Issue 03, March 2015

ProjectReport Print

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ProjectReport Print

Uploaded by

Copyright:

Available Formats

Visvesvaraya Technological University

Jnana Sangama, Belagavi - 590 014

Student Performance Prediction

MOHAMMED ADNAN : 4PM17CS047

Under the Guidance of

PES INSTITUTE OF TECHNOLOGY AND MANAGEMENT

NH 204, Sagar Road, Shivamogga - 577 204

Certified that the project work entitled “Student Performance Prediction”

Dr. Manu A P Dr. Chatrapathy K Dr. Chaitanya Kumar M V

Name of the Examiner Signature with Date

We, Mohammed Adnan (4PM17CS047), Mohammed Azeem Sharif(4PM17CS048),

(Mohammed Adnan) (Mohammed Azeem Sharif )

(Shoaib Ahmed) (Umar Farooq)

Place: PESITM, Shivamogga.

We are highly grateful to Dr. Chaitanya Kumar M V, Principal, PESITM, for

Place: PESITM, Shivamogga

As the competitive environment prevails among educational institutions, the chal-

1.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 2

4.1 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

This chapter includes introduction to Machine Learning, Machine learning types,

1.1 Machine Learning

1.1.1 Types of Machine Learning

1.1.2 Decision Algorithm Tree

Classification Trees : A split tree is an algorithm in which the target variables

• Development Server and debugger

• Integrated support for unit testing

• RESTful request dispatching

• Support for secure cookies (client side sessions)

• Google App Engine compatibility

• Extensions available to enhance features desired

• Higher compatibility with latest technology

• Easier to use for simple cases

• Codebase size is relatively smaller

• High scalability for simple applications

• Easy to build a quick prototype

• Routing URL is easy

• Easy to develop and maintain applications

• Database integration is easy

• Small core and easily extensible

• Minimal yet powerful platform

• Lots of resources available online

How MongoDB works

• The structure of the document is closely related to how engineers construct

Key Components of MongoDB

• Collection: This is organizing of MongoDB documents. A Collection in the

• Database: This is a container for collections such as the RDMS as a tableware.

• Document: The record in the MongoDB collection is actually called a docu-

1. Define the problem.

4. Train the model.

5. Test the model.

• School examination data: Type of school, school level etc.

• Personal data: Personality, attention, psychological-related data, etc.

• Naive Bayes algorithm (NB) is a simple method for classification based

• Statistical methods and neural networks are considered to be very un-

2.2 Related Studies

• Generating data source of predictive variables,

• Identifying different factors, which effects a student’s learning behavior and

• Constructing predictive model using classification data mining techniques on

• Validation of the developed model for students studying in Indian Universities

3.1 Why We Need Educational Data Mining and

1. Cultural Education Program: In this program there is direct communication

2. Web-based learning program: Also known as e-learning. It is becoming increas-

1. Make a node called N.