Student Performance Analysis Using Machine Learning

AN INTERNSHIP REPORT
STUDENT PERFORMANCE ANALYSIS USING
MACHINE LEARNING
Submitted by
BALAJI N R 113219031022
SUDARSAN M S 113219031147
VIGNESH K 113219031158
VIGNESH RAJ V 113219031159
in partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.

(An Autonomous Institution, Affiliated to Anna University, Chennai)
2021-2022
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66
BONAFIDE CERTIFICATE
Certified that this internship report “Student Performance Analysis using

Machine Learning” is the bonafide work of BALAJI N R (113219031022) ,
SUDARSAN M S (113219031147), VIGNESH K (113219031158), VIGNESH
RAJ V (113219031159) carried out at “PANTECH SOLUTIONS” during
07.12.2021 to 07.01.2022.
SIGNATURE SIGNATURE
Dr. B. MURUGESHWARI Mrs. LOVELIT JOSE

HEAD OF THE DEPARTMENT SUPERVISOR
Dept. of Computer Science and Engineering Computer Science and Engineering
Velammal Engineering College Velammal Engineering College
Ambattur - Red Hills Road Ambattur - Red Hills Road
Chennai – 600 066. Chennai – 600 066.
CERTIFICATE FROM INDUSTRY
CERTIFICATE OF EVALUATION
COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE

BRANCH : COMPUTER SCIENCE AND ENGINEERING
SEMESTER :V
Sl. Name of Faculty

Name of the student who
No Title of the Internship Coordinator with
has done the Internship
designation
1 Balaji N R
2 Sudarsan M S Student Performance Mrs. Lovelit Jose
Vignesh K Analysis Using Machine
3
Learning
4 Vignesh Raj V
This report of internship work submitted by the above student in partial fulfillment
for the award of Bachelor of Computer Science and Engineering Degree in Anna
University was evaluated and confirmed to be reports of the work done by the above
student and then assessed.
Submitted for Internal Evaluation held on........................
Examiner 1 Examiner 2 Examiner 3

TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.

ABSTRACT i
LIST OF FIGURES ii
1. INTRODUCTION 1
1.1 EXISTING SYSTEM 1
1.1.1 DISADVANTAGES 1
1.2 PROPOSED SOLUTION 2
1.2.1 ADVANTAGES 2
1.2.2 SYSTEM ARCHITECTURE 2
2. MODULES 4
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 5
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6
3. DATA FLOW DIAGRAM 8
3.1 DATA FLOW DIAGRAM 8
3.2 UML DIAGRAM 11
3.3 CLASS DIAGRAM 12
3.4 SEQUENCE DIAGRAM 13
3.5 ACTIVITY DIAGRAM 14
4. DOMAIN SPECIFICATION 15
4.1 MACHINE LEARNING 15
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 DEFINITION 15
4.1.1.2 ALGORITHM 18
4.1.2 UNSUPERVISED LEARNING 19
4.1.2.2 ALGORITHM 19
4.1.3 REINFORCEMENT LEARNING 20
5. REQUIREMENTS 21
5.1 SOFTWARE REQUIREMENTS 21
5.2 PYTHON LIBRARIES 21
6. SOURCE CODE WITH OUTPUT 22

7. CONCLUSION AND FUTURE WORK 26
7.1 CONCLUSION 26
7.2 FUTURE WORK 26
REFERENCES 27
ACKNOWLEDGEMENT
I wish to acknowledge with thanks to the significant contribution given by the

management of our college Chairman, Dr.M.V.Muthuramalingam, and our Chief
Executive Officer Thiru. M.V.M. Velmurugan, for their extensive support.
I would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal

Engineering College, for giving me this opportunity to do this project.
I wish to express my gratitude to our effective Head of the Department, Dr. B.

Murugeshwari, for her moral support and for her valuable innovative suggestions,
constructive interaction, constant encouragement and unending help that have enabled me
to complete the project.
I wish to express my indebted humble thanks to the Company PANTECH

SOLUTIONS and the External Guide Mr. Praveen Kumar, Software Developer for their
invaluable guidance in shaping of this project.
1 wish to express my sincere gratitude to my faculty coordinator Mrs. Lovelit Jose,

Assistant Professor, Department of Computer Science and Engineering for her guidance,
without whom this project would not have been possible.
I am grateful to the entire staff members of the department of Computer Science

and Engineering for providing the necessary facilities to carry out the project. I would
especially like to thank my parents for providing me with the unique opportunity to work,
and for their encouragement and support at all levels. Finally, my heartfelt thanks to The
Almighty for guiding me throughout the life.
Abstract:
Performance analysis in outcome based on learning is a system which will strive for
excellence at different levels and diverse dimensions in the field of student’s
interests.
This paper proposes a complete EDM framework in a form of a rule based

recommender system that is not developed to analyze and predict the student’s
performance only, but also to exhibit the reasons behind it.
The proposed framework analyzes the students’ demographic data, study related and
psychological characteristics to extract all possible knowledge from students,
teachers and parents.
Seeking the highest possible accuracy in academic performance prediction using a

set of powerful data mining techniques.
The framework succeeds to highlight the student’s weak points and provide
appropriate recommendations. The realistic case study that has been conducted on
200 students proves the outstanding performance of the proposed framework in
comparison with the existing ones.
i
LIST OF FIGURES
FIGURE NO FIGURE NAME PAGE NO

1.1 System Architecture 3
3.1 DataFlow Level 0 8
3.4 UML Case Diagram 11
3.5 Class Diagram 12
3.6 Sequence Diagram 13
3.7 Activity Diagram 14
6.1 Student Data Display 22
6.2 Shape Display 23
6.3 Column Display 23
6.4 Gender Display 23
6.5 Test Preparation Course 24
6.6 Ethnicity Count 24
6.7 Description 25
6.8 Frequency Counts 26
ii
INTRODUCTION
1.1 Existing System:
The previous predictive models only focused on using the student’s demographic
data like gender, age, family status, family income and qualifications. In addition
to the study related attributes including the homework and study hours as well as
the previous achievements and grades. These previous work were only limited to
provide the prediction of the academic success or failure, without illustrating the
reasons of this prediction. Most of the previous researches have focused to gather
more than 40 attributes in their data set to predict the student’s academic
performance. These attributes were from the same type of data category whether
demographic, study related attributes or both, that lead to lack of diversity of
predicting rules.
1.1.1 Disadvantages:
 As a result, these generated rules did not fully extract the knowledge for the
reasons behind the student’s dropout.
 Apart from the previously mentioned work, there were previous statistical
analysis models from the perspective of educational psychology that
conducted a couple of studies to examine the correlation between the mental
health and the academic performance.
 The type of the recommendations was too brief, they missed illustrating the
methodologies to apply them.
1
1.2 Proposed System:
The proposed framework firstly focuses on merging the demographic and study
related attributes with the educational psychology fields, by adding the student’s
psychological characteristics to the previously used data set (i.e., the students’
demographic data and study related ones). After surveying the previously used
factors for predicting the student’s academic performance, we picked the most
relevant attributes based on their rationale and correlation with the academic
performance.
1.2.1 Advantage:
The proposal aims to analyze student’s demographic data, study related details
and psychological characteristics in terms of final state to figure whether the
student is on the right track or struggling or even failing. In addition to extensive
comparison of our proposed model with the other previous related models.
2
1.2.2 System Architecture:
ML
Algorithms
Machine
Data pre- Feature
dataset learning
processing extraction
model
Data
Result classifier
classification
Fig 1.1 – System Architecture
3
2. MODULES
2.1 Modules
• DATA COLLECTION
• DATA PRE-PROCESSING
• FEATURE EXTRATION
• EVALUATION MODEL
2.1.1 DATA COLLECTION
Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working with.
ML problems start with data preferably, lots of data (examples or observations) for
which you already know the target answer. Data for which you already know the target
answer is called labelled data.
4
2.1.2 DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
2.1.1.1 Formatting
2.1.1.2 Cleaning
2.1.1.3 Sampling
2.1.1.1 Formatting:
The data you have selected may not be in a format that is suitable for you to
work with. The data may be in a relational database and you would like it in a
flat file, or the data may be in a proprietary file format and you would like it in
a relational database or a text file.
2.1.1.2 Cleaning:
Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to
address the problem. These instances may need to be removed. Additionally,
there may be sensitive information in some of the attributes and these attributes
may need to be anonym zed or removed from the data entirely.
2.1.1.3 Sampling:
There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller
5
representative sample of the selected data that may be much faster for exploring
and prototyping solutions before considering the whole dataset.
2.1.3 FEATURE EXTRATION:
Next thing is to do Feature extraction is an attribute reduction process.

Unlike feature selection, which ranks the existing attributes according to their
predictive significance, feature extraction actually transforms the attributes. The
transformed attributes, or features, are linear combinations of the original
attributes. Finally, our models are trained using Classifier algorithm. We use
classify module on Natural Language Toolkit library on Python. We use the
labelled dataset gathered. The rest of our labelled data will be used to evaluate
the models. Some machine learning algorithms were used to classify pre-
processed data. The chosen classifiers were Random forest. These algorithms
are very popular in text classification tasks.
2.1.4 EVALUATION MODEL

Model Evaluation is an integral part of the model development process. It helps
to find the best model that represents our data and how well the chosen model
will work in the future. Evaluating model performance with the data used for
training is not acceptable in data science because it can easily generate
overoptimistic and over fitted models. There are two methods of evaluating
models in data science, Hold-Out and Cross-Validation to avoid over fitting,
both methods use a test set (not seen by the model) to evaluate model
performance. Performance of each classification model is estimated base on its
averaged. The result will be in the visualized form. Representation of classified
6
data in the form of graphs. Accuracy is defined as the percentage of correct
predictions for the test data. It can be calculated easily by dividing the number
of correct predictions by the number of total predictions.
7
3. DATAFLOW DIAGRAM
3.1 DATA FLOW DIAGRAM

DATA FLOW DIAGRAM
DATA FLOW DIAGRAM
LEVEL 0
Dataset
Collection
Pre-
processing
Random
selection
Trained
& Testing
dataset
Fig 3.1 DataFlow Level 0
8
LEVEL 1
Dataset
collection
Pre-
processing
Feature
Extraction
Apply
Algorithm
9
LEVEL 2
EDA
Classify
the
dataset
Train
Algorithm
Predict
Result
10
3.2 UML DIAGRAM
UML DIAGRAMS
USE CASE DIAGRAM
Fig 3.4 UML CASE DIAGRAM
11
3.3 CLASS DIAGRAM:
Fig 3.5 Class Diagram
12
3.4 SEQUENCE DIAGRAM
Fig 3.6 Sequence Diagram
13
3.5 ACTIVITY DIAGRAM:
Fig 3.7 Activity Diagram
14
4. DOMAIN SPECIFICATION
4.1 MACHINE LEARNING

o Machine Learning is a system that can learn from example through self-improvement
and without being explicitly coded by programmer. The breakthrough comes with the
idea that a machine can singularly learn from the data (i.e., example) to produce accurate
results.
o Machine learning combines data with statistical tools to predict an output. This output
is then used by corporate to makes actionable insights. Machine learning is closely
related to data mining and Bayesian predictive modeling. The machine receives data as
input, use an algorithm to formulate answers.
o A typical machine learning tasks are to provide a recommendation. For those who have
a Netflix account, all recommendations of movies or series are based on the user's
historical data. Tech companies are using unsupervised learning to improve the user
experience with personalizing recommendation.
o Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
4.1.1 SUPERVISED LEARNING
4.1.1.1 Definition
An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
15
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:
 Classification task.
 Regression task.
Classification
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
customer database. You know the gender of each of your customer, it can only be male
or female. The objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you have collected).
When the model learned how to recognize male or female, you can use new data to make
a prediction. For instance, you just got new information from an unknown customer,
and you want to know if it is a male or female. If the classifier predicts male = 70%, it
means the algorithm is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but
if a classifier needs to predict object, it has dozens of classes (e.g., glass, table, shoes,
etc. each object represents a class).
Regression
When the output is a continuous value, the task is a regression. For instance, a financial
analyst may need to forecast the value of a stock based on a range of feature like equity,
16
previous stock performances, macroeconomics index. The system will be trained to
estimate the price of the stocks with the lowest possible error.
17
4.1.1.2 Algorithms:
Algorithm Description Type

Name
Linear Finds a way to correlate each feature to the output to help Regression
regression predict future values.
Logistic Extension of linear regression that's used for classification Classification

regression tasks. The output variable 3is binary (e.g., only black or
white) rather than continuous (e.g., an infinite list of potential
colors)
Decision Highly interpretable classification or regression model that Regression

tree splits data-feature values into branches at decision nodes (e.g., Classification
if a feature is a color, each possible color becomes a new
branch) until a final decision output is made
Naive Bayes The Bayesian method is a classification method that makes Regression
use of the Bayesian theorem. The theorem updates the prior Classification
knowledge of an event with the independent probability of
each feature that can affect the event.
Support Support Vector Machine, or SVM, is typically used for the Regression (not
vector classification task. SVM algorithm finds a hyperplane that very common)
machine optimally divided the classes. It is best used with a non-linear Classification
solver.
Random The algorithm is built upon a decision tree to improve the Regression
forest accuracy drastically. Random forest generates many times Classification
simple decision trees and uses the 'majority vote' method to
decide on which label to return. For the classification task, the
final prediction will be the one with the most vote; while for
the regression task, the average prediction of all the trees is
18
the final prediction.
AdaBoost Classification or regression technique that uses a multitude of Regression

models to come up with a decision but weighs them based on Classification
their accuracy in predicting the outcome
Gradient- Gradient-boosting trees is a state-of-the-art Regression

boosting classification/regression technique. It is focusing on the error Classification
trees committed by the previous trees and tries to correct it.
4.1.2 UNSUPERVISED LEARNING
4.1.2.1 Definition
In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)
You can use it when you do not know how to classify the data, and you want the algorithm
to find patterns and classify the data for you.
4.1.2.2 Algorithm:
Algorithm Description Type
K-means Puts data into some groups (k) that each contains data with Clustering
clustering similar characteristics (as determined by the model, not in
advance by humans)
Gaussian mixture A generalization of k-means clustering that provides more Clustering

model flexibility in the size and shape of groups (clusters
Hierarchical Splits clusters along a hierarchical tree to form a Clustering

clustering classification system.
19
Can be used for Cluster loyalty-card customer
Recommender Help to define the relevant data for making a Clustering

system recommendation.
PCA/T-SNE Mostly used to decrease the dimensionality of the data. The Dimension
algorithms reduce the number of features to 3 or 4 vectors Reduction
with the highest variances.
4.1.3 REINFORCEMENT LEARNING
4.1.3.1 Definition
Reinforcement learning is a subfield of machine learning in which systems are trained

by receiving virtual "rewards" or "punishments," essentially learning by trial and error.
Google's DeepMind has used reinforcement learning to beat a human champion in the
Go games. Reinforcement learning is also used in video games to improve the gaming
experience by providing smarter bot.
One of the most famous algorithms are:
 Q-learning
 Deep Q network
 State-Action-Reward-State-Action (SARSA)
 Deep Deterministic Policy Gradient (DDPG)
20
5. REQUIREMENTS
5.1 SOFTWARE REQUIREMENTS:

 Python
 Anaconda Navigator
 Python built-in modules
5.2 PYTHON LIBRARIES:
o Numpy
o Pandas
o Matplotlib
o Sklearn
o Seaborm
21
6. SOURCE CODE
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from pandas import Series
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("StudentsPerformance.csv")
df.head(10)
Fig 6.1 Student Data Display
22
df.shape
Fig 6.2 Shape Display
df.columns
Fig 6.3 Column Display
df['gender'].value_counts()
Fig 6.4 Gender Count
23
df['test preparation course'].value_counts()
Fig 6.5 Test Preparation Course
a=df['race/ethnicity'].value_counts()
Fig 6.6 Ethnicity Count
df.describe()
24
Fig 6.7 Description
df.describe(include=['O'])
Fig 6.8 Frequency Counts
25
7. CONCLUSION AND FUTURE WORK
7.1 CONCLUSION:
Finally, performance analysis for students are a major problem. It is important that they
are countered. The work reported in this thesis indicates the machine learning techniques
with supervised learning algorithms to understand the performance of algorithm with
respect to student records where we analyses the performance of student and categorized
it into three classes as high , average, low with the accuracy of 79% .
7.2 FUTURE WORK:
In the future we provide some technical solution by improve the efficiency of student
performance .The user interaction model could be derived for giving the record of
student dynamically and it could give staff an alert message about those students who
are having low performance . We could build the prediction using Neural Network and
can expect improvised results. We can add non- academic attributes along with
academics attributes.
26
REFERENCES:
1. https://iopscience.iop.org/article/10.1088/1757-899X/1055/1/012122/pdf
2. https://www.ijser.org/researchpaper/Students-Performance-Analysis-Using-
Machine-Learning-Tools.pdf
3. https://www.activestate.com/resources/quick-reads/what-is-pandas-in-
python-everything-you-need-to-know
4. https://scikit-learn.org/stable
5. https://www.codecademy.com/article/scikit-learn
6. https://machinelearningmastery.com/metrics-evaluate-machine-learning-
algorithms-python
27

Student Performance Analysis Using Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Student Performance Analysis Using Machine Learning

Uploaded by

Copyright:

Available Formats

AN INTERNSHIP REPORT

STUDENT PERFORMANCE ANALYSIS USING

in partial fulfillment for the award of the degree of

VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.

Certified that this internship report “Student Performance Analysis using

Dr. B. MURUGESHWARI Mrs. LOVELIT JOSE

COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE

Sl. Name of Faculty

Submitted for Internal Evaluation held on........................

Examiner 1 Examiner 2 Examiner 3

CHAPTER NO. TITLE PAGE NO.

6. SOURCE CODE WITH OUTPUT 22

I wish to acknowledge with thanks to the significant contribution given by the

I would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal

I wish to express my gratitude to our effective Head of the Department, Dr. B.

I wish to express my indebted humble thanks to the Company PANTECH

1 wish to express my sincere gratitude to my faculty coordinator Mrs. Lovelit Jose,

I am grateful to the entire staff members of the department of Computer Science

This paper proposes a complete EDM framework in a form of a rule based

Seeking the highest possible accuracy in academic performance prediction using a

FIGURE NO FIGURE NAME PAGE NO

1.1 Existing System:

Fig 1.1 – System Architecture

2.1.1 DATA COLLECTION

2.1.3 FEATURE EXTRATION:

Next thing is to do Feature extraction is an attribute reduction process.

2.1.4 EVALUATION MODEL

3.1 DATA FLOW DIAGRAM

Fig 3.1 DataFlow Level 0

Fig 3.2 DataFlow Level 1

Fig 3.3 DataFlow Level 2

Fig 3.4 UML CASE DIAGRAM

Fig 3.5 Class Diagram

Fig 3.6 Sequence Diagram

Fig 3.7 Activity Diagram

4.1 MACHINE LEARNING

4.1.1 SUPERVISED LEARNING

Algorithm Description Type

Logistic Extension of linear regression that's used for classification Classification

Decision Highly interpretable classification or regression model that Regression

AdaBoost Classification or regression technique that uses a multitude of Regression

Gradient- Gradient-boosting trees is a state-of-the-art Regression

4.1.2 UNSUPERVISED LEARNING

Gaussian mixture A generalization of k-means clustering that provides more Clustering

Hierarchical Splits clusters along a hierarchical tree to form a Clustering

Recommender Help to define the relevant data for making a Clustering

4.1.3 REINFORCEMENT LEARNING

Reinforcement learning is a subfield of machine learning in which systems are trained

5.1 SOFTWARE REQUIREMENTS:

5.2 PYTHON LIBRARIES:

Fig 6.1 Student Data Display

Fig 6.2 Shape Display

Fig 6.3 Column Display

Fig 6.4 Gender Count

Fig 6.5 Test Preparation Course

Fig 6.6 Ethnicity Count

Fig 6.8 Frequency Counts

7.2 FUTURE WORK:

You might also like