You are on page 1of 40

AN INTERNSHIP REPORT

STUDENT PERFORMANCE ANALYSIS USING

MACHINE LEARNING

Submitted by

BALAJI N R 113219031022
SUDARSAN M S 113219031147
VIGNESH K 113219031158
VIGNESH RAJ V 113219031159

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.


(An Autonomous Institution, Affiliated to Anna University, Chennai)

2021-2022
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66

BONAFIDE CERTIFICATE

Certified that this internship report “Student Performance Analysis using


Machine Learning” is the bonafide work of BALAJI N R (113219031022) ,
SUDARSAN M S (113219031147), VIGNESH K (113219031158), VIGNESH
RAJ V (113219031159) carried out at “PANTECH SOLUTIONS” during
07.12.2021 to 07.01.2022.

SIGNATURE SIGNATURE

Dr. B. MURUGESHWARI Mrs. LOVELIT JOSE


HEAD OF THE DEPARTMENT SUPERVISOR
Dept. of Computer Science and Engineering Computer Science and Engineering
Velammal Engineering College Velammal Engineering College
Ambattur - Red Hills Road Ambattur - Red Hills Road
Chennai – 600 066. Chennai – 600 066.
CERTIFICATE FROM INDUSTRY
CERTIFICATE OF EVALUATION

COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE


BRANCH : COMPUTER SCIENCE AND ENGINEERING
SEMESTER :V

Sl. Name of Faculty


Name of the student who
No Title of the Internship Coordinator with
has done the Internship
designation

1 Balaji N R
2 Sudarsan M S Student Performance Mrs. Lovelit Jose
Vignesh K Analysis Using Machine
3
Learning
4 Vignesh Raj V

This report of internship work submitted by the above student in partial fulfillment
for the award of Bachelor of Computer Science and Engineering Degree in Anna
University was evaluated and confirmed to be reports of the work done by the above
student and then assessed.

Submitted for Internal Evaluation held on........................

Examiner 1 Examiner 2 Examiner 3


TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.


ABSTRACT i
LIST OF FIGURES ii

1. INTRODUCTION 1
1.1 EXISTING SYSTEM 1
1.1.1 DISADVANTAGES 1
1.2 PROPOSED SOLUTION 2
1.2.1 ADVANTAGES 2
1.2.2 SYSTEM ARCHITECTURE 2

2. MODULES 4
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 5
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6
3. DATA FLOW DIAGRAM 8
3.1 DATA FLOW DIAGRAM 8
3.2 UML DIAGRAM 11
3.3 CLASS DIAGRAM 12
3.4 SEQUENCE DIAGRAM 13
3.5 ACTIVITY DIAGRAM 14

4. DOMAIN SPECIFICATION 15
4.1 MACHINE LEARNING 15
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 DEFINITION 15
4.1.1.2 ALGORITHM 18
4.1.2 UNSUPERVISED LEARNING 19
4.1.2.1 DEFINITION 19
4.1.2.2 ALGORITHM 19
4.1.3 REINFORCEMENT LEARNING 20
4.1.3.1 DEFINITION 20

5. REQUIREMENTS 21
5.1 SOFTWARE REQUIREMENTS 21
5.2 PYTHON LIBRARIES 21

6. SOURCE CODE WITH OUTPUT 22


7. CONCLUSION AND FUTURE WORK 26
7.1 CONCLUSION 26
7.2 FUTURE WORK 26

REFERENCES 27
ACKNOWLEDGEMENT

I wish to acknowledge with thanks to the significant contribution given by the


management of our college Chairman, Dr.M.V.Muthuramalingam, and our Chief
Executive Officer Thiru. M.V.M. Velmurugan, for their extensive support.

I would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal


Engineering College, for giving me this opportunity to do this project.

I wish to express my gratitude to our effective Head of the Department, Dr. B.


Murugeshwari, for her moral support and for her valuable innovative suggestions,
constructive interaction, constant encouragement and unending help that have enabled me
to complete the project.

I wish to express my indebted humble thanks to the Company PANTECH


SOLUTIONS and the External Guide Mr. Praveen Kumar, Software Developer for their
invaluable guidance in shaping of this project.

1 wish to express my sincere gratitude to my faculty coordinator Mrs. Lovelit Jose,


Assistant Professor, Department of Computer Science and Engineering for her guidance,
without whom this project would not have been possible.

I am grateful to the entire staff members of the department of Computer Science


and Engineering for providing the necessary facilities to carry out the project. I would
especially like to thank my parents for providing me with the unique opportunity to work,
and for their encouragement and support at all levels. Finally, my heartfelt thanks to The
Almighty for guiding me throughout the life.
Abstract:

Performance analysis in outcome based on learning is a system which will strive for
excellence at different levels and diverse dimensions in the field of student’s
interests.

This paper proposes a complete EDM framework in a form of a rule based


recommender system that is not developed to analyze and predict the student’s
performance only, but also to exhibit the reasons behind it.

The proposed framework analyzes the students’ demographic data, study related and
psychological characteristics to extract all possible knowledge from students,
teachers and parents.

Seeking the highest possible accuracy in academic performance prediction using a


set of powerful data mining techniques.

The framework succeeds to highlight the student’s weak points and provide
appropriate recommendations. The realistic case study that has been conducted on
200 students proves the outstanding performance of the proposed framework in
comparison with the existing ones.

i
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO


1.1 System Architecture 3
3.1 DataFlow Level 0 8
3.2 DataFlow Level 1 9
3.3 DataFlow Level 2 10
3.4 UML Case Diagram 11
3.5 Class Diagram 12
3.6 Sequence Diagram 13
3.7 Activity Diagram 14
6.1 Student Data Display 22
6.2 Shape Display 23
6.3 Column Display 23
6.4 Gender Display 23
6.5 Test Preparation Course 24
6.6 Ethnicity Count 24
6.7 Description 25
6.8 Frequency Counts 26

ii
INTRODUCTION

1.1 Existing System:

The previous predictive models only focused on using the student’s demographic
data like gender, age, family status, family income and qualifications. In addition
to the study related attributes including the homework and study hours as well as
the previous achievements and grades. These previous work were only limited to
provide the prediction of the academic success or failure, without illustrating the
reasons of this prediction. Most of the previous researches have focused to gather
more than 40 attributes in their data set to predict the student’s academic
performance. These attributes were from the same type of data category whether
demographic, study related attributes or both, that lead to lack of diversity of
predicting rules.

1.1.1 Disadvantages:

 As a result, these generated rules did not fully extract the knowledge for the
reasons behind the student’s dropout.
 Apart from the previously mentioned work, there were previous statistical
analysis models from the perspective of educational psychology that
conducted a couple of studies to examine the correlation between the mental
health and the academic performance.
 The type of the recommendations was too brief, they missed illustrating the
methodologies to apply them.
1
1.2 Proposed System:

The proposed framework firstly focuses on merging the demographic and study
related attributes with the educational psychology fields, by adding the student’s
psychological characteristics to the previously used data set (i.e., the students’
demographic data and study related ones). After surveying the previously used
factors for predicting the student’s academic performance, we picked the most
relevant attributes based on their rationale and correlation with the academic
performance.

1.2.1 Advantage:

The proposal aims to analyze student’s demographic data, study related details
and psychological characteristics in terms of final state to figure whether the
student is on the right track or struggling or even failing. In addition to extensive
comparison of our proposed model with the other previous related models.

2
1.2.2 System Architecture:

ML
Algorithms

Machine
Data pre- Feature
dataset learning
processing extraction
model

Data
Result classifier
classification

Fig 1.1 – System Architecture

3
2. MODULES

2.1 Modules

• DATA COLLECTION

• DATA PRE-PROCESSING

• FEATURE EXTRATION

• EVALUATION MODEL

2.1.1 DATA COLLECTION

Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working with.
ML problems start with data preferably, lots of data (examples or observations) for
which you already know the target answer. Data for which you already know the target
answer is called labelled data.

4
2.1.2 DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
2.1.1.1 Formatting
2.1.1.2 Cleaning
2.1.1.3 Sampling

2.1.1.1 Formatting:

The data you have selected may not be in a format that is suitable for you to
work with. The data may be in a relational database and you would like it in a
flat file, or the data may be in a proprietary file format and you would like it in
a relational database or a text file.

2.1.1.2 Cleaning:

Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to
address the problem. These instances may need to be removed. Additionally,
there may be sensitive information in some of the attributes and these attributes
may need to be anonym zed or removed from the data entirely.

2.1.1.3 Sampling:

There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller

5
representative sample of the selected data that may be much faster for exploring
and prototyping solutions before considering the whole dataset.

2.1.3 FEATURE EXTRATION:

Next thing is to do Feature extraction is an attribute reduction process.


Unlike feature selection, which ranks the existing attributes according to their
predictive significance, feature extraction actually transforms the attributes. The
transformed attributes, or features, are linear combinations of the original
attributes. Finally, our models are trained using Classifier algorithm. We use
classify module on Natural Language Toolkit library on Python. We use the
labelled dataset gathered. The rest of our labelled data will be used to evaluate
the models. Some machine learning algorithms were used to classify pre-
processed data. The chosen classifiers were Random forest. These algorithms
are very popular in text classification tasks.

2.1.4 EVALUATION MODEL


Model Evaluation is an integral part of the model development process. It helps
to find the best model that represents our data and how well the chosen model
will work in the future. Evaluating model performance with the data used for
training is not acceptable in data science because it can easily generate
overoptimistic and over fitted models. There are two methods of evaluating
models in data science, Hold-Out and Cross-Validation to avoid over fitting,
both methods use a test set (not seen by the model) to evaluate model
performance. Performance of each classification model is estimated base on its
averaged. The result will be in the visualized form. Representation of classified

6
data in the form of graphs. Accuracy is defined as the percentage of correct
predictions for the test data. It can be calculated easily by dividing the number
of correct predictions by the number of total predictions.

7
3. DATAFLOW DIAGRAM

3.1 DATA FLOW DIAGRAM


DATA FLOW DIAGRAM
DATA FLOW DIAGRAM

LEVEL 0

Dataset
Collection

Pre-
processing

Random
selection

Trained
& Testing
dataset

Fig 3.1 DataFlow Level 0

8
LEVEL 1

Dataset
collection

Pre-
processing

Feature

Extraction

Apply

Algorithm

Fig 3.2 DataFlow Level 1

9
LEVEL 2

EDA

Classify
the
dataset

Train
Algorithm

Predict
Result

Fig 3.3 DataFlow Level 2

10
3.2 UML DIAGRAM
UML DIAGRAMS
USE CASE DIAGRAM

Fig 3.4 UML CASE DIAGRAM

11
3.3 CLASS DIAGRAM:

Fig 3.5 Class Diagram

12
3.4 SEQUENCE DIAGRAM

Fig 3.6 Sequence Diagram

13
3.5 ACTIVITY DIAGRAM:

Fig 3.7 Activity Diagram

14
4. DOMAIN SPECIFICATION

4.1 MACHINE LEARNING


o Machine Learning is a system that can learn from example through self-improvement
and without being explicitly coded by programmer. The breakthrough comes with the
idea that a machine can singularly learn from the data (i.e., example) to produce accurate
results.
o Machine learning combines data with statistical tools to predict an output. This output
is then used by corporate to makes actionable insights. Machine learning is closely
related to data mining and Bayesian predictive modeling. The machine receives data as
input, use an algorithm to formulate answers.
o A typical machine learning tasks are to provide a recommendation. For those who have
a Netflix account, all recommendations of movies or series are based on the user's
historical data. Tech companies are using unsupervised learning to improve the user
experience with personalizing recommendation.
o Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.

4.1.1 SUPERVISED LEARNING

4.1.1.1 Definition

An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
15
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:

 Classification task.
 Regression task.

Classification
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
customer database. You know the gender of each of your customer, it can only be male
or female. The objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you have collected).
When the model learned how to recognize male or female, you can use new data to make
a prediction. For instance, you just got new information from an unknown customer,
and you want to know if it is a male or female. If the classifier predicts male = 70%, it
means the algorithm is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but
if a classifier needs to predict object, it has dozens of classes (e.g., glass, table, shoes,
etc. each object represents a class).

Regression

When the output is a continuous value, the task is a regression. For instance, a financial
analyst may need to forecast the value of a stock based on a range of feature like equity,

16
previous stock performances, macroeconomics index. The system will be trained to
estimate the price of the stocks with the lowest possible error.

17
4.1.1.2 Algorithms:

Algorithm Description Type


Name

Linear Finds a way to correlate each feature to the output to help Regression
regression predict future values.

Logistic Extension of linear regression that's used for classification Classification


regression tasks. The output variable 3is binary (e.g., only black or
white) rather than continuous (e.g., an infinite list of potential
colors)

Decision Highly interpretable classification or regression model that Regression


tree splits data-feature values into branches at decision nodes (e.g., Classification
if a feature is a color, each possible color becomes a new
branch) until a final decision output is made

Naive Bayes The Bayesian method is a classification method that makes Regression
use of the Bayesian theorem. The theorem updates the prior Classification
knowledge of an event with the independent probability of
each feature that can affect the event.

Support Support Vector Machine, or SVM, is typically used for the Regression (not
vector classification task. SVM algorithm finds a hyperplane that very common)
machine optimally divided the classes. It is best used with a non-linear Classification
solver.

Random The algorithm is built upon a decision tree to improve the Regression
forest accuracy drastically. Random forest generates many times Classification
simple decision trees and uses the 'majority vote' method to
decide on which label to return. For the classification task, the
final prediction will be the one with the most vote; while for
the regression task, the average prediction of all the trees is

18
the final prediction.

AdaBoost Classification or regression technique that uses a multitude of Regression


models to come up with a decision but weighs them based on Classification
their accuracy in predicting the outcome

Gradient- Gradient-boosting trees is a state-of-the-art Regression


boosting classification/regression technique. It is focusing on the error Classification
trees committed by the previous trees and tries to correct it.

4.1.2 UNSUPERVISED LEARNING

4.1.2.1 Definition

In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)

You can use it when you do not know how to classify the data, and you want the algorithm
to find patterns and classify the data for you.

4.1.2.2 Algorithm:
Algorithm Description Type

K-means Puts data into some groups (k) that each contains data with Clustering
clustering similar characteristics (as determined by the model, not in
advance by humans)

Gaussian mixture A generalization of k-means clustering that provides more Clustering


model flexibility in the size and shape of groups (clusters

Hierarchical Splits clusters along a hierarchical tree to form a Clustering


clustering classification system.

19
Can be used for Cluster loyalty-card customer

Recommender Help to define the relevant data for making a Clustering


system recommendation.

PCA/T-SNE Mostly used to decrease the dimensionality of the data. The Dimension
algorithms reduce the number of features to 3 or 4 vectors Reduction
with the highest variances.

4.1.3 REINFORCEMENT LEARNING

4.1.3.1 Definition

Reinforcement learning is a subfield of machine learning in which systems are trained


by receiving virtual "rewards" or "punishments," essentially learning by trial and error.
Google's DeepMind has used reinforcement learning to beat a human champion in the
Go games. Reinforcement learning is also used in video games to improve the gaming
experience by providing smarter bot.
One of the most famous algorithms are:
 Q-learning
 Deep Q network
 State-Action-Reward-State-Action (SARSA)
 Deep Deterministic Policy Gradient (DDPG)

20
5. REQUIREMENTS

5.1 SOFTWARE REQUIREMENTS:


 Python
 Anaconda Navigator
 Python built-in modules

5.2 PYTHON LIBRARIES:

o Numpy
o Pandas
o Matplotlib
o Sklearn
o Seaborm

21
6. SOURCE CODE

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from pandas import Series
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("StudentsPerformance.csv")
df.head(10)

Fig 6.1 Student Data Display

22
df.shape

Fig 6.2 Shape Display

df.columns

Fig 6.3 Column Display

df['gender'].value_counts()

Fig 6.4 Gender Count

23
df['test preparation course'].value_counts()

Fig 6.5 Test Preparation Course

a=df['race/ethnicity'].value_counts()

Fig 6.6 Ethnicity Count

df.describe()

24
Fig 6.7 Description

df.describe(include=['O'])

Fig 6.8 Frequency Counts

25
7. CONCLUSION AND FUTURE WORK

7.1 CONCLUSION:
Finally, performance analysis for students are a major problem. It is important that they
are countered. The work reported in this thesis indicates the machine learning techniques
with supervised learning algorithms to understand the performance of algorithm with
respect to student records where we analyses the performance of student and categorized
it into three classes as high , average, low with the accuracy of 79% .

7.2 FUTURE WORK:

In the future we provide some technical solution by improve the efficiency of student
performance .The user interaction model could be derived for giving the record of
student dynamically and it could give staff an alert message about those students who
are having low performance . We could build the prediction using Neural Network and
can expect improvised results. We can add non- academic attributes along with
academics attributes.

26
REFERENCES:

1. https://iopscience.iop.org/article/10.1088/1757-899X/1055/1/012122/pdf
2. https://www.ijser.org/researchpaper/Students-Performance-Analysis-Using-
Machine-Learning-Tools.pdf
3. https://www.activestate.com/resources/quick-reads/what-is-pandas-in-
python-everything-you-need-to-know
4. https://scikit-learn.org/stable
5. https://www.codecademy.com/article/scikit-learn
6. https://machinelearningmastery.com/metrics-evaluate-machine-learning-
algorithms-python

27

You might also like