Professional Documents
Culture Documents
Alen George
2021-2023
MAR ATHANASIUS COLLEGE OF ENGINEERING
(Affiliated to APJ Abdul Kalam Technological University,
TVM) KOTHAMANGALAM
CERTIFICATE
First and foremost, I thank God Almighty for his divine grace and blessings in
making all this possible. May he continue to lead me in the years to come.
I would like to express my special gratitude and thanks to my Mini project guide
Prof. Beena Jacob , Assistant Professor, Department of Computer Applications for her
guidance and constant supervision as well as for providing necessary information
regarding the Mini project & also for her support.
I profusely thank other Professors in the department and all other staffs of
MACE, for their guidance and inspirations throughout my course of study. No words can
express my humble gratitude to my beloved parents who have been guiding me in all
walks of my journey. My thanks and appreciations also go to my friends and people who
have willingly helped me out with their abilities.
ABSTRACT
An outsized amount of digital data from social media, research, agriculture, medical records
and other University and technical organizations faces high competition and their challenge is
in analysing their students’ performance. The foremost important challenges are in admission,
student placement and within the curriculum. The two most important processes during which
data are collected and analysed are admission and placement. The university’s rank in the
market solely depends on academic performance and placement of the student. Aside from
academic performance there are various other factors which help in understanding the final
performance of the student. During this project, processing techniques are utilized to understand
the performance of students and group the students under various categories as a student must
consistently improve to compete in today’s world. Almost every university has their own
management system to manage the students’ records. Currently, there is a student management
system that manages the students’ records in University of Malaysia Sarawak (UNIMAS), but
no permission is provided for lecturers to access the system. This is often actually because the
access permission is solely to top management like Deans and Deputy Deans of Undergraduate
and Student Development due to its privacy setting. Thus, this project proposes a system named
‘Academic Analytics Using Machine Learning’ to remain track of students’ results. The
proposed system offers a predictive system that's able to predict the student’s performance
which in turn assists the lecturers to identify students that are predicted to possess bad
performance in their courses.
The proposed system offers student performance prediction through the principles generated via
processing technique. The knowledge mining technique utilized during this project is
classification. The Dataset consists of 6 features (Gender, S1 CGPA, S2 CGPA, S3 CGPA,
Overall CGPA, Target Class). Here we considered different algorithms and by comparing
accuracies of different algorithms we chose SVM (support vector machine), it is a machine
learning algorithm’s shows 96% accuracy on the given dataset.
LIST OF TABLES
1 Introduction 1
2 Supporting Literature 2
2.1 Literature Review................................................................................................2
2.2 Findings and Proposals........................................................................................4
3 System Analysis 6
3.1 Analysis of Dataset...............................................................................................6
3.1.1 About the Dataset.......................................................................................6
3.1.2 Explore the Dataset……………………………………….………….…....7
3.2 Data Pre-processing..............................................................................................8
3.2.1 Data Cleaning.............................................................................................8
3.2.2 Analysis of Feature Variables....................................................................9
3.2.3 Analysis of Class Variables........................................................................9
3.3 Data Visualization..............................................................................................10
3.4 Analysis of Algorithm........................................................................................12
3.4.1 Accuracy Comparison……………………………………………………14
3.5 Project Pipeline...................................................................................................16
3.6 Feasibility Analysis............................................................................................18
3.6.1 Technical Feasibility................................................................................18
3.6.2 Economic Feasibility................................................................................18
3.6.3 Operational Feasibility.............................................................................19
3.7 System Environment...........................................................................................20
3.7.1 Software Environment..............................................................................20
3.7.2 Hardware Environment............................................................................22
4 System Design 23
4.1 Model Building...................................................................................................23
4.1.1 Model Planning…………………………………………………………..23
4.1.2 Training....................................................................................................23
4.1.3 Testing......................................................................................................24
6 Model Deployment 25
7 Git History 27
8 Conclusions 28
9 Future Work 29
10 Appendix 30
10.1 Minimum Software Requirements....................................................................30
10.2 Minimum Hardware Requirements...................................................................30
11 References 31
Academic analytics using machine learning
1. INTRODUCTION
Machine learning could be a specialization under the vast AI. Machine learning works
towards comprehending the complexity of assorted sorts of collected data and
identifying the right model for the data by trying several models. This can be
effectively systemized with easier interpretation and use by people. Machine learning
lies within the engineering science field but is different from basic computing
algorithms which are used for problem solving. Within the process of machine
learning, the algorithms are designed in a way which allows the system or computer to
process the input information, create training sets and produce the desired range
specified output using statistical estimation. Students are the greatest asset for various
universities. Universities and students play a very important role in producing
graduates of high qualities with their academic performance achievement. Academic
performance achievement is the level of accomplishment of the students’ educational
goal that may be measured and tested through examination, assessments and other
types of measurements. However, the educational performance achievement varies for
different reasons students may have at different levels of performance achievement. In
all the elemental parts of a student’s personal and professional development is
performance evaluation. Performance evaluations emphasize students’ strong suits and
their forte. This acts as a vital tool in augmenting their strengths and distinguishing
areas that require improvement as goals. By having the ability to research the
performance of their students, teachers can divert their attention to the mandatory
areas, advise and guide the scholars along the proper path and acknowledge and
reward their achievements.
2. SUPPORTING LITERATURE
2.1. Literature Review
In paper [1] the paper suggests that Student performance prediction is very
important to understand a student's progress rate. To predict the student’s
performance, they begin by collecting data sets in order to anticipate the
students' performance. As a result, they attempted to gather students' class
test, attendance, presentation, assignment, midterm, and final examination
marks. For the greatest accuracy rate, they propose using K-Nearest
Neighbors and Decision Tree Classifier. This proposed model outperforms
Student Performance across three semesters. The training and testing sets
provide optimum results and event accuracy. Finally, they are able to get the
best results and accuracy with the K-Nearest Neighbors, Decision Tree
Classifier model with 89.74 percent & 94.44 percent accuracy.
From the above three papers, we get to know that different approaches are used for
students performance analysis. First paper use KNN and Decision Tree Classifier. This
proposed model outperforms Student Performance across three semesters. The training
and testing sets provide optimum results and event accuracy. In the second paper they
developed a model to predict the grades of students taking the same course in the next
term using logistic regression, linear discriminant analysis, K-nearest neighbours,
classification and regression trees, gaussian Naive Bayes, and support vector machines
on historical data of student grades in one of the undergraduate courses. In the third
paper, They used WEKA to examine the feasibility of linear regression and multilayer
perceptron in terms of accuracy, performance, and error rate. According to the
findings,support vector machine has the highest accuracy with 94.88%.
3. SYSTEM ANALYSIS
3.1. Analysis of Dataset
3.1.1. About the Dataset
The dataset I used is a dataset which was made by collecting information
through a survey. It contains details of various students including marks of
different semester and sessional exam. It also includes gender of the students.
https://drive.google.com/file/d/1mwfn2PWHVPRQg3IqP2xlUTMlyVo6XdNb/view?
usp=share_link
In this dataset there are about 600 records which is details that the students have to
answer related to academics. By analyzing the values in each records we can conclude
or predict that the student may pass or fail in the upcoming semester. There are 6
feature list of dataset i.e. the gender and marks of various semesters and sessional
exams and pass or fail is the class label.
The attributes in this dataset are CGPA of s1,s2,s3,session 1,session 2 and age. And
the class label is whether the student will pass or failin the upcoming semester which
is found by analyzing this huge dataset.
Accuracy = (a+d)/(a+b+c+d)
Algorithm Used
Support Vector Machine
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines
from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear
SVM classifier.
Data collection: A dataset with appropriate parameters like Gender, different semester
marks like S1,S2,S3 and Session 1,Session 2 Marks, Overall CGPA and Class
variables are pass and fail
Data Pre-processing: Make the acquired data set in an organized format. Data
Cleaning is the data pre-processing method we choose. Data cleaning routines attempt
to fill in missing values, smooth out noisy data and correct inconsistencies.
Missing values can be handled by:
Ignore the tuple: This is usually done when the class label is missing.
Use global constant to fill the missing value: Replace all missing attribute values by
the same constant, such as label like “Unknown” or NA.This method is simple.
Use attribute mean or use the attribute mean for all samples belonging to the same
class as the given tuple.
Noisy data can be handled by:
Binning: Binning methods smooth a sorted data value by consulting its
“neighbourhood”. The sorted values are distributed into a number of buckets or bins.
Since binning methods consult the neighbourhood of values, they perform local
smoothing.
Regression: Data can be smoothed by fitting the data to a function such as with
regression. Linear regression and Multiple Linear Regression can be used.
Clustering: Outliers may be detected by clustering, where similar values are organized
into groups or clusters. Intuitively values that fall outside of the set of clusters may
considered as outliers.
The dataset taken is already pre-processed, so pre-processing techniques are not need
for the dataset. But for assurance pre-processing techniques for handling missing
values and duplicated values are made.
Training and Testing: Model Training was done and generates model and saved
in .pkl format using pickle. Testing is made by loading saved model and perform
prediction through python code. Accuracy comparison is made by splitting dataset into
training and test data. After Testing , User interface is developed for prediction and
connect the model using Flask Framework.
Technical Feasibility
Economical Feasibility
Operational Feasibility
3.6.1.Technical Feasibility
Various software used for the development of this application are the
following :
Python
Numpy :
Google Colab
Github
Git is an open-source version control system that was started
by Linus Torvalds. Git is similar to other version control systems
Subversion, CVS, and Mercurial to name a few. Version control
systems keep these revisions straight, storing the modifications in a
central repository. This allows developers to easily collaborate, as they
can download a new version of the software, make changes, and upload
the newest revision. Every developer can see these new changes,
download them, and contribute. Git is the preferred version control
system of most developers, since it has multiple advantages over the
other systems available.It stores file changes more efficiently and
ensures file integrity better.
The social networking aspect of GitHub is probably its most
powerful feature, allowing projects to grow more than just about any of
the other features offered. Project revisions can be discussed publicly,
so a mass of experts can contribute knowledge and collaborate to
advance a project forward.
4. SYSTEM DESIGN
4.1. Model Building
4.1.1. Model Planning
Model is generated by using SVM algorithm with high accuracy is
used for prediction.. The accuracy comparison is made by using the
dataset as training and testing data. By splitting the dataset, a portion is
used for training the model and other for testing the model. 70% of
dataset is used as training data and remaining 30% used as testing data.
5.1.3. Testing
6. MODEL DEPLOYMENT
This figure shows the user interface of this application. The
interface is very simple and easy to understand. There are 6 fields for
entering details from users. There is a drop down list to select the gender .
There is a predict button to predict the result. Validation for numeric field
is done in html. And for making all fields mandatory validation is done
when values are taken from the form to the model.
7. GIT HISTORY
8. CONCLUSIONS
9. FUTURE WORK
In this project, the prediction is not updated dynamically within the system’s source
codes. Thus, in future, a dynamic prediction model could be implemented by training
the prediction model itself whenever a new training set is fed into the system.
Moreover, the prediction can be offered to the other courses in future as well. .There is
a large number of institutions of higher education and they operate in a very complex
and highly competitive environment. Predicting a student's academic performance is
one of the most important steps towards efficient education and university’s
profitability, especially for private ones which are fully funded by tuition fees. It
affects the modification of the existing programs and the creation of new ones. With
accelerated IT development and lower prices, universities start to collect a huge
amount of data about their students. These data can be further analyzed with machine
learning methods and techniques. A special application of machine learning methods
and techniques in the educational environment is used. It is an interdisciplinary area
that brings together techniques from statistics, artificial intelligence, database systems,
machine learning, pattern recognition, data visualization, knowledge acquisition and
information theory to find useful patterns and, thus, help understand students’
behavior and how they learn
10. APPENDIX
10.1. Minimum Software
Requirements Software:Google
Colab
11. REFERENCES