Professional Documents
Culture Documents
MACHINE LEARNING
Submitted by
BALAJI N R 113219031022
SUDARSAN M S 113219031147
VIGNESH K 113219031158
VIGNESH RAJ V 113219031159
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
2021-2022
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
1 Balaji N R
2 Sudarsan M S Student Performance Mrs. Lovelit Jose
Vignesh K Analysis Using Machine
3
Learning
4 Vignesh Raj V
This report of internship work submitted by the above student in partial fulfillment
for the award of Bachelor of Computer Science and Engineering Degree in Anna
University was evaluated and confirmed to be reports of the work done by the above
student and then assessed.
1. INTRODUCTION 1
1.1 EXISTING SYSTEM 1
1.1.1 DISADVANTAGES 1
1.2 PROPOSED SOLUTION 2
1.2.1 ADVANTAGES 2
1.2.2 SYSTEM ARCHITECTURE 2
2. MODULES 4
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 5
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6
3. DATA FLOW DIAGRAM 8
3.1 DATA FLOW DIAGRAM 8
3.2 UML DIAGRAM 11
3.3 CLASS DIAGRAM 12
3.4 SEQUENCE DIAGRAM 13
3.5 ACTIVITY DIAGRAM 14
4. DOMAIN SPECIFICATION 15
4.1 MACHINE LEARNING 15
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 DEFINITION 15
4.1.1.2 ALGORITHM 18
4.1.2 UNSUPERVISED LEARNING 19
4.1.2.1 DEFINITION 19
4.1.2.2 ALGORITHM 19
4.1.3 REINFORCEMENT LEARNING 20
4.1.3.1 DEFINITION 20
5. REQUIREMENTS 21
5.1 SOFTWARE REQUIREMENTS 21
5.2 PYTHON LIBRARIES 21
REFERENCES 27
ACKNOWLEDGEMENT
Performance analysis in outcome based on learning is a system which will strive for
excellence at different levels and diverse dimensions in the field of student’s
interests.
The proposed framework analyzes the students’ demographic data, study related and
psychological characteristics to extract all possible knowledge from students,
teachers and parents.
The framework succeeds to highlight the student’s weak points and provide
appropriate recommendations. The realistic case study that has been conducted on
200 students proves the outstanding performance of the proposed framework in
comparison with the existing ones.
i
LIST OF FIGURES
ii
INTRODUCTION
The previous predictive models only focused on using the student’s demographic
data like gender, age, family status, family income and qualifications. In addition
to the study related attributes including the homework and study hours as well as
the previous achievements and grades. These previous work were only limited to
provide the prediction of the academic success or failure, without illustrating the
reasons of this prediction. Most of the previous researches have focused to gather
more than 40 attributes in their data set to predict the student’s academic
performance. These attributes were from the same type of data category whether
demographic, study related attributes or both, that lead to lack of diversity of
predicting rules.
1.1.1 Disadvantages:
As a result, these generated rules did not fully extract the knowledge for the
reasons behind the student’s dropout.
Apart from the previously mentioned work, there were previous statistical
analysis models from the perspective of educational psychology that
conducted a couple of studies to examine the correlation between the mental
health and the academic performance.
The type of the recommendations was too brief, they missed illustrating the
methodologies to apply them.
1
1.2 Proposed System:
The proposed framework firstly focuses on merging the demographic and study
related attributes with the educational psychology fields, by adding the student’s
psychological characteristics to the previously used data set (i.e., the students’
demographic data and study related ones). After surveying the previously used
factors for predicting the student’s academic performance, we picked the most
relevant attributes based on their rationale and correlation with the academic
performance.
1.2.1 Advantage:
The proposal aims to analyze student’s demographic data, study related details
and psychological characteristics in terms of final state to figure whether the
student is on the right track or struggling or even failing. In addition to extensive
comparison of our proposed model with the other previous related models.
2
1.2.2 System Architecture:
ML
Algorithms
Machine
Data pre- Feature
dataset learning
processing extraction
model
Data
Result classifier
classification
3
2. MODULES
2.1 Modules
• DATA COLLECTION
• DATA PRE-PROCESSING
• FEATURE EXTRATION
• EVALUATION MODEL
Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working with.
ML problems start with data preferably, lots of data (examples or observations) for
which you already know the target answer. Data for which you already know the target
answer is called labelled data.
4
2.1.2 DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
2.1.1.1 Formatting
2.1.1.2 Cleaning
2.1.1.3 Sampling
2.1.1.1 Formatting:
The data you have selected may not be in a format that is suitable for you to
work with. The data may be in a relational database and you would like it in a
flat file, or the data may be in a proprietary file format and you would like it in
a relational database or a text file.
2.1.1.2 Cleaning:
Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to
address the problem. These instances may need to be removed. Additionally,
there may be sensitive information in some of the attributes and these attributes
may need to be anonym zed or removed from the data entirely.
2.1.1.3 Sampling:
There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller
5
representative sample of the selected data that may be much faster for exploring
and prototyping solutions before considering the whole dataset.
6
data in the form of graphs. Accuracy is defined as the percentage of correct
predictions for the test data. It can be calculated easily by dividing the number
of correct predictions by the number of total predictions.
7
3. DATAFLOW DIAGRAM
LEVEL 0
Dataset
Collection
Pre-
processing
Random
selection
Trained
& Testing
dataset
8
LEVEL 1
Dataset
collection
Pre-
processing
Feature
Extraction
Apply
Algorithm
9
LEVEL 2
EDA
Classify
the
dataset
Train
Algorithm
Predict
Result
10
3.2 UML DIAGRAM
UML DIAGRAMS
USE CASE DIAGRAM
11
3.3 CLASS DIAGRAM:
12
3.4 SEQUENCE DIAGRAM
13
3.5 ACTIVITY DIAGRAM:
14
4. DOMAIN SPECIFICATION
4.1.1.1 Definition
An algorithm uses training data and feedback from humans to learn the relationship of given
inputs to a given output. For instance, a practitioner can use marketing expense and weather
15
forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict
new data.
There are two categories of supervised learning:
Classification task.
Regression task.
Classification
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
customer database. You know the gender of each of your customer, it can only be male
or female. The objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you have collected).
When the model learned how to recognize male or female, you can use new data to make
a prediction. For instance, you just got new information from an unknown customer,
and you want to know if it is a male or female. If the classifier predicts male = 70%, it
means the algorithm is sure at 70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes, but
if a classifier needs to predict object, it has dozens of classes (e.g., glass, table, shoes,
etc. each object represents a class).
Regression
When the output is a continuous value, the task is a regression. For instance, a financial
analyst may need to forecast the value of a stock based on a range of feature like equity,
16
previous stock performances, macroeconomics index. The system will be trained to
estimate the price of the stocks with the lowest possible error.
17
4.1.1.2 Algorithms:
Linear Finds a way to correlate each feature to the output to help Regression
regression predict future values.
Naive Bayes The Bayesian method is a classification method that makes Regression
use of the Bayesian theorem. The theorem updates the prior Classification
knowledge of an event with the independent probability of
each feature that can affect the event.
Support Support Vector Machine, or SVM, is typically used for the Regression (not
vector classification task. SVM algorithm finds a hyperplane that very common)
machine optimally divided the classes. It is best used with a non-linear Classification
solver.
Random The algorithm is built upon a decision tree to improve the Regression
forest accuracy drastically. Random forest generates many times Classification
simple decision trees and uses the 'majority vote' method to
decide on which label to return. For the classification task, the
final prediction will be the one with the most vote; while for
the regression task, the average prediction of all the trees is
18
the final prediction.
4.1.2.1 Definition
In unsupervised learning, an algorithm explores input data without being given an explicit
output variable (e.g., explores customer demographic data to identify patterns)
You can use it when you do not know how to classify the data, and you want the algorithm
to find patterns and classify the data for you.
4.1.2.2 Algorithm:
Algorithm Description Type
K-means Puts data into some groups (k) that each contains data with Clustering
clustering similar characteristics (as determined by the model, not in
advance by humans)
19
Can be used for Cluster loyalty-card customer
PCA/T-SNE Mostly used to decrease the dimensionality of the data. The Dimension
algorithms reduce the number of features to 3 or 4 vectors Reduction
with the highest variances.
4.1.3.1 Definition
20
5. REQUIREMENTS
o Numpy
o Pandas
o Matplotlib
o Sklearn
o Seaborm
21
6. SOURCE CODE
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from pandas import Series
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("StudentsPerformance.csv")
df.head(10)
22
df.shape
df.columns
df['gender'].value_counts()
23
df['test preparation course'].value_counts()
a=df['race/ethnicity'].value_counts()
df.describe()
24
Fig 6.7 Description
df.describe(include=['O'])
25
7. CONCLUSION AND FUTURE WORK
7.1 CONCLUSION:
Finally, performance analysis for students are a major problem. It is important that they
are countered. The work reported in this thesis indicates the machine learning techniques
with supervised learning algorithms to understand the performance of algorithm with
respect to student records where we analyses the performance of student and categorized
it into three classes as high , average, low with the accuracy of 79% .
In the future we provide some technical solution by improve the efficiency of student
performance .The user interaction model could be derived for giving the record of
student dynamically and it could give staff an alert message about those students who
are having low performance . We could build the prediction using Neural Network and
can expect improvised results. We can add non- academic attributes along with
academics attributes.
26
REFERENCES:
1. https://iopscience.iop.org/article/10.1088/1757-899X/1055/1/012122/pdf
2. https://www.ijser.org/researchpaper/Students-Performance-Analysis-Using-
Machine-Learning-Tools.pdf
3. https://www.activestate.com/resources/quick-reads/what-is-pandas-in-
python-everything-you-need-to-know
4. https://scikit-learn.org/stable
5. https://www.codecademy.com/article/scikit-learn
6. https://machinelearningmastery.com/metrics-evaluate-machine-learning-
algorithms-python
27