You are on page 1of 4

DATA PREPROCESSING STEPS:

For the given dataset of grades achieved by different students, we have done the following
preprocessing on data:

First of all, we have displayed the dataset to analyze the full picture. Then we checked the columns
provided in the dataset. We found out the dimensions of the dataset by using shape method.

The data was inspected to see if there were any null values present in any column. We applied a method
which returned the number of null values in each column. To deal with missing values, we had to apply a
solution that could cater all the null values efficiently without rendering any significant impact on data
accuracy. Hence, we found out the mode of all the values in each column and replaced the null values
with the mode of each column respectively.

Next, we looked for categorical data in the dataset and created a function for making key value pair of
categorical columns with respect to features. Then we searched for unique values in each column. The
number of unique values was found to be 13. We assigned the respective GPA for each value in the
column(grade) as follows:

A+¿ ¿= 4.0, A = 4.0, A−¿¿= 3.7, B+¿¿= 3.3, B = 3.0, B−¿ ¿= 2.7, C +¿¿= 2.3, C = 2, C−¿¿= 1.7, D +¿¿ = 1.3, D =
1.0, F = 0.0, WU = 0.0

After that, we transformed all the grades in each column according to their respective GPAs.
Furthermore, the seat number was removed from the dataset in order to attain a simplified data for
making prediction. We implemented a function that could efficiently retrieve all the courses taught in
different years. If we give it the parameter 1, it would only retrieve the courses of First year. Similarly, if
we provide 2, it would retrieve the courses of First year combined with Second year and for parameter
3, the courses of First year combined with both Second and Third year are regained. The courses are the
features and the CGPA is set as the target of the model.

MODEL AND ALGORITHM:


We have applied two machine learning algorithms to predict the final CGPA of a student at the end of
fourth year with the help of CGPAs of the courses obtained at the end of 1 st, 2nd and 3rd years.

MODELS USED:
Model 1: predict final CGPA based on GPs of first year only.

Model 2: predict final CGPA based on GPs of first two years.

Model 2: predict final CGPA based on GPs of first three years.

ALGORITHMS USED:
We have implemented Linear Regression Model and KNN Regressor Model for predicting final CGPAs.
1. Linear Regression:
Linear regression is the most commonly used model for predictive analysis of continuous data. It
attempts to model the relationship between two variables by fitting a linear equation to observed data.
One variable is considered to be an explanatory variable, and the other is considered to be a dependent
variable.

First of all, we split the data into 70% train data and 30% test data, and then applied Linear Regression
model on train data.

For Model 1:

The accuracy of train data for first year was found to be 84.23%. Then we verified if there are any NaN
values in our test data and forwarded with its prediction. The mean squared error was calculated to be
6% and the accuracy was 81%.

For Model 2:

The 2nd model was handled in the same way. The accuracy of the train data was acquired as 90%. As for
the test data, its mean squared error was computed as 3% and the accuracy was 92%.

For Model 3:

The accuracy of the train data was found to be 92.568%. The test data for this model consists of the
minimum mean squared error, that is, 1% and the accuracy was 97% which is the best among all three
models.

2. KNN Regressor:
KNN regression is a non-parametric method that, in an intuitive manner, approximates the association
between independent variables and the continuous outcome by averaging the observations in the same
neighborhood.

For Model 1:

We calculated the accuracy of train data first which came out as 84.64%. As far as test data is concerned,
the prediction was made for it and its mean squared error was observed as 6.99% and the accuracy was
79%. We also provided the solution for handling the case for a single input by the user. To achieve this,
the test data was first reshaped into 1 column and as many rows as suggested by NumPy and then sent
on for prediction.

For Model 2:

The 2nd model was catered in the similar way. The accuracy of the train data was obtained as 87.7%.
Meanwhile, the mean squared error and accuracy of test data were gained as 3% and 91% respectively.

For Model 3:

The accuracy of the train data for this model was acquired as 89.3%. The test data was discovered to
have the accuracy of 95% with mean squared error of 1.5%.
GRAPHICAL COMPARISON OF MODELS:
FOR LINEAR REGRESSION:

COMPARISON OF THE ACCURACY PERCENTAGE OF


THREE MODELS FOR LINEAR REGRESSION
100

95

90

85
97
92
80

75 81

70
Model 1 Model 2 Model 3

Series 1 Column1 Column2

FOR KNN REGRESSOR:

COMPARISON OF THE ACCURACY PERCENTAGE OF


THREE MODELS FOR KNN REGRESSOR
100
90
80
70
60
50 95
91
40 79
30
20
10
0
Model 1 Model 2 Model 3

Series 1 Series 2 Series 3


PERFORMANCE OF MACHINE LEARNING SYSTEMS:
LINEAR REGRESSION:
The accuracy achieved by implementing linear regression is quite good in all three cases.

For model 1, the difference in the accuracies of train and test data is 3.23%. For model 2, the difference
is 2% and for model 3, it is 5.568% which is pretty much acceptable. Moreover, the accuracy of the first
model lies in the range of 80% to 90% whereas the accuracies for model 2 and 3 are significantly above
90% for both train and test data.

Based on all these values, we can form a statement that our model is a good fit.

KNN REGRESSOR:
KNN regressor also succeeded in providing high accuracies. In this case, the model 1 possessed a
difference of 5.64% in the accuracies of train and test data. But for model 2, the difference between the
two got improved and turned out to be 3.3%. The model 3 had the difference of 5.7% in the accuracies
which is also not bad at all. The accuracies of all the models for both train and test data are mostly
above 80% which is considered to be excellent.

Hence, we can say that our model is a really good fit according to the analysis of all the accuracies.

Here, the dataset provided to us was the record of 571 students which is not a very large number. One
reason for obtaining high accuracies could be the size of the dataset as well as the simplification of data.
But irrespective of the reason, the correctness of predictions is quite impressive.

You might also like