Professional Documents
Culture Documents
Semester-Project
12.01.2024
Literature Review:
In this project we were requested to perform a predictive analysis of SGPA and CGPA for
students in the fifth semester based on relevant features, treating it as a regression
problem. The initial phase involved meticulous data preprocessing steps, including loading
the dataset, addressing missing values, removing duplicates, and transforming categorical
variables. Exploratory data analysis (EDA) was performed using Seaborn and Sweetviz
libraries, generating insightful visualizations for data distribution, outliers, and
relationships.
Work-Breakdown Structure:
1.
3
2.
3.
3.
4.
4
4.
5
5.
6.
6
WBS Screenshot:
Methodology:
Data Preprocessing:
In data preprocessing, first we used panda library that read all the given data set into data frame
and then we showed all the contents of data frame.
After that we showed the data types and mean of all the contents.
After that we filled all the empty spaces in dataset, else we can do that we can drop columns with
empty spaces but we filled them instead.
We choose some features for testing and training as by dropping irrelevant features. We split the
data into 70% training and 30% testing.
8
AI Prediction Models :
In data preprocessing, it cannot be done through classification models but through regression
models. As our target variable is continuous not discrete. So we have used Linear Regression
Model, Support Vector Regression (SVR), Neural Network, XGBoost regressor and Random Forest
Regressor Model. These all models are Regression Models.
First we import the necessary library Linear Regression from scikit-learn for the linear regression
model.
Then we trained the model by fit the model with training data (X_train and y_train). Then we made
the predictions by using the trained model to make predictions on the test set (X_test).
After that we evaluate the model by calculating and printing metrics such as Mean Squared Error
(MSE) and R-squared to assess the model's performance. At last we visualize predictions by plotting
the actual values against the predicted values for visualization using Matplotlib.
we trained the model by fitting the model with training data (X_train and y_train). Then we made
predictions by using the trained SVR model to make predictions on the test set (X_test). Next we
evaluated the model by calculating and printing metrics such as Mean Squared Error (MSE) and R-
squared to assess the model's performance and visualized the model by plotting it.
3. Neural Network:
We instantiate a Standard Scaler object from scikit-learn. This scaler will be used to standardize the
features. Fit_transform will fit the scaler on the training data (X_train) and transform it. Then,
transform the test data (X_test) using the same scaler.
After that we instantiate a Multi-layer Perceptron (MLP) regressor model using scikit-learn and train
the MLP regressor on the standardized training data. There are two hidden layers with 100 and 50
neurons.
We set the maximum number of iterations (epochs) for training (max_iter) and for reproducibility
(random_state).
Then we calculated the mean squared error between the actual (y_test) and predicted values
(y_pred) and access the training loss values over epochs from the MLP regressor through loss curve.
10
11
4. XGBoost Regressor:
In this model, we instantiate an XGBoost Regressor. Then we fit the model to the training set by
using the fit method to train the XGBoost model on the training set (X_train and y_train). After that
we made predictions on the test set (X_test). Next we calculated the Mean Squared Error (MSE) and
R-squared between the actual (y_test) and predicted values (y_pred) and visually plot the actual and
predicted values.
12
In random forest regressor, we instantiate a Random Forest Regressor with 100 trees (n_estimators)
and a specified random seed (random_state). Then we used the fit method to train the Random
Forest model on the training set (X_train and y_train). Then we used the trained Random Forest
model to make predictions on the test set (X_test). After that we calculated Mean Squared Error
(MSE) and R-squared between the actual (y_test) and predicted values (y_pred) and visually plot the
actual and predicted values.
13
In Conclusion, Random Forest Regressor model performed best as it has lower Mean Squared Error
(MSE) and higher R-squared values as compared to all other models.
14
The EDA process undertaken in this project involves leveraging Seaborn and Sweetviz
libraries to gain insights into the dataset's characteristics.
Seaborn's box plots were generated for distinct subsets of columns within the one-hot
encoded DataFrame. These visualizations, organized into four blocks, each analyzing 20
columns, serve to illuminate the distribution of data and highlight potential outliers.
Analyzing the first 19 columns of a student data set using box plots revealed various characteristics.
The distribution of data was largely symmetrical, except for slight skewness in some opinion-based
columns. Central tendencies, represented by medians, differed significantly across variables,
indicating diverse typical values. The spread of data points varied similarly, with some columns
exhibiting wider ranges in responses than others. A few potential outliers existed, particularly in the
"surprise quizzes" category, hinting at extreme cases of stress and discouragement. These
observations suggested possible connections between family background and preferred learning
styles, diverse coping mechanisms for anxieties, and potentially strong reactions to surprise quizzes.
16
Specific Observations:
Visual Insights
19
This is the Graphical User Interface of the website being hosted.The User can input the
Matric and Intermediate percentages and all the Semester GPA’s from first till fourth
Semester to get the predicted SGPA and CGPA of fifth semester.All the five models are
integrated with the GUI and based on these inputs each model predicts corresponding
output value respectively.
OUTPUT:
All the models predict the corresponding SGPA and CGPA respectively.
21
Conclusion:
In conclusion, the project successfully conducted a predictive analysis of SGPA and CGPA
for the fifth semester, employing various regression models such as Linear Regression,
Support Vector Regression, Neural Network, XGBoost Regressor, and Random Forest
Regressor. The comprehensive methodology, encompassing meticulous data
preprocessing, model training, and evaluation using metrics like Mean Squared Error and
R-squared, revealed that the Random Forest Regressor model outperformed others. The
inclusion of a user-friendly graphical interface enhances the project's practicality,
providing users with a seamless tool for predicting SGPA and CGPA based on their
educational details.
THE END