Final Project - Report

Bablu Banik
Jerridan Bonbright-Schultz
Javier Gonzalez
Trinidad Ramirez
Professor Ergezer
CST 383 - Introduction to Data Science
25 February 2022
Final Project - Alcohol Consumption Among Students & How it

Affects their Academic Performance
Introduction
For this team project, we analyzed alcohol consumption among students and looked at
various factors that attribute to a student's tendency to consume alcohol. We obtained our
dataset from Kaggle’s Student Alcohol Consumption dataset.
Our goal for this project was to predict:
1. The level of alcohol consumption among students.
2. The factors affecting a student's grades.
3. The students grade based on the features available to us.
Because of alcohol’s negative effects on cognition, we hypothesized a negative
correlation between increased alcohol consumption and a decline in students’ academic
performance. In the process, we attempted to gain valuable insights on the leading causes of
student alcohol consumption and contributing factors to a student’s final grade.
Selection of Data
This project uses a dataset on student alcohol consumption retrieved from Kaggle. The
data comes from students of Math and Portuguese courses which are separated into two unique
files. For our purpose, we chose to utilize only the data from the math class. This dataset
includes a number of interesting fields ranging from light demographic information to social
characteristics relating to students’ personal lives such as the amount of free time and quality of
family relationships. This data contained no null values which eliminated much of the need for
preprocessing. Some feature engineering was used for certain approaches which involved
methods such as converting categorical columns to numeric values.
Methods
Tools:
● Numpy, Pandas, Matplotlib, and Seaborn for data exploration and visualization
● Scipy library for zscore
● Scikit-learn for ML library
● Github for version control and submission
● VS Code, Juypter Notebook, and Google Colab were used as IDEs
● Graphviz was used to construct visualization of DecisionTreeRegressor model
Inference methods used with Scikit-learn:
● ML Models: KNeighborsRegressor, LinearRegression, & DecisionTreeRegressor
● Features: export_graphviz, StandardScaler, train_test_split
Results
K Nearest Neighbors (KNN):
Kaggle’s Student Alcohol Consumption dataset may spark lots of significant approaches
that can be used to predict the future and attempt to correct some unexpected future events.
For example, looking at the daily and weekly alcohol consumption can be used to predict if a
student may pass or fail in his future exam. Grade 3 (G3) data can be divided into two groups -
the grade less than an equal 5 can be related to “fail” statue and the rest can be related to the
“pass” statue. Using the KNN regression model and analyzing daily and weekly alcohol
consumption, we can predict if a student may fail in the future. The main intention is identifying
vulnerable students, coach them better, help them reduce alcohol consumption and finally
enable them to succeed in the future exams. This model predicts well with some anomalies
which will be addressed in future iterations.
Linear Regression:
When using a Linear Regression as our approach, we decided to use a number of
predictor variables that seemed significant in determining the student’s grade. A few, in
particular, included the parent’s education, study time, overall alcohol consumption, and how
often the student goes out. The data was split into test and training sets and scaled before being
fit to the linear regression model. This approach yielded results that were not promising in the
effort to make accurate predictions of the student’s final grade. A key indicator of this was the
root mean squared error of over 4.
As a test, the same linear regression model was used to predict the number of failures a
student had acquired. For this experiment, the predictors were changed to include new columns
such as the course grades. This resulted in a model with a root mean squared error of roughly
.7 which is more acceptable for making accurate predictions. This led us to believe that a linear
regression model may be unfit for our purpose while something like KNN may lead to
predictions of higher accuracy.
Decision Trees:
After reviewing some of the data and visualizations, we started off with choosing few
features like 'failures', 'famrel', and 'goout' to see how this would affect students’ weekday and
weekend alcohol consumption in our regression tree model. After obtaining a base rmse with
these predictors, we thought we could lower our model’s margin of error with including more
predictors pertaining to the student’s family background (e.g. 'Pstatus_T', 'Medu', 'Fedu') and
increasing the tree depth from 2 to 5 would provide us a better prediction of students’ weekday
and weekend alcohol consumption. Unfortunately, it didn’t and features 1) going out with friends
(‘goout’) and 2) free time after school (‘freetime’) yielded the best prediction results for weekday
and weekend alcohol consumption.

Next, we focused on weekday and weekend alcohol consumption and how it affected
students’ final grade (‘G3’). Surprisingly, alcohol consumption was not the best predictor for a
student’s final grade. According to the Decision Tree Regression model, the first and second
period grades (‘G1’ & ‘G2’ respectively) were the best predictors for the final grade target.
Despite all other contributing factors to a student’s final grade, it appears that it was their
previous grade performance that provided the highest level of accuracy when it comes to
predicting academic performance.
Discussion
Due to the nature of this very specific and small dataset, it’s hard to say with certainty
the exact level of alcohol consumption among students and all extenuating factors that affect a
student’s academic performance. There is room to believe that with a larger dataset of this
nature, more accurate predictions can be made for the student’s final grade. One interesting
finding that arose was the ability to make semi accurate predictions of the number of failures a
student had acquired in the past.
We went into this project with many preconceived notions on what caused students to
consume alcohol and how it affects their academic performance. Many of us thought that the
predictors relating to family data or past academic failures would have the biggest impact on
predicting the likelihood of alcohol consumption and in turn, affect overall grades. The three
machine learning models chosen for this project demonstrated to us that the answers to these
questions are more complex and nuanced than we previously thought. Other factors have to be
considered such as how much students go out with their friends, their previous grade work, their
study habits, and how much free time they have on hand. For future studies, it would help to
have a larger dataset and to gather data from a more diverse group of students to yield better
prediction outcomes.
Summary
This project taught us a lot about student alcohol consumption and how it relates to their
academic performance but there is still more that needs to be uncovered. Our findings led us to
discover that an increase in student’s alcohol consumption doesn’t necessarily translate to lower
academic performance. Other factors beyond alcohol consumption must be considered and
have shown to be better predictors when combined with other features. Previous grades scores
(‘G1’ & ‘G2’) and other social activities like free time outside of school (‘freetime’) and going out
with friends (‘goout’) yielded better results as predictor features.

Final Project - Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Project - Report

Uploaded by

Copyright:

Available Formats

Bablu Banik

Final Project - Alcohol Consumption Among Students & How it

dataset from Kaggle’s Student Alcohol Consumption dataset.

Our goal for this project was to predict:

1. The level of alcohol consumption among students.

2. The factors affecting a student's grades.

3. The students grade based on the features available to us.

Because of alcohol’s negative effects on cognition, we hypothesized a negative

correlation between increased alcohol consumption and a decline in students’ academic

student alcohol consumption and contributing factors to a student’s final grade.

methods such as converting categorical columns to numeric values.

● Scipy library for zscore

● Scikit-learn for ML library

● Github for version control and submission

● VS Code, Juypter Notebook, and Google Colab were used as IDEs

● Graphviz was used to construct visualization of DecisionTreeRegressor model

Inference methods used with Scikit-learn:

● ML Models: KNeighborsRegressor, LinearRegression, & DecisionTreeRegressor

● Features: export_graphviz, StandardScaler, train_test_split

K Nearest Neighbors (KNN):

which will be addressed in future iterations.

root mean squared error of over 4.

predictions of higher accuracy.

and weekend alcohol consumption.

predicting academic performance.

student had acquired in the past.

with friends (‘goout’) yielded better results as predictor features.

You might also like