You are on page 1of 5

Bablu Banik

Jerridan Bonbright-Schultz
Javier Gonzalez
Trinidad Ramirez
Professor Ergezer
CST 383 - Introduction to Data Science
25 February 2022

Final Project - Alcohol Consumption Among Students & How it


Affects their Academic Performance

Introduction
For this team project, we analyzed alcohol consumption among students and looked at

various factors that attribute to a student's tendency to consume alcohol. We obtained our

dataset from Kaggle’s Student Alcohol Consumption dataset.

Our goal for this project was to predict:

1. The level of alcohol consumption among students.

2. The factors affecting a student's grades.

3. The students grade based on the features available to us.

Because of alcohol’s negative effects on cognition, we hypothesized a negative

correlation between increased alcohol consumption and a decline in students’ academic

performance. In the process, we attempted to gain valuable insights on the leading causes of

student alcohol consumption and contributing factors to a student’s final grade.

Selection of Data
This project uses a dataset on student alcohol consumption retrieved from Kaggle. The

data comes from students of Math and Portuguese courses which are separated into two unique

files. For our purpose, we chose to utilize only the data from the math class. This dataset

includes a number of interesting fields ranging from light demographic information to social
characteristics relating to students’ personal lives such as the amount of free time and quality of

family relationships. This data contained no null values which eliminated much of the need for

preprocessing. Some feature engineering was used for certain approaches which involved

methods such as converting categorical columns to numeric values.

Methods
Tools:

● Numpy, Pandas, Matplotlib, and Seaborn for data exploration and visualization

● Scipy library for zscore

● Scikit-learn for ML library

● Github for version control and submission

● VS Code, Juypter Notebook, and Google Colab were used as IDEs

● Graphviz was used to construct visualization of DecisionTreeRegressor model

Inference methods used with Scikit-learn:

● ML Models: KNeighborsRegressor, LinearRegression, & DecisionTreeRegressor

● Features: export_graphviz, StandardScaler, train_test_split

Results

K Nearest Neighbors (KNN):

Kaggle’s Student Alcohol Consumption dataset may spark lots of significant approaches

that can be used to predict the future and attempt to correct some unexpected future events.

For example, looking at the daily and weekly alcohol consumption can be used to predict if a

student may pass or fail in his future exam. Grade 3 (G3) data can be divided into two groups -

the grade less than an equal 5 can be related to “fail” statue and the rest can be related to the

“pass” statue. Using the KNN regression model and analyzing daily and weekly alcohol

consumption, we can predict if a student may fail in the future. The main intention is identifying

vulnerable students, coach them better, help them reduce alcohol consumption and finally
enable them to succeed in the future exams. This model predicts well with some anomalies

which will be addressed in future iterations.

Linear Regression:
When using a Linear Regression as our approach, we decided to use a number of

predictor variables that seemed significant in determining the student’s grade. A few, in

particular, included the parent’s education, study time, overall alcohol consumption, and how

often the student goes out. The data was split into test and training sets and scaled before being

fit to the linear regression model. This approach yielded results that were not promising in the

effort to make accurate predictions of the student’s final grade. A key indicator of this was the

root mean squared error of over 4.

As a test, the same linear regression model was used to predict the number of failures a

student had acquired. For this experiment, the predictors were changed to include new columns

such as the course grades. This resulted in a model with a root mean squared error of roughly

.7 which is more acceptable for making accurate predictions. This led us to believe that a linear

regression model may be unfit for our purpose while something like KNN may lead to

predictions of higher accuracy.

Decision Trees:
After reviewing some of the data and visualizations, we started off with choosing few

features like 'failures', 'famrel', and 'goout' to see how this would affect students’ weekday and

weekend alcohol consumption in our regression tree model. After obtaining a base rmse with

these predictors, we thought we could lower our model’s margin of error with including more

predictors pertaining to the student’s family background (e.g. 'Pstatus_T', 'Medu', 'Fedu') and

increasing the tree depth from 2 to 5 would provide us a better prediction of students’ weekday

and weekend alcohol consumption. Unfortunately, it didn’t and features 1) going out with friends

(‘goout’) and 2) free time after school (‘freetime’) yielded the best prediction results for weekday

and weekend alcohol consumption.


Next, we focused on weekday and weekend alcohol consumption and how it affected

students’ final grade (‘G3’). Surprisingly, alcohol consumption was not the best predictor for a

student’s final grade. According to the Decision Tree Regression model, the first and second

period grades (‘G1’ & ‘G2’ respectively) were the best predictors for the final grade target.

Despite all other contributing factors to a student’s final grade, it appears that it was their

previous grade performance that provided the highest level of accuracy when it comes to

predicting academic performance.

Discussion
Due to the nature of this very specific and small dataset, it’s hard to say with certainty

the exact level of alcohol consumption among students and all extenuating factors that affect a

student’s academic performance. There is room to believe that with a larger dataset of this

nature, more accurate predictions can be made for the student’s final grade. One interesting

finding that arose was the ability to make semi accurate predictions of the number of failures a

student had acquired in the past.

We went into this project with many preconceived notions on what caused students to

consume alcohol and how it affects their academic performance. Many of us thought that the

predictors relating to family data or past academic failures would have the biggest impact on

predicting the likelihood of alcohol consumption and in turn, affect overall grades. The three

machine learning models chosen for this project demonstrated to us that the answers to these

questions are more complex and nuanced than we previously thought. Other factors have to be

considered such as how much students go out with their friends, their previous grade work, their

study habits, and how much free time they have on hand. For future studies, it would help to

have a larger dataset and to gather data from a more diverse group of students to yield better

prediction outcomes.
Summary
This project taught us a lot about student alcohol consumption and how it relates to their

academic performance but there is still more that needs to be uncovered. Our findings led us to

discover that an increase in student’s alcohol consumption doesn’t necessarily translate to lower

academic performance. Other factors beyond alcohol consumption must be considered and

have shown to be better predictors when combined with other features. Previous grades scores

(‘G1’ & ‘G2’) and other social activities like free time outside of school (‘freetime’) and going out

with friends (‘goout’) yielded better results as predictor features.

You might also like