Professional Documents
Culture Documents
Final Project - Report
Final Project - Report
Jerridan Bonbright-Schultz
Javier Gonzalez
Trinidad Ramirez
Professor Ergezer
CST 383 - Introduction to Data Science
25 February 2022
Introduction
For this team project, we analyzed alcohol consumption among students and looked at
various factors that attribute to a student's tendency to consume alcohol. We obtained our
performance. In the process, we attempted to gain valuable insights on the leading causes of
Selection of Data
This project uses a dataset on student alcohol consumption retrieved from Kaggle. The
data comes from students of Math and Portuguese courses which are separated into two unique
files. For our purpose, we chose to utilize only the data from the math class. This dataset
includes a number of interesting fields ranging from light demographic information to social
characteristics relating to students’ personal lives such as the amount of free time and quality of
family relationships. This data contained no null values which eliminated much of the need for
preprocessing. Some feature engineering was used for certain approaches which involved
Methods
Tools:
● Numpy, Pandas, Matplotlib, and Seaborn for data exploration and visualization
Results
Kaggle’s Student Alcohol Consumption dataset may spark lots of significant approaches
that can be used to predict the future and attempt to correct some unexpected future events.
For example, looking at the daily and weekly alcohol consumption can be used to predict if a
student may pass or fail in his future exam. Grade 3 (G3) data can be divided into two groups -
the grade less than an equal 5 can be related to “fail” statue and the rest can be related to the
“pass” statue. Using the KNN regression model and analyzing daily and weekly alcohol
consumption, we can predict if a student may fail in the future. The main intention is identifying
vulnerable students, coach them better, help them reduce alcohol consumption and finally
enable them to succeed in the future exams. This model predicts well with some anomalies
Linear Regression:
When using a Linear Regression as our approach, we decided to use a number of
predictor variables that seemed significant in determining the student’s grade. A few, in
particular, included the parent’s education, study time, overall alcohol consumption, and how
often the student goes out. The data was split into test and training sets and scaled before being
fit to the linear regression model. This approach yielded results that were not promising in the
effort to make accurate predictions of the student’s final grade. A key indicator of this was the
As a test, the same linear regression model was used to predict the number of failures a
student had acquired. For this experiment, the predictors were changed to include new columns
such as the course grades. This resulted in a model with a root mean squared error of roughly
.7 which is more acceptable for making accurate predictions. This led us to believe that a linear
regression model may be unfit for our purpose while something like KNN may lead to
Decision Trees:
After reviewing some of the data and visualizations, we started off with choosing few
features like 'failures', 'famrel', and 'goout' to see how this would affect students’ weekday and
weekend alcohol consumption in our regression tree model. After obtaining a base rmse with
these predictors, we thought we could lower our model’s margin of error with including more
predictors pertaining to the student’s family background (e.g. 'Pstatus_T', 'Medu', 'Fedu') and
increasing the tree depth from 2 to 5 would provide us a better prediction of students’ weekday
and weekend alcohol consumption. Unfortunately, it didn’t and features 1) going out with friends
(‘goout’) and 2) free time after school (‘freetime’) yielded the best prediction results for weekday
students’ final grade (‘G3’). Surprisingly, alcohol consumption was not the best predictor for a
student’s final grade. According to the Decision Tree Regression model, the first and second
period grades (‘G1’ & ‘G2’ respectively) were the best predictors for the final grade target.
Despite all other contributing factors to a student’s final grade, it appears that it was their
previous grade performance that provided the highest level of accuracy when it comes to
Discussion
Due to the nature of this very specific and small dataset, it’s hard to say with certainty
the exact level of alcohol consumption among students and all extenuating factors that affect a
student’s academic performance. There is room to believe that with a larger dataset of this
nature, more accurate predictions can be made for the student’s final grade. One interesting
finding that arose was the ability to make semi accurate predictions of the number of failures a
We went into this project with many preconceived notions on what caused students to
consume alcohol and how it affects their academic performance. Many of us thought that the
predictors relating to family data or past academic failures would have the biggest impact on
predicting the likelihood of alcohol consumption and in turn, affect overall grades. The three
machine learning models chosen for this project demonstrated to us that the answers to these
questions are more complex and nuanced than we previously thought. Other factors have to be
considered such as how much students go out with their friends, their previous grade work, their
study habits, and how much free time they have on hand. For future studies, it would help to
have a larger dataset and to gather data from a more diverse group of students to yield better
prediction outcomes.
Summary
This project taught us a lot about student alcohol consumption and how it relates to their
academic performance but there is still more that needs to be uncovered. Our findings led us to
discover that an increase in student’s alcohol consumption doesn’t necessarily translate to lower
academic performance. Other factors beyond alcohol consumption must be considered and
have shown to be better predictors when combined with other features. Previous grades scores
(‘G1’ & ‘G2’) and other social activities like free time outside of school (‘freetime’) and going out