School of Engineering and Technology
Department of Computer Science and Engineering
Jain Global Campus, Kanakapura Taluk - 562112
Ramanagara District, Karnataka, India
A Project Report on
“Wine Quality Analysis”
For the partial fulfilment of
BACHELOR OF TECHNOLOGY
IN
C O MP UT E R S CI E N CE A N D E NG I NE E RI NG
Submitted by
Brahadeesh Kishore
16BT6CS006
Karishma Kurickal
16BT6CS011
School of Engineering & Technology
Department of Computer Science and Engineering
Jain Global campus
Kanakapura Taluk - 562112
Ramanagara District
Karnataka, India
CERTIFICATE
This is to certify that the Project work titled “Wine Quality Analysis” for the course
Machine Learning (16CIC73) during 7th semester, is carried out by Brahadeesh Kishore
(16BT6CS006), Karishma Kurickal (16BT6CS011) are bonafide students at the School of
Engineering & Technology, JAIN (Deemed-to-be-University), Bangalore in partial fulfilment
for the award of degree in Bachelor of Technology in Computer Science and Engineering,
during the year 2019 - 2020.
Prof. Shilpa Das Dr. Narayana Swamy Ramaiah
Assistant Professor Head of the Department,
Dept. of Computer Science and Engineering, Dept. of Computer Science and Engineering ,
School of Engineering & Technology, School of Engineering & Technology,
JAIN (Deemed-to-be-University) JAIN (Deemed-to-be-University)
Date: Date:
TABLE OF CONTENTS
Chapter 1 01
1. INTRODUCTION 01
1.1 Problem Definition 01
1.2 Objectives 02
1.3 Methodology 04
1.4 Software Requirements 05
1.5 Tool Description 06
Chapter 2 07
2. IMPLEMENTATION 07
2.1 Design and Implementation 07
2.1.1 Implementation Mechanism 08
2.1.2 Major Considerations for Implementation 09
2.1.3 Source Code 10
2.2 Machine Learning algorithm Used 11
Chapter 3
3. RESULTS AND DISCUSSION 12
Chapter 4
4. CONCLUSION 13
REFERENCES 14
Chapter 1
Introduction
(Introduction to the problem chosen, the domain of the problem, which other problems
were there, and why did you choose this particular problem, should be given here within 1 –
2 paragraphs.)
Other Issues in the Domain (along with the chosen issue) –
• i.
• ii.
• Iii.
• Iv.
Brief Explanation about relevance of chosen issue, and why is it important.
1.1. Problem Definition
The problems this project aims to solve are:
Predicting the quality rating of a new kind of wine, given the properties and quality
ratings of many other types of wine
Testing multiple algorithms and see which performs the best on the given data
1.2. Objectives
This objectives of this project are the following:
To compare and contrast various machine learning algorithms and their training
set/test set performance on the given dataset
To correctly classify the wine quality of wine types in the test set, based on their
features
1.3. Methodology
The architecture/workflow to solve the addressed problems should be explained, along with
how you plan to implement the proposed model.
1.4. Software Requirements
This project uses the following tools for its implementation:
1. Python: The programming language used to implement this project.
2. Jupyter Notebook: Used to provide an interactive environment for python, for the
implementation of this project
3. Google Colab: Used to run Jupyter Notebooks remotely and save them to Google
Drive, hence increasing portability. Google Colab also provides additional hardware
resources, enhancing execution times.
4. Scikit-Learn: Used to implement the several machine learning models used in this
project, as well as view their accuracies.
5. Matplotlib: Used to plot graphs for us to understand the trends in
accuracy/performance of the various machine learning algorithms implemented.
6. Pandas: Used to analyse and clean our dataset.
1.5. Tool Description
The various tools used in this project are described below.
1. Python
Python is an interpreted, high-level programming language created by Guido van
Rossum in 1991. It emphasizes on code readability by using whitespace to terminate
statements and blocks. Its ease of use and syntax makes it a language of preference
for data analysis.
2. Jupyter Notebook
It is a web application that provides an integrated development environment for
Python. Using this, one can share documents that contains equations, visualizations
such as graphs, text, as well as live code. For these reasons, it is a highly used tool for
the purposes of data analysis.
3. Google Colab
Google Colab is a free cloud service that allows you to write and run Jupyter
Notebooks on the cloud, rather than having to install Jupyter Notebook and the
necessary packages on your machine. It provides the advantage of portability, since
Jupyter Notebooks are saved to your Google Drive, and can hence be run anywhere.
The main advantage of Google Colab is the system resources that are provided,
which help speed up training time for machine learning models. This includes GPUs
and additional RAM. This is helpful since many local systems may not have adequate
system resources to train these models as quickly.
4. Scikit-Learn
It is a python package used for data modelling. It provides a number of supervised
and unsupervised machine learning models. Scikit-learn makes it extremely simple to
train models, with simple function calls on the input data being all that is needed.
5. Matplotlib
This is a plotting library for the python programming language, and is used to make
plots and graphs based on provided data. Plots help us understand trends, patterns,
and make correlations. Understanding trends in data by simply looking at numbers is
difficult, so matplotlib provides visuals for this purpose.
6. Pandas
Although Scikit-learn provides models for the training of data, it is not concerned
with the preparation and cleaning of this data. That is where the use of pandas
comes in. Pandas is an open source library used to import datasets in a variety of
formats, analyse, clean this data. It is written in C and supports vectorized
operations. That is, it supports the updating of multiple elements of a row or column
in parallel, hence eliminating the need for an explicit for loop to update these rows
and columns. Hence it is highly optimized.
Chapter 2
Implementation
2.1. Design & Implementation
The implementation of this project involved the following steps:
Import the necessary packages
Import the required dataset and clean the dataset
Split the dataset into a training and test set
Apply three algorithms: K Nearest Neighbors, Naïve Bayes, and Random Forests.
Compare the accuracies on the training and test set.
Generate graphs comparing train and test set accuracies for different algorithms
The implementation is described in detail below.
2.1.1. Implementation Mechanism
The three machine learning algorithms that we will implement are K Nearest Neighbors,
Random Forests, and Naïve Bayes. First and foremost, the dataset had to be imported and
cleaned. Since we are using Google Colab, we must upload the dataset to the cloud. The
code snippet is given below.
import pandas as pd
from [Link] import files
[Link]()
df = pd.read_csv("[Link]")
df
A brief summary of the dataset is returned.
As we can see, every column except for type and quality contains continuous numerical
values. The quality is what we are trying to predict. Since the type column consists of
strings, we first convert them to numeric values. We execute this line of code next:
df['type'].unique()
output: array(['white', 'red'], dtype=object)
Since there are only two types of wine, namely red and white, we can encode these as 0s
and 1s respectively. Additionally we also drop all rows containing null values.
df['type'] = [Link](df['type'] == 'white', 0, 1)
df = [Link](axis=0)
Dataset after cleaning:
Now that we are done cleaning the dataset we can create two vectors, one for inputs and
one for labels, then divide these into training and test sets. We will use 70% of the data as
the training set and the remaining 30% as the test set. Our labels will be the wine’s quality
rating. We split the data using scikit-learn’s train_test_split function.
from sklearn.model_selection import train_test_split
X = [Link][:,1:]
X['type'] = df['type']
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3
)
The data is now ready for the ML algorithms to be used on. The necessary packages are
given below.
from [Link] import KNeighborsClassifier
from sklearn import metrics
from [Link] import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
The code for the execution of these are given below.
knn = KNeighborsClassifier(n_neighbors=10, p=1)
[Link](X_train, y_train)
2.1.2. Major Considerations for Implementation
While cleaning the data, it is important to convert all string values to numeric values before
proceeding, since strings cannot be directly inputted into ML models by scikit-learn. An
example of encoding strings to values would be [‘a’, ‘b’, ‘c’] -> [0, 1, 2]. In our case, we only
had 2 different types of string values so we could just encode them as 0s and 1s. Rows
containing null or undefined values were dropped. Since we want to achieve the highest
possible accuracy, we did not risk replacing null values with any predefined value and
preferred dropping them altogether. Since we had enough training data, this step was
feasible.
Additionally, since K-NN is a distance-based algorithm, we need to scale the data
appropriately. If data is not scaled, the contribution of certain features would be greater
than others. For example, the contribution of the total sulfur dioxide feature to the
prediction of the wine’s quality rating would be much more than say, sulphates, since the
former has a higher range of values. We tested K-Nearest Neighbors both with and without
scaling features between 0 and 1 and noted the accuracy.
2.1.3. Source Code
Partial / Complete Source Code
2.2. Machine Learning Algorithms Used
1. K Nearest Neighbors
Chapter 3
Results and Discussion
Detailed graphs for comparison between the 3-4 methods used, indicating performance with
respect to the dataset, explanation of the graphs, what do they indicate, why do they
perform as they do.
Accuracy achieved in values should also be given here, for all the methods.
Chapter 4
Conclusion
Conclude here what did you propose to do, how much you did, how well did you obtain
results, this should be short story on the entire work while you explain, like a revisit to the
entire project.
References
1. Should be in IEEE format – don’t make mistake here, should be related to your problem only,
don’t give absurd references – keep it 10 -12.
2. Author Names – (First and Last Name of each author), “Title of the Paper”, Name of the
Journal/Transaction Paper, Volume Number, Publisher, Page number as pp, Month and Year
of Publishing.
3. Author Names – (First and Last Name of each author), “Title of the Paper”, Name of the
Conference, Volume Number, Page number as pp, Month and Year of Publishing.
4. Ex - G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type
involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–
551, April 1955.
5. Ex - I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor
networks: a survey”, Computer Networks, Elsevier, vol. 38, no. 4, pp. 393– 422, Mar. 2002.