Professional Documents
Culture Documents
net/publication/311919082
CITATIONS READS
12 9,080
2 authors:
All content following this page was uploaded by Yeşim Er on 16 January 2017.
Abstract: The main purpose of this study is to predict wine quality based on physicochemical data. In this study, two large separate data
sets which were taken from UC Irvine Machine Learning Repository were used. These data sets contain 1599 instances for red wine and
4898 instances for white wine with 11 features of physicochemical data such as alcohol, chlorides, density, total sulfur dioxide, free
sulfur dioxide, residual sugar, and pH. First, the instances were successfully classified as red wine and white wine with the accuracy of
99.5229% by using Random Forests Algorithm. Then, the following three different data mining algorithms were used to classify the
quality of both red wine and white wine: k-nearest-neighbourhood, random forests and support vector machines. There are 6 quality
classes of red wine and 7 quality classes of white wine. The most successful classification was obtained by using Random Forests
Algorithm. In this study, it is also observed that the use of principal component analysis in the feature selection increases the success rate
of classification in Random Forests Algorithm.
This journal is © Advanced Technology & Science 2013 IJISAE, 2016, 4(Special Issue), 23–26 | 23
properties, such as acidity and alcohol composition. Due to are tested for each method. The best classification results of each
privacy and logistic issues, only physicochemical (inputs) and method are obtained when k value is chosen as 10.
sensory (the output) variables are available (e. g. There is no data Firstly, both datasets are separated into training and testing set by
about grape types, wine brand, wine selling price, etc.). using 10-fold cross-validation method. Afterwards, both datasets
Table 1 and Table 2 present the 11 different physicochemical are randomly divided into two groups called training and testing
properties and data statistics (minimum, maximum, mean, and set. First set involves randomly 80% of dataset, and the other set
standard deviation values of all instances for each feature) of involves the resting data.
white wine and red wine sets respectively.
2.3. Data Mining Techniques
IJISAE, 2016, 4(Special Issue), 23–26 This journal is © Advanced Technology & Science 2013
classify the wine samples as red wine and white wine. A model SVM 48.1 58.3 52.7 70.6
was built using each method and applied to the wine data set. The Cross
k-NN 64.3 64.8 64.5 72.7
classification results of the three classification algorithms are Validation
RF 66.8 69.6 67.8 86.4
evaluated both test modes which are 10-fold cross-validation, and
80% percentage split. SVM 49.7 59.1 53.9 70.8
Also, some of the standard performance measures (statistics) are Percentage
k-NN 65.6 65.6 65.5 72.8
Split
calculated to evaluate the performance of the algorithms. The
RF 69.6 71.9 70.5 87.2
standard performance measures are recall, precision, F measure,
and ROC values.
Table 3 presents the correctly classified instances results of the The most successful classification result of white wine sample
classification of red and white wine samples. qualities as 11 classes was obtained by using Random Forest
algorithm for both test modes. The accuracy of each cross-
Table 3. Performance results of the classification of red and white wine validation and percentage split mode with this algorithm is
samples
70.3757%, and 68.6735% respectively.
Test Modes Classifier Precision Recall F Measure ROC The most successful classification result of red wine sample
(%) (%) (%) (%) qualities as 11 classes was obtained by using Random Forest
algorithm for both test modes. The accuracy of each cross-
SVM 99.1 99.1 99.1 98.6
validation and percentage split mode with this algorithm is
Cross
k-NN 99.2 99.2 99.2 99.0 69.606%, and 71.875% respectively.
Validation
RF 99.5 99.5 99.5 99.8 Then, for increasing of classification success in this study, the
SVM 99.2 99.2 99.2 98.6 number of features was reduced by using PCA algorithm and the
Percentage process wine quality classification was repeated by using SVM,
k-NN 99.2 99.2 99.2 98.7
Split k-NN, and RF algorithms.
RF 99.5 99.5 99.5 99.9 Table 6 and Table 7 present the correctly classified instances
results of the classification of white and red wine sample qualities
The most successful classification result of red and white wine after applying PCA respectively.
samples was obtained by Random Forests Algorithm for both test Table 6. Performance results of the classification of white wine sample
modes. The accuracy of each cross-validation and percentage qualities after applying PCA
split mode with this algorithm is 99.5229%, and 99.4611%
respectively. Test Modes Classifier Precision Recall F Measure ROC
Table 3 clearly shows that Random Forests algorithm (%) (%) (%) (%)
outperforms from the other algorithms in two test modes. SVM 39.8 52.2 44.0 66.3
For the classification of the quality of both red and white wine Cross
k-NN 64.5 64.7 64.6 74.4
samples, the classification experiment is measured by the Validation
RF 70.7 69.9 68.8 86.9
accuracy percentage of classifying the instances correctly into its
class according to quality features which are ranged between 0 SVM 39.4 51.2 43.6 65.9
(very bad) and 10 (very excellent) as 11 different classes and Percentage
k-NN 63.5 63.6 63.5 73.7
Split
totally 22 classes.
RF 68.1 67.4 66.3 85.4
Table 4 and Table 5 present the correctly classified instances
results of the classification of white and red wine sample qualities Table 7. Performance results of the classification of red wine sample
respectively. qualities after applying PCA
For k-nearest-neighbourhood classifiers, different k values are
Test Modes Classifier Precision Recall F Measure ROC
tested for each test mode. When k value is increased, the
(%) (%) (%) (%)
achievement of the classification decreases. For this reason, k
value is taken as 1. SVM 47.8 58.0 52.4 69.7
Cross
Table 4. Performance results of the classification of white wine sample k-NN 64.3 64.8 64.5 72.7
Validation
qualities RF 68.4 71.2 69.4 86.4
Test Modes Classifier Precision Recall F Measure ROC SVM 47.8 56.9 51.9 69.3
Percentage
(%) (%) (%) (%) k-NN 68.0 67.8 67.8 74.4
Split
SVM 39.6 52.1 44.1 66.7 RF 71.4 73.4 72.1 87.8
Cross
k-NN 65.1 65.4 65.2 75.0
Validation
RF 71.0 70.4 69.5 87.3 The most successful classification result of white wine sample
qualities after applying PCA was obtained by using Random
SVM 39.4 51.2 43.7 65.8
Forest algorithm for both test modes. The accuracy of each cross-
Percentage
k-NN 63.0 63.3 63.0 73.1 validation and percentage split mode with this algorithm is
Split
RF 69.8 68.7 67.4 85.7 69.9061%, and 67.449% respectively.
The most successful classification result of red wine sample
Table 5. Performance results of the classification of red wine sample qualities after applying PCA was obtained by using Random
qualities
Forest algorithm for both test modes. The accuracy of each cross-
Test Modes Classifier Precision Recall F Measure ROC validation and percentage split mode with this algorithm is
(%) (%) (%) (%) 71.232%, and 73.4375% respectively.
IJISAE, 2016, 4(Special Issue), 23–26 This journal is © Advanced Technology & Science 2013