Professional Documents
Culture Documents
Project Report
Project Report
1. Abstract Page 2
a. Correlation Page 3
6. Results Page 6
d. SVM Page 8
7. Conclusion Page 8
1
Abstract
Often considered a “superfood,” mushrooms have become a staple for many cuisines, as
well as a healthy addition to many meals. While technically a fruit, mushrooms provide many
benefits that vegetables also provide, including being fat-free, low calorie, and low in sodium.
Mushrooms also contain B vitamins, which help breakdown fat, carbohydrates, and proteins, as
well as Vitamin D, which helps fight common diseases and promotes weight loss. The nutritional
merits of mushrooms are clearly obvious. However, not every mushroom is edible for humans; in
fact, some mushrooms are poisonous, and if consumed, can lead to death.
Our mission through this project is to work towards building a model that can predict
whether a mushroom is edible or poisonous. We built several different models throughout our
project, including decision tree, random forest, SVM, and Naïve Bayes’ theorem.
Our study would be extremely beneficial to people who are put into survival situations, as
well as those people who tend to obtain their food from the land around them, rather than from a
grocery store. People who are thrust into survival situations might turn to mushrooms, if
available, due to their nutritional value, but it would be advantageous for them to know which
Problem Description
The problem that many people face with mushrooms is the fact that there are over 14,000
classified mushroom species throughout the world. As a result, it can be extremely difficult to
mushrooms can lead to kidney failure, liver damage, and even death. Through use of predictive
2
analytics, we aim to build a model that can help correctly classify mushrooms into either edible
or poisonous categories.
Data Exploration
Correlation
The relationship between all the variables and the target variable were plotted on a heat
map, shown in Figure 1. Among the 22 attributes, odor attribute was the one with high
Figure 1
Insights from the Relationship Between the Target Variable and Other Variables
The plot helps to explore the contribution of a single attribute in deciding the edibility of
mushroom. The higher the exclusiveness, we can say that it contributes more to the decision
3
making, and several attributes do not make any conclusion, so we can eliminate them directly.
We can predict few relations directly from the plot by seeing whether the mushroom is edible or
poisonous.
Based on Figure 2, we can say that if the odor attribute smells fishy, spicy, or foul then it
is likely poisonous. Contrarily, if the odor attribute is anise or almond, then the mushroom is
likely edible. This attribute has more weight to decide the class of the mushroom.
Figure 2
Data Preprocessing
The first step we took in our data preprocessing was to limit the number of rows that we
would observe. The entire dataset contains slightly over 8,000 data points; however, we took a
random sample of the entire set to limit the number of observations to 4,000 to make our
predictive models run more smoothly. Additionally, we used summary statistics to explore our
new, smaller dataset as well as determine any missing values. Thankfully, our dataset contained
4
no missing values, so we were not forced to impute any missing values, which will help with the
Our data was initially classified using individual letters to describe certain mushrooms
and their characteristics. We chose to convert these individual letters for each characteristic to
numeric values with factors to help with visually plotting our data. Additionally, in the “Class”
column of our dataset, mushrooms were initially classified with a “p” for poisonous or an “e” for
edible. We felt that clarifying this classification would help and we simply replaced the
individual letters with the words “poisonous” and “edible,” respectively. Lastly, the column
“veil-type” had two possible input values – either “p” for partial or “u” for universal. However,
in our dataset, every value in the “veil-type” column was classified with a “p.” As a result, we
removed this column due to its lack of contribution to the model. After these steps, we were
Data Summarization
In the entire dataset, we have 8,124 observations and 23 variables. Of the 23 variables,
there are 22 categorical variables and 1 target variable – which is the variable to determine
Figure 3
Our variables consist mainly of different physical descriptions of parts of the mushroom
such as the cap shape, cap color, gill size, stalk size, and various other physical descriptions that
5
are used to depict parts of the mushroom, pictured in APPENDIX 3 – Mushroom Anatomy, for
reference. We believe that by utilizing the different characteristics and attributes pertaining to the
mushrooms in the dataset, we will be able to accurately predict whether a selected mushroom
As previously mentioned, there are no missing values in the dataset, which not only
helped when it came to utilizing our time efficiently, but also helped our models run efficiently
and quicker than they had if we had been forced to impute missing data using mean values,
random variable selection, or integrating an aspect of null values into our models.
Results
As stated earlier, we built models using decision trees, random forests, SVM, and Naïve
Bayes’ theorem. Due to our entirely categorical dataset, we found that using logistic regression
was simply not going to be suitable for us as logistic regression relies on numeric values to be
successful.
Decision Tree
We first built a decision tree model in hopes of gaining a stronger understanding of our
data and the relationships among its many variables. We used a 70/30 breakdown to build our
model – 70% of our data used for the training dataset and 30% used for the testing dataset. This
worked out to 2,823 and 1,177 observations, respectively. The dataset contains many categorical
variables which led to the idea that a decision tree might be ideal. There were two false negative
outcomes where the mushroom was classified as edible but is actually poisonous. This model
6
0.9968153, and 0.9984051, respectively. Furthermore, the decision tree model returned a 95%
confidence interval of (0.9969, 0.9998), further cementing the strength of this entire model.
Based on Figure 4 below, odor is the most important variable, and consequently is the
first split. If the odor is creosote (c), foul (f), musty (m), pungent (p), spicy (s), or fishy (y), then
it is poisonous. Odors that are almond (a), anise (l), or none (n) will move on to be determined by
spore_print_color, which is the second most important variable and the second split. From there,
a green (r) spore print color will classify the mushroom as poisonous, and anything else will
move on to be further evaluated. This will continue until the variables are not important
anymore.
Figure 4
Naïve Bayes’
Following our decision tree model, we built a model utilizing the Naïve Bayes’ theorem.
The early report on this model was that it is not as accurate as the decision tree was. In the initial
7
confusion matrix, there were 75 misclassifications between edible and poisonous mushrooms,
resulting in an accuracy value of 0.9363. This model returned a recall, precision, and F-score of
Random Forest
The third model we built was a random forest model. We felt that with random forest
models being one of the strongest classification models, we absolutely needed to include it in our
consideration of building a proper model. The confusion matrix for the random forest model
We were intent of finding the most important variable in terms of predicting edibility in
our model. We utilized the Mean Decreasing Gini using varImpPlot, and we observed that odor
Figure 5
8
SVM
The last model we built was an SVM model. SVM models are very useful for
classification, and we felt it might be another useful model to help us predict the edibility of a
mushroom. Like the random forest model, the confusion matrix for our SVM model returned no
misclassifications and an accuracy of 1.0. In addition, the recall, precision, and F-score were all
1.0.
Conclusion
Building a strong model and obtaining the results is certainly an important aspect of
modelling, but it is equally as important to be able to select the best model and interpret its
results. It is important to note that while all models are technically incorrect, some can be more
In classification problems like we have taken on, a random forest will essentially give
you the probability of belonging to one of the different classes within your dataset – for us, either
poisonous or edible. On the other hand, SVM models use the concept of a hyperplane that
separates the classes involved in the problem and classifies a data point into one of those classes
Ultimately, both models gave us the same results, but the random forest model would be
a better model over SVM. One main reason is that if the dataset gets larger, the performance of
random forest is faster than SVM as SVM has a higher computational complexity. Also, SVM
starts to become unusable with data over 20,000 rows. The table below shows the comparison of
all models with their accuracy, precision, recall, F-score, and AUC.
9
Accuracy 0.9983 0.9363 1 1
Precision 0.9968153 0.8952654 1 1
Recall 1 0.9984051 1 1
F-score 0.9984051 0.9433107 1 1
AUC 0.9993 0.9594 1 1
10
APPENDIX A – Data Preprocessing
#overview of the mushrooms data with mean, median, mode, length, etc and NAs
summary(xm)
#names of columns
names(xm)
11
APPENDIX B – Model Building
#Decision Tree
tree <- ctree(edibility~., trainData)
print(tree)
plot(tree,type='simple')
summary(tree)
p1 <- predict(tree, trainData)
tab1 <- table(Predicted=p1,Actual=trainData$edibility)
tab1
p2 <- predict(tree, testData)
tab2 <- table(Predicted=p2,Actual=testData$edibility)
tab2
recall(tab2)
precision(tab2)
Fscore= 2 * precision(tab2) * recall(tab2) / (precision(tab2) + recall(tab2))
Fscore
confusionMatrix(p1,trainData[["edibility"]])
confusionMatrix(p2,testData[["edibility"]])
#NaiveBayes
model <- naiveBayes(edibility~.,trainData)
pred <- predict(model,testData)
confusionMatrix(data = pred, reference = testData$edibility, positive = "edible")
tab1 <- table(Predicted=pred,Actual=testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore
#RandomForest
model_rf <- randomForest(edibility ~ ., ntree = 500, data = trainData)
predict_rf = predict(model_rf, testData[-1])
confusionMatrix(data = predict_rf, reference = testData$edibility, positive = "edible")
tab1 <- table(Predicted=predict_rf,Actual=testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore
#SVM
model_svm <- svm(edibility ~. , data=trainData, cost = 1000, gamma = 0.01)
test_svm <- predict(model_svm, newdata = testData)
confusionMatrix(data = test_svm, reference = testData$edibility, positive = "edible")
12
tab1<-table(test_svm, testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore
13
APPENDIX C – Mushroom Anatomy
14
References
www.kaggle.com/uciml/mushroom-classification.
anatomy-caps-stems.
15