You are on page 1of 15

Table of Contents

1. Abstract Page 2

2. Problem Description Page 2

3. Data Exploration Page 3

a. Correlation Page 3

b. Insights from the Relationship Between the Target Variable

and Other Variables Page 4

4. Date Preprocessing Page 4

5. Data Summarization Page 5

6. Results Page 6

a. Decision Tree Page 6

b. Naïve Bayes’ Page 7

c. Random Forest Page 8

d. SVM Page 8

7. Conclusion Page 8

8. APPENDIX A – Data Preprocessing Page 9

9. APPENDIX B – Model Building Page 10

10. APPENDIX C – Mushroom Anatomy Page 12

11. References Page 13

1
Abstract

Often considered a “superfood,” mushrooms have become a staple for many cuisines, as

well as a healthy addition to many meals. While technically a fruit, mushrooms provide many

benefits that vegetables also provide, including being fat-free, low calorie, and low in sodium.

Mushrooms also contain B vitamins, which help breakdown fat, carbohydrates, and proteins, as

well as Vitamin D, which helps fight common diseases and promotes weight loss. The nutritional

merits of mushrooms are clearly obvious. However, not every mushroom is edible for humans; in

fact, some mushrooms are poisonous, and if consumed, can lead to death.

Our mission through this project is to work towards building a model that can predict

whether a mushroom is edible or poisonous. We built several different models throughout our

project, including decision tree, random forest, SVM, and Naïve Bayes’ theorem.

Our study would be extremely beneficial to people who are put into survival situations, as

well as those people who tend to obtain their food from the land around them, rather than from a

grocery store. People who are thrust into survival situations might turn to mushrooms, if

available, due to their nutritional value, but it would be advantageous for them to know which

mushrooms they can eat, and which they should avoid.

Problem Description

The problem that many people face with mushrooms is the fact that there are over 14,000

classified mushroom species throughout the world. As a result, it can be extremely difficult to

determine whether a given mushroom is edible or poisonous. Consumption of poisonous

mushrooms can lead to kidney failure, liver damage, and even death. Through use of predictive

2
analytics, we aim to build a model that can help correctly classify mushrooms into either edible

or poisonous categories.

Data Exploration
Correlation

The relationship between all the variables and the target variable were plotted on a heat

map, shown in Figure 1. Among the 22 attributes, odor attribute was the one with high

exclusiveness when compared to other attributes.

Figure 1

Insights from the Relationship Between the Target Variable and Other Variables

The plot helps to explore the contribution of a single attribute in deciding the edibility of

mushroom. The higher the exclusiveness, we can say that it contributes more to the decision
3
making, and several attributes do not make any conclusion, so we can eliminate them directly.

We can predict few relations directly from the plot by seeing whether the mushroom is edible or

poisonous.

Based on Figure 2, we can say that if the odor attribute smells fishy, spicy, or foul then it

is likely poisonous. Contrarily, if the odor attribute is anise or almond, then the mushroom is

likely edible. This attribute has more weight to decide the class of the mushroom.

Figure 2

Data Preprocessing

The first step we took in our data preprocessing was to limit the number of rows that we

would observe. The entire dataset contains slightly over 8,000 data points; however, we took a

random sample of the entire set to limit the number of observations to 4,000 to make our

predictive models run more smoothly. Additionally, we used summary statistics to explore our

new, smaller dataset as well as determine any missing values. Thankfully, our dataset contained

4
no missing values, so we were not forced to impute any missing values, which will help with the

accuracy of our model.

Our data was initially classified using individual letters to describe certain mushrooms

and their characteristics. We chose to convert these individual letters for each characteristic to

numeric values with factors to help with visually plotting our data. Additionally, in the “Class”

column of our dataset, mushrooms were initially classified with a “p” for poisonous or an “e” for

edible. We felt that clarifying this classification would help and we simply replaced the

individual letters with the words “poisonous” and “edible,” respectively. Lastly, the column

“veil-type” had two possible input values – either “p” for partial or “u” for universal. However,

in our dataset, every value in the “veil-type” column was classified with a “p.” As a result, we

removed this column due to its lack of contribution to the model. After these steps, we were

ready to begin building our models.

Data Summarization

In the entire dataset, we have 8,124 observations and 23 variables. Of the 23 variables,

there are 22 categorical variables and 1 target variable – which is the variable to determine

whether the mushroom is edible or poisonous.

Figure 3

Our variables consist mainly of different physical descriptions of parts of the mushroom

such as the cap shape, cap color, gill size, stalk size, and various other physical descriptions that
5
are used to depict parts of the mushroom, pictured in APPENDIX 3 – Mushroom Anatomy, for

reference. We believe that by utilizing the different characteristics and attributes pertaining to the

mushrooms in the dataset, we will be able to accurately predict whether a selected mushroom

will be edible or poisonous.

As previously mentioned, there are no missing values in the dataset, which not only

helped when it came to utilizing our time efficiently, but also helped our models run efficiently

and quicker than they had if we had been forced to impute missing data using mean values,

random variable selection, or integrating an aspect of null values into our models.

Results

As stated earlier, we built models using decision trees, random forests, SVM, and Naïve

Bayes’ theorem. Due to our entirely categorical dataset, we found that using logistic regression

was simply not going to be suitable for us as logistic regression relies on numeric values to be

successful.

Decision Tree

We first built a decision tree model in hopes of gaining a stronger understanding of our

data and the relationships among its many variables. We used a 70/30 breakdown to build our

model – 70% of our data used for the training dataset and 30% used for the testing dataset. This

worked out to 2,823 and 1,177 observations, respectively. The dataset contains many categorical

variables which led to the idea that a decision tree might be ideal. There were two false negative

outcomes where the mushroom was classified as edible but is actually poisonous. This model

had an accuracy of 0.9983. Additionally, it returned a recall, precision, and F-score of 1,

6
0.9968153, and 0.9984051, respectively. Furthermore, the decision tree model returned a 95%

confidence interval of (0.9969, 0.9998), further cementing the strength of this entire model.

Based on Figure 4 below, odor is the most important variable, and consequently is the

first split. If the odor is creosote (c), foul (f), musty (m), pungent (p), spicy (s), or fishy (y), then

it is poisonous. Odors that are almond (a), anise (l), or none (n) will move on to be determined by

spore_print_color, which is the second most important variable and the second split. From there,

a green (r) spore print color will classify the mushroom as poisonous, and anything else will

move on to be further evaluated. This will continue until the variables are not important

anymore.

Figure 4

Naïve Bayes’

Following our decision tree model, we built a model utilizing the Naïve Bayes’ theorem.

The early report on this model was that it is not as accurate as the decision tree was. In the initial

7
confusion matrix, there were 75 misclassifications between edible and poisonous mushrooms,

resulting in an accuracy value of 0.9363. This model returned a recall, precision, and F-score of

0.9984051, 0.8952654, and 0.9433107, respectively.

Random Forest

The third model we built was a random forest model. We felt that with random forest

models being one of the strongest classification models, we absolutely needed to include it in our

consideration of building a proper model. The confusion matrix for the random forest model

returned no misclassifications along with an accuracy of 1.0. Additionally, it returned a recall,

precision, and F-score all of 1.0.

We were intent of finding the most important variable in terms of predicting edibility in

our model. We utilized the Mean Decreasing Gini using varImpPlot, and we observed that odor

has the highest predictive power.

Figure 5

8
SVM

The last model we built was an SVM model. SVM models are very useful for

classification, and we felt it might be another useful model to help us predict the edibility of a

mushroom. Like the random forest model, the confusion matrix for our SVM model returned no

misclassifications and an accuracy of 1.0. In addition, the recall, precision, and F-score were all

1.0.

Conclusion

Building a strong model and obtaining the results is certainly an important aspect of

modelling, but it is equally as important to be able to select the best model and interpret its

results. It is important to note that while all models are technically incorrect, some can be more

useful than others in reporting your findings.

In classification problems like we have taken on, a random forest will essentially give

you the probability of belonging to one of the different classes within your dataset – for us, either

poisonous or edible. On the other hand, SVM models use the concept of a hyperplane that

separates the classes involved in the problem and classifies a data point into one of those classes

– once again, in our case either poisonous or edible.

Ultimately, both models gave us the same results, but the random forest model would be

a better model over SVM. One main reason is that if the dataset gets larger, the performance of

random forest is faster than SVM as SVM has a higher computational complexity. Also, SVM

starts to become unusable with data over 20,000 rows. The table below shows the comparison of

all models with their accuracy, precision, recall, F-score, and AUC.

Model Decision Tree Naïve Bayes’ SVM Random Forest

9
Accuracy 0.9983 0.9363 1 1
Precision 0.9968153 0.8952654 1 1
Recall 1 0.9984051 1 1
F-score 0.9984051 0.9433107 1 1
AUC 0.9993 0.9594 1 1

10
APPENDIX A – Data Preprocessing

#random sample of mushroom data to 4000 rows


xm<-xm %>% sample_n(4000, replace = FALSE)

#overview of the mushrooms data with mean, median, mode, length, etc and NAs
summary(xm)

#to find the location of NAs


which(is.na(xm))

#names of columns
names(xm)

#unique values in column cap.shape


colnames(xm) <- c("edibility", "cap_shape", "cap_surface", "cap_color", "bruises", "odor",
"gill_attachement", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root",
"stalk_surface_above_ring", "stalk_surface_below_ring", "stalk_color_above_ring",
"stalk_color_below_ring", "veil_type", "veil_color", "ring_number", "ring_type",
"spore_print_color", "population", "habitat")

xm <- xm %>% map_df(function(.x) as.factor(.x))


levels(xm$edibility) <- c("edible", "poisonous")
xm <- subset(xm, select= -veil_type)
which(is.na(xm))
set.seed(0822211)
samp <- sample(2, nrow(xm), replace = TRUE, prob = c(0.7,0.3))
trainData <-xm[samp==1,]
testData <-xm[samp==2,]

11
APPENDIX B – Model Building

#Decision Tree
tree <- ctree(edibility~., trainData)
print(tree)
plot(tree,type='simple')
summary(tree)
p1 <- predict(tree, trainData)
tab1 <- table(Predicted=p1,Actual=trainData$edibility)
tab1
p2 <- predict(tree, testData)
tab2 <- table(Predicted=p2,Actual=testData$edibility)
tab2
recall(tab2)
precision(tab2)
Fscore= 2 * precision(tab2) * recall(tab2) / (precision(tab2) + recall(tab2))
Fscore
confusionMatrix(p1,trainData[["edibility"]])
confusionMatrix(p2,testData[["edibility"]])

#NaiveBayes
model <- naiveBayes(edibility~.,trainData)
pred <- predict(model,testData)
confusionMatrix(data = pred, reference = testData$edibility, positive = "edible")
tab1 <- table(Predicted=pred,Actual=testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore

#RandomForest
model_rf <- randomForest(edibility ~ ., ntree = 500, data = trainData)
predict_rf = predict(model_rf, testData[-1])
confusionMatrix(data = predict_rf, reference = testData$edibility, positive = "edible")
tab1 <- table(Predicted=predict_rf,Actual=testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore

#SVM
model_svm <- svm(edibility ~. , data=trainData, cost = 1000, gamma = 0.01)
test_svm <- predict(model_svm, newdata = testData)
confusionMatrix(data = test_svm, reference = testData$edibility, positive = "edible")
12
tab1<-table(test_svm, testData$edibility)
recall(tab1)
precision(tab1)
Fscore= 2 * precision(tab1) * recall(tab1) / (precision(tab1) + recall(tab1))
Fscore

13
APPENDIX C – Mushroom Anatomy

14
References

1. UCI Machine Learning. “Mushroom Classification.” Kaggle, 1 Dec. 2016,

www.kaggle.com/uciml/mushroom-classification.

2. Robinson, Nick. R&R Cultivation, 31 July 2020, rrcultivation.com/blogs/mn/mushroom-

anatomy-caps-stems.

15

You might also like