You are on page 1of 4

Christine Feng, CS 3600, Spring 2017

Project 4a Analysis

Dummy Dataset 1:

The accuracy rate for this dataset was 100%, and the tree size was 3. According to the tree
printed in the screenshot above, if the element at index 5 of the example is 0, then the label for
that example will be 1, and vice versa that is, if example[5] == 1, the label will be 0. Examining
the actual datasets in confirms this; every training and test example with a 5th
element of 0 is classified with the label of 1, and vice versa. Since this simple classification rule
is followed in every example, the accuracy rate after running the examples through the decision
tree algorithm is 100%.

Dummy Dataset 2:

The accuracy rate for this dataset was 65%, and the tree size was 11. If you compare the trees
for Dummy Datasets 1 and 2, its clear that the tree for Dummy Dataset 2 is more complicated
than the one for Dummy Dataset 1, with more branches and possibilities. For example, if you
follow the tree from the top, if example[2] == 0 AND example[0] == 0, then the example will be
labeled 0. In this case alone, the decision tree must look at 2 attributes before labelling. In
another case, if example[2] == 1 AND example [5] == 0 AND example[6] == 0, then the example
will be labeled 0. In other words, Dummy Dataset 2 takes more attributes to arrive at a
classification than does Dummy Dataset 3, which is why it has a larger tree size and a lower
accuracy rate.


The accuracy rate for the cars dataset was 94.3%, with a tree size of 408. The tree for this
dataset is larger than that of either of the dummy datasets, however this dataset has a high
accuracy rate because it only has 7 attributes and a large pool of examples to learn from and,
accordingly, to build an accurate decision tree from. Additionally, there are a few attributes in
the car dataset that have great influence on the final classification; for example, safety (the
attribute on which the tree initially splits) and the number of persons the car can hold (all cars
that can only hold 2 people are immediately classified as unacceptable, according to the tree)
seem to affect the cars label more than others. The existence of such influential attributes
contributes to the accuracy of the algorithm.

The accuracy rate for this dataset was 75.7%, with a tree size of 41521. The format of the
dataset indicates that there are 42 attributes one for each combination of 7 columns and 6
rows each with 3 possible values: O, X, or blank. This is a much larger number of attributes
than any of the other datasets, so the resulting tree is much larger, as well. Thankfully, this
dataset was provided a large example set; however, unlike the cars dataset, it doesnt seem like
any one attribute has a heavier influence on the final labeling than other attributes. The
random and widely varied nature of this dataset makes it less accurate than the car dataset.

The cars dataset provided for this project sheds light on several potential applications for
decision trees, particularly in a sales context. For example, a retail website could look at
browsing trends for example, users who gravitate towards the womens section on a clothing
website and who shop in the business-casual section of the site and who sort the search results
from lowest price to highest (so, we could assume that user is young) could be classified with a
young female professional label, and the website could use similar labels for other user
demographic groups to deliver useful product suggestions. So, in this case, attributes could be
price range (values: very low ($1-$10 per item), low ($10-20 per item), medium ($20-30 per
item), high ($30+ per item)), gender, age group, style (possible values: business casual, trendy,
formal, casual, androgynous, etc.), all of which would help categorize users.

The Connect4 dataset could be combined with a CSP (constraint satisfaction problem) setup,
like the one we wrote for a previous project. For example, after a user puts a piece in a certain
square, they are most likely going to try to follow that move by placing a piece in a second
square that is adjacent or diagonal to the first piece. Incorporating this data could help cut
down on the number of branches in the decision tree, decreasing the number of possible
outcomes and increasing the accuracy rate.


The accuracy rate for this dataset was 74.05%, with a tree size of 199174. This large tree size
stems from the fact that poker hands often rely on the values and suits of every single card for
classification, so there is an element of unpredictability involved. In other words, there isnt any
one attribute that would have a greater influence on the ultimate classification of each
example. That is, the rank of the first card being ace doesnt have a larger overall influence on
the final label than any other card having that same rank value. However, with almost 16
thousand examples, this dataset did provide the AI a large pool of data to learn from and to
build a decision tree with.

Although this dataset was used to analyze classifications of various poker hands based on the
ranks and suits of each of 5 cards, this knowledge could also be applied to other situations. For
example, computer vision AI uses similar procedures to identify facial expressions; instead of
looking to see whether the first cards rank is Jack or the third cards suit is hearts, for example,
a computer vision AI would look at the mouth for an upward curve or downward curve, each
suggesting happiness or sadness, respectively; the eyebrows for inward furrow or neutral
expression, which suggest anger or contentment, respectively, etc.