You are on page 1of 109

A

BRIEF INTRODUCTION TO DECISION


TREES

Decision trees are used in data mining to discover patterns of information in data. Once built
the decision tree can be used to predict outputs of new data using patterns observed in the data
used to build the tree. In that way a decision tree can be thought of as a data structure for storing
experience. For example, the first time you play a new game you have no idea what the best
strategy is, so you make moves based on your experience from other games. As you gain more
experience with that game, you learn how it differs from other games youve played, what works
and what does not. The more you play the game, especially if playing with a variety of other
players, the more nuanced your mental game-playing decision tree will become. You may also
learn that some aspects of game play require more mental attention than others.
Sample problem
Imagine you are given a coin and are asked to predict whether the front or back of the coin will show
when you drop the coin. You do that a bunch of times and determine that it shows heads about half the
time and tails the other half. On average, your prediction will only be right half the time. Now imagine
you are given two 6-sided dice and asked to predict the sum that will come up the most across 100 tosses
of the dice. Do the potential sums [2..12] have the same chance of occurring? Could you reliably guess the
number? Lets check results for each possibility:

If the first die is a 1, the possible sums are:

2 3 4 5 6 7

If the first die is a 2, the possible sums are:

3 4 5 6 7 8

If the first die is a 6, the possible sums are:

7 8 9 10 11 12

Individually these sets of inputs and outputs do not give us much information, but a pattern emerges when
they are combined.
Decision trees are very good at discovering patterns that lead to specific results in raw data. This strength
is also their weakness because data that is skewed by many samples of a particular type can result in an
unbalanced tree that favors the majority and fragments the minority.

Python
This book uses the Python programming language to explore decision trees. Why Python? Because Python
is a low ceremony, powerful and easy-to-read language whose code can be understood by entry-level
programmers. I explain the occasional Python feature but should you encounter a programming construct
youve never seen before and cant intuit, Python.org and StackOverflow.com are great places to find
explanations. If you have experience with another programming language then you should have no
difficulty learning Python by induction while also exploring decision trees.
example Python syntax

# this is a comment
import math # imports make code from other modules available

# code blocks are initiated by a trailing colon followed by indented lines


class Circle: # define a class
def __init__(self, radius): # constructor with parameter radius
self.radius = radius # store the parameter in a class variable

def get_area(self): # function that belongs to the class


return math.pi \
* self.radius \
* self.radius # trailing \ continues the expression
# on the next line
# code that is not in a class is executed immediately
for i in range(1, 10):
if (i & 1) == 0:
continue
circle = Circle(i) # create an instance
print("A circle with radius {0} has area {1:0.2f}".format(
i, circle.get_area() # `print` writes output to the console
))

You can run the code above in your browser at: https://repl.it/EWUh

Like blacksmiths, programmers create their own tools. We frequently prototype a solution by using tools
we already have available, not unlike using a pair of pliers to pull a nail. Once we get a good
understanding of the problem, however, we usually restart with a better combination of tools or build a
problem-specific one.

In this book we will co-evolve a decision tree engine while examining increasingly difficult projects with
the engine. By co-evolving the engine youll know exactly how it works so youll be able to use its
features effectively to curate data in your own projects. The engine will be a by-product of applying
decision trees to the different projects in this book. If you were to co-evolve the engine with a different
set of projects, or even the projects in this book in a different order, you would end up with a different
engine. But, by co-evolving the engine you will gain experience with some of the features available in
commonly used packages and see how they can affect the performance of your code.
About the author
I am a polyglot programmer with more than 15 years of professional programming experience.
Occasionally I step out of my comfort zone and learn a new language to see what that development
experience is like and to keep my skills sharp. This book grew out of my experiences while learning
Python, but it isnt about Python.

When learning a new programming language, I start with a familiar project and try to learn enough of the
new language to solve it. For me, writing machine learning tools, like a decision tree or genetic algorithm
engine, is my familiar project. This allows me to discover the expressiveness of the language, the power
of its tool chain, and the size of its development community as I work through the idiosyncrasies of the
language.
About the text
The Python 3.5 code snippets in this book were programmatically extracted from working code files using
the tags feature of AsciiDoctors include directive.
BRANCHING
20 questions
Lets begin by learning a little bit about decision trees from a game we played as kids. Reach way back in
your memories to a game called 20 Questions. It is a simple game for two people where one picks
something they can see around them and the other has to identify that thing in 20 questions or less. When
first learning to play the game kids tend to use very specific questions.

Is it a dog? No
Is it a picture? No
Is it a chair? No
Is it a table? Yes

That works reasonably well the first few games but then each player starts trying to pick more obscure
items in the room in order to run the other person out of questions. Eventually we learn to start with more
generic questions. (think of a chair)

Is it alive? No
Does it use electricity? No
Is there only 1 in the room? Yes
Is it more than 5 years old? No
Is it made of wood? Yes
...

At this point the person who is better at categorizing things generally wins.

However, one thing we automatically do when playing the game is make use of domain knowledge. For
example, after this sequence:

Is it alive? Yes
It is an animal? Yes

Would we ask "Does it have leaves?" No. The reason is, of course, because we know that animals do not
have leaves. We use our memory of what weve asked so far, and our knowledge of the implied domain,
animals, to improve our questions.

A decision tree doesnt have domain knowledge. All it has to work with is the data we provide. If we
give it noisy, skewed or insufficient data to find the best patterns, then the results we get will also be
flawed.
Decision trees use a variety of techniques to compensate for noisy data and they can find patterns a
human would not see.
First decision tree
Well use the data in the following table to try to predict where a person was born.

Name Gender Born

William male Germany

Louise female Texas

Minnie female Texas

Emma female Texas

Henry male Germany

We can use a greedy algorithm to turn that data into a decision tree. Start by finding all the unique
attribute-value pairs, for example Gender=male, and the number of times each occurs in the data.

an attribute is a data feature, often a column like Name or Gender.

attribute-value count

Name=William 1

Name=Louise 1

Name=Minnie 1

Name=Emma 1

Name=Henry 1

Gender=male 2

Gender=female 3
Next designate the root of the tree, the first decision point, as the attribute-value pair that has the highest
count. If there is a tie, pick one of them at random. Theres no tie in our data so the root will be:
Gender=female Now split all the data rows into two subsets based on that attribute-value pair.

Gender = female

Name Gender Born

Louise female Texas

Minnie female Texas

Emma female Texas

Gender != female

Name Gender Born

William male Germany

Henry male Germany

If all rows in a subset have the same result value, birthplace in this example, then were done with that
subset. Otherwise repeat the process with the attribute-value pairs in that subset, excluding any attribute-
value pairs that match the entire set. This process is known as recursive partitioning.

In this simple problem we end up with a single rule:

Gender = female

Heres the resultant decision tree:


First Program
It is time to write some Python. By the way, if you do not already have a favorite Python development
environment, I highly recommend JetBrains' PyCharm IDE.

Data
Lets start by converting the data we used above to a list that contains lists of data elements:
dtree.py

data = [['Name', 'Gender', 'Born'],


['William', 'male', 'Germany'],
['Louise', 'female', 'Texas'],
['Minnie', 'female', 'Texas'],
['Emma', 'female', 'Texas'],
['Henry', 'male', 'Germany'],
]

outcomeLabel = 'Born'

Build the tree


Next we need to build the heart of the decision tree engine. It starts by extracting the indexes of the
columns that could be used to build the tree, and the index of the column that holds the outcome value.

attrIndexes = [index for index, label in enumerate(data[0]) if


label != outcomeLabel]
outcomeIndex = data[0].index(outcomeLabel)

The tree will be stored as a linked list in array form. Each node in the array will contain its list index, the
decision data (attribute column index and value), and child node indexes, if any. For example:

0, 1, female, 2, 3

This nodes list index is 0. It checks for female in the Gender field, index 1 in the data rows. The rows
that match are next split using the node at index 2. Those that do not match are next split using node 3.
Since each branch has only two possible outcomes this is a binary decision tree. This implies a list of
nodes and a way to track which node were working on.

nodes = []
lastNodeNumber = 0
Next we need a work queue with an initial work item containing three things:

the parent node index, -1 since this is the root


this nodes index, and

the indexes of all the data rows this node will split.

workQueue = [(-1, lastNodeNumber, set(i for i in range(1, len(data))))]


while len(workQueue) > 0:

LIST COMPREHENSIONS

i for i in range(1, len(data)) is an example of a list comprehension, a powerful Python feature that enables us to
build a list by saying what we want instead of how to get it. This is equivalent to:

temp = set()
for i in range(1, len(data)):
temp.add(i)

The Python compiler may be able to write faster code for list comprehensions in some cases.

The first action inside the loop is to get the values from the next work item in the queue.

parentNodeId, nodeId, dataRowIndexes = workQueue.pop()

When there are multiple variables on the left side of an assignment, Python automatically separates the item being assigned into that
many parts, with any residual going into the final destination variable.

We then check to see if all the data rows for that work item have the same outcome value. When a subset
of data rows all have the same outcome value that subset is called pure. Pure subsets cannot be split any
further. When a pure subset is found we add a leaf node to the tree and proceed to the next work item.

uniqueOutcomes = set(data[i][outcomeIndex] for i in dataRowIndexes)


if len(uniqueOutcomes) == 1:
nodes.append((nodeId, uniqueOutcomes.pop()))
continue

Otherwise the subset must be split based on some attribute. To select that attribute we first gather counts
for all attribute-value pairs present in the data rows associated with this work item. What happens next
depends on the split algorithm being used. Some split algorithms allow N subsets as long as each subset
is larger than a particular threshold value. For now were going to use a greedy algorithm, which means
well pick the combination that appears the most in the data.

attrValueResults = []
for attrIndex in attrIndexes:
for rowIndex in dataRowIndexes:
row = data[rowIndex]
value = row[attrIndex]
attrValueResults.append((attrIndex, value))
potentials = [i for i in Counter(attrValueResults).most_common(1)]
attrIndex, attrValue = potentials[0][0]

The Counter class counts how many times each item in the list occurs. most_common(1) returns
the most frequently observed item and its count. The Counter class comes from a library called
collections. To use it we have to add an import statement at the top of the file, as follows:

from collections import Counter

Next, we split all the rows in the work item into two sets depending on whether they match the attribute-
value pair or not.

matches = {rowIndex for rowIndex in dataRowIndexes if


data[rowIndex][attrIndex] == attrValue}
nonMatches = dataRowIndexes - matches

{} around a collection or list comprehension is a quick way of creating a set object from the items in that collection. Above we
use set-logic to get the non-matches.

Work items for each group are then added to the work queue.
lastNodeNumber += 1
matchId = lastNodeNumber
workQueue.append((nodeId, matchId, matches))
lastNodeNumber += 1
nonMatchId = lastNodeNumber
workQueue.append((nodeId, nonMatchId, nonMatches))

And finally, a branch (or decision) node is added to the node list.

nodes.append((nodeId, attrIndex, attrValue, matchId, nonMatchId))

Display the tree


Now we need some way to see the result. Lets simply sort the nodes by their indexes and display the
sorted list.

nodes = sorted(nodes, key=lambda n: n[0])


print(nodes)

If you run the code you get output like the following:

[(0, 1, 'female', 1, 2), (1, 'Texas'), (2, 'Germany')]

That isnt so easy to understand is it? Lets write each node on a separate line and show the attribute
labels instead of raw indexes:

def is_leaf(node):
return len(node) == 2

for node in nodes:


if is_leaf(node):
print('{}: {}'.format(node[0], node[1]))
else:
nodeId, attrIndex, attrValue, nodeIdIfMatch, nodeIdIfNonMatch = node
print('{}: {}={}, Yes->{}, No->{}'.format(
nodeId, data[0][attrIndex], attrValue, nodeIdIfMatch,
nodeIdIfNonMatch))

Now run again to get expanded output as follows:


0: Gender=female, Yes->1, No->2
1: Texas
2: Germany

Thats better.

The following visual representation resembles the one we built by hand earlier doesnt it?

Prediction
Now that we have a working decision tree engine lets see how we can use it to predict the birthplace of
a person it has not seen before. Start with the test data:

testData = ["Alice", "female"]

To create the prediction we start with the root node of the decision tree and apply its attribute-value
check to the test data. Then we follow the matching or non-matching branch depending on the result of the
attribute-value check. When we reach a leaf node weve found the prediction value to use.

currentNode = nodes[0]
while True:
if is_leaf(currentNode):
print("predict: {}".format(currentNode[1]))
break
nodeId, attrIndex, attrValue, nodeIdIfMatch, \
nodeIdIfNonMatch = currentNode
currentNode = nodes[nodeIdIfMatch if
testData[attrIndex] == attrValue else nodeIdIfNonMatch]
0: Gender=female, Yes->1, No->2
1: Texas
2: Germany
predict: Texas

Because the decision tree makes its decision based on the Gender column, and the input has female in
that column, we expect the decision tree to predict Texas, and it does. And if you change the gender to
male it will predict Germany.

Congratulations, youve just programmatically built and used a decision tree.


Separate the specific use case data from the engine
We have a working engine but it is currently somewhat intertwined with the data were using, so the next
task is to extract the test data into a separate file. Start by creating a new file named test.py.

Next move the the data, outcomeLabel and testData assignment statements to the new file.

test.py

data = [['Name', 'Gender', 'Born'],


['William', 'male', 'Germany'],
['Louise', 'female', 'Texas'],
['Minnie', 'female', 'Texas'],
['Emma', 'female', 'Texas'],
['Henry', 'male', 'Germany'],
]

outcomeLabel = 'Born'

testData = ["Alice", "female"]

The test file can now be updated to use code in the dtree file via an import statement at the top of the
file:
test.py

import dtree
Make the engine reusable
In dtree.py well add a function to encapsulate the code that builds the node list from the input data.
We can do that by wrapping all the code in that file down to, but not including, print(nodes) in a
function as follows:
dtree.py

def build(data, outcomeLabel):


attrIndexes = [index for index, label in enumerate(data[0]) if
label != outcomeLabel]
outcomeIndex = data[0].index(outcomeLabel)
...
nodes = sorted(nodes, key=lambda n: n[0])
return DTree(nodes, data[0])

Note that at the end the print(nodes) statement has been replaced with a statement that returns an
object of type DTree.

The rest of the code in the file is related to predicting a result. We dont want the caller to have to know
the structure of the node list so well encapsulate it in a new class named DTree. Start with the class
constructor:

class DTree:
def __init__(self, nodes, attrNames):
self._nodes = nodes
self._attrNames = attrNames
...

Next well convert the is_leaf function to a private static function in the class (note the increased
indentation which makes the function a child element of DTree).

@staticmethod
def _is_leaf(node):
return len(node) == 2

Also, prefixing the function name with an underscore is the way private functions are indicated in Python.
By convention, private functions and data should only be accessed by other functions in the same class
(or module, depending on the scope). It will only be visible to other functions in the DTree class.
Now wrap the code that we used to display the tree in a function named __str__. That is a special
function name that the Python runtime looks for when you ask an object to make a displayable version of
itself.

Instead of printing the node and link details immediately, we will add them to a string that will be
returned to the caller at the end of the function. We also need to add manual newlines (\n) at the end of
each string we plan to display if we dont want them to run together.

def __str__(self):
s = ''
for node in self._nodes:
if self._is_leaf(node):
s += '{}: {}\n'.format(node[0], node[1])
else:
nodeId, attrIndex, attrValue, nodeIdIfMatch, \
nodeIdIfNonMatch = node
s += '{}: {}={}, Yes->{}, No->{}\n'.format(
nodeId, self._attrNames[attrIndex], attrValue,
nodeIdIfMatch, nodeIdIfNonMatch)
return s

Notice how uses of data and functions that belong to the DTree class are prefixed with self.

The final function in the DTree class wraps the prediction logic. This function too has been changed to
return the result instead of printing it.

def get_prediction(self, data):


currentNode = self._nodes[0]
while True:
if self._is_leaf(currentNode):
return currentNode[1]
nodeId, attrIndex, attrValue, nodeIdIfMatch, \
nodeIdIfNonMatch = currentNode
currentNode = self._nodes[nodeIdIfMatch if
data[attrIndex] == attrValue else nodeIdIfNonMatch]
Use the engine
Back in test.py, well use dtree.build() to create a prediction object for the data then call
print to see the tree structure.

data = [['Name', 'Gender', 'Born'],


['William', 'male', 'Germany'],
['Louise', 'female', 'Texas'],
['Minnie', 'female', 'Texas'],
['Emma', 'female', 'Texas'],
['Henry', 'male', 'Germany'],
]

outcomeLabel = 'Born'

tree = dtree.build(data, outcomeLabel)


print(tree)
...

Next, use the tree to predict a persons birthplace, then display that prediction.

testData = ["Alice", "female"]


predicted = tree.get_prediction(testData)
print("predicted: {}".format(predicted))

Finally, run the code to make sure everything works. You should get output like the following:

0: Gender=female, Yes->1, No->2


1: Texas
2: Germany

predicted: Texas

Great!
Summary
In this chapter we created an engine that uses a greedy algorithm to build a decision tree from data that
can be split on a single attribute. When built with data having Name, Gender, and Born attributes, the
decision tree is able to predict the birthplace of a person it has not seen before. However, the engine
cannot yet handle data that needs multiple attribute comparisons. As you work your way through this book
you will learn more about decision trees by evolving the engine to handle complex data and to make good
predictions even when the data has noise.
Final Code
The code for each chapter in this book is available from:

https://github.com/handcraftsman/TreeBasedMachineLearningAlgorithms
MULTIPLE BRANCHES

In the last chapter we had the following simple set of data which could be split into two pure subsets
using only the Gender attribute.

Name Gender Born

William male Germany

Louise female Texas

Minnie female Texas

Emma female Texas

Henry male Germany

Gender can no longer produce pure subsets when we add the following row to the data because not all
females will have Texas in the Born column.

Name Gender Born

Alice female Germany

To verify, lets add the new row to the data array:

test.py

data = [['Name', 'Gender', 'Born'],


...
['Alice', 'female', 'Germany'],
]

If you run the code now it will error out with IndexError: list index out of range
because after the first split all the rows in the next work item have the same gender. As a result, gender is
the most common attribute again so it enqueues another work item with all of the data rows, and so on.
The fix is to ignore attribute-value pairs that match all the rows in the work item.
dtree.py
def build(data, outcomeLabel):
...
attrValueResults = []
for attrIndex in attrIndexes:
for rowIndex in dataRowIndexes:
row = data[rowIndex]
value = row[attrIndex]
attrValueResults.append((attrIndex, value))
potentials = [i for i in Counter(attrValueResults).most_common()
if i[1] < len(dataRowIndexes)]
attrIndex, attrValue = potentials[0][0]
...

Now run the test and it will produce a tree and can correctly predict that Alice was born in Germany

...
predicted: Germany

Now lets examine the decision tree that was built:

0: Gender=female, Yes->1, No->2


1: Name=Alice, Yes->3, No->4
2: Germany
3: Germany
4: Texas
...

The first problem here is your result is probably different. Why? Because Counter only cares about
ordering by the number of times an item appears. It doesnt specify what should happen when multiple
items have the same count, so the order of items with the same count is non-deterministic. That is easily
fixed however. We can simply add a line to sort the potentials list by the count (array index 1),
descending, and then by the attribute index and value (array index 0), ascending.
dtree.py

...
potentials = [i for i in Counter(attrValueResults).most_common() if
i[1] < len(dataRowIndexes)]
potentials = sorted(potentials, key=lambda p: (-p[1], p[0]))
...
The - in front of a term causes that term to be sorted descending.

Now the tree output will always be deterministic, and yours should match that above.

Heres what it looks like when rendered with GraphViz:

Since the engine has already used gender to split the data it must use the Name field to split the female
rows in the second branch. Alice is the best choice because shes the only one not born in Texas but
thats not why she was picked. She was picked because when all the Name based attribute-value pairs
were sorted, Alice ended up first in the list.

If we change her name to Sophie


test.py

data = [['Name', 'Gender', 'Born'],


...
['Sophie', 'female', 'Germany'],
]

the resultant tree will look much different:


But we dont want that mess! We want the decision tree to be as shallow as possible so that we can make
a prediction within as few branches as possible. To make that happen we need a way to help it determine
that Sophie is a better choice than Louise. How do we do that? Well, what if we consider the number of
unique outcomes in each subset? The ideal situation is when the attribute-value pair splits the outcomes
so that each subset has only 1 type of outcome (aka pure subsets). The next best is when one of the subsets
is pure and the other is not, and the worst is when neither subset is pure, in which case we would ideally
like the one that splits the results into subsets that are as pure as possible.

attribute-value # pure rows % pure

Name=Emma 1 16.6

Name=Louise 1 16.6

Name=Minnie 1 16.6

Name=Sophie 4 100

Aha! Checking the number of output values seen in the resultant subsets looks promising. Lets try it.
Test
First well add a helper function that returns the purity percentage of a group of rows.
dtree.py

def _get_purity(avPair, dataRowIndexes, data, outcomeIndex):


attrIndex, attrValue = avPair
matchIndexes = {i for i in dataRowIndexes if
data[i][attrIndex] == attrValue}
nonMatchIndexes = dataRowIndexes - matchIndexes
matchOutcomes = {data[i][outcomeIndex] for i in matchIndexes}
nonMatchOutcomes = {data[i][outcomeIndex] for i in nonMatchIndexes}
numPureRows = (len(matchIndexes) if len(matchOutcomes) == 1 else 0) \
+ (len(nonMatchIndexes) if len(nonMatchOutcomes) == 1
else 0)
percentPure = numPureRows / len(dataRowIndexes)
return percentPure

It starts by splitting the input into matching and non-matching subsets. It then gets the unique set of
outcomes for each subset and uses those to determine how many data rows end up pure. It divides that
count by the total number or rows to get a percentage. When both subsets are pure, it returns 1.

Now change the line where we sort the potentials to call the function above and prefer those
attribute-values that have a higher purity value.

...
potentials = sorted(potentials,
key=lambda p: (-p[1],
-_get_purity(p[0],
dataRowIndexes,
data, outcomeIndex),
p[0]))
...

We still need the tree to be deterministic so the fallback is still to sort the attribute id-value pairs
alphabetically, ascending - thats the final parameter - p[0].

Now if we run the code the tree splits on Gender first then on whether the Name is Sophie, verifying that
the purity code works.

0: Gender=female, Yes->1, No->2


1: Name=Sophie, Yes->3, No->4
2: Germany
3: Germany
4: Texas

And to further verify we can check that it still works when Sophie is replaced with Alice in the training
data.

0: Gender=female, Yes->1, No->2


1: Name=Alice, Yes->3, No->4
2: Germany
3: Germany
4: Texas

Excellent!
Entropy
Lets add another attribute, Marital Status, so we can learn about Entropy.
test.py

data = [['Name', 'Gender', 'Marital Status', 'Born'],


['William', 'male', 'Married', 'Germany'],
['Louise', 'female', 'Single', 'Texas'],
['Minnie', 'female', 'Single', 'Texas'],
['Emma', 'female', 'Single', 'Texas'],
['Henry', 'male', 'Single', 'Germany'],
['Theo', 'male', 'Single', 'Texas'],
]

Also note that Alice has been replaced by Theo. We also need to make a compensating change to the test
data.

testData = ['Sophie', 'female', 'Single']

Run this data and the resultant decision tree is:

0: Marital Status=Single, Yes->1, No->2


1: Gender=female, Yes->3, No->4
2: Germany
3: Texas
4: Name=Henry, Yes->5, No->6
5: Germany
6: Texas
Unfortunately, this is an inefficient decision tree. To make it easier to see why it is inefficient, lets
modify the output to include the number of rows assigned to each subset.

First capture the number of rows affected in the branch node


dtree.py

def build(data, outcomeLabel):


...
nodes.append((nodeId, attrIndex, attrValue, matchId, nonMatchId,
len(matches), len(nonMatches)))
...

Then include those counts in the __str__ output (items 5 and 6 in the format string below).

def __str__(self):
...
else:
nodeId, attrIndex, attrValue, nodeIdIfMatch, \
nodeIdIfNonMatch, matchCount, nonMatchCount = node
s += '{0}: {1}={2}, {5} Yes->{3}, {6} No->{4}\n'.format(
nodeId, self._attrNames[attrIndex], attrValue,
nodeIdIfMatch, nodeIdIfNonMatch, matchCount,
nonMatchCount)
...

The get_prediction function also needs a compensating change since it doesnt care about the new
values in the branch node:

def get_prediction(self, data):


...
nodeId, attrIndex, attrValue, nodeIdIfMatch, \
nodeIdIfNonMatch = currentNode[:5]
...

The colon in the array index of currentNode[:5] means it is taking a slice of the array. Were basically saying we only
want the items from indexes 0-4 of currentNode. Slices are another powerful feature of Python.

Run again and the updated graph looks like this:


Notice that in the first branch only one of the rows takes the No path and ends up at the Germany leaf
node, while 5 of the rows take the Yes path to the Gender=female branch.

If we force the tree to use Gender in the first branch, however, we get a much better tree:

The decision tree looks the same structurally, but when examined from the point of view of the amount of
uncertainty in the outcome after the first test, this is a much better decision tree because it is able to
uniquely classify 1/2 of the rows in the first branch.

What were doing is calculating how predictable the data is. The more predictable the data is, the lower
the entropy, or uncertainty, and the easier it is for the decision tree to uniquely and correctly classify a
row of rata.

To put it another way, the tree above can uniquely classify 3/6, or 50 percent of the data rows with one
attribute comparison, and 4/6, or 66 percent after the second. Note that because Marital Status only
uniquely classifies one row at this point, it is no better or worse than using a value from the Name field,
as those also only uniquely identify one row in our training data. Finally, it classifies 100 percent of the
rows after the third attribute comparison. The decision tree that has Marital Status at the root, however,
only uniquely classifies 1/6, or about 17 percent of the data in the first comparison. That then jumps to 66
percent and 100 percent in the next two branches.

Implementation-wise this means we dont care at all about finding the attribute-value pair that matches
the most rows. Instead we care about how evenly an attribute-value pair splits the rows into subsets. That
may not result in the optimal tree overall but it does mean each branch gets a lot more information out of
the rows. At this point most decision tree engines calculate the standard deviation of the outcomes in the
two branches to find the attribute-value pair that provides the lowest sum. But thats a time consuming
calculation so were going to do something simpler that works well enough for our purposes.

To begin with, since we dont care about row counts anymore we can simplify the way the
potentials are gathered and just collect all the unique attribute-value pairs in the work items data
rows into a set.

dtree.py

def build(data, outcomeLabel):


...
if len(uniqueOutcomes) == 1:
nodes.append((nodeId, uniqueOutcomes.pop()))
continue

uniqueAttributeValuePairs = {(attrIndex, data[rowIndex][attrIndex])


for attrIndex in attrIndexes
for rowIndex in dataRowIndexes}
potentials = ...

Next, we need to make changes to _get_purity. First rename it to _get_bias to better reflect its
new purpose.

def _get_bias(avPair, dataRowIndexes, data, outcomeIndex):


...

Then, in addition to the purity, well calculate how evenly the attribute-value pair comparison splits the
rows into two subsets.
...
percentPure = numPureRows / len(dataRowIndexes)

numNonPureRows = len(dataRowIndexes) - numPureRows


percentNonPure = 1 - percentPure
split = 1 - abs(len(matchIndexes) - len(nonMatchIndexes)) / len(
dataRowIndexes) - .001
splitBias = split * percentNonPure if numNonPureRows > 0 else 0
return splitBias + percentPure

We add a small bias against splitting evenly (thats the -.001) so that we prefer pure subsets to evenly-
split ones, and add the result to the value returned.

Finally, back in the build function, order the attribute-value pairs by the calculated bias value,
descending, with fallback to the attribute index and value.

def build(data, outcomeLabel):


...
potentials = sorted((-_get_bias(avPair, dataRowIndexes, data,
outcomeIndex), avPair[0], avPair[1])
for avPair in uniqueAttributeValuePairs)
attrIndex, attrValue = potentials[0][1:]
...

Notice that were no longer using Counter.

Now when the code is run it produces a much better decision tree, as intended.

0: Gender=female, 3 Yes->1, 3 No->2


1: Texas
2: Name=Theo, 1 Yes->3, 2 No->4
3: Texas
4: Germany
Exercise
Make a backup copy of your code then try changing _get_bias to calculate the sum of the standard
deviation of the outcomes as follows: For each branch use a Counter to collect the unique outcome
values and their counts and put the counts into an array. This in essence allows us to assign a numeric
value, the array index, to each unique outcome value. That makes it possible to calculate the average and
standard deviation of the outcomes. The standard deviation tells us how different the values in the branch
are from the average. The more variety in the outcome values the higher the standard deviation value will
be. If the outcomes are all the same then the standard deviation will be zero. Multiply the standard
deviation by the number of rows in that branch to weight it. Then sum the weighted result from both
branches and choose the attribute-value pair that produces the lowest sum - that involves changing the sort
order of potentials. Then run the test to see any difference in the output.
Summary
In this chapter we updated the engine to be able to generate multi-level decision trees. We used the purity
and uncertainty produced by each possible attribute-value pair to choose the one that provides the best
split at each branch in the tree, although this does not necessarily find the optimal tree. We also made the
result deterministic when multiple attribute-value pairs have the same bias value. You can find a lot of
scholarly articles online about finding the best split for a given set of nodes.
CONTINUOUS ATTRIBUTES

All the attributes weve used so far have had discrete values (Married, Single, male, female, etc.). As a
result weve been able to use equality for comparison. However, theres another type of attribute that has
a large set of potential values, like age or zip code, or even an unlimited range like distance or price.
These are called continuous attributes and they need a comparison like greater-than to split the range of
potential values into two groups.

The first problem then is to tell the engine which attributes are continuous. To facilitate that lets add an
optional parameter so the user can provide the labels for the continuous attributes in the data:
dtree.py

def build(data, outcomeLabel, continuousAttributes=None):


...
attrIndexes = [index for index, label in enumerate(data[0]) if
label != outcomeLabel]
outcomeIndex = data[0].index(outcomeLabel)
continuousAttrIndexes = set()
if continuousAttributes is not None:
continuousAttrIndexes = {data[0].index(label) for label in
continuousAttributes}
if len(continuousAttrIndexes) != len(continuousAttributes):
raise Exception(
'One or more continuous column names are duplicates.')
...

An error will be raised if a provided label is wrong or appears more than once.

Thatll work but it would be even nicer if the engine could figure out which attributes are continuous by
itself. If it could then the parameter would only be necessary in situations where the engine cant figure it
out, or when we dont want a column to be treated as continuous. So, how can we make that work? Well,
one thing we know is that continuous data is numeric. We could simply test all the values for a given
attribute to see if they are numeric, and if so mark that attribute as containing continuous values.

To do that we first need to import Number from the numbers library.

from numbers import Number


Then add an else case to the continuousAttributes if-block to check whether all the values are
numeric. When they are, add that attributes index to the continuousAttrIndexes list.

...
else:
for attrIndex in attrIndexes:
uniqueValues = {row[attrIndex] for rowIndex, row in
enumerate(data) if rowIndex > 0}
numericValues = {value for value in uniqueValues if
isinstance(value, Number)}
if len(uniqueValues) == len(numericValues):
continuousAttrIndexes.add(attrIndex)
...
Support alternate match operators
Next, before we try to use continuous attributes, we need to change all the places where weve hard
coded the use of == for attribute-value comparison to use a provided operator instead. This will make it
easy to add support for greater-than for continuous attributes in the next step.

The following import makes it possible for us to pass built-in comparison functions as variables.

import operator

Now add the comparison operator to the attribute-value pair information when collecting the unique
attribute-value pairs from each work item:

uniqueAttributeValuePairs = {
(attrIndex, data[rowIndex][attrIndex], operator.eq)
for attrIndex in attrIndexes
for rowIndex in dataRowIndexes}

Next, include the comparison operator (avPair[2] below) when building the potentials list, and
unpack the operator as isMatch when we pick the best attribute-value pair.

potentials = sorted((-_get_bias(avPair, dataRowIndexes, data,


outcomeIndex),
avPair[0], avPair[1], avPair[2])
for avPair in uniqueAttributeValuePairs)
attrIndex, attrValue, isMatch = potentials[0][1:]

Then use isMatch instead of == when building the set of matches.

matches = {rowIndex for rowIndex in dataRowIndexes if


isMatch(data[rowIndex][attrIndex], attrValue)}

The last change to the build function is to include isMatch when adding the branching node to the
node list.

nodes.append((nodeId, attrIndex, attrValue, isMatch, matchId,


nonMatchId, len(matches), len(nonMatches)))

The operator must also be extracted and used in place of == in the _get_bias function.
def _get_bias(avPair, dataRowIndexes, data, outcomeIndex):
attrIndex, attrValue, isMatch = avPair
matchIndexes = {i for i in dataRowIndexes if
isMatch(data[i][attrIndex], attrValue)}
...

The __str__ function must also be updated to unpack isMatch and use it (parameter 7 in the format
string) to determine whether = or > is added between the attribute and value in the string.

def __str__(self):
...
else:
nodeId, attrIndex, attrValue, isMatch, nodeIdIfMatch, \
nodeIdIfNonMatch, matchCount, nonMatchCount = node
s += '{0}: {1}{7}{2}, {5} Yes->{3}, {6} No->{4}\n'.format(
nodeId, self._attrNames[attrIndex], attrValue,
nodeIdIfMatch, nodeIdIfNonMatch, matchCount,
nonMatchCount, '=' if isMatch == operator.eq else '>')

The final change to the dtree.py code is to unpack and use the match operator in place of == in the
get_prediction function.

def get_prediction(self, data):


...
nodeId, attrIndex, attrValue, isMatch, nodeIdIfMatch, \
nodeIdIfNonMatch = currentNode[:6]
currentNode = self._nodes[nodeIdIfMatch if
isMatch(data[attrIndex], attrValue) else nodeIdIfNonMatch]

Now run the test to make sure everything still works. Your result should still look like this:

0: Gender=female, 3 Yes->1, 3 No->2


1: Texas
2: Name=Theo, 1 Yes->3, 2 No->4
3: Texas
4: Germany

Great!
Use continuous attributes
Now were ready to use continuous attributes. The first change is to the way we get the attribute-value
pairs in the build function. Were going to exclude attribute indexes that are in
continuousAttrIndexes when we build uniqueAttributeValuePairs because we arent
going to check values in those columns using equality.

def build(data, outcomeLabel, continuousAttributes=None):


...
uniqueAttributeValuePairs = {
(attrIndex, data[rowIndex][attrIndex], operator.eq)
for attrIndex in attrIndexes
if attrIndex not in continuousAttrIndexes
for rowIndex in dataRowIndexes}
...

Now we have to think about what we want to happen with continuous attributes. The comparison operator
well be using, greater-than, implies something about how we handle the data. First, it implies that well
sort the data by the value of the continuous attribute. Also, since were checking for a difference instead
of equality, we can limit the amount of work we do by only performing the check at discontinuities (when
the attribute value changes). Third, we dont want to test every discontinuity because that could take a lot
of time - for example with prices in a grocery store - so the number of checks we perform will be limited
to the square root of the number of rows. Once we have the list of indexes we want to check, we can
create attribute-value pairs and pass them to _get_bias along with the attribute-value pairs being
evaluated by equality, and take the best.

First well introduce a generator function that takes a list of sorted values and returns the indexes of the
discontinuities ordered by distance from the center index.

...
def _generate_discontinuity_indexes_center_out(sortedAttrValues):
center = len(sortedAttrValues) // 2
left = center - 1
right = center + 1
while left >= 0 or right < len(sortedAttrValues):
if left >= 0:
if sortedAttrValues[left] != sortedAttrValues[left + 1]:
yield left
left -= 1
if right < len(sortedAttrValues):
if sortedAttrValues[right - 1] != sortedAttrValues[right]:
yield right - 1
right += 1

Next is a function that takes the output from the generator and keeps at least 1 but not more than a given
maximum number of indexes.

...
def _get_discontinuity_indexes(sortedAttrValues, maxIndexes):
indexes = []
for i in _generate_discontinuity_indexes_center_out(sortedAttrValues):
indexes.append(i)
if len(indexes) >= maxIndexes:
break
return indexes

Third is a function that iterates over all the continuous attributes, extracts and sorts their values, calls the
above function to get a certain number of discontinuities, and adds an attribute-value pair for each to a set,
which is returned at the end.

import math
...
def _get_continuous_av_pairs(continuousAttrIndexes, data, dataRowIndexes):
avPairs = set()
for attrIndex in continuousAttrIndexes:
sortedAttrValues = [i for i in sorted(
data[rowIndex][attrIndex] for rowIndex in dataRowIndexes)]
indexes = _get_discontinuity_indexes(
sortedAttrValues,
max(math.sqrt(
len(sortedAttrValues)),
min(10,
len(sortedAttrValues))))
for index in indexes:
avPairs.add((attrIndex, sortedAttrValues[index], operator.gt))
return avPairs

Finally, we need to call the above function from build to get the attribute-value pairs for the
discontinuities and add them to the set of attribute-value pairs we created for equality comparison. All of
which are then passed to get_bias for evaluation.

...
for rowIndex in dataRowIndexes}
continuousAttributeValuePairs = _get_continuous_av_pairs(
continuousAttrIndexes, data, dataRowIndexes)
uniqueAttributeValuePairs |= continuousAttributeValuePairs
potentials = sorted((-_get_bias(avPair, dataRowIndexes, data,
outcomeIndex),

Now were ready to try it out. Well start off with Age:
test.py

data = [['Name', 'Age', 'Born'],


['William', 37, 'Germany'],
['Louise', 18, 'Germany'],
['Minnie', 16, 'Texas'],
['Emma', 14, 'Texas'],
['Henry', 47, 'Germany'],
['Theo', 17, 'Texas'],
]

testData = ['Sophie', 19]

Run the code and we get the correct result. Everyone whose age is greater than 17 was born in Germany,
and everyone else was born in Texas.

0: Age>17, 3 Yes->1, 3 No->2


1: Germany
2: Texas

Now lets add Gender and Marital Status to see if they confuse the engine.
data = [['Name', 'Gender', 'Marital Status', 'Age', 'Born'],
['William', 'male', 'Married', 37, 'Germany'],
['Louise', 'female', 'Single', 18, 'Germany'],
['Minnie', 'female', 'Single', 16, 'Texas'],
['Emma', 'female', 'Single', 14, 'Texas'],
['Henry', 'male', 'Married', 47, 'Germany'],
['Theo', 'male', 'Single', 17, 'Texas'],
]

testData = ['Sophie', 'female', 'Single', 17]

Run the code and we still get the correct result:

0: Age>17, 3 Yes->1, 3 No->2


1: Germany
2: Texas

Great!
Extract _get_potentials
You may have noticed that the build function has become quite long. We can shorten it somewhat by
extracting the code used to create the potentials into a separate function.

dtree.py

def _get_potentials(attrIndexes, continuousAttrIndexes, data,


dataRowIndexes, outcomeIndex):
uniqueAttributeValuePairs = {
(attrIndex, data[rowIndex][attrIndex], operator.eq)
for attrIndex in attrIndexes
if attrIndex not in continuousAttrIndexes
for rowIndex in dataRowIndexes}
continuousAttributeValuePairs = _get_continuous_av_pairs(
continuousAttrIndexes, data, dataRowIndexes)
uniqueAttributeValuePairs |= continuousAttributeValuePairs
...
potentials = sorted((-_get_bias(avPair, dataRowIndexes, data,
outcomeIndex),
avPair[0], avPair[1], avPair[2])
for avPair in uniqueAttributeValuePairs)
return potentials

The calling code block in build will now look like this:

...
if len(uniqueOutcomes) == 1:
nodes.append((nodeId, uniqueOutcomes.pop()))
continue

potentials = _get_potentials(attrIndexes, continuousAttrIndexes,


data, dataRowIndexes, outcomeIndex)
attrIndex, attrValue, isMatch = potentials[0][1:]
...
Summary
In this chapter we added the ability to split numeric attributes using greater-than. This gives us the
ability to work with much larger and more realistic training data.
PRUNING

Typing the trainging data into the test file will quickly become tedious and error-prone. Lets add the
ability to read the data from a comma-separated-value (CSV) file.

Start by adding a convenience function to dtree for reading CSV files:

dtree.py

import csv
...
def read_csv(filepath):
with open(filepath, 'r') as f:
reader = csv.reader(f)
data = list(reader)
return data
...

By default the CSV reader imports every value as a string. If the data contains continuous columns we
may want to convert those to integers. Well do that in a separate function named prepare_data.

def prepare_data(data, numericColumnLabels=None):


if numericColumnLabels is not None and len(numericColumnLabels) > 0:
numericColumnIndexes = [data[0].index(label) for label in
numericColumnLabels]
for rowIndex, row in enumerate(data):
if rowIndex == 0:
continue
for numericIndex in numericColumnIndexes:
f = float(data[rowIndex][numericIndex]) if len(
data[rowIndex][numericIndex]) > 0 else 0
i = int(f)
data[rowIndex][numericIndex] = i if i == f else f
return data

If the training data had hundreds of numeric columns then we might opt to make the function detect which
were numeric by their content, but this meets our immediate needs.

Next, create census.csv containing the following data. You can also download it from:

https://github.com/handcraftsman/TreeBasedMachineLearningAlgorithms/tree/master/ch04:
Name,Gender,Marital Status,Age,Relationship,Born
August,male,Married,32,Head,Germany
Minnie,female,Married,28,Wife,Texas
Emma,female,Single,9,Daughter,Texas
Theo,male,Single,3,Son,Texas
William,male,Married,37,Head,Germany
Sophie,female,Married,22,Wife,Germany
Louise,female,Single,4,Daughter,Texas
Minnie,female,Single,2,Daughter,Texas
Emma,female,Single,1,Daughter,Texas
Henry,male,Married,33,Head,Germany
Henrietta,female,Married,28,Wife,Germany
Henry,male,Single,9,Son,Texas
Frank,male,Single,7,Son,Texas
Hermann,male,Single,4,Son,Texas
Louise,female,Single,3,Daughter,Texas
Charles,male,Single,1,Son,Texas
Hermann,male,Married,39,Head,Germany
Dora,female,Married,31,Wife,Germany
Hennie,female,Single,8,Daughter,Texas
Lisette,female,Single,5,Daughter,Texas
Fritz,male,Single,3,Son,Texas
Minnie,female,Single,3,Daughter,Texas
Charles,male,Married,68,Head,Germany
Louise,female,Married,64,Wife,Germany
Katie,female,Single,21,Daughter,Germany
Charles,male,Single,18,Son,Germany
Henry,male,Single,2,Nephew,Texas
Horace,male,Married,27,Head,Texas
Lucy,female,Married,25,Wife,Texas
Henry,male,Married,61,Head,Germany
Louise,female,Married,51,Wife,Germany
Fritz,male,Single,18,Son,Germany
Otto,male,Single,16,Son,Texas
Bertha,female,Single,15,Daughter,Texas
Nathlie,female,Single,10,Daughter,Texas
Elsa,female,Single,8,Daughter,Texas
August,male,Single,6,Son,Texas
Henry,male,Single,2,Nephew,Texas
William,male,Married,66,Head,Germany
Minnie,female,Married,89,Wife,Germany
Hermann,male,Married,43,Head,Germany
Emily,female,Married,47,Wife,Germany
Henry,male,Single,19,Son,Texas
Olga,female,Single,18,Daughter,Texas
Paul,male,Single,16,Son,Texas
Ernst,male,Single,15,Son,Texas
Emil,male,Single,12,Son,Texas
Ed,male,Single,11,Son,Texas
Otto,male,Single,9,Son,Texas
Ella,female,Single,7,Daughter,Texas
William,male,Married,47,Head,Germany
Emily,female,Married,42,Wife,Germany
Lena,female,Single,15,Daughter,Texas
Christian,male,Single,14,Son,Texas
Bertha,female,Single,12,Daughter,Texas
Ella,female,Single,9,Daughter,Texas
Mollie,female,Single,6,Daughter,Texas
Hettie,female,Single,1,Daughter,Texas

test.py can now be changed to get its data from the file like this:

test.py

data = dtree.read_csv('census.csv')
data = dtree.prepare_data(data, ['Age'])
...
testData = ['Elizabeth', 'female', 'Married', 19, 'Daughter']

When this code is run it produces a tree with 27 nodes, notice the structure.
Of the 13 branch nodes, 7 use age, 4 use name, 1 uses gender and 1 uses marital status. The first three
branches split the data almost evenly each time. The problem is that after the 4th branch the tree starts to
fan out into very small nodes and use the Name attribute to determine the birth places of the remaining
people.

This would be a great decision tree if we only planned to apply it to the data that was used to build the
tree. But it is not such a good tree if we plan to use it to predict birth places of people in new data. The
reason is, values in the 5th-6th level branches are too granular. They also use characteristics that are too
specific to the data used to build the tree, like having the name August or being between 16 and 18 years
old. This means the tree works substantially better on the initial data than it would on future data. Thats
called overfitting the data.

There are three common methods used to reduce overfitting in order to improve a decision trees ability
to predict future data:

prune while building the tree, or top-down,

prune after building the tree, or bottom-up, and

error driven - could be implemented top-down or bottom-up

Top-down pruning includes stopping when:

the data rows in the work item all have the same outcome, we already do this,

the data rows are identical except for the outcomes, this is noisy data, and
the number of data rows in a subset is smaller than some threshold.

Bottom-up pruning includes:

replacing a branch with its most common leaf node,


splitting the tree-building data into two groups, a building set and a validation set, and using the
validation set like future data to remove less valuable nodes. This is called cross-validation.

Error reduction includes variations on:

splitting the tree-building data into building and validation sets, building the tree, then continuously
replacing the node that has the worst split ratio with its most common leaf node, until the error rate
crosses a given threshold.
alternating tree-building and pruning.
Prune small subsets
Getting back to our census data tree, 10 of the 14 leaf nodes represent 3 or fewer rows of data. The tree
would work much better on future data if their parent branches were replaced with their most common
leaf. Lets add support for an optional threshold value that can be used to eliminate those leaf nodes. First
the optional parameter.
dtree.py

def build(data, outcomeLabel, continuousAttributes=None,


minimumSubsetSizePercentage=0):
if minimumSubsetSizePercentage > 0:
minimumSubsetSizePercentage /= 100
minimumSubsetSize = int(minimumSubsetSizePercentage * len(data))

The parameter is a percentage instead of a specific count so that it automatically scales with the number
of data rows. We automatically convert it to a decimal value if necessary, and calculate a specific count.
That count, minimumSubsetSize, is used in _get_bias to return a negative bias value when either
resultant subset would have fewer than minimumSubsetSize rows.

def _get_bias(avPair, dataRowIndexes, data, outcomeIndex, minimumSubsetSize):


attrIndex, attrValue, isMatch = avPair
matchIndexes = {i for i in dataRowIndexes if
isMatch(data[i][attrIndex], attrValue)}
nonMatchIndexes = dataRowIndexes - matchIndexes
if len(matchIndexes) < minimumSubsetSize or len(
nonMatchIndexes) < minimumSubsetSize:
return -1

We have to make a compensating change to pass minimumSubsetSize to _get_bias from


_get_potentials.

def _get_potentials(attrIndexes, continuousAttrIndexes, data,


dataRowIndexes, outcomeIndex, minimumSubsetSize):
...
potentials = sorted((-_get_bias(avPair, dataRowIndexes, data, outcomeIndex,
minimumSubsetSize),
avPair[0], avPair[1], avPair[2])
for avPair in uniqueAttributeValuePairs)
return potentials

and in the build function.


potentials = _get_potentials(attrIndexes, continuousAttrIndexes,
data, dataRowIndexes, outcomeIndex,
minimumSubsetSize)
if len(potentials) == 0 or potentials[0][0] > 0:
nodes.append((nodeId, uniqueOutcomes.pop()))
continue

Since we want to eliminate the nodes that have 3 or fewer rows and the data file has 58 data rows, we
can set the threshold value to 6 percent.
test.py

tree = dtree.build(data, outcomeLabel, minimumSubsetSizePercentage=6)

Run the code again and the resultant tree is much smaller:

0: Gender=female, 29 Yes->1, 29 No->2


1: Age>12, 14 Yes->9, 15 No->10
2: Age>15, 15 Yes->3, 14 No->4
3: Age>32, 8 Yes->5, 7 No->6
4: Texas
5: Germany
6: Age>18, 3 Yes->7, 4 No->8
7: Texas
8: Texas
9: Age>28, 6 Yes->11, 8 No->12
10: Texas
11: Germany
12: Marital Status=Married, 4 Yes->13, 4 No->14
13: Texas
14: Texas
However, once again your result may be different because we allowed a bit of randomness to slip in. As
a result, the outputs of the tree above are substantially different from the original data.

For example, in the tree detail below both of the leaf nodes below Age>18 (nodes 7 and 8) now predict
Texas.

What happened? The problem is this change:

if len(potentials) == 0 or potentials[0][0] > 0:


nodes.append((nodeId, uniqueOutcomes.pop()))
continue
uniqueOutcomes is a set object, which means the values it contains are unordered. Thus calling
uniqueOutcomes.pop() is equivalent to picking a random value from the set. Clearly thats not
what we want. The fix is to use a Counter instead of a set:

while len(workQueue) > 0:


parentNodeId, nodeId, dataRowIndexes = workQueue.pop()
uniqueOutcomes = Counter(
data[i][outcomeIndex] for i in dataRowIndexes).most_common()
if len(uniqueOutcomes) == 1:
nodes.append((nodeId, uniqueOutcomes.pop(0)[0]))
continue
...
if len(potentials) == 0 or potentials[0][0] > 0:
nodes.append((nodeId, uniqueOutcomes.pop(0)[0]))
continue

This fixes leaf nodes where there is a clear difference, but theres still some randomness when the number
of rows for each outcome is equal (see nodes below node 12 in the following tree detail).

We could resolve this by sorting by the outcome text too, but that isnt a true representation of the data,
and could potentially still end up with both leaf nodes producing the same output. A better solution is to
introduce a new type of leaf node that contains both the potential outcomes and their probabilities.

if len(potentials) == 0 or potentials[0][0] > 0:


nodes.append((nodeId, [(n[0], n[1] / len(dataRowIndexes))
for n in uniqueOutcomes]))
continue

Which produces output like this:


0: Gender=female, 29 Yes->1, 29 No->2
1: Age>12, 14 Yes->9, 15 No->10
2: Age>15, 15 Yes->3, 14 No->4
3: Age>32, 8 Yes->5, 7 No->6
4: Texas
5: Germany
6: Age>18, 3 Yes->7, 4 No->8
7: [('Texas', 0.6666666666666666), ('Germany', 0.3333333333333333)]
8: [('Germany', 0.5), ('Texas', 0.5)]
9: Age>28, 6 Yes->11, 8 No->12
10: Texas
11: Germany
12: Marital Status=Married, 4 Yes->13, 4 No->14
13: [('Texas', 0.5), ('Germany', 0.5)]
14: [('Texas', 0.75), ('Germany', 0.25)]

predicted: [('Texas', 0.5), ('Germany', 0.5)]

And those tree nodes look like this (note the percentages in the leaf nodes):

However, returning the list of potential outcomes and their probabilities from get_prediction is
messy, and would require the calling code to check for that kind of result.

Instead, lets make get_prediction choose a random outcome based on the probabilities in that leaf
node.

import random
...
def get_prediction(self, data):
currentNode = self._nodes[0]
while True:
if self._is_leaf(currentNode):
node = currentNode[1]
if type(node) is not list:
return node
randPercent = random.uniform(0, 1)
total = 0
for outcome, percentage in node:
total += percentage
if total > randPercent:
return outcome
return node[-1][0]
nodeId, attrIndex, ...

Now the predicted result is still a simple value from the outcome column but based on the frequency of
that outcome in that particular branch.
output

...
predicted: Germany

Great! We now have the ability to reduce the specificity of the decision tree structure without
substantially impacting its accuracy.
Error reduction
As previously mentioned, another way of preventing the tree from making decisions using less relevant
columns is to optionally use a portion of the tree-building data for validation. Lets add that capability.

def build(data, outcomeLabel, continuousAttributes=None,


minimumSubsetSizePercentage=0, validationPercentage=0):
if validationPercentage > 0:
validationPercentage /= 100
validationCount = int(validationPercentage * len(data))
if minimumSubsetSizePercentage > 0:
...

Now split the row indexes into those used for building the tree and those used for validation.

lastNodeNumber = 0
dataIndexes = {i for i in range(1, len(data))}
validationIndexes = set()
if validationCount > 0:
validationIndexes = set(
random.sample(range(1, len(data)), validationCount))
dataIndexes -= validationIndexes
workQueue = [(-1, lastNodeNumber, dataIndexes, validationIndexes)]
while len(workQueue) > 0:
parentNodeId, nodeId, dataRowIndexes, validationRowIndexes = \
workQueue.pop()

The validation row indexes must also be passed to _get_potentials.

potentials = _get_potentials(attrIndexes, continuousAttrIndexes,


data, dataRowIndexes, outcomeIndex,
minimumSubsetSize,
validationRowIndexes)

The only usage in _get_potentials is to pass them through to _get_bias:

def _get_potentials(attrIndexes, continuousAttrIndexes, data,


dataRowIndexes, outcomeIndex, minimumSubsetSize,
validationRowIndexes):
...
potentials = sorted((-_get_bias(avPair, dataRowIndexes, data,
outcomeIndex, minimumSubsetSize,
validationRowIndexes),
avPair[0], avPair[1], avPair[2])
for avPair in uniqueAttributeValuePairs)
return potentials

They must also be split into matching and non-matching sets at the end of build and included in the child
node data.

matches = {rowIndex for rowIndex in dataRowIndexes if


isMatch(data[rowIndex][attrIndex], attrValue)}
nonMatches = dataRowIndexes - matches
validationMatches = {
rowIndex for rowIndex in validationRowIndexes if
isMatch(data[rowIndex][attrIndex], attrValue)}
nonValidationMatches = validationRowIndexes - validationMatches
lastNodeNumber += 1
matchId = lastNodeNumber
workQueue.append((nodeId, matchId, matches, validationMatches))
lastNodeNumber += 1
nonMatchId = lastNodeNumber
workQueue.append((nodeId, nonMatchId, nonMatches,
nonValidationMatches))
nodes.append((nodeId, attrIndex, attrValue, isMatch, matchId,
nonMatchId, len(matches), len(nonMatches)))

Finally, in _get_bias the data rows in the validation set are split using the given attribute-value pair. If
either resultant set is empty then we dont use that attribute-value pair.

def _get_bias(avPair, dataRowIndexes, data, outcomeIndex, minimumSubsetSize,


validationRowIndexes):
attrIndex, attrValue, isMatch = avPair
if len(validationRowIndexes) > 0:
validationMatchIndexes = {i for i in validationRowIndexes if
isMatch(data[i][attrIndex], attrValue)}
validationNonMatchIndexes = validationRowIndexes - \
validationMatchIndexes
if len(validationMatchIndexes) == 0 or len(
validationNonMatchIndexes) == 0:
return -2
matchIndexes = ...

To use the new option, just provide a validation percentage when calling build. The larger the
percentage, the more likely rare attribute-value pairs will be used in the tree. Also, since the validation
set is random, the tree is different every time.
test.py
tree = dtree.build(data, outcomeLabel, validationPercentage=6)

Notice that the structure of this sample result is quite compact compared to those weve been seeing so far
in this chapter.
sample output

0: Age>14, 28 Yes->1, 27 No->2


1: Age>28, 14 Yes->3, 14 No->4
2: Texas
3: Germany
4: [('Texas', 0.6428571428571429), ('Germany', 0.35714285714285715)]

Run it a few times so you can see the variation in the decision trees produced with different validation
percentages. Then try it using both a validation percentage and minimum subset size percentage to see
how the two settings might work together.
Summary
In this chapter we explored ways of reducing the amount of error encoded in the tree. This is particularly
important in keeping the predicted values from being biased by rare data or conflicting information in the
original data. In the next chapter well look at another way to accomplish that goal.
RANDOM FORESTS

Despite the pruning innovations weve added to the decision tree engine, it still does not handle noisy
data very well. For example, as we saw in previous chapters, if the census data tree is first split on Age
instead of Gender, it completely changes the structure of the tree and may impact the accuracy of the
predictions. One solution to this is a random forest. Random forests recover well from noisy data
because they aggregate, and possibly weight, the results of many small decision trees created from subsets
of the data and subsets of the attributes in the data, thus reducing the impact of bad data. This is known as
ensemble learning.

Random forests resolve another problem in working with large data sets as well - computation time.
Consider what happens when your data has 50,000 attributes and 100 million rows. An example of this
might be self-driving car sensor inputs where the output is what the car should do next. A random forest
resolves the computation problem by using a fraction of the data sampled at random. The total work is
much smaller than evaluating all attributes across all rows of data without substantially impacting the
quality of the predictions.

The downside of ensemble learning is that while the tool provides good results, we can no longer easily
ascertain why it makes a particular decision.
Implementation
Structurally, a random forest is a wrapper around a collection of decision trees, so well start by passing
it all the data and tell it the outcome label and which attributes contain continuous values, if any.
forest.py

import dtree
import math
import random

class Forest:
def __init__(self, data, outcomeLabel, continuousAttributes=None,
dataRowIndexes=None, columnsNamesToIgnore=None):
...

Next we need to decide how many data rows and attributes each tree should use. You can play around
with various numbers but it turns out that a good size is the square root of the number of rows.

...
self.data = data
self.outcomeLabel = outcomeLabel
self.continuousAttributes = continuousAttributes \
if columnsNamesToIgnore is None \
else [i for i in continuousAttributes if
i not in columnsNamesToIgnore]
self.numRows = math.ceil(math.sqrt(
len(data) if dataRowIndexes is None else len(dataRowIndexes)))
...

Instead of making a copy of the rows well pass a list of row indexes and attribute indexes to use to build
the tree. We need to exclude the header row and the outcome index as those are required. We also need to
decide how many trees to build. Again you can and should play with this number but it turns out that you
get diminishing value above about 200 trees.

...
self.outcomeIndex = data[0].index(outcomeLabel)
columnIdsToIgnore = set() if columnsNamesToIgnore is None else set(
data[0].index(s) for s in columnsNamesToIgnore)
columnIdsToIgnore.add(self.outcomeIndex)
self.attrIndexesExceptOutcomeIndex = [i for i in range(0, len(data[0]))
if i not in columnIdsToIgnore]
self.numAttributes = math.ceil(
math.sqrt(len(self.attrIndexesExceptOutcomeIndex)))
self.dataRowIndexes = range(1, len(
data)) if dataRowIndexes is None else dataRowIndexes
self.numTrees = 200
self.populate()
...

Lastly, we need a way to populate the forest by creating the random trees. Well put this in a separate
function so we can rebuild whenever we want.

...
def _build_tree(self):
return dtree.build(self.data, self.outcomeLabel,
continuousAttributes=self.continuousAttributes,
dataIndexes={i for i in random.sample(
self.dataRowIndexes, self.numRows)},
attrIndexes=[
i for i in random.sample(
self.attrIndexesExceptOutcomeIndex,
self.numAttributes)])

def populate(self):
self._trees = [self._build_tree() for _ in range(0, self.numTrees)]
Update the decision tree
Now we need to add support for the new parameters, dataIndexes and attrIndexes, to the
build function in dtree.py in order to make it use subsets of the rows and attributes to build the tree.

dtree.py

def build(data, outcomeLabel, continuousAttributes=None,


minimumSubsetSizePercentage=0, validationPercentage=0,
dataIndexes=None, attrIndexes=None):
if validationPercentage > 0:
validationPercentage /= 100
validationCount = int(validationPercentage *
(len(data) if dataIndexes is None else len(
dataIndexes)))
...

Support for attrIndexes is easy as we just need to add a None check around populating the existing
variable.

...
minimumSubsetSize = int(minimumSubsetSizePercentage *
(len(data) if dataIndexes is None else len(
dataIndexes)))
if attrIndexes is None:
attrIndexes = [index for index, label in enumerate(data[0]) if
label != outcomeLabel]
outcomeIndex = data[0].index(outcomeLabel)
...

We also need to use the data indexes for validation in build.

...
lastNodeNumber = 0
if dataIndexes is None:
dataIndexes = {i for i in range(1, len(data))}
elif not isinstance(dataIndexes, set):
dataIndexes = {i for i in dataIndexes}
validationIndexes = set()
if validationCount > 0:
validationIndexes = set(
random.sample([i for i in dataIndexes], validationCount))
dataIndexes -= validationIndexes
workQueue = [(-1, lastNodeNumber, dataIndexes, validationIndexes)]
while len(workQueue) > 0:
parentNodeId, nodeId, dataRowIndexes, validationRowIndexes = \
workQueue.pop()
...
Forest prediction
Finally, back in the Forest class we need a way to get the aggregate prediction from the random forest.

forest.py

from collections import Counter


...
def get_prediction(self, dataItem):
sorted_predictions = self._get_predictions(dataItem)
return sorted_predictions[0][0]

def _get_predictions(self, dataItem):


predictions = [t.get_prediction(dataItem) for t in self._trees]
return Counter(p for p in predictions).most_common()
Test
Now we can change the test code to use the random forest. First read the CSV data file.
test.py

import dtree
from forest import Forest

data = dtree.read_csv('census.csv')
continuousColumns = ['Age']
data = dtree.prepare_data(data, continuousColumns)
outcomeLabel = 'Born'
...

Then build the forest and get the result it predicts.

...
forest = Forest(data, outcomeLabel, continuousColumns)
testData = ['Elizabeth', 'female', 'Married', 16, 'Daughter']
predicted = forest.get_prediction(testData)
print("predicted: {}".format(predicted))

Run this code and it will probably predict Germany but it may also predict Texas. Why? Well, when we
use the decision tree directly were using all the data to make a prediction, so the result is always the
same - Germany. The random forest, on the other hand, chooses random data rows and columns 200 times
to make different decision trees and then takes the most common prediction from those trees. To get a
better idea of how right or wrong Germany might be we can count the number of times that prediction
results from 100 runs when we rebuild the forest each time.

from collections import Counter


...
forest = Forest(data, outcomeLabel, continuousColumns)
predictions = []
for _ in range(0, 100):
predictions.append(forest.get_prediction(testData))
forest.populate()
counts = Counter(predictions)
print("predictions: {}".format(counts.most_common()))

sample result

predictions: [('Germany', 52), ('Texas', 48)]


There are two potential reasons why it isnt predicting Germany 100 percent of the time. The first is that
Age ends up being the deciding factor in many of the trees. The second is that were working with too few
rows of data. The census data file were using simply doesnt have enough data to justify the use of a
random forest. No problem, lets use a bigger data file.
Survivors of the Titanic
Download the train.csv file for the survival information on the Titanic disaster from
https://www.kaggle.com/c/titanic/data - you may be asked to create an account to download the file. It is
worth doing so as this site has many interesting data sets and challenges.

The file has the following columns:

PassengerId - unique number for each row


Survived - 1 if they survived, otherwise 0

Pclass - the persons ticket class: 1, 2, or 3

Name - structured, example: "Gschwend, Mrs. John (Elizabeth Guntly)"

Sex - "male" or "female"

Age - integer, decimal, or blank if unknown


SibSp - 1 if spouse is aboard, otherwise number of siblings aboard

Parch - number of parents or children aboard

Ticket - ticket number, examples: "A/5 121", "314159"

Fare - how much they paid

Cabin - examples: "D15" or multiple like "D15 D17", or blank

Embarked - code for the city where they boarded: "S", "C", "Q"

The PassengerId, Pclass, Age, SibSp, Parch, and Fare columns only contain numbers so lets treat them as
continuous value columns. That will give the decision tree the flexibility to, for example, group
passengers in 1st and 2nd class, or children with more than 2 siblings aboard.

Lets create a new file named titanic.py. Heres the full code for constructing the random forest
from the Titanic data.
titanic.py

import dtree
from forest import Forest
import random

continuousColumns = ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


data = dtree.read_csv('train.csv')
data = dtree.prepare_data(data, continuousColumns)
outcomeLabel = 'Survived'
columnsToIgnore = ['PassengerId', 'Name', 'Ticket', 'Cabin']
trainingRowIds = random.sample(range(1, len(data)), int(.8 * len(data)))
forest = Forest(data, outcomeLabel, continuousColumns, trainingRowIds, columnsToIgnore)

Lets ignore the PassengerId, Name, Ticket, and Cabin columns for now because theyre probably more-
or-less unique per passenger.

We will allow the forest to pick rows from 80 percent of the data.

Next well ask the random forest for a survival prediction for each of the rows in the 20 percent we didnt
use for training and compare against the known value - this is cross-validation again.

correct = sum(1 for rowId, row in enumerate(data) if


rowId > 0 and
rowId not in trainingRowIds and
forest.get_prediction(row) == row[1])
total = (len(data) - 1 - len(trainingRowIds))
print("{} ({:.1%}) of {} correct ".format(correct, correct / total, total))

sample result

135 (75.8%) of 178 correct

Not bad. The problem is the result varies every time we run it. It would be nice to know if 75.8% is near
the average, or abnormally high or low. We can do that.

Benchmark
Lets add a Benchmark class to forest.py. It will have a run function that takes a function to call. It
expects the given function to return a number - the percentage correct. The run function will call the
provided function 100 times and will display the running average and standard deviation for the first 10
rounds and every 10th round thereafter.
forest.py

import statistics
...
class Benchmark:
@staticmethod
def run(f):
results = []
for i in range(100):
result = f()
results.append(result)
if i < 10 or i % 10 == 9:
mean = statistics.mean(results)
print("{} {:3.2f} {:3.2f}".format(
1 + i, mean,
statistics.stdev(results, mean) if i > 1 else 0))

You may need to install the statistics module on your system. This can be accomplished from the command line with
python -m pip install statistics

Next well convert the Titanic survival prediction to a function. To make sure we get different results,
well create the forest inside the function using a random set of training data.
titanic.py

from forest import Benchmark


...
def predict():
trainingRowIds = random.sample(range(1, len(data)), int(.8 * len(data)))
forest = Forest(data, outcomeLabel, continuousColumns, trainingRowIds,
columnsToIgnore)
correct = sum(1 for rowId, row in enumerate(data) if
rowId > 0 and
rowId not in trainingRowIds and
forest.get_prediction(row) == row[1])
return 100 * correct / (len(data) - 1 - len(trainingRowIds))

Then run the benchmark.

...
Benchmark.run(predict)

Heres my result:

1 73.60 0.00
2 73.31 0.00
3 73.03 0.56
4 73.74 1.48
5 73.15 1.84
6 73.22 1.65
7 73.92 2.38
8 73.17 3.04
9 73.47 2.98
10 73.99 3.25
20 74.80 3.29
30 75.58 3.72
40 75.79 3.47
50 75.70 3.50
60 75.84 3.63
70 75.55 3.57
80 75.62 3.58
90 75.69 3.50
100 75.69 3.55

This means that, averaging 100 runs, the random forest correctly predicts 75.69 percent of the survivors,
and 68 percent of the time (one standard deviation) it predicts between 72.14 (75.69 - 3.55) and 79.24
(75.69 + 3.55) percent of the survivors correctly.

Improving survival prediction


Can we improve upon that? Probably. Remember, were currently ignoring the PassengerId, Name,
Ticket, and Cabin columns and there may be something not person-specific in one or more of those
columns.

Looking at the data file there dont appear to be any passengers listed next to their spouse, parent, or
child, so PassengerId is probably a simple row number for randomized rows picked from the full data set.
That means it will not be useful to us. We can check that assumption by running again after removing
PassengerId from the ignored columns list.

My results over multiple benchmark runs show the impact of the addition of PassengerId to be
consistently negative. Its use reduces our ability to predict the survivors by 2 percentage points on
average.
sample results from 3 runs

100 72.13 3.49


100 72.26 2.89
100 71.12 3.44

Lets try a different column.

Cabin
From Wikipedia https://en.wikipedia.org/wiki/RMS_Titanic we learn that the Titanic hit an iceberg at
11:40pm, ship time. Most people would probably have been asleep at the time. Lets see if adding the
persons presumed location to the available columns affects our prediction accuracy. As we did with
PassengerId, well simply remove Cabin from the excluded columns list and run again.

sample results from 3 runs

100 75.51 3.14


100 76.64 3.11
100 76.22 3.15

The results show negligeable impact vs not using the Cabin value. Maybe theres something else we can
do with the information in that field.

Lets apply a Counter to the values to see what we have.

from collections import Counter


...
data = dtree.read_csv('train.csv')
data = dtree.prepare_data(data, continuousColumns)
cabinColumnIndex = data[0].index('Cabin')
print(Counter(data[i][cabinColumnIndex] for i in range(1, len(data))).most_common())
...

partial result:

[('', 687), ('B96 B98', 4), ('C23 C25 C27', 4), ('G6', 4), ('E101', 3), ('D', 3), ('F33',
3), ... ('T', 1), ('C47', 1), ('D48', 1) ...

From this we learn that the majority of passengers were not assigned a cabin, or we dont know their
cabin assignment. However, we also learn that the cabin numbers are structured. They start with a letter,
indicating the deck containing the cabin, possibly followed by a room number.

If interested you can learn more about where the decks and rooms were located within the ship at
http://www.titanic2ship.com/rms-titanic-deck-plans/ For example, rooms on the starboard side of the ship had odd numbers and
those on the port side had even numbers.

Feature engineering
When we manipulate the data to simplify it, fill in missing values, or combine pieces of information to
create new attributes, were performing feature engineering. One example of this might be filling in
missing ages by using the median age of all passengers or finding another person with similar column
values to fill in that piece of data.

In this case were going to replace the full cabin number, if any, with just the deck letter.

columnsToIgnore = ['PassengerId', 'Name', 'Ticket']


for i in range(1, len(data)):
if len(data[i][cabinColumnIndex]) == 0:
continue
data[i][cabinColumnIndex] = data[i][cabinColumnIndex][0]
print(Counter(data[i][cabinColumnIndex] for i in range(1, len(data))).most_common())

Counter output

[('', 687), ('C', 59), ('B', 47), ('D', 33), ('E', 32), ('A', 15), ('F', 13), ('G', 4),
('T', 1)]

And the benchmark result:


sample results from 3 runs

100 76.13 3.13


100 76.20 2.96
100 75.88 3.64

Hmm. Thats about the same result as using the full Cabin value meaning we dont appear to have gained
any useful information by taking just the deck letter. Not surprising really as less than 10 percent of the
passengers, at most, have a cabin on any particular deck.

What about something simpler like 0 if the person didnt have a cabin or 1 if they did.

columnsToIgnore = ['PassengerId', 'Name', 'Ticket']


cabinColumnIndex = data[0].index('Cabin')
for i in range(1, len(data)):
data[i][cabinColumnIndex] = 0 if len(data[i][cabinColumnIndex]) == 0 else 1
print(Counter(data[i][cabinColumnIndex] for i in range(1, len(data))).most_common())

Counter output

[(0, 687), (1, 204)]


That isnt as good as not using the Cabin column at all.
sample results from 3 runs

100 76.61 2.88


100 75.82 3.42
100 75.89 3.27

One last attempt. Lets see if having a port or starboard cabin makes a difference. For those who have a
cabin, if the cabin number is odd well assign starboard otherwise port. Well use a regular expression
to find the first cabin number.

import re
...
columnsToIgnore = ['PassengerId', 'Name', 'Ticket']
cabinColumnIndex = data[0].index('Cabin')
for i in range(1, len(data)):
if len(data[i][cabinColumnIndex]) == 0:
continue
match = re.match(r'^[^\d]+(\d+)', data[i][cabinColumnIndex])
if not match:
data[i][cabinColumnIndex] = ''
continue
cabin = int(match.groups(1)[0])
data[i][cabinColumnIndex] = 'starboard' if (cabin & 1) == 1 else 'port'
print(Counter(data[i][cabinColumnIndex] for i in range(1, len(data))).most_common())

Counter output

[('', 691), ('port', 108), ('starboard', 92)]

sample results from 3 runs

100 75.66 3.43


100 76.66 3.30
100 76.30 3.15

About the same again, although the standard deviation did improve. Not good enough, lets try another
column.

Name
The data in the Name column is structured and it turns out theres something that might be of use. Here are
some structural examples:
Guntly, Miss. Elizabeth
Gschwend, Mr. John

Lets use a regular expression to extract the persons title

columnsToIgnore = ['PassengerId', 'Ticket', 'Cabin']


nameColumnIndex = data[0].index('Name')
for i in range(1,len(data)):
if len(data[i][nameColumnIndex]) == 0:
continue
name = data[i][nameColumnIndex]
match = re.match(r'^[^,]+, ([^.]+)\..*', name)
if match is None:
continue
data[i][nameColumnIndex] = match.groups(1)
print(Counter(data[i][nameColumnIndex] for i in range(1,len(data))).most_common())

Counter output

[(('Mr',), 517), (('Miss',), 182), (('Mrs',), 125), (('Master',), 40), (('Dr',), 7),
(('Rev',), 6), (('Major',), 2), (('Col',), 2), (('Mlle',), 2), (('Mme',), 1), (('Sir',),
1), (('Capt',), 1), (('Don',), 1), (('the Countess',), 1), (('Jonkheer',), 1), (('Ms',),
1), (('Lady',), 1)]

These titles provide potentially useful personal attributes like gender, marital status, social class,
nationality, age, and profession, and result in a solid improvement.
sample results from 3 runs

100 78.78 3.06


100 78.75 3.13
100 78.62 2.81

It is clear that something in the titles gives a clue to the persons survival.
Exercises
Now that you see how feature engineering works you should experiment with other fields like Ticket. This
is your chance to be creative and try to get every advantage you can from the data and any domain
knowledge you may have.

You might also want to see if any of the cabin variants work better if the persons title is also used. It is
not uncommon for an apparently useless field to become useful in the presence of another engineered
field.

Another method of feature engineering is to combine one or more fields or to add new fields. For
example, you could add a new continuous column, like Family Size, and populate it with the sum of Parch
and SibSp. This might allow you to discover whether small families have a survival advantage over large
ones. Or whether the Fare is per group or per person (Fare divided by size of family).

Do you think people of English descent had a survival advantage? You could write a function that uses the
persons first name(s) and some lists of popular French, German, etc. names to guess at the persons
ancestry.

How would you determine which field is the most important predictor of survival? You could ignore all
but one field to see how well that one attribute predicts survival then iterate through the fields. Try it. Are
there any surprises? Now try pairs. Can you think of a way to automate this?
Summary
This chapter introduced random forests, a powerful tool for reducing the amount of data we have to
process in order to get good predictions, while also greatly improving our ability to work around bad
data. It also introduced you to the concept of feature engineering, an important skill in data science. There
are many ways to manipulate data to make it more accessible to the tools you are using. Finally, you were
introduced Kaggle, a popular web site with many potentially interesting data science challenges for you to
use to grow your knowledge.
REGRESSION TREES

Weve worked with categorical and continuous attributes and categorical outcomes. Now were going to
look into the situation where the outcome attribute has continuous values. This is called regression.

To better understand the concept lets use the census data to try to predict a persons age:

import dtree

continuousAttributes = ['Age']
data = dtree.read_csv('census.csv')
data = dtree.prepare_data(data, continuousAttributes)
outcomeLabel = 'Age'

tree = dtree.build(data, outcomeLabel, continuousAttributes)


print(tree)

When you run that you get a massive tree that categorizes everyone by their age. Heres a snippet of the
rendered tree:

Unfortunately it also uses the Age attribute when branching. We can fix that by excluding the outcome
index from the continuous attribute indexes in build if present.
Handle numeric predictions
dtree.py

...
if outcomeIndex in continuousAttrIndexes:
continuousAttrIndexes.remove(outcomeIndex)

nodes = []
...

Lets also prune the tree a bit.


test.py

...
tree = dtree.build(data, outcomeLabel, continuousAttributes,
minimumSubsetSizePercentage=6)
print(tree)

Run again and it does what we want, heres a snippet.

This tells us that all 19 of the unmarried women were between 1 and 21 years old and that, for example,
11% were 15 years old.

Now lets think about what wed want get_prediction to return for a continuous attribute. Should it
pick a random weighted value from the collection like it does for categorical attributes? That would not
be ideal if you were trying to decide someones salary based on the breadth of their experience would it?
It is probably better if the result is consistent.
What about taking the average? That solves the issue raised above but theres two new potential
problems, depending on how you want to handle your data. First, the average value is probably going to
be one that isnt in the data. For example, if the collection values were [2, 3, 5, 7] the average would be
17/4 or 4.25, which is not a value in the collection. The other problem involves the number of significant
digits and rounding in the result. Using the example above, should the function return 4, 4.2, 4.3, or 4.25?
Or perhaps it should round to 5, the nearest number in the data.

Another option is to return the value that occurs the most often, but how do you break ties? And what
happens when all the values are different?

A fourth option is to return the median value from the collection. This results in a consistent value that is
also in the original data.

Im sure you can think of other options. I prefer to return the median value so thats what well implement.

First well add a variable in build to track the fact that the outcome is a continuous attribute, and pass it
to the DTree constructor.

dtree.py

...
if outcomeIndex in continuousAttrIndexes:
continuousAttrIndexes.remove(outcomeIndex)
outcomeIsContinuous = True
else:
outcomeIsContinuous = False

nodes = []
...
return DTree(nodes, data[0], outcomeIsContinuous)

The DTree constructor stores the value.

...
class DTree:
def __init__(self, nodes, attrNames, outcomeIsContinuous=False):
self._nodes = nodes
self._attrNames = attrNames
self._outcomeIsContinuous = outcomeIsContinuous
...
Then in get_prediction, if the outcome attribute is continuous well sort the value-percentage pairs
by the value then set the percent we want to .5 so we get the median value.

def get_prediction(self, data):


...
if type(node) is not list:
return node
if self._outcomeIsContinuous:
node = sorted(node, key=lambda n: n[0])
randPercent = .5 if self._outcomeIsContinuous else \
random.uniform(0, 1)
total = 0
...

Now add a prediction request to the test file:


test.py

...
testData = ['Elizabeth', 'female', 'Single', -1, 'Daughter', 'Germany']

predicted = tree.get_prediction(testData)
print("predicted: {}".format(predicted))

Finally, run it to see the median age.

predicted: 8

Great!

By adding detection of the situation where the outcome attribute is also a continuous attribute, and
adjusting the output of the get_prediction function when that happens, weve turned our
classification tree into a classification and regression tree. Now you see why the two are often discussed
together and referred to as CART.
Exercise
Try to predict the Fare amount of passengers in the Titanic data. You may want to engineer the Fare
values to be per person instead of per family. Does using features from the Name field improve your
ability to predict the fare? What about Cabin deck or features from the Ticket?
Summary
This chapter introduced the concept of regression. We also made of couple of changes that facilitate
building regression trees and obtaining prediction values from them.
BOOSTING

Now were going to learn a popular way to improve the predictive capability of random forests. Lets
start with the data file for another classification problem from Kaggle. You can download it from:

https://www.kaggle.com/uciml/mushroom-classification

If you look at the contents of mushrooms.csv youll notice that every field has been reduced to a
single-character code. The codes are explained somewhat on the the project page. You may also notice
that one of the columns has the code ? meaning missing. Presumably that means they dont have the data,
rather than that the mushrooms stalk-root is missing. That would be a place where we could use some
feature engineering to guess at the possible value, but were trying to decide whether the mushrooms are
poisonous or edible, so why take the risk of making a change that influences the outcome the wrong way?
Either way, it isnt necessary for our purposes. There are no continuous columns, like height or width, so
we do not have any data preparation of that kind to perform either.

Well begin by reading the data from the CSV file.


test.py

import dtree
import random
import forest

data = dtree.read_csv('mushrooms.csv')
outcomeLabel = 'class'
outcomeLabelIndex = data[0].index(outcomeLabel)
continuousAttributes = []

Next well add a benchmarking function that randomly selects 1 percent of the data to build a decision
tree. It then uses the remainder of the data to test the decision trees ability to predict whether the
mushrooms are edible or poisonous.
test.py

def predict():
trainingRowIds = random.sample(range(1, len(data)),
int(.01 * len(data)))
tree = dtree.build(data, outcomeLabel, continuousAttributes,
dataIndexes=trainingRowIds)
correct = sum(1 for rowId, row in enumerate(data) if
rowId > 0 and
rowId not in trainingRowIds and
tree.get_prediction(row) == row[outcomeLabelIndex])
return 100 * correct / (len(data) - 1 - len(trainingRowIds))

forest.Benchmark.run(predict)

results from 3 runs

100 92.90 3.27


100 93.66 2.87
100 92.91 3.42

For comparison well replace the decision tree with a random forest and run again.
test.py

def predict():
trainingRowIds = random.sample(range(1, len(data)),
int(.01 * len(data)))
f = forest.Forest(data, outcomeLabel, continuousAttributes,
trainingRowIds)

results from 3 runs

100 92.22 2.57


100 92.60 2.41
100 92.34 2.49

As you can see, the random forest has a better standard deviation, meaning it groups the results better, but
had about the same to fractionally worse ability to make correct predictions as the decision tree. Why is
that?

The random forest is just as collection of randomly generated decision trees. We hope that through asking
the same question of 200 different random groupings of the test data well get better answers. But on this
problem we do not. The reason is, every one of those trees gets an equal vote, even though some are
wrong more than half of the time. What if we were to increase the voting power of trees that provide
correct predictions more often and reduce the voting power of those that provide incorrect predictions
more often? Thats called boosting.
Add voting power
Well implement boosting in two rounds. The first round adds an optional parameter to the Forest
constructor and implements the voting power concept as a floating point weight for each tree, initialized
to 0.5.
forest.py

class Forest:
def __init__(self, data, outcomeLabel, continuousAttributes=None,
dataRowIndexes=None, columnsNamesToIgnore=None,
boost=False):
...
self.numTrees = 200
self.boost = boost
self.weights = [.5 for _ in range(0, self.numTrees)]
self.populate()

Then instead of simply using a Counter in _get_predictions to sum the number of votes for each
prediction we now sum the weights of the trees grouped by the predicted outcome.
forest.py

import operator

...
def _get_predictions(self, data):
predictions = [t.get_prediction(data) for t in self._trees]
counts = {p: 0 for p in set(predictions)}
for index, p in enumerate(predictions):
counts[p] += self.weights[index]
return sorted(counts.items(), key=operator.itemgetter(1),
reverse=True)

This set of changes gives every tree the same voting power when were not boosting. You can run the
code to verify that it still works.
Adjust the weights
Now well add code to adjust the weight of each tree up or down a small fraction based on whether its
prediction is correct or incorrect respectively.

Well start with a change to get_predictions to return both the sorted predictions and the list of
raw predicted outcomes.
forest.py

def _get_predictions(self, data):


...
return sorted(counts.items(), key=operator.itemgetter(1),
reverse=True), \
predictions

And a compensating change to get_prediction to ignore the 2nd returned value.

def get_prediction(self, data):


sorted_predictions, _ = self._get_predictions(data)

Then add a guard in populate to stop when were not boosting.

def populate(self):
self._trees = [self._build_tree() for _ in range(0, self.numTrees)]

if not self.boost:
return
...

The rest is new code in populate. It starts with a loop that will run until no weight adjustments are
made, or 10 rounds, whichever comes first. We could run more rounds, or make it configurable, but 10
rounds are enough for this problem. An inner loop gets the predictions from each tree for each row of
training data.

...
outcomeLabelIndex = self.data[0].index(self.outcomeLabel)
anyChanged = True
roundsRemaining = 10
while anyChanged and roundsRemaining > 0:
anyChanged = False
roundsRemaining -= 1
for dataRowIndex in self.dataRowIndexes:
dataRow = self.data[dataRowIndex]
sorted_predictions, predictions = self._get_predictions(
dataRow)

If the outcome for that row was predicted correctly then it goes on to the next row. Otherwise it sets the
flag to indicate that a weight (will be) changed this round.

...
expectedPrediction = dataRow[outcomeLabelIndex]
if expectedPrediction == sorted_predictions[0][0]:
continue
anyChanged = True
...

It then calculates the difference between the sum of the weights of the trees that predicted the wrong
outcome and that of those that predicted the correct outcome.

...
actualPrediction = sorted_predictions[0][0]
lookup = dict(sorted_predictions)
expectedPredictionSum = lookup.get(expectedPrediction)
difference = sorted_predictions[0][1] if \
expectedPredictionSum is None else \
sorted_predictions[0][1] - expectedPredictionSum
...

That value is then divided by the number of training data rows because each row will get a chance to
adjust the weight if necessary. If the result is zero it is set to a neutral value.

...
maxDifference = difference / len(self.dataRowIndexes)
if maxDifference == 0:
maxDifference = .5 / len(self.dataRowIndexes)
...

Finally, the weights of each of the trees that predicted the correct outcome are increased by a small
random fraction, with the maximum final weight being no greater than 1. And the weights of the trees that
predicted the actual outcome are decreased by a small random fraction. If any trees weight reaches or
goes below zero that tree is replaced with a new tree and default weight.

...
for index, p in enumerate(predictions):
if p == expectedPrediction:
self.weights[index] = min(1, self.weights[
index] + random.uniform(0, maxDifference))
continue
if p == actualPrediction:
self.weights[index] = max(0, self.weights[
index] - random.uniform(0, maxDifference))
if self.weights[index] == 0:
self._trees[index] = self._build_tree()
self.weights[index] = 0.5

Another way to perform weight corrections is to use a function, i.e. an s-shaped, or sigmoid, curve, to limit the rate of weight
changes in areas where the weights are more likely to be correct. With a sigmoid function for example, corrections near 0 and 1
are large because they are probably wrong (it is unlikely that a tree built with the same data will be consistently right or
consistently wrong), while corrections near .5 are small.

Thats it.

Now set boost to True

test.py

def predict():
trainingRowIds = random.sample(range(1, len(data)),
int(.01 * len(data)))
f = forest.Forest(data, outcomeLabel, continuousAttributes,
trainingRowIds, boost=True)
...

and run the test again.


results from 3 runs

100 94.86 1.91


100 94.78 1.74
100 94.78 1.82

Not bad. Boosting earned a 1-2 percent overall improvement in the ability to predict the correct outcome
while also achieving that percentage more consistently.
Exercise
Try using 2, 5 and 10 percent of the data for training. Try using boost when building the forest for the
Titanic data.
Summary
In this chapter we learned how having the ability to adjust the voting power of the randomly selected
decision trees in the random forest can improve its overall ability to predict outcomes. This is a common
technique for tuning the categorization process to the data being categorized. As a result there are many
different boosting algorithms.
AFTERWARD

This book has given you a solid introduction to tree-based machine learning algorithms. There is still a lot
more to learn on this topic but you now know enough to teach yourself, and that will lead to true mastery.
For your next step you have several options, they include:

use the dtree and forest modules from this book to explore classification problems in your field of
expertise,
switch to a different Python-based module and repeat some of the experiments in order to spin up your
knowledge of that module,
learn about another machine learning tool like genetic algorithms or neural networks.

Good luck!

Clinton Sheppard

Twitter: @gar3t
Goodreads: https://www.goodreads.com/handcraftsman

Other books by Clinton Sheppard

Get a hands-on introduction to machine learning with genetic algorithms using Python. Step-by-step
tutorials build your skills from Hello World! to optimizing one genetic algorithm with another, and finally
genetic programming; thus preparing you to apply genetic algorithms to problems in your own field of
expertise.
Genetic algorithms are one of the tools you can use to apply machine learning to finding good, sometimes
even optimal, solutions to problems that have billions of potential solutions. This book gives you
experience making genetic algorithms work for you, using easy-to-follow example projects that you can
fall back upon when learning to use other machine learning tools and techniques. Each chapter is a step-
by-step tutorial that helps to build your skills at using genetic algorithms to solve problems using Python.

Available from major stores including Amazon, Apple and Barnes & Noble, in paperback, ePub, Kindle
and PDF formats.

https://github.com/handcraftsman/GeneticAlgorithmsWithPython

You might also like