You are on page 1of 1

house doctor 177

white collar fbi 81


better call saul lawyer 30
the wire drug 60
poi assassin plus programmer 103

Adding in the new POIs in this example, none of whom we have financial information
for, has introduced a subtle problem, that our lack of financial information about
them can be picked up by an algorithm as a clue that theyre POIs. Another way to
think about this is that theres now a difference in how we generated the data for
our two classes--non-POIs all come from the financial spreadsheet, while many POIs
get added in by hand afterwards. That difference can trick us into thinking we have
better performance than we do--suppose you use your POI detector to decide whether
a new, unseen person is a POI, and that person isnt on the spreadsheet. Then all
their financial data would contain NaN but the person is very likely not a POI
(there are many more non-POIs than POIs in the world, and even at Enron)--youd be
likely to accidentally identify them as a POI, though!

This goes to say that, when generating or augmenting a dataset, you should be
exceptionally careful if your data are coming from different sources for different
classes. It can easily lead to the type of bias or mistake that we showed here.
There are ways to deal with this, for example, you wouldnt have to worry about
this problem if you used only email data--in that case, discrepancies in the
financial data wouldnt matter because financial features arent being used. There
are also more sophisticated ways of estimating how much of an effect these biases
can have on your final answer; those are beyond the scope of this course.

For now, the takeaway message is to be very careful about introducing features that
come from different sources depending on the class! Its a classic way to
accidentally introduce biases and mistakes.

You might also like