Professional Documents
Culture Documents
ers have taken the field over the past 150 years, providing a large, extensively-
documented cohort of men who can serve as a proxy for even larger, less well-
documented populations. Indeed, we can use this baseball player data to answer
questions like:
• How often do people return to live in the same place where they were
born? Locations of birth and death have been extensively recorded in this
data set. Further, almost all of these people played at least part of their
career far from home, thus exposing them to the wider world at a critical
time in their youth.
• To what extent have heights and weights been increasing in the population
at large?
There are two particular themes to be aware of here. First, the identifiers
and reference tags (i.e. the metadata) often prove more interesting in a data set
than the stuff we are supposed to care about, here the statistical record of play.
Second is the idea of a statistical proxy, where you use the data set you have
to substitute for the one you really want. The data set of your dreams likely
does not exist, or may be locked away behind a corporate wall even if it does.
A good data scientist is a pragmatist, seeing what they can do with what they
have instead of bemoaning what they cannot get their hands on.