How to Spot a Fake Data Scientist

20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
Open in app
Follow 596K Followers
from sklearn import *

and other dead-giveaways that you’re a fake data scientist
Conor Lazarou Mar 22, 2020 · 9 min read
Like most news outlets writing about “hackers”, fake data scientists happily use any random code they find online
(image source: Picography)
These days it seems like everyone and their dog are marketing themselves as data
scientists—and you can hardly blame them, with “data scientist” being declared the
Sexiest Job of the Century and carrying the salary to boot. Still, blame them we will,
since many of these posers grift their way from company to company despite having
little or no practical experience and even less of a theoretical foundation. In my
experiences interviewing and collaborating with current and prospective data
scientists, I’ve found a handful of tells that separate the posers from the genuine
articles. I don’t mean to belittle self-taught and aspiring data scientists — in fact, I
think this field is especially appropriate for passionate self-learners —but I definitely
mean to belittle the sort of person who takes a single online course and ever after styles
themselves an expert, despite having no knowledge of (or interest in) the fundamental
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 1/8
theory of the field. I’ve compiled this list of tells so that, if you’re a hiring manager and
Open in app
don’t know what you’re looking for in a data scientist, you can filter out the slag, and if
you’re an aspiring data scientist and any of these resonate with you, you can fix them
before you turn into a poser yourself. Here are three broad domains of data science
faux pas with specific examples that will land your resume in the bin.
1. You Don’t Bother with Data Exploration

Data exploration is the first step in any machine learning project. If you don’t take the
time to familiarize yourself with your data and get well-acquainted with its features
and quirks, you’re going to waste a significant amount of time barking up the wrong
decision tree before arriving at a usable product — if you even make it there.
A) You don’t visualize your data

Exploratory data visualization is the best way to start any data-related project. If you’re
applying machine learning, it’s likely that you’re working with a high volume of high-
dimensional data; perusing a .csv in Excel or running a df.describe() is not a suitable
alternative to proper data visualization. Francis Anscombe illustrated the importance
of data visualization with his famous quartet:
Anscombe’s Quartet (Source: Wikimedia Commons)

The datasets in each panel all have essentially identical summary statistics: the x
Open in app
and y means, x and y sample variances, correlation coefficients, R-squared values, and
lines of best fit are all (nearly) identical. If you don’t visualize your data and rely on
summary stats, you might think these four datasets have the same distribution, when a
cursory glance shows that this is obviously not the case.
Data visualization allows you to identify trends, artifacts, outliers, and distributions in
your data; if you skip this step, you might as well do the rest of the project blindfolded,
too.
B) You don’t clean your data

Data is messy: values get entered wrong; conversions run awry; sensors go spastic. It’s
important that you resolve these issues before you waste months and months on a
dead-end project, and it’s mission critical that you resolve them before pushing your
models to production. Remember: garbage in ⇒ garbage out.
There are lots of good ways to identify problems with your data and no good ways to
identify them all. Data visualization is a good first step (have I mentioned this?), and
although it can be a tedious and manual process it pays for itself many times over.
Other methods include automatic outlier detection and conditional summary stats.
For an example, consider this histogram of human heights:
Open in app
A histogram of adult human heights
Training a model with this data would doubtless lead to poor results. But, by inspecting
the data, we find that the 100 “outliers” in fact had their height entered in metres
rather than centimetres. This can be correcting by multiplying these values by 100.
Properly cleaning the data not only prevents the model from being trained on bad data,
but, in this case, let us salvage 100 data points that might otherwise have been thrown
out. If you don’t clean your data properly, you’re leaving money on the table at best and
building a defective model at worst.
C) You don’t bother with feature selection and engineering

One of the cool things about neural networks is that you can often throw all your raw
data at it and it will learn some approximation of your target function. Sorry, typo, I
meant one of the worst things. It’s convenient, sure, but inefficient and brittle. Worst of
all, it makes beginner data scientists reliant on deep learning when it’s often the case
that a more traditional machine learning approach would be appropriate, sending
them on a slow descent to poserdom. There’s no “right” way to do feature selection and
engineering, but there are a few key outcomes to strive for:
Dimensionality Reduction: More data isn’t always better. Often, you want to
reduce the number of features before fitting your model. This typically involves
removing irrelevant and redundant data, or combining multiple related fields into
one.
Data Formatting: Computers are dumb. You need to convert your data into a
format that your model will easily understand: neural networks like numbers
between -1 and 1; categorical data should be one-hot encoded; ordinal data
(probably) shouldn’t be represented as a single floating point field; it may be
beneficial to log transform your exponentially-distributed data. Suffice it to say,
there’s a lot of model-dependent nuance in data formatting.
Creating Domain-Specific Features: It’s often productive to create your own

features from data. If you have count data, you may want to convert it into a
relevant binary threshold field, such as “≥100” vs “<100”, or “is 0” vs “is not 0”. If
you have continuous data x and z, you may want to include fields x², xz, and z²
alongside x and z in your feature set. This is a highly problem-dependent practice,
but if done right can drastically improve model performance for some types of
Open in app
models.
Most laypeople think that machine learning is all about black boxes that magically
churn out results from raw data; please don’t contribute to this misconception.
2: You Fail to Choose an Appropriate Model Type

Machine learning is a broad field with a rich history, and for much of that history it
went by the name “statistical learning”. With the advent of easy-to-use open source
machine learning tools like Scikit-Learn and TensorFlow, combined with the deluge of
data we now collect and a ubiquity of fast computers, it’s never been easier to
experiment with different ML model types. However, it’s not a coincidence that
removing the requirement that ML practitioners actually understand how different
model types work has led to many ML practitioners not understanding how different
model types work.
A) You just try everything

The github repos of aspiring data scientists are littered with Kaggle projects and online
course assignments-come-portfolios that look like this:
from sklearn import *

for m in [SGDClassifier, LogisticRegression, KNeighborsClassifier,
KMeans, KNeighborsClassifier, RandomForestClassifier]:
m.overfit(X_train, y_train)
This is an obvious giveaway that you don’t understand what you’re doing, and it’s a
damned shame that so many online courses recommend this course of action. It’s a
waste of time and easily leads to inappropriate model types being selected because
they happened to work well on the validation data (you remembered to hold out a
validation set, right? Right?). The type of model used should be selected based on the
underlying data and the needs of the application, and the data should be engineered to
match the chosen model. Selecting a model type is an important part of the data
science process, and direct comparison between a handful of appropriate models may
be warranted, but blindly applying every tool you can in order to find the one with “the
best number” is a major red flag. In particular, this belies an underlying problem which
is that…
B) You don’t actually understand how different model types work

Why might a KNN classifier not work so well if your inputs are “car age in years” and
Open in app
“kilometres traveled”? What’s the problem with applying linear regression to predict
global population growth? Why isn’t my random forest classifier working on my
dataset with a 1000-category one-hot-encoded variable? If you can’t answer those
questions, that’s okay! There are lots of great resources to learn how each of these
techniques work; just be sure to read and understand them before you apply for a job in
the field.
The bigger problem here isn’t that people don’t know how different ML models work,
it’s that they don’t care and aren’t interested in the underlying math. If you like
machine learning but don’t like math, you don’t really like machine learning; you have
a crush on what you think it is. If you don’t care to learn how models work or are fit to
data, then you’ll have no hope of troubleshooting them when they inevitably go awry.
The problem is only exacerbated when…
C) You don’t know if you want accuracy or interpretability, or why you have
to pick
All model types have their pros and cons. An important trade-off in machine learning is
that between accuracy and interpretability. You can have a model that does a poor job
of making predictions but is easy to understand and effectively explains the process,
you can have a black box which is very accurate but whose inner workings are an
enigma, or you can land somewhere in the middle.
Which type of model you choose should be informed by which of these two traits is
more important for your application. If the intent is to model the data and gain
actionable insights, then an interpretable model, such as a decision tree or linear
regression, is the obvious choice. If the application is production-level prediction such
as image annotation, then interpretability takes a backseat to accuracy and a random
forest or neural network is likely more appropriate.
In my experience, data scientists who don’t understand this trade-off and those who
beeline for accuracy without even considering why interpretability matters are not the
sort you want training models for you.
3: You Don’t Use Effective Metrics and Controls

Despite making up 50% of the words and 64% of the letters, the “science” component
of data science is often ignored. It’s not uncommon for poser data scientists to blindly
apply a single metric in a vacuum as their model evaluation. Unwitting stakeholders
are easily wowed by bold claims like “90% accuracy” which are technically correct but
Open in app
wildly inappropriate for the task at hand.
A) You don’t establish a baseline model

I have a test for pancreatic cancer which is over 99% accurate. Incredible, right? Well,
it’s true, and you can try it for yourself by clicking this link.
If you saw a red circle with a line through it, you’ve tested negative. If you saw a green
check mark, you’re lying. The point is, 99% of people don’t have pancreatic cancer
(more, actually, but let’s just assume it’s 99% for the sake of this example), so my silly
little “test” is accurate 99% of the time. Therefore, if accuracy is what we care about,
any machine learning model used for diagnosing pancreatic cancer should perform at
least as well as this uninformative, baseline model. If the hotshot you’ve hired fresh
out of college claims he’s developed a tool with 95% accuracy, compare those results to
a baseline model and make sure his model is performing better than chance.
B) You use the wrong metric

Continuing the diagnostic example above, it’s important to make sure you’re using the
right metric. For cancer diagnosis, accuracy is actually a bad metric; it’s often
preferable to decrease your accuracy if it means an increase in sensitivity. What’s the
cost associated with a false positive? Patient stress, as well as wasted time and
resources. What’s the cost of a false negative? Death. An understanding of the real-
world implications of your model and an appreciation of how those implications
govern metric selection clearly delineate real data scientists from their script-kiddie
lookalikes.
C) You bungle the train/test split

This is a big one, and it’s far too common. Properly testing a model is absolutely
essential to the data science process. There are many ways this can go awry: not
understanding the difference between validation and test data, performing data
augmentation before splitting, not plugging data leaks, ignoring data splitting
altogether… There’s not much to say about this other than that if you don’t know or
care how to create a proper holdout set, all your work has been a waste of time.
…to import tensorflow as tf
These are only a handful of tells that give up the game. With enough experience,
Open in app
they’re easy to spot, but if you’re just starting out in the field it can be hard to separate
the Siraj Ravals of the world from the Andrew Ngs. Now, I don’t mean to gatekeep the
field to aspiring data scientists; if you feel attacked by any of the above examples, I’m
glad to hear it because it means you care about getting things right. Keep studying, keep
climbing so that you too can be endlessly irked by the sea of posers.
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Emails will be sent to christopher.pope@gmail.com.

Get this newsletter
Not you?
Data Science Machine Learning
About Help Legal
Get the Medium app

How to Spot a Fake Data Scientist

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How to Spot a Fake Data Scientist

Uploaded by

Copyright:

Available Formats

20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science

Follow 596K Followers

from sklearn import *

Conor Lazarou Mar 22, 2020 · 9 min read

1. You Don’t Bother with Data Exploration

A) You don’t visualize your data

Anscombe’s Quartet (Source: Wikimedia Commons)

B) You don’t clean your data

For an example, consider this histogram of human heights:

A histogram of adult human heights

C) You don’t bother with feature selection and engineering

Creating Domain-Specific Features: It’s often productive to create your own

2: You Fail to Choose an Appropriate Model Type

A) You just try everything

from sklearn import *

B) You don’t actually understand how different model types work

3: You Don’t Use Effective Metrics and Controls

A) You don’t establish a baseline model

B) You use the wrong metric

C) You bungle the train/test split

…to import tensorflow as tf

Sign up for The Variable

Emails will be sent to christopher.pope@gmail.com.

Data Science Machine Learning

About Help Legal

Get the Medium app

You might also like