Professional Documents
Culture Documents
Open in app
Like most news outlets writing about “hackers”, fake data scientists happily use any random code they find online
(image source: Picography)
These days it seems like everyone and their dog are marketing themselves as data
scientists—and you can hardly blame them, with “data scientist” being declared the
Sexiest Job of the Century and carrying the salary to boot. Still, blame them we will,
since many of these posers grift their way from company to company despite having
little or no practical experience and even less of a theoretical foundation. In my
experiences interviewing and collaborating with current and prospective data
scientists, I’ve found a handful of tells that separate the posers from the genuine
articles. I don’t mean to belittle self-taught and aspiring data scientists — in fact, I
think this field is especially appropriate for passionate self-learners —but I definitely
mean to belittle the sort of person who takes a single online course and ever after styles
themselves an expert, despite having no knowledge of (or interest in) the fundamental
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 1/8
20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
theory of the field. I’ve compiled this list of tells so that, if you’re a hiring manager and
Open in app
don’t know what you’re looking for in a data scientist, you can filter out the slag, and if
you’re an aspiring data scientist and any of these resonate with you, you can fix them
before you turn into a poser yourself. Here are three broad domains of data science
faux pas with specific examples that will land your resume in the bin.
The datasets in each panel all have essentially identical summary statistics: the x
Open in app
and y means, x and y sample variances, correlation coefficients, R-squared values, and
lines of best fit are all (nearly) identical. If you don’t visualize your data and rely on
summary stats, you might think these four datasets have the same distribution, when a
cursory glance shows that this is obviously not the case.
Data visualization allows you to identify trends, artifacts, outliers, and distributions in
your data; if you skip this step, you might as well do the rest of the project blindfolded,
too.
There are lots of good ways to identify problems with your data and no good ways to
identify them all. Data visualization is a good first step (have I mentioned this?), and
although it can be a tedious and manual process it pays for itself many times over.
Other methods include automatic outlier detection and conditional summary stats.
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 3/8
20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
Open in app
Training a model with this data would doubtless lead to poor results. But, by inspecting
the data, we find that the 100 “outliers” in fact had their height entered in metres
rather than centimetres. This can be correcting by multiplying these values by 100.
Properly cleaning the data not only prevents the model from being trained on bad data,
but, in this case, let us salvage 100 data points that might otherwise have been thrown
out. If you don’t clean your data properly, you’re leaving money on the table at best and
building a defective model at worst.
Dimensionality Reduction: More data isn’t always better. Often, you want to
reduce the number of features before fitting your model. This typically involves
removing irrelevant and redundant data, or combining multiple related fields into
one.
Data Formatting: Computers are dumb. You need to convert your data into a
format that your model will easily understand: neural networks like numbers
between -1 and 1; categorical data should be one-hot encoded; ordinal data
(probably) shouldn’t be represented as a single floating point field; it may be
beneficial to log transform your exponentially-distributed data. Suffice it to say,
there’s a lot of model-dependent nuance in data formatting.
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 4/8
20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
but if done right can drastically improve model performance for some types of
Open in app
models.
Most laypeople think that machine learning is all about black boxes that magically
churn out results from raw data; please don’t contribute to this misconception.
This is an obvious giveaway that you don’t understand what you’re doing, and it’s a
damned shame that so many online courses recommend this course of action. It’s a
waste of time and easily leads to inappropriate model types being selected because
they happened to work well on the validation data (you remembered to hold out a
validation set, right? Right?). The type of model used should be selected based on the
underlying data and the needs of the application, and the data should be engineered to
match the chosen model. Selecting a model type is an important part of the data
science process, and direct comparison between a handful of appropriate models may
be warranted, but blindly applying every tool you can in order to find the one with “the
best number” is a major red flag. In particular, this belies an underlying problem which
is that…
Why might a KNN classifier not work so well if your inputs are “car age in years” and
Open in app
“kilometres traveled”? What’s the problem with applying linear regression to predict
global population growth? Why isn’t my random forest classifier working on my
dataset with a 1000-category one-hot-encoded variable? If you can’t answer those
questions, that’s okay! There are lots of great resources to learn how each of these
techniques work; just be sure to read and understand them before you apply for a job in
the field.
The bigger problem here isn’t that people don’t know how different ML models work,
it’s that they don’t care and aren’t interested in the underlying math. If you like
machine learning but don’t like math, you don’t really like machine learning; you have
a crush on what you think it is. If you don’t care to learn how models work or are fit to
data, then you’ll have no hope of troubleshooting them when they inevitably go awry.
The problem is only exacerbated when…
C) You don’t know if you want accuracy or interpretability, or why you have
to pick
All model types have their pros and cons. An important trade-off in machine learning is
that between accuracy and interpretability. You can have a model that does a poor job
of making predictions but is easy to understand and effectively explains the process,
you can have a black box which is very accurate but whose inner workings are an
enigma, or you can land somewhere in the middle.
Which type of model you choose should be informed by which of these two traits is
more important for your application. If the intent is to model the data and gain
actionable insights, then an interpretable model, such as a decision tree or linear
regression, is the obvious choice. If the application is production-level prediction such
as image annotation, then interpretability takes a backseat to accuracy and a random
forest or neural network is likely more appropriate.
In my experience, data scientists who don’t understand this trade-off and those who
beeline for accuracy without even considering why interpretability matters are not the
sort you want training models for you.
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 6/8
20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
are easily wowed by bold claims like “90% accuracy” which are technically correct but
Open in app
wildly inappropriate for the task at hand.
If you saw a red circle with a line through it, you’ve tested negative. If you saw a green
check mark, you’re lying. The point is, 99% of people don’t have pancreatic cancer
(more, actually, but let’s just assume it’s 99% for the sake of this example), so my silly
little “test” is accurate 99% of the time. Therefore, if accuracy is what we care about,
any machine learning model used for diagnosing pancreatic cancer should perform at
least as well as this uninformative, baseline model. If the hotshot you’ve hired fresh
out of college claims he’s developed a tool with 95% accuracy, compare those results to
a baseline model and make sure his model is performing better than chance.
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 7/8
20/5/2021 How to Spot a Fake Data Scientist | Towards Data Science
These are only a handful of tells that give up the game. With enough experience,
Open in app
they’re easy to spot, but if you’re just starting out in the field it can be hard to separate
the Siraj Ravals of the world from the Andrew Ngs. Now, I don’t mean to gatekeep the
field to aspiring data scientists; if you feel attacked by any of the above examples, I’m
glad to hear it because it means you care about getting things right. Keep studying, keep
climbing so that you too can be endlessly irked by the sea of posers.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/from-sklearn-import-478c711dafa1 8/8