You are on page 1of 1

14 Best Data

Science Books to
Read Right Now
From textbooks to introductory
tomes and mass-market nonfiction.

Mae Rice
March 8, 2020
Updated: July 8, 2021

C
hico Camargo, a postdoctoral
researcher in data science at
the Oxford Internet Institute
came to data science from a
background in biology.

“Biology is big, messy and complex,” he


told Built In, “so I was drawn toward
tools that could help me make some
sense out of that.”

Usually, humans make sense of the


natural world’s complexity with our
own natural tools: our brains and our
senses. Data science augments those
innate capacities, though, with
algorithms and predictive models.

Camargo was especially drawn to


unsupervised machine learning and
natural language processing, which
helps humans with everything from
detecting signs of metastasizing cancer
to understanding foreign languages
with Google Translate.

At this point, in fact, data science has


gotten so sophisticated that it doesn’t
just enhance our natural abilities — it
mimics them.

Take deep learning, for example. It


“uses multiple layers [of algorithms] to
progressively extract higher-level
features from raw input,” Camargo
explained.

Human vision works in a similarly


layered way. “The first layers of
neurons in our visual system are
responsible for identifying light and
dark,” Camargo said, “while the deeper
layers respond to patterns like curves
and straight lines.” Ultimately, the “nth”
layer of neurons recognizes the visual
for what it is: “Aha, it’s a face!”

In a way, data science has become


humanity’s sixth sense. Yet it’s also
probably the sense the average person
understands the least. So for anyone
hoping to learn more, we asked three
experts to recommend their favorite
data science books. Our panel
included:

Zach Miller, lead data scientist


at CreditNinja
Jeff Herman, lead data science
instructor at the Flatiron School
Chico Camargo, postdoctoral
researcher in data science at the
Oxford Internet Institute

The resulting reading list ranges from


technical machine learning and math
textbooks to sociological studies of
how algorithms impact our daily lives.

General Interest Books

EVERYBODY LIES: BIG DATA,


NEW DATA, AND WHAT THE
INTERNET CAN TELL US
ABOUT WHO WE REALLY
ARE BY SETH STEPHENS-
DAVIDOWITZ

CAMARGO: This book is like


Freakonomics in the age of data
science. It’s 100 percent not a technical
book. Every chapter tells some peculiar
story illustrating a data science
concept — like, there’s one chapter
about Google searches, another about
news, another about image data, etc.
It’s a bunch of stories of people being
creative and finding patterns in the
most random things, because these
random things actually reveal a lot. The
book has that name because you can lie
about what you eat and read, and you
can lie about who you’re going to vote
for — but if I have access to your
search history, I can figure out the
View remote jobs at top tech companies
truth. It’s a book for people that are
nationwide
BETA
curious about what data science is and
what it can do — especially when it
comes to social data. The author
finishes by saying the next Freud will
be a data scientist, the next Foucault
will be a data scientist, the next Marx
will be a data scientist. I think that’s a
bit much perhaps, because data
science doesn’t answer every question
ever. But it’s a fun book, to be read with
a grain of salt.

NAKED

STATISTICS: STRIPPING THE


DREAD FROM DATA BY
CHARLES WHEELAN

HERMAN: This book gives a lot of


examples of how statistical concepts
apply in the real world. Wheelan does
not go into a lot of theory, but he has
some pretty interesting examples and a
kind of dry sense of humor. This the
only statistics book that’s ever made
me laugh, and it’s the book that we
recommend our incoming students at
the Flatiron School read beforehand.
Our students come from a wide variety
of statistics backgrounds, but I’ve
always gotten really positive feedback
on it. It’s ideal for beginners, but I also
think that if you’ve never read it and
you’re in data science, it’s a great read.

​WEAPONS OF MATH
DESTRUCTION: HOW BIG
DATA INCREASES
INEQUALITY AND
THREATENS DEMOCRACY BY
CATHY O’NEIL

CAMARGO: The author of this book,


Cathy O’Neil, used to be an academic
mathematician. Then she went to Wall
Street, then she went to Occupy Wall
Street and now she’s an activist raising
awareness of how algorithms rule our
lives, and how they are not as neutral
or unbiased as we like to believe. The
book is a collection of stories of
algorithms’ real-world applications,
and a lot of them are about people who
were classified as unworthy by an
algorithm. Like, someone purchased an
item at a particular shop and
automatically got their credit card limit
lowered, or a college student couldn’t
get a job at a local grocery store
because the algorithm said so.

She doesn’t just say “boo hoo, bad


algorithm, bad machine!” though — she
makes an effort to explain the
mechanisms that might make an
algorithm racist, for instance. So, why
is a policing algorithm sending officers
to black neighborhoods more often?
Well, what happened in that case is
that the algorithm was fed data on
previous police patrols, which were
more often in black neighborhoods. So
the algorithm learned that those
neighborhoods are the ones that
receive more patrols. The algorithm
simply reproduced what it was taught.
The book makes you think a lot about
how you can design algorithms and
data science practices to deal with
that.

ALGORITHMS OF
OPPRESSION BY SAFIYA
NOBLE

CAMARGO: This book has a few


stories, with very simple “data,” which
the author explores in depth. I found it
a very interesting read, because the
author’s background is almost
diametrically opposed to mine. She’s
100 percent qualitative, telling stories
based on “small data” with a lot of
context.

In one of these stories, the author,


Safiya Noble, was organizing a party for
her niece and other children, and she
searched something like “black girls”
on Google. To her surprise, she didn’t
find pictures of children. She found
websites like “HOT BLACK SINGLES IN
YOUR AREA.” For other search terms,
like “Latina girls” and “Asian girls,” she
found the same stuff.

The reason this happened, she


explained, is Google’s revenue model.
The algorithm will serve whatever ad
pays the most. And it becomes a
troubling situation, because even
though Google is an advertising
company, we use it like a public library
— like some sort of publicly accessible
repository of information. I found it a
very sobering read.

BEGINNER-FRIENDLY
TEXTBOOKS

AN

INTRODUCTION TO
STATISTICAL LEARNING:
WITH APPLICATIONS IN R BY
GARETH JAMES, DANIELA
WITTEN, TREVOR HASTIE
AND ROBERT TIBSHIRANI

HERMAN: When I was first learning


data science, most statistical textbooks
were kind of unreadable. They went in-
depth on theory and didn’t really show
the application side. This book doesn’t
go as deep statistically as a lot of other
books, but it gives you enough
knowledge to be successful as a data
scientist, and it goes over the key
machine learning algorithms. One of
the issues people have with data
science is that algorithms are these
black boxes where you put data in and
you get data out and you have no idea
what happens in the middle. This book
gives you enough statistical knowledge
to understand what’s going on in that
black box.

It’s geared toward people that don’t


have any programming or statistics
background. That being said, I’ve
actually read this book multiple times.
Even if you’re an experienced data
scientist, a lot of statistical concepts,
you kind of forget about them over
time. As you work in a job, you’re not
going to be using every single
algorithm. You get comfortable. This
book allows you to say, okay, maybe I
should try this other algorithm.

DATA
SCIENCE
FROM

SCRATCH: FIRST PRINCIPLES


OF PYTHON BY JOEL GRUS

MILLER: This book is about how to


write data science algorithms in
Python. It’s a mix between a textbook
and a normal book — a great entryway
book, very appropriate for a
layperson. So for instance, if I wanted
to learn the machine learning
algorithm Naive Bayes, this book says,
“We’re going to literally program Naive
Bayes as if it doesn’t exist in the world.
We’re going to learn the math first and
then write the code as part of that.
We’ll build this algorithm together with
nothing but Python.”

You probably want to know a little bit


of Python and a little bit of statistics
going in, but this book assumes almost
no depth of knowledge. It’s not one of
those books that’s like, “This is left to
the reader because it’s easy.” And it will
teach you all the standard machine
learning algorithms, probably 10 or 15
different ones.

HANDS-
ON

MACHINE LEARNING WITH


SCIKIT-LEARN, KERAS AND
TENSORFLOW BY AURÉLIEN
GÉRON

HERMAN: This book will teach you


how to run predictive analytics. In the
data science world, there are two main
programming languages: Python and R.
There are pros and cons to both, but
this book is specifically for Python.
Scikit-Learn, Keras and TensorFlow are
all libraries of machine learning and
deep learning functions within the
Python programming languages.

You have to be pretty good at these


libraries to be a data scientist. When I
was starting out, I would reference this
book daily. To this day I probably look
at it at least monthly as a reference,
because he really goes deep into
explaining how each algorithm works.
A lot of algorithms have a lot of knobs
or levers that you can turn — so
depending on what the data is doing,
you might change the algorithm a little
bit. The author explains what those
different knobs and levers are in a way
that a beginner can understand, but
someone with more experience can
appreciate the level of detail that he
goes into.

THINK
STATS
BY
ALLEN
B.

DOWNEY

MILLER: Data science is a mix of three


different disciplines. One is
programming and computer science;
one is linear algebra, stats, very math-
heavy analytics; and then one is
machine learning and algorithms. The
ideal data scientist is really good at all
of them. But that doesn’t always
happen, so this book is about building
out that analytics, math and stats side
of your data science knowledge. How
do you do testing, how do you
determine whether your solutions are
working and the distributions are right,
and how do you use that math stuff to
solve business problems?

It’s textbook-y, but it isn’t a hardcore


textbook. It also merges the statistical
analysis with how you would write it in
Python. Early in my career, I found
statistics fairly easy, but making
statistics into a program was more
challenging. I found this very helpful
for making that connection.

GROKKING DEEP LEARNING


BY ANDREW W. TRASK

CAMARGO: This book is an


introductory textbook for the beginner
who wants to go beyond usage and
understand a bit of how deep learning
works. People who develop deep
learning tools are usually drawing from
a lot of mathematics: multivariate
calculus, linear algebra, optimization,
often some physics too. But you don’t
need all these things to understand
what deep learning is doing. In the
author’s words, “If you’ve passed high
school mathematics and hacked
around in Python, you’re ready for this
book.” It covers some very general and
fundamental bits, such as gradient
descent, backpropagation and
regularization, which are used in so
many advanced tools that you cannot
progress without a decent
understanding of them.

I think books like this are important


because thanks to online tutorials, you
can get to a point where you’re
implementing complex stuff without
actually understanding how it works —
all you need is Python and an internet
connection. And that is troublesome,
sometimes. People can waste resources
by using deep neural networks where a
linear regression would do (using a
bazooka to kill a fruit fly, in a sense) or
by implementing algorithms that lead
to decisions that harm people, without
the programmers realizing that’s
happening.

LINEAR

ALGEBRA DONE RIGHT BY


SHELDON AXLER

MILLER: This book is an


undergraduate math textbook. It’s
designed for a mid-level linear algebra
course, which is something every data
:

You might also like