You are on page 1of 297

1.

Introduction
2. Chapter 1 Data Science and Big Data

1. 1.1 Digging into Big Data


2. 1.2 Big Data Industries
3. 1.3 Birth of Data Science
4. 1.4 Key Points

3. Chapter 2 Importance of Data Science

1. 2.1 History of the Data Science


Field
2. 2.2 The New Paradigms
3. 2.3 The New Mindset and the
Changes It Brings
4. 2.4 Key Points

4. Chapter 3 Types of Data Scientists

1. 3.1 Data Developers


2. 3.2 Data Researchers
3. 3.3 Data Creatives
4. 3.4 Data Businesspeople
5. 3.5 Mixed/Generic Type
6. 3.6 Key Points

5. Chapter 4 The Data Scientist’s Mindset

1. 4.1 Traits
2. 4.2 Qualities and Abilities
3. 4.3 Thinking
4. 4.4 Ambitions
5. 4.5 Key Points

6. Chapter 5 Technical Qualifications

1. 5.1 General Programming


2. 5.2 Scientific Background
3. 5.3 Specialized Know-How
4. 5.4 Key Points

7. Chapter 6 Experience
1. 6.1 Corporate vs. Academic
Experience
2. 6.2 Experience vs. Formal
Education
3. 6.3 How to Gain Initial Experience
4. 6.4 Key Points

8. Chapter 7 Networking

1. 7.1 More than Just Professional


Networking
2. 7.2 Relationship with Academia
3. 7.3 Relationship with the Business
World
4. 7.4 Key Points

9. Chapter 8 Software Used

1. 8.1 Hadoop Suite and Friends


2. 8.2 OOP Language
3. 8.3 Data Analysis Software
4. 8.4 Visualization Software
5. 8.5 Integrated Big Data Systems
6. 8.6 Other Programs
7. 8.7 Key Points

10. Chapter 9 Learning New Things and


Tackling Problems

1. 9.1 Workshops
2. 9.2 Conferences
3. 9.3 Online Courses
4. 9.4 Data Science Groups
5. 9.5 Requirements Issues
6. 9.6 Insufficient Know-How Issues
7. 9.7 Tool Integration Issues
8. 9.8 Key Points

11. Chapter 10 Machine Learning and the R


Platform

1. 10.1 Brief History of Machine


Learning
2. 10.2 The Future of Machine
Learning
3. 10.3 Machine Learning vs.
Statistical Methods
4. 10.4 Uses of Machine Learning in
Data Science
5. 10.5 Brief Overview of the R
Platform
6. 10.6 Resources for Machine
Learning and R
7. 10.7 Key Points

12. Chapter 11 The Data Science Process

1. 11.1 Data Preparation


2. 11.2 Data Exploration
3. 11.3 Data Representation
4. 11.4 Data Discovery
5. 11.5 Learning from Data
6. 11.6 Creating a Data Product
7. 11.7 Insight, Deliverance and
Visualization
8. 11.8 Key Points

13. Chapter 12 Specific Skills Required


1. 12.1 The Data Scientist’s Skill-Set
in the Job Market
2. 12.2 Expanding Your Current Skill-
Set as a Programmer / SW
Developer
3. 12.2.1 OO Programmer
4. 12.2.2 Software Developer
5. 12.2.3 Other Programming-Related
Career Tracks
6. 12.3 Expanding Your Current Skill-
Set as a Statistician or Machine
Learning Practitioner
7. 12.3.1 Statistics Background
8. 12.3.2 Machine Learning / A.I.
Background
9. 12.3.3 Mixed Background
10. 12.4 Expanding Your Current Skill-
Set as a Data-Related Professional
11. 12.4.1 Database Administrator
12. 12.4.2 Data Architect/Modeler
13. 12.4.3 Business Intelligence
Analyst
14. 12.5 Developing the Data Scientist’s
Skill-Set as a Student
15. 12.6 Key Points

14. Chapter 13 Where to Look for a Data


Science Job

1. 13.1 Contact Companies Directly


2. 13.2 Professional Networks
3. 13.3 Recruiting Sites
4. 13.4 Other Methods
5. 13.5 Key Points

15. Chapter 14 Presenting Yourself

1. 14.1 Focus on the Employer


2. 14.2 Flexibility and Adaptability
3. 14.3 Deliverables
4. 14.4 Differentiating Yourself from
Other Data Professionals
5. 14.5 Self-Sufficiency
6. 14.6 Other Factors to Consider
7. 14.7 Key Points

16. Chapter 15 Freelance Track


1. 15.1 Pros and Cons of Being a Data
Science Freelancer
2. 15.2 How Long You Should Do It
for
3. 15.3 Other Relevant Services You
Can Offer
4. 15.4 Example of a Freelance Data
Science Opportunity
5. 15.5 Key Points

17. Chapter 16 Experienced Data Scientists


Case Studies

1. 16.1 Dr. Raj Bondugula


2. 16.2 Praneeth Vepakomma
3. 16.3 Key Points

18. Chapter 17 Senior Data Scientist Case


Study

1. 17.1 Basic Professional Information


and Background
2. 17.2 Views on Data Science in
Practice
3. 17.3 Data Science in the Future
4. 17.4 Advice to New Data Scientists
5. 17.5 Key Points

19. Chapter 18 Call for New Data Scientists

1. 18.1 Ads for Entry-Level Data


Scientists
2. 18.2 Ads for Experienced Data
Scientists
3. 18.3 Ads for Senior Data Scientists
4. 18.4 Online Job Searching Tips
5. 18.5 Key Points

20. Final Words


21. Glossary of Computer and Big Data
Terminology
22. Appendix 1 Useful Websites
23. Appendix 2 Relevant Articles
24. Appendix 3 Offline Resources
25. Index
The Definitive Guide to
Becoming a Data Scientist

first edition

Zacharias Voulgaris, PhD


Published by:
Technics Publications, LLC
2 Lindsley Road
Basking Ridge, NJ 07920
USA
http://www.TechnicsPub.com
Cover design by Mark Brye
Cartoons by Sarah
Silverberg

Edited by Carol Lehn


All rights reserved. No part of this book may be reproduced
or transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording or by any
information storage and retrieval system, without written
permission from the publisher, except for the inclusion of
brief quotations in a review.
The author and publisher have taken care in the preparation
of this book, but make no expressed or implied warranty of
any kind and assume no responsibility for errors or
omissions. No liability is assumed for incidental or
consequential damages in connection with or arising out of
the use of the information or programs contained herein.
All trade and product names are trademarks, registered
trademarks, or service marks of their respective companies,
and are the property of their respective holders and should
be treated as such.
Copyright © 2014 by Zacharias Voulgaris, PhD
ISBN, print ed. 978-1-935504-69-6
ISBN, ePub ed. 978-1-935504-75-7
First Printing 2014
Library of Congress Control Number: 2014935091
Introduction
A year and a half ago, I had no clear idea what a data scientist
was and why it was an important role. Immersed in a dead-end
job in an e-marketing company, I had started to forget all of the
stuff I had learned through the many difficult years of my
education. I am not sure what triggered my resolve to look into
the matter more (at that time there were no decent books on
the topic, and I had no-one to mentor me), but I do remember
coming to the realization that this was my life’s vocation.
Naturally, there were problems with this new type of work – lots
of things I hadn’t learned and no idea of how to learn them,
especially if you factor in my 50 hours per week schedule and
the fact that there wasn’t a decent data science course
anywhere in the country in which I was living. But I did power
through, my resolve fueled by the conviction that this was
something worthwhile and enjoyable. And if I happened to fail
in my pursuit, at least I would have picked up some useful skills
in the process.
This book is for people who have the same desire to learn
about this fascinating field. When I started my quest into the
data science world, I had to learn the hard way, through trial
and error, as well as through hard research via articles, videos
and other sources on the Web. Fortunately, it will be much
easier for you. That’s why I wrote this book: so that you have a
manual, of sorts, to provide you with guidelines for this
challenging transition.
Data science is a very rewarding field that deals with a
fascinating new entity in the data world: big data, something
that constitutes a quite intriguing challenge since there is no
straightforward way of dealing with it effectively. This leaves a
lot of room for creativity and a wider array of possibilities that
you are called to explore as a data scientist. In addition,
through this role you have the opportunity to develop aspects of
yourself that no other role in the IT field provides: namely
creativity, communication, direct links with the business world,
etc. Through all this you have a chance of providing something
useful to the organization you work for (which can be a
company, government agency, or even a charity) through the
intelligent use of the data that is available. Since this data is
bound to be large, diverse, and quite messy, it is not something
you would normally find in a tidy database. Hence the term big
data and the role of the data scientist, the professional who
deals with big data in a scientific, creative and understandable
manner.
Over the past few years, there has been heightened awareness
of big data and its implications in business, as well as its impact
on the job market. But what is big data exactly? And how is it
different from traditional data? The short definition of big data is
“data that cannot be handled by a single computer.” Although
this is usually due to its very large size, there are a few other
reasons. In general, it is defined by four main characteristics,
usually referred to as the four Vs of big data:

Volume. Contrary to “normal” data, big data is


significantly larger; i.e., it ranges from a few
Terabytes (TB) to a few Zettabytes (ZB). The
latter is a billion TB, or a trillion Gigabytes (GB).
That’s a lot of data! In 2010, the data of the whole
world was about 1 ZB – that’s 125 million 8 GB
media players! What’s more, this number has
been increasing rapidly over the past few years,
and there is no sign of it stopping any time soon.
This very high amount of data that characterizes
big data, in combination with the fact that big data
cannot be processed efficiently using a single
machine (even a supercomputer), has brought
about the use of parallel computing (a cluster of
computers working together via a network
connection), something that is inherent in the vast
majority of data science projects.
Variety. Big data is also quite varied, coming from
non-traditional as well as traditional sources. The
data we are used to processing is structured data,
the kind of data usually found in databases. We
know what its data type and size are, and we
generally know what’s supposed to be in each
field. Big data, however, includes unstructured
and semi-structured data as well. Unstructured
data lacks a pre-defined structure in its
subcomponents (e.g., data found in Facebook
posts, tweets, phone call transcripts, etc.), while
semi-structured has some structure and is
something in between structured and unstructured
(e.g., data in machine logs and email address
headers).
Velocity. Another important characteristic of big
data is velocity, or the rate at which it arrives at
the enterprise and is processed. Traditional data
is thought to be slower and fairly static in terms of
how it is developed and transferred from the
location it is generated to the location it is
processed. Contrast this with big data, which is
constantly moving, and moving fast (though there
may be some exceptions to this rule). This means
that it needs to be processed quickly, in real-time
if possible, in order to harness its potential. For
example, a financial services company may need
to analyze over 5 million market messages every
second, with a latency of about 30 microseconds.
Veracity. This last one was added relatively
recently, so there are still many references to the
three Vs of big data in books and articles on the
topic. Big data is also characterized by veracity,
an attribute that relates to the quality
(trustworthiness) of the data. As one would
expect, there is a lot of noise in all of this data.
Working with big data effectively means being
able to discern the noise from the signals that
may hide within. This is a challenging process that
requires advanced analytical techniques. If one is
not careful, it is easy to draw conclusions backed
by statistical significance that don’t have any real
value, or that may lead to questionable decisions.

There are two more Vs that are sometimes included, Variability


and Visibility, but there has not been consensus on these
characteristics, yet.
It doesn’t take much to realize that making effective use of big
data is a challenge. Ignoring it is no longer an option in many
industries as its information potential is becoming more and
more evident and ways to make use of it constantly increase.
Think of Amazon and Netflix, for example. Their clever use of
big data has given them a competitive advantage and has
opened new roads for their industries. If you were in the online
shopping business, for example, and you had a large customer
base that supplied you with large amounts of data, imagine
what you could learn about buying patterns, the demographics
of your customers, and the opportunities you could take
advantage of by analyzing the data.
Building on this newly acquired knowledge, you could go one
step further: namely, design a widget or an app that makes use
of the insights you have derived and helps its user to gain
similar insights into their experience with the environment of the
data (in this case, the online shop). That’s actually one of the
reasons Amazon became so successful. It not only offered a
large variety of products to its users, but made the whole
experience of shopping easier and more enjoyable through the
use of interesting features on its site, such as its recommender
system. This and many other similar mini-programs that are
based on intelligent analysis of big data are usually referred to
as data products and constitute the goal of the majority of data
scientists. There are data scientists, however, who are not
directly involved in the creation of these products and focus on
engineering ways of facilitating other data scientists in their
work. So the field is quite diverse in the particular tasks data
scientists can undertake through the application of their specific
skill-sets.
So the question is not whether or not to hop on the big data
wagon, but how. This is where the data scientist comes in. The
data scientist is a fairly new role in the industry, and since its
introduction to the job market, it has grown in popularity. It
involves all the different aspects of dealing with data,
particularly big data, in an intelligent and very methodical
manner, in order to create a useful product (the aforementioned
data product). The product is usually a widget or an app that
can provide meaningful information the users do not already
know (the last part is something that is stressed by John
Foreman, a very successful and experienced data scientist).
Big data has brought about new paradigms in data processing
and data visualization, equipping the data scientist with
powerful tools that require a different mindset and a different
skill-set to accompany it.
Many people confuse the data scientist with the data analyst.
However, they are quite different roles, much like space flight is
different from traditional flight. A data analyst uses techniques
that may work with data that borders on being big data, but may
be inefficient and lack the flexibility of the techniques employed
by a data scientist. The former relies on a series of pre-made
models to derive useful information from the data and creates
reports for a businessperson to view. The latter often develops
his own models or uses a completely data-driven approach in
his analyses, often resulting in something that many other
people can use, not just a businessperson in his company. The
data analyst will create intuitive plots in his reports. The data
scientist will create an interactive dashboard that will plot all the
essential information in real-time.
In other words, data analysis is a very useful tool, but if one is
to make use of the data the world is immersed in today, one
needs to not only be efficient with data analysis techniques, but
also gain a working knowledge of other aspects of data science
that will be described in this book. Being a data analyst is great,
but it will limit you to a certain type of datasets that involve
structured data only, and among these datasets you will only be
able to deal with the relatively small ones. If you want to take a
stab at the larger and more complicated ones, you’ll need to
learn the ways of the data scientist.
Being a data scientist is not only about know-how, though; to
someone who’s interested, it can also be a very enjoyable and
intriguing occupation. The domain of the data scientist is
constantly changing as new technologies are developed,
making it a very dynamic field. He1 is at the cutting edge of
science and gets to communicate with interesting people, some
of whom drive these changes. Data science is an inter-
disciplinary field, so the data scientist expands his worldviews
by learning to think in a more systemic way, integrating things
from various fields. Most importantly, he often gets to be
creative in the way he deals with the problems that arise and
the ways data can be processed.
Being a data scientist is also a great profession. For example,
given that it is a new role that can provide a strategic
advantage to an organization (and there aren’t many people
trained to do the role properly), the data scientist can be very
well paid, usually more than other IT professionals, according to
Indeed.com, for the same years of experience. In addition, a
data scientist has the opportunity to develop a wide variety of
skills, making him a very versatile and adaptable professional
who may have the opportunity to communicate with all kinds of
people in the industry and the scientific world and work in
different industry sectors. This is particularly useful in times of
financial turmoil, when job-hunting becomes challenging for
specialized professionals.
This book is comprised of eighteen chapters, covering the basic
aspects of the transition to the data science world. In the first
few chapters you will learn more about what the field entails
(what data science and big data are; why data science is very
important, especially nowadays; and the different types of data
scientists). Afterwards, you will have a chance to learn about
what it takes to be a data scientist (the data scientist’s mindset,
his technical qualifications, the experience that is required for
this role, and a few things about networking). Next, you will
have an opportunity to learn about the everyday life of a data
scientist (what software he uses, the importance of learning
new things in this line of work, the kind of problems he
encounters, and the main stages of the data science process).
In the chapter that follows, you will be presented with the
various migration paths from existing roles (what to do and
learn if you are a programmer/software developer, if you are a
statistician or machine learning practitioner, if you are a data-
related professional, or if you are a student). Afterwards, you
will be given some practical and down-to-earth advice on what
you need to do to land your first data science job (where to
look, how to present yourself as a would-be data scientist, and
what you need to consider if you wish to follow the freelance
track). Finally, you will have a chance to read about some real-
world data scientists, their experiences and their views on the
matter, as well as some real job posting examples for data
scientist positions. At the end of the book, there is a glossary of
the most important terms that have been introduced, as well as
three appendices – a list of useful sites, some relevant articles
on the Web, and a list of offline resources for further reading.
There is also a comprehensive index at the end of this text.
Throughout the book, the Kea bird is used to represent the data
scientist. The Kea is known for its intelligence, innovative
attitude, curiousness, and is one of the rarest species of its
category. These attributes are the discerning features of the
Kea and are shared by the data science professional.
I sincerely hope that this book is useful and, perhaps, even
enjoyable for you. The transition itself is quite demanding
(especially if you are in the beginning of your professional life),
but it is an intriguing and rewarding experience. And when you
eventually become a data scientist, the field continues to be
just as interesting. Not a role for the faint-hearted, being a data
scientist is a wonderful experience on many levels and can be
a fascinating journey. Are you ready to embark on it?

Dr. Zacharias Voulgaris


Although I use “he” to refer to a data scientist throughout the text, the role
can be undertaken by both men and women.
Chapter 1
Data Science and Big Data

Data science is a response to the difficulties of working with big


data and other data analysis challenges we collectively face
today. We examined this briefly in the introduction, but that was
just scratching the surface. In fact, there is so much literature
on big data that this whole chapter will still not be able to do it
justice. It will, however, give you a good idea of its importance
in today’s world. Furthermore, it will help you understand what
all the hype is about big data (a hype that has increased
significantly over the past year), and why data science is so
important.
Big data is a fundamental asset for today’s businesses, and it is
not a coincidence that the majority of businesses today are
using, or are in the process of adopting, the corresponding
technology. Despite all the hype about it in various media, this
is not a fad. There are specific advantages to using this asset,
and the fact that it is growing more abundant is an indication
that it is imperative to do something about it, and do it fast!
Perhaps it is not useful for certain industries right now as big
data tends to be quite chaotic or even non-existent for them.
Those who do have it and make intelligent use of it, though,
reap its benefits and stand a good chance of being more
successful in today’s competitive economic ecosystems.

1.1 Digging into Big Data


Big data is abundant and contains information that is relevant to
the business problems at hand. If you are a manager of an e-
commerce company, for example, the data you collect on your
servers regarding your customers and the visitors to your site
are rich with information that, when analyzed properly, can be
used to increase your sales, enhance your site’s design, and
improve your customer service. It can also provide you with
ideas on marketing strategies and ways to improve your
company’s overall strategy; all that from a bunch of ones and
zeroes that dwell on your servers. You just need to extract the
information from them, allocating a small part of your resources.
Not a bad trade-off, for sure. We’ll come back to this example
later on.
Not every amalgamation of data qualifies for the term big data,
although most Web-related data falls under this umbrella. This
is because big data is characterized by the four Vs2.

Fig. 1.1 The four Vs of big data.


As we have already seen, these are:

Volume – Big data consists of large quantities of


data. This translates into several TB up to a few ZB.
This data may be distributed across various
locations, often in several computer networks
connected through the Internet. Generally, any
amount of data that is too large to be processed by
a single computer satisfies the Volume criterion of
big data. This alone is an issue that requires a
different approach to data processing, something
that gave rise to the parallel computing technology
known as MapReduce.
Velocity – Big data is also in motion, usually at high
transfer speeds. It is often referred to as data
streams, which are frequently too difficult to archive
(the speed alone is a great issue, considering the
limited amount of storage space a computer network
has). That is why only certain parts of it are
collected. Even if it were possible to collect all of it, it
would not be cost effective to store big data for long,
so the collected data is periodically jettisoned in
order to save space, keeping only summaries of it
(e.g., average values and variances). This problem
is expected to become more serious in the near
future as more and more data is being generated at
higher and higher speeds.
Variety – In the past, data used to be more or less
homogeneous, which also made it more
manageable. This is not the case with big data,
which stems from a variety of sources and,
therefore, varies in form. This translates into
different structures among the various data sources
and the existence of semi-structured and completely
unstructured data. Structured data is what is found
in traditional databases, where the structure of the
content of the data is predefined in fields of
specified sizes. Semi-structured data has some
structure, but it is not consistent (see the contents of
a .JSON file, for example), making it difficult to work
with. Even more challenging is unstructured data
(e.g., plain text) that has no structure whatsoever. In
most cases big data is semi-structured, though
rarely do its sources share the same form. In the
past few years, unstructured and semi-structured
data have constituted the vast majority of all big
data.
Veracity – This is one aspect of big data that is often
neglected by the literature, partly because it is
relatively new although equally important. It has to
do with how reliable the data is, something that is
taken into account in the data science process
(which is different from the traditional data analysis
process, as we will see in Chapter 11). Veracity
involves the signal-to-noise ratio; i.e., figuring out
what in the big data is valid for the business, which
is an important concept in information theory. Big
data tends to have varied veracity as not all of its
sources are equally reliable. Increasing the veracity
of the available data is a major big data challenge.

Note that a piece of data may have one or more of these


characteristics and still not be classified as big data. Big data
has all four of these. Big data is a serious issue as it is not
easy, even for a supercomputer, to manage it effectively, let
alone perform a useful analysis of it.
In the example we started with, a typical set of data that you
would encounter would have the following qualities:

The volume of data would be very large, with a


tendency to become larger, especially if your site
monitors several aspects of its visitors’ behavior.
This data may easily account for several TB a year.
It would flow constantly as visitors come and go and
new visitors pay a visit to your site. This translates
to continuous network activity on your servers,
which is basically a data stream from the Web
flowing into your server logs.
The data you would collect from your visitors would
vary greatly, ranging from simple Web statistics
(time spent on each page, time of the visit, number
of pages visited, etc.) to text entered on the site
(assuming you have some kind of review system,
like most serious e-commerce sites) and several
other types of data (e.g., ratings from customers for
various products, transaction data, etc.).
Naturally, not everything you observe on your site’s
servers will be trustworthy. Some of your visitors
may be bots sent by hackers or other users for
shady purposes, while other visitors may be your
competitors spying on you! Some visitors may have
spelling errors in their reviews, or leave random or
spam messages on the site for whatever reason.
Even if you have some kind of filtering system, it is
inevitable that your site will collect some useless
data over time.

Based on all of the above observations, do you think that you


are dealing with big data in this company or not? Why? If you
have understood the above concepts, you should be confident
in replying positively to this question. Each one of the bullet
points describing the data situation in that company has to do
with one of the Vs of big data.

1.2 Big Data Industries


Naturally, not all industries are equally affected by the big data
movement. Depending on how much they rely on data and how
profitable information is to them, they may be looking at a
goldmine or one more asset that can wait. Based on recent
statistics, the following industries appear to have benefited, or
are inclined to benefit the most from big data:

Retail (particularly in terms of productivity boost)


Telecommunications (particularly in terms of
revenue increase)
Consulting
Healthcare
Air transportation
Construction
Food products
Steel and manufacturing in general
Industrial instruments
Automobile industry
Customer care
Financial services
Publishing
Logistics

Note that the benefit is not always directly related to the bottom
line, but it is definitely of significant business value. For
example, by employing big data technologies in healthcare,
physicians can use previous data to gain a better
understanding of the patients’ issues, yielding a better
diagnosis and enabling them to take better care of their patients
in general. This can eventually result in greater efficiencies in
the medical system, translating into lower costs through the
intelligent use of medical information derived from that data.
Another example comes from customer care, where big data
can help leverage bad customer experiences. By effective use
of big data technologies, companies can gain a better
understanding of what their customers like and don’t like in
near real-time. This can help them amend their strategies in
dealing with these customers and give them insight into how to
improve their services in the future.
Note that there are many other industries that have the
potential for gaining from big data, but based on their current
status, it is not a worthwhile option for them. For example, the
art industry is still not big on big data, since the data involved in
this field is limited to descriptions of artwork and, in some
cases, digitized forms of these works of art. However, it is
possible that this may change in the future depending on how
the artists act. For example, if a certain gallery makes use of
sensors monitoring the number of people who view a certain
painting, and in combination with other data (e.g., number of
people who bought tickets to the various exhibitions that hosted
that painting), they could gradually build a large database that
would contain data about the sensor readings, the ticket sales,
and even the comments some people leave on the gallery’s
blog about the various paintings. All this can potentially yield
useful information about which pieces of art are more popular
(and by how much), as well as what the optimum ticket prices
should be for the gallery’s exhibitions throughout the year.
All this is great, but how is it of any real use to you? Well,
higher profit margins and the potential to significantly boost
productivity are not going to happen on their own. It is naïve to
think that just installing a big data package and assigning it to
an employee (even if they are a skilled employee) could result
in measurable gains. In order to take advantage of big data, a
company needs to hire qualified people who can undertake the
task of turning this seemingly chaotic bundle of data into useful
(actionable) information. This is the problem that all data
scientists are asked to solve and one of the driving forces of all
developments in the field that came to be known as data
science.

1.3 Birth of Data Science


The field of data science resulted from the attempt to discover
potential insights residing in big data and overcoming the
challenges that were reflected in the four Vs described
previously. This was possible through the combination of
various technological advances of modern computing.
Specifically, parallel computing, sophisticated data analysis
processes (mainly through machine learning), and powerful
computing at lower prices made this feasible. What’s more, the
continuously accelerating progress of the IT infrastructure and
technology will enable us to generate, collect, and process
significantly more data in the not-so-distant future. Through all
this, data science addresses the issues of big data on a
technical level through the application of the intelligence and
creativity that is employed in the development and use of these
technologies. That is, big data is somewhat manageable and at
least able to provide some useful information to make the
whole process worthwhile.
It’s important to note that data science is not a fad, but
something that is here to stay and bound to evolve rapidly. If
you were an IT professional when the World Wide Web came
about, you might have seen it as a luxury or a fad that wouldn’t
catch on, but those who managed to see its real value and the
potential it held made very lucrative careers out of it. Imagine
being one of the first people to learn HTML, CSS and
JavaScript, or one of the first to create digital graphics to be
used for websites. It would be like holding a winning lottery
ticket, especially if you were good at your job. This is the
situation with data science today. It would probably not be so
well-known if it weren’t for so many people writing about its
benefits. Still, most professionals and many students are not
aware of what data science really means.
If you assimilate the aforementioned facts about big data, you
will understand that data science is the solution to a real
problem that is only going to become more pronounced in the
years to come. This problem, as mentioned earlier, is reflected
in the four Vs of big data, the characteristics that make it
difficult to deal with using conventional technologies. As
technology is on its side, data science is bound to become
more robust and more diverse in the coming decade or so.
There are already some post-graduate programs making an
appearance in the academic world3, and there are plenty of
respectable researchers writing papers on data science topics.
This is not a coincidence. It shows a trend for the development
of an infrastructure of knowledge and know-how that will
nourish this field.
It is not very clear exactly when data science was born (there
have been people working on this field as researchers for
several decades), but the first conference where it received the
spotlight was in 1996 (“Data Science, Classification, and
Related Methods” by IFCS). It wasn’t until September 2005,
however, when the term “data scientist” first appeared in the
literature. Specifically, in a report released that year4, data
scientists were defined as “the information and computer
scientists, database and software engineers and programmers,
disciplinary experts, curators and expert annotators, librarians,
archivists, and others, who are crucial to the successful
management of a digital data collection.” In June, 2009, the
importance of the role of the data scientist became more
apparent, as Nathan Yau’s article “Rise of the Data Scientist” in
FlowingData was written5. Since then, references to and
literature on data science have increased rapidly. Just take a
look at how many conferences are being organized for it
nowadays, appealing to both academics and people in the
industry! What’s more, as several large companies that are
leaders in their sectors (e.g., Amazon) make use of data
science in their everyday workflow, it is quite likely that this
trend will continue. Also, as the role of the data scientist adapts
to the ever-changing requirements of the data world, it has
come to include several things such as the application of state-
of-the-art data analysis techniques, not just the original
responsibilities.

1.4 Key Points

Big data is a recent phenomenon where there is a


large quantity of data, in quick motion, varying from
structured to unstructured (with everything else in
between), and with different reliability levels. This is
often referred as the four Vs of big data: Volume,
Velocity, Variety, and Veracity.
Dealing with big data is a challenging problem due
to these four Vs. Data science is our response to the
challenges that big data represents.
Data scientists are the people that make sense of
big data. Through the use of state-of-the-art
technologies and know-how, they manage to derive
actionable information from it, usually in the form of
a data product.
Big data occurs in a variety of industries; taking
advantage of it can have a profound effect on them
in terms of productivity boost and revenue increase.
Data science has been around for over two decades
but has only recently taken off as the corresponding
technology was developed (parallel computing,
intelligent data analysis methods, and powerful
computing at a very low cost).
The role of the data scientist first made an
appearance in the literature in 2005, while it started
becoming quite popular in 2009. In an article in
Harvard Business Review, data science was called
the “sexiest” profession of the 21st century.6
Data science is expected to continue to grow in
terms of business value, technology, available
knowledge and know-how, and popularity in the
years to come.

Actually, some people include an additional two Vs, variability and visibility,
which refer to the fact that Big Data changes over time and is hardly
visible to users.
One of them, created by Berkley, costs around $60,000, which is
significantly more than the high-priced MBAs you see elsewhere. This
is a clear indication that people in the academic world as well as in the
industry are taking data science quite seriously.
Long-lived Digital Data Collections: Enabling Research and Education in the
21st Century, available at http://www.nsf.gov/pubs/2005/nsb0540
The article is still available online at the time of this writing. You can access
it at http://flowingdata.com/2009/06/04/rise-of-the-data-scientist
Davenport, Thomas H., and D. J. Pattil. “Data Scientist: The Sexiest Job of
the 21st Century.” Harvard Business Review, October 2012.
Chapter 2
Importance of Data
Science

In the previous chapter, we got a glimpse of how data science


came about and how it is related to big data. We also looked
into the major milestones of this field and why it has become
popular in recent years. However, this was just scraping the
surface, since data science has much to offer on many more
levels. In order to get a better understanding, we will look into
its history, the new paradigms it entails and the new mindset it
brings about as well as the changes it brings.

2.1 History of the Data Science Field


The term “data science” was around before big data came into
play (just like the term “data” preceded computers by four
centuries or so). In 1962, when John W. Tukey7 wrote his book
The Future of Data Analysis8, he foresaw the rise of new type
of data analysis that was more of a science than a
methodology. In 1974, Peter Naur published a book entitled
Concise Survey of Computer Methods,9 in both Sweden and
the United States. Although this was merely an overview of the
data processing methods of the time, this book contained the
first definition of data science as “the science of dealing with
data, once they have been established, while the relation of the
data to what they represent is delegated to other fields and
sciences.” So back then, anyone proficient with computers who
also understood the semantics of the data to some extent was
a data scientist. No fancy tools, no novel paradigms, no new
science behind it. It’s no surprise that the term took a while to
catch on.
As computer technology and statistics started to converge later
that decade, Tukey’s vision began to materialize, albeit quite
subtly. It wasn’t until the late 1980s, though, that it started to
gain ground through one of data science’s most well-known
methods: data mining. As the years advanced, the scientific
processing of data rose to new heights, and data science came
into the spotlight of academic research through a conference in
1996 called “Data Science, Classification, and Related
Methods.” This conference, which was organized by the
International Federation of Classification Societies (IFCS), took
place in Kobe, Japan. It made data science more well-known to
the circles of researchers and distinguished it from other data
analysis terms, such as classification, which are not as broad
as data science. This helped gradually make data science an
independent field.
In the next year (1997), the Data Mining and Knowledge
Discovery journal was launched, defining data mining as
“extracting information from large databases.” This was the first
data science method to gain popularity and respect in the
scientific community as well as in the industry. This method will
be revisited in the data science process in Chapter 11.
The role of data science started to become more apparent at
the end of the 1990s as databases grew larger. This was
voiced very eloquently by Jacob Zahavi in December 1999 in
his article “Mining Data for Nuggets of Knowledge”10:
“Conventional statistical methods work well with small data
sets. Today’s databases, however, can involve millions of rows
and scores of columns of data… Scalability is a huge issue in
data mining. Another technical challenge is developing models
that can do a better job analyzing data, detecting non-linear
relationships and interaction between elements… Special data
mining tools may have to be developed to address web-site
decisions.” This depicted very clearly how the need for a new
framework of data analysis was imperative, something that
aided in the coming about of data science as a field to address
that need.
In the 2000s, publications about data science started to appear
at an increasing rate, though they were mainly academic.
Journals and books on data science became more common
and attracted interest among researchers. In September 2005,
the term “data scientist” was first defined (albeit somewhat
generically) in a government report, as we saw in the previous
chapter. Later on, in 2007 the Research Center for Dataology
and Data Science was established in Shanghai, China.
2009 was a great year for data science. Yangyong Zhu and Yun
Xiong, two of the researchers in the aforementioned research
center, declared in their publication “Introduction to Dataology
and Data Science,”11 that data science was a new science,
distinctly different from natural science and social science. In
addition, in January of that year, Hal Varian (Google’s Chief
Economist) stated for the press that the next sexy job in the
coming decade would be statisticians12 (a term sometimes
used for data scientists when addressing people who are not
entirely familiar with the topic). Finally, in June of that year,
Nathan Yau’s article “Rise of the Data Scientist”13 was
published on FlowingData, making the role of the data scientist
much more familiar to the non-academic world.
In the current decade (2010s), data science publications have
become abundant, although there is still no decent source of
information about how to effectively become a data scientist
apart from this book you are reading. The term “data science”
gained a more concrete definition, the essence of which was
summarized in September 2010 by Drew Conway’s Venn
diagram (Fig. 2.1).
Fig. 2.1 Conway’s Venn diagram about Data Science.
This diagram illustrates the key components of data
science as well as how it differs from the field of machine
learning and traditional research. By “danger zone” he
probably means the hackers/crackers that compromise
the security of many computer systems today. Image
source: Drew Conway.

His quote provides further understanding of the fundamentals


for becoming a data scientist: “…one needs to learn a lot as
they aspire to become a fully competent data scientist.
Unfortunately, simply enumerating texts and tutorials does not
untangle the knots. Therefore, in an effort to simplify the
discussion, and add my own thoughts to what is already a
crowded market of ideas, I present the Data Science Venn
Diagram… hacking skills, math and stats knowledge, and
substantive expertise.”14
Finally, in September of 2012, Hal Varian’s quote about this
decade’s sexy job grew into a whole article in Harvard Business
Review (“Data Scientist: The Sexiest Job of the 21st
Century”15) making an even larger population aware of the
importance of the role of the data scientist in the years to come.
It is noteworthy that parallel to these publications and
conferences, there has been a lot of online social activity in
terms of data science. The first official data science group was
created on LinkedIn in June 2009 (known as Data Scientists
group16), and currently also has an independent site
(datascientists.net as well as datascientists.com, its original
name). Other data science groups have been available online
since 2008, although as of 2010, their number has risen at an
increasing rate along with online postings for data scientist
jobs. This will be covered in a bit more detail in Chapter 13. It
should also be noted that over the past few years, there have
been a lot of non-academic conferences on data science.
These conferences are usually rich in workshops and are
targeted at data professionals, project managers and
executives.

2.2 The New Paradigms


Data science has brought about or popularized some new
paradigms that constitute great tools for any data professional.
The main ones are:

MapReduce – A parallel, distributed algorithm for


splitting a complex task into a series of simpler tasks
and solving them in a very efficient manner, thus
increasing the speed of performing the complex task
and lowering the cost of computing resources.
Although this algorithm existed before, its wide use
in data science has made it more well known.
Hadoop Distributed File System (HDFS) – An open-
source platform designed to make use of parallel
computing technology, it basically makes dealing
with big data manageable by breaking it into smaller
chunks that are split over a network of computers.
Advanced Text Analytics – Often referred to as
Natural Language Processing (NLP), this is the field
of data analysis that involves techniques for
processing unstructured textual data to extract
useful information and business intelligence from it.
Before data science, this field didn’t exist at all.
Large scale data programming languages (e.g., Pig,
R, ECL, etc.) – Programming languages that work
with large datasets (especially big data) in an
efficient manner. These were underdeveloped or
completely absent before data science appeared.
Alternative database structures (e.g., HBase,
Cassandra, MongoDB, etc.) – Databases for
archiving, querying and editing big data using
parallel computing technologies.

You may be familiar with the New Technology File System


(NTFS) employed by every modern Windows OS. This is a
fairly satisfactory file system that works without too many
problems for most PCs. It would be impossible to use in a
network of connected computers, for handling large amounts of
data, however; NTFS has a limit of 256 TB, which is insufficient
for many big data applications. Unix-based file systems face
similar restrictions, which is why when Hadoop was developed,
a new type of file system had to be created, one that was
optimal for a computer cluster. HDFS allows the user to view all
the files on the cluster and perform some basic operations on
them as if they are on a single computer (even if most of these
files are scattered across the entire network).
At the heart of Hadoop lies MapReduce, which is the paradigm
that enables the network to crunch the data efficiently with
limited risk of failure. All the data is replicated in case one of the
computers of the cluster (usually called nodes) fails. There are
a number of supervising nodes that are in charge of scheduling
the tasks and managing the data flow. First, all of the data is
mapped through a set of cluster nodes referred to as mappers.
Once it is processed by the mappers, a set of nodes
undertakes the task of reducing the resulting processed data
into more useful outputs. This set of nodes, referred to as
reducers, may include mappers that have finished their job as
well. Everything is coordinated by the supervising node(s),
ensuring that the outputs of every stage are stored securely (in
multiple copies) across the cluster. Once the whole process
terminates, the outputs are provided to the user. The
MapReduce paradigm involves a lot of programming that can
be quite tedious. Its big advantage is that it ensures the
process finishes relatively quickly, making efficient use of all
available resources, while at the same time minimizing the risk
of data loss through hardware failure (something quite common
for the largest clusters).
Text analytics have been around for a while, but data science
introduced some advanced techniques that make the previous
techniques seem almost primitive. Modern (advanced) text
analytics allow the user to process large amounts of text data,
pinpointing patterns in them very quickly while allowing for
common problems such as misspelled words, multi-word terms
split over a sentence, etc. Advanced text analytics may be able
to pinpoint sentiment (!) in social media posts, recognizing if
someone’s comments are literal or sarcastic, something that is
extremely difficult for a machine to accomplish without the use
of these advanced methods. This advancement was made
possible via the application of artificial intelligence algorithms in
a Hadoop environment.
Large scale data programming languages, such as Pig, R, and
ECL, were developed to tackle big data and integrate well with
the Hadoop environment (actually, Pig is part of the Hadoop
ecosystem). R, which was developed before the advent of big
data, underwent a major upgrade that allows it to connect with
Hadoop and handle files in HDFS. As programming languages
are not too difficult to develop nowadays, it is possible that at
the time you are reading this book, other new languages in this
category have been developed, so it is good to keep your eyes
open. By the end of this decade, it is possible that the current
languages will no longer be the first choice for a data scientist
(although it is quite likely that R will be around for a while due to
its immense user community).
New alternative database structures came about thanks to data
science. These structures include Hash Table (e.g., JBoss data
grid, Riak), B-Tree (e.g., MongoDB, CouchDB), and Log
Structured Merge Tree (e.g., HBase, Cassandra). Unlike
traditional databases, these types of schemas are designed for
big data, so they are very flexible in how they read/write data
records in a database. Each has its own advantages and
disadvantages, but they are all better than traditional SQL
databases, which fail when the number of records or the
number of fields increases beyond a certain level. For example,
if you have a very large database (big data warehouse)
consisting of a million fields and a billion records, finding a
simple maximum value of a given field using a traditional
database will take longer than anyone is willing to wait. The
same query in a columnar database (e.g., HBase) will take a
fraction of a second.
All of these paradigms are based on the notion that a team of
computers, in the form of a cluster, work significantly better than
any single (super)computer, given that there are enough
members in that team. The innovation lies in the intelligent and
customized approaches to planning the essential tasks so that
they are efficiently handled by the computer cluster; in essence,
optimizing the process of dealing with the problem at hand. It is
no coincidence that these paradigms have exhibited increased
popularity since their creation and that they continue to evolve
rapidly. There is a lot of interest (and money) invested in these
technologies; learning them now is bound to pay off in the near
future.

2.3 The New Mindset and the Changes It


Brings
By now, you’ve probably figured out that data science is not
merely a set of clever tools, methodologies, and know-how. It is
a whole new way of thinking about data altogether. Naturally,
this paradigm shift brings about certain changes in the way
people work on related projects, how they engage with the
problems at hand and on how they develop themselves as
professionals.
Data science requires us to think more systematically,
combining an imaginative approach to problems with solid
pragmatism. This translates into a way of thinking that
resembles that of a good civil engineer, combining an artistic
perspective (through design) with hard-core engineering and
time management. Planning is a crucial aspect of working with
big data as different ways of doing the same task may have
vastly different demands on resources without any significant
difference in the results.
The changes this new mindset brings are evident in the way a
data scientist functions. The data scientist usually works as part
of a varied team consisting of data modelers, businesspeople,
and other professionals (depending on the industry). It is very
rare to see a data scientist work on his own for long periods of
time as a traditional waterfall model programmer would, for
instance.
In addition, the data scientist handles problems by taking
advantage of current literature, connecting with a variety of
professionals who may be more knowledgeable on the problem
he is facing, and breaking problems down into manageable
sub-problems that he gradually solves.
The skills a data scientist needs to be successful are not
uncommon individually. A data scientist should be able to learn
new things easily. With the fast pace of development of big data
technologies, a data scientist must have an agile mind that is
quick to grasp new methods and familiarize itself with new
tools.
A data scientist must also be proactive, anticipating things that
will be needed in his work, problems that may arise, and
anything else that will require his time. Existing methods may
need to be fine-tuned or customized for the problem at hand,
and changes in the method may be needed.
A data scientist needs to be flexible, adapting easily to a new
business domain, new team members, and new tools (the
software he uses when starting a job may be quite different
from what he ends up using later in that job). He needs to be
adept at networking and should understand the value of the
skills he is missing so he takes steps to develop them. Overall,
almost all of the skills that a data scientist has are highly
transferable and applicable to a large variety of situations. As a
result, he is a potent professional who can be an asset to any
team, especially an IT one.
We will go into how the shift of mindset and the skills required
manifest themselves in practice in much more detail in Chapter
4.

2.4 Key Points

Data science is older than most people think, but it


only started gaining ground in the past decade
(2000s).
Drew Conway’s well-known Venn diagram, created
in September 2010, effectively summarizes the
essence of data science.
Data science has brought about some new
paradigms that change the way we deal with data,
the main ones being:

MapReduce
Hadoop Distributed File System (HDFS)
Advanced Text Analytics
Large scale data programming
languages (e.g., Pig, R, ECL, etc.)
Alternative database structures (e.g.,
HBase, Cassandra, MongoDB, etc.)

Data science’s paradigm shift in the way we deal


with data caused certain important changes in our
lives as data professionals as it brought about a
whole new mindset that is essential for dealing with
big data.
The new mindset that data science promotes brings
about several changes in the data scientist’s
professional life and in the way he interacts with
others.

Tukey was a remarkable statistician who invented the Tukey honest


significance test, which is often used in combination with the well-
known ANOVA method. You can find more on the Tukey method at the
following site:
http://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm
John W. Tukey: The Future of Data Analysis, Ann. Math. Statist. Volume 33,
Number 1, 1962.
Peter Naur: Concise Survey of Computer Methods, 397 p. Studentlitteratur,
Lund, Sweden, ISBN 91-44-07881-1, 1974.
http://knowledge.wharton.upenn.edu/article/mining-data-for-nuggets-of-
knowledge
Yangyong Zhu, Yun Xiong. Introduction to Dataology and Data Science.
2009. This paper is available at the website:
http://www.dataology.fudan.edu.cn/s/98/t/316/51/0d/info20749.htm
http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_
challenges_managers
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/#comment-
30739
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
http://www.linkedin.com/groups?home=&gid=2013423&trk=anet_ug_hm
Chapter 3
Types of Data Scientists

Just as there are no two snowflakes that are exactly the same,
there are also no two data scientists who have identical skill-
sets or identical roles. The big data world has a wide variety of
problems, causing some natural differentiation in the specific
roles that a data scientist may undertake. In addition, the
profession has not been properly defined yet, so depending on
various aspects of one’s background, such as education, the
data scientist role can be further differentiated. Based on some
research that was done on the topic by a group of scientists
(Harlan Harris, Sean Murphy, and Marck Vaisman, who recently
published the book Analyzing the Analyzers17), there are four
types of data scientists: data developers, data researchers,
data creatives, and data businesspeople. Often encountered
among the most experienced professionals of the field is a fifth
type, a mixed/generic combination of these. While there is a
certain overlap among all of these categories (e.g., they are all
familiar with data analysis methodologies, big data technology,
and the data science process), they are generally quite different
from one another in several ways. Let’s examine each one of
them in more detail.

3.1 Data Developers


Data developers usually focus on the more technical issues of
data management and data analysis. In other words, their day-
to-day work involves getting the data from various sources and
organizing it in large databases, querying those databases for
meaningful results, and analyzing the results to derive useful
information from them. Data developers have a tendency to be
programmers with strong coding and machine learning skills,
since these are the skills that are most essential for this
particular specialty. Their business or statistics skills may be
relatively immature, depending on their education and work
experience. Data developers are ideal for certain parts of the
data science work, the bigger picture of which we will examine
later on once the specific parts become clear (see Chapter 11).
Data developers may not produce the most robust analyses,
which is why they usually team up with other data professionals
and designers. Still, they provide value for the companies for
which they work, and they can always develop the skills they
lack through courses, workshops, etc.
Data developers can be found in a variety of industries and are
often employed in smaller companies or as part of a data
science team in larger companies. People coming from an IT
background may tend to become this type of data scientist
since it comes naturally to them. They can enhance their skills
by taking courses in business and statistics, parallel to
acquiring experience in the industry. A data developer is usually
found in entry-level (junior) data scientist roles, although he
may take up more managerial roles as he develops his skill-set.

3.2 Data Researchers


Data researchers usually come from the academic world,
demonstrating a strong background in statistics or any of the
sciences that employ statistics (e.g., social sciences). They also
tend to have PhDs in a significantly higher proportion than any
other types of data scientists. Business skills are usually not
their strong suit, but they are excellent analysts. This particular
attribute of theirs is great in cases where a lot of
groundbreaking work needs to take place (e.g., in the case of
an organization that has never done data science before and/or
has no clear idea of what to do with the data it has).
Data researchers are often a very good asset for larger
organizations as part of a data science team along with other
professionals who complement this type of data scientist by
contributing programming and business skills, things that are
essential for the creation of useful data products. As data
researchers are adept at learning new things, they can quickly
pick up additional skills, expanding their skill-set and becoming
more flexible professionals if there is a need for it.

3.3 Data Creatives


Data creatives usually have considerable academic experience
and are exceptionally good at big data technologies (i.e.,
software designed for big data governance and analysis),
machine learning, and programming. They tend to be devoted
users of open source software and boast a broad-based skill-
set. This enables data creatives to move with little effort from
one role to another, acting like the Swiss Army knives of the
data science field. Not the most business savvy of
professionals, they are good at doing the day-to-day work of a
data scientist but may require help in making others see its
value.
Data creatives are a great asset for smaller companies, where
flexibility is of fundamental importance in an employee. Yet they
can easily work in a larger company, particularly if they team up
with more business-oriented professionals. Missing skills can
usually be acquired through work experience.

3.4 Data Businesspeople


Data businesspeople are usually the senior data scientists who
lead data science teams (which they sometimes build from
scratch). They are adept in business skills and are great project
managers. Their focus is mainly on increasing the revenue of a
company, and they are concerned with the bigger picture.
Nevertheless, they can also be down-to-earth since they have
substantial technical expertise.
Data businesspeople tend to be found in larger organizations or
their own start-ups. They are great at dealing with other
professionals, particularly businesspeople, and often have
extensive experience in every aspect of the data science
process. This kind of data scientist usually has other data
scientists and data professionals working for him and has a
project management role in the data science projects in which
he is involved.
3.5 Mixed/Generic Type
Mixed/generic data scientists are like data businesspeople but
without the broad experience or the intense business focus.
They are more balanced than the other types of data scientists
and are more likely to grow into the higher echelons of the field
faster than the first three types. Their skill-set includes
programming, statistics, and business skills, and they are very
flexible, much like the data creatives but with better
understanding of the business world. Most new data scientists
who study data science at a younger age tend to be of this type
since they develop their skills in a more holistic manner
(something that is reflected in the syllabus of the data science
courses).
Mixed/generic data scientists are good for any kind of company,
can work very well independently as well as part of a team, and
are quite enthusiastic about the field (which is why they have
acquired this wide variety of skills). Based on the growing
supply of data science courses and the maturing of the field, it
is expected that many data scientists in the future will be of this
type, even though they may have other types of differentiations,
just like programmers today are more balanced and versatile
than programmers in the early days of computing.

3.6 Key Points

There are five different types of data scientists:

Data developers
Data researchers
Data creatives
Data businesspeople
Mixed/generic

The data developers are experts in programming,


but may lack other parts of the data scientist skill-
set. They usually come from the IT industry.
The data researchers are experts in data analysis
techniques and possess state-of-the-art knowledge
in machine learning and other fields. They usually
have a PhD and have been or are involved in
academic research.
The data creatives are more holistically developed
as data science professionals than the other two
types, have a bias towards using open-source
software, and are very versatile. They come from all
kinds of industries, though usually they are
computer scientists already.
The data businesspeople (aka senior data
scientists) are the highest level of data scientist and
usually have managerial roles, closer to the
business world than to data science per se. They
usually come from a mixed background that
includes a degree in management.
The mixed/generic type of data scientists are the
most balanced, having developed all of the aspects
of data science more or less equally. They have less
breadth of experience than data businesspeople,
are very versatile, and come from all types of
backgrounds. Usually, the mixed/generic data
scientist evolves into the data businesspeople type.

Harlan Harris, Sean Murphy, Marck Vaisman: Analyzing the Analyzers,


O’Reilly June 2013. http://www.oreilly.com/data/free/analyzing-the-
analyzers.csp
Chapter 4
The Data Scientist’s
Mindset

People tend to have a very superficial view of what a data


scientist is (if they can even distinguish the term from the data
analyst or from the traditional scientist). This is clearly reflected
in the books and articles that are available today on this role18.
Rarely will you find a text that attempts to go deeper into what a
data scientist really is.
A data scientist is a person characterized by a particular set of
traits, qualities, a way of thinking, and ambition, just like every
profession, not just by a set of skills. Let us look at each one of
these key aspects of this mindset one by one in order to obtain
a better understanding of it and create a framework about what
being a data scientist really is.

4.1 Traits
A data scientist has a variety of professional characteristics and
traits that usually reflect the kind of work he specializes in, so
this list is not set in stone and is more of a guideline to
understand this role better. First and foremost, a data scientist
has a healthy curiosity about the things he observes, such as
potential patterns or relationships between two attributes or
features, unusual distributions, etc. If you want to be a data
scientist worth the money you earn, you need to have an
inquiring mind.
This does not mean that you need to be curious about
everything and get lost in perpetual random quests for answers.
Curiosity has to be accompanied by the discipline to focus on
down-to-earth, long-term interests that are more grounded than
a fleeting curiosity, which can be impulsive and superficial. A
data scientist is interested in the phenomena he observes in
the data he deals with, wanting to get to the bottom of them. A
statistical analysis of what’s there may be a good first step for
him, but he is not satisfied until he has a good answer for the
reason of these phenomena, the root cause behind the
statistical metrics he calculates. This allows him to explain the
root cause to other people in the company in the form of a
story.

Fig. 4.1 Curiosity is a very useful trait to have as a data


scientist.

This leads to another trait that is somewhat akin to curiosity: an


interest in experimentation. Namely, the data scientist has the
courage and the imagination to try out new things, develop new
ideas and put them into practice, design experiments and
validate new notions that he develops. He is not afraid to build
a model that no one else has built before, always being fully
aware of the risks in terms of resource usage, etc. All this is a
disciplined and practical form of experimentation where the
ideas stem from the data available. Otherwise, there is the risk
of misusing the data to project notions that are not there, a
common mistake among data analysts lacking scientific
discipline in their work. Experimentation is crucial, though,
because it allows the data scientist to find new ways of
interpreting the data and helping it transmute into information
that can be useful to other people. This is an important point.
The output of the experiments needs to be understandable to
the non-technical members of his team; otherwise, it is probably
immature. So experimentation is applied on many levels.
Representation of results is one of them, and although not the
most intellectually challenging to the data scientist, it is
definitely no less important than the other tasks he undertakes.
Other traits that the data scientist has are creativity and
systematic work. These are mentioned together because they
are often applied together in data science and are equally
important. The data scientist is an artist of sorts, in the sense
that he is involved in design and other creative endeavors in his
line of work. He values out-of-the-box thinking and regularly
applies it to the problems he tackles. Although knowledgeable
in various data analysis methods, he is not restricted by this
palette of methodologies. Instead, he may use a combination of
them, or even something completely new, tailored for the
particular problem he faces. This is an important aspect of the
data scientist that distinguishes him from a traditional data
analyst and statistician. Creativity goes hand in hand with
experimentation, making it an organic growth approach to
tackling problems. Without creativity, experimentation may not
quickly lead to results (think of a scientist researching a
treatment for a disease; without creativity he may have to
spend a large amount of time and other resources trying out
potential solutions, many of which he could avoid testing
altogether by applying a more creative and efficient approach).
Creativity is, therefore, invaluable to the data scientist and a
fundamental aspect of his thinking.
The data scientist is not, however, an artist per se. That’s why
every creative thought he has is accompanied by several not-
so-creative actions. This is where systematic work comes in.
Think of any inventor (e.g., Thomas Edison) and how many
hours of often tedious work were spent on honing and applying
their creative insight. In a sense, having a particularly creative
idea is not all that difficult. Finding one that is applicable to a
given problem is very creative but still not too challenging,
either. However, putting this idea into practice, working out all
the engineering details of it, and getting useful results in a
manageable timeframe: that is a real accomplishment. This is
feasible through systematic work, which is not just hard work,
but work done in a methodical and efficient way, something
typical of any type of scientific endeavor. The trait of working
systematically expresses itself as the discipline, organization
and rhythm through which the data scientist manages to ground
the creative ideas he comes up with.
Last, but certainly not least, of the essential skills of the data
scientist is that of communication. Data science is not an ad
hoc field. It is an interdisciplinary one, and, as such, it is closely
connected to other fields. In the data scientist role, this
translates into a series of connections or collaborations with
other professionals in the organization. These professionals are
usually in a variety of specialties and may have a different
understanding of the various levels of the information the data
scientist deals with. For the data scientist to be good at his role,
he needs to be able to explain not only his methodology and
results to his colleagues and his managers, but also the value
of the whole process. It is this connectivity to other people that
gives value to the data scientist’s role. You don’t do data
science on your own (unless you are just practicing). Besides,
requirements and problem parameters are not always clear cut,
needing to be defined through a process of interviews with
other professionals in the organization as well as communiqués
with middle/upper management. The data scientist needs to be
able to not only communicate his results, but also understand
clearly what is expected of him and engage in a constructive
conversation to determine the best possible parameters of the
projects he undertakes. He needs to be able to manage the
ensuing expectations and make sure that others, especially
those in managerial roles, see practical value in what he can
provide (without expecting miracles).
Although there are other traits that a data scientist may have,
the traits described above are the most essential ones for a
good data scientist. When applied with discernment and
intelligence, they can help his role develop organically and
effectively.

4.2 Qualities and Abilities


Hand in hand with traits are the qualities and abilities of a data
scientist, which often depend on his particular specialty.
However, there are certain ones that are found in every type of
data scientist. The most important of these are the following:
Model Building. This is a fundamental ability of a data scientist,
involving the design and implementation of mathematical
models that can be used to solve the data-related problems he
is asked to tackle. Stemming from the need to scientifically
explain and predict certain phenomena that are reflected in the
available data, model building is a key skill for every data
scientist. This is also one of those things that differentiate him
from most statisticians and the majority of data professionals. It
involves understanding and creativity as well as a great deal of
imagination. The models built by a data scientist are
implemented in an interactive environment, so it goes without
saying that a certain amount of programming also takes place
and that the models created take into account the available
resources, using them in a very effective and efficient manner.
Building a model, though, is not that easy. The model has to be
as simple as possible without being too simple. For example, a
simple model may be able to predict how many people will
attend a football match based on how large the fan clubs of the
participating teams are, the expected weather on that day, and
the time of the year, while an overly simple model may try to
predict the same thing using only one of these features. The
model has to be able to generalize so that it can predict a lot of
different cases that may not be entirely akin to the ones that
were used to create it. It has to be easy to change and
understood by everyone who uses it, especially those who may
need to fine-tune it. Model building can be based on
mathematics, a computational algorithm, or, more often, a
combination of both. The key thing is efficiency, so this is
something that the data scientist needs to factor into the whole
process. What good would a perfect model be if it took weeks
to provide any results, or if it required a huge number of
computers to run it? Also, it goes without saying that a data
scientist needs to be able to evolve and fine-tune his models,
customizing them to different circumstances and adapting them
to the data when it changes.
Planning. This is an obvious quality for anyone in the data-
related professions, but it is especially useful for a data scientist
as it is very easy to get carried away with analyzing the
available data, experimenting with various models, and not
dedicating sufficient time for other tasks such as documenting
the process and the results or creating the corresponding
visuals, comprehensive presentations and reports. In addition,
a data scientist needs to be able to factor in potential delays,
technical issues, communication lags, etc. in order to make
sure that he can meet all the deadlines of the projects he
undertakes. He needs to be able to think like a project manager
and have a practical approach to assessing time durations of
different tasks and plotting a realistic and efficient plan of action
for all the projects he undertakes.
Problem Solving. This is a key quality for any scientist,
particularly a data scientist; it involves being able to focus on
solutions rather than on the restrictions that a problem
presents. Often, the data scientist has not encountered these
solutions before, so it requires a certain amount of imagination
and creativity. It means being able to look at the problem at
hand from different angles, with different eyes and an open
mind.
Problem solving often involves finding ways to hack existing
technologies to work around a problem. Data science is rarely
clearly defined (similar to most academic endeavors), and
every problem it deals with is unique. That’s why a data
scientist is often more akin to the hacker than the scientist as
he may have to tackle problems through a lateral thinking
approach (see next subchapter for details) and walk outside the
beaten path. Also, he may need to develop new tools for
tackling the problems he faces (i.e., making sense of chaotic
big data), building code from scratch or doing major
modifications to the existing code.
Learning Fast. Being able to learn new things and learn them
fast is a priceless quality for any profession. However, in a field
with constant and rapid changes such as data science, it is
particularly useful. It also attests to mental agility and promotes
creativity, both invaluable aspects of the mindset suitable for
someone who wants to tackle big data problems. Learning fast
means being very methodical, selective, and able to assess
different sources of knowledge. It requires great discipline and
mental plasticity. Almost anyone can learn like that at a
relatively early age, but being able to maintain this openness
throughout adulthood is a challenge to most people. A data
scientist accepts this challenge and does not let age dictate
what he can or cannot learn, nor how fast he can do so. His
disciplined and nimble mind makes sure of that.
Key elements for learning fast are motivation and being able to
perceive the applicability of new material. If you keep this in
mind, it will be easier to develop this ability and use it effectively
in your journey as a data scientist.
Adaptability. An essential quality for a data scientist is the
ability to adapt to new circumstances and new situations. In a
way, data science is like a safari; you have some idea of what
the available game is, but you don’t know when or where you’ll
find it. One thing is certain: you will need to be versatile and
able to find ways to adapt your know-how and techniques to the
(often unique) data problems you will be asked to tackle. Also,
the methods you are going to use may need to be adapted to
be capable of handling the form of the available data. This
quality also enables the data scientist to work in different
industries without being restricted to any one in particular. After
all, the language of information is universal.
Teamwork. As mentioned earlier, a data scientist needs to be
able to communicate and collaborate effectively with other
professionals who are often not familiar with his field. He needs
to be a good team player and not let the uniqueness of his role
corrupt his character, turning him into a person full of himself
and incapable of working well with others. A data scientist is
assertive when it comes to presenting and defending his work
but is also modest and open to new ideas from his colleagues.
He puts the interest of the team before his own interests and is
secure enough to not always have to prove himself. He is
intelligent enough to figure things out on his own, but also
mature enough to see that through brainstorming and
collaboration, he can arrive at the same (or better) results
significantly faster. An independent professional, he is
disciplined enough to work on his own but is also a good
contributor to team meetings and is easy to work with.
Flexibility. Flexibility is another important quality for a data
scientist to have. It is akin to adaptability and the mental agility
mentioned earlier. It enables the data scientist to be versatile
and non-rigid when dealing with data problems, technical
issues and other obstacles (challenges) in his work. Flexibility is
crucial when it comes to new problems or new data structures
that have never been encountered before. It hones a can-do
attitude that allows him to deal with novel situations effectively,
efficiently and creatively. This simple quality is the glue that ties
all the other qualities and abilities together, enabling the data
scientist to organically evolve his techniques and even his
thinking, thus making him an invaluable asset to his
organization.
Research. This has nothing to do with academic research,
although it is scientific in essence. The data scientist is able to
understand and evaluate the current state of the art in his field
and find all the knowledge resources that are required for his
tasks. This entails more than looking things up on a search
engine or a knowledge base, though. Finding quality sources is
crucial for tackling the challenging problems of big data, and it
requires a trained eye to see which methods are applicable and
efficient when applied to a specific problem. It also entails
putting together documents describing new methods he
develops in a concise, scientifically robust and replicable way.
Whether or not these documents are publishable is another
matter and not related to how useful the described techniques
are.
This ability ties very well with learning fast as it enables the
data scientist to be self-sufficient when it comes to learning. In
addition, it makes it possible for him to train others as well as
have something to share at data science conferences and other
relevant events if he so chooses. Needless to say, it is
particularly vital in the initial stages of his career, especially if he
has a disposition towards innovation.
Attention to Detail. A data scientist needs to be attentive to
details since that is usually where useful information lurks. Also,
a small detail may cause syntactical or, even worse, logical
errors in his programs, slowing him down and compromising his
deadlines. Apart from the efficiency boost, this ability is very
useful in other ways as well. For example, certain details in the
available data may hint at using one or another data analysis
approaches or towards a particular set of features that could
simplify the problem significantly, even improving the results of
the analysis. Also, attention to detail can help the data scientist
pinpoint anomalous data points in a data set, enabling him to
predict problems in advance.
Reporting. Last, but certainly not least, reporting is a useful
ability for a data scientist. It entails creating documents that
summarize his work, creating visuals that depict his results,
putting these results in perspective, creating comprehensive
presentations, etc. A data scientist’s reports need to be
understandable by non-technical people while still maintaining
scientific rigor. Also, reporting provides a way to document
progress on various projects in an easy-to-access manner.
Reporting employs organization and communication and forges
a link between the data science world and the business world.
Other qualities and abilities may be required for certain
specialties of a data scientist, but having the ability to learn new
things may compensate for any lack of skills that you may have.
Fig. 4.2 A data scientist is not your average IT
professional.

4.3 Thinking
The data scientist’s way of thinking is the most important
attribute to keep in mind since it often distinguishes him from
other types of professionals. In general, a data scientist thinks
in a combinatorial, non-linear way. His thinking needs to
combine both traditional and lateral thinking and be versatile in
employing either pattern when dealing with the challenges that
arise in his work.
His thinking is creative when it comes to designing and
implementing his models or investigating which approach
should be used for tackling a particular problem. His thinking is
not bound by unnecessary restrictions when creating or
updating the algorithms he decides to use for his data analysis.
In that sense, his thinking often resembles that of an artist, a
designer and an architect. He does not hesitate to experiment
with different approaches and methodologies and is poised to
try out different ways to visualize the available data insightfully.
Colors and shapes are his tools and can be as applicable as
numbers in expressing the information that is waiting to be
discovered. In a way, his thinking is very similar to that of the
explorer who sets out to find new lands, but his realm is the
vast seas of data in the cyberspace universe.
A data scientist’s thinking is also grounded and practical,
especially when it comes to building something with limited
resources in a constrained timeframe. In that sense, it is similar
to the thinking of a civil engineer who opts to make the most of
the available space and budget without dwelling much on fancy
designs. Just like a civil engineer, a data scientist does not
neglect the given requirements and tailors his creative
approach to the restrictions of the task at hand. Perhaps he
could derive ten or fifteen different metrics from a given dataset
to monitor the evolution of a given variable, but he only needs
four or five of them. And from the dozens of beautiful graphs he
could create to depict that dataset over time, he picks only a
couple that summarize it most effectively. A data scientist is
also an engineer of sorts and always thinks and behaves in a
pragmatic and down-to-earth manner.
The data scientist’s thinking is also self-reflective and, in a way,
meta-cognitive. He investigates different ways of thinking about
things and evaluates his current thinking processes. In
essence, a data scientist should be aware of how his mind
works and, therefore, be willing to admit to gaps in knowledge
(and do something about them). He continually looks for flaws
in his own methods and takes the necessary steps to fix them.
He is proactive and takes responsibility for how his mind
functions and the inputs it uses. He is not afraid to say that he
doesn’t know something and makes every effort to acquire the
relevant resources to help him understand it sufficiently and
quickly. This allows him to be a better team member and greatly
facilitates communication with others.
Most importantly, the mind of the data scientist evolves over
time. Modern neuroscience confirms the brain’s life-long ability
to change and create new connections within itself. The
thinking of the data scientist today is not the same as it was last
year, and it is not going to be the same next year. His mind
embraces change and uses it to upgrade itself through new
experiences, new knowledge and new know-how. In some
professions, it may be sufficient to have more or less static
thinking, but data science is not one of them. The data scientist
is similar to the entrepreneurs, the managers and the inventors,
continuously learning new things and adapting his thinking to
the ever-changing circumstances of our fast-paced world.
Of course, the thinking of a data scientist is not limited to the
above meta-descriptions, and a book subchapter may not be
capable of doing it justice. The above guidelines do, however,
pinpoint some of its main aspects and hopefully provide
incentive for looking into it in greater depth through a conscious
evaluation of your thinking as you learn more about data
science in general.
Fig. 4.3 Thinking is an important aspect of the data
scientist’s mindset.

4.4 Ambitions
It seems a bit unconventional for a book like this to talk about a
professional’s ambitions as this is something that is very
personal and somewhat relative. However, there are certain
aspirations that are more or less common to data scientists;
understanding them may provide useful insight into his mindset.
A data scientist aspires to master big data in its many forms.
Being able to deal with a particular data set in this domain is
great, but often not enough. Someone who cares for data
science finds ways, often through interaction with other
professionals in this field, to be on top of the data that is out
there, meaning that he comprehends fully what each data type
can offer to an organization, what useful information he can
potentially derive from it and what costs acquiring each data
type entails. This stems from the dream of continuous
improvement, which is quite feasible in fields like this where
more and more tools become available as new data analysis
methods are developed all the time.
Data scientists also constantly want to learn new things. This
wish ties quite well with the previous ambition of mastering big
data since learning, especially when related to diverse things
that include the realm of big data, has been proven to aid in the
development of creativity and mental agility. These are
essential aspects of the role of the data scientist, and
cultivating them makes perfect sense. A data scientist’s
interests are not limited to the data science techniques that he
may use in his everyday work. He is also interested in new
developments in artificial intelligence, distributed computing,
information security, new programming languages and machine
learning, among other fields.
Fig. 4.4 A data scientist is not without ambitions.

Finally, a data scientist aspires to familiarize himself with the


open problems and challenges that exist in the big data world
as well as the opportunities that are available through the
intelligent processing of company data. He may want to
research new ways of tackling problems through the use of new
technologies, development of new methods, etc., or he may
look into how specific business requirements can be fulfilled
through the use of certain kinds of data that are available or
can be acquired in a cost-effective manner.
The aforementioned ambitions are examples of what a data
scientist wants professionally. The bottom line is that he is not
static, but always wanting to be more than what his job
description implies. If you keep this in mind, along with the
other aspects of his mentality, you will have a clearer
understanding of the mindset required for this intriguing role. As
a result, you will be able to place the specific skills and
knowledge that he has in context with his work and have a
more holistic view of this profession.

4.5 Key Points

The most important traits a data scientist has are:

Curiosity
Experimentation
Creativity and Systematic Work
Communication

The main qualities and abilities of a data scientist


are:

Model Building
Planning
Problem Solving
Learning Fast
Adaptability
Teamwork
Flexibility
Research
Attention to Detail
Reporting

A data scientist aspires, among other things, to:

Master big data in its many forms


Constantly learn new things
Familiarize himself with the open
problems and challenges that exist in
the big data world as well as the
opportunities that are available

Recently, the author came across a post on Quora (a forum for geeks)
where the poster mentioned a series of 10 steps that you need to do in
order to become a data scientist. Most of them were focused on
specific skills, the majority of which are of questionable quality. This
clearly illustrates the limited understanding many people have about
what being a data scientist involves and how this misinformation
propagates.
Chapter 5
Technical Qualifications

Similar to many other jobs nowadays, a robust set of technical


qualifications is essential before you can opt for a data science
job. The mindset of the data scientist, which was described in
the previous chapter, is like an operating system you need to
have installed in your mind, but it needs to be augmented with
particular software (i.e., your technical skills) to enable you to
get the job done. These skills fall into three broad categories:
general programming, scientific background and specialized
know-how (software and techniques). Naturally, all these
qualifications will vary greatly from one company to another, but
having a core set of skills across all of these categories may
help you qualify for most data science jobs.
In this chapter, we will look into the most commonly expected
qualifications for a data scientist position today. We’ll look into
the general programming skills required, the scientific
background you will be expected to have and the specialized
know-how you need to possess related to data analysis and
data engineering.

5.1 General Programming


Unlike other branches of science, programming is a must have
for any data scientist. Professionals in academia may be able to
get by without knowing any coding, but in data science you
need to know languages that are:

Robust
Popular in the industry
Scalable, especially when it comes to large data
sets

The (general purpose) languages that appear most commonly


in data scientist job openings are:

Java
Python
C++ / C#
Perl

SQL is also required, but this is a more specialized language.


You can get by in data science without knowing Perl or Java,
but you won’t manage without SQL, since at one point or
another you will need to access a database and run queries on
it. In addition, SQL is the foundation for other languages related
to databases, so knowing it will enable you to work with
somewhat similar languages such as Hive Query Language,
NoSQL, AQL, BigSQL, etc.
Notice that the aforementioned general programming
languages are all object-oriented (OO) languages; this is not a
coincidence. There are other great languages (the most
widespread of which is C) that may not work for you as a data
scientist because the trend for the past few years is towards
OO languages. (Fortunately, C has an OO counterpart, C++.)
One of the main reasons for this is that an OO language
enables you to create more sophisticated projects quite easily
and then combine your code with others’ code very effectively.
That’s a really big plus when working on a team tackling big
data when agility is key.
Although there is a lot of interest in Python, it is by no means
better as a language than any of the other ones mentioned. It
does have a wide variety of libraries though, so it is relatively
easy for someone with no programming experience to pick up.
If you have done programming before, you may want to
consider a more robust language such as Java. It is a good
idea to master at least one language, but it doesn’t hurt to
familiarize yourself with more than one since you never know
when they will come in handy.
Note that knowing how to program well in one or more of these
languages may not be enough. You will need to have some
data processing experience with them, particularly with large
data sets. After all, that’s what you’ll be using them with!

5.2 Scientific Background


This is a key aspect of the qualifications bundle of a data
scientist, differentiating him from other IT professionals. A data
scientist has at least a master’s degree in a technical field
(usually computer science, statistics, mathematics, systems
engineering, or something along these lines). Alternatively, a
background in a non-technical field but with sufficient technical
experience from previous jobs is also an acceptable option.
Having a PhD, though, is a major advantage, regardless of your
background, especially if your research has a quantitative
component to it and if you are looking into a position with a
higher salary. There are data scientists out there who have
PhDs in very diverse disciplines such as psychology and
physics.
A PhD can provide considerable experience in data analysis,
especially if the research done for it is on real-world datasets.
Acquiring a PhD is not considered to be formal work
experience, but in reality, the experience and skills gained can
be as useful as actual work experience. In fact, most of the
professional attributes that real-world experience provides you
with (time management, reliability, teamwork, etc.) are also
skills you learn working on a PhD, especially if you are part of a
research lab. So if you have a PhD that has provided you with
applicable skills, you may want to refer to them in your resume
as well as in interviews during the hiring process, depending on
the people you’ll be working for. That’s a judgment call you’ll
need to make since not everyone values PhDs the same way.
A solid theoretical understanding and practical know-how of
various advanced analytical techniques is also required as part
of a scientific background. If you lack knowledge in advanced
analytics, be prepared to offer something that no one else can
such as state-of-the-art knowledge of processing data
effectively. The aforementioned techniques include (but are not
limited to) data mining, machine learning and predictive
modeling (aka predictive analytics).
All of the above techniques are great tools that you need to
know intimately. However, what binds them all together is a
strong mathematics and statistics background, which is also an
essential qualification that employers are looking for. This
doesn’t mean that you need to know all theorems and their
proofs, but you do need to be familiar with most of them and,
most importantly, know how to use them with the data you have
available. Not everything will work with all types of data, of
course. Overall, you need to know enough to be able to do the
following in a way that’s second nature to you:

Discern which tool to use when.


Fine-tune the tool you decide to use, customizing it
to the problem at hand.
Know what to do with the results your tool yields.
Think of alternative approaches to solving a problem
and be able to rank them in terms of resource
requirements.

A solid understanding of the theory behind the techniques you


are applying is crucial. To gain this understanding, you need to
have taken several classes on mathematics and statistics and
not be intimidated by anything in those fields. If you don’t know
something, you need to be able to learn it by leveraging what
you have already learned. You can do this by taking a seminar,
an online class or even just reading a couple of books.
The scientific background you are expected to have as a data
scientist will also enable you to formulate testable hypotheses,
apply a reproducible methodology to the data at hand, make
good use of the data science process (see Chapter 11) and
have a thorough understanding of the results. Moreover, you
will be able to fine-tune your methods, know where something
has gone wrong and come up with alternative approaches to a
problem. It is very hard to overestimate the importance of
having a scientific background.

5.3 Specialized Know-How


Being a data scientist requires some specialized know-how that
distinguishes him from other professionals. It is important that
you have mastery of at least one of these statistics tools:

R (the most advanced statistical analysis platform;


open-source)
SPSS (another great statistical tool; proprietary)
SAS (a very popular statistical tool in the industry;
proprietary)
Stata (another good statistical tool; proprietary)

Some employers might also include Matlab in the list, since


Matlab enables you to do any data analysis conceivable with
minimal code and comes with its own advanced integrated
development environment (IDE) that makes debugging and
development a walk in the park. The big drawback of Matlab is
that its license is quite expensive, especially for commercial
applications.
If you are not sure on which tool to focus, it is recommended
that you go with R. Over the past few years, R has become
more popular for several good reasons: R is open-source (and
therefore completely free), it has a very large user-community, it
is easy to install and customize, it is fairly easy to learn, there is
ample documentation for it as well as several books for a
variety of levels, and it comes with a wide variety of libraries
(known as packages) that enable you to do many complex
tasks easily without having to do much coding. Note that
although R has all the characteristics of an OO language (and
all of the data structures in its workspace are treated as
objects), it is still considered by most people to be a statistics
tool.
If you already know Matlab reasonably well, you may want to
learn another tool just in case an employer is not familiar with it
or is unwilling to purchase a license or two. Note that the
transition from Matlab to R and vice versa is quite easy,
especially if you are somewhat familiar with OO programming.
Experience with big data storage frameworks is also an
essential qualification. As we saw in a previous chapter, big
data requires a different set of paradigms, one of which is novel
database schemas. So, large-scale data frameworks like
Hadoop, Hive, large-scale partitioned relational databases, etc.,
are something you need to be familiar with as a data scientist.
Finally, some experience in working with large datasets (TB
class) is also very useful. Although this experience may not be
required, it is something you can gain very quickly and doesn’t
entail any additional know-how. Other qualifications that may be
required include:

Visualization – this is an important aspect of the


data science process, which has to do with the
creation of graphics (usually plots, heat maps,
graphs, etc.) that aim to help the user get a good
idea of what the data illustrates without having to
look at tables or statistics. Visualization is oftentimes
done through the data analysis tool you are using.
Relational databases – depending on your project,
you may need to work with relational databases. It
will be useful to become familiar with them,
especially if you do that while learning SQL.
Consumer modeling – this is a type of modeling that
has to do with creating and using consumer profiles
in order to understand the company’s target group
better and facilitate all the marketing endeavors that
employ this information. It is particularly useful if you
are working for a company in the retail industry.
Big data integrated processing system (e.g., IBM’s
BigInsights, Knime, Alpine, and Pivotal, just to name
a few) – although it is not likely that this will be a
requirement for a data scientist job posting, being
familiar with a system like this provides you with a
better understanding of the bigger picture of big data
processing and allows you to focus on the most
creative aspects of your job since it does all the low-
level work for you and helps you deal with the
problem using a high-level approach.

As the data science field matures, it is likely that additional


specialized know-how will be required in order to be a data
scientist. However, the qualifications identified in this chapter
are bound to remain essential, particularly the data analysis
tools. It is recommended that you keep up to date with the
newest developments in the field so that you know how to
adjust your training strategy and avoid wasting your resources
on things that you may not need.

5.4 Key Points

As a data scientist, you need a specific set of


technical skills that are the tools you will use in your
everyday job.
You need to be familiar with one or more object-
oriented programming languages such as Java or
Perl. Having mastery of at least one of them is
imperative.
You need to have a solid scientific background
(even if your education is non-technical), making
you adept in the following:

The scientific process


The theory behind various data analysis
techniques
Using the above techniques in practice
Formulating and testing various
hypotheses
Understanding the results of a data
analysis method
Having a PhD in a technical discipline can be quite
useful when it comes to data science, as it can
compensate for lack of work experience, but it is not
a prerequisite.
You need to have some specialized knowledge that
is particular to the job of a data scientist, including:

Sufficient knowledge of one or more


data analysis tools (e.g., R, SPSS, SAS,
Stata, or Matlab) and mastery of at least
one of them.
Experience with big data storage
frameworks (e.g., Hadoop, Hive, etc.).
Other know-how that may or may not be
a prerequisite for getting a data science
job, such as visualization, relational
databases, consumer modeling, a big
data integrated processing system, and,
of course, experience working with
datasets in the big data domain.

The data science field evolves rapidly, so you need


to keep up with the changes, particularly in the tools
used so that you can adjust your training strategy
accordingly.
Chapter 6
Experience

Like many other job openings, data scientist job ads usually
specify a requirement of at least two years of experience in a
data-related endeavor. Although this is a very ambiguous
requirement (someone can be a master data scientist and still
not have enough experience for a particular position), it is
definitely worth looking into more. In this chapter we will
examine in more detail the why’s and how’s of experience in
this intriguing field.

6.1 Corporate vs. Academic Experience


There is a big debate about the applicability of different types of
experience in relation to industry. That is, corporate experience
is perceived differently from academic experience for any data
science job.
More often than not, data scientists work in a corporate
environment. Less frequent are positions in large government
organizations such as NASA, FBI or the CIA. Those who have
worked in such an environment can attest to how different it is
from that of the academic world, which has a completely
different set of values. Both environments, however, offer many
similar aspects of experience, such as:

Working with others


Following a time plan
Adhering to organizational policies
Having a professional stance when dealing with
one’s tasks and with other people, particularly
clients
Working in an academic setting enables you in general to
interact with bright people who are interested in innovation and
have a close connection with the state of the art in the field you
are working in. Academic life is often related to teaching (or
helping out with the teaching of) a class or two and cultivating
your presentation skills as well as being exposed to the fresh
mindset of younger people. All this helps you accumulate
experiences that are harder to find in a non-academic
environment.
In the corporate world, however, professionals tend to get paid
more and receive more immediate feedback for the work they
do. In academia you may be working on something, and years
after you may get someone to acknowledge that what you did is
good (e.g. the case of many theoretical physicists). In the
corporate world, you are likely to be praised within a year and
get a bonus as a token of your organization’s appreciation. In
addition, in the corporate world you are more likely to find job
opportunities, many of which you will be made aware of without
any effort from your part (e.g. through recruiters), something
that rarely happens in academia. Moreover, in the corporate
world you often get to see the fruits of your work manifested
through products or news reports, especially if you are an
experienced professional. Unless you are a Nobel-level
scientist, this is unlikely to happen to you in the academic
world. Finally, the conferences in the corporate world are more
inclusive and appeal to everyone interested in the field, while in
academia they are mainly for the few specialists who already
do research work in that field.
Although both corporate and academic types of experience
have their advantages, which one you choose to work in is a
matter of personal preference at the end of the day. It would be
ideal to take the best of both worlds, but this is not always
feasible. If your experience is limited to the academic world,
you can still make a robust argument about how it can be
useful for a data science job, especially if you have experience
working on real-world data through research projects. Most of
the skills you acquire through working in the academic world
are transferable to the data science world. Still, as data
scientists are mainly recruited in the corporate world, it would
make more sense to invest in some experience in the industry if
you have the choice.

6.2 Experience vs. Formal Education


What’s the exact tradeoff between the time spent gathering
work experience and the time spent pursuing a university
degree? This is the million dollar question and a point of debate
for many professionals. Is it possible for lack of experience to
be counter-balanced by sufficient formal education?
Most employers think that this is not the case, which is why
they have specific requirements when it comes to a data
scientist job posting. There are jobs where an advanced degree
can be substituted for experience, but in data science there is a
demand for work experience in the vast majority of cases
(exceptions are some start-ups).
In essence, if you have substantial formal education
(particularly a PhD) in appropriate areas of study (e.g. computer
science, engineering, applied mathematics, etc.), you can
undertake most data science tasks. This is important to keep in
mind. Experience enables you to do things better, but lack of
experience doesn’t have to be a complete negative. If you are
good with the tools you are using, even if you haven’t been
using them for long you can still be successful in data science.
Being fresh out of the university can be a plus if you have
studied something akin to data science (or even better, if you
have done a master’s degree in data science or machine
learning). This can give you an edge when it comes to know-
how and familiarity with the latest tools or data analysis
techniques.
Still, work experience can help you make all this knowledge
and know-how tangible and useful. There is no way around it;
you need to have some experience in order to get a data
science job, even if you can perform data science without it. So
how do you gain some experience so that you have a fighting
chance in the data science job market?

6.3 How to Gain Initial Experience


To gain experience that can jump-start your career, you need to
first pinpoint the industry you plan to get involved with. This is
useful for any position because companies see experience
relevant to their industry as a big plus.
Then you need to find some relevant data you can work with,
just for practice, using UCI machine learning repository19 and
other open databases of datasets used for these purposes. A
query on any search engine is bound to provide you with
sufficient results to get you started. This won’t give you
experience that you can put on your resume, but it will get you
acquainted with the data of that industry. Once you are
confident about its structure (or lack of structure) and about
how your algorithms work with it, you are ready for the next
step.
Next, find a data science competition or project open to
anyone, take part in it and do well! Participating will give you
some invaluable experience, and finishing in a respectable
position in the competition is something you can include on
your resume. It is advisable that you compete in several
competitions, focusing on those that involve large datasets if
possible. If it is an option, you can recruit some friends or
colleagues to form a small team to maximize your chances of
success and gain some teamwork experience. A great place to
find such competitions is Kaggle20, which is also a priceless
resource for accessing the data science job market (more on
that in the Chapter 13).
Another good way to gain some work experience in this field,
particularly if you are a student, is by getting an internship for a
relevant role (even if it is unpaid). You will gain a good
understanding of the corporate environment, gain some basic
work ethics and become acquainted with data processing to
some extent. If you are lucky, you may work with a full-time
data scientist and have a chance to learn from him directly. The
value of a mentor cannot be overstated.
If you are a master’s student studying something similar to data
science, you may want to do a case study for your final project.
This could involve tackling a particular data analysis problem
that a company is willing to outsource. The catch is that they
won’t spend a dime on it, but they will offer you some useful
real-world experience. It’s similar to an internship, but you have
better control of the whole process since it is your project. Also,
you have the opportunity to create proper documentation on it
and have someone edit it for you, resulting in a respectable,
professional-looking report in the form of a thesis. The latter
you can share with potential employers as a distillation of your
relevant work experience as long as you do not reveal any
confidential information that is for the company’s internal use
only. Be sure to check with a company representative first.
Although not as useful as a source of practical experience, you
can get involved in one or more data science groups in your
community and volunteer to help with organizing their events.
These can be simple Meetup groups, which are also a great
place to learn more about the field. Depending on the events
these groups have, you should have an opportunity to meet
working data scientists, familiarize yourself with local company
reps (who are assuming a recruiter role) and learn some hands-
on organizational skills that may be valued in your future
workplace. This is also a very useful networking tool, as we’ll
see in later chapters of this book.
Finally, after you have exhausted all other avenues, you can
pursue an apprenticeship with a real master data scientist,
which can be presented as an unpaid internship. One possible
source of opportunities is the Data Science Central site21. An
internship makes all the topics discussed in this book concrete
and applicable.
At the end of the day, your future employer will care about what
you can do for them and how you can benefit their company’s
bottom line, so be prepared to make a convincing argument for
what you bring to the table.

6.4 Key Points

Experience is an essential requirement for the vast


majority of data science jobs across the industry
spectrum. Experience enables you to be more
efficient in your work. It also facilitates
communications and provides you with more in-
depth knowledge of the methods and tools you are
using as a data scientist.
Both corporate and academic types of experience
have their advantages and can be used as work
experience for a data science job.
Ways to get initial work experience include, but are
not limited to:

Participate in a few data science


competitions at Kaggle, with or without a
team
Obtain an internship with a company for
a relevant role
If you are a master’s student, do your
thesis on a case study of a company
having a data-related problem
Volunteer in a data science group
Join an apprenticeship such as one
available in Data Science Central.

http://archive.ics.uci.edu/ml
www.kaggle.com
http://www.datasciencecentral.com/group/data-science-apprenticeship
Chapter 7
Networking

Networking is a very important aspect of being a data scientist,


especially if you are in the initial stages of your career. You
never know how a professional acquaintance can be of use to
you in your life in data science. This is because this field is
interconnected with other professions and its reputation has
grown significantly over the past few years, making others
(especially business people) more interested in meeting and
connecting with people in this intriguing field. Many data
scientists have already started taking advantage of this trend by
attending networking groups dedicated to data science topics.
In this chapter, we will examine what networking for a data
scientist entails, how it is different from other professionals’
networking and how it can hone relationships that are essential
for his career (namely within academia and the business world).
It will not touch the topic of how networking is employed for
pursuing a data scientist job, however, as this subject will be
covered in detail in Chapter 13.

7.1 More than Just Professional Networking


For a data scientist, networking is an integral part of his job,
enabling him to learn more about techniques, tools and other
things he ought to know in order to be a better data scientist. It
is an educational opportunity that he cannot afford to neglect,
considering the pace at which things are moving in the data
science field. It is also possible that through such meetings he
may find a mentor, which can be very beneficial, especially in
the initial stages of his career.
Networking is also a chance for the data scientist to connect
with business people, not just for finding out about business
opportunities, but to gain a better understanding of how the
business world is faring and how data science contributes to it.
This is not a simple task since it involves talking to many
different people in various industries to get a solid and reliable
first-hand understanding of the matter, forming his own opinions
about it and possibly his own solutions to the problems that are
out there. Also, the language that business people use is quite
different than that used by tech geeks, so it takes a bit of
getting used to this kind of networking.
In addition to these more tangible benefits, networking can help
you develop a professional way of presenting yourself to
others. You may have heard of the “elevator speech,” which
was introduced for screenplay writers hoping to secure a deal
for their stories by bumping into a producer in the elevator of
the company they worked at. During the one minute or so of the
elevator ride, they had to be able to present their story idea
clearly, make that person interested in their idea and sketch the
main benefits of this idea for that person (i.e. how the story’s
originality can help the producer’s bottom line). Networking has
similar opportunities, but you are promoting yourself as a data
scientist.
Networking can be crucial for a data scientist’s project.
Normally he would have a team of his own, but it is not
uncommon to look for collaborators or business partners. Since
people work better with people with whom they can
communicate well, networking can be a filtering process for
commencing a data science project. By presenting yourself
well, you give yourself the opportunity to be a potential recruit.

7.2 Relationship with Academia


A data scientist is not an academic, but that does not mean that
he shouldn’t have a connection with academia. Whether he
hangs out with academics in order to discuss the latest
innovations in fields of common interest (e.g., machine learning,
parallel computing, etc.) or takes classes at a local university,
he should maintain a relationship with the academic world.
Data scientists understand the language of science, which
enables them to communicate effectively with researchers, ask
interesting questions and even think of potential applications of
state-of-the-art research in the researcher’s field. The
academics who develop research projects are familiar enough
with their field to propose their own applications, and it is not
unusual for some of them to have connections with the industry
(usually to supplement their relatively limited income). The data
scientist is similar to this type of academic and can benefit
professionally from a mutually beneficial relationship – the data
scientist learns about the latest innovations, while the academic
learns about industry problems that he can investigate
scientifically. However, for this relationship to develop, it is
usually the data scientist who has to take the first step by
networking at conferences, workshops and other events open
to both academics and other professionals. These events may
be more costly than reason would dictate, but for someone who
is serious about data science, it may be worth the investment.
Networking can yield a lot of collaboration potential that, in turn,
can provide an edge to the company the data scientist works
for. Think about how the large companies of the Western world
have benefitted from such an approach. Using cutting-edge
know-how in an industry can provide an enormous boost, if
done properly. The data scientist can contribute significantly
towards such an objective by, for example, getting involved in a
research project and learning about new research trends first-
hand, then introducing them to his organization. This kind of
project takes a lot of time (a typical journal paper may take one
to two years until it is finally printed) and it is not uncommon for
there to be a monetary cost involved. (Yes, in the academic
world you often need to pay to get your stuff published,
especially if it’s in a worthwhile journal!) However, if you are
part of a team, these issues are mitigated by the limit of your
contribution, keeping all the frustratingly painful academic
bureaucracy at a healthy distance. Sounds like an interesting
option to consider, doesn’t it?

7.3 Relationship with the Business World


A data scientist should extend networking efforts to the
business world, too. The most notable benefit of this strategy is
that you become intimately familiar with real-world problems,
what various industries are in need of and how other
professionals are tackling the data-related challenges of the
real world. The business world can also keep you grounded. It
is very easy to get carried away and let your mind wander in the
fascinating gardens of mathematical modeling, data analysis
theories and the like, but these things won’t pay any bills. The
data scientist always needs to keep in mind what all his work is
for, to keep up to date about the latest developments in the
business arena and to think of potential applications of big data
in this setting. Business networking supports these objectives.
A very intriguing aspect of networking in the business world is
the potential for strategically beneficial business opportunities.
These are not just job opportunities, which will be covered later
in the book (Chapter 13), but other opportunities such as the
creation of a new group, collaborations with other professionals
on an independent project and even business research
partnerships. The key thing here is to be open and think outside
the box, something that should come naturally if you express
your data scientist side.
One important point to remember when engaging in business
networking is to keep the technical jargon to a minimum and
show genuine interest in what other people are doing. We’ll
look into this in more detail in Chapter 14, where various self-
presentation tactics will be discussed.

7.4 Key Points

Networking is a very important aspect of being a


data scientist, especially in the initial stages of your
career.
Networking can help you develop your
communication skills and adapt yourself to approach
different types of people, something essential in the
data scientist’s work.
Networking can be an invaluable source of useful
knowledge about the latest innovations related to
the data science field or other fields that are
adjacent to it.
A data scientist should maintain a healthy
relationship with academia, through networking, to
keep himself updated with the latest advances and
for potentially beneficial partnerships.
A data scientist needs to remain grounded by
maintaining contact with the business world through
networking. This can help him gain a better
understanding of what is needed, about new
potential applications of big data and interesting
business opportunities that are not limited to job
openings.
Chapter 8
Software Used

Being a data scientist entails using certain software, some of


which we discussed in previous chapters of this book. This
software covers the basic technical know-how that you need in
order to apply for a data scientist position. The actual position
may go beyond the initial job description as is often the case in
IT jobs. That’s good, in a way, because it provides opportunities
for learning new things, which is an integral part of being in the
fascinating field of data science.
In this chapter, we will explore the types of software that are
commonly used in a data science setting. Not all of these
programs will be used in the data scientist position you will get,
but being aware of them may help you understand your options
better. In particular, we will examine the Hadoop suite and a
few of its most promising alternatives (such as Spark, Storm,
etc.), the various object-oriented programming languages that
come into play (Java, C++, C#, Ruby and Python), the data
analysis software that is available (R, Matlab, SPSS, SAS or
Stata), the visualization program that you may have installed
and the integrated big data system (e.g., IBM’s BigInsights,
Cloudera, etc.) that may be available for you to use. We’ll also
see other programs that you may encounter such as GIT, Excel,
Eclipse, Emcien and Oracle. Note that this list of software will
give you an idea of what to expect although it may not reflect
the actual programs you will be using; some companies may
require specialized software for their industry, which you will be
probably be asked to get acquainted with as soon as you are
hired. Familiarity with most of the software in this list should
make it a relatively easy and straightforward task for you.
8.1 Hadoop Suite and Friends
Hadoop has become synonymous with big data software over
the past few years; it is the backbone of a data scientist’s
arsenal. It is important to know that Hadoop is not just a
program, but more like a suite of tools (similar to MS Office).
This suite is designed to handle, store and process big data. It
also includes a scheduler (Oozie) and a metadata and table
management framework (HCatalog). All data processing jobs in
Hadoop are distributed over the computer cluster on which you
have Hadoop installed. These jobs can be object-oriented
programming (OOP) code, data analysis programs, data
visualization scripts, or anything else that has a finite process
time and is useful for the data analysis task. Hadoop makes
sure that whatever you want to do with your data is done
efficiently and is monitored in a straightforward way.
Hadoop does not have a particularly user-friendly software
environment, as you can see in Fig. 8.1 where a screenshot of
a typical Hadoop job is shown.
Fig. 8.1 Screenshot of a Task Dashboard in Hadoop.
The Hadoop suite is comprised of the following modules, all of
which are important:
MapReduce – created by Google, this is the main
component of Hadoop; as mentioned in a previous
part of this book, it is the heart of any big data
technology. Although it is inherently linked with
Hadoop, it can also be found in other big data
programs such as MPP and NoSQL databases
(e.g., MongoDB). Although MapReduce was
originally proprietary, after generous funding by
Yahoo in 2006 it emerged as an open source
implementation via Hadoop, reaching “Web scale”
two years later. One of the most well-known
algorithms for parallel computing, it makes use of a
computer cluster to query a dataset, break it down
into pieces and process it over the various nodes of
the cluster.
HDFS – short for Hadoop Distributed File System,
this is the file system that Hadoop uses. For
anything to be processed by Hadoop, it has to be
imported to the HDFS, where it is backed up across
the network of the computers the Hadoop
installation runs on. Its data limit is approximately 30
PB.
Pig – a high-level programming language for the
various Hadoop computations. You can view it as
the control module of the various operations of the
Hadoop ecosystem. Its capabilities are extensible.
Hive – a data warehouse program with SQL-like
access, designed for data spread over the Hadoop
computer cluster. Its capabilities are extensible.
HBase, Sqoop and Flume – the database
components of Hadoop. HBase is a column-oriented
database that runs on a layer on top of the HDFS. It
is based on BigTable by Google and has a data limit
of about 1 PB. Also, it is somewhat slower than
directly accessing the data on the HDFS. Not that
great for processing data stored in it, HBase is good
for archiving and counting time-series data. Sqoop
is a program that enables importing data from
relational databases into HDFS. Flume is similar
although it focuses on collecting and importing log
and event data from various sources.
Mahout – a library of machine learning and data
mining algorithms used for processing data stored in
the HDFS.
Zookeeper – Hadoop has a whole bestiary of
components, so a configuration management and
coordination program is imperative. Zookeeper
ensures that the whole suite remains integrated and
relatively easy to use.

There are also a few other components of the Hadoop suite


that are supplementary to these core ones. The best way to
familiarize yourself with them is to download Hadoop and play
around with it. If you prefer, you can read a tutorial instead (or,
even better, a manual) while trying to solve a benchmark
problem.
Hadoop is not your only option when it comes to big data
technology. An interesting alternative that is not as well known
as it should be is Storm (used by Twitter, Alibaba, Groupon and
several other companies). Storm is significantly faster than
Hadoop, is also open source and is generally easy to use,
making it a worthy alternative. Unlike Hadoop, Storm doesn’t
run MapReduce jobs, running topologies instead. The key
difference is that a MapReduce job ends eventually, while a
topology runs forever or until it is killed by the user. (You can
think of it as a background process that runs on your OS
throughout its operation). The topology can be visualized as a
graph of computation, processing data streams. The sources of
these data streams are called “spouts” (symbolized as taps),
and they are linked to “bolts” (symbolized by lightning bolts). A
bolt consumes any number of input streams, does some
processing and potentially emits new streams. You can see an
example of a Storm topology in Fig. 8.2.
Fig. 8.2 Example of a Topology in the Storm Software, a
worthwhile Hadoop alternative. Creating a topology like
this one is somewhat easier and more intuitive than a
MapReduce sequence.
A topological approach to data processing guarantees that it
will produce the right results even in the case of failure (since
topologies run continuously), meaning that if one of the
computers in the clusters breaks down, this will not compromise
the integrity of the job that has been undertaken by the cluster.
It should be noted that Storm topologies are programs usually
written in Java, Ruby, Python and Fancy. The Storm software is
written in Java and Clojure (a functional language that works
well with Java), and its source code is the most popular project
on this type of technology.
The advantages of this software are its ability to process data in
real-time; its simple API; the fact that it’s scalable, fault tolerant,
easy to deploy and use, free and open source and able to
guarantee data processing; and that it can be used with a
variety of programming languages. It also has a growing user
community, spanning over the West and East Coasts of the
USA as well as London and several other places.
Although Storm is a very popular and promising Hadoop
alternative, providing flexibility and ease of use, there are other
players boasting of similar qualities that also challenge
Hadoop’s dominance in the big data world. The most
worthwhile ones (at the time of this writing) are:

Spark – developed by the UC Berkeley AMP lab,


Spark is one of the newest players in the
MapReduce field. Its aim is making data analytics
fast in both writing and running. Unlike many
systems of the field, Spark allows in-memory
querying of data instead of just using disk I/O. As a
result, Spark performs better than Hadoop on many
iterative algorithms. It is implemented in Scala (see
next section) and, at the time of this writing, its main
users are UC Berkeley researchers and Conviva.
BashReduce – being just a script, BashReduce
implements MapReduce for standard Unix
commands (e.g., sort, awk, grep, join, etc.), making
it a different alternative to Hadoop. It supports
mapping/partitioning, reducing and merging.
Although it doesn’t have a distributed file system at
all, BashReduce distributes files to worker machines
with an inevitable lack of fault-tolerance, among
other things. It is less complex than Hadoop and
allows for more rapid development. Apart from its
lack of fault-tolerance, it also lacks flexibility
because BashReduce only works with certain Unix
commands. BashReduce was developed by Erik
Frey (from the online radio station last.fm) and his
associates.
Disco Project – initially developed by Nokia
Research, Disco has been around for several years
without becoming well known. The MapReduce jobs
are written in simple Python, while Disco’s backend
is written in Erlang, a scalable functional language
with built-in support for concurrency, fault tolerance
and distribution, making it ideal for a MapReduce
system. Similar to Hadoop, Disco distributes and
replicates data, but it doesn’t have its own file
system. The job scheduling aspect of this system is
also good since it’s very efficient.
GraphLab – developed at Carnegie Mellon and
designed for machine learning applications,
GraphLab aims to facilitate the design and
implementation of efficient and correct parallel
machine learning algorithms. GraphLab has its own
version of the map stage, called the update phase.
Unlike MapReduce, the update phase can both read
and modify overlapping sets of data. Its graph-
based approach makes machine learning on graphs
more controllable and improves dynamic iterative
algorithms.
HPCC Systems – with its own framework for
massive data analytics, HPCC makes an attempt to
facilitate writing parallel-processing workflows
through the use of Enterprise Control Language
(ECL), a declarative, data-centric language
(somewhat similar to SQL, Datalog and Pig). HPCC
is written in C++, making in-memory querying much
faster according to some people. HPCC is a
promising alternative to Hadoop since it has its own
distributed file system.
Sector/Sphere – developed in C++, this system
promises high performance 2-4 times faster than
Hadoop. It is composed of two parts: Sector, a
scalable and secure distributed file system, and
Sphere, a parallel data processing engine that can
process Sector data files on the storage nodes with
very simple programming interfaces. It has good
fault-tolerance, WAN support and is compatible with
legacy systems (requiring few modifications). It is a
worthy alternative to Hadoop that has been around
since 2006.

Parallel to all these systems, there are several projects that can
facilitate the work undertaken by Hadoop, working in a
complimentary way, so if you are going to learn Hadoop, you
may want to check them out once you’ve got all the basics
down. The most well-known of these projects are the following:

Drill – this is a Hadoop add-on that focuses on


providing an interface for interactive analysis of the
datasets stored in the Hadoop cluster. It often
makes use of MapReduce to perform batch analysis
on big data in Hadoop, and through Dremel it is
capable of handling much larger datasets very fast.
This is possible through its ability to scale to a very
large number of servers (its design goal is at least
10000 of them), making it a good option if you plan
to work with really big data. A good tool to look into,
particularly if you plan to use Hadoop.
D3.js – short for Data Driven Documents, D3.js is an
open source JavaScript library that enables you to
manipulate documents that display big data. This
collection of programs can create dynamic graphics
using Web technologies (e.g., HTML5, SVG and
CSS). Also, it provides many visualization methods
such as chord diagrams, bubble charts,
dendrograms and node-link trees. Due to the fact
that it is open source, this list is constantly
expanding. D3.js was designed to be very fast and
compatible with programs across various hardware
platforms. Although it may not replace a full-blown
data visualization program (see subchapter 8.4), it is
a good add-on to have in mind. D3.js was
developed by Michael Bostock, a New York Times
graphics editor.
Kafka – a messaging system originally developed at
LinkedIn to serve as the basis for the social
medium’s activity stream and operational data
processing pipeline. Since then, its user base has
expanded, encompassing a variety of different
companies for various data pipeline and messaging
uses. It is quite efficient and integrates well with the
Hadoop ecosystem. It is run on Java across any
operating system.
Julia – this is actually more of a data analysis tool,
but it is designed to be run in a distributed
computing environment such as Hadoop. It is
robust, easy to use, similar to Matlab and R (see
subchapter 8.3), and very fast. It’s a worthy add-on
to the Hadoop suite and, if you are inclined towards
programming, a good language to add to your
programming skill-set.
Impala – a distributed query execution engine
designed to run against data that is stored natively
in Apache HDFS and Apache HBase. Developed by
Cloudera, it focuses on databases and does not
make use of MapReduce at all. This allows it to
return results in real-time since it avoids the
overhead of MapReduce jobs.

8.2 OOP Language


A data scientist needs to be able to handle an object-oriented
programming (OOP) language and handle it well. Comparing
the various OOP languages is beyond the scope of this book,
so for the sake of example, Java will be discussed in this
subchapter as it is well-known in the industry. Just like most
OOP languages, Java doesn’t come with a graphical user
interface (GUI), which is why many people prefer Python (which
does come from its developers with a decent GUI). However,
Java is very fast and elegant, and there is abundant
educational material both online and offline. A typical Java
program can be seen in Fig. 8.3.
Fig. 8.3 A Typical Java program for determining if a year
is a leap year. The program is viewed in an editor that
recognizes Java code.

Note that the highlighting of certain words and lines is done by


the editor automatically (though this is not always the case,
e.g., when using Notepad). Also, spacing is pretty much
optional and is there to facilitate the user of this script. Note that
most programs tend to be lengthier and more complicated than
this simple example, yet they usually can be broken down to
simple components like the one shown here.
Programming can be soul-crushing if you need to allocate a lot
of your time to writing the scripts (usually on a text editor like
Notepad++ or Textpad). To alleviate this, several integrated
development environments (usually referred to as IDEs) have
been developed over the years. These IDEs provide an
additional layer to the programming language, integrating its
engine, compiler and other components in a more user-friendly
environment with a decent GUI. One such IDE, particularly
popular among Java developers, is Eclipse (see Fig. 8.4),
which also accommodates several other programming
languages and even data analysis packages like R.
Fig. 8.4 Screenshot of Eclipse Running Java. Eclipse is
an excellent Java IDE (suitable for other programming
languages as well).
Other OOP languages you may want to consider are:

C++ – an equally good language as Java, very


popular and fast
Ruby – a powerful OOP language alternative
JavaScript – the web-based counterpart of Java
Python – a good OOP language, especially for
beginners
C# – a popular language in the industry, developed
by Microsoft

All of these are free and easy to learn via free tutorials (the IDE
of the last one, Visual Studio, is proprietary software, however).
Also, they all share some similarities, so if you are familiar with
the basic OOP concepts, such as encapsulation, inheritance
and polymorphism, you should be able to handle any one of
them. Note that all of these programming languages are of the
imperative paradigm (in contrast with the declarative/functional
paradigm that is gradually becoming more popular). The
statements that are used in this type of programming are
basically commands to the computer for actions that it needs to
take. Declarative/functional programming, on the other hand,
focuses more on the end result without giving details about the
actions that need to be taken.
Although at the time of this writing, OOP languages are the
norm when it comes to professional programming, there is
currently a trend towards functional languages (e.g., Haskell,
Clojure, ML, Scala, Erlang, OCaml, Clean, etc.). These
languages have a completely different philosophy and are
focused on the evaluation of functional expressions rather than
the use of variables or the execution of commands in achieving
their tasks.
The big plus of functional languages is that they are easily
scalable (which is great when it comes to big data) and much
more error free since they don’t use a global workspace. Still,
they are somewhat slower for most data science applications
than their OOP counterparts although some of them (e.g.,
OCaml and Clean) can be as fast as C22 when it comes to
numeric computations. If things take a turn for the better in the
years to come, you may want to look into adding one of these
languages in your skill-set as well, just to be safe. Note that
there can be an overlap between functional languages and
traditional OOP languages such as those described previously.
For example, Scala is a functional OOP language, one that’s
probably worth looking into.

8.3 Data Analysis Software


What good would all the programming be for a data scientist if
there was nothing to compliment it and give meaning to it?
That’s where all the data analysis software comes in. There are
several options, the most powerful of which are Matlab and R.
Though tempting, there will be no comparison between them as
it is usually a matter of preference. Interestingly, they are so
similar in their syntax and function that it shouldn’t take you
more than two to three weeks to learn one if you know the other
at a satisfactory level.
As R is somewhat more popular, mainly due to the fact that it is
open source and has a huge community of users that
contribute to it regularly, we will focus on it in this book. For
those who are more inclined towards Matlab and are familiar
with its advantages over R and other data analysis tools, keep
an open mind. R also has an edge over other data analysis
alternatives and is straightforward to write and run programs in,
often without the need to include loops (a programming
structure that generally slows down analysis done in a high-
level programming language). Instead, it makes use of vector
operations, which can also extend to matrices. This
characteristic is known as vectorization and makes sense for
data analysis scripts only (OOP languages are inherently fast,
so loops are not an issue for them).
The R programming environment is very basic (similar to
Python, in a way) but still user-friendly enough, especially for
small programs. The screenshot in Fig. 8.5 gives you an idea of
what the environment is like.
Fig. 8.5 The R Environment (vanilla flavor). As can be
seen here, although the programming environment is
quite user-friendly, it lacks many useful accessories like
an IDE.
R is great as a data analysis tool and its GUI is quite well made.
However, if you are serious about using this tool, you’ll need to
invest some time in learning and customizing an IDE for it.
There are several of them available (most of which are free),
but the one that stands out is RStudio (see Fig. 8.6 for a
screenshot).
Fig. 8.6 One of the many R IDEs, RStudio. You can see
here that in addition to the console (bottom-left window),
it also has a script editor (top-left), a workspace viewer
(top-right) and a plot viewer (bottom-right) among several
other useful features that facilitate writing and running R
programs.

Other alternatives to R for data analysis applications are:

Matlab/Octave – this was the king of data analysis


long before R became well-known in the industry.
Although Matlab is proprietary, it has a few open
source counterparts, the best of which is Octave.
Both Matlab and Octave are great for beginners,
have a large variety of applications and employ
vectorization, just like R. However, the toolboxes
(libraries) of Matlab are somewhat expensive, while
Octave doesn’t have any at all.
SPSS – this is one of the best statistical programs
available and is widely used in research. Quite easy
to learn, it can do any data analysis, though not as
efficiently as R. Also, it is proprietary, just like
Matlab. It is preferred by academics and industry
professionals alike.
SAS – a very popular program for statistics,
particularly in the corporate world. Relatively easy to
learn, it also has a good scripting language that can
be used to create more sophisticated data analyses.
However, it too is proprietary.
Stata – a good option for a statistical package, Stata
is one of the favorite tools of statisticians. Also a
proprietary piece of software, it has lost popularity
since R became more widespread in the data
analysis world.

Note that all of these are proprietary software, so they may not
ever be as popular as R or attract as large user communities. If
you are familiar with statistics and understand programming,
they shouldn’t be very difficult for you to learn; with Matlab, you
don’t need to be familiar with statistics at all in order to use it.
We will revisit R in subchapter 10.5, where we will examine how
this software is used in a machine learning framework.

8.4 Visualization Software


The importance of visualizing the results of a data analysis is
hard to overstate. That is why there are some visualization
software options available to refine your software arsenal.
Although all data analysis programs provide some decent
visualization tools, it often helps to have a more specialized
alternative such as Tableau, which can make the whole process
much more intuitive and efficient (see Fig. 8.7 for a screenshot
of this software to get an idea of its usability and GUI).
Tableau is, unfortunately, proprietary software and is somewhat
costly. However, it allows for fast data visualization, blending
and exporting of plots. It is very user-friendly, easy to learn, has
abundant material on the web, is fairly small in size (<100 MB)
and its developers are very active in educating users via
tutorials and workshops. It runs on Windows (any version from
XP onwards) and has a two-week trial period. Interestingly, it is
part of the syllabus of the “Introduction to Data Science” course
of the University of Washington.
Fig. 8.7 Screenshot of Tableau, an excellent visualization
program. As you can see, it’s quite intuitive and offers a
variety of features.

In the industry, Tableau appears to have a leading role


compared to other data visualization programs. Though more
suitable for business intelligence applications, it can be used for
all kinds of data visualization tasks, and it allows easy sharing
of the visualizations it produces via email or online. It also offers
interactive mapping and can handle data from different sources
simultaneously.
If you are interested in alternatives to this software, you can
familiarize yourself with one or more of the following programs:
Spotfire – a great product by TIBCO, ideal for visual
analytics. It can integrate well with geographic
information systems and modeling and analytics
software, and it has unlimited scalability. Its price is
on the same level as Tableau.
Qlikview – a very good alternative, ideal for data
visualization and drilldown tasks. It is very fast and
provides excellent interactive visualization and
dashboard support. It has a great UI and set of
visual controls, and it is excellent in handling large
datasets in memory. However, it is limited by the
RAM available (scalability issue) and is relatively
expensive.
Prism – an intuitive BI software that is easy to
implement and learn. Focusing primarily on
business data, it can create dashboards,
scoreboards, query reports, etc., apart from the
usual types of plots.
inZite – an interesting alternative, offering both
appealing visualization and dashboards features.
Very fast and intuitive.
Birst – a good option, offering a large collection of
interactive visualizations and analytics tools. It can
create pivot tables and drill into data with
sophisticated, straightforward reporting functions.
SAP Business Objects – this software offers point-
and-click data visualization functionality in order to
create interactive and shareable visualizations as
well as interactive dashboards. Naturally, it
integrates directly with other SAP enterprise
products.

Generally, data visualization programs are relatively easy to


learn, so this is not an issue when trying to add them in your
software arsenal. Before dedicating a lot of time in mastering
any one of them, make sure that it integrates well with the other
programs you plan to use. Also, take a look at what
visualization programs are included in most of the ads for the
other programs in which you are interested.

8.5 Integrated Big Data Systems


Although not essential, it is good to be familiar with at least one
integrated big data system. One such system, which is quite
good despite the fact that it is still in its initial versions, is IBM’s
BigInsights platform. The idea is to encapsulate most of the
functions of Hadoop into a user-friendly package that has a
decent GUI as well. As a bonus, it can also do some data
visualization and scheduling, things that are useful to have in
an all-in-one suite so that you can focus on other aspects of
data science work. BigInsights runs on a cluster/server and is
accessible via a web browser. A screenshot of the BigInsights
platform can be seen in Fig. 8.8.
Fig. 8.8 IBM’s BigInsights platform running in the Mozilla
Firefox browser. As you can see, it has a very good GUI
and is quite user-friendly.
The big advantage of an integrated big data system is its GUI,
which when combined with good documentation makes the
whole system user-friendly, straightforward and relatively easy
to learn. Also, as the GUI takes care of all the Hadoop
operations, it allows you to focus on more high-level aspects of
the data science process, freeing you from much of the low-
level programming that’s needed.
An alternative to BigInsights is Cloudera, which is well known in
the industry and more robust. Other worthy alternatives include
Knime, Alpine Data Labs’ suite, the Pivotal suite, etc. It is quite
likely that by the time you read these lines there will be other
integrated big data systems available, so be sure to become
familiar with what they are and what they offer.

8.6 Other Programs


The above list of programs would be incomplete if some
auxiliary ones were not included. These programs may vary
from company to company, but they are generally a good place
to start when it comes to refining your software arsenal. For
example, the GIT version control program is one that definitely
deserves your attention since you are quite likely to need one
such program, especially if you are going to work on a large
project along with other people (usually programmers). You can
see a screenshot of its interface and its most commonly used
commands in Fig. 8.9.

Fig. 8.9 The GIT version control program. Not the most
intuitive program available, but very rich in terms of
functionality and quite efficient in its job.
Note that there are several GUI add-ons for GIT available for all
major operating systems. One that is particularly good for the
Windows OS is GIT Extensions (open source), although there
are several GUIs for other OSs as well. This particular GUI add-
on makes the use of GIT much more intuitive while preserving
the option of using its command prompt (something that’s not
always the case with GIT GUIs).
It would be sacrilege to omit the Oracle SQL Developer
software since it is frequently used for accessing the structured
data of a company whose DBMS is Oracle. Although this
particular software is probably going to be less essential in the
years to come due to big data technology spreading rapidly, it is
still something useful to know when dealing with data science
tasks. You can see a screenshot of this program in Fig. 8.10.
Fig. 8.10 The Oracle SQL Developer database software,
a great program for working with structured data in
company databases and data warehouses.
The key part of this software is SQL, so in order to use it to its
full potential, you need to be familiar with this query language.
As we saw in an earlier chapter, this is a useful language to
know as a data scientist even if you don’t have to use it that
much. This is because there are several variants of it that are
often used in big data database programs.
Some other useful programs to be familiar with when in a data
science position are:
MS Excel – the well-known spreadsheet application
of the MS Office suite. Though ridiculously simple
compared to other data analysis programs, it is still
used today and may come in handy for inspecting
raw data in .csv files, for example, or when creating
summaries of the results of your analyses. Just like
the rest of the MS Office suite, it is proprietary
though there are several freeware alternatives that
have comparable functionality to MS Excel (e.g.,
Calc from Open Office).
MS Outlook – an equally well-known MS Office suite
application designed for handling emails, calendars,
to-do lists and contact information. There are
several freeware alternatives to it, but it is often
encountered in workplaces. It will be very useful to
know if you’ll be using it every day for handling
internal and external communications,
appointments, etc. It is also proprietary.
Eclipse – as mentioned earlier, this is one of the
most popular IDEs for OOP languages as well as
other languages (even R). Very robust and
straightforward, it makes programming more user-
friendly and efficient. It is open source and cross
platform.
Emcien – a good graph analysis program for dealing
with complicated datasets, particularly semi-
structured and non-numeric ones. A good program
to look into if you are interested in more advanced
data analysis, particularly graph based. It is not a
substitute for other data analysis programs,
however, and it is proprietary.
Filezilla (or any other FTP client program) – useful if
you need to transfer large files or require a certain
level of security in transferring your files to other
locations over the internet. It is open source.
8.7 Key Points

A data scientist makes use of a variety of programs


in his everyday work, the most representative of
which are described in this chapter and include:
Hadoop/Spark, an OOP language (such as Java), a
data analysis platform (such as R), visualization
software, an integrated big data system (such as
IBM’s BigInsights) and other auxiliary programs
(such as GIT and Oracle). Additional programs may
be required depending on the company and its
industry.
Hadoop is the Cadillac of big data software, and its
suite is comprised of a variety of components,
including a file system (HDFS), a method for
distributing the data to a computer cluster
(MapReduce), a machine learning program
(Mahout), a programming language (Pig), database
programs (Hive, HBase, etc.), a scheduler (Oozie),
a metadata and table management framework
(HCatalog) and a configuration management and
coordination program (Zookeeper) among others.
There are several alternatives to the Hadoop suite
such as Storm, Spark, BashReduce, the Disco
project, etc.
There are a few programs that can facilitate the
work undertaken by Hadoop, working in parallel to it:
Drill, Julia, D3.js and Impala among others.
As a data scientist, you need to be able to handle at
least one OOP language such as Java, C++, Ruby,
Python, C#, etc. OOP languages are currently the
most widespread programming paradigm although
lately there has been a trend towards functional
languages.
Functional programming languages (such as
Clojure, OCaml, Clean, ML, Scala and Haskell) are
good assets to have, particularly if you are good at
programming and want to expand your
programming skill-set.
You must be intimately familiar with at least one of
the data analysis tools that are used nowadays: R,
Matlab/Octave, SPSS, SAS or Stata. Of these only
R and Octave are open source, with the former
being the most popular choice overall.
Tableau is one of the best choices for data
visualization software although there are several
other worthy alternatives such as Spotfire, Qlikview,
Brist, inZite, Prism and SAP Business Objects.
Big data integrated systems, such as IBM’s
BigInsights platform, are also worth looking into
since they make the whole data science process
more efficient and insulate you from much of the
low-level programming required for MapReduce.
Some other programs worth familiarizing yourself
with are GIT (or any other version control program),
Oracle, MS Excel, MS Outlook, Eclipse, Emcien and
Filezilla (or any other FTP client program). Naturally,
the more programs you know (even programs not
included in this list), the better off you are, provided
you know them well enough and they are useful in a
business setting.

C is one of the best structural programming languages ever created and


constitutes a benchmark when it comes to speed. Although it is not
used so much anymore since the OOP paradigm took over, its OOP
counterparts C++ and C# are still very popular and powerful
languages. C is also the basis of Matlab, one of the best data analysis
programming platforms out there. Because C is a very low-level
language, it is not intuitive and can often be challenging to work with
for complex programs.
Chapter 9
Learning New Things and
Tackling Problems

Learning new things is an integral part of being a data scientist,


especially when new innovations are fairly common in the field.
However, if you are new to data science, you are bound to have
a lot of gaps in your knowledge, so learning the missing
material is essential. Of course, if you want to evolve
professionally, this is something you would do in any
profession. However, in data science there are new programs
coming out all the time, so even if you were the best data
scientist in the world right now, your skills would be bound to be
somewhat obsolete in a few years if you decided not to keep
abreast of the developments in the field.
Tackling problems is similar to learning new things in that it
requires the same flexibility and mental agility. Although this is
common with many IT-related professions, in data science
problems are a bit more commonplace, mainly because it’s an
interdisciplinary field. However, by tackling the problems that
arise with a positive attitude and a creative approach, you’ll
also learn more new things than you’d normally be able to learn
otherwise.
In this chapter we’ll examine various ways that you can
upgrade your knowledge and, more importantly, your skill-set
right now as well as while on the job. In the first four
subchapters you’ll find out about how you can learn from
workshops, conferences, online courses (often referred to as
MOOCs) and data science groups. In the later subchapters,
you’ll learn about the various problems that may arise in your
work as a data scientist: namely, resource issues, requirements
issues, insufficient know-how for a task you undertake, and
integration issues.

9.1 Workshops
Workshops are the most efficient way to learn something new,
especially when it comes to technical know-how. Fortunately,
due to the increased popularity of the data science field there
are numerous workshops available from which to learn any
aspect of the field.
Workshops tend to be somewhat expensive (several hundred
dollars each) but they are a good investment, especially if you
are good at picking up new knowledge and know-how. Free
alternatives for learning new things will be covered in
subchapter 9.3. How to find the best workshops will be
discussed later in this section.
So why bother with workshops if there are other ways to learn
new things? Well, workshops provide networking opportunities,
can enhance your resume (if you have no other data science
related qualifications), and often provide more useful
knowledge and know-how than university courses, regardless
of the university. This is because university courses are often
based on the available literature in scientific books, journal
papers and conference proceedings and are designed to give
students the foundation on which to build more advanced
knowledge.
Workshops are also very time efficient, squeezing into a few
hours material that would normally take days to learn on your
own. They are often hard and demand all of your concentration,
but they enable you to learn something you would normally not
have the time or resources to learn on your own.
The key things to keep in mind when choosing to register for a
workshop are what you are going to learn and how it can be
useful for your job as a data scientist. This sounds obvious, but
it is really easy to get sold on workshops that you don’t need
since they all appear quite appealing at the sites that promote
them.
To ensure that you stay focused on the appropriate workshops,
make a list of the skills and knowledge that you want or need,
then research workshops that are being offered. Update your
list if you find workshops that offer something you haven’t
thought of; if there are several workshops that offer it, it is
usually something useful to know in the industry. Finally, pick
the workshop that is most suitable for what you want or need,
taking into account its location, the time of the year it’s offered
and, of course, its price. You can’t go wrong with a strategy like
that.

9.2 Conferences
Conferences are like workshops but are designed for larger
groups of people. They offer some innovative pieces of
knowledge based on research and case studies as well as
more foundational information for those who are newer to the
subject of the conference. More often than not, conferences
offer workshops to attract more people. Note that in this book
we are referring to non-academic conferences, since the
academic ones have a different mission and scope.
Conferences are a great way to learn a variety of new things in
a short period of time, meet new people, exchange war stories
and get acquainted with other challenges in the field.
Conferences are quite interactive and provide great mental
stimulation, very similar to some good university classes, but
without the stress of exams and written assignments. They are
usually costly, making them a viable option mainly for full-time
professionals. However, given the benefits they can provide,
they are a worthy alternative for anyone interested in expanding
his skill-set and data science knowledge. Fortunately,
companies often cover at least some (if not all) of the expenses
of their employees who are participating in such conferences.
The big advantage of this option for learning new things is that
it is very time efficient, especially when combined with a couple
of workshops. If you can relate this new knowledge to an
existing problem you are facing, that’s even better. The bottom
line is that if you are open to new things, a conference can
prove to be a very fruitful experience that may enrich your
understanding of data science and your particular role, too. You
can find out about the various conferences that are being
offered by searching the web directly or through the various
data science groups (see subchapter 9.4).

9.3 Online Courses


Although the world today has a lot of issues, it’s also the first
time in our history that refined knowledge23 on a large variety
of subjects is publically available at no cost. This is through the
various online courses, particularly MOOCs24.
The first MOOCs appeared in 2008 and have grown in
popularity and in variety since then. The largest MOOC
provider, Coursera, is an initiative of two faculty members of
Stanford University, Prof. Daphne Koller and Prof. Andrew Ng.
The courses on this site span from calculus to philosophy to
history of art. Since one of the founders, Prof. Ng, is a leading
machine learning expert, there are several worthwhile courses
on data science (Prof. Ng’s course “Machine Learning” is one of
the best MOOCs out there, not just within the Coursera site).
Coursera’s website (www.coursera.org) is user-friendly and
straightforward, and so are its applications for smartphones and
tablets to facilitate the use of the site’s content while you’re on
the move.
There are several other places where you can find MOOCs, the
most well-known of which are:

Udacity – this MOOC site covers a variety of


courses on science (especially computer science),
design and business.
edX – focusing mainly on science, this site also
offers some courses on humanities and
business/economics.
Khan Academy – a great site for the younger
students, this is a good resource for mathematics
and science courses.
Codeacademy – if you want to learn or just practice
programming, this is a good place to start. The
focus is on web-based programming, though.
Open Learning Initiative – having a relatively short
collection of MOOCs, this site focuses on quality
and variety. It is still quite new but appears to be
promising.
Open Yale Courses – a very well-organized archive
of courses offered in the famous university, this is a
place where you can find good quality material
(videos, transcripts, etc.) on various subjects to
download and use at your own pace.
OpenLearn (Open University) – one of the most
established free learning resources, which offers
high-veracity information. It has a variety of courses
(even language-related ones) and a large
community of users. Worth looking into if you have
time. Note that this is a serious MOOC provider,
requiring a certain level of commitment.
canvas.net – a great MOOC site with a large variety
of subjects. Somewhat irrelevant to data science,
however.
openHPI – a relatively small MOOC provider
focusing on web-related technologies. Still, it is quite
relevant to data science.
NovoED – an interesting place for MOOCs, mainly
on business and management as well as a few
other subjects. Despite its limited variety, it has
some very good courses from Stanford University.
MongoDB – this is a highly specialized MOOC
provider focusing on the MongoDB database
framework. Still, it is very relevant to data science,
especially if you are interested in this particular
piece of big data technology.
Open2Study – this is an excellent resource for small
courses (lasting 4 weeks). It focuses on business
and management MOOCs, but has also some
computer science courses and a few other kinds of
MOOCs.
All the above alternatives are great, but it’s good to keep in
mind that none of them come anywhere close to the Coursera
site in terms of quality and popularity (a typical data science
course at Coursera has 50000-100000 students enrolled). In
addition, the courses of Coursera are quite interactive, and if
you commit to them, they can be a very enjoyable experience.
However, if you can’t find the course you are looking for on that
particular site, it is worth taking a look at the alternative MOOC
providers to supplement your learning.
The (Coursera) MOOCs on data science that are definitely
worth looking into are:

Web Intelligence and Big Data (Indian Institute of


Technology) – a great place to learn about big data,
the MapReduce algorithm and how they relate to the
Web (which is one of the main sources of big data
nowadays). The instructor, Prof. Gautam Shroff, is
very knowledgeable and methodical, making this
course a must for any aspiring data scientist. A
Certificate of Accomplishment (CA) is available.
Statistics One (Princeton) – a great course for
learning the basics and more of this fundamental
subject. Also, a great place to practice using the R
data analysis platform. No CA is available, though.
Computing for Data Analysis (Johns Hopkins) – a
great place to learn about R and use it in various
data analysis applications. Difficult if you are
unfamiliar with R since it is quite short (4 weeks).
CA is available.
Machine Learning (Stanford) – as mentioned earlier,
this is the course given by one of the founders of
Coursera, Prof. Ng. One of the best courses
available on this subject, with a variety of topics
related to data science. Note that this is just an
introductory course, though, so it doesn’t go into
depth on any of the methods presented in it.
Programming language used: Matlab/Octave. CA is
available.
Data Analysis (Johns Hopkins) – this course
focuses on data analysis methodology using R, so
some familiarity with it is very useful. CA is
available.
Statistics: Making Sense of Data (University of
Toronto) – a great place to learn the basics of
statistics and have a good time doing so. Plenty of
examples, interesting case studies and very
charismatic instructors. CA is available.
Introduction to Data Science (University of
Washington) – this is a must for any aspiring data
scientist. Unfortunately, there have been no
upcoming sessions of this MOOC for months, yet
the lectures of the course are available at the
MOOC’s webpage. Programming languages used in
this course: Python (ver. 2.7), R and SQL. CA is
available.
Machine Learning (University of Washington) – this
is different from Prof. Ng’s course, but it is quite
good. It covers topics that the other machine
learning course doesn’t, and you can use whatever
programming language you prefer for the
assignments. CA is available.
Passion Driven Statistics (Wesleyan) – an
interesting course covering topics beyond an
introductory statistics course. However, the statistics
package used is SAS (not the easiest tool to learn),
and it is not available in certain countries due to
licensing issues. CA is available.
Introduction to Databases (Stanford) – one of the
first courses on Coursera, this MOOC provides a
good introduction to the subject although it is
recommended that you have a solid background in
computer science already. This is a self-study
course, so you can do it at your own pace. No
information about CA availability.
Note that as more and more universities develop MOOCs, there
may be new data science courses that are not on this list. So
keep your eyes open and ask around. Oftentimes, the Coursera
forums are a great place to get informed about courses similar
to the ones you are taking, plus you can get some useful
feedback on how good they are from classmates of yours who
have taken them. A great place to get additional evaluations of
the various courses is Coursetalk (coursetalk.org), so check
this out too before enrolling for a course to make the most of
your time. Finally, lately Coursera offers specializations, which
are basically amalgamations of courses from a university with
an exam or project at the end and a specialized certificate if
you pass all the classes. Not all specializations are free.
Currently there is one specialization for data science, offered by
the Johns Hopkins University.

9.4 Data Science Groups


One of the most enjoyable ways to learn, especially if you like
socializing and networking, data science groups are popping up
all over the place. If you live in a large city, the chances are that
you will find one in your area. Data science groups are a great
place to network and make acquaintances that may lead to job
opportunities. You can read more about that in Chapter 13.
Since data science is a buzz word, some data science groups
use the name in order to get a lot of people involved without
living up to their promise. So always check out a group’s
organizer(s) before joining it since time is a very valuable
resource and there may be better ways of using it for your data
science endeavors. If the organizer is an actual data scientist or
someone who strikes you as very knowledgeable on the
subject, you can hop onboard. In addition, check out the events
that the group hosts. If they include a lot of talks by respectable
professionals who are related to the field, it is a worthwhile
choice. If most of the meetings are just conversations among
the members, maybe you can skip that one. Finally, make sure
that there are several members in the group (the more the
better) to ensure that you will meet lots of interesting
professionals, improving your chances for learning. Note that a
group about machine learning or data mining is also relevant,
so don’t consider only groups with “data science” in their
names.
You can learn from a data science group in (at least) two ways.
First of all, you can attend the events where a knowledgeable
speaker presents a data science topic. This person could be a
researcher in the field or an industry professional, possibly even
a developer of some promising new big data program. We
already talked a bit about Storm and how great an alternative it
is to Hadoop. This piece of software became popular through
the developers’ various presentations. Imagine attending one
such presentation and being one of the first people to learn
about the software. If you played your card right and acted on
the knowledge, you would have an edge when it came to this
program. And if a company was looking for someone who was
familiar with it, you’d be one of the people to be shortlisted.
The other way to learn with a data science group is through
active conversation with the other members of the group. (This
approach is useful for all kinds of professional events, by the
way.) This means actively participating in a conversation,
asking meaningful questions, providing brief and focused
replies, etc. If you enter a conversation and let yourself vent
about the problems you are facing at work or about topics that
are of no interest to the others, don’t expect to learn much or
keep the other participants of the conversation interested for
long. Listening is the key here as well as being able to ask
questions that will make the other person think and offer
meaningful responses, engaging them in a creative debate.
Apart from all these sources of learning, there are also the
various data science websites and blogs, which you are
probably somewhat familiar with. Books may be great at
providing you with some reliable fundamental knowledge about
the field, but when it comes to staying updated, nothing beats
the Web. It would be a futile task to try to list all the various
online resources on data science, especially considering how
quickly the Web is changing. However, there are a few that are
definitely worth looking into (see Appendices 1 and 2). One that
seems particularly easy to digest is the Data Science 101 blog
– http://datascience101.wordpress.com.
9.5 Requirements Issues
Requirements issues are a type of problem you may encounter
although this greatly depends on the company you are in (or
your clients, if you are working as an independent consultant or
freelancer). Many IT professionals encounter problems with
requirements, and it is not unusual to see similar problems in a
data science setting. Issues with requirements have to do with
the miscommunication and misunderstanding of a project’s
requirements as well as how they are implemented in a working
prototype. That’s surely something that some good
communication can fix, right?
Well, it’s a bit more complicated than that. When two parties of
completely different backgrounds and priorities communicate,
even if they communicate well, there may be subtle differences
in how things settle after they are filtered by what’s feasible
(taking into account the resource limitations described
previously). For example, your manager may want you to
create a prediction model for a particular dataset, but after
examining it you realize that this may not be feasible with the
data you have or the tools you can muster (let alone the
timeframe in which this needs to be done). The requirements
need to be reasonable, but reasonable is a relative term when it
comes to something that has not been created yet. Creating a
data product may not work as you imagine it to work, so you
and your manager or client need to agree on a set of
requirements that also outline the desired end result. You need
to be able to manage your client’s expectations, help them
understand the limitations of your tools and hardware and find a
solution that is mutually satisfactory. That takes a lot of
creativity, diplomacy and communication, not to mention
patience.

9.6 Insufficient Know-How Issues


Lack of knowledge is the most commonly encountered issue for
people who are new in a data science job though it can be
encountered by more established data scientists as well,
especially when changing industries. If you have insufficient
know-how, it’s better to admit it and offer a strategy for
overcoming the issue rather than hiding it and pretending you
know everything. Try to augment your knowledge from one or
more of the following sources:

Reliable article – an article written by someone


knowledgeable. Reliable articles are usually found
on established data science portals and journals as
well as on LinkedIn.
Relevant technical book – avoid books that are not
written on the topic on which you want to expand
your knowledge. Technics Publications has many
reliable books on a variety of data related topics, so
that would be a good place to start. In general, try to
avoid books that are written for non-technical people
when seeking a specific piece of know-how. Even
this book may be less than ideal for filling most of
your technical know-how gaps.
Reliable website – try a technical blog or forum
rather than a generic one that is there just to
generate traffic for its ads. What’s important is that
it’s run by someone who knows what he is talking
about. If you are familiar with SEO, look into the
site’s source code to get hints of whether it is worth
the traffic it receives and whether or not it has been
developed by a professional.
Worthwhile workshop – this source was covered in
subchapter 9.1.
Specialist – if the above options don’t work out, or if,
for whatever reason, you prefer to bypass them,
consult a fellow data scientist. Here’s where
networking pays off. You can find a fellow data
scientist in a data science group (see subchapter
9.4), physically or online. It would be best to avoid
shooting an email to the author of this book asking
technical questions, though, and you should always
be considerate of others’ schedule before contacting
them. Note that in the initial stages of your data
science career, you may want to have such a
specialist as a mentor since you may have a
number of things to ask that may not be readily
answered in the aforementioned sources.

Issues related to insufficient know-how may be challenging, but


with enough humility, open-mindedness and research, it’s only
a matter of time before you resolve them. No one was born
knowing everything, and no one knows all there is to know in
this ever-changing field. So don’t hesitate to seek the missing
pieces of your know-how puzzle and evolve as a professional
through this experience.

9.7 Tool Integration Issues


Tool integration issues are common in the work of a data
scientist. You may have a great tool (e.g., Matlab), but the other
software you work with doesn’t integrate with its scripts. Or you
may develop a great data analysis script in R (which pretty
much every data science software integrates with), but the
format of the data file or data stream you have to process is not
recognizable by your script.
In general, when dealing with an integration issue you need to
research the technologies used and how other people have
resolved (or at least tried to resolve) the problem at hand.
Clever use of a search engine is a big asset here, so make use
of it extensively. You will also need to employ your
communication skills, your creativity and, of course, patience.
Here’s another instance when the contacts you’ve made
through networking can be useful. Solving tool integration
issues will enable you to become more intimately acquainted
with the software you use and allow you to exercise your
creativity and research skills, both of which are essential to your
role.

9.8 Key Points

Keeping your knowledge up to date is a very


important aspect of being a data scientist, especially
when new innovations come about in the field.
Workshops are the most efficient way to learn
something new, particularly about technical
subjects. Workshops may be expensive, but they
can be a worthwhile investment of your resources
because they can enhance your resume and often
provide more useful knowledge than university
courses.
Conferences are an excellent way to expand your
understanding of data science, get updated on
recent innovations, meet lots of interesting people in
the field and learn about useful things you can apply
to the problems you are facing in your data science
endeavors.
Online courses, particularly MOOCs, are one of the
best ways to increase and refine your knowledge of
a variety of topics. There are several data science
related courses available on Coursera, the most
established MOOC provider.
Data science groups are a great and quite enjoyable
way to learn new things about the field. You need to
find a group that hosts a lot of educational events,
has many members in the data science field (not
just beginners), and practice active conversation
when socializing with the other members of the
group.
Resource issues are quite common and involve
dealing with the limited resources that are available
for data analysis tasks.
Requirements issues are commonplace and involve
miscommunication, misunderstanding and
misinterpretation in terms of implementation of the
requirements your manager or client has. These
issues can be effectively tackled by employing
creativity, diplomacy and communication as well as
patience.
Insufficient know-how for your work can be
overcome by reading a good article/book/website or
by consulting a specialist.
Integration issues are quite common in the IT world
and involve getting different programs as well as
datasets of various formats to work together. This is
particularly difficult when working with newly
developed programs. You can overcome this kind of
issue by employing good communication, creativity
and patience.

This is knowledge that is produced by scientists and/or academics, includes


proper references and has a certain amount of usefulness to it.
Refined knowledge has a certain value to it, and in most cases it is
hard to find unless you know where to look.
MOOC = Massive (or Massively) Open Online Course: a usually free online
course open to anyone and potentially having a huge number of
enrolled participants, according to dictionary.com.
Chapter 10
Machine Learning and the R
Platform

We have already learned a little about the R platform and how it


is one of the tools of choice for many data scientists. In this
chapter, we’ll see a bit more about what R can do, and how you
can expand your knowledge of this versatile software so that
when you start learning it in depth, you will find it to be easier
and more interesting.
Before we get a bit technical, though, we’ll take a look at
machine learning25, one of the core disciplines of data science
and one that has received much less attention than it deserves.
We’ll examine how it came to be a discipline, how things are
looking in its future, how it compares with statistics and how it is
incorporated into data science. Afterwards, we’ll see how R
manages to combine this promising discipline with the more
established one of statistics through a brief overview of the
platform. Finally, we’ll see what resources are available for both
machine learning and R.

10.1 Brief History of Machine Learning


Machine learning has come a long way since its invention. In a
way, it has followed the development of computers and the
science that accompanies this technology. However, machine
learning applies to a variety of machines, ranging from
stationary computers to robots to mobile devices (the most
representative of which is the Smartphone, which makes use of
pattern recognition and clustering among other machine
learning techniques), and even cars (Fig. 10.1). Basically,
anything that can have computing power is capable of
incorporating machine learning programs.
Fig. 10.1 Machine learning in action.
Machine learning came about in the 1950s and involved
systems that were based on rules to process various types of
information, requiring a lot of programming from the developer.
Also, their use was limited to a very specific domain and their
evolution would be limited once they were deployed. As time
went by and more applied research was done on the machine
learning field, more sophisticated and more agile systems came
about, the most notable of which were the Artificial Neural
Networks (ANNs) that were very popular in the 1990s. The idea
was to emulate certain cognitive functions of the human brain
through a relatively simple but quite scalable mathematical
model that was able to attain great generalization and even
retain information once it was trained.
With the development of more and more tools, machine
learning grew to be a very practical field with many real-world
applications, based on advanced data processing systems that
had an assumption-free approach to the problems they tackled.
This was often referred to as pattern recognition and still
remains a very important aspect of data analysis, especially
non-statistical data analysis. In fact, most of the non-statistical
data analyses conducted make use of pattern recognition
techniques.
One very important technique, which was also a fundamental
tool for pattern recognition applications and in hardware
implementations of machine learning, was fuzzy logic. This
method entailed the use of non-rigid logic sets for more flexible
modeling of conditions, especially in situations where the state
of a system was described linguistically (e.g., hot, warm,
lukewarm, cold, etc.). Since its creation by Prof. Zadeh in 1964,
fuzzy logic has been very popular and has been used in
conjunction with ANNs for an even more robust machine
learning system known as neuro-fuzzy classifier (aka ANFIS).
Fuzzy logic is one of the cornerstones of artificial intelligence
and is still used widely today.
In the 2000s, a new order of machine learning systems was
developed through the innovation of adaptive programming.
This gave rise to more versatile machines that could run
adaptive programs that would undertake a variety of tasks
ranging from data collection, data processing, learning from
experience, information abstraction and optimization for
efficiency and/or accuracy. Also, several more programming
languages came about (mainly for the Web), and the computer
scientist profession became more useful and more popular.
Nowadays, machine learning has become an inherent aspect of
any serious computer system that deals with data. It is really
hard to find something that is completely devoid of a machine
learning algorithm, even if it is a simple clustering one.
(Clustering-based search engines have become more
widespread, with Clusty and Ixquick being probably the most
representative ones in that aspect.)
One thing that stands out, though, is deep learning, which is
basically the use of highly evolved ANNs (aka deep belief
networks) that incorporate statistics in a highly complex ANN
structure. This enables them to achieve generalization that was
never possible before, even in cases where the user doesn’t
have any domain knowledge whatsoever. A very interesting
example of deep learning in action was the case of Prof. G.
Hinton, who won first prize in a Kaggle competition using this
technology even though he had no prior knowledge of the
domain of the data involved26. Note that although deep learning
is closely associated with speech recognition techniques
(where it was originally applied with tremendous success), its
use covers several other applications today, making this a very
promising (if not the most promising) field of machine learning
today.

10.2 The Future of Machine Learning


It is truly difficult to speculate on the future of such a
sophisticated discipline, especially when the corresponding
technology that employs it is changing so rapidly. However, we
expect to see machine learning branching out in various
directions as it is already quite diverse.
One direction of machine learning, which could prove to be a
major branch, is algorithms that are optimized for smaller
devices such as smart watches and phones. This is not so far-
fetched considering that most Internet access nowadays is
done through some kind of mobile device; if there is one thing
that is quite appealing to a machine learning system it is the
Web because of the large number of applications of machine
learning techniques, such as clustering and pattern recognition,
on Web data.
Another major direction or branch could be statistical A.I., which
was very popular in the 1990s and still maintains its value
today. As statistical models continue to evolve, it is possible
that the corresponding machine learning algorithms can take off
as an independent field.
The most promising direction, however, is deep learning (deep
belief neural networks) and the emulation of a complete human
brain. Prof. Ng, one of the world-renowned experts in the field,
is already working on this project in Silicon Valley. Although this
project is still in its infancy, it is possible that it will soon
constitute a machine learning branch on its own.
Another quite promising direction is the merging of machine
learning with some A.I. techniques that are not already part of
machine learning, such as evolutionary computation algorithms.
These have been used very successfully for a variety of
applications since their optimization potential is quite
exceptional when it comes to complex problems. Evolutionary-
based machine learning algorithms are not only feasible, but
already somewhat popular today. If these evolutionary
algorithms continue to develop, it is quite likely that this can
become an independent branch of machine learning.
The most interesting trend, from a scientific perspective, would
be the development of systems that perform an assumption-
free analysis of the data at hand. Although there are a few such
systems already in place, there has not yet been a theoretical
framework of this approach, nor serious and focused research
on the field. Assumption-free data analysis is employed by
ANNs and could prove to be the reason behind their success. A
formal framework of this philosophy is bound to expand our
understanding of the data world and give rise to a series of
interesting machine learning applications that would be backed
by a concrete theoretical core.
There will no doubt be other evolutionary routes of machine
learning that are completely unknown today since they have yet
to be developed. The bottom line is that this is a promising field
that has yet to realize its true potential as well as provide
benefits for data science and the world in general. So stay
tuned for new developments27…

10.3 Machine Learning vs. Statistical Methods


Machine learning is very promising, but is it better than
statistics as a data analysis methodology? There is no clear-cut
answer for this question since each discipline has its own
merits. Machine learning is great at dealing with data without
making any assumptions about its distribution, and it is very
efficient and has a lot of potential given the continuous
advancement of computational technology. It receives a great
deal of research interest, which enables refinement of its
current methods. However, statistics is a well-established field,
has clear-cut and heavily researched proofs behind all of its
models and is very effective with large datasets. Also, its
models are relatively easy to implement, and it is generally
easier to attain a working knowledge of.
How does a data scientist decide which one is better for a
particular application or task? First, he must get acquainted with
both of them. So let’s take a look at some of the fundamental
tasks of the data science process and how statistics, machine
learning or both can be used in each one of them.
Fundamental processes:

Basic description of the data – Statistics is the


clear winner here since it has specific metrics that
are designed for that particular task. Its various
methods (often referred to as descriptive
statistics) give a very clear overview of the data
and provide some useful intuition about how to
deal with it. Machine learning, however, doesn’t
deal with this kind of task since it is designed to
tackle different types of problems.
In-depth understanding of the data – Machine
learning is somewhat better in this aspect as it
can provide more accurate and more insightful
information on the data at hand. You could use
statistics as well with some success, but if you are
familiar with machine learning, you can perform
miracles here.
Data exploration – Both machine learning and
statistics are good for this task because they are
adept at providing the measures and plots
necessary to gain an understanding of what
potential patterns are there. The use of
histograms and other plots is especially useful for
this.

Data analysis processes:


Model building (simple models) – Statistics is
probably better due to the large amount of
literature on this subject. Also, statistical models
are generally much simpler to understand and
evaluate.
Model building (complex models) – Machine
learning is probably better in this aspect even
though you can do some quite complicated stuff if
you are a statistics guru. Still, machine learning is
easier to work with for this type of models (e.g.,
using genetic programming).
Automated somewhat intelligent processes –
Machine learning is a clear winner as it has a
significant overlap with artificial intelligence
technologies.
Deep learning based on data – Machine learning
is ideal for this task since all kinds of learning
technology that is not strictly model-based is in
the machine learning domain. Also, there has
lately been a lot of research in this application.
Predictive analytics – Machine learning is great
for this task although statistics could be used as
well (e.g., for time-series analysis).
Confidence estimation – Statistics is the clear
winner here, primarily because all of the
measures it predicts (through techniques usually
referred to as inference statistics) are
accompanied by formulas to calculate the
corresponding confidence intervals.
Learning from unknown data – Machine learning
is king here as there is a lot of research in this
field on algorithms for these tasks (often referred
to as unsupervised learning).
Learning from known data – Machine learning is
again great for this application as there is a lot of
research in this field on algorithms for these tasks,
too (often referred to as supervised learning).
However, there are several statistics methods that
can do a quite good job, too.
Finding outliers – Both statistics and machine
learning are good for this type of application.
Machine learning has anomaly detection
algorithms that can provide accurate and easily
interpretable results. Statistics can also provide a
solution to this kind of problem (though it’s not that
great for complex datasets).
Finding associations – Another example of how
machine learning excels. Actually there is a lot of
research for this type of application through the
use of association rule mining techniques,
especially for large datasets.

Other tasks:

Data visualization – Both statistics and machine


learning can do well on this task though you may
want to look into specialized software instead.
Finding recommendations – Machine learning is
excellent for finding recommendations as there is
a whole field of machine learning (recommender
systems) that deals with it exclusively.
Dealing with neat (structured) datasets – Both
statistics and machine learning are a good option
for this.
Dealing with unstructured data – Graph-based
systems, which are a type of machine learning
algorithm, are ideal for this.
Explaining the basis of a model – Statistics has a
lot of theory to back up any of its models, so
explaining the basis of a model is a fairly
straightforward task.

By taking all of the above into consideration, you can develop


your own unique approach to dealing with data analysis
challenges, using both machine learning and statistics
techniques effectively in a data science framework.

10.4 Uses of Machine Learning in Data


Science
Any data scientist worth his title will employ machine learning in
the data science process, partly because it is difficult to tackle
data analysis tasks through conventional means. Although
statistics can do a great job of describing the data and
providing some inferences, if you really want to go into the
subtleties of the data’s structure and explore it in depth,
machine learning is the way to go.
Machine learning methods can effectively evaluate the
information content of data, providing useful insight into the
value of a particular feature (useful aspect of a meaningful
pattern) or a group of features. This can have several useful
applications of its own such as feature selection and other
forms of data reduction. With a substantial amount of research
work on this subject, even if the machine learning techniques
don’t have a robust mathematical theory behind them, they are
very effective and quite scientific. This task is closely linked with
feature evaluation, which involves estimating how useful or
information-rich any given feature, or set of features, is for a
specific problem (usually classification or regression analysis).
Feature selection and other data reduction methods are
extremely useful when dealing with large datasets. Machine
learning methods can help you in that task by providing efficient
and effective tools that will also give you peace of mind. There
is no need to worry about statistical significance tests and other
relevant metrics as most machine learning methods are
automated and easy to use, providing easy to interpret results.
To ensure that you make the most of these techniques, it is
best to practice on smaller datasets, especially datasets that
have already been studied extensively for data analysis
applications. Such datasets are publically available from
various sources on the web such as the UCI machine learning
repository.
Machine learning can also help you organize your data by
assigning labels to it (clustering) or by pinpointing outliers
(anomaly detection) through sophisticated and easy-to-use
techniques. The best part is that machine learning methods
lend themselves to lots of tweaking, so you can customize your
method of choice to the application at hand. Make sure you
have a thorough understanding of the theory before playing
around with these algorithms as they have subtleties that may
be overlooked even by an experienced programmer. Reading
an overview article on the topic or a specialized book on
machine learning would be a good place to start.
Advanced machine learning techniques such as deep learning
may be especially useful for a data scientist, assuming he is
comfortable with this technology and knows how to use it
effectively. Deep learning can help the data scientist find useful
patterns in data without the need for any domain knowledge on
his part, making the whole data analysis process much more
efficient and effective.
There are other ways you can apply machine learning to the
data science process. The best way to become familiar with
them is to get acquainted with the various techniques, practice
them (a machine learning course would be a great place to
start), and then experiment a bit with some real-world data. One
way to learn more about machine learning in action for data
science applications is to load R onto your computer and learn
it. We’ll look a bit more into this great data analysis tool, as well
as its relationship with machine learning technology, in the
subchapters that follow.
Through the use of machine learning in data science
applications, you can learn more about data analysis strategies
and gain a deeper understanding of algorithms and how to
implement them effectively. Finally, and most importantly,
connect with other data scientists and learn more about their
work, which will in many cases involve at least a few machine
learning techniques.

10.5 Brief Overview of the R Platform


As we learned in the previous chapter, R is a great platform for
data analysis applications, combining the diversity of tools
found in Matlab with the accessibility of open source software
that has a strong emphasis on statistical analysis and other
techniques that are implemented in various built-in scripts and
libraries. Let’s take a closer look into what this great program
can offer you as a data scientist and how various machine
learning techniques can be implemented in it.
R, especially if it is accompanied by a proper integrated
development environment (IDE) such as RStudio (as seen in
Chapter 8), can undertake the following tasks in a
straightforward and user-friendly manner:

Load and store data efficiently (it takes about 5


minutes to learn the corresponding commands)
Calculate descriptive statistics of any dataset
Create both simple and complex data models
Calculate inference statistics for a variety of
distributions and statistical tests for all types of
data (numeric, categorical, etc.)
Create professional-looking plots, including
interactive plots, by using the appropriate
library/package

With R you can also:

Find help in a variety of topics ranging from highly


technical (e.g., how to parse large data logs that
are not in a simple format) to more scientific ones
(e.g., how the standard deviation is defined). A
very large community of users and a very detailed
free online manual are available.
Create scripts and wrappers to perform all kinds
of data analysis using built-in functions (e.g.,
mean, sum, etc.) or more advanced specialized
functions (e.g., time-series analysis, clustering,
and other functions). The functions can be found
in easy to download and install libraries (referred
to as “packages”).
Expand your know-how in statistics through the
practical application of statistical knowledge on
real-world data (either your own or from the large
variety of datasets that come with R).
Get acquainted with the modern tools of data
analysis through R’s various libraries.
Share your experiences with other users and help
them deal with problems you have overcome (it’s
hard to overestimate the importance of the
community aspect and how useful it can be).
Develop your own toolboxes/libraries and share
them with the rest of the world, making yourself
more well-known and your programs useful to
others. This is a great way to get feedback on
your code and gain priceless experience in the
field.
Learn how to work with version control (this
feature is not available in the plain console-based
R, but it is integrated into the RStudio IDE).

Apart from all these, you can also read tutorials and books on
R, obtaining a better understanding of its capabilities and
learning how to make the most of it in your everyday work as a
data scientist.
R has a series of great machine learning libraries that you can
employ in your data analyses, saving you the trouble of having
to code everything from scratch. The most important of these
libraries, which are usually referred to as packages, are the
following (as of the time of this writing):

Artificial Neural Networks packages:

nnet: simple ANNs. Comes with the


base R program
RSNNS: good package providing a UI
for the Stuttgart Neural Network
Simulator, a great tool for learning
about the function of ANNs; used
mainly by researchers.

Recursive Partitioning packages:

rpart: ideal for “CART” type decision


trees. Comes with the base R
program
tree: another package for various
decision trees
RWeka: an interface for the well-
known WEKA toolbox, which contains
a large variety of machine learning
programs
Cubist: good package for rule-based
models
C50: package for C5.0 type decision
trees for classification applications
party: good package for recursive
partitioning algorithms (contains two
of them)
LogicReg: package for logistic
regression applications
maptree: good package for visualizing
decision trees

Random Forests (groups of collaborating decision


trees) packages:

randomForest: random forest


algorithm for regression applications
ipred: complete package of random
forest programs, including
classification applications, ensembles
and several others
varSelRF and Boruta: two distinct
packages focusing on the use of
random forests for variable (feature)
selection applications
bigrf: random forests for large
datasets using parallel computing.

Regularized and Shrinkage Methods packages:

lasso2 and lars: packages for


regression models with some
constraints
penalized: package for penalized
regression models employing a
different implementation of lasso and
ridge algorithms
ahaz: package providing semi-
parametric models using lasso
penalties
earth: package having programs
using multivariate adaptive regression
splines

Boosting packages:

gbm: package containing a variety of


gradient boosting methods
GAMBoost: package specializing in
boosting methods for generalized
additive models in particular
mboost: package containing a rich
boosting framework for generalized
linear and additive as well as
nonparametric models

Support Vector Machines (SVM) and Kernel


Methods packages:

e1071: a package containing the


svm() function, which provides an
interface to the LIBSVM library
kernlab: package implementing a
flexible framework for kernel-based
learning (including, but not limited to,
SVMs)
rdetools: package providing tools for
estimating the relevant dimension in
kernel feature spaces

Bayesian Methods packages:

BayesTree: package implementing


several methods for combining weak
learners based on Bayesian Additive
Regression Trees (BART)
tgp: a good package containing a
variety of processes for regression
and classification based on various
models such as Bayesian CART.

Optimization using Genetic Algorithms packages:

rgp and rgenoud: packages


containing optimization programs
based on genetic algorithms
Rmalschains: package implementing
memetic algorithms with local search
chains.

Association Rules packages:

arules: excellent package providing


both data structures for efficient
handling of sparse binary data as well
as interfaces to implementations of
the Apriori and Eclat algorithms for
the creation of association rules
based on an item-set

Fuzzy Rule-based Systems packages:

frbs: package containing various


implementations of fuzzy logic for
regression and classification
applications

Model selection and validation packages:

e1071: apart from SVMs, this


package has a couple of functions
that can be used for the estimation of
the error rate in a model
svmpath: package providing a way to
optimize an SVM for performance,
finding the best fit for the cost
parameter C
ROCR: good package for ROC
analysis as well as other visualization
methods
caret: package containing a variety of
functions for predictive models
including parameter tuning and
measures for variable importance

Naturally, most of the above libraries may seem a bit abstract to


you, especially if you have limited experience in machine
learning. However, as you learn more about this fascinating
field, they will begin to make more sense to you and may even
become your favorite part of R. It is good to keep this list handy
as it will be a very useful reference in your work as a data
scientist. In addition, as the R community is quite large, it would
be very useful to update this list every now and then as new
machine learning packages and updated versions of the
existing ones come out over time.
Note that apart from the aforementioned packages, R has a
variety of other great libraries for a large variety of data analysis
applications (e.g., on time series data). It cannot be stressed
enough that all these libraries are open source and are,
therefore, completely free, just like the R platform itself. The
Matlab equivalent libraries (which are also fewer in number)
come at a hefty price of $1000 each. Although the Matlab
libraries are great, most of the R libraries are equally good and
they are expanding constantly.

10.6 Resources for Machine Learning and R


If you want to learn more about the field of machine learning
and the R platform, there are is a variety of resources you can
use. Good places to start are:

Machine Learning related:

Machine Learning online course by


Prof. Andrew Ng
(https://www.coursera.org/course/ml)
Machine Learning online course by
Prof. Pedro Domingos
(https://www.coursera.org/course/mac
hlearning)
Practical Machine Learning course by
Dr. Jeff Leek
(https://www.coursera.org/course/pred
machlearn)
Pattern Classification (2nd edition) by
Richard Duda et al (probably the best
reference book ever written on this
subject)
Machine Learning in Action by Peter
Harrington (book focuses on Python
for the implementation of the
algorithms described)
“Machine Learning in R” presentation
(http://www.slideshare.net/kerveros99/
machine-learning-in-r)
Machine Learning Connection online
group (linkedin.com)
Neural Networks for Machine
Learning course by Prof. Geoffrey
Hinton
(https://www.coursera.org/course/neur
alnets).

R related:

cran.R-project.orgsite (best resource


for finding an organized list of R
libraries as well as the R platform
itself)
journal.r-project.org site (great
resource for the latest significant
developments in the R platform by
users around the world)
R in a Nutshell by Joseph Adler (a
great reference book on R, but not
suitable for learning R as it is targeted
at people already familiar with the
platform)
Introduction to Machine Learning with
R: Data Science Step-by-Step by
Daniel Gutierrez
Getting Started with RStudio by John
Verzani (a great resource for learning
all about the RStudio IDE)
Computing for Data Analysis course
by Prof. Roger Peng
(https://www.coursera.org/course/com
pdata)
Statistics One course by Prof. Andrew
Conway
(https://www.coursera.org/course/stat
s1)
Data Analysis course by Prof. Jeff
Leek
(https://www.coursera.org/course/data
analysis)
Statistics: Making Sense of Data
course by Prof. Alison Gibbs and Prof.
Jeffrey Rosenthal
(https://www.coursera.org/course/intro
stats)
The R Project for Statistical
Computing online group
(linkedin.com)
R Programming online group
(linkedin.com).

Using the above resources as a basis, in addition to all the


other material in this chapter, you can get acquainted with the
topic of machine learning and the R platform. Using the
understanding you will gain, you can expand this list as you see
fit. Just make sure that you are systematic in studying the
material in these resources as it can be quite overwhelming
otherwise!

10.7 Key Points

Machine learning is an intriguing field that


constitutes a core aspect of data science.
Although machine learning has become popular in
the past few years, it’s been around since the 1950s
and has grown to include a variety of systems for
data analysis such as decision trees, artificial neural
networks (ANNs), random forests, clustering
algorithms and lately deep learning (deep belief
networks).
Machine learning is one of the most promising fields
for both research and application areas.
Both machine learning and statistics have their
benefits. For simpler applications statistics is fine,
but for more complicated cases, machine learning is
preferable.
There are several uses of machine learning in data
science, especially in the data analysis stage.
R is a great data analysis platform, having several
libraries for machine learning (among other fields).
With a solid understanding of its libraries, it can be a
very useful tool for all types of analyses as well as
some visualizations.
There is an abundance of both online and offline
resources for machine learning and for R. Using a
combination of online courses, books and
professional groups can make learning much more
efficient and enjoyable.

The subfield of computer science dealing with intelligent algorithms that


enable machines, particularly computers, to learn new things that go
beyond what was initially programmed into them.
See article for details:
http://www.nytimes.com/2012/11/24/science/scientists-see-advances-
in-deep-learning-a-part-of-artificial-intelligence.html?_r=0
Although not absolutely necessary, you may want to join a machine learning
group just to keep in touch with the recent updates on the field. A
couple of good ones are Machine Learning Connection and Pattern
Recognition, Data Mining, Machine Intelligence and Learning, both of
which are on LinkedIn.
Chapter 11
The Data Science Process

In this chapter, we’ll see how the different aspects of the data
scientist fit together organically to form a certain process that
defines his work. We will see how the data scientist makes use
of his qualities and skills to formulate hypotheses, discover
noteworthy information, create what is known as a data product
and provide insight and visualizations of the useful things he
finds, all through a data-driven approach that allows the data to
tell its story. The whole process is quite complicated and often
unpredictable, but the different stages are clearly defined and
are straightforward to comprehend. You can think of it as the
process of finding, preparing and eventually showcasing (and
selling) a diamond, starting from the diamond mine all the way
to the jewelry store. Certainly not a trivial endeavor, but one
that’s definitely worth learning about, especially if you value the
end result (and know people who can appreciate it). Let us now
look into the details of the process, which includes data
preparation, data exploration, data representation, data
discovery, learning from data, creating a data product and,
finally, insight, deliverance and visualization (see Fig 11.1).
Fig. 11.1 Different stages of the data science process.
Note that understanding this process and being able to apply it
is a fundamental part of becoming a good data scientist.

11.1 Data Preparation


Data preparation is probably the most time-consuming and
uninteresting part of the data science process, partly because it
involves minimal creativity and little skill. However, it is a very
important step, and if it doesn’t receive the attention it needs,
you will have problems in the steps that follow. In general, data
preparation involves reading the data and cleansing it. This is
the first step in turning the available data into a dataset, i.e. a
group of data points, usually normalized, that can be used with
a data analysis model or a machine learning system (often
without any additional preprocessing). There are datasets for a
variety of data analysis applications available in data
repositories that are often used for benchmarking machine
learning algorithms, the most well-known of which is the UCI
machine learning repository, which we’ve already noted as a
good place to find practice problems.
Reading the data is relatively straightforward. However, when
you are dealing with big data, you often need to employ an
HDFS to store the data for further analysis and the data needs
to be read using a MapReduce system (if you use Hadoop or a
similar big data ecosystem). The latter will help in both entering
the data into an HDFS cluster and also in employing a number
of machines in that cluster to cut down the required time
significantly. Fortunately, there are premade programs for doing
all this, so you will not need to write a lot of code for reading
your data. However, you may need to supply it in .JSON or
some other similar format type. Also, if your data is in a
completely custom form, you may need to write your own
program(s) for accessing and restructuring it into a format that
can be understood by the mappers and the reducers of your
cluster. This is a very important step, especially if you want to
save resources by implementing the optimum data types for the
various variables involved. More on that sub-process in the
data representation subchapter later on.
When reading a very large amount of data, it is wise to first do a
sample run on a relatively small subset of your data to ensure
that the resulting dataset will be useable and useful for the
analysis you plan to perform. Some preliminary visualization of
the resulting sample dataset would also be quite useful as this
will ensure that the dataset is structured correctly for the
different analyses you will do in the later stages of the process.
Cleansing the data is a very time-consuming part of data
preparation and requires a level of understanding of the data.
This step involves filling in missing values, often removing
those records that may contain corrupt or problematic data and
normalizing28 the data in a way that makes sense for the
analysis that ensues. To comprehend this point better, let us
examine the rationale behind normalization and how
distributions (mathematical models of the frequency of the
values of a variable) come into play.
When we think of data, particularly relatively large amounts of
data, we often use distributions to map them in our minds.
Although the most commonly used distribution is the normal
distribution (N), there are several others that often come into
play such as the uniform distribution (U), the student distribution
(T), the Poisson distribution (P) and the binomial distribution (B)
among several others (see Fig. 11.2 for examples. Distribution
B was omitted as it is the discreet version of distribution N). Of
course, a group of data points may not follow any one of these
distributions, but in order to make use of the various statistical
tools that have been developed over the years, we often make
use of one of these distributions as a template for the data we
have. Normalization enables us to see how the data we have
fits these distributions and whether a data point is an outlier or
not (i.e., whether it has an extreme value for a given
distribution). Note that normalization applies only to numeric
data, particularly continuous variables.
Fig. 11.2 Histogram examples of different distributions.
From top to bottom: Normal, Uniform, Student and
Poisson.
Cleansing the data also involves dealing with many of these
outliers (in rare cases, all of them). This means that they may
need to be removed or the model may need to be changed to
accommodate their existence. This is a call you need to make
as there is no fool-proof strategy for this kind of situation;
sometimes outliers are useful to include in your analysis. What
you decide depends on factors such as the number of extreme
data points and the types of variables that make up your data.
Also, whether or not you remove any outliers depends on how
sensitive your model is to their existence. Normally, when
dealing with big data, outliers shouldn’t be an issue, but it
depends on their values; extremely large or small values may
affect the basic statistics of the dataset, especially if there are
many outliers in it.
Normalizing your data will sometimes change the shape of its
distribution, so it makes sense to try out a few normalizing
approaches before deciding on one. The approaches that are
most popular are:

Subtracting the mean and dividing by the standard


deviation, (x – μ) / σ. This is particularly useful for
data that follows a normal distribution; it usually
yields values between -3 and 3, approximately.
Subtracting the mean and dividing by the range, (x –
μ) / (max-min). This approach is a bit more generic;
it usually yields values between -0.5 and 0.5,
approximately.
Subtracting the minimum and dividing by the range,
(x-min) / (max-min). This approach is very generic
and always yields values between 0 and 1,
inclusive.

When dealing with text data, which is often the case if you need
to analyze logs or social media posts, a different type of
cleansing is required. This involves one or more of the
following:

removing certain characters (e.g., special characters


such as @,*, and punctuation marks)
making all words either uppercase or lowercase
removing certain words that convey little information
(e.g., “a”, “the”, etc.)
removing extra or unnecessary spaces and line
breaks

All these data preparation steps (and other methods that may
be relevant to your industry), will help you turn the data into a
dataset. Having done that, you are ready to continue to the next
stages of the data science process. Make sure you keep a
record of what you have done though, in case you need to redo
these steps or describe them in a report.

11.2 Data Exploration


After the data has been cleaned and shaped into a useful and
manageable form, it is ready to be processed. First, some
exploration of it is performed to figure out the potential
information that could be hiding within it. There is a common
misconception that the more data one has, the better the
results of the analysis will be. A data scientist, however, does
not embrace this belief. It is very easy to fall victim to the
illusion that a large dataset is all you need, but more often than
not such a dataset will contain noise and several irrelevant
attributes. All of these wrinkles will need to be ironed out in the
stages that follow, starting with data exploration.
According to Techopedia (www.techopedia.com), “data
exploration is an informative search used by data consumers to
form true analysis from the information gathered.” It involves
carefully selecting which data to use for a particular data
analysis task from the data warehouse in which the dataset is
stored. It can be thought of as the manual approach to
searching through the data in order to find relevant or
necessary information. The alternative to this, employing an
automatic approach to the same objective, is data mining. Note
that data exploration and data mining can also be performed on
an unstructured pool of data. Quite often, data exploration is
done in parallel to data preparation. It is an essential
prerequisite to any analysis that ensues.
The big advantage of human-based data exploration over data
mining is that it makes use of a human being’s intuition.
Sophisticated data mining algorithms may work well with some
datasets and are very efficient, but they may not always
pinpoint certain key aspects of the dataset that an individual
(particularly one who is familiar with the data domain) may spot
through a data exploration approach. Ideally, a data scientist
will do both types but will rely primarily on the data exploration
approach.

11.3 Data Representation


Data representation is the step of the data science process that
comes right after data exploration. According to the McGraw-
Hill Dictionary of Scientific & Technical Terms, it is “the manner
in which data is expressed symbolically by binary digits in a
computer.” This basically involves assigning specific data
structures to the variables involved and serves a dual purpose:
completing the transformation of the original (raw) data into a
dataset and optimizing the memory and storage usage for the
stages to follow. For example, if you have a variable that takes
the values 1, 2, 3, etc., then it would be more meaningful to
allocate the corresponding data into an integer data structure
rather than a double or a float one. Also, if you have variables
that take only two values (true or false), it makes sense to
represent them with a logical data structure. Note that
regardless of how you represent your data, you can always
switch from one data type to another if necessary (e.g., in the
case where you want to merge a Boolean variable with a few
numeric ones for a regression model).
All this may seem very abstract to someone who has never
dealt with data before, but it becomes very clear once you start
working with R or any other statistical analysis package.
Speaking of R, the data structure of a dataset in that
programming platform is referred to as a data frame, and it is
the most complete structure you can find as it includes useful
information about the data (e.g. names, modality, etc.).
However, certain Python libraries also employ this kind of
structure. R also allows for names for the variables that make
up the dataset in an easily accessible field, making the whole
data analysis process that follows much more user-friendly and
straightforward, even for beginners.

11.4 Data Discovery


Data discovery is the core of the data science process.
Although there is no definition on which everyone agrees, data
discovery can be seen as the process that involves finding
patterns in a dataset through hypothesis formulation and
testing. It makes use of several statistical methods to prove the
significance of the relationships that the data scientist observes
among the variables of the dataset or among different clusters
of data points (i.e., how unlikely each is to happen by chance).
In essence, it is the only known cure to the problem that
plagues big data: too many relationships! Data discovery
enables you to filter out less robust relationships based on
statistics and also throw away the less meaningful relationships
based on your judgment.
Unfortunately there is no fool-proof methodology for data
discovery although there are several tools that can be used to
make the whole process more manageable. How effective you
are regarding this stage of the data science process will
depend on your experience, your intuition and how much time
you spend on it. Good knowledge of the various data analysis
tools (especially machine learning techniques) can prove very
useful here.
Data discovery can be quite overwhelming, especially when
dealing with complex datasets, so it makes sense to apply
some data visualization, which we will examine in more depth in
a separate subchapter later on. Good knowledge of statistics
will definitely help as it will enable you to focus most of your
energy on the intuitive aspect of the process. In addition,
experience with scientific research in data analysis will also
prove to be priceless in this stage.

11.5 Learning from Data


Learning from data is a crucial stage in the data science
process and involves a lot of intelligent (and often creative)
analysis of a dataset using statistical methods and machine
learning systems. In general, there are two types of learning:
supervised and unsupervised. The former involves any system
or algorithm that helps a computer learn how to distinguish and
predict new data points based on a training set, which it uses to
understand the data and draw generalizations from. The latter
has to do with enabling the computer to learn on its own what
the data structure can reveal about the data itself; this is done
through the use of certain evaluation metrics that help the
computer know when it has found a worthy result. The results
of this type of learning are often used afterwards in supervised
learning.
It may seem that using unsupervised and supervised learning
may guarantee a more or less automated way of learning from
data. However, without feedback from the user/programmer,
this process is unlikely to yield any good results for the majority
of cases. (This feedback can take the form of validation or
corrections/tweaks that provide more meaningful results.) It is
possible, though, for the process to be quite autonomous for a
specific type of problem that has been analyzed extensively or
in the case of deep learning networks. It is good to remember
that all of these are quite robust tools, but it is the user of these
tools (the data scientist) that makes them useful and able to
yield worthy results consistently. For example, artificial neural
networks (ANNs), a very popular artificial intelligence tool that
emulates the way the human brain works, are a great tool for
supervised learning, but if you don’t know how to use them
properly, they are bound to yield poor results and/or require
extensive amounts of time to learn from a dataset.
Another point to keep in mind is that some data learning tools
are better than others for specific problems even if all of them
could work on a dataset prepared in the previous stages of the
data science process. Knowing which tool to use can make the
difference between a successful data analysis and a poor one;
something to keep in mind at all times.

11.6 Creating a Data Product


All of the aforementioned parts of the data science process are
precursors to developing something concrete that can be used
as a product of sorts. This part of the process is referred to as
creating a data product and was defined by influential data
scientist Hilary Mason as “a product that is based on the
combination of data and algorithms.”29
Let’s elaborate on this simple but dense definition. A data
product is something a company can sell to its clients and
potentially make a lot of money from because it is something
that can be supplied in practically unlimited quantities for a
limited cost. So it can be quite valuable. Why? Because it
provides useful information to these organizations. How? It
takes data that these organizations have (and value) and
applies an intelligent data processing method (the algorithms
that Ms. Mason mentions) to extract this information from it.
Why? Because that’s what data science does and why it is
useful in today’s information-driven world. So a data product is
not some kind of luxury that marketing people try to force us to
buy. It is something the user cares about and could find good
use for, something that has a lot of (intelligent) work behind it,
something that is tailored for every particular user (information
consumer). So it is something worth building and something
worth paying for.
Typical examples of data products are all the network statistics
and graphs that LinkedIn provides to its members (particularly
its premium and golden members); the results pages of a good
search engine (i.e., one that provides relevant results based on
your query and adds useful metadata to them such as how
popular and how reliable the webpages are); a good
geographic information system, such as MapQuest, that
provides useful geographic information overlaid on the map of
the place(s) you are querying, etc. Note that many data product
providers don’t ask you for money in exchange for what they
offer. Also note that the vast majority of them are online.
A data product is not something new though. Data products
have been around since the dawn of data technologies. Even
printed newspapers can be seen as a data product of sorts.
However, today’s data products are a bit different. They make
use of big data in one way or another and do so in an extremely
fast manner. This is accomplished through the use of efficient
algorithms and parallel computing; in other words, data
science.
Not all of the various outputs that the learning algorithms yield,
based on the processed data you feed them, turn into data
products. A data scientist picks the most useful and most
relevant of his results and packages them into something that
the end user can understand and find useful. For example,
through rigorous data analysis of the various features of the car
industry, you may discover several useful facts about modern
cars and their road behavior. The average user may not be
interested in how many cylinders there are in an SUV,
especially if they live outside the US. If, however, you tell that
user that the average fuel efficiency of his car is X and that it
fluctuates over this range over the time of the day in such and
such ways, and that by avoiding these routes he can save
about X gallons per week, and that this translates into X dollars
saved, based on the current fuel prices, then he may want to
listen to you and pay a premium for this information. So a data
product is similar to having a data expert in your pocket who
can afford to give you useful information at very low rates due
to the economies of scale employed.
To create a data product, you need to understand the end users
and become familiar with their expectations. You also need to
exercise good judgment on the algorithms you will use and
(particularly) on the form that the results will take. You need to
strip any technical jargon that they may not comprehend from
the resulting information and make the product as
straightforward as possible. Imagination is key here.
Graphs, particularly interactive ones, are a very useful form in
which to present information if you want to promote it as a data
product (more on that in the subchapter that follows). In
addition, clever and simple-to-use applications can be another
form of a data product. And these are just the beginning. Who
knows what form data products will take in the years to come?
As long as you have something worthwhile and interesting to
say based on the data, when it comes to ways of promoting this
to a user, the sky is the limit.

11.7 Insight, Deliverance and Visualization


Other aspects of the data science process that make the results
more comprehensible to everyone and more applicable to the
company include insight, deliverance and visualization. Apart
from the creation of data products, described in the previous
subchapter, data science involves research into the data, the
goal of which is to determine and understand more of what’s
happening below the surface and how the data products
perform in terms of usefulness to the end users, maintainability,
etc. This often leads to new iterations of data discovery, data
learning, etc., making data science an ongoing, evolving
activity, oftentimes employing the agile framework frequently
used in software development today.
In this final stage of the data science process, the data scientist
delivers the data product he has created and observes how it is
received. The user’s feedback is crucial as it will provide the
information he needs to refine the analysis, upgrade it and even
redo it from scratch if necessary. Also in this stage, the data
scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users’ newest
requirements. This is a very creative part of the data science
process and one that provides very useful experience since the
learnings from these activities are what distinguish an
experienced professional from the data science amateur (who
may be equally skilled and be a very promising professional,
nevertheless, but lack the intuition of the experienced
professional). So pay close attention to this stage.
Visualization involves the graphical representation of data so
that interesting and meaningful information can be obtained by
the viewer (who is oftentimes the end user). Visualization is
usually part of data product creation, but it also comes about
afterwards. It is a way of summarizing the essence of the
analyzed data graphically in a way that is intuitive, intelligent
and oftentimes interactive. It is intuitive because it keeps the
terminology and the overwhelming quantity of numbers at bay,
allowing the data scientist to think differently about the data by
employing a more holistic perspective. This is similar to data
exploration, in a way, but also quite different. When in the data
exploration stage you don’t know what you are going to find so
you are broader in your search. When you are in the final stage
of the data science process, you have a deeper understanding
of what’s going on and know what is important in the data. For
example, when exploring your data, you may find that for
describing the health of oranges the features weight and
softness are good, while length of stem is an irrelevant feature.
When in the visualization stage of your analysis, you may be
able to delineate the exact boundaries of the normal class of
healthy oranges and pinpoint the structure of this class, as well
as the characteristics of the unhealthy oranges, based on each
one of these features and their combinations.
Through visualization, you become more aware of what you
don’t know (there are more known unknowns than unknown
unknowns in the puzzle) and are therefore able to handle the
uncertainty of the data much better. This means that you are
more aware of the limitations of your models as well as the
value of the data at hand. Still, you have accomplished
something and can show it to everyone through some
appealing graphs. You may even be able to make these graphs
interactive, providing more information about what is going on.
In essence, visualization makes your models come to life and
tell their own story, which may contain more information than
what is reflected in the numbers you got from them and used in
the data products. This is akin to what traditional (core)
research is all about.
This stage of the data science process is also quite enjoyable
as there is room for improvisation and making use of some
attractive software designed specifically for this purpose. Still, it
is not the graphs you generate that make this process
interesting and useful; it is what you do with them. A good data
scientist knows that there may still be something to discover, so
these graphs can bring about insight, which is the most
valuable part of the data science process. This translates into
deeper understanding and usually to some new hypotheses
about the data. You see the dataset anew, armed not only with
the understanding that you have gained, but with an open mind
as well. You question things again, even if you have a good
idea of what’s happening, in order to get to the bottom of them
based on the intuition that there is something more out there.
This insight urges you to start over without throwing away
anything you’ve done so far. It brings about the improvements
you see in data products all over the world, the clever upgrades
of certain data applications and, most importantly, the various
innovations in the big data world. So this final stage of the data
science process is not the end but rather the last part of a cycle
that starts again and again, spiraling to greater heights of
understanding, usefulness and evolution. You can take
advantage of this opportunity to write about what you have
done and what you are planning to do, then share it with your
supervisors to show them that you have done something
worthwhile with the resources that you have been given and
that you are eager to continue making good use of your time.
Sounds like a win-win situation, doesn’t it?

11.8 Key Points


The main parts of the process can be summarized
in the following seven steps:

1. Data Preparation. Data preparation


involves getting the data ready for
analysis through cleansing and
normalization of the numeric variables. It
also involves recognizing the format of the
data and accessing it accordingly.
2. Data Exploration. “Data exploration is an
informative search used by data
consumers to form true analysis from the
information gathered”
(www.Techopedia.com). This stage has to
do with searching through the data for
meaningful patterns, pinpointing useful
aspects of it (features), creating
provisional plots of it and generally
obtaining an understanding of what’s
going on and what information may be
lurking within the dataset.
3. Data Representation. This is “the manner
in which data is expressed symbolically by
binary digits in a computer” (McGraw-Hill
Dictionary of Scientific & Technical Terms).
It is related to assigning specific data
structures to the variables involved,
effectively transforming the (raw) data into
a proper dataset. Also essential for good
memory resource management.
4. Data Discovery. This has to do with
discovering patterns in the available
dataset through hypothesis formulation
and testing. It entails a lot of statistics, as
well as common sense, to make sense of
the data and find useful aspects of it to
work with.
5. Data Learning. This is related to an
intelligent analysis of the discovered
patterns through the (often creative) use
of statistics and machine learning
algorithms. It aims to make something
practical and useful out of these patterns
and forms the basis of the data product(s)
to follow.
6. Creating a Data Product. This involves the
most important aspect of the process:
creating something useful out of the data
and sharing it with other people in the
form of a product. A data product is
defined as “a product that is based on the
combination of data and algorithms”
(Hilary Mason).
7. Insight, Deliverance and Visualization.
This has to do with delivering the data
product to the end user (as well as
receiving feedback and fine-tuning or
making plans for upgrading the product),
visualizing the processed data to highlight
the findings and investigate other aspects
of the dataset and deriving insight from all
these in order to start a new cycle of data
science in this or another dataset.

For this and any other statistical terminology, please refer to the glossary at
the end of the book.
Hilary Mason, “How to Know When You Need a Data Scientist”, LinkedIn
article, January 2013.
Chapter 12
Specific Skills Required

Migrating to a data scientist role takes more than having the


right mindset and knowing what the different data science tools
are. It takes a certain kind of skill-set which needs to be both
present in your mind and well-presented in your resume. In this
chapter we will look into the specifics of that skill-set and how
you can develop it coming from four major backgrounds:
programming, statistics or machine learning, data modeling,
and studentship. Furthermore, you will get acquainted with the
field from a practical perspective, seeing how the data
scientist’s skills relate to what you are already doing and
making the whole transition not only an educational but also an
enjoyable process.

12.1 The Data Scientist’s Skill-Set in the Job


Market
As a data scientist candidate you are expected to possess a
variety of technical skills, briefly described in Chapter 5. These
are the so-called hard skills on the recruiter’s check list and
therefore play a major role in the job-hunting process. However,
it would unwise to try to develop them in the order they are
listed as there are certain subtleties that need to be taken into
account. Since you are most likely already versed in some of
these skills, depending on your current professional status, it
would be best to gradually expand your current skill-set so that
it ends up including all of the following skills:

Data analysis skills


Cleaning data
Creating models
Applying statistics on data
Applying and developing machine
learning algorithms
Validating models
Performing data visualization

Programming skills

One or more of the specialized data


analysis platforms (R / Matlab / SPSS /
SAS)
One or more OOP languages (Python,
C++, Java, C#, Perl, etc.)
Other programming skills relevant to the
industry (e.g., familiarity with
HTML/CSS in the case of the Web
industry)

Data management skills (particularly for big data)

Hadoop (particularly Hive/HBase, HDFS


and MapReduce)
SQL
NoSQL
Other data management skills relevant
to the company

Business skills

Familiarity with the Waterfall or the Agile


frameworks
Understanding of how a company
operates
Knowledge of the industry sector
Other business skills relevant to the
company and industry

Communication skills (technically a soft skill)

Delivering engaging presentations /


storytelling
Report-writing
Listening skills
Being able to translate customer
requirements into specific action items
Other communication skills relevant to
the company

It is tempting to think that one group of skills is more important


than everything else and choose to focus on just that. However,
as a data scientist you’ll need all of them even if you don’t have
them in balance. In fact, it is quite unlikely that you’ll have them
all in balance from the very beginning unless you start from a
clean slate and gradually develop them through a university
degree or a meticulously planned set of courses and books.

12.2 Expanding Your Current Skill-Set as a


Programmer / SW Developer
Being a programmer or a software developer is actually
relatively close to being a data scientist. However, depending
on how experienced you are with data analysis and big data
technologies, you may need to pick up a few skills to ensure
that you are marketable as a data science professional. The
exact skills and knowledge you’ll need depend on your
particular profession as well as on your experience.
In this section, we will look into the skills you need to develop,
the knowledge you need to acquire and how you can do all
that, depending on where you come from. In particular, we’ll
examine it from the perspective of you being an OO
programmer, a software prototype developer or something else
in that industry (e.g., a programming architect or a project
manager). Whatever the case, we’ll make sure that you
understand what it takes to migrate from your current
professional state into a promising and fulfilling career in the
data science field.

12.2.1 OO Programmer
If you are already in the object-oriented programming game,
you are familiar with data structures and how to implement an
algorithm efficiently in one or more OOP languages. You may
even be adept at conserving resources and optimizing your
code to meet a particular objective. So you have a decent head
start towards becoming a data scientist since, as we have
already seen, these are some of the essential skills you need to
be a player in the data science game.
Unless you already have experience with Matlab or R,
vectorization is something you need to learn, especially if you
plan to work with one of these data analysis tools. Vectorization
involves processing one operation on multiple pairs of
operands at the same time, writing code that is loop-free,
instead of processing one pair of operands at a time and
looping around to the next pair. The fewer loops in your code,
the faster it will run on a Matlab or R platform as well as on any
other data analysis tools that employ vectorization. This is
because vectorized functions are built-in programs that are
optimized and implemented in C or some other low-level
language, enabling them to run super-fast. This is a great point
to remember, especially when you are dealing with large
datasets. A vectorized approach may be many times faster than
one using loops even if your lines of code are kept to a
minimum. If you learn R, you will naturally learn vectorization
because most tutorials don’t cover loops; if they do, they do so
briefly at a later stage of the tutorial. Also, R has a large variety
of built-in functions that save you the trouble of having to create
loops doing the same thing on your own. So it lends itself to
cleaner, faster vectorized scripts.
But the purpose of this chapter is not to broadcast the merits of
the R language; R can do that for itself. The point is that an OO
programmer will be able to quickly assimilate the data analysis
software used in data science, whether this is R, Matlab or any
other software. The mental discipline that is required for
effective OO programming work can be applied to any other
software required. Even big data technologies, such as
Hadoop, are not going to be a challenge for you if you have this
quality. You will need to learn all these technologies, though,
and it may be somewhat time consuming. For this, you can use
the resources in Appendix 1 as well as all the other sources
mentioned in the first part of Chapter 9, Learning New Things
and Tackling Problems. How long it will take will depend on how
dedicated you are and how much time you can devote to it.
You will want to pay close attention to the data visualization
software as this is probably something that you are the least
familiar with in your current work. It shouldn’t pose much of a
challenge as all the programming required in such a piece of
software is minimal, if not non-existent. Just familiarize yourself
with one or more data visualization packages, and you will be
good to go.
You will need to study the data analysis literature and mine it for
know-how that you will need as a data scientist. You’ll
particularly need to study statistics, if you haven’t taken a
course on this subject already, and most importantly machine
learning. You may not have time to go very deep on either one
of them, but at least make sure you know enough to ace a
statistics or machine learning course.
Finally, you need to learn more about how the end-user thinks,
what he requires, how to interpret these requirements and how
to communicate effectively in a non-technical language.
Basically, hone the soft skills that can make you a software
developer or a systems engineer (though the latter requires
more than just this stuff). This is very important in cultivating the
data scientist mindset and performing this role, as we’ve seen
in Chapter 4.
Naturally, once you’ve learned all these things, you need to
practice. You can start with the Kaggle challenges or the
datasets available in the UCI machine learning repository. Just
be sure that you acquire some hands-on experience before
putting yourself out there as a data scientist for hire.
12.2.2 Software Developer
As a software developer, you are bound to be familiar with GUIs
and the importance of the (usually non-technical) user of your
work. This familiarity is invaluable. Being able to think as the
user thinks allows you to appreciate their point of view and
understand their concerns. Therefore, for a role in data science,
you will need to focus your attention on your other technical
skills.
As a developer you must be already familiar with two or more
programming languages, most likely the .NET framework and
C# or possibly C++ and Java. That’s a great starting point. Just
like your OO programming colleague, you have all the
programming background to be a data scientist, so you should
expand this by incorporating knowledge of big data technology
and data analysis tools.
Your programming background and familiarity with the end-user
will allow you to focus your efforts on the gaps in your
knowledge. Similar to the OO programmer, you will need to
develop your knowledge of visualization software and statistics.
You will also need to go deeper on the machine learning know-
how as this is something many data scientists (all those not
belonging to the researcher category) often lack. If you don’t
know about clustering and pattern recognition, you need to gain
an understanding of them as well as deep learning and other
state-of-the-art machine learning techniques. Joining a relevant
group is one strategy for achieving that objective.
As in the case of the programmer, you will need some hands-on
experience with these new skills before being marketable as a
data scientist. The methods described in the previous
subchapter for acquiring this experience are applicable here,
too. For more details about how to get the initial experience,
you can revise Section 6.3.

12.2.3 Other Programming-Related Career


Tracks
Of course, you may be in the IT sector and not be an OO
programmer or a software developer. For example, you may be
a web designer, a web programmer, a QA analyst or a
database administrator (this particular case we’ll cover in a
separate chapter as it has a special relationship with the data
science world). What can you do, then, in order to become a
data scientist?
In these cases you should expand your technical skills,
focusing on the data science skills you need to cultivate the
most. If you are a web programmer, you may want to work on
your data analysis know-how, while if you are a QA analyst, you
may want to refresh your programming skills first.
Whatever the case, you’ll need to brush up on your statistics,
read up on the machine learning literature, get up to date on
the latest developments in these fields and familiarize yourself
with all the relevant software (which we have already examined
thoroughly in Chapter 9), for starters.
Similar to software developers, you may have experience
dealing with users directly (e.g., if you are a web developer),
which can aid you in your communication skills and
requirements interpretation, giving you a chance to focus on the
development of the hard skills.
Again, it’s essential that you become acquainted with all the big
data technologies and get plenty of practice before marketing
yourself as a data scientist.

12.3 Expanding Your Current Skill-Set as a


Statistician or Machine Learning Practitioner
The transition from a statistician or a machine learning/A.I.
practitioner to a data scientist is fairly smooth. That’s because
someone in this field already has a working knowledge of the
core of data science, i.e., data analysis and some object
oriented programming. Even if you are not familiar with all the
specialized theory and know-how, or if your programming skills
are very basic, you can easily pick up what you are missing and
attain all the formal qualifications needed. What’s more, all the
experience you have in your current field may be considered
relevant data science experience (especially if you combine
both statistics and machine learning). Unlike the other
professions described in the previous chapter, a statistician’s or
machine learning practitioner’s way of thinking is quite close to
that of the data scientist and can easily evolve into that of a
data science role.
Let us now examine what you will need to learn in order to
make this transition. We’ll examine coming from a statistics
background (with packages like R being your bread and butter),
a machine learning / A.I. background (with intelligent
information systems being your playmates) or a mixed
background, which is often the case for many new
professionals. Let us get started now and see how you can
organically transform your skills into those of a data scientist.

12.3.1 Statistics Background


Coming from a statistics background, you have a clear
advantage over the classic programmer and the data-related
professional: you already know the theory behind the majority
of the data analysis needed, and you have some hands-on
experience with it. Statistics may be a fairly straightforward
subject to learn, but it takes a lot of work to master its use for
real-world problems. Having succeeded at this daunting task, at
least to some extent, means that you are good at learning
challenging material, and so everything else in data science
should be feasible for you. Before you know it, you can expand
your skill-set so that it more closely resembles that of the data
scientist as described in Chapter 5.
First, you will need to expand your theoretical knowledge so
that it includes the details of the modern datasets known as the
big data domain. Then you’ll have to get acquainted with the
relevant paradigms that have been developed for it and invest
some time in learning at least a couple of programming
languages. If you are already familiar with R, SPSS, SAS or
some other data analysis package, this shouldn’t be too
difficult. Finally, you’ll need to learn some supplementary
material to complete your skill set, making your resume similar
to a data scientist’s. Let’s look into each one of these in detail.
12.3.1.1 Theoretical Material to Learn
Statistics may be a great tool, but the data it deals with, at least
in most cases, is somewhat limited in size. That’s not to say
that with a statistician never deals with large datasets, but the
data that defines the data world today is whole new ballgame.
As mentioned in the first part of this book, big data is a big deal
and a big challenge. So unless you have studied some of the
new literature of data analysis, you are in for a big surprise. Big
data requires a whole new approach, usually the MapReduce
paradigm and several tools for dealing with unstructured as well
as semi-structured data streams as we saw in Chapter 8. First
things first though. Try to find as many sources as you can on
MapReduce as well as on its ecosystems, the group of
toolboxes that have been developed to tackle big data using
this paradigm.
Once you understand the concepts of the MapReduce
architecture, you can learn about the ecosystems that employ
it: mainly Hadoop and Spark. Familiarize yourself with their
toolboxes and dig into their technical aspects (see subchapter
8.1 for details). If you don’t want to get your hands dirty, you
may want to look into an integrated platform such as IBM
InfoSphere’s BigInsights (commercial license). It all depends on
how technical you want to get.
At a minimum, you will need to know all about big data,
MapReduce and some software that employs these
technologies. If you are not sure about how technical you want
to get, it is recommended that you start with a high-level
MapReduce platform. Once you are confident with it, proceed
to learn lower-level details. All this material may be a bit
daunting for someone who is not a computer scientist, so it is
recommended that you take a university course on the subject
(or a good MOOC) and get some hands-on experience in the
field.
In parallel, you will need to get acquainted with some computer
science theory. Pay close attention to algorithm complexity,
algorithm design and data structures in particular. You don’t
need to be an expert in information theory, though, as this is not
too relevant to data science. The key thing to remember is that
your resources are limited, and even if you are using a large
cluster of computers, you’ll need to be able to run efficient
programs on it. So don’t jump straight to coding; learn a few
things about program design first.
All this may seem a bit overwhelming, particularly if you haven’t
ever taken any computer science courses. However, there are
several courses you can take on these subjects; consult
Appendix 1 for details. In addition, you can read one or more of
the books listed in Appendix 3 or at least buy them to use as
reference material.
12.3.1.2 Languages to Learn
The ability to write programs is probably the most important skill
you will need in your transformation to a data scientist. Unlike
other scientists, the data scientist must be a mean programmer
who can efficiently implement his ideas into working programs
(even if he is not as adept as a professional programmer). To
do this, you will need to learn at least one programming
language, preferably a powerful one, unless you are not familiar
with R, in which case you will need to learn at least two of them.
Which specific languages you learn are not all that important as
long as at least one of them is an object-oriented one. A very
popular choice for this type of language is Python due to its
simplicity and the fact that there is a plethora of packages
available for use in your programs (saving you a lot of time in
development and debugging). However, if you are up for it
there are better options, the most popular of which are Java,
C# and C++ (not to be confused with its predecessor, C, which
is not an OO language). All of these languages (especially
C++) are quite fast and therefore any programs written in them
are quite scalable. You can find a large community of users for
any one of these four OO languages, making the process of
learning them much easier than it would be by reading a book
or taking a course on them.
Ideally, you will employ several sources when learning any one
of these languages. For someone who is not familiar with
programming, this can be time consuming and frustrating at
times. It is recommended that you pick a language, take a
course on it (online or offline), read a book or two and practice
a lot. You don’t have to design any fancy algorithms at this
stage, just learn the syntax and the various functions of the
language. You can implement some of the algorithms you
already know from statistics and expand from there.
A very important aspect of programming is the development
environment you use (often referred to as an IDE). An excellent
choice for this is Eclipse, which has various versions for a
variety of programming languages. Python has its own
development environment, but it is very basic so you may want
to consider getting acquainted with Eclipse even if you decide
on that language. Just like any one of these languages, Eclipse
is freeware, and there are plenty of tutorials for it. Consult the
relevant appendices for details.
If you decide to get more technical with Hadoop, you will also
need to learn Pig and Hive. The former is used for the
MapReduce programming, while the latter is for creating and
running queries for data that is spread over the cluster. If you
decide to avoid low-level programming in MapReduce, you will
still need to learn the high-level big data platform of your choice
well enough to be able to customize it. This entails learning
some programming for it. If you know R already, that’s a big
plus.
If all this programming sounds intimidating, don’t let it scare you
off. Once you know one programming language, it is
significantly easier to learn other ones. The logic is usually very
similar and you only need to familiarize yourself with its
particular characteristics: syntax, packages, functions, etc.
Besides, there is nothing that’s too difficult to overcome with
enough practice.
12.3.1.3 Other Material to Get Acquainted With
Apart from the above, you will need to learn a query language
such as SQL. This is much simpler than the aforementioned
languages, and it shouldn’t take you more than a couple of
weeks although you may want to practice a bit after that to
make it second nature for you. SQL is designed to work with
structured data such as the data that you find in databases.
This is not something akin to big data, which is usually
unstructured, but it may be that you will need to retrieve some
data from the company’s database, which you’ll need to be able
to do on your own. Also, knowing SQL is very useful as a basis
since there are SQL-like languages that are used with big data,
e.g., AQL (Annotated Query Language) and NoSQL (Not Only
SQL) among others.
Another useful thing to learn, which does not fit in any of bins
above, is graph analysis. A relatively old field of mathematics, it
has recently experienced a resurgence as its benefits have
found fertile ground in the big data world. In fact, GraphLab is
specialized software (which is also free) that deals with
graphing data for processing and visualization. Even though it
is not a necessary thing to learn, it would be useful to at least
learn it at a basic level since you already have the background
for it.
If you learn all the above, you have a fighting chance of
entering the data science field gracefully and without feeling
inferior to the more technical professionals who aspire to the
same thing. Learning what you need may take from a few
months to two years or more. During that time you may also
acquire some useful experience, especially if you plan your
course carefully. A small bonus is all this will beef up your
resume with useful skills that are in demand beyond the data
science field. Sounds like a good tradeoff for your time,
wouldn’t you agree?

12.3.2 Machine Learning / A.I. Background


A background like this is ideal for a data scientist. It combines
programming, some knowledge of statistics, and often some
hacking skills. More importantly, it provides you with access to
the state-of-the-art research in the technology that constitutes
the heart of data science: machine learning.
Even an experienced machine learning / A.I. practitioner is
bound to have some gaps in his knowledge when entering the
data science domain. In that case, here are some useful things
to consider looking into before marketing yourself as a data
scientist.
12.3.2.1 Theoretical Material to Learn
Being a machine learning / A.I. practitioner may give you some
statistical knowledge, but unless you are doing research on
hybrid approaches to machine learning, your statistics may
need some upgrading. You may want to expand your know-how
to include less well-known methods and become familiar with
the theory behind the methods you already know. It would be
useful to do this while learning R, if you are not already familiar
with it. There are other statistical packages, of course, but if you
have to choose one to learn well for regular use, it should be R.
In addition to being a very robust tool for statistical analysis
(and data analysis in general), it can provide you with insight
into how certain methods work. Moreover, it is designed for a
wide variety of users and does not assume more than a basic
knowledge of statistics, which as a machine learning / A.I.
practitioner you should have. Furthermore, you may already be
familiar with Matlab, so learning R should be a piece of cake for
you.
Learning statistics in depth can be a daunting task, so it is
recommended that you look into university courses on the
subject even online ones. A class or MOOC that has some
hands-on practice on a statistical package would be most
beneficial.
However, statistics is merely the beginning. You will need to get
acquainted with the MapReduce paradigm, as well as
distributed computing in general. If you feel comfortable enough
with the paradigm, consider designing a mapper and a reducer
for practice in a language of your choice. Your learning style will
dictate if you need to take a course on the subject or if a good
book and some tutorials will be sufficient.
Finally, you will need to learn a few things about databases, if
you are not already familiar with them. A good working
knowledge of the various types of data structures can go a long
way. Most languages, even the less robust ones like Python,
can handle multiple data structures. However, if you don’t know
them in some depth, you may not be able to take advantage of
these features. Data science involves a lot of resource
management, and using the right data structures can help you
create efficient programs and access diverse data sources
more easily.
12.3.2.2 Languages to Learn
Many machine learning and A.I. practitioners get by using
Matlab, Octave or Python. All of these are great tools, but not
sufficient for a data scientist. Invest some time to learn one of
the more robust languages, such as Java, Scala, C++ or C#, for
starters. If languages are not your thing and you had a hard
time learning Matlab, then Python is always a popular option.
There are also several courses available. Python is not as
intuitive as Matlab, but it has a wide variety of packages and is
completely free. It is also adept at handling large datasets.
In addition, you should have at least some working knowledge
of R. Apart from being a very intuitive and high-level way to
implement any statistical method, it has robust parallel
computing capabilities, good memory management and a wide
variety of packages including one for large datasets. There is
also a very big user community for R. New statistical methods
created by researchers are usually first implemented and made
available in R. Finally, unlike other statistical programs, R (and
all its packages) is completely free.
Finally, you will need to get familiar with at least one of the big
data integrated platforms, such as IBM’s BigInsights, and with
the underlying technology, Hadoop. If you are more interested
in working on Hadoop without relying on a platform, you may
want to learn about Pig and Hive, for starters, so that you can
create your own Hadoop code.
Learning a language can be time consuming, but as a machine
learning/A.I. practitioner you are already good with algorithms.
Therefore, learning a language should be manageable for you.
If you have time, you may want to learn more than two
languages, including R, as different employers often value
different languages. So once you master one of the robust
languages, get at least some working knowledge of another
language.
12.3.2.3 Other Material to Get Acquainted With
As mentioned in Section 12.3.1.3, some working knowledge of
SQL is essential to your role as a data scientist. If you haven’t
touched this language since your university days, you may
want to refresh your skills and practice on a variety of datasets.
While you are at it, you may want to get acquainted with
NoSQL, as well, since that’s what is used for data in the big
data domain.
If you are confident about your programming skills and enjoy
algorithm design, you may prefer to write your programs in R or
some other language, linking them to the platform you are
using. Make sure you learn how this is done and practice on
some benchmark datasets. This could be more useful than
mastering a language since there is no one particular language
(yet) that you can rely on completely for all your data science
endeavors.
In addition, you will need to practice building data analysis
models using your newly acquired knowledge of statistics.
These models don’t have to be purely statistical, but the more
statistical elements you incorporate in them the better. Also,
look into ways to combine different techniques in these models
and test them on some benchmark models. Machine learning
and A.I. techniques are excellent, but statistical techniques can
be quite good, especially when dealing with numeric data.
Since data science applications usually deal with this kind of
data, you may want to make use of statistics more in your
models.

12.3.3 Mixed Background


Coming from a mixed background in statistics and machine
learning or A.I. has a lot of advantages and makes the
transition to data science easier than from any other
background. You should already be familiar with statistics
theory, various machine learning techniques and may know a
programming language or two, but you need to become
acquainted with parallel computing, the MapReduce paradigm
and databases. If you don’t know an OOP language well, you
will need to expand your knowledge of programming. Also, you
will need to work on your data mining techniques, practice with
various benchmark datasets and refine your R programming.
If you come from a mixed background, you may want to invest
in gaining more experience in data analysis because that is
your forte. All the experience you already have is an invaluable
asset, so make sure you integrate that into the data scientist
core that you are building. Don’t hesitate to practice on
datasets you have already worked with by incorporating the
new methods you learn. Focus on better resource management
and honing your hacking skills. Review subchapter 6.3 on how
to get initial experience, and study the data science process
(Chapter 11) so that it becomes second nature to you.
Whether you are a statistics person, a machine learning
practitioner, or a combination of both, you will need to enhance
your skills before you are ready to enter the data science
market. Focus on your strengths and think of ways to
complement them with the knowledge and skills you are
missing. Just be sure that along with all of the skills you learn,
you also develop the mindset needed for success that we saw
in Chapter 4 so that you are more than a moving data science
library. It is best to view all of these skills as ways to expand
your thinking and develop yourself to tackle data science
problems. As the years go by, the tools are bound to change,
but what you’ve learned through cultivating them will remain.
It’s this aspect of education that will define you as a data
scientist and, if you play your cards right, it’s what will help you
land your first data scientist job.

12.4 Expanding Your Current Skill-Set as a


Data-Related Professional
As a professional in a data related field, you are already familiar
with data types and structured data, so your emphasis should
be on other aspects of data science such as OO programming,
data visualization, data analysis, honing your communication
skills and getting some hands-on experience.
In this section, we’ll look into three main data-related jobs:
database administrator, data architect/modeler and BI analyst
as well as how you can make the transition to data science
from each one of them. For a full description of the required
skills for a data scientist role, you may refer to the
corresponding chapters (Chapter 8 for the necessary software
and Chapters 4-7 for all the soft skills and practices).

12.4.1 Database Administrator


If you are a database administrator, you have a clear
understanding of what a clean and ordered dataset looks like,
how data can be gathered from a variety of sources, the
different types of data that exist and you have expertise in one
or more database management systems and SQL-based
software.
You are probably familiar with user requirements and are able
to interpret what their requirements mean. You are familiar with
querying strategies and are confident about importing and
exporting (mainly structured) data in various formats from a
database or a data warehouse.
In order to migrate to the data science world effectively, you’ll
need to get acquainted with the big data technologies, starting
with Hadoop. This shouldn’t be difficult, considering that at least
one of the components of Hadoop (Hive) is similar to SQL. In
addition, if you are familiar with database schemas, creating an
HBase database shouldn’t be much of a challenge. Finally, the
NoSQL language is, in a way, an extension of SQL (although it
includes much more), making it somewhat easier for you to
learn.
Learning programming may be more challenging for you, but
chances are that you are already somewhat familiar with
programming even if you are not a master of an OO language.
After taking a couple of courses (or reading a few good books),
you should be able to handle that aspect of data science, too. If
you haven’t done any programming before, Python may be a
good place to start as it’s probably the simplest OO language
(though not the most powerful one).
You will need to invest time to learn visualization. Programs like
Tableau and Flare, among many others, are great options for
this task. Although they are not too difficult, they will require
some time to learn and practice.
Once you’ve got your mind adjusted to this learning sprint, you
may want to step it up a notch by tackling the statistics and
machine learning aspect of the field. Regardless of your
technical background, you’ll need to devote quite some time to
this, especially if you haven’t seen a statistics book since your
university days. In order to save time and to make the whole
process more interesting, you may want to learn this material
while taking up R, Matlab or some other data analysis package.
You’ll find that a lot of it will make more sense once you see it in
practice. You’ll also come to appreciate these wonderful pieces
of software. Note that you’ll need to learn about vectorization as
well. Some hands-on knowledge of linear algebra can be very
beneficial towards that.
Moreover, you may want to hone your communications skills,
cultivating the art of storytelling. If you find that challenging,
learn about the business world, read a few articles about
investments and other business-related topics, speak to various
business people and familiarize yourself with the business
mindset. (Good documentaries on the topic may be useful as
well.) You don’t have to get an MBA in order to do that although
an intro to economics or finance course may be very helpful.
Even if you are great at communicating technical details
efficiently, you’ll need to be able to communicate holistically as
well, establishing links with non-technical people and
expressing things in an engaging way, as if you are telling a
very interesting story to them. Report-writing may also help in
that aspect, especially if your reports are targeted at people
unfamiliar with the technical aspect of your field.
Finally, it’s a good idea to acquire some experience by applying
what you have learned on benchmark problems and/or
datasets from online competitions.

12.4.2 Data Architect/Modeler


If you are a data architect (data modeler), you are probably
familiar with the business side of the database world and have
substantial experience with requirements and planning. You
already possess many of the skills of a database administrator,
and you probably have hands-on experience dealing with a
variety of data types. All these are essential skills that can give
you a good head start in your data science endeavors.
To take better advantage of this head start, you may want to
invest in the scientific know-how related to the field or expand
the business skills you have, aiming at a senior data scientist
post. The former includes statistics and machine learning,
which you can learn through courses and reading a few books.
As for business skills, you may want to invest in developing
project management skills, getting acquainted with the
corresponding software and learning more about the data
science process and how it can be broken down into specific
independent tasks that can be delegated to others. Naturally,
you’ll need to be comfortable doing each one of these tasks
yourself, so some hands-on experience with the data science
process is also essential (refer to Chapter 11 for details).
In order to gain this experience, you need to familiarize yourself
with the big data technologies. The best place to start would be
the Hadoop databases, namely Hive and HBase. NoSQL is
also useful to know and in demand. As you are already familiar
with database design, understanding big data technologies
should come more naturally to you, making your transition to
the data science world smoother. You’ll need to expand your
knowledge to include MapReduce and all the other aspects of
Hadoop.
As a data architect/modeler, you may already be familiar with
some OO programming. If not, you will need to learn at least
one OO language fluently. This can facilitate the next step: data
analysis tools.
Just like the database admin, you’ll need to spend quite some
time on learning data analysis tools, perhaps while you learn
about the statistics and machine learning methods you will be
using. You don’t need to know everything about these
packages, but knowing some programming can be to your
advantage. Try implementing various programming methods in
your data analysis package of choice (e.g., R). Note that even
though R and Matlab are not advertised as OO languages, they
do support classes and every single thing in their workspaces is
treated as an object. So don’t be fooled by their simple
interfaces – they each have a real beast in their core!
Unfortunately, there is no way around learning vectorization as
this is essential if you want to create efficient programs in either
one of these packages. A solid understanding of linear algebra
can be quite helpful for that.
Finally, although your communication skills are probably quite
decent, you may want to practice presenting things, such as the
models you develop and plots of the data, in a storytelling
fashion, as this is something very useful in a data scientist role.
This can be effectively combined with learning about data
visualization.

12.4.3 Business Intelligence Analyst


Working in business intelligence (BI) gives you a firm grasp of
the value of data, particularly in a business setting. If you are in
that field, you can readily see how the big data movement can
benefit the business world through BI. It is assumed that you
already know about data types and that data visualization is
your bread and butter. You may not be familiar with the
particular data analysis packages that were described in this
book, but it shouldn’t take you long to familiarize yourself with
any one of them. You are probably comfortable with statistics
although you should expand your repertoire of statistical
methods and add some machine learning techniques into the
mix as well. Start from what you already know, or something
you are somewhat familiar with, and you can’t go wrong.
If you aren’t much of a programmer, you will want to take a
couple of courses in an OO language of choice (Python is
probably the easiest option). Courses like “Intro to
Programming” would be ideal. If you are so inclined, to save
some time you can study the book Machine Learning in Action,
where a variety of machine learning techniques are introduced
and implementations in Python are made available. You’ll still
need to read up on machine learning from other sources or
take a course on the subject to ensure you know enough. In
parallel, you can expand your statistical knowledge as this is
bound to have a synergistic effect that will compliment your
studies in machine learning.
Making the transition to the data science world will, of course,
entail familiarizing yourself with the big data technology. If you
are not entirely comfortable with SQL, master it. Then learn
about NoSQL, Hive and HBase. Afterwards, learning about
HDFS, MapReduce and the other Hadoop components should
be easier.
Learning a data analysis package is the next logical step for
your transition to the data science world. Both R and
Matlab/Octave are great places to start since learning either
one of them will make you capable of running any kind of data
analysis method you’ll ever need, but if you are already working
with SPSS or SAS, you can expand your data analysis skills
through them. Keep in mind that an open source alternative is
often preferred by companies, unless, of course, they already
have a license for some proprietary software.
Finally, you will want to polish your communication skills,
particularly when it comes to presenting what you have found to
people who are unfamiliar with the problem domain and lack
technical expertise. This is common in the BI world, so it
shouldn’t be too challenging for you. Still, more practice would
be useful. You’ll also need to get some experience using
benchmark datasets or competition problems (see subchapter
6.3 for details).

12.5 Developing the Data Scientist’s Skill-Set


as a Student
If you are a student you may feel left behind due to your limited
experience in a field related to data science. However, you may
have an advantage over the professionals who are about to
enter the field. This is because you have the opportunity to
cultivate a more balanced skill-set from the very beginning. If
you take full advantage of this, it may save you time in your
development as a data scientist since you’ll be honing your
skills in a more organic way.
The best way to start would be to figure out your strengths and
weaknesses in relation to the skill-set described in the
beginning of this chapter. Afterwards, you can develop a plan to
cultivate an existing strength and a weakness at the same time
so that you are not overwhelmed (in the case of working only
on a weakness) or overconfident (in the case of working only
on a strength). So if you are good with programming but not so
good with business concepts, you can work on these two skills
in parallel. Namely, you can expand your programming skill by
either learning the language you know in more depth or by
learning a new language that is used in data science. At the
same time, you can take a course on business models, micro-
economics, finance or business administration to get a feel of
how companies and the economy in general function. Reading
business articles can also be very helpful in that respect.
You need to be sure to incorporate a lot of hands-on exercises
while you are developing those skills. All the data analysis
techniques you learn are useless if you don’t know how to
apply them for a specific data analysis problem. Even the most
mundane things can become quite interesting if you engage
with them with a hands-on approach. Enjoying the whole
process of learning can actually be a great benefit to your
development as a data scientist. If you feel uninspired
sometimes, it would be useful to read up on stories of
successful data scientists and talk to them if possible (via a
Meetup group, for example). The more tangible the role of data
scientist is for you, the easier it will be to eventually adopt. The
skills you cultivate will then make more sense and be more
meaningful. This is key in the long run since the initial
enthusiasm does not always linger till the end of your training.
However, if you are committed to your goal, you may find ways
to revive it and develop a more lasting passion for the role,
something that will reflect in the quality of your work.

12.6 Key Points

The transition from being a student, OO


programmer, software developer or other related
career tracks to data science can be smooth and
relatively easy given that you have the focus,
discipline and determination to do it.
As an OO programmer, you need to make sure that
you do the following (preferably in this order):

Learn vectorization
Learn about data analysis tools such as
Matlab, R, etc.
Study statistics and machine learning
Get acquainted with big data tech
Get acquainted with how end-users
think and understand them

As a software prototype developer, you need to


include in your to-do list all the things mentioned for
the OO programmer (except the last one).
If you are in another career track related to the
above, you need to adjust the list of things to do in
the OO programmer section to your specific needs,
giving emphasis to programming and data analysis
tools and theory.
As an OO programmer you’ll need to get plenty of
practice on large datasets such as the ones found in
the Kaggle site and the UCI machine learning
repository.
As a statistician you’ll need to learn more about
machine learning and programming, get acquainted
with big data technologies and expand your
business skills, preferably in that order.
As a ML/A.I. practitioner, you’ll need to learn more
about statistics, expand your programming skills,
learn about big data technologies and expand your
business skills, preferably in that order.
As someone who knows both statistics and ML/A.I.,
you’ll need to focus on expanding your programming
skills, getting acquainted with big data technologies
and learning more about the business world,
preferably in that order.
As a professional in a data-related field, you are
already familiar with data types and structured data,
so your emphasis should be on other aspects of
data science such as OO programming, data
visualization, data analysis, honing your
communication skills and getting some hands-on
experience.
If you are a DB administrator, you’ll need to add the
following to your to-do list (recommended in this
particular order):

Big data technology, starting with Hive


and NoSQL
OO programming language, possibly
Python since it’s easier
Data visualization software such as
Tableau, Flare, etc.
Statistics and machine learning,
preferably parallel to a data analysis
package such as R or Matlab
Communication skills, focusing on
storytelling and presentations
Practice with benchmarks or
competition datasets

If you are a data architect/modeler, you’ll need to


focus on the following things (preferably in this
order):

Statistics and machine learning


Project management, if you aspire to
get a senior data scientist position
Big data technology, starting with Hive,
HBase and NoSQL
OO programming (expand existing
knowledge so that you have fluency in
at least one language)
Data analysis packages such as R or
Matlab (possibly combined with OO
programming)
Data visualization and presentation
skills (storytelling)
Practice with benchmarks or
competition datasets
If you are a BI analyst, you’ll need to concentrate on
the following skills (in the suggested order, if
possible):

OO Programming with a simple


language like Pythonif you are new to
programming. (Possibly combine this
with studying machine learning via
Machine Learning in Action.)
SQL and big data technology, starting
with Hive, NoSQL and HBase.
Data analysis tools such as R or Matlab
(though expanding the knowledge of
one you already know is also an option).
Communication skills (storytelling).
Practice with benchmarks or
competition datasets.

If you are a student, you need to learn all of the core


skills of data science, data analysis, programming,
big data technologies, business skills, etc., in a
balanced manner. It is recommended that you
identify your strengths and weaknesses first.
Regardless of which discipline you are coming from,
it is very important to pay close attention to
communications skills as they are crucial for a data
science role.
Chapter 13
Where to Look for a Data
Science Job

So you have acquired all of the skills necessary to be a data


scientist and your mindset is on the right track. What’s next?
Having the right skills does not automatically make you the right
candidate for a data science job. You need to be systematic,
persistent, patient and somewhat aggressive in your job
hunting if you want to have a fighting chance at landing a data
science job. This is because most employers today are
unfortunately not entirely familiar with the field and what it
offers, so their expectations of data scientists are somewhat
unrealistic. Nevertheless, more and more companies are
becoming aware of the benefits of having a data scientist in
their ranks, so there are plenty of jobs out there.
To find a data science job and get your foot in the door can be
challenging, but it is feasible if you know where to look. In this
chapter, we’ll take a look at the major avenues of job research
and how you can use them to your benefit in getting a data
science job. We will look into how you can contact companies
directly (and what to do to maximize your chances through this
bold approach), how you can make use of professional
networks to find job opportunities and make valuable
professional connections, how you can employ recruiting sites
(without getting lost in the masses) and what other alternative
tactics are available.
Before beginning your search, you need to decide what kind of
data science job you are after. The more specific you are about
this, the better. For example, you may want to get involved in a
particular industry or you may only be looking for something
full-time, etc. The more specific you are, the better your
chances of finding a position in which you’ll be content and of
avoiding jobs that are not relevant to what you want. In
addition, you need to be able to summarize all your skills,
qualifications and qualities in an effective way, both written and
orally. This is particularly helpful if you are planning to use
recruiters.

13.1 Contact Companies Directly


Contacting companies directly is probably the most promising
method for finding a job as a data scientist. This is usually the
final step of a long process of preparation that makes it
possible for this approach to have a chance of working out.
Before you contact a company, you need to learn as much
about it as you can. Looking it up on a search engine is not
nearly enough. You need to know enough to show that you
have a genuine interest in the organization and also to
convince yourself that it’s worthwhile. If possible, try connecting
with one or more of the organization’s employees, clients or
even its associates through networking. Find out through
research how much the organization has to gain by
incorporating big data technologies in its data management
department and how feasible such an endeavor is. Learn who
is in charge of information technology recruiting and what they
value in an employee. Come up with ideas to propose to that
person in case they are not familiar with the field. Be
considerate and proactive.
It is also important to note that you have a better chance of
landing a job if the timing is right. If you approach a company
before they have started to learn about the potential of big data,
they probably won’t listen to you; if they do, they are unlikely to
hire you. If you approach a company too late, after they have
already started building a data science team, you may be
thwarted by the competition and the company’s recruiting
politics. Of course it is very difficult to know when the timing will
be right, but you can gauge if the company is ready when you
contact people who are associated with it. Make sure that when
you finally approach a manager or human resources officer, you
are confident about the timing.
If you are student, consider the possibility of an unpaid
internship. Although an internship does not guarantee they will
keep you afterwards, at least you will get to know the business
from inside (how it functions, what it needs, etc.), make some
useful professional connections and learn about what you can
do in practice, which is often somewhat different from what you
can do in theory or when tackling a benchmark dataset.
Connecting with someone who works at the company in a
position related to the one you are after is ideal. Of course, that
doesn’t guarantee success either, but it may get you a meeting
(even if it’s over the phone) with someone who can make
recruiting decisions. A meeting like that is worth a thousand
applications over the internet or by snail-mail. Also, the
feedback you may receive will be an invaluable aid for refining
your approach in case things don’t work out.
Contacting a company directly is great for another reason as
well. It shows that you can think outside the box, a quality that
is highly valued in the data science world. Few applicants are
creative and bold enough to try out this approach, so if you do,
you will stand out. At the very least you will have made an
impression and opened a door for the future; never
underestimate the value of a connection as it may yield results
when you least expect it.
When contacting a company that is not recruiting for a specific
position, avoid using email. It sounds strange, but the reality of
the matter is that unless you know the person that you are
contacting electronically, your email is bound to remain unread
and will probably find its way to the trash folder before the end
of the week. Think about it for a minute: how often do you reply
to emails from people you don’t know, especially if you have
other things you need to attend to? Always opt for a video
communication (e.g., via Skype), a chat over the phone or
ideally a face-to-face meeting if possible. After the first
connection is made, you may follow up with an email, but since
this time your name will be familiar, they will be more likely to
read it and take it into consideration.
The emails you write to that company, once a communication
channel has been opened, are crucial. You may want to think of
each email as an end-of-term assignment for a challenging
class. Needless to say, they are as important as your actual
resume and everything else you use to present yourself. That’s
because if they fail to grab the reader’s attention, you risk the
possibility of no-one reading your resume. More on that topic at
the next chapter.
Moreover, you need to treat each company you contact as if
they have the only job opening available to you. You may have
a dozen companies lined up, but in this job-hunting tactic you
cannot afford to rely on the numbers. Each one of them is
unique and should be treated as such; otherwise you are just
wasting your time. You may want to consider taking a break
between two consecutive approaches, giving yourself time to
assimilate the experience of the previous one before moving on
to the next. You do not have to wait for a final response from
one company before approaching the next because it could
take anywhere from several days to several weeks. Once you
set off for the next company in line, focus on that particular
company as if there is no other alternative.
Finding companies worth checking out in person is not that
difficult. Practically any company that is large enough to have a
data management or BI department can qualify for your job
quest. Keep in mind that companies that have a strong social
media presence or deal with large markets are bound to be in
great need of a data scientist as they will have a lot to gain from
harnessing the big data they probably possess. Before deciding
to approach a company, you will need to take into account
factors such as its location, especially if it is in a different
country or region, the culture of the company and where it is
located (even if it is in the same country as you are now,
different regions can have noticeable cultural differences that
may be an issue for some people), how comfortable you will be
there and your ability to adapt. Consider carefully these matters
and any other factors that you can think of before picking out
companies from a business directory or a website.
The whole process of finding a job possibility through this
approach is quite time consuming and often exhausting, but in
some ways it is better than sending off impersonal applications
in response to job postings even if these applications are
accompanied by personalized cover letters. That’s because you
have a chance to refine your approach and hone your self-
promotion skills using direct feedback. We will take a look at
responding to job postings later on in this chapter. The bottom
line is that direct contact is much more promising and needs to
be applied systematically and patiently. It is a good investment
of your time, and, if you pay enough attention to it, it will be
helpful to you in a variety of ways.

13.2 Professional Networks


If contacting companies directly does not feel right for you, or if
you want to just build up towards that, you can try your hand at
professional networks as a job-hunting strategy. This includes
professional associations, data science conferences and
professional groups on the social media.
Professional social media networks are becoming more and
more popular as an effective means of recruiting people, finding
potential business partners and forging professional
collaborations. However, in order to make the most of them you
need to find the ones that are the most efficient investment of
your time and to actively participate in them. Let’s look at each
of these in more detail.
When evaluating a professional network, you need to consider
the professional status of the members, the potential for
advancing your knowledge in the field and the regular
occurrence of events. There may also be a cost involved (at
least in some of them, usually to cover the cost of particular
events). However, if you are serious about networking, the
expense may be a good investment for you since you are also
going to invest your most valuable asset: your time.
The professional status of the members in a professional
network is crucial. Obviously, you don’t want to join a group
where everyone is on the same level as you, an aspirant to
data science. Nor do you want to join a group where everyone
is at the management level. You will need to find a group where
there is a healthy mix of professional statuses among the
members with a slight bias towards the more experienced ones.
This will give you an opportunity to learn by listening and
observing when socializing with the other members even over
the internet. You may also be able to find a mentor through the
organization.
The potential for advancing your knowledge in data science is
one of the most important factors to keep in mind. If the group
is active and there is a lot of material flying around among its
members, you will have the opportunity to pick up a lot of useful
information including unfamiliar terminology, new tools to
explore or just some anecdotes of educational value. Attending
regularly a group with experienced practitioners can be as
educational as a course on data science, and possibly more
useful, since you will have the opportunity to learn how a
practitioner deals with real-world situations. When around the
more knowledgeable members of such a group, show interest
in what they have to say and ask meaningful questions. It
sounds obvious, but you’ll be surprised at how few people do
that, and it gives you a chance to distinguish yourself from
other less-experienced members of the group. Membership in
such a group will give you food for thought and may plant the
seeds for future personal development in the science of the
field and the actual work of the data scientist.
The occurrence of regular events is another significant factor
when choosing a professional network. In fact, this is the
reason some of these groups are quite successful while others
barely manage to survive. Events are the life-force of a group
and can be extremely beneficial for its members, even simple
networking events. Face-to-face events are most beneficial;
attending online events such as teleconferences and online
workshops can also be useful even if they are somewhat
impersonal. Events can be very educational and usually attract
some knowledgeable data scientists, especially those involved
in the development of new technologies in data science and
those who have done serious research in the past. This can
offer you access to state-of-the-art knowledge of the field and
enhance your understanding of the problems in the big data
domain. So pay close attention to groups that organize events
and spare no expense to attend. A group that organizes events
on a regular basis is preferable, but even events held once or
twice every year are better than nothing. Try to find groups that
hold regular events so that you maximize your chances of
attending at least a few of them.
A few years ago, there were only a handful of social media
platforms30, so it was very easy to keep up with them. Now it
seems that every other site on the Web is (or tries to be) a
social medium of sorts! However, among all of them, only a few
stand out as a place to host data science groups, namely
LinkedIn31, Meetup, DataScienceCentral, and lately Kaggle32.
There may be others, but at the time of writing, these are the
major ones. You may be tempted to comb the Web for new and
alternative social media sites that have a data science
component, but it is better not to spread yourself too thin. Start
with one or two groups, and if you feel you can manage more,
expand from there. A constant presence on a few groups is
bound to be more fruitful than an occasional visit to more
groups every now and then.
Every one of these sites has its advantages, so it is a good idea
to study each before deciding on which ones to focus. It is
worth mentioning that Meetup is somewhat different than all the
other sites, as it is not merely a social medium and is often not
considered a social medium at all by the social media gurus. In
essence, it is a meta-community of people around the globe
that consists of several local communities, each one comprised
of at least one specialized social group in that area. All this
structure is organized by a very straightforward site that
functions like a social medium. However, if someone wants to
create a group in Meetup, this group has to be mainly in the
real world since strictly online groups are not allowed. People in
Meetup groups meet regularly and develop a strong community
that often transcends the Meetup group. Unfortunately, the
richness of these groups varies greatly from place to place, so
there may not be many data science Meetup groups in your
area yet. If there are any, though, you may want to join them
and, if you have time, become actively involved. As the vast
majority of the Meetup world is based on volunteers, you can
easily pass any work you do in them as volunteer experience
on your resume, which can be a plus.
Most Meetup groups are relatively small and their members are
amateurs, but there are some groups that are truly mind-
blowing in how well they are organized and the potential they
have. These are usually found in large metropolitan areas,
however. It is worth mentioning that the majority of the Meetup
groups are completely free for their members. The only fees are
for the Meetup site, which may be shouldered by the
organizer(s) of the group. There may also be small fees for
some events, depending on how expensive they are to
organize. Often a Meetup group is associated with one or more
sponsors who cover the basic expenses, allowing the group to
avoid charging dues to its members. The bottom line is that
Meetup is an excellent initiative and can be a great professional
networking avenue for anyone in the data science field.
Data science conferences are definitely worth considering,
particularly if you are passionate about the role of the data
scientist and you don’t see it as merely a job. Of course they
can be quite costly, but they are often worth the high price.
Unlike academic conferences (which are good mainly for the
conference organizers), data science conferences offer a
healthy balance of talks, workshops and professional
networking. The talks are carefully selected to be of interest to
a wide audience interested in data science and usually refrain
from very specialized topics that are useful to a minority.
As for the workshops, they are designed so that everyone can
learn something useful that can be applied to a real-world
problem. They are usually costly, but registering for several of
them may get you a good discount. These workshops cover a
variety of applications, so you are bound to find one that is of
particular interest to you. Attending a workshop may mean that
you miss several conference sessions, but this is inevitable due
to the limited time that such a conference takes (otherwise it
would be too long for most people to be able to attend).
As for the professional networking part, this is similar to what
takes place in a Meetup group, but it can involve people from
all over the world. Among the conference attendees there may
also be some head-hunters, which may give you an opportunity
to learn about positions you might otherwise not hear of.
To make the most of your networking strategy, develop a plan.
It’s best to start with professional networking via social media,
joining two or three of them and actively participating in them.
Definitely try out a data science Meetup group, especially if you
live in a large city. Then, as you gain some social experience
and get more acquainted with the data science field on a
practical level, check out a relevant conference or two. If you
are on a budget, opt for the early bird registration, which has a
good discount, and be very selective in the workshops you
decide to attend. Finally, if you find you need more and you’re
not ready to hit the market yet, you can join a data science
association.
It is worth mentioning that it is a good idea to have some
professionally made business cards with you to give away to (at
least some of) the people you meet in your networking and job-
hunting endeavors. If you are not pleased with the job you
currently have, or if you are unemployed, you can put “Data
Science Consultant” or something along these lines as your job
title on the card.

13.3 Recruiting Sites


Seeking a job through recruiting sites (or sites with job ads) is
probably the least effective tactic for any field, especially one
that is still under formation. However, if you submit a large
enough number of applications and/or contact enough
recruiters, you may find just what you’re looking for.
Recruiters make a profit based on successful placements, so if
you don’t seem promising enough, they may not take your
case. In addition, contrary to what they may say, many of them
know less about the field than what you did when you started
reading this book, so take what they say with a grain of salt.
Many larger companies work exclusively with certain recruiters,
so they may have knowledge of positions that are not known or
advertised on the open market. Other recruiters have some
connections, but no special insight into job availability – you
can search the Web as well as they can. Don’t discount their
connections, though, because they may be able to get you in
the door when sending your resume on your own wouldn’t.
Take advantage of the benefits a recruiter may offer, but don’t
stop searching and networking on your own.
Job sites can be promising, particularly if they are designed for
data science and IT jobs in general. Before you send in any
applications, though, it will be necessary to tailor your resume
for each particular job opening (see next chapter for doing that
effectively). Your cover letter needs to be adjusted to be as
convincing as possible and should be tailored for each
particular job opening, too. Most sites have their own templates
with which to post your work and education history, so the
format and structure of your resume may go to waste, for the
most part. However, the wording and the details will come into
play, contributing to a worthy and eye-catching online resume
on these sites. That’s just the first step, though. After that is
when the real struggle begins.
You need to be very systematic and organized when it comes
to using these sites. Several job openings may be posted on
more than one of them with different application templates, so
you need to keep track of all the applications you send to avoid
sending multiple applications to the same company. This will
save you time and also make you look more professional. Be
selective when it comes to the job postings to which you
choose to reply. If there is an opening that requires much more
experience than you have, or if they want you to know
programming languages that you are not familiar with, don’t
bother with that post. The old advice of “apply to as many job
openings as you can” is outdated, not to mention naïve. If you
try to follow this advice, you may find yourself wasting all your
time in this tactic and will eventually be disappointed due to the
inherently small number of responses you’ll receive. You don’t
want to be too picky either, though. Find a healthy number of
job openings that you know you can handle effectively and
proceed from there.
When you come across a job application on one of these sites,
read it carefully several times. If you are completely confident
that you have most of the skills it takes for that particular data
scientist job, note it down and carry on. Make sure that you can
still meet the deadline for the application, though. There are
often job postings that have expired or expire very soon, so it’s
best not to waste any time on them. Once you have gone
through all the job postings on the sites you are using, you are
ready to start working on the applications.
Similar to contacting a company directly, you need to treat each
application as something unique and worth all of your attention.
Naturally, an application is fairly impersonal, but it is still an
opportunity for you to distinguish yourself and test the waters of
the data science sea. As you prepare your application, review
the requirements again and again, trying to read between the
lines to clearly understand what it is that is expected of a
successful candidate. Many of the requirements may be
unrealistic because the recruiters want to aim high so that they
have room for compromise. It is preferable that you
demonstrate you can do every single thing they ask for even if
you are not the most experienced applicant out there. Make
sure that you convey the message that you are a team player
(even if you don’t say it explicitly as it’s a cliché), that you are
willing to learn new things (and learn them fast!) and that you
truly care about the industry the company is in. See yourself
from your future manager’s perspective and you can’t go
wrong. After all, you’re going for a win-win situation, something
that needs to come across in your application.
In addition to the well-known recruiting sites that cover all the
different industries (sites that you ought to look into, starting
from Appendix 1), there are a couple of specialized sites for
data scientists in particular. These are Kaggle and
DataScienceCentral although there may be more by the time
you read this book. It is worth looking into them first because
they are already well established.
Kaggle is an excellent site for people interested in data science
from a practical perspective. It began as a site for data analysis
competitions where companies around the world offered prizes
for the best solutions to a data-related problem they were
facing. As the Kaggle community grew, it eventually became a
social medium as we saw in the previous subchapter. Kaggle
has always been great for finding data-related jobs, particularly
those in the field of data science (actually this has become a
prominent key phrase in its SEO strategy, something that
speaks volumes to those who understand e-marketing). What
distinguishes it from other recruiting sites, though, is that you
can have an interaction with the recruiters through the site, see
what other applicants have asked and get an idea of what is
expected of you apart from the dry text on the requirements of
the job posting. Often the recruiters are individuals who are
preparing a start-up and are looking for talent. Although such
jobs don’t carry the prestige of working in a well-known or at
least established company, they offer invaluable experience
and the possibility of an exceptionally high ROI if the company
proves to be successful. In addition, if you are good with data
analysis, you may distinguish yourself from other applicants by
winning a competition or two. Note that although Kaggle is a
truly excellent site in many ways, it is targeted mainly towards
people who are entering the data science field, data science
freelancers, and headhunters.
DataScienceCentral is a more interesting site for data science
aspirants as it is targeted to a wide population of people
interested in the field including several full-time data scientists.
Apart from being a remarkable resource for data science
material, news and professional networking, it also exhibits job
opportunities and is appealing to recruiters of data scientists.
This is definitely a site worth checking regularly and making use
of if you are serious about becoming and remaining a data
scientist. Also a social medium, this site offers very rich content
that is continuously renewed. It has over 16,000 members from
different parts of the world and has several good job
opportunities in various locations. You can create an account
for free or sign in using your Twitter, Google or Yahoo! account
credentials.
Finally, another recruiting site worth looking into, even if it is
somewhat generic, is LinkedIn—the most worthwhile social
medium out there for professionals of all levels. Recently, it has
experienced a boom in big data awareness, and several data
science related groups have been organized: Data Mining,
Statistics, Big Data, and Data Visualization; BIG DATA
Professionals Architects Scientists Analytics Experts
Developers Cloud Computing NoSQL BI; Data Science, BI &
Predictive Analytics; and Data Scientists, to name a few. Even
DataScienceCentral has its own group on LinkedIn. These
groups are themselves useful for professional networking, as
we saw previously, but they also offer lot of job opportunities as
well. In addition, as LinkedIn contains all of your professional
information (to the extent you keep your profile complete and
updated), it is easy for someone to check out your professional
background by visiting your LinkedIn profile. Moreover,
recruiters can see your activity in these groups to assess your
communication skills as well as your personality, to some extent
—things that could easily help them make a decision about
calling you for an interview or not. LinkedIn is a great resource
for job hunting and one that you need to take full advantage of.
If you can afford it and decide to pursue this tactic more
aggressively, you should consider getting the premium
membership to enhance the whole process.

13.4 Other Methods


In addition to the aforementioned job-hunting tactics, there are
a few other methods for hunting for a data science job. Though
not as effective as those already mentioned, they may be worth
a shot, especially if you pursue them with persistence and
commitment. For example, you can try attending job fairs and
focus on companies that are interested in data scientists or
data analysts. Present yourself as a data scientist and be
prepared to explain to them how you would be able to
contribute more to the company than a simple data analyst
would. Also, be prepared to listen to what the company
representatives have to say and try to pinpoint specific ways
you could help them by harnessing the data they have and
tackling specific business problems they are facing.
If you are a student, be sure to check out bulletin boards at the
universities in your city (particularly the ones with an IT
department) as this could also be a good source of internships
or job opportunities in data science.
Finally, if you are confident with Web pages, you may want to
adopt a more proactive approach and put together a website to
promote yourself. This kind of site is like an online resume, and
it is not uncommon for independent contractors and
researchers to have one. On this site you can include samples
of the code you have written, relevant projects you have
worked on (these could be projects you have done during your
studies) and detailed information about your data science
background, organized in brief and comprehensive Web pages.
It would also be a good idea to include a list of roles you are
interested in as well as places to which you are willing to
relocate. Be sure to include an email address and either a
Skype address or a phone number to make it easy for potential
employers to contact you. Try to come up with a few tactics of
your own. Who knows? Maybe your creativity can be put to use
for more than just big data analysis.

13.5 Key Points

In order to have a good chance of finding a job in


data science you need to be systematic, persistent,
patient and somewhat aggressive in your job
hunting.
It is very important to be as specific as possible
about what exactly you are looking for and what
compromises you are willing to make (e.g.,
relocating) before starting your job quest.
One of the most effective methods for landing a data
science job is through contacting a company
directly. This entails:

learning about the company and figuring


out how it can employ big data
technologies beneficially
finding the right time to make your move
applying for an internship there if you
are a student
connecting with someone in that
company

Professional networking is an easier method for


finding a data science job. It involves professional
associations, data science conferences and
professional groups on the social media.
Great online places for networking are LinkedIn,
Meetup, DataScienceCentral and Kaggle. Meetup in
particular is very useful because it entails face-to-
face networking through its various data science
groups.
Using job sites can be a great way to get in touch
with recruiters and job opportunities in general.
Good places to start are LinkedIn groups, Kaggle
and DataScienceCentral.
There are several alternative methods for finding a
data science job. These include:

job fairs
university bulletin boards
personal website (online work portfolio
and resume)
other (use your imagination!)

Facebook (and its predecessors), LinkedIn and Twitter.


There are several groups here, both international and local. We’ll look into
them at the end of subchapter 13.3. You can also do your own search
through the LinkedIn search engine.
This site has been around for a while, but only recently did it become a
social medium, adopting a new design that focuses on the social
dimension. Before that, it was a simple competition site for machine
learning practitioners and data analysts.
Chapter 14
Presenting Yourself

In the previous chapter, we examined the various strategies


you can employ for finding a job opening, expanding your
professional network and gaining insight into the industry.
However, all this is of limited effectiveness if not accompanied
by the right attitude and self-presentation skills. Note that this is
more than just writing an appealing resume (which can easily
be outsourced nowadays) and a good cover letter (which can
also be outsourced!).
Presenting yourself is all about having conviction and
conveying this efficiently without much vocal communication.
It’s what image-makers do for their clients, and since chances
are you don’t have the budget to hire one, you will need to do
their job yourself! Note that the conviction and air of confidence
that you need to portray must be based on having solid
abilities, so if your skill-set is limited, you will need to work on
that first (see Chapters 8 and 12).
In this chapter, we’ll look into several guidelines about
presenting yourself: in your cover letter, on the phone and, of
course, in person. We’ll examine the importance of focusing on
the employer and his company’s needs, the value of flexibility
and adaptability, the significance of the deliverables in a data
scientist role and how you can guarantee them, the ways you
can differentiate yourself from other data professionals (who
may be after the same position), the value of being self-
sufficient as a professional and a few other factors you may
want to consider about improving the way you present yourself.
The advice in this chapter is applicable in other fields as well,
not just data science. However, you need to employ at least
some of these strategies if you want to have a fighting chance
of making it past the first stages of the interview process.

14.1 Focus on the Employer


The alpha-male approach that has been dominant in job
hunting for many years may not be the most effective strategy
when it comes to landing a data science job. Of course, it is
great if you are a go-getter and exhibit a strong, somewhat
aggressive approach to tackling problems and making things
happen, but this may not be what an employer is looking for.
With all of your social media information at his disposal and a
sense of uncertainty about the technical jargon on a data
scientist resume, your potential employer may feel
overwhelmed if you start shooting big data technical terms at
them out of the blue.
You need to focus your whole approach to the employer (i.e.,
what’s in it for them) and communicate in terms that they can
understand in an unambiguous manner that shows knowledge
and confidence. You want to avoid sounding like you are
technically savvy with no people skills because they’ll think you
are just a geek. If you come across as overly confident, they
may think you are playing them. What they need is someone
who is upright, balanced and cares about the company;
someone who will go above and beyond just what is on his job
description. Can you be that person?
Few people are naturally charismatic marketers, and since you
are not in the marketing game, chances are marketing yourself
is not your strongest suit. The only way to overcome this is
through practice. You’ll need to accept that you may not get one
of the first few positions you apply for due to your lack of
experience unless you are such a great fit for the job that the
employer is willing to overlook that. It’s a small price to pay for
the opportunities that practicing your interviewing skills will
open for you later. Focusing on the employer will be useful not
only for the various stages of the hiring process, but also for
performing the actual job afterwards. You don’t need to be a
data scientist to figure out that having a good relationship with
your manager benefits you, your employer and everyone else
you’ll be working with. A healthy relationship can only bring
about good things for your career as a data scientist and for
your resume.

14.2 Flexibility and Adaptability


We talked about flexibility and adaptability briefly earlier in the
book (Chapter 4), emphasizing their importance in the data
scientist mindset. However, here they are described in a
different light.
Flexibility and adaptability are all about how you can stretch
and adapt your skills and experience to fit a job description and
its requirements as well as how you can amend any gaps in
knowledge you may have. They can be demonstrated by being
honest about what you know and explaining how the skills you
have can be adapted/enhanced to meet the firm’s needs. This
also shows your creativity and interest in enhancing your skills
to benefit the company.
For example if you know R or Matlab, you can adapt from one
to the other quite quickly, and if you are flexible enough, you
can use either one to get the job done. Despite their functions
being somewhat different, the underlying logic of the two data
analysis tools is pretty much identical, so shifting from one to
the other is quite feasible.
Flexibility and adaptability are also important when it comes to
selecting the positions you wish to apply for. Say you are
looking for a standard data scientist position, but all you have a
shot at is an entry-level (junior) data scientist post. Will you go
for it? If you are flexible enough, then yes. Besides, experience
is experience. There is no doubt that it’s better on your resume
than Kaggle competitions and practicing on benchmark
datasets.
If you can only find a data scientist position in a domain you are
not familiar with, you can still demonstrate your adaptability by
becoming familiar with that industry and using your data
science skills to tackle its problems.

14.3 Deliverables
So you know all the relevant software and you’ve read your
statistics and machine learning books so much that you’ll have
a hard time reselling these books, but does that mean that you
can do the job and do it well? It all boils down to the
deliverables involved.
The deliverables of a particular data science position may vary
significantly since different employers have different business
needs for their (big) data, which differs significantly from
industry to industry. They may want you to undertake a project
management role—if not right from the start, then a few months
down the road. This is not uncommon for a senior data scientist
position (business data scientist type). You may know your stuff
well, but at the end of the day, your future employer needs to
make sure that you won’t be sitting in front of your workstation
all day and that you’ll exhibit some human resource
management skills. After all, you have good communication
skills, right? So what’s stopping you from becoming a project
manager or an assistant team leader?
A potential employer is looking for what can you bring to the
company if you are hired. You can say that you are able to
deliver every single item listed in the responsibilities section of
the job description and explain exactly how you can do that. But
you can also be a bit more creative and bring some new ideas
to the table, preferably something that you have thought
through beforehand. Step into the employer’s shoes for a
minute and evaluate the two possibilities from their point of
view. Would you hire you?
The deliverables factor is something that ties in with each one
of your skills, too. You didn’t learn R because of its pretty
interface, nor did you learn Hadoop because of its nice
documentation and you certainly didn’t learn Java because of
what its fans say about it. You learned each one of these
programs because they can deliver something valuable to you
and bring usefulness to your work. So when you have a chance
to talk about your technical skills, you should point out how they
can benefit your potential employer because that’s what he will
care about the most. Remember subchapter 14.1 and the
importance of focusing on the employer. Your interview is your
chance to apply what you’ve learned and convince him that you
have something to offer that he would be unwise to pass on.
The same applies to your other abilities, the so-called soft skills.
In truth, there is nothing soft about them because if you use
them well, they can have some really hard effects that will
benefit everyone around you. Sure, there is a certain prestige
around knowing a particular piece of software at an expert
level, but being able to communicate well can be as important,
if not more so, depending on the particular position. You can
learn a piece of software in a few months, so even if you don’t
know how to use the big data package that a company prefers,
that’s not an issue as long as you’ve worked on similar
software. However, you need the ability to communicate well
right from the start. During the interview process, you want to
show that you can use your soft skills to provide lots of
deliverables because that could be what distinguishes you from
all other applicants.

14.4 Differentiating Yourself from Other Data


Professionals
Distinguishing yourself from the competition is essential when
seeking a data scientist placement. Some of your competitors
will be people who are worthy data professionals who have
done some studying, taken a couple of courses and decided to
brand themselves as data scientists. They may have no idea
about the scientific method, the data science process or any of
the qualities that constitute the data scientist mindset, but this
gap of know-how and thinking is not reflected on their resumes.
So how can you differentiate yourself from them and
demonstrate that you are a real data scientist who can do what
it takes to make their big data talk?
Let’s look at the points you can emphasize to distinguish
yourself from your competitors in an unambiguous way:

Machine learning experience. As a data scientist,


you can do more than t-tests and correlation
analysis as you have a good grasp of machine
learning techniques, both in theory and in practice.
This translates into intelligent and very efficient data
processing that can yield promising results without
the use of any ad hoc models.
Big data know-how. Obviously, your expertise
extends to the distributed computing domain and
you embrace big data, knowing how to tame it with
the relevant technology and know-how. This, by
itself, should give you enough differentiation and a
strong competitive advantage.
Strong communication skills. You are confident and
skilled at telling a story about your findings through
the data science process you follow because you
understand everything in more depth (even if you
are not the most experienced person). This should
be evident in the way you present yourself during
your interviews.
Scientific approach to data analysis. With enough
data, you can draw all kinds of conclusions and find
a lot of interesting relationships in the data. In fact,
you can find statistically significant results in
completely useless combinations of variables. This
doesn’t mean that anyone cares about such results.
Your selling point is that your results are driven by
meaningful questions that you ask beforehand
(when you formalize your hypotheses), and every
step of the analysis is based on a methodology that
is scientifically sound and can be easily replicated.
Familiarity with data analysis tools. Some data
professionals will be familiar with a lot of the
software that you use too, but they are less likely to
know the ins and outs of R, Matlab and other data
analysis tools, something that could give you an
edge. You can perform data analysis tasks using
Java or Python, but the aforementioned data
analysis tools are exceptionally good for data
exploration, data discovery and, to some extent,
data visualization. Knowing both programming and
data analysis tools gives you a clear advantage.
Other factors. There are other small things that may
differentiate you from would-be data scientists,
some of which are too insignificant on their own but
together make up something powerful and
significant. These factors have to do with the data
scientist mentality and several other qualities that
you possess and may take for granted most of the
time (e.g., problem-solving, ability to think outside
the box, ability to come up with effective ways to
quantify qualitative data, etc.).

Differentiating yourself from wannabe data scientists is one


thing, but what about differentiating yourself from other data
professionals (e.g., data architects) who have good technical
skills and may be well known in the industry? What makes you
different from them and more suitable for the particular role you
are pursuing? Why would someone care about your additional
skills and not about their additional experience? These are
questions you need to address first for yourself and then for the
potential employer if you really want that job.
Let’s take a look at all the factors that differentiate you from the
other data professionals (database administrators, business
intelligence analysts, etc.) in an unambiguous way:

Data analysis know-how. It is not uncommon for


data professionals to have limited knowledge of
statistics and/or machine learning, things that are
your bread and butter as a data scientist. Even if
your employer is not that knowledgeable in them,
they should still appreciate the benefits and inherent
value of your skills in the big data world. Remember
that many of them still think of data scientists as
statisticians who can handle big data; you need to
explain how much more your training and skills can
accomplish for them.
Big data know-how. Your familiarity with big data is a
huge plus since most data professionals don’t know
the relevant technology.
Strong communication skills. As in the previous
case, this is an advantage when it comes to working
in the modern business environment, which is quite
diverse and communication driven.
Familiarity with data analysis tools. Even if your
competitors know a few tricks about using statistics
and machine learning on various datasets (pretty
much everyone has heard of clustering, for
example), your comfort and experience with
specialized data analysis packages such as R and
Matlab will allow you to produce meaningful results
for your potential employer faster than individuals
who don’t have experience with the packages.
Scientific approach to everything you do. You are
quite familiar with the scientific method, even if you
don’t realize it. This could be used to your
advantage since many employers value a
methodical, disciplined and organized approach to
tasks.
Other factors. There are several other smaller
factors that may distinguish you from other data
professionals. They are related to the data scientist
mentality and other qualities that you may have and
take for granted (e.g., being able to see what data is
useful but not there, having the ability to come up
with useful models, employing a more creative
approach to problem solving, etc.). Even if your
employer is not savvy when it comes to data
science, these things will come through if you are
aware of them and value them enough.

14.5 Self-Sufficiency
The definition of self-sufficiency used in this book is “being
independent in a proactive and somewhat creative way.” It
means knowing what needs to be done and doing it with little to
no guidance, especially when it comes to your own domain.
You need to own it and plan it accordingly.
Like most things you talk about on your resume, in your cover
letter and during networking sessions, you need to be able to
demonstrate your self-sufficiency with examples drawn from
your professional experience by referring to specific cases
where you participated in or led a project, taking initiative and
showing creativity. Finding an innovative approach to a
problem, developing a clever feature in a data analysis
package or handling a difficult situation through a creative
approach, all without relying on a supervisor, are examples of
self-sufficiency. This is fairly common in the research industry
although it is not valued as much as it should be. The same
initiative in industry could result in a raise, a bonus or perhaps
even a promotion, while in the research world it is usually taken
for granted. So if you are in research, it is high time you learned
to value this attribute of yours and sell it properly to an
employer who can appreciate it.

14.6 Other Factors to Consider


Interview-appropriate personal presentation, language, physical
appearance, etc. are also important factors but are beyond the
scope of this book. There are many books and websites that
address the personal and interpersonal aspects of interviewing.
You would be wise to take advantage of the advice that is
available from these sources. Some of them can be found in
Appendix 1.

14.7 Key Points

Presenting yourself is more than just writing a good


resume and a nice cover letter. It entails a lot of
things that refine your first impression, whether this
is via a letter, a phone call or a face-to-face meeting
with a potential employer.
Focusing on the employer is important to keep in
mind when presenting yourself for a data science
job. Specifically, you’ll need to understand what they
require from the use of big data, listen carefully to
what they expect from the person in the position
they are hiring, be able to explain to them what you
can offer in terms of benefits for the company and
the bottom line and communicate effectively. Ask
lots of questions and show a genuine interest in the
company and the position.
Flexibility and adaptability can be demonstrated by
being honest about what you know and explaining
how the skills you have can be adapted or
enhanced to meet the organization’s needs. This
also shows your creativity and interest in developing
your skills to benefit the company.
Deliverables relate to what you need to deliver to
fulfill the requirements of the data scientist position
in which you are interested as well to the effect your
specific skills and know-how can have on the bottom
line of the company that may hire you. The
deliverables can also refer to other benefits you can
bring to the organization such as initiative, ideas,
improvements in their existing BI processes, etc.
Differentiating yourself from your competition is very
important in this field. It involves selling the specific
technical skills, know-how and non-technical skills
you have that make you stand out from wannabe
data scientists and other data professionals
(database administrators, business intelligence
analysts, etc.).
Self-sufficiency is a must-have quality for any
profession nowadays, but especially in data science.
It means owning your work, acting responsibly,
showing initiative and managing your workload
without relying on a supervisor. It is highly valued
and a great ability to promote about yourself.
For other factors that are important to keep in mind
when presenting yourself for an interview, see books
and articles on the subject. Such factors include:
physical appearance
language (including body language)
business cards
research (learning about your audience
before meeting with them)
Chapter 15
Freelance Track

Working as a freelancer or consultant is a great way to gain


experience and learn more about the field without having to
specialize in a particular industry. At the same time, you can
familiarize yourself with different domains, getting a better
understanding of the business world. However, like other
freelance jobs, it can involve a hectic and sometimes chaotic
schedule and long hours. Freelancers and consultants can
have downtime (periods when they’re not working), so payment
may not be steady. In addition, if you don’t deliver exactly what
the client expects, you may not be paid at the end of the
project. Nevertheless, it is an option worth considering if you
have confidence in your abilities and find that the data scientist
market is not favorable to you or if you have other ways to pay
your bills while doing data science as a freelancer.
Being a freelance data scientist involves everything that a
normal in-house data scientist position involves with the
exception that the wages are usually hourly or project-based
and each assignment is of limited duration. If you work as a
data scientist for a company, you will be paid regularly and may
have ongoing responsibilities beyond the projects you are
involved in. Freelancing is definitely not for the faint hearted,
but delivering a satisfactory outcome can be a great milestone
for your life in the data science world and can jumpstart your
career as a data scientist.
In this chapter we will examine the pros and cons of being a
data science freelancer, investigate how long you should do
freelance work, talk about other services you can offer relevant
to the field and provide an example of a real freelance data
science opportunity to make all the points discussed in this
chapter more concrete. Note that even if you are not looking
into becoming a freelance data scientist now, the information in
this chapter can still be useful to you.

15.1 Pros and Cons of Being a Data Science


Freelancer
Although the freelance track in data science is demanding, it
can be quite rewarding as well. The million dollar question,
though, is whether it is rewarding enough to be worth your
while. Let’s examine the pros and cons and discuss it in more
depth afterwards.

Pros Cons

You are You have no job


independent security whatsoever
and cultivate You may not
self-sufficiency acquire sufficient
You can gain domain knowledge
invaluable in any particular
experience in industry to be
a variety of considered an
industry expert in it
domains You may spend a
You have the lot of money
potential to promoting your
make more business/services
money (you through a
can get marketing
involved in a campaign or a
number of marketing
projects) consultant
You are your If something goes
own boss wrong, you
(except for the shoulder most of
clients you the responsibility
report to) It is difficult to get
You gain in started
reputation from Hectic schedule
every project and long hours
you complete You don’t usually
Lots of have a chance to
networking develop a long-term
opportunities relationship with
You have a clients/collaborators
chance to You have no-one to
learn about give you guidelines
different types about what
of people and approaches are
hone your better for the
communication problem at hand
skills
You have more
freedom to
explore
different
approaches to
tackling data
science
challenges

Note that in order to be as objective as possible, no clear-cut


result can be gained from this comparison table. The
importance of each of the factors above depends on what is
important to you. You need to evaluate each factor from the
perspective of your particular expectations, values, beliefs and
general lifestyle.
Remember that a freelance track is not mutually exclusive to
other career options. You may do some consulting work while
having a regular job (assuming your employer is okay with this
and that you have enough time to juggle all the responsibilities
of your day job), getting the best of both worlds. However, if this
is not an option, you can do freelance work as an initial stage
for something more long term. It is not unusual for a freelance
job to evolve into a respectable business or for an organization
to hire you as a full-time employee. The possibilities are limited
only by your imagination and your ambition.
The bottom line is if you can make it in the freelance world, you
can be successful just about anywhere. Freelancing and
consulting gives you an opportunity to acquire invaluable
experience. It is highly encouraged among the more
experienced professionals who wish to make a transition to the
entrepreneurial world as well as for young professionals who
have limited job prospects and fully aware of their abilities. It is
not a great option for more financially conservative individuals,
though, especially in today’s economy.

15.2 How Long You Should Do It for


It’s hard to say how long you should remain a freelancer,
especially with the turbulent economic climate we are
experiencing at the time of this writing. However, if you take a
look at several LinkedIn profiles of data scientists employed by
companies, it will become clear that a couple of years or so is
usually sufficient time to gain all the required experience to get
a full-time data science job at a company (though shorter times
could be enough, depending on your skill-set). That doesn’t
mean that you shouldn’t do it longer if you are comfortable with
the lifestyle of a freelancer, have some money put aside to help
you weather down periods and/or don’t have a family to
support.
If you have a full-time job that doesn’t take up all your extra
time and energy (like many academics have), you may want to
do freelancing on the side as an extra source of income. That’s
particularly useful if you like your job and/or have financial
responsibilities that require a steady income (e.g., children,
mortgage, student debt, etc.).

15.3 Other Relevant Services You Can Offer


Apart from basic big data analyses, you can also offer other
relevant services as a freelance data scientist. For example,
you can do programming gigs for your clients. There is a lot of
demand for OO programmers and it pays decent money. In
addition, it is good practice for you since programming is an
integral part of your job anyway. C# seems to be in the highest
demand in the industry today, but you can find gigs for Java,
C++ and even Python. Just make sure that you know the
language at an expert level before undertaking any of these
jobs.
In addition to offering programming services, you can also
undertake a data scrubbing gig, something you do anyway in
any data science project. For example, it could be that
someone is good at performing data analysis, but the data they
have needs to be ordered and cleansed. If they know that this
is a service you offer, they may consider hiring you for this task.
And if you don’t get many clients for this service, at least you
appear to be a versatile freelancer, which is always good.
Tutoring for a data analysis tool, a programming language or
anything else you are good enough at is another service worth
considering. You may want to tailor this service for
professionals who want to develop a particular skill fast and
provide it in a way that accommodates their busy schedules.
Alternatively, you can target students although you’ll have to
adjust your rates to be competitive as a professional. It is
particularly hard to get any real revenue from this service
nowadays since there is a wide variety of free alternatives
available (Coursera courses being the most well-known option),
but it definitely doesn’t hurt to include it as part of the services
you offer.
Finally, if you are skilled at writing, you can try to find some
editing work, particularly for students who have a hard time
writing a presentable thesis. This kind of service can be
educational for you as well, especially if your clients are
students of a discipline that involves a lot of data analysis work.
You may be able to think of other services that you can include
in your freelance endeavor, maximizing your chances of
earning some money and getting free advertising as a bonus
from happy clients who spread the word, making your services
more well known.
15.4 Example of a Freelance Data Science
Opportunity
To make this more concrete, here is an actual opportunity found
on Kaggle.com for freelance data science work for a company
in Africa. It includes some useful annotations by the author of
this book to illustrate points discussed in this and other
chapters. Do not feel frustrated if you cannot fulfill some of
these requirements as this is merely one of the many gigs you
can pursue as a freelance data scientist.
Background
We are a South African company looking for an experienced data
scientist that would be interested in doing freelance work on an
ongoing project with regards to analytics, data mining, data science
and knowledge discovery.
The ideal individual would have experience in applied data mining
and knowledge discovery projects.
Summary

The candidate will work with a team of professionals in


addition to key decision-makers as a data-driven
advisor and consultant.
The candidate will conduct ad-hoc statistical analyses,
data mining, apply and test models, run test/control
scenarios and present results and recommendations to
both technical and non-technical audience.
The candidate desirably should combine marketing &
business acumen with deep analytical skills to drive
impactful insights.

Responsibilities

Perform detailed data exploration and validation to


separate genuine phenomena from spurious anomalies
(e.g., outlier detection).
Develop, implement and evaluate complex statistical
models to predict or describe user behavior and
campaign patterns.
Help design and analyze structured experiments to
understand/measure the effects of changes in various
campaign factors driven by the developed models.
Present new insights and analyses that inform
decisions and help achieve continued success in
developing innovative solutions in engaging people
participating in the various campaign programs.
Collaborate across functions to design and deploy our
next generation analytics system.

Required Experience and Qualifications

Proven experience as a data scientist and


demonstrated employment of a variety of analytical
methods using applied statistics and data mining
according to the corresponding business objectives.
Minimum of 5 years of a professional level enterprise
data science experience, for a company of similar size
and complexity. Working experience in the fields of
mobile advertising, mobile marketing, internet, media,
social or online gaming preferred.33
Thorough knowledge of supervised and unsupervised
modeling techniques.
Ability to communicate clearly, in non-technical
language the impact of the data modeling results to
non-technical business stakeholders and decision
makers.34
Employ and share with our teams, best practices to
promote collaboration, knowledge and skills
development.
Deep knowledge of classical statistical methods,
Bayesian analysis, machine learning and data mining.
Expert knowledge of R or Matlab is required.
Proficiency in data management, SQL and shell
scripting.
Experience working with large data sets, using
analytical databases (Vertica, ParAccel) is an
advantage, working with distributed computing tools
(Map/Reduce, Hadoop, Hive, etc.) is a plus.35
Attention to detail, data accuracy and quality of output.
MS or PhD in applied statistics, mathematics,
economics, or a related quantitative field.
Curious about data and just about anything else.
Outstanding problem solving skills.
Hands on, results-driven who can work with extreme
efficiency, excited to learn new things and effective
problem solver.36

Further information regarding the project

The project is 1-year in length.


A weekly report of insights and deductions based on
analytical findings will need to be forwarded to us.
The chosen candidate will be paid on an hourly-basis
(please include hourly fees in response).37

Other freelance opportunities may not be so clear-cut and


detailed (obviously this employer has worked with data
scientists before and/or has a good idea of what the field
involves). However, this example illustrates that being a
freelance data scientist is a quite feasible option that can be a
good alternative for people who are relatively new to the field.
We will examine some more real-world examples of data
science work in the case studies section of this book (Chapters
16 and 17).

15.5 Key Points

Freelance work in the data science field is


challenging but can be quite rewarding, especially if
you are either experienced (and have a steady job
on the side) or if you are very new, without many
financial responsibilities, and you wish to gain some
professional experience.
Being a freelance data scientist involves more or
less everything that a normal in-house data scientist
position entails, although the wages are usually
hourly and the project is of limited duration.
There are both pros and cons to being a freelancer,
but whether it is worth it for you depends on your
expectations, your values and your lifestyle choices.
Working as a freelance data scientist can help you
acquire invaluable experience. If freelancing doesn’t
pan out as a long-term career, you can still use the
experience and your professional connections to
land a full-time job in the industry. If it does work out,
you can turn it into a respectable business and hire
others to help you undertake more data science
projects.
Based on research on various data scientists on
LinkedIn, it is clear that a couple of years or so is
usually enough time to gain the experience
necessary to get a full-time data science job at a
company.
When embarking on the freelance track, it is crucial
that you do some financial planning beforehand to
ensure that the whole endeavor won’t land you into
a well of debt if it is not viable for you.
There are several services you can offer as a
freelance data scientist, in parallel to your regular
work, such as:

Programming gigs
Data scrubbing gigs
Tutoring professionals or students
Helping students on their theses
Looking at real-world examples of freelance data
science gigs can help you gain invaluable insight
into what is expected of you in the freelance world
and in the data science world in general.
Often, freelance gigs are not very clearly defined
(the included example is a special case where the
employer is quite clear about what they want).

Here the employer clearly states the domain knowledge that is relevant to
their industry. This is probably the most difficult requirement to meet in
unless you are already in this industry.
This part of the ad is intentionally in bold, something that clearly illustrates
the importance of good communication skills in the data science
domain.
This is a tricky requirement. Obviously they want you to be familiar with big
data technology, but if you are new, they will consider you.
Does this ring a bell? If not, please review Chapter 4.
In such cases, you need to do plenty of research on what other freelance
data scientists charge so that you are not considered too expensive
nor undervalue your worth.
Chapter 16
Experienced Data Scientists
Case Studies

We’ll begin the case studies with the story of two experienced
data scientists who work in the retail sector and the law
enforcement industries. In both cases, we’ll get to know them
better with some basic professional and background
information, then proceed with their views on data science in
practice, how they see data science in the future and finally
what advice they have for you, the aspiring data scientist. At
the end of the chapter, we’ll have some take-away points, as
usual, to help you remember the key lessons of these
interviews.

16.1 Dr. Raj Bondugula


16.1.1 Basic Professional Information and Background
Dr. Bondugula has worked for Home Depot for the past few
months although he’s been in the data science field for many
years. He comes from a machine learning background and
spent several years in academia, so he is more of a researcher
type of data scientist. He is formally trained in computer vision
and natural language processing (NLP), fields that are very
relevant to data science as they both involve a great deal of
challenging data. Although most of his work is in these two
fields, his expertise goes beyond them; he has also spent
several years practicing data science in the industry. At one
point in his career, he worked for the Department of Defense on
computer clusters.
Dr. Bondugula has been involved in associations, namely IEEE
and the Computational Intelligence Society, specializing in
fuzzy logic. He was also active in the research arena,
contributing a number of papers in bioinformatics during his
academic phase. Currently, he is involved in conferences
focusing on data science technologies. He is also open to the
idea of joining a data science group when he has more time.
16.1.2 Views on Data Science in Practice
This data scientist has a very mature and clear perception of
what data science is, something that is uncommon among other
data scientist, particularly those new to the field. Dr. Bondugula
views dealing with data as an extension of machine learning
practices, where the scope is scaled up while everything else
pretty much stays the same. For him, Hadoop is an easy way to
meet the challenge of parallelization for those unfamiliar with
parallel computer programming. It not only saves a lot of time,
but also a great deal of money (millions of dollars).
Dr. Bondugula runs a data science team for Home Depot.
Along with his team, he handles data science projects from the
conceptual level all the way to implementation and validation.
Afterwards, they partner with IT to create their data products.
As the whole group is relatively new to Home Depot (a little
over a year, at the time of this writing), he is still the only official
data scientist there although the team now includes some
members who are adept in big data technology and a few
machine learning practitioners. The people he works most
closely with are a small subset of this group.
For Dr. Bondugula, the most important thing in this line of work
is in-depth technical knowledge of the tools used and the ability
to adapt to the problem at hand including modifying the tools if
necessary. This “fundamental understanding of the techniques,”
as he calls it, enabled him to use the same methods in a variety
of domains, adapting them to fundamentally different problems
effectively and efficiently.
His everyday works involves a variety of things. Sometimes he
and his team are asked to improve internal manual processes
employing data science (using NLP, for example). Other times
they come up with novel ideas to improve customer
satisfaction, e.g., through a recommender system they have
developed for the company’s website. They also undertake
Web intelligence tasks at times in order to ensure quality in the
function of the website (e.g., pinpoint broken links).
Although Dr. Bondugula is not a senior data scientist yet (i.e.,
he doesn’t have other data scientists reporting to him), he is
well on his way to becoming one. For him, there is a very fuzzy
line between the two classes of data scientists; the division is
not as clear-cut as it appears to be in job applications. He also
finds that the titles of “data scientist” and “machine learning
specialist” are pretty much the same thing since the former has
been around only for a few years in the market.
Dr. Bondugula thinks that the retail industry lends itself to data
science because of the amount of data available and the data-
related problems faced by the industry. That’s what makes it
interesting, too. He finds that most of the time he needs to
apply and adapt existing methods rather than invent his own
(as is often the case in the R&D departments of data-driven
companies). An example of a data product he and his team are
developing is a non-personalized, content-based recommender
system for the company’s website. If a customer is looking to
buy a bath faucet from Home Depot through its website, they
may want to buy other bath products or other related items as
well. His data science system will find those products and
display them on the Web page for the customer to view,
emulating the experience they would have if they were
physically in the store.
He finds that although data science was not previously been
essential in this industry, nowadays it is a very useful tool that is
of great importance to it. The reason is that it satisfies a need
that was always there.
There is no doubt that Dr. Bondugula loves his job. He makes
that clear when he says that his day job doesn’t begin when he
gets in the office, but rather as soon as he wakes up, when he
starts thinking about the data science problems he is currently
tackling. It is evident that not only he is satisfied with this line of
work, but he is also very enthusiastic about it, stating that he is
“having fun” every day in his work.
16.1.3 Data Science in the Future
Dr. Bondugula envisions that in the future, Extract, Transform,
and Load (ETL) tools will become redundant and be replaced
by Hadoop, which is, in his view, the most promising piece of
data science technology today along with its “family”: Mahout,
HDFS, etc. It is a technology that makes a very challenging
task (computer parallelization) relatively easy, albeit not simple.
Regarding Hadoop evolution, he foresees that it will employ
more data analysis paradigms beyond the MapReduce one that
is widely used in data science today.
For Dr. Bondugula, the most challenging part of data science,
which will probably be the focus of the field in the future, is
forming the right questions to yield useful and meaningful
answers from big data. Open questions like “what’s interesting
about this data?” may not be popular in the future because they
may not yield very insightful answers (even if they are
scientifically valid). He believes that the source of a hypothesis
(scientific inquiry) should come from business knowledge. This
is where creativity, an inherently human attribute that is less
likely to be undertaken by computers, comes into play, at least
in this field.
On a personal level, Dr. Bondugula is confident that regardless
of the domain he works with in the future, he’ll do just fine even
if he needs to learn all the relevant domain knowledge from
scratch. As one would expect, he plans to continue in this line
of work for many years to come.
16.1.4 Advice to New Data Scientists
Dr. Bondugula advises new data scientists to “become an
expert in one field, be it statistics, machine learning or, say,
Java programming, and then try to get into the other ones.” You
also need to be prepared to accept help from other people as
you won’t be able to solve every single problem on your own.
Moreover, networking and communication skills also matter a
lot, so you need to develop a varied skill-set, which includes
“soft” skills, too.

16.2 Praneeth Vepakomma


16.2.1 Basic Professional Information and Background
Mr. Vepakomma works at PublicEngines Inc., a company that
develops software for law enforcement that includes analytics
and advanced predictive products. A very unique field in the
business world, his current line of work makes use of data
science in order to create advanced spatio-temporal predictive
models and algorithms that predict crime to facilitate law
enforcement agencies to accurately and efficiently use their
resources. Mr. Vepakomma has worked as a researcher for
about five years, three of which he has spent in the industry as
a data scientist.
A member of the American Statistical Association (ASA) and
American Mathematical Society (AMS) (similar to IEEE but for
mathematicians) and former member of the Data Science
Atlanta meetup group, where he also once gave a talk, Mr.
Vepakomma is actively involved in research and regularly
participates in conferences such as ECML and PKDD. The
academic research he does is usually pertaining to an
intersection of advanced mathematics, machine learning and
statistics both in the theoretical and applied realms.
A very amicable person, Mr. Vepakomma is the personification
of many data scientist qualities. He has great communication
skills, curiosity and interest in many things and willingness to
constantly learn apart from indeed his existent, technical
strengths. He strongly believes in the importance of having a
technical breadth of knowledge across many sub-domains
apart from a depth of expertise in a few.
16.2.2 Views on Data Science in Practice
Mr. Vepakomma believes that it is very important to interact with
all project participants throughout the development of a
product- not just in the aspects of algorithm development and
core problem solving, but also through the aspects of business
strategy. He advocates that the most important things to have
on your resume as a data scientist are a strong quantitative
background, very good communication skills and experience in
having lead a data-science project. (The latter is not essential
for junior data scientists, but always valued.) A track record of
problem-solving that has lead to end products of high value
proposition or disruptive impact; within the market are pointers
that would boost your resume.
Mr. Vepakomma is part of a team of ten people, mostly
consisting of engineers. Apart from daily interactions with them,
he also has a direct line of communication with executives
about strategy and execution related matters in regular
meetings. His everyday work includes problem-solving, R&D,
coming up with evaluation metrics, developing the algorithmic
backend and creating optimization hacks for scaling the
devised solution to save computational time and resources. In
addition, he also does some academic research on the side
and strongly believes in the power of collaborative research.
Based on his experience, Mr. Vepakomma thinks that it is
important to anticipate and investigate all the things that could
go wrong when implementing a model to ensure better fault
tolerance in the end result since many minor aspects are bound
to go wrong sooner or later, if that attitude of looking towards
faults is not inculcated. Regardless, of the core mathematical
model being state-of-the art, minor faults in the product could
arise in the pipeline that makes use of this mathematical secret-
sauce. That said, when it comes to developing a quality data
product, he believes that sticking to the scientific method is the
most important strategic guideline of all that should guide the
attitude towards the product development and execution.
The data products that he has been involved in at
PublicEngines, that are currently in the market are “Command
Central Predictive” and “Command Central.” The former is a
crime prediction software that provides an accurate prediction
of criminal activity in a small area (much smaller than a typical
heat map), thereby yielding actionable information that proves
invaluable to the police and law enforcement agencies. This
level of focused tactical information generated by this multiple
patent-pending predictive product helps reinforce ‘directed,
actionable patrol plans’ and increase ‘resource-efficiency’ so
that law enforcement agencies can use it to positively impact
their communities through accurate and efficient policing. The
“Command Central” software is a platform that provides spatial
and temporal crime analytics, while Command Central
Predictive is focused on spatio-temporal predictive models and
algorithms.
Working in a breadth of industries or domains would be quite
easy for Mr. Vepakomma because, as he explains, the domain
doesn’t matter that much for a data scientist. What really
matters is the presence of a right environment to produce hi-
tech disruptive technologies with high value propositions. It all
starts with the presence of a right environment and the right
skillsets, he emphasizes. He quotes John Tukey, saying that
Tukey liked Statistics and Applied mathematics because he got
to work in everyone’s backyard. This mindset has continued on
into the realm of Data Science as well, he says. However, he
prefers working in a company that has a worthwhile strategy
when dealing with the product development, and he doesn’t
favor work environments whose monetization strategy is solely
focused on aggressive marketing and is not naturally backed or
driven by high-quality and impactful technological products.
According to Mr. Vepakomma, there is a wonderful trade-off
within the four pillars formed by the size of the market, the
quality/relevance of the product, the marketing/delivery
channels and the competition. A weakness amongst any of
these four criteria has to be aggressively compensated by the
rest in order to sustain and develop financial traction while
(most importantly) continuing to correct for the weaknesses.
He believes that the roster of domains and industries where a
data scientist can be an ‘A-player’ and contribute with high,
lasting impact is pretty long. He naturally, is very satisfied with
his job and is very excited about working in the domain of law
enforcement and predictive policing. He finds that having a
noticeable impact on society through the data products he and
his team develop is a great motivator.
16.2.3 Data Science in the Future
According to Mr. Vepakomma, the future of data science is
extremely bright because the amount of available data is
growing exponentially thereby opening up many new
opportunities for monetizing products and services through the
development of mathematical/statistical models and algorithms
that make use of this data by bringing intelligent use-cases out
of it. He finds that people with varied yet formal quantitative
backgrounds can come together to make this possible under
the umbrella of a data science centric organization.
16.2.4 Advice to New Data Scientists
Mr. Vepakomma believes that it is very important to learn new
skills and constantly develop your existing ones while working
on a breadth and depth of technical sub-domains. He also
believes that the environment in which you work is very
important regardless of the domain of the company you work
for. A good environment will greatly help you, particularly in the
early stages of your career, while a bad one is bound to hold
you back. If you are a new data scientist, it helps to nurture
yourself for one to two years by working with a competent and
experienced data scientist (as part of his team) before going
solo.

16.3 Key Points

Hadoop is a very important tool and also plays an


equally important role in the evolution of data
science.
The most important thing in this line of work is in-
depth technical knowledge (especially of the
relevant tools), including the ability to adapt tools to
the problem at hand.
The everyday life of a data scientist involves both
optimizing existing processes as well as creating
novel ones to improve the customer experience.
The titles of “data scientist” and “machine learning
specialist” are pretty much synonymous in practice.
The retail industry lends itself to data science due to
the amount of data available and the data-related
problems the industry is facing. However, most of
the data science work in this industry has to do with
applying and adapting existing methods rather than
developing innovative ones.
The most challenging part of data science, which is
probably going to be the focus of the field in the
future, is forming the right questions in order to find
useful and meaningful answers from big data.
If you want to be a good data scientist, it is very
important to master one particular field before
moving on to the next one.
The work environment, as well as the industry you
are in, plays an important role in your development
as a data scientist, particularly in the early stages of
your career.
The future of data science is bright, and the field is
bound to grow to be more varied in the years to
come.
Chapter 17
Senior Data Scientist Case
Study

Senior data scientists are very difficult to reach because of the


demands on their time. However, these are the people who
have very useful insights about data science, and they are
generally better equipped to offer actionable advice compared
to other data scientists. In a way, they are the most mature
professionals in the field and inhabit the role that most data
scientists aspire to (including the author of this book).
In this chapter, we will look at a researcher type of data scientist
from the Greater Atlanta area, Dr. Nikolaos Vasiloglou. We will
examine his background, his views on data science in practice,
how he sees the field evolving in the future and what tips he
has for new data scientists (and aspiring data scientists).
Finally, we’ll end with a summary of the main points from this
particular case study.

17.1 Basic Professional Information and


Background
Dr. Vasiloglou is a machine learning specialist, i.e., a data
scientist who specializes in the machine learning aspect of the
field. He works in the software development and mobile
advertising industries. Although he has been working as a data
scientist for about five years, he has been involved in the field
much longer. His PhD was in scalable machine learning
techniques, a topic that integrates seamlessly with data
science.
Dr. Vasiloglou has been involved in several local groups related
to the field, mainly through meetup.com. He was the founder of
Machine Learning by Example, a group for students of machine
learning (the group is no longer active), a member of Data
Science Atlanta (the largest data science group in the state)
and groups for Hadoop and programming languages. He also
organizes the MLconf conferences, an industry-based type of
conference, on machine learning.
He believes that there are two things on his resume that played
an important role in jumpstarting his career in data science:
internships in well-known companies such as Google and
having a PhD in machine learning from a good university
(Georgia Tech). For those who are unable to list either one of
these credentials on their resume, he recommends getting the
machine learning certificate from Stanford University (Prof. Ng’s
physical class, not the MOOC on Coursera).
Dr. Vasiloglou is part of a 4-member team at one of the
companies for which he works and on his own at the other
company. In the team, he is responsible for all of its members
and manages them by creating the architectural framework in
which they work and by planning the projects in which they are
involved.
Dr. Vasiloglou is a very professional individual who at the same
time is very down-to-earth and approachable. He can be a fine
role model for those who plan to make data science their life-
long career.

17.2 Views on Data Science in Practice


This data scientist’s views on data science are based on his
experience in the field and his research interests, which revolve
around scalable machine learning techniques. His everyday
work includes daily report monitoring (for jobs left to run
overnight), brainstorming and mini group meetings, debugging
problematic code, reading newsletters and conference
proceedings and revising current problems (e.g., deep learning
networks) to keep himself abreast of new technologies in the
field.
According to Dr. Vasiloglou, a senior data scientist differs from
the other grades of data scientist in two ways. First, a senior
data scientist has more knowledge, know-how and more
experience, which translates into more efficient work and a
wider variety of potential techniques to employ when tackling a
given problem. Second, a senior data scientist is capable of
architecting a problem solution involving considerable work that
may be divided among several people and of starting a new
project (e.g., based on a conversation with a client and the data
that he is given).
Examples of data products that he has developed (or
participated in the development of) over the years include:

Botnets identification (finding infected machines


based on network traffic data)
Library of machine learning methods that are fast
and efficient
Forecasting model based on a traditional relational
database

Although he has been practicing in the industry for the past few
years, he values the role of researchers in the field and
believes that a data scientist ought to be a bridge between
academia and the industry, something that he seems to have
accomplished very effectively based on what he says about his
life as a data scientist. Since information theory is universal, he
believes that he could transition to another industry relatively
easily. He finds the sectors of drug discovery and forensics
particularly interesting for a data scientist today.

17.3 Data Science in the Future


Dr. Vasiloglou acknowledges the possibility of data science
becoming more automated—even completely automated. Still,
he sees a lot of merit in having the state-of-the-art know-how as
the field is constantly evolving and will no doubt continue to do
so. He also expects more programming languages, particularly
functional ones (e.g., Scala), to be very popular when it comes
to data science in the years to come.

17.4 Advice to New Data Scientists


Dr. Vasiloglou believes in the importance of well-founded (solid)
knowledge, so he advises newcomers to study mathematics
(through books, papers, courses, etc.), especially younger
people who are still in college/university. He also finds merit in
competitions (e.g., those in Kaggle), which he recommends for
people preparing to enter the field. Such competitions offer lots
of useful experience with various types of datasets and give
you a chance to put into practice a variety of the data analysis
techniques you have learned. He also suggests that
newcomers learn software development through OO and
functional programming languages. He doesn’t favor any
particular language because programming skills are highly
transferrable.
Dr. Vasiloglou is a champion of equilibrium when it comes to
developing your data science skills. Therefore, all of the above
recommendations need to be taken into account and followed
in an organic and holistic way so that you end up with a
balanced skill set.

17.5 Key Points

Being a senior data scientist is somewhat different


than being a typical data scientist because it entails
more knowledge and know-how, more experience,
the ability to architect a solution to a problem and
the ability to start a new project.
In order to get a senior data scientist position,
having an internship in a major company or
obtaining a PhD from a good university are
important. However, if you don’t have either one of
these credentials on your resume, you can opt for a
certificate in machine learning from Stanford
University (Prof. Ng’s classroom course).
Drug discovery and forensics are interesting
industries where data science can prove to be very
useful.
Transitioning to another industry is relatively easy
because information theory is universal.
In order to become a data scientist, you need to
develop the following in a balanced way:

Well-founded (solid) knowledge of


mathematics
Experience through competitions (e.g.,
Kaggle)
Software development in OOP and
functional languages
Chapter 18
Call for New Data
Scientists

Now that you’ve made it this far and have taken to heart the
guidance in the chapters, let’s look at what you need to know
when you’re ready to start your job quest in the data science
world. Gaining some perspective of the types of job
opportunities advertised may be quite useful.
In this chapter, we’ll take a look at different types of ads:
namely, entry-level, experienced, and senior data scientist ads.
In addition, we’ll discuss some relevant tips for online searching
and present a few samples of ads for data scientist positions
that are currently open.

18.1 Ads for Entry-Level Data Scientists


There are relatively few ads for entry-level data scientists,
which is something of a mystery considering that a junior data
scientist is bound to yield the highest ROI for a company
starting a data science project. Basically, an entry-level data
scientist has minimal experience and a lower level of expertise
is expected (simply working knowledge of R or any other data
analysis tool may be sufficient). Compensation may not reflect
the amount of knowledge that is required for this position. After
you gain some experience, you can proceed to more
challenging roles, yielding better compensation. Here are a
couple of examples of entry-level data scientist jobs so that you
get an idea of what to expect, but the requirements are bound
to have changed a bit by the time you read this.
Title: Junior “Big Data” Software Engineer
Summary
This position involves creating and using custom software tools to
gather, manipulate, pre-process, and filter large data-sets. We’re
looking for a candidate who enjoys working in a fast-paced startup
environment with an interest in artificial intelligence technologies and
the semantic web. This position will involve working with more
senior engineers, with opportunities for mentorship and career
advancement. If you are a self-motivated, highly creative,
engineering-focused person, with an interest in AI and related
technologies, come talk to us. If you watched IBM Watson compete
on Jeopardy and thought “I want to build that,” talk to us.
Skill Requirements

Solid understanding of Linux software development


Experience with Linux command-line text manipulation
tools
Desire and ability to work with large data-sets
Experience in Agile development and object-oriented
design
Basic Linux system admin knowledge
Self-motivation & creativity
Familiarity with multiple languages (Java, Python, etc.) is
a plus.

(Source: Kaggle.com)
Title: Junior Data Scientist
Summary
This is an exciting opportunity for an experienced data and analytics
professional to join a leading brand name company within an
innovative and growing team. This position sits within a growing
analytics function offering the chance to play a key role in the further
development of customer insight and business analysis.
This role will play a key part in developing and delivering algorithms
and new analytical approaches to better understand and enable pricing
and business analytics. The role will involve the following key
responsibilities:
Using multiple data sets and sources to streamline
analysis and generate algorithms to develop analytical
frameworks
Work closely with other internal teams and senior
stakeholders to better understand and available data, with
a view to identifying personalization driven products and
solutions
Develop programming language based scripts (SAS, SQL
or R) to help in the creation of market leading customer
insight strategies
Work with clients to improve and build upon their
understanding of their digital channels and
personalization
Mentor and lead junior team members.

Skill Requirements
To be shortlisted for this position, you must have the following
ESSENTIAL skills and experience:

Experience working in an advanced analytics function


Strong working knowledge of SAS, R, Python or SQL for
advanced statistics and programming/modeling
Degree in a numerical discipline e.g., Math, Stats,
Physics, Economics etc. from a top tier university
Experience developing and implementing algorithms and
analytic principles
Good communication skills – ability to communicate
technical strategy into easily understandable concepts.

(Source: Harnham.com)

18.2 Ads for Experienced Data Scientists


Ads for more experienced data scientists are the most
commonplace. They are most often related to larger companies
or startups with big funding behind them. Basically, when a
company is looking for an experienced data scientist, it is data
savvy and may even be data driven. Reading ads for
experienced data scientists can be very educational for anyone
interested in developing a career in this field because they
demonstrate the skills and experience needed. Remember that
these ads express what is in demand at the moment, so don’t
take their requirements as gospel. Needs will change over the
years as more advanced technologies come about and the data
science world takes a more formal shape. Here are a couple of
examples of such ads:
Title: Data Scientist
Summary

Work on large data sets of structured, semi-structured,


and unstructured data to discover hidden knowledge
about the client’s business and develop methods to
leverage that knowledge within their line of business
The successful candidate will combine strengths in
mathematics and applied statistics, computer science,
visualization capabilities, and a healthy sense of
exploration and knowledge acquisition
Work closely with various teams across the company to
identify and solve business challenges utilizing large
structured, semi-structured, and unstructured data in a
distributed processing environment
Develop predictive statistical, behavioral or other models
via supervised and unsupervised machine learning,
statistical analysis, and other predictive modeling
techniques
Drive the collection of new data and the refinement of
existing data sources
Analyze and interpret the results of product experiments
Collaborate with the engineering and product teams to
develop and support our internal data platform to support
ongoing analyses.

Skills Requirements

M.S. or Ph.D. in a relevant technical field (e.g., applied


mathematics, statistics, physics, computer science,
operations research), or years of experience in a relevant
role
Extensive experience solving analytics problems using
quantitative approaches
A proven passion for generating insights from data
Strong knowledge of statistical methods generally, and
particularly in the areas of modeling and business
analytics
Comfort manipulating and analyzing complex, high-
volume, high-dimensionality data from varying sources
Ability to communicate complex quantitative analysis in
a clear, precise, and actionable manner
Fluency with at least one scripting language such as
Python, Java, or C/C++
Expertise with relational databases and SQL. NoSQL is a
big plus
Experience working with large data sets, experience
working with distributed computing tools a plus
(Map/Reduce, Hadoop, Hive, etc.)
Expert knowledge of an analysis tool such as R, D3,
Matlab, SAS, Weka with the ability to transfer that
knowledge to different tools
Experience with Fraud analytics is a nice to have.
(Source: Linkedin)
Title: Data Scientist – London – £70,000 [Note: Though quite
atypical, some ads do mention the salary.]
Summary
My client, a market leader within the market research sector with
offices based worldwide is now looking to hire a talented data
scientist to come on board in central London.
With medium term plans to build a data function of 10 scientists in
UK and continuous learning and support from the established team in
the USA, this presents an exciting opportunity to join a data oriented
company with big growth plans and career progression.
Now at 3 million members worldwide, the company is generating and
storing huge amounts of data on a daily basis. The successful scientist
will produce statistical models and complex algorithms used for
extracting, testing, hypothesizing and providing meaningful insights
used to inform and make business decisions by organizations across
the globe.
Skills Requirements
The successful candidate will have the following skills and
experience:

BSC/MSC/PHD with computer science, mathematics or


related area
3yrs+ experience within a data science or analytical role
Relevant working experience working with vast data sets
The ability to work as an individual or as part of a team
Extensive experience with at least one statistical language
(R or Matlab preferred)
Proficiency in at least one programming language
(Python or Java preferred)
Strong SQL skills.

Above all you must have an inquisitive nature and a real passion for
data!
(Source: Linkedin)
18.3 Ads for Senior Data Scientists
Ads for senior data scientists are fewer though encountered
more often than those for junior data scientists. Senior data
scientists are basically the top-tier data scientists, the ones who
have sailed all kinds of oceans and have fought against
monsters of data. They usually end up in a business-oriented
position where they deal directly with management, and often
with the company’s clients, themselves. Note that if you try the
freelance track, you’ll basically be taking a senior data scientist
role even if you don’t refer to it this way. This is because you’ll
need to undertake all the different aspects of that role including
the link to the business world, the project organization, the
architecture design, etc.
You’re probably not going to be hunting for this type position
right now, but it’s good to be aware of what’s out there in case
you want to drive towards it quickly and you have enough
expertise to make it happen. Experience can be gained
relatively easily once you are committed to your goal, are
focused, and know what you are doing. Here is an example of a
senior data scientist position from a US company.
Title: Senior Data Scientist
Summary
As a senior member of the data sciences team, you will be responsible
for managing and executing critical R&D projects, while providing
thought leadership, along with significant personal contributions.
Working in a highly collaborative environment, you will drive product
innovation and partner with Engineering and Product teams to
prototype and launch data-driven features and products. You will
develop deep domain expertise in digital advertising and generate key
insights that influence business decisions and technological solutions.
In addition, you will be active in the data sciences community and
contribute to attracting, retaining and growing the best talent in a
performance-driven organization.
Skills requirements
Required Qualifications
PhD in a quantitative discipline (e.g., statistics, computer
science, physics), or MS with equivalent experience
10+ years of hands-on experience in analysis and
modeling of large complex datasets
A passion for innovating with data sciences at scale –
applying modern algorithms to massive datasets and
creating measureable business value
Excellent interpersonal and communication skills, with a
strong written and verbal presentation
Proven ability to take ownership of a project and lead
R&D with minimal supervision
Track record of successful implementations of
quantitative, data-driven products in a business
environment
Deep understanding and hands-on experience with
optimization, data mining, machine learning or natural
language processing techniques
Superb understanding of algorithms, scalability and
various tradeoffs in a big data setting
Expert level in R, Matlab or a similar environment;
proficiency in SQL
Ability to personally put together a system of disjoint
components that implements a working solution to the
problem
Experience programming in at least one compiled
language (C/C++ preferred).

Preferred Qualifications

Experience analyzing internet scale sparse datasets


(billions of rows, thousands of columns)
Expertise in using Hadoop and/or MPP databases (e.g.,
Netezza, Vertica, RedShift) for complex data assembly
and transformation
Digital advertising or web technology experience
Experience with real-time bidding, electronic trade
execution or high-frequency trading algorithms.

(Source: Linkedin)

18.4 Online Job Searching Tips


Other ads for data scientist positions may masquerade as ads
for different types of positions, so you may want to include the
following keywords in your search for data scientist jobs online:

Data Engineer
Big Data (Software) Engineer
Chief Scientist
Senior Scientist
Big Data Analyst
Hadoop Programmer / Developer
Big Data Scientist
Big Data Analytics
Research Scientist – Data
VP, Data Science
Data Mining Scientist
Machine Learning Developer
Machine Learning Specialist
Statistician

In addition, if you decide to look into websites specifically


designed for job hunting, it makes sense to upload your resume
and keep it up to date. This will make the application process
much quicker and help you target a variety of companies
simultaneously. Always have someone check your resume
before putting it anywhere, however, preferably a professional
editor. Also, be sure to keep your LinkedIn profile in a
professional state, especially if you plan to network parallel to
targeting online job ads. Here are some examples of useful
sites for data scientist ads:

Indeed.com – everything on this site is about job


hunting for all kinds of jobs, including data science
ones.
LinkedIn.com – there are separate groups in this
social medium that act like job boards, like
LinkedIn’s built-in job search function.
DataScienceCentral.com – job-board area under
Jobs option.
Kaggle.com – primarily for data analysis
competitions, this site also has a forum for data
science jobs.

Keep in mind that all of these are just one strategy for landing a
data science job. Don’t forget there are other paths to the same
goal and make use of networking. A connection with a person
working for a company you are applying to could lead to a job
offer for another position if the one you are applying for doesn’t
work out. So draw your own plan of action for making it happen
in this fascinating field. It won’t be easy, but rest assured it is
definitely worth it!

18.5 Key Points

Familiarizing yourself with the various data scientist


openings, even the ones for more advanced
positions, can be very useful in your search for data
science jobs.
There are few job ads in the field for junior data
scientists, at least at the present time. Most of the
ads out there for data scientists are for experienced
ones, followed by those for chief data scientists.
When looking for a data science position, it’s useful
to search for openings using various keywords, not
just “data scientist,” as different companies may
refer to the role with different names.
It’s good to have a resume in a professional state
online when searching for a data science job using a
job hunting site.
Having a presentable LinkedIn profile can help you
significantly in your search for a data science
position through networking.
Some useful sites for data scientist ads include:

Indeed.com
LinkedIn.com
DataScienceCentral.com
Kaggle.com
Final Words
In this book, we have seen what the field of data science entails
and how the profession of the data scientist came to be. We
described what big data is and how it differs from traditional
data through its main characteristics: volume, variety, velocity
and veracity. We also looked into the different types of data
scientists and the skill-sets of each one. We dug into what the
role of the data scientist requires in terms of the relevant
mindset, technical skills, experience and how he connects to
other people. We also zoomed in on the daily life of a data
scientist, examining the problems he may encounter and how
he tackles with them, what programs he uses and how he
expands his knowledge and know-how. We then looked into
how you can become a data scientist based on where you are
starting from: a programming, machine learning, data-related or
student background. Moreover, we went step-by-step through
the process of landing a data scientist job: where you need to
look, how you would present yourself to a potential employer
and what it takes to follow a freelancer path. Finally, we looked
at case studies of experienced and senior-level data scientists
in an attempt to get a better perspective of what this role is in
practice.
Now it is your turn to put all this knowledge to good use.
Whether you are opting for a position in a large organization or
planning to work as a freelancer, you have a lot of interesting
and educational challenges in front of you. This is practical
knowledge that cannot fit in a book. Just remember to stay
current on what is happening in the data science field so that
you always remain competitive. Enrich your toolbox and
knowledge-base constantly; good places to start are the
websites, articles and books that are listed in the appendices.
The book’s glossary can also be used as a hands-on reference
for a variety of relevant terms.
The data science field is still in its toddler years, and few are
those who are perceptive enough to foresee its potential. As
distributed computing gains more ground, data storage
becomes cheaper, data transfer becomes faster and, most
importantly, people begin reaping the fruits of big data, we
should expect it to become a big part of our everyday lives. This
should lead to data science becoming a major profession in the
not-so-distant future. And as big data technology continues to
evolve, more and more interesting ways of making use of
existing data will become available. The data scientist will
continue to be an ever-fascinating role that will rely as much on
creativity as it does on technical skills. By then, there will
probably be university departments specializing in this field,
and future data scientists will look back on the data scientists of
this decade, the pioneers of the field, with great admiration.
Glossary
of Computer and Big Data
Terminology

Big data terminology has developed during the last few years.
This glossary alphabetically lists some big data definitions
along with some relative computer terms that a newcomer in
the field will find useful. A basic understanding of computers is
required to fully harness the information in this glossary.

A
Aggregation – the process through which data is searched,
gathered and presented.
Algorithm – a mathematical process that can perform a
specific analysis or transformation on a piece of data.
Analytics – the discovery and communication of insights
derived from data, or the use of software-based algorithms and
statistics to derive meaning from data.
Analytics Platform – software and/or hardware that provide
the tools and computational power needed to build and perform
many different analytical queries.
Anomaly Detection – the systematic search for data items in a
dataset that deviate from a projected pattern or expected
behavior. Anomalies are often referred to as outliers,
exceptions, surprises or contaminants, and they usually provide
critical and actionable information.
Application (App) – a program designed to perform
information processing tasks for a specific purpose or activity.
Artificial Intelligence (A.I.) – the field of computer science
related to the development of machines and software that are
capable of perceiving their environment and taking appropriate
action when required (in real-time), even learning from those
actions. Some A.I. algorithms are widely used in data science.

B
Behavioral Analytics – analytics that inform about the how,
why and what (instead of just the who and when) occurs in data
related to human behavior. Behavioral analytics investigates
humanized patterns in the data.
Big Data – data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage and process
them within a tolerable elapsed time. Big data sizes are a
constantly moving target, ranging from a few dozen terabytes to
many petabytes of data in a single data set. Big data is
characterized by its 4 Vs: volume, velocity, variety and veracity.
Big Data Scientist – an IT professional who is able to
use/develop the essential algorithms to make sense out of big
data and communicate the derived information effectively to
anyone interested. Also known as a data scientist.
Big Data Startup – a young company that has developed new
big data technology.
Business Intelligence – the theories, methodologies and
processes to make data, particularly business-related data,
understandable and more actionable.
Byte (B) – an acronym for “binary term.” A sequence of bits
that represents a character. Each byte has 8 bits.

C
Central Processing Unit (CPU) – the brains of an information
processing system; the processing component that controls the
interpretation and execution of instructions in a computer.
Classification Analysis – a systematic process for obtaining
important and relevant information about data using
classification algorithms.
Cloud – a broad term that refers to any Internet-based
application or service that is hosted remotely.
Cloud Computing – a computing system whose processing is
distributed over a network that uses server farms to store data
in a distant location (see also, data centers).
Clustering Analysis – the process of identifying objects that
are similar to each other and grouping them in order to
understand the differences and the similarities within the data.
Clustering is usually referred to as unsupervised learning and is
a fundamental part of data exploration and data discovery.
Comparative Analysis – a process that ensures a step-by-
step procedure of comparisons and calculations to detect
patterns within very large data sets.
Complex Structured Data – data that is composed of two or
more complex, complicated and interrelated parts that cannot
be easily interpreted by structured query languages and tools.
Computer Generated Data – data generated by computers
such as log files. This constitutes a large part of big data in the
world today.
Concurrency – performing and executing multiple tasks and
processes at the same time.
Correlation Analysis – a statistical technique for determining a
relationship between variables and whether that relationship is
negative or positive. Although it does not imply causation,
correlation analysis can yield very useful information about the
data and help the data scientist handle it more effectively.
Customer Relationship Management (CRM) – managing
sales and business processes. Big data will affect CRM
strategies.

D
Dashboard – a graphical representation of the analyses
performed by algorithms, usually in the form of plots and
gauges.
Data – a quantitative or qualitative value. Common types of
data include sales figures, marketing research results, readings
from monitoring equipment, user actions on a website, market
growth projections, demographic information and customer
lists.
Data Access – the act or method of viewing or retrieving stored
data.
Data Aggregation Tools – methods for transforming scattered
data from numerous sources into a new, single source.
Data Analytics – the application of software to derive
information or meaning from data. The end result might be a
report, an indication of status or an action taken automatically
based on the information received.
Data Analyst – someone who analyzes, models, cleanses,
and/or processes data. Data analysts usually don’t perform
predictive analytics, and when they do, it’s usually through the
use of a simple statistical model.
Data Architecture and Design – the way enterprise data is
structured. The actual structure or design varies depending on
the eventual end result required. Data architecture has three
stages or processes: conceptual representation of business
entities, the logical representation of the relationships among
those entities and the physical construction of the system to
support the functionality.
Database – a digital collection of data and the structure in
which the data is organized (structured). The data is typically
entered into and accessed via a database management system
(DBMS).
Database Administrator (DBA) – a person who is responsible
for supporting and maintaining the integrity of the structure and
content of a database.
Database-as-a-Service (DaaS) – a database hosted in the
cloud and sold on a metered basis. Examples include Heroku
Postgres and Amazon Relational Database Service.
Database Management System (DBMS) – collecting, storing
and providing access of data through integrated software that is
practical to use even by non-specialists.
Data Center – a physical location that houses the servers for
storing data. Data centers might belong to a single organization
or sell their services to many organizations.
Data Cleansing – the process of reviewing and revising data in
order to delete duplicates, correct errors and provide
consistency.
Data Collection – any process that captures any type of data.
Data Custodian – a person responsible for the database
structure and the technical environment including the storage of
data.
Data-Directed Decision Making – using data to support
making crucial decisions.
Data Exhaust – the data that a person creates as a byproduct
of a common activity: for example, a cell call log or Web search
history.
Data Governance – a set of processes or rules that ensure the
integrity of the data and that data management best practices
are met.
Data Integration – the process of combining data from different
sources and presenting it in a single view.
Data Integrity – the measure of trust an organization has in the
accuracy, completeness, timeliness and validity of the data.
Data Management Association (DAMA) – a non-profit
international organization for technical and business
professionals “dedicated to advancing the concepts and
practices of information and data management.”
Data Management – according to the Data Management
Association, data management incorporates the following
practices needed to manage the full data lifecycle in an
enterprise:

data governance
data architecture, analysis and design
database management
data security management
data quality management
reference and master data management
data warehousing and business intelligence
management
document, record and content management
metadata management

contact data management

Data Migration – the process of moving data between different


storage types or formats or between different computer
systems.
Data Mining – the process of finding certain patterns or
information from data sets in an automated way. This is one
popular way to perform data exploration.
Data Modeling – development of a graphic representation
defining the structure of data for the purpose of communicating
the data needed for business processes between functional
and technical people or for communicating a plan to develop
how data is stored and accessed among application
development team members.
Data Science – a recent term that has multiple definitions but is
generally accepted as a discipline that incorporates statistics,
data visualization, computer programming, data mining,
machine learning and database engineering to solve complex
problems.
Data Scientist – a practitioner of data science. Also known as
big data scientist.
Data Security – the practice of protecting data from destruction
or unauthorized access.
Data Set – a collection of data, usually in a structured form.
Data sets are represented as data frame objects in R.
Data Structure – a specific way of storing and organizing data.
Data Visualization – a visual abstraction of data designed for
the purpose of deriving meaning or communicating information
more effectively.
Data Virtualization – a data integration process that improves
data insights. Usually it involves databases, applications, file
systems, websites, big data techniques, etc.
Discriminant Analysis – a statistical analysis that takes
advantage of known groups or clusters in data to derive the
classification rule. It involves cataloguing the data as well as
distributing it into groups, classes or categories.
Distributed File System – a system that offers simplified,
highly available access to storing, analyzing and processing
data.
Distributed Processing System – a form of local area network
in which each user has a fully functional computer, but all users
can share data and application software. The data and software
are distributed among the linked computers, not stored in one
central computer.
Document Store Database – a document-oriented database
that is especially designed to store, manage and retrieve
documents. Also known as semi-structured data.

E
Enterprise Resource Planning (ERP) – a software system
that allows an organization to coordinate and manage all its
resources, information and business functions.
E-Science – traditionally defined as computationally intensive
science involving large data sets. More recently broadened to
include all aspects and types of research that are performed
digitally.
Event Analytics – a process that shows the series of steps
that led to an action.
Exploratory Analysis – finding patterns within data without
standard procedures or methods. It is a means of discovering
the data and finding the data set’s main characteristics. Usually
referred to as data exploration, it constitutes an important part
of the data science process.
Exabyte – approximately 1000 petabytes or 1 billion gigabytes.
Today, we create one exabyte of new information globally on a
daily basis.
Extract, Transform and Load (ETL) – a process for populating
data in a database and data warehouse by extracting the data
from various sources, transforming it to fit operational needs
and loading it into the database.

F
Failover – switching automatically to a different server or node
if one fails. This is a very useful property of a computer cluster
and ensures scalability in data analysis processes.
Fault-Tolerant Design – a system designed to continue
working even if certain parts fail.
Federal Information Security Management Act (FISMA) – a
US federal law that requires all federal agencies to meet certain
standards of information security across their systems.
File Transfer Protocol (FTP) – a set of guidelines or standards
that establishes the format in which files can be transmitted
from one computer to another.

G
Gamification – using game elements in a non-game context.
This is a very useful way to create data; therefore, coined as
the friendly scout of big data.
Gigabyte – a measurement of the storage capacity of a
computer. One megabyte represents more than 1 billion bytes.
Gigabyte may be abbreviated G or GB or Gig; however, GB is
clearer since G also stands for the metric prefix giga (meaning
1 billion).
Graph Database – databases that use graph structures (a
finite set of ordered pairs or certain entities), with edges,
properties and nodes for data storage. It provides index-free
adjacency, meaning every element is directly linked to its
neighboring element.
Grid Computing – connecting different computer systems from
various locations, often via a cloud, to reach a common goal.

H
Hadoop – an open-source framework that is built to enable the
process and storage of big data across a distributed file system.
Hadoop is currently the most widespread and most developed
big data platform available.
Hadoop Distributed File System (HDFS) – a distributed file
system designed to run on commodity hardware.
HBase – an open source, non-relational, distributed database
running in conjunction with Hadoop. It is particularly useful for
archiving purposes.
High-Performance-Computing (HPC) – using
supercomputers to solve highly complex and advanced
computing problems.
Hypertext – a technology that links text in one part of a
document with related text in another part of the document or in
other documents. A user can quickly find the related text by
clicking on the appropriate keyword, key phrase, icon or button.
Hypertext Transfer Protocol (HTTP) – the protocol used on
the World Wide Web that permits Web clients (Web browsers)
to communicate with Web servers. This protocol allows
programmers to embed hyperlinks in Web documents using
hypertext markup language (HTML).

I
Indexing – the ability of a program to accumulate a list of
words or phrases that appear in a document, along with their
corresponding page numbers, and to print or display the list in
alphabetical order.
Information Processing – the coordination of people,
equipment and procedures to handle the storage, retrieval,
distribution and communication of information. The term
information processing embraces the entire field of processing
words, figures, graphics, videos and voice input by electronic
means.
In-Database Analytics – the integration of data analytics into
the data warehouse.
Information Management – the practice of collecting,
managing and distributing information of all types: digital,
paper-based, structured and unstructured.
In-Memory Data Grid (IMDG) – the storage of data in memory,
across multiple servers, for the purpose of greater scalability
and faster access or analytics.
In-Memory Database – a database management system that
stores data in the main memory instead of on the disk, resulting
in very fast processing, storing and loading of the data.
Internet – a system that links existing computer networks into a
worldwide network. The Internet may be accessed by means of
commercial online services (such as America Online) and
Internet service providers (ISPs).
Internet of Things (IoT) – ordinary devices that are connected
to the Internet at anytime and anywhere via sensors. IoT is
expected to contribute substantially to the growth of big data.
Internet Service Provider (ISP) – an organization that
provides access to the Internet for a fee. Companies like
America Online are more properly referred to as commercial
online services because they offer many other services in
addition to Internet access.
Intranet – a private network established by an organization for
the exclusive use of its employees. Firewalls prevent outsiders
from gaining access to an organization’s intranet.

J
Juridical Data Compliance – the need to comply with the laws
of the country where your data is stored. Relevant when you
use cloud solutions and when the data is stored in a different
country or continent.

K
Key Value Database – database in which data is stored with a
primary key, a uniquely identifiable record, making it easy and
fast to look up. The data stored in a KeyValue is normally some
kind of primitive of the programming language.
Kilobyte – a measurement of the storage capacity of a
computer. One kilobyte represents 1024 bytes. Kilobyte may be
abbreviated K or KB; however, KB is the clearer abbreviation,
since K also stands for the metric prefix kilo (meaning 1000).

L
Latency – a measure of time delay in a system.
Legacy System – an old system, technology or computer
system that is not supported any more.
Load Balancing – distributing workload across multiple
computers or servers in order to achieve optimal results and
utilization of the system.
Location Data (Geo-Location Data) – GPS data describing a
geographical location. Very useful for data visualization among
other things.
Log File – a file that a computer, network or application creates
automatically to record events that occur during operation (e.g.,
the time a file is accessed).

M
Machine Data – data created by machines via sensors or
algorithms.
Machine Learning (ML) – the field of computer science related
to the development and use of algorithms to enable machines
to learn from what they are doing and become better over time.
Although there is a large overlap between ML and artificial
intelligence, they are not the same. ML algorithms are an
integral part of data science.
MapReduce – a software framework for processing vast
amounts of data using parallelization.
Massively Parallel Processing (MPP) – using many different
processors (or computers) to perform certain computational
tasks at the same time.
Master Data Management (MDM) – management of core non-
transactional data that is critical to the operation of a business
to ensure consistency, quality and availability. Examples of
master data are customer or supplier data, product information,
employee data, etc.
Megabyte – a measurement of the storage capacity of a
computer. One megabyte represents more than 1 million bytes.
Megabyte may be abbreviated M or MB; however, MB is clearer
since M also stands for the metric prefix mega (meaning 1
million).
Memory – the part of a computer that stores information. Often
synonymous to Random Access Memory (RAM), the temporary
memory that allows information to be stored randomly and
accessed quickly and directly without the need to go through
intervening data.
Metadata – any data used to describe other data; for example,
a data file’s size or date of creation.
MongoDB – a popular open-source NoSQL database.
MPP Database – a database optimized to work in a massively
parallel processing environment.
Multi-Dimensional Database – a database optimized for
online analytical processing (OLAP) applications and for data
warehousing.
Multi-Threading – the act of breaking up an operation within a
single computer system into multiple threads for faster
execution. Multi-threading turns a single PC with a modern
CPU into a computer cluster that makes use of all of its CPU
cores.
MultiValue Database – a type of NoSQL and multidimensional
database that understands 3-dimensional data directly. They
are primarily giant strings that are perfect for manipulating
HTML and XML strings directly.
Memetic Algorithm – a special type of evolutionary algorithm
that combines a steady state genetic algorithm with local
search for real-valued parameter optimization.
N
Natural Language Processing (NLP) – a field of computer
science involved with interactions between computers and
human languages. NLP is widely used in text analytics and is a
popular subfield of data science.
Network Analysis – analyzing connections and the strength of
the ties between nodes in a network. Viewing relationships
among the nodes in terms of the network or graph theory.
NewSQL – an elegant, well-defined database system that is
easier to learn and better than SQL. It is even newer than
NoSQL.
NoSQL – a class of database management system that does
not use the relational model. NoSQL is designed to handle
large data volumes that do not follow a fixed schema. It is
ideally suited for use with very large data volumes that do not
require the relational model. It is sometimes referred to as ”Not
only SQL” because it is a database that doesn’t adhere to
traditional relational database structures. It is more consistent
and can achieve higher availability and horizontal scaling.
Normalization – the process of transforming a numeric
variable so that its values are in the same range as other
normalized variables. This allows for easier comparisons and
more efficient ways of handling a set of variables.

O
Object Database – databases that store data in the form of
objects as used by object-oriented programming. They are
different from relational or graph databases, and most of them
offer a query language that allows objects to be found with a
declarative programming approach.
Online Analytical Processing (OLAP) – the process of
analyzing multidimensional data using three operations:
consolidation (the aggregation of available data), drill-down (the
ability for users to see the underlying details) and slice and dice
(the ability for users to select subsets and view them from
different perspectives).
Online Transactional Processing (OLTP) – the process of
providing users with access to large amounts of transactional
data so that they can derive meaning from it.
Open Data Center Alliance (ODCA) – a consortium of global
IT organizations whose goal is to speed the migration to cloud
computing.
Open Source – a type of software code that has been made
freely available for download, modification and redistribution.
Operational Database – databases that record the regular
operations of an organization; they are generally very important
to a business. Organizations generally use online transaction
processing, which allows them to enter, collect and retrieve
specific information about the company.
Optimization Analysis – the process of optimization during the
design cycle of products done by algorithms. It allows
companies to virtually design many different variations of a
product and to test that product against pre-set variables.
Ontology – ontology represents knowledge as a set of
concepts within a domain and the relationships between those
concepts. Very useful when designing a database.
Outlier Detection – an outlier is an object that deviates
significantly from the general average within a dataset or a
combination of data. It is numerically distant from the rest of the
data and therefore indicates that something is going on that
requires additional analysis. Usually referred to as anomaly
detection.

P
Parallel Data Analysis – breaking up an analytical problem
into smaller components and running algorithms on each of
those components at the same time. Parallel data analysis can
occur within the same system or across multiple systems.
Parallel Method Invocation (PMI) – the ability to allow
programming code to call multiple functions in parallel.
Parallel Processing – the ability to execute multiple tasks at
the same time.
Parallel Query – a query that is executed over multiple system
threads for faster performance.
Pattern Recognition – identifying patterns in data via
algorithms to make predictions of new data coming from the
same source. Pattern recognition is also referred to as
supervised learning and constitutes a major part of machine
learning.
Performance Management – the process of monitoring
system or business performance against predefined goals to
identify areas that need attention.
Petabyte –1024 terabytes or 1 million gigabytes. The CERN
Large Hadron Collider generates approximately 1 petabyte per
second.
Predictive Analysis (Predictive Analytics) – the most
valuable analysis within big data as it helps predict what
someone is likely to buy, visit or do as well as how someone will
behave in the (near) future. It uses a variety of different data
sets such as historical, transactional, social, or customer profile
data to identify risks and opportunities.
Predictive Modeling – the process of developing a model to
predict a trend or outcome.
Program – an established sequence of instructions that tells a
computer what to do. The term program means the same thing
as software.
Protocol – a set of standards that permits computers to
exchange information and communicate with each other.

Q
Quantified Self – a modern movement related to the use of
applications to track one’s every move during the day in order
to gain a better understanding of one’s behavior.
Query – asking for information to answer a certain question,
usually in a database context.
Query analysis – the process of analyzing a search query for
the purpose of optimizing it for the best possible result.
R
R – an open-source programming language and software
environment for statistical computing and graphics. The R
language is widely used among statisticians and data miners
for developing statistical software and data analysis. R’s
popularity has increased substantially in recent years.
Real Time – a descriptor for events, data streams or processes
that have an action performed on them as they occur.
Real-Time Data – data that is created, processed, stored,
analyzed and visualized within milliseconds of its creation.
Recommendation Engine (Recommender System) – an
algorithm that analyzes a user’s purchases and actions on an
e-commerce site and then uses that data to recommend
complementary products.
Record – a collection of all the information pertaining to a
particular subject.
Records Management – the process of managing an
organization’s records throughout their entire lifecycle from
creation to disposal.
Reference Data – data that describes an object and its
properties. The object may be physical or virtual.
Regression Analysis – a statistical technique for defining the
dependency between continuous variables. It assumes a one-
way causal effect from one variable to the response of another
variable.
Report – the presentation of information derived from a query
against a dataset, usually in a predetermined format.
Risk Analysis – the application of statistical methods on one or
more datasets to determine the likely risk of a project, action or
decision.
Root-Cause Analysis – the process of determining the main
cause of an event or problem.
Routing Analysis – using many different variables to find the
optimal route for a certain means of transport in order to
decrease fuel costs and increase efficiency.

S
Scalability – the ability of a system or process to maintain
acceptable performance levels as workload or scope increases.
Schema – the structure that defines the organization of data in
a database system.
Semi-Structured Data – a form a structured data that does not
conform to a formal structure the way structured data does. It
contains tags or other markers to enforce a hierarchy of
records. Semi-structured data is usually found in .JSON
objects.
Server – a physical or virtual computer that serves requests for
a software application and delivers those requests over a
network.
Signal Analysis – the analysis of measurement of time varying
or spatially varying physical quantities to analyze the
performance of a product. Signal analysis is frequently used
with sensor data.
Similarity Searches – finding the closest object to a query in a
database where the data object can be of any type of data.
Simulation Analysis – a simulation is the imitation of the
operation of a real-world process or system. A simulation
analysis helps to ensure optimal product performance by taking
into account many different variables.
Smart Grid – the smart grid refers to the concept of adding
intelligence to the world’s electrical transmission systems with
the goal of optimizing energy efficiency. Enabling the smart grid
will rely heavily on collecting, analyzing and acting on large
volumes of data.
Software-as-a-Service (SaaS) – application software that is
used over the Web by a thin client or Web browser. Salesforce
is a well-known example of SaaS.
Solid-State Drive (SSD) – also called a solid-state disk; a
device that uses memory ICs to persistently store data.
Spatial Analysis – the process of analyzing spatial data such
as geographic or topological data to identify and understand
patterns and regularities within data distributed in geographic
space. This is usually performed in a special type of system
called a geographic information system (GIS).
Storm – an open source distributed computation system
designed for processing multiple data streams in real time.
Structured Data – data that is identifiable because it is
organized in a structure such as rows and columns. The data
resides in fixed fields within a record or file, or the data is
tagged correctly and can be accurately identified.
Structured Query Language (SQL) – a programming
language for retrieving data from a relational database. SQL is
not directly applicable in the big data domain.

T
Terabyte – approximately 1000 gigabytes. A terabyte is the
data volume of about 300 hours of high-definition video.
Text Analytics – the application of statistical, linguistic and
machine learning techniques on text-based sources to derive
meaning or insight.
Thread – a series of posted messages that represents an
ongoing discussion of a specific topic in a bulletin board
system, a newsgroup or a Web site.
Time Series Analysis – the process of analyzing well-defined
data obtained through repeated measurements of time. The
data has to be well-defined and measured at successive points
in time spaced at identical time intervals.
Topological Data Analysis – focusing on the shape of
complex data and identifying clusters and any statistical
significance that is present within that data.
Transmission Control Protocol/Internet Protocol (TCP/IP) –
a collection of over 100 protocols that are used to connect
computers and networks.
Transactional Data – data that describes an event or
transaction that took place.
Transparency – operating in such a way that whatever is
taking place is open and apparent to whomever is interested.

U
Unstructured Data – data that is text heavy, in general, but
may also contain dates, numbers and facts.

V
Value – the benefits that organizations can reap from analysis
of big data.
Variability – one of the characteristics of big data, variability
means that the meaning of the data can change (and rapidly).
For example, in multiple tweets the same word can have totally
different meanings.
Variety – one of the major characteristics of big data. Data
today comes in many different formats: structured data, semi-
structured data, unstructured data and even complex structured
data.
Velocity – one of the major characteristics of big data. The
speed at which the data is created, stored, analyzed and
visualized.
Veracity – one of the major characteristics of big data, veracity
refers to the correctness of the data. Organizations need to
ensure that both the data and the analyses performed on it are
correct.
Visualization – visualizations are complex graphs that can
include many variables of data while still remaining
understandable and readable. With the right visualizations, raw
data can be put to use.
Volume – one of the major characteristics of big data. It refers
to the total quantity of data, beginning at terabytes and growing
higher over time.

W
Weather Data – an important open, public data source that can
provide organizations with a lot of insights when combined with
other sources.

X
XML Database – databases that allow data to be stored with its
markup tags. XML databases are often linked to document-
oriented databases. The data stored in an XML database can
be queried, exported and serialized into any format needed.

Y
Yottabytes – approximately 1000 Zettabytes, or 250 trillion
DVDs. The entire digital universe today is 1 Yottabyte; this will
double every 18 months.

Z
Zettabyte – approximately 1000 Exabytes or 1 billion terabytes.
It is expected that in 2016, more than 1 zettabyte will cross our
networks globally on a daily basis.
Appendix 1
Useful Websites

Website URL Tags Description


www.kaggle.com networking, Probably the most
data analysis popular website for
competitions, data scientists as
job posts well as machine
learning
practitioners. If you
haven’t bookmarked
it yet, do so now!
www.linkedin.com networking, The most useful
job posts, social medium out
online there; ideal for any
resume professional in any
stage of his career.
Lots of groups to join
and a great place to
connect with other
data scientists and
learn.
Website URL Tags Description
www.datascience articles, job One of the most
central.com posts, popular high-quality
networking, portals for data
news, science. Ideal for
background both the newbie and
knowledge the somewhat
experienced data
scientist. A great
place to seek new
innovations and
learn stuff.
www.coursera.com online The most developed
learning, MOOC provider,
networking founded by a couple
of professors from
Stanford University.
It has several
courses on data
science topics,
among others. The
forums for each
course are an
excellent place to
network.
https://data news A cozy place for new
science101. people in the field to
wordpress.com get updated about
recent
developments,
useful courses, etc.
http://cran.r- R, news, A great place to get
project.org/ software more familiar with
developments in R
and download useful
packages to run on
it.
Website URL Tags Description
www.indeed.com job posts, The most
online widespread job-
resume hunting site.
http://whatsthe news, An interesting place
bigdata.com background to expand your
knowledge, understanding of big
articles, big data and read up on
data new developments in
the field.
http://www.big online One of the best
datauniversity.com learning, big resources for
data technical know-how
on big data
technology. It is
created and
maintained by IBM.
http://www.r-proj R, news, The developers’
ect.org software website for one of
the most popular
data analysis open
source platforms.
http://www.eclipse. Eclipse, The developers’
org news, website for the most
software popular open-source
OOP integrated
development
environment (IDE).
http://hadoop. big data, The developers’
apache.org Hadoop, website for the most
software popular big data
technology platform.
http://www.caree job hunting A great resource for
realism.com job-hunting tips
including presenting
yourself and
interviewing.
Website URL Tags Description
http://stackover technical A great resource for
flow.com questions finding answers to
various technical
questions (e.g., for
R).
http://coursetalk. online A portal for MOOC
org learning students from a
variety of online
learning portals
posting their views
on the courses they
are taking or have
taken.
www.edX.org online A great MOOC
learning provider covering a
variety of topics
including data
science.
www.class- online A great resource for
central.com learning discovering the
various MOOCs that
are available from
different sites, like
Coursera, edX, etc.
www.technics offline The best resource
pub.com learning for technical books
including, but not
limited to data
science.
www.java.com OOP, Java, The developers’
software website for one of
the most popular
OOP languages.
Website URL Tags Description
www.python.org OOP, Python, The developers’
software website for one of
the most popular
OOP languages for
data analysis.
www.gnu.org/soft OOP, Octave, The developers’
ware/octave software website for one of
the most popular
open source data
analysis platforms.
Very similar to
Matlab.
www.mathworks. OOP, Matlab, The developers’
com/matlabcentral software website for one of
the most popular
proprietary data
analysis platforms.
www.tableausoft data The developers’
ware.com visualization, website for one of
software the most popular
data visualization
pieces of software
(proprietary).
www-01.ibm.com/ big data, The developers’
software/ data/ software website for one of
infosphere/bigin the most promising
sights ecosystems for
handling big data
(based on Hadoop).
http://git-scm.com version The developers’
control, website for one of
software the most popular
version control
programs.
Website URL Tags Description
www.oracle.com database The developers’
management, website for one of
software the most popular
database
management
systems
(proprietary).
www.data online Interesting resource
science201. learning for learning new
com?kid=21QD6 things in data
science; aimed at
novices.
www.udacity.com online A great MOOC
learning provider covering a
variety of topics,
including data
science.
www.cs.toronto. inspirational Homepage of one of
edu/~hinton the most famous
data scientists in the
world.
www.meetup.com networking, One of the best
offline social media (?)
learning sites, based on self-
organized groups of
people getting
together for a variety
of reasons. It
includes professional
groups for
networking and
educational
purposes.
Appendix 2
Relevant Articles

Article URL Tags Description


http://flowingdata. data Article about the
com/2009/06/04/ scientist popularity of data
rise-of-the-data- role scientists over the
scientist past few years, the
technical skills
involved and other
relevant
information about
the role.
http://datacom data Article about the
munitydc.org/ products recent popularity of
blog/2013/01/ data products.
the-rise-of-
data-products
http://www.cool infographic, Infographic about
dailyinfographics. big data big data’s role
com/post/de across the various
scribing-how- industries today.
different-indus
tries-have-capital
ized-big-data
Article URL Tags Description
http://gigaom. infographic, Another infographic
com/2011/09/ big data about big data’s
30/big-data- role in various
equals-big- industries.
opportunities-for-
businesses-info
graphic
http://what Data Article providing
sthebigdata. science the major
com/2012/04/26/ history milestones of data
a-very-short- science from its
history-of- very early days
data-science until today.
http://www.forbes. Data Another article
com/sites/gil science about the history of
press/2013/05/28/ history data science.
a-very-short-histo
ry-of-data-science
http://www.veri technical Useful article about
ous.com/tutorial/ info, big data storage and
big-data-storage- data databases for
mediums-data- storing large
structures datasets.
http://howtojboss. technical Useful article about
com/2013/02/13/ info, big data storage and
big-data-storage- data databases.
mediums-data-
structures
http://www.techo Data Useful article (and
pedia.com/defi science resource, in
nition/28789/data- process, general) about data
exploration data exploration and its
exploration role in the data
science process.
Article URL Tags Description
http://www.byte technical Article about the
mining.com/2011/ info, big most well-known
08/hadoop-fatigue- data tech alternatives to
alternatives-to- Hadoop for
hadoop handling big data.
http://www.enter technical Another article
priseappstoday. info, big about Hadoop
com/data-manage data tech alternatives.
ment/4-hot-open-
source-big-data-
projects.html
http://strata.oreilly. technical Overview of the
com/2012/02/ info, big Hadoop
what-is-apache- data tech ecosystem.
hadoop.html
https://apandre. Data Article comparing
wordpress.com/ science various data
tools/comparison/ process, visualization tools
data that are popular
visualization today.
http://www.mlplat Data Brief description of
form.nl/what-is- science some of the main
machine-learning history, milestones of
machine machine learning.
learning
http://sge.wonder Data More extensive
ville.ca/machine science description of
learning/history/ history, machine learning,
history.html machine covering all of its
learning milestones.
http://cran.r-proj machine Full list of all
ect.org/web/views/ learning, R machine learning
MachineLearning.html packages related libraries
(packages) for R.
Article URL Tags Description
http://www.data data Article on the skills
sciencecentral. scientist required for a data
com/profiles/ skills scientist position
blogs/data-scien today.
tist-core-skills
http://data-in big data, One of the sources
formed.com/glos glossary of the information
sary-of-big-data- in this book’s
terms glossary.
www.bigdata- big data, Another one of the
startups.com/ glossary sources of the
abc-big-data-glos information in this
sary-terminology book’s glossary.
http://www.mhhe. glossary Another one of the
com/business/bus sources of the
com/gregg/docs/ information in this
appd.pdf book’s glossary.
Appendix 3
Offline Resources

Resource Title Tags Description


Hal Daumé III, A machine Decent overview of the
Course in learning, various machine learning
Machine ebook techniques targeted at
Learning (2012) beginners in the field.
Still in draft format.
Mohammed J. data Very good book on data
Zaki and Wagner analysis, analysis, focusing on
Meira Jr., Data ebook data mining methods.
Mining and Still in draft format.
Analysis:
Fundamental
Concepts and
Algorithms
(2013)
Jeffrey Stanton, R, ebook Contrary to what the title
An Introduction to suggests, this is a book
Data Science about R for data science
(2013) applications, targeted at
absolute beginners.
John Verzani, R studio, Good guide for the
Getting Started ebook RStudio IDE of the R
with RStudio platform.
(2011)
Resource Title Tags Description
Bill Franks, big data Business-oriented book
Taming the Big on the value of big data
Data Tidal Wave in the world today.
(2012)
Pete Warden, Big big data Much more than a
Data Glossary glossary of big data
(2011) terms, this book gives
you an overview of big
data technologies and
how they are used.
O’Reilly Media, big data, Another book about big
Big Data Now: case study data technologies and
2012 Edition their use. Includes a
(2012) good case study for
healthcare.
Anand big data, Good reference book for
Rajaraman, Jure data data analysis techniques
Leskovec, and analysis, for very large datasets.
Jeffrey D. data
Ullman, Mining mining
Of Massive
Datasets (2013)
McKinsey Global big data, Summary of the main
Institute, Big ebook, points of the big data
Data: The Next summary movement and how it
Frontier of affects the business
Innovation, world.
Competition and
Productivity
(2011)
Peter Harrington, machine Hands-on guide for
Machine learning various machine learning
Learning in methods for data
Action (2012) analysis using Python.
Resource Title Tags Description
Derek Rowntree, statistics Good introduction to
Statistics Without main statistical concepts
Tears: An for beginners. Excellent
Introduction for examples and very
Non- application-oriented.
Mathematicians
(2000)
Nate Silver, The statistics,
Great inspirational book
Signal and the data for data analysis
Noise: Why So analysis enthusiasts. The author
Many Predictions describes several case
Fail — but Some studies and how several
Don’t (2012) data analysis
approaches failed to
yield any useful results,
while others succeeded.
Joseph Adler, R R A great reference book
in a Nutshell – on R, but not suitable for
2nd Edition learning R as it is
(2012) targeted at people
already familiar with the
platform.
Paul Zikopoulos big data, An interesting book to
et al, Harness the BigInsights use as a reference for
Power of Big big data terminology and
Data (2013) the various frameworks
that it entails. Quite
heavy on the promotion
of the particular big data
platform developed by
IBM however.
Richard Duda et machine Probably the best
al, Pattern learning, reference book ever
Classification Matlab written on this subject.
(2nd edition)
Index
(Note: bold indicates definition of term)
adaptability, 44, 51, 193, 199
advanced text analytics, 24
Advanced Text Analytics, 29
aggregation, 237
agile, 28, 114, 145
ahaz, 126
AI. See Artificial Intelligence
algorithm, 237
Alibaba, 78
Amazon, 4, 17, 241
American Mathematical Society, 213
American Statistical Association, 213
AMS. See American Mathematical Society
analytics, 237
analytics platform, 237
Analyzing the Analyzers, 31
anomaly detection, 120, 237, 250
application, 237
AQL, 54, 159
Artificial Intelligence, 26, 50, 115, 119, 238
Artificial Neural Network, 125
arules, 127
ASA. See American Statistical Association
association rules, 127
BashReduce, 80, 96
Bayesian Methods, 127
BayesTree, 127
behavioral analytics, 238
Berkeley, 80
big data, 2, 238
characteristics of, 2–4, 10–12
example of, 9
mastering of, 49
big data ecosystem, 134
big data program, 107
big data scientist, 238
big data startup, 238
big data system, 75, 92, 96
big data technology, 31, 77, 78, 94, 104, 153, 169, 173, 206, 210, 238,
258
BigInsights, 59, 75, 92, 93, 96, 97, 157, 162, 264
bigrf, 126
BigSQL, 54
Birst, 91
bolts, 78
Bostock, Michael, 82
bubble chart, 82
business intelligence, 238
Business Objects, 91, 97
byte, 238
C#, 54, 85, 150, 153, 162, 204
C++, 54, 75, 81, 84, 85, 96, 150, 153, 158, 162, 204, 229, 232
C50, 126
Calc, 95
canvas.net, 103
caret, 128
Carnegie Mellon, 81
Cassandra, 24, 26, 29
central processing unit, 238
chord diagram, 82
CIA, 61
civil engineer, 27, 47
classification analysis, 239
Clean, 85, 97
Clojure, 79, 85, 97
cloud, 239
cloud computing, 239
Cloudera, 75, 83, 93
clustering analysis, 239
Codeacademy, 103
communication, 40, 48, 73, 165, 188, 196, 197, 213, 214, 231, 246
comparative analysis, 239
complex structured data, 239
computer generated data, 239
computer science, 55, 104, 113
Computing for Data Analysis, 105, 130
concurrency, 239
consumer modeling, 58
Conviva, 80
correlation analysis, 239
CouchDB, 26
Coursera, 102, 104, 105, 106, 112, 204, 220, 259
Coursetalk, 106
CPU. See central processing unit
creating a data product, 133, 142
creativity, 1, 15, 39, 41, 43, 50, 109, 111, 112, 134, 189, 193, 198,
199, 212, 226, 236
CRM. See Customer Relationship Management
Cubist, 125
curiosity, 37, 38, 214
Customer Relationship Management, 239
D3.js. See Data Driven Documents
DaaS. See Database-as-a-Service
DAMA. See Data Management Association
dashboard, 240
data, 240
data access, 240
data aggregation tools, 240
data analyst, 240
data analytics, 240
data businesspeople, 33
data center, 241
data cleansing, 241
data collection, 241
data creative, 33
data custodian, 241
data developer, 32
data discovery, 133, 140, 196, 239
data driven documents, 82
data exhaust, 241
data exploration, 133, 138, 139, 145, 196, 205, 239, 242, 244, 262
data governance, 241
data integration, 241
data integrity, 241
data management, 242
Data Management Association, 241
data migration, 242
data mining, 20, 21, 56, 139, 164, 205, 206, 232, 242, 263
data modeling, 242
data preparation, 133, 134, 135, 138, 139
data processing, 5, 11, 19, 55, 65, 76, 79, 81, 82, 114, 115, 142, 195
data representation, 133
data researcher, 32
data science, 242
drivers for, 15
history of, 17, 19
Data Science 201 blog, 108
data scientist, 2, 5, 17, 242
aspects of a, 2
data analyst vs., 5
from business intelligence analyst to, 168
from data architect to, 167
from data modeler to, 167
from database administrator to, 165
from OO programmer to, 151
from software prototype developer to, 153
traits of a, 37
data security, 242
data set, 242
data structure, 243
data virtualization, 243
data visualization, 5, 76, 82, 89, 90, 91, 92, 97, 141, 152, 165, 168,
169, 172, 196, 215, 242, 243, 247, 259, 262
database, 240
Database Administrator, 240
Database Management System, 241
Database-as-a-Service, 241
data-directed decision making, 241
Datalog, 81
DataScienceCentral, 181, 187, 188, 233
datascientists.com, 23
datascientists.net, 23
dataset, 134
DBA. See Database Administrator
DBMS. See Database Management System
Deep Belief, 115, 117
Deep Learning, 115, 117
dendrogram, 82
Disco, 80, 96
discriminant analysis, 243
distributed file system, 243
distributed processing system, 243
distributions, 38, 124, 135, 136
document store database, 243
Dremel, 82
Drew Conway’s Venn diagram, 22
Drill, 82, 96
DuckDuckGo, 115
e1071, 127, 128
earth, 7, 34, 38, 48, 126, 220
ECL, 24, 26, 29, 81
Eclipse, 75, 84, 95, 97, 158, 258
edX, 103, 258, 259
Emcien, 75, 95, 97
encapsulation, 85
Enterprise Resource Planning, 243
Erlang, 80, 85
ERP. See Enterprise Resource Planning
e-science, 243
ETL. See Extract, Transform and Load
event analytics, 243
Evolutionary Computation, 117
exabyte, 244
Excel, 75, 95, 97
exploratory analysis, 244
Extract, Transform and Load, 244
Facebook, 3, 181
failover, 244
Fancy, 79
fault-tolerant design, 244
FBI, 61
Federal Information Security Management Act, 244
File Transfer Protocol, 244
FISMA. See Federal Information Security Management Act
Flare, 166, 173
flexibility, 44, 51, 193, 199
FlowingData, 17, 22
Flume, 77
Foreman, John, 5
four Vs of big data, 2
frbs, 128
Frey, Erik, 80
FTP. See File Transfer Protocol
fuzzy logic, 115, 128, 209
Fuzzy Rule-based Systems, 128
GAMBoost, 127
gamification, 244
gbm, 126
Genetic Programming, 119
gigabyte, 244
GIT, 75, 93, 94, 96, 97
Google, 21, 77, 188, 220
graph analysis, 95, 159
graph database, 245
GraphLab, 81, 160
grid computing, 245
Groupon, 78
Hadoop, 24, 25, 26, 29, 60, 76, 77, 78, 79, 96, 150, 217, 245, 258,
259, 262
Hadoop Distributed File System, 24, 29, 77, 245
Harris, Harlan, 31
Harvard Business Review, 18, 23
Haskell, 85, 97
HBase, 24, 26, 29, 77, 83, 96, 150, 165, 167, 169, 173, 245
HCatalog, 76, 96
HDFS, 26, 29, 77, 78, 83, 96, 134, 150, 169, 212, 245, See Hadoop
Distributed File System
High-Performance-Computing, 245
Hive, 54, 60, 77, 96, 150, 159, 162, 165, 167, 169, 172, 173, 206, 229
HPC. See High-Performance-Computing
HPCC Systems, 81
HTTP. See Hypertext Transfer Protocol
hypertext, 245
Hypertext Transfer Protocol, 245
IFCS. See International Federation of Classification Societies
Impala, 83, 96
in-Database analytics, 246
Indeed.com, 6, 233
indexing, 245
inference statistics, 120, 124
information management, 246
information processing, 246
inheritance, 85
In-Memory Data Grid, 246
in-memory database, 246
insight, deliverance, and visualization, 133
integrated development environment, 84
International Federation of Classification Societies, 20
internet, 246
Internet of Things, 246
Internet Service Provider, 246
intranet, 246
inZite, 91, 97
IoT. See Internet of Things
ipred, 126
ISP. See Internet Service Provider
Java, 54, 59, 75, 79, 82, 83, 84, 96, 150, 153, 158, 162, 194, 196, 204,
213, 226, 229, 230, 259
JSON, 134
Julia, 82, 96
juridical data compliance, 247
Kafka, 82
Kaggle, 64, 66, 116, 153, 172, 181, 187, 193, 222, 223, 226, 233
Kernel Method, 127
kernlab, 127
key value database, 247
Khan Academy, 103
kilobyte, 247
Knime, 59, 93
Koller, Daphne, 102
lars, 126
lasso2, 126
last.fm, 80
latency, 247
learning from data, 133, 141
legacy system, 247
LinkedIn, 23, 82, 109, 118, 142, 143, 181, 188, 203, 208, 233
load balancing, 247
location data, 247
log file, 247
Log Structured Merge Tree, 26
LogicReg, 126
machine data, 247
machine learning, 71, 113, 247
Mahout, 78, 96, 212
MapQuest, 143
MapReduce, 11, 24, 25, 29, 77, 78, 79, 80, 81, 82, 83, 96, 97, 104,
134, 150, 156, 157, 159, 161, 163, 168, 169, 212, 248
maptree, 126
Mason, Hilary, 142
Massively Parallel Processing, 248
Master Data Management, 248
mathematics, 55, 159
Matlab, 57, 58, 60, 75, 82, 88, 97, 105, 123, 150, 152, 162, 166, 168,
169, 173, 193, 196, 197, 206, 229, 230, 232, 259, 264
mboost, 127
MDG. See In-Memory Data Grid
MDM. See Master Data Management
Meetup, 65, 181, 182, 183, 184
megabyte, 248
memetic algorithm, 249
memory, 248
metadata, 248
mixed/generic data scientist, 34
ML. See machine learning
MongoDB, 24, 26, 29, 77, 104, 248
MOOCs, 99, 102, 103, 104, 105, 106, 112, 259
MPP. See Massively Parallel Processing
MPP database, 248
multi-dimensional database, 248
multi-threading, 248
MultiValue database, 249
Murphy, Sean, 31
NASA, 61
Natural Language Processing, 209, 232, 249
Naur, Peter, 19
network analysis, 249
New Technology File System, 24
NewSQL, 249
Ng, Andrew, 102, 129
NLP. See Natural Language Processing, See Natural Language
Processing, See natural language processing
nnet, 125
node-link tree, 82
Nokia, 80
normalization, 135, 136, 147, 249
NoSQL, 249
NovoED, 104
NTFS. See New Technology File System
object database, 249
object oriented programming, 76
object-oriented, 54, 75, 83, 151, 226, 249
OCaml, 85, 97
Octave, 88, 97, 105, 162, 169, 259
ODCA. See Open Data Center Alliance
OLAP. See Online Analytical Processing
OLTP. See Online Transactional Processing
Online Analytical Processing, 250
Online Transactional Processing, 250
ontology, 250
OO. See object-oriented
OOP. See object oriented programming
Oozie, 76, 96
Open Data Center Alliance, 250
Open Learning Initiative, 103
open source, 250
Open Yale Courses, 103
Open2Study, 104
openHPI, 103
OpenLearn, 103
operational database, 250
optimization analysis, 250
Optimization using Genetic Algorithms, 127
Oracle, 75, 94, 96, 97
outlier, 136
outlier detection, 250
Outlook, 95, 97
parallel computing, 2, 11, 15, 18, 24, 71, 77, 126, 143, 162, 163
parallel data analysis, 251
Parallel Method Invocation, 251
parallel processing, 251
parallel query, 251
party, 126
pattern recognition, 113, 115, 116, 154, 251
penalized, 126
performance management, 251
Perl, 54, 59, 150
petabyte, 251
Pig, 24, 26, 29, 77, 81, 96, 159, 162
PMI. See Parallel Method Invocation
polymorphism, 85
predictive analysis, 251
predictive modeling, 251
Prism, 91, 97
problem solving, 43
program, 251
protocol, 252
Python, 54, 75, 79, 80, 83, 85, 87, 96, 105, 129, 140, 150, 158, 161,
259
Qlikview, 91, 97
quantified self, 252
query, 252
query analysis, 252
Quora, 37
R, 57, 252
randomForest, 126
rdetools, 127
real time, 252
real-time data, 252
recommendation engine, 252
Recommender Systems, 121
record, 252
records management, 252
Recursive Partitioning, 125
reference data, 252
regression analysis, 253
relational database, 58
report, 253
Research Center for Dataology and Data Science, 21
resume, 149, 156, 185, 192, 198, 199, 214, 233, 257
rgenoud, 127
rgp, 127
risk analysis, 253
Rmalschains, 127
ROCR, 128
root-cause analysis, 253
routing analysis, 253
rpart, 125
RSNNS, 125
Ruby, 75, 79, 84, 96
RWeka, 125
SaaS. See Software-as-a-Service
SAS, 57, 60, 75, 89, 97, 106, 150, 156, 169, 227, 229
Scala, 80, 85, 86, 97, 162, 222
scalability, 253
schema, 253
Sector, 81
semi-structured data, 3, 253
sentiment, 25
server, 253
signal analysis, 253
signal-to-noise ratio. See veracity
similarity search, 253
simulation analysis, 253
smart grid, 254
social media, 25, 137, 178, 179, 181, 182, 184, 190, 192, 260
Software-as-a-Service, 254
Solid-State Drive, 254
Spark, 75, 80, 96, 156
spatial analysis, 254
Sphere, 81
Spotfire, 90, 97
spouts, 78
SPSS, 57, 60, 75, 88, 97, 150, 156, 169
SQL, 26, 54, 58, 77, 81, 94, 95, 105, 150, 159, 163, 165, 169, 173,
206, 227, 229, 230, 232, 249, 254
Sqoop, 77
SSD. See Solid-State Drive
Stanford University, 102, 104, 220, 223, 257
Stata, 57, 60, 75, 89
Statistical A.I., 116
statistics, 55
Statistics One, 105, 130
Storm, 75, 78, 79, 96, 107, 254
structured data, 3, 254
Support Vector Machine, 127
SVM. See Support Vector Machine
svmpath, 128
systematic work, 39, 40, 51
systems engineering, 55
Tableau, 89, 90, 97, 166, 173
TCP/IP. See Transmission Control Protocol/Internet Protocol
teamwork, 44, 51
Techopedia, 138, 147
terabyte, 254
text analytics, 254
tgp, 127
thread, 255
time series analysis, 255
topological data analysis, 255
transactional data, 255
Transmission Control Protocol/Internet Protocol, 255
transparency, 255
Tukey, John W., 19
Twitter, 78, 181, 188
Udacity, 103
unstructured data, 3, 255
Vaisman, Marck, 31
value, 255
variability, 255
Varian, Hal, 21, 23
variety, 11, 18, 255
varSelRFandBoruta, 126
vectorization, 86, 88, 151, 166, 168, 171
velocity, 3, 11, 18, 256
veracity, 3, 12, 18, 256
visualization, 58, 89, 144, 145, 148, 188, 256
volume, 10, 11, 18, 19, 256
weather data, 256
Web Intelligence and Big Data, 104
web statistics, 12
Xiong, Yun, 21
XML database, 256
Yahoo!, 188
Yau, Nathan, 17, 22
yottabytes, 256
Zahavi, Jacob, 20
zettabyte, 256
Zhu, Yangyong, 21
Zookeeper, 78, 96

You might also like