Professional Documents
Culture Documents
Introduction
2. Chapter 1 Data Science and Big Data
1. 4.1 Traits
2. 4.2 Qualities and Abilities
3. 4.3 Thinking
4. 4.4 Ambitions
5. 4.5 Key Points
7. Chapter 6 Experience
1. 6.1 Corporate vs. Academic
Experience
2. 6.2 Experience vs. Formal
Education
3. 6.3 How to Gain Initial Experience
4. 6.4 Key Points
8. Chapter 7 Networking
1. 9.1 Workshops
2. 9.2 Conferences
3. 9.3 Online Courses
4. 9.4 Data Science Groups
5. 9.5 Requirements Issues
6. 9.6 Insufficient Know-How Issues
7. 9.7 Tool Integration Issues
8. 9.8 Key Points
first edition
Note that the benefit is not always directly related to the bottom
line, but it is definitely of significant business value. For
example, by employing big data technologies in healthcare,
physicians can use previous data to gain a better
understanding of the patients’ issues, yielding a better
diagnosis and enabling them to take better care of their patients
in general. This can eventually result in greater efficiencies in
the medical system, translating into lower costs through the
intelligent use of medical information derived from that data.
Another example comes from customer care, where big data
can help leverage bad customer experiences. By effective use
of big data technologies, companies can gain a better
understanding of what their customers like and don’t like in
near real-time. This can help them amend their strategies in
dealing with these customers and give them insight into how to
improve their services in the future.
Note that there are many other industries that have the
potential for gaining from big data, but based on their current
status, it is not a worthwhile option for them. For example, the
art industry is still not big on big data, since the data involved in
this field is limited to descriptions of artwork and, in some
cases, digitized forms of these works of art. However, it is
possible that this may change in the future depending on how
the artists act. For example, if a certain gallery makes use of
sensors monitoring the number of people who view a certain
painting, and in combination with other data (e.g., number of
people who bought tickets to the various exhibitions that hosted
that painting), they could gradually build a large database that
would contain data about the sensor readings, the ticket sales,
and even the comments some people leave on the gallery’s
blog about the various paintings. All this can potentially yield
useful information about which pieces of art are more popular
(and by how much), as well as what the optimum ticket prices
should be for the gallery’s exhibitions throughout the year.
All this is great, but how is it of any real use to you? Well,
higher profit margins and the potential to significantly boost
productivity are not going to happen on their own. It is naïve to
think that just installing a big data package and assigning it to
an employee (even if they are a skilled employee) could result
in measurable gains. In order to take advantage of big data, a
company needs to hire qualified people who can undertake the
task of turning this seemingly chaotic bundle of data into useful
(actionable) information. This is the problem that all data
scientists are asked to solve and one of the driving forces of all
developments in the field that came to be known as data
science.
Actually, some people include an additional two Vs, variability and visibility,
which refer to the fact that Big Data changes over time and is hardly
visible to users.
One of them, created by Berkley, costs around $60,000, which is
significantly more than the high-priced MBAs you see elsewhere. This
is a clear indication that people in the academic world as well as in the
industry are taking data science quite seriously.
Long-lived Digital Data Collections: Enabling Research and Education in the
21st Century, available at http://www.nsf.gov/pubs/2005/nsb0540
The article is still available online at the time of this writing. You can access
it at http://flowingdata.com/2009/06/04/rise-of-the-data-scientist
Davenport, Thomas H., and D. J. Pattil. “Data Scientist: The Sexiest Job of
the 21st Century.” Harvard Business Review, October 2012.
Chapter 2
Importance of Data
Science
MapReduce
Hadoop Distributed File System (HDFS)
Advanced Text Analytics
Large scale data programming
languages (e.g., Pig, R, ECL, etc.)
Alternative database structures (e.g.,
HBase, Cassandra, MongoDB, etc.)
Just as there are no two snowflakes that are exactly the same,
there are also no two data scientists who have identical skill-
sets or identical roles. The big data world has a wide variety of
problems, causing some natural differentiation in the specific
roles that a data scientist may undertake. In addition, the
profession has not been properly defined yet, so depending on
various aspects of one’s background, such as education, the
data scientist role can be further differentiated. Based on some
research that was done on the topic by a group of scientists
(Harlan Harris, Sean Murphy, and Marck Vaisman, who recently
published the book Analyzing the Analyzers17), there are four
types of data scientists: data developers, data researchers,
data creatives, and data businesspeople. Often encountered
among the most experienced professionals of the field is a fifth
type, a mixed/generic combination of these. While there is a
certain overlap among all of these categories (e.g., they are all
familiar with data analysis methodologies, big data technology,
and the data science process), they are generally quite different
from one another in several ways. Let’s examine each one of
them in more detail.
Data developers
Data researchers
Data creatives
Data businesspeople
Mixed/generic
4.1 Traits
A data scientist has a variety of professional characteristics and
traits that usually reflect the kind of work he specializes in, so
this list is not set in stone and is more of a guideline to
understand this role better. First and foremost, a data scientist
has a healthy curiosity about the things he observes, such as
potential patterns or relationships between two attributes or
features, unusual distributions, etc. If you want to be a data
scientist worth the money you earn, you need to have an
inquiring mind.
This does not mean that you need to be curious about
everything and get lost in perpetual random quests for answers.
Curiosity has to be accompanied by the discipline to focus on
down-to-earth, long-term interests that are more grounded than
a fleeting curiosity, which can be impulsive and superficial. A
data scientist is interested in the phenomena he observes in
the data he deals with, wanting to get to the bottom of them. A
statistical analysis of what’s there may be a good first step for
him, but he is not satisfied until he has a good answer for the
reason of these phenomena, the root cause behind the
statistical metrics he calculates. This allows him to explain the
root cause to other people in the company in the form of a
story.
4.3 Thinking
The data scientist’s way of thinking is the most important
attribute to keep in mind since it often distinguishes him from
other types of professionals. In general, a data scientist thinks
in a combinatorial, non-linear way. His thinking needs to
combine both traditional and lateral thinking and be versatile in
employing either pattern when dealing with the challenges that
arise in his work.
His thinking is creative when it comes to designing and
implementing his models or investigating which approach
should be used for tackling a particular problem. His thinking is
not bound by unnecessary restrictions when creating or
updating the algorithms he decides to use for his data analysis.
In that sense, his thinking often resembles that of an artist, a
designer and an architect. He does not hesitate to experiment
with different approaches and methodologies and is poised to
try out different ways to visualize the available data insightfully.
Colors and shapes are his tools and can be as applicable as
numbers in expressing the information that is waiting to be
discovered. In a way, his thinking is very similar to that of the
explorer who sets out to find new lands, but his realm is the
vast seas of data in the cyberspace universe.
A data scientist’s thinking is also grounded and practical,
especially when it comes to building something with limited
resources in a constrained timeframe. In that sense, it is similar
to the thinking of a civil engineer who opts to make the most of
the available space and budget without dwelling much on fancy
designs. Just like a civil engineer, a data scientist does not
neglect the given requirements and tailors his creative
approach to the restrictions of the task at hand. Perhaps he
could derive ten or fifteen different metrics from a given dataset
to monitor the evolution of a given variable, but he only needs
four or five of them. And from the dozens of beautiful graphs he
could create to depict that dataset over time, he picks only a
couple that summarize it most effectively. A data scientist is
also an engineer of sorts and always thinks and behaves in a
pragmatic and down-to-earth manner.
The data scientist’s thinking is also self-reflective and, in a way,
meta-cognitive. He investigates different ways of thinking about
things and evaluates his current thinking processes. In
essence, a data scientist should be aware of how his mind
works and, therefore, be willing to admit to gaps in knowledge
(and do something about them). He continually looks for flaws
in his own methods and takes the necessary steps to fix them.
He is proactive and takes responsibility for how his mind
functions and the inputs it uses. He is not afraid to say that he
doesn’t know something and makes every effort to acquire the
relevant resources to help him understand it sufficiently and
quickly. This allows him to be a better team member and greatly
facilitates communication with others.
Most importantly, the mind of the data scientist evolves over
time. Modern neuroscience confirms the brain’s life-long ability
to change and create new connections within itself. The
thinking of the data scientist today is not the same as it was last
year, and it is not going to be the same next year. His mind
embraces change and uses it to upgrade itself through new
experiences, new knowledge and new know-how. In some
professions, it may be sufficient to have more or less static
thinking, but data science is not one of them. The data scientist
is similar to the entrepreneurs, the managers and the inventors,
continuously learning new things and adapting his thinking to
the ever-changing circumstances of our fast-paced world.
Of course, the thinking of a data scientist is not limited to the
above meta-descriptions, and a book subchapter may not be
capable of doing it justice. The above guidelines do, however,
pinpoint some of its main aspects and hopefully provide
incentive for looking into it in greater depth through a conscious
evaluation of your thinking as you learn more about data
science in general.
Fig. 4.3 Thinking is an important aspect of the data
scientist’s mindset.
4.4 Ambitions
It seems a bit unconventional for a book like this to talk about a
professional’s ambitions as this is something that is very
personal and somewhat relative. However, there are certain
aspirations that are more or less common to data scientists;
understanding them may provide useful insight into his mindset.
A data scientist aspires to master big data in its many forms.
Being able to deal with a particular data set in this domain is
great, but often not enough. Someone who cares for data
science finds ways, often through interaction with other
professionals in this field, to be on top of the data that is out
there, meaning that he comprehends fully what each data type
can offer to an organization, what useful information he can
potentially derive from it and what costs acquiring each data
type entails. This stems from the dream of continuous
improvement, which is quite feasible in fields like this where
more and more tools become available as new data analysis
methods are developed all the time.
Data scientists also constantly want to learn new things. This
wish ties quite well with the previous ambition of mastering big
data since learning, especially when related to diverse things
that include the realm of big data, has been proven to aid in the
development of creativity and mental agility. These are
essential aspects of the role of the data scientist, and
cultivating them makes perfect sense. A data scientist’s
interests are not limited to the data science techniques that he
may use in his everyday work. He is also interested in new
developments in artificial intelligence, distributed computing,
information security, new programming languages and machine
learning, among other fields.
Fig. 4.4 A data scientist is not without ambitions.
Curiosity
Experimentation
Creativity and Systematic Work
Communication
Model Building
Planning
Problem Solving
Learning Fast
Adaptability
Teamwork
Flexibility
Research
Attention to Detail
Reporting
Recently, the author came across a post on Quora (a forum for geeks)
where the poster mentioned a series of 10 steps that you need to do in
order to become a data scientist. Most of them were focused on
specific skills, the majority of which are of questionable quality. This
clearly illustrates the limited understanding many people have about
what being a data scientist involves and how this misinformation
propagates.
Chapter 5
Technical Qualifications
Robust
Popular in the industry
Scalable, especially when it comes to large data
sets
Java
Python
C++ / C#
Perl
Like many other job openings, data scientist job ads usually
specify a requirement of at least two years of experience in a
data-related endeavor. Although this is a very ambiguous
requirement (someone can be a master data scientist and still
not have enough experience for a particular position), it is
definitely worth looking into more. In this chapter we will
examine in more detail the why’s and how’s of experience in
this intriguing field.
http://archive.ics.uci.edu/ml
www.kaggle.com
http://www.datasciencecentral.com/group/data-science-apprenticeship
Chapter 7
Networking
Parallel to all these systems, there are several projects that can
facilitate the work undertaken by Hadoop, working in a
complimentary way, so if you are going to learn Hadoop, you
may want to check them out once you’ve got all the basics
down. The most well-known of these projects are the following:
All of these are free and easy to learn via free tutorials (the IDE
of the last one, Visual Studio, is proprietary software, however).
Also, they all share some similarities, so if you are familiar with
the basic OOP concepts, such as encapsulation, inheritance
and polymorphism, you should be able to handle any one of
them. Note that all of these programming languages are of the
imperative paradigm (in contrast with the declarative/functional
paradigm that is gradually becoming more popular). The
statements that are used in this type of programming are
basically commands to the computer for actions that it needs to
take. Declarative/functional programming, on the other hand,
focuses more on the end result without giving details about the
actions that need to be taken.
Although at the time of this writing, OOP languages are the
norm when it comes to professional programming, there is
currently a trend towards functional languages (e.g., Haskell,
Clojure, ML, Scala, Erlang, OCaml, Clean, etc.). These
languages have a completely different philosophy and are
focused on the evaluation of functional expressions rather than
the use of variables or the execution of commands in achieving
their tasks.
The big plus of functional languages is that they are easily
scalable (which is great when it comes to big data) and much
more error free since they don’t use a global workspace. Still,
they are somewhat slower for most data science applications
than their OOP counterparts although some of them (e.g.,
OCaml and Clean) can be as fast as C22 when it comes to
numeric computations. If things take a turn for the better in the
years to come, you may want to look into adding one of these
languages in your skill-set as well, just to be safe. Note that
there can be an overlap between functional languages and
traditional OOP languages such as those described previously.
For example, Scala is a functional OOP language, one that’s
probably worth looking into.
Note that all of these are proprietary software, so they may not
ever be as popular as R or attract as large user communities. If
you are familiar with statistics and understand programming,
they shouldn’t be very difficult for you to learn; with Matlab, you
don’t need to be familiar with statistics at all in order to use it.
We will revisit R in subchapter 10.5, where we will examine how
this software is used in a machine learning framework.
Fig. 8.9 The GIT version control program. Not the most
intuitive program available, but very rich in terms of
functionality and quite efficient in its job.
Note that there are several GUI add-ons for GIT available for all
major operating systems. One that is particularly good for the
Windows OS is GIT Extensions (open source), although there
are several GUIs for other OSs as well. This particular GUI add-
on makes the use of GIT much more intuitive while preserving
the option of using its command prompt (something that’s not
always the case with GIT GUIs).
It would be sacrilege to omit the Oracle SQL Developer
software since it is frequently used for accessing the structured
data of a company whose DBMS is Oracle. Although this
particular software is probably going to be less essential in the
years to come due to big data technology spreading rapidly, it is
still something useful to know when dealing with data science
tasks. You can see a screenshot of this program in Fig. 8.10.
Fig. 8.10 The Oracle SQL Developer database software,
a great program for working with structured data in
company databases and data warehouses.
The key part of this software is SQL, so in order to use it to its
full potential, you need to be familiar with this query language.
As we saw in an earlier chapter, this is a useful language to
know as a data scientist even if you don’t have to use it that
much. This is because there are several variants of it that are
often used in big data database programs.
Some other useful programs to be familiar with when in a data
science position are:
MS Excel – the well-known spreadsheet application
of the MS Office suite. Though ridiculously simple
compared to other data analysis programs, it is still
used today and may come in handy for inspecting
raw data in .csv files, for example, or when creating
summaries of the results of your analyses. Just like
the rest of the MS Office suite, it is proprietary
though there are several freeware alternatives that
have comparable functionality to MS Excel (e.g.,
Calc from Open Office).
MS Outlook – an equally well-known MS Office suite
application designed for handling emails, calendars,
to-do lists and contact information. There are
several freeware alternatives to it, but it is often
encountered in workplaces. It will be very useful to
know if you’ll be using it every day for handling
internal and external communications,
appointments, etc. It is also proprietary.
Eclipse – as mentioned earlier, this is one of the
most popular IDEs for OOP languages as well as
other languages (even R). Very robust and
straightforward, it makes programming more user-
friendly and efficient. It is open source and cross
platform.
Emcien – a good graph analysis program for dealing
with complicated datasets, particularly semi-
structured and non-numeric ones. A good program
to look into if you are interested in more advanced
data analysis, particularly graph based. It is not a
substitute for other data analysis programs,
however, and it is proprietary.
Filezilla (or any other FTP client program) – useful if
you need to transfer large files or require a certain
level of security in transferring your files to other
locations over the internet. It is open source.
8.7 Key Points
9.1 Workshops
Workshops are the most efficient way to learn something new,
especially when it comes to technical know-how. Fortunately,
due to the increased popularity of the data science field there
are numerous workshops available from which to learn any
aspect of the field.
Workshops tend to be somewhat expensive (several hundred
dollars each) but they are a good investment, especially if you
are good at picking up new knowledge and know-how. Free
alternatives for learning new things will be covered in
subchapter 9.3. How to find the best workshops will be
discussed later in this section.
So why bother with workshops if there are other ways to learn
new things? Well, workshops provide networking opportunities,
can enhance your resume (if you have no other data science
related qualifications), and often provide more useful
knowledge and know-how than university courses, regardless
of the university. This is because university courses are often
based on the available literature in scientific books, journal
papers and conference proceedings and are designed to give
students the foundation on which to build more advanced
knowledge.
Workshops are also very time efficient, squeezing into a few
hours material that would normally take days to learn on your
own. They are often hard and demand all of your concentration,
but they enable you to learn something you would normally not
have the time or resources to learn on your own.
The key things to keep in mind when choosing to register for a
workshop are what you are going to learn and how it can be
useful for your job as a data scientist. This sounds obvious, but
it is really easy to get sold on workshops that you don’t need
since they all appear quite appealing at the sites that promote
them.
To ensure that you stay focused on the appropriate workshops,
make a list of the skills and knowledge that you want or need,
then research workshops that are being offered. Update your
list if you find workshops that offer something you haven’t
thought of; if there are several workshops that offer it, it is
usually something useful to know in the industry. Finally, pick
the workshop that is most suitable for what you want or need,
taking into account its location, the time of the year it’s offered
and, of course, its price. You can’t go wrong with a strategy like
that.
9.2 Conferences
Conferences are like workshops but are designed for larger
groups of people. They offer some innovative pieces of
knowledge based on research and case studies as well as
more foundational information for those who are newer to the
subject of the conference. More often than not, conferences
offer workshops to attract more people. Note that in this book
we are referring to non-academic conferences, since the
academic ones have a different mission and scope.
Conferences are a great way to learn a variety of new things in
a short period of time, meet new people, exchange war stories
and get acquainted with other challenges in the field.
Conferences are quite interactive and provide great mental
stimulation, very similar to some good university classes, but
without the stress of exams and written assignments. They are
usually costly, making them a viable option mainly for full-time
professionals. However, given the benefits they can provide,
they are a worthy alternative for anyone interested in expanding
his skill-set and data science knowledge. Fortunately,
companies often cover at least some (if not all) of the expenses
of their employees who are participating in such conferences.
The big advantage of this option for learning new things is that
it is very time efficient, especially when combined with a couple
of workshops. If you can relate this new knowledge to an
existing problem you are facing, that’s even better. The bottom
line is that if you are open to new things, a conference can
prove to be a very fruitful experience that may enrich your
understanding of data science and your particular role, too. You
can find out about the various conferences that are being
offered by searching the web directly or through the various
data science groups (see subchapter 9.4).
Other tasks:
Apart from all these, you can also read tutorials and books on
R, obtaining a better understanding of its capabilities and
learning how to make the most of it in your everyday work as a
data scientist.
R has a series of great machine learning libraries that you can
employ in your data analyses, saving you the trouble of having
to code everything from scratch. The most important of these
libraries, which are usually referred to as packages, are the
following (as of the time of this writing):
Boosting packages:
R related:
In this chapter, we’ll see how the different aspects of the data
scientist fit together organically to form a certain process that
defines his work. We will see how the data scientist makes use
of his qualities and skills to formulate hypotheses, discover
noteworthy information, create what is known as a data product
and provide insight and visualizations of the useful things he
finds, all through a data-driven approach that allows the data to
tell its story. The whole process is quite complicated and often
unpredictable, but the different stages are clearly defined and
are straightforward to comprehend. You can think of it as the
process of finding, preparing and eventually showcasing (and
selling) a diamond, starting from the diamond mine all the way
to the jewelry store. Certainly not a trivial endeavor, but one
that’s definitely worth learning about, especially if you value the
end result (and know people who can appreciate it). Let us now
look into the details of the process, which includes data
preparation, data exploration, data representation, data
discovery, learning from data, creating a data product and,
finally, insight, deliverance and visualization (see Fig 11.1).
Fig. 11.1 Different stages of the data science process.
Note that understanding this process and being able to apply it
is a fundamental part of becoming a good data scientist.
When dealing with text data, which is often the case if you need
to analyze logs or social media posts, a different type of
cleansing is required. This involves one or more of the
following:
All these data preparation steps (and other methods that may
be relevant to your industry), will help you turn the data into a
dataset. Having done that, you are ready to continue to the next
stages of the data science process. Make sure you keep a
record of what you have done though, in case you need to redo
these steps or describe them in a report.
For this and any other statistical terminology, please refer to the glossary at
the end of the book.
Hilary Mason, “How to Know When You Need a Data Scientist”, LinkedIn
article, January 2013.
Chapter 12
Specific Skills Required
Programming skills
Business skills
12.2.1 OO Programmer
If you are already in the object-oriented programming game,
you are familiar with data structures and how to implement an
algorithm efficiently in one or more OOP languages. You may
even be adept at conserving resources and optimizing your
code to meet a particular objective. So you have a decent head
start towards becoming a data scientist since, as we have
already seen, these are some of the essential skills you need to
be a player in the data science game.
Unless you already have experience with Matlab or R,
vectorization is something you need to learn, especially if you
plan to work with one of these data analysis tools. Vectorization
involves processing one operation on multiple pairs of
operands at the same time, writing code that is loop-free,
instead of processing one pair of operands at a time and
looping around to the next pair. The fewer loops in your code,
the faster it will run on a Matlab or R platform as well as on any
other data analysis tools that employ vectorization. This is
because vectorized functions are built-in programs that are
optimized and implemented in C or some other low-level
language, enabling them to run super-fast. This is a great point
to remember, especially when you are dealing with large
datasets. A vectorized approach may be many times faster than
one using loops even if your lines of code are kept to a
minimum. If you learn R, you will naturally learn vectorization
because most tutorials don’t cover loops; if they do, they do so
briefly at a later stage of the tutorial. Also, R has a large variety
of built-in functions that save you the trouble of having to create
loops doing the same thing on your own. So it lends itself to
cleaner, faster vectorized scripts.
But the purpose of this chapter is not to broadcast the merits of
the R language; R can do that for itself. The point is that an OO
programmer will be able to quickly assimilate the data analysis
software used in data science, whether this is R, Matlab or any
other software. The mental discipline that is required for
effective OO programming work can be applied to any other
software required. Even big data technologies, such as
Hadoop, are not going to be a challenge for you if you have this
quality. You will need to learn all these technologies, though,
and it may be somewhat time consuming. For this, you can use
the resources in Appendix 1 as well as all the other sources
mentioned in the first part of Chapter 9, Learning New Things
and Tackling Problems. How long it will take will depend on how
dedicated you are and how much time you can devote to it.
You will want to pay close attention to the data visualization
software as this is probably something that you are the least
familiar with in your current work. It shouldn’t pose much of a
challenge as all the programming required in such a piece of
software is minimal, if not non-existent. Just familiarize yourself
with one or more data visualization packages, and you will be
good to go.
You will need to study the data analysis literature and mine it for
know-how that you will need as a data scientist. You’ll
particularly need to study statistics, if you haven’t taken a
course on this subject already, and most importantly machine
learning. You may not have time to go very deep on either one
of them, but at least make sure you know enough to ace a
statistics or machine learning course.
Finally, you need to learn more about how the end-user thinks,
what he requires, how to interpret these requirements and how
to communicate effectively in a non-technical language.
Basically, hone the soft skills that can make you a software
developer or a systems engineer (though the latter requires
more than just this stuff). This is very important in cultivating the
data scientist mindset and performing this role, as we’ve seen
in Chapter 4.
Naturally, once you’ve learned all these things, you need to
practice. You can start with the Kaggle challenges or the
datasets available in the UCI machine learning repository. Just
be sure that you acquire some hands-on experience before
putting yourself out there as a data scientist for hire.
12.2.2 Software Developer
As a software developer, you are bound to be familiar with GUIs
and the importance of the (usually non-technical) user of your
work. This familiarity is invaluable. Being able to think as the
user thinks allows you to appreciate their point of view and
understand their concerns. Therefore, for a role in data science,
you will need to focus your attention on your other technical
skills.
As a developer you must be already familiar with two or more
programming languages, most likely the .NET framework and
C# or possibly C++ and Java. That’s a great starting point. Just
like your OO programming colleague, you have all the
programming background to be a data scientist, so you should
expand this by incorporating knowledge of big data technology
and data analysis tools.
Your programming background and familiarity with the end-user
will allow you to focus your efforts on the gaps in your
knowledge. Similar to the OO programmer, you will need to
develop your knowledge of visualization software and statistics.
You will also need to go deeper on the machine learning know-
how as this is something many data scientists (all those not
belonging to the researcher category) often lack. If you don’t
know about clustering and pattern recognition, you need to gain
an understanding of them as well as deep learning and other
state-of-the-art machine learning techniques. Joining a relevant
group is one strategy for achieving that objective.
As in the case of the programmer, you will need some hands-on
experience with these new skills before being marketable as a
data scientist. The methods described in the previous
subchapter for acquiring this experience are applicable here,
too. For more details about how to get the initial experience,
you can revise Section 6.3.
Learn vectorization
Learn about data analysis tools such as
Matlab, R, etc.
Study statistics and machine learning
Get acquainted with big data tech
Get acquainted with how end-users
think and understand them
job fairs
university bulletin boards
personal website (online work portfolio
and resume)
other (use your imagination!)
14.3 Deliverables
So you know all the relevant software and you’ve read your
statistics and machine learning books so much that you’ll have
a hard time reselling these books, but does that mean that you
can do the job and do it well? It all boils down to the
deliverables involved.
The deliverables of a particular data science position may vary
significantly since different employers have different business
needs for their (big) data, which differs significantly from
industry to industry. They may want you to undertake a project
management role—if not right from the start, then a few months
down the road. This is not uncommon for a senior data scientist
position (business data scientist type). You may know your stuff
well, but at the end of the day, your future employer needs to
make sure that you won’t be sitting in front of your workstation
all day and that you’ll exhibit some human resource
management skills. After all, you have good communication
skills, right? So what’s stopping you from becoming a project
manager or an assistant team leader?
A potential employer is looking for what can you bring to the
company if you are hired. You can say that you are able to
deliver every single item listed in the responsibilities section of
the job description and explain exactly how you can do that. But
you can also be a bit more creative and bring some new ideas
to the table, preferably something that you have thought
through beforehand. Step into the employer’s shoes for a
minute and evaluate the two possibilities from their point of
view. Would you hire you?
The deliverables factor is something that ties in with each one
of your skills, too. You didn’t learn R because of its pretty
interface, nor did you learn Hadoop because of its nice
documentation and you certainly didn’t learn Java because of
what its fans say about it. You learned each one of these
programs because they can deliver something valuable to you
and bring usefulness to your work. So when you have a chance
to talk about your technical skills, you should point out how they
can benefit your potential employer because that’s what he will
care about the most. Remember subchapter 14.1 and the
importance of focusing on the employer. Your interview is your
chance to apply what you’ve learned and convince him that you
have something to offer that he would be unwise to pass on.
The same applies to your other abilities, the so-called soft skills.
In truth, there is nothing soft about them because if you use
them well, they can have some really hard effects that will
benefit everyone around you. Sure, there is a certain prestige
around knowing a particular piece of software at an expert
level, but being able to communicate well can be as important,
if not more so, depending on the particular position. You can
learn a piece of software in a few months, so even if you don’t
know how to use the big data package that a company prefers,
that’s not an issue as long as you’ve worked on similar
software. However, you need the ability to communicate well
right from the start. During the interview process, you want to
show that you can use your soft skills to provide lots of
deliverables because that could be what distinguishes you from
all other applicants.
14.5 Self-Sufficiency
The definition of self-sufficiency used in this book is “being
independent in a proactive and somewhat creative way.” It
means knowing what needs to be done and doing it with little to
no guidance, especially when it comes to your own domain.
You need to own it and plan it accordingly.
Like most things you talk about on your resume, in your cover
letter and during networking sessions, you need to be able to
demonstrate your self-sufficiency with examples drawn from
your professional experience by referring to specific cases
where you participated in or led a project, taking initiative and
showing creativity. Finding an innovative approach to a
problem, developing a clever feature in a data analysis
package or handling a difficult situation through a creative
approach, all without relying on a supervisor, are examples of
self-sufficiency. This is fairly common in the research industry
although it is not valued as much as it should be. The same
initiative in industry could result in a raise, a bonus or perhaps
even a promotion, while in the research world it is usually taken
for granted. So if you are in research, it is high time you learned
to value this attribute of yours and sell it properly to an
employer who can appreciate it.
Pros Cons
Responsibilities
Programming gigs
Data scrubbing gigs
Tutoring professionals or students
Helping students on their theses
Looking at real-world examples of freelance data
science gigs can help you gain invaluable insight
into what is expected of you in the freelance world
and in the data science world in general.
Often, freelance gigs are not very clearly defined
(the included example is a special case where the
employer is quite clear about what they want).
Here the employer clearly states the domain knowledge that is relevant to
their industry. This is probably the most difficult requirement to meet in
unless you are already in this industry.
This part of the ad is intentionally in bold, something that clearly illustrates
the importance of good communication skills in the data science
domain.
This is a tricky requirement. Obviously they want you to be familiar with big
data technology, but if you are new, they will consider you.
Does this ring a bell? If not, please review Chapter 4.
In such cases, you need to do plenty of research on what other freelance
data scientists charge so that you are not considered too expensive
nor undervalue your worth.
Chapter 16
Experienced Data Scientists
Case Studies
We’ll begin the case studies with the story of two experienced
data scientists who work in the retail sector and the law
enforcement industries. In both cases, we’ll get to know them
better with some basic professional and background
information, then proceed with their views on data science in
practice, how they see data science in the future and finally
what advice they have for you, the aspiring data scientist. At
the end of the chapter, we’ll have some take-away points, as
usual, to help you remember the key lessons of these
interviews.
Although he has been practicing in the industry for the past few
years, he values the role of researchers in the field and
believes that a data scientist ought to be a bridge between
academia and the industry, something that he seems to have
accomplished very effectively based on what he says about his
life as a data scientist. Since information theory is universal, he
believes that he could transition to another industry relatively
easily. He finds the sectors of drug discovery and forensics
particularly interesting for a data scientist today.
Now that you’ve made it this far and have taken to heart the
guidance in the chapters, let’s look at what you need to know
when you’re ready to start your job quest in the data science
world. Gaining some perspective of the types of job
opportunities advertised may be quite useful.
In this chapter, we’ll take a look at different types of ads:
namely, entry-level, experienced, and senior data scientist ads.
In addition, we’ll discuss some relevant tips for online searching
and present a few samples of ads for data scientist positions
that are currently open.
(Source: Kaggle.com)
Title: Junior Data Scientist
Summary
This is an exciting opportunity for an experienced data and analytics
professional to join a leading brand name company within an
innovative and growing team. This position sits within a growing
analytics function offering the chance to play a key role in the further
development of customer insight and business analysis.
This role will play a key part in developing and delivering algorithms
and new analytical approaches to better understand and enable pricing
and business analytics. The role will involve the following key
responsibilities:
Using multiple data sets and sources to streamline
analysis and generate algorithms to develop analytical
frameworks
Work closely with other internal teams and senior
stakeholders to better understand and available data, with
a view to identifying personalization driven products and
solutions
Develop programming language based scripts (SAS, SQL
or R) to help in the creation of market leading customer
insight strategies
Work with clients to improve and build upon their
understanding of their digital channels and
personalization
Mentor and lead junior team members.
Skill Requirements
To be shortlisted for this position, you must have the following
ESSENTIAL skills and experience:
(Source: Harnham.com)
Skills Requirements
Above all you must have an inquisitive nature and a real passion for
data!
(Source: Linkedin)
18.3 Ads for Senior Data Scientists
Ads for senior data scientists are fewer though encountered
more often than those for junior data scientists. Senior data
scientists are basically the top-tier data scientists, the ones who
have sailed all kinds of oceans and have fought against
monsters of data. They usually end up in a business-oriented
position where they deal directly with management, and often
with the company’s clients, themselves. Note that if you try the
freelance track, you’ll basically be taking a senior data scientist
role even if you don’t refer to it this way. This is because you’ll
need to undertake all the different aspects of that role including
the link to the business world, the project organization, the
architecture design, etc.
You’re probably not going to be hunting for this type position
right now, but it’s good to be aware of what’s out there in case
you want to drive towards it quickly and you have enough
expertise to make it happen. Experience can be gained
relatively easily once you are committed to your goal, are
focused, and know what you are doing. Here is an example of a
senior data scientist position from a US company.
Title: Senior Data Scientist
Summary
As a senior member of the data sciences team, you will be responsible
for managing and executing critical R&D projects, while providing
thought leadership, along with significant personal contributions.
Working in a highly collaborative environment, you will drive product
innovation and partner with Engineering and Product teams to
prototype and launch data-driven features and products. You will
develop deep domain expertise in digital advertising and generate key
insights that influence business decisions and technological solutions.
In addition, you will be active in the data sciences community and
contribute to attracting, retaining and growing the best talent in a
performance-driven organization.
Skills requirements
Required Qualifications
PhD in a quantitative discipline (e.g., statistics, computer
science, physics), or MS with equivalent experience
10+ years of hands-on experience in analysis and
modeling of large complex datasets
A passion for innovating with data sciences at scale –
applying modern algorithms to massive datasets and
creating measureable business value
Excellent interpersonal and communication skills, with a
strong written and verbal presentation
Proven ability to take ownership of a project and lead
R&D with minimal supervision
Track record of successful implementations of
quantitative, data-driven products in a business
environment
Deep understanding and hands-on experience with
optimization, data mining, machine learning or natural
language processing techniques
Superb understanding of algorithms, scalability and
various tradeoffs in a big data setting
Expert level in R, Matlab or a similar environment;
proficiency in SQL
Ability to personally put together a system of disjoint
components that implements a working solution to the
problem
Experience programming in at least one compiled
language (C/C++ preferred).
Preferred Qualifications
(Source: Linkedin)
Data Engineer
Big Data (Software) Engineer
Chief Scientist
Senior Scientist
Big Data Analyst
Hadoop Programmer / Developer
Big Data Scientist
Big Data Analytics
Research Scientist – Data
VP, Data Science
Data Mining Scientist
Machine Learning Developer
Machine Learning Specialist
Statistician
Keep in mind that all of these are just one strategy for landing a
data science job. Don’t forget there are other paths to the same
goal and make use of networking. A connection with a person
working for a company you are applying to could lead to a job
offer for another position if the one you are applying for doesn’t
work out. So draw your own plan of action for making it happen
in this fascinating field. It won’t be easy, but rest assured it is
definitely worth it!
Indeed.com
LinkedIn.com
DataScienceCentral.com
Kaggle.com
Final Words
In this book, we have seen what the field of data science entails
and how the profession of the data scientist came to be. We
described what big data is and how it differs from traditional
data through its main characteristics: volume, variety, velocity
and veracity. We also looked into the different types of data
scientists and the skill-sets of each one. We dug into what the
role of the data scientist requires in terms of the relevant
mindset, technical skills, experience and how he connects to
other people. We also zoomed in on the daily life of a data
scientist, examining the problems he may encounter and how
he tackles with them, what programs he uses and how he
expands his knowledge and know-how. We then looked into
how you can become a data scientist based on where you are
starting from: a programming, machine learning, data-related or
student background. Moreover, we went step-by-step through
the process of landing a data scientist job: where you need to
look, how you would present yourself to a potential employer
and what it takes to follow a freelancer path. Finally, we looked
at case studies of experienced and senior-level data scientists
in an attempt to get a better perspective of what this role is in
practice.
Now it is your turn to put all this knowledge to good use.
Whether you are opting for a position in a large organization or
planning to work as a freelancer, you have a lot of interesting
and educational challenges in front of you. This is practical
knowledge that cannot fit in a book. Just remember to stay
current on what is happening in the data science field so that
you always remain competitive. Enrich your toolbox and
knowledge-base constantly; good places to start are the
websites, articles and books that are listed in the appendices.
The book’s glossary can also be used as a hands-on reference
for a variety of relevant terms.
The data science field is still in its toddler years, and few are
those who are perceptive enough to foresee its potential. As
distributed computing gains more ground, data storage
becomes cheaper, data transfer becomes faster and, most
importantly, people begin reaping the fruits of big data, we
should expect it to become a big part of our everyday lives. This
should lead to data science becoming a major profession in the
not-so-distant future. And as big data technology continues to
evolve, more and more interesting ways of making use of
existing data will become available. The data scientist will
continue to be an ever-fascinating role that will rely as much on
creativity as it does on technical skills. By then, there will
probably be university departments specializing in this field,
and future data scientists will look back on the data scientists of
this decade, the pioneers of the field, with great admiration.
Glossary
of Computer and Big Data
Terminology
Big data terminology has developed during the last few years.
This glossary alphabetically lists some big data definitions
along with some relative computer terms that a newcomer in
the field will find useful. A basic understanding of computers is
required to fully harness the information in this glossary.
A
Aggregation – the process through which data is searched,
gathered and presented.
Algorithm – a mathematical process that can perform a
specific analysis or transformation on a piece of data.
Analytics – the discovery and communication of insights
derived from data, or the use of software-based algorithms and
statistics to derive meaning from data.
Analytics Platform – software and/or hardware that provide
the tools and computational power needed to build and perform
many different analytical queries.
Anomaly Detection – the systematic search for data items in a
dataset that deviate from a projected pattern or expected
behavior. Anomalies are often referred to as outliers,
exceptions, surprises or contaminants, and they usually provide
critical and actionable information.
Application (App) – a program designed to perform
information processing tasks for a specific purpose or activity.
Artificial Intelligence (A.I.) – the field of computer science
related to the development of machines and software that are
capable of perceiving their environment and taking appropriate
action when required (in real-time), even learning from those
actions. Some A.I. algorithms are widely used in data science.
B
Behavioral Analytics – analytics that inform about the how,
why and what (instead of just the who and when) occurs in data
related to human behavior. Behavioral analytics investigates
humanized patterns in the data.
Big Data – data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage and process
them within a tolerable elapsed time. Big data sizes are a
constantly moving target, ranging from a few dozen terabytes to
many petabytes of data in a single data set. Big data is
characterized by its 4 Vs: volume, velocity, variety and veracity.
Big Data Scientist – an IT professional who is able to
use/develop the essential algorithms to make sense out of big
data and communicate the derived information effectively to
anyone interested. Also known as a data scientist.
Big Data Startup – a young company that has developed new
big data technology.
Business Intelligence – the theories, methodologies and
processes to make data, particularly business-related data,
understandable and more actionable.
Byte (B) – an acronym for “binary term.” A sequence of bits
that represents a character. Each byte has 8 bits.
C
Central Processing Unit (CPU) – the brains of an information
processing system; the processing component that controls the
interpretation and execution of instructions in a computer.
Classification Analysis – a systematic process for obtaining
important and relevant information about data using
classification algorithms.
Cloud – a broad term that refers to any Internet-based
application or service that is hosted remotely.
Cloud Computing – a computing system whose processing is
distributed over a network that uses server farms to store data
in a distant location (see also, data centers).
Clustering Analysis – the process of identifying objects that
are similar to each other and grouping them in order to
understand the differences and the similarities within the data.
Clustering is usually referred to as unsupervised learning and is
a fundamental part of data exploration and data discovery.
Comparative Analysis – a process that ensures a step-by-
step procedure of comparisons and calculations to detect
patterns within very large data sets.
Complex Structured Data – data that is composed of two or
more complex, complicated and interrelated parts that cannot
be easily interpreted by structured query languages and tools.
Computer Generated Data – data generated by computers
such as log files. This constitutes a large part of big data in the
world today.
Concurrency – performing and executing multiple tasks and
processes at the same time.
Correlation Analysis – a statistical technique for determining a
relationship between variables and whether that relationship is
negative or positive. Although it does not imply causation,
correlation analysis can yield very useful information about the
data and help the data scientist handle it more effectively.
Customer Relationship Management (CRM) – managing
sales and business processes. Big data will affect CRM
strategies.
D
Dashboard – a graphical representation of the analyses
performed by algorithms, usually in the form of plots and
gauges.
Data – a quantitative or qualitative value. Common types of
data include sales figures, marketing research results, readings
from monitoring equipment, user actions on a website, market
growth projections, demographic information and customer
lists.
Data Access – the act or method of viewing or retrieving stored
data.
Data Aggregation Tools – methods for transforming scattered
data from numerous sources into a new, single source.
Data Analytics – the application of software to derive
information or meaning from data. The end result might be a
report, an indication of status or an action taken automatically
based on the information received.
Data Analyst – someone who analyzes, models, cleanses,
and/or processes data. Data analysts usually don’t perform
predictive analytics, and when they do, it’s usually through the
use of a simple statistical model.
Data Architecture and Design – the way enterprise data is
structured. The actual structure or design varies depending on
the eventual end result required. Data architecture has three
stages or processes: conceptual representation of business
entities, the logical representation of the relationships among
those entities and the physical construction of the system to
support the functionality.
Database – a digital collection of data and the structure in
which the data is organized (structured). The data is typically
entered into and accessed via a database management system
(DBMS).
Database Administrator (DBA) – a person who is responsible
for supporting and maintaining the integrity of the structure and
content of a database.
Database-as-a-Service (DaaS) – a database hosted in the
cloud and sold on a metered basis. Examples include Heroku
Postgres and Amazon Relational Database Service.
Database Management System (DBMS) – collecting, storing
and providing access of data through integrated software that is
practical to use even by non-specialists.
Data Center – a physical location that houses the servers for
storing data. Data centers might belong to a single organization
or sell their services to many organizations.
Data Cleansing – the process of reviewing and revising data in
order to delete duplicates, correct errors and provide
consistency.
Data Collection – any process that captures any type of data.
Data Custodian – a person responsible for the database
structure and the technical environment including the storage of
data.
Data-Directed Decision Making – using data to support
making crucial decisions.
Data Exhaust – the data that a person creates as a byproduct
of a common activity: for example, a cell call log or Web search
history.
Data Governance – a set of processes or rules that ensure the
integrity of the data and that data management best practices
are met.
Data Integration – the process of combining data from different
sources and presenting it in a single view.
Data Integrity – the measure of trust an organization has in the
accuracy, completeness, timeliness and validity of the data.
Data Management Association (DAMA) – a non-profit
international organization for technical and business
professionals “dedicated to advancing the concepts and
practices of information and data management.”
Data Management – according to the Data Management
Association, data management incorporates the following
practices needed to manage the full data lifecycle in an
enterprise:
data governance
data architecture, analysis and design
database management
data security management
data quality management
reference and master data management
data warehousing and business intelligence
management
document, record and content management
metadata management
E
Enterprise Resource Planning (ERP) – a software system
that allows an organization to coordinate and manage all its
resources, information and business functions.
E-Science – traditionally defined as computationally intensive
science involving large data sets. More recently broadened to
include all aspects and types of research that are performed
digitally.
Event Analytics – a process that shows the series of steps
that led to an action.
Exploratory Analysis – finding patterns within data without
standard procedures or methods. It is a means of discovering
the data and finding the data set’s main characteristics. Usually
referred to as data exploration, it constitutes an important part
of the data science process.
Exabyte – approximately 1000 petabytes or 1 billion gigabytes.
Today, we create one exabyte of new information globally on a
daily basis.
Extract, Transform and Load (ETL) – a process for populating
data in a database and data warehouse by extracting the data
from various sources, transforming it to fit operational needs
and loading it into the database.
F
Failover – switching automatically to a different server or node
if one fails. This is a very useful property of a computer cluster
and ensures scalability in data analysis processes.
Fault-Tolerant Design – a system designed to continue
working even if certain parts fail.
Federal Information Security Management Act (FISMA) – a
US federal law that requires all federal agencies to meet certain
standards of information security across their systems.
File Transfer Protocol (FTP) – a set of guidelines or standards
that establishes the format in which files can be transmitted
from one computer to another.
G
Gamification – using game elements in a non-game context.
This is a very useful way to create data; therefore, coined as
the friendly scout of big data.
Gigabyte – a measurement of the storage capacity of a
computer. One megabyte represents more than 1 billion bytes.
Gigabyte may be abbreviated G or GB or Gig; however, GB is
clearer since G also stands for the metric prefix giga (meaning
1 billion).
Graph Database – databases that use graph structures (a
finite set of ordered pairs or certain entities), with edges,
properties and nodes for data storage. It provides index-free
adjacency, meaning every element is directly linked to its
neighboring element.
Grid Computing – connecting different computer systems from
various locations, often via a cloud, to reach a common goal.
H
Hadoop – an open-source framework that is built to enable the
process and storage of big data across a distributed file system.
Hadoop is currently the most widespread and most developed
big data platform available.
Hadoop Distributed File System (HDFS) – a distributed file
system designed to run on commodity hardware.
HBase – an open source, non-relational, distributed database
running in conjunction with Hadoop. It is particularly useful for
archiving purposes.
High-Performance-Computing (HPC) – using
supercomputers to solve highly complex and advanced
computing problems.
Hypertext – a technology that links text in one part of a
document with related text in another part of the document or in
other documents. A user can quickly find the related text by
clicking on the appropriate keyword, key phrase, icon or button.
Hypertext Transfer Protocol (HTTP) – the protocol used on
the World Wide Web that permits Web clients (Web browsers)
to communicate with Web servers. This protocol allows
programmers to embed hyperlinks in Web documents using
hypertext markup language (HTML).
I
Indexing – the ability of a program to accumulate a list of
words or phrases that appear in a document, along with their
corresponding page numbers, and to print or display the list in
alphabetical order.
Information Processing – the coordination of people,
equipment and procedures to handle the storage, retrieval,
distribution and communication of information. The term
information processing embraces the entire field of processing
words, figures, graphics, videos and voice input by electronic
means.
In-Database Analytics – the integration of data analytics into
the data warehouse.
Information Management – the practice of collecting,
managing and distributing information of all types: digital,
paper-based, structured and unstructured.
In-Memory Data Grid (IMDG) – the storage of data in memory,
across multiple servers, for the purpose of greater scalability
and faster access or analytics.
In-Memory Database – a database management system that
stores data in the main memory instead of on the disk, resulting
in very fast processing, storing and loading of the data.
Internet – a system that links existing computer networks into a
worldwide network. The Internet may be accessed by means of
commercial online services (such as America Online) and
Internet service providers (ISPs).
Internet of Things (IoT) – ordinary devices that are connected
to the Internet at anytime and anywhere via sensors. IoT is
expected to contribute substantially to the growth of big data.
Internet Service Provider (ISP) – an organization that
provides access to the Internet for a fee. Companies like
America Online are more properly referred to as commercial
online services because they offer many other services in
addition to Internet access.
Intranet – a private network established by an organization for
the exclusive use of its employees. Firewalls prevent outsiders
from gaining access to an organization’s intranet.
J
Juridical Data Compliance – the need to comply with the laws
of the country where your data is stored. Relevant when you
use cloud solutions and when the data is stored in a different
country or continent.
K
Key Value Database – database in which data is stored with a
primary key, a uniquely identifiable record, making it easy and
fast to look up. The data stored in a KeyValue is normally some
kind of primitive of the programming language.
Kilobyte – a measurement of the storage capacity of a
computer. One kilobyte represents 1024 bytes. Kilobyte may be
abbreviated K or KB; however, KB is the clearer abbreviation,
since K also stands for the metric prefix kilo (meaning 1000).
L
Latency – a measure of time delay in a system.
Legacy System – an old system, technology or computer
system that is not supported any more.
Load Balancing – distributing workload across multiple
computers or servers in order to achieve optimal results and
utilization of the system.
Location Data (Geo-Location Data) – GPS data describing a
geographical location. Very useful for data visualization among
other things.
Log File – a file that a computer, network or application creates
automatically to record events that occur during operation (e.g.,
the time a file is accessed).
M
Machine Data – data created by machines via sensors or
algorithms.
Machine Learning (ML) – the field of computer science related
to the development and use of algorithms to enable machines
to learn from what they are doing and become better over time.
Although there is a large overlap between ML and artificial
intelligence, they are not the same. ML algorithms are an
integral part of data science.
MapReduce – a software framework for processing vast
amounts of data using parallelization.
Massively Parallel Processing (MPP) – using many different
processors (or computers) to perform certain computational
tasks at the same time.
Master Data Management (MDM) – management of core non-
transactional data that is critical to the operation of a business
to ensure consistency, quality and availability. Examples of
master data are customer or supplier data, product information,
employee data, etc.
Megabyte – a measurement of the storage capacity of a
computer. One megabyte represents more than 1 million bytes.
Megabyte may be abbreviated M or MB; however, MB is clearer
since M also stands for the metric prefix mega (meaning 1
million).
Memory – the part of a computer that stores information. Often
synonymous to Random Access Memory (RAM), the temporary
memory that allows information to be stored randomly and
accessed quickly and directly without the need to go through
intervening data.
Metadata – any data used to describe other data; for example,
a data file’s size or date of creation.
MongoDB – a popular open-source NoSQL database.
MPP Database – a database optimized to work in a massively
parallel processing environment.
Multi-Dimensional Database – a database optimized for
online analytical processing (OLAP) applications and for data
warehousing.
Multi-Threading – the act of breaking up an operation within a
single computer system into multiple threads for faster
execution. Multi-threading turns a single PC with a modern
CPU into a computer cluster that makes use of all of its CPU
cores.
MultiValue Database – a type of NoSQL and multidimensional
database that understands 3-dimensional data directly. They
are primarily giant strings that are perfect for manipulating
HTML and XML strings directly.
Memetic Algorithm – a special type of evolutionary algorithm
that combines a steady state genetic algorithm with local
search for real-valued parameter optimization.
N
Natural Language Processing (NLP) – a field of computer
science involved with interactions between computers and
human languages. NLP is widely used in text analytics and is a
popular subfield of data science.
Network Analysis – analyzing connections and the strength of
the ties between nodes in a network. Viewing relationships
among the nodes in terms of the network or graph theory.
NewSQL – an elegant, well-defined database system that is
easier to learn and better than SQL. It is even newer than
NoSQL.
NoSQL – a class of database management system that does
not use the relational model. NoSQL is designed to handle
large data volumes that do not follow a fixed schema. It is
ideally suited for use with very large data volumes that do not
require the relational model. It is sometimes referred to as ”Not
only SQL” because it is a database that doesn’t adhere to
traditional relational database structures. It is more consistent
and can achieve higher availability and horizontal scaling.
Normalization – the process of transforming a numeric
variable so that its values are in the same range as other
normalized variables. This allows for easier comparisons and
more efficient ways of handling a set of variables.
O
Object Database – databases that store data in the form of
objects as used by object-oriented programming. They are
different from relational or graph databases, and most of them
offer a query language that allows objects to be found with a
declarative programming approach.
Online Analytical Processing (OLAP) – the process of
analyzing multidimensional data using three operations:
consolidation (the aggregation of available data), drill-down (the
ability for users to see the underlying details) and slice and dice
(the ability for users to select subsets and view them from
different perspectives).
Online Transactional Processing (OLTP) – the process of
providing users with access to large amounts of transactional
data so that they can derive meaning from it.
Open Data Center Alliance (ODCA) – a consortium of global
IT organizations whose goal is to speed the migration to cloud
computing.
Open Source – a type of software code that has been made
freely available for download, modification and redistribution.
Operational Database – databases that record the regular
operations of an organization; they are generally very important
to a business. Organizations generally use online transaction
processing, which allows them to enter, collect and retrieve
specific information about the company.
Optimization Analysis – the process of optimization during the
design cycle of products done by algorithms. It allows
companies to virtually design many different variations of a
product and to test that product against pre-set variables.
Ontology – ontology represents knowledge as a set of
concepts within a domain and the relationships between those
concepts. Very useful when designing a database.
Outlier Detection – an outlier is an object that deviates
significantly from the general average within a dataset or a
combination of data. It is numerically distant from the rest of the
data and therefore indicates that something is going on that
requires additional analysis. Usually referred to as anomaly
detection.
P
Parallel Data Analysis – breaking up an analytical problem
into smaller components and running algorithms on each of
those components at the same time. Parallel data analysis can
occur within the same system or across multiple systems.
Parallel Method Invocation (PMI) – the ability to allow
programming code to call multiple functions in parallel.
Parallel Processing – the ability to execute multiple tasks at
the same time.
Parallel Query – a query that is executed over multiple system
threads for faster performance.
Pattern Recognition – identifying patterns in data via
algorithms to make predictions of new data coming from the
same source. Pattern recognition is also referred to as
supervised learning and constitutes a major part of machine
learning.
Performance Management – the process of monitoring
system or business performance against predefined goals to
identify areas that need attention.
Petabyte –1024 terabytes or 1 million gigabytes. The CERN
Large Hadron Collider generates approximately 1 petabyte per
second.
Predictive Analysis (Predictive Analytics) – the most
valuable analysis within big data as it helps predict what
someone is likely to buy, visit or do as well as how someone will
behave in the (near) future. It uses a variety of different data
sets such as historical, transactional, social, or customer profile
data to identify risks and opportunities.
Predictive Modeling – the process of developing a model to
predict a trend or outcome.
Program – an established sequence of instructions that tells a
computer what to do. The term program means the same thing
as software.
Protocol – a set of standards that permits computers to
exchange information and communicate with each other.
Q
Quantified Self – a modern movement related to the use of
applications to track one’s every move during the day in order
to gain a better understanding of one’s behavior.
Query – asking for information to answer a certain question,
usually in a database context.
Query analysis – the process of analyzing a search query for
the purpose of optimizing it for the best possible result.
R
R – an open-source programming language and software
environment for statistical computing and graphics. The R
language is widely used among statisticians and data miners
for developing statistical software and data analysis. R’s
popularity has increased substantially in recent years.
Real Time – a descriptor for events, data streams or processes
that have an action performed on them as they occur.
Real-Time Data – data that is created, processed, stored,
analyzed and visualized within milliseconds of its creation.
Recommendation Engine (Recommender System) – an
algorithm that analyzes a user’s purchases and actions on an
e-commerce site and then uses that data to recommend
complementary products.
Record – a collection of all the information pertaining to a
particular subject.
Records Management – the process of managing an
organization’s records throughout their entire lifecycle from
creation to disposal.
Reference Data – data that describes an object and its
properties. The object may be physical or virtual.
Regression Analysis – a statistical technique for defining the
dependency between continuous variables. It assumes a one-
way causal effect from one variable to the response of another
variable.
Report – the presentation of information derived from a query
against a dataset, usually in a predetermined format.
Risk Analysis – the application of statistical methods on one or
more datasets to determine the likely risk of a project, action or
decision.
Root-Cause Analysis – the process of determining the main
cause of an event or problem.
Routing Analysis – using many different variables to find the
optimal route for a certain means of transport in order to
decrease fuel costs and increase efficiency.
S
Scalability – the ability of a system or process to maintain
acceptable performance levels as workload or scope increases.
Schema – the structure that defines the organization of data in
a database system.
Semi-Structured Data – a form a structured data that does not
conform to a formal structure the way structured data does. It
contains tags or other markers to enforce a hierarchy of
records. Semi-structured data is usually found in .JSON
objects.
Server – a physical or virtual computer that serves requests for
a software application and delivers those requests over a
network.
Signal Analysis – the analysis of measurement of time varying
or spatially varying physical quantities to analyze the
performance of a product. Signal analysis is frequently used
with sensor data.
Similarity Searches – finding the closest object to a query in a
database where the data object can be of any type of data.
Simulation Analysis – a simulation is the imitation of the
operation of a real-world process or system. A simulation
analysis helps to ensure optimal product performance by taking
into account many different variables.
Smart Grid – the smart grid refers to the concept of adding
intelligence to the world’s electrical transmission systems with
the goal of optimizing energy efficiency. Enabling the smart grid
will rely heavily on collecting, analyzing and acting on large
volumes of data.
Software-as-a-Service (SaaS) – application software that is
used over the Web by a thin client or Web browser. Salesforce
is a well-known example of SaaS.
Solid-State Drive (SSD) – also called a solid-state disk; a
device that uses memory ICs to persistently store data.
Spatial Analysis – the process of analyzing spatial data such
as geographic or topological data to identify and understand
patterns and regularities within data distributed in geographic
space. This is usually performed in a special type of system
called a geographic information system (GIS).
Storm – an open source distributed computation system
designed for processing multiple data streams in real time.
Structured Data – data that is identifiable because it is
organized in a structure such as rows and columns. The data
resides in fixed fields within a record or file, or the data is
tagged correctly and can be accurately identified.
Structured Query Language (SQL) – a programming
language for retrieving data from a relational database. SQL is
not directly applicable in the big data domain.
T
Terabyte – approximately 1000 gigabytes. A terabyte is the
data volume of about 300 hours of high-definition video.
Text Analytics – the application of statistical, linguistic and
machine learning techniques on text-based sources to derive
meaning or insight.
Thread – a series of posted messages that represents an
ongoing discussion of a specific topic in a bulletin board
system, a newsgroup or a Web site.
Time Series Analysis – the process of analyzing well-defined
data obtained through repeated measurements of time. The
data has to be well-defined and measured at successive points
in time spaced at identical time intervals.
Topological Data Analysis – focusing on the shape of
complex data and identifying clusters and any statistical
significance that is present within that data.
Transmission Control Protocol/Internet Protocol (TCP/IP) –
a collection of over 100 protocols that are used to connect
computers and networks.
Transactional Data – data that describes an event or
transaction that took place.
Transparency – operating in such a way that whatever is
taking place is open and apparent to whomever is interested.
U
Unstructured Data – data that is text heavy, in general, but
may also contain dates, numbers and facts.
V
Value – the benefits that organizations can reap from analysis
of big data.
Variability – one of the characteristics of big data, variability
means that the meaning of the data can change (and rapidly).
For example, in multiple tweets the same word can have totally
different meanings.
Variety – one of the major characteristics of big data. Data
today comes in many different formats: structured data, semi-
structured data, unstructured data and even complex structured
data.
Velocity – one of the major characteristics of big data. The
speed at which the data is created, stored, analyzed and
visualized.
Veracity – one of the major characteristics of big data, veracity
refers to the correctness of the data. Organizations need to
ensure that both the data and the analyses performed on it are
correct.
Visualization – visualizations are complex graphs that can
include many variables of data while still remaining
understandable and readable. With the right visualizations, raw
data can be put to use.
Volume – one of the major characteristics of big data. It refers
to the total quantity of data, beginning at terabytes and growing
higher over time.
W
Weather Data – an important open, public data source that can
provide organizations with a lot of insights when combined with
other sources.
X
XML Database – databases that allow data to be stored with its
markup tags. XML databases are often linked to document-
oriented databases. The data stored in an XML database can
be queried, exported and serialized into any format needed.
Y
Yottabytes – approximately 1000 Zettabytes, or 250 trillion
DVDs. The entire digital universe today is 1 Yottabyte; this will
double every 18 months.
Z
Zettabyte – approximately 1000 Exabytes or 1 billion terabytes.
It is expected that in 2016, more than 1 zettabyte will cross our
networks globally on a daily basis.
Appendix 1
Useful Websites