You are on page 1of 12

Chapter 1: Introduction to Data Science

In This Chapter

• What data science is and the history of its evolution


• The journey to becoming a data scientist
• Building a data science team
• The data science project life cycle
• Managing data science projects

Data science has recently become a common topic of conversation at almost every data-driven organization.
Along with the term “big data,” the rise of the term “data science” has been so rapid that it’s frankly confusing.

What exactly is data science and why has it suddenly become so important?

In this chapter we provide an introduction to data science from a practitioner’s point of view, explaining some of
the terminology around it, and looking at the role a data scientist plays in this era of big data.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Outline

• What Is Data Science?


• Example: Search Advertising
• A Bit of Data Science History
• Statistics and Machine Learning
• Innovation from Internet Giants
• Data Science in the Modern Enterprise
• Becoming a Data Scientist
• The Data Engineer
• The Applied Scientist
• Transitioning to a Data Scientist Role
• Soft Skills of a Data Scientist
• Building a Data Science Team
• The Data Science Project Life Cycle
• Ask the Right Question
• Data Acquisition
• Data Cleaning: Taking Care of Data Quality
• Explore the Data and Design Model Features
• Building and Tuning the Model
• Deploy to Production
• Managing a Data Science Project
• Summary

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.1 Iterative process of data science discovery.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.2 Search ads shown on an Internet search results page.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.3 The skillset of the data scientist.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.4 Expanded version of Figure 1.1 further illustrating the iterative nature of data science.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Discussion

What Is Data Science?


The definition of data science (page 4) used throughout the book is as follows: Data science is the exploration
of data via the scientific method to discover meaning or insight, and the construction of software systems that
utilize such insight in a business context.

This definition emphasizes two key aspects of data science, which are analogous to traditional practice of
Science and engineering.

1. Exploring data using the scientific method (discovery)


2. Implementation of software systems that can use output of the technique (engineering)

The figure on page 4 also emphasizes the cyclic nature of inquiry. It is rare that the first hypothesis does not
need refinement.

Example: Search Advertising


Use click through rate (CTR) as an example of scientific refinement and applied engineering. Algorithms based
on what will get people to "click" (hypothesis) are developed and tested. The best ones are put in to practice
(engineering).

A Bit of Data Science History


Indicate that statistics and machine learning (this includes popular deep learning tools like Tensorflow) are
different approaches to the same problem--gaining insight from data. Statistics has been in use far longer than
machine learning and approach the problem in very different ways. The statistical approach assumes the data
are generated by a given stochastic (some random probability distribution) data model while machine learning
uses algorithmic models to characterize data and makes no assumptions about the underlying model or
mechanism. Because machine learning does not require an underlying model, it can be more widely applied
than statistical methods. It is also possible that machine learning may provide insights into possible statistical
models.

(continues)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)

Almost all of the large Internet companies use some form of Machine Learning and any are moving into AI
based method. Note: AI and machine learning are often used to mean the same thing. We prefer the following
definition to help students understand the difference

• Data science produces insights and relationships


• Machine learning produces predictions
• Artificial intelligence produces actions

In addition the advent of "cheap storage" has made data retention an easy decision for most modern
enterprises. And, since they have the data, why not use it.

Becoming a Data Scientist


Emphasize the distinction between a Data Engineer and an Applied Data Scientist. A data engineer can be
considered an experienced software engineer who is highly skilled in building high-quality production-grade
software systems with a specialization in building fast (and often distributed or scalable) data pipelines.

A data scientist on the other hand is practiced knowing what tools to use given an analytics problem. They often
process a good understanding the math behind algorithms know the trade-offs associated with various
methodologies. They usually have experience in research methods and are not "academic" in nature. That is,
the results of their research are often used in production environment and not the subject of research papers or
presentations. This situation is similar to an industrial chemist who developers new compounds that result in
new products.

Becoming a data scientist is usually a transition rather than a creation. A computer engineer needs to gain an
understanding of statistics and machine learning and how to investigate a problem. As an applied scientist your
transition will require programing skills and the ability to build solutions from many different components.

(continues)

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)

Point out that the "data Scientist" is a relatively new role and the definition (and needed skill sets) are not written
in stone. In general, a data scientist is a blend of data engineering and applied science that are supported by a
set of "soft skills" necessary for a data scientist.

1. Curiosity that helps keeps things moving past the obvious.


2. Learning because the tools and methods are always changing and improving.
3. Persistence because if it were easy and worked the first time anyone could do it.
4. Create a simplified "big picture" for Story-telling because not everyone is a data scientist or knows what
your results mean (or how you even got the results).

Building a Data Science Team


Because a data scientist is "rare bird" (and expensive) it is best to build teams form both data engineers and
applied scientists and allow the team to evolve into a true data science team.

The Data Science Project Lifecycle


Figure 1.4 is an expanded and more detailed version of Figure 1.1. The figure is also drawn slightly different and
has multiple "cyclic" pathways that reflect the iterative nature of a data science investigation. The important point
is that the approach is cyclic. For instance, the results may be poor because the data quality is poor, which may
necessitate better data cleaning or even looking for similar (but better) data. There is both a science portion and
engineering portion in the graph. Science cycle (investigation):

5. Ask the Right Question


6. Data Acquisition (or finding the data)
7. Data Cleaning (Taking Care of Data Quality, not all data are neat and tidy)
8. Explore the Data and Design Model Features
9. Building and Tuning the Model (cycling between 5 and 4 is almost a given)
10. Evaluate Results, if acceptable move on to implementation, otherwise, go back to 2,3,4, or 5

(continues)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)

Engineering (implementation):

1. Deploy to Production
2. Suggest another question

Managing a Data Science Project


Bottom line, real-world data science projects can be messy due to unknown data quality and the difficulty in
accessing progress. In particular, things like machine learning have no underlying model so there is way to
measure how far away you are from any predicted result. Managing a messy project requires focus and some
intuition.

Another important point to keep in mind is the fact that some models may never "converge" and only provide a
weak (or nonexistent) insight. i.e. the experiment was not able to produce any results, which is a perfectly
acceptable outcome in many scientific endeavors.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Summary

In this chapter

• We defined data science as the art and science of discovering insight from data and building software
systems to apply this insight in a business context.
• We reviewed the history of data science in academia and industry, how it started in both the statistics and
machine learning communities and was made practical through innovation from big Internet companies like
Yahoo! and Google.
• We discussed the role of a data scientist as one with the combined skillset of a data engineer and an applied
scientist and the challenges with building a data science team.
• We looked at the data science life cycle, from asking the right question, through data quality control, pre-
processing modeling, evaluation, and deployment to production.

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Exercises

1. What is a working definition of Data Science?


See page 4

2. What is the difference between statistics and machine learning?


The statistical approach assumes the data are generated by a given stochastic (some random probability
distribution) data model while machine learning uses algorithmic models to characterize data and makes no
assumptions about the underlying model or mechanism. (page 7)

3. Outline the difference between a data engineer and an applied data scientist.
A data engineer can be considered an experienced software engineer who is highly skilled in building
production-grade software systems with a specialization in building fast (and often distributed or scalable)
data pipelines. A data scientist has a deeper understanding of the mathematics behind the tools and is
practiced at knowing what tools to use given an analytics problem. (pages 8 and 9)

4. Name the eight skill sets needed by a data scientist.


Distributed systems, data processing, computer science, software engineering, data analysis, experimental
design, machine learning, statistics (see page 10, figure 13)

5. Using the chart in Figure 1.4, describe each components of the data science project life cycle. Explain where
the change from science to engineering happens.
(see page 14)

Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.

You might also like