Professional Documents
Culture Documents
Big Data CH01
Big Data CH01
In This Chapter
Data science has recently become a common topic of conversation at almost every data-driven organization.
Along with the term “big data,” the rise of the term “data science” has been so rapid that it’s frankly confusing.
What exactly is data science and why has it suddenly become so important?
In this chapter we provide an introduction to data science from a practitioner’s point of view, explaining some of
the terminology around it, and looking at the role a data scientist plays in this era of big data.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Outline
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.1 Iterative process of data science discovery.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.2 Search ads shown on an Internet search results page.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.3 The skillset of the data scientist.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Figure 1.4 Expanded version of Figure 1.1 further illustrating the iterative nature of data science.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Discussion
This definition emphasizes two key aspects of data science, which are analogous to traditional practice of
Science and engineering.
The figure on page 4 also emphasizes the cyclic nature of inquiry. It is rare that the first hypothesis does not
need refinement.
(continues)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)
Almost all of the large Internet companies use some form of Machine Learning and any are moving into AI
based method. Note: AI and machine learning are often used to mean the same thing. We prefer the following
definition to help students understand the difference
In addition the advent of "cheap storage" has made data retention an easy decision for most modern
enterprises. And, since they have the data, why not use it.
A data scientist on the other hand is practiced knowing what tools to use given an analytics problem. They often
process a good understanding the math behind algorithms know the trade-offs associated with various
methodologies. They usually have experience in research methods and are not "academic" in nature. That is,
the results of their research are often used in production environment and not the subject of research papers or
presentations. This situation is similar to an industrial chemist who developers new compounds that result in
new products.
Becoming a data scientist is usually a transition rather than a creation. A computer engineer needs to gain an
understanding of statistics and machine learning and how to investigate a problem. As an applied scientist your
transition will require programing skills and the ability to build solutions from many different components.
(continues)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)
Point out that the "data Scientist" is a relatively new role and the definition (and needed skill sets) are not written
in stone. In general, a data scientist is a blend of data engineering and applied science that are supported by a
set of "soft skills" necessary for a data scientist.
(continues)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
(Chapter Discussion continued)
Engineering (implementation):
1. Deploy to Production
2. Suggest another question
Another important point to keep in mind is the fact that some models may never "converge" and only provide a
weak (or nonexistent) insight. i.e. the experiment was not able to produce any results, which is a perfectly
acceptable outcome in many scientific endeavors.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Chapter Summary
In this chapter
• We defined data science as the art and science of discovering insight from data and building software
systems to apply this insight in a business context.
• We reviewed the history of data science in academia and industry, how it started in both the statistics and
machine learning communities and was made practical through innovation from big Internet companies like
Yahoo! and Google.
• We discussed the role of a data scientist as one with the combined skillset of a data engineer and an applied
scientist and the challenges with building a data science team.
• We looked at the data science life cycle, from asking the right question, through data quality control, pre-
processing modeling, evaluation, and deployment to production.
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.
Exercises
3. Outline the difference between a data engineer and an applied data scientist.
A data engineer can be considered an experienced software engineer who is highly skilled in building
production-grade software systems with a specialization in building fast (and often distributed or scalable)
data pipelines. A data scientist has a deeper understanding of the mathematics behind the tools and is
practiced at knowing what tools to use given an analytics problem. (pages 8 and 9)
5. Using the chart in Figure 1.4, describe each components of the data science project life cycle. Explain where
the change from science to engineering happens.
(see page 14)
Companion materials to Practical Data Science with Hadoop and Spark by Ofer Mendelevitch, Casey Stella, and Douglas Eadline.
Copyright © 2017 Pearson Education, Inc. PowerPoints copyright © 2019; version 1.0, prepared January 2019.