Professional Documents
Culture Documents
Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• There’s been a lot of hype in the media about
“data science” and “Big Data.”
eyebrow-raising about Big Data
and data science
• There’s a lack of definitions around the most basic
terminology
• What is “Big Data ? How big is big?
• What does “data science” mean?
• relationship between Big Data and data science?
• Is data science the science of Big Data?
• Is data science only the stuff going on in companies like
Google and Facebook and tech companies?
• Why do many people refer to Big Data as crossing disciplines
and
• data science as only taking place in tech?
• There’s a distinct lack of respect for the researchers in
academia and industry
• who have been working on this kind of stuff for years
• work by statisticians, computer scientists, mathematicians,
engineers, and scientists of all types.
• media describes it, machine learning algorithms were just
invented last week and data was never “big” until Google
came along
• The hype is crazy - masks reality and increases
the noise-to-signal ratio.
• Statisticians already feel that they are studying
and working on the “Science of Data.”
• data science is not just a rebranding of
statistics or machine learning
• Statisticians already feel that they are studying
and working on the “Science of Data.”
Getting Past the Hype
• Rachel’s experience going from getting a PhD
in statistics to working at Google
• the stuff I was working on at Google was different than
anything I had learned at school
• what I’d learned in school provided a framework and way of
thinking that I relied on daily,
• provided a solid theoretical and practical foundation
necessary to do my work.
• also many skills I had to acquire on the job at Google that I
hadn’t learned in school
• I had a statistics background and picked up more
computation, coding, and visualization skills, as well as
domain expertise while at Google
• Another person coming in as a computer scientist or a social
scientist or a physicist would have different gaps and would fill
them in accordingly
• as individuals, we each had different strengths and gaps, yet
we were able to solve problems by putting ourselves together
into a data team
• It’s a general, there’s a gap between what you learned in
school and what you do on the job.
Why Now
• We have massive amounts of data about many aspects of our
lives, and
• an abundance of inexpensive computing power.
• Shopping, communicating, reading news, listening to music,
searching for information, expressing our opinions
• Also offline behaviors
• It’s not just Internet data - finance, the medical industry,
• pharmaceuticals, bioinformatics, social welfare, government,
education, retail, and the list goes on
• Not only the massive data but data itself, often in real time,
becomes the building blocks of data products
• On the Internet, this means Amazon recommendation
systems, friend recommendations on Facebook, film and
music recommendations
• In finance, this means credit ratings, trading algorithms, and
models
• In education, this is starting to mean dynamic personalized
learning and assessments
• government, this means policies based on data.
• Technology makes this possible: infrastructure for large-scale
data processing, increased memory, and bandwidth wasn’t
true a decade ago
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''
• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
• We are being datafied, or rather our actions are
• when we “like” someone or something online, we are
intending to be datafied.
• when we merely browse the Web, we are unintentionally, or
at least passively datafied
• When we walk around in a store, or even on the street, we
are being datafied in a completely unintentional way, via
sensors, cameras, or Google glasses.
• Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
The Current Landscape (with a
Little History)
• So, what is data science?
• Is it new, or is it just statistics or analytics rebranded?
• Is it real, or is it pure hype?
• if it’s new and if it’s real, what does that mean?
• one way to understand what’s going on in this industry is to
look online and see what current discussions are taking place.
Metamarket CEO Mike
Driscoll’s answer:
• Drew Conway's Venn diagram of data science
from 20l0,
skills of data geeks
• Statistics (traditional analysis)
• Data munging (parsing, scraping, and
formatting data)
• Visualization (graphs, tools, etc.)
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise
Chapter 2, Pages 15 - 34
Big Data Statistics (pages 17 -33)
• Prerequisites – skills
– basic: stats, linear algebra, coding.
• Data model
• Statistical model – fitting?
• Mathematical model
What is a model?
• A model is our attempt to understand and
represent the nature of reality through a
particular lens.
• A model is an artificial construction where all
extraneous detail has been removed or
abstracted
• The mathematical expressions
Chapter 2, Pages 34 - 50
Exploratory Data Analysis (EDA)
• EDT is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.
• need to understand how the data is behaving,
• best way to do that is looking at it and getting
your hands dirty (EDA).
Basic tools
• The basic tools of EDA are plots, graphs and
summary statistics
Exploratory Data Analysis (EDA)
• it’s a method of systematically going through the
data:
• plotting distributions of all variables (using box
plots), plotting time series of data, transforming
variables,
• looking at all pairwise relationships between
variables using scatterplot matrices,
• generating summary statistics for all of them :
computing their mean, minimum, maximum, the
upper and lower quartiles, and identifying outliers.
• Is a mindset about your relationship with the
data to better understand
• Gain intuition,
• understand the shape of it,
• connect your understanding of the process
that generated the data.
• EDA happens between you and the data and
isn’t about proving anything to anyone else.
Philosophy of EDA
• There are important reasons anyone working with
data should do EDA.
• gain intuition about the data;
• to make comparisons between distributions;
• for sanity checking (data is on the scale you
expect);
• to find out where data is missing or if there are
outliers;
• and to summarize the data..
• Plotting data and making comparisons can get
you extremely far,
• and is far better to do than getting a dataset
and immediately running a model just
because you know it
EDA methods
Read the data
1. Read the data from auto-mpg.csv
>auto <- read.csv("auto-mpg.csv“)
2. Get the summary statistics:
>summary(auto)
str() function for an overview of a data
frame
> str(auto)
Finding the mean and standard
deviation
> mean(auto$mpg)
> sd(auto$mpg)
Extracting a subset of a dataset
1. Index by position. Get model_year and
car_name for the first three cars:
> auto[1:3, 8:9]
> auto[1:3, c(8,9)]
2. Index by name. Get model_year and car_name
for the first three cars:
> auto[1:3,c("model_year", "car_name")]
3. Retrieve all details for cars with the highest or
lowest mpg
> auto[auto$mpg == max(auto$mpg) |
auto$mpg == min(auto$mpg), ]
Excluding columns
> auto[,c(-1,-9)]
> auto[,-c(1,9)]
Generating standard plots such as
histograms, boxplots, and scatterplots
• Histograms
> hist(acceleration)
You can customize everything
> hist(acceleration, col="blue",
xlab = "acceleration",
Ylab = “Frequency”,
main = "Histogram of acceleration")
Boxplots
• boxplot for mpg :
> boxplot(mpg, xlab = "Miles per gallon")
Boxplot for mpg ~ model year
> boxplot(mpg ~ model_year, xlab = "Miles per
gallon")
boxplot of mpg by cylinders:
> boxplot(mpg ~ cylinders)
Scatterplots
• scatterplot for mpg by horsepower:
> plot(mpg ~ horsepower)
Data Science Process
A Data Scientist’s Role in This process