You are on page 1of 65

Doing Data Science

Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• There’s been a lot of hype in the media about
“data science” and “Big Data.”
eyebrow-raising about Big Data
and data science
• There’s a lack of definitions around the most basic
terminology
• What is “Big Data ? How big is big?
• What does “data science” mean?
• relationship between Big Data and data science?
• Is data science the science of Big Data?
• Is data science only the stuff going on in companies like
Google and Facebook and tech companies?
• Why do many people refer to Big Data as crossing disciplines
and
• data science as only taking place in tech?
• There’s a distinct lack of respect for the researchers in
academia and industry
• who have been working on this kind of stuff for years
• work by statisticians, computer scientists, mathematicians,
engineers, and scientists of all types.
• media describes it, machine learning algorithms were just
invented last week and data was never “big” until Google
came along
• The hype is crazy - masks reality and increases
the noise-to-signal ratio.
• Statisticians already feel that they are studying
and working on the “Science of Data.”
• data science is not just a rebranding of
statistics or machine learning
• Statisticians already feel that they are studying
and working on the “Science of Data.”
Getting Past the Hype
• Rachel’s experience going from getting a PhD
in statistics to working at Google
• the stuff I was working on at Google was different than
anything I had learned at school
• what I’d learned in school provided a framework and way of
thinking that I relied on daily,
• provided a solid theoretical and practical foundation
necessary to do my work.
• also many skills I had to acquire on the job at Google that I
hadn’t learned in school
• I had a statistics background and picked up more
computation, coding, and visualization skills, as well as
domain expertise while at Google
• Another person coming in as a computer scientist or a social
scientist or a physicist would have different gaps and would fill
them in accordingly
• as individuals, we each had different strengths and gaps, yet
we were able to solve problems by putting ourselves together
into a data team
• It’s a general, there’s a gap between what you learned in
school and what you do on the job.
Why Now
• We have massive amounts of data about many aspects of our
lives, and
• an abundance of inexpensive computing power.
• Shopping, communicating, reading news, listening to music,
searching for information, expressing our opinions
• Also offline behaviors
• It’s not just Internet data - finance, the medical industry,
• pharmaceuticals, bioinformatics, social welfare, government,
education, retail, and the list goes on
• Not only the massive data but data itself, often in real time,
becomes the building blocks of data products
• On the Internet, this means Amazon recommendation
systems, friend recommendations on Facebook, film and
music recommendations
• In finance, this means credit ratings, trading algorithms, and
models
• In education, this is starting to mean dynamic personalized
learning and assessments
• government, this means policies based on data.
• Technology makes this possible: infrastructure for large-scale
data processing, increased memory, and bandwidth wasn’t
true a decade ago
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''

• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
• We are being datafied, or rather our actions are
• when we “like” someone or something online, we are
intending to be datafied.
• when we merely browse the Web, we are unintentionally, or
at least passively datafied
• When we walk around in a store, or even on the street, we
are being datafied in a completely unintentional way, via
sensors, cameras, or Google glasses.
• Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
The Current Landscape (with a
Little History)
• So, what is data science?
• Is it new, or is it just statistics or analytics rebranded?
• Is it real, or is it pure hype?
• if it’s new and if it’s real, what does that mean?
• one way to understand what’s going on in this industry is to
look online and see what current discussions are taking place.
Metamarket CEO Mike
Driscoll’s answer:
• Drew Conway's Venn diagram of data science
from 20l0,
skills of data geeks
• Statistics (traditional analysis)
• Data munging (parsing, scraping, and
formatting data)
• Visualization (graphs, tools, etc.)
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise

Observation: Nobody is an expert in everything, which is


why it makes more sense to create teams of people who
have different profiles and different expertise-together, as
a team, they can specialize in all those things.
Data Science Profile
Data Science Team
What is Data Science, Really?
• In Academia: an academic data scientist is a scientist, trained in
anything from social science to biology, who works with large
amounts of data, and must grapple with computational problems
posed by the structure, size, messiness, and the complexity and
nature of the data, while simultaneously solving a real-world
problem.

• In Industry: Someone who knows how to extract meaning from and


interpret data, which requires both tools and methods from
statistics and machine learning, as well as being human. She spends
a lot of time in the process of collecting, cleaning, and “munging”
data, because data is never clean. This process requires persistence,
statistics, and software engineering skills-skills that are also
necessary for understanding biases in the data, and for debugging
logging output from code.
Doing Data Science

Chapter 2, Pages 15 - 34
Big Data Statistics (pages 17 -33)

• Statistical thinking in the Age of Big Data


• Statistical Inference
• Populations and Samples
• Big Data Examples
• Big Assumptions due to Big Data
• Modeling
Statistical Thinking in the Age of Big
Data
Big Data?.
• First, it is a bundle of technologies.
• Second, it is a potential revolution in
measurement.
• And third, it is a point of view, or philosophy,
about how decisions will be—and perhaps
should be—made in the future.
Statistical Thinking – Age of Big Data

• Prerequisites – skills
– basic: stats, linear algebra, coding.

– Analytical: Data preparation, modeling,


visualization, communication.
Statistical Inference
• The World – complex, random, uncertain - big data-
generating machine
• As we commute to work on subways and in cars,
• shopping, emailing, browsing the Internet and watching the stock
market,
• as we’re building things, eating things,
• talking to our friends and family about things,
• this all processes potentially produces data.

– Data are small traces of real-world processes.


– which traces we gather are decided by our data
collection or sampling method
…..

• Note: two forms of randomness exist:)


– Underlying the process (system property)
– Collection methods (human errors)

• Need a solid method to extract meaning and


information from random, dubious data.
– This is Statistical Inference!
• This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.
• More precisely, statistical inference is the discipline that
concerns itself with the development of procedures,
methods, and theorems that allow us to extract meaning
and information from data that has been generated by
stochastic (random) processes.
Populations and Samples
• Population : population of India or population
of world ?
• It could be any set of objects or units, such as
tweets or photographs or stars etc.
• set of observations (N)
• population - all emails sent last year by
employees at a huge corporation, BigCorp
• Observation - sender’s name, the list of recipients,
date sent, text of email, number of characters in the
email, number of sentences in the email, number of
verbs in the email, and the length of time until first
reply.
• Sample - subset of the units of size n in order
• to examine the observations to draw
conclusions and make inferences about the
population.
• In the BigCorp email example - list of all the
employees and select 1/10th of those people
at random and take all the email they ever
sent
Populations and Samples of Big
Data
• If we had all the email in the first place, why
would we need to take a sample
• Sample solves some engineering challenges -
data you need at hand really depends on what
your goal is
• Bias - data should not be extended to draw
conclusions beyond those sets of users
• Sampling - sampling distribution
• New kinds of data :
• Traditional: numerical, categorical, or binary
• Text: emails, tweets, articles
• Records: user-level data, time stamped event
data, json formatted log files
• Geo-based location data:
• Network
• Sensor data
• Images
Big Data Domain - Assumptions
• Other Bad or Wrong Assumptions
– N = 1 vs. N = ALL (multiple layers) (Page 25 -26)
• Big Data introduces a 2nd degree to the data context.
• There are infinite levels of depth and breadth in the data.
• Individuals become populations. Populations become
populations of populations – to the nth degree. (meta-data)
– My Example:
• 1 billion Facebook posts (one from each user) vs. 1 billion
Facebook posts from one unique user.
• 1 billion tweets vs. 1 billion images from one unique user.
• Danger: Drawing conclusions from incomplete
populations. Understand the boundaries/context.
Modeling
• What’s a model

• Data model
• Statistical model – fitting?
• Mathematical model
What is a model?
• A model is our attempt to understand and
represent the nature of reality through a
particular lens.
• A model is an artificial construction where all
extraneous detail has been removed or
abstracted
• The mathematical expressions

• diagram of data flow


how do you build a model?
• Understand the relationship between data
variables – EDA
• building blocks of these models are probability
distributions.
Probability distributions
• Probability distributions are the foundation of
statistical models.
Probability Distributions
Fitting a model
• Fitting a model means that you estimate the
parameters of the model using the observed
data.
• Overfitting- dataset to estimate the
parameters of your model, but your model
isn’t that good at capturing reality beyond
your sampled data.
Doing Data Science

Chapter 2, Pages 34 - 50
Exploratory Data Analysis (EDA)
• EDT is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.
• need to understand how the data is behaving,
• best way to do that is looking at it and getting
your hands dirty (EDA).
Basic tools
• The basic tools of EDA are plots, graphs and
summary statistics
Exploratory Data Analysis (EDA)
• it’s a method of systematically going through the
data:
• plotting distributions of all variables (using box
plots), plotting time series of data, transforming
variables,
• looking at all pairwise relationships between
variables using scatterplot matrices,
• generating summary statistics for all of them :
computing their mean, minimum, maximum, the
upper and lower quartiles, and identifying outliers.
• Is a mindset about your relationship with the
data to better understand
• Gain intuition,
• understand the shape of it,
• connect your understanding of the process
that generated the data.
• EDA happens between you and the data and
isn’t about proving anything to anyone else.
Philosophy of EDA
• There are important reasons anyone working with
data should do EDA.
• gain intuition about the data;
• to make comparisons between distributions;
• for sanity checking (data is on the scale you
expect);
• to find out where data is missing or if there are
outliers;
• and to summarize the data..
• Plotting data and making comparisons can get
you extremely far,
• and is far better to do than getting a dataset
and immediately running a model just
because you know it
EDA methods
Read the data
1. Read the data from auto-mpg.csv
>auto <- read.csv("auto-mpg.csv“)
2. Get the summary statistics:
>summary(auto)
str() function for an overview of a data
frame
> str(auto)
Finding the mean and standard
deviation
> mean(auto$mpg)
> sd(auto$mpg)
Extracting a subset of a dataset
1. Index by position. Get model_year and
car_name for the first three cars:
> auto[1:3, 8:9]
> auto[1:3, c(8,9)]
2. Index by name. Get model_year and car_name
for the first three cars:
> auto[1:3,c("model_year", "car_name")]
3. Retrieve all details for cars with the highest or
lowest mpg
> auto[auto$mpg == max(auto$mpg) |
auto$mpg == min(auto$mpg), ]
Excluding columns
> auto[,c(-1,-9)]
> auto[,-c(1,9)]
Generating standard plots such as
histograms, boxplots, and scatterplots
• Histograms

> hist(acceleration)
You can customize everything
> hist(acceleration, col="blue",
xlab = "acceleration",
Ylab = “Frequency”,
main = "Histogram of acceleration")
Boxplots
• boxplot for mpg :
> boxplot(mpg, xlab = "Miles per gallon")
Boxplot for mpg ~ model year
> boxplot(mpg ~ model_year, xlab = "Miles per
gallon")
boxplot of mpg by cylinders:
> boxplot(mpg ~ cylinders)
Scatterplots
• scatterplot for mpg by horsepower:
> plot(mpg ~ horsepower)
Data Science Process
A Data Scientist’s Role in This process

You might also like