Subtitle

So this is a crash course in data science.
And so the first question you might

ask yourself is, what is data science? So this is a blog post that I wrote for
our blog Simply Statistics, that talked about how the key word in
data science, is science, and not data. So the key issue when you're analyzing
a data set, or when you're trying to use data to help your business, or
to help your organization move forward is to know that data science is only useful
when you're actually using that data to answer a specific, concrete question that
could be useful for your organization. So it turns out that this definition
actually ends up in the actual definition of data science in Wikipedia, and, as we
know, Wikipedia is the main source of most people's information, so
we literally define data science. So a couple of example of this, we'll illustrate
what I mean
by what is data science. So one of the examples that you about
a lot when you hear about data science is Moneyball. And so, in Moneyball, the idea
was, can we build a winning baseball team
if we have a really limited budget? Now they used quantification
of player skills, and a new metric that's more
useful to answer that question. But the key underlying question that
they were asking, the key reason why this is a data science problem was,
could we use the data that we collected to answer this specific question which is
building a low budget baseball team. A second question would be, how do we find
the people who vote for Barack Obama and make sure that those people end
up at the polls on polling day? And so this is an example from
a study of Barack Obama's data team, where they went and they actually
tried to analyze the data, and run experiments to identify those people. And they
ended up being a surprising
group of people that weren't necessarily the moderate voters that
everybody thought they would be, that could be swayed to go out and
vote for Barack Obama. And so, this is again an example where
there was a high-level technical issue that had been used to basically A B
testing on websites and things like that, to basically collect and identify the
data
that they will use to answer the question. But at the core,
the data science problem was can we use data to answer this
question of voter turnout, and the right kind of voter turnout to make
sure a particular team wins an election. So another data science
question is the Netflix prize. So here the idea was, Netflix wants to
keep people watching movies, and so to get them to watch those movies, you
need to keep producing recommendations of movies that they might like to watch
after they've finished watching one. And so, the idea here is
the question is how can we show people movies that they'd like
to see so they'll keep watching, and then they use data, basically
the preferences of other people like that person to try to predict which
movies that they would like. So this is another example where there is
some technical high-level machine learning techniques that were used
to do these predictions but at the core the question was, how can we
identify movies that people will like. And so I've talked a lot about how data
science about answering questions with data and that's definitely true but there
are also some other
components to the problem. So, data science is involved in
formulating those quantitative questions, identifying the data that could be used
to answer the questions, cleaning it, making it nice, then analyzing the data,
whether that's with machine learning, or with statistics, or
with neural networks or whatever. And then communicating that
answer to other people. And so another component of that, that
often gets left out in these discussions, is basically the engineering
component of it. So one example of that
is this Netflix prize. So in the Netflix prize, they had a whole
bunch of teams competing to try to predict how best to show people what
movies to watch next, and the team that won blended together a large
number of machine learning algorithms. In other words, they predicted the result
with a large number of machine learning algorithms and
then cleverly averaged them together. But it turns out that's really
computationally hard to do, and so Netflix never actually ended up implementing the
waiting solution on their system, because there wasn't enough computing power to do
that at a scale where they could do it for all their customers. So this is an
example of how there
are different components to the data science process. There's the actual data
science,
the actual learning from data. And doing, discovering what
the right prediction model is. And then there's the implementation
component which is often lumped into data engineering which
is how you actually implement or scale that technology to be able to apply
it to, say, a large customer base or to a large number of people all at once. And
so there are these trade-offs
that always come up in data science. The tradeo-ffs between interpretability
and accuracy or interpretability and speed, or interpretability and
scalability, and so forth. So you can basically imagine that there
are all these different components to a model whether it's in It's
interpretable, simple, accurate, fast, and scalable. And you have to sort of make
judgments
about which of those things are important for the particular problem
that you're trying to solve. And so, another component of this is
like being able to identify what's hype and what's not. And so, this is an example
of the hype
cycle where there's sort of the peak of inflated expectations, followed by
when everybody gets disillusioned with the technology and
a plateau of productivity. We're sort of just coming out to that part
where data science, people have really starting to figure out how to use
data science to solve key problems. And so
we're about to see a lot of productivity. You can kind of think about it
as the 1999 for data science. So it's an exciting time to
be involved in it because even relatively simple data
science tools used well to answer very specific questions can have a
major impact on you and your organization.

Subtitle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subtitle

Uploaded by

Copyright:

Available Formats

So this is a crash course in data science.

And so the first question you might

You might also like