ask yourself is, what is data science? So this is a blog post that I wrote for our blog Simply Statistics, that talked about how the key word in data science, is science, and not data. So the key issue when you're analyzing a data set, or when you're trying to use data to help your business, or to help your organization move forward is to know that data science is only useful when you're actually using that data to answer a specific, concrete question that could be useful for your organization. So it turns out that this definition actually ends up in the actual definition of data science in Wikipedia, and, as we know, Wikipedia is the main source of most people's information, so we literally define data science. So a couple of example of this, we'll illustrate what I mean by what is data science. So one of the examples that you about a lot when you hear about data science is Moneyball. And so, in Moneyball, the idea was, can we build a winning baseball team if we have a really limited budget? Now they used quantification of player skills, and a new metric that's more useful to answer that question. But the key underlying question that they were asking, the key reason why this is a data science problem was, could we use the data that we collected to answer this specific question which is building a low budget baseball team. A second question would be, how do we find the people who vote for Barack Obama and make sure that those people end up at the polls on polling day? And so this is an example from a study of Barack Obama's data team, where they went and they actually tried to analyze the data, and run experiments to identify those people. And they ended up being a surprising group of people that weren't necessarily the moderate voters that everybody thought they would be, that could be swayed to go out and vote for Barack Obama. And so, this is again an example where there was a high-level technical issue that had been used to basically A B testing on websites and things like that, to basically collect and identify the data that they will use to answer the question. But at the core, the data science problem was can we use data to answer this question of voter turnout, and the right kind of voter turnout to make sure a particular team wins an election. So another data science question is the Netflix prize. So here the idea was, Netflix wants to keep people watching movies, and so to get them to watch those movies, you need to keep producing recommendations of movies that they might like to watch after they've finished watching one. And so, the idea here is the question is how can we show people movies that they'd like to see so they'll keep watching, and then they use data, basically the preferences of other people like that person to try to predict which movies that they would like. So this is another example where there is some technical high-level machine learning techniques that were used to do these predictions but at the core the question was, how can we identify movies that people will like. And so I've talked a lot about how data science about answering questions with data and that's definitely true but there are also some other components to the problem. So, data science is involved in formulating those quantitative questions, identifying the data that could be used to answer the questions, cleaning it, making it nice, then analyzing the data, whether that's with machine learning, or with statistics, or with neural networks or whatever. And then communicating that answer to other people. And so another component of that, that often gets left out in these discussions, is basically the engineering component of it. So one example of that is this Netflix prize. So in the Netflix prize, they had a whole bunch of teams competing to try to predict how best to show people what movies to watch next, and the team that won blended together a large number of machine learning algorithms. In other words, they predicted the result with a large number of machine learning algorithms and then cleverly averaged them together. But it turns out that's really computationally hard to do, and so Netflix never actually ended up implementing the waiting solution on their system, because there wasn't enough computing power to do that at a scale where they could do it for all their customers. So this is an example of how there are different components to the data science process. There's the actual data science, the actual learning from data. And doing, discovering what the right prediction model is. And then there's the implementation component which is often lumped into data engineering which is how you actually implement or scale that technology to be able to apply it to, say, a large customer base or to a large number of people all at once. And so there are these trade-offs that always come up in data science. The tradeo-ffs between interpretability and accuracy or interpretability and speed, or interpretability and scalability, and so forth. So you can basically imagine that there are all these different components to a model whether it's in It's interpretable, simple, accurate, fast, and scalable. And you have to sort of make judgments about which of those things are important for the particular problem that you're trying to solve. And so, another component of this is like being able to identify what's hype and what's not. And so, this is an example of the hype cycle where there's sort of the peak of inflated expectations, followed by when everybody gets disillusioned with the technology and a plateau of productivity. We're sort of just coming out to that part where data science, people have really starting to figure out how to use data science to solve key problems. And so we're about to see a lot of productivity. You can kind of think about it as the 1999 for data science. So it's an exciting time to be involved in it because even relatively simple data science tools used well to answer very specific questions can have a major impact on you and your organization.