John B. Rollins, Ph.D.
IBM Analytics | IBM Corporation
Foundational Data Science Methodology
2015 IBM Corporation
Introduction
Why we are interested in data science
- Solve problems and answer questions
- Gain useful insights through modeling to predict outcomes or discover
underlying patterns
Rapidly evolving technologies
- Platform growth
- In-database analytics
- Text analysis
- Automation
2015 IBM Corporation
Data science methodology
Why?
- To provide a guiding strategy
What?
- General strategy that guides the processes and activities within a given
domain
- Does not depend on particular technologies or tools
- Not a set of techniques or recipes
- Provides the data scientist with a framework for how to proceed to obtain
answers
2015 IBM Corporation
Methodology diagram
Business
Understanding
Analytic
Approach
Data
Requirements
Feedback
Data Collection
Deployment
Data
Understanding
Evaluation
Modeling
Data
Preparation
2015 IBM Corporation
Business understanding
Business
Understanding
Every project begins with business understanding.
- Clearly define project objectives and requirements from the business
perspective key to a successful solution
- Business sponsors most critical in this stage
Define problem and solution requirements
- Business sponsors involved throughout the project
Provide domain expertise
Review intermediate findings
Ensure that the work generates the intended solution
2015 IBM Corporation
Analytic approach
Analytic
Approach
With a clear definition of the business problem, we define the analytic
approach to solving the problem.
- Express problem in context of statistical and machine learning techniques
- Identify suitable technique(s)
- Examples
Classification to predict response to a promotion ("yes" or "no)
Clustering and Associations for customer segmentation and market basket
analysis
2015 IBM Corporation
Data compilation
The chosen analytic approach determines the
data requirements.
- Content, formats, representations
Initial data collection is performed.
- Available data resources (structured, unstructured,
semi-structured) relevant to the problem domain
- Decide whether to obtain less-accessible data
elements
- Revise data requirements or collect more data,
if needed
Data
Requirements
Data Collection
Data
Understanding
Then data understanding is gained.
- Descriptive statistics and visualization
- Content, quality, initial insights about data
- Additional data collection to fill gaps, if needed
7
2015 IBM Corporation
Data preparation
Data preparation encompasses all activities to construct the data set.
- Data cleaning
Missing or invalid values
Eliminating duplicate rows
Formatting properly
- Combining multiple data sources
- Transforming data
- Feature engineering
- Text analysis
Accelerate data preparation by
automating common steps
Data
Preparation
2015 IBM Corporation
Modeling
Modeling focuses on developing models.
- Predictive or descriptive models
- According to the previously-defined analytic approach
- Training set for predictive modeling
Highly iterative process
- Intermediate insights refinements in data preparation & model specification
- Multiple algorithms & parameters to find best model for a given technique
Modeling
2015 IBM Corporation
Model evaluation
Model evaluation is performed during model development and before
model deployment.
- Understand the models quality
- Ensure that it properly addresses the business problem
Diagnostic measures
- Suitable to the modeling technique used
- Testing set
- Refine model as needed
Evaluation
Statistical significance tests
10
2015 IBM Corporation
Deployment and feedback
Once finalized, the model is deployed into a production environment.
- May be in a limited / test environment until model is proven
- Involves additional groups, skills, and technologies
Solution owner
Feedback
Marketing
Application developers
IT administration
Deployment
Feedback to assess model performance
- Gathering and analysis of feedback for assessment
of the models performance and impact
- Iterative process for model refinement and redeployment
- Accelerate through automated processes
11
2015 IBM Corporation
Ongoing value through good methodology
Methodology diagram illustrates the iterative nature of problem-solving in
a data science project.
Through feedback, refinement, and redeployment, models are continually
improved and adapted to evolving conditions.
The model continues to provide value to the organization for as long as
the solution is needed.
12
2015 IBM Corporation