You are on page 1of 22

Topic 2 Business in practice and

The GRISP-DM framework


Vincent Hoang (2022), Lecture 2
Camn et al (2016), Chapter 1
Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. Paper presented
at the Proceedings of the 4th international conference on the practical applications of knowledge discovery and
data mining.
A CRISP-DM
framework
• CRISPDM – CRoss Industry
Standard Process for Data
Mining.
• Developed in the 1990s by a
group of five companies:
SPSS, TeraData, Daimler
AG, NCR, and OHRA.
• An open standard process
model that describes
common approaches used by
data mining experts and is
the most widely-used
Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining.
analytics model. Paper presented at the Proceedings of the 4th international conference on the practical
applications of knowledge discovery and data mining.
What is data mining?
• Data mining is a complex process of examining large sets of data for
identifying patterns and then using them for valuable business
insights.
• Data mining can be considered as parts of descriptive and
prescriptive analytics.
◦ In descriptive analytics, data-mining tools identify patterns in data.
◦ In predictive analytics, data-mining tools predict the future or make better
decisions that will impact future performance.
A CRISP-DM
framework
• The arrows show the most
important dependencies
between stages.

• This sequence is not fixed.

• The cycle reflects the


ongoing nature of data
mining work

Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining.
Paper presented at the Proceedings of the 4th international conference on the practical
applications of knowledge discovery and data mining.
A case study
• An online retailer has spent 90% of its marketing budget on attracting visitors to their
online store through many channels of advertising. Its CEO and other directors feel that
their online advertising campaigns are quite successful with “remarkable” growth in
traffic to their website and in “new” customers over the last few years. However, the
total revenue and profit do not meet their expectations, and this threatens their
business model.
• With their own business understanding and experience, they want to increase average
revenue per customer per year, which they believe that it is below its main competitors.
• Marketing Director proposes a strategy that incentivises customers to repurchase after
they have bought first purchase. Obviously, they have budget constraint in doing this.
The board of directors think of using business analytics.
A case study: Performance Lawn Equipment
• Using the data set “PerformanceLawnEquipmentData” and Performance Lawn Equipment (Case study)
document
• In groups of 4 or five, each of you play the following roles:
(1) Sales Director (looking after sales performance),
(2) Production and Delivery Director (looking after all production and deliveries of all orders),
(3) Support Service Director (looking at customers services, complaints, accounting, etc.), and
(4) CEO (Chief Executive Officer) who looks after overall business management and administration including
human resource management.
(5) Assistant to CEO whose main task is to manage some business analytics projects that the company plans
to do.
Tasks: (Be prepared to present a brief summary of your answers so please take notes while discussing.) Using
the CRISP-DM framework and discuss among your group?
Business
Understanding
• Determine business • Class discussion (10 minutes)
objectives • What are business problems or opportunities?
• The background • Set clear objectives (could be more than one
• The problems / objectives)
opportunities
• Agree on success criteria for BA proposal
• Project objectives
(initiatives)
• Success criteria
• What types of analytics involved?
• Identify and document what resources are
Business available?
Understanding • Data
• Staff & experience
• Budget
• Assess situations: go into
details of • What resources are required and then
• Resources available
identify constraints
• Requirements and • Other risks involved?
constraints (assumptions)
• Risks and contingencies • Evaluate costs and benefits
• Costs and benefits • Understand how action can be taken based
on the likely outcomes (how to deploy?)
• Not only in development stage but also in
deployment stage
• Data analysis goals:
Business
Understanding
• Project plan
• Project plan: • Tools and techniques
• Business problems • Classification of customers
• Business goals
• Resources & constraints
• Linear regression, logistic regression,
• Data analysis goals
regression tree
• Initial assessments of • Optimisation (what levels of incentives to
tools and techniques obtain higher total revenue or total profit)
Data
Understanding
• Collect initial data • High level:
• Describe data • Identify data sources and data fields
• Data volume and properties
• Review data strategy and documentation
• Accessibility and availability of
attributes • What data are relevant and in what formats
• Attribute types, range, correlation and
identifies (database, text files, excel etc.)
• Basis descriptive analysis
• Crucially, target data fields that maps to
• Explore data business/analytical objectives, e.g.
• Visualise and identify relationships

• Verify data quality


Data
Understanding
• Collect initial data
•Low level:
• Describe data
• Data volume and properties • Explore data
• Accessibility and availability of
attributes • Look for patterns
• Attribute types, range, correlation and
identifies • Use uni- or bi-variate analysis to
• Basis descriptive analysis establish relationships (often using
• Explore data visualisation tools)
• Visualise and identify relationships
• Test hypotheses
• Verify data quality
• Identify anomalies
Data Preparation
• Select data
• Clean data
• Correct, remove or ignore noise • Data Understanding helps design this
• Special values and their meaning
• Outliers and aggregation
step.
• Construct data • This could take more time than
• Create new attributes from available
attributes expected, especially newer projects
• Integrate data from multiple • Create new data variables (e.g.
sources
revenue / cost = profitability)
• Format data: e.g. string
values to numerical values
Modelling
• Select modelling
techniques: e.g.
regression • Apply a variety of modelling techniques
• Choice of techniques depends on
• Generate test design: e.g. • Analytical and business objectives
splitting data into training, • Data quantities and data types
test and validation sets • Modelling approaches:
• Hypothesis led: theories, knowledge or experience tell
• Build models us what variables (data fields) to use
• Data led: add more fields at the beginning and
• Assess models incrementally reduce and or let the algorithms do that.
• Many cases best to combine these two approaches.

• Assess models against agreed objectives and


success criteria.
Modelling
• Modelling approaches:
• Select modelling • Hypothesis led:
techniques: e.g.
regression
• Generate test design: e.g. • Do Directors (& their staff) know
splitting data into training, everything?
test and validation sets
• Build models
• Assess models
• Partitioning a data set is splitting the
Modelling
data randomly into two, sometimes
• Select modelling three smaller data sets: Training,
techniques: e.g. Validation and Test.
regression • Training: The subset of data used to create
(build) a model
• Generate test design: e.g. • Validation: the subset of data that remains
splitting data into training, unseen when building the model and is
test and validation sets used to tune the model parameter
estimates.
• Build models • Test (hold-out): A subset of data used to
measure overall model performance and
• Assess models compare the performance among different
candidate models.

• In our case study:


Evaluation
• Evaluate results
• Review process: • Essential that the models are tested
• activities missed or repeated against unseen data.
• steps followed,
• failures and misleading etc. • Evaluate against the success criteria
• Determine next steps agreed in the Business Understanding
• Potential for deployment of each
results
phase.
• Potential for improvement
• Alternative continuations
• Evaluate how well the model perform
• Refine process plan against a given value criteria, e.g.
• Take ACTIONS!
revenue.
Deployment
• Plan deployment
• Plan monitoring and • Could be simple or complex (e.g. when
maintenance embedding in an operational system to
• What could change in the
environment? predict in real time and automate
• How will accuracy be monitored decisions).
• When should model(s) not be used
anymore? • Important to distinguish between a model
• What if business objectives of the in the modelling and deployment phases:
use of the model change?
• In modelling phases, often many different
• Produce final report models and modelling options are built and
evaluated.
• Review project • In the deployment phase, often the winning
models are fixed (in a short-term)
Deployment
• On-going evaluation (monitoring) if
• Plan deployment
models are to be used over time.
• Plan monitoring and
maintenance • Some models have a longer life than
• What could change in the
environment? others.
• How will accuracy be monitored
• When should model(s) not be used • Development of models that adapt
anymore?
• What if business objectives of the themselves to changing
use of the model change?
environments/circumstances.
• Produce final report
• Review project
CRISP-DM
process reprised
• The process is iterative.

• Meaning: doing again and


again to improve both the
process and outcomes.

• The order of stages in the


process is not fixed.
IBM’s new
ASUM-DM
framework
that extends
CRISP-DM
• McKinsey (2018) Achieving business impact with data

You might also like