Professional Documents
Culture Documents
Edd Biddle
Watson Senior Data Scientist
How to use this document
It contains a mixture of
• A business case study based on a fictitious e-retailer company – highlighted in a green box
The colour coding is designed to help identify slides to skip or focus on depending on the type of detail required
The CRISP-DM Methodology
5% 5%
10%
Business Understanding
5% Data Understanding
Data Preparation
35% Modeling
Evaluation
Deployment
40%
Business Understanding
Business Understanding
Take time to explore what the business expects to gain from the data analysis. Understand how the results of the
analysis will be used.
The focus of this task is on the Business Problem, not the solution or the underlying technology that may solve the
problem. The focus needs to stay business – bringing in technology too soon will guide the solution to a technology and
the danger is the actual business problem will be forgotten or not fully answered.
Task List
• Gather background information about the current business situation – As Is Scenario and pain points
• Document specific business objectives decided upon by key decision makers – To Be Scenario
• Agree upon success criteria – How can we measure success / sign off
Document:
• Describe the problem to be solved
• Specify all business questions as precisely as possible
• Determine any other business requirements (e.g. not losing an existing customer whilst increasing cross sell
opportunities
• Specify expected benefits in business terms (e.g. reducing churn amongst high-value customers by 10%)
Business Understanding
It is important to define success criteria so that you can say the task is complete.
• Subjective – The criteria is less concrete and harder to measure or quantify the unknown
• “Discover clusters of effective treatments”
• “Identify non-obvious insights within the text”
Case Study
Louise the VP of marketing is feeling the pressure from ever growing competition from new web retailers
Customer acquisition is expensive and she has decided that she needs to cultivate existing customer relationships in
order to maximise the value of the company’s current customers
Now that the business goals are clear we now need to translate them into data analysis goals. For example if the
business objective is to reduce churn then the following analysis goals might be set
A key question to be answered is whether the data provided by the customer contains the correct information to
answer the business problem
Business Understanding
Case Study
Jeff expands the business objectives by exploring the data that the company has available
• Use historical information about previous purchases to generate a model that links "related" items. When users look
at an item description, provide links to other items in the related group (market basket analysis).
• Use Web logs to determine what different customers are trying to find, and then redesign the site to highlight these
items. Each different customer "type" will see a different main page for the site (profiling).
• Use Web logs to try to predict where a person is going next, given where he or she came from and has been on your
site (sequence analysis).
Business Understanding
Success must also be defined in technical terms to keep the data analysis on track.
• Describe the methods for model assessment (e.g. accuracy, performance etc)
• Define subjective measurements as best you can and determine the arbiter of success
Data Understanding
Data Understanding
Involves taking a closer look at the data available for analysis. It involves accessing and exploring the data
Quantitive questions
• How many documents?
• Average number of pages?
• Smallest, largest and average size of document?
Quality questions
• How are the documents formatted?
• Are there particular locations / sections of the document that are rich with data and others not so much?
• Do they need splitting up into smaller sub-documents?
• Are there complex structures, like tables or images contained within the documents? If so what information within
them are important and how are you going to extract that information?
• Are the documents digital or do they need to be OCR’d?
• What languages need to be supported?
Data Understanding
Document Segmentation – Why do it?
When dealing with large documents or even fairly small documents it is often good practice to chunk them up into
smaller sub-documents. There are a number of reasons to do this and they will depend on the data and the use case
• Information at the top of the document is less likely to be related or correlated with information several pages
further down in a document. Text Analytic systems draw inferences between co-location
Quantitive questions
• How many rows and how many attributes are available in the data
Quality questions
• How much missing data there is – attributes will have missing field values
• What is the sparsity of the attributes – do we need to consider combining attributes together?
• For example in market basket analysis do we need to aggregate similar products together?
• Are there correlations between different attributes? Can we remove some attributes based on these correlations?
• What sort of categorical data is available and how is it represented?
• Do we have historical data with known results (e.g. churned)
• Plot different fields against each other to see how they interact
Data Understanding
• Use different plots to explore the data
• Box plots for Categorical data
• Scatter plots for continuous data
The following example charts explore different attributes and their relation to price within a used car dataset
Explore the variance in data in the categorical field drive Explore the relationship between the continuous
wheels – in this example the price range between rear variable engine size and price to see if engine size is a
wheel drive and other drive wheel car is distinct, whilst good predictor for price. The scatter plot below shows
the price between forward and 4 wheel drive is almost that there is a linear relationship between the two and
indistinguishable. as the engine size increases so does the price.
Data Understanding
• How are different data sources connected, if at all, is there a common key?
• How do we merge data sources
• what sort of data manipulation might be needed?
• Do values need to be normalised to align different sources together?
Data Understanding
Basic characteristics of the data
Summarise the information gathered above into a data collection report. Capture the quantity and quality of the data.
Key characteristics:
• Coding schemes
• values are often represented by characteristics of the data such as M and F for gender
• Numeric key linked to a product list
Data Understanding
Key questions to ask:
• Are you able to prioritise relevant attributes? Do you need to get a business SME to provide further insight?
Content
Manager
Data Sources
Select
De-normalised Data
Handling of outliers and missing values – Outliers and missing values can produce biased results. The mitigation of outliers and the transformation of
missing data into meaningful information can improve data quality enormously.
Variable Selection – Variables can be superfluous by presenting the same or very similar information as others. Dependent or highly correlated variables
can be found by performing statistical tests like bivariate statistics, linear and polynomial regression. Dependent variables should be reduced by selecting
one variable for all others or by composing a new variable for all correlated ones by factor or component analysis. Variable reduction will improve model
performance
Data Preparation
Involves the following tasks:
• Selecting a sample subset of data
• Filter on rows that target particular customers / products that help answer the data analysis and business goals
• Filter on attributes relevant to data analysis and business goals
• Aggregating records
• Based on group by like operations
• Feature reduction – if your data is sparse or you have lots of attributes then it might be worth considering reducing
the number of features down
• Principle Component Analysis will show you which attributes have the biggest impact on the data
• Linear regression might show you that 2 attributes are strongly correlated and so only need to use one of those
attributes and remove the other
• Normalisation
• Normalise numeric fields to use the same range. Features with large values compared to features with small
values can be given more weight by various algorithms
• Normalisation eliminates the unit of measurement by rescaling data, often to a value between 0 and 1
Missing Data occurs when there is no data value stored for a particular observation or field within the data. Missing data
are a common occurrence and can have significant effect on the conclusions that can be drawn from the data.
Data normalization is a process of adjusting values measured on different scales to a notionally common scale
Example
If an analyst scores a piece of information on a scale of 1-5 where 1 = not useful and 5 = very useful
The usefulness of the information has been scored by two systems that use different scales.
In order to compare the analyst vs the machine the values need to be normalised to a common scale
One approach to do this would be to convert the percentage scores to the same scale the analyst used
0-20 = 1
21-40 = 2
41-60 = 3
61-80 = 4
81-100 =5
Data Preparation - Normalisation
There are 3 common techniques used for data normalisation, these are not exhaustive and there are a multitude of
alternative techniques
Simple Feature Scaling – used to bring all values into the range of 0-1
Min-Max Feature Scaling – Variation on thee simple feature scaling which takes into account where the value lies in
between the min and max score
Z-Score – Normalises based on the mean and standard deviation - scores typically range from -3 - 3
where μ = population mean and σ = population standard deviation (an alternative is to use the sample mean and sd)
Data Preparation – Categorical data
Categorical or discrete variables are those that represent a fixed number of possible values, rather than a continuous
number.
Although some data mining algorithms can handle categorical data as it is many require it to be converted into a
numerical representation.
There are a number of approaches that enable this conversion, these are often referred to as encoding techniques
Note: High dimensionality can cause model complexity. A dataset with more dimensions requires more parameters for
the model to understand and that means more rows to reliably learn the parameters. If the number of rows in a dataset
is fixed, the addition of extra dimensions without adding more information to learn from can have a detrimental effect on
eventual model accuracy
• Other Approaches
• There are a number of other approaches that make use of the mean and standard deviation of the dependant
variable in order to give a distribution across the one hot encoding (i.e. each additional dimension is given a
value between 0 and 1) Popular ones include “Backward Difference Coding” and “Polynomial Coding”
Data Preparation
• Split the data into training and test data sets
• Validation dataset – having selected a model that performs well on training data, we run the model on the validation
data set.
• Typically ranges from 10% - 20% of the available data
• Allows us to test for overfitting
• Test dataset – contains data that has never been used in the training
• Typically ranges from 5% - 20% of the available data
Modelling
Modelling
Selection of the best suited modelling techniques to use for a given business problem. This steps not only includes
defining the appropriate technique or mix of techniques to use, but also the way in which the techniques should be
applied
• Classification
• Associations
• Clustering
• Anomaly detection
• Prediction
• Similar patterns
• Similar time sequences
The selection of the method will normally be obvious, for example, market basket analysis in retail will use associations
technique which was originally developed for this purpose.
The challenge is usually not which technique to use, but the way in which the technique should be applied. All the
techniques will require some form of parameter selection which requires experience of how the techniques work and
what the various parameters do.
Modelling is an iterative process where a number of experiments will be conducted in order to find the optimal model,
parameter and feature selection.
Modelling
Feature
Association Clustering Classification Predictive
Reduction
Pearson Hierarchical
Apriori Regression
Correlation Clustering
Principle
Anomaly Decision
Components Sequence
Analysis Detection Tree
Support
Factor Association Neural
K Means Vector
Analysis Rules Machine Network
Bayesian
Kohonen Statistical
Network
Unsupervised Supervised
Machine Learning
Modelling
Naïve Bayes Classifier Uses the method of maximum Where variables in the data set are Highly Scalable Has been shown to be out
likelihood independent performed by Decision Trees
Fast and Random Forests
A conditional probability model Normally requires integer feature
counts (e.g. word counts for text Produces poorly calibrated
classification) but also works on probabilities – favours
fractional counts such as tf-idf extreme probability values
Sequence Uses a two-pass method for finding When temporal analysis of the data
frequent sequences within the is important – market basket
dataset analysis over time
Clustering Techniques
Cluster Analysis: The task of grouping a set of objects in such a way that objects in the same group are more similar
to each other than to those in other groups. Cluster algorithms minimize the distance within a cluster and maximize
the distance between clusters
Hierarchical Clustering Builds a hierarchy of clusters. There No knowledge of the Algorithm can never undo
are two main approaches, ideal number of what was previously done
• bottom up – Each observation clusters required
starts as it’s own cluster and pairs Time consuming for large
of clusters are joined as you move Easy to implement datasets
up the hierarchy. and
• top down - starts with one big Can be sensitive to outliers
cluster and then splits are
performed recursively as you move
down the hierarchy.
Results are usually presented in a
dendrogram
Clustering Techniques
Method Description When to use Strengths Weaknesses
MACHINE LEARNING PREDICTIVE MODELS
K Means Aims to partition n observations into k On large data sets with lots of If the number of Difficult to predict the K
clusters in which the observation variables variables is large then k value
belongs to the cluster with the means is
nearest mean computationally faster Different initial partitions
than hierarchical can result in different final
clustering techniques clusters – this is particularly
evident on smaller data sets
K means produce
tighter clusters than Do not know which variable
hierarchical techniques contributes more to the
clustering process
Kohonen A type of neural network that Does not use a target More computationally
performs clustering using a self field expensive then k-means
organising map
Do not need to know Harder to interpret the
Try to uncover patterns in a set of the number of clusters results
input fields
Feature Reduction
Feature Reduction: Process of selecting a subset of relevant features for use in model construction based on the
premise that data contains many features that are either redundant or irrelevant and can therefore be removed
without incurring much loss of information
Method Description When to use Strengths Weaknesses
STATISTICAL FEATURE REDUCTION TECHNIQUES
Pearson Correlation Coefficient Measures linear correlation between On datasets that are normally Easy to calculate Outliers can cause
two variables. It gives a value distributed misleading results
between -1 and 1 where 1 is a total
positive correlation, -1 is total Good first analysis step in Sensitive to the data
negative correlation and 0 is no understanding your data and how distribution
correlation. different variables are related to
each other Lack of correlation may not
mean there is no
relationship, it could be non
linear
Principal Component Analysis Uses orthogonal transformations to Simplification of models Fast and simple to Reliant on finding linear
(PCA) convert a set of observations of implement correlations within datasets
possibly correlated variables into a set Shorter training times
of values of linearly uncorrelated Sensitive to the scaling of
variables called principal components To avoid the curse of dimensionality variables – consider
– sparse datasets normalising first
Enhanced generalization by
reducing overfitting
Factor Analysis Similar to PCA, but looks for joint See PCA Not so susceptible to
variations across multiple variables error variance
Model Evaluation
Model Evaluation
To determine if the model is a good fit we look at the following combination of:
• Do the predicted values make sense?
• Visualization
• Numerical values for evaluation
• Comparison of multiple model
The next stage of the CRISP DM method is to evaluate your results against the business goals stated at the beginning
of the process and whether the success criteria has been met.
Task List
Document your assessment of whether the data analysis results meet the business success criteria. Consider the
following questions in your report:
• Are your results stated clearly and in a form that can be easily presented?
• Can you rank the models and findings in order of their applicability to the business goals?
• In general, how well do these results answer your organization's business goals?
• What additional questions have your results raised? How might you phrase these questions in business terms?
Hyper Parameter Tuning
Brief Explanation
Hyperparameters are the settings of an algorithm that can be adjusted to optimise performance
• Model parameters, such as slope and intercept in a linear regression model are learned during training
• Hyperparameters must be set by the data scientist before training
Python libraries like Scikit-Learn implement a set of sensible default hyperparameters, but these are not guaranteed to
be optimal for a problem.
The best hyperparameter are usually impossible to determine ahead of time and there is an element of trial and error
required to properly tune a model.
• To determine optimal settings try many different combinations and evaluate the performance of each model
• It is important to perform this evaluation on the validation and test data sets to avoid model overfitting.
Hyper Parameter Tuning
• Cross validation is a statistical technique which involves partitioning the data into subsets, training the data on a
subset and using the other subset to evaluate the model’s performance.
• Hyperparameter tuning is part of the cross validation model evaluation, which means that you are potentially going
to be evaluating hundreds of models in order to properly identify the optimal configuration
Experiment 1
Experiment 2 Training
Experiment 3
Validation
Experiment 4
Experiment 5
Shows the relationship between two variables whether a linear model is appropriate
Plots the error between predicted and observed values to determine whether a linear model is appropriate
• A linear model is appropriate if the residual error is distributed randomly. The residual error would have a zero mean
and equal variance
• A linear model is not appropriate when the residual error is not random where the values of the error change with x
Distribution plots count the predicted values against the actual values
Here are two examples of distribution plots created to evaluate two different multi-linear regression models
The one on the left shows that the predicted values are closer to the target value
The one on the right shows that the model predicted values for the higher priced cars are inaccurate
Precision and Recall
f1 is the harmonic mean of precision and recall – punishes you for a disparity between precision and recall
f1 = 2 * P * r / (p + r)
Example of Precision and Recall
f1 = 0.5
ROC and PR Curves
A receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier
system as its discrimination threshold is varied.
The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.
The true-positive rate is also known as sensitivity
A PR Curve uses precision – how many true positives out of all that have been predicted as positives
• Deployment
• Making model predictions / classifications easily available to the business
• Evaluation
• Measuring the quality of the deployed models
• Management
• Tracking model quality over time
• Monitoring
• Improving deployed models with feedback
Deployment
Deployment is making the model easily accessible and available to the business
• The model moves from the development environment into a production environment
• Live data is fed through the model to be scored or classified (depending on which model has been selected)
• Live data will need to be prepared correctly into a format that the deployed model is expecting
• The output of the deployed model is actionable insights which some external application can consume
Development Environment Production Environment
Model
Historical development Trained Deployed Actionable
and Model Model Insights
Data
evaluation
• Accuracy of prediction – this is largely done during the model evaluation in the model development phase, but
further accuracy tests can be performed on the live system (Machine Learning Metric)
• Business impact – is the deployed model influencing the business success criteria defined in the business
understanding phase (Business Metric)
Management and Monitoring – Tracking model quality over time, improving based on feedback
Important features
• Versioning
• Logging
• Provenance
• Dashboards
• Reports
• Capturing and responding to feedback
Planning, Monitoring and Maintenance
In a full-fledged deployment and integration of modeling results, your data analysis work may be ongoing. For
example, if a model is deployed to predict sequences of e-basket purchases, this model will likely need to be
evaluated periodically to ensure its effectiveness and to make continuous improvements. Similarly, a model deployed
to increase customer retention among high-value customers will likely need to be tweaked once a particular level of
retention is reached. The model might then be modified and re-used to retain customers at a lower but still
profitable level on the value pyramid.
Task list
• For each model or finding, which factors or influences (such as market value or seasonal variation) need to be
tracked?
• How can the validity and accuracy of each model be measured and monitored?
• How will you determine when a model has "expired"? Give specifics on accuracy thresholds or expected changes
in data, etc.
• What will occur when a model expires? Can you simply rebuild the model with newer data or make slight
adjustments? Or will changes be pervasive enough as to require a new data mining project?
• Can this model be used for similar business issues once it has expired? This is where good documentation
becomes critical for assessing the business purpose for each data mining project.
Model Update Strategies
There are a number of strategies for migrating to a new model, 2 of which are covered in this presentation
A / B testing
• measures the performance of two models and when one out performs the other the new model is published.
• For example an association model for product recommendations based on website page navigation might
measure the click through rate as a business success criteria
2000 visits
Model V1
10% Click Through Rate
Model v2 gets
promoted
2000 visits
Model V2
30% Click Through Rate
Multi-armed bandit
• Focus on how to learn faster by performing exploration and exploitation
• Exploration - Split 10% of your data you have equally between models to be evaluated
• Exploitation - Use the other 90% of the data on the best performing model
• Exploration is ongoing and runs simultaneously. Allowing for a model switch if the other model starts to
outperform the first.
Deployment
A successful deployment of the e-retailer's data mining results requires that the right information reaches the right
people.
Decision makers. Decision makers need to be informed of the recommendations and proposed changes to the site, and
provided with short explanations of how these changes will help. Assuming that they accept the results of the study, the
people who will implement the changes need to be notified.
Web developers. People who maintain the Web site will have to incorporate the new recommendations and
organization of site content. Inform them of what changes could happen because of future studies, so they can lay the
groundwork now. Getting the team prepared for on-the-fly site construction based upon real-time sequence analysis
might be helpful later.
Database experts. The people who maintain the customer, purchase, and product databases should be kept apprised of
how the information from the databases is being used and what attributes may be added to the databases in future
projects.
Above all, the project team needs to keep in touch with each of these groups to coordinate the deployment of results
and planning for future projects.