You are on page 1of 60

CRISP-DM

Cross Industry Standard Process for Data Mining

Edd Biddle
Watson Senior Data Scientist
How to use this document

This document is a detailed walk through of the CRISP-DM methodology

It contains a mixture of

• Basic presentation of ideas and concepts

• A business case study based on a fictitious e-retailer company – highlighted in a green box

• Deeper dive technical content – highlighted in a red box

The colour coding is designed to help identify slides to skip or focus on depending on the type of detail required
The CRISP-DM Methodology

Effort Distribution - Percent

5% 5%
10%
Business Understanding
5% Data Understanding
Data Preparation
35% Modeling
Evaluation
Deployment

40%
Business Understanding
Business Understanding
Take time to explore what the business expects to gain from the data analysis. Understand how the results of the
analysis will be used.

The focus of this task is on the Business Problem, not the solution or the underlying technology that may solve the
problem. The focus needs to stay business – bringing in technology too soon will guide the solution to a technology and
the danger is the actual business problem will be forgotten or not fully answered.

Task List

• Gather background information about the current business situation – As Is Scenario and pain points
• Document specific business objectives decided upon by key decision makers – To Be Scenario
• Agree upon success criteria – How can we measure success / sign off

Defining Business Objectives

Document:
• Describe the problem to be solved
• Specify all business questions as precisely as possible
• Determine any other business requirements (e.g. not losing an existing customer whilst increasing cross sell
opportunities
• Specify expected benefits in business terms (e.g. reducing churn amongst high-value customers by 10%)
Business Understanding

Business Success Criteria

It is important to define success criteria so that you can say the task is complete.

There is two types of success criteria:

• Objective – The criteria quantifiable and can be measured


• “Customer churn will be reduced by 10%”
• “Product up sell will increase by 3%”

• Subjective – The criteria is less concrete and harder to measure or quantify the unknown
• “Discover clusters of effective treatments”
• “Identify non-obvious insights within the text”

It is important to connect each business objective to a success criteria


Business Understanding

Case Study

Louise the VP of marketing is feeling the pressure from ever growing competition from new web retailers

Customer acquisition is expensive and she has decided that she needs to cultivate existing customer relationships in
order to maximise the value of the company’s current customers

She commissions a study with the following objectives

• Improve cross-sales by making better recommendations


• Increase customer loyalty with a more personalised service

She identifies the following success criteria

• Cross-sales increase by 10%


• Customers Spend more time and see more pages on the site per visit.
Business Understanding
Determining Data Analysis Goals

Now that the business goals are clear we now need to translate them into data analysis goals. For example if the
business objective is to reduce churn then the following analysis goals might be set

• Identify high-value customers based on recent purchased data


• Build a model using available customer data to predict the likelihood of churn for each customer
• Assign each customer a rank based on both churn propensity and customer value

A key question to be answered is whether the data provided by the customer contains the correct information to
answer the business problem
Business Understanding

Case Study

Louise tasks her business analyst Jeff to undertake the study

Jeff expands the business objectives by exploring the data that the company has available

• Use historical information about previous purchases to generate a model that links "related" items. When users look
at an item description, provide links to other items in the related group (market basket analysis).

• Use Web logs to determine what different customers are trying to find, and then redesign the site to highlight these
items. Each different customer "type" will see a different main page for the site (profiling).

• Use Web logs to try to predict where a person is going next, given where he or she came from and has been on your
site (sequence analysis).
Business Understanding

Data Analysis Success Criteria

Success must also be defined in technical terms to keep the data analysis on track.

• Describe the methods for model assessment (e.g. accuracy, performance etc)

• Define benchmarks for evaluating success – provide specific numbers

• Define subjective measurements as best you can and determine the arbiter of success
Data Understanding
Data Understanding
Involves taking a closer look at the data available for analysis. It involves accessing and exploring the data

For unstructured data:


• spend a few hours opening and reading a number of the documents.
• Capture examples of the different ways information is reported pertinent to the business problem to be solved
• Identify any formatting problems, especially if the document is a PDF- scroll your mouse down the page to see the
layout of the text. The order in which the text is highlighted will be the order in which any NLP model will be
presented with the data
• Create a text index of the data and explore key concepts related to the business problem

Quantitive questions
• How many documents?
• Average number of pages?
• Smallest, largest and average size of document?

Quality questions
• How are the documents formatted?
• Are there particular locations / sections of the document that are rich with data and others not so much?
• Do they need splitting up into smaller sub-documents?
• Are there complex structures, like tables or images contained within the documents? If so what information within
them are important and how are you going to extract that information?
• Are the documents digital or do they need to be OCR’d?
• What languages need to be supported?
Data Understanding
Document Segmentation – Why do it?

When dealing with large documents or even fairly small documents it is often good practice to chunk them up into
smaller sub-documents. There are a number of reasons to do this and they will depend on the data and the use case

• Information at the top of the document is less likely to be related or correlated with information several pages
further down in a document. Text Analytic systems draw inferences between co-location

• Stronger assumptions can be drawn on smaller chunks of information:


• Implicit relationships can be assumed between Entities, like people, places and organisations, which are
mentioned within the same paragraph or section

• For supervised machine learning training has the following advantages:


• By splitting the documents up into small chunks and randomly presenting them to be marked up by a human
annotator can make the human think more like the machine because it removes contextual information and
knowledge that can introduce bias.
• Smaller documents are easier to human annotate and reduces the chances that key information is missed
Data Understanding
For structured data:
• Understand what information each field is providing
• Run basic statistical analysis on each field – calculate the mean, min, max and standard deviation. Do these values
make sense? For example if the field is age and the data is for credit cards does an age of 4 make sense?
• Capture any data formatting requirements – are there numeric fields that are written out as text? For example 36
months or 10%
• Think about whether fields need to be normalised
• Multiple variations representing the same piece of information
• Identify categorical data – need to work out how this data will be used within models – see later slide

Quantitive questions

• How many rows and how many attributes are available in the data

Quality questions
• How much missing data there is – attributes will have missing field values
• What is the sparsity of the attributes – do we need to consider combining attributes together?
• For example in market basket analysis do we need to aggregate similar products together?
• Are there correlations between different attributes? Can we remove some attributes based on these correlations?
• What sort of categorical data is available and how is it represented?
• Do we have historical data with known results (e.g. churned)
• Plot different fields against each other to see how they interact
Data Understanding
• Use different plots to explore the data
• Box plots for Categorical data
• Scatter plots for continuous data

The following example charts explore different attributes and their relation to price within a used car dataset

Explore the variance in data in the categorical field drive Explore the relationship between the continuous
wheels – in this example the price range between rear variable engine size and price to see if engine size is a
wheel drive and other drive wheel car is distinct, whilst good predictor for price. The scatter plot below shows
the price between forward and 4 wheel drive is almost that there is a linear relationship between the two and
indistinguishable. as the engine size increases so does the price.
Data Understanding

General questions for both unstructured and structured

• How are different data sources connected, if at all, is there a common key?
• How do we merge data sources
• what sort of data manipulation might be needed?
• Do values need to be normalised to align different sources together?
Data Understanding
Basic characteristics of the data

Summarise the information gathered above into a data collection report. Capture the quantity and quality of the data.
Key characteristics:

• Amount of data – Number of records and number of fields.


• Large volumes may require taking a subset.
• Large number of fields may require some form of feature reduction task to be performed

• Value types – data can be numeric, categorical, boolean or string.


• The type of data can help determine what data analysis technique is used
• What preparation is required

• Coding schemes
• values are often represented by characteristics of the data such as M and F for gender
• Numeric key linked to a product list
Data Understanding
Key questions to ask:

• Does the data include characteristics relevant to the business problem?

• Are you able to prioritise relevant attributes? Do you need to get a business SME to provide further insight?

• Which attributes seem promising for further analysis?

• Have your explorations revealed new characteristics about the data?

• Has the exploration altered the data analysis goals?

• Has the exploration altered the business goals?


• Is the data adequate to answer the business problem?
• Do you need to find more data or change the business goal?
• Are there other goals that could be answered?

• Construct a number of “Thought Experiments” on how the data can be analysed


Data Preparation
Data Preparation
Is one of the most important and often time consuming part of data analysis. It is good practice to record any data
preparation tasks in a Data Preparation Report

Content
Manager
Data Sources

Select

De-normalised Data

Cleanse filter Aggregate


Having populated the data model the next step is to ensure that the data fulfils the requirement of completeness, exactness and relevance

Validity of chosen variables – Visual


Inspection

Handling of outliers and missing values

Variable selection – removal of redundant


variables
Validity of chosen variables – perform univariate statistical analysis to visually inspect the maximum, minimum, mean and standard deviation values for
each variable to detect implausible distributions – e.g. a negative age

Handling of outliers and missing values – Outliers and missing values can produce biased results. The mitigation of outliers and the transformation of
missing data into meaningful information can improve data quality enormously.

Variable Selection – Variables can be superfluous by presenting the same or very similar information as others. Dependent or highly correlated variables
can be found by performing statistical tests like bivariate statistics, linear and polynomial regression. Dependent variables should be reduced by selecting
one variable for all others or by composing a new variable for all correlated ones by factor or component analysis. Variable reduction will improve model
performance
Data Preparation
Involves the following tasks:
• Selecting a sample subset of data
• Filter on rows that target particular customers / products that help answer the data analysis and business goals
• Filter on attributes relevant to data analysis and business goals

• Merging data sets or records


• Requires a common key to join the data sets together

• Aggregating records
• Based on group by like operations

• Deriving new attributes


• Often when merging data sets, especially when you have 1 to many relationships it can be useful to derive new
attributes - for example if I have a customer data set and a customer purchases dataset I might condense the
purchases down to a new derived attribute with mean or total spend

• Formatting and sorting the data for modelling


• Sequence and temporal algorithms might need data to be pre-sorted into a particular order
• Categorical data fields might need to be converted from textual categories to numerical ones
Data Preparation
• Removing or replacing blank or missing values
• Exclude rows where the missing attribute is key in making the decision
• Fill missing attributes with 0 or estimated values where the rest of the row adds value to the analysis

• Feature reduction – if your data is sparse or you have lots of attributes then it might be worth considering reducing
the number of features down
• Principle Component Analysis will show you which attributes have the biggest impact on the data
• Linear regression might show you that 2 attributes are strongly correlated and so only need to use one of those
attributes and remove the other

• Normalisation
• Normalise numeric fields to use the same range. Features with large values compared to features with small
values can be given more weight by various algorithms
• Normalisation eliminates the unit of measurement by rescaling data, often to a value between 0 and 1

• Replace / correct data and measurement errors


Data Preparation – Missing Data

Missing Data occurs when there is no data value stored for a particular observation or field within the data. Missing data
are a common occurrence and can have significant effect on the conclusions that can be drawn from the data.

Common approaches to handling missing data

• Calculate a value to replace the missing value


• The simplest approach is to calculate the mean for the field in question and replace the missing value with the
mean value
• More complex statistical methods can be applied which consider other known values for the data record in
question

• Delete or remove the record from the data set


• This approach is often adopted when the target field for a predictive model has missing data.
Data Preparation - Normalisation

Data normalization is a process of adjusting values measured on different scales to a notionally common scale

Example

If an analyst scores a piece of information on a scale of 1-5 where 1 = not useful and 5 = very useful

If a machine scores the same piece of information, but as a percentage

The usefulness of the information has been scored by two systems that use different scales.

In order to compare the analyst vs the machine the values need to be normalised to a common scale

One approach to do this would be to convert the percentage scores to the same scale the analyst used

0-20 = 1
21-40 = 2
41-60 = 3
61-80 = 4
81-100 =5
Data Preparation - Normalisation

There are 3 common techniques used for data normalisation, these are not exhaustive and there are a multitude of
alternative techniques

Simple Feature Scaling – used to bring all values into the range of 0-1

Min-Max Feature Scaling – Variation on thee simple feature scaling which takes into account where the value lies in
between the min and max score

Z-Score – Normalises based on the mean and standard deviation - scores typically range from -3 - 3

where μ = population mean and σ = population standard deviation (an alternative is to use the sample mean and sd)
Data Preparation – Categorical data
Categorical or discrete variables are those that represent a fixed number of possible values, rather than a continuous
number.

There are 2 types of categorical data variables


• Nominal – categories labelled without any order of precedence, e.g. “London”, “Paris”, “Berlin”
• Ordinal – categories labelled where an order of precedence exists, e.g. “Low”, “Medium”, “High”

Although some data mining algorithms can handle categorical data as it is many require it to be converted into a
numerical representation.

There are a number of approaches that enable this conversion, these are often referred to as encoding techniques

• Approach 1 – Label Encoding


• Assign each unique category value a set numeric value
• This approach is good for ordinal categorical data, but could be misleading for nominal data
• Easy to implement
Category Value Encoding
Category Value Encoding London 1
Some algorithms might
Low 1 Paris 2
see a value of 3 three
Medium 2 times the magnitude Berlin 3
High 3 of a value of 1 Rome 4
Data Preparation – Categorical data
• Approach 2 – One Hot Encoding
• Converts each category value into a new column and assign a 1 or a 0 (true or false)
• This overcomes the weighting issue of label encoding
• This has the potential of adding a large number of dimensions to your data.

Note: High dimensionality can cause model complexity. A dataset with more dimensions requires more parameters for
the model to understand and that means more rows to reliably learn the parameters. If the number of rows in a dataset
is fixed, the addition of extra dimensions without adding more information to learn from can have a detrimental effect on
eventual model accuracy

• Approach 3 – Custom Binary Encoding


• This is a hybrid of label and one hot encoding where the number of categories are first reduced into buckets and
then one hot encoding is applied
• For example if the category was world cities, the cities might first be grouped into geographical regions or size in
population and then these abstracted labels converted into one hot encoding
• This approach attempts to minimize the “curse of dimensionality” problem
• Natural categorical groupings might be hard to determine / may collapse the data down too much

• Other Approaches
• There are a number of other approaches that make use of the mean and standard deviation of the dependant
variable in order to give a distribution across the one hot encoding (i.e. each additional dimension is given a
value between 0 and 1) Popular ones include “Backward Difference Coding” and “Polynomial Coding”
Data Preparation
• Split the data into training and test data sets

• Training data set - is used to train the model.


• It can vary, but typically we use 60% of the available data

• Validation dataset – having selected a model that performs well on training data, we run the model on the validation
data set.
• Typically ranges from 10% - 20% of the available data
• Allows us to test for overfitting

• Test dataset – contains data that has never been used in the training
• Typically ranges from 5% - 20% of the available data
Modelling
Modelling
Selection of the best suited modelling techniques to use for a given business problem. This steps not only includes
defining the appropriate technique or mix of techniques to use, but also the way in which the techniques should be
applied

Some standard techniques

• Classification
• Associations
• Clustering
• Anomaly detection
• Prediction
• Similar patterns
• Similar time sequences

The selection of the method will normally be obvious, for example, market basket analysis in retail will use associations
technique which was originally developed for this purpose.

The challenge is usually not which technique to use, but the way in which the technique should be applied. All the
techniques will require some form of parameter selection which requires experience of how the techniques work and
what the various parameters do.

Modelling is an iterative process where a number of experiments will be conducted in order to find the optimal model,
parameter and feature selection.
Modelling

Market Basket Analysis Market Segmentation Risk Analysis


Profiling Direct Mail (CRM)
Data Event Analysis Portfolio Selection
Recommendation Engine Fraud Detection
Preparation Temporal Analysis Credit Approval
Defect Analysis Probability
Web Usage Forecasting
Fraud Detection

Feature
Association Clustering Classification Predictive
Reduction

Pearson Hierarchical
Apriori Regression
Correlation Clustering
Principle
Anomaly Decision
Components Sequence
Analysis Detection Tree

Support
Factor Association Neural
K Means Vector
Analysis Rules Machine Network

Bayesian
Kohonen Statistical
Network
Unsupervised Supervised
Machine Learning
Modelling

The difference between Statistical and Machine Learning techniques

Statistical Modelling Machine Learning


More mathematical based Less assumptions
Sub field of mathematics Sub field of computer science
Uses Equations Uses Algorithms
Human effort Minimal human effort (Please note there
can be a significant human effort to train
supervised machine learning models)
Model Network / Graphs
Parameters Weights
Fitting Learning
Regression / Classification Supervised learning
Density Estimation / Clustering Unsupervised learning
Classification Techniques
Classification: Predicts a class label where the dependent variable is categorical
Method Description When to use Strengths Weaknesses
STATISTICAL CLASSIFICATION MODELS
Logistic Regression / A regression model – a set of LR - Where the dependent variable Low variance and so is Assumes that there is a
Multinomial Logistic statistical processes for estimating the is binary categorical less prone to over-fit single decision boundary
Regression relationship among variables
MLR - Where the dependent Predicted probabilities Only works on numerical
variable has more than two are well calibrated data and alpha categorical
outcome categories data needs to be converted
into dummy numeric

Naïve Bayes Classifier Uses the method of maximum Where variables in the data set are Highly Scalable Has been shown to be out
likelihood independent performed by Decision Trees
Fast and Random Forests
A conditional probability model Normally requires integer feature
counts (e.g. word counts for text Produces poorly calibrated
classification) but also works on probabilities – favours
fractional counts such as tf-idf extreme probability values

MACHINE LEARNING CLASSIFICATION MODELS


Support Vector Machines Performs a non-linear, non-parametric When data is not regularly Good at analysing data High algorithmic complexity
classification technique to perform a distributed or the distribution is with very large number and extensive memory
• Linear binary classification. unknown of predictor fields requirements – speed and
size of both training and
• Polynomial Good at handling not testing
regularly distributed
• Radial basis function data The non-parametric nature
of the algorithm make
• Sigmoid results hard to visualise
Classification Techniques - Continued
Method Description When to use Strengths Weaknesses
Decision Tree Uses a tree-like graph to model Where there is a mix of numeric Can be applied when Complex decision trees are
decisions and their possible and categorical alpha fields there is not just one open to over-fitting
• C5.0, consequences underlying decision
CHAID requires all numeric boundary Complex trees are
• Classifier and Regression, computationally hungry
CHAID can create non-
• Chi-square Automatic binary trees meaning
Interaction Detection that splits have more
(CHAID) than 2 branches

• Quick unbiased efficient


statistical tree (QUEST)
Random Forests Constructs a multitude of smaller less Large complex data sets with lots of Smaller less complex Slow to create predictions
complex decision trees on randomly features trees are fast to train once trained
selected features and subsets of the
data Less prone to
overfitting
K-Nearest Neighbours A method for classifying cases based Good baseline method to use Model is easy to When your training set is
on their similarity to o before trying more advanced understand and often very large (either number of
techniques gives reasonable records or number of fields)
A way to recognise patterns of data performance without a prediction can be slow
without requiring an exact match to lot of adjustments
any stored patterns or cases Does not perform well on
ther cases Building can be very large feature spaces (100 or
fast more features)
Does not perform well on
sparse data sets (where
most features have 0 values)
Predictive Modelling
Techniques
Predictive: Predicts a value where the dependent variable is continuous
Method Description When to use Strengths Weaknesses
STATISTICAL PREDICTIVE MODELS
Regression A regression model – a set of LR – One independent variable Low variance and so is Assumes that there is a
statistical processes for estimating the used to make a prediction less prone to over-fit single decision boundary
• Linear Regression relationship among variables
MLR – multiple independent Only works on numerical
• Multiple Linear Regression variables used to make a prediction data and alpha categorical
data needs to be converted
• Polynomial Regressions PR - Non Linear data into dummy numeric

• Cox Regression CR – time to event predictions like


customer churn
• Isotonic Regression
IR – does not assume an form of
target function like linearity

MACHINE LEARNING PREDICTIVE MODELS


Neural Networks Can approximate a wide range of Automatically detects Not easy to interpret the
predictive models with minimal whether a linear or relationship between the
demands on model structure and non-linear model is target and the predictors.
assumption required
Association Rule Learning
Techniques
Association Rules: Discover interesting relations between variables in large datasets
Method Description When to use Strengths Weaknesses
MACHINE LEARNING ASSOCIATION TECHNIQUES
Association Rule Captures association rules based on Developed for market basket Does not consider the order
discovering regularities between analysis – for example if you buy of the items within a
variables in a large data set onions and potatoes then you are transaction or across
likely to also buy a burger transactions

Apriori Similar to association rule, but looks For large datasets


at frequency of items in the data faster to train
corpus to group frequent items
together

Sequence Uses a two-pass method for finding When temporal analysis of the data
frequent sequences within the is important – market basket
dataset analysis over time
Clustering Techniques
Cluster Analysis: The task of grouping a set of objects in such a way that objects in the same group are more similar
to each other than to those in other groups. Cluster algorithms minimize the distance within a cluster and maximize
the distance between clusters

Method Description When to use Strengths Weaknesses


STATISTICAL PREDICTIVE MODELS

Hierarchical Clustering Builds a hierarchy of clusters. There No knowledge of the Algorithm can never undo
are two main approaches, ideal number of what was previously done
• bottom up – Each observation clusters required
starts as it’s own cluster and pairs Time consuming for large
of clusters are joined as you move Easy to implement datasets
up the hierarchy. and
• top down - starts with one big Can be sensitive to outliers
cluster and then splits are
performed recursively as you move
down the hierarchy.
Results are usually presented in a
dendrogram
Clustering Techniques
Method Description When to use Strengths Weaknesses
MACHINE LEARNING PREDICTIVE MODELS
K Means Aims to partition n observations into k On large data sets with lots of If the number of Difficult to predict the K
clusters in which the observation variables variables is large then k value
belongs to the cluster with the means is
nearest mean computationally faster Different initial partitions
than hierarchical can result in different final
clustering techniques clusters – this is particularly
evident on smaller data sets
K means produce
tighter clusters than Do not know which variable
hierarchical techniques contributes more to the
clustering process

Arithmetic mean is not


robust to outliers
Anomaly Detection A clustering technique which detects Identifies some kind of problem There are multiple There are multiple
anomalies or outliers within a dataset. such as bank fraud, or a structural techniques that can be techniques that can be
defect within the dataset applied dependant on applied dependant on the
the dataset and dataset and problem
problem

Kohonen A type of neural network that Does not use a target More computationally
performs clustering using a self field expensive then k-means
organising map
Do not need to know Harder to interpret the
Try to uncover patterns in a set of the number of clusters results
input fields
Feature Reduction
Feature Reduction: Process of selecting a subset of relevant features for use in model construction based on the
premise that data contains many features that are either redundant or irrelevant and can therefore be removed
without incurring much loss of information
Method Description When to use Strengths Weaknesses
STATISTICAL FEATURE REDUCTION TECHNIQUES
Pearson Correlation Coefficient Measures linear correlation between On datasets that are normally Easy to calculate Outliers can cause
two variables. It gives a value distributed misleading results
between -1 and 1 where 1 is a total
positive correlation, -1 is total Good first analysis step in Sensitive to the data
negative correlation and 0 is no understanding your data and how distribution
correlation. different variables are related to
each other Lack of correlation may not
mean there is no
relationship, it could be non
linear

Principal Component Analysis Uses orthogonal transformations to Simplification of models Fast and simple to Reliant on finding linear
(PCA) convert a set of observations of implement correlations within datasets
possibly correlated variables into a set Shorter training times
of values of linearly uncorrelated Sensitive to the scaling of
variables called principal components To avoid the curse of dimensionality variables – consider
– sparse datasets normalising first

Enhanced generalization by
reducing overfitting

Factor Analysis Similar to PCA, but looks for joint See PCA Not so susceptible to
variations across multiple variables error variance
Model Evaluation
Model Evaluation

To determine if the model is a good fit we look at the following combination of:
• Do the predicted values make sense?
• Visualization
• Numerical values for evaluation
• Comparison of multiple model

A model can be configured or tuned by adjusting the hyperparameters of the model


Evaluation

The next stage of the CRISP DM method is to evaluate your results against the business goals stated at the beginning
of the process and whether the success criteria has been met.

Task List

Document your assessment of whether the data analysis results meet the business success criteria. Consider the
following questions in your report:

• Are your results stated clearly and in a form that can be easily presented?

• Are there particularly novel or unique findings that should be highlighted?

• Can you rank the models and findings in order of their applicability to the business goals?

• In general, how well do these results answer your organization's business goals?

• What additional questions have your results raised? How might you phrase these questions in business terms?
Hyper Parameter Tuning
Brief Explanation

Hyperparameters are the settings of an algorithm that can be adjusted to optimise performance

• Model parameters, such as slope and intercept in a linear regression model are learned during training
• Hyperparameters must be set by the data scientist before training

Python libraries like Scikit-Learn implement a set of sensible default hyperparameters, but these are not guaranteed to
be optimal for a problem.

The best hyperparameter are usually impossible to determine ahead of time and there is an element of trial and error
required to properly tune a model.

• To determine optimal settings try many different combinations and evaluate the performance of each model

• It is important to perform this evaluation on the validation and test data sets to avoid model overfitting.
Hyper Parameter Tuning

Random Forest Example

In order to illustrate hyperparameter tuning it is easier to go through an example

A random forest algorithm has the following hyperparameters


• N_estimators – number of trees in the forest
• Max_features – max number of features considered for splitting a node
• Max_depth – max number of data points placed in a node before the node is split
• Min_samples_split – min number of data points place in a node before the node is split
• Min_samples_leaf – min number of data points allowed in a leaf node
• Bootstrap – method for sampling data points (with or without replacement)

To tune the model the following combinations could be tried

bootstrap: [True, False],


max_depth: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
max_features: ['auto', 'sqrt’],
min_samples_leaf: [1, 2, 4],
min_samples_split: [2, 5, 10],
n_estimators: [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
Model Evaluation

Two standard measures are used to evaluate Linear Regression models

• Mean Square Error


• Calculate the distance between the predicted and actual value of the target field and square it
• Calculate the mean of all the observed errors
• R-Squared
• Is a measure to determine how close the data is to the fitted regression line
• Is the percentage of variation of the target variable that is explained by the linear model
Hyper Parameter Tuning and cross validation
• Consider using K-fold cross validation technique to minimise sample variability between training and test set

• Cross validation is a statistical technique which involves partitioning the data into subsets, training the data on a
subset and using the other subset to evaluate the model’s performance.

• Hyperparameter tuning is part of the cross validation model evaluation, which means that you are potentially going
to be evaluating hundreds of models in order to properly identify the optimal configuration

Total size of available data

Experiment 1

Experiment 2 Training
Experiment 3
Validation

Experiment 4

Experiment 5

5 fold Cross Validation


Regression Plots

Shows the relationship between two variables whether a linear model is appropriate

• The strength of the correlation


• The direction of the relationship (positive or negative)
• Validates that the model makes sense – e.g. are there invalid values like a negative price
Residual Plots

Plots the error between predicted and observed values to determine whether a linear model is appropriate

• A linear model is appropriate if the residual error is distributed randomly. The residual error would have a zero mean
and equal variance
• A linear model is not appropriate when the residual error is not random where the values of the error change with x

Good Linear model fit Bad Linear model fit


Distribution Plots

Distribution plots count the predicted values against the actual values

Here are two examples of distribution plots created to evaluate two different multi-linear regression models

The one on the left shows that the predicted values are closer to the target value
The one on the right shows that the model predicted values for the higher priced cars are inaccurate
Precision and Recall

Precision is the fraction of retrieved instances that are relevant

Recall is the fraction of relevant instances that are retrieved

Precision (p) = # of aligned mentions / # of referenced mentions

Recall (r) = # of aligned mentions / # of total mentions

f1 is the harmonic mean of precision and recall – punishes you for a disparity between precision and recall

f1 = 2 * P * r / (p + r)
Example of Precision and Recall

Suppose a computer program recognizing dogs in


photographs identifies eight dogs in a picture
containing 12 dogs and some cats

Of the 8 dogs identified,


• 5 actually are dogs – true positives
• The rest are cats – false positives
• The 7 Dogs that were not recognized - false
negatives

The programs precision is 5/8


The programs recall is 5/12

f1 = 0.5
ROC and PR Curves

A receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier
system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.
The true-positive rate is also known as sensitivity

A PR Curve uses precision – how many true positives out of all that have been predicted as positives

PR is thought to be more informative when there is a high class imbalance


Deployment
Deployment

There are 4 key factors to model deployment

• Deployment
• Making model predictions / classifications easily available to the business

• Evaluation
• Measuring the quality of the deployed models

• Management
• Tracking model quality over time

• Monitoring
• Improving deployed models with feedback
Deployment
Deployment is making the model easily accessible and available to the business

• The model moves from the development environment into a production environment
• Live data is fed through the model to be scored or classified (depending on which model has been selected)
• Live data will need to be prepared correctly into a format that the deployed model is expecting
• The output of the deployed model is actionable insights which some external application can consume
Development Environment Production Environment

Model
Historical development Trained Deployed Actionable
and Model Model Insights
Data
evaluation

Live data becomes Feedback


Live Data
historical data
Deployment

Evaluation – measuring the quality of deployed models

There are 2 measurements of evaluation

• Accuracy of prediction – this is largely done during the model evaluation in the model development phase, but
further accuracy tests can be performed on the live system (Machine Learning Metric)

• Business impact – is the deployed model influencing the business success criteria defined in the business
understanding phase (Business Metric)

Management and Monitoring – Tracking model quality over time, improving based on feedback

Important features
• Versioning
• Logging
• Provenance
• Dashboards
• Reports
• Capturing and responding to feedback
Planning, Monitoring and Maintenance
In a full-fledged deployment and integration of modeling results, your data analysis work may be ongoing. For
example, if a model is deployed to predict sequences of e-basket purchases, this model will likely need to be
evaluated periodically to ensure its effectiveness and to make continuous improvements. Similarly, a model deployed
to increase customer retention among high-value customers will likely need to be tweaked once a particular level of
retention is reached. The model might then be modified and re-used to retain customers at a lower but still
profitable level on the value pyramid.

Task list

• For each model or finding, which factors or influences (such as market value or seasonal variation) need to be
tracked?

• How can the validity and accuracy of each model be measured and monitored?

• How will you determine when a model has "expired"? Give specifics on accuracy thresholds or expected changes
in data, etc.

• What will occur when a model expires? Can you simply rebuild the model with newer data or make slight
adjustments? Or will changes be pervasive enough as to require a new data mining project?

• Can this model be used for similar business issues once it has expired? This is where good documentation
becomes critical for assessing the business purpose for each data mining project.
Model Update Strategies
There are a number of strategies for migrating to a new model, 2 of which are covered in this presentation

A / B testing
• measures the performance of two models and when one out performs the other the new model is published.

• For example an association model for product recommendations based on website page navigation might
measure the click through rate as a business success criteria
2000 visits
Model V1
10% Click Through Rate
Model v2 gets
promoted
2000 visits
Model V2
30% Click Through Rate

Multi-armed bandit
• Focus on how to learn faster by performing exploration and exploitation

• Exploration - Split 10% of your data you have equally between models to be evaluated
• Exploitation - Use the other 90% of the data on the best performing model

• Exploration is ongoing and runs simultaneously. Allowing for a model switch if the other model starts to
outperform the first.
Deployment

E-Retail Example continued

A successful deployment of the e-retailer's data mining results requires that the right information reaches the right
people.

Decision makers. Decision makers need to be informed of the recommendations and proposed changes to the site, and
provided with short explanations of how these changes will help. Assuming that they accept the results of the study, the
people who will implement the changes need to be notified.

Web developers. People who maintain the Web site will have to incorporate the new recommendations and
organization of site content. Inform them of what changes could happen because of future studies, so they can lay the
groundwork now. Getting the team prepared for on-the-fly site construction based upon real-time sequence analysis
might be helpful later.

Database experts. The people who maintain the customer, purchase, and product databases should be kept apprised of
how the information from the databases is being used and what attributes may be added to the databases in future
projects.

Above all, the project team needs to keep in touch with each of these groups to coordinate the deployment of results
and planning for future projects.

You might also like