22 Partition Modeling and Datamining-1

Lean 6 Sigma
Introduction to the Partition Analysis
 DMAIC and Data Mining

 What is Data Mining?
 JMP and Data Mining
 The Press Band Data - Background
 The Press Band Data - Partitioning
 Case Study: Defect Reduction
 Summary
We are drowning in information and starving for
knowledge.
- Rutherford D. Roger
There's Gold In Them Thar Data (maybe)!
The original data miner?
 2016 Philip J. Ramsey, Ph.D. 2

DMAIC and Data Mining
In both transactional and manufacturing situations, large
observational data sets relating to the processes of interest are
frequently available.
These data sets can be “mined” in order to:
 Identify well-scoped projects;
 Provide background information on relationships
between predictors and responses, prior to further data
collection or simply as background knowledge;
 Suggest causal relationships and potential solutions;
 Suggest hypotheses to be explored in the Analyze phase.
 Identify anomalies;
 Reduce the number of predictors to be studied.
DMAIC and Data Mining
As such, data mining can support the Define, Measure, Analyze,
and Improve phases of DMAIC.
In the Define Phase of a

project, partitioning Define
can be used to define,
or refine, the scope of a
project. Control Measure
Improve Analyze

What Is Data Mining?
Data Mining is the analysis of large observational data sets with
the goal of finding unsuspected relationships.
By “large” data sets, we mean either a large number of records,
and/or a large number of variables measured on each
record.
The other key word in the definition is “observational”.
 Often, data sets used in data mining were collected for
purposes other than those of the data mining study.
 Consequently, these data sets often lack the integrity and
relevance found in data sets collected as part of a well-
defined study.
JMP and Data Mining
JMP’s Partition platform is a version of Classification and
Regression Tree Analysis (CART™).
Both response and factors can be either continuous or categorical.
Continuous factors are split into two partitions according to
cutting values.
Categorical factors are split into two groups of levels.
If the response is continuous, the fitted values are the averages
within groups.
If the response is categorical, the fitted values are the response
rates (estimated probabilities) within groups.

JMP and Data Mining
The splits are determined by maximizing the LogWorth criterion,
which reflects the degree of separation for a potential split.
For a continuous response, the LogWorth is related to the sums of
squares due to the differences between means on each side of
the split.
For a categorical response, it is related to the likelihood ratio chi-
square statistic – we skip the theoretical details.
JMP’s Partition platform is useful for both exploring
relationships and for modeling.
The platform is visually oriented and easy to use.

Case Study – Fisher Iris Data
R.A. Fisher (father of modern statistics) published a famous dataset

in which he attempted to classify Iris plants into the correct
species based upon petal and sepal measurements.
There are three species: Virginica; Versicolor; and Setosa.
Five continuous measurements are used to attempt to correctly
classify the plants into the their species.
We will apply the Partition plot in JMP to see if we can find a
significant classification model.
The data are available in the JMP Help Sample data directory: Help
 Sample Data  Iris.JMP.

Based on logworth our first partition is on Petal Length. This is the
best classifier for Species. Partition for Species
1.00
virginica
0.75
Species
0.50 versicolor
0.25
setosa
0.00
All Rows
Number
RSquare N of Splits
0.000 150 0
All Rows
Count G^2
150 329.58369
Level Prob
setosa 0.3333
versicolor 0.3333
virginica 0.3333
Candidates
Candidate
Term G^2 LogWorth
Sepal length 115.8732799 32.09622458
Sepal width 58.8743943 14.76067659
Petal length 190.9542505 * 57.33863312
Petal width 190.9542505 56.63837050

The Fit Y by X plot demonstrates that the largest G^2 or Logworth
value was achieved for Petal Length =3. No other split on the
other 4 variables achieves this high a split value.
Bivariate Fit of Criterion By Petal length
200
150
100
50
0
1 2 3 4 5 6 7
Petal length

Our first partition is on Petal Length.

All Rows
Count G^2 LogWorth

150 329.58369 57.338633
Level Prob
Notice that the Setosa
setosa 0.3333 species all have Petal Length
versicolor 0.3333
virginica 0.3333
< 3.0 and none of the other
species occur in this node.
Petal length>=3.0 Petal length<3.0
Count G^2 Count G^2

100 138.62944 50 0
Level Prob Level Prob
setosa 0.0000 setosa 1.0000
versicolor 0.5000 versicolor 0.0000
virginica 0.5000 virginica 0.0000

A Fit Y by X plot shows why the cut point was selected.

Bivariate Fit of Petal width By Petal length
Species
2.5 setosa
versicolor
2.0 virginica
Petal width
1.5
Setosa
1.0
0.5
0.0
1 2 3 4 5 6 7
Petal length
The second split is on Petal Width.
All Rows
Count G^2 LogWorth

150 329.58369 57.338633
Count G^2 LogWorth Count G^2

100 138.62944 25.997261 50 0
Petal width<1.8 Petal width>=1.8
Count G^2 Count G^2

54 33.317509 46 9.6353844

A Fit Y by X plot shows why the second cut point was selected.
Species
2.5 Virginica setosa
versicolor
2.0 virginica
Petal width
1.5
1.0 Setosa
Versicolor
0.5
0.0
1 2 3 4 5 6 7
Petal length

A third split is made on Petal Length.
All Rows
Count G^2 LogWorth

150 329.58369 57.338633

100 138.62944 25.997261 50 0
Petal width<1.8 Petal width>=1.8

54 33.317509 3.3691232 46 9.6353844
Petal length<5.0 Petal length>=5.0
Count G^2 Count G^2

48 9.7214225 6 7.63817

A Fit Y by X plot shows the third split is less definitive.
Species
2.5 Virginica setosa
Split 1 versicolor
2.0 virginica
Petal width
1.5 Split 2
1.0 Setosa
Versicolor
0.5
Split 3
0.0
1 2 3 4 5 6 7
Petal length

The Leaf Report depicts the counts and probabilities for each of the
4 nodes. Notice that Petal Length < 3.0 was not split further
since such a split resulted in 100% setosa.

A Mosaic Plot of Species by Leaf (or node) number reveals how
well the Partition model has worked.
Mosaic Plot
1.00
virginica
0.75
Species
0.50 versicolor
0.25
setosa
0.00
1 2 3 4
Leaf Number
The model predictions are nothing more than the estimated
proportions or probabilities for each of the classes in each of the
terminal nodes.
As an example below is the prediction formula for Versicolor

Similarly, below is the prediction formula for Virginica

The Press Band Data - Background
We will illustrate JMP’s partition platform and data mining using
a data set from a continuous improvement project for a
rotogravure printing process.
In this printing process:
 An engraved chrome-plated cylinder is rotated in a bath of
ink,
 Excess ink is removed,
 Paper is pressed against the inked image,
 Once the job is complete, the engraved image is removed
from the cylinder, and the cylinder is reused.

The image to be printed is
incised into the cylinder.
The cylinder is rotated through
the ink fountain.
The doctor blade acts as a
squeegie, removing excess
ink.
The paper is pressed against the
cylinder by the impression
roll, transferring the image.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version
published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. Subject to disclaimers.

A defect called banding can sometimes occur, ruining the
product.
Banding consists of grooves that appear in the cylinder at
some point during the print run.
Once detected, the run is halted, and the cylinder is removed and
repaired (repolished or replated).
This process can take anywhere from one-half hour to six
hours.
Understanding the conditions that lead to banding is critical and
could save a printer enormous amounts of money.

We will use a data set that contains observational data on
banding.
The target variable is “Band Occurred?”, whose values are
BAND and NOBAND.
The available predictors (possible causes) for banding consist of
11 categorical variables and 18 continuous variables.
Before proceeding with Partition Analysis it is important to
examine the data for quality issues.
It is especially important to perform a missing value analysis,
which happens to be an important issue in this data set.

A portion of
the Press
Band data
set.
There are 540
records and
39 variables.

The team used JMP’s Distribution Platform to generate the
analysis below, which indicates that banding occurred in 42%
of jobs.
The team would like to understand the root causes of banding.

First perform a missing data pattern analysis.
To investigate data quality, all 30 variables of interest can be
entered into JMP’s distribution platform.
This results in histograms for continuous variables and bar
charts for nominal variables, along with descriptive statistics
for each variable.
The team notes that some continuous variables have unruly
distributions, and that some levels of nominal variables are
sparsely populated.
Some of the nominal variables such as Cylinder No. have too
many levels to be useful as a predictor of Banding.
The presence of considerable missing values can be a problem for
any statistical analysis.
This is especially true if the missing data patterns are non-random.
In these cases there are often underlying reasons for the missing
values and these missing data patterns may actually provided
information about the response.
As an example, suppose applicants for credit cards leave blank
questions related to current debt load and income?
It is quite likely that in this case the missing values may indicate a
poor credit risk.
For these scenarios we say the missing values are informative.
A complete discussion of missing values is beyond our scope
however strategies exist to deal with missing values.
One common strategy is referred to as imputation of missing
values.
Imputation simply refers to a process of substituting some value
or values for the missing values.
The theory of imputation can be very complex and can involve
studying the process of missing value generation and even
modeling that process.
The literature on imputation is voluminous and textbooks have
been written on the subject.
However, often a simple imputation strategy will work well.
A couple of relatively simple and very common strategies can be
grouped as:
 Predict the missing values from the non-missing value of the
predictors or features.
 Replace the missing values with a most typical value such as
the mean or median or most frequent class for a nominal
variable.
The latter option is the most common and one we will explore, but be
aware that many other, sometimes complex alternatives exist.
Basically, this requires adding two columns to a data table, one codes
the values as missing or non-missing and the other contains the
imputed values; JMP does this automatically is some platforms.
The coded missing, non-missing indicator column tell us if the
missing values impact the prediction of the target changes.
We now proceed with the Partition analysis of the data and select the
imputation option in the launch dialog window.

The Press Band Data - Partitioning
Once the Partition
report window opens
it is a good practice
to select a couple of
display options from
the main report
menu.
Select Show Split Prob
and Show Split
Count.

To the right is the initial

Report window.
Points corresponding to the
runs are jittered in such a
way that runs with
banding are red, and are
in the area of the graph
beneath the horizontal
divider at 42 %.
Blue points (no banding)
are shown above the
line.
To the right is the candidate
variables for our first split.
Note that ink pct and solvent
pct have almost the same
LogWorth with ink pct
slightly preferred.
Both of the variables seem to
behave similarly with
respect to the response
Band?
JMP will select ink pct for
the first split with 64.1 as
the cut point.

We split once.
Note that JMP chooses the
variable ink pct as the
splitting variable.
Notice that missing values

are placed in the left
hand side of the split.
It appears the missing value
pattern for ink pct may
be informed to Band?

The second split is on
press speed for ink
pct < 64.1.
With ink pct < 64.1 and
press speed >= 2189
90.53 of these runs
have no banding.
Here the missing values
are place on the left
split for press speed.
We should investigate
the missing value
patterns more closely.

Here is the tree after
five splits.
Notice that Press
shows up in the
third split.
Bur Press is an asset
number and
contains no useful
information.
Prune back 3 splits
and Lock out
press so it cannot
be selected.
After pruning
back the tree,
select Lock
Columns from
the report
menu.
Then select press
as the variable
to lock out.
Now begin
splitting again
and press will
be ignored.

After 5 splits our new tree does not contain press.

After 8 splits the tree diagram
has become too large to easily
view; in the report menu is an
option for a Small Tree View
that is easier to view and see
the tree structure.
Select Small Tree View from the
main report menu.

The analyst is usually interested
in which potential variables
have the largest effect on the
response.
The Column Contributions
(select in report menu) report
ranks the variables in terms of
their ability to classify Band?
correctly.
The top 3 variables seem to
explain most of the variation
in Band?
As trees get large, they become difficult to read.
JMP provides, in addition to the Small Tree View, a Leaf
Report (main report menu), which gives the rule set (paths)
and a graphic display of the terminal nodes’ of the tree model
and their classification ability.

In the main report menu JMP offers an important report referred
to as Show Fit Details and this report gives us a synopsis of
how well our model predicts the response.
The Confusion Matrix
provides a report on the
accuracy of the model
predictions.
Notice that many
misclassifications
occur.
Our model is inaccurate.

Before concluding that ink pct is a
major factor in banding, we
should dig deeper into the
missing values for the variable.
For the last 55 consecutive press
runs, both ink pct and solvent
pct are missing and all runs had
banding.
The data is now questionable as to
the ability to reach any firm
conclusions of cause and effect.

Once you have achieved the best model, usually based upon
model performance on the Validation set the Prediction
Formula can be saved to the data table (under Save Columns
in the main report menu).
A Partition model formula is saved to the data table that can be
used for future observations added to the data table.
The predicted probabilities of Band or NoBand are simply the
observed proportions of the two conditions that occur in each
of the terminal nodes of the decision tree.
Therefore the prediction for each record is simply the observed
proportions for the tree node that the record falls into.

Summary
 Partition methods:
 Assist in data exploration;
 Help with variable reduction;
 Inform variable recoding (grouping levels of categorical
variables into fewer categories);
 Often allow the building of better models than would
traditional regression methods;
 Are intuitive and easily understood by Lean 6 Sigma
project team members;
 Combined with DOE, can greatly enhance project
success.
Summary
We provide the following guidance to Lean 6 Sigma project
teams:
A Lean 6 Sigma project team should consider using tree-based
methods when any of the following hold:
 There is a large observational data set to explore, either in
the define phase, or later in the project;
 The team’s data contains multi-level categorical variables;
 The data is unruly (many outliers, missing data);
 The data may contain complex interactions.

Skills Practice 1
We will use the dataset Equity.JMP to perform a Partition
analysis.
The dataset is found in the JMP Sample Data directory (Help 
Sample Data  Business and Demographic  Equity).
The data consists of information collected on applicants for home
equity loans and the goal is to develop a predictive model to
determine whether or not the application is a Good or Bad risk
for loan.
The binary response variable Y is BAD where 1 indicates a bad
loan risk and 0 indicates a good risk.
There are 12 potential predictor variables or Xs.
Skills Practice 1
In order to better interpret the response variable BAD, we will
recode the column to values of 1  Bad and 0  Good.
Open the Column Info window for BAD and change the data
type to Character.
Next in order to recode, highlight the BAD column header and
from the Cols  Utilities  Recode.
Once in the Recode dialog window change the 1s and 0s to Bad
and Good values respectively.
Finally, after entering the recoded values of BAD, click on Done
and select the option Formula Column and then click on OK
to create the recoded BAD column.
Skills Practice 1
The new recoded BAD (usually labeled BAD 2) column is now

our response column for the Partition modeling.
Before performing an analysis, use the Missing Data Pattern
option in the Tables menu to determine if any columns
contain excessive missing values (say > 90%).
To check the proportion of missing values, from the Missing
Data Pattern table open the Distribution platform and select
the right most 13 “variable” Columns as Ys.
The Distribution reports for the variables provides the percentage
of missing values.
Do any of the variables appear to have an excessive number of
missing values? Explain.
Skills Practice 1
We are now ready to perform a Partition analysis and remember
the response Y is the recoded BAD column.
Enter all of the 12 potential predictors into the X window.
In the lower left-hand corner specify the Validation Proportion
as 0.30. Also, make certain Informative Missing is checked.
Now perform a Partition Analysis. In developing your model and
determining the number of splits use the Column
Contributions and the Confusion Matrices in Show Fit
Details; also feel free to use any other items from the Partition
report
In particular compare the Training and Validation reports to
determine when to stop splitting.
Skills Practice 1
Answer the following questions based upon your analysis:
1. How many splits were performed?
2. Do one or more of the predictors appear to have informative
missing values for Bad?; Explain.
3. How did you determine when to stop splitting?
4. What variables are the most important predictors of BAD?; hint
Column Contributions is a good source of this information.
5. Based on the Validation fitting details, how well does your
model actually classify the applicants as Good or Bad?
The Confusion Matrix is very helpful.
6. Would you recommend a loan institution use this model to
classify potential customers? Why or Why Not?
Skills Practice 2
Use the Partition platform to explore the LostCustomers.jmp data.
Recall that Status is the response, and Quote, Time to Delivery
and Part Type are the factors.
Do we gain any additional insights regarding the conditions
under which we are more likely to lose, or not lose, an order?

22 Partition Modeling and Datamining-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

22 Partition Modeling and Datamining-1

Uploaded by

Copyright:

Available Formats

Lean 6 Sigma

Introduction to the Partition Analysis

 DMAIC and Data Mining

There's Gold In Them Thar Data (maybe)!

The original data miner?

 2016 Philip J. Ramsey, Ph.D. 2

In the Define Phase of a

 2016 Philip J. Ramsey, Ph.D. 4

 2016 Philip J. Ramsey, Ph.D. 6

 2016 Philip J. Ramsey, Ph.D. 7

R.A. Fisher (father of modern statistics) published a famous dataset

 2016 Philip J. Ramsey, Ph.D. 8

 2016 Philip J. Ramsey, Ph.D. 9

 2016 Philip J. Ramsey, Ph.D. 10

Our first partition is on Petal Length.

Count G^2 LogWorth

Count G^2 Count G^2

 2016 Philip J. Ramsey, Ph.D. 11

A Fit Y by X plot shows why the cut point was selected.

The second split is on Petal Width.

Count G^2 LogWorth

Petal length>=3.0 Petal length<3.0

Count G^2 LogWorth Count G^2

Petal width<1.8 Petal width>=1.8

Count G^2 Count G^2

 2016 Philip J. Ramsey, Ph.D. 13

 2016 Philip J. Ramsey, Ph.D. 14

Count G^2 LogWorth

Petal length>=3.0 Petal length<3.0

Count G^2 LogWorth Count G^2

Petal width<1.8 Petal width>=1.8

Count G^2 LogWorth Count G^2

Petal length<5.0 Petal length>=5.0

Count G^2 Count G^2

 2016 Philip J. Ramsey, Ph.D. 15

 2016 Philip J. Ramsey, Ph.D. 16

 2016 Philip J. Ramsey, Ph.D. 17

 2016 Philip J. Ramsey, Ph.D. 19

 2016 Philip J. Ramsey, Ph.D. 20

 2016 Philip J. Ramsey, Ph.D. 21

 2016 Philip J. Ramsey, Ph.D. 22

 2016 Philip J. Ramsey, Ph.D. 23

 2016 Philip J. Ramsey, Ph.D. 24

 2016 Philip J. Ramsey, Ph.D. 25

The team would like to understand the root causes of banding.

 2016 Philip J. Ramsey, Ph.D. 26

 2016 Philip J. Ramsey, Ph.D. 31

 2016 Philip J. Ramsey, Ph.D. 32

To the right is the initial

 2016 Philip J. Ramsey, Ph.D. 34

Notice that missing values

 2016 Philip J. Ramsey, Ph.D. 35

 2016 Philip J. Ramsey, Ph.D. 36

 2016 Philip J. Ramsey, Ph.D. 38

 2016 Philip J. Ramsey, Ph.D. 39

 2016 Philip J. Ramsey, Ph.D. 40

 2016 Philip J. Ramsey, Ph.D. 42

 2016 Philip J. Ramsey, Ph.D. 43

 2016 Philip J. Ramsey, Ph.D. 44

 2016 Philip J. Ramsey, Ph.D. 45

 2016 Philip J. Ramsey, Ph.D. 47

The new recoded BAD (usually labeled BAD 2) column is now

 2016 Philip J. Ramsey, Ph.D. 53