Pred - Mod - PASWMod140 - 85X11 - Complete (With Highlighting)

Predictive Modeling
®
with PASW Modeler
34450-001
v14.0 05/10 yc/mr 8.5X11

For more information about SPSS Inc., an IBM Company software products, please visit our Web site
at http://www.spss.com or contact:
SPSS Inc., an IBM Company

233 South Wacker Drive, 11th Floor
Chicago, IL 60606-6412 USA
Tel: +1-312-651-3000
Fax: +1-312-651-3668
SPSS®, PASW® Modeler, PASW® Modeler Professional, PASW® Statistics, PASW® Data Collection,
PASW® Data Collection Interviewer, PASW® Data Collection Interviewer Web and PASW® Reports
for Surveys are registered trademarks of SPSS Inc., an IBM Company.
Microsoft, Excel, Windows, Fox Pro and SQL Server are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Project phases are based on the CRISP-DM process model. Copyright © 1997–2003 by CRISP-DM
Consortium (http://www.crisp-dm.org).
dBase is a trademark of dataBased Intelligence, Inc.
UNIX is a registered trademark of The Open Group.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks
of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Predictive Modeling with PASW® Modeler

Copyright © 2010 by SPSS Inc., an IBM Company
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the
prior written permission of the publisher.
Predictive Modeling with PASW Modeler
TABLE OF CONTENTS
LESSON 1: PREPARING DATA FOR MODELING ........................... 1-1

1.1 INTRODUCTION ............................................................................................................. 1-2
1.2 CLEANING DATA .......................................................................................................... 1-3
1.3 BALANCING DATA........................................................................................................ 1-4
1.4 NUMERIC DATA TRANSFORMATIONS .......................................................................... 1-6
1.5 BINNING DATA VALUES ............................................................................................... 1-9
1.6 DATA PARTITIONING .................................................................................................. 1-12
1.7 ANOMALY DETECTION ............................................................................................... 1-14
1.8 FEATURE SELECTION FOR MODELS............................................................................ 1-19
SUMMARY EXERCISES............................................................................................................ 1-24
LESSON 2: DATA REDUCTION: PRINCIPAL COMPONENTS ..... 2-1
2.1 INTRODUCTION ............................................................................................................. 2-1
2.2 USE OF PRINCIPAL COMPONENTS FOR PREDICTION MODELING AND CLUSTER ANALYSES
...................................................................................................................................... 2-1
2.3 WHAT TO LOOK FOR WHEN RUNNING PRINCIPAL COMPONENTS OR FACTOR ANALYSIS
...................................................................................................................................... 2-3
2.4 PRINCIPLES ................................................................................................................... 2-3
2.5 FACTOR ANALYSIS VERSUS PRINCIPAL COMPONENTS ANALYSIS............................... 2-4
2.6 NUMBER OF COMPONENTS ........................................................................................... 2-5
2.7 ROTATIONS ................................................................................................................... 2-5
2.8 COMPONENT SCORES ................................................................................................... 2-6
2.9 SAMPLE SIZE ................................................................................................................ 2-6
2.10 METHODS ..................................................................................................................... 2-6
2.11 OVERALL RECOMMENDATIONS ................................................................................... 2-7
2.12 EXAMPLE: REGRESSION WITH PRINCIPAL COMPONENTS ............................................ 2-7
LESSON 3: DECISION TREES/RULE INDUCTION.......................... 3-1
3.1 INTRODUCTION ............................................................................................................. 3-1
3.2 COMPARISON OF DECISION TREE MODELS .................................................................. 3-2
3.3 USING THE C5.0 NODE ................................................................................................. 3-3
3.4 VIEWING THE MODEL................................................................................................... 3-8
3.5 GENERATING AND BROWSING A RULE SET................................................................ 3-15
3.6 UNDERSTANDING THE RULE AND DETERMINING ACCURACY ................................... 3-18
3.7 UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION ......................... 3-27
3.8 FURTHER TOPICS ON C5.0 MODELING ....................................................................... 3-28
3.9 MODELING CATEGORICAL OUTPUTS WITH OTHER DECISION TREE ALGORITHMS ... 3-33
3.10 MODELING CATEGORICAL OUTPUTS WITH CHAID .................................................. 3-33
3.11 MODELING CATEGORICAL OUTPUTS WITH C&R TREE ............................................. 3-40
3.12 MODELING CATEGORICAL OUTPUTS WITH QUEST .................................................. 3-46
3.13 PREDICTING CONTINUOUS FIELDS ............................................................................. 3-49
i
LESSON 4: NEURAL NETWORKS ....................................................... 4-1

4.1 INTRODUCTION TO NEURAL NETWORKS ......................................................................4-1
4.2 TRAINING METHODS .....................................................................................................4-2
4.3 THE MULTI-LAYER PERCEPTRON .................................................................................4-3
4.4 THE RADIAL BASIS FUNCTION .....................................................................................4-4
4.5 WHICH METHOD? .........................................................................................................4-5
4.6 THE NEURAL NETWORK NODE .....................................................................................4-6
4.7 MODELS PALETTE .......................................................................................................4-15
4.8 THE NEURAL NET MODEL ..........................................................................................4-16
4.9 VALIDATING THE LIST OF PREDICTORS ......................................................................4-23
4.10 UNDERSTANDING THE NEURAL NETWORK ................................................................4-25
4.11 UNDERSTANDING THE REASONING BEHIND THE PREDICTIONS ..................................4-28
4.12 MODEL SUMMARY ......................................................................................................4-31
4.13 BOOSTING AND BAGGING MODELS ............................................................................4-31
4.14 MODEL BOOSTING WITH NEURAL NET.......................................................................4-32
4.15 MODEL BAGGING WITH NEURAL NET ........................................................................4-41
SUMMARY EXERCISES ............................................................................................................4-47
LESSON 5: SUPPORT VECTOR MACHINES .................................... 5-1
5.1 INTRODUCTION .............................................................................................................5-1
5.2 THE STRUCTURE OF SVM MODELS ..............................................................................5-1
5.3 SVM MODEL TO PREDICT CHURN ................................................................................5-5
5.4 EXPLORING THE MODEL .............................................................................................5-14
5.5 A MODEL WITH A DIFFERENT KERNEL FUNCTION .....................................................5-17
5.6 TUNING THE RBF MODEL ...........................................................................................5-20
LESSON 6: LINEAR REGRESSION...................................................... 6-1
6.1 INTRODUCTION .............................................................................................................6-1
6.2 BASIC CONCEPTS OF REGRESSION ................................................................................6-2
6.3 AN EXAMPLE: ERROR OR FRAUD DETECTION IN CLAIMS ............................................6-4
6.4 USING LINEAR MODELS NODE TO PERFORM REGRESSION ........................................6-15
LESSON 7: COX REGRESSION FOR SURVIVAL DATA ................ 7-1
7.1 INTRODUCTION .............................................................................................................7-1
7.2 WHAT IS SURVIVAL ANALYSIS? ...................................................................................7-2
7.3 COX REGRESSION .........................................................................................................7-5
7.4 COX REGRESSION TO PREDICT CHURN .........................................................................7-6
7.5 CHECKING THE PROPORTIONAL HAZARDS ASSUMPTION ...........................................7-19
7.6 PREDICTIONS FROM A COX MODEL ............................................................................7-22
ii
LESSON 8: TIME SERIES ANALYSIS ................................................. 8-1

8.1 INTRODUCTION ............................................................................................................. 8-1
8.2 WHAT IS A TIME SERIES? ............................................................................................. 8-3
8.3 A TIME SERIES DATA FILE ........................................................................................... 8-5
8.4 TREND, SEASONAL AND CYCLIC COMPONENTS........................................................... 8-7
8.5 WHAT IS A TIME SERIES MODEL? .............................................................................. 8-10
8.6 INTERVENTIONS ......................................................................................................... 8-11
8.7 EXPONENTIAL SMOOTHING ........................................................................................ 8-12
8.8 ARIMA ...................................................................................................................... 8-13
8.9 DATA REQUIREMENTS................................................................................................ 8-15
8.10 AUTOMATIC FORECASTING IN A PRODUCTION SETTING ........................................... 8-16
8.11 FORECASTING BROADBAND USAGE IN SEVERAL MARKETS ..................................... 8-16
8.12 APPLYING MODELS TO SEVERAL SERIES ................................................................... 8-40
LESSON 9: LOGISTIC REGRESSION ................................................. 9-1
9.1 INTRODUCTION TO LOGISTIC REGRESSION .................................................................. 9-1
9.2 A MULTINOMIAL LOGISTIC ANALYSIS: PREDICTING CREDIT RISK ............................. 9-4
9.3 INTERPRETING COEFFICIENTS .................................................................................... 9-13
LESSON 10: DISCRIMINANT ANALYSIS ........................................ 10-1
10.1 INTRODUCTION ........................................................................................................... 10-1
10.2 HOW DOES DISCRIMINANT ANALYSIS WORK? .......................................................... 10-2
10.3 THE DISCRIMINANT MODEL ....................................................................................... 10-3
10.4 HOW CASES ARE CLASSIFIED .................................................................................... 10-3
10.5 ASSUMPTIONS OF DISCRIMINANT ANALYSIS ............................................................. 10-4
10.6 ANALYSIS TIPS ........................................................................................................... 10-5
10.7 COMPARISON OF DISCRIMINANT AND LOGISTIC REGRESSION .................................. 10-5
10.8 AN EXAMPLE: DISCRIMINANT.................................................................................... 10-6
SUMMARY EXERCISES.......................................................................................................... 10-19
LESSON 11: BAYESIAN NETWORKS ............................................... 11-1
11.1 INTRODUCTION ........................................................................................................... 11-1
11.2 THE BASICS OF BAYESIAN NETWORKS ...................................................................... 11-1
11.3 TYPE OF BAYESIAN NETWORKS IN PASW MODELER ............................................... 11-4
11.4 CREATING A BAYES NETWORK MODEL ..................................................................... 11-5
11.5 MODIFYING BAYES NETWORK MODEL SETTINGS ................................................... 11-21
LESSON 12: FINDING THE BEST MODEL FOR CATEGORICAL
TARGETS .......................................................................................... 12-1
12.1 INTRODUCTION ........................................................................................................... 12-1
LESSON 13: FINDING THE BEST MODEL FOR CONTINUOUS
TARGETS .......................................................................................... 13-1
13.1 INTRODUCTION ........................................................................................................... 13-1
iii
LESSON 14: GETTING THE MOST FROM MODELS .................... 14-1

14.1 INTRODUCTION ...........................................................................................................14-1
14.2 COMBINING MODELS WITH THE ENSEMBLE NODE.....................................................14-2
14.3 USING PROPENSITY SCORES .....................................................................................14-11
14.4 META-LEVEL MODELING .........................................................................................14-17
14.5 ERROR MODELING ....................................................................................................14-22
SUMMARY EXERCISES ..........................................................................................................14-30
APPENDIX A : DECISION LIST ........................................................... A-1
INTRODUCTION ........................................................................................................................ A-1
A DECISION LIST MODEL ........................................................................................................ A-1
COMPARISON OF RULE INDUCTION MODELS .......................................................................... A-4
RULE INDUCTION USING DECISION LIST ................................................................................. A-5
UNDERSTANDING THE RULES AND DETERMINING ACCURACY .............................................. A-8
UNDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION .................................... A-14
EXPERT OPTIONS FOR DECISION LIST ................................................................................... A-15
INTERACTIVE DECISION LIST ................................................................................................ A-18
SUMMARY EXERCISES ........................................................................................................... A-34
iv
Lesson 1: Preparing Data for Modeling

Overview
• Preparing and cleaning data for modeling
• Balancing data using the Distribution and the Balance node
• Transforming the data with the Distribution node
• Grouping data with the Binning node
• Partition the data into training and testing samples with the Partition node
• Detecting unusual cases with the Anomaly Node
• Selecting predictors with the Feature Selection Node
Data
In this lesson we use data from a telecommunications company, churn.txt for several examples. The
file contains records for 1477 of the company’s customers who have at one time purchased a mobile
phone. It includes such information as length of time spent on local, long distance and international
calls, the type of billing scheme and a variety of basic demographics, such as age and gender. The
customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers.
We want to use data mining to understand what factors influence whether an individual remains as a
customer or leaves for an alternative company. The data are typical of what is often referred to as a
churn example (hence the file name). We also use a similar data file named rawdata.txt to illustrate
several steps in data preparation, and a third file customer_dbase.sav, also from a telecommunications
firm, to demonstrate how to detect anomalous records and select fields for modeling.
Note about Type Nodes in this Course

Streams presented in this course contain Type nodes, although in most instances the Types tab in the
Source node would serve the same purpose.
PASW Modeler and PASW Modeler Server

By default, PASW Modeler will run in local mode on your desktop machine. If PASW Modeler
Server has been installed, then PASW Modeler can be run in local mode or in distributed (client-
server) mode. In this latter mode, PASW Modeler streams are built on the client machine, but run by
PASW Modeler Server. Since the data files used in this training course are relatively small, we
recommend you run in local mode. However, if you choose to run in distributed mode make sure the
training data are either placed on the machine running PASW Modeler Server or that the drive
containing the data can be mapped from the server. To determine in which mode PASW Modeler is
running on your machine, click Tools…Server Login (from within PASW Modeler) and see whether
the Connection option is set to Local or Network. This dialog is shown below.
Preparing Data for Modeling 1-1

Figure 1.1 Server Login Dialog in PASW Modeler
Note Concerning Data for this Course

Data for this course are assumed to be stored in the folder c:\Train\ModelerPredModel. At SPSS
training centers, the data will be located in a folder of that name. Note that if you are running PASW
Modeler in distributed (Server) mode (see note above), then the data should be copied to the server
machine or the directory containing the data should be mapped from the server machine.
1.1 Introduction
Preparing data for modeling can be a lengthy but essential and extremely worthwhile task. If data are
not cleaned and modified/transformed as necessary, it is doubtful that the models you build will be
successful. In this lesson we will introduce a number of techniques that enable such data preparation.
We will begin with a brief discussion concerning the handling of blanks and cleaning of data,
although this is covered in greater detail in the Introduction to PASW Modeler and Data Mining
course.
Following this, we will introduce the concept of data balancing and how it is achieved within PASW
Modeler. A number of data transformations will also be introduced as possible solutions to skewed
data.
We will discuss how to create training and validation samples of the data automatically with the use
of data partitioning.

1.2 Cleaning Data

In most cases, datasets contain problems or errors such as missing information, outliers, and/or
spurious values. Before modeling begins, these problems should be corrected or at least minimized.
The higher the quality of data used in data mining, the more likely it is that predictions or results are
accurate.
PASW Modeler provides a number of ways to handle blank or missing information and several
techniques to detect data irregularities. In this section we will briefly discuss an approach to data
cleaning.
Note: If there is interest the trainer may refer to the stream Dataprep.str located in the
c:\Train\ModelerPredModel directory. This stream contains examples of the techniques detailed in
the following paragraphs.
After the data have been read into PASW Modeler, and if necessary all relevant data sources have
been combined, the first step in data cleaning is to assess the overall quality of the data. This often
involves:
• Using the Types tab of a source node or the Type node to fully instantiate data, usually
achieved by clicking the Read Values button within the source or Type node, or by passing
the data from a Type node into a Table node and allowing PASW Modeler to auto-type.
• Flagging missing values (white space, null and value blanks) as blank definitions within a
source node or the Type node.
• Using the Data Audit node to examine the distribution and summary statistics (minimum,
maximum, mean, standard deviation, number of valid records) for data fields.
Once the condition of the data has been assessed, the next step is to attempt to improve the overall
quality. This can be achieved in a variety of ways:
• Using the Generate menu from the Data Audit node’s report, a Select node that removes
records with blank fields can be automatically created (particularly relevant for a model’s
output field).
• Fields with a high proportion of blank records can be filtered out using the Generate menu
from the Data Audit node’s report to create a Filter node.
• Blanks can be replaced with appropriate values using the Filler node. Possible appropriate
values within a continuous field can range from the average, mode, or median, to a value
predicted using one of the available modeling techniques. In addition, missing values can be
imputed by using the Data Audit node.
• The Type node and Types tab in source nodes provide an automatic checking process that
examines values within a field to determine whether they comply with the current
measurement level and bounds settings. If they do not, fields with out-of-bound values can
either be modified, or those records removed from passing downstream.
After these actions are completed, the data will have been cleaned of blanks and out-of-bounds
values. It may also be necessary to use the Distinct node to remove any duplicate records.
Once the data file has been cleaned, you can then begin to modify it further so that it is suitable for
the modeling technique(s) you plan to use.

1.3 Balancing Data

Once the data have been cleaned you should examine the distribution of the key fields you will be
using in modeling, including the output field (if you are creating a predictive model). This is achieved
most easily using the Data Audit node, but either the Distribution node (for categorical data), the
Histogram node (for continuous data), or the Graphboard node (for either type) will produce charts
for single fields.
If the distribution of a categorical target field is heavily skewed in favor of one of the categories, you
may encounter problems when generating predictive models. For example, if only 3% of a mailing
database have responded to a campaign, a neural network trained on this data might try to classify
every individual as a non-responder to achieve 97% accuracy—great but not very useful!
One solution to overcome this problem is to balance the data, which will overweight the less frequent
categories. This can be accomplished with the Balance node, which works by either reducing the
number of records in the more frequent categories, or boosting the records in the less frequent
categories. It can be automatically generated from the distribution and histogram displays.
When balancing data we recommend using the reduce option in preference to the boosting option.
The latter duplicates records and thus magnifies problems and irregularities, as only a relatively few
cases can be heavily weighted. However, when working with small datasets, data reducing is often
not feasible and data boosting is the only sensible solution to imbalances within data.
Note
A better solution than balancing data at this stage is to sample from the original dataset(s) to create a
training file with a roughly equal number of cases in each category of the output field. The test
datasets should, however, match the unbalanced population proportions for this field to provide a
realistic test of the generated models.
The Partition node makes it easy to create training and validation data partitions from a single data
file, but that node doesn’t solve the problem of a skewed distribution for a field, as it can overweight
one or more categories.
We will illustrate data balancing by examining the distribution of the field CHURNED within the file
churn.txt. This field records whether the customer is current, a voluntary leaver, or an involuntary
leaver (we attempt to predict this field in the lessons that follow).
Open the stream Cpm1.str (located in c:\Train\ModelerPredModel)

Run the Table node and familiarize yourself with the data
Close the Table window
Connect a Distribution node to the Type node
Edit the Distribution node and set the Field: to CHURNED
Click the Run button

Figure 1.2 Distribution of the CHURNED Field
The proportions of the three groups are rather unequal and data balancing may be useful when trying
to predict this field.
This output can be used directly to create a Balance node, but first we must decide whether we wish
to reduce or boost the current data. Reducing the data will drop over 73% of the records, but boosting
the data will involve duplicating the involuntary leavers from 132 records to over 830. Neither of
these methods is ideal but in this case we choose to reduce the data to eliminate the magnification of
errors.
Click Generate…Balance Node (reduce)

Close the Distribution plot window
A generated Balance node will appear in the Stream Canvas.
Drag the Balance node to the right of the Type node and connect it between the Type and
Distribution nodes
Run the stream from the Distribution node
Figure 1.3 Distribution of the CHURNED Field after Balancing the Data
When balancing data it is advisable to enable a data cache on the balance node, to freeze the selected
sample. This is due to the fact that the balance node is randomly reducing or boosting data and a
different sample will be selected each time the data are passed through the node.

At this point the data are balanced and can be passed into a modeling node, such as the Neural Net
node. Once the model has been built, it is important that the testing and assessment of the model
should be done based on the unbalanced data.
1.4 Numeric Data Transformations

When working with numeric data, the act of data balancing, as detailed above, is a rather drastic
solution to the problem of skewed data and usually isn’t appropriate. There are a variety of numerical
transformations that provide a more sensible approach to this problem and that result in a flat or
flatter distribution.
The Derive node can be used to produce such transformed fields within PASW Modeler. To
determine which transformation is appropriate, we need to view the data using a histogram. We’ll use
the field LOCAL in this example, which measures the number of minutes of local calls per month.
Add a Histogram node to the stream

Connect the Histogram node to the Type node
Edit the Histogram node and select LOCAL in the Field list (not shown)
Run the node
Figure 1.4 Histogram of the LOCAL Field
This distribution has a strong positive skewness. This condition may lead to poor performance of a
neural network predicting LOCAL since there is less information (fewer records) on those individuals
with higher local usage. What we need is a transformation that inverts the original skewness, that is,
skews it to the left. If we get the transformation correct, the data will become relatively balanced.
When you transform data you normally try to create a normal distribution or a uniform (flat)
distribution.

For our problem, the distribution of LOCAL closely follows that of a negative exponential, e-x, so the
inverse is a logarithmic function. We will therefore try a transformation of the form ln(x + a), where a
is a constant and x is the field to be transformed. We need to add a small constant because some of the
records have values of 0 for LOCAL, and the log of 0 is undefined. Typically the value of a would be
the smallest actual positive value in the data.
Close the Histogram window

Add a Derive node from the Field Ops palette and connect the Type node to it
Edit the Derive node and set the Derive Field name to LOGLOCAL
Select Formula in the Derive As list
Enter log(LOCAL + 3) in the Formula text box (or use the Expression Builder)
Click on OK
Figure 1.5 Derive Node to Create LOGLOCAL
Connect the Derive node to the existing Histogram node

Edit the Histogram node and set the Field to LOGLOCAL

Figure 1.6 Histogram of the Transformed LOCAL Field Using a Logarithmic Function
Although this distribution is not perfectly normal it is a great improvement on the distribution of the
original field.
The above is a simple example of a transformation that can be used. Table 1.1 gives a number of
other possible transformations you may wish to try when transforming data, together with their
CLEM expression.
Table 1.1 Possible Numerical Transformations

Transformation CLEM Expression
x Exp(x)
e
where x is the name of the field to be transformed
Log(x + a)
ln(x+a) where a is a numerical constant
Log((x - a) / (b – x))
ln((x-a)/(b-x))
Where a and b are numerical constants
log(x+a) Log10(x + a)
sqrt(x) Sqrt(x)
1 / exp(@GLOBAL_AVE(x) – x)
1 / e(mean(x)-x) where @GLOBAL_AVE is the average of the field x, set using
the Set Globals node in the Output palette
Note
Because the original field LOCAL has been transformed, predictions from a model will be made in the
log of that field. To transform back to the original scale, you need to raise the predicted value to the

base 10 for the standard log or base e for the natural log (e.g., 10predictved value). So, for example, if the
model predicts a value of 1.4 for LOGLOCAL, that is actually 101.4 or 25.12 for LOCAL (or LOCAL +
constant).
1.5 Binning Data Values

Another method of transforming a continuous field involves modifying it to create a new categorical
field (flag, nominal, ordinal) based on the original field’s values. For example, you might wish to
group age into a new field based on fixed width categories of 5 or 10 years. Or, you might wish to
transform income into a new field based on the percentiles (based on either the count or sum) of
income (e.g., quartiles, deciles).
This operation is labeled binning in PASW Modeler, since it takes a range of data values and
collapses them into one bin where they are all given the same data value. It is certainly true that
binning data loses some information compared to the original distribution. On the other hand, you
often gain in clarity, and binning can overcome some data distribution problems, including skewness.
Moreover, oftentimes there is interest in looking at the effect of a predictor at natural cutpoints (e.g.,
one standard deviation above the mean). In addition, when performing data understanding, it might be
easier to view the relationship between two or more continuous fields if at least one is binned.
Binning can be performed with bins based on fixed widths, percentiles, the mean and standard
deviation, or ranks.
We can use the original field LOCAL to show an example of binning. We know this field is highly
positively skewed, and it has many distinct values. Let’s group the values into five bins by requesting
binning by quintiles, and then examine the relationship of the binned field to CHURNED. The
Binning node is located in the Field Ops palette.
Add a Binning node to the stream near the Type node

Connect the Type node to the Binning node
Edit the Binning node and set the Bin fields to LOCAL
Click the Binning method dropdown and select Tiles (equal count) method
Click the Quintile (5) check box
By default, a new field will be created from the original field name with the suffix _TILEN, where N
stands for the number of bins to be created (here five). Percentiles can be based on the record count
(in ascending order of the value of the bin field, which is the standard definition of percentiles), or on
the sum of the field.

Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles
The Bin Values tab allows you to view the bins that have been created and their upper and lower
limits. However, understandably, information on generated bins is not available until the node has
been run in order to allow the thresholds to be determined.
Click OK
To study the relationship between binned LOCAL (LOCAL_TILE5) and CHURNED, we could use a
Matrix node, since both fields are categorical, but we can also use a Distribution node, which will be
our choice here.
Add a Distribution node to the stream and attach it to the Binning node
Edit the Distribution node and select LOCAL_TILE5 as the Field
Select CHURNED as the Overlay field
Click Normalize by color checkbox (not shown)
Click Run

Figure 1.8 Distribution of CHURNED by Binned LOCAL
There is an interesting pattern apparent. Essentially all the involuntary churners are in the first
quintile of LOCAL_TILE5 (notice how the number of cases in each category is almost exactly the
same). Perhaps we got lucky when specifying quintiles as the binning technique, but we have found a
clear pattern that might not have been evident if LOCAL had not been binned.
We would next wish to know what the bounds are on the first quintile, and to see that we need to edit
the Binning node.

Edit the Binning node for LOCAL
Click the Bin Values tab
Select 5 from the Tile: menu

Figure 1.9 Bin Thresholds for LOCAL
We observe that the upper bound for Bin 1 is 10.38 minutes. That means that the involuntary churners
essentially all made less than 10.38 minutes of local calls, since they all fall into this bin (quintile).
Given this finding, we might decide to use the binned version of LOCAL in modeling, or try two
models, one with the original field and then one with the binned version.
1.6 Data Partitioning

Models that you build (train) must be assessed with separate testing data that was not used to create
the model. The training and testing data should be created randomly from the original data file. They
can be created with either a Derive or Sample node, but the Partition node allows greater flexibility.
With the Partition node, PASW Modeler has the capability to directly create a field that can split
records between training, testing (and validation) data files. Partition nodes generate a partition field
that splits the data into separate subsets or samples for the training and testing stages of model
building. When using all three subsets, the model is built with the training data, refined with the
testing data, and then tested with the validation data.
The Partition node creates a categorical field with the role automatically set to Partition. The set field
will either have two values (corresponding to the training and testing files), or three values (training,
testing, and validation).
PASW Modeler model nodes have an option to enable partitioning, and they will recognize a field
with role “partition” automatically (as will the Evaluation node). When a generated model is created,
predictions will be made for records in the testing (and validation) samples, in addition to the training
records. Because of this capability, the use of the Partition node makes model assessment more
efficient.

To illustrate the use of data partitioning, we will create a partition field for the churn data with two
values, for training and testing. Although the Partition node assists in selecting records for training
and testing, its output is a new field, and so it can be found in the Field Ops Palette.
Add a Partition node to the stream and connect the Type node to it
Edit the Partition node
The name of the partition field is specified in the Partition field text box. The Partitions choice
allows you to create a new field with either 2 or 3 values, depending on whether you wish to create 2
or 3 data samples.
The size of the files is specified in the partition size text boxes. Size is relative and given in percents
(which do not have to add to 100%). If the sum of the partition sizes is less than 100%, the records
not (randomly) included in a partition will be discarded.
The Generate menu allows you to create Select nodes that will select records in the training, testing,
and validation samples.
We’ll change the size of the training and testing partitions, and input a random seed so our results are
comparable.
Figure 1.10 Partition Node Settings

Change the Training partition size: to 70

Change the Testing partition size: to 30
Change the Seed value to 999 (not shown)
Click OK
Attach a Distribution node to the Partition node
Edit the Distribution node and select Partition in the Field list
Run the Distribution node
Figure 1.11 Distribution of the Partition Field
The new field Partition has close to a 70/30 distribution. It can now be directly used in modeling as
described above, or separate files can be created with use of the Select node. We will use the partition
field in a later lesson, so we’ll save the stream for use in later lessons.
Close the Distribution window

Click on File…Save Stream As
Save the stream with the name Lesson1_Partition
1.7 Anomaly Detection

Data mining usually involves very large data files, sometimes with millions of records. In such
situations, we may not be concerned about whether some records are odd or unusual based on how
they compare to the bulk of records in the file. Odd cases, unless they are relatively frequent (and
then they can hardly be labeled “unusual”), will not cause problems to most algorithms when we try
to predict some outcome.
For analysts with smaller data files, though, anomalous records can be a concern, as they can distort
the outcomes of a modeling process. The most salient example of this comes from classical statistics,
where regression, and other methods that fall under the rubric of the General Linear Model, can be
strongly affected by outliers and deviant points.
PASW Modeler includes an Anomaly node that searches for unusual cases in an automatic manner.
Anomaly detection is an exploratory method designed for the quick detection of unusual cases or
records that should be candidates for further analysis. These should be regarded as suspected
anomalies, which, on closer examination, may or may not turn out to be real concerns. You may find
that a record is perfectly valid but choose to screen it from the data for purposes of model building.
Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error in the
data collection process.

The procedure is based on clustering the data using a set of user-specified fields. A case that is
deviant compared to the norms (distributions) of all the cases in that cluster is deemed anomalous.
The procedure helps you quickly detect unusual cases during data exploration before you begin
modeling. It is important to note that the definition of an anomalous case is statistical and not
particular to any specific industry or application, such as fraud in the finance or insurance industry
(although it is possible that the technique might find such cases).
Clustering is done using the TwoStep cluster routine (also available in the TwoStep node). In addition
to clustering, the Anomaly node scores each case to identify its cluster group and creates an anomaly
index, to measure how unusual it is, and identifies which fields contribute most to the anomalous
nature of the case.
We’ll use a new data file to demonstrate the Anomaly node’s operation. The file,
customer_dbase.sav, is a richer data file that is also from a telecommunications company. It has an
outcome field churn which measures whether a customer switched providers in the last month. There
is no target field for anomaly detection, but in most instances you will want to use the same set of
fields in the Anomaly node that you plan to use for modeling. There is an existing stream file we can
use for this example. The Anomaly node is found in the Modeling palette since it uses the TwoStep
clustering routine.
Click File…Open Stream

Double-click on Anomaly_FeatureSelect.str in the c:\Train\ModelerPredModel directory
Run the Table node and view the data
Place an Anomaly node in the stream and connect it to the Type node
Edit the Anomaly node, and then click the Fields tab
Figure 1.12 Anomaly Node Fields Tab
You will typically specify exactly which fields should be used to search for anomalous cases. In these
data, there are several fields that measure various aspects of the customer’s account, and we want to

use all these here (there are also demographic fields, but in the interests of keeping this example
relatively simple, we will restrict somewhat the number and type of fields used).
Click Use custom settings button

Click the Field chooser button, and select all the fields from longmon to ebill (they are
contiguous)
Click OK
Click the Model tab
Figure 1.13 Anomaly Node Model Settings
By default, the procedure will use a cutoff value that flags 1% of the records in the data. The cutoff is
included as a parameter in the model being built, so this option determines how the cutoff value is set
for modeling but not the actual percentage of records to be flagged during scoring. Actual scoring
results may vary depending on the data.
The Number of anomaly fields to report specifies the number of fields to report as an indication of
why a particular record is flagged as an anomaly. The most anomalous fields are defined as those that
show the greatest deviation from the field norm for the cluster to which the record is assigned.
We’ll use the defaults for this example.
Click Run
Right-click on the Anomaly model in the Models Manager, and select Browse
Click Expand All button

Figure 1.14 Browsing Anomaly Generated Model Results
We see that three clusters (labeled “Peer Groups”) were created automatically (although we didn’t
view the Expert options, the default number of clusters to be created is set between 1 and 15). In the
first cluster there are 1267 records, and 18 have been flagged as anomalies (about 1.4%, close to the
1% cutoff value). The Model browser window doesn’t tell us which cases are anomalous in this
cluster, but it does provide a list of fields that contributed to defining one or more cases as anomalous.
Of the 18 records identified by the procedure, 16 are anomalous on the field lnwireten (the log of
wireless usage over tenure in months [time as a customer]). This was a derived field created earlier in
the data exploration process. The average contribution to the anomaly index from lnwireten is .275.
This value should be used in a relative sense in comparison to the other fields.
To see information for specific records we use the generated Anomaly model on the stream canvas.
We will sort the records by the $O-AnomalyIndex field, which contains the index values.
Add a Sort node from the Record Ops palette to the stream and connect the Anomaly
generated model node to the Sort node
Edit the Sort node and select the field $O-AnomalyIndex as the sort field
Change the Sort Order to Descending

Figure 1.15 Sorting Records by Anomaly Index
Click OK
Connect a Table node to the Sort node
Run the Table node
Figure 1.16 Records Sorted by Anomaly Index with Fields Generated by Anomaly Model
For each record, the model creates 9 new fields. The field #O-PeerGroup contains the cluster
membership. The next six fields contain the top three fields that contributed to this record being an
anomaly and the contribution of that field to the anomaly index (we can request fewer or more fields
on which to report in the Anomaly node Model tab). Thus we see that the three most anomalous
cases, with an anomaly index of 5.0, all are in cluster 2. The first two of these are most deviant on
longmon and longten.
Knowing which fields made the greatest contribution to the anomaly index allows you to more easily
review the data values for these cases. You don’t need to look at all the fields, but instead can

concentrate on specific fields detected by the model for that case. In the interests of time, we won’t
take this next step here, but you might want to try this in the exercises.
What we can briefly show are the options available when an Anomaly generated model is added to
the stream.

Edit the Anomaly generated model node in the stream
Click on the Settings tab
Figure 1.17 Settings Tab Options for Anomaly Generated Models
Note in particular that in large files, there is an option available to discard non-anomalous records,
which will make investigating the anomalous records much easier. Also, you can change the number
of fields on which to report here.
Close the Anomaly model Browser window
1.8 Feature Selection for Models

Just as data files can have many records in data-mining problems, there are often hundreds, or
thousands, of potential fields that can be used as predictors. Although some models can naturally use
many fields—decision trees, for example—others cannot or are inefficient, at best, with too many
fields. As a result, you may have to spend an inordinate amount of time to examine the fields to
decide which ones should be included in a modeling effort.
To shortcut this process and narrow the list of candidate predictors, the Feature Selection node can
identify the fields that are most important—mostly highly related—to a particular target/outcome

field. Reducing the number of fields required for modeling will allow you to develop models more
quickly, but also permit you to explore the data more efficiently.
Feature selection has three steps:
1) Screening: In this first step, fields are removed that have too much missing data, too little
variation, or too many categories, among other criteria. Also, records are removed with
excessive missing data.
2) Ranking: In the second step, each predictor is paired with the target and an appropriate test of
the bivariate relationship between the two is performed. This can be a matrix for categorical
fields or a Pearson correlation coefficient if both fields are continuous. The probability values
from these bivariate analyses are turned into an importance measure by subtracting the p
value of the test from 1 (thus a low p value leads to an importance near 1). The predictors are
then ranked on importance.
3) Selecting: In the final step, a subset of predictors is identified to use in modeling. The number
of predictors can be identified automatically by the model, or you can request a specific
number.
Feature selection is also located in the Modeling palette and creates a generated model node. This
node, though, does not add predictions or other derived fields to the stream. Instead, it acts as a filter
node, removing unnecessary fields downstream (with parameters under user control).
We’ll try feature selection on the customer database file. Note that although we are using feature
selection after demonstrating anomaly detection, you may want to use these two in combination. For
example, you can first use feature selection to identify important fields. Then you can use anomaly
detection to find unusual cases on only those fields.
Add a Feature Selection node to the stream and connect it to the Type node
Edit the Feature Selection node and click the Fields tab
Click the Use custom settings button
Select churn as the Target field (not shown)
Select all the fields from region to news (near the bottom) as Inputs (be careful not to select
churn again)
Click the Model tab

Figure 1.18 Model Tab for Feature Selection to Predict Churn
By default fields will initially be screened based on the various criteria listed in the Model tab. A field
can have no more than 70% missing data (which is rather generous, and you may wish to modify this
value). There can be no more than 90% of the records with the same value, and the minimum
coefficient of variation (standard deviation/mean) is 0.1. All of these are fairly liberal standards.
Click the Options tab
Figure 1.19 Options for Feature Selection

Fields after being ranked will be selected based on importance, and only those deemed Important will
be selected in the model. This can be changed to select the top N fields, by ranking of importance, or
by selecting all fields that meet a minimum level of importance. Four options are available for
determining the importance of categorical predictors, with the default being the Pearson chi-square
value.
We will use all default settings for these data.
Click Run
Right-click on the churn Feature Selection generated model and select Browse
Figure 1.20 Feature Selection Browser Window

We selected 127 potential predictors. Seven were rejected in the screening stage because of too much
missing data or too little variation. Of the remaining 120 fields, the model selected 63 as being
important, so it has reduced our tasks of data review and model building considerably. The model
ranked the fields by importance (importance is rounded off to a maximum value of 1.000). If you
scroll down the list of fields in the upper pane, you will eventually see fields with low values of
importance that are unrelated to churn. All fields with their box checked will be passed downstream if
this node is added to a data stream.
The set of important fields includes a mix, with some demographic (age, employ), account-related
(tenure, ebill), and financial status (cardtenure) types.
From here, the generated Feature Selection model in the stream will filter out the unimportant fields.
Note
When using the Feature Selection node, it is important to understand its limitations. First, importance
of a relationship is not the same thing as the strength of a relationship. In data mining, the large data
files used allow very weak relationships to be statistically significant. So just because a field has an
importance value near 1 does not guarantee that it will be a good predictor of some target field.
Second, nonlinear relationships will not necessarily be detected by the tests used in the Feature
Selection node, so a field could be rejected yet have the potential of being a good predictor (this is
especially true for continuous predictors).

Summary Exercises
A Note Concerning Data Files
In this training guide files are assumed to be located in the c:\Train\ModelerPredModel directory.
The exercises in this lesson use the data file churn.txt. The following table provides details about the
file.
churn.txt contains information from a telecommunications company. The data are comprised of
customers who at some point have purchased a mobile phone. The primary interest of the company is
to understand which customers will remain with the organization or leave for another company.
The file contains the following fields:
ID Customer reference number

LONGDIST Time spent on long distance calls per month
International Time spent on international calls per month
LOCAL Time spent on local calls per month
DROPPED Number of dropped calls
PAY_MTHD Payment method of the monthly telephone bill
LocalBillType Tariff for locally based calls
LongDistanceBillType Tariff for long distance calls
AGE Age
SEX Gender
STATUS Marital status
CHILDREN Number of Children
Est_Income Estimated income
Car_Owner Car owner
CHURNED (3 categories):
Current – Still with company
Vol – Leavers who the company wants to keep
Invol – Leavers who the company doesn’t want
In these exercises we will perform some exploratory analysis on the Churn.txt data file and prepare
these data so that they are ready for modeling.
1. Read the file c:\Train\ModelerPredModel\Churn.txt—this file is comma delimited and

includes field names—using a Var. File node. Browse the data and familiarize yourself with
the data structure within each field.
2. Check to see if there are blanks (missing values) within the data; if you find any problems,
decide how you wish to deal with these and take appropriate steps.
3. Look at the distribution of the CHURNED field. This field probably requires balancing. Try
“boosting” the data to balance the field, since we used reducing in the lesson.

4. If you think that both of these methods are too harsh (either in terms of duplicating data too
much or reducing data so there are too few cases), edit the balance node and see if you can
find a way of reducing the impact of balancing.
5. If you are going to use this data for modeling, do you wish to cache this node?
6. Use the Data Audit node to look at the distribution of some of the fields that will be used as
inputs. Does the distribution of these fields appear appropriate? If not, try and find a
transformation that may help the modeling process. (Note: The instructor may have already
spoken about the field LOCAL—you may want to transform this field, as discussed in the
lesson).
7. Look at the field International. Do you think this field will need transforming or binning?
Can you find a transformation that helps with this field? If not, why do think this is?
8. Think about whether there are potentially any other fields that could be derived from existing
data that may help out with the modeling process. If so, create those fields.
9. Try using the Anomaly node on these data to detect unusual records. Don’t use the field
CHURNED. Do you find any commonalities among most of the anomalous records?
10. If you have made any data transformations, balanced the data, or derived any fields, you may
want to create a Supernode that reduces the size of your current stream.
11. Save your stream as Exer1.str.
12. For those with extra time. Use the Anomaly node to detect anomalous cases in the
customer_dbase.sav file, as we did in the lesson. Then add the generated Anomaly node to
the stream and investigate these unusual cases in more detail. Would you retain them for
modeling, or not? Why?
13. For those with more extra time. Use the Data Audit node or other methods to search for
outlier data values on continuous fields. If you find some, what might be done to reduce their
impact on modeling?


Lesson 2: Data Reduction: Principal

Components
Objectives
• Review principal components analysis, a technique used to perform data reduction prior to
modeling
• Run a principal components analysis on a dataset of waste production
Data
We use a file containing information about the amount of solid waste in thousands of tons (WASTE)
in various locations along with information about land use, including number of acres used for
industrial work (INDUST), fabricated metals (METALS), trucking and wholesale trade (TRUCKS),
retail trade (RETAIL), and restaurants and hotels (RESTRNTS). The data set appears in Chatterjee and
Hadi (1988, Sensitivity Analysis in Linear Regression. New York: Wiley).
2.1 Introduction
Although it is used as an analysis technique in its own right, in this lesson we discuss principal
components primarily as a data reduction technique in support of statistical predictive modeling (for
example, regression or logistic regression) and clustering.
We first review the role of principal components and factor analysis in segmentation and prediction
studies, and then discuss what to look for when running these techniques. Some background
principles will be covered along with comments about popular factor methods. We provide some
overall recommendations. We will perform a principal components analysis on a set of fields
recording different types of land usage, all of which are to be used to predict the amount of waste
produced from that land.
2.2 Use of Principal Components for Prediction Modeling

and Cluster Analyses
In the areas of segmentation and prediction, principal components and factor analysis typically serve
in the ancillary role of reducing the many fields available to a core set of composite fields
(components or factors) that are used by cluster, regression or logistic regression. These techniques,
though, can also be used for classical data mining methods, including neural networks and Bayesian
networks.
Statistical prediction models such as regression, logistic regression, and discriminant analysis, when
run with highly correlated input fields can produce unstable coefficient estimates (the problem of near
multicollinearity). In these models, if any input field can be almost or perfectly predicted from a
linear combination of the other inputs (near or pure multicollinearity), the estimation will either fail or
be badly in error. Prior data reduction using factor or principal components analysis is one approach
to reducing this risk.
Although we have described this problem in the context of statistical prediction models, neural
network coefficients can become unstable under these circumstances. However, since the
interpretation of neural network coefficients is relatively rarely done, this issue is less prominent.
Data Reduction: Principal Components 2-1

However, neural networks training time increases with more inputs, so reducing the number of inputs
while retaining important variable information is normally a good practice.
Bayesian networks that model the conditional probabilities among a set of fields function best, and
are much easier to interpret, with a relatively small number of inputs, so data reduction can be a
useful step before modeling here as well.
Rule induction methods will run when predictors are highly related. However, if two continuous
predictors are highly correlated and have about the same relationship to the target, then the predictor
with the slightly stronger relationship to the target will enter into the model. The other predictor is
unlikely to enter into the model, since it contributes little in addition to the first predictor. While this
may be adequate from the perspective of accurate prediction, the fact that the first field entered the
model, while the second didn't, could be taken to mean that the first was important and the second
was not. However, if the first were removed, the second predictor would have performed nearly as
well. Such relationships among inputs should be revealed as part of the data understanding and data
preparation step of a data mining project. If this were not done, or if it were done inadequately, then
the data reduction performed by principal components or factor analysis might be necessary (for
statistical methods) and helpful (for both statistical and machine learning methods).
In some surveys done for segmentation purposes, dozens of customer attitude measures or product
attribute ratings may be collected. Although cluster analysis can be run using a large number of
cluster fields, two complications can develop. First, if several fields measure the same or very similar
characteristics and are included in a cluster analysis, then what they measure is weighted more
heavily in the analysis. For example, suppose a set of rating questions about technical support for a
product is used in a cluster analysis with other unrelated questions. Since distance calculations used in
the PASW Modeler clustering algorithms are based on the differences between observations on each
field, then other things being equal, the set of related items would carry more weight in the analysis.
To exaggerate to make a point, if two fields were identical copies of each other and both were used in
a cluster analysis, the effect would be to double the influence of what they measure. In practice you
rarely ask the same number of rating questions about each attribute (or psychographic) area. So
principal components and factor analysis are used to either explicitly combine the original input fields
into independent composite fields, to guide the analyst in constructing subscales, or to aid in selection
of representative sets of fields (some analysts select three fields strongly related to each factor or
component to be used in cluster analysis). Cluster is then performed on these fields.
A second reason factor or principal components might be run prior to clustering is for conceptual
clarity and simplification. If a cluster analysis were based on forty fields it would be difficult to look
at so large a table of means or a line chart and make much sense of them. As an alternative, you can
perform rule induction to identify the more influential fields and summarize those. If factor or
principal components analysis is run first, then the clustering is based on the themes or concepts
measured by the factors or components. Or, as mentioned above, clustering can be done on equal-
sized sets of fields, where each set is based on a factor. If the factors (components) have a ready
interpretation, it can be much easier to understand a solution based on five or six factors, compared to
one based on forty fields. As you might expect, factor and principal components analyses are more
often performed on “soft” measures—attitudes, beliefs, and attribute ratings— and less often on
behavioral measures like usage and purchasing patterns.
Keep in mind that factor and principal components analysis are considered exploratory data
techniques (although there are confirmatory factor methods; for example, Amos can be used to test
specific factor models). So as with cluster analysis, do not expect a definitive, unassailable answer.

When deciding on the number and interpretation of factors or components, domain knowledge of the
data, common sense, and a dose of hard thinking are very valuable.
2.3 What to Look for When Running Principal

Components or Factor Analysis
There are two main questions that arise when running principal components and factor analysis: how
many (if any) components are there, and what do they represent? Most of our effort will be directed
toward answering them. These questions are related because, in practice, you rarely retain factors or
components that you cannot identify and name. Although the naming of components has rarely
stumped a creative researcher for long, which has led to some very odd-sounding “components,” it is
accurate enough to say that interpretability is one of the criteria when deciding to keep or drop a
component. When choosing the number of components, there are some technical aids (eigenvalues,
percentage of variance accounted for) we will discuss, but they are guides and not absolute criteria.
To interpret the components, a set of coefficients, called loadings or lambda coefficients, relating the
components (or factors) to the original fields, are very important. They provide information as to
which components are highly related to which fields and thus give insight into what the components
represent.
2.4 Principles
Factor analysis operates (and principal components usually operates) on the correlation matrix
relating the continuous fields to be analyzed. The basic argument is that the fields are correlated
because they share one or more common components, and if they didn’t correlate there would be no
need to perform factor or component analysis. Mathematically a one-factor (or component) model for
three fields can be represented as follows (Vs are fields (or variables), F is a factor (or component),
Es represent error variation that is unique to each field (uncorrelated with the F component and the E
components of the other variables)):
V 1 = L 1* F 1 + E 1
V 2 = L 2* F 1 + E 2
V 3 = L 3* F 1 + E 3
Each field is composed of the common factor (F1) multiplied by a loading coefficient (L1, L2, L3 - the
lambdas) plus a unique or random component. If the factor were measurable directly (which it isn’t)
this would be a simple regression equation. Since these equations can’t be solved as given (the Ls, Fs
and Es are unknown), factor and principal components analysis take an indirect approach. If the
equations above hold, then consider why fields V1 and V2 correlate. Each contains a random or
unique component that cannot contribute to their correlation (Es are assumed to have 0 correlation).
However, they share the factor F1, and so if they correlate the correlation should be related to L1 and
L2 (the factor loadings). When this logic is applied to all the pairwise correlations, the loading
coefficients can be estimated from the correlation data. One factor may account for the correlations
between the fields, and if not, the equations can be easily generalized to accommodate additional
factors. There are a number of approaches to fitting factors to a correlation matrix (least squares,
generalized least squares, maximum likelihood), which has given rise to a number of factor methods.
What is a factor? In market research factors are usually taken to be underlying traits, attitudes or
beliefs that are reflected in specific rating questions. You need not believe that factors or components

actually exist in order to perform a factor analysis, but in practice the factors are usually interpreted,
given names, and generally spoken of as real things.
2.5 Factor Analysis versus Principal Components

Analysis
Within the general area of data reduction there are two highly related techniques: factor analysis and
principal components analysis. They can both be applied to correlation matrices with data reduction
as a goal. They differ in a technical way having to do with how they attempt to fit the correlation
matrix. We will pursue the distinction since it is relevant to which method you choose. The diagram
below is a correlation matrix composed of five continuous fields.
Figure 2.1 Correlation Matrix of Five Continuous Fields
Principal components analysis attempts to account for the maximum amount of variation in the set of
fields. Since the diagonal of a correlation matrix (the ones) represents standardized variances, each
principal component can be thought of as accounting for as much as possible of the variation
remaining in the diagonal. Factor analysis, on the other hand, attempts to account for correlations
between the fields, and therefore its focus is more on the off-diagonal elements (the correlations). So
while both methods attempt to fit a correlation matrix with fewer components or factors than fields,
they differ in what they focus on when fitting. Of course, if a principal component accounts for most
of the variance in fields V1 and V2 , it must also account for much of the correlation between them.
And if a factor accounts for the correlation between V1 and V2 , it must account for at least some of
their (common) variance. Thus, there is definitely overlap in the methods and they usually yield
similar results. Often factor is used when there is interest in studying relations among the fields, while
principal components is used when there is a greater emphasis on data reduction and less on
interpretation. However, principal components is very popular because it can run even when the data
are multicollinear (one field can be perfectly predicted from the others), while most factor methods
cannot. In data mining, since data files often contain many fields likely to be multicollinear or near
multicollinear, principal components is used more often. This is especially the case if statistical
modeling methods, which will not run with multicollinear predictors, are used. Both methods are
available in the PCA/Factor node; by default, the principal components method is used.
2.6 Number of Components

When factor or principal components analysis is run there are several technical measures that can
guide you in choosing a tentative number of factors or components. The first indicator would be the
eigenvalues. Eigenvalues are fairly technical measures, but in principal components analysis, and
some factor methods (under orthogonal rotations), their values represent the amount of variance in the

input fields that is accounted for by the components (or factors). If we turn back to the correlation
matrix in Figure 8.1, there are five fields and therefore 5 units of standardized variance to be
accounted for. Each eigenvalue measures the amount of this variance accounted for by a factor. This
leads to a rule of thumb and a useful measure to evaluate a given number of factors. The rule of
thumb is to select as many factors as there are eigenvalues greater than 1. Why? If the eigenvalue
represents the amount of standardized variance in the fields accounted for by the factor, then if it is
above 1, it must represent variance contained in more than one field. This is because the maximum
amount of standardized variance contained in a single field is 1. Thus, if in our five-field analysis the
first eigenvalue were 3, it must account for variation in several fields. Now an eigenvalue can be less
that 1 and still account for variation shared among several fields (for example 30% of the variation of
each of three fields for an eigenvalue of .9), so the eigenvalue of 1 rule is only applied as a rule of
thumb. Another aspect of eigenvalues (for principal components and some factor methods) is that
their sum is the same as the number of fields, which is equal to the total standardized variance in the
fields. Thus you can convert the eigenvalue into a measure of percentage of explained variance,
which is helpful when evaluating a solution. Finally, it is important to mention that in applications in
which you need to be able to interpret the results, the components must make sense. For this reason,
factors with eigenvalues over 1 that cannot be interpreted may be dropped and those with eigenvalues
less than 1 may be retained.
2.7 Rotations
When factor analysis succeeds you obtain a relatively small number of interpretable factors that
account for much of the variation in the original set of fields. Suppose you have eight fields and factor
analysis returns a two-factor solution. Formally, the factor solution represents a two-dimensional
space. Such a space can be represented with a pair of axes as shown below.
While each pair of axes defines the same two-dimensional space, the coordinates of a point would
vary depending on which pair of axes was applied. This creates a problem for factor methods since
the values for the loadings or lambda coefficients vary with the orientation of axes and there is no
unique orientation defined by the factor analysis itself. Principal Components does not suffer from
this problem since its method produces a unique orientation. This difficulty for factor analysis is a
fundamental mathematical problem. The solutions to it are designed to simplify the task of
interpretation for the analyst. Most involve, in some fashion, finding a rotation of the axes that
maximizes the variance of the loading coefficients, so some are large and some small. This makes it
easier for the analyst to interpret the factors. This is the best that can currently be done, but the fact
that factor loadings are not uniquely determined by the method is a valid criticism leveled against it
by some statisticians. We will discuss the various rotational schemes in the Methods section below.

Figure 2.2 Two Dimensional Space
2.8 Component Scores

If you are satisfied with a factor analysis or principal components solution, you can request that a new
set of fields be created that represent the scores of each data record on the factors. These are
calculated by summing the product of each original field and a weight coefficient (derived from the
lambda coefficients). These factor score fields can then be used as the inputs for prediction and
segmentation analyses. They are usually normalized to have a mean of zero and standard deviation of
one. An alternative some analysts prefer is to use the lambda coefficients to judge which fields are
highly related to a factor, and then compute a new field which is the sum or mean of that set of fields.
This method, while not optimal in a technical sense, keeps (if means are used) the new scores on the
same scale as the original fields (this of course assumes the fields themselves share a common scale),
which can make the interpretation and the presentation straightforward. Essentially, subscale scores
are created based on the factor results, and these scores are used in further analyses.
2.9 Sample Size

Since principal components analysis is a multivariate statistical method, the rule of thumb for sample
size (commonly violated) is that there should be from 10 to 25 times as many records as there are
continuous fields used in the factor or principal components analysis. This is because principal
components and factor analysis are based on correlations and for p fields there are p* (p-1)/2
correlations. Think of this as a desirable goal and not a formal requirement (technically if there are p
fields there must be p+1 observations for factor analysis to run—but don’t expect reasonable results).
If your sample size is very small relative to the number of input fields, you should turn to principal
components.
2.10 Methods
There are several popular methods within the domain of factor and principal components analyses.
The common factor methods differ in how they go about fitting the correlation matrix. A traditional
method that has been around for many years—for some it means factor analysis— is the principal
axis factor method (often abbreviated as PAF). A more modern method that carries some technical

advantages is maximum likelihood factor analysis. If the data are ill behaved (say near
multicollinear), maximum likelihood, the more refined method, is more prone to give wild solutions.
In most cases results using the two methods will be very close, so either is fine under general
circumstances. If you suspect there are problems with your data, then principal axis may be a safer
bet. The other factor methods are considerably less popular. One factor method, called Q factor
analysis, involves transposing the data matrix and then performing a factor analysis on the records
instead of the fields. Essentially, correlations are calculated for each pair of records based on the
values of the input fields. This technique is related to cluster analysis, but is used infrequently today.
Besides the factor methods, principal components can be run and, as mentioned earlier, must be run
when the inputs are multicollinear.
Similarly, there are several choices in rotations. The most popular by far is the varimax rotation,
which attempts to simplify the interpretation of the factors by maximizing the variances of the input
fields’ loadings on each factor. In other words, it attempts to finds a rotation in which some fields
have high and some low loadings on each factor, which makes it easier to understand and name the
factors. The quartimax rotation attempts to simplify the interpretation of each field in terms of the
factors by finding a rotation yielding high and low loadings across factors for each field. The equimax
rotation is a compromise between the varimax and quartimax rotation methods. These three rotations
are orthogonal, which means the axes are perpendicular to each other and the factors will be
uncorrelated. This is considered a desirable feature since statements can be made about independent
factors or aspects of the data. There are nonorthogonal rotations available (axes are not
perpendicular); popular ones are oblimin and promax (runs faster than oblimin). Such rotations are
rarely used in data mining, since the point of data reduction is to obtain relatively independent
composite measures, and it is easier to speak of independent effects when the factors are uncorrelated.
Finally, principal components does not require a rotation, since there is a unique solution associated
with it. However, in practice, a varimax rotation is sometimes done to facilitate the interpretation of
the components.
2.11 Overall Recommendations

For data mining applications, principal components is more commonly performed than factor analysis
because of the expected high correlations among the many continuous inputs that are often analyzed,
and because there isn't always strong interest in interpreting the results. Varimax rotation is usually
done (although it is not necessary for principal components) to simplify the interpretation. If there are
not many highly correlated fields (or other sources for ill-behaved data, for example, much missing
data), then either principal axis or maximum likelihood factor can be performed. Maximum likelihood
has technical advantages, but can produce an ugly solution if the data are not well conditioned (a
statistical criterion).
2.12 Example: Regression with Principal Components

To demonstrate principal components, we will run a linear regression analysis predicting a target
(amount of waste produced) as a function of several related inputs (amount of acreage put to different
uses). After examining the regression results, we will run principal components analysis and use the
first few component score fields as inputs to the regression.
Note
A complete example of linear regression is provided in Lesson 6. Our intent here is not to teach linear
regression, but instead to use this technique to illustrate how principal components can be used in
conjunction with that modeling technique.
Click File…Open Stream and move to the c:\Train\ModelerPredModel directory

Double-click on PrincipalComponents.str
When the stream first opens, the following warning dialog is displayed. In version 14.0 of Modeler,
the Linear Models node was added that is an enhanced version of the Regression node. The
Regression node will be replaced in a future release.
Figure 2.3 Regression Node Expiration Warning
Click OK
Right-click on the Table node connected to the Type node, then click Run
Examine the data, and then close the Table window
Double-click on the Type node
Figure 2.4 Type Node for Linear Regression Analysis
The INDUST, METALS, TRUCKS, RETAIL, and RESTRNTS fields (which measure the number of
acres of a specific type of land usage) will be used as inputs to predict the amount of solid waste
(WASTE).
Close the Type node window

Double-click on the Regression node named WASTE at the top of the Stream canvas
Click the Expert tab, and then click the Expert option button
Click the Output button, and then make sure that the Descriptives check box is checked

Figure 2.5 Requesting Descriptive Statistics in a Linear Regression Node
To check for correlation among the inputs, we request descriptive statistics (Descriptives). This will
display correlations for all the fields in the analysis, among other statistics. (Note that we could have
obtained these correlations from the Statistics node.) We can obtain more technical information about
correlated predictors by checking the Collinearity Diagnostics check box.
Click OK, and then click the Run button

Right-click the Regression generated model node named Waste in the Models Manager
window, then click Browse
Click the Summary tab
Expand the Analysis topic in the Summary tab (if necessary)
Figure 2.6 Linear Regression Browser Window (Summary Tab)

The estimated regression equation appears in the Summary tab under Analysis; notice that two of the
inputs have negative coefficients.
Click the Advanced tab

Scroll to the Pearson Correlation section of the Correlations table in the Advanced tab of
the browser window
Figure 2.7 Correlations for Input Fields and Target Field
All correlations are positive and there are high correlations between the METALS and TRUCKS fields
(.893) and between the RESTRNTS and RETAIL fields (.920). Since some of the inputs are highly
correlated, this might create stability problems (large standard errors) for the estimated regression
coefficients due to near multicollinearity.
Scroll to the Model Summary table
Figure 2.8 Regression Model Summary
The regression model with five predictors accounted for about 83% (adjusted R Square) of the
variation in the target field (waste).
Scroll to the Coefficients table

Figure 2.9 Linear Regression Coefficients
Two of the significant coefficients (INDUST and RETAIL) have negative regression coefficients,
although they correlate positively (see Figure 2.9) with the target field. Although there might be a
valid reason for this to occur, this coupled with the fact that RETAIL is highly correlated with another
predictor is suspicious. Also, those familiar with regression should note that the estimated beta
coefficient for RESTRNTS is above 1, which is another sign of near multicollinearity. It is possible
that this situation could have been avoided if a stepwise method had been used (this is left as an
exercise). However, we will take the position that the current set of inputs is exhibiting signs of near
multicollinearity and we will run principal components as an attempt to improve the situation.
Close the Regression browser window

Double-click the PCA/Factor model node (named Factor) in the stream canvas
Figure 2.10 PCA/Factor Dialog

In Simple mode (see Expert tab), the only options involve selection of the factor extraction method
(some of these were discussed in the Methods section). Notice that Principal Components is the
default method.

Right-click the PCA/Factor generated model node named Factor in the Models Manager
Figure 2.11 PCA/Factor Browser Window (Five-Component Solution)
Five principal components were found. Since there were originally five input fields, reducing them to
five principal components does not constitute data reduction (but it does solve the problem of
multicollinearity). If the solution were successful, we would expect that the variation within the five
input fields would be concentrated in the first few components and we could check this by examining
the Advanced tab of the browser window. However, instead we will use the Expert options to have
the PCA/Factor node select an optimal number of principal components.
Close the PCA/Factor browser window

Double-click on the PCA/Factor modeling node named Factor
Click the Expert tab, and then click the Expert Mode option button

Figure 2.12 Expert Options
The Extract factors option indicates that while in Expert mode, PCA/Factor will select as many
factors as there are eigenvalues over 1 (we discussed this rule of thumb earlier in the lesson). You can
change this rule or specify a number of factors; this might be done if you prefer more or fewer factors
than the eigenvalue rule provides. By default, the analysis will be performed on the correlation
matrix; principal components can also be applied to covariance matrices, in which case fields with
greater variation will have more weight in the analysis. This is really all we need to proceed, but let's
examine the other Expert options.
Notice that the Only use complete records check box becomes active when the Expert Mode is
selected. By default, PCA/Factor will only use records with complete information on the input fields.
If this option is not checked, then a pairwise technique is used. Here for a record with missing values
on one or more fields used in the analysis, fields with valid values will be used. However, the created
factor score fields will be set to $null$ for these records. Also, substantial amounts of missing data,
when the Use only complete records is not selected, can lead to numeric instabilities in the algorithm.
The Sort values check box in the Component/Factor format section will have PCA/Factor list the
fields in descending order by their loading coefficients on the factor/component for which they load
highest. This makes it very easy to see which fields relate to which factors and is especially useful
when a many input fields are involved. To further aid this effort, by suppressing loading coefficients
less that .3 in absolute value (the Hide values below option) you will only see the larger loadings
(small values are replaced with blanks) and not be distracted by small loadings. Although not
required, these options make the interpretive task much easier when many fields are involved.
Make sure the Sort values check box is checked
Make sure the Hide values below check box is checked
Set the Hide scores below value to 0.3

Click the Rotation button
Figure 2.13 Expert Options (Factor/Component Rotation)
By default, no rotation is performed, which is often the case when principal components is run. The
Delta and Kappa text boxes control aspects of the Oblimin and Promax rotation methods,
respectively.
Click Cancel
Right-click the PCA/Factor generated model node, named Factor, in the Models Manager
Click the Model tab

Figure 2.14 PCA/Factor Browser Window (Two-Component Solution)
The PCA/Factor browser window contains the equations to create component (in this case) or factor
score fields from the inputs. Two components were selected based on the eigenvalue greater than 1
rule (recall five were selected in the original analysis under the Simple mode). The coefficients are so
small because the components are normalized to have means of 0 and standard deviations of 1, while
most inputs have values that extend into the thousands. To interpret the components, we turn to the
advanced output.

Scroll to the Communalities table in the Expert Output browser window

Figure 2.15 Communalities Summary
The communalities represent the proportion of variance in an input field explained by the factors
(here principal components). Since initially, as many components are fit as there are inputs, the
communalities in the first column (Initial) are trivially 1. They are of interest when a solution is
reached (Extraction column). Here the communalities are below 1 and measure the percentage of
variance in each input field that is accounted for by the selected number of components (two). Any
fields having very small communalities (say .2 or below) have little in common with the other inputs,
and are neither explained by the components (or factors), nor contribute to their definition. Of the five
inputs, all but INDUST have a large proportion of their variance accounted for by the two
components and Indust itself has a communality of .44 (44%).
Scroll to the Total Variance Explained table in the Avanced tab of browser window
Figure 2.16 Total Variance Explained (by Components) Table
The Initial eigenvalues area contains all (5) eigenvalues, along with the percentage of variance (of the
fields) explained by each and a cumulative percentage of variance. We see in the Extracted Sums of
Squared Loadings section that there are two eigenvalues over 1, the first being about twice the size of
the second. Two components were selected and they collectively account for about 82 percent of the
variance of the 5 inputs. The third eigenvalue is .73, which might be explored as a third component if
more input fields were involved (reducing from five fields to three components is not much of a
reduction). The remaining two components (fourth and fifth) are quite small. While not pursued here,
in practice we might try out a solution with a different number of components.
Scroll to the Component Matrix table in the Advanced tab of the browser window

Figure 2.17 Component Matrix (Component or Factor Loadings)
PCA/Factor next presents the Component (or Factor) Matrix that contains the unrotated loadings. If a
rotation were requested, this table would appear in addition to a table containing the rotated loadings.
The input fields form the rows and the components (or factors if a factor method were run) form the
columns. The values in the table are the loadings. If any loading were below .30 (in absolute value),
blanks would appear in its position due to our option choice. While it makes no difference here, the
option helps focus on the larger (absolute value closer to 1) loadings.
The first component seems to be a general component, having positive loadings on all the input fields
(recall that they all correlated positively—see Figure 2.17). In some sense, it could represent the total
(weighted) amount of land used in these activities. The second component has both positive and
negative coefficients, and seems to represent the difference between land usage for trucking and
wholesale trade, fabricated metals, and industrial work, versus retail trade, restaurants and hotels.
This might be considered a contrast between manufacturing/industrial and service-oriented use of
land. This pattern, all fields with positive loadings on the first component (factor) and contrasting
signs on coefficients of the second and later components (factors), is fairly common in unrotated
solutions. If we requested a rotation, the fields would group into the two rotated components
according to their signs on the second component.
We should note that when interpreting components or factors, the loading magnitude is important;
that is, fields with greater loadings (in absolute value) are more closely associated with the
components and are more influential when interpreting the components.
We know that the two components account for 82 percent of the variation of the original input fields
(a substantial amount), and that we can interpret the components. Now we will rerun the linear
regression with the components as inputs.
Close the PCA/Factor browser window

Double-click on the Type node located to the right of the PCA/Factor generated model
node named Factor

Figure 2.18 Type Node Set Up for Principal Components Regression
The two component score fields ($F-Factor-1, $F-Factor-2) are the only fields that will be used as
inputs; the original land usage fields have their role set to None. If both the land usage fields and the
component score fields were inputs to the linear regression, we would have only exacerbated the near
multicollinearity problem (as an exercise, explain why).
Close the Type node window

Run the Regression modeling node, named Waste, located in the lower right section of the
Stream canvas
Right-click the Regression generated model node named Waste in the Models Manager,
then click Browse
Click the Summary tab
Expand the Analysis topic

Figure 2.19 Linear Regression (Using Components as Inputs) Browser Window
The prediction equation for waste is now in terms of the two principal component fields. Notice that
the coefficient for the second component has a negative sign, which we will consider when examining
the expert output.

Scroll to the Model Summary table
Figure 2.20 Model Summary (Principal Components Regression)
The regression model with two principal component fields as inputs accounts for about 73% of the
variance (adjusted R square) in the Waste field. This compares with the 83% in the original analysis
(Figure 2.8). Essentially, we are giving up 10% explained variance to gain more stable coefficients
and possibly a simpler interpretation. The requirements of the analysis would determine whether this
tradeoff is acceptable.
Scroll to the Coefficients table

Figure 2.21 Coefficients Table (Principal Components Regression)
Both components are statistically significant. The positive coefficient for $F-Factor-1 indicates, not
surprisingly, as overall land usage increases, so does the amount of waste. The coefficient for the
second component (which represented a contrast of land use for manufacturing/industrial versus
service-oriented), which is negative, indicates that, controlling for total land usage, as the amount of
manufacturing/industrial land use increases relative to service-oriented usage, waste production goes
down. Or, to put it another way, as service-oriented land use increases, relative to
manufacturing/industrial, waste production increases.
As mentioned before, the interpretation of the component, and thus the regression, results might be
made easier by rotating (say using a varimax rotation) the components (you might ask your instructor
to demonstrate this approach). Notice that the components, unlike the original fields (see Figure 2.8),
have no beta coefficients above 1, indicating that the potential problem with near multicollinearity has
been resolved.
It is important to note that while we have shifted from a regression with five inputs to a regression
with two components, the five inputs are still required to produce predictions because they are needed
to create the component score fields.
Additional Readings
Those interested in learning more about factor and principal components analysis might consider the
book by Kline (1994), Jae-On Kim’s introductory text (1978) and his book with Charles W Mueller
(1979), and Harry Harman’s revised text (1979).

Summary Exercises
The exercises in this lesson use the file waste.dat. The table provides details about the file.
Waste.dat contains information from a waste management study in which the amount of solid waste
produced within an area was related to type of land usage. Interest is in relating land usage to amount
of waste produced for planning purposes. Inputs were found to be highly correlated and the dataset is
used to demonstrate principal components regression. The file contains 40 records and the following
fields:
INDUST Acreage (US) used for industrial work

METALS Acreage used for fabricated metal
TRUCKS Acreage used for trucking and wholesale trade
RETAIL Acreage used for retail trade
RESTRNTS Acreage used for restaurants and hotels
WASTE Amount of solid waste produced
1. Working with the current stream from the lesson, request a varimax rotation of the principal
components analysis. Interpret the component coefficients. Use the component score fields
from this generated model node as inputs to the Regression node predicting waste. Does the
R square change? Explain this. Do the regression coefficients change? How would you
interpret them?
2. With the same data, use the Extraction Method drop-down list in the PCA/Factor node to run
a factor analysis instead (using principal axis factoring or maximum likelihood) with no
rotation. Compare the results to those obtained by the principal components in the lesson. Are
they similar? In what way do they differ? Now rerun the factor analysis, requesting a varimax
rotation. How do these results compare to those obtained in the first exercise? Do you find
anything that leads you to prefer one to the other?


Lesson 3: Decision Trees/Rule Induction

Overview
• Introduce the features of the C5.0, CHAID, C&R Tree and QUEST nodes
• Create models for categorical targets
• Understand how CHAID and C&R Tree model a continuous output
Data
We will use the dataset churn.txt that we used in Lesson 1. This data file contains information on
1477 of a telecommunication company’s customers who have at some time purchased a mobile
phone. The customers fall into one of three groups: current customers, involuntary leavers and
voluntary leavers. In this lesson, we use decision tree models to understand which factors influence
group membership.
Following recommended practice, we will use a Partition Node to divide the cases into two partitions
(subsamples), one to build or train the model and the other to test the model (often called a holdout
sample). With a holdout sample, you are able to check the resulting model performance on data not
used to fit the model. The holdout data sample also has known values for the target field and therefore
can be used to check model performance.
A second dataset Insclaim.dat, used with the C&R Tree node, contains 293 records based on patient
admissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields
(grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goal
is to build a predictive model for the insurance claim amount and use this model to identify outliers
(patients with claim values far from what the model predicts), which might be instances of errors or
fraud made in the claims. Such analyses can be performed for error or fraud detection in instances
where audited data (for which the outcome error/no error or fraud/no fraud) are not available.
3.1 Introduction
PASW Modeler contains four different algorithms for constructing a decision tree (more generally
referred to as rule induction): C5.0, CHAID, QUEST, and C&R Tree (classification and regression
trees). They are similar in that they can all construct a decision tree by recursively splitting data into
subgroups defined by the predictor fields as they relate to the target. However, they differ in several
important ways. (PASW Modeler also includes the Decision List node, which develops models to
identify subgroups or segments that show a higher or lower likelihood of a binary (yes or no) target
relative to the overall sample. These models are tree-like, but they are different enough that Decision
List is reviewed separately in Appendix A.)
We begin by reviewing a table that highlights some distinguishing features of the algorithms. Next,
we will examine the various options for the algorithms in the context of predicting a categorical
output. Within each section we discuss when it is advisable to use the expert options within these
nodes.
3.2 Comparison of Decision Tree Models

The table below lists some of the important differences between the decision tree/rule induction
algorithms available within PASW Modeler.
Decision Trees/Rule Induction 3-1

Table 3.1 Some Key Differences between the Four Decision Tree Models
Model Criterion C5.0 CHAID QUEST C&R Tree

1
Type of Split for Multiple Multiple Binary Binary
Categorical
Predictors
Continuous No Yes No Yes
Target
Continuous Yes No2 Yes Yes
Predictors
Criterion for Information Chi-square Statistical Impurity
Predictor measure F test for (dispersion)
Selection continuous measure
Can Cases Yes, uses Yes, missing Yes, uses Yes, uses
Missing Predictor fractionalization becomes a surrogates surrogates
Values be Used? category
Priors No No Yes Yes
Pruning Upper limit on Stops rather Cost-complexity Cost-complexity
Criterion predicted error than overfit pruning pruning
Build Trees No Yes Yes Yes
Interactively
Supports Yes Yes Yes Yes
Bagging/Boosting
1
Modeler has extended the logic of the CHAID approach to accommodate ordinal and continuous
target fields.
2
Continuous predictors are binned into ordinal fields containing by default approximately equal sized
categories.
Note: C&R Tree and QUEST produce binary splits (two branch splits) when growing the tree, while
C5.0 and CHAID can produce more than two subgroups when splitting occurs. However, if we had a
predictor of measurement level nominal or ordinal with four categories, each of which were distinct
with relation to the target field, C&R Tree and QUEST could perform successive binary splits on this
field. This would produce a result equivalent to a multiple split at a single node, but requires
additional tree levels.
All methods can handle predictors and targets that are categorical (flag, nominal, and ordinal).
CHAID and C&R Tree can use a continuous target field, while all but CHAID can use a continuous
predictor or input (although see footnote 2).
The trees that each method grows will not necessarily be identical because the methods use very
different criteria for selecting a predictor. CHAID and QUEST use more standard statistical methods,
while C5.0 and C&R Tree use non-statistical measures, as explained below.
Missing (blank) values are handled in three different ways. C&R Tree and QUEST use the substitute
(surrogate) predictor field whose split is most strongly associated with that of the original predictor to
direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in
proportion to the distribution of the predictor field and passes a weighted portion of the case down
each tree branch. CHAID uses all the missing values as an additional category in model building.

Three of the four methods prune trees after growing them quite large, while CHAID instead stops
before a tree gets too large.
For all these reasons, you should not expect the four algorithms to produce identical trees for the
same data. You should expect that important predictors would be included in trees built by any
algorithm.
Those interested in more detail concerning the algorithms can see the PASW Modeler 14.0 Algorithms
Guide. Also, you might consider C4.5: Programs for Machine Learning (Morgan Kauffman, 1993)
by Ross Quinlan, which details the predecessor to C5.0; Classification and Regression Trees
(Wadsworth, 1984) by Breiman, Friedman, Olshen and Stone, who developed CART (Classification
and Regression Tree) analysis; the article by Loh and Shih (1997, “Split Selection Methods for
classification trees,” Statistica Sinica, 7: 815-840) that details the QUEST method; and for a
description of CHAID, “The CHAID Approach to Segmentation Modeling: CHI-squared Automatic
Interaction Detection,” Lesson 4 in Richard Bagozzi, Advanced Methods of Marketing Research
(Blackwell, 1994).
3.3 Using the C5.0 Node

We will use the C5.0 node to create a rule induction model. It contains the rule induction model in
either decision tree or rule set format. By default, the C5.0 node is labeled with the name of the output
field. The C5.0 model can be browsed and predictions can be made by passing new data through it in
the Stream Canvas.
Before a data stream can be used by the C5.0 node—or essentially any node in the Modeling
palette—the measurement levels of all fields used in the model must be instantiated (either in the
source node or a Type node). That is because all modeling nodes use this information to set up the
models. As a reminder, the table below shows the six key roles for a field.
Table 3.2 Role Settings

Input The field acts as an input or predictor within the modeling.
Target The field is the output or target for the modeling.
Both Allows the field to be act as both an input and an target in modeling. This role is
suitable for the association rule and sequence detection algorithms only; all other
modeling techniques will ignore the field.
None The field will not be used in machine learning or statistical modeling. Default if
the field is defined as Typeless.
Partition Indicates a field used to partition the data into separate samples for training,
testing, and (optional) validation purposes.
Split Indicates that a model should be built for each value of a field. For flag, nominal
and ordinal fields only.
Frequency Used as a frequency weighting factor. Supported by CHAID, QUEST, C&R Tree,
and the Linear node
Role can be set by clicking in that column for a field within the Type node or the Type tab of a source
node and selecting the role from the drop-down menu. Alternatively, this can be done from the Fields
tab of a modeling node.
If the Stream Canvas is not empty, click File…New Stream

Place a Var. File node from the Sources palette
Double-click the Var. File node

Move to the c:\Train\ModelerPredModel directory and double-click on the Churn.txt file

As delimiter, check the Comma option if necessary
Set the Strip lead and trail spaces: option to Both
Click OK to return to the Stream Canvas
Place a Partition node from the Field Ops palette to the right of the Var. File node named
Churn.txt
Connect the Var.File node named Churn.txt to the Partition node
Place a Type node from the Field Ops palette to the right of the Partition node
Connect the Partition Node to the Type node
Next we will add a Table node to the stream. This not only will force PASW Modeler to instantiate
the data but also will act as a check to ensure that the data file is being correctly read.
Place a Table node from the Output palette above the Type node in the Stream Canvas
Connect the Type node to the Table node
Right-click the Table node
Run the Table node
The values in the data table should look reasonable (not shown).
Click File…Close to close the Table window

Double-click the Type node
Click in the cell located in the Measurement column for ID (current value is Continuous),
and select Typeless from the list
Click in the cell located in the Role column for CHURNED (current value is Input) and select
Target from the list
Figure 3.1 Type Node Ready for Modeling

Notice that ID will be excluded from any modeling as the role is automatically set to None for a
Typeless field. The CHURNED field will be the target field for any predictive model and all fields but
ID and Partition will be used as predictors.
Click OK
Place a C5.0 node from the Modeling palette to the right of the Type node
Connect the Type node to the C5.0 node
The name of the C5.0 node should immediately change to CHURNED.
Figure 3.2 C5.0 Modeling Node Added to Stream
Double-click the C5.0 node

Figure 3.3 C5.0 Node Model Tab

The Model name option allows you to set the name for both the C5.0 and resulting C5.0 rule nodes.
The form (decision tree or rule set, both will be discussed) of the resulting model is selected using the
Output type: option.
The Use partitioned data option is checked so that the C5.0 node will make use of the Partition field
created by the Partition node earlier in the stream. Whenever this option is checked, only the cases the
Partition node assigned to the Training sample will be used to build the model; the rest of the cases
will be held out for Testing and/or Validation purposes. If unchecked, the field will be ignored and
the model will be trained on all the data. Here, we use the default setting for the Partition node, so
50% of cases will be used for training and 50% for testing.
The Build model for each split option enables you to use a single stream to build separate models for
each possible value of a flag, categorical or continuous input field, which is specified as split field in
the Fields tab or upstream Type node. With split modeling, you can easily build the best-fitting model
for each possible field value in a single execution of the stream.
The Cross-validate option provides a way of validating the accuracy of C5.0 models when there are
too few records in the data to permit a separate holdout sample. It does this by partitioning the data
into N equal-sized subgroups and fits N models. Each model uses (N-1) of the subgroups for training,
then applies the resulting model to the remaining subgroup and records the accuracy. Accuracy
figures are pooled over the N holdout subgroups and this summary statistic estimates model accuracy
applied to new data. Since N models are fit, N-fold validation is more resource intensive and reports
the accuracy statistic, but does not present the N decision trees or rule sets. By default N, the number
of folds, is set to 10.
For a predictor field that has been defined as categorical, C5.0 will normally form one branch per
value in the set. However, by checking the Group symbolics check box, the algorithm can be set so
that it finds sensible groupings of the values within the field, thus reducing the number of rules. This
is often desirable. For example, instead of having one rule per region of the country, group symbolic
values may produce a rule such as:
Region [South, Midwest] …

Region [Northeast, West] …
Once trained, C5.0 builds one decision tree or rule set that can be used for predictions. However, it
can also be instructed to build a number of alternative models for the same data by selecting the
Boosting option. Under this option, when it makes a prediction it consults each of the alternative
models before making a decision. This can often provide more accurate prediction, but takes longer to
train. Also the resulting model is a set of decision tree predictions and the outcome is determined by
voting, which is not simple to interpret.
The algorithm can be set to favor either Accuracy on the training data (the default) or Generality to
other data. In our example, we favor a model that is expected to better generalize to other data and so
we select Generality.
Click Generality option button
C5.0 will automatically handle errors (noise) within the data and, if known, you can inform PASW
Modeler of the expected proportion of noisy or erroneous data. This option is rarely used.

As with all of the modeling nodes, after selecting the Expert option or tab, more advanced settings are
available. In this course, we will discuss the Expert options briefly. The reader is referred to the
Modeler 14 Modeling Nodes documentation for more information on these settings.
Click the Expert option button
Figure 3.4 C5.0 Node Model Tab Expert Options
By default, C5.0 will produce splits if at least two of the resulting branches have at least two data
records each. For large datasets you may want to increase this value to reduce the likelihood of rules
that apply to very few records. To do so, increase the value in the Minimum records per child branch
box.
Click the Simple Mode option button, and then click Run
A C5.0 Rule model, labeled with the predicted field (CHURNED), will appear in the Models palette
of the Manager.
The C5.0 Rule model is also added automatically to the stream, connected to the Type node. A dotted
line connects it to the C5.0 Modeling node, indicating the source of the model (not shown). Each time
the model is rerun, the model in the stream will be replaced.
3.4 Viewing the Model

Once the C5.0 Rule node is in the stream it can be edited.

Right-click the C5.0 generated model node named CHURNED in the stream palette, then
click Edit
The Model Viewer window has two panes. The left one shows the root node of the tree and the first
split; the right pane displays a graph of predictor importance measures.
According to what we see of the tree so far, LOCAL is the first split in the tree. Further, we see that if
LOCAL <= 4.976 the Mode value for CHURNED is InVol. The Mode is the modal (most frequent)
output value for the branch, and it will be the predicted value unless there are other fields that need to
be taken into account within that branch to make a prediction. When LOCAL <= 4.976 the branch
terminates, visually apparent because of the arrow. So this means the prediction for all customers with
this range of values on LOCAL is to be an involuntary churner.
In the second half of the first split where LOCAL > 4.976, the Mode value is Current. In this instance,
no predictions of CHURNED are visible, and to view the predictions we need to further unfold the
tree.
Predictor importance is enabled by default on the Analyze tab in the C5.0 modeling node (or any
modeling node for which it can be calculated). Predictor importance takes into account the whole tree
and is calculated on the test partition, if one is available (as is true here). Predictor importance values
sum to 1.0 so the relative importance of each predictor can be directly compared. Importantly,
predictor importance does not relate to model accuracy; instead, it is a measure of how much
influence a field has on model prediction, i.e., changes in a field lead to changes in model predictions.
Figure 3.5 Browsing the C5.0 Rule Node

The bar chart shows that the field LOCAL, used on the first split, is by far the most important in
predicting CHURNED. However, we haven’t seen the whole tree, and critically, we aren’t yet ready
to use the test partition data, so we won’t examine predictor importance any further at the moment.
To unfold the branch LOCAL > 4.976, just click the expand button.
Click to unfold the branch LOCAL > 4.976
Figure 3.6 Unfolding a Branch
SEX is the next split field. Now we see that SEX is the best predictor for persons who spend more than
4.976 minutes on local calls. The Mode value for Males is Current and for Females is Vol. However,
at this point we still cannot make any predictions because there is a symbol to the left of each
value of SEX which means that other fields need to be taken into account before we can make a
prediction. Once again we can unfold each separate branch to see the rest of the tree, but we will take
a shortcut:
Click the All button in the Toolbar

Figure 3.7 Fully Unfolded Tree
We can see several nodes usually referred to as terminal nodes that cannot be refined any further. In
these instances, the mode is the prediction. For example, if we are interested in the Current Customer
group, one group we would predict to remain customers are persons where LOCAL > 4.976, SEX =
M, International <= 0.905, and AGE > 29. To get an idea about the number and percentage of
records within such branches we ask for more details.
Click Show or hide instance and confidence figures in the toolbar

Figure 3.8 Instance and Confidence Figures Displayed
Branch
predicting
Current
The incidence tells us that there are 218 persons who met those criteria. The confidence figure for
this set of individuals is 1.0, which represents the proportion of records within this set correctly
classified (predicted to be Current and actually being Current). That means it is 100% accurate on
this group! If we were to score another dataset with this model, how would persons with the same
characteristics be classified? Because PASW Modeler assigns the group the modal category of the
branch, everyone in the new dataset who met the criteria defined by this rule would be predicted to
remain Current Customers.
If you would like to present the results to others, an alternative format is available that helps visualize
the decision tree. The Viewer tab provides this alternative format.
Click the Viewer tab

Click the Decrease Zoom tool (to view more of the tree). (You may also need to expand
the size of the window.)

Figure 3.9 Decision Tree in the Viewer Tab
The root of the tree shows the overall percentages and counts for the three categories of CHURNED.
The modal category is shaded in each node. We see that there are 719 customers in the training
partition.
The first split is on LOCAL, as we have seen already in the text display of the tree. Similar to the text
display, we can decide to expand or collapse branches. In the right corner of some nodes a – or + is
displayed, referring to an expanded or collapsed branch, respectively. For example, to collapse the
tree at node 2:
Click in the lower right corner of node 2 (shown in Figure 3.10)

Figure 3.10 Collapsing a Branch
In the Viewer tab, toolbar buttons are available for zooming in or out; showing frequency information
as graphs and/or as tables; changing the orientation of the tree; and displaying an overall map of the
tree in a smaller window (tree map window) that aids navigation in the Viewer tab. When it is not
possible to view the whole tree at once, such as now, one of the more useful buttons in the toolbar is
the Tree map button because it shows you the size of the tree. A red rectangle indicates the portion of
the tree that is being displayed. You can then navigate to any portion of the tree you want by clicking
on any node you desire in the Tree map window.
Click in the lower right corner of node 2

Click on the Treemap button in the tool bar
Enlarge the Treemap until you see the node numbers (shown in Figure 3.11)

Figure 3.11 Decision Tree in the Viewer Tab with a Tree Map
3.5 Generating and Browsing a Rule Set

When building a C5.0 model, the C5.0 node can be instructed to generate the model as a rule set, as
opposed to a decision tree. A rule set is a number of IF … THEN rules which are collected together
by prediction.
A rule set can also be produced from the Generate menu when browsing a C5.0 decision tree model.
In the C5.0 Rule Model Viewer window, click Generate…Rule Set
Figure 3.12 Generate Ruleset Dialog

Note that the default Rule set name appends the letters “RS” to the output field name. You may
specify whether you want the C5.0 Ruleset node to appear in the Stream Canvas (Canvas), the
generated Models palette (GM palette), or both. You may also change the name of the rule set and
lower limits on support (percentage of records having the particular values on the input fields) and
confidence (accuracy) of the produced rules (percentage of records having the particular value for the
output field, given values for the input fields).
Set Create node on: to GM Palette

Click OK
Figure 3.13 Generated C5.0 Rule Set Node
Generated Rule
Set for
CHURNED
Click File…Close to close the C5.0 Rule browser window

Right-click the C5.0 Rule Set node named CHURNEDRS in the generated Models palette in
the Manager, then click Browse

Figure 3.14 Browsing the C5.0 Generated Rule Set
Click All button to unfold

Click Show or hide instance and confidence figures button in the toolbar
The numbered rules now expand as shown below.

Figure 3.15 Fully Expanded C5.0 Generated Rule Set
For example Rule #1 (Current) has this logic: If a person makes more than 4.976 minutes of local
calls a month, is Male, makes less than or equal to .905 minutes of International calls, is less than or
equal to 29 years old, and has an estimated income greater than 38,950.50, then we would predict
Current. This form of the rules allows you to focus on a particular conclusion rather than having to
view the entire tree.
If the Rule Set is added to the stream, a Settings tab will become available that allows you to export
the rule set in SQL format, which permits the rules to be directly applied to a database.
Click File…Close to close the Rule set browser window
3.6 Understanding the Rule and Determining Accuracy

The predictive accuracy of the rule induction model is not given directly within the C5.0 model node.
To get that information, you can use an Analysis node. However, at this stage we will use Matrix
nodes and Evaluation Charts to determine how good the model is.

Creating a Data Table Containing Predicted Values

We use the Table node to examine the predictions from the C5.0 model.
Place a Table node from the Output palette below the generated C5.0 Rule model named
CHURNED
Connect the generated C5.0 Rule model named CHURNED to the Table node
Right-click the Table node, then click Run and scroll to the right in the table
Figure 3.16 Two New Fields Generated by the C5.0 Rule Node
Two new columns appear in the data table, $C-CHURNED and $CC-CHURNED. The first represents
the predicted value for each record and the second the confidence value for the prediction.
Click File…Close to close the Table output window
Comparing Predicted to Actual Values

We will view a matrix (crosstab table) to see where the predictions were more, and less, correct, and
then we evaluate the model graphically with a gains chart.
Place two Select nodes from the Records palette, one to the lower right of the generated
C5.0 node named CHURNED, and one to the lower left
Connect the generated C5.0 node named CHURNED to the each Select node

First we will edit the Select node on the left that we will use to select the Training sample cases:
Double-click on the Select node on the left to edit it
Click the Expression Builder button
Move Partition from the Fields list box to the Expression Builder text box
Click the (equal sign button)
Click the Select from existing field values button and insert the value 1_Training
Click OK, and then click OK again to close the dialog
Figure 3.17 Completed Selection for the Training Partition
Now we will edit the Select node on the right to select the Testing sample cases:
Double-click on the Select node on the right to edit it

Click the (equal sign button)
Click the Select from existing field values button and insert the value 2_Testing
Click OK, and then click OK again to close the dialog
Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:
Place a Matrix node from the Output palette below the Select node
Connect the Matrix node to the Select node
Double-click the Matrix node to edit it
Put CHURNED in the Rows:
Put $C-CHURNED in the Columns:
Click the Appearance tab
Click the Percentage of row option
Click on the Output tab and custom name the Matrix node for the Training sample as
Training and the Testing sample as Testing (this will make it easier to keep track of
which output we are looking at)

Click OK
For each actual churned category, the Percentage of row choice will display the percentage of records
predicted into each of the target categories.
Run each Matrix node
Figure 3.18 Matrix Output for the Training and Testing Samples
Looking at the Training sample results, the model predicts about 78.7% of the Current category
correctly, 100% of the Involuntary Leavers, and 97.0% of the Voluntary Leavers correctly. The
results with the testing sample compare favorably which suggests that the model will perform well
with new data.
Click File…Close to close the Matrix windows
Evaluation Chart Node

The Evaluation Chart node offers an easy way to evaluate and compare predictive models in order to
choose the best model for your application. Evaluation charts show how models perform in predicting
particular outcomes. They work by sorting records based on the predicted value and confidence of the
prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of a
criterion for each quantile, from highest to lowest.
To produce a gains chart for the Current group:
Place an Evaluation chart node from the Graphs palette to the right of the generated C5.0
Rule node named CHURNED
Connect the generated C5.0 Rule node named CHURNED to the Evaluation chart node
Outcomes are handled by defining a specific value or range of values as a hit. Hits usually indicate
success of some sort (such as a sale to a customer) or an event of interest (such as someone given
credit being a good credit risk). Flag output fields are straightforward; by default, hits correspond to
true values. For Set output fields, by default the first value in the set defines a hit. For the churn data,
the first value for the CHURNED field is Current. To specify a different value as the hit value, use the
Options tab of the Evaluation node to specify the target value in the User defined hit group. There are
five types of evaluation charts, each of which emphasizes a different evaluation criterion. Here we

discuss Gains and Lift charts. For information about the others, which include Profit and ROI charts,
see the Modeler 14 Modeling Nodes documentation.
Figure 3.19 Evaluation Chart Dialog
Gains are defined as the proportion of total hits that occurs in each quantile. We will examine the
gains when the data are ordered from those most likely to those least likely to be in the current
category (based on the confidence of the model prediction).
The Chart type option supports five chart types with Gains chart being the default. If Profit or ROI
chart type is selected, then the appropriate options (cost, revenue and record weight values) become
active so information can be entered. The charts are cumulative by default (see Cumulative plot check
box), which is helpful in evaluating such business questions as “how will we do if we make the offer
to the top X% of the prospects?” The granularity of the chart (number of points plotted) is controlled
by the Plot drop-down list and the Percentiles choice will calculate 100 values (one for each
percentile from 1 to 100). For small data files or business situations in which you can only contact
customers in large blocks (say some number of groups, each representing 5% of customers, will be
contacted through direct mail), the plot granularity might be decreased (to deciles (10 equal-sized
groups) or vingtiles (20 equal-sized groups)).

A baseline is quite useful since it indicates what the business outcome value (here gains) would be if
the model predicted at the chance level. The Include best line option will add a line corresponding to
a perfect prediction model, representing the theoretically best possible result applied to the data where
hits = 100% of the cases.
The Separate by partition option in the node provides an easy and convenient way to validate the
model by displaying not only the results of the model using the training data, but in a separate chart,
showing how well it performed with the testing or holdout data. Of course, this assumes that you
made use of the Partition Node to develop the model.
Click the Include best line checkbox (not shown)

Click Run
Figure 3.20 Gains Chart of the Current Customer Group
The vertical axis of the gains chart is the cumulative percentage of the hits, while the horizontal axis
represents the ordered (by model prediction and confidence) percentile groups. The diagonal line
presents the base rate, that is, what we expect if the model is predicting the outcome at the chance
level. The upper line (labeled $BEST-CHURNED) represents results if a perfect model were applied
to the data, and the middle line (labeled $C-CHURNED) displays the model results. The three lines
connect at the extreme [(0, 0) and (100, 100)] points. This is because if either no records or all records
are considered, the percentage of hits for the base rate, best model, and actual model are identical.
The advantage of the model is reflected in the degree to which the model-based line exceeds the base-
rate line for intermediate values in the plot and the area for model improvement is the discrepancy
between the model line and the Best (perfect model) line. If the model line is steep for early
percentiles, relative to the base rate, then the hits tend to concentrate in those percentile groups of
data. At the practical level, this would mean for our data that many of the current customers could be
found within a small portion of the ordered sample.
The gains line ($C-CHURNED) in the Training data chart rises steeply relative to the baseline,
indicating the hits for the Current outcome are concentrated in the percentiles predicted most likely to

contain current customers according to the model. Just over 75% of the hits were contained within the
first 40 Percentiles. The gains line in the chart using Testing data is very similar which suggests that
this model can be reliably used to predict current customers with new data.
You can hover over a line and a popup will display the value at that point, as shown in Figure 3.21.
Figure 3.21 Gains Chart for the Current Customer Group
Click File…Close to close the Evaluation chart window
Changing Target Category for Evaluation Charts

By default, an Evaluation chart will use the first target outcome category to define a hit. To change
the target category on which the chart is based, we must specify the condition for a User defined hit in
the Options tab of the Evaluation node. To create a gains chart in which a hit is based on the
Voluntary Leaver category:
Double-click the Evaluation node

Click the User defined hit checkbox
Click the Expression Builder button in the User defined hit group
Click @Functions on the functions category drop-down list
Select @TARGET on the functions list, and click the Insert button
Click the = button
Right-click CHURNED in the Fields list box, then select Field Values
Select Vol, and then click Insert button

Figure 3.22 Specifying the Hit Condition within the Expression Builder
The condition (Vol as the target value) defining a hit was created using the Expression Builder.
Click OK

Figure 3.23 Defining the Hit Condition for CHURNED
In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.
Click Run

Figure 3.24 Gains Chart for the Voluntary Leaver Category (Interaction Enabled)
The gains chart for the Voluntary Leavers category is better (steeper in the early percentiles) than that
for the Current category. For example, the top 40 model-ordered percentiles in the Training data chart
contain over 85% of the Voluntary Leavers as opposed to the same chart when we looked at Current
Customers (that value was 75.3%)
Click File…Close to close the Evaluation chart window
To save this stream for later work:
Click File…Save Stream As

Move to the c:\Train\ModelerPredModel directory
Type C5 in the File name: text box
Click Save
3.7 Understanding the Most Important Factors in

Prediction
An advantage of rule induction models over neural networks is that the decision tree form makes it
clear which fields are having an impact on the predicted field. There is no great need to use
alternative methods such as web plots and histograms to understand how the rule is working. Of
course, you may still use the techniques we will demonstrate in Lesson 4 for neural networks to help
understand the model, but they often are not needed.
As we noted, predictor importance is calculated on the testing data partition. In addition to using that
information, viewing the tree also provides information about importance, as the most important

fields in the predictions can be thought of as those that divide the tree in its earliest stages. Thus in
this example the most important field in predicting churn is LOCAL. Once the model divided the data
into two groups, those who do more local calling and those who do less, it will focus separately on
each group to determine which predictors determine whether or not a customer will remain loyal to
the company, voluntarily leave, or even be dropped as a customer. The process continues until the
nodes either cannot be refined any further or stopping rules are reached which causes tree growth to
stop.
In Figure 3.25 we show the C5.0 model node with an expanded tree along with the predictor
importance chart. The order in which splits occur in the tree parallels the relative importance of the
fields.
Figure 3.25 Expanded Tree and Predictor Importance Chart
3.8 Further Topics on C5.0 Modeling

Now that we have introduced you to the basics of C5.0 modeling, we will discuss the Expert options
which allow you to refine your model even further. This time, we will use an existing stream rather
than building one from scratch.

Double Click on DecisionTrees.str
The simple options within the C5.0 node allow you to use Boosting, specify the Expected noise (%)
and whether the resulting tree favors Accuracy or Generality. Noisy (inconsistent) data contain
records in which the same, or very similar, predictor values lead to different target values. While C5.0
will handle noise automatically, if you have an estimate of it, the method can take this into account

(see the section on Minimum Records and Pruning for more information on the effect of specifying a
noise value).
The expert mode allows you to fine-tune the rule induction process.
Double-click on the C5.0 node named CHURNED

Click the Model tab
Click the Expert Mode option button
Figure 3.26 Expert Options Available within the C5.0 Dialog (Model Tab)
When constructing a decision tree, the aim is to split the data into subsets that are, or seem to be
heading toward, single-class collections of records on the target field. That is, ideally the terminal
nodes contain only one category of the output field. At each point of the tree, the algorithm could
potentially partition the data based on any one of the input fields. To decide which is the “best” way
to partition the data—to find a compact decision tree that is consistent with the data—the algorithms
construct some form of test that usually works on the basis of maximizing a local measure of
progress.
Gain Ratio Selection Criterion

Within C5.0, the Gain Ratio criterion, based on information theory, is used when deciding how to
partition the data.
In the following sections, we will describe, in general terms, how this criterion measures progress.
However the reader is referred to C4.5: Programs for Machine Learning by J. Ross Quinlan (Morgan
Kaufmann, San Mateo CA, 1993) for a more detailed explanation of the original algorithm.

The criterion used in the predecessors to C5.0 selected the partition that maximizes the information
gain. Information gained by partitioning the data based on the categories of field X (an input or
predictor field) is measured by:
GAI N(X) = I NFO(DATA) – I NFOX(DATA)
Where INFO(DATA) represents the average information needed to identify the class (outcome
category) of a record within the total data.
And INFOX(DATA) represents the expected information requirement once the data has been
partitioned into each outcome of the current field being tested.
The information theory that underpins the criterion of gain can be given by the statement:
“The information conveyed by a message depends on its probability and can be measured in bits as
minus the logarithm to the base 2 of that probability. So, if for example there are 8 equally probable
messages, the information conveyed by any one of them is – log2 (1 / 8) or 3 bits”. For details on how
to calculate these values the reader is referred to Lesson 2 in C4.5: Programs for Machine Learning.
Although the gain criterion gives good results, it has a flaw in that it favors partitions that have a large
number of outcomes. Thus a categorical predictor with many values has an advantage over one with
few categories. The gain ratio criterion, used in C5.0, rectifies this problem.
The bias in the gain criterion can be rectified by a kind of normalization in which the gain attributable
to tests with many outcomes is adjusted. The gain ratio represents the proportion of information
generated by dividing the data in the parent node into each of the categories of field X that is useful,
i.e., that appears helpful for classification.
GAI N RATI O(X) = GAI N(X) / SPL I T I N FOX(DATA)
Where SPLIT INFOX(DATA) represents the potential information generated by partitioning the data
into n outcomes, whereas the information gain measures the information relevant to classification.
The C5.0 algorithm will choose to partition the data based on the outcomes of the field that
maximizes the information gain ratio. This maximization is subject to the constraint that the
information gain must be large, or at least as great as the average gain over all tests examined. This
constraint avoids the instability of the gain criterion, when the split is near trivial and the split
information is thus small.
Two other parameters the expert options allow you to control are the severity of pruning and the
minimum number of records per child branch. In the following sections we will introduce each of
these in turn and give advice on their settings.
Pruning and Attribute Winnowing Within C5.0

Within C5.0, once the tree has been built, it can be pruned back to create a more general (and less
bushy) tree. Within the expert mode, the Pruning severity option allows you to control the extent of
the pruning. The higher this number, the more severe the pruning and the more general the resulting
tree.
The algorithm used to decide whether a branch should be pruned back toward the parent node is
based on comparing the predicted errors for the “sub-tree” (i.e. unpruned branches) with those for the

“leaf” (or pruned node). Error estimates for leaves and sub-trees are calculated based on a set of
unseen cases the same size as the training set. The formula used to calculate the predicted error rates
for a leaf involves the number of cases within the leaf, the number of these cases that have been
incorrectly classified within this leaf and confidence limits based on the binomial distribution. The
reader is referred to Lesson 4 in C4.5: Programs for Machine Learning for a more detailed
description of error estimations and pruning in general.
A second phase of pruning (global pruning) is then applied by default. It prunes further based on the
performance of the tree as a whole, rather than at the sub-tree level considered in the first stage of
pruning. This option (Use global pruning) can be turned off, which generally results in a larger tree.
After initially analyzing the data, the Winnow attributes option will discard some of the inputs to the
model before building the decision tree. This can produce a model that uses fewer input fields yet
maintains near the same accuracy, which can be an advantage in model deployment. This option can
be especially effective when there are many inputs and where inputs are statistically related.
Minimum Records per Child Branch

One other consideration when building a general decision tree is that the terminal nodes within the
tree are not too small in size. Within the C5.0 dialog, you control the Minimum records per child
branch, which specifies that at any split point in the tree, at least two sub-trees must cover at least this
number of cases. The default is two cases but increasing this number can be useful for noisy datasets
and tends to produce less bushy trees.
How to Use Pruning and Minimum Records per Branch

As previously mentioned, within the C5.0 dialog the Simple mode allows you to specify both the
Expected noise (%) and whether the resulting tree favors Accuracy or Generality.
• If the algorithm is set to favor Accuracy, the Pruning Severity is set to 75 and the Minimum
records per branch is 2; hence, although the tree is accurate there is a degree of generality by
not allowing the nodes to contain only one record.
• If the algorithm is set to favor Generality the Pruning Severity is set to 85 and the Minimum
records per branch is 5.
• If the Expected noise (%) is used the Minimum records per branch is set to half of this value.
Once a tree has been built using the simple options, the expert options may be used to refine the tree
in these two common ways.
• If the resulting tree is large and has too many branches increase the Pruning Severity
• If there is an estimate for the expected proportion of noise (relatively rare in practice), set the
Minimum records per branch to half of this value.
Boosting
C5.0 has a special method for improving its accuracy rate, called boosting. It works by building
multiple models in a sequence. The first model is built in the usual way. Then, a second model is built
in such a way that it focuses especially on the records that were misclassified by the first model. Then
a third model is built to focus on the second model's errors, and so on. Finally, cases are classified by
applying the whole set of models to them, using a weighted voting procedure to combine the separate
predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0

model, but it also requires longer training. The Number of trials option allows you to control how
many models are used for the boosted model.
While boosting might appear to offer something for nothing, there is a price. When model building is
complete, more than one tree is used to make predictions. Therefore, there is no simple description of
the resulting model, nor of how a single predictor affects the target field. This can be a serious
deficiency, so boosting is normally used when the chief goal of an analysis is predictive accuracy, not
understanding.
Misclassification Costs
The Costs tab allows you to set misclassification costs. When using a tree to predict a categorical
output, you may wish to assign costs to misclassifications (where the tree predicts incorrectly) to bias
the model away from “expensive” mistakes. The Misclassifying controls allow you to specify the cost
attached to each possible misclassification. The default costs are set at 1.0 to represent that each
misclassification is equally costly. When unequal misclassification costs are specified, the resulting
trees tend to make fewer expensive misclassifications, usually at the cost of an increased number of
the relatively inexpensive misclassifications.
Propensity Scores
Importance measures and propensity scores are available from the Analyze tab.
Click Analyze tab
By default, importance scores will be calculated for model evaluation, as we have seen.
There are two check boxes to request raw and adjusted propensity scores. Propensity scores are used
for flag fields only, and they indicate the likelihood of the True value defined for the field.
Raw propensity scores are derived from the model based on the training data only. If the model
predicts the true value (will respond), then the propensity is the same as P, where P is the probability
of the prediction (often the confidence). If the model predicts the false value, then the propensity is
calculated as (1 – P).
Propensity scores differ from confidence scores, which apply to the model prediction, whether true or
false. In cases where the prediction is false, for example, a high confidence actually means a high
likelihood not to respond. Propensity scores overcome this limitation to allow easier comparison
across all records. For example, a no prediction with a confidence of 0.65 translates to a raw
propensity of 0.35 (or 1 – 0.65).
Raw propensities are based purely on estimates given by the model, which may be overfitted, leading
to over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking at
how the model performs on the test or validation partitions and adjusting the propensities to give a
better estimate accordingly. A partition is required to calculate adjusted propensities.

Figure 3.27 Analyze Tab Options
Close the C5.0 modeling node
3.9 Modeling Categorical Outputs with Other Decision

Tree Algorithms
As we saw in Table 3.1, C5.0 can only be used to model categorical targets. Quest has the same
limitation. The other two algorithms, CHAID and C&R Tree can be used to model both categorical
and continuous targets. Before we discuss how to create models with continuous targets, let’s take a
look at the various options for modeling categorical targets in CHAID, C&RT and Quest. You can
certainly try one of these techniques on the churn.txt data to compare to C5.0, but for the most part,
the output format is very similar.
3.10 Modeling Categorical Outputs with CHAID

First, we’ll look at the CHAID node and the options available there.
Double-click the CHAID node named CHURNED

Click the Field tab (if necessary)

Figure 3.28 CHAID Node Dialog Fields Tab
There are four tabs to control various aspects of the modeling process. The Fields tab, as with many
modeling nodes, allows you to select the target field and the predictors, or inputs. Here, the fields are
already set because of roles assigned in the Type node.
Click the Build Options tab

Figure 3.29 Objective Settings in Build Options Tab
The Build Options tab enables you to control six different areas, including overall objectives, the
specific CHAID algorithm, tree depth, stopping rules, costs of making an error, how to combine
ensembles of models, and advanced statistical specifications, such as the significance level used for
splitting a node.
The default objective is to build a single tree. You can set one of two modes: Generate model builds
the model, Launch Interactive session launches the Interactive Tree feature which we will discuss in a
later section.
Multiple trees can be built by using bagging or boosting (see Lesson 4 for a discussion of these
options). If you have a server connection, CHAID can create models for very large datasets by
dividing the data into smaller data blocks and building a model on each block. The most accurate
models are then automatically selected and combined into a single final model.
Click the Basics settings

Figure 3.30 Basics Settings in Build Options Tab
The options on the Basics panel enable you to choose the algorithm. For a single CHAID tree model,
there are two methods, standard or exhaustive CHAID. The latter is a modification of CHAID
designed to address some of its weaknesses. Exhaustive CHAID examines more possible splits for a
predictor, thus improving the chances of finding the best predictor (at the cost of additional
processing time.
The Maximum tree depth specifies the maximum number of levels below the root node. The default
depth is 5. Since CHAID doesn’t prune a bushy tree, the user can specify the depth with the Custom
setting. This setting should depend on the size of the data file, the number of predictors, and the
complexity of the desired tree.
Click the Stopping Rules settings

Figure 3.31 Stopping Rules Settings in Build Options Tab
The options on the Stopping Rules panel enable you to specify the rules to be applied to cease
splitting nodes in the tree.
You set the minimum branch sizes to prevent splits that would create very small subgroups. These
can be specified either as an absolute number of records or as a percentage of the total number of
records. By default, a parent branch to be split must contain at least 2% of the records; a child branch
must contain at least 1%. It is often more convenient to work with the absolute number of records
rather than a percent, but in either case, you will very likely modify these values to get a smaller, or
larger, tree.
Click the Ensembles settings

Figure 3.32 Ensembles Settings in Build Options Tab
These settings determine the behavior of ensembling that occurs when boosting, bagging, or very
large datasets are requested in Objectives. Options that do not apply to the selected objective are
ignored.
For bagging and very large datasets, rules must be used to combine the predictions from two or more
models. The rules differ depending upon whether the target is categorical or continuous, with voting
used for the former and the mean of the predictions for the latter. Other options are available.
Boosting always uses a weighted majority vote to score categorical targets and a weighted median to
score continuous targets.
When using boosting or bagging, by default 10 models or bootstrap samples, respectively, are
created. This number is usually sufficient but can be changed.
Click the Advanced settings

Figure 3.33 Advanced Settings in Build Options Tab
To select the predictor for a split, CHAID uses a chi-square test in the table defined at each node by a
predictor and the target field. CHAID chooses the predictor that is the most significant (smallest p
value). If that predictor has more than 2 categories, CHAID compares them and collapses together
those categories that show no differences in the target. This category merging process stops when all
remaining categories differ at the specified testing level (Significance level for splitting:). It is
possible for CHAID to split merged categories, controlled by the Allow resplitting of merged
categories check box. (Note that a categorical predictor with more than 127 discrete categories will
be ignored by CHAID.) There is a comparable significance level for merging.
For continuous predictors, the values are binned into a maximum of 10 groups, and then the same
tabular procedure is followed as for flag and categorical types.
Because many chi square tests are performed, CHAID automatically adjusts its significance values
when testing the predictors. These are called Bonferroni adjustments and are based on the number of
tests. You should normally leave this option turned on; in small samples or with only a few
predictors, you could turn it off to increase the power of your analysis.
The Overfit prevention set (%) setting controls the percent of records that are internally separated into
an overfit prevention set. This is an independent set of data used to track errors during training in
order to prevent the tree from modeling chance variation in the data. The default is 30%. This setting

is unrelated to the separation of data before modeling into training and testing partitions. The
modeling is done only with the training data, and thus the separation into an overfitting set is done
only within the training data.
Note
The overfit prevention data split is not used by CHAID, but instead by C&R Tree and QUEST. For
those decision tree methods, the overfit set is used during tree pruning.
Unlike other models, CHAID uses missing, or blank, values when growing a tree. All blank values
are placed in a missing category that is treated like any other category for nominal predictors. For
ordinal and continuous predictors, the process of handling blanks is a bit different, but the effect is the
same (see the PASW Modeler14.0 Algorithms Guide for detailed information). If you don’t want to
include blank data in a model, it should be removed beforehand.
3.11 Modeling Categorical Outputs with C&R Tree

We move next to the C&R Tree node to predict a categorical output field.
Click Cancel to close the CHAID dialog

Double-click on the C&R Tree node named CHURNED
The same settings options are available for C&R Tree as for CHAID in the Objective settings.
It is also possible, as with CHAID, to grow a tree interactively.

Figure 3.34 Classification and Regression Trees (C&R Tree) Build Options Tab

Figure 3.35 Basics Settings for Build Options Tab
The default tree depth, as with CHAID, is five levels below the root node.
Pruning within C&RT

The Prune tree to avoid overfitting check box will invoke pruning. The Maximum difference in risk
(in standard Errors) allows C&R Tree to select the simplest tree whose risk estimate (which is the
proportion of errors the tree model makes when equal misclassification costs and empirical priors are
used) is close to that of the subtree with the smallest risk. The value in the text box indicates how
many standard errors difference are allowed in the risk estimate between the final tree and the tree
with the smallest risk. As this is increased, the pruning will be more severe.
Surrogates
Surrogates are used to deal with missing values on the predictors. For each split in the tree, C&R Tree
identifies the input fields (the surrogates) that are most similar statistically to the selected split field.
When a record to be classified has a missing value for a split field, its value on a surrogate field can
be used to make the split.
The Maximum surrogates option controls how many surrogate predictor fields will be stored at each
mode. Retaining more surrogates slows processing and the default (5) is usually adequate.

The Stopping Rules settings are identical to those for CHAID, so we won’t review them here. We do
note that, unlike CHAID, while the default values may seem small, it is important to keep in mind
that pruning is an important component of C&R Tree, and it can trim back some of the small
branches.
Click on Costs & Priors settings
Figure 3.36 Costs & Priors Settings for Build Options Tab
The misclassification costs are identical to those for the other decision tree models we have discussed.
Priors in C&RT
Historically, priors have been used to incorporate knowledge about the base population rates (here of
the output field categories) into the analysis. Breiman et al. (1984) point out that if one target category
has twice the prior probability of occurring than another, it effectively doubles the cost of
misclassifying a case from the first category, since it is counted twice. Thus by specifying a larger
prior probability for a response category, you can effectively increase the cost of its misclassification.
Since priors are only given at the level of the base rate for the output field categories (with J
categories there are J prior probabilities), use of them implies that the misclassification of a record

actually in output category j has the same cost regardless of the category into which it is
misclassified, (that is C(j) = C(k|j), for all k not equal to j).
By default, the prior probabilities are set to match the probabilities found in the training data. The
Equal for all classes option allows you to set all priors equal (might be used if you know your sample
does not represent the population and you don’t know the population distribution on the target), and
you can enter prior probabilities (Custom option). The prior probabilities should sum to 1 and if you
enter custom priors that reflect the desired proportions, but do not sum to 1, the Normalize button will
adjust them. Finally, priors can be adjusted based on misclassification costs (see Breiman’s comment
above) entered in the Costs tab.
The Ensembles settings are identical to that for CHAID, so we don’t review them.

Impurity Criterion
The criterion that guides tree growth in C&R Tree with a categorical output field is called impurity. It
captures the degree to which responses within a node are concentrated into a single output category.
A pure node is one in which all cases fall into a single output category, while a node with the
maximum impurity value would have the same number of cases in each output category. Impurity can
be defined in a number of ways and two alternatives are available within the C&R Tree procedure.
The default, and more popular measure, is the Gini measure of dispersion. If P(t)i is the proportion of
cases in node t that are in output category i, then the Gini measure is:
Gini = 1 − ∑ P(t ) i
2
Alternatively:
Gini = ∑ P(t )i P(t ) j

i≠ j
If two nodes have different distributions across three response categories (for example (1,0,0) and
(1/3, 1/3, 1/3)), the one with the greater concentration of responses in a single category (the first one)
will have the lower impurity value (for (1,0,0) the impurity is 1 – (12 + 02 + 02), or 0; for (1/3, 1/3,
1/3) the impurity is 1 – ((1/3)2 + (1/3)2 + (1/3)2)or .667). The Gini measure ranges between 0 and 1,
although the maximum value is a function of the number of output categories.
Thus far we have defined impurity for a single node. It can be defined for a tree as the weighted
average of the impurity values from the terminal nodes. When a node is split into two child nodes, the
impurity for that branch is simply the weighted average of their impurities. Thus if two child nodes
resulting from a split have the same number of cases and their individual impurities are .4 and .6, their
combined impurity is .5*.4 + .5*.6. When growing the tree, C&R Tree splits a node on the predictor
that produces the greatest reduction in impurity (comparing the impurity of the parent node to the
impurity of the child nodes). This change in impurity from a parent node to its child nodes is called
the improvement and under Expert options you can specify the minimum change in impurity for tree
growth to continue. The default value is .0001 and if you are considering modifying this value, you
might calculate the impurity at the root node (the overall output proportions) to establish a point of
reference.
The problems with using impurity as a criterion for tree growth are that you can almost always reduce
impurity by enlarging the tree and any tree will have 0 impurity if it is grown large enough (if every
node has a single case, impurity is 0). To address these difficulties, the developers of the classification
and regression tree methodology (see Breiman, Friedman, Olshen, and Stone, Classification and
Regression Trees, Wadsworth, 1984) developed a pruning method based on a cross-validated cost
complexity measure (as discussed above).
By default, the Gini measure of dispersion is used. Breiman and colleagues proposed Twoing as an
alternative impurity measure. If the target has more than two output categories, twoing will create
binary splits of the response categories in order to calculate impurity. Each possible combination of
output categories split into two groups will be separately evaluated for impurity with each predictor,
and the best split across predictors and target category combinations is chosen. Ordered Twoing
(inactive because the target field is nominal, not ordinal) applies Twoing as described above, except
that the output category combinations are limited to those consistent with the rank order of the

categories. For example, if there are five output categories numbered 1,2,3,4 and 5, Ordered Twoing
would examine the (1,2) (3,4,5) split, but the (1,4) (2,3,5) split would not be considered, since only
contiguous categories can be grouped together.
Of the methods, the Gini measure is most commonly used.
The option to prevent model overfitting is available here, as it was with CHAID.
3.12 Modeling Categorical Outputs with QUEST

We turn next to the QUEST node for predicting a categorical field.
Click Cancel to close the C&R Tree dialog

Double-click on the QUEST node named CHURNED
QUEST (Quick Unbiased Efficient Statistical Tree) is a binary classification method that was
developed, in part, to reduce the processing time required for large C&R Tree analyses with many
fields and/or records. It also tries to reduce the tendency in decision tree methods to favor predictors
that allow more splits (see Loh and Shi, 1997).

Figure 3.38 QUEST Build Options Tab
There are the same settings areas as for C&R Tree. In the Objectives settings, the same selections are
available as the other methods.

Figure 3.39 QUEST Advanced Settings in Build Options Tab
QUEST separates the tasks of predictor selection and splitting at a node. Like CHAID, it uses
statistical tests to pick a predictor at a node. For each continuous or ordinal predictor, QUEST
performs an analysis of variance, and then uses the significance of the F test as a criterion. For
nominal predictors (measurement level flag and nominal), chi-square tests are performed. The
predictor with the smallest significance value from either the F or chi-square test is selected.
Although not evident from the dialog box options, Bonferroni adjustments are made, as with CHAID
(not under user control).
QUEST is more efficient than C&R Tree because not all splits are examined, and category
combinations are not tested when evaluating a predictor for selection.
After selecting a predictor, QUEST determines how the field should be split (into two groups) by
doing a quadratic discriminant analysis, using the selected predictor on groups formed by the target
categories. The details are rather complex and can be found in Loh and Shih (1997). The
measurement level of the predictor will determine how it is treated in this method. While quadratic
discriminant analysis allows for unequal variances in the groups and makes one fewer assumption

than does linear discriminant analysis, it does assume that the distribution of the data are multivariate
normal, which is unlikely for predictors that are flags and sets.
QUEST uses an alpha (significance) value of .05 for splitting in the discriminant analysis, and you
can modify this setting. For large files you may wish to reduce alpha to .01, for example.
Pruning, Stopping, Surrogates

QUEST follows the same pruning rule as does C&R Tree, using a cost-complexity measure that takes
into account the increase in error if a branch is pruned, using a standard error rule. The Stopping
choices are the same as for CHAID and C&R Tree. QUEST also uses surrogates to allow predictions
for missing values, employing the same methodology as C&R Tree.
3.13 Predicting Continuous Fields

Two of the decision tree models, CHAID and C&R Tree, can predict a continuous field. We will
briefly review the options available for this type of target and then run an example.
Continuous Outputs with C&R Tree

When a continuous field is used as the target field in C&R Tree, the algorithm runs in the way
described earlier in this lesson. For a continuous output field (the regression trees portion of the
algorithm), the impurity criterion is still used but is based on a measure appropriate for a continuous
field: within-node variance. It captures the degree to which records within a node are concentrated
around a single value. A pure node is one in which all cases have the same output value, while a node
with a large impurity value (in principle, the theoretical maximum would be infinity) would contain
cases with very diverse values on the output field. For a single node, the variance (or standard
deviation squared) of the output field is calculated from the records within the node. When generating
a prediction, the algorithm uses the average value of the target field within the terminal node.

Double-click on CRTree.str
Figure 3.40 PASW Modeler Stream with C&R Tree Model Node (Continuous Output Field)
The data file consists of patient admissions to a hospital. The goal is to build a model predicting
insurance claim amount based on hospital length of stay, severity of illness group and patient age.
Right-click on the Table node and select Run

Review the data and then close the Table window

Double click on the C&R Tree node (labeled CLAIM)
Click on Build Options tab
The Build Options tab for the C&R Tree dialog with a continuous output is the same as for a
categorical output. However, if we explore the various settings we find that some—priors and
misclassification costs—which are not relevant for a continuous output field, are inactive.
Otherwise, setting up the model-building parameters, and executing the tree, is identical to the
process for a categorical target. The generated model will display the predicted mean for the
insurance claim amount in each terminal node.
Figure 3.41 C&R Tree Build Options Tab for a Continuous Output Field
We will illustrate the use of interactive tree-building as well in this example.
Click the Launch interactive session button

Click Run

Figure 3.42 Interactive Tree Builder to Predict Claim with C&R Tree
Because the target field is continuous, different statistics appear in the nodes. The nodes display the
mean, number of cases, and percentage of the sample. Thus, the mean insurance claim for persons in
this data file is slightly over $4680, which we would predict for each person if we didn’t know how
long they stayed in the hospital, their age, or how severely ill they were. Once the tree is grown, we
should get some insight into the characteristics of patients that separate high insurance claims from
low ones.
As before, we are using about 70% of the data to fit the model, and reserve the remainder to test
overfitting. For C&R Tree, the overfit data is used to prune the tree.
We could grow the tree one level at a time. However, if we do so, the tree then can’t be easily pruned,
and pruning is an important feature of creating a tree that will perform as well on new data.
Right-Click on the Node 0 and select Grow Tree and Prune

Figure 3.43 Interactive Tree Builder Fully Grown and Pruned Tree
The results indicate that LOS (Length of Stay) is the best predictor. The average insurance claim for
persons who stay more than 2 days in the hospital is $5637.07 while the mean claim for persons who
spent 1 or two days in the hospital is $4369.51. This certainly makes sense.
The predictions are further refined by the split on ASG (severity of illness) for those with a length of
stay of 1 or 2 days . This field is coded 0, 1, and 2. Those with the lowest severity (0) in Node 3 have
a mean insurance claim of $4194.15; those who have more severe illnesses have a mean claim of
$4659.76.
You can make different splits in the tree based on business information or just to try alternative trees.
To see this option:
Right-Click on the Node 0 and select Grow Branch with Custom Split

Figure 3.44 Choosing a Custom Split with C&R Tree
The current best split is listed, on LOS, along with the values that will be used to split the tree. You
can change the split values by selecting the Custom option button.
The Predictors dialog shows the top predictors and the Improvement value for the optimal split on
each.
Click Predictors button

Figure 3.45 Predictors and Improvement Values for that Split
You can choose another predictor here to split the node. In this instance, the Improvement value for
LOS is clearly the largest, so it would be better to retain it as the first split.
Click Cancel, and then Cancel again
For a continuous output, we can investigate model performance in several ways. No model is
automatically created when using interactive mode, so we need to generate a model first.
Click Generate…Generate Model

Click OK in the Generate New Model dialog
Figure 3.46 Generate New Model Dialog
Close the Interactive Tree window

Move the generated model CLAIM1 near the Type node
Connect the Type node to the generated model CLAIM1
Add an Analysis node to the stream
Connect the Analysis node to the generate model CLAIM1
Run the Analysis node

Figure 3.47 Analysis Output for C&R Tree Model to Predict CLAIM
The Mean Absolute Error is perhaps the most useful statistic, and has a value of 824.11. This is the
amount of model prediction error, on the average. The mean value of CLAIM is about 4,631, so the
average error is somewhat under 20% of the mean. The analyst has to decide whether this is
sufficiently accurate, given the goals of a data-mining project.
The correlation between the predicted value and actual value of CLAIM is .507, which is good, but
not outstanding. On the other hand, there are only three terminal nodes in the tree, and so only three
predicted values of CLAIM. Given that, this correlation isn’t bad at all.
You can explore the model further by using other nodes to see the relationship between the predictors
and the value of $R-CLAIM. However, the decision tree is very clear as to how the predictors are
being used to make predictions. You may, though, want to investigate how AGE related to $R-CLAIM
since AGE isn’t used in the tree (you would find there is essentially no relationship, hence the reason
it wasn’t used).
Close the Analysis window
Continuous Outputs with CHAID

When CHAID is used with a continuous target, the overall approach is identical to what we have
discussed above, but the specific tests used to select predictors and merge categories differ.

An analysis of variance test is used for predictor selection and merging of categories, with the target
as the dependent variable. Nominal and ordinal predictors are used in their untransformed form.
Continuous predictors are binned as described above for CHAID into at most 10 categories; then an
analysis of variance test is used on the transformed field.
The field with the lowest p value for the ANOVA F test is selected as the best predictor at a node, and
the splitting and merging of categories proceeds based on additional F tests.
If you are interested, you can try CHAID with this same data and target.

Summary Exercises
The exercises in this lesson use the data file churn.txt that was also used for examples in this lesson.
The following table provides details about the file.
churn.txt contains information from a telecommunications company. The data are comprised of
customers who at some point have purchased a mobile phone. The primary interest of the company is
to understand which customers will remain with the organization or leave for another company.
The file contains the following fields:
ID Customer reference number

LONGDIST Time spent on long distance calls per month
International Time spent on international calls per month
LOCAL Time spent on local calls per month
DROPPED Number of dropped calls
PAY_MTHD Payment method of the monthly telephone bill
LocalBillType Tariff for locally based calls
LongDistanceBillType Tariff for long distance calls
AGE Age
SEX Gender
STATUS Marital status
CHILDREN Number of Children
Est_Income Estimated income
Car_Owner Car owner
CHURNED (3 categories)
Current – Still with company
Vol – Leavers who the company wants to keep
Invol – Leavers who the company doesn’t want
In these exercises we will explore the various training methods and options for the rule induction
techniques within PASW Modeler. In all the exercise that follow, you can use a Partition node to split
the data into Training and Testing partitions to make these modeling exercises more realistic. You
may or may not want to split the data 50/50 into two partitions depending on the number of records in
the file.
1. Begin a new stream with a Var.file node connected to the file Churn.txt.
2. Use C5.0 and at least one other decision tree method to predict CHURNED and compare the
accuracy of both. What do you learn from this? Which rule method performs “best”?
3. Now browse the rules that have been generated by the methods. Which model appears to be
the most manageable and/or practical? Do you think there is a trade-off between accuracy and
practicality?
4. Try switching from Accuracy to Generality in C5.0. Does this have much effect on the size
and accuracy of the tree?

5. Experiment with the model options within the methods you selected to see how they affect
tree growth. Can you increase the accuracy without making the model overly complicated?
Experiment with the minimum values for parent and child nodes and see how this influences
the size of the tree.
6. In C5.0, use the Winnow attributes expert option and see if it reduces the number of inputs
used in the model (hint: for an easier comparison, generate a Filter node from the generated
C5.0 node with Winnow attributes checked and with it unchecked).
7. Of all the models you have run, which do you think is the “best”? Why?
You may wish to save the stream (use the name Exer3.str) that you have just created.


Lesson 4: Neural Networks

Overview
• Describe the structure and types of neural networks
• Build a neural network
• Browse and interpret the results
• Evaluate the model
• Illustrate the use of bagging and boosting
Data
In this lesson we will use the dataset churn.txt. We continue to use a Partition Node to divide the
cases into two partitions (subsamples), one to build or train the model and the other to test the model
(often called a holdout sample).
4.1 Introduction to Neural Networks

Historically, neural networks attempted to solve problems using methods modeled on how it was
envisioned the brain operates. Today they are simply viewed as powerful modeling techniques.
A typical neural network consists of several neurons arranged in layers to create a network. Each
neuron can be thought of as a processing element that is given a simple part of a task. The
connections between the neurons provide the network with the ability to learn patterns and
interrelationships in data. The figure below gives a simple representation of a common neural
network (a Multi-Layer Perceptron).
Figure 4.1 Simple Representation of a Common Neural Network
When using neural networks to perform predictive modeling, the input layer contains all of the fields
used to predict the target. The output layer contains an output field: the target of the prediction. The
input and output fields can be continuous or categorical (in PASW Modeler, categorical fields are
transformed into a numeric form (dummy or binary set encoding) before processing by the network).
The hidden layer contains a number of neurons at which outputs from the previous layer combine. A
Neural Networks 4-1

network can have any number of hidden layers, although these are usually kept to a minimum. All
neurons in one layer of the network are connected to all neurons within the next layer
While the neural network is learning the relationships between the data and results, it is said to be
training. Once fully trained, the network can be given new, unseen data and can make a decision or
prediction based upon its experience.
When trying to understand how a neural network learns, think of how a parent teaches a child how to
read. Patterns of letters are presented to the child and the child makes an attempt at the word. If the
child is correct she is rewarded and the next time she sees the same combination of letters she is likely
to remember the correct response. However, if she is incorrect, then she is told the correct response
and tries to adjust her response based on this feedback. Neural networks work in the same way.
4.2 Training Methods

One of the advantages of PASW Modeler is the ease with which you are able to build a neural
network without in fact knowing too much about how the algorithms work. Nevertheless, it helps to
understand a bit about these methods, so we will begin by briefly describing the two different types of
training methods: a Multi-Layer Perceptron (MLP) and a Radial Basis Function Network (RBFN)
model.
As was noted above, a neural network consists of a number of processing elements, often referred to
as “neurons,” that are arranged in layers. Each neuron is linked to every neuron in the previous layer
by connections that have strengths or weights attached to them. The learning algorithm controls the
adaptation of these weights to the data; this gives the system the capability to learn by example and
generalize for new situations.
The main consideration when building a network is to locate the best, or global solution, within a
domain; however, the domain may contain a number of sub-optimal solutions. The global solution
can be thought of as the model that produces the least possible error when records are passed through
it.
To understand the concept of global error, imagine a graph created by plotting the hidden weights
within the neural network against the error produced. Figure 4.2 gives a simple representation of such
a graph. With any complex problem there may be a large number of feasible solutions, thus the graph
contains a number of sub-optimal solutions or local minima (the “valleys” in the plot). The trick to
training a successful network is to locate the overall minimum or global solution (the lowest point),
and not to get “stuck” in one of the local minima or sub-optimal solutions.
Neural Networks 4-2

Figure 4.2 Representation of the Error Domain Showing Local and Global Minima
There are many different types of supervised neural networks (that is neural networks that require
both inputs and an output field). However, within the world of data mining, two are most frequently
used. These are the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN).
In the following paragraphs we will describe the main differences between these types of networks
and describe their advantages and disadvantages.
4.3 The Multi-Layer Perceptron

The MLP network consists of layers of neurons, with each neuron linked to all neurons in the
previous layer by connections of varying weights. All MLP networks consist of an input layer, an
output layer and at least one hidden layer. The hidden layer is required to perform non-linear
mappings. The number of neurons within the system is directly related to the complexity of the
problem, and although a multi-layered topology is feasible, in practice there is rarely a need to have
more than one hidden layer.
Within a Multi-Layer Perceptron, each hidden layer neuron receives an input based on a weighted
combination of the outputs of the neurons in the previous layer. The neurons within the final hidden
layer are, in turn, combined to produce an output. This predicted value is then compared to the correct
output and the difference between the two values (the error) is fed back into the network, which in
turn is updated. This feeding of the error back through the network is referred to as back-propagation.
To illustrate this process we will take the simple example of a child learning the difference between
an apple and a pear. The child may decide in making a decision that the most useful factors are the
shape, the color and the size of the fruit—these are the inputs. When shown the first example of a
fruit she may look at the fruit and decide that it is round, red in color and of a particular size. Not
knowing what an apple or a pear actually looks like, the child may decide to place equal importance
on each of these factors—the importance is what a network refers to as weights. At this stage the
child is most likely to randomly choose either an apple or a pear for her prediction.
On being told the correct response, the child will increase or decrease the relative importance of each
of the factors to improve her decision (reduce the error). In a similar fashion a MLP network begins
with random weights placed on each of the inputs. On being told the correct response, the network
Neural Networks 4-3

adjusts these internal weights. In time, the child and the network will hopefully make correct
predictions.
To visualize how a MLP works, imagine a problem where you wish to predict a target field consisting
of two groups, using only two input fields. Figure 4.3 shows a graph of the two input fields plotted
against one another, overlaid with the target. Using a non-linear combination of the inputs, the MLP
fits an open curve between the two classes.
Figure 4.3 Decision Surface Created Using the Multi-Layer Perceptron
The advantages of using a MLP are:

• It is effective on a wide range of problems
• It is capable of generalizing well
• If the data are not clustered in terms of their input fields, it will classify examples in the
extreme regions
• It is currently the most commonly used type of network and there is much literature
discussing its applications
The disadvantages of using a MLP are:

• It can take a great deal of time to train
• It does not guarantee finding the best global solution
4.4 The Radial Basis Function

The Radial Basis Function (RBF) is a more recent type of network and is responsive to local regions
within the space defined by the input fields.
Figure 4.4 shows a graphical representation of how a RBF fits a number of basis functions to the
problem described in the previous section. The RBF can be thought of performing a type of clustering
within the input space, encircling individual clusters of data by a number of basis functions. If a data
point falls within the region of activation of a particular basis function, then the neuron corresponding
Neural Networks 4-4

to that basis function responds most strongly. The concept of the RBF is extremely simple; however
the selection of the centers of each basis function is where difficulties arise.
Figure 4.4 Operation of a Radial Basis Function
The advantages of using a RBF network are:

• It is quicker to train than a MLP
• It can model data that are clustered within the input space.
The disadvantages of using a RBF network are:

• It is difficult to determine the optimal position of the function centers
• The resulting network often has a poor ability to represent the global properties of the data.
Within PASW Modeler the RBF algorithm uses the K-means clustering algorithm to determine the
number and location of the centers in the input space.
4.5 Which Method?

Due to the random nature of neural networks, the models built using each of the algorithms will tend
to perform with varying degrees of accuracy depending on the initial weights and starting positions.
When building neural networks it is sensible to try both algorithms and either choose the one with the
best overall performance, or, use both models to gain a majority prediction.
Neural Networks 4-5

4.6 The Neural Network Node

The Neural Net node is used to create a neural network and can be found in the Modeling palette.
Once trained, a Generated Neural Net node labeled with the name of the predicted field will appear in
the Generated Models palette and on the stream. This node represents the trained neural network. Its
properties can be browsed and new data can be passed through this node to generate predictions.
We can use the stream saved in Lesson 3.

Move to the c:\Train\ ModelerPredModel directory
Double-click on the C5.str
Delete the C5.0 modeling node and the C5.0 generated model node, leaving the other nodes
Run the Table node
Click File…Close to close the Table window
Double-click the Type node
Figure 4.5 Type Node Ready for Modeling
Notice that ID will be excluded from any modeling as the role is automatically set to None for a
Typeless field. The CHURNED field will be the target field for any predictive model and all fields but
ID and Partition will be used as predictors.
Neural Networks 4-6

Click OK
Place a Neural Net node from the Modeling palette to the right of the Type node
Connect the Type node to the Neural Net node
Figure 4.6 Neural Net Node (CHURNED) Added to Data Stream
Notice that once the Neural Net node is added to the data stream, its name becomes CHURNED, the
field we wish to predict.
Double-click the Neural Net node

Click the Build Options tab (if necessary)
Neural Networks 4-7

Figure 4.7 Neural Net Dialog: Build Options Tab
The Build Options tab enables you to control five different areas, including overall objectives, the
specific neural net algorithm, stopping rules, how to combine ensembles of models, and advanced
statistical specifications, including how to handle missing data and the size of an overfit prevention
set.
The default objective is to build a new single neural network. You can instead use boosting or
bagging (explained in sections below) to enhance model accuracy or stability. If you have a server
connection, Neural Net can create models for very large datasets by dividing the data into smaller
data blocks and building a model on each block. The most accurate models are then automatically
selected and combined into a single final model.
Neural Networks 4-8

Figure 4.8 Basics Settings in Build Options Tab
The options on the Basics panel enable you to choose one of two algorithms. The multilayer
perceptron allows for more complex relationships at the possible cost of increasing the training and
scoring time. The radial basis function may have lower training and scoring times, at the possible cost
of reduced predictive power compared to the MLP.
The hidden layer(s) of a neural network contains unobservable neurons (units). The value of each
hidden unit is some function of the predictors; the exact form of the function depends in part upon the
network type. A multilayer perceptron can have one or two hidden layers; a radial basis function
network can only have one hidden layer. By default the model will choose the best number of hidden
units in each hidden layer, although you can specify this yourself. Normally, it is best to allow the
algorithm to make this choice.
Click the Stopping Rules settings
Neural Networks 4-9

Figure 4.9 Stopping Rules Settings in Build Options Tab
This area allows you to control the rules that determine when to stop training multilayer perceptron
networks; these settings are ignored when the radial basis function algorithm is used. Training
proceeds through at least one cycle (data pass), and can then be stopped based on three criteria.
By default, PASW Modeler stops when it appears to have reached its optimally trained state; that is,
when accuracy in the (internal) test dataset seems to no longer improve. Alternatively, you can set a
required accuracy value, a limit to the number of cycles through the data, or a time limit in minutes.
We use the default in these examples.
Click the Ensembles settings
Neural Networks 4-10

Figure 4.10 Ensembles Settings in Build Options Tab
The settings in this section are used when boosting, bagging or very large datasets are modeled as
requested in the Objectives section. In this case, two or more models need to be combined to make a
prediction. Ensemble predicted values for categorical targets can be combined using voting, highest
probability, or highest mean probability. Voting selects the category that has the highest probability
most often across the base models. Highest probability selects the category that achieves the single
highest probability across all base models. Highest mean probability selects the category with the
highest value when the category probabilities are averaged across the individual models. Ensemble
predicted values for continuous targets can be combined using the mean or median of the predicted
values from the individual models.
You can also specify the number of base models to build for boosting and bagging;for bagging, this is
the number of bootstrap samples.

Over-training is one of the problems that can occur within neural networks. As the data pass
repeatedly through the network, it is possible for the network to learn patterns that exist in the sample
only and thus over-train. That is, it will become too specific to the training sample data and lose its
ability to generalize. By selecting the Overfit prevention set option (checked by default), only a
randomly selected proportion of the training data is used to train the network (this is separate from a
holdout sample created in the Partition node). By default, 70% of the data is selected for training the
model, and 30% for testing it. Once the training proportion of data has made a complete pass through
the network, the rest is reserved as a test set to evaluate the performance of the current network. By
default, this information determines when to stop training and provides feedback information. We
advise you to leave this option turned on. Note that with a Partition node in use, and with Overfit
prevention set turned on, the Neural Net model will be trained on 70 percent of the training sample
selected by the Partition Node, and not on 70% of the entire dataset.
Since the neural network initiates itself with random weights, the behavior of the network can be
reproduced by setting the same random seed by checking the Replicate Results checkbox (selected by
default) and specifying a seed. It is advisable to run several trials on a neural network to ensure that

you obtain similar results using different random seed starting points. The Generate button will create
a pseudo-random integer between 1 and 2147483647, inclusive.
The Neural Net node requires valid values for all input fields. There are two options to handle
missing values for the predictors (records with missing values on the target are always ignored). By
default, listwise deletion will be used, which deletes any record with a missing (blank) value on one
or more of the input fields. As an alternative, Modeler can impute the missing data. For categorical
fields, the most frequently occurring category (the mode) is substituted; for continuous fields, the
average of the minimum and maximum observed values is substituted.
It is important to check the data in advance of running a Neural Net using the Data Audit node to
decide which records and fields should be passed to the neural network for modeling. Otherwise you
run the risk of a model being built using data values supplied by these imputation rules even if that is
not your preference. Also, you can take control of the missing value imputation by using, for
example, the Data Audit node to change missing values to valid values with several optional methods
before using the Neural Net node.
Click the Model Options tab

Figure 4.12 Model Options Tab
The Make Available for Scoring area contains options for controlling how the model is scored. The
predicted value (for all targets) and confidence (for categorical targets) are always computed when
the model is scored. The computed confidence can be based on the probability of the predicted value
(the highest predicted probability) or the difference between the highest predicted probability and the
second highest predicted probability.
Propensity scores—the likelihood of the true outcome—can be created for flag targets, as we also saw
for decision tree models. The model produces raw propensity scores; adjusted propensity scores are
not available.
We are ready to run the Neural Net node. Although we have discussed the various options, we will
use all defaults for this first model.
Click Run

4.7 Models Palette

The Neural Net model is placed in both the stream, attached to the Type node, and in the Models
Palette. The Models tab in the Manager holds and manages the results of the machine learning and
statistical modeling operations. There are two context menus available within the Models palette. The
first menu applies to the entire model palette.
Right-click in the background (empty area) in the Models palette
Figure 4.13 Context Menu in the Models Palette
This menu allows you to open a model in the palette, save the models palette and its contents, open a
previously saved models palette, clear the contents of the palette, or to add the generated models to
the Modeling section of the CRISP-DM project window. If you use PASW Collaboration and
Deployment Services to manage and run your data mining projects, you can store the palette, retrieve
a palette or model from the repository.
The second menu is specific to the generated model nodes.
Right-click the generated Neural Net node named CHURNED in the Models palette

Figure 4.14 Context Menu for Nodes in the Models Palette
This menu allows you to rename, annotate, and browse the generated model node. A generated model
node can be deleted, exported as PMML (Predictive Model Markup Language) and stored in the
PASW Collaboration and Deployment Services Repository, or saved in a file for future use.
4.8 The Neural Net Model

We can now explore the neural net model created to predict CHURNED.
Double-click the Neural Net model in the stream
The Model Tab contains five views or summaries of the model. The first is the Model Summary,
which is a higher level view of model performance (see next figure).
The table identifies the target, the type of neural network trained, the stopping rule that stopped
training (shown if a multilayer perceptron network was trained), and the number of neurons in each
hidden layer of the network. Here, eight hidden neurons were used in one hidden layer.
The bar chart displays the accuracy of the final model, which is 81.1% (compare to that for the C5.0
model in the previous lesson). For a categorical target, this is simply the percentage of records for
which the predicted value matches the observed value. Since we are using a Partition node, this is the
percentage correct on the Training partition. The full Training partition is used for this calculation,
including the overfit prevention set that was used internally during model building.

Figure 4.15 Model Summary Information for Neural Net Model
Note: Accuracy for a Continuous Target

For a categorical target the accuracy is simply the percentage correct. It is worth noting that if the
target is continuous then accuracy within PASW Modeler is defined as the average across all records
of the following expression, which is summed over all records.
1  Target Value − Predicted Target Value 

Accuracy = ∑ 1 − 
n  Range of Target Values 
Predictor importance
Next we can look at predictor importance.
Click the Predictor Importance chart panel
For models such as neural nets, predictor importance takes on, well, added importance because there
is no single equation or other representation of the model available in the generated model nugget
(but we can view the coefficients, as demonstrated below). The same is true, for example, with SVM
models. Predictor importance is based on sensitivity analysis, which is a method to determine how
variation in the model inputs leads to variation in the predicted values. The more important a
predictor, the more changes in its values change the outcome values. In PASW Modeler, importance

is calculated by sampling repeatedly from combinations of values in the distribution of the predictors
and then assessing the effect on the target. Then everything is normalized to 1.0 so that the
importances can be compared.
The most important predictor of CHURNED is LONGDIST followed by LOCAL and then
International. These fields all measure usage of the phone service. The first customer demographic
variable in importance is SEX. You may wish to compare these fields to the ones selected by the C5.0
model in Lesson 3.
Figure 4.16 Predictor Importance in Model
Predictor importance is not a substitute for exploring the model and seeing how it actually functions,
as we do later, but it is a first step at reviewing a model.
Next we can view how well the model performs at predicting each category of CHURNED.
Click the Classification panel
For categorical targets, this section displays the cross-classification of observed versus predicted
values in a heat map, plus the percent in each cell. We look, as usual, on the diagonal to see the
correct predictions. The neural net does best at predicting those in the InVol category, with lowest
accuracy for Current customers. As a reminder, this table uses the Training partition. The depth of
shading of each cell is based on the percentage, with darker shading corresponding to higher
percentages. There are three other table styles available, selected from the Styles dropdown.

Figure 4.17 Predictor Importance in Model
The Neural Network Structure

You can see the neural network itself in the next section of output.
Click on the Neural Network panel
The network can be displayed in several views. These icons display the network in,
respectively:
• Standard style, with inputs on the left and outputs on the right
• Inputs on the top and outputs on the bottom
• Inputs on the bottom and outputs on the top
Also available is a slider to limit the display of inputs based on predictor importance.

Figure 4.18 Neural Network Structure
It is difficult to see the full network in the standard view, so we’ll switch to the view with inputs on
the top.
Click on the middle network icon

Maximize the window’s width
There are two different display styles, which are selected from the Style dropdown list.
• Effects. This displays each predictor and target as one node in the diagram irrespective of
whether the measurement scale is continuous or categorical. That is the current view.
• Coefficients. This displays multiple indicator nodes for categorical predictors and targets. The
connecting lines in the coefficients-style diagram are colored based on the estimated value of
the synaptic weight.
Move both sliders to their endpoints so all fields are displayed

Figure 4.19 Neural Network With Inputs at Top
In this network there is one hidden layer, containing eight neurons, and the output layer still contains
only one neuron corresponding to the target field CHURNED. .
Click the Style dropdown and select Coefficients
In the Coefficients view all the neurons are visible. The input layer is made up of one neuron per
continuous field. Categorical fields will have one neuron per value. In this example, there are seven
continuous, five flag, and one nominal field with three values, totaling twenty neurons. There are also
Bias neurons to set the scale of the input. The field CHURNED is represented by three neurons for its
three categories.

Figure 4.20 Neural Network In Coefficients View
The connecting lines in the diagram are colored based on the estimated value of the synaptic weight,
with darker blues corresponding to a greater weight. If you hover the cursor over a link between
neurons, the weight will be displayed in a popup (weights vary from –1.0 to +1.0).
It is visually evident that neural network models are very complicated to summarize easily. This is
because of the very large number of connections (each of the input neurons is connected to each
hidden neuron, and then each hidden neuron is connected to each output neuron). In effect, there are
many, many equations in the network, and so the influence or effect of any one input field would
have to be summed across these many equations.
We conclude our review of the Model Viewer output by looking at the Settings tab.
Click the Settings tab

Figure 4.21 Settings Tab for Neural Network
The options here are equivalent to those on the Model Options tab in the Neural Network modeling
node. The type of confidence can be requested, along with the number of probability fields for
categorical targets. One new option is to generate SQL for the model, allowing pushback of the model
scoring stage to the database. This is only available for the multilayer perceptron.
Click OK to close the Neural Network Model Viewer
4.9 Validating the List of Predictors

Because the neural network results depend on the initial random starting point, it is important to rerun
the model with a different seed to be sure that the results are consistent. It is entirely possible that
because of the seed we chose that one or more of the fields the Neural Network found to be important
in influencing CHURNED might not be selected again with a different seed. Therefore, it is crucial to
run the Neural Network model enough times until you are convinced about which predictors are the
most important in influencing your target. We will rerun the model just once and compare it with the
one we just ran. Normally, you would need to rerun it several times.

Double-click the Neural Network modeling node

Click Build Options tab
Click Advanced
Change the random seed in the Random seed: text box to 444
Click Run
Edit the generated Neural Net model in the stream
We see in the Model Summary pane that the overall accuracy has decreased by about 3.6%. Also,
intriguingly, the number of neurons in the hidden layer is now 3, not 8 (the number of hidden neurons
is also determined based in part on the random seed).
Figure 4.22 Model Summary Information for Neural Net Model After Changing Seed
Click the Predictor Importance panel
The Predictor Importance is not identical to that in our first model, but it very similar. The three usage
fields are the most important. In fact, the top six fields are the same, in the same order, as the first
model. After that the order changes, but importance is not something that should be viewed as a fixed
and definite value (such as a regression coefficient). Instead, importance is a rough measure of a
predictor’s influence on the overall network output.

Figure 4.23 Predictor Importance after Changing the Seed
So although the accuracy dropped a bit, generally these results are encouraging about the stability of
the model. If we look at the Classification table, we will learn that accuracy dropped for the Current
and Vol categories more than for InVol.
Normally we would rerun the model a few more times with different seeds to further convince
ourselves that the top predictors of CHURNED remain the same and that accuracy remains fairly
constant, but we will stop here and attempt to further understand the model.
Click OK to close the Neural Network Model Viewer
4.10 Understanding the Neural Network

A common criticism of neural networks is that they are opaque; that is, once built, the reasoning
behind their predictions is not clear. For instance, does making a lot of international calls mean that
you are likely to remain a customer, or instead leave voluntarily? In the following sections we will
use some techniques available in PASW Modeler to help you evaluate the network and discover its
structure.
Creating a Data Table Containing Predicted Values

The first step is and passing the data to a Table node to look at the output fields.
Connect the generated Neural Net model named CHURNED to the nearby Table node
Run the Table node

Figure 4.24 Table Showing the Two Fields Created by the Generated Net Node
The generated Neural Net node calculates two new fields, $N-CHURNED and $NC-CHURNED, for
every record in the data file with valid data for the model. The first represents the predicted
CHURNED value and the second a confidence value for the prediction. The latter is only appropriate
for categorical targets and will be in the range of 0.0 to 1.0, with the more confident predictions
having values closer to 1.0. We can observe that the first record, which is contained in the Training
partition, was correctly predicted to be a voluntary churner.

In data-mining projects it is advisable to see not only how well the model performed with the data we
used to train the model, but also with the data we held out for testing purposes. The Neural Net model
only displays results for the Training partition, so we need to use a Matrix node to create the
equivalent table for the Testing partition.
Because the Matrix node does not have an option to automatically split the results by partition we
must manually divide the Training and Testing samples with Select nodes. This will allow us to
create a separate matrix table for each sample. We already have the Select and Matrix nodes in the
stream from the C5.0 stream.
Connect the generated Neural Net model named CHURNED to each Select node

We then need to specify the correct field in the Matrix nodes.
Double-click on each Matrix node to edit it

Put $N-CHURNED in the Columns:
Figure 4.25 Matrix of Actual and Predicted Churned for Training and Testing Samples
For the training data, the model is predicting 75.8% of the current customers, 95.8.0% of the
involuntary leavers, and 74.9% of the voluntary leavers. For the testing data, the model is doing
slightly better on the current customers (76.3%), but not quite as well on the other two categories
(91.8% and 70.7%, respectively).
When you decide whether to accept a model, and you report on its accuracy, you should use the
results from the Testing (or Validation) sample. The model’s performance on the Training data may
be too optimized for that particular sample, so its performance on the Testing sample will be the best
indication of its performance in the future.
Close the Matrix windows
Overall Accuracy with an Analysis Node

An Analysis node will allow us to assess the overall accuracy of the model on each data partition. It is
often true when predicting a categorical target with more than two categories that overall accuracy is
less important than accuracy at predicting specific outcomes, but usually analysts prefer to know
overall accuracy, too. And, decision-makers regularly ask about it.
The Analysis node in the stream is ready for our use.
Connect the generated Neural Net model node to the Analysis node
Click Run
Overall percent correct for the Training partition is 77.47%; overall percent correct for the Testing
partition is 75.73%. This small reduction in accuracy from the Training to Testing data is typical, and
it falls well within acceptable limits. You can see that the Testing data partition is slightly larger than
that for the Training data, as they were created randomly.

Figure 4.26 Analysis Node Output
Close the Analysis Output browser window
Evaluation Charts
The Evaluation node is included in the stream, and if you wish, you can run Evaluation charts for the
Neural Network model to further study and compare the performance on the training and testing data
partitions.
4.11 Understanding the Reasoning behind the Predictions

One method of trying to understand how a neural network is making its predictions is to apply an
alternative machine learning technique, such as rule induction, to model the neural network
predictions. Here, though, we will use more straightforward methods to understand the relationships
between the predicted values and the fields used as inputs.
Categorical Input with Categorical Target

Based on the predictor importance chart, a categorical input of moderate importance is SEX. Since it
and the target field are categorical we can use a distribution plot with a symbolic overlay to
understand how gender relates to the CHURNED predictions.

Place a Distribution node from the Graphs palette near the Select node for the Training
partition
Connect the Select node to the Distribution node
Double-click the Distribution node
Select SEX from the Fields: list
Select $N-CHURNED as the Color Overlay field
Click the Normalize by color check box (not shown)
Click Run
The Normalize by color option creates a bar chart with each bar the same length. This helps to
compare the proportions in each overlay category.
Figure 4.27 Distribution Plot Relating Sex and Predicted Churned ($N-CHURNED)
The chart illustrates that the model is predicting that the majority of females are voluntary leavers,
while the bulk of males were predicted to remain current customers. The large difference in the
proportion of each category of CHURNED for males compared to females is an illustration of why
SEX is an important predictor. And this plot would help you describe the model in any summary
reports you write.
We next look at a histogram plot with an overlay.
Continuous Input with Categorical Target

The most important continuous input for this model is LONGDIST. Since the target field is
categorical, we will use a histogram of LONGDIST with the predicted value as an overlay to try to
understand how the network is associating long distance minutes used with CHURNED.

Place a Histogram node from the Graphs palette near the Select node for the Training
sample
Connect the Select node to the Histogram node
Double-click the Histogram node
Click LONGDIST in the Field: list
Select $N-CHURNED in the Overlay Color field list
Click on the Options tab
Click on Normalize by Color (not shown)
Click Run
Figure 4.28 Histogram with Overlay of Predicted Churned by Long Distance Minutes
Here the only clear pattern we see is that Involuntary Leavers tend to be people who do little or no
long distance calling. In contrast, it appears that the amount of long distance calling was not as much
an issue when it came to predicting whether or not a person would remain a current customer or
voluntarily choose to leave.
You may wish to try these same graphs with the Testing Partition.
Note: Use of Data Audit Node

We explored the relationship between just two input fields (LONGDIST and SEX) and the prediction
from the neural net ($N-CHURNED), and used Distribution and Histogram nodes to create the plots.
If more inputs were to be viewed in this way, a more efficient approach would be to use the Data
Audit node because overlay plots could easily be produced for multiple input fields (the overlay plots
can’t be normalized, though).


Move to the c:\Train\ModelerPredModel directory (if necessary)
Type NeuralNet in the File Name: text box
Click Save
4.12 Model Summary

In summary, we appear to have built a neural network that is reasonably good at predicting the three
different CHURNED groups. The overall accuracy was about 77% with the Training data, and 76%
with the Testing data. Focusing on the Testing or unseen data, the model is most accurate at
predicting the Involuntary Leaver group but somewhat less successful predicting the Current
Customers and Voluntary Leaver. Considering that the model was correct almost three-quarters of the
time even in the case of these latter two groups, it is certainly within the realm of possibility that the
model may be considered a success. Of course, this would depend on whether these accuracy rates
met or exceeded the minimum requirements defined at the beginning of the data-mining project. In
terms of how predictors relate to the model, the most important factors in making its predictions are
LONGDIST, International, LOCAL, and SEX. The network appears to associate females with the
Voluntary Leaver group and predicts that males will remain Current Customers. The model also tends
to predict that the people who are most likely to be dropped by the company (Involuntary Leavers)
are those who do little or no long distance calling. (There are many other relationships in the data
between the predictors and the CHURNED that we didn’t investigate, of course.)
4.13 Boosting and Bagging Models

There are two additional techniques available in the Neural Net node, and other modeling nodes in
PASW Modeler, to create reliable and accurate predictive models. These techniques build a
collection, or ensemble, of models, and then combine the results of the models to make a prediction.
However, the techniques don’t simply create a number of models on exactly the same data, which
wouldn’t provide any particular advantage. Instead, they resample or reweight the data for each
additional model, which leads to a different model each time. This turns out to be a winning strategy
for creating effective ensembles of models.
These two techniques are called boosting and bagging. We will provide a brief description of how
each operates, then simple examples of both.
Boosting. The key concept behind model boosting is that successive models are built to predict the
cases misclassified from earlier models. Thus, as the number of models increases, the number of
misclassified cases should decrease. The method works by applying a model to the data in the normal
fashion, with each record assigned an equal weight. After the first model is constructed, predictions
are made, and weights are created that are inversely proportional to the accuracy of classification.
Then a second model is created, using the weighted data. Then predictions are made from the second
model, and then weights are created that are inversely proportional to the accuracy of classification
from the second model. This process continues for a (usually) small number of iterations. When
done, the model predictions can then be combined by voting, by highest probability, etc.
Bagging. The term “bagging” is derived from the phrase “Bootstrap aggregating.” In this method,
new training datasets are generated which are of the same size as the original training dataset. This is
done by using simple random sampling with replacement from the original data. By doing so, some
records will be repeated in each new dataset. This type of sample is called a bootstrap sample. Then, a
model is constructed for each bootstrap sample, and the results are combined with the usual methods
(voting, averaging for continuous targets). Cases are weighted normally with this method in each
bootstrap sample.

Boosting can be used on datasets of almost any size and characteristics. It is designed to increase
model accuracy, first and foremost. Bagging should be avoided on very small datasets, especially
those with lots of outliers, where the outliers can affect the samples that are constructed. Bagging can
increase accuracy, but also reduce model variability.
When these methods are used, the Model Viewer will provide different views from that when a single
model is constructed. Included will be the results from each model and details on each, plus some
indication of the variability of the model results (for bagging).
It is absolutely necessary to have a test or validation dataset on which the boosted or bagged models
can be assessed. These models, even more so than a regular model, are highly tuned to the training
data, and so their performance must be evaluated on data not used for model-building. Bagging and
boosting are not guaranteed to improve model performance on new data, but the idea is that creating
several models is worth the tradeoff of reusing the training data several times.
Typically, only a small number of boosted or bagged models need to be constructed; the default
number is 10 in PASW Modeler.
The outcome of boosting or bagging is still only one model nugget, and it can be used the same as any
other standard model nugget. The downside to boosting or bagging is that no one equation, or
decision tree, or the equivalent, can represent the model, so it can be hard to describe and characterize
how the model makes predictions. Thus, you should investigate the relationship between the
predictors and the target field, as we did above, go gain model understanding.
4.14 Model Boosting with Neural Net

We will use boosting to predict CHURNED, using the default settings but changing the random seed
once more.
Edit the Neural Net modeling node named CHURNED

Click Build Options tab
Click Objectives
Click Enhance model accuracy (boosting)

Figure 4.29 Requesting Model Boosting
Click Ensembles settings
We viewed the Ensembles settings earlier in the lesson. The number of component models to create
for boosting or bagging is 10, and that can be changed here. The default choice for combining models
for categorical targets is voting, and two other choices using probability are available.
We’ll use the default choices.

Figure 4.30 Ensemble Model Scoring
Click Advanced settings

Change the Random seed value to 5555 (not shown)
Click Run
You will notice that execution does take much longer than when running a single neural net model.
Once the model has finished:
Edit the Neural Net Model CHURNED

Figure 4.31 Boosted Model Accuracy
The Model Summary view has three measures of accuracy. The bar chart displays the accuracy of the
final model, compared to a reference model and a naive model. The reference model is the first model
built on the original unweighted data. The naïve model represents the accuracy if no model was built,
and assigns all records to the modal category (Current). The naive model is not computed for
continuous targets.
The ensemble of 10 models is perfectly accurate—100%! That is encouraging, but we’ll have to see
how it performs on the Training data partition.
Click Predictor Importance panel

Figure 4.32 Predictor Importance for Boosted Model
The same fields are important as in the original Neural Net model, although now SEX is the second
most important, and it and LONGDIST have higher importance values than any other field.
Click the Predictor Frequency panel

Figure 4.33 Predictor Frequency for Boosted Model
In some modeling methods, such as decision trees, the predictor set can vary across component
models. The Predictor Frequency plot is a dot plot that shows the distribution of predictors across
component models in the ensemble. Each dot represents one or more component models containing
the predictor. Predictors are plotted on the y-axis, and are sorted in descending order of frequency;
thus the topmost predictor is the one that is used in the greatest number of component models and the
bottommost one is the one that was used in the fewest. The top 10 predictors are shown. However, all
predictors are used in each Neural Net component model in the ensemble, so this plot is not useful
here.
Click the Ensemble Accuracy panel

Figure 4.34 Ensemble Accuracy for Boosted Model
The Ensemble Accuracy line plot shows the overall accuracy of the model ensemble as each model is
added. Generally, accuracy will increase as models are added, and we see that ensemble model
accuracy reached 100% after only five models (you can hover the cursor over the line and view a
popup of accuracy at that point). The line plot can be used to decide how fast the ensemble accuracy
is increasing, and whether it is worthwhile to increase (or decrease) the number of models in another
modeling run.
Click on Component Model Details panel

Figure 4.35 Component Model Details for Boosted Model
Information about each of the models, in order of their creation, is supplied in the Component Model
Details panel. Included are model accuracy, the number of predictors, and the number of synapses
(weights), which is directly related to the number of neurons in the network. Not surprisingly, as the
models attempted to model cases that were still mispredicted by earlier models, model accuracy
decreased, although not dramatically. You can sort the rows in ascending or descending order by the
values of any column by clicking on the column header.
This information can be used to decide whether the model should be rerun, with additional
component models, or perhaps different modeling settings.
Boosted Model Performance

Because this model has been added to the stream, we can immediately check its performance on the
Testing data partition.
Click OK

Figure 4.36 Analysis Node Output for Boosted Model
As we saw when browsing the model, the boosted model is 100.0% accurate on the Training data. On
the Testing data, the accuracy is 76.12%. This large drop-off is typical for boosted models and is
illustrative of why the Testing data performance is the true guide to model performance on new data.
The level of accuracy is decent, but the accuracy with one model was 75.73%, almost the same. Of
course, every small increase in accuracy may be important. Compared to the single neural network,
the ensemble is performing better with current customers, but not as well with those who left
involuntarily. All of these factors must be taken into account when deciding which model is preferred.
We will not investigate the boosted neural net model further, but you can do so using the existing
nodes in the stream, or others you add. If you do you will discover that the boosted model has similar
relations between the key predictors and the target field CHURNED.

4.15 Model Bagging with Neural Net

We next try bagging with a neural net model. We will use the same settings in the Neural Net
modeling node, just changing to bagging.
Close the Analysis node

Edit the Neural Net modeling node named CHURNED
Click Objectives
Click Enhance model stability (boosting)
Figure 4.37 Requesting Model Bagging
Click Run
You will notice that execution does take much longer than when running a single neural net model.
Once the model has finished:
Edit the Neural Net Model CHURNED

Figure 4.38 Bagged Model Accuracy
The Model Summary view has the same three measures of accuracy, and a measure of model
diversity (variability). The reference and naïve model accuracies are the same as for the boosted
model, as the reference model is a model built on all the training data, as with the boosted model.
The ensemble of 10 models is very accurate at 96.94%, although not completely accurate as was the
boosted model.
For bagged models there is a dropdown to display accuracy for the different model combining rules.
All of these can be shown on one chart by selecting the Show All Combining Rules check box.
Click Show All Combining Rules check box

Figure 4.39 Bagged Model Accuracy for all Voting Methods
For this bagged model, the rule with the highest accuracy is to use the highest mean probability. You
can try all three types of combining rules on the Testing data to pick the best performing model.
Below the Quality bar chart is a bar chart labeled Diversity. This chart displays the "diversity of
opinion" among the component models used to build the bagged ensemble, presented in greater is
more diverse format, normalized to vary from 0 to 1. It is a measure of how much predictions vary
across the base models. Although the label indicates that larger is better, this isn’t necessarily so. The
true test is how well the bagged models perform on the Testing data.
Figure 4.40 Bagged Model Diversity

We skip the Predictor Importance information, which is very similar to that for the boosted ensemble
model. For the same reason, we skip the Predictor Frequency panel, which is identical to that for the
boosted model, since all predictors are used in each component model.
Click the Ensemble Accuracy panel
Figure 4.41 Component Model Accuracy for Bagged Model
The Component Accuracy chart is a dot plot of predictive accuracy for the component models. Each
dot represents one or more component models with the level of accuracy plotted on the y-axis. Hover
over any dot to obtain the id for the corresponding individual component model. The chart also
displays color coded lines for the accuracy of the ensemble as well as the reference model and naïve
models. A checkmark appears next to the line corresponding to the model that will be used for
scoring. What we can see from the dot plot is that the level of accuracy of the 10 bagged models is
very comparable.
Unlike with boosted models, we cannot see the overall accuracy as each model is added. This is
because the models have no logical sequence.
Click on Component Model Details panel

Figure 4.42 Component Model Details for Bagged Model
Information about each of the models created, in order of their creation, is supplied in the Component
Model Details panel. This is the same type of information as supplied for boosted models.
Here, there is no trend in accuracy as new bootstrap samples are taken, and there really shouldn’t be.
Also, overall accuracy is very similar for each of the models. And since the diversity measure was
low (.08), we know that the model predictions were also very similar. The fact that the models were
so comparable in performance may be a good thing, but we won’t know until we view the bagged
model with the Testing data.
Bagged Model Performance

Because this model has been added to the stream, we can immediately check its performance on the
Testing data partition.
Click OK

Figure 4.43 Analysis Node Output for Bagged Model
As we saw when browsing the model, the bagged model is correct on 96.94% of the records on the
Training data. On the Testing data, the accuracy is 76.25%. This large drop-off is typical for bagged
models and is illustrative, as with boosted models, of why the Testing data performance is the true
guide to model performance on new data. The model performs better than the single neural network
(75.73%), but not by much. It does, though, do better on those who left involuntarily, but not as well
on those who left voluntarily, compared to the boosted model.

Summary Exercises
The exercises in this lesson use the file charity.sav. The following table provides details about the
file.
charity.sav comes from a charity and contains information on individuals who were mailed a
promotion. The file contains details including whether the individuals responded to the campaign,
their spending behavior with the charity and basic demographics such as age, gender and mosaic
(demographic) group. The file contains the following fields:
response Response to campaign

orispend Pre-campaign expenditure
orivisit Pre-campaign visits
spendb Pre-campaign spend category
visitb Pre-campaign visits category
promspd Post-campaign expenditure
promvis Post-campaign visits
promspdb Post-campaign spend category
promvisb Post-campaign visit category
totvisit Total number of visits
totspend Total spend
forpcode Post Code
mos 52 Mosaic Groups
mosgroup Mosaic Bands
title Title
sex Gender
yob Year of Birth
age Age
ageband Age Category
In this set of exercises you will use a neural network to predict the field Response to campaign.
1. Begin with a blank Stream canvas. Place a Statistics source node on the Stream canvas and
connect it to the file charity.sav. Tell PASW Modeler to use variable and value labels.
2. Attach a Type and Table node in a stream to the source node. Run the stream and allow
PASW Modeler to automatically define the types of the fields.
3. Edit the Type node. Set all of the fields to role NONE.
4. We will attempt to predict response to campaign (Response to campaign) using the fields
listed below. Set the role of all five of these fields to input and the Response to campaign
field to target.
Pre-campaign expenditure
Pre-campaign visits
Gender
Age
Mosaic Bands (which should be changed to nominal measurement level)

5. Attach a Neural Net node to the Type node. Run the Neural Net node with the default
settings.
6. Once the model has finished training, browse the generated Net node in the stream. What is
the predicted accuracy of the neural network? What were the most important fields within the
network?
7. Connect the generated Net node to a Matrix node and create a data matrix of actual response
against predicted response. Which group is the model predicting well?
8. Use some of the methods introduced in the lesson, as well as others, such as web plots and
histograms (or use the Data Audit node with an overlay field), to try to understand the
reasoning behind the network’s predictions.
9. Change the random seed and rerun the neural net and recheck its performance.
10. Try a radial basis function neural network to see if you can improve on the model
performance.
11. Try a boosted or bagged model to see if you can improve on model performance compared to
the models you created above.
12. For those with extra time: Use C5.0 or other decision tree methods to predict Response to
campaign from the charity.sav data. How do the rule induction models compare with the
neural network models built here? Which are the most accurate? Which are the easiest to
understand?
13. Save a copy of the stream as Exer4.str.

Lesson 5: Support Vector Machines

Objectives
• Review the foundations of the Support Vector Machines model
• Use an SVM model to predict customer churn
• Try several different kernel functions and model parameters to improve the model
• Discuss how missing data is handled in SVM models
Data
In this lesson we use the data file customer_dbase.sav, which like churn.txt is also from a
telecommunications firm. We will use an SVM model to predict customers churn. The file contains
fields measuring both customer demographics and customer use of telecommunications service to use
as predictors. SVM models can use many input fields, so this file will allow us to demonstrate that
feature. We continue to use a Partition node to split the data file.
5.1 Introduction
Support vector machine (SVM) is a robust classification technique that is used to predict either a
categorical and continuous outcome field. SVM is particularly well suited to analyzed data with a
large number of predictor fields. Broadly, an SVM works by mapping data into a dimensional space
where the data points can be categorized or predicted accurately, even if there is no easy way to
separate the points in the original dimensional space. This involves using a kernel function to map the
data from the original space into the new space. An SVM, like a neural net model, does not provide
output in the form of an equation with coefficients on the predictor fields, although predictor
importance is available with the model. Thus, like a neural net, to understand the model, not use it
simply as a black box that makes predictions, you will need to do additional analysis.
We will use an SVM in this lesson to predict customer churn in three categories. First we provide
some background and theory about how an SVM model is calculated and what features of the model
are under user control. Developing an acceptable SVM model usually requires trying various model
settings rather than accepting the default node settings.
5.2 The Structure of SVM Models

SVM models were developed to handle difficult classification/prediction problems where the
“simple” linear models were unable to accurately separate the categories of an outcome field. A
typical complicated problem, in two dimensions, is shown in Figure 5.1. Assume that the X and Y
axis represent two predictors, while the circles and squares represent the two categories of a target
field we wish to predict.
Support Vector Machines 5-1

Figure 5.1 Predicting a Binary Outcome Field
There is no simple straight line that can separate the categories, but the curve drawn around the
squares shows that there is a complex curve that will completely separate the two categories.
The central task of SVM is to transform the data from this space into another space where the curve
that separates the data points will be much simpler. Typically this means transforming the data so that
a hyperplane (in higher dimensional space) can be used to separate the points.
The mathematical function used for the transformation is known as a kernel function. After
transformation, the data points might be represented as shown in Figure 5.2.
Figure 5.2 Kernel Transformation of Original Data
The squares and circles can now be separated by a straight line in this two-dimensional space. The
filled in circles and squares are the cases (called vectors in the SVM literature) that are on the
boundary between the two classes. They are the same points in both Figure 5.1 and Figure 5.2. The
filled in circles and squares are all the data that is needed to separate the two categories, and these key
points are called support vectors because they support the solution and boundary definition. Because

SVM models were developed in the machine learning tradition, this technique was called support
vector machine, hence the model name.
Even though it appears that we have a solution, there is more than one straight line (hyperplane) that
could be used to separate the two categories, as illustrated in Figure 5.3.
Figure 5.3 Multiple Possible Separating Lines
SVM models try to find the best hyperplane that maximizes the margin (separation) between the
categories while balancing the tradeoff of potentially overfitting the data. The narrower the margin
between the support vectors, the more accurate the model will be on the current data. Thus, a
separating line as shown in Figure 5.4 maximizes the margin between the support vectors.
Figure 5.4 Creating Maximum Separation Between Support Vectors
However, although this separator is 100% accurate, it may be too narrow to perform well on new
data, as illustrated in Figure 5.5. Here, in a new dataset, there circles or squares that fall on the wrong
side of the support vectors and so will be classified in error.

Figure 5.5 Misclassified Cases in Training Data
To allow for this, SVM models include a regularization or weight factor C that is added as a penalty
term in the function used by the SVM model. The algorithm attempts to maximize the margin
between the support vectors while minimizing error. As described below, you can try various values
of C to find an optimal model.
Although the description of an SVM has been in the context of a categorical target, an SVM can be
applied to predict a continuous field. See the Modeler 14.0 Algorithms Guide for more information.
Kernel Functions
Four different types of kernel function are available in the SVM node.
• Linear: Simple function that works well when there are nonlinear relationships in the data are
minimal
• Polynomial: A more complex function that allows for higher order terms
• RBF (Radial Basis Function): Equivalent to the neural network of this type. Can fit highly
nonlinear data.
• Sigmoid: Equivalent to a two-layer neural network. Can also fit highly nonlinear data.
Some of these functions have other parameters that you can modify to find an optimal model, such as
the degree of the polynomial, or the gamma factor that controls the influence of the function. As with
the factor C, there is a tradeoff with gamma values between accuracy and overfitting.
You can anticipate that you will not find the best model using the default settings in the SVM node.
Just as with decision trees, where you usually need to change the depth of a tree, a pruning parameter,
or the minimum number of cases in terminal nodes, SVM models must be tuned to perform better.
One method to fit many models efficiently is to use the Auto Classifier (if the target field is a flag) or
the Auto Numeric nodes (if the target is continuous), as appropriate, which allow you to run several
versions of a model at one time (see Lessons 12 and 13, respectively). For example, you could run 10
different SVMs with 10 different values of C.
SVM models make predictions by use of the separating hyperplane, and the equation of that
hyperplane, and the support vectors themselves, are possible outputs from the model. However,
neither of these will provide much insight into the model unless the dimensionality of the space is

very low, which is rarely the case. As a consequence, the SVM node doesn’t provide this output,
although you can request predictor importance (which is not directly related to either the support
vectors of hyperplane definition). So to understand a model, you will need to explore how the
predictors are related to the predicted values.
With this background in the basics of an SVM model, we can now apply an SVM to the churn data.
5.3 SVM Model to Predict Churn

The customer database we use in this lesson contains a field (churn) that measures whether or not a
customer of the telecommunications firm has renewed their service or not. We will attempt to
prediction this flag field with several inputs.

Double-click on SVM.str
Run the Table node
Edit the Type node

Figure 5.6 Type Node for SVM Model to Predict churn
There are 19 fields with Role Input that will be used as predictors. SVM models can easily handle
hundreds of predictors, but we limit the number of predictors here for a practical reason. Requesting
predictor importance for an SVM model can greatly increase execution time (by over a factor of 10).
Therefore, so that we don’t wait excessively for models to run, we limit the inputs.
Close the Type window

Add an SVM node to the stream
Connect the Type node to the SVM node
Edit the SVM node

Figure 5.7 SVM Node Model Tab
All the model settings are available in the Expert tab.
Click Expert tab

Click Expert option button
Figure 5.8 Expert Options for SVM Models

As mentioned in the previous section, there are four types of kernels that can be selected to
effectively create different types of models. The default is a RBF (Radial Basis Function), and we use
that initially.
The Regularization parameter (C) controls the trade-off between maximizing the margin of the
support vectors and minimizing the training error. Its value should normally be between 1 and 10
inclusive, with the default being 10. Increasing the value improves the classification accuracy (or
reduces the regression error at predicting a continuous outcome) for the training data, but this can also
lead to overfitting. In general, it is usually better to reduce C.
The Regression precision (epsilon) is used when the measurement level of the target field is
continuous. Errors in the model predictions are accepted if they are under this value. Increasing
epsilon may result in faster modeling, but at the expense of accuracy.
The RBF gamma value is enabled only if the kernel type is set to RBF. Gamma should normally be
between 3/k and 6/k, where k is the number of input fields. For example, if there are 12 input fields,
values between 0.25 and 0.5 would be worth trying. Increasing the value improves the classification
accuracy (or reduces the regression error) for the training data, but this can also lead to overfitting, in
a similar manner to the Regularization parameter. For our problem, with 19 predictors, gamma should
be between .16 and .32, so we will need to change the default value of .10.
The Gamma value is enabled only if the kernel type is set to Polynomial or Sigmoid. As with RBF
gamma, increasing the value improves the classification accuracy for the training data, but this can
also lead to overfitting.
The Bias value is enabled only if the kernel type is set to Polynomial or Sigmoid. Bias sets the coef0
value in the kernel function. The default value 0 is suitable in most cases.
You can use the Degree value when the Kernel type is Polynomial to control the complexity
(dimension) of the mapping space. The default is 3 (equivalent to a term such as X3).
Change the Regularization parameter to 3

Change the RBF Gamma value to 0.2

Figure 5.9 Expert Settings for SVM Model
Click Analyze tab
In the Analyze tab, unlike most models, the calculation of predictor importance is not checked on by
default. This calculation can be lengthy, and we will not request it in our initial model run.
Figure 5.10 Options in Analyze Tab

Click Run
After the model has finished execution:
Right-click the generated SVM model and select Browse
Figure 5.11 SVM Generated Model Summary Tab
Close the SVM model browser window

Attach the SVM generated model to the Analysis node
Edit the Analysis node
Click Coincidence matrices (for symbolic targets) check box

Figure 5.12 Analysis Node Settings
Click Run

The SVM model is accurate on 81.3% of the cases in the Training partition. Looking at the
classification table, the model is extremely accurate for those customers who don’t churn (value of 0);
here the accuracy is almost 94%. But on those customers who did churn (value of 1), the model
accuracy is only about 44%, which is probably not satisfactory. We haven’t developed a final model
yet, so we won’t bother with performance on the Testing data.
Close the SVM Model browser
Requesting Predictor Importance

To understand the model better, we can now request predictor importance. Even though this first
model is not satisfactory, this will show which fields most affect the predictions of churn.
Edit the SVM modeling node

Click the Analyze tab
Click Calculate predictor importance check box

Figure 5.14 Requesting Predictor Importance
Click Run
The model will take much longer to run. When it is done:
Right-click the generated SVM model and select Browse

Figure 5.15 Predictor Importance in Predicting churn
The most important field, by far, is equip (VI chart will display variable label instead of variable itself
if it is available), which records whether or not a customer rents equipment from the
telecommunications firm. The second most important field, ebill, records whether or not a customer
pays bills electronically. The most important demographic field is gender. One strategy based on this
chart is to drop some of the fields of low importance. We will instead change model parameters.
Close the SVM model browser
5.4 Exploring the Model

Before trying a different model, we briefly illustrate examining how the SVM model makes
predictions. Since we are working with the training data, we need to use a Select node to work only
with the data in that partition.
Add a Select node from the Record Ops palette to the stream near the Type node
Attach the Type node to the Select node
Edit the Select node
Use the Expression Builder to create the condition Partition = “1_Training”

Figure 5.16 Select Node to Select Training Data
Click OK
Connect the Select node to the SVM model node in the stream
Click Replace to replace the connection
Add a Distribution node to the stream near the SVM model
Connect the SVM model node to the Distribution node
Edit the Distribution node
Specify equip as the Field
Specify churn as the Overlay field
Click Normalize by color check box
These selections will show us the relationship between the most important predictor and the original
values of churn.

Figure 5.17 Requesting a Distribution Graph with equip and churn
Click Run
We see in Figure 5.18 that most customers who don’t rent equipment (value of 0, or you can click the
Label tool) did not churn, but a sizeable fraction of customers who did rent equipment churned.
Figure 5.18 Distribution Graph of equip and churn
Now we look at model predictions.
Close the Distribution graph window

Edit the Distribution node

Change the Overlay field to $S-churn (not shown)

Click Run
Figure 5.19 Distribution Graph of equip and Predicted churn
The model predictions are similar to the previous graph, only more extreme. The model predicts
almost no customers who don’t rent equipment will churn. It predicts about the same percentage of
customers who do rent equipment will churn.
Close the Distribution graph window
We could continue this process with other predictors, using appropriate nodes to see the relationship
between them and the original and predicted values of churn.
We next try to improve our first model.
5.5 A Model with a Different Kernel Function

We have three other kernel functions to try, along with changing model parameters. The linear model
may be too simple, so let’s start with the Polynomial.

Click the Analyze tab
Click Calculate predictor importance to deselect it
Click the Expert tab
Click the Kernel type: dropdown and select Polynomial
Change the Degree value to 2
We’ll use a simpler polynomial by one degree.

Figure 5.20 Requesting a Model with Polynomial Kernel
Click Run
When the model is done executing:
Attach the Type node to new generated SVM model

Edit the SVM model
Click Annotations tab
Click Custom and name the model Polynomial – 2
Click OK
Attach the generated SVM model to the Analysis node by replacing the connection

This model performs about the same as the first one that used an RBF kernel, but it does a bit better at
predicting customers who churned (about 45%). So all things being equal, this model might be
preferred.
To save time, we ran a model with a linear kernel function. The Analysis node results are displayed in
Figure 5.22. That model doesn’t perform any better, either overall or at predicting customers who
churn.

Figure 5.22 Analysis Node for Linear Kernel Model
Close the Analysis Output browser
We won’t show the results of a model using a Sigmoid kernel, as it performs much more poorly (you
can try it if you wish).
5.6 Tuning the RBF Model

To illustrate tuning a model even further, we can rerun the SVM model node with an RBF kernel.
We’ll change the value of C, and even though, in theory, increasing it may make the model generalize
less well, we’ll give it a try to see the effect.

Click on Expert tab
Change the Kernel type to RBF (not shown)
Change the Regularization parameter to 5
Click Run
After the model has run:
Connect the SVM model to the Analysis node


Figure 5.23 Analysis Node with RBF Kernel Model
The model did a bit better overall than the original RBF-based model (see Figure 5.13). And it did
better at predicting the customers who churned (47%). So this model appears to be better even with a
higher value of C.
SVM Models and Missing Data

In Figure 5.23 for the Analysis node output, note that there are three records for which a prediction
could not be made (the column with the label $null$). In SVM models, records with missing values
for any of the input or output fields are excluded from model building. In these data, there are 3
records with missing values on longten, and two of these customers also have a missing value for
cardten. (You can use a Select node and Table node to view the data and see these records.) By
chance, all three customers were in the Testing Partition.
If the amount of missing data in a file is a small fraction of the total records, you may be willing to
tolerate the loss of some records from model building and scoring. But if the amount of missing data
is significant, you will need to take some action before using the SVM node. You can use the Type
node Check option to change missing values to a valid value. You could also do this yourself with a
Filler node. Or you could use a Data Audit node to impute missing values in a sophisticated fashion.
Of course, if you do this when creating a model, you will also need to use the same methods to handle
missing data before scoring new data with the model.
Model Comparison
We have several results from several models, and picking a final model is based, in part, on the
accuracy on the Testing partition. For ease of reference, we have created the table below which
summarizes the results from Figure 5.13, Figure 5.21, Figure 5.22, and Figure 5.23.

Table 5.1 Model Performance on Testing Partition

Accuracy at Accuracy at
Overall Predicting Predicting
Model Type Accuracy Non-Churners Churners
RBF (C=3, RBF
77.81% 92.1% 36.3%
gamma=0.2)
Polynomial
79.14% 92.9% 39.2%
(Degree=2)
Linear 79.73% 93.7% 39.4%

RBF (C=5, RBF
77.65% 91.2% 38.3%
gamma=0.2)
The best model on all three measures used a Linear kernel, while the next best model used a
Polynomial kernel. One never knows what model will do best until trying several types of models
with various settings. None of these models does very well at predicting those customers who
churned, so we might continue model building by adding or deleting fields, modifying fields, or
resume changing model parameters.
It should be clear by now that to be well-organized, you will want to develop a strategy of changing
model parameters in a systematic way, and keeping close track of the model parameters and results.
As noted above, the Auto Classifier or Auto Numeric nodes can be very helpful.
We don’t need to save the stream in this lesson.

Summary Exercises
file.

forpcode Post Code
title Title
sex Gender
yob Year of Birth
age Age
In this set of exercises you will attempt to predict the field Response to campaign using an SVM
model.
1. Begin with a clear Stream canvas. Place an Statistics source node on the Stream canvas and
connect it to the file charity.sav. Tell PASW Modeler to Read Labels as Names.
listed below. Set the role of all five of these fields to Input and the Response to campaign
field to Target.
Pre-campaign expenditure
Pre-campaign visits
Gender
Age

Mosaic Bands (which should be changed to nominal measurement level)
5. Attach an SVM node to the Type node. Create a model with the default settings using an RBF
kernel type. Request predictor importance.
6. Once the model has finished training, browse the model. Which inputs are most important?
Add the generated model to the stream and use an Analysis node to check its accuracy. On
which category does the model do better, responders or non-responders?
7. Now try other SVM models, by both changing the kernel type and parameters associated with
that kernel, or by changing the regularization parameter C. Which model does best at
predicting response to campaign? Is the most important field the same in all the models?
8. After you’ve found a satisfactory model, explore how the input fields relate to the model
predictions.

Lesson 6: Linear Regression

Objectives
• Review the concepts of linear regression
• Use the Regression node to model medical insurance claims data
• Demonstrate the Linear node to perform regression modeling
Data
We use the data file InsClaim.dat, which contains 293 records based on patient admissions to a
hospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severity
of illness, age, length of stay, and insurance claim amount) are included. The goal is to build a
predictive model for the insurance claim amount and use this model to identify outliers (patients with
claim values far from what the model predicts), which might be instances of errors or fraud made in
the claims.
6.1 Introduction
Linear regression is a method familiar to just about everyone these days. It is the classic general linear
model (GLM) technique, and it is used to predict an target that is interval or ratio in scale
(measurement level continuous) with predictors that are also interval or ratio. In addition, categorical
input fields can be included by creating dummy variables (fields). The Regression model node
performs linear regression in PASW Modeler.
Linear regression assumes that the data can be modeled with a linear relationship. To illustrate, the
figure below contains a scatterplot depicting the relationship between the length of stay for hospital
patients and the dollar amount claimed for insurance. Superimposed on the plot is the best-fit
regression line.
The plot may look a bit unusual in that there are only a few values for length of stay, which is
recorded in whole days, and few patients stayed more than three days.
Linear Regression 6-1

Figure 6.1 Scatterplot of Hospital Length of Stay and Insurance Claim Amount
Although there is a lot of spread around the regression line and a few outliers, it is clear that there is a
positive trend in the data such that longer stays are associated with greater insurance claims. Of
course, linear regression is normally used with several predictors; this makes it impossible to display
the complete solution with all predictors in convenient graphical form, but it is useful to look at
bivariate scatterplots.
6.2 Basic Concepts of Regression

In the plot above, to the eye (as well as to one's economic sense) there seems to be a positive relation
between length of stay and the amount of a health insurance claim. However, it would be more useful
in practice to have some form of prediction equation. Specifically, if some simple function can
approximate the pattern shown in the plot, then the equation for the function would concisely describe
the relation and could be used to predict values of one field given knowledge of the other. A straight
line is a very simple function and is usually what researchers start with, unless there are reasons
(theory, previous findings, or a poor linear fit) to suggest otherwise. Also, since the goal of much
research involves prediction, a prediction equation is valuable. However, the value of the equation
would be linked to how well it actually describes or fits the data, and so part of the regression output
includes fit measures.
The Regression Equation and Fit Measure

In the plot above, insurance claim amount is placed on the Y (vertical axis) and the length of stay
appears along the X (horizontal) axis. If we are interested in insurance claim as a function of the
length of stay, we consider insurance claim to be the output field and length of stay as the input or
predictor field. A straight line is superimposed on the scatterplot along with the general form of the
equation:
Yi = A + B * Xi + ei

Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (the value of Y
when X is zero, and ei is the model residual or error for the ith observation).
Given this, how would one go about finding a best-fitting straight line? In principle, there are various
criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median
deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is
the one that minimizes the sum of the squared deviation of each point about the line.
Returning to the plot of insurance claim amount and length of stay, we might wish to quantify the
extent to which the straight line fits the data. The fit measure most often used, the r-square measure,
has the dual advantages of being measured on a standardized scale and having a practical
interpretation. The r-square measure (which is the correlation squared, or r2, when there is a single
input field, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect prediction).
Also, the r-square value can be interpreted as the proportion of variation in one field that can be
predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the
variation in one field if we know values of the other. You can think of this value as a measure of the
improvement in your ability to predict one field from the other (or others if there is more than one
input field).
Multiple regression represents a direct extension of simple regression. Instead of a single input field
(Yi = A + B * Xi + ei) multiple regression allows for more than one input field.
Yi = A + B1 * X1i + B2 * X2i + B3 * X3i + . . . + ei
in the prediction equation. While we are limited to the number of dimensions we can view in a single
plot, the regression equation allows for many input fields. When we run multiple regression we will
again be concerned with how well the equation fits the data, whether there are any significant linear
relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are
interested in the relative importance of the independent fields in predicting the output field.
Residuals and Outliers

Viewing the plot, we see that many points fall near the line, but some are more distant from it. For
each point, the difference between the value of the dependent field and the value predicted by the
equation (value on the line) is called the residual (ei). Points above the line have positive residuals
(they were under predicted), those below the line have negative residuals (they were over predicted),
and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large
residuals are of interest because they represent instances where the prediction line did poorly. As we
will see shortly in our detailed example, large residuals (gross deviations from the model) have been
used to identify data errors or possible instances of fraud (in application areas such as insurance
claims, invoice submission, or telephone and credit card usage).
Assumptions
Regression is usually performed on data for which the input and output fields are interval scale. In
addition, when statistical significance tests are performed, it is assumed that the deviations of points
around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be
independent of the predicted values (values on the line), which implies that the variation of the
residuals around the line is homogeneous (homogeneity of variance). PASW Modeler can provide
summaries and plots useful in evaluating these latter issues. One special case of the assumptions
involves the interval scale nature of the predictor field(s). A field coded as a dichotomy (say 0 and 1)

can technically be considered as an interval scale. An interval scale assumes that a one-unit change
has the same meaning throughout the range of the scale. If a field’s only possible codes are 0 and 1
(or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus
dichotomous or flag fields (for example gender) can be used as predictors in regression. It also
permits the use of categorical predictor fields if they are converted into a series of flag fields, each
coded 0 or 1; this technique is called dummy coding.
Note that the Regression node in PASW Modeler will only accept continuous inputs or ordinal fields
that contain numeric values (they will be treated as continuous, then). Thus if you have categorical
inputs, you must convert them to numeric dummy fields (using the SetToFlag node to create 0,1
dummy-coded fields, followed by a Type node to set the measurement level of these fields to
continuous), before they can be used by the Regression node.
6.3 An Example: Error or Fraud Detection in Claims

To illustrate linear regression we turn to a dataset containing insurance claims (CLAIM) for a single
medical treatment performed in a hospital (in the US, a single DRG or diagnostic related group). In
addition to the claim amount, the data file also contains patient age (AGE), length of hospital stay
(LOS) and a severity of illness category (ASG). This last field is based on several health measures and
higher scores indicate greater severity of the illness.
The plan is to build a regression model that predicts the total claim amount for a patient on the basis
of length of stay, severity of illness and patient age. Assuming the model fits adequately, we are then
interested in those patients that the model predicts poorly. Such cases can simply be instances of poor
model fit, or the result of predictors not included in the model, but they also might be due to errors on
the claims form or fraudulent entries.
Thus we are approaching the problem of fraud detection by identifying exceptions to the prediction
model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the
model, they may be more likely to be fraudulent or contain errors.
Some organizations perform random audits on claims applications and then classify them as
fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to
correctly classify new claims; logistic, discriminant, rule induction and neural networks have been
used for this purpose. However, when such a target field is not available, fraud detection then
involves searching for and identifying exceptional instances. Here, an exceptional instance is one that
the model predicts poorly.
We are using regression to build the model; if there were reasons to believe the model were more
complex (for example, contained nonlinear relations or complex interactions), then a neural network
or rule induction model could be applied, or higher-order and interaction terms could be added to the
regression equation. We begin by opening an existing stream.

Double-click on LinearRegress.str
Before the stream opens, the warning message in Figure 6.2 appears. The Regression node will be
replaced in later versions of Modeler by the Linear Models node, which is newly added to Modeler
14.0. We will continue to use the Regression node in this lesson, but also demonstrate the Linear
Models node at the end of the lesson.

Figure 6.2 Warning Message About Regression Node Expiration
Click OK
Figure 6.3 Type Node for Claims Data
Note that the severity of illness field (ASG) is continuous, although it has only the three integer values
0, 1, and 2. We will leave it as continuous since these values fall on an ordered scale (higher values
indicate greater severity).
The predictors for linear regression must be numeric. If you have predictors that are truly categorical
(nominal or ordinal), such as region of the U.S. (e.g., northwest, southwest, etc.), they must be
represented by dummy fields (coded either 0 or 1). However, the Regression node will not
automatically create dummy fields for these categories. You will need to create dummy fields
yourself using the SetToFlag node, and then enter the dummy fields in the model, leaving one out so
as not to create redundancy (ask your instructor for more detail if you are interested).
Close the Type node dialog

Double-click on the Regression node (named CLAIM)

Figure 6.4 Linear Regression Model Dialog
Simple options include whether a constant (intercept) will be used in the equation and the Method of
input field selection. By default (Enter), all inputs will be included in the linear regression equation.
With such a small number of predictor fields, we will simply add them all into the model together.
However, in the common situation of many input fields (most insurance claim forms would contain
far more information) a mechanism to select the most promising predictors is desirable. This could be
based on the domain knowledge of the business expert (here perhaps a health administrator). In
addition, an option may be chosen to select, from a larger set of independent fields, those that in some
statistical sense are the best predictors (Stepwise Method). In the stepwise method, the best input field
(according to a statistical criterion) is entered into the prediction equation. Then the next best input
field is entered, and so on, until a point is reached when no further input fields meet the criterion. The
stepwise method includes a check to insure that the fields entered into the equation before the current
step still meet the statistical criterion when the additional inputs are added. Variations on the stepwise
method (Forward—inputs are added one by one, as described above, but are never removed;
Backward—all inputs are entered, then the least significant input is removed, and this process is
repeated until only statistically significant inputs remain) are available as well.
Click the Fields tab

Figure 6.5 Regression Fields Tab
The weighted least squares option (Use weight field check box) supports a form of regression in
which the variability of the output field is different for different values of the input fields; an
adjustment can be made for this if an input field is related to this degree of variation. In practice, this
option has been rarely used in data mining.
We see here the option to specify a partition field when there is such a field (but it doesn’t have the
default name of Partition), also you can specify more than one input field as split fields. A model is
built for each possible combination of the values of the selected split fields.


Figure 6.6 Expert Options (Missing Values and Tolerance)
By default, the Regression node will only use records with valid values on the input and output fields
(this is often called listwise deletion). This option can be checked off, in which case PASW Modeler
will attempt to use as much information as possible to estimate the Regression model, including
records where some of the fields have missing values. It does this through a method called pairwise
deletion of missing values. However, we recommend not using this option unless you are a very
experienced user of regression; using incomplete records in this manner can lead to computational
problems in estimating the regression equation. Instead, if there is a large amount of missing data, you
may wish to substitute valid values for the missing data before using the Regression node.
The Singularity tolerance will not allow an input field in the model unless at least .0001 (.01 %) of its
variation is independent of the other predictors. This prevents the linear regression model estimation
from failing due to multicollinearity (linear redundancy in the predictors). Most analysts would
recommend increasing the default tolerance value to at least .05, though, and also checking explicitly
for multicollinearity.
Click the Model tab, and then click Stepwise on the Method drop-down list
Click the Expert tab, and then click the Stepping button
Figure 6.7 Stepping Criteria and Tolerance Expert Options

You control the criteria used for input field entry and removal from the model. By default, an input
field must be statistically significant at the .05 level for entry and will be dropped from the model if
its significance value increases above .1.
Click Cancel
Click Output button
Figure 6.8 Advanced Output Options
These options control how much supplementary information concerning the regression analysis
displays. The results will appear in the Advanced tab of the generated model node in HTML format.
Confidence bands (95%) for the estimated regression coefficients can be requested (Confidence
interval). Summaries concerning relationships among the inputs can be obtained by requesting their
Covariance matrix or Collinearity Diagnostics. The latter are especially useful when you need to
identify the source and assess the level of redundancy in the predictors, and the more predictors you
have, the more likely that some may be highly correlated (you can ask your instructor for more
information on these diagnostics). Part and partial correlations measure the relationship between an
input and the output field, controlling for the other inputs. Descriptive statistics (Descriptives) include
means, standard deviations, and correlations; these summaries can also be obtained from the Statistics
or Data Audit node. The Durbin-Watson statistic can be used when running regression on time series
data and evaluates the degree to which adjacent residuals are correlated (regression assumes residuals
are uncorrelated).
Click Cancel
Click the Simple option button
Click Model tab, and then click Enter on the Method drop-down list (not shown)
Click Run button
Edit the Regression generated model node in the stream

Click the Summary tab, and then expand the Analysis summary

Figure 6.9 Linear Regression Browser Window (Analysis Summary)
This Analysis summary contains only the equation relating the predictor fields to the output. We
could interpret the coefficients here, but since we don’t know whether they are statistically significant
or not, we will postpone this until we examine additional information in the Advanced tab. To reach
the more detailed results:

Increase the size of the window to see more of the output
The advanced output is formatted in HTML.
After listing the dependent (output) and independent (input) fields, Regression provides several
measures of how well the model fits the data. First is the multiple R, which is a generalization of the
correlation coefficient. If there are several input fields (our situation) then the multiple R represents
the unsigned (positive) correlation between the output and the optimal linear combination of the input
fields. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the R-square
measure can be interpreted as the proportion of variance of the output that can be predicted from the
input field(s). Here it is about 32% (.318), which is far from perfect prediction, but still substantial.
The adjusted R-square represents a technical improvement over the R-square in that it explicitly
adjusts for the number of input fields and sample size, and as such is preferred by many analysts.
Generally the two R-square values are very close in value; in fact, if they differ dramatically in
multiple regression, it is a sign that you have used too many inputs relative to your sample size, and
the adjusted R-square value should be more trusted. In our results, they are very close.

Figure 6.10 Model Summary and Overall Significance Tests
While the fit measures indicate how well we can expect to predict the output or how well the line fits
the data, they do not tell whether there is a statistically significant relationship between the output and
input fields. The analysis of variance table presents technical summaries (sums of squares and mean
square statistics), but here we refer to variation accounted for by the prediction equation. We are
interested in determining whether there is a statistically significant (non-zero) linear relation between
the output and the input field(s) in the population. Since our analysis contains three input fields, we
test whether any linear relation differs from zero in the population from which the sample was taken.
The significance value accompanying the F test gives us the probability that we could obtain one or
more sample slope coefficients (which measure the straight-line relationships) as far from zero as
what we obtained, if there were no linear relations in the population. The result is highly significant
(significance probability less than .0005—the table value is rounded to .000—or 5 chances in
10,000). Now that we have established that there is a significant relationship between the claim
amount and one or more input fields and obtained fit measures, we turn to interpret the regression
coefficients.
Here we are interested in verifying that several expected relationships hold: (1) claims will increase
with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will
increase with age. Strictly speaking, this step is not necessary in order to identify cases that are
exceptional. However, in order to be confident in the model, it should make sense to a domain expert
in hospital claims. Since interpretation of regression models is made directly from the regression
coefficients, we turn to those next.

Figure 6.11 Regression Coefficients
The second column contains a list of the input fields plus the intercept (Constant). The estimated
coefficients in the B column are those we saw when we originally browsed the Linear Regression
generated model node; they are now accompanied by supporting statistical summaries.
Although the B coefficient estimates are important for prediction and interpretive purposes, analysts
usually look first to the t test at the end of each row to determine which input fields are significantly
related to the output field. Since three inputs are in the equation, we are testing if there is a linear
relationship between each input field and the output field after adjusting for the effects of the two
other inputs. Looking at the significance values (Sig.) we see that all three predictors are highly
significant (significance values are .004 or less). If any of the fields were found to be not significant,
you would typically rerun the regression after removing these input field(s).
The column labeled B contains the estimated regression coefficients we would use to deploy the
model via a prediction equation. The coefficient for length of stay indicates that on average, each
additional day spent in the hospital was associated with a claim increase of about $1,106. The
coefficient for admission severity group tells us that each one-unit increase in the severity code is
associated with a claim increase of $417. Finally, the age coefficient of about –$33 suggests that
claims decrease, on average, by $33 as patient age increases one year. This is counterintuitive and
should be examined by a domain expert (here a physician). Perhaps the youngest patients are at
greater risk or perhaps the type of insurance policy, which is linked somehow to age, influences the
claim amount. If there isn’t a convincing reason for this negative association, the data values for age
and claims should be examined more carefully (perhaps data errors or outliers are influencing the
results). Such oddities may have shown up in the original data exploration. We will not pursue this
issue here, but it certainly would be done in practice.
The constant or intercept of $3,027 indicates that the amount of predicted claim of someone with 0
days in the hospital, in the least severe illness category (0) and with age 0. This is clearly impossible.
This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital
(it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects
well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still is
needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond
where the data are observed, since the assumption is that the same pattern continues. Here it clearly
cannot!
The Std. Error (of B) column contains standard errors of the estimated regression coefficients. These
provide a measure of the precision with which we estimate the B coefficients. The standard errors can
be used to create a 95% confidence band around the B coefficients (available as a Expert Output

option). In our example, the regression coefficient for length of stay is $1,106 and the standard error
is about $104. Thus we would not be surprised if in the population the true regression coefficient
were $1,000 or $1,200 (within two standard errors of our sample estimate), but it is very unlikely that
the true population coefficient would be $300 or $2,000.
If we wish to predict claims based on length of stay, severity code and age, the formula would use the
estimated B coefficients:
Pr edict ed Clai ms = $3,027 + $1,106 * l engt h of st ay + $417 * sever it y code – $33 * age
Betas are standardized regression coefficients and are used to judge the relative importance of each of
several input fields. They are important because the values of the regression coefficients (Bs) are
influenced by the standard deviations of the input fields and their scale, and the beta coefficients
adjust for this. Here, not surprisingly, length of stay is the most important predictor of claim amount,
followed by severity group and age. Betas typically range from –1 to 1 and the further from 0, the
more influential the predictor field.
Predictor importance measures are also provided by PASW Modeler, but for statistical models such
as linear regression or discriminant analysis, they add little to your understanding (click the Model tab
to see the Predictor Importance graph). For the claims data, the predictor importance values have
roughly the same relative magnitude as do the Betas, but the Betas are more optimal to compare
predictor importance in a regression equation.
Points Poorly Fit by Model

The motivation for this analysis is to detect errors or possible fraud by identifying cases that deviate
substantially from the model. As mentioned earlier, these need not be the result of errors or fraud, but
they are inconsistent with the majority of cases and thus merit scrutiny. We first create a field that
stores the residuals, or errors in prediction, which we will then sort and display in a table.
Close the Regression generated model node

Place a Derive node from the Field Ops palette to the right of the Regression generated
model node
Connect the Regression generated model node to the Derive node
Edit the Derive node
Enter the new field name DIFF into the Derive field: text box
Enter the formula CLAIM – ‘$E-CLAIM’ into the Formula text box

Figure 6.12 Computing an Error (Residual) Field
The DIFF field measures the difference between the actual claim value (CLAIM) and the claim value
predicted by the model ($E-CLAIM). Since we are most interested in the large positive errors, we will
sort the data by DIFF before displaying it in a table.
Click OK to complete the Derive node

Place a Sort node to the right of the Derive node
Connect the Derive node to the Sort node
Edit the Sort node
Select DIFF as the Sort by field
Select Descending in the Order column (not shown)
Click OK to process the Sort request
Place a Table node to the right of the Sort node

Connect the Sort node to the Table node
Run the Table node

Figure 6.13 Errors Sorted in Descending Order
There are two records for which the claim values are much higher than the regression prediction. Both
are about $6,000 more than expected from the model. These would be the first claims to examine
more carefully. We could also examine the last few records for large over-predictions, which might
be errors as well.
6.4 Using Linear Models Node to Perform Regression

The Linear Models node was added to Modeler 14.0 to create linear models to predict a continuous
target with one or more predictors. Models are created with an equation that assumes a simple linear
relationship between the target and predictors.
The Linear Models node has more features than the Regression node, including the ability to create
the best subset model, several criteria for model selection, the option to limit the number of
predictors, and the use of bagging and boosting, as discussed earlier in this course. In addition, there
is a feature to automatically prepare the data for modeling, by transforming the target and predictors in
order to maximize the predictive power of the model. This includes outlier handling, adjusting the
measurement level of predictors, and merging similar categories. The Linear Models node automatically
creates dummy variables from categorical fields (that have nominal or ordinal measurement level), which
is another definite advantage.
The Linear Models node also uses the new Model Viewer to display a wealth of information about the
model that helps to evaluate it and understand the effect of the predictors.
For this example, we will reproduce the model from the Regression node and concentrate on the new
features and output, rather than attempt to find a more accurate model.

Close the Table node

Add a Linear node to the stream near the Type node
Connect the Linear node to the Type node
Edit the Linear node
Figure 6.14 Build Options Tab for Linear Models Node
There are five areas within the Build Options tab that correspond to the standard ones included with
most Classification nodes. We will create a standard model and not use bagging and boosting.
Click Basics settings

Figure 6.15 Basics Settings for Build Options Tab for Linear Models Node
The Linear node can use a variety of automatic methods to prepare the data for modeling. These
include changing the measurement levels of predictors (e.g., continuous predictors with less than five
values are changed to ordinal), outlier handling to reduce extreme values, special missing value
handling (e.g., missing values for nominal predictors are replaced with the mode), and supervised
merging of predictor categories to reduce the number of fields to be processed (helpful when there are
large numbers of predictors).
To keep things comparable, we will turn off this option.
Click Automatically prepare data check box to deselect this option

Click Model Selection setting

Figure 6.16 Model Selection Settings for Build Options Tab for Linear Models Node
There are three model selection methods available:

• None, which is equivalent to the Enter method for Regression with all fields entered on one
step
• Forward stepwise, which is standard stepwise model selection, although using by default an
Information Criterion measure for entry/removal rather than the F statistic
• Best Subsets, which tries "all possible" models, or at least a much larger subset of the
possible models than forward stepwise, to choose the best model based on various criteria.
Click the Model selection method: dropdown and select None
The model default criterion for entry/removal is an information theoretic measure (akin to AIC or
BIC), but also available is R square, the F statistic, and an overfit prevention criterion. The
significance values for entry and removal can be modified.
You can customize the maximum number of effects (predictors, including those coded as dummy
variables) in the final model, and you can customize the maximum number of model-building steps if
using forward stepwise or best subsets methods.

The Ensembles and Advanced settings are similar to those for other modeling nodes, so we won’t
review them here.
Click Model Options tab
Figure 6.17 Model Options Tab for Linear Models Node
Neither probability nor propensity are available for the Linear Models node because it is predicting a
continuous target.
Click Run
After the model has executed:
Edit the CHURNED Linear Models generated model

Figure 6.18 Model Summary View of Linear Model
The Model Summary includes information on how the model was constructed. The Accuracy chart
displays adjusted R square (.311 for the model).
The second panel displays information on Automatic data preparation, but none was requested for this
example.
Click the Predictor Importance panel

Figure 6.19 Predictor Importance for Linear Model
Predictor importance is about equal for ASG and AGE, with LOS being most important. Importance
for a Linear model is calculated differently than for a Regression model. For Linear Models, the
method of leave one predictor out is used, with the model statistics compared with and without that
predictor (see the PASW® Modeler 14 Algorithms Guide for details).
Click on the Effects panel

Figure 6.20 Model Effects Diagram for Linear Model
The Model Effects display shows each predictor in the model, sorted from top to bottom by predictor
importance. Connecting lines in the diagram are weighted based on effect significance, with greater
line width corresponding to more significant effects (smaller p-values).
There is a standard predictor importance slider, but also a significance slider controls which effects
are shown in the view, beyond those shown based on predictor importance. Effects with significance
values greater than the slider value are hidden. This does not change the model, but simply allows you
to focus on the most important effects.
The Style dropdown has a Table choice that will display the ANOVA table for the model.
Click Coefficients panel

Figure 6.21 Coefficients Diagram for Linear Model
The Coefficients panel displays the value of each coefficient in the model. If there are categorical
predictors, each dummy predictor created to represent these will be displayed separately.
In the default Diagram style, the chart displays the intercept first, and then sorts effects from top to
bottom by decreasing predictor importance. Connecting lines in the diagram are colored and weighted
based on coefficient significance, with greater line width corresponding to more significant
coefficients (smaller p-values). Blue corresponds to a positive effect, orange-yellow to a negative
effect.
You can also hover over a line and a popup will display the coefficients.
Click the Style: dropdown and select Table

Figure 6.22 Coefficients in Table Style for Linear Model
The Table view shows the coefficients, significance tests, and importance for the individual model
coefficients. After the intercept, the effects are sorted from top to bottom by decreasing predictor
importance. Within effects containing factors, coefficients are sorted by ascending order of data
values. These are the same regression coefficients and significance values as we obtained from the
Regression model.
There are three other panels that help assess whether the data and model meet statistical assumptions
for linear modeling. These include:
1. Predicted by Observed scatterplot, which should display a linear relationship between these
two fields. In addition, outlying cases can be identified that may influence model coefficients.
2. Residuals histogram, which allows you to check for the normality of the errors assumption.
However, with the typically large data files encountered in data mining, this assumption
becomes less critical.
3. Outliers table using Cook’s Distance, a measure of how much the residuals of all records
would change if a particular record were excluded from the calculation of the model
coefficients. A large Cook's distance indicates that excluding a record changes the
coefficients substantially, and should therefore be considered influential. Some analysts use a
cutoff of around .10 and above to identify influential records.
Click on the Outliers panel

Figure 6.23 Outliers Table for Linear Model
For these data, the outliers have large values of CLAIM and so are harder to predict with the model.
Finally, there is a Estimated Means panel which displays the model value of the target on the vertical
axis for each value of the predictor on the horizontal axis, holding all other predictors constant at their
mean. It provides a useful visualization of the total effects of each predictor on the target across the
full range of that predictor.
As we have reviewed, the Linear Models node provides more modeling options, and more visual
output, than the Regression node.

Summary Exercises
The exercises use the data file InsClaim.dat that was used in this lesson. The following table provides
details about the file.
InsClaim.dat contains insurance claim information from patients in a hospital. All patients were in
the same diagnosis related group (DRG). Interest is in building a prediction model of total charges
based on patient information and then identifying exceptions to the model (error or fraud detection).
The file contains about 300 records and the following fields:
ASG Severity of illness code (higher values mean more

seriously ill)
AGE Age
LOS Length of hospital stay (in days)
CLAIM Total charges in US dollars (total amount claimed on
form)
1. Using the insurance claims data, use the Stepwise method and compare the equation to the
one obtained using the Enter method. Are you surprised by the result? Why or why not? Try
the Forward and Backward methods. Do you find any differences?
2. Instead of examining errors in the original scale, analysts may prefer to express the residual
as a percent deviation from the prediction. Such a measure may be easier to communicate to a
wider audience. Add a Derive node that calculates a percent error. Name this field
PERERROR and use the following formula: 100* (CLAIM – '$E-CLAIM')/'$E-CLAIM'.
Compare this measure of error to the original DIFF. Do the same records stand out? What
conditions is percent error most sensitive to? Use the Histogram node to produce histograms
for either of the error fields, generate a Select node to select records with large errors, and
then display them in a table.
3. Use the Neural Net modeling node to predict CLAIM using a neural network. How does its
performance compare to linear regression? What does this suggest about the model?
4. Fit a C&R Tree model and make the same comparison. Examine the errors from the better of
the neural net and C&R Tree models (as you judge them). Do the same records consistently
display large errors?

Lesson 7: Cox Regression for Survival

Data
Overview
• What is Survival Analysis?
• What to Look for in Survival Analysis
• Cox Regression
• Checking the Proportional Hazards Assumption
• Predictions from a Cox Model
Data
In this lesson we use the data file customer_dbase.sav that contains 5,000 records from customers of a
telecommunications firm. The firm has collected a wide variety of consumer information on its
customers, and we are interested in studying the length of time customers retain their primary credit
card—in other words, we wish to model the time for these customers to churn—not renew—their
primary credit card. We will use several predictors to model churn to learn their effect on time to
churn.
7.1 Introduction
Survival analysis studies the length of time to an event of interest. The analysis can involve no
predictors, or it can investigate survival time as a function of one or more predictor fields. The
technique was originally used in medical research to study the amount of time patients survive
following onset of a disease (hence the name survival analysis). In data mining, it has been applied to
model such diverse outcomes as length of time a person subscribes to a newspaper or magazine, the
time employees spend with a company, the time to failure for electrical or mechanical components,
the time to make a second purchase from an online retailer, or the length of tenure for renters of
commercial properties.
The Cox node in PASW Modeler can perform both univariate (no predictors) and multivariate (Cox
regression) survival analysis. The former type of analysis is called Kaplan-Meier, often used to
compare survival time for treatment and control groups in medical studies. In data mining, there are
many possible predictors, so Cox regression, a semi-parametric technique, is used. In this lesson we
will review the concepts and theory behind survival analysis and Cox regression, and then perform a
Cox regression predicting churn.
Cox Regression for Survival Data 7-1

7.2 What is Survival Analysis?

Survival analysis examines the length of time to a critical event, possibly as a function of predictor
fields. Since time has interval scale properties (actually ratio scale), such methods as regression or
ANOVA might first come to mind as possible analytical methods to use with time ordered data..
However, survival data often contains censored values: observations for which the final outcome is
not known, yet for which some information is available. For example, in a data set of subscribers to a
newspaper, some will have cancelled their subscription, but many others will still be subscribers at
the time of data collection or analysis. The former group is comprised of those who did not “survive,”
but the latter group did survive until the end of the study. Their outcome is censored which simply
means that we don’t know when they will end their subscription. Ordinary regression, or other
techniques, has no easy way of handling a data value of 15+ years, meaning we know the subscriber
has received the newspaper for at least 15 years so far, with an unknown future survival. Survival
analysis can explicitly incorporate and handle such information (see next section).
The main outcome measure in survival analysis is the length of time to a critical event. Such
summaries as the mean and median survival time with their accompanying standard errors are useful.
The mean survival time is not simply the sample mean value, but is estimated using the cumulative
survival plot (discussed below) that adjusts for censored data.
An important summary is the survival function over time. The cumulative survival plot (or table)
displays the estimated probability of surviving longer than the time displayed in the chart (or table).
An example of a cumulative survival plot for two treatment groups in a medical study appears below.
Figure 7.1 Survival Plot for Two Treatment Groups
We see the probability of surviving longer than a given time starts out at 1.0 (when time= 0) and
decreases over time. The probability is adjusted at the time of each critical event (here a death). There
were censored observations (patients who could not be contacted beyond a certain point, died of other
causes, or outlived the study) that appear as small plus signs (+) along the plot. They are used in the

denominator when calculating survival probability up until the time of their censoring (since they are
known to be alive up until that point), and discarded thereafter.
Note that the plot is an empirical curve, adjusted for each critical event. This approach does not make
distribution assumptions about the shape of the curve and is called nonparametric. This is the
approach taken by the Kaplan-Meier method; there are parametric models of survival analysis that
would fit smooth curves to the data viewed above. For this reason, when comparing survival
functions of different groups using the Kaplan-Meier method, nonparametric tests are used.
A related function is called the hazard function. It measures the rate of occurrence per unit time of the
critical event, at a given instant. As such it represents the risk of the event occurring at a particular
time. A medical researcher would be interested in looking at time points with elevated hazard rates, as
would a renter of retail space, since these would be times of greater risk of the event (death, tenant
leaving). PASW Modeler can produce cumulative hazard plots. For more detailed discussion of these
issues see Lee (1992), Kleinbaum (1996), or Klein and Moeschberger (1997).
Censoring
A distinguishing feature of survival analysis is its use of censored data. Users of other advanced
statistical methods are familiar with missing data, which are observations for which no value was
recorded. Censored values contain some information, but the final value is hidden or not yet known.
For example, in a time to churn analysis, at the end of the data collection phase for the study, a
customer may still be a subscriber, effectively outliving the study. The analyst would thus know that
the customer survived for at least n number of months, but does not know the final date when the
subscription will end.
Censored data is included in the calculation of the survival function. A censored case is used in the
denominator to calculate survival probabilities for data involving events occurring with shorter time
values than the censored case. As an example, if we know a customer survives 60 months and is then
censored, use is made of the fact that the customer did not churn during the first 60 months. After the
time of censoring, the censored value is dropped from any survival calculations.
Data Sampling and Censoring

Technically, the type of censoring we have been discussing is called “right-censoring.” If we think of
time flowing from left to right on a horizontal axis, with the time values increasing, then a case that is
right-censored extends beyond the end of data collection, as exemplified in the chart in Figure 7.2.
The mirror image of this is left-censoring, of which there are two types. “Left censoring” occurs when
the case is never a part of the study because it occurred before the time the study began. “Left-
truncated” cases are those which have an unmeasured beginning time, then are measured, so that
when the event of interest occurs, we can only say that they have been a customer for at least as long
as measured (but it could be much longer).

Figure 7.2 Types of Censoring in Survival Data
Both left censoring and left truncation can cause problems for models because they lead to a biased
sample. In data-mining applications, it is common to have customer history data taken from a cross-
sectional extraction from existing databases, using all customers active as of some fixed date when
the study begins. This approach will systematically under sample customers with short survival times
(those who are left-censored) and thus overestimates survival. Left truncation is less of an issue with
business databases because the time a customer begins is normally well known (although when
companies merge and combine customers, incompatible data systems can lead to uncertainties in
customer history).
To solve the left-censored problem, it is better to sample on a history of customers over time,
sampling not cross-sectionally but over some defined time period. This means that you don’t need to
choose all those customers who began say, on a fixed date. Instead, customers can enter, and leave the
study (because they churned), over some long time interval. This doesn’t imply that a survival study
must actually go on for many months or years in real-time; instead, it means that the data sampling
must be done over an extended time interval.
Too much right-censored data can also be a problem simply because the event of interest—here
churning—will have occurred too infrequently.
Why Not Regression or Logistic Regression?

Those who first encounter survival data sometimes wonder why linear regression can’t be used to
predict survival time, or why logistic regression or other methods to predict dichotomous outcomes
can’t be used to predict whether an event has occurred or not. Besides the censoring issue, there are
several reasons, but the key one related to linear regression is that the residuals from the regression
model are unlikely to be normally distributed, even for large sample sizes. This is because the time to
event distribution is likely to be non-normal, even bimodal in many real-world applications (there are
certain intervals when more customers are likely to churn, such as at the end of their contract period).
Logistic regression doesn’t assume a particular distribution of the residuals, but it also doesn’t handle
censored data appropriately. It is possible to follow a sample of say, 1000 customers, from the time
they obtain a credit card until the last one has dropped that card (many years later). This type of
dataset has no right-censored data because the status of every customer is known along with when the
event of interest occurred. However, collecting this type of data is often impractical; moreover, since
conditions change rapidly in many businesses, the effects of predictors on credit card churn may

change over time, so collecting data over a long time interval may, perversely, lead to a less accurate
model. This again argues for using Cox Regression rather than another technique.
Sampling on the Outcome Field

Survival analysis is based on the idea that there is one type of event that causes “death” in a study. In
the context of business data, this means that models should be developed to predict only one type of
churn. As illustration, if we knew that some customers dropped their DSL internet service to sign up
with a cable internet provider, while other customers dropped their DSL service because they will be
using broadband service available through satellite, it would be better to develop a different model for
each group. Often, though, what happens to a customer after they cancel a service or subscription is
unknown, but if so, it does imply that a model may be mixing customers who have different
influences that lead them to churn. In turn, this will lead to less accurate models.
7.3 Cox Regression

Cox regression is a survival model that represents hazard as a function of time and predictor fields
that can be continuous or categorical. Because it allows for multiple predictors, it is more general than
the Kaplan-Meier method. It is considered a semi-parametric model because it does not require a
particular functional form to the baseline hazard or survival curves. This allows Cox regression to be
applied to datasets exhibiting very different survival patterns. As we will see shortly, the model does
assume that the ratio of the hazard rate between any two individuals or groups remains constant over
time (for this reason it is also called Cox’s proportional hazard model). If this assumption is not met,
the Cox model has been extended to incorporate time-varying predictors, in other words, interaction
terms between predictors and time (although PASW Modeler doesn’t provide an automatic way to
create and incorporate such effects).
In the Cox Regression model, the hazard function at time t as a function of predictors X1 to Xp, can be
expressed as:
X 1*B1 + X 2 *B2 +...+ X p *B p

h(t | X 1 , X x ,..., X p ) = h0 (t ) * e
The hazard is a measure of the potential for the event to occur at a particular time t, given that the
event did not yet occur. Larger values of the hazard function indicate greater potential for the event to
occur.
The hazard function is expressed as the product of two components: a base hazard function that
changes over time and is independent of the predictors ( h0 (t ) ); and factor or covariate effects (
X 1*B1 + X 2 *B2 +...+ X p *B p
e ) that are independent of time and adjust the base hazard function. The shape of the
base hazard function is unspecified and is empirically determined by the data (the nonparametric
aspect of the model). Since the model effects relate to the hazard through the exponential function,
the exponentiated value of a model coefficient (say e B1 ) can be interpreted as the change in the
hazard function associated with a one-unit change in the predictor (X1), controlling for other effects in
the model.
The separation of the model into two components, one a function of time alone and the other a
function of the predictors alone, implies that the ratio of the hazard functions for any two individuals
or groups will be a constant over time (the base hazard function cancels out, leaving the ratio constant
over time and based on the differences in predictor values). If this model assumption (the effects of

the predictors are constant over time) is not met, then the Cox Regression model will not provide the
best fit to the data.
Since the hazard function is related to the cumulative survival function, this model can alternatively
be expressed in terms of cumulative survival. The value of the survival function is the probability that
the given event has not occurred by time t. Survival plots, based on the model, are also helpful in
studying the model and are more intuitive for most people.
Note
The Cox node can perform a Kaplan-Meier analysis by specifying a model with no input fields.
Missing Data
The Cox regression uses only records with valid data for all input and target fields. Thus, if a value is
missing for a field, that case will not be used in the analysis. If you have a significant amount of
missing data, you may want to either not use fields with lots of missing data, or you may wish to
estimate/impute missing data to increase the number of valid cases for the analysis.
7.4 Cox Regression to Predict Churn

The customer database we use in this lesson contains a field (cardtenure) that measures the length of
time a customer has, or did hold, their primary credit card. This is the survival time field, which must
of be of measurement level continuous. The target field, which must be of measurement level flag, is
churn, which records whether or not a customer switched cards. We will use several predictors to
model churn, mainly customer demographics.

Double-click on Cox Regression.str
Run the Table node
Edit the Type node

Figure 7.3 Type Node for Customer Database
There are over 100 fields in the file, and we only want to use a few for the Cox Regression to keep
this example manageable. The best approach is to set the Role for all fields to None, and then
selectively change specific fields.
Right-click on any field and select Select All from the context menu
Right-click on any field and select Set Role…None
Change the Role for gender, age, ed, income, marital, cardfee, and cardtenure to Input
Change the Role of churn to Target
Click OK
Cox regression will create appropriate dummy fields to represent fields that are categorical without
user intervention.
It can be instructive to look at the distribution of the key fields of cardtenure and churn before
proceeding further.
Add a Distribution node to the stream

Connect the Type node to the Distribution node
Specify churn as the Field
Click Run
Out of the 5000 records, about three-quarter still have their primary credit card, while one quarter
changed their card. These are reasonable ratios of the two events for a Cox regression.

Figure 7.4 Distribution of churn
Next let’s look at the distribution of cardtenure.
Close the Distribution window

Connect the Type node to the Histogram node
Specify cardtenure as the Field
Click Run
The cardtenure values are in years, and the distribution ranges from 0 (people who have recently
obtained a new primary credit card) to 40 (people who have had their primary credit card for a very
long time). The distribution drops over time, as expected, and it is choppy, rather than smooth, which
is often typical of this type of data.
Figure 7.5 Histogram of cardtenure

As a final step in data exploration, let’s look at the relationship between these two key fields.

Edit the Histogram node
Specify churn as the Color overlay field
Click Options
Click Normalize by color (not shown)
Click Run
At first, the smoothly declining percentage over time of values of churn=1 over cardtenure may come
as a surprise. But actually, as with many products, most consumers who switch do so early. So churn
rates are initially around 50% for the first few years. But then, over time, switching the primary credit
card still occurs, but less frequently as a percentage of those customers who have survived that long.
This trend continues right up to the end of data collection at 40 years. This will help us understand
predictions from a Cox regression model in a later section.
Figure 7.6 Overlay Histogram of cardtenure by churn
We are now ready to add the Cox regression node to the stream and review its settings.
Add a Cox modeling node to the stream

Connect the Type node to the Cox node
Edit the Cox node
We selected the input and target fields in the Type node, and the Cox node correctly lists churn as the
target field. However, we also need to specify which field contains survival times.

Figure 7.7 Fields Tab in Cox Node
Click Survival time field chooser and select cardtenure

Click Model tab
As with other regression-based models, Cox regression can estimate a model either by using all the
predictors, or by performing forwards or backwards stepwise model-building. In this example, we use
the default choice of Enter and use all the predictors. If you have many predictors and want to use one
of the stepwise methods, you definitely should use a testing or validation sample or partition.
Complex models can be built by specifying a Model type of Custom and then selecting specific terms.
This allows you to incorporate interaction terms into a model, for example.
If you would prefer to see separate analyses for discrete groups of customers, a Groups: field can be
selected. This field should be categorical, and a separate model will be developed for each category in
this field. Alternatively, such a field can be included in the model as a predictor, but if you believe
that coefficients and survival times are quite different for the various groups, or you want to assess
whether this is the case, estimating separate models can be a useful approach.

Figure 7.8 Model Tab in Cox Node

Click the Expert options button
Click the Output button
By default only the most basic of output is supplied by the Cox node, and neither the survival or
hazard plots are included.
The check box for Display baseline function will display the baseline hazard function and cumulative
survival at the mean of the covariates.

Figure 7.9 Expert Output Options
Click Display baseline function check box

Click Survival and Hazard check boxes under Plots area
When the plots selections are made, the bottom of the dialog box becomes active, and the fields
included in the model are listed, with their value set to the mean. This is necessary because the
survival and hazard functions depend on the values of the predictors, and you must use constant
values for the predictors to plot the functions versus time. The default is to use the mean, but you can
enter your own values for the plot using the grid. This would allow you to get plots for survival for a
particular type of customer. For categorical inputs, indicator coding is used, so there is a regression
coefficient for each category (except the last). For a categorical input, the mean value for each
dummy field is equal to the proportion of cases in the category corresponding to that indicator
contrast.
You can also request a separate line for each value of a categorical field on any plot. We’ll do so for
marital (the field used here does not have to be an input to the model). This is not equivalent to
adding a Groups field to the model.
Click the Plot a separate line for each value dropdown and select marital

Figure 7.10 Completed Advanced Output Dialog Selections
We will now briefly look at the Settings tab.
Click OK
Click Settings tab
The Settings tab has several options to specify how a Cox model should be applied to make
predictions. The model can be scored at regular time intervals, over one or more time periods, with
the unit of time defined by the field used in the model. Alternatively, another time field can be listed.
In many cases, customers or the equivalent will already have a survival time which must be taken into
account (not everyone is beginning as a new customer who has just acquired a product or
subscription), so the setting Past survival time allows you to select a field which contains this
information (this is often the same field as used for survival time itself, such as cardtenure in the
current example).
These options become relevant when a model has already been developed, so we won’t say more
about them at this point in the lesson.

Figure 7.11 Settings Tab in a Cox Node
Click Run
After the model runs:
Right-click the Cox model in the stream named churn and select Edit
The Categorical Variable Coding table shows the dummy variable coding for the categorical variables
in the model. Unlike in other PASW Modeler nodes, Cox regression does indicator coding, with the
last category as the reference category. This means that for flag variables such as gender, which is
coded 0 for males and 1 for females, the coding within Cox regression reverses this ordering. This is
very important to note for interpretation of the model.
If you prefer the original ordering, you can change the values of a field with a Reclassify node.

Figure 7.12 Categorical Variable Coding
The next set of tables includes tests of the model as a whole. Since all predictors are entered at once,
the values reported in the Change From Previous Step and Change From Previous Block sections are
identical. Here we are testing whether the effect of one or more of the predictor fields is significantly
different from zero in the population. This is analogous to the overall F test used in regression
analysis. The results indicate that at least one predictor is significantly related to the hazard because
the significance values are well below .05 or .01. (An omnibus test is also done using the score
statistic, which is used in stepwise predictor selection.)
Figure 7.13 Omnibus Tests of Model Coefficients

Figure 7.14 Variables in the Equation Table
The next table—Variables in the Equation—contains information on the individual effects of each
predictor. To interpret these coefficients, recall that the model predicts the hazard directly, not
survival time, and that in the scale of the predictors, the natural log of the hazard is being predicted.
Therefore, the B coefficient relates the change in natural log of the hazard to a one unit change in a
predictor, controlling for other predictors. As such they are difficult to understand (although positive
values are associated with increasing hazard and lower survival time, while negative values are
associated with decreasing hazard and increasing survival times). For this reason, the Exp(B) column
is usually used when interpreting the results.
The significance of each predictor is tested using the Wald statistic and the associated probability
values are reported in the Sig. column. Here, four of the predictors are significant, but gender and
cardfee are not.
The positive B values for ed and marital indicate, respectively, that increasing education and being
unmarried (check the coding) are associated with increasing hazard for churn. The negative B values
for age and income indicate that increasing age and income lead to reduced hazard for churn.
The Exp(B) column presents the estimated change in risk (hazard) associated with a one-unit change
in a predictor, controlling for the other predictors. When the predictor is categorical and indicator
coding is used, Exp(B) represents the change in hazard when comparing the reference category to
another category and is referred to as relative risk. Exp(B) is also called the hazard ratio, since it
represents the ratio of the hazards for two individuals who differ by one unit in the predictor of
interest. The Exp(B) value for marital is 1.469; this means that, other things being equal, the hazard
for customers who are unmarried is 1.469 times greater than the hazard for married customers.
This does not mean that an unmarried customer will churn 1.469 times faster, or that 1.469 times as
many unmarried customers as married customers will churn in a given time interval. It simply means
that an unmarried customer has odds of 1.469:1 that he or she will churn compared to married
customers.
For a continuous predictor such as age, the odds ratio refers to the effect on the hazard for each unit
change. So, a one year increase in age leads to a decrease in the hazard of .923. A ten year change in
age (comparing, say a 40-year-old to a 30-year-old customer) must be calculated by multiplying the
hazards and so corresponds to .92310 = .449, which is a substantial reduction in the likelihood of
churning for the older customer (who tend to hang on to their credit cards).
The next, rather lengthy table is the Survival table which contains the baseline cumulative hazard, the
survival estimates, and the cumulative hazard. The first portion of the table is displayed in Figure

7.15. Although these summaries are technical, it is worth noting that a hazard can take on values
above 1 (it is in theory unbounded on the positive side). The Baseline Cum(ulative) Hazard is the
hazard rate for the model when all predictors are set to 0. There is a row for each time value in the
data.
Associated with this are both the Survival values (with a standard error estimate), and a Cum(ulative)
Hazard function, both calculated at the mean of all the predictors (or covariates, as labeled in the
table). Survival begins at or near 1.0 and drops off toward 0; Cumulative hazard begins near 0 and
increases.
Figure 7.15 Survival Table
The survival time gives the estimated survival rate at a specific time for customers who have mean
values on the predictors (a “hypothetical” customer). So for our model, survival—retaining the
primary credit card—has dropped to .933, or 93.3%, by the sixth year.
These values are plotted in the graphs that follow.

Figure 7.16 Cumulative Survival Function
The cumulative survival values are plotted in the Survival Function chart. Time held primary credit
card is on the horizontal axis, and cumulative survival is on the vertical axis. Until the last few years,
cumulative survival for a typical customer declines steadily at about a constant pace, then more
steeply over the last two or three years. The curve is not smooth, but jagged because survival for these
data is measured by the year (rather than in months or even weeks).
Figure 7.17 Cumulative Hazard Function

The cumulative hazard function plot is somewhat the mirror image of the survival plot; again, note
that the hazard can take on values above 1. As cumulative survival decreases, cumulative hazard—the
chance of not retaining the primary credit card—decreases.
There are survival and hazard graphs with separate lines for each category of marital. Looking at the
cumulative survival function for what is labeled “patterns 1 -2” shows that the survival curve for
unmarried customers is always below the line for married customers. This means that the cumulative
survival for unmarried customers is always lower than for married customers, which is consistent
with the regression coefficient for marital, which had an Exp(B) value of 1.469 and indicated that
there was an increased hazard for unmarried customers. Differences in survival gradually increase
over time between these two groups until about 35 years, where estimates grow less precise. This type
of plot is a useful adjunct to interpretation of model coefficients and help to gain additional model
understanding.
Figure 7.18 Cumulative Survival Function for Married and Unmarried Customers
Close the Cox model Browser window
That completes a review of the most important types of output for a Cox model.
7.5 Checking the Proportional Hazards Assumption

Cox regression is based on a proportional hazards model, but we don’t know whether that assumption
is valid for these data (that the hazard functions of any two individuals or groups remain in constant
proportion over time). There are several approaches to test this for predictors. For our purposes, the
two chief methods are to:
• Examine the survival or hazard plots with the categorical predictor as the factor

• Examine the survival or log-minus-log plot in Cox Regression with the categorical predictor
specified as a Groups variable
We will illustrate by specifying marital as the Groups field within Cox Regression and examining the
survival and log-minus-log plots.
Edit the Cox modeling node

Click Log minus log check box
Remove marital from the Plot a separate line for each value box (not shown)
Click OK
Click Model tab
Select marital as the Groups field
Figure 7.19 Marital Selected As Groups Field
A separate base hazard function will be fit to each category of the Groups field. If marital does not
conform to the proportional hazards assumption, this should be revealed in the survival and log minus
log plots, which will present a line for each category, i.e., married and unmarried customers.
Click Run
Right-click on the generated model and select Browse
Since the focus of this analysis is on the proportional hazards assumption, we will not examine the
model results but move directly to the diagnostic plots.

Figure 7.20 Survival Plot for Marital Categories
The survival plots for the married and unmarried customers remain roughly parallel over time,
suggesting that the hazard ratio for the two groups is reasonably constant over time. We can ignore
the last few years where there are few customers and so estimates are less precise.
Because marital is not used as a predictor in the model (it is a groups field), the expected survival
values allow us to assess whether marital would meet the proportional hazards assumption in this
context.

Figure 7.21 Log Minus Log Plot for Marital Categories
Another way of examining the proportional hazards assumption is through the ln(-ln) plot. Again, we
simply have to judge whether the lines for the different categories are parallel. Here the two survival
lines do initially diverge after a few years, and then slowly converge over time until about year 30.
However, compared to the full range of the data, the divergence is small between the two curves.
Although this is requires some judgment, in this instance we will conclude that the proportional
hazards assumption is met for marital.
If the assumption is not met, a time-varying covariate model can be fit to the data. Or you can drop
this predictor from the model, which is a viable option when you have dozens of predictors.
Checking the proportional hazards assumption should be done, in theory, for all significant
categorical predictors in the model.
We can now look at the predictions made by the model.
7.6 Predictions from a Cox Model

The Cox regression model node can be added to a stream to make predictions. Given that the data are
collected over time, the idea of a “prediction” of the model is complicated, and in fact the model can
make several predictions for the same customer, over different time periods, i.e., survival over the
next time period, over the next five time periods, and so forth.
We want to rerun the basic model without marital specified as a Groups field.

Close the Cox generated model browser

Edit the Cox modeling node
Click the Model tab
Remove marital as the Groups field
Click Run
Add a Table node to the stream and connect it to the generated Cox model node
Run the Table node
Scroll to the right to see the new fields
Figure 7.22 Output Fields from a Cox Regression Model with Default Settings
Four fields are created by default from a Cox model. $C-churn-1 is the prediction of the Cox model
for whether or not this customer will churn in the time interval that the user has requested (we will
review the default interval next. The field $CP-churn-1 contains the probability associated with this
prediction (whether it is associated with the churn or no churn condition). As can be seen from the
table, the probabilities are very high for almost all the predictions. The last two columns contain the
probabilities for the churn=0 and churn=1 conditions (the field $CP-0-1 stands for the predicted
probability associated with churn=0 for the first predicted interval). All the predicted probabilities
seem to be taken from the $CP-0-1 field.
It is very important to understand that the predictions we see in the Table do not take current survival
time for a customer into account. This may seem odd, given that the model was developed with
survival time data, and model nuggets usually make predictions based on values of the predictors, but
because predicting into the future is so central to the use of Cox models, no survival time value is
used by default in the model. To see this we need to view the Settings tab on the Cox generated
model.


Edit the Cox model node
Figure 7.23 Settings Tab Options on Cox Generated Model
The options available within this tab are the same as those available in the modeling node, allowing a
model to be developed that simultaneously makes predictions. Note that there is no Time field listed,
nor any Past survival time field. By default, survival will be predicted for a time interval of 1.0,
which is defined in units of the time field used to create the model (here cardtenure). So the
prediction will be for one year, for one time period (Number of time periods to score).
But—and here is the key to understanding how predictions are made with Cox regression—since no
past survival time field is specified, the predictions are equivalent to predicting whether each
customer will keep their primary credit card for 1 year after receiving it. These predictions are not
whether a customer who has survived this long (as measured by cardtenure) will keep his or her card
for another year. Since the odds of a customer churning in the first year are not terribly large, very
few customers will be predicted to churn.
To see what the predictions of the model are for each customer for their actual survival time (which
varies by customer), we need to set the time field to cardtenure. This will request the Cox model
predict into the “future” the number of time intervals represented by cardtenure for each customer.
Recall from the previous example that, when no past survival time is listed, the model effectively
begins at time 0.
Click Time field option button

Specify the field as cardtenure

Click OK
Run the Table node
Figure 7.24 Predictions of Cox Model at Current Survival Time
In the first 20 records visible in Figure 7.24 there are still no predictions that a customer will churn.
However, the probabilities of churning ($CP-1-1) are much higher than previously (see Figure 7.22).
These are the predictions of the model for the actual survival times of each customer.
We can now use an Analysis node to review the model predictions for churn.

Connect the Cox generated model to the Analysis node
Click Coincidence matrices (for symbolic targets) check box (not shown)
Click Run

Figure 7.25 Predictions for Cox Model
Overall the model is correct for 72.7% of the customers. However, looking at the Coincidence Matrix,
we see that the model performs very well for customers who didn’t churn, but not nearly as well for
those who did. But we are using only a few predictors to keep the example simple; in a real-life data-
mining project using Cox regression, performance is very likely to be much better than this.
Given what we have learned in the two instances of prediction we have reviewed, let’s return to the
Cox generated model to discuss prediction in more depth.
Close the Analysis node

Edit the Cox generated model

Figure 7.26 Cox Model Settings Tab
The section of the Settings tab labeled Predict survival at future times specified as: allows the user to
either specify future time based on time intervals or an actual time field. This section is separate from
specifying past (really current) survival time. As we have seen, if no past survival time is listed, then
the model predicts based as it time=0. Let’s set the past survival time to cardtenure and predict one
time interval (one year) into the future.
Click Regular intervals option button

Specify cardtenure as the Past survival time field

Figure 7.27 Cox Model to Predict One Year Beyond Current Survival Time
Click OK
Run the Table node

Figure 7.28 Cox Predictions One Year Beyond Current Survival
There are two points to notice about these data. First, for case 16, there are values of $null$ for all
four output fields. If we scroll to the left and check the value of cardtenure for this customer, we will
learn that it is 40. Predictions cannot be made outside the range of survival values in the data, so for
any customer already at the upper end of the survival range, no predictions can be made.
Second, if you compare Figure 7.28 to Figure 7.24, you may see differences that are at first puzzling.
The customer in the first row has a cardtenure value of 2. Thus, the model we just ran is predicting
one year ahead for this customer, or to year 3, and the probability of churning in year 3 is 0.135. But
if we look back at Figure 7.24 we see a probability of churning of 0.251, which is greater. You might
wonder how the probability of churning can go down; the probability of the terminal event occurring
should either stay constant or increase over time (unless we have time-varying covariates). The
answer is that predictions that take into account past survival are actually conditional predictions, and
the probability listed is the conditional probability. This means that, in Figure 7.28, the probability of
0.135 is the probability of this customer churning by the end of year 3, given their churn probability
(and survival) through year 2.
Thus, if we used the Analysis node to examine predictions from this model, we would see very few
customers predicted to churn. This is because the probability of churning in the next year is always
rather small for these data, so there are few predictions that a customer will churn.

Once you are satisfied with a model, you will want to score customers to find those most likely to
churn. A common situation is to predict some time into the future, given past survival time. We’ll
predict five years into the future.
There is no need to predict for those customers who have already churned; we would only be
interested in predicting churn for customers who are still current and may churn at some point in the
future. So we need to select those customers with churn=0. Then we can sort the data stream by the
likelihood of churning in a given future time interval.

Place a Select node from the Record Ops palette in the stream to the right of Cox model
Connect the Cox generated model to the Select node
Edit the Select node
Enter the text churn = 0 in the Condition: box
Figure 7.29 Selecting Customers who have not Churned
Click OK
Add a Sort node to the stream to the right of the Select node
Connect the Select node to the Sort node
Edit the Sort node
Select $C-churn-1 as the first sort field in descending order
Select $CP-1-1 as the second sort field, also in descending order

Figure 7.30 Sorting by Predicted Value and Probability to Churn
Click OK
Connect the Sort node to the Table node, replacing the connection
Edit the Cox generated model
To see survival five years into the future, taking into account current survival for each customer, we
need to make only one change.
Change the Time interval value to 5.0

Figure 7.31 Settings to Predict Five Years into the Future taking into Account Current Survival
Since the number of time periods to score is still 1, this will score each record at five years into the
future dependent on past survival time up until the last point data recorded for that customer.
Click OK
Run the Table node attached to the Sort node

Figure 7.32 Cox Predictions Five Years Beyond Current Survival
There are predictions now that customers will churn ($C-churn-1=1). All these customers have a
value of $CP-1-1 above .50. If we scroll down to row 71, we see that there are 71 customers
predicted to churn. However, this number depends on the default cutoff value of .500, which you are
free to adjust up or down as necessary. Decreasing the cutoff will cast a wider net to find more
customers who might churn in the specified five-year time period. Customers are ordered by model-
predicted probability of churning, so we can spend marketing dollars or other resources on contacting
them in order of likelihood to churn.
If you scroll down to record 3266 you will begin to see values of $null$ again for all the output fields.
This occurs when a prediction is made outside the survival range of the model. As we all know from
basic regression analysis, predicting outside the bounds of the predictor fields is to be avoided. In Cox
regression, if we attempt to predict outside the bounds of the time field, the model will not pass on a
predicted value downstream. In every instance, these null values come about when current survival
time is 36 years or greater. Given the fewer customers at large survival times, these customers can
either be ignored or handled separately (see PASW Modeler Application Examples on Cox
Regression for hints).
To understand more about the model predictions into the future, we’ll examine the relationship
between time and the model prediction with an overlay histogram.


Add a Histogram node to the stream near the Sort node
Connect the Sort node to the Histogram node
Specify cardtenure as the Field
Specify $C-churn-1 as the Overlay Color field (not shown)
Click Options
Click Normalize by color
Click Run
Figure 7.33 Histogram of Survival by Model Prediction
When reviewing this chart, keep in mind that, based on current survival, we are predicting survival
five years into the future. This explains why there is a sharp cutoff after 35 years, since 36 + 5 extends
beyond the data range of cardtenure of 40 years. Most of the predictions that people will switch their
primary credit card occur to customers with relatively low current survival times—under 10 years—
which is consistent with past experience and common sense.
If we wanted to see survival in a graph at cardtenure + 5 years, which is what the model is now
predicting, you could use a Derive node to create a field with that equation and then use it in the
Histogram in place of cardtenure.
To recap how to make predictions with a Cox generated model, follow this advice:

1) If you use the default setting, you are predicting from time 0 one time period into the future.
You usually don’t want this for existing customers, but this could certainly be appropriate for
a data file with new customers.
2) If you want to predict survival at the current survival time for existing customers, specify the
survival time field as the Time field on the Settings tab.
3) If you want to predict future survival, given past survival, specify the survival time field on
the Past survival time field, and specify either time intervals and periods to score (if you score
more than one period, you will get more than one prediction), or a Time field.

Summary Exercises
The exercises in this lesson use the data file customer_dbase.sav that we have been using throughout
the lesson examples.
1. Begin with a current stream, or alternatively just begin a new stream to access the
customer_dbase.sav data. Choose a completely different set of predictors than the
demographic fields used in the lesson to predict churn, although you will continue to use
cardtenure as survival time.
2. Estimate a Cox regression model with this new set of predictors. What are the significant
predictors?
3. Choose a categorical predictor or two that are significant and plot survival curves for each
value of the predictors? What did you learn?
4. Test the assumption of proportional hazards for your model with one or more categorical
predictors. Is the assumption met or not?
5. Using your model, predict survival 3 and 6 years into the future. Select only those customers
who have not churned. How many customers are predicted to churn at 3 years into the future?
At 6 years?

Lesson 8: Time Series Analysis

Objectives
• To explain what is meant by time series analysis
• Outline how time series models work
• Demonstrate the main principles behind a time-series forecasting model
• Forecast several series at one time
• Produce forecasts with a time series model on new observations
8.1 Introduction
It is often essential for organizations to plan ahead. In order to do this it is important to forecast
events in order to ensure a smooth transition into the future. In order to minimize errors when
planning for the future it is necessary to collect information on any factors which may influence plans
on a regular basis over time. Once a catalogue of past and current information has been collected,
patterns can be identified and these patterns help make forecasts into the future. Even though many
organizations may collect historic information relevant to the planning process, forecasts are often
made on an ad-hoc basis. This often leads to large forecasting errors and costly mistakes in the
planning process. Statistical techniques provide a more scientific basis upon which to base forecasts.
By using these techniques, a more structured approach can be used to ensure careful planning which
will reduce the chance of making costly errors. Statisticians have developed a whole area of statistical
techniques, known as time series analysis, which is devoted to the area of forecasting.
Examples of Time Series Analysis

In order to understand how time series analysis works it is useful to give an example. Suppose that a
company wishes to forecast the growth of its sales into the future. The benefit of making the forecast
is that if the company has an idea of future sales it can plan the production process for its product. In
doing so, it can minimize the chances of under producing and having product shortages or,
alternatively, overproducing and having excess stock which will need to be stored at additional cost.
Prior to being able to make the forecast, the company will need to collect information on its sales over
time in order to gain a full picture of how sales have changed in the past. Once this information has
been collected it is possible to plot how sales change over time. An example of this is shown in
Figure 8.1. Here information on the sales of a product has been collected each month from January
1982 until December 1995.
Time Series Analysis 8-1

Figure 8.1 Plot of Sales Over Time
This is a simple example that demonstrates the idea of time series. Time series analysis looks at
changes over time. Any information collected over time is known as a time series. A time series is
usually numerical information collected over time on a regular basis.
One of the most common uses of time series analysis is to forecast future values of a series. There are
a number of statistical time series techniques which can be used to make forecasts into the future. In
the above example the forecast would be the future values of sales.
Some time series methods can also be used to identify which factors have been important in affecting
the series you wish to forecast. For example, to determine whether an advertising campaign has had a
significant and beneficial effect on sales. It is also possible to use time series analysis to quantify the
likely impact of a change in advertising expenditure on future sales.
Other examples of time series analysis and forecasting include:
• Governments using time series analysis to predict the effects of government policies on
inflation, unemployment and economic growth.
• Traffic authorities analyzing the effect on traffic flows following the introduction of parking
restrictions in city centers.

• The analyses of how stock market prices change over time. By being able to predict when
stock market prices rise or fall decisions can be made about the right times to buy and sell
shares.
• Companies predicting the effects of pricing policies or increased advertising expenditure on

the sales of their product.
• A company wishing to predict the number of telephone calls at different times during the day,
so it can arrange the appropriate level of staffing.
Time series analysis is used in many areas of business, commerce, government and academia, and its
value cannot be overstated.
A number of time series techniques can be found within the Time Series node in PASW Modeler.
This node provides analysts with both a flexible and powerful way to analyze time series data.
8.2 What is a Time Series?

A time series is a field whose values represent equally spaced observations of a phenomenon over
time. Examples of time series include quarterly interest rates, monthly unemployment rates, weekly
beer sales, annual sales of cigarettes, and so on. In term of a data file, time periods constitute the rows
(cases) in your file.
Time series analysis is usually based on aggregated data. If we take the monthly sales shown in
Figure 8.1, each sale is recorded on a transactional basis with an attached date and/or time stamp.
There is usually no business need requiring sales forecasts on a minute-by-minute basis, while there is
often great interest in predicting sales on a weekly, monthly, or quarterly basis. For this reason,
individual transactions and events are typically aggregated at equally spaced time points (days,
weeks, months, etc.), and forecasting is based on these summaries. Also, most software programs that
perform time series analysis, including PASW Modeler, expect each row (case) of data to represent a
time period, while the columns contain the series to be forecast.
Classic time series involves forecasting future values of a time series based on patterns and trends
found in the history of that series (exponential smoothing and simple ARIMA) or on predictor fields
measured over time (multivariate ARIMA, or transfer functions).
Time Series Models versus Econometric Models

Time Series models are models constructed without drawing on any theories concerning possible
relationships between the fields. In univariate models, the movements of a field are explained solely
in terms of its own past and its position with respect to time. ARIMA models are the premier time
series models for single series.
By way of contrast, econometric models are constructed by drawing on theory to suggest possible
relationships between fields. Given that you can specify the form of the relationship, econometrics
provides methods for estimating the parameters, testing hypotheses, and producing predictions. Your
model might consist of a single equation, which can be estimated by some variant of regression, or a
system of simultaneous equations, which can be estimated by two-stage least squares or some other
technique.

The Classical Regression Model

The classical linear regression model is the conventional starting point for time series and
econometric methods. Peter Kennedy, in A Guide to Econometrics (5th edition, 2003, MIT Press),
provides a convenient statement of the model in terms of five assumptions:
• The dependent variable can be expressed as a linear function of a specific set of independent
variables plus a disturbance term (error).
• The expected value of the disturbance term is zero.
• The disturbances have a constant variance and are uncorrelated.
• The observations on the independent variable(s) can be considered fixed in repeated samples.
• The number of observations exceeds the number of independent variables and there are no
exact linear relationships between the independent variables.
While regression can serve as a point of departure for both time series and econometric models, it is
incumbent on you (the researcher) to generate the plots and statistics which will give some indication
of whether the assumptions are being met in a particular context.
Assumption 1 is concerned with the form of the specification of the model. Violations of this
assumption include omission of important regressors (predictors), inclusion of irrelevant regressors,
models nonlinear in the parameters, and varying coefficient models.
When assumption 2 is violated, there is a biased intercept.
Assumption 3 assumes constant variance (homoscedasticity) and no autocorrelation. (Autocorrelation

is the correlation of a variable with itself at a fixed time lag.) Violations of the assumption are the
reverse: non-constant variance (heteroscedasticity) and autocorrelation.
Assumption 4 is often called the assumption of fixed or nonstochastic independent variables.

Violations of this assumption include errors in measurement in the variables, use of lagged values of
the dependent variable as regressors (common in time series analysis), and simultaneous equation
models.
Assumption 5 has two parts. If the number of observations does not exceed the number of
independent variables, then your problem has a necessary singularity and your coefficients are not
estimable. If there are exact linear relationships between independent variables, software might
protect you from the consequences. If there are near-exact linear relationships between your
independent variables, you face the problem of multicollinearity.
In regression, parameters can be estimated by least squares. Least squares methods do not make any
assumptions about the distribution of the disturbances. When you make the assumptions of the
classical linear regression model and add to them the assumption that the disturbances are normally
distributed, the regression estimators are maximum likelihood estimators (ML). It also can be shown
that the least-squares methods produce Best Linear Unbiased estimates (BLU). The BLU and ML
properties allow estimation of the standard errors of the regression coefficients and the standard error
of the estimate, and therefore enable the researcher to do hypothesis testing and calculate confidence
intervals.

8.3 A Time Series Data File

To show you what a time series data file looks like, we open a PASW Modeler stream.
Click File…Open Stream and move to the c:\Train\ModelerPredModel folder
Double-click on Time Series Intro.str
Run the Table node
Figure 8.2 A Time Series Data File
Each column in the data editor corresponds to a given field. The important point to note concerning
the organization of time series data is that each row in the Table window corresponds to a particular
period of time. Each row must therefore represent a sequential time period. The above example shows
a data file containing monthly data for sales starting in January 1982. In order to use standard time
series methods it is important to collect, or at least be able to summarize, the information over equal
time periods. Within a time series data file it is essential that the rows represent equally spaced time
periods. Even time periods for which no data was collected must be included as rows in the data file
(with missing values for the fields).
The Time Plot Chart

The data file contains the recorded sales of a product over a fourteen-year period. The simplest way
of identifying patterns in your data is to plot your information over the relevant time period, and this
is essential for time series analysis. In PASW Modeler a sequence chart called a Time Plot chart is
used to show how time series change over time. The Time Plot chart plots the value of the field of

interest on the vertical axis, with time represented on the horizontal axis. A Time Plot chart can show
several fields (series) on the same chart. Points are joined up to display a line graph which shows any
patterns in your data.

Double-click on the Time Plot node to the right of the Time Intervals node to open it
Use the Field selector tool to select Sales
Figure 8.3 Time Plot Dialog
Click Run
There is an option to Display series in separate panels which can be used to generate a separate chart
for each series if you want to plot several of them at once. If you do not check this option, all fields
are plotted on one chart. Figure 8.4 shows how sales have changed over the fourteen years.

Figure 8.4 Sequence Plot of Sales
The sequence chart is the most powerful exploratory tool in time series analysis and it can be used to
identify trend, seasonal and cyclical patterns in a time series. There is a clear regularity (repeating
pattern) to the times series, and the volume of sales generally increases over time. These are the key
features we will need to model.
8.4 Trend, Seasonal and Cyclic Components

After identifying important patterns that have occurred in the past, time series analysis uses this
information to forecast into the future. In Figure 8.4 there are clear patterns in past sales. These
patterns can be divided into three main categories: trend, seasonal components and cycles.
Trend Patterns
Trend refers to the smooth upward or downward movement characterizing a time series over a long
period of time. This type of movement is particularly reflective of the underlying continuity of
fundamental demographic and economic phenomena. Trend is sometimes referred to as secular trend
where the word secular is derived from the Latin word saeculum, meaning a generation or age.
Hence, trend movements are thought of as long-term movements, usually requiring 15 or 20 years to
describe (or the equivalent for series with more frequent time intervals). Trend movements might be
attributable to factors such as population change, technological progress, and large-scale shifts in
consumer tastes.
For example, if we could examine a time series on the number of pairs of shoes produced in the
United States extending annually, say, from the 1700s until the present, we would find an underlying
trend of growth throughout the entire period, despite fluctuations around this general upward
movement. If we would compare the figures of the recent time against those near the beginning of the
series, we would find the recent numbers are much larger. This is because of the increase in
population, because of the technical advances in shoe-producing equipment enabling vastly increased

levels of production, and because of shifts in consumer tastes and levels of affluence which have
meant a larger per capita requirement of shoes than in the earlier time
In Figure 8.4 there is a clear upward trend in the data as sales have continued to increase from 1982
until 1995, albeit less pronounced from the beginning of 1991.
Cyclical Patterns
Cyclical patterns (or fluctuations), or business cycle movements, are recurrent up and down
movements around the trend levels which have a duration of anywhere from about 2 to 15 years. The
duration of these cycles can be measured in terms of their turning points, or in other words, from
trough to trough or peak to peak. These cycles are recurrent rather than strictly periodic. The height
and length (amplitude and duration) of cyclical fluctuations in industrial series differ from those of
agricultural series, and there are differences within these categories and within individual series.
Hence, cycles in durable goods activity generally display greater relative fluctuations than consumer
goods activity and a particular time series of, say, consumer goods activity may possess business
cycles which have considerable variations in both duration and amplitude.
Economists have produced a large number of explanations of business cycle fluctuations including
external theories which seek the causes outside the economic system, and internal theories in terms of
factors within the economic system that lead to self-generating cycles.
Since it is clear from the foregoing discussion that there is no single simple explanation of business
cycle activity and that there are different types of cycles of varying length and size, it is not surprising
that no highly accurate method of forecasting this type of activity has been devised. Indeed, no
generally satisfactory mathematical model has been constructed for either describing or forecasting
these cycles, and perhaps never will be. Therefore, it is not surprising to find that classical time series
analysis adopts a relatively rough approach to the statistical measurement of the business cycle. The
approach is a residual one; that is, after trend and seasonal variations have been eliminated from a
time series, by definition, the remainder or residual is treated as being attributable to cyclical and
irregular factors. Since the irregular movements are by their very nature erratic and not particularly
tractable to statistical analysis, no explicit attempt is usually made to separate them from cyclical
movements, or vice versa. However, the cyclical fluctuations are generally large relative to these
irregular movements and ordinarily no particular difficulty in description or analysis arises from this
source. Therefore, unless you have data available over a long period of time, cyclic patterns are not
usually fit by forecasting models.
Seasonal Patterns
Seasonal variations are periodic patterns of movement in a time series. Such variations are considered
to be a type of cycle that completes itself within the period of calendar year, and then continues in a
repetition of this basic pattern. The major factors in this seasonal pattern are weather and customs,
where the latter term is broadly interpreted to include patterns in social behavior as well as
observance of various holidays such as Christmas and Easter. Series of monthly or quarterly data are
ordinarily used to examine these seasonal patterns. Hence, regardless of trend or cyclical levels, one
can observe in the United States that each year more ice cream is sold during the summer months than
during the winter, whereas more fuel oil for home heating purposes is consumed in the winter than
during the summer months. Both of these cases illustrate the effect of climatic factors in determining
seasonal patterns. Also, department store sales generally reveal a minor peak during the months in
which Easter occurs and a larger peak in December, when Christmas occurs, reflecting the shopping
customs of consumers associated with these dates.

Seasonal patterns need not be linked to a calendar year. For example, if we studied the daily volume
of packages delivered by a private delivery service, the periodic pattern might well repeat weekly
(heavier deliveries mid-week, lighter deliveries on the weekend). Here the period for the seasonal
pattern could be seven days. Of course, if daily data were collected over several years, then there may
well be a yearly pattern as well, and just which time period constitutes a season is no longer clear.
The number of time periods that occur during the completion of a seasonal pattern is referred to as the
series periodicity. How often the time series data are collected usually depends on the type of
seasonality that the analyst expects to find.
• For hourly data, where data are collected once an hour, there is usually one seasonal
pattern every twenty-four hours. The periodicity is most likely to be 24.
• For monthly data, where each month a new time period of data is collected, there is
usually one seasonal pattern every twelve months. The periodicity is thus likely to be 12.
• For daily data, where data are collected once every day, there is usually one seasonal
pattern per week. The periodicity is therefore 7 if the data refer to a seven-day week or 5
if no data are collected on Saturdays and Sundays.
• For quarterly data, where data are collected once every three months, there is usually one
seasonal pattern per year. The periodicity is therefore 4.
• For annual data, where data are collected once a year, there is no seasonal pattern. The
periodicity is therefore none (undefined).
Of course, changes can occur in seasonal patterns because of changing institutional and other factors.
Hence, a change in the date of the annual automobile show can change the seasonal pattern of
automobile sales. Similarly, the advent of refrigeration techniques with the corresponding widespread
use of home refrigerators has brought about a change of seasonal pattern of ice cream sales. The
techniques of measurement of seasonal variation which we will discuss are particularly well suited to
the measurement of relatively stable patterns of seasonal variation, but can be adapted to cases of
changing seasonal movements as well.
In Figure 8.4, there appears to be a rise in sales during the early part of the year while sales tend to
fall to a low around November. Finally, there is some recovery in sales leading up to the Christmas
period of each year.
Irregular Movements
Irregular movements are fluctuations in time series that are erratic in nature, and follow no regularly
recurrent or other discernible pattern. These movements are sometimes referred to as residual
variations, since, by definition, they represent what is left over in an economic time series after trend,
cyclical, and seasonal elements have been accounted for. These irregular fluctuations result from
sporadic, unsystematic occurrences such as wars, earthquakes, accidents, strikes, and the like. In the
classical time series model, the elements of trend, cyclical, and seasonal variations are viewed as
resulting from systematic influences leading to gradual growth, decline, or recurrent movements.
Irregular movements, however, are considered to be so erratic that it would be fruitless to attempt to
describe them in terms of a formal model. Irregular movements can result from a large number of
causes of widely differing impact.
8.5 What is a Time Series Model?

A time series model is a tool used to predict future values of a series by analyzing the relationship
between the values observed in the series and the time of their occurrence. Time series models can be

developed using a variety of time series statistical techniques. If there has been any trend and/or
seasonal variation present in the data in the past then time series models can detect this variation, use
this information in order to fit the historical data as closely as possible, and in doing so improve the
precision of future forecasts.
Time Series techniques in PASW Modeler can be categorized in the following ways:
Pure time series models

Exponential Smoothing
Causal time series models

Linear Time Series Regression
Intervention Analysis
Both Pure and Causal

ARIMA
Pure Versus Causal Time Series Models

A distinction can be made between pure and causal time series models.
Pure time Series Models

Pure time series models utilize information solely from the series itself. In other words, pure time
series forecasting makes no attempt to discover the factors affecting the behavior of a series. For
example, if the aim were to forecast future sales for a product, then a pure time series model would
use just the data collected on sales. Information on other explanatory forces such as advertising
expenditure and economic conditions would not be used when developing a pure time series model.
In such models it is assumed that some pattern or combination of patterns in the series which is to be
forecasted is recurring over time. Identifying and extrapolating that pattern can develop forecasts for
subsequent time periods. The main advantage of pure time series modeling is that it is a quick and
simple way of developing a forecast model. Also, such models rely upon little statistical theory. One
obvious disadvantage of pure time series models, such as exponential smoothing, is that they cannot
identify important factors influencing the series. Another drawback is that it is not possible to
accurately predict the impact of any decisions taken by an organization on the future values of the
series.
Causal time series models

Causal time series models such as regression and ARIMA will incorporate data on influential factors
to help predict future values of a series. In such models, a relationship is modeled between a target
field (the time series being predicted), time, and a set of predictor fields (other associated factors also
measured over time). The first task of forecasting is to find the cause-and-effect relationship. In our
sales example, a causal time series technique such as regression would indicate whether advertising
expenditure or the price of the product has been an important influence on sales and if it has, whether
each factor has had a positive or negative influence on sales. The real advantage of an explanatory
model is that a range of forecasts corresponding to a range of values for the different fields can be
developed. For example, causal time series models can assess what the effect of a $100,000 increase
in advertising expenditure will have on future sales, or alternatively a $150,000 increase in
advertising expenditure.

The main drawbacks of causal time series models are that they require information on several fields
in addition to the target that is being forecast and usually take longer to develop. Furthermore, the
model may require estimation of the future values of the independent factors before the target can be
forecast.
8.6 Interventions
Time series may experience sudden shifts in level, upward or downward, as a result of external
events. For example, sales volume may briefly increase as the result of a direct marketing campaign
or a discount offering. If sales were limited by a company’s capacity to manufacture a product, then
bringing a new plant online would shift the sales level upward from that date onward. Similarly,
changes in tax laws or pricing may shift the level of a series. The idea here is that some outside
intervention resulted in a shift in the level of the series.
In this context, a distinction is made between a pulse—that is, a sudden, temporary shift in the series
level—and a step, a sudden, permanent shift in the series level. A bad storm, or a one-time, 30-day
rebate offer, might result in a pulse, while a change in legislation or a large competitor’s entry into a
market could result in a step change to the series. Time series models are designed to account for
gradual, not sudden, change. As a result, they do not natively fit pulse and step effects very well.
However, if you can identify events (by date) that you believe are associated with pulse or step
effects, they can be incorporated into time series models (they are called intervention effects) and
forecasts.
Below we see an example of a pulse intervention. In April 1975 a one-time tax rebate occurred in an
attempt to stimulate the US economy, then in recession. Note that the savings rate reached its
maximum (9.7%) during this quarter. The intervention can be modeled and used in scenarios to assess
the effect of a tax rebate on savings rates in the future.

Figure 8.5 U. S. Savings Rate (Seasonally Adjusted)—Tax Rebate in April 1975
8.7 Exponential Smoothing

The Expert Modeler in PASW Forecasting considers two classes of time series models when
searching for the best forecasting model for your data: exponential smoothing and ARIMA. In this
section we provide a brief introduction to simple exponential smoothing.
Exponential smoothing is a time series technique that can be a relatively quick way of developing
forecasts. This technique is a pure time series method; this means that the technique is suitable when
data has only been collected for the series that you wish to forecast. In comparison, ARIMA models
can accommodate predictor fields and intervention effects.
Exponential smoothing takes the approach that recent observations should have relatively more
weight in forecasting than distant observations. “Smoothing” implies predicting an observation by a
weighted combination of the previous values. “Exponential” smoothing implies that the weights
decrease exponentially as the observations get older. “Simple” (as in simple exponential smoothing)
implies that slowly changing level is all that is being modeled. Exponential smoothing can be
extended to model different combinations of trend and seasonality. Exponential smoothing
implements many models in this fashion.
An analyst using custom exponential smoothing typically examines the series to make some broad
characterizations (is there trend, and if so what type? Is there seasonality [a repeating pattern], and if
so what type?) and fits one or more models. The best model fit is then extrapolated into the future to

make forecasts. One of the main advantages of exponential smoothing is that models can be easily
constructed. The type of exponential smoothing model developed will depend upon the seasonal and
trend patterns inherent in the series you wish to forecast. An analyst building a model might simply
observe the patterns in a sequence chart to decide which type of exponential smoothing model is the
most promising one to generate forecasts. In PASW Forecasting, when the Expert Modeler examines
the series, it considers all appropriate exponential smoothing models when searching for the most
promising time series model.
Simple exponential smoothing (no trend, no seasonality) can be described in two algebraically
equivalent ways. One common formula, known as the recurrence form, is as follows:
S ( t ) = α * y ( t ) + (1 − α ) * S ( t −1)
Also, the forecast

y ( m ) = S (t )
where y(t) is the observed value of the time series in period t, S(t-1) is the smoothed level of the series at
time t-1, α (alpha) is the smoothing parameter for the level of the series, and S(t) is the smoothed level
of the series at time t, computed after y(t) is observed, and y(m) is the model estimated m step ahead
forecast at time t. Intuitively, the formula states that the current smoothed value is obtained by
combining information from two sources: the current point and the history embodied in the series.
Alpha (α) is a weight ranging between 0 and 1. The closer alpha is to 1, the more exponential
smoothing weights the most recent observation and the less it weights the historical pattern of the
series. The smoothed value for the current case becomes the forecast value.
This is the simplest form of an exponential smoothing model. As mentioned above, extensions of the
exponential smoothing model can accommodate several types of trend and seasonality, yielding a
general model capable of fitting single-series data.
8.8 ARIMA
Many of the ideas that have been incorporated into ARIMA models were developed in the 1970s by
George Box and Gwilym Jenkins, and for this reason ARIMA modeling is sometimes called Box-
Jenkins modeling. ARIMA stands for AutoRegressive Integrated Moving Average, and the
assumption of these models is that the variation accounted for in the series field can be divided into
three components:
• Autoregressive (AR)
• Integrated (I) or Difference
• Moving Average (MA)
An ARIMA model can have any component, or combination of components, at both the nonseasonal
and seasonal levels. There are many different types of ARIMA models and the general form of an
ARIMA model is ARIMA(p,d,q)(P,D,Q), where:
• p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA
model (and P the order of the seasonal autoregressive process)
• d refers to the order of nonseasonal integration or differencing (and D the order of the
seasonal integration or differencing)

• q refers to the order of the nonseasonal moving average process incorporated in the model
(and Q the order of the seasonal moving average process).
So for example an ARIMA(2,1,1) would be a nonseasonal ARIMA model where the order of the
autoregressive component is 2, the order of integration or differencing is 1, and the order of the
moving average component is also 1. ARIMA models need not have all three components. For
example, an ARIMA(1,0,0) has an autoregressive component of order 1 but no difference or moving
average component. Similarly, an ARIMA(0,0,2) has only a moving average component of order 2.
Autoregressive
In a similar way to regression, ARIMA models use input fields to predict a target field (the series
field). The name autoregressive implies that the series values from the past are used to predict the
current series value. In other words, the autoregressive component of an ARIMA model uses the
lagged values of the series target, that is, values from previous time points, as predictors of the
current value of the series. For example, it might be the case that a good predictor of current monthly
sales is the sales value from the previous month.
The order of autoregression refers to the time difference between the series target and the lagged
series target used as a predictor. If the series target is influenced by the series target two time periods
back, then this is an autoregressive model of order two and is sometimes called an AR(2) process. An
AR(1) component of the ARIMA model is saying that the value of series target in the previous period
(t-1) is a good indictor and predictor of what the series will be now (at time period t). This pattern
continues for higher-order processes.
The equation representation of a simple autoregressive model (AR(1)) is:
y ( t ) = Φ 1 * y ( t −1) + e(t ) + a
Thus the series value at the current time point (y(t)) is equal to the sum of: (1) the previous series
value (y(t-1)) multiplied by a weight coefficient (Φ1); (2) a constant a (representing the series mean);
and (3) an error component at the current time point (e(t)).
Moving Average
The autoregressive component of an ARIMA model uses lagged values of the series values as
predictors. In contrast to this, the moving average component of the model uses lagged values of the
model error as predictors.
Some analysts interpret moving average components as outside events or shocks to the system. That
is, an unpredicted change in the environment occurs, which influences the current value in the series
as well as future values. Thus the error component for the current time period relates to the series’
values in the future.
The order of the moving average component refers to the lag length between the error and the series
target. For example, if the series target is influenced by the model’s error lagged one period, then this
is a moving average process of order one and is sometimes called an MA(1) process. An MA(1)
model would be expressed as:
y ( t ) = Φ 1 * e( t −1) + e(t ) + a

Thus the series value at the current time point (y(t)) is equal to the sum of several components: (1) the
previous time point’s model error (e(t-1)) multiplied by a weight coefficient (here Φ1); (2) a constant
(representing the series mean); and (3) an error component at the current time point (e(t)).
Integration
The Integration (or Differencing) component of an ARIMA model provides a means of accounting
for trend within a time series model. Creating a differenced series involves subtracting the values of
adjacent series values in order to evaluate the remaining component of the model. The trend removed
by differencing is later built back into the forecasts by Integration (reversing the differencing
operation). Differencing can be applied at the nonseasonal or seasonal level, and successive
differencing, although relatively rare, can be applied. The form of a differenced series (nonseasonal)
would be:
x( t ) = y ( t ) − y ( t −1)
Thus the differenced series values (x(t)) is equal to the current series value (y(t)) minus the previous
series value (y(t-1)).
Multivariate ARIMA
ARIMA also permits a series to be predicted from values in other data series. The relations may be at
the same time point (for example, a company spending on advertising this month influences the
company’s sales this month) or in a leading or lagging fashion (for example, the company’s spending
on advertising two months ago influence the company’s sales this month). Multiple predictor series
can be included at different time lags. A very simple example of a multivariate ARIMA model
appears below:
y(t ) = b1 * x(t −1) + e(t ) + a
Here the series value at the current time point (y(t)) is equal to the sum of several components: (1) the
value of the predictor series at the previous time point (x(t-1)) multiplied by a weight coefficient (b1);
(2) a constant; and (3) an error component at the current time point (e(t)). In a practical context, this
model could represent monthly sales of a new product as a function of direct marketing spending the
previous month.
Complex ARIMA models that include other predictor series, autoregressive, moving average, and
integration components can be built in the Time Series node..
8.9 Data Requirements

In time series analysis, each time period at which the data are collected yields one sample point to the
time series. The idea is that the more sample points you have, the clearer the picture of how the series
behaves. It is not reasonable to collect just two months worth of data on the sales of a product and, on
the basis of this, expect to be able to forecast two years into the future. This is because your sample
size is only two (one sixth of the seasonal span) and you wish to forecast 24 data points, or months,
ahead (two full seasonal spans). Therefore the way to view the collection of time series information is
that the more data points you have, the greater your understanding of the past will be, and the more
information you have to use to predict future values in the series.

The first important question to be answered is how many data points are required before it is possible
to develop time series forecasts. Unfortunately, there is no clear-cut answer to this, but the following
factors influence the minimum amount of data required:
• Periodicity
• How often the data are collected
• Complexity of the time series model
It is important to note that some time series techniques incorporating seasonal effects require several
seasonal spans of time series data before it is possible to use them. Usually four or more seasons of
data observations is a good rule of thumb to use when attempting to explore seasonal modeling. For
example, four years (seasonal spans) worth of quarterly or monthly data would be sufficient, as there
are four replications of the time period. At the same time, four years worth of annual data is not
enough historic data, as the sample is only four. The four year rule is not, however, a rigid rule, as
time series can be developed and used for forecasting with less historic data.
Two final thoughts: first, the more complex the time series model, the larger the time series sample
size should be. Secondly, time series models assume that the same patterns appear throughout the
series. If you are fitting a long series in which a dramatic change occurred that might influence the
fundamental relations that exist over time (for example, deregulation in the airline and telecom
industries), you may obtain more accurate prediction using only the recent (after the change) data to
develop the forecasts.
8.10 Automatic Forecasting in a Production Setting

In common data mining applications, analysts need to create forecasts for dozens of series on a
regular basis. Typical examples are for inventory control for many different products/parts, or for
demand forecasting within segments of customers (geographical, customer type, etc.). In principle,
this task is no more complex than what we have already reviewed in the previous lessons. But in
practice, it can be demanding simply because of the large number of series which could require data
exploration, checking of residuals, etc.
Fortunately, the Expert PASW Modeler in the Time Series node will automatically find a best-fitting
model for any number of series that are added to the target list, with little work on your part (you can
also use one or more predictor fields that would apply to all the target series). Although you could, if
you had the time, do some preliminary work to determine the characteristics of the series, if you need
to make regular forecasts on a weekly or monthly basis, it is likely that you won’t have the time to
devote to this effort.
After models are fit to several series—each series will have its own unique model—you can then
easily apply those models in the future, without having to re-estimate or rebuild the models. This will
be very time efficient. Of course, when enough time passes, you will most likely want to re-estimate
the models, just in case any fundamental processes have changed in the drivers of specific series.
8.11 Forecasting Broadband Usage in Several Markets

Our example of production forecasting involves a national broadband provider who wants to produce
forecasts of user subscriptions in order to predict bandwidth usage. To keep the example relatively
manageable, we will use only five time series in the example, although there are 85 series altogether
(but we also forecast the total for all series). The file broadband_1.sav contains the monthly number
of subscribers for each series from January 1999 to December 2003. After fitting models to the series,

we want to produce forecasts for the next three months, which will be adequate to prepare for changes
in demand/usage.
We’ll open the data file and do some data exploration.
Close the Time Plot graph window

Click on File…Open Stream
Double-click on the file broadband1
Run the Table node
Figure 8.6 Broadband Time Series Data
The file contains information on 85 markets. Rather than looking at all of them at once, we will focus
only on Markets 1 through 5. The Filter node to the right of the source node will filter out the markets
we don’t need.

Double-click the Filter node

Figure 8.7 Filter Node Dialog
The next step is to examine sequence charts of each series, but before doing so we will need to define
the periodicity of each series. The Time Series modeling node, and the Time Plot node, require that
the periodicity be defined. This is accomplished in the Time Intervals node which is found in the
Fields Ops palette.
Place a Type Node to the right of the Filter node

Connect the Filter node to the Type node
Place a Time Intervals node to the right of the Type node
Connect the Type node to the Time Interval node
Double-click on the Time Intervals node

Figure 8.8 Time Intervals Dialog
The Time Intervals node allows you to specify intervals and generate labels for time series data to be
used in a Time Series model or a Time Plot node. A full range of time intervals is supported, from
seconds to years. You can also specify the periodicity—for example, five days per week or eight
hours per day.
In this node you can specify the range of records to be used for estimating a model, and you can
choose whether to exclude the earliest records in the series and whether to specify holdouts. Doing so
enables you to test the model by holding out the most recent records in the time series data in order to
compare their known values with the estimated values for those periods. You can also specify how
many time periods into the future you want to forecast, and you can specify future values for use in
forecasting by downstream Time Series modeling nodes.
The Time Interval dropdown is used to define the periodicity of the series. By default it is set to None.
While it is not absolutely required that you specify a periodicity, unless you do so the Expert PASW
Modeler will not consider models that adjust for seasonal patterns. In this case, because we have
collected data on a monthly basis, we will define our time interval as months.
Click on the Time Interval dropdown and select Months

Figure 8.9 Time Intervals Dialog with Periodicity Defined
The next step is to label the intervals. You can either start labeling from the first record, which in the
case of this data file is January, 1999, or build the labels from a field that identifies the time or date of
each measurement. In order to use the Start labeling from first record method, you must specify the
starting date and/or time to label the records. This method assumes that the records are already
equally spaced, with a uniform interval between each measurement. Any missing measurements
would be indicated by empty rows in the data. You can use the Build from data method for series that
are not equally spaced. This requires that you have a field that contains the time or date of each
measurement. PASW Modeler will automatically impute values for the missing time points so that
the series will have equally spaced intervals. However, in addition, this method requires a date, time,
or timestamp field in the appropriate format to use as input. For example if you have a string field
with values like Jan 2000, Feb 2000, etc., you can convert this to a date field using a Filler node. This
is the method that we are going to use.
Click OK
Insert a Filler node between the Filter node and the Type Node

Figure 8.10 Stream After Adding the Filler Node
Double-click on the Filler node

Select DATE_ in the Fill in fields box
Select Always from the Replace: dropdown
Type or use the expression builder to insert to_date(DATE_) in the Replace with: box
Figure 8.11 Completed Filler Node
Click OK
Next, let’s set up the Type node so that the role for all the target series we want to forecast is set to
target, and the role for the newly converted DATE field is set to None. We will also need to
instantiate the data.

Set the role on all the fields from Market_1 to Total to Target
Set the role on the DATE_ field to None
Click on Read Values button to instantiate the data
Figure 8.12 Completed Type Node
Click OK
Now we can complete the Time Intervals settings.
Double-click on the Time Intervals node

Click on Build from data
Use the menu on the Field: option to select DATE_

Figure 8.13 Time Intervals Dialog with Date Field added
The New field name extension is used to apply either a prefix or suffix to the new fields generated by
the node. By default it is the prefix $T1_.
Click on the Build tab

Figure 8.14 Build Tab Dialog
The Build tab allows you to specify options for aggregating or padding fields to match the specified
interval. These settings apply only when the Build from data option is selected on the Intervals tab.
For example, if you have a mix of weekly and monthly data, you could aggregate or “roll up” the
weekly values to achieve a uniform monthly interval. Alternatively, you could set the interval to
weekly and pad the series by inserting blank values for any weeks that are missing, or by
extrapolating missing values using a specified padding function. When you pad or aggregate data, any
existing date or timestamp fields are effectively superseded by the generated TimeLabel and
TimeIndex fields and are dropped from the output. Typeless fields are also dropped. Fields that
measure time as a duration are preserved—such as a field that measures the length of a service call
rather than the time the call started—as long as they are stored internally as time fields rather than
timestamp.
Click the Estimation tab

Figure 8.15 Estimation Tab Dialog
The Estimation tab of the Time Intervals node allows you to specify the range of records used in
model estimation, as well as any holdouts. These settings may be overridden in downstream modeling
nodes as needed, but specifying them here may be more convenient than specifying them for each
node individually. The Begin Estimation is used to specify when you want the estimation period to
begin. You can either begin the estimation period at the beginning of the data or exclude older values
that may be of limited use in forecasting. Depending on the data, you may find that shortening the
estimation period can speed up performance (and reduce the amount of time spent on data
preparation), with no significant loss in forecasting accuracy. The End Estimation option allows you
to either estimate the model using all records up to the end of the data or “hold out” the most recent
records in order to evaluate the model. For example, if you hold out the last three records and then
specify 3 for the number of records to forecast, you are effectively “forecasting” values that are
already known, allowing you to compare observed and predicted values to gauge the model’s
effectiveness to forecast into the future. We will use the default settings.
Click the Forecast tab

Figure 8.16 Forecast Tab Dialog
The Forecast tab of the Time Intervals node allows you to specify the number of records you want to
forecast and to specify future values for use in forecasting by downstream Time Series modeling
nodes. These settings may be overridden in downstream modeling nodes as needed, but again
specifying them here may be more convenient than specifying them for each node individually.
The Extend records into the future option lets you specify the number of time points you wish to
forecast beyond the estimation period. Note that these time points may or may not be in the future,
depending on whether or not you held out some historic data for validation purposes. For example, if
you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and
only 1 future value. The Future indicator field is used to label the generated field to indicate whether
a record contains forecast data. The default value for the label is $TI_Future. The Future Values to
Use in Forecasting allows you to specify future values for any predictor fields you use. Future values
for any predictor fields are required for each record that you want to forecast, excluding holdouts. For
example, if you are forecasting next month's revenues for a hotel based on the number of reservations,
you need to specify the number of reservations you actually expect. Note that fields selected here may
or may not be used in modeling; to actually use a field as a predictor, it must be selected in a
downstream modeling node. This dialog box simply gives you a convenient place to specify future
values so they can be shared by multiple downstream modeling nodes without specifying them
separately in each node. Also note that the list of available fields may be constrained by selections on
the Build tab. For example, if Specify fields and functions is selected in the Build tab, any fields not
aggregated or padded are dropped from the stream and cannot be used in modeling. The Future value
functions lets you choose from a list of functions, or specify a value of your own. For example, you
could set the value to the most recent value. The available functions depend on the measurement level
of field.
Click on the Extend records into the future check box

Specify that you would like to forecast 3 records beyond the estimation period

Figure 8.17 Completed Forecast Tab Dialog
Click OK
The next step is to examine each series with a Sequence chart. We will display all the fields on the
same chart.
Place a Time Plot node from the Graphs palette below the Time Intervals node
Attach the Time Intervals node to the Time Plot node
Double-click on the Time Plot node
Select all the series from Market_1 to Total
Uncheck the Display Series in separate panels box

Figure 8.18 Completed Time Plot Dialog
Click Run

Figure 8.19 Sequence Chart Output for Each Series
From this graph, it is clear that Broadband usage has been increasing rapidly in the US over this
period, so we see a steady, very smooth increase for all fields. The numbers for Market_3 do begin to
dip in the last couple of months, but perhaps this is temporary. There is clearly no seasonality in these
data, which makes sense. The number of broadband subscriptions does not rise and fall throughout
the year.
If we use this fact, we can reduce the time for the Expert PASW Modeler to fit models to these series,
since requesting that seasonality be considered will increase processing time.
Additionally, because the series we’ve viewed here are so smooth, with no obvious outliers, we’ll not
request outlier detection. This will also save on processing time. Note, though, that if you are in doubt
about this, it is safer to use outlier detection during modeling.
Close the Time Plot graph

Place a Time Series node from the Modeling palette near the Time Intervals node
Connect the Time Intervals node to the Time Series node
Here is the stream so far:

Figure 8.20 Stream with Times Series Node Attached
Double-click on the Time Series node
Figure 8.21 Times Series Node
The default Method of modeling is Expert PASW Modeler which automatically selects the best
exponential smoothing or ARIMA model for one or more series (there can be a different model for
each series). As an alternative, you can use the menu to specify that you only want to specify a
custom Exponential Smoothing or ARIMA model. In addition, there is a Continue estimation using

existing model(s) option, which allows you to apply an existing model to new data, without re-
estimating the model from the beginning. In this way you can save time by re-estimating and
producing a new forecast based on the same model settings as before but using more recent data.
Thus, if the original model for a particular time series was Holt's linear trend, the same type of model
is used for re-estimating and forecasting for that data; the modeler does not reattempt to find the best
model type for the new data. We will use the Expert Modeler in this example.
In addition, you can specify the confidence intervals you want for the model predictions and residual
autocorrelations. By default, a 95% confidence interval is used. You can set the maximum number of
lags shown in tables and in plots of autocorrelations and partial autocorrelations.
You must include a Time Intervals node upstream from the Time Series node. Otherwise, the dialog
will indicate that no time interval has been defined and the stream will not run. In this example, the
settings indicate that the model will be estimated from all the records and that forecasts will be made
3 time periods beyond the estimation period.
Click the Criteria button
Figure 8.22 Criteria Dialog
The All models option should be checked if you want the Expert Modeler to consider both ARIMA
and exponential smoothing models. The other two modeling options can be used if you want the
Expert Modeler to only consider Exponential smoothing or ARIMA models. The Expert Modeler will
only consider seasonal models if periodicity has been defined for the active dataset. When this option
is selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is not
selected, the Expert Modeler only considers nonseasonal models. We will uncheck this option

because the sequence charts clearly show that there were no seasonal patterns in broadband
subscriptions.
The Events and Interventions option enables you to designate certain fields as event or intervention
fields. Doing so identifies a field as identifying periods when time series data might have been
affected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time
incidents, e.g., a power outage or employee strike). These fields must be of flag, nominal or ordinal,
and must be numeric (e.g. 1/0, not T/F, for a Flag field), to be included in this list.
Click the Outliers tab
Figure 8.23 Outliers Dialog
The Detect Outliers automatically option is used to locate and adjust for outliers. Outliers can lead to
forecasting bias either up or down, erroneous predictions if the outlier is near the end of the series,
and increased standard errors. Because there were no obvious outliers in the sequence chart, we will
leave this option unchecked.
Click Cancel
Click Run
Right-click on the generated model named 6 fields in the Models palette

Click Browse

Figure 8.24 Time Series Model Output (Simple View)
The Time Series model displays details of the models the Expert Modeler selected for each series. In
this case, it chose the Holts linear trend exponential smoothing model for the first four series and the
last one (Total), and the Winters additive exponential smoothing model for the fifth series. Given the
likely similar patterns in the series, it is not surprising that the same model was chosen for most of the
series. The default output shows for each series the model type, the number of predictors specified,
and the goodness-of-fit measure (stationary R-squared is the default). This measure is usually
preferable to an ordinary R-squared when there is a trend or seasonal pattern. If you have specified
outlier methods, there is a column showing the number of outliers detected. The default output also
includes a Ljung-Box Q statistic, which tests for autocorrelation of the error. Here we see that the
result was significant for the Model_2, Model_4, and Total series. Below we will examine some
residuals plots to see why the results were significant.
The default view (Simple) displays the basic set of output columns. For additional goodness-of-fit
measures, you can use the View menu to select the Advanced option. The check boxes to the left of
each model can be used to choose which models you want to use in scoring. All the boxes are
checked by default. The Check all and Un-check all buttons in the upper left act on all the boxes in a

single operation. The Sort by option can be used to sort the rows in ascending or descending order of
a specified column. As an alternative, you can also click on the column heading itself to change the
order.
Click on the View: menu and select Advanced
Figure 8.25 Time Series Model Output (View = Advanced)
The Root Mean Square Error (RMSE) is the square root of the mean squared error. The Mean
Absolute Percentage Error (MAPE) is obtained by taking the absolute error for each time period,
dividing it by the actual series value, averaging these ratios across all time points, and multiplying by
100. The Mean Absolute Error (MAE) takes the average of the absolute values of the errors. The
Maximum Absolute Percentage Error (MaxAPE) is the largest absolute forecast error expressed as a
percentage. The Maximum Absolute Error (MaxAE) is the largest forecast error, positive or negative.
And finally, Normalized Bayesian Information Criterion (Norm BIC) is a general measure of the
overall fit of a model that attempts to account for model complexity.
From this table, you can easily scan the statistics to look for better, or poorer, fitting models. We can
see here that Model_5 has the highest Stationary R-squared value (0.544) and Total has a very low
one (0.049). However, the Total series has a lower MAPE than any of the other series.
The summary statistics at the bottom of the output provide the mean, minimum, maximum and
percentile rank values for the standard fit measures. Here we see that the value for Stationary R-

squared at the highest percentile (Percentile 95) is 0.544. This means that Model_5 should be ranked
in the highest percentile based on this statistic, and the Total series should be ranked in the lowest.
Model Parameters
Time series models are represented by specific equations, and each model therefore has coefficients
or parameters associated with its various terms. These parameters can provide additional insight into a
model and the series that it predicts.
Click Parameters tab
A Holts linear trend model has two parameters, the level and trend. The level of the series is modeled
with Alpha, which varies from 0 (older observations count just as much as current observations) to 1
(the current observation is used exclusively). The alpha estimate is 1.0, and it is significant at the .05
level, so this smoothly varying series can be modeled for level by ignoring previous observations
when predicting an observation.
Gamma is a parameter which models the trend in the series, and it varies from 0 (trends are based on
all observations) to 1 (the trend is based on only the most recent observations). The gamma value of
0.3 (also significant at the .05 level) indicates that the series trend (which is clearly increasing over
time) requires using some of the past observations as well as more recent ones. Note that the value of
gamma does not describe the trend itself.
Figure 8.26 Parameters of Holts Linear Trend Model for Market_1
An experienced analyst can use the parameters for a model to compare one model to another, to
compare changes over time in a model (and thus a series), and to assess whether a model makes
sense. You may want to review the parameters for the other series’ models.
Now let’s look at the Residual plots.
Click on the Residuals tab

Figure 8.27 Residuals Output for the Market_1 Series
The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function
(PACF) of the residuals (the differences between expected and actual values) for each target field.
The ACF values are the correlations between the current value and the previous time points. By
default, 24 autocorrelations are displayed. The PACF values look at the correlations after controlling
for the series values at the intervening time points. If all of the bars fall within the 95% confidence
limits (the blue highlighted area), then there are no significant autocorrelations in the series. That
seems to be the case with the Market_1 series. However, as we saw in Figure 8.24, the Market_2
series seemed to have significant autocorrelation based on the Ljung-Box Q statistic. Let’s take a look
at the residuals plot for the Market_2 series to see if we can see why that statistic was significant for
that series.
Use the Display plot for model: option to select the Market_2 series

Figure 8.28 Residuals Output for the Market_2 Series
Here we see that there is significant autocorrelation at lag 6 in both the ACF and PACF plots. Thus,
the results of the Ljung-Box Q statistic and these two plots are consistent: there is a non-random
pattern in the errors. What this implies is not that the current model can’t be used for forecasting, as it
may perform adequately for the broadband provider. But it does suggest that the model can be
improved. The Expert Modeler is an automatic modeling technique, and it normally finds a fairly
acceptable model, but that doesn’t mean that some tweaking on your part isn’t appropriate.
Click OK
Place a Table node nearby the generated Times Series model
Connect the generated model to the Table node
Run the Table node

Figure 8.29 Table Output Showing Fields Created by Time Series Model
The table now contains a forecast value for each time point for each series along with an upper and
lower confidence limit. In addition, there is a field called $T1_Future that indicates that there are
records that contain forecast data. For records that extend into the future, the value for this field will
be “1”.
Scroll to the bottom of the table and then slightly to the right

Figure 8.30 Table Output with Future Values Displayed
Notice that the original series all have null values on these last three records because they are
projected into the future. On the right hand side in Figure 8.30 we can see the forecast values for
future months (January 2004 to March 2004) for the Market_1 series.
Finally, let’s create a chart showing the forecast for one of the series.

Place a Time Plot node near the generated model on the stream canvas
Connect the Time Plot node to the generated model
Select the following fields to be plotted: Market_5, $TS-Market_5, $TSLCI-Market_5,
$TSUCI-Market_5
Uncheck the Display Series in separate panels option
Click Run

Figure 8.31 Sequence Chart for Market_5 along with Forecast and Upper & Lower Confidence
Limits
From this chart, it appears that the model fits this series very well.
Close the Time plot graph window

Click on File…Save Stream As and name the file Broadband.str
8.12 Applying Models to Several Series

We just produced models for 6 series, along with forecasts for the next three months. Suppose that 3
months has passed and we now have actual data for January to March 2004 (for which we made
forecasts initially). Now it is April 2004 and we want to make forecasts for the next three months
(April to June 2004) using the same model we developed before without having to re-estimate the
model now that we have updated the file with three months of new records. We do this with the
Reuse stored settings method in the Time Series node to apply the model we just created to the
updated data file. (We leave aside whether the correct forecast period is three months, more, or less.)
Click File…Open Stream…Broadband2.str (in the C:\Train\ModelerPredModel folder)

Copy the generated Time Series node from the Broadband.str (or add it from the Models
manager to the stream
Paste the generated model into Broadband2.str

Figure 8.32 Broadband2.str with the Generated Model from Broadband1.str.
This node contains the settings from the time series models we just created. Normally, at this point,
with any other type of PASW Modeler generated model, we would make predictions on new data by
attaching this node to the Type node and running the generated model. This would automatically
make predictions for new cases.
Time series data, though, is different. Unlike other types of data files, where there is usually no
special order to the cases (in terms of modeling), order makes a difference in a time series. To reuse
our settings, but also use the new data (from January to March) to make estimates, we must create a
new Time Series node directly from the generated Time Series model.
Right-click on the generated model and select Edit

Click on Generate…Generate Modeling Node
This places a time series modeling node onto the palette.
Close the time series modeling output and delete the copied generated model from the
stream canvas
Connect the Time Intervals node to the new Time Series node
Figure 8.33 Broadband2.str with the Time Series Node Generated from the Previous Model

We don’t have to specify any target fields because the models, with all specifications, are already
stored in the generated time series modeling node. We simply insert the model node and decide
whether the model should be re-estimated or not. Assuming that you have recently estimated models,
you might be willing to act as if the model type for each series still holds. You can avoid redoing
models and apply the existing models to the new data by using the method Continue estimation using
existing model(s) option. This choice means that PASW Modeler will use the model settings for
model form (type of exponential smoothing and ARIMA model). Thus, for example, if the original
model for a particular time series was Holt's linear trend, the same type of model is used for re-
estimating and forecasting for that data; the system does not attempt to find the best model type for
the new data. The Expert Modeler will estimate new model parameters based on the additional data.
If instead you wish to re-estimate the model type, then you can uncheck the Continue estimation
check box. Although it will clearly take more computing time to redo the models, you may prefer this
choice unless you have many, many time series which are very long. However, if you are making
forecasts every month (week, etc.) based on just one additional month (week, etc.) of data, it may not
be worth the effort to redo the complete models every month. In that case, you may wish to re-
estimate the parameters every month, but re-estimate the models themselves every few months.
Double-Click on the Time Series node
By default the Time series node will use the existing models for estimation

Figure 8.34 Time Series Model Node with Continue Estimating Using Existing Models
Click Run to place a new model in the Models Manager

Browse the new model

Figure 8.35 Time Series Model Output
As we can see, the models used for each series are the same as before (see Figure 8.24), although the
statistics are not (examine stationary R square, for example). Let’s review the parameter estimates.
Click Parameters tab

Figure 8.36 Parameters for Market_1 Holts Linear Trend Model Estimated with New Data
The alpha parameter is still 1.0, but the gamma parameter is now almost zero (0.001) and it is non-
significant, so it is effectively equal to zero. This means that the trend in the series is modeled using
all the data, rather than more recent observations, compared to the original model for the Market_1
series.
Now let’s take a look at the new forecasts for April, May and June.
Close the Time Series model browser

Attach a Table node to the new Time Series model
Run the Table node

Figure 8.37 Table Node Output with New Forecasts
There are now null values for the original series for April, May, and June 2004, but there are
predictions for these months. In addition, you can compare the predictions for the first three months
of 2004 with these data with the original predictions we made above. They will not be the same
because we have three months of additional data to improve the model.
In summary, in this lesson we demonstrated a typical application of time series analysis in data
mining by showing how to make forecasts for several series at once. We then used these models but
re-estimated them on new data to make new forecasts at a future date for those same series. The
process of applying the models to new data can be repeated as often as necessary.

Summary Exercises
The exercises in this lesson are written for the data file broadband_1.sav.
1. Using the same dataset that was used in the lesson (broadband_1.sav), rerun the Time Series
node, using different series from the ones used in the lesson to fit a model and then produce
forecasts.
2. Try rerunning the models requesting outlier detection. Does this make any difference in the
generated models?
3. For those with extra time: Try specifying your own exponential smoothing model(s) or an
ARIMA model, if you are knowledgeable about these methods, to see whether you can obtain
a better model than that found by the Expert Modeler for one or more of the series.


Lesson 9: Logistic Regression

Objectives
• Review the concepts of logistic regression
• Use the technique to model credit risk
Data
A risk assessment study in which customers with credit cards were assigned to one of three
categories: good risk, bad risk-profitable (some payments missed or other problems, but were
profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a
number of demographics, including age, income, number of children, number of credit cards, number
of store credit cards, having a mortgage, and marital status, were available for about 2,500 records.
9.1 Introduction to Logistic Regression

Logistic regression, unlike linear regression, develops a prediction equation for a categorical target
field that contains two or more unordered categories (the categories could be ordered, but logistic
regression does not take the ordering into account). Thus it can be applied to such situations as:
• Predicting which brand (of the major brands) of personal computer an individual will
purchase
• Predicting whether or not a customer will close her account, accept an offering, or switch
providers
Logistic regression technically predicts the probability of an event (of a record being classified into a
specific category of the target field). The logistic function is shown in Figure 9.1. Suppose that we
wish to predict whether someone buys a product. The function displays the predicted probability of
purchase based on an incentive.
Figure 9.1 Logistic Model for Probability of Purchase
We see the probability of making the purchase increases as the incentive increases. Note that the
function is not linear but rather S-shaped. The implication of this is that a slight change in the
incentive could be effective or not depending on the location of the starting point. A linear model
Logistic Regression 9-1

would imply that a fixed change in incentive would always have the same effect on probability of
purchase. The transition from low to high probability of purchase is quite gradual. However, with a
logistic model the transition can occur much more rapidly (steeper slope) near the .5 probability
value.
To understand how the model functions, we need to review some equations. The logistic model
makes predictions based on the probability of an outcome. Binary (two target category) logistic
regression can be formulated as:
eα + B1 X 1+ B 2 X 2 +...+ BkXk
Prob(event) =
1 + eα + B1 X 1+ B 2 X 2 +...+ BkXk
Where X1, X2, …, Xk are the input fields.
This can also be expressed in terms of the odds of the event occurring.
Prob (event) Prob (event)

Odds (event) = or = eα + B1 X 1+ B 2 X 2 +...+ BkXk
1 − Prob (event) Prob (no event)
where the outcome is one of two categories (event, no event). If we take the natural log of the odds,
we have a linear model, akin to a standard regression equation:
ln (Odds (event)) = α + B1 X 1 + B2 X 2 + ... + Bk X k
With two categories, a single odds ratio summarizes the outcome. However, when there are more than
two target categories, ratios of the category probabilities can still describe the outcome, but additional
ratios are required. For example, in the credit risk data used in this lesson there are three target
categories: good risk, bad risk–profit, and bad risk–loss. Suppose we take the Good Risk category as
the reference or baseline category and assign integer codes to the target categories for identification:
(1) Bad Risk–Profit, (2) Bad Risk–Loss, (3) Good Risk. For the three categories we can create two
probability ratios:
π (1) Prob (Bad Risk - Profit)

g(1) = =
π (3) Prob (Good Risk)
and
π (2) Prob (Bad Risk - Loss)

g(2) = =
Where π (j) is the probability of being in target category j.
Each ratio is based on the probably of an target category divided by the probability of the reference or
baseline target category. The remaining probability ratio (Bad Risk–Profit / Bad Risk–Loss) can be
obtained by taking the ratio of the two ratios shown above. Thus the information in J target categories
can be summarized in (J-1) probability ratios.

In addition, these target-category probability ratios can be related to input fields in a fashion similar to
what we saw in the binary logistic model. Again using the Good Risk category as the reference or
baseline, we have the following model:
π (1) Prob (Bad Risk - Profit)

ln( ) = ln( ) = α 1 + B11 X 1 + B12 X 2 + ... + B1k X k
and
π (2) Prob (Bad Risk - Loss)

ln( ) = ln( ) = α 2 + B21 X 1 + B22 X 2 + ... + B2 k X k
Notice that there are two sets of coefficients for the three-category case, each describing the ratio of a
target category to the reference or baseline category. If we complete this logic and create a ratio
containing the baseline category in the numerator, we would have:

ln( ) = ln( ) = ln(1) = 0
= α 3 + B31 X 1 + B32 X 2 + ... + B3k X k
π (3)
This implies that the coefficients associated with ln( ) are all 0 and so are not of interest. Also,
π (3)
the ratio relating any two target categories, excluding the baseline, can be easily obtained by
subtracting their respective natural log expressions. Thus:
π (1) π (1) π (2)

ln( ) = ln( ) − ln( ) , or
π (2) π (3) π (3)
Prob (Bad Risk - Profit) Prob (Bad Risk - Profit) Prob (Bad Risk - Loss)
ln( ) = ln( ) - ln( )
Prob (Bad Risk - Loss) Prob (Good Risk) Prob (Good Risk)
We are interested in predicting the probability of each target category for specific values of the
predictor fields. This can be derived from the expressions above. The probability of being in target
category j is:
g (j)
π (j) = J
, where J is the number of target categories.
∑ g (i)
i =1
In our example with the three risk categories, for category (1):

π (1)
g (1) π (3) π (1) π (1)
= = =
g (1) + g (2) + g (3) π (1) π (2) π (3) π (1) + π (2) + π (3) 1
+ +
π (3) π (3) π (3)
And substituting for the g(j)’s, we have an equation relating the predictor fields to the target category
probabilities.
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k

π (1) =
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + eα 3 + B31 X 1 + B32 X 2 +...+ B3 k X k
eα1 + B11 X 1 + B12 X 2 +...+ B1k X k
= α1 + B11 X 1 + B12 X 2 +...+ B1k X k
e + eα 2 + B21 X 1 + B22 X 2 +...+ B2 k X k + 1
In this way, the logic of binary logistic regression can be naturally extended to permit analysis of
categorical target fields with more than two categories.
9.2 A Multinomial Logistic Analysis: Predicting Credit

Risk
We will perform a multinomial logistic analysis that attempts to predict credit risk (three categories)
for individuals based on several financial and demographic input fields. The data file has been split
into two, and we use risktrain.txt to train the model. We are interested in fitting a model, interpreting
and assessing it, and obtaining a prediction equation. Possible input fields are shown below.
Field name Field description

AGE age in years
INCOME income (in thousands of British pounds)
GENDER f=female, m=male
MARITAL marital status: single, married, divsepwid (divorced, separated or
widowed)
NUMKIDS number of dependent children
NUMCARDS number of credit cards
HOWPAID how often paid: weekly, monthly
MORTGAGE have a mortgage: y=yes, n=no
STORECAR number of store credit cards
LOANS number of other loans
INCOME1K income (in thousands of British pounds) divided by 1,000
The target field is:
Field name Field description

RISK credit risk: 1= bad risk-loss, 2=bad risk-profit, 3= good risk

To access the data:

Double-click on Logistic.str
Run the Table node, examine the data, and then close the Table window
The target field is credit risk (RISK). Notice that only four input fields are used. This is done to
simplify the results for this presentation. As an exercise, the other fields will be used as predictors.
Figure 9.2 Type Node for Logistic Analysis
Close the Type node dialog

Double-click on the Logistic Regression model node named RISK
Click on the Model tab

Figure 9.3 Logistic Regression Dialog
In the Model tab, you can choose whether a constant (intercept) is included in the equation. The
Procedure option is used to specify whether a binomial or multinomial model is created. The options
that will be available to you in the dialog box will differ according to which modeling procedure you
select.
Binomial is used when the target field has two discreet values, such as good risk/bad risk, or
churn/not churn. Whenever you use this option, you will in addition be asked to declare which of
your flag or categorical fields should be treated as categorical, the type of contrast you want
performed, and the reference category for each predictor. The default contrast is Indicator, which
indicates the presence or absence of category membership. However, in fields with some implicit
order, you may want to use another contrast such as Repeated, which compares each category with
the one that precedes it. The default reference or base category is the First category. If you prefer,
you can change this to the Last category.
Multinomial should be used when the target field is a categorical field with more than two values.
This is the correct choice in our example because the RISK field has three values: bad risk, bad profit,
and good risk. Whenever you use this option, the Model type option will become available for you to
specify whether you want a main effects model, a full factorial model, or a custom model. By default,
a model including the main effects (no interactions) of factors (categorical inputs) and covariates
(continuous inputs) will be run. This is similar to what the Regression model node will do (unless

interaction terms are formally added). The Full factorial option would fit a model including all factor
interactions (in our example, with two categorical predictors, the two-way interaction of MARITAL
and MORTGAGE would be added).
Notice that there are Method options (as there were for linear regression), so stepwise methods can be
used when the Main Effects model type is selected. When a number of input fields are available, the
stepwise methods provide a method of input field selection based on statistical criteria.
The Base Category for target option is used to specify the reference category. The default is the First
category in the list, which in this case is bad loss. Note: This field is unavailable if the contrast
setting is Difference, Helmert, Repeated, or Polynomial.
Select the Multinomial Procedure option (if necessary)

Click on the Specify button to the right of Base category for target. This will open the
Insert Value dialog box
Click on good risk
Figure 9.4 Insert Value Dialog
Click the Insert button
This will change the base target category. The result is shown in Figure 9.5.

Figure 9.5 Logistic Regression Dialog with Good Risk as the Base Target Category
Click on the Expert tab


Figure 9.6 Logistic Expert Mode Options
The Scale option allows adjustment to the estimated parameter variance-covariance matrix based on
over-dispersion (variation in the outcome greater than expected by theory, which might be due to
clustering in the data). The details of such adjustment are beyond the scope of this course, but you can
find some discussion in McCullagh and Nelder (1989).
If the Append all probabilities checkbox is selected, predicted probabilities for every category of the
target field would be added to each record passed through the generated model node. If not selected, a
predicted probability field is added only for the predicted category.

Make sure the Likelihood ratio tests check box is selected
Make sure the Classification table check box is selected
By default, summary statistics and (partial) likelihood ratio tests for each effect in the model appear in
the output. Also, 95% confidence bands will be calculated for the parameter estimates. We have
requested a classification table so we can assess how well the model predicts the three risk categories.

Figure 9.7 Logistic Regression Advanced Output Options
In addition, a table of observed and expected cell probabilities can be requested (Goodness of fit chi-
square statistics). Note that, by default, cells are defined by each unique combination of a covariate
(continuous input) and factor (categorical input) pattern, and a response category. Since a continuous
predictor (INCOME1K) is used in our analysis, the number of cell patterns is very large and each
might have but a single observation. These small counts could possibly yield unstable results, and so
we will forego goodness of fit statistics. The asymptotic correlation of parameter estimates can
provide a warning for multicollinearity problems (when high correlations are found among parameter
estimates). Iteration history information is requested to help debug problems if the algorithm fails to
converge, and the number of iteration steps to display can be specified. Monotonicity measures can be
used to find the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the
percentage of the total number of pairs that each represents. The Somers' D, Goodman and Kruskal's
Gamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Information
criteria provides Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion
(BIC).
Click OK
Click Convergence button
Figure 9.8 Logistic Regression Convergence Criteria

The Logistic Regression Convergence Criteria options control technical convergence criteria.
Analysts familiar with logistic regression algorithms might use these if the initial analysis fails to
converge to a solution.
Click Cancel
Click Run
Browse the Logistic Regression generated model node named RISK in the Models
Manager window
Click the Advanced tab, and then expand the browsing window
The advanced output is displayed in HTML format.
Figure 9.9 Record Processing Summary
The marginal frequencies of the categorical inputs and the target are reported, along with a summary
of the number of valid and missing records. A record must have valid values on all inputs and the
target in order to be included in the analysis. We have nearly 2,500 records for the analysis.

Figure 9.10 Model Fit and Pseudo R-Square Summaries
The Final model chi-square statistic tests the null hypothesis that all model coefficients are zero in the
population, equivalent to the overall F test in regression. It has ten degrees of freedom that correspond
to the parameters in the model (seen below), is based on the change in –2LL (–2 log likelihood) from
the initial model (with just the intercept) to the final model, and is highly significant. Thus at least
some effect in the model is significant. The AIC and BIC fit measures are also displayed. The model
fit improves as these two values approach zero. Because each of them decreased, we can conclude
that the model fit improved with the addition of the predictors.
Pseudo r-square measures try to measure the amount of variation (as functions of the chi-square lack
of fit) accounted for by the model. The model explains only a modest amount of the variation (the
maximum is 1, and some measures cannot reach this value).
Figure 9.11 Likelihood Ratio Tests
The Model Fitting Criteria table provided an omnibus test of effects in the model. Here we have a test
of significance for each effect (in this case the main effect of an input field) after adjusting for the
other effects in the model. The caption explains how it is calculated. All effects are highly significant.
Notice that the intercepts are not tested in this way, but tests of the individual intercepts can be found
in the Parameter Estimates table. In addition, we can use this table to rank order the importance of
the predictors. For instance, if we focus on the -2 LL value, if INCOME1K was removed as a

predictor, the -2 LL value would increase by a magnitude 302.422. Clearly, the removal of this
predictor would have far more impact on the overall fit than if we were to eliminate any of the other
predictors. The further -2LL gets from zero, the worse the fit. Thus, we can conclude that
INCOME1K is the most important predictor, followed by MARITAL, NUMKIDS, and MORTGAGE.
For those familiar with binary (two category) logistic regression, note that the values in the df
(degrees of freedom) column are double what you would expect for a binary logistic regression
model. For example, the covariate income (INCOME1K), which is continuous, has two degrees of
freedom. This is because with three target categories, there are two probability ratios to be fit,
doubling the number of parameters. Income has by far the largest chi-square value compared to the
other predictors with two (or even four) degrees of freedom.
9.3 Interpreting Coefficients

The most striking feature of the Parameter Estimates table is that there are two sets of parameters.
One set is for the probability ratio of “bad risk–loss” to “good risk,” which is labeled “bad loss.” The
other set is for the probability ratio of “bad risk–profit” to “good risk,” labeled “bad profit.” You can
view the estimates in equation form in the Model tab, but the Advanced tab contains more
supplementary information.
Figure 9.12 Parameter Estimates
For each of the two target probability ratios, each predictor is listed, plus an intercept, with the B
coefficients and their standard errors, a test of significance based on the Wald statistic, and the
Exp(B) column, which is the exponentiated value of the B coefficient, along with its 95% confidence

interval. As with ordinary linear regression, these coefficients are interpreted as estimates for the
effect of a particular input, controlling for the other inputs in the equation.
Recall that the original (linear) model is in terms of the natural log of a probability ratio. The intercept
represents the log of the expected probability ratio of two target categories when all continuous inputs
are zero and all categorical fields are set to their reference category (last group) values. For
covariates, the B coefficient is the effect of a one-unit change in the input on the log of the probability
ratio. Examining income (INCOME1K) in the “bad loss” section, an
increase of 1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio
between “bad loss” and “good risk” by –.056. But what does this mean in terms of probabilities?
Moving to the Exp(B) column, we see the value is .945 for INCOME1K (in the “bad loss” section of
the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected ratio of
the probability of being a bad loss to the probability of being a good risk by a factor of .945. In other
words, increasing income reduces the expected probability of being a “bad loss” relative to being a
“good risk,” and this reduction is .945 per 1,000 British pounds. This finding makes common sense.
If we examine the income coefficient in the “bad profit” section of the table, we see that in a similar
way (Exp(B) = .878) the expected probability of being a “bad profit” relative to being a good risk
decreases as income increases. Thus increasing income, after controlling for the other fields in the
equation, is associated with decreasing the probability of having a “bad loss” or “bad profit” outcome
relative to being a “good risk.” This relationship is quantified by the values in the Exp(B) column and
the Sig column indicates that both coefficients are statistically significant.
Turning to the number of children (NUMKIDS), we see that its coefficient is significant for the “bad
loss” ratio, but not the “bad profit” ratio. Examining the Exp(B) column for NUMKIDS in the “bad
loss” section, the coefficient estimate is 2.267. For each additional child (one unit increase in
NUMKIDS), the expected ratio of the probability of being a “bad loss” to being a “good risk” more
than doubles. Thus, controlling for other predictors, adding a child (one unit increase) doubles the
expected probability of being a “bad loss” relative to a “good risk.” However, controlling for the
other predictors, the number of children has no significant effect on the probability ratio of being a
“bad profit” relative to a “good risk.”
The Logistic node uses a General Linear Model coding scheme. Thus for each categorical input (here
MARITAL and MORTGAGE), the last category value is made the reference category and the other
coefficients for that input are interpreted as offsets from the reference category. In examining the
table we see that the last categories for MARITAL (single) and MORTGAGE (y) have B coefficients
fixed at 0. Because of this the coefficient of any other category can be interpreted as the change
associated with shifting from the reference category to the category of interest, controlling for the
other input fields. Since the reference category coefficients are fixed at 0, they have no associated
statistical tests or confidence bands.
Looking at the MARITAL input field, its two coefficients (for divsepwid and married categories) are
significant for both the “bad loss” and “bad profit” summaries. In the “bad loss” section, we see the
estimated Exp(B) coefficient for the “MARITAL=divsepwid” category is .284, while that for
“MARITAL=married” is 2.891. Thus we could say that, after controlling for other inputs, compared
to those who are single, those who are divorced, separated or widowed have a large reduction (.284)
in the expected ratio of the probability of being a “bad loss” relative to a “good risk.” Put another
way, the divorced, separated or widowed group is expected to have fewer “bad losses” relative to
“good risks” than is the single group. On the other hand, the married group is expected to have a
much higher (by a factor of almost 3) proportion of “bad losses” relative to “good risks” than the
single group. The explanation of why being married versus single should be associated with an

increase of “bad losses” relative to “good risks” should be worked out by the analyst, perhaps in
conjunction with someone familiar with the credit industry (domain expert). If we examine the
MARITAL Exp(B) coefficients for the “bad profit” ratios, we find a very similar result.
Finally, MORTGAGE is significant for both the “bad loss” and “bad profit” ratios. Since having a
mortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that compared
to the group with a mortgage, those without a mortgage have a greater expected probability of being
“bad losses” (1.828) or “bad profits” (2.526) relative to “good risks.” In short, those without
mortgages are less likely to be good risks, controlling for the other predictors.
In this way, the statistical significance of inputs can be determined and the coefficients interpreted.
Note that if a predictor were not significant in the Likelihood Ratio Tests table, then the model should
be rerun after dropping the field. Although NUMKIDS is not significant for both sets of category
ratios, the joint test (Likelihood Ratio Test) indicates it is significant and so we would retain it.
Classification Table
The classification table, sometimes called a misclassification table or confusion matrix, provides a
measure of how well the model performs. With three target categories we are interested in the overall
accuracy of model classification, the accuracy for each of the individual target categories, and
patterns in the errors.
Figure 9.13 Classification Table
The rows of the table represent the actual target categories while the columns are the predicted target
categories. We see that overall, the predictive accuracy of the model is 62.4%. Although marginal
counts do not appear in the table, by adding the counts within each row we find that the most
common target category is bad profit (1,475). This constitutes 60.1% percent of all cases (2,455).
Thus the overall predictive accuracy of our model is not much of an improvement over the simple
rule of always predicting “bad risk–profit.” However, we should recall that this simple rule would
never make a prediction of “bad risk–loss” or “good risk.”
In examining the individual categories, the “bad risk–profit” group is predicted most accurately
(87.3%), while the other categories, “bad risk–loss” (15.9%) and “good risk” (36.8%) are predicted
with much less accuracy. Not surprisingly, most errors in prediction for these latter two target
categories are predicted to be “bad risk–profit.”
The classification table allows us to evaluate a model from the perspective of predictive accuracy.
Whether this model would be adequate depends in part on the value of correct predictions and the
cost of errors. Given the modest improvement of this model over simply classifying all cases as “bad
risk–profit,” in practice an analyst would see if the model could be improved by adding additional
predictors and perhaps some interaction terms.

Finally, it is important to note that the predictions were evaluated on the same data used to fit the
model and for this reason may be optimistic. A better procedure is to keep a separate validation
sample on which to evaluate the predictive accuracy of the model.
Making Predictions
We now have the estimated model coefficients. How does the Logistic generated model node make
predictions from the model? First, let’s see the actual predictions by adding the generated model to
the stream.
Close the Model browsing window

Add a Table node to the stream and connect the Logistic generated model to the Table
node
Run the Table node
Figure 9.14 Predicted Value and Probability from Logistic Model
The field $L-RISK contains the most likely prediction from the model (here “good risk”). The
probabilities for all three target categories must sum to 1; the model prediction is the category with
the highest probability. That probability is contained in the field $LP-RISK. So for the first case, the
prediction is “good risk” and the predicted probability of this occurring is .692 for this combination of
input values. You prefer that the probability be as close to 1 as possible (the lowest possible value for
the predicted category is .333; Why?).
To illustrate how the actual calculation is done, let’s take an individual who is single, has a mortgage,
no children, and has an income of 35,000 British pounds (INCOME1K = 35.00). What is the predicted

probability of her (although gender was not included in the model) being in each of the three risk
categories? Into which risk category would the model place her?
Earlier in this lesson we showed the following (where π (j) is the probability of being in target
category j):
g (j)
π (j) = J
, where J is the number of target categories
∑ g (i)
i =1
If we substitute the parameter estimates in order to obtain the estimated probability ratios, we have:
gˆ (1) = e .438−.056*Income1k +.818*Numkids −1.260*Marital1 +1.062*Marital2 +.603*Mortgage1
gˆ (2) = e 4.285−.130*Income1k +.153*Numkids −1.220*Marital1 +1.021*Marital2 +.927*Mortgage1

and
gˆ (3) = 1
Where because of the coding scheme for the categorical inputs (Factors):
Marital1 = 1 if Marital=divsepwid; 0 otherwise

Marital2 = 1 if Marital=married; 0 otherwise
Mortgage1 = 1 if Mortgage=n; 0 otherwise
Thus for our hypothetical individual, the estimated probability ratios are:
gˆ (1) = e.438−.056*35.0+.818*0−1.260*0+1.062*0+.603*0 = e.−1.522 = .218

gˆ (2) = e 4.285−.130*35.0+.153*0−1.220*0+1.021*0+.927*0 = e −.265 = .767
gˆ (3) = 1
And the estimated probabilities are:
.218
πˆ (1) = = .110
.218 + .767 + 1
.767
πˆ (2) = = .386
.218 + .767 + 1
1
πˆ (3) = = .504
.218 + .767 + 1

Since the third group (good risk) has the greatest expected probability (.504), the model predicts that
the individual belongs to that group. The next most likely group to which the individual would be
assigned would be group 2 (bad risk–profit) because its expected probability is the next largest (.386).
Additional Readings
Those interested in learning more about logistic regression might consider David W Hosmer and
Stanley Lemeshow’s Applied Logistic Regression, 2nd Edition, New York, Wiley, 2000.

Summary Exercises
The exercises in this lesson use the data file risktrain.tx detailed in the following text box.
RiskTrain.txt contains information from a risk assessment study in which customers with credit
cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed
or other problems, but profitable for the issuing company), and bad risk-loss. In addition to the risk
classification field, a number of demographics are available for about 2,500 cases. We want to
predicting credit risk from the demographic fields. The file contains the following fields:
ID ID number
AGE Age
INCOME Income in British pounds
GENDER Gender
MARITAL Marital status
NUMKIDS Number of dependent children
NUMCARDS Number of credit cards
HOWPAID How often is customer paid by employer (weekly, monthly)
MORTGAGE Does customer have a mortgage?
STORECAR Number of store credit cards
LOANS Number of outstanding loans
RISK Credit risk category
INCOME1K Income in thousands of British pounds (field
derived within PASW Modeler)
1. Continuing with the stream from the lesson, add the other available inputs, excluding
INCOME (which is linearly related to income1k), and ID, to a logistic regression model and
evaluate the results. Do the additional fields substantially improve the predictive accuracy of
the model? Examine the estimated coefficients for the significant inputs. Do these
relationships make sense?
2. Rerun the Logistic node, dropping those inputs that were not significant in the last analysis.
Does the accuracy of the model change much? Does the interpretation of any of the
coefficients change substantially?
3. Rerun the Logistic node, this time using the Stepwise method. Do the input fields selected
match those retained in Exercise 2?
4. Run a rule induction model (using C5.0 or CHAID) on this data, using all fields but ID and
INCOME as inputs. How does the accuracy of this model compare to that found by logistic
regression? What does this suggest about the relations in the data? Do the inputs used by the
model correspond to the inputs that were found to be significant in the logistic regression
analysis?
5. Run a neural net model on this data, again excluding ID and INCOME as inputs. Make sure
you request predictor importance. Does the neural network outperform the other models? Are
the important predictors in the neural network model the same as the significant input fields
in the logistic regression?


Lesson 10: Discriminant Analysis

Objectives
• How Does Discriminant Analysis Work?
• The Elements of a Discriminant Analysis
• The Discriminant Model
• How Cases are Classified
• Assumptions of Discriminant Analysis
• Analysis Tips
• A Two–Group Discriminant Example
Data
To demonstrate discriminant analysis we use data from a study in which respondents answered,
hypothetically, whether they would accept an interactive news subscription service (via cable). We
wish to identify those groups most likely to adopt the service. Several demographic fields are
available, including education, gender, age, income (in categories), number of children, number of
organizations the respondent belonged to, and the number of hours of TV watched per day. The target
measure was whether they would accept the offering or not.
10.1 Introduction
Discriminant analysis is a technique designed to characterize the relationship between a set of fields,
often called the response or predictors, and a grouping field with a relatively small number of
categories. By modeling the relationship, discriminant can make predictions for categories of the
grouping field (target). To do so, discriminant creates a linear combination of the predictors that best
characterizes the differences among the groups. The technique is related to both regression and
multivariate analysis of variance, and as such it is another general linear model technique. Another
way to think of discriminant analysis is as a method to study differences between two or more groups
on several fields simultaneously.
Common uses of discriminant include:
1. Deciding whether a bank should offer a loan to a new customer.

2. Determining which customers are likely to buy a company’s products.
3. Classifying prospective students into groups based on their likelihood of success at a school.
4. Identifying patients who may be at high risk for problems after surgery.
Discriminant Analysis 10-1

10.2 How Does Discriminant Analysis Work?

Discriminant analysis assumes that the population of interest is composed of separate and distinct
populations, as represented by the grouping field. The discriminant analysis grouping field can have
two or more categories. Furthermore, we assume that each population is measured on a set of
fields—the predictors—that follow a multivariate normal distribution. Discriminant attempts to find
the linear combinations of the predictors that best separate the populations. If we assume two input
fields, X and Y, and two groups for simplicity, this situation can be represented as in Figure 10.1.
Figure 10.1 Two Normal Populations and Two Predictor Fields, with Discriminant Axis
The two populations or groups clearly differ in their mean values on both the X and Y-axes.
However, the linear function—in this instance, a straight line—that best separates the two groups is a
combination of the X and Y values, as represented by the line running from lower left to upper right
in the scatterplot. This line is a graphic depiction of the discriminant function, or linear combination
of X and Y, that is the best predictor of group membership. In this case with two groups and one
function, discriminant will find the midpoint between the two groups that is the optimum cutoff for
separating the two groups (represented here by the short line segment). The discriminant function and
cutoff can then be used to classify new observations.
If there are more than two predictors, then the groups will (hopefully) be well separated in a
multidimensional space, but the principle is exactly the same. If there are more than two groups, more
than one classification function can be calculated, although not all the functions may be needed to
classify the cases. Since the number of predictors is almost always more than two, scatterplots such as
Figure 10.1 are not always that helpful. Instead, plots are often created using the new discriminant
functions, since it is on these that the groups should be well separated.
The effect of each predictor on each discriminant function can be determined, and the predictors can
be identified that are more important or more central to each function. Nevertheless, unlike in
regression, the exact effects of the predictors are not typically seen as of ultimate importance in
discriminant analysis. Given the primary goal of correct prediction, the specifics of how this is
accomplished are not as critical as the prediction itself (such as offering loans to customers who will

pay them back). Second, as will be demonstrated below, the predictors do not directly predict the
grouping field, but instead a value on the discriminant function, which, in turn, is used to generate a
group classification.
10.3 The Discriminant Model

The discriminant model has the following mathematical form for each function:
F K = D 0 + D 1X 1 + D 2X 2 + ... D pX p
where FK is the score on function K, the Di’s are the discriminant coefficients, and the Xi’s are the
predictor or response fields (there are p predictors). The maximum number of functions K that can be
derived is equal to the minimum of the number of predictors (p) or the quantity (number of groups –
1). In most applications, there will be more predictors than categories of the grouping field, so the
latter will limit the number of functions. For example, if we are trying to predict which customers will
choose one of three offers, (3-1), or two classification functions can be derived.
When more than one function is derived, each subsequent function is chosen to be uncorrelated, or
orthogonal, to the previous functions (just as in principal components analysis, where each
component is uncorrelated with all others, see Lesson 7). This allows for straightforward partitioning
of variance.
Discriminant creates a linear combination of the predictor fields to calculate a discriminant score for
each function. This score is used, in turn, to classify cases into one of the categories of the grouping
field.
10.4 How Cases Are Classified

There are three general types of methods to classify cases into groups.
1. Maximum likelihood or probability methods: These techniques assign a case to group k if its
probability of membership is greater for group k than for any other group. These probabilities
are posterior probabilities, as defined below. This method relies upon assumptions of
multivariate normality to calculate probability values.
2. Linear classification functions: These techniques assign a case to group k if its score on the
function for that group is greater than its score on the function for any other group. This
method was first suggested by Fisher, so these functions are often called Fisher linear
discriminant functions (which is how PASW Modeler refers to them).
3. Distance functions: These techniques assign a case to group k if its distance to that group’s
centroid is smaller than its distance to any other group’s centroid. Typically, the Mahalanobis
distance is the measure of distance used in classification.
When the assumption of equal covariance matrices is met, all three methods give equivalent results.
PASW Modeler uses the first technique, a probability method based on Bayesian statistics, to derive a
rule to classify cases. The rule uses two probability estimates. The prior probability is an estimate of
the probability that a case belongs to a particular group when no information from the predictors is
available. Prior probabilities are typically either determined by the number of cases in each category
of the grouping field, or by assuming that the prior probabilities are all equal (so that if there are three

groups, the prior probability of each group would be 1/3). We have more to say about prior
probabilities below.
Second, the conditional probability is the probability of obtaining a specific discriminant score (or
one further from the group mean) given that a case belongs to a specific group. By assuming that the
discriminant scores are normally distributed, it is possible to calculate this probability.
With this information and by applying Bayes’ rule, the posterior probability is calculated, which is
defined as the likelihood or probability of group membership, given a specific discriminant score. It is
this probability value that is used to classify a case into a group. That is, a case is assigned to the
group with the highest posterior probability.
Although PASW Modeler uses a probability method of classification, you will most probably use a
method based on a linear function to classify new data. This is mainly for ease of calculation because
calculating probabilities for new data is computationally intensive compared to using a classification
function. This will be illustrated below.
10.5 Assumptions of Discriminant Analysis

As with other general linear model techniques, discriminant makes some fairly rigorous assumptions
about the population. And as with these other techniques, it tends to be fairly robust to violations of
these assumptions.
Discriminant assumes that the predictor fields are measured on an interval or ratio scale (continuous).
However, as with regression, discriminant is often used successfully with fields that are ordinal, such
as questionnaire responses on a five- or seven-point Likert scale. Nominal fields can be used as
predictors if they are given dummy coding. The grouping field can be measured on any scale and can
have any number of categories, though in practice most analyses are run with five or fewer categories.
Discriminant assumes that each group is drawn from a multivariate normal population. This
assumption can be and is violated often, especially as sample size increases, and moderate departures
from normality are usually not a problem. If this assumption is violated, the tests of significance and
the probabilities of group membership will be inexact. If the groups are widely separated in the space
of the predictors, this will not be as critical as when there is a fair amount of overlap between the
groups.
When the number of categorical predictor fields is large (as opposed to interval–ratio predictors),
multivariate normality cannot hold by definition. In that case, greater caution must be used, and many
analysts would choose to use logistic regression instead. Most evidence indicates that discriminant
often performs reasonably well with such predictors, though.
Another important assumption is that the covariance matrices of the various groups are equal. This is
equivalent to the standard assumption in analysis of variance about equal variances across factor
levels. When this is violated, distortions can occur in the discriminant functions and the classification
equations. For example, the discriminant functions may not provide maximum separation between
groups when the covariances are unequal. If the covariance matrices are unequal but the input fields’
distribution is multivariate normal, the optimum classification rule is the quadratic discriminant
function. But if the matrices are not too dissimilar, the linear discriminant function performs quite
well, especially when the sample sizes are small. This assumption can be tested with the Explore
procedure or with the Box’s M statistic, displayed by Discriminant.

For a more detailed discussion of problems with assumption violation, see P.A. Lachenbruch
(Discriminant Analysis. 1975. New York: Hafner) or Carl Huberty (Applied Discriminant Analysis.
1994. New York: Wiley).
10.6 Analysis Tips

In addition to the assumptions of discriminant, some additional guidelines are helpful. Many analysts
would recommend having at least 10 to 20 times as many cases as predictor fields to insure that a
model doesn’t capitalize on chance variation in a particular sample. For accurate classification,
another common rule is that the number of cases in the smallest group should be at least five times the
number of predictors. In the interests of parsimony, Huberty recommends having a goal of only 8 to
10 response fields in the final model. Although in applied work this may be too stringent, keep in
mind that more is not always better.
Outlying cases can affect the results by biasing the values of the discriminant function coefficients.
Looking at the Mahalanobis distance for a case or examining the probabilities is normally an effective
check for outliers. If a case has a relatively high probability of being in more than one group, it is
difficult to classify. Analyses can be run with and without outliers to see how results are affected.
Multicollinearity is less of a problem in Discriminant Analysis because the exact effect of a predictor
field is typically not the focus of an analysis. When two fields are highly correlated, it is difficult to
partition the variance between them, and the coefficient estimates are often incorrect. Still, the
accuracy of prediction may be little affected. Multicollinearity can be more of a problem when
stepwise methods of field selection are used, since fields can be removed from a model for reasons
unrelated to that field’s ability to separate the groups.
10.7 Comparison of Discriminant and Logistic Regression

Discriminant and logistic regression have the same broad purpose: to build a model predicting which
category (or group) individuals belong to based on a set of interval scale predictors. Discriminant
formally makes stronger assumptions about the predictors, specifically that for each group they
follow a multivariate normal distribution with identical population covariance matrices. Based on this
you would expect discriminant to be rarely used since this assumption is seldom met in practice.
However, Monte Carlo simulation studies indicate that multivariate normality is not critical for
discriminant to be effective.
Discriminant follows from a view that the domain of interest is composed of separate populations,
each of which is measured on variables that follow a multivariate normal distribution. Discriminant
attempts to find the linear combinations of these measures that best separate the populations. This is
represented in Figure 10.1. The two populations are best separated along an axis (discriminant
function) that is a linear combination of x and y. The midpoint between the two populations is the cut-
point. This function and cut-point would be used to classify future cases.
Logistic regression, as we have seen in Lesson 9, is derived from a view of the world in which
individuals fall more along a continuum.
This difference in formulation led discriminant to be employed in credit analysis (there are those who
repay loans and those who don’t), while logistic regression was used to make risk adjustments in
medicine (depending on demographics, health characteristics and treatment you are more or less
likely to survive a disease). Despite these different origins, discriminant and logistic give very similar
results in practice. Monte Carlo simulation work has not found one to be superior to the other over

very general circumstances. There is, of course, the obvious point that if the data are based on
samples from multivariate normal populations, then discriminant outperforms logistic regression.
One consideration when choosing between these two methods involves how many dichotomous
predictor fields (or dummy coded nominal or ordinal fields) are used in the analysis. Because of the
stronger assumptions made about the predictors by discriminant, the more categorical fields you have,
the more you would lean toward logistic regression. Within the domain of response-based
segmentation, from the business side more discriminant analysis is seen, while if the problem is
formulated from a marketing perspective as a choice model, logistic models are more common.
Note that neither discriminant nor logistic will produce a list of groups more or less associated with
various target categories. Rather they will indicate which predictor fields (some may represent
demographic characteristics) are relevant to the category. From the prediction equation or other
summary measures you can determine the combinations of characteristics that most likely lead to the
desired target category.
Recommendations
Logistic regression and discriminant analysis give very similar results in practice. Since discriminant
does make stronger assumptions about the nature of your predictors (formally, multivariate normality
and equal covariance matrices are assumed), as more of your predictor fields are categorical (and thus
need to be dummy coded) or dichotomous, you would move in the direction of logistic regression.
Certain research areas have a tradition of using only one of the methods, which may also influence
your choice.
10.8 An Example: Discriminant

To demonstrate discriminant analysis we use data from a study in which respondents indicated
whether they would accept an interactive news subscription service (via cable). Most of the predictor
fields are continuous in scale, the exceptions being GENDER (a dichotomy) and INC (an ordered
categorical field). We would expect few if any of these to follow a normal distribution, but will
proceed with discriminant.
Note that the predictor fields for discriminant must be numeric, although they can be categorical.
Most importantly, if you have predictors that are truly categorical, such as region of the U.S. (e.g.,
northwest, southwest, etc.), even with numeric coding, Discriminant will not create dummy
variables/fields for these categories. You will need to create dummy variables yourself (use the
SetToFlag node), and then enter the dummy variables in the model, leaving one out so as not to create
redundancy. In this current example we don’t face this issue.
As in our other examples, we will move directly to the analysis although ordinarily you would run
data checks and exploratory data analysis first.
Click File…Open Stream and then move to the c:\Train\ModelerPredModel folder

Double-click on Discriminant.str
Right-click on the Table node and select Run to view the data

Figure 10.2 The Interactive News Study Data
Place a Discriminant node from the Modeling palette to the right of the Type node
Connect the Type node to the Discriminant node
The name of the Discriminant node will immediately change to NEWSCHAN, the target field.
Figure 10.3 Discriminant Node Added to the Stream
Double-click on the Discriminant node


Figure 10.4 Discriminant Dialog
The Use partitioned data option can be used to split the data into separate samples for training and
testing. This may provide an indication of how well the model will work with new data. We will not
use this option in this example, but instead will take advantage of a different option for validating the
model (Leave-one-out classification) that is built into the Discriminant procedure.
The Build model for each split option enables you to use a single stream to build separate models for
each possible value of a field which is set to role Split in the Type node. All the resulting models will
be accessible from a single model nugget.
The Method option allows you to specify how you want the predictors entered into the model. By
default, all of the terms are entered into the equation. If you do not have a particular model in mind,
you can invoke the Stepwise option that will enter predictors into the equation based on a statistical
criterion. At each step, terms that have not yet been added to the model are evaluated, and if the best
of those terms adds significantly to the predictive power of the model, it is added. Some analysts
prefer to enter all the predictor fields into the equation and then evaluate which are important.
However, if there are many correlated predictor fields, you run the risk of multicollinearity, in which
case a Stepwise method may be preferred. A drawback is that the Stepwise method has a strong
tendency to overfit the training data. When using this method, it is especially important to verify the
validity of the resulting model with a hold-out test sample or new data (which is common practice in
data mining).
Click on the Method button and select Stepwise

Figure 10.5 Discriminant Analysis with Method Stepwise
Click on the Expert tab

Click on Expert mode
Figure 10.6 Discriminant Expert Options
You can use the Prior Probabilities area to provide Discriminant with information about the
distribution of the target in the population. By default, before examining the data, Discriminant
assumes an observation is equally likely to belong to each group. If you know that the sample
proportions reflect the distribution of the target in the population then you can use the Compute from
group sizes option to instruct Discriminant to make use of this information. For example, if a target

category is very rare, Discriminant can make use of this fact in its prediction equation. We don’t
know what the proportions would be so we retain the default.
The Use covariance matrix option is often useful whenever the homogeneity of variance option is not
met. In general, if the groups are well separated in the discriminant space, heterogeneity of variance
will not be terribly important. However, in situations when you do violate the equal variance
assumption, it may be useful to use the Separate-groups covariance matrices to see if your predictions
change by very much. If they do, that would suggest that the violation of the equal variance
assumption was serious. It should be noted that using separate-groups covariance matrices does not
affect the results prior to classification. This is because PASW Modeler does not use the original
scores to do the classification. Thus, the use of the Fisher classification functions is not equivalent to
classification by PASW Modeler with separate covariance matrices.

Figure 10.7 Discriminant Advanced Output Dialog
Checking Univariate ANOVAs will have PASW Modeler display significance tests of between-group
(target categories) differences on each of the predictors. The point of this is to provide some hint as to
which fields will prove useful in the discriminant function, although this is precisely what
discriminant will resolve. The Box’s M statistic is a direct test of the equality of covariance matrices.
The covariance matrices are ancillary output and very rarely viewed in practice. However, you might
view the within-groups correlations among the predictors to identify highly correlated predictors.

Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to make
predictions for future observations (customers). Both sets of coefficients produce the same predictions
when equal covariance matrices are assumed. If there are only two target categories (as is our
situation), either set of coefficients is easy to use. If you want to try “what if” scenarios using a
spreadsheet, the unstandardized coefficients, which involve a single equation in the two-category
case, are more convenient. If you run discriminant with more than two target categories, then Fisher's
coefficients are easier to apply as prediction rules.
Casewise results can be used to display the codes for the actual group, predicted group, posterior
probabilities, and discriminant scores for each case. The Summary table, also known by several other
names including Classification table, Misclassification Table, and Confusion table, displays the
number of cases correctly and incorrectly assigned to each of the groups based on the discriminant
analysis. The Leave-one-out classification classifies each case based on discriminant coefficients
calculated while the case is excluded from the analysis. This is a jackknife method and provides a
classification table that should at least slightly better generalize to other samples. You can also
produce a Territorial map, which is a plot of the boundaries used to classify cases into groups, but the
map will not be displayed if there is only one discriminant function (the maximum number of
functions is equal to the number of categories – 1 in the target field).
The Stepwise options allow you to display a Summary of statistics for all fields after each step.
Click the Means, Univariate ANOVAS, and Box’s M check boxes in the Descriptives area
Click the Fisher’s and Unstandardized check boxes in the Function Coefficients area
Click the Summary table and Leave-one-out classification check boxes in the
Classification area

Figure 10.8 Discriminant Advanced Output dialog after Option Selection
Click OK
Click the Stepping button
Figure 10.9 Stepping Dialog
Wilks’ lambda is the default and probably the most common method. The differences between the
methods are somewhat technical and beyond the scope of this course. You can change the statistical

criterion for field entry. For example, you might want to make the criterion more stringent when
working with a large sample.
Click Cancel
Browse the Discriminant generated model in the Models Manager window
Click the Advanced tab, and then expand the browsing window
Scroll to the Classification Results
Figure 10.10 Classification Results Table
Although this table appears at the end of the discriminant output, we turn to it first. It is an important
summary since it tells us how well we can expect to predict the target. The actual (known) groups
constitute the rows and the predicted groups make up the columns. Of the 227 people surveyed who
said they would not accept the offering, the discriminant model correctly predicted 157 of them; thus
accuracy for this group is 69.2%. For the 214 respondents who said they would accept the offering,
66.4% were correctly predicted. Overall, the discriminant model was accurate in 67.80% of the cases.
Is this good? Will this model work well with new data? The answer to the first question will largely
depend on what level of predictive accuracy you required before you began the project. One way we
can assess the success of the model is to compare these results with the predictions we would have
made if we simply guessed the larger group. If we simply did that, we would be correct in 227 of 441
(227 + 214) instances, or about 51.5% of the time. The 67.8% correct figure, while certainly far from
perfect accuracy, does far better than guessing. The Cross-validated portion of the table gives us an
idea about how accurate this model will be with new data. The percent of correctly classified cases
has decreased slightly from 67.8% to 67.3% for the cross-validation. Because these results are
virtually identical, it appears the model is valid.
Since we are interested in discovering which characteristics are associated with someone who accepts
the news channel offer, we proceed.
Scroll back to the Group Statistics pivot table

Figure 10.11 Group Statistics
Viewing the means by themselves is of limited use, but notice the group that would accept the service
is about 7 years older than the group that would not accept, whereas the daily hours of TV viewing
are almost identical for the two groups. The standard deviations are very similar across groups, which
is promising for the equal covariance matrices assumption.
Scroll to the Tests of Equality of Group Means pivot table
Figure 10.12 Univariate F Tests
The significance tests of between-group differences on each of the predictor fields provide hints as to
which will be useful in the discriminant function (recall we are using Wilks' criterion as a stepwise

method). Notice Age in Years has the largest F (is most significant) and will be first selected in the
stepwise solution. This table looks at each field ignoring the others, while discriminant adjusts for the
presence of the other fields in the equation (as would regression).
Scroll to the Box’s M test results
Figure 10.13 Box’s M Test Results
Because the significance value is well above 0.05, we can accept the null hypothesis that the
covariance matrices are equal. However, the Box’s M test is quite powerful and leads to rejection of
equal covariances when the ratio N/p is large, where N is the number of cases and p is the number of
fields. The test is also sensitive to lack of multivariate normality, which applies to these data. If the
variances were unequal, the effect on the analysis would be to create errors in the assignment of cases
to groups.
Scroll to the Eigenvalues and Wilks’ Lambda portion of the output

Figure 10.14 Summaries of Discriminant Function (Eigenvalues and Wilks’ Lambda)
These two tables are overall summaries of the discriminant function. The canonical correlation
measures the correlation between a field (or fields when there are more than two groups) contrasting
the groups and an optimal (in terms of maximizing the correlation) linear combination of the
predictors. In short, it measures the strength of the relationship between the predictor fields and the
groups. Here, there is a modest (.363) canonical correlation.
Wilks’ lambda provides a multivariate test of group differences on the predictors. If this test were not
significant (it is highly significant), we would have no basis on which to proceed with discriminant
analysis. Now we view the individual coefficients.

Scroll down until you see the Standardized Coefficients and Structure Matrix
Figure 10.15 Standardized Coefficients and Structure Matrix
Standardized discriminant coefficients can be used as you would use standardized regression
coefficients in that they attempt to quantify the relative importance of each predictor in the
discriminant function. The only three predictors that were selected by the stepwise analysis were
Education, Gender and Age. Not surprisingly, age is the dominant factor. The signs of the
coefficients can be interpreted with respect to the group means on the discriminant function. Notice
the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a
woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient
will lower the discriminant score, and move the individual toward the group with a negative mean
(those that don’t accept the offering). Thus women are less likely to accept the offering, adjusting for
the other predictors.
The Structure Matrix displays the correlations between each field considered in the analysis and the
discriminant function(s). Note that income category correlates more highly with the function than
gender or education, but it was not selected in the stepwise analysis; this is probably because income
correlated with predictors that entered earlier. The standardized coefficients and the structure matrix
provide ways of evaluating the discriminant fields and the function(s) separating the groups.
Scroll down to the Canonical Discriminant Function Coefficients and Functions at Group
Centroids are visible

Figure 10.16 Unstandardized Coefficients and Group Means (Centroids)
In Figure 10.1 we saw a scatterplot of two separate groups and the axis along which they could be
best separated. Unstandardized discriminant coefficients, when multiplied by the values of an
observation, project an individual onto this discriminant axis (or function) that separates the groups. If
you wish to use the unstandardized coefficients for prediction purposes, you would simply multiple a
prospective customer’s education, gender and age values by the corresponding unstandardized
coefficients and add the constant. Then you compare this value to the cut-point (by default the
midpoint) between the two group means (centroids) along the discriminant function (the cut-point
appears in Figure 10.1). If the prospective customer’s value is greater than the cut point you predict
the customer will accept, if the score is below the cut point you predict the customer will not accept.
This prediction rule is also easy to implement with two groups, but involves much more complex
calculations when more than two groups are involved. It is in a convenient form to do “what if”
scenarios, for example, it we have a male with 16 years of education at what age would such an
individual be a good prospect? To answer this we determine the age value that moves the
discriminant score above the cut-point.
Scroll down until you see the Classification Function Coefficients

Figure 10.17 Fisher Classification Coefficients
Fisher function coefficients can be used to classify new observations (customers). If we know a
prospective customer’s education (say 16 years), gender (Female=1) and age (30), we multiply these
values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30
–20.85) which yield a numeric score. We repeat the process using the coefficients for the Yes group
and obtain another score. The customer is then placed in the target group for which she has the higher
score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets,
databases) for predictive purposes.
We did not test for the normality assumptions of discriminant analysis in this example. In general,
normality does not make a great deal of difference, but heterogeneity of the covariance matrices can,
especially if the sample group sizes are very different. Here the samples sizes were about the same.
As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the
costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the
prediction was far from perfect we were able to identify the relations between the demographic fields
and the choice.

Summary Exercises
The exercises in this lesson use the data file credit.sav. The following table provides details about
fields in the file.
Credit.sav has the same fields as risktrain.txt except that they are all numeric so that we can use them
all in a Discriminant Analysis. The file contains the following fields:
ID ID number
AGE Age
INCOME Income
GENDER Gender
MARITAL Marital status
NUMKIDS # of dependent children
NUMCARDS # of credit cards
HOWPAID paid M/Wkly
MORTGAGE Mortgage
STORECAR # of store cards held
LOANS # other loans
RISK Credit risk category
INCOME1K Income in thousands of British pounds (field
derived within PASW Modeler)
1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and
connect it to Credit.sav.
2. Attach a Type node to the Source node, and a Table node to the Type node. Run the Table
and allow PASW Modeler to automatically type the fields.
3. Attach a SetToFlag node to the Type node and create separate dummy fields for each
category of the marital field. Make sure that you code the True value as 1 and the False value
as 0. This is important because Discriminant expects numeric data for the inputs.
4. Attach a Type node to the SetToFlag node.
5. Edit the second Type node and change the role for risk to Target, and to None for id, marital,
income1k, and marital_3, or a reference field of your choice. Leave the role as Input for all
the rest of the fields.
6. Use a Distribution node to examine the distribution of risk.
7. Attach a Discriminant node to the second Type node and run the analysis. How many
classification functions are significant? What fields are important predictors?
8. How accurate is the model as a whole. On which category is it more accurate?


Lesson 11: Bayesian Networks

Objectives
• The Basics of Bayesian Networks
• Types of Bayesian Networks in PASW Modeler
• Creating models with the Bayes Net node
• Modifying Bayes Network Model Settings
Data
We will use the dataset churn.txt that we have employed in several previous lessons. This data file
contains information on 1477 customers of a telecommunication company who have at some time
purchased a mobile phone. The customers fall into one of three groups: current customers,
involuntary leavers and voluntary leavers. In this lesson, we use a Bayes Net to predict group
membership. A partition node will be used to split the data.
11.1 Introduction
Bayesian analysis has been introduced to data mining with Bayesian networks, which are graphic
representations of the probabilistic relationships among a set of fields. These networks are very
general and can be used to represent causal relationships, can have multiple target fields, and often
allow an analyst to specify the existence (or non-existence) of certain relationships using domain
knowledge and experience.
The Bayes Net node provides the ability to use two different types of Bayesian networks to predict a
categorical target. Bayes Net can use predictors on any scale, but continuous (Range) fields will be
automatically binned into five groups. In theory a Bayes Net can use many predictors, but since every
field will be categorical, cells with low or zero counts are more likely, especially if some categorical
predictors have many categories. This is less an issue with very large data files.
Bayes Nets are an alternative to other methods of prediction for categorical targets, including decision
trees, neural nets, logistic/multinomial regression, or SVM models. Unlike many other PASW
Modeler models, a graphical depiction of the model in the form of a Bayesian network is available in
the generated model to further model understanding, although there is no predictive equation with
coefficients for individual predictors as in some other models.
The Bayes Net node is included in the Classification module.
11.2 The Basics of Bayesian Networks

Bayesian analysis is a an area of statistics that is based on a different approach to probability than
frequentist statistics, which are, for example, the standard approach used to calculate the probability
values for a t-test. The frequentist approach defines probability as the limit of an outcome’s relative
frequency in a large number of trials, and it assumes that a priori knowledge plays no role in
determining probability. In contrast, Bayesian statistics incorporate prior knowledge or belief about
an event or outcome, so that one has both prior and posterior probabilities.
Bayesian analysis, and Bayes’ theorem, on which it is based, is named after the Reverend Thomas
Bayes, who studied how to compute a distribution for the parameter of a binomial distribution. There
Bayesian Networks 11-1

are several ways to state Bayes’ theorem. If we wish to test a hypothesis H that is conditional on
evidence from some Data, then one general statement of Bayes’ theorem is:
P(H|Data) = P(Data|H)*P(H) /P(Data),
Where P(H|Data) means the probability of H given the Data.
The issue of prior probabilities enters because the P(H) is the prior probability of H, given no other
information, i.e. the data collected for our study. This probability can be subjective, or it can be based
on more objective prior knowledge, such as the proportion of persons who buy a new refrigerator in a
year (for a model where we are trying to predict who will buy a new refrigerator).
We won’t provide an example of using the theorem here. You can find many good worked examples
on websites and elementary texts on Bayesian statistics. We also don’t provide an example because
you don’t really need to understand Bayes theorem to use the Bayes Net node or its output. A portion
of the output is a joint probability table, but that is really nothing other than a bivariate or multi-way
crosstabulation of the fields that are found to be dependent (because values of one field depend or are
related to another, although this dependence does not imply causality: correlation does not equal
causality, as we have all been taught).
Otherwise, the output fields from a Bayes Net model are similar to those from other models and
include a prediction and the probability of that prediction.
A Bayesian network is a graphical model based on a direct acyclic graph (DAG). First, a directed
graph is shown in Figure 11.1 for comparison. Directed graphs are composed of vertices or nodes (the
circles) that represent fields in a model, and arrows between the nodes that are called variously arcs,
arrows, or directed edges.
Figure 11.1 Simple Example of a Directed Graph
In comparison, a directed acyclic graph is shown in Figure 11.2. Here, for any node n, there is no
path, following the arrows, that begins and ends on n. You can try that for any of the nodes in the
graph.

Figure 11.2 Directed Acyclic Graph to Predict RESPONSE
A Bayesian network is a model that represents a set of data with a directed acyclic graph and that uses
that information to make predictions. Nodes that are connected have probabilistic dependencies.
Nodes that are not connected (broadly speaking) are conditionally independent, which means that
these other nodes add no more information to the relationship, given the nodes that are interconnected
(more about this below). So in the graph in Figure 11.2, the field VISITB is conditionally independent
of ORIVISIT.
A Bayesian network can display causal relationships between nodes with the arcs and arrows.
However, the networks constructed by the Bayes Net node are not designed to represent causal
relationships, for several important reasons. In data-mining, more emphasis is placed on the ability of
a model to make accurate predictions rather than represent causal influences, i.e., the effect of field A
on outcome C is direct and also indirect through field B. The networks constructed by the Bayes Net
node are optimized for prediction. Second, software by itself, despite any claims otherwise, cannot
successfully find causal relationships without user input. That is why in structural equation modeling,
the user must set up the structure of the model and then test whether the data support that structure or
model. Finally, data mining problems often incorporate many potential predictors, making
specification of causal links more and more complex.
The end result of these points is that it is possible to glean information from a network in PASW
Modeler, but you need to be cautious when doing so and not over interpret the model.
Bayesian networks in general are often resistant to problems caused by missing data, and they can
make predictions for cases with missing data. However, the Bayes Net node by default uses listwise
deletion, where any missing data causes it to delete a case from analysis. Why this is so and how it
affects model-building is explained with an example below.
Bayesian networks as implemented in PASW Modeler are designed to use only categorical data for
which probability statements can be readily constructed. This means that only categorical targets can
be used. If a continuous predictor is used, it will be binned into five roughly equally-spaced bins. This
may not always be appropriate for skewed or other non-symmetrical distributions. If you have
predictors like that, you may wish to manually bin these fields using a Binning node before the
Bayesian Network node. For example, you could use Optimal Binning where the Supervisor field is
the same as the Bayesian Network node Target field.

11.3 Type of Bayesian Networks in PASW Modeler

The Bayes Net node provides two types of Bayesian networks. To understand them, it helps to first
discuss a Naïve Bayes network.
In Figure 11.3 this type of network is displayed. There is a target field A and a set of predictors B, C,
and D. A is a parent node of the other nodes, and nodes B, C, and D are therefore child nodes of A.
This is reminiscent of the graphical view of a decision tree, but you should not try to equate the two.
Although we are attempting to predict A, the arcs point toward the predictors. This is a consequence
of Bayes theorem, where the prior probability of the data, given the outcome, is included in the
numerator of the equation. This probability is represented by the arrows flowing away from A, the
target.
Figure 11.3 Naïve Bayes Network
Of course, we include fields that are meaningful predictors of the target, so these arrows shouldn’t be
confusing. For example, if we want to predict customers who will make a second purchase from an
online retailer, we can include such things as income, gender, and prior purchase behavior. All of
those will influence a second purchase, but not the reverse.
The other key characteristic of a Naïve Bayes network is that there are no links or dependencies
between the predictors. This is the simplest possible network.
With this as background, we can now consider the two networks available in the Bayes Net node.
Tree-Augumented Naïve Bayes (TAN). This type of network extends Naïve Bayes by allowing each
predictor to depend on one other predictor in addition to the target field. Again, this dependence is not
necessarily causal dependence but simply probabilistic dependence given the data at hand. Figure
11.2 shows a TAN network, where you can see that no predictor has more than two arrows pointing
toward it, where all the arrows point away from the target RESPONSE, and where one predictor
(ORISPEND) has no dependency on other predictors.
The conditional probability tables produced by the Bayes Net node will reflect this structure, so a
table for VISITB will include RESPONSE and SPENDB.
Markov Blanket. This type of network selects the set of nodes in the dataset that contain the target’s
parents, its children, and its children’s parents. This is illustrated in Figure 11.4 where once again the
target field is RESPONSE. There were many more potential predictors available than are displayed in
the network, but once a Markov Blanket has been defined, the target node is conditionally

independent of all other nodes (predictors), and so those predictors are not used in the network
(model). Essentially, a Markov blanket identifies all the fields that are needed to predict the target.
Notice that arrows can go both to and from the target field in a Markov Blanket.
This type of network should, all things being equal, be more accurate than a TAN, especially with a
large number of fields. However, with large datasets the processing time will be significantly greater.
To reduce the amount of processing, you can use the Feature Selection options on the Expert tab to
have PASW Modeler use only the fields that have a significant related bivariate relationship to the
target. As before, arrows from the target to another field don’t indicate causal influence in that
direction.
Figure 11.4 Example of a Markov Blanket Network
You now understand the basics of a Bayesian network and the types of networks produced with the
Bayes Net node. We can begin using Bayesian networks to predict customer churn.
11.4 Creating a Bayes Network Model

We will use the churn data file that we have used in several other lessons. This will allow comparison
to these other techniques.
Click File…Open Stream and move to the c:\Train\ModelerPredModel folder

Double-click on Bayes Net.str
Run the Table node
Edit the Type node

Figure 11.5 Type Settings for Churn Data
All available input fields will be used (with the exception of ID). The field CHURNED has three
categories.
Close the Type window

Edit the Bayes Net node named CHURNED
There are two types, or structures, of networks available, as explained above.
If you have many fields, you may wish to include a first step of feature selection that will reduce the
number of inputs. This option can be turned on with the Include feature selection preprocessing step
check box.

Figure 11.6 Model Tab in Bayes Net Node
Recall that the probability being modeled in a Bayesian network is comprised of a series of tables,
and so there can be a significant fraction of cells with small or even zero cell counts. This can pose a
computational difficulty; in addition, there is a danger of overfitting the model. The Bayes adjustment
for small cell counts check box reduces these problems by applying smoothing to reduce the effect of
any zero-counts.
If a model has previously been trained, the results shown on the model nugget Model tab are
regenerated and updated each time the model is run if you select the Continue training existing model
check box. You would do this when you have added new or updated data to an existing stream with a
model.
Click Expert tab

Click Expert options button
Missing Values in Bayes Net Models

The default option for a Bayes Net is to use only complete records (Use only complete records check
box). This is equivalent to standard listwise deletion, so if a record has a missing value for any field,
that record won’t be used in creating a model (or in scoring from an existing model). If this option is
unchecked, the Bayes Net will do the equivalent of pairwise deletion, using as much information as
possible.

However, as with any algorithm that uses pairwise deletion, at least two issues become salient. The
number of cases used for the analysis now becomes ill-defined. This may not be critical for most
data-mining projects, but you should be aware of this issue. Perhaps more important, the estimates of
the model can be unstable and be affected by small changes in the data. This could make model
validation more difficult.
If there is a significant amount of missing data, you may wish to estimate/impute some of the missing
data values, although this raises its own complications.
Computationally, the best solution is to use listwise deletion, but that only is ideal when missing data
is a small percentage of the file.
Other Bayes Net Expert Options

The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size to
carry out independence testing and remove unnecessary links from the network (to find parents and
children of the target field). This can be especially useful when processing data with strong
dependencies among many fields. The default setting for Maximal conditioning set size is 5,
Because tests involving a high number of conditioning fields require more time and memory for
processing you can limit the number of fields to be included. If you reduce the maximum
conditioning set size, though, the resulting network may contain some superfluous links. You can also
use this setting if you are using a TAN network by requesting feature selection in the Model tab.
The Feature Selection area is available for Markov Blanket models or with TAN models when feature
selection is turned on. You can use this option to restrict the maximum number of inputs used when
processing the model in order to speed up model building. If feature selection is turned on, the default
maximum number of fields to be used in the network is 10. If there are important fields that should be
used in a network, you can specify them in the Inputs always selected box.
The Bayes Net conducts tests of independence on two-way and larger tables to construct the network.
A Likelihood ratio test is used by default, but you can request that a standard Pearson chi-square be
used instead. The significance level of the test can be set, but only if feature selection or a Markov
Blanket network are requested.

Figure 11.7 Expert Options for Bayes Net
At this point we won’t change any defaults.
Click Run
Right-click and Browse the generated Bayes Net model

Figure 11.8 Bayes Net Model Browser for TAN Model
As with most predictive models, predictor importance is included in the right half of the model
browser window. The most important predictors are clearly SEX, LONGDIST, and International (you
might want to compare this to other models that we developed to predict CHURNED with these data,
such as the decision trees in Lesson 3).
The actual Bayesian TAN model is displayed in the left half of the model browser. The network
graph of nodes displays the relationship between the target and its predictors, as well as the
relationship between the predictors. The importance of each predictor is shown by the density of its
color; a darker blue color shows an important predictor. The target CHURNED has a red node.
You can use the mouse to drag nodes around the graph to more easily view relationships.
Click on the node for CHURNED and drag it more into the center of the network (see Figure
11.9)

Figure 11.9 Bayesian Network Graph
There are several things to notice about the network diagram.
There is a path from CHURNED to every input field. The arrows all point away from CHURNED
even though it is the target field. This makes CHURNED a parent of all the input nodes. These facts
are simply a consequence of how a TAN is defined and don’t mean that somehow churn status is
affecting the input fields. The arrows do indicate which fields will be included in conditional
probability tables, as we will see shortly.
Second, a TAN network allows paths between a predictor and one other predictor (plus the
connection with the target). You can see this if you examine the network closely; no predictor has
more than two arrows going toward it.
Third, the links and arrows do have some meaning, but not causal influence. For example, there is an
arrow from LOCAL to Est_Income. Since the number of minutes on average of local phone service
isn’t going to affect one’s income, the direction of the arrow doesn’t indicate causality, but instead a
conditional dependency or interrelatedness. Or consider the paths going from LONGDIST to
International¸Car_Owner, and LOCAL. The arrows between LONGDIST and the other two measures
of phone service usage do probably indirectly indicate something meaningful, but not cause and
effect. Instead, the arrows are a sign that there are probabilistic dependencies among these fields.
From our understanding of the data, we might conclude that these dependencies exist because of
different groups of customers who have similar phone use patterns. For example, one group could be
customers who make local calls but not many long distance calls; another group could be those who
make lots of long distance and international calls.
As mentioned earlier, the Bayes Net node bins continuous predictors into five categories, splitting the
range into five equally-spaced groups. You can view the bin values by hovering with the mouse over
a node for a continuous predictor.

Hover the mouse over the node for Est_Income
The first bin runs from 0 to 20,054.807. The last bin contains all customers with incomes above
79,888.377.
Figure 11.10 Distribution of Est_Income
We are now in the Basic view of the network (see the View dropdown in the lower left corner). We
can switch to the Distribution view.
Click View dropdown and select Distribution
Figure 11.11 Distribution View of TAN Network

The Distribution view displays the conditional probabilities for each node in the network as a mini-
graph. Bayesian networks work only with categorical data, so the graphs are all bar charts. The
simplest one is for the target field, which shows its distribution unrelated to any other field (because
the arrows point away from it).
You can hover the mouse pointer over a graph to display its values in a popup ToolTip.
Hover the mouse over the bottom bar in the graph for CHURNED
In Figure 11.12 we have isolated just this portion of the network.
Figure 11.12 Percentage of Customers who are Current from ToolTip
The probabilities for the input nodes are more complicated because most are conditional with the
target and another field.
Hover the mouse over the graph for Car_Owner and move it from top to bottom
As you move the mouse over the mini-graph for Car_Owner, you see probabilities listed along with
values of CHURNED and LONGDIST. This is because there are arrows from those two fields pointed
toward Car_Owner.
We can learn more from viewing the conditional probability table for an input node. When you select
a node in either Basic or Distribution view, the associated conditional probabilities table is displayed
in the right half of the model browser. This table contains the conditional probability value for each
node category and each combination of values in its parent nodes.
Click on the mini-graph for Car_Owner to select it

Figure 11.13 Conditional Probability Table for Car_Owner
These conditional probabilities are based on the actual data. Thus, if we created a table with
CHURNED, Car_Owner, and (binned) LONGDIST, we would find, for example, that of those
customers who have the lowest value of LONGDIST (<5.996) and who are current customers (first
row in table), 20% (.20) own a car and 80% (.80) do not (for reference, about 30% of all customers in
the Training partition own a car).
However, since we are interested in predicting CHURNED, we can look at this table a bit differently.
If we hold LONGDIST constant (looking at the first 3 rows in the table where LONGDIST <5.996),
we can see how car ownership varies by churn status. Customers who are voluntary churners (Vol)
are more likely to own a car. Customers who are current are the least likely. It is the use of these
probabilities that allows the TAN model to make predictions.
Of course, this conditional probability table only includes two inputs. Since a customer will have a
value on all inputs (the default of listwise deletion), there will be many conditional probability
distributions that must be taken into account when making a prediction. And that is what the model
does with the help of Bayes theorem to combine probabilities (see the PASW Modeler 14 Algorithms
Guide for details).
All these cells have at least one customer because there are probabilities listed in each one. Other
conditional probability tables have zeros because no customer fit that particular pattern of values.
This condition will be important in our discussion of the model predictions next.
Close the Bayes Net model browser

Add an Analysis node to the stream and connect it to the Bayes Net model
Click Run
Figure 11.14 Analysis Node Output for TAN Model
On the Training partition the model is accurate on 78.16% of the cases. It is most accurate on current
customers, least accurate on involuntary churners. The model doesn’t do very well at all on the
Testing partition and is only accurate on 69.79% of the cases overall.
There are no missing data in this data file, yet something odd appears in the Coincidence Matrix for
the Testing data. There is a fourth predicted value of $null$, i.e., missing. Why would the Bayes
network predict a missing value or, more accurately, be unable to make a prediction for 13 cases?
(The accuracy statistics don’t drop the missing cases and count them as an incorrect prediction.)
The fact that missing predictions appear in the Testing partition but not the Training partition is the
tip-off to the cause. As mentioned above, the conditional probabilities are used to make predictions,
e.g., the probability of .20 for a customer with low long distance service who owns a car and is a
current customer. But if a combination of values (a cell in a table) exists in the Training data but not
in the Testing data, the network cannot make predictions for a customer with those characteristics.
Fortunately there are only 13 customers who have a missing predicted value, which is less than 2% of
the file. This is probably acceptable. It does illustrate the importance of having a large and varied
enough training dataset so that all possible combinations have one or more records.
We’ll next try the other type of network structure, a Markov Blanket, using the default settings
otherwise.

Close the Analysis output browser

Edit the Bayes Net modeling node
Click Model tab
Click Markov Blanket
Click Run
Figure 11.15 Bayes Net Model Browser for Markov Blanket Model
This model looks very different than the TAN network. First, not all the predictors are used. Second,
the arrows go from the inputs to the target field, which is the direction we expect for a causal
predictive model but, as with the TAN model, the arrows should not be used to indicate causal
influence. Arrows in a Markov Blanket can point away from the target field.
Third, there are no connections between the inputs. This isn’t always the case in a Markov Blanket,
but is more likely than with a TAN network. In fact, this network is equivalent to a Naïve Bayes
classifier.
The top three fields on the predictor importance chart are identical to those for the TAN network,
although the order is different.
Let’s view the conditional probability table for CHURNED (the tables for the inputs are
uninteresting).
Click on the node for CHURNED
Expand the right half of the model browser to view the probability table

Figure 11.16 Conditional Probability Table for CHURNED
The table is very large, so we can’t display it all in the figure above. Because all four inputs have
arrows pointing toward CHURNED, its conditional probability table contains all four of these fields.
The first thing we can see is that there are many cells with a probability value of 0 which indicates
that there were no customers with that combination of values.
Second, this type of table is one that is easier to think about and use in the context of predicting
CHURNED because we can choose various combinations of values of the inputs and see what the
distribution is of CHURNED. So, for example, if we select males who make very few calls in any
category (the first row in the table), we see that they are very unlikely to be voluntary churners (.051
probability).
Let’s see how well the Markov Blanket model does at predicting CHURNED.
Close the Bayes Net model browser


Figure 11.17 Analysis Node Output for Markov Blanket Model
This model, with fewer predictors, is less accurate, and this is perhaps not unexpected. Overall
accuracy on the Training data is 73.16%. The performance on the Testing data is also lower than the
TAN model, though not by as much.
There are missing predictions for 39 customers, many more than the TAN model. This is because the
conditional probability table for CHURNED has many zero cells, all of which lead to a missing
prediction (because of multiplication by 0 in the probability equations).
If the amount of missing data is small, we can rerun these models and request the Bayes adjustment
for small cell counts, which effectively adds a small amount to any cell with a zero count. We’ll
return to the TAN network, which was more accurate.

Click Model tab (if necessary)
Click TAN option button
Click Bayes adjustment for small cell counts option button
Click Run

Figure 11.18 TAN Network Model with Bayes Adjustment
The model is essentially identical to what we saw before. To see how the Bayes adjustment has
affected the network, we need to view a conditional probability table.
Click on the node for LOCAL

Expand the right half of the model browser to view the table

Figure 11.19 Conditional Probability Table for LOCAL
What we can see are many cells that have a gray background. All of these cells have a zero count and
so have been given a Bayes adjustment. This means these patterns (cells) in the data can now be
estimated.
We can see this by viewing model predictions with the Analysis node.

Click Custom
Type the text Bayes adjustment – TAN
Click OK
Add this model to the stream by the Type node
Connect the Type node to the Bayes adjustment model
Add an Analysis node to the stream and connect it to the Bayes adjustment model
Click Run

Figure 11.20 Analysis Node Output for TAN Model with Bayes Adjustment
The accuracy on the Training partition remains at 78.3% because it is not affected by the Bayes
adjustment. But there are now no predictions of $null$ for the Testing Partition, so predictions can be
made for all these cases. And this increased the accuracy on the Testing data to 70.71%.
Using a Bayes adjustment doesn’t guarantee that there will be no missing predictions. In fact, if you
run a TAN model with a Bayes adjustment, there will still be 39 cases with a missing prediction. This
is because an adjustment can be made for an existing pattern by adding a small amount to a cell with
zero count, but if a pattern is completely missing from the Training data, it still won’t be possible to
make a prediction in the Testing data.
As mentioned in an earlier section, using the Bayes adjustment is fine when the amount of missing
data is a small portion of the data file, but when there is a large amount of missing data, another
solution should be employed.
At this point, we can continue to use the TAN network but can change the Maximal conditioning set
size parameter.
11.5 Modifying Bayes Network Model Settings

As with SVM models, and many other types of models (neural networks, decision trees), finding the
best model requires some experimentation with other settings. Two key options for a Bayes Net are
feature selection and the Maximal conditioning set size.


Click Include feature selection preprocessing step check box
Figure 11.21 Requesting Feature Selection for TAN Model
Although we only have about a dozen input fields, to change the Maximal conditioning set size for a
TAN model, we need to request feature selection (this is because of how the model is calculated
without taking parent and child nodes into account).
Click Expert tab, then click Expert option button

Figure 11.22 Feature Selection and Maximal Conditioning Set Size Options
We don’t need to specify any inputs to always be in the network, and we’ll leave the Maximum
number of inputs at 10, which means that the TAN network can only include 10 of the 12 possible
inputs.
The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size to
carry out independence testing and remove unnecessary links from the network. The TAN network
can also use a conditioning set to do feature selection. The higher the value for Maximal conditioning
set size, the more time and memory for processing is required, but a higher value can be especially
useful when the data have strong dependencies among many fields.
We don’t expect strong relationships among the predictors, so we’ll reduce the value to 3, and then
run the model.
Change the Maximal conditioning set size to 3

Click Run
When the model has been generated:
Right-click the generated model and select Browse

Figure 11.23 TAN Network with Maximal Conditioning Set Size=3
The resulting TAN network is much simpler than the original and includes only 4 fields. These are
the same four fields that were included in the Markov Blanket network, and they were four of the top
five in predictor importance in the original TAN network. As before, arrows point away from
CHURNED to the inputs. No arrows point to LONGDIST from any other input.
Click on the node for LONGDIST
Figure 11.24 Conditional Probability Table for LONGDIST
The conditional probability table for LONGDIST only includes CHURNED because that is its only
parent. We see that involuntary churners are very likely to have little long distance call usage
(probability .992 in the lowest category), while the probabilities are spread more evenly for the other
two types of customers.
This network is very simple, but how will it do in predicting CHURNED? We’ll use an Analysis node
to get the answer.

Closer the Bayes Net model browser

Add the generated Bayes Net model to the stream
Connect the new model to one of the Analysis nodes, replacing the connection
Figure 11.25 Analysis Node Output for Modified TAN Network
The overall accuracy on the Training data has declined to 72.6%, a substantial drop. However, notice
that the accuracy on the Testing data is 71.24%, which is an increase of almost 1%. And, in the final
analysis, how the network does on the Testing data is the key criteria.
It would appear, although this is only an educated guess, that the more complicated TAN models
somewhat overfit the data, and that the Markov Blanket wasn’t quite complicated enough.
As is standard in data-mining modeling, we would continue developing variants of a Bayes Net

model to try to find a handful of candidates to undergo further testing. However, there are fewer
parameters to modify than say, an SVM model, so that process shouldn’t be too burdensome.
We’ll conclude the discussion of Bayes Net models by seeing how the predicted values are related to
the inputs.
In this latest model, the field LONGDIST has only CHURNED in its conditional probability table.
Now LONGDIST has been binned into five categories (visible in Figure 11.24), but we won’t bother
to take the time to use a Reclassify node to do this. We’ll just use a Histogram with an overlay to look
at the general relationship.
There are Select nodes on the bottom of the stream that will select the Training or Testing partitions.
We’ll use the one for the Training data.

Add a Select node to the stream

Connect the TAN model named Bayes adjustment – TAN to the Select node
Add a Histogram node to the stream below the Select node, and connect these two nodes
Select LONGDIST as the Field and CHURNED as the Color Overlay field
Click Options tab
Click Normalize by color
Click Run
Figure 11.26 Histogram of LONGDIST with CHURNED as Overlay on Training Data
As the conditional probability table suggests, essentially all the involuntary churners have low values
on LONGDIST. The proportion of customers who are current or voluntary churners is about equal
across values of LONGDIST, and the pattern in the histogram echoes this. Now we’ll look at the
predicted values of CHURNED.

Change the Color Overlay field to $B-CHURNED
Click Run

Figure 11.27 Histogram of LONGDIST with Predicted CHURNED as Overlay on Training

Data
The patterns are extremely similar, although there is more range in the values of LONGDIST for
predicted involuntary churners (but all values are in the first bin for that field). Although there are
other inputs in the network, these two histograms are similar because the only direct parent of
LONGDIST is CHURNED itself.
Although you can use the conditional probability tables if you are adept at reading that type of
information, it is likely that you will want to conduct this type of analysis between the inputs and
original and predicted values to understand how a Bayes Net model makes its predictions.
You may wish to continue this analysis with the TAN model. You can try these same histograms on
the Testing partition. Or you can use another input, such as SEX (Hint: use a Distribution node).

Summary Exercises
file.

forpcode Post Code
title Title
sex Gender
yob Year of Birth
age Age
In this set of exercises you will attempt to predict the field Response to campaign using a Bayes Net
model.
1. If you have previously saved a stream that accesses the file charity.sav, you can use that
stream. Otherwise, use an Statistics source node to read this file. Tell PASW Modeler to Read
Labels as Names.
listed below. Set the role of all five of these fields to Input and the Response to campaign
field to Target.
Pre-campaign spend category

Pre-campaign visits category
Gender

Age
Mosaic Bands (which should be changed to measurement level nominal)
5. Attach a Bayes Net node to the Type node. First create a TAN network with the default
settings.
6. Once the model has finished training, browse the generated Bayes Net model. What are the
most important fields? Are all fields used? Can you look at the conditional probability tables
and learn anything about the network? How does predictor importance compare to the Neural
Net results in Lesson 4 or the SVM results in Lesson 5?
7. Place the generated Bayes Net node on the Stream canvas and connect the Type node to it.
Connect the generated Net node to an Analysis node and create a matrix of actual response
against predicted response. How well does this model do in predicting response to the
campaign? How does its performance compare to other models?
8. Now create a Markov Blanket network and answer the same questions as in #6 and 7.
Additionally, compare and contrast the two models. What are the differences? Which model
does better at predicting response to campaign?
9. Use various methods to explore how the two most important predictors are related to
predictions of the model.
10. For those with extra time: Try using a dataset with more fields, such as customer_dbase.sav,
to predict an outcome with a more complex network. If you do so, you can use some of the
Expert settings in the Bayes Net node.


Lesson 12: Finding the Best Model for

Categorical Targets
Objectives
• Introduce the Auto Classifier Node
• Use the Auto Classifier Node to predict customers who will churn
Data
In this lesson we will use the dataset churn.txt that we have used in several previous lessons. We will
build models to predict whether a customer is loyal or not, and continue to use a Partition Node to
divide the cases into two segments (subsamples), one to build or train the models and the other to test
the models.
12.1 Introduction
When you are creating a model, it isn’t possible to know in advance which modeling technique will
produce the most accurate result. Often several different models may be appropriate for a given data
file and target, and normally it is best to try more than one. For example, suppose you are trying to
predict a binary target (buy/not buy). Potentially, you could model the data with a Neural Net, any of
the Decision Tree algorithms, an SVM model, a Bayes Net, Logistic Regression, Nearest Neighbor,
Decision List, or Discriminant Analysis. Unfortunately this process can be quite time consuming.
The Auto Classifier node allows you to create and compare models for categorical targets using a
number of methods all at the same time, and then compare the results. You can select the modeling
algorithms that you want to use and the specific options for each. You can also specify multiple
variants for each model. For instance, rather than choose between the Multilayer Perceptron or Radial
Basis Function methods for a neural net model, you can try them both. The Auto Classifier node
generates a set of models based on the specified options and ranks the candidates based on the criteria
you specify. The supported algorithms include Neural Net, all decision trees (C5.0, C&R Tree,
QUEST, and CHAID), Logistic Regression, Decision List, Bayes Net, Discriminant, Nearest
Neighbor and SVM.
To use this node, a single target field with categorical measurement level (flag, nominal or ordinal)
and at least one predictor field are required. Predictor fields can be continuous or categorical, with the
limitation that some predictors may not be appropriate for some model types. For example, ordinal
fields used as predictors in C&R Tree, CHAID, and QUEST models must have numeric storage (not
string), and will be ignored by these models if specified otherwise. Similarly, continuous predictor
fields can be binned in some cases (as with CHAID). The requirements are the same as when using
the individual modeling nodes.
When an automated modeling node is executed, the node estimates candidate models for every
possible combination of options, ranks each candidate model based on the measure you specify, and
saves the best models in a composite automated model nugget.
We continue to use the Churn.txt file which we used in many earlier lessons. However, we will
combine the Voluntary and Involuntary Leavers into a single category in order to use the Auto
Classifier.
Finding the Best Model for Categorical Targets 12-1

Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder

Double-click on FindBestModel.str
Place an Auto Classifier node from the Modeling palette to the right of the Type node
Connect the Type Node to the Auto Classifier node
Edit the Derive node named LOYAL
Figure 12.1 Creation of Flag Field Identifying Loyal Customers
In the Derive node we use the field CHURNED to create a new target with the name LOYAL. This
target will be a flag, with a value of Leave when CHURNED is not equal to Current; this means that
customers who are voluntary or involuntary leavers will have values of Leave. Current customers will
have a value of Stay.
Close the Derive node

Edit the Auto Classifier node

Figure 12.2 Auto Classifier Node
The Auto Classifier will use partitioned data if available. It will also create separate models for each
value of a split field. The number of models to use and display in the Auto Classifier is 3 by default,
which you can change. The top-ranking 3 models are listed according to the specified ranking
criterion, but you can increase or decrease this value. The Rank models by option allows you to
specify the criteria used to rank the models. Note that the True value defined for the target field is
assumed to represent a hit when calculating profits, lift, and other statistics (discussed below). We
have defined Leave as the True category in the Derive node because we are more interested in
locating persons who will leave as customers than those who will stay.
Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually
better to initially rank the models by the Training partition since the Testing data should only be used
after you have some acceptable models.
Predictor importance can also be calculated; this option is turned on by default, but it can significantly
increase execution time.
Click on the Rank models by dropdown to see the different ranking options

Figure 12.3 Ranking Options for Models
Overall accuracy refers to the percentage of records that is correctly predicted by the model relative
to the total number of records. Area under the curve (ROC curve) provides an index for the
performance of a model based on a Gains curve from an Evaluation chart. The further the curve is
above the baseline, the more accurate the model, hence the greater the area. Profit (Cumulative) is the
sum of profits across cumulative percentiles (sorted in terms of confidence for the prediction), based
on the specified cost, revenue, and weight criteria. Typically, the profit starts near 0 for the top
percentile, increases steadily, and then decreases. For a good model, profits will show a well-defined
peak, which is reported along with the percentile where it occurs. For a model that provides no
information, the profit curve will be relatively straight and may be increasing, decreasing, or level,
depending on the cost/revenue structure that applies. Lift (Cumulative) refers to the ratio of hits in
cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of confidence
for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as
high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles
and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no
information, the lift will hover around 1.0. Number of fields ranks models based on the number of
fields used.
The Profit Criteria section is used to define the costs, revenue and weight values for each record, for
only flag targets. Profit equals the revenue minus the cost for each record. Profits for a quantile are
simply the sum of profits for all records in the quantile. Profits are assumed to apply only to hits, but
costs apply to all records. Use the Costs option to specify the cost associated with each record. You
can either specify a Fixed or Variable cost. Use the fixed costs option if the costs are the same for
each record. If the costs are variable, select the field which has the cost associated for each record.
The Revenue option is used to specify the amount of revenue associated with each record. Again, this
value can be either Fixed or Variable. The Weight option should be used if your data represent more
than one unit. This option allows you to use frequency weights to adjust the results. For fixed
weights, you will need to specify the weight value (the number of units per record). For variable
weights, use the Field Selector button to select a field as the weight field. Note that model profit will
have nothing to do with monetary profit unless you specify actual cost and revenue values.
Nevertheless, the defaults will still give you some sense of how good the model is compared to other
models. For example, if it costs you 5 dollars to send out a promotion, and you get 10 dollars in
revenue for each positive response, the model with the highest cumulative profit would be the one
with the most hits.
Lift Criteria is used to specify the percentile to use for lift calculations. The default is 30.

Select Overall accuracy from the Rank models by: dropdown

Click Training partition button under Rank models using:
Figure 12.4 Auto Classifier Expert Tab
The Expert tab allows you to select from the available model types and to specify stopping rules and
misclassification costs. By default, all models are selected except KNN and SVM. However, it is
important to note that the more models you select, the longer the processing time will be. You can
uncheck a box if you don’t want to consider a particular algorithm. The Model parameters option can
be used to change the default settings for each algorithm, or to request different versions of the same
model.
In this example, we will request both Neural Net model algorithms and accept the default values for
all the other models.
Click on the Model Parameters cell for Neural Net and select Specify

Figure 12.5 Algorithms Simple Tab for Neural Net Models
Click in the Neural network model row in the Options cell and select Both from the
dropdown list

Figure 12.6 Selecting Neural Network Model Setting
The Auto Classifier node will now try both types of neural network models.
The Expert tab within the Algorithm settings dialog allows detailed changes to specific models.
Click Expert tab

Figure 12.7 Algorithm Settings Expert Tab for Neural Net
The settings in this dialog are those that would be available in the Neural Net node.
Note that the Set random seed parameter is set to a numeric value. This means each time the Auto
Classifier node is run, the same neural net model will be found for these data and target (if we use the
same other settings). If you find a neural net model that performs well (or any other model dependent
on a random seed), then you may wish to change the random seed and return the Auto Classifier to
check for model stability.
Click the Simple tab, and then click OK

Click Stopping rules… button
Figure 12.8 Stopping Rules Dialog

Stopping rules can be set to restrict the overall execution time to a specific number of hours. All
models generated to that point will be included in the results, but no additional models will be
produced. In addition, you can request that execution be stopped once a model has been built that
meets all the criteria specified in the Discard tab.
Click Cancel
Click the Discard Tab
Figure 12.9 Auto Classifier Discard Tab
The Discard tab allows you to automatically discard models that do not meet certain criteria. These
models will not be listed in the summary report. You can specify a minimum threshold for overall
accuracy, lift, profit, and area under the curve, and a maximum threshold for the number of fields
used in the model. Optionally, you can use this dialog in conjunction with Stopping rules to stop
execution the first time a model is generated that meets all the specified criteria.

Figure 12.10 Auto Classifier Settings Tab
The Settings tab of the Auto Classifier node allows you to pre-configure the score-time options that
are available on the nugget. For flag targets you can select from the following Ensemble methods:
Voting, Confidence-weighted voting, Raw propensity-weighted voting (flag targets only), Highest
confidence wins, and Average raw propensity (flag targets only). For voting methods, you can specify
how ties are resolved. You can choose one of the tied values randomly, choose the tied value that was
predicted with the highest confidence, or with the largest absolute raw propensity.
Misclassification Costs
In some contexts, certain kinds of errors are more costly than others. For example, it may be more
costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-
risk applicant as high risk (a different kind of error). Misclassification costs allow you to specify the
relative importance of different kinds of prediction errors.
Misclassification costs are basically weights applied to specific outcomes. These weights are factored
into the model and may actually change the prediction (as a way of avoiding more costly mistakes).
Misclassification costs are not taken into account when ranking or comparing models using the Auto
Classifier node. A model that includes costs may produce more errors than one that doesn't and may

not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms
because it has a built-in bias in favor of less expensive errors.

Click the Misclassification costs button
The cost matrix shows the cost for each possible combination of predicted category and actual
category. By default, misclassification costs are set to 0.0 for the cells with correct predictions, and
1.0 for cells that represent errors of prediction (misclassification). To enter custom cost values, select
Use misclassification costs checkbox and enter custom values into the cost matrix.
Figure 12.11 Misclassification Costs Dialog
We don’t need to specify costs for this example.
Click OK
Click Run
Since the Auto Classifier can take some time to compute all the models, especially if variants of
models are requested, a feedback dialog is presented while the node is running.

Figure 12.12 Execution Feedback Dialog
Once the Auto Classifier is finished:
Edit the LOYAL model nugget in the stream
Figure 12.13 Auto Classifier Results for Testing Set
Although we requested that the models be ranked on the Training partition, the default View is of the
Testing set (Partition). So let’s switch to the Training data.
Click the View: dropdown and select Training set

Figure 12.14 Auto Classifier Results for Training Set
Here we see the top ranking three models contained in the nugget, including C5.0, CHAID, and
Decision List (see the Appendix for information on that type of model). The number to the right of
the model type indicates whether this is the first, second, etc. model of that type created by the Auto
Classifier. The best model is the C5.0 model, which is over 90% accurate overall at predicting
LOYAL. The order of the models on accuracy is the same on the Training data as the Testing data.
Ranking models by Lift would place Decision List first, following by C5.0. As with any modeling
exercise, you need to choose the model criterion that is most appropriate for your specific data-
mining problem. But unlike when using one algorithm at a time, the Auto Classifier makes it easy to
compare models on several factors, as all the various ranking measures are displayed in separate
columns.
You can use the Sort by: option or click on a column header to change the column used to sort the
table. In addition, you can use the Show/hide columns menu tool to show or hide specific
columns.
Delete Unused Models will permanently delete all models that are unchecked in the Use? Column.
The bar chart thumbnails show how the model predictions are related to the actual value of the target
field LOYAL. The full-sized plot includes up to 1000 records and will be based on a sample if the
dataset contains more records. Let’s see how the C5.0 model does.
Double-click on the Graph thumbnail for the C5.0 model

Figure 12.15 Distribution Graph of LOYAL by Predicted LOYAL for C5.0 Model
The predicted value ($C-LOYAL) is overlaid on the actual value of the target field. Ideally, if the
model was 100% accurate, each bar would be of only one color because the overlap would be perfect.
We can see from the graph that the model does fairly well at predicting customers who will stay and
extremely well for those who will leave, as most of the bar for Leave is blue (expand the window
vertically to see this) and most of the bar for Stay is red, meaning there is great overlap between the
actual and predicted values.
Close the Graphboard window
It is also possible to see how well the ensemble of three models does in predicting LOYAL.
Click the Graph tab

Figure 12.16 Accuracy and Predictor Importance of Ensemble of Models
As with the C5.0 model, the predicted value is overlaid on the actual value of LOYAL. The graph
includes both the training and testing partitions. Note that the accuracy for Leave is somewhat less
than for the C5.0 model alone; the models are combined by confidence-weighted voting, the default.
Also displayed is the predictor importance for the combined models. No one predictor is especially
important.
In order to further compare the performance of these three models, we can generate an evaluation
chart directly from the Auto Classifier nugget.
Click Generate…Evaluation Chart(s)

Figure 12.17 Evaluation Chart Selection Dialog
Since we have used accuracy to rank and select the models, we’ll use Lift to further evaluate the
models on another criterion.
Click Lift
Click OK
Figure 12.18 Lift Charts for Models
The best possible model is represented by the dark green line labeled $BEST. Initially the three
models have about the same lift value, but eventually the C5.0 model surpasses the other two models.
This is more evidence that the C5.0 model is the best performer.

Close the Evaluation chart window
You can also view each model in its standard Model Viewer. As illustration:

Double-click on the C5.0 Model cell in the Model column
Click All button on toolbar
Figure 12.19 C5.0 Model
The Model Viewer has the same detail and options as for a C5.0 model created from that modeling
node. In this way you can explore specific characteristics of a model.
Click OK
Close the Auto Classifier Model

We would normally continue to explore models here to see which ones are most satisfactory. One of
the tricky things about using the Auto Classifier (and the Auto Numeric node in the next lesson) is
that you may have many models from which to choose. When you are comparing only two or three
models, it is easy to simply look at the results on the Training partition, and then on the Testing
partition, to decide which model to select. But with many models, what is the appropriate
methodology to follow for model selection?
Ideally, you select the candidate models before looking at the Testing data, although some analysts
would argue that the only thing that matters is performance on the Testing data. However, our advice
is to pick a set of possible models to assess, not all models generated, but more than just 1 or 2.
This will require looking at the evaluation chart, looking at other ways of ranking the models, looking
at how the models make their predictions (what are the important fields; what are the decision tree
rules, etc.), seeing which categories the models predict more accurately, and perhaps picking
minimum levels of lift, accuracy, or other measures.
For this example, let’s next use an Analysis node to evaluate the performance of the ensembled three
best models.
Place an Analysis node to the right of the Auto Classifier nugget named LOYAL
Connect the LOYAL model node to the Analysis node
Click Coincidence matrices (for symbolic targets) (not shown)
Click Run

Figure 12.20 Analysis Node Output for Ensemble of Models
We see that the ensembled model is reasonably accurate on both data partitions, with accuracy in the
training data of 87.40%, and in the testing data of 82.19%. From the Coincidence Matrix, the model
correctly identifies about 93% (316 of 338) of Leavers in the Training partition and 91% (280 of 307)
of Leavers in the Testing partition. For the current customers who will stay the ensembled model
didn’t yield the same degree of accuracy.
Close the Analysis window

Double-click the Auto Classifier nugget named LOYAL
Since the most accurate model on the Testing data was C5.0, let’s examine its accuracy further with
an Analysis node. We can select the C5.0 model in the model column and then create a generated
model.
Double-click the C5.0 model nugget in the Model column

In the Model Viewer, click Generate…Model to Palette
Click OK, and then OK again

Figure 12.21 C 5.0 Model Added to the Models Palette
Move the generated C5.0 to the Stream Canvas

Connect the C5.0 model to the Type node
Place an Analysis node to the right of the C5.0 model
Connect the C5.0 model to the Analysis node
Figure 12.22 Revised Stream with the Addition of the C5.0 Model and an Analysis Node

Click Run

Figure 12.23 Analysis Node Output for C5.0 Model
We observe that the C5.0 model is very accurate on both data partitions, though accuracy fell a bit, as
expected, on the Testing partition. It is more accurate than the ensemble of three models. The C5.0
model correctly identifies almost all of the Leavers in the Testing partition (302 out of 307, or about
98.4%!). And while it didn’t predict current customers who will stay with the same degree of
accuracy, it still did very well with this group.
You could investigate the other candidate models in a similar way to see which ones do better on
which category of customers. When all this work is done, you will have a winning model, either one
of the models, or a combination of the models.

Summary Exercises
The exercises in this lesson use the data file charity.sav. The following table provides information on
this file.

forpcode Post Code
title Title
sex Gender
yob Year of Birth
age Age
1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and
connect it to charity.sav.
2. Try to predict Response to campaign using all the available model choices in the Auto
Classifier. Use the defaults first. Which model is best, and which is worse? You can choose
the criterion for ranking models, or use more than one. Which models use fewer inputs?
3. Now change some of the model settings on one more models and rerun the Auto Classifier.
Request more than 3 models. Does the order of models change?
4. Pick two or more models and generate a model for each. Add them to the stream and use an
Analysis node or other nodes to further compare their predictions. Which model would you
use, and why?
5. Then use an Analysis node with the Auto Classifier model to compare the predictions of the
ensemble of models. Does the ensemble do better than any individual model?

Lesson 13: Finding the Best Model for

Continuous Targets
Objectives
• Introduce the Auto Numeric Node
• Use the Auto Numeric Node to predict birth weight of babies
Data
In this lesson we use the dataset birthweight.sav. This file contains information on the births of about
380 babies and characteristics of their mothers, such as age and various health measures (smoking,
history of hypertension, etc.). Researchers are interested in accurately predicting birth weight months
in advance so that interventions can be done for potential low birth weight babies to increase their
chances of survival. This dataset is relatively small, which is typical of many medical studies, but as
good practice we will still use a Partition node with the data.
13.1 Introduction
In the previous lesson we learned how to automate the production of models to predict categorical
targets with the Auto Classifier node. In this lesson we discuss the Auto Numeric node, which in an
analogous manner can automate the production of models for targets that are numeric with a
continuous level of measurement.
The Auto Numeric node allows you to create and compare models for continuous targets using a
number of methods all at the same time, and then compare the results. You can select the modeling
algorithms that you want to use and the specific options for each. You can also specify multiple
variants for each model. The supported algorithms include Neural Net, C&R Tree, CHAID,
Regression, Linear, Generalized Linear Models, KNN and SVM.
To use this node, a single target field of measurement level continuous and at least one predictor field
are required. Predictor fields can be categorical or continuous, with the limitation that some predictors
may not be appropriate for some model types. For example, C&R Tree models can use categorical
string fields as predictors, while linear regression models cannot use these fields and will ignore them
if specified. The requirements are the same as when using the individual modeling nodes.
The format of this lesson will match that of the previous lesson on the Auto Classifier. We begin by
opening an existing stream file and reviewing the data.

Double-click on NumericPredictor.str
Run the Table node
Finding the Best Model for Continuous Targets 13-1

Figure 13.1 Birthweight Data File
In the Statistics source node, we have checked the option to Read labels as names so the variable
labels become the field names in PASW Modeler. The field we want to predict is in the last column
(Birth Weight in Grams) which measures actual birth weight. There is also a separate field (Low Birth
Weight) which indicates whether the birth weight was below a certain threshold. We won’t use that
field in this example. All other fields can be used as predictors except for id, of course.
The Partition node splits the data into equal parts for training and testing.
We need to set the role for the fields in the model.

Edit the Birthweight.sav source node
Click on the Types tab
We need to fully instantiate the data so that PASW Modeler has values for all fields. We also need to
change the role of id and Low Birth Weight to None, and then the role of Birth Weight in Grams to
Target.
Click Read Values button

Click OK

Figure 13.2 Types Tab for Birthweight Data
Change the role of id and Low Birth Weight to None

Change the role of Birth Weight in Grams to Target (not shown)
Click OK
Before attempting to model birth weight, let’s look at its distribution with a Histogram.

Attach the Source node to the Histogram node
Specify the Field as Birth Weight in Grams (not shown)
Click Run

Figure 13.3 Histogram of Birth Weight
The distribution of birth weight is approximately normal, peaking around 3,000 grams, or about 6.6
pounds. Many physical and biological quantities have a normal distribution, which makes modeling
less challenging. When a continuous field is distributed normally, just about any technique can be
used to predict it. Also, there aren’t too many outliers on either the low or high end. This is because
babies born alive can only be so small, or large. This also makes creating models less problematic.
We can now add an Auto Numeric node to the stream.
Add an Auto Numeric node to the right of the Partition node

Connect the Partition node to the Auto Numeric node
Edit the Auto Numeric node
Click the Model tab if necessary

Figure 13.4 Auto Numeric Node
The Auto Numeric node will use partitioned data and build a model for each split, if available.
Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually
better to initially rank the models by the Training partition since the Testing data should only be used
after you have some acceptable models.
As with the Auto Classifier, the number of models to use is 3 by default. Predictor importance is
turned on by default but it may lengthen the execution time.
Specify the Number of models to use: as 8

Click on the Rank models by menu to see the different ranking options

Figure 13.5 Ranking Options for Models
The Rank models by option allows you to specify the criteria used to rank the models. Because we are
predicting a continuous target, the choices to rank models are suited for this type of target. They
include:
• Absolute value of correlation between observed and predicted values

• Number of predictors used
• Relative error, which is defined as the ratio of the error variance for the model to the variance
of the target field
If the relationship between predicted and observed values is not linear, the correlation is not a good
measure of fit or ranking. We’ll view scatterplots to make that determination.
The options to discard models are listed here in the Model tab dialog. You can specify criteria that
correspond to the ranking options to discard candidate models. The more models you generate, the
more likely you are to use these options, but we don’t need to do so for this example.
When using a continuous target, profit can’t be defined based on a predicted category, nor can
misclassification costs be defined (a model can certainly directly predict revenue or profit, but that
isn’t the same as defining profit based on a categorical target).
Click Training partition button under Rank models using:

Figure 13.6 Auto Numeric Expert Tab
The Expert tab allows you to select from the available model types and to specify stopping rules. By
default, six model types except KNN and SVM are checked and will be used. The Model parameters
option can be used to change the default settings for each algorithm or to request different versions of
the same model.
In this example, we will request both Neural Net models and also a stepwise Regression model.
Click on the Model Parameters cell for Neural Net and select Specify
Click in the Neural network model row in the Options cell and select Both from the
dropdown list

Figure 13.7 Changing Neural Network Model Setting
The Expert tab within the Algorithm settings dialog allows detailed changes to specific models.
As with the Auto Classifier example, the random seed is set to a fixed value for the Neural Net
models, and so each time it is run the same model will be found for these data and target (if we use
the identical other settings).
Now we’ll request a stepwise regression model as well.
Click OK
Click on the Model Parameters cell for Regression and select Specify
Click in the Method row in the Options cell and select Specify

Figure 13.8 Regression Parameter Editor
The Enter method is used by default, but we’ll add Stepwise.
Click Stepwise check box

Click OK
Click OK
Figure 13.9 Auto Numeric Dialog Completed

We have requested that a total of 8 models be constructed.
We won’t examine the Stopping rules dialog, which is identical to that for the Auto Classifier in
terms of options and operation.
Figure 13.10 Auto Numeric Settings Tab
The Settings tab of the Auto Numeric node allows you to pre-configure the score-time options that
are available on the nugget. For a continuous target, the ensemble scores will be generated by
averaging the predicted value of each model used.
Click Run
Since the Auto Numeric node can take some time to compute all the models, especially if variants of
models are requested, a feedback dialog is presented while the node is running.

Figure 13.11 Auto Numeric Settings Tab
Once the Auto Numeric model is finished:
Edit the Birth Weight in Grams model nugget in the stream
Although we requested that the models be ranked on the Training partition, the default View is of the
Testing set (Partition). So let’s switch to the Training data.
Click the View: dropdown and select Training set

Figure 13.12 Auto Numeric Results for Training Set
There are wide differences in model performance. The best model, the first Neural Net with a
Multilayer Perceptron, has a correlation between the predicted and actual values of 0.543; the worst
model, the second Regression (Stepwise method) only has a correlation of 0.255. The Regression 2
model, which used stepwise selection, includes only one predictor (Presence of Uterine Irritability;
you can find this information by double-clicking on the model icon.
The relative error is the ratio of the variance of the observed values from those predicted by the model
to the variance of the observed values from the mean. In practical terms, it compares how well the
model performs relative to a null or intercept model that simply returns the mean value of the target
field as the prediction. For a good model, this value should be less than 1, indicating that the model is
more accurate than the null model. The same differences are evident in the relative error, where
smaller numbers closer to zero are better. The Neural Net is again the best model, followed by a
Generalized Linear model. This is an instance of automatic modeling where the best model clearly
stands out from the others.
Scatterplot thumbnails are provided for each model to show how the model predictions are related to
the actual value of the target field. As noted above, if the relationship between the actual and
predicted fields isn’t linear, then the correlation is not an appropriate measure to use for ranking
models.
Let’s see how the Neural Net model does.

Double-click on the graph thumbnail for the Neural Net 1 model
Figure 13.13 Distribution Graph of Birth Weight by Predicted Birth Weight for Neural Net
If the model predictions were perfect, all the points would fall on a straight line running from the
lower left to upper right. Although the neural net model is far from perfect, that is the general
tendency of the points.
To see what a poor model’s predictions look like in a scatterplot, let’s open the graph for the C&R
Tree model.

Double-click on the graph thumbnail for the C&R Tree model

Figure 13.14 Distribution Graph of Birth Weight by Predicted Birth Weight for C&R Tree
The difference between the plots is immediately evident. The decision tree model is clearly a poor
performer. In fact, it seems to predict only two values for birth weight because the number of distinct
predictions depends on the number of terminal nodes in the tree.
Because we are predicting a continuous field, evaluation charts are not available to further assess the
models.
We can look at how well the ensemble of models does in predicting birth weight.
Click the Graph tab

Move the slider in the Predictor Importance pane to the fourth position from the left

Figure 13.15 Scatterplot and Predictor Importance of Ensemble of Models
Predictor importance is calculated from the test partition; while no one predictor is very important,
the more important predictors include the mother smoking, number of physician visits, and history of
hypertension.
The scatterplot is similar to those we viewed above, although the position of the variables on the axes
is flipped so that the predicted value of birth weight is on the X axis and the observed value is on the
Y axis. We expect to see a reasonably strong linear correlation between these two values, which is
apparent here.
The next step is to look at the performance of the models on the Testing partition.
Click Model tab

Click View dropdown and select Testing set

Figure 13.16 Auto Numeric Results for Testing Set
The results are somewhat different, perhaps because of the small sample sizes used in the two data
partitions. The best model is now the generalized linear model, with the neural net second. CHAID
has fallen to number 5, while the first regression model is now third. There is also less difference
between the first two models, and in fact, the neural net model has lower relative error than the
generalized linear model, so it may still be the best performer.
Despite the changes in model performance, we know that we should focus on the results on the
Testing partition when selecting final models. You may want to engage in a class discussion about
how to use the Training and Testing data to select models.
For this example, we will select the top three models: Generalized Linear 1, Neural Net 1, and
Regression 1. We will check whether the ensembled model makes more accurate predictions than the
Neural Net 1 model alone.
Double-click the Neural Net 1 icon in the Model column

Click Generate…Model to Palette
Click OK
Deselect the Neural Net 2, CHAID 1, Linear 1, C&R Tree 1 and Regression 2 models in
the Use? Column
Click OK
We can use an Analysis node to evaluate the models further, as we did with the Auto Classifier.
Drag the Neural Net 1 model nugget to the right of the Birth Weight in Grams nugget
Connect the two nuggets

Figure 13.17 Revised Stream with the Addition of the Neural Net Model
Place an Analysis node to the right of the Neural Net 1 model

Connect the Neural Net 1 model to the Analysis node
Right-click the Analysis node and select Run

The Analysis node provides various summary measures for the ensembled model (Generalized
Linear, Neural Net and Regression) and the Neural Net model on its own. These include the model
minimum and maximum error, the mean error, the mean absolute error (the better measure of the
two), standard deviation and the correlation. By mean absolute error, the ensembled model performed
better.
You could investigate the other ensembled models in a similar way to see which combination do
better predicting birth weight.
You can also explore, using standard methods, how the ensembled model makes predictions, i.e.,
what is the relationship of the input fields to the predicted value of birth weight.

Summary Exercises
The exercises in this lesson are written for the data file charity.sav.
charity.sav is from a charity and contains information on individuals who were mailed a promotion.
The file contains details including whether the individuals responded to the campaign, their spending
behavior with the charity and basic demographics such as age, gender and mosaic (demographic)
group. The file contains the following fields:

forpcode Post Code
title Title
sex Gender
yob Year of Birth
age Age
1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and
connect it to charity.sav.
2. Try to predict Post-campaign expenditure using all the available model choices in the Auto
Numeric node. Use the defaults first. Which model is best, and which is worse? You can
choose the criterion for ranking models, or use more than one. Which models use fewer
inputs?
3. Now change some of the model settings on one more models and rerun the Auto Numeric
node. Does the order of models change?
4. Pick two or more models and generate a model for each. Add them to the stream and use an
Analysis node and other nodes to further compare their predictions. Which model would you
use, and why? How do they compare to the ensemble of models?


Lesson 14: Getting the Most from Models

Objectives
• Discuss common approaches to improving the performance of a model in data mining
projects
• Use an Ensemble node to combine model predictions
• Use propensity scores to score records
• Do Meta-modeling to improve model performance
• Model errors in prediction
Data
In this lesson we will use the dataset churntrain.txt that is a variant of the churn.txt file that we have
used in several previous lessons. The data have been split into a separate training file, and we will
build or use models constructed with it (there is also a churnvalidate.txt file that can be used for
model testing).
14.1 Introduction
Throughout this course we have looked at several different modeling techniques, including neural
networks, decision trees and rule induction, regression and logistic regression, Bayes Nets, SVM
models, and discriminant analysis. After building a model we have usually performed some type of
diagnostic analysis that helps with the interpretation of the model, and we have also done additional
analyses to help determine where a model is more and less accurate.
In this lesson we develop and extend the model building skills learned so far. The key concept in
these examples is that models built with an algorithm in PASW Modeler should usually (unless
accuracy is very high and satisfactory) be viewed not as the endpoint of an analysis, but as a way
station on the path to a robust solution. There are various methods to improve models, only some of
which we discuss here, and you are likely to come up with your own as you become experienced
using PASW Modeler and read references on data mining.
We provide methods in this lesson for how to improve a model, but there is no one simple answer as
to how this should be done. That is because the appropriate method is highly dependent upon
characteristics of the existing model that has been built. Potential things to consider when improving
the performance of a model are:
• The modeling technique used

• The measurement level of the target field (categorical or continuous)
• Which parts of the model are under-performing, i.e., are less accurate
• The distribution of confidence values for the existing model.
We begin the lesson with the Ensemble node, which is an automated method of combining the
predictions from two or more models. We then discuss propensity scores and show how they can be
used to score a model. Following from this, we consider other methods of combining models,
including modeling the error from a model.
Getting the Most From Models 14-1

14.2 Combining Models with the Ensemble Node

Many authors of data-mining books and articles recommend developing more than one predictive
model for a given project. This is usually good advice because there are so many model types
available, and a priori, it isn’t normally possible to forecast which model will do better. Moreover,
PASW Modeler makes it easy to try several models on the same data, including two nodes that
automate the building of many models simultaneously.
If you do develop several models, though, you then have the question of how to use them to make
predictions. The simplest approach is to use the best model, but what is the “best” model? Is it the
most accurate overall, or the one that is most accurate at predicting the most critical category? And if
the models are predicting a continuous target, there are several possible definitions of best model.
Since we have two or more models, another approach is to combine their predictions in some suitable
manner, on the theory that two heads (models) are better than one. And in prediction, that is often
true. There are a variety of methods to combine models. You could:
• Let the models vote, with the category predicted most frequently the “winner”
• Pick the model prediction with the highest confidence
• Let the models vote, but weight the voting by model confidence
• Average the model predictions if predicting a continuous field
And there are several other possibilities, including using the propensity scores now available for most
models in PASW Modeler 14.0 (but only for flag fields). All the methods in the bullet list are
available in the Ensemble node, which is designed to make combining models a simple process.
Each Ensemble node generates a field containing the combined prediction, with the name based on
the target field and prefixed with $XF_, $XS_, or $XR_, depending on the output field type (flag, set,
or range, respectively). We’ll use a preexisting stream file with four generated models to demonstrate
the Ensemble node with the churn data. The Ensemble node is located in the Field Ops palette
because it creates new fields.

Double-click on Ensemble.str
Run the Table node

Figure 14.1 Churntrain Data File
The training data has just over 1,100 records. We will predict the field CHURNED, which is a
nominal in measurement level with three categories (you can run the Distribution node to review its
distribution). The Type node already has all the appropriate settings.
Looking at the stream (in Figure 14.2), we used four modeling nodes—CHAID, Neural Net, Bayes
Net, and SVM—to create four models that have already been placed in the stream to save time. The
models were created with all available predictors.

Figure 14.2 Ensemble Stream
Before using an Ensemble node, let’s see how well these four models predict CHURNED.
Add an Analysis node to the stream near the last generated model and attach this model to
the Analysis node
Click Run
There is a lot of output, so we show this in two figures. Figure 14.3 shows the results for all four
models.

Figure 14.3 Analysis Node Results for Four Models
The most accurate model overall is the SVM, at 82.85%. The least accurate model is the Bayes Net at
79.15%. Interestingly, although there are only 100 customers in the InVol category, which should
make this group more difficult to predict, 3 of the 4 models do very well with this group, and the
CHAID model is 100% accurate, although it was not the best model overall. This illustrates the
potential advantage of combining models.

Figure 14.4 Analysis Node Results When the Models Agree
The last two tables in the Analysis browser window show the accuracy when the model predictions
are combined in a simplistic fashion. All four models make the same prediction for 69.13% of the
cases. For this segment of the file, those predictions have an accuracy of 91.91%. Clearly, combining
models can improve performance.
The Ensemble Node

However, the models don’t make the same prediction for 30.87% of the customers, a sizeable fraction
of the file. What to do with these records? What prediction should be made for them? This is where
the Ensemble node can provide much assistance.
Close the Analysis output browser window

Place an Ensemble node from the Field Ops palette near the last generated model
Attach the last generated model to the Ensemble node
Edit the Ensemble node

Figure 14.5 Ensemble Node Settings Tab
The Ensemble node will use the settings in the last Type node or Types tab from a Source node, so it
recognizes that CHURNED is the target field. Ensemble nodes can use flag, nominal, or continuous
target fields from the upstream model nodes.
Because the results from many models are being placed in one stream, and each one of those models
generates at least two fields, by default the Ensemble node filters out all those generated fields (Filter
out fields generated by ensembled models check box). If you want to continue to compare individual
model predictions downstream, or use their predictions (see note below), then you will want to
deselect this option.
The key setting is the Ensemble method, which determines how the model predictions will be
combined.
Click Ensemble method dropdown

Figure 14.6 Ensemble Method Options
The choices available for Ensemble method will vary based on the measurement level of the target
field. For a categorical target, there are three choices:
1) Voting: The node counts the number of times each value is predicted, and selects the value
with the highest total.
2) Confidence-weighted voting: The node counts not the simple fact that a prediction was made,
but instead uses the confidence of that prediction. So if a model predicts value A with a
confidence of .80, then the node counts .80 as the “vote.” These weights are summed, and the
value with the highest total is selected.
3) Highest confidence wins: In this method, the best model, as measured by its confidence, is
used for each prediction.
If the target field is a flag, there are four other options available, all based on the propensity score
(discussed in more detail in the next section). The Ensemble prediction can be based on propensity-
weighted voting, or on average propensity. This can be done for either raw propensity or adjusted
propensity (which is based on a validation or testing data partition, and so is only available in those
situations).
If the target field is numeric (continuous), the only available method is to average the model
predictions.
We’ll use the default method of confidence-weighted voting.
If you use one of the voting methods, and there is a tie, the Ensemble node can break the tie in two
ways. A random selection can be made, or the model with the highest confidence can be selected.
This latter choice seems like a better one, so we’ll use that.

Click Highest confidence option button

Click OK
There is no Run option for an Ensemble node because the node creates new fields but is not a Model
or Output node.
We can first view the results of combining the four models with a Table node.
Add a Table node to the stream near the Ensemble node

Attach the Ensemble node to the Table node
Run the Table node
Figure 14.7 Table with Output Fields from Ensemble Node
The Ensemble node created two new fields, the prediction ($XS-CHURNED) and its confidence
($XSC-CHURNED). In this instance, the confidence is the sum of the confidence for the winning
models divided by the total number of models.
We can use another Analysis node to check the performance of the combined prediction.


Add an Analysis node to the stream near the Ensemble node
Connect the Ensemble node to the Analysis node
Click Coincidence matrices (for symbolic targets)
Click Run
Figure 14.8 Analysis Node Output for Ensemble Node Prediction
Overall “model” accuracy is 84.12%, which is better than any individual model by about 1.3%. When
accuracy is important, this improvement would very likely be crucial. The model continues to predict
the InVol category almost perfectly, and does very well for the other two categories (we could use a
Matrix node to get exact percentages correct in each category).
This stream, with the Ensemble node, can now be used to score new data.
Looking back at Figure 14.4, we recall that when all four models agreed on the prediction, the
prediction was accurate for 91.91% of the customers. That is much better than 84.12%. So another
approach when using the Ensemble node is to follow this methodology:
1) If all models agree on a prediction, use that prediction

2) When they don’t agree, use the prediction from the Ensemble node
This method can’t be used for continuous fields.

Is there any downside to combining models? The usual objection is that you cannot now explain why
a specific prediction was made. If someone asks “What characteristics of this particular customer
caused him/her to be predicted to be a voluntary churner?” that information will not be available,
since the models are combined. Still, if you are using a single Neural Net or SVM model, the same
holds true, so this is not a fatal objection. You can, as with any model, examine the predictions from
the Ensemble node and see how they relate to the input fields, and that will be helpful. But full model
understanding is sacrificed here for accuracy.
Close the Analysis output browser.
In our next example, we will learn how to use a model to score records to rank them by the propensity
of a model prediction.
14.3 Using Propensity Scores

Confidence values obtained from a model in PASW Modeler reflect the level of confidence that the
model has in the prediction of a given output, and they are only available for categorical targets.
Confidence values make no distinction between categories of a target field; thus, for a flag with
values of “yes” and “no,” confidence values can vary from 0 to 1 for predictions in each category.
Consequently, a high degree of confidence does not help us determine whether that customer will
continue or cancel their service (it instead indicates the confidence that the model has in its
prediction, whatever that is).
Sometimes it would be helpful to have a score so that, for a specific category of interest—such as
customers who churned—a high score means a prediction of churn, and a low score indicates the
customer is current. This type of score can be used in choosing cases for future actions—intervention,
marketing efforts, and so forth.
To create a score as just described, most PASW Modeler models calculate a propensity score for a
flag field (propensity scores are not available for nominal, ordinal or continuous fields). A propensity
score is actually based on the probability of a prediction. The raw propensity score is based only on
the training data (if using a Partition node), or the whole file, otherwise. When the model predicts the
true value defined for the target field, the propensity is the same as P, where P is the probability of the
prediction. If the model predicts the false value, then the propensity is calculated as (1 – P).
The adjusted propensity score is only available if a Partition field is being used, and it is calculated
based on model performance on the Testing data. Adjusted propensities attempt to compensate for
model overfitting on the Training data. We aren’t using a Partition node in this example so can only
use raw propensity scores.
Now for many models in PASW Modeler, such as all decision trees, the confidence is equal to the
probability of the model prediction (why would that be?). For other models, such as a neural net, the
confidence and the probability are not equivalent, although they are usually close in value.
This simple transformation of the probability allows you to easily score a data file with the propensity
(probability) of an outcome to occur.
For this example, we continue using the dataset churntrain.txt. A Derive node has been added to the
beginning of the stream to create a modified version of the CHURNED field. It converts CHURNED
into the field LOYAL which measures whether or not a customer continued with the company. LOYAL

groups together both voluntary and involuntary leavers into one group, so comparisons can be made
with customers who remain loyal.
We begin by opening the corresponding stream.
Click File…Open Stream and move to the c:\Train\ModelerPredModel directory (if

necessary)
Double-click on Propensity.str
A Derive node calculates the new field LOYAL. Then both a neural net and CHAID model were
trained to predict the field LOYAL, using the ChurnTrain.txt data. Their generated models were then
added to the stream connected to the Type node.
Figure 14.9 Stream with Two Models and new Field LOYAL
Let’s look at the Derive node, which calculates the LOYAL field.
Edit the Derive node LOYAL
The Derive node creates a flag field. The True value will be Leave, and this value will be assigned to
a record when CHURNED is not equal to Current. Then LOYAL will be False for customers who are
still current. The new field is defined in this manner because we are interested in finding customers
who might churn.

Figure 14.10 Derive Node Combining Churning Customers in One Category
Close the Derive node dialog
Propensity scores are not calculated by default for a model but must be requested (unlike confidence
values). For CHAID models, raw propensities can be calculated either at the time of model creation,
or later from the model nugget. For Neural Nets, propensities must be calculated when the model is
created, not afterwards. We did this in the Model Options tab of the Neural Net node before
generating the model in the stream. To request propensity scores for CHAID models after the model
is created:
Edit the CHAID generated model

There is a check box to request raw propensity scores. Because we aren’t using a Partition field, the
option for adjusted propensity scores is grayed out.

Figure 14.11 Settings Tab in CHAID Model
Click Calculate raw propensity scores

Click OK
To illustrate the distribution of propensity scores and how they differ from confidence values, we’ll
look at both fields for the CHAID model.
Add a Table node to the stream near the CHAID model

Connect the CHAID model to the Table node
Run the Table node

Figure 14.12 Table with CHAID Model Confidence and Propensity
Recall that when the model predicts that a customer will Leave, the propensity is equal to the
probability. And, for the first record, with a prediction of Leave, the confidence ($RC-LOYAL) is
equal to the propensity ($RRP-LOYAL), where the “RP” stands for raw propensity. This means that
the probability for a CHAID model is equal to the confidence.
For record 4, where the prediction is Current, the propensity is 1 – confidence.
To make all this very clear, we will view histograms of confidence and propensity with an overlay of
the model prediction.

Add a Histogram node to the Stream canvas, and connect the CHAID model to it
Select $RC-LOYAL as the Field to display and $R_LOYAL as the Color Overlay Field (not
shown)
Run the Histogram node

Figure 14.13 Histogram of Confidence Value by Predicted Loyalty
We can see that the confidence values range from .50 to 1.0, but that a high confidence doesn’t
necessarily indicate that we expect a customer to leave or stay, since there are customers in both
categories at high confidence values (we would find the same pattern if we used the values of LOYAL,
the actual status of customers). In fact, the highest confidence is associated with customers who are
current.
Now we can create the histogram with the propensity scores.

Select $RRP-LOYAL as the Field to display
Run the Histogram node
The distribution of the propensity is bimodal. Those customers predicted to leave have scores ranging
from .50 to 1, and those predicted to remain have scores below .50.
Propensity scores have a similar distribution for the neural net model.

Figure 14.14 Histogram of Propensity Scores Overlaid by Predicted Loyalty
The propensity score can now be used to score a database, as is commonly done in many data-mining
applications, so that customers can, for example, be selected for a marketing campaign based on their
propensity to leave. Sort the file by propensity, and begin choosing customers with the highest
propensities first.
In addition to scoring new records, there is another use of propensity scores. A score field can be used
in a new model to improve the prediction of LOYAL. The score fields do not perfectly predict the
value of LOYAL (remember we have been using the predicted value of LOYAL, not the actual values,
in our histograms; try running the histograms with LOYAL itself to see the difference), but they
apparently have a high degree of potential predictive power. Clearly, this is based purely upon the
way that CHAID or the neural network has differentiated between customers who will leave or stay,
but if the model has a high degree of accuracy (which it does in this case), then the propensity score
may act as a very good predictor for another modeling technique. If a more complex model that takes
output from one model as inputs for another (often called a meta-model) were to be built, information
on the score values from the CHAID model could be used as an input to a neural network. We shall
look at this form of meta-modeling in the next section.
14.4 Meta-Level Modeling

The idea of meta-level modeling is to build a model based upon the predictions, or results, of another
model. In the previous section, we used a stream which contained both a trained neural network and a
CHAID rule induction model. We then created propensity scores for each to use separately.
We can use the propensity score, though, from the CHAID model as one of the inputs to a modified
neural network model. We know that the CHAID algorithm can predict loyalty with higher accuracy;

thus it is hoped that by inputting the propensity scores into a neural network analysis, the neural
network may be able to correctly predict some of the remaining cases that the CHAID model
incorrectly classified.
Click File…Close Stream and click No if asked to save changes

Double-click on Metamodel.str
The figure below shows the completed stream loaded into PASW Modeler. It is fairly complex but
not difficult.
Figure 14.15 Meta-Model Stream
A Type node has been inserted after the two generated model nodes. If we are to build a model based
upon results obtained from previous models, each of the newly created fields will need to be
instantiated and have its role set. We will be using both the new propensity score field and the
predicted value from the CHAID model.
If we run the two Analysis nodes attached to the two generated models, we would find that the neural
net and CHAID models had accuracies of 80.42% and 83.75%, respectively when trying to predict
the field LOYAL.
When doing this type of meta-modeling, you must make a decision about which fields should be
inputs to the new model. You can use all the original fields, or reduce their number since the CHAID
propensity score and predicted category will effectively contain much of the predictive power of the
original fields. If the number of inputs isn’t large, then including them along with the two new fields
in the new neural network will not appreciably slow training time, and that is the approach we take
here. But you may wish to drop at least some of the fields that had little influence on the model, since
including all fields can lead to over-fitting.
We’ll begin by examining the downstream Type node.
Run the Table node attached to the Type node downstream of the generated models

Edit the Type node attached to the CHAID generated model
Figure 14.16 Type Node Settings
In this example, we will use all the original input fields as predictors, plus the predicted value of
LOYAL from the CHAID model ($R-LOYAL) and the propensity score ($RRP-LOYAL). The target
field remains LOYAL.
A Neural Net node has been attached to the Type node (and renamed MetaModel_LOYAL). We’ve
set the random seed to 1000 so that everyone will obtain the same solution, and we use the Quick
training method.
Let’s run the model.

Close the Type node
Run the neural network MetaModel_LOYAL
Edit the generated model
Click on the Predictor Importance graph

Figure 14.17 Predictor Importance from Meta-Model for LOYAL
We can see that, not surprisingly, the fields $RRP-LOYAL is by far the dominant input within the
model. The actual predicted value from CHAID, $R-LOYAL, is only the fifth most important input.
We can check the accuracy of the meta-model with an Analysis node.
Close the Neural Net model browser

Add an Analysis node to the stream and connect the generated meta-model to the
Analysis node
Click Run

Figure 14.18 Analysis Node Output for Meta-Model for LOYAL
In the portion of the output comparing $N1-LOYAL with LOYAL, we observe the overall accuracy of
the meta-model is 84.93%. This is much better than the original neural net model (and it is even about
1% better than the CHAID model).
This is a realistic example of how using the results of one model to improve another model can work
in practice. Sometimes it is said that doing so is misguided, and it is true that in classical statistics one
doesn’t combine models in this manner. But in data mining, this is an acceptable methodology, but
you must always validate the final meta-model on a testing or validation sample. As an exercise, you
can use the file ChurnValidate.txt to validate this meta-model.

14.5 Error Modeling

Error modeling is another form of meta-modeling that can be used to build a better model than the
original, and it is often recommended in texts on data mining. In essence, this method is
straightforward. Cases with errors of prediction are isolated and modeled separately. Almost
invariably, some of these cases can now be accurately predicted with a different modeling technique.
However, there is a catch to this technique. In both the training and test data files we have a target
field to check the accuracy of a model. Thus, in the churn data, we know whether a customer
remained or left. But in real life, that is exactly what we are trying to predict. So how can we create a
model that uses the fact that an error of prediction has occurred since, when applying the model, we
won’t know whether the model is in error until it is too late, i.e., the event we are trying to predict has
occurred?
The answer to this dilemma is that, of course, we can’t, so we have to find a viable substitute strategy.
The most common approach is to find groups of cases with similar characteristics for which we make
a greater proportion of errors. We then create separate models for these cases, assuming that the same
pattern will hold in the future. It is, as always, crucial to validate the models with a holdout sample
when using this technique.
In this section we build an error model on the churn data in order to investigate where the initial
neural network is under-performing, and then improve it by modeling the cases more prone to
prediction errors with a C5.0 model.
In this stream, the True value for LOYAL is reversed from our previous examples and is defined as a
customer will stay with their service. The False value is then a customer who will leave. Since we
would like to model errors for predicting both categories, it doesn’t make as much difference here
which category is associated with True.

Close the current stream (you don’t need to save it), and clear the Models Manager
Double-click on Errors.str
Switch to small icons (right-click Stream canvas, click Icon Size…Small)
The figure below displays the error-model stream in the PASW Modeler Stream canvas. The upper
stream in the canvas includes the generated model from the neural network and attaches a Derive
node to it. The Derive node compares the original target field (LOYAL) with the network prediction of
the target ($N-LOYAL), calculating a flag field (CORRECT) with a value of “True” if the prediction
of the neural network is correct, and “False” if it was not. You can open it and review it if you wish.
The first goal of the error model is to use a rule induction technique, which can isolate where the
neural network model is under-performing. This will be done by using the C5.0 algorithm to predict
the field CORRECT.
We chose a C5.0 model because its transparent output will provide the best understanding of where
the neural network is under-performing. In order to ensure that the C5.0 model returns a relatively
simple model, the expert options have been set so that the minimum records per branch is 15. Setting
this value is a judgment call based on the number of records in the training data and the number of
rules with which you wish to work (another approach would be to winnow attributes, an expert
option).

A Type node is used to set the new field CORRECT to role Target, and the original inputs to the
neural network to Input. It would need to be fully instantiated before training the C5.0 model.
Figure 14.19 Error Modeling from a Neural Network Model
In this example, the model has already been trained and added to the stream, labeled C5 Error Model.
Let’s browse this model.
Edit the C5.0 generated model node labeled C5 Error Model
We generated a ruleset from the C5.0 model because it makes it easier to view those rules for the
False values of CORRECT. Again, we are trying to predict the values of CORRECT, which means we
are trying to predict whether the neural network was accurate or not. There are two rules for a False
value.
Click the Show all levels button

Click the Show or Hide Instances and Confidence button (so instances and
confidence values are visible)
The two rules all have reasonable values of confidence, ranging from .59 to .68 (although you might
prefer them to be a bit higher). Rule 1 tells us that for male customers who make less than about a
tenth of a minute of long distance calls per month, we predict the value of CORRECT to be False, i.e.,
the wrong prediction. Rule 2 is more complicated.
It would be better to have more than two rules predicting False, but this will suffice for purposes of
illustration.

Figure 14.20 Decision Tree Ruleset Flagging Where Errors Occur Within the Neural Network
The next step is to split the training data into two groups based on the ruleset, one for predictions of
True and the other for False. We can do this by generating a Rule Tracing Supernode from the Rule
browser window and applying a Reclassify or Derive node to truncate the values of the new field to
just True and False. We will use the Reclassify node to modify the Rule field so that it only has two
categories, which we will rename as Correct and Incorrect. Let’s check the distribution of this field.
Close the C5.0 Model browser window

Run the Distribution node named Split

Figure 14.21 Distribution of Split Field
The neural network accuracy was about 80.5%. The distribution of Split doesn’t match this because
we limited the records per branch to no lower than 15, and because the C5.0 model can’t perfectly
predict when the neural network was accurate or not.
There are clearly enough cases with a value of Correct (1032) to predict with a new model, but there
are only 78 cases with a value of Incorrect, which is a bit low for accurate modeling. The best
solution is to create a larger initial sample so that the cases predicted to be incorrect by the C5.0
model would be represented by a larger number of cases. If that isn’t possible, you can use a Balance
node and boost the number of cases in the Incorrect category (although this is not an ideal solution,
either). Since this is an example of the general method, we won’t bother doing either to see how much
we can improve our model with no special sampling.
Looking back at the stream, we next added a Type node to set the role of FALSE_TRUE and Split to
None so that they are not used in the modeling process. We wish to use only the original predictors.
The stream then branches into two after the Type node. The upper branch uses a Select node to select
only those records with predictions expected to be correct, while the lower branch selects those
records with predictions expected to be incorrect. We reemphasize that the split of the training data is
not based on the target field. Instead, only demographic and customer status fields were used to create
the field Split used for record selection. It is for this reason that this model can, if successful, be used
in a production environment to make predictions on new data where the outcome is unknown.
After the data are split, the customers for whom we generally made correct predictions are modeled
again with a neural network. We do so because these cases were modeled well before with a neural
network, so the same should be true now. And, with the problematic cases removed, we expect the
network to perform better.

For the customer group for which predictions were generally wrong, we use a C5.0 model to try a
new technique, since the neural network tended to mispredict for this group. We could certainly try
another neural network, however, or any other modeling technique.
After the models are created, they are added to the stream, and Analysis nodes are then attached to
assess how well each performed. Let’s see how well we did.

Run both Analysis nodes in the lower stream
The neural network model for the group of customers with generally originally correct predictions is
correct 85.56% of the time, a substantial improvement over the base figure of 80.51%. The C5.0
model is even more accurate, correctly predicting who will leave or stay for 86.84% of the cases that
were originally difficult to accurately predict. Clearly, using the errors in the original neural network
to create new models has led to a substantial improvement with little additional effort. If you take this
approach, you would, as usual, explore each model to see which fields are the better predictors and
how this differs in each model.
Figure 14.22 Model Accuracy for Two Groups
So far so good, but we’d still like to automate the solution so that the data all flow in one stream
rather than in two, and we can therefore make a combined prediction for LOYAL on new data. This is
easy to do. To demonstrate, we open a stream with a modified version of the current one.
Close the current stream and don’t’ save it if asked

Double-click on Combined_predictions.str
Switch to small icons (right-click Stream canvas, click Icon Size…Small)
We have combined the two generated models in sequence in this modified stream. You might think
that we could simply combine the output from each model, since each was trained on a different
group of cases and thus will make predictions only for those cases, but this isn’t the case. Although

each model was trained on only a portion of the data, each will make predictions for all the cases.
(Why? To verify this, run the Table node.)
Figure 14.23 Combined Predictions Stream
But the solution is simple. We know that the value of the field Split tells us which model’s output to
use, and we do so in the Derive node named Prediction.
Edit the Derive node named Prediction
This node creates a new field called Prediction. When Split is equal to Correct, the value of
Prediction is set to the output of the neural network output. Otherwise, the value of Prediction is set
to the output of the C5.0 model. Thus, we have a new field that contains the combined prediction
from the best model for that group of customers.

Figure 14.24 Derive Node to Create Prediction Field
We know that the baseline neural network had an accuracy of 80.51%, and made 216 errors. We will
do much better with these two models. To see how much, we can run the Matrix node that
crosstabulates Prediction and LOYAL.
Close the Derive node

Run the Matrix node named Prediction x LOYAL
The combined models have made only 159 errors, quite an improvement. This translates to an
accuracy of 85.64%, or an increase of about 5.1% over the original neural network model.

Figure 14.25 Comparison of Prediction and LOYAL
The process of modeling errors need not stop here. Although there will clearly be diminishing returns
as the number of errors decreases, it is certainly possible to attempt to separately model the remaining
errors from the combined model. At the very least, you would still want to investigate those
customers whose behavior remains difficult to model.
Eventually you would validate the models with the ChurnValidate.txt dataset. We won’t do that here
because the stream with the C5.0 model predicting errors in the original neural network has only 33
records, not enough for a reasonable validation. Obviously, the validation dataset should be of
sufficient size, just as with the training file.
We should also note that this same technique could be used for target fields that are continuous, either
integers or real numbers. In that case, the errors are relative, not absolute, but some numeric bounds
can be specified to differentiate cases deemed to be in error from those with sufficiently accurate
predictions. Then the former group of cases can be handled in a similar manner as was done above.

Summary Exercises
In these exercises we will use the streams created in this lesson.
1. Use the stream Metamodel.str. Rerun the MetaModel_LOYAL neural network model,
removing all the original inputs from the model and thus using only the modified confidence
score and the predicted value from the CHAID model. How does this affect model
performance? Add this generated model to the stream and validate it with the
ChurnValidate.txt data file. Was the model validated, in your judgment?
2. Use the stream Errors.str. Instead of using a C5.0 model to predict cases with proportionally
more errors, try another type of model (your choice). How well does this perform compared
to the C5.0 model? How does it compare to the accuracy of the original neural network?
What do you recommend we use for these cases that were predicted in error?

Appendix A : Decision List

Overview
• Introduce the Decision List model
• Compare rule induction by Decision List with the decision trees nodes
• Outline the main differences between a decision tree and a decision rule
• Understand how Decision List models a categorical target
• Review the Interactive Decision List modeling feature
• Use partitioned data to test a model (optional, already covered in former lesson)
Data
In this appendix we use the data file churn.txt, which contains information on 1477 customers of a
telecommunications firm who have at some time purchased a mobile phone. The customers fall into
one of three groups: current customers, involuntary leavers and voluntary leavers. Unlike the models
developed in Lesson 3, here we want to understand which factors influence the voluntary leaving of a
customer, rather than trying to predict all three categories.
Introduction
PASW Modeler contains five different algorithms for performing rule induction: C5.0, CHAID,
QUEST, C&R Tree (classification and regression trees) and Decision List. The first four are similar
in that they all construct a decision tree by recursively splitting data into subgroups defined by the
predictor fields as they relate to the target. However, they differ in several ways that are important to
the user (see Lesson 3).
Decision List predicts a categorical target, but it does not construct a decision tree; instead, it
repeatedly applies a decision rules approach. To give you some sense of a Decision List model we
begin by browsing such a model and viewing its characteristics. After that we continue by reviewing
a table that highlights some distinguishing features of the rule induction algorithms. Finally, we will
outline the difference between decision trees and decision rules and the various options for the
Decision List algorithm in the context of predicting categorical fields.
A Decision List Model

Before diving into the details of the Decision List node, we review a decision list model.
Click File…Open Stream, and then move to the c:\Train\ModelerPredModel directory

Double-click DecisionList.str
Decision List A-1

Figure A.1 Decision List Stream
Right-click the Decision List node CHURNED[Vol]

Select Run
Once the Decision List generated model is in the Models palette, the model can be browsed.
Right-click the Decision List node named CHURNED[Vol] in the Models palette
Click Browse
The results are presented as a list of decision rules, hence Decision List. If you are familiar with the
C5.0 model output you will see a distinct likeness to the Rule Set presentation of a C5.0 model.
Figure A.2 Browsing the Decision List Model
Decision List A-2

The first row gives information about the training sample. The sample has 719 records (Cover (n)) of
which 267 meet the target value Vol (Frequency). In consequence the percentage of records meeting
the target value is 37.13% (Probability).
A numbered row represents a model rule and consists of an id, a Segment, a target value or Score
(Vol) and a number of measures (here: Cover (n), Frequency and Probability). As you can see, a
segment is described by one or more conditions, and each condition in a segment is based on a
predictive field, e.g. SEX = F, INTERNATIONAL > 0 in the second segment.
All predictions are for the Vol category, as this is what is defined in the Decision List modeling node.
The accuracy of predicting this category is listed for each segment in the Probability column, and
accuracy is reasonably high for most segments.
As a whole our model has 5 segments and a Remainder. The maximum number of predictive fields in
a segment is 2. Each segment is not too small (see measure Cover (n)); the smallest has 52 records.
This is not chance. The maximum number of segments in the model, the maximum number of
predictive fields in a segment and the minimum number of records in a segment are all set in the
Decision List node, as we will see later.
We now review in some detail the Decision List model.
The Target
A characteristic of Decision List is that it models a particular value of a categorical target. In the
Decision List model at hand we have modeled the voluntary leaving of a customer as represented by
target value CHURNED = Vol.
The Remainder Segment

The Remainder segment is yet another defining characteristic of the Decision List model. Unlike with
decision trees, there will be a group of customers for which no prediction is made (the Remainder).
The Decision List algorithm is particularly suitable for business situations where you are interested in
a relatively small but extremely good (in terms of response) subset of the customer base. Think of
customer selection for a marketing campaign, where there is a limited campaign budget available. So
the marketer will only be interested in the top N customers he can afford to approach given her
budget, and the rest (Remainder) will be excluded for the campaign.
Overlapping Segments
In our model the 5 segments and the Remainder form a non-overlapping segmentation of the training
sample, meaning that a customer (or a record) belongs to exactly one segment or to the Remainder.
So the total of the Cover (n) for all segments including the Remainder should match the Cover (n) of
the training sample. This basic requirement affects the way a particular segment should be interpreted
when reading the model.
The Nth segment should be interpreted as:

The record is in segment N and not(segment N-1) and not(segment N-2 ) and….. and
not(segment 1)
Example
Decision List A-3

Given our model a Female customer with International >0 and AGE from 43 to 58 satisfies both
segment 1 and segment 2. However she will be regarded as a member of segment 1. The rules are
applied in the order you see them listed for the segments, so this customer is assigned to segment 1.
A customer belongs to segment 2 if:

not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions]
and SEX = F and International > 0
And a customer belongs to segment 3 if:

not (SEX = F and International > 0) [the segment 2 conditions]
and not (SEX = F and 42 < AGE <= 58) [the segment 1 conditions]
and SEX = F and 73 < AGE <= 89
This mechanism prevents multiple counting of customers in overlapping segments. Be aware that the
order of the segments in the model affects the segment a customer belongs to and so also the
measures Cover (n), Frequency and Probability for each model segment.
This is a consequence of the iterative method by which Decision List generates rules. In a later
section we will cover in detail how this rule induction mechanism works. For now it is sufficient to
realize that the Decision list algorithm is constructing trees of decision rules using a very different
splitting mechanism than the one used in the decision tree algorithms. This is the reason why the
Decision List algorithm is not a tree but a rule algorithm.
Comparison of Rule Induction Models

The table below lists some of the important differences between the rule induction algorithms
available within PASW Modeler. The first four columns are repeated from Lesson 3 for ease of
comparison.
Table A.1 Some Key Differences Between the Five Rule Induction Models
Model Criterion C5.0 CHAID QUEST C&R Tree Decision List

Split Type for Multiple Multiple1 Binary Binary Multiple
Categorical
Predictors
Continuous No Yes No Yes No
Target
Continuous Yes No2 Yes Yes No2
Predictors
Criterion for Information Chi-square Statistical Impurity Statistical
Predictor measure F test for (dispersion)
Selection continuous measure
Can Cases Yes, uses Yes, missing Yes, uses Yes, uses Yes, missing
Missing Predictor fractiona- becomes a surrogates surrogates becomes a
Values be Used? lization category category
Priors No No Yes Yes No
Pruning Upper limit Stops rather Cost- Cost- Stops rather
Criterion on predicted than overfits complexity complexity than overfits
error pruning pruning
Decision List A-4

Model Criterion C5.0 CHAID QUEST C&R Tree Decision List

Build Models No Yes Yes Yes Yes
Interactively
Supports Yes Yes Yes Yes Yes
Boosting
1
SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target
fields.
2
Continuous predictors are binned into ordinal fields containing by default approximately equal sized
categories.
Unlike the decision tree algorithms, Decision List does not create subgroups by splitting but by either
adding a new predictor or by narrowing the domain of the existing predictor(s) in the group (decision
rule approach) and in consequence tree-splitting issues are not applicable here.
Decision List can handle targets that are of measurement level flag, nominal, and ordinal. Decision
List is designed to model a specific category of a categorical target, so effectively it predicts a binary
outcome (target or not target). The algorithm treats continuous predictors by binning them into
ordinal fields with approximately equal number of records in each category..
In generating rules, just like CHAID and QUEST, Decision List uses more standard statistical
methods, as explained below.
The way missing values are handled is set with Expert options. Either the missing fields in a predictor
are neglected when it comes to using that predictor in forming a subgroup, or like CHAID, all missing
values are used as an additional category in model building.
The process of rule generation halts based on settings such as the maximum number of predictors in a
rule, explicit group size related settings, and the statistical confidence required.
Rule Induction Using Decision List

The Decision List modeling node must appear in a stream containing fully instantiated fields (either
in a Type node or the Types tab in a source node). Within the Type node or Types tab, the field to be
predicted (or explained) must have role target or it must be specified in the Fields tab of the modeling
node. All fields to be used as predictors must have their role set to input (in Types tab or Type node)
or be specified in the Fields tab. Any field not to be used in modeling must have its role set to none.
Any field with role both will be ignored by Decision List.
The Decision List node is labeled with the name of the target field and target category. Like most
other models, a Decision List model may be browsed and predictions can be made by passing new
data through it in the Stream Canvas.
The target field must have categorical values, and Decision List will model on a particular value of
the target field. That target value is set in the Decision List node. The other values of the target field
will then be regarded as a second category value, appearing as the value $null$ in predictions.
In this example we will attempt to predict which customers voluntarily cancel their mobile phone
contract. Rather than rebuild the source and Type nodes, we use the existing stream opened
previously. We’ll delete the Decision list node so we can review the default settings.
Decision List A-5

Close the Decision List Browser window

Delete the CHURNED[Vol] node and the generated CHURNED[Vol] mode node
Place a Decision List node from the Modeling palette to the upper right of the Type node in
the Stream Canvas
Connect the Type node to the Decision List node (see Figure A.3)
The name of the Decision List node should immediately change to No Target Value.
Figure A.3 Decision List Modeling Node Added to Stream
The reason for the name “No Target Value” is because target field CHURNED has three values, but
Decision List predicts only one specific target value.
Double-click the Decision List node to edit it
Note the message stating that a target value must be specified.
Decision List A-6

Figure A.4 Decision List Dialog - Initial
The Model name option allows you to set the name for both the Decision List and resulting Decision
List nodes. The Use partitioned data option is checked so that the Decision List node will make use
of the Partition field created by the Partition node earlier in the stream.
By default the model is built automatically, as the Mode is set to Generate model. By selecting
Launch interactive session it is possible to create the model interactively.
The Target value has to be set explicitly to Vol.
Click the button, to the right of the Target value

Click Vol, then click Insert
With Decision List you are able to generate rules better than the average or worse than the average
depending on your goal (where the average is the overall probability of the target value). This is set
Decision List A-7

by the Search Direction value of Up or Down. An upward search looks for segments with a high
frequency. A downward search will create segments with low frequency.
A decision rule model contains a number of segments. The maximum is set in Maximum number of
segments.
Each segment is described by one or more predictors, also known as attributes in the Decision List
node. The maximum number of predictive fields to be used in a segment is set in Maximum number of
attributes. You may compare this setting with Levels below root setting in CHAID and QUEST,
prescribing the maximum tree depth.
The Maximum number of attributes setting implies a stopping criterion for the algorithm. Just like the
stopping criteria of CHAID, Decision List also has settings related to segment size: As percentage of
previous segment (%) and As absolute value (N). The percentage setting states that a segment can
only be created if it contains at least a certain percentage of records of its parent. Compare this with a
branch point in a tree algorithm. The absolute value setting is straightforward: a segment only
qualifies for the model if it is not too small, thus serving the generality requirement of a predictive
model. The larger of these two settings takes precedence.
Note that whereas in CHAID’s stopping criteria you must choose either a percentage or an absolute
value approach, Decision List combines the two by using the percentage requirement for the parent
and the absolute value requirement for the child.
The model’s accuracy is controlled by Confidence interval for new conditions (%). This is a
statistical setting and the most commonly used value is 95, the default. Of course depending on the
business case and how costly an erroneous prediction, you may increase or decrease this confidence
value.
Understanding the Rules and Determining Accuracy

The predictive accuracy of the rule induction model is not given directly within the Decision List
node. To obtain that information using an Analysis node may be confusing let alone misleading as
Decision List will only explicitly report on the particular target value that was modeled and the other
value(s) will be regarded as $null$. To avoid that we will use Matrix nodes and Evaluation Charts to
determine how good the model is.
We use the Table node to examine the predictions from the Decision List model.
Click Run to run the model

Place a Table node from the Output palette below the generated Decision List node
Connect the generated Decision List node to the Table node
Right-click the Table node, then click Run and scroll to the right in the table
Decision List A-8

Figure A.5 Three New Fields Generated by the Decision List Node
Three new columns appear in the data table, $D-CHURNED, $DP-CHURNED and $DI-CHURNED.
The first represents the predicted target value for each record, the second the probability and the third
shows the ID of the model segment a record belongs to. The sixth segment is the Remainder.
Note that the predicted value is either Vol or $null$, demonstrating that the Decision List algorithm
predicts a particular value of the target field to the exclusion of the others.
Click File…Close to close the Table output window

We will use a matrix to see where the predictions were correct, and then we evaluate the model
graphically with a gains chart.
Place two Select nodes from the Records palette, one to the upper right of the generated
Decision List node and one to the lower right
Connect the generated Decision List node to each Select node
First we will edit the Select node on the upper right that we will use to select the Training sample
cases:
Double-click on the Select node on the upper right to edit it

Click the equal sign button
Click the Select from existing field values button and insert the value 1_Training (not
shown)
Click OK
Decision List A-9


Select Custom and enter value Training
Click OK
Figure A.6 Completed Selection for the Training Partition
Now do the same for the Select node on the right to select the Testing sample cases.
Insert Partition value “2_Testing” and annotate the node as “Testing.”
Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:
Place a Matrix node from the Output palette near the Select node
Connect the Matrix node to the Select node
Double-click the Matrix node to edit it
Put CHURNED in the Rows:
Put $D-CHURNED in the Columns:
Click the Appearance tab
Click the Percentage of row option
Click on the Output tab and custom name the Matrix node for the Training sample as
Training and the Testing sample as Testing (this will make it easier to keep track of
which output we are looking at)
Click OK
For each actual risk category, the Percentage of row choice will display the percentage of records
predicted in each of the target categories.
Decision List A-10

Figure A.7 Matrix Output for the Training and Testing Samples
Looking at the Training sample results, the model predicts about 82.0% of the Vol (Voluntary
Leavers) category correctly. The results with the testing sample compare favorably (80.5% accurate)
which suggests that the model will perform well with new data.
Note that technically no prediction for the other two categories is correct, since the model doesn’t
predict Current or InVol but just $null$. But we can combine these results by hand to obtain the
accuracy. The percentage of correct not Vol predictions is:
(313 + 48)/((313 + 68)+(48 +23))*100 = 79.9%.
We could have made this calculation easier by creating a two-valued target field based on
CHURNED, thus creating a 2 by 2 matrix. Decision List would create the same rules for such a field.
To produce a gains chart for the Voluntary group:
Close both Matrix windows

Place the Evaluation chart node from the Graphs palette to the right of the generated
Decision List node named CHURNED[Vol]
Connect the generated Decision List node to the Evaluation chart node
Double-click the Evaluation chart node, and click the Include best line checkbox
By default, an Evaluation chart will use the first target category to define a hit. To change the target
category on which the chart is based, we must specify the condition for a User defined hit in the
Options tab of the Evaluation node.

Click the User defined hit checkbox
Click the Expression Builder button in the User defined hit group
Click @Functions on the functions category drop-down list
Select @TARGET on the functions list, and click the Insert button
Click the = button
Right-click CHURNED in the Fields list box, then select Field Values
Select Vol, and then click Insert button
Decision List A-11

Figure A.8 Specifying the Hit Condition within the Expression Builder
Click OK
Figure A.9 Defining the Hit Condition for CHURNED
In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.
Decision List A-12

Click Run
Figure A.10 Gains Chart for the Voluntary Leaving Group
The gains line ($D-CHURNED) in the Training data chart rises steeply relative to the baseline,
indicating that hits for the Voluntary Leaving category are concentrated in the percentiles predicted
most likely to contain this type of customer, according to the model.
th
Hold the cursor over the model line in the Training partition at the 40 percentile
Approximately 77% of the hits were contained within the first 40 Percentiles.
Decision List A-13

Figure A.11 Gains Chart for the Voluntary Leaving Group (Interaction Enabled)
The gains line in the chart using Testing data is very similar which suggests that this model can be
reliably used to predict voluntary leavers with new data.
Close the Evaluation chart window

Move to the c:\Train\ModelerPredModel directory (if necessary)
Type DecisionList Model in the File name: text box
Click Save
Understanding the Most Important Factors in Prediction

An advantage of rule induction models, as with decision trees, is that the rule form makes it clear
which fields are having an impact on the predicted field. There is no great need to use alternative
methods such as web plots and histograms to understand how the rule is working. Of course, you may
still use the techniques described in the previous lessons to help understand the model, but they often
are not needed.
In the Decision List algorithm the most important fields in the predictions can be thought of as those
that define the best subgroups in the sample used for training the model at a certain stage in the
process. Thus in this example the most important fields when using the whole training sample are
SEX and AGE. Because the sample used for training the model gradually decreases during the
stepwise rule discovery process, there will be other predictive fields coming to the surface as being
Decision List A-14

most important. This intuitively makes sense. So in step 2 when finding the best second segment,
using the whole training sample except the first segment, the most important fields turn out to be SEX
and International. Similarly, when finding segment 3 and using the whole training sample except for
the first two segments, SEX and AGE are again the most important predictors.
The process continues until the algorithm is not able to construct segments satisfying the
requirements, or stopping criteria are reached.
Expert Options for Decision List

Now that we have introduced you to the basics of Decision List modeling, we will discuss the Expert
options which will allow you to refine your model even further.
Double-click on the Decision List node named CHURNED[Vol] to edit it
Expert mode options allow you to fine-tune the rule induction process.

Figure A.12 Decision List Expert Options
Decision List A-15

Binning
Binning is a method of transforming a numeric field (with measurement level continuous) into a
number of categories/intervals. The Number of bins input will set the maximum number of bins to be
constructed. Whether this maximum will actually be the number of bins depends on other settings as
well.
There are two main types of Binning methods, Equal Count and Equal Width. Equal Width will
transform numeric fields into a number of fixed width intervals. Equal Count is a more balanced
binning method, and it will create intervals based on an equal number of records per interval.
The three settings below this control details of the model process, described below.
If Allow missing values in conditions is checked, the Decision List algorithm will regard being empty
or undefined as a particular category that can be used as a condition in a segment. That may result in a
segment such as “SEX = F and AGE IS MISSING”.
The Decision List Algorithm

The Decision List algorithm constructs lists of rules based on the predictions of a tree. However, the
tree is generated quite differently from the way it is done in the decision tree algorithms, so the word
“tree” has to be regarded as a way to visualize the solution area and the rule generation process of the
Decision List algorithm.
Process Hierarchy
In order to understand the Decision List rule generation process, we must first realize that a decision
list contains segments, with each segment containing one or more conditions, and each condition
being based on one predictive field. This hierarchy is directly reflected in the rule generation process:
a main cycle of generating the list’s segments and a sub cycle for each segment of constructing the
segment’s conditions based on the predictive fields.
The main cycle is also called the List cycle and the sub cycle is called the Rule cycle. In constructing
the conditions on the lowest process level the algorithm also has a Split cycle where the binning is
performed in case of continuous predictive fields.
Qualification
A key question is: what makes one list better than the other and what makes a segment better than the
other?
For a list the accuracy is defined by:

List% = 100* SUM (Frequency) / SUM (Cover(n)), the Remainder excluded
On a segment level, a segment at hand is better than another segment if:

(1) the probability of the target value on the segment is better
(2) there is no overlap between the confidence interval of the segment at hand and the confidence
interval of the other segment. This interval is directly related to the setting for the Confidence
interval for new conditions (%), as set in the Simple mode of the Decision List dialog and
defined as Probability± Error), where Error is the statistical error in the prediction of the
Probability.
Decision List A-16

List Generation – Simple

To simplify the argument we will describe the process given the setting Model search width = 1,
meaning we will not create multiple lists simultaneously to choose from in the end. So we will
assume one List cycle here.
Rule Generation
Given the above, the rule generation process starts with the full Training sample to search for
segments. The solution area is generated as follows: on the first rule level segments are constructed
based on 1 predictive field. The best 5 (Rule search width) will be selected as starting points for a
second rule level, resulting in a set of segments each described by 2 predictive fields. Again the best 5
are selected for the third rule level. This goes on till the last rule level, which is 5 (Maximum number
of attributes), and indeed in principle the fifth rule level segments are described by five predictive
fields.
It is not always possible to refine a certain segment in a next step by adding a new predictive field.
One of the reasons is the group size as set in As absolute value (N). The algorithm may come up with
segments that are described by less than five predictive fields. On the other hand, refining a given
segment in a next step can also be done by not adding a new predictive field, but by reconsidering an
existing predictive field. This is set by Allow attribute re-use. (e.g., “Age between (20, 60)” in step 1
could be refined to “Age between (25, 55)” in level 2). So this is why in rule level N there may be
segments having less than N predictive fields.
A segment that is not refined anymore is called a final result, which is comparable to terminal nodes
in a decision tree.
If Model search width =1, out of all these final results the algorithm will return the best 5 (Maximum
number of segments) based on the target value’s probability. Our previous model did create all five.
The decision rule process may not be able to use all the “freedom” as set in the Rule search width (5)
and in Maximum number of attributes (5). The main reasons are typically group size requirements
and/or the statistical confidence requested.
List Generation – Boosting

Just like C5.0, Decision List has a “boosting” mechanism. This is reflected in the setting Model
search width. In describing the decision list algorithm we assumed Model search width to be 1.
By setting a higher value (say 2) you direct the Decision List algorithm to consider 2 alternatives for
each segment Thus the Decision Rule algorithm will deliver the best 2 segments after each Rule
cycle. In our model this means that we instructed the algorithm to build a list of 5 segments.
The List cycle will now have 5 iterations of the Rule cycle (= Maximum Number of segments) and
each Rule cycle will have 5 iterations (= Maximum number of attributes)
For the first segment on the list the Rule cycle will return the top 2 segments (= Model search width).
Thus, now 2 lists are created each with 1 segment and a Remainder. On each of the 2 lists the Rule
cycle will be performed on the Remainder. This will result in 4 lists, each with 2 segments and a
Remainder. Out of these 4 lists the top 2 based on List% are selected to find a third, and so forth.
When working in Interactive mode, the Maximum Number of alternatives setting is active. When the
model is automatically generated, its value is set to 1.
Decision List A-17

Be aware that the Model search width and the Rule search width have a direct impact on the data-
mining processing time.
Interactive Decision List

Decision lists can be generated automatically, allowing the algorithm to find the best model, as we
did with the first example. As an alternative, you can use the Interactive List Builder to take control
of the model building process. You can grow the model segment-by-segment, you can select specific
predictors at a certain intermediate point, and you can insure that the list of segments is not too
complex so that it is practical enough to be used for a business problem.
To use the List Builder, we simply specify a decision list model as usual, with the one addition of
selecting Launch interactive session in the Decision List node’s Model tab. We’ll use the Decision
List Interactive session to predict the voluntary leavers.
Close the Decision List modeling node

Double-click on DecisionList Interactive.str
On the Stream canvas double-click the Decision List node named CHURNED[Vol]
Click the Model tab
Click the Launch interactive session option button
Decision List A-18

Figure A.13 Decision List Model Tab with Interactive session enabled
Note that we have modified some of the default settings, such as the maximum number of attributes,
the maximum number of segments, and the absolute value of the minimum segment size. Click on the
Expert tab to review those settings as well.
When the model runs, a generated Decision List model is not added to the Model Manager area.
Instead, the Decision List Viewer opens, as shown in Figure A.14.
Click Run to open the Decision List Viewer
Decision List A-19

Figure A.14 The Decision List Viewer
The easy-to-use, task-based Decision List Viewer graphical interface takes the complexity out of the
model building process, freeing you from the low-level details of data mining techniques. It allows
you to devote your full attention to those parts of the analysis requiring user intervention, such as
setting objectives, choosing target groups, analyzing the results, selecting the optimal model and
evaluating models.
The whole workspace consists of one pane and two pop-up tabs, Working Model Pane, Alternatives
Tab and Snapshots Tab. The Working model pane (Figure A.14) displays the current model, including
mining tasks and other actions that apply to the working model. The Alternatives tab and Snapshots
tab are generated when you click Find Segments, the Alternatives tab lists all alternative mining
results for the selected model or segment on the working model pane. The Snapshots tab displays
current model snapshots (a snapshot is a model representation at a specific point in time).
Note: The generated, read-only model displays only the working model pane and cannot be modified.
In the working model pane you can see two rules. The first gives information about the training
sample. Here the sample has 719 records (Cover (n)) of which 267 meet the target value (Frequency).
In consequence the percentage of records meeting the target value is 37.13% (Probability).
The second, called Remainder, is now the first segment in our model and contains the whole training
sample. This will be the starting point for building our Decision List model.
Right-click the Remainder segment

From the dropdown list select Find Segments
Decision List A-20

Figure A.15 Model Albums Dialog (Alternatives tab)
The pop-up window states that the mining task was performed on the Remainder segment and is
completed. There are the two alternatives that were generated by this data mining task. Recall that for
this task the Model search width is set to 2. The first alternative (Alternative 1) contains 7 segments
and the model represented by this list has an average probability of 59.36%.
The second alternative has 8 segments and the corresponding model has an average probability of
56.13%.
Let’s view each of the two alternatives
Click on Alternative 1
The result will be displayed in the Alternative Preview pane.
Decision List A-21

Figure A.16 Preview of an Alternative
Click on Alternative 2 (not shown)
You will see that these two alternatives differ in their 7th segment. The second has a 7th segment based
on AGE but the first has no this rule. Another interesting segment is the Remainder.
The first alternative has a Remainder of 281 and misses 7 voluntarily leaving customers, whereas the
second alternative list has a Remainder of 254 and misses 6 of these customers.
Assume that we prefer the first alternative but we want to capture some more of the voluntary leavers
in the model. First we must promote the first alternative list to our working model; from there we will
continue the model building process.
Right-click on Alternative 1
Select Load (or click the Load button on the bottom)
Click OK
The result will be displayed in the Working model panel.
Decision List A-22

Figure A.17 Loading an Alternative to the Working Model
We can now create a Gains chart for the working model.
Click Gains tab
Figure A.18 Gains Chart of Working Model
Decision List A-23

The results look encouraging on both the training data and the testing data. The segments included in
the model are represented by the solid line; the excluded portion (Remainder) is represented by the
dashed line.
Let’s put both the Working model (Alternative 1) and Alternative 2 on display in the Gains chart, so
that we can choose one better model.
Click the Viewer tab
Click the Take Snapshot button

Click the toolbar button to view the alternatives
Right-click on Alternative 2
Select Load (or click the Load button on the bottom)
Click OK (for now, Alternative 2 is a working model)
Click the Gains tab
Click Chart Options
Figure A.19 Chart Options
Select the checkbox of Snapshot 1 (Actually Snapshot 1 is Alternative 1, we can click the
toolbar to view the snapshot)
Click OK
Decision List A-24

Figure A.20 Gains Chart of Alternatives
Although model performance is similar, the alternative 2 (Working Model) performs a bit more
poorly than the alternative 1 (Snapshot 1).
Click Viewer tab

In the Working model pane, right-click on segment SEX = F
Decision List A-25

Figure A.21 Options to Modify a Segment in the Model
Choices in the context menu allow you to modify the segments created by the data mining task. For
example you may decide to delete a segment or to exclude it from scoring. You can even edit the
segment. For example, you could add an extra condition to the segment ‘SEX = F’, or you could
modified the lower and upper boundary value of EST_INCOME in segment 6 (Edit Segment Rule).
Model Assessment
We have used the Gains chart above to get an overall view of the model. You can assess the model on
a segment level by using the model measures. There are five types of measures available.
From the menu, click Tools…Organize Model Measures
Decision List A-26

Figure A.22 Organize Model Measures Dialog
When building a Decision List model, you have five types of measures at your disposal (Display): a
Pie Chart and four numerical measures.
Each measure has a Type, the Data Selection it will operate on (here Training Data) and whether it
will be displayed in the model (Show).
The Pie Chart displays the part of the Training sample that is described by a segment. The other
Coverage measure is Cover (n), which will show the number of records in the Training sample in that
segment. The Frequency measure displays the number of records in the segment with the target value,
Probability calculates the ratio of Frequency over Cover(n) and the Error returns the statistical error.
It is possible to add new measures to your model by clicking the Add new model measure button .
We’ll create a measure (call it %Test) showing the probability of each segment on the Testing
partition. Furthermore, we will rename Probability to %Train.
Click the Add new model measure button
This will create a new row named Measure 6.
Double-click in the Name cell for Measure 6 and change the name to %Test
Click the dropdown list for Type and change to Probability
Decision List A-27

Figure A.23 Creating a New Measure
Click the dropdown list for Data Selection and change to Testing Data
Click the Show checkbox for %Test
Double-click in the Name cell for Probability and change its name to %Train, then hit Enter
Figure A.24 Completed %Test Measure and Renamed Probability Measure
Click OK
Decision List A-28

Figure A.25 New Measures Added to the Working Model
Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value
calculations and profit formulas directly within the model building process to simulate cost/benefit
scenarios. The link with Excel allows you to export data to Excel, where it can be used to create
presentation charts, calculate custom measures, such as complex profit and ROI measures, and view
them in Decision List Viewer while building the model.
The following steps are valid only when MS Excel is installed. If Excel is not installed, the options
for synchronizing models with Excel are not displayed.
Suppose that we have created a template in Excel where, based on the Probability and on the
Coverage of a segment, we calculate the amount of loss we will suffer should the customers in a
segment actually leave voluntarily.
Click Tools and select Organize Model Measures

Click Yes for Calculate custom measures in Excel (TM)
Click Connect to Excel (TM) … button
Browse to C:\Train\ModelerPredModel\ and select Template_churn_loss.xlt
Click Open
Decision List A-29

Figure A.26 The Excel Workbook for the Churn Case
Switch to PASW Modeler using the ALT-Tab keys on your keyboard (if necessary)
Figure A.27 Excel Input Fields
The Choose Inputs for Custom Measures window reveals that Excel expects two fields for input:
Probability and Cover.
Decision List A-30

On the other hand four fields are available to add to your model:
Loss = Probability * Cover * Loss – Cover * Variable Cost

%Loss = 100 * Loss / Sum (Loss), the fraction of the total loss a segment can be accounted for
Cumulative = Cumulative Loss
%Cumulative = % Cumulative Loss
By default all are selected.
Clicking on an empty Model Measure cell in the dialog will open a drop down list with all the
measures available in your model.
Click in the Model Measure cell for Probability and select %Train
Click in the Model Measure cell for Cover and select Cover (n)
Figure A.28 Mapping Excel Input File to the Decision List Model Measure
Click OK
In the Organize Model Measures window you will see which measures are available for input to your
model. By default all are selected.
Deselect measure %Test (not shown)

Click OK
Decision List A-31

Figure A.29 the Decision List Model with External Measures
As you can see, segment 4 is responsible for more than 20% of the total loss expected (reflected by its
measure %Loss), and the first four segments for more than 50% (reflected by the %Cumulative
measure for the fourth segment).
So if the business objective was to select a set of customers in a retention campaign to reduce the
expected loss by at least 50%, the list manager would probably choose the first 4 segments to be
scored.
If you wish to exclude a segment from a model, it can be done from a context menu.
Right-click on Segment 5
Decision List A-32

Figure A.30 Manually Excluding Segments from Scoring Based on External Measures
Interactive Decision Lists are not a model, but instead are a form of output, like a table or graph.
When you are satisfied with the list you have built, you can generate a model to be used in the stream
to make predictions.
Click Generate…Generate Model

Click OK in the resulting dialog box (not shown)
Close the Interactive Decision List Viewer window
A generated Decision List model appears in the upper left corner of the Stream Canvas. It can be
edited, attached to other nodes, and used like any other generated model. The only difference is in
how it was created.
Decision List A-33

Summary Exercises
The exercises in this appendix are written for the data file Newschan.sav.
1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and
connect it to Newschan.sav.
2. Try to predict with Decision List whether or not someone responds to a cable new service
offer (NEWSCHAN). Start by using the default settings. Use all available predictors. How
many segments were created? What fields were used? Does this model seem adequate?
3. Try different models by changing various settings, including the minimum segment size,
allowing attribute reuse, confidence interval (change to 90%), or some of the expert settings.
Can you find a better model?
Decision List A-34

Pred - Mod - PASWMod140 - 85X11 - Complete (With Highlighting)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pred - Mod - PASWMod140 - 85X11 - Complete (With Highlighting)

Uploaded by

Copyright:

Available Formats

Predictive Modeling

v14.0 05/10 yc/mr 8.5X11

SPSS Inc., an IBM Company

Oracle is a registered trademark of Oracle Corporation and/or its affiliates.

dBase is a trademark of dataBased Intelligence, Inc.

UNIX is a registered trademark of The Open Group.

Predictive Modeling with PASW® Modeler

LESSON 1: PREPARING DATA FOR MODELING ........................... 1-1

LESSON 4: NEURAL NETWORKS ....................................................... 4-1

LESSON 8: TIME SERIES ANALYSIS ................................................. 8-1

LESSON 14: GETTING THE MOST FROM MODELS .................... 14-1

Lesson 1: Preparing Data for Modeling

Note about Type Nodes in this Course

PASW Modeler and PASW Modeler Server

Preparing Data for Modeling 1-1

Figure 1.1 Server Login Dialog in PASW Modeler

Note Concerning Data for this Course

Preparing Data for Modeling 1-2

1.2 Cleaning Data

Preparing Data for Modeling 1-3

1.3 Balancing Data

Open the stream Cpm1.str (located in c:\Train\ModelerPredModel)

Preparing Data for Modeling 1-4

Figure 1.2 Distribution of the CHURNED Field

Click Generate…Balance Node (reduce)

A generated Balance node will appear in the Stream Canvas.

Preparing Data for Modeling 1-5

Close the Distribution plot window

1.4 Numeric Data Transformations

Add a Histogram node to the stream

Figure 1.4 Histogram of the LOCAL Field

Preparing Data for Modeling 1-6

Close the Histogram window

Figure 1.5 Derive Node to Create LOGLOCAL

Connect the Derive node to the existing Histogram node

Preparing Data for Modeling 1-7

Close the Histogram window

Table 1.1 Possible Numerical Transformations

Preparing Data for Modeling 1-8

1.5 Binning Data Values

Add a Binning node to the stream near the Type node

Preparing Data for Modeling 1-9

Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles

Preparing Data for Modeling 1-10

Figure 1.8 Distribution of CHURNED by Binned LOCAL

Close the Distribution plot window

Preparing Data for Modeling 1-11

Figure 1.9 Bin Thresholds for LOCAL

1.6 Data Partitioning

Preparing Data for Modeling 1-12

Figure 1.10 Partition Node Settings

Preparing Data for Modeling 1-13

Change the Training partition size: to 70

Figure 1.11 Distribution of the Partition Field

Close the Distribution window

1.7 Anomaly Detection

Preparing Data for Modeling 1-14

Click File…Open Stream

Figure 1.12 Anomaly Node Fields Tab

Preparing Data for Modeling 1-15

Click Use custom settings button

Figure 1.13 Anomaly Node Model Settings

We’ll use the defaults for this example.