You are on page 1of 12

NESUG 2007

Statistics and Data Analysis

Variable Selection and Transformation of Variables in


SAS Enterprise Miner 5.2
Kattamuri S. Sarma, Ph.D.
Ecostat Research Corp., White Plains NY

Introduction
In predictive modeling and data mining one is often confronted with a large number of inputs
(explanatory variables). The number of potential inputs to choose from may be as large as 2000
or higher. Some of these inputs may not have any relation to the target. An initial screening is
therefore necessary to eliminate irrelevant variables to keep the number of inputs to a manageable
size. The Variable Selection node of SAS Enterprise Miner provides alternative methods for
eliminating irrelevant variables and selecting variables which have predictive power. In the
process of variable selection, the Variable Selection nodes creates binned variables from
interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned input is
more strongly correlated with the target variable than the original input, indicating a non-linear
relationship between the input and the target. The grouped variables are created by collapsing or
grouping the categories of a nominal inputs. With fewer categories, the grouped variables are
easier to use in modeling than the original ungrouped variables.
The predictive power of the inputs can sometimes be enhanced by making suitable
transformations. One can use the Transform Variables node to select the best mathematical
transformation for any given input, based on such criterion as maximizing normality or
maximizing correlation with the target. The Transform Variables node can also be used for
optimally binning the interval inputs and creating dummy variables from categorical inputs.
Variable selection and transformation is also done by the Decision Tree node. The inputs that
give significant splits in creating a decision tree are selected by the Decision Tree node and
passed to the next node which may be Regression or Neural Networks node. In addition to
variable selection, the Decision Tree node creates a special categorical variable which indicates
the leaf node to which a given record is assigned.
This paper discusses the details of the variable selection methods, transformations and the options
available in these three nodes.

The Variable Selection node


There are two methods of variable selection available in the Variable Selection node. These are:
R-Square and Chi-Square methods of selection.
R-Square Method
The R-Square method can be used with a Binary as well as with a interval-scaled target.

NESUG 2007

Statistics and Data Analysis

In the R-Square method, variable selection is performed in two steps. In the first step R-Square
between the input and the target is calculated. All variables with a correlation above a specified
threshold are selected in the first step. Those variables which are selected in the first step enter
the second step of variable selection.
Step 1:

In this step, a preliminary selection is made, based on Minimum R-Square property, of


the Variable Selection node, which the user can specify (See Diagram 7).
For each interval-scaled input the Variable Selection node calculates two measures of correlation
between each input and the target. One is the R-Square between the target and the original input.
The other is the R-Square between the target and the binned version of the input variable. The
binned variable is a categorical variable created by the Variable Selection node from each
continuous (interval-scaled) input. The levels of this categorical variable are the bins. In
Enterprise Miner, this binned variable is referred to as an AOV16 variable. The number of levels
or categories of the binned variable (AOV16) is at most 16, corresponding to 16 intervals that are
equal in width.
In the case of nominal-scaled categorical inputs with a continuous target, R-Square is calculated
using one-way ANOVA. Here you have the option of using either the original or the grouped
variables. Grouped variables are the new variables created by collapsing the levels of categorical
variables. For example, suppose there is a categorical (nominal) variable called LIFESTYLE,
which indicates the lifestyle of the customer. It may take on values such as Foreign Traveler,
Urban Dweller, etc. If the variable LIFESTYLE has 100 levels or categories, it can be
collapsed to fewer levels or categories by setting the Group Variables property to Yes as shown
in Diagram 7.
Step 2
In the second step, a sequential forward selection process is used. This process starts by selecting
the input variable that has the highest correlation coefficient with the target. A regression
equation (model) is estimated with the selected input. At each successive step of the sequence, an
additional input variable that provides the largest incremental contribution to the Model R-Square
is added to the regression. If the lower bound for the incremental contribution to the Model RSquare is reached, the selection process stops. The lower bound for the incremental contribution
to the Model R-Square can be specified by setting the Stop R-Square property (See Display 7) to
the desired value.
Chi-Square Method
This criterion can be used when the target is binary. When this criterion is selected, the selection
process does not have two distinct steps, as in the case of the R-square criterion. Instead, a tree is
constructed. The inputs selected in the construction of the tree are passed to the next node with
the assigned Role of Input.

NESUG 2007

Statistics and Data Analysis

Using Decision Tree node for Variable Selection


The Decision Tree node of Enterprise Miner can also be used for variable selection and
transformation.
The inputs which create significant splits in the development of the tree are passed to the
next node with the role of Input. These are the variables selected by the Decision Tree
node and they can be used in the Regression node or in the Neural Network node as
inputs. In addition to selecting variables, the Decision Tree node also creates a special
categorical variable called _NODE_ and optionally passes it to the next node as an input.
The variable _NODE_ can be used as a class input in the Regression node.

The Transform Variables node


Transformations for Interval Inputs
Simple Transformations
The available simple transformations are Log, Square Root, Inverse, Square,
Exponential, and Standardize. They can be applied to any interval-scaled input. These
simple transformations can be used irrespective of whether the target is categorical or
continuous.
Binning Transformations
In Enterprise Miner, there are three ways of binning an interval-scaled variable. To use
these as default transformations, select the Transform Variables node, and set the
value of the Interval Inputs property to Bucket, Quantile, or Optimal in the Default
Methods section.
Bucket:
The Bucket option creates buckets by dividing the input into n equal-sized
intervals and grouping the observations into the n buckets. The resulting number of
observations in each bucket may differ from bucket to bucket. For example if AGE is
divided into the four intervals 025, 2550, 5075, and 75100 then the number of
observations in the interval 025 (bin 1) may be 100, the number of observations in the
interval 2550 (bin 2) may be 2000, the number of observations in the interval
5075 (bin 3) may be 1000, and the number of observations in the interval 75100 (bin
4) may be 200.
Quantile:
This option groups the observations into quantiles (bins) with an equal number of
observations in each. If there are 20 quantiles, then each quantile consists of 5% of the
observations.

NESUG 2007

Statistics and Data Analysis

Optimal Binning for Relationship to Target:

This transformation is available for binary targets only. The input is split into a
number of bins, and the splits are placed so as to make the distribution of the
target levels (for example, response and non-response) in each bin significantly
different from the distribution in the other bins.
Best Power Transformations
The Transform Variables node selects the best power transformations from among
X , log( X ), sqrt ( X ), e X , X 1/ 4 , X 2 , and X 4 , where X is the input. There are four
criteria of best available:
Maximum Normal: To find the transformation that maximizes normality, sample
quantiles from each of the transformations listed above are compared with the
theoretical quantiles of a normal distribution. The transformation that yields quantiles
that are closest to the normal distribution is chosen.
Suppose Y is obtained by applying one of the above transformations to X . For
example, the 0.75-sample quantile of the transformed variable Y is that value of Y at or
below which 75% of the observations in the data set fall. The 0.75-quantile for a
standard normal distribution is 0.6745 given by P ( Z 0.6745) = 0.75 , where Z is a
normal random variable with mean 0 and standard deviation 1. The 0.75-sample
quantile for Y is compared with 0.6745, and similarly the other quantiles are compared
with the corresponding quantiles of the standard normal distribution.
Maximum Correlation: This is available only for continuous targets. The
transformation that yields the highest linear correlation with the target is chosen.
Equalize Spread with Target Levels: This method requires a class target. The method
first calculates variance of a given transformed variable within each target class. Then
for each transformation it calculates the variances of these variances. It chooses the
transformation that yields the smallest variance of the variances.
Optimal Maximum Equalize Spread with Target Level: This method requires a
class target. It chooses the method that equalizes spread with the target.
Transformations of Class Inputs
For class inputs, two types of transformations are available.
Group Rare Levels transformation:
This transformation combines the rare levels into a separate group, _OTHER_. To define
a rare level, you define a cutoff value.
Dummy Indicators Transformation:
To choose one of these available transformations, select the Transform Variables node
and set the value of the Class Inputs property to the desired transformation.

NESUG 2007

Statistics and Data Analysis

Transformation before Variable Selection


If you have a large number of inputs, you can make an initial variable selection, then
transform the selected variables and use them in Regression or other modeling tool. This
scenario is shown in Display 1.
Display 1

Transformation after Variable Selection


If you have only a small number of inputs (hundred or less), you can transform the
variables first, and then select the best variables from the transformed and original
variables. This scenario is shown in Display 2.
Display 2

Variable Selection and Transformation of variables using the


Decision Tree
As described before, the Decision Tree node selects variables which produce significant
splits, and passes them to the next node. In addition, the Decision Tree node creates a
categorical variable called _NODE_. For any given record the value of this variable is the
leaf node to which the record is assigned. Display 3 shows the process flow diagram for
using the Decision Tree node for variable selection and transformation.

NESUG 2007

Statistics and Data Analysis

Display 3

Display 4 shows the property settings of the Decision Tree node for variable selection
and variable transformation.

NESUG 2007

Statistics and Data Analysis

Display 4: Decision Tree node

In order to use the Decision Tree node for variable selection and transformation, you
should specify the Variable Selection property to YES, Leaf Variable property to YES
and Leaf Role property to Input, as shown in Display 4. For a detailed discussion of the
Decision Tree node see Predictive Modeling with SAS Enterprise Miner by the
author of this paper.

NESUG 2007

Statistics and Data Analysis

Property Settings of the nodes


In any process flow diagram the first node is the Input Data node, which makes the data
set available for the project. The property panel of the Input Data node is shown in
Display 5

Display 5: Input Data node

In order that the data is available for the project, one has to first create a data source.
Creation of a data source is illustrated step-by-step in the book Predictive Modeling with
SAS Enterprise Miner. From the property panel shown in Display 5, it can be seen
that the name of the data set is NESUG2007 and it is in the library assigned to T1.

Display 6 shows the Data Partition node.

NESUG 2007

Statistics and Data Analysis

Display 6: Data Partition node

From the property panel shown in Display 6, it can be seen that 40% of the records are
allocated for training, 30% for validation and 30% for test and the data is split by the
default method. For binary targets the default method is stratified sampling.
Display 7 shows the properties panel for Variable Selection node.

NESUG 2007

Statistics and Data Analysis

Display 7: Variable Selection node

Display 8 shows the property panel of Transform Variables node.


.

10

NESUG 2007

Statistics and Data Analysis

Display8: Transform Variables node

The transformation chosen for Interval inputs in Display 8 is Maximum Normal for
interval inputs and Dummy Indicators for class inputs. These are the default methods.
However, one can open the Variables window of the Transform Variables node and
specify different transformations for different inputs.
Display 9 shows the transformations available for interval inputs in Enterprise Miner, and
Display 10 shows the transformations available for class inputs.

11

NESUG 2007

Statistics and Data Analysis

Display 9: Transformations for Interval Inputs

Display 10: Transformations for Class Inputs

Reference

Sarma, Kattamuri S, Predictive Modeling with SAS Enterprise Miner Practical


Solutions for Business Applications, Cary, NC: SAS Institute Inc., 2007.

12