312 views

Original Title: Variables Selection and Transformation SAS EM

Uploaded by Jennifer Parker

- Predictive Modeling Using Decision Trees
- Privacy Preserving Data Mining
- (2) Fraud Detection in the Financial Services Industry
- Decision Tree
- Tulane AI finalprep
- Paper 39-Important Features Detection in Continuous Data
- Chapter1-3TestStudyGuide.docx
- regression
- classificacion basic conceps decisions trees and model evolution chapther 4
- Tutorial_Rapid_Miner_Life_Insurance_Promotion (1).pdf
- Applying Data Mining Techniques Using SAS Enterprise Miner
- Hronsky_Session 4_Mineral Exploration Tactics
- Using Customer Churn Data to Analyse Customer Retention Project Report
- ATALGWP-0599.PDF
- NW 301 - Decision Trees
- Winow vs Perceptron
- Teori Akt S3 Artikel #7
- Data Mining Techniques
- 9781783281336_Learning_Game_AI_Programming_with_Lua_Sample_Chapter
- D4.4 Quality Anomaly Detection and Trace Checking Tools Final Version

You are on page 1of 12

SAS Enterprise Miner 5.2

Kattamuri S. Sarma, Ph.D.

Ecostat Research Corp., White Plains NY

Introduction

In predictive modeling and data mining one is often confronted with a large number of inputs

(explanatory variables). The number of potential inputs to choose from may be as large as 2000

or higher. Some of these inputs may not have any relation to the target. An initial screening is

therefore necessary to eliminate irrelevant variables to keep the number of inputs to a manageable

size. The Variable Selection node of SAS Enterprise Miner provides alternative methods for

eliminating irrelevant variables and selecting variables which have predictive power. In the

process of variable selection, the Variable Selection nodes creates binned variables from

interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned input is

more strongly correlated with the target variable than the original input, indicating a non-linear

relationship between the input and the target. The grouped variables are created by collapsing or

grouping the categories of a nominal inputs. With fewer categories, the grouped variables are

easier to use in modeling than the original ungrouped variables.

The predictive power of the inputs can sometimes be enhanced by making suitable

transformations. One can use the Transform Variables node to select the best mathematical

transformation for any given input, based on such criterion as maximizing normality or

maximizing correlation with the target. The Transform Variables node can also be used for

optimally binning the interval inputs and creating dummy variables from categorical inputs.

Variable selection and transformation is also done by the Decision Tree node. The inputs that

give significant splits in creating a decision tree are selected by the Decision Tree node and

passed to the next node which may be Regression or Neural Networks node. In addition to

variable selection, the Decision Tree node creates a special categorical variable which indicates

the leaf node to which a given record is assigned.

This paper discusses the details of the variable selection methods, transformations and the options

available in these three nodes.

There are two methods of variable selection available in the Variable Selection node. These are:

R-Square and Chi-Square methods of selection.

R-Square Method

The R-Square method can be used with a Binary as well as with a interval-scaled target.

NESUG 2007

In the R-Square method, variable selection is performed in two steps. In the first step R-Square

between the input and the target is calculated. All variables with a correlation above a specified

threshold are selected in the first step. Those variables which are selected in the first step enter

the second step of variable selection.

Step 1:

the Variable Selection node, which the user can specify (See Diagram 7).

For each interval-scaled input the Variable Selection node calculates two measures of correlation

between each input and the target. One is the R-Square between the target and the original input.

The other is the R-Square between the target and the binned version of the input variable. The

binned variable is a categorical variable created by the Variable Selection node from each

continuous (interval-scaled) input. The levels of this categorical variable are the bins. In

Enterprise Miner, this binned variable is referred to as an AOV16 variable. The number of levels

or categories of the binned variable (AOV16) is at most 16, corresponding to 16 intervals that are

equal in width.

In the case of nominal-scaled categorical inputs with a continuous target, R-Square is calculated

using one-way ANOVA. Here you have the option of using either the original or the grouped

variables. Grouped variables are the new variables created by collapsing the levels of categorical

variables. For example, suppose there is a categorical (nominal) variable called LIFESTYLE,

which indicates the lifestyle of the customer. It may take on values such as Foreign Traveler,

Urban Dweller, etc. If the variable LIFESTYLE has 100 levels or categories, it can be

collapsed to fewer levels or categories by setting the Group Variables property to Yes as shown

in Diagram 7.

Step 2

In the second step, a sequential forward selection process is used. This process starts by selecting

the input variable that has the highest correlation coefficient with the target. A regression

equation (model) is estimated with the selected input. At each successive step of the sequence, an

additional input variable that provides the largest incremental contribution to the Model R-Square

is added to the regression. If the lower bound for the incremental contribution to the Model RSquare is reached, the selection process stops. The lower bound for the incremental contribution

to the Model R-Square can be specified by setting the Stop R-Square property (See Display 7) to

the desired value.

Chi-Square Method

This criterion can be used when the target is binary. When this criterion is selected, the selection

process does not have two distinct steps, as in the case of the R-square criterion. Instead, a tree is

constructed. The inputs selected in the construction of the tree are passed to the next node with

the assigned Role of Input.

NESUG 2007

The Decision Tree node of Enterprise Miner can also be used for variable selection and

transformation.

The inputs which create significant splits in the development of the tree are passed to the

next node with the role of Input. These are the variables selected by the Decision Tree

node and they can be used in the Regression node or in the Neural Network node as

inputs. In addition to selecting variables, the Decision Tree node also creates a special

categorical variable called _NODE_ and optionally passes it to the next node as an input.

The variable _NODE_ can be used as a class input in the Regression node.

Transformations for Interval Inputs

Simple Transformations

The available simple transformations are Log, Square Root, Inverse, Square,

Exponential, and Standardize. They can be applied to any interval-scaled input. These

simple transformations can be used irrespective of whether the target is categorical or

continuous.

Binning Transformations

In Enterprise Miner, there are three ways of binning an interval-scaled variable. To use

these as default transformations, select the Transform Variables node, and set the

value of the Interval Inputs property to Bucket, Quantile, or Optimal in the Default

Methods section.

Bucket:

The Bucket option creates buckets by dividing the input into n equal-sized

intervals and grouping the observations into the n buckets. The resulting number of

observations in each bucket may differ from bucket to bucket. For example if AGE is

divided into the four intervals 025, 2550, 5075, and 75100 then the number of

observations in the interval 025 (bin 1) may be 100, the number of observations in the

interval 2550 (bin 2) may be 2000, the number of observations in the interval

5075 (bin 3) may be 1000, and the number of observations in the interval 75100 (bin

4) may be 200.

Quantile:

This option groups the observations into quantiles (bins) with an equal number of

observations in each. If there are 20 quantiles, then each quantile consists of 5% of the

observations.

NESUG 2007

This transformation is available for binary targets only. The input is split into a

number of bins, and the splits are placed so as to make the distribution of the

target levels (for example, response and non-response) in each bin significantly

different from the distribution in the other bins.

Best Power Transformations

The Transform Variables node selects the best power transformations from among

X , log( X ), sqrt ( X ), e X , X 1/ 4 , X 2 , and X 4 , where X is the input. There are four

criteria of best available:

Maximum Normal: To find the transformation that maximizes normality, sample

quantiles from each of the transformations listed above are compared with the

theoretical quantiles of a normal distribution. The transformation that yields quantiles

that are closest to the normal distribution is chosen.

Suppose Y is obtained by applying one of the above transformations to X . For

example, the 0.75-sample quantile of the transformed variable Y is that value of Y at or

below which 75% of the observations in the data set fall. The 0.75-quantile for a

standard normal distribution is 0.6745 given by P ( Z 0.6745) = 0.75 , where Z is a

normal random variable with mean 0 and standard deviation 1. The 0.75-sample

quantile for Y is compared with 0.6745, and similarly the other quantiles are compared

with the corresponding quantiles of the standard normal distribution.

Maximum Correlation: This is available only for continuous targets. The

transformation that yields the highest linear correlation with the target is chosen.

Equalize Spread with Target Levels: This method requires a class target. The method

first calculates variance of a given transformed variable within each target class. Then

for each transformation it calculates the variances of these variances. It chooses the

transformation that yields the smallest variance of the variances.

Optimal Maximum Equalize Spread with Target Level: This method requires a

class target. It chooses the method that equalizes spread with the target.

Transformations of Class Inputs

For class inputs, two types of transformations are available.

Group Rare Levels transformation:

This transformation combines the rare levels into a separate group, _OTHER_. To define

a rare level, you define a cutoff value.

Dummy Indicators Transformation:

To choose one of these available transformations, select the Transform Variables node

and set the value of the Class Inputs property to the desired transformation.

NESUG 2007

If you have a large number of inputs, you can make an initial variable selection, then

transform the selected variables and use them in Regression or other modeling tool. This

scenario is shown in Display 1.

Display 1

If you have only a small number of inputs (hundred or less), you can transform the

variables first, and then select the best variables from the transformed and original

variables. This scenario is shown in Display 2.

Display 2

Decision Tree

As described before, the Decision Tree node selects variables which produce significant

splits, and passes them to the next node. In addition, the Decision Tree node creates a

categorical variable called _NODE_. For any given record the value of this variable is the

leaf node to which the record is assigned. Display 3 shows the process flow diagram for

using the Decision Tree node for variable selection and transformation.

NESUG 2007

Display 3

Display 4 shows the property settings of the Decision Tree node for variable selection

and variable transformation.

NESUG 2007

In order to use the Decision Tree node for variable selection and transformation, you

should specify the Variable Selection property to YES, Leaf Variable property to YES

and Leaf Role property to Input, as shown in Display 4. For a detailed discussion of the

Decision Tree node see Predictive Modeling with SAS Enterprise Miner by the

author of this paper.

NESUG 2007

In any process flow diagram the first node is the Input Data node, which makes the data

set available for the project. The property panel of the Input Data node is shown in

Display 5

In order that the data is available for the project, one has to first create a data source.

Creation of a data source is illustrated step-by-step in the book Predictive Modeling with

SAS Enterprise Miner. From the property panel shown in Display 5, it can be seen

that the name of the data set is NESUG2007 and it is in the library assigned to T1.

NESUG 2007

From the property panel shown in Display 6, it can be seen that 40% of the records are

allocated for training, 30% for validation and 30% for test and the data is split by the

default method. For binary targets the default method is stratified sampling.

Display 7 shows the properties panel for Variable Selection node.

NESUG 2007

.

10

NESUG 2007

The transformation chosen for Interval inputs in Display 8 is Maximum Normal for

interval inputs and Dummy Indicators for class inputs. These are the default methods.

However, one can open the Variables window of the Transform Variables node and

specify different transformations for different inputs.

Display 9 shows the transformations available for interval inputs in Enterprise Miner, and

Display 10 shows the transformations available for class inputs.

11

NESUG 2007

Reference

Solutions for Business Applications, Cary, NC: SAS Institute Inc., 2007.

12

- Predictive Modeling Using Decision TreesUploaded byTruely Male
- Privacy Preserving Data MiningUploaded byEditor IJRITCC
- (2) Fraud Detection in the Financial Services IndustryUploaded bychiragjuneja234
- Decision TreeUploaded bydoviet
- Tulane AI finalprepUploaded byGustavo Eladio
- Paper 39-Important Features Detection in Continuous DataUploaded byEditor IJACSA
- Chapter1-3TestStudyGuide.docxUploaded bybooky777
- regressionUploaded byAshik Mahmud
- classificacion basic conceps decisions trees and model evolution chapther 4Uploaded byKintaro Oe
- Tutorial_Rapid_Miner_Life_Insurance_Promotion (1).pdfUploaded byAnonymous atrs1I
- Applying Data Mining Techniques Using SAS Enterprise MinerUploaded bysudheerhegade
- Hronsky_Session 4_Mineral Exploration TacticsUploaded byjunior.geologia
- Using Customer Churn Data to Analyse Customer Retention Project ReportUploaded bySunil Kamat
- ATALGWP-0599.PDFUploaded bySabrina Sanchez
- NW 301 - Decision TreesUploaded byDennis Dizon
- Winow vs PerceptronUploaded byspacer29
- Teori Akt S3 Artikel #7Uploaded bywuri
- Data Mining TechniquesUploaded byPuneet Gupta
- 9781783281336_Learning_Game_AI_Programming_with_Lua_Sample_ChapterUploaded byPackt Publishing
- D4.4 Quality Anomaly Detection and Trace Checking Tools Final VersionUploaded bygkout
- Chapter 4 Ramsey%27s Reset Test of Functional Misspecification %28EC220%29Uploaded byfayz_mtjk
- j 0332080086Uploaded byinventionjournals
- R2.pdfUploaded byharuhi.karasuno
- Open StatisticsUploaded byRaphael Bahamonde
- os3Uploaded byAbhishek Jain
- Analysis of a Standalone Usage and Limitations .... 2018Uploaded byYoh Sandriano N. Hitang
- BetaUploaded byMaguiña Polanco Eric
- Classification of Categorical Uncertain Data Using Decision TreeUploaded byEditor IJRITCC
- Openintro.pdfUploaded bymariana
- Diez D.M., Barr Chr. D., Çetinkaya-Rundel M.-OpenIntro Statistics.pdfUploaded byKaren Vargas Castilla

- SVMsforPCUploaded byJavier Navarro Fernández
- IBM SPSS Modeler-Neural NetworksUploaded byJennifer Parker
- Scalare multidimensionalaUploaded byJennifer Parker
- Machine Learning CourseUploaded byJennifer Parker
- Statistics - The histogramUploaded byJennifer Parker
- Analiza factorialăUploaded byJennifer Parker
- IBook - An Introduction to Non-life Insurance Mathematics (1993)Uploaded byJennifer Parker
- SAS Programming for Data MiningUploaded byJennifer Parker
- Classes MethodsUploaded byAbhishek Dubey
- Exploratory Data Analysis With R (2015)Uploaded byJennifer Parker
- Statistics - Does smoking kill?Uploaded byJennifer Parker
- Svm TutorialUploaded byJennifer Parker
- Harvard Business School Publishing Case MapUploaded byJennifer Parker
- AlchemyAPI Text Analysis Enterprise Data InitiativesUploaded byJennifer Parker
- AlchemyAPI Text Analysis Enterprise Data InitiativesUploaded byJennifer Parker
- CHI SquaredUploaded byJennifer Parker
- Infrastructura tehnologică pentru comerţ electronicUploaded byJennifer Parker
- TRUSTED ADVISORUploaded byJennifer Parker
- Intrusion and Fraud DetectionUploaded byJennifer Parker
- IBM - CognosUploaded byJennifer Parker
- Artificial IntelligenceUploaded byJennifer Parker
- Introduction to ClementineUploaded byJennifer Parker
- Regresia liniara si logisticaUploaded byJennifer Parker
- STAT2 - Lecture 2Uploaded byJennifer Parker
- 06. Norbert Petrovici - Analiza ClusterUploaded byCorina Ica

- Advance Research Method 1 Midterm Exam Answer KeyUploaded byApril Mergelle Lapuz
- Applied Survival Analysis RUploaded byDoc Ebene Watt
- Statistics and PsychologyUploaded byjazznik
- $RBHDEZUUploaded byneilwu
- Supply ChainUploaded byarriva
- 16 Analysis of Variance (ANOVA).pdfUploaded byFco.JavierEspinosa
- Clinical trialUploaded byrahsarah
- Wiedermann_GCM_2011Uploaded byBartosz Gula
- Some ExamplesUploaded byki_soewarsono
- Binomial PoissonUploaded byRoei Zohar
- Test Reliabilitychapter 8Uploaded byOutofbox
- Reliability Checking Through SPSSUploaded byjazzlovey
- Ch04_Sect01Uploaded byyuenhuei
- 7Tests of OutliersUploaded byProfesor Jorge
- probeUploaded byapi-3756871
- Applied Predictive Modeling - Max KuhnUploaded byDavid
- Hypothesis Testing Skills SetUploaded byKahloon Tham
- SLNotes09.docUploaded byTerry Lok
- Desain-Statistik-Sample (viera).pptxUploaded byAsrori
- Assignment 2Uploaded byNurul Absar Rashed
- APA format for statistical notation and other things.pdfUploaded bySabrina van Dijk
- Maximum Likelihood Estimation – StokastikUploaded byKamalakar Sreevatasala
- Arch Dis Child 2005 Akobeng 845 8principle of EBMUploaded byGhada
- Statistics for association studyUploaded byMaulikUpadhyay
- supac-irUploaded bySannik Banerjee
- Paper11-5_TruckTimeSeriesForecastingUploaded byKama Dika
- 23008 Leverage ReportUploaded byodvut
- THT 3Uploaded byDan BirdHunter Fallins
- Five Steps of Hypothesis TestingUploaded byjervrgbp15
- Data Mining and Data Scientist Salary Estimates in the PhilippinesUploaded byAdrian Cuyugan