You are on page 1of 43

Filtering Unwanted Cases

The SAS Enterprise Miner Filter tool enables you to remove unwanted records from an analysis. Use
these steps to build a diagram that will read a data source and filter records. Create a new diagram called
Segmentation Analysis.
1. Drag the CENSUS2000 data source to the Segmentation Analysis workspace window.
2

2. Select the Sample tab to access the Sample tool group.

If you explore the data in CENSUS2000, you will notice that there are a number of records that
have a value of 0 for Median Household Income and Average Household Size. The Explore
window below shows the histogram windows for each variable, with the data that has a Median
Household Income of 0 highlighted. Clicking on any of the bars in one of histograms highlights
the same set of records in all the other histograms. The zero average household size seems to be
evenly distributed across the longitude (LOCX), latitude (LOCY), and density percentile
(RegDens) variables. It also seems concentrated on low incomes and populations.
3

A portion of the CENSUS2000 window is shown below. Records 28 and 33 (among others) have
a value of 0 for Average Household Size. These records also have unusual values in the
remaining non-geographic fields. For example, the Median Household Income is listed as $0,
the Region Density Percentile is missing, and the Region Population is 0.

If you sort the records in the CENSUS2000 window by ascending order of the Average
Household Size, you can examine these records together. (You click the Average Household
Size column header once to sort by descending order, and click it again to sort by ascending
order.) With these records grouped together, it is easy to see that most of the cases with
an Average Household Size of 0 have a value of 0 or missing on the remaining non-geographic
attributes. There are some exceptions, but you might decide that cases such as this are not of
interest for analyzing household demographics.
4

3. Drag the Filter tool (fourth from the left) from the tools palette into the Segmentation Analysis
workspace window.

You can use filters to exclude certain observations, such as extreme outliers and errant data
that you do not want to include in your mining analysis. Filtering extreme values from the
training data tends to produce better models because the parameter estimates are more stable.
You might also want to filter your data to focus on a particular subset of the original data
source.
4. Connect the CENSUS200 data to the Filter node.
You have just created a process flow. The process flow, at this point, reads the raw CENSUS2000
data and filter unwanted observations. However, you must specify which observations are unwanted.
To do this, you must change the settings of the Filter node.
5

5. Select the Filter node and examine the Properties panel.

The Properties panel displays the analysis methods used by the node when run. By default, the node
will filter cases in rare levels in any class input variable and cases exceeding three standard deviations
from the mean on any interval input variable.

You can control the number of standard deviations by using the Advanced Properties sheet,
which is discussed later.
Because the CENSUS2000 data source only contains interval inputs, only the Interval Variables
criterion is considered.
6. Change the Default Filtering Method property (under the Interval Variables grouping) to
User-Specified Limits.

7. Select the Interval Variables ellipsis (). SAS Enterprise Miner then informs you that it is updating
the path. When the update is complete, the Interactive Interval Filter window opens.
6

You are warned at the top of the window that the train or raw data set does not exist. This indicates
that you are restricted from the interactive filtering elements of the node (which are available after a
node has been run). Nevertheless, you can enter filtering information.
7

8. Type 0.1 in the Filter Lower Limit field for the input variable MeanHHSz.

9. Select OK to close the Interactive Interval Filter window. You are returned to the SAS Enterprise
Miner interface window.
When the diagram is run, all cases with an average household size less than 0.1 are filtered from
subsequent analysis steps.
10. Right-click the Filter node and click Run from the shortcut menu.
8

11. Select Results in the window. The Filter nodes Results window opens.

The Output window indicates that 1081 observations were excluded from the TRAIN
(CENSUS2000) data.
12. Close the Results window. The CENSUS2000 data is ready for pattern discovery analyses.
9

Rejecting Data Source Variables


Before you run a cluster analysis, you must select and evaluate the inputs that you want to use. In general,
you should seek inputs that have the following attributes:

meaningful to the analysis objective

relatively independent

limited in number

a measurement level of Interval

compatible measurement scales (or be standardized if not)

low skewness (at least in the training data)

Using the Variables window, you can explore the inputs you have selected for the analysis. Histograms for
each input variable enable you to see problems with your data that might need to be fixed, such as skewed
data or data that needs standardization, before you run the cluster analysis.

Not all the inputs in the CENSUS2000 data set will be used for the segmentation analysis. The variables
LocX, LocY, and (arguably) RegPop describe characteristics of the geographic regions and not the
people who live there. Because this analysis is an analysis of demographic and not geographic
characteristics, you should reject these variables. Using the Input Data node, change the role for the three
variables to rejected.
Only the variables with a role of Input will be used in subsequent steps of the segmentation
analysis. Explore these variables using the Input Data node. The default setting of Sample
Method is Top, but you can specify a different method in the Sample Properties window. You
can also use the Preferences window to change the default setting of Sample
Method to Random so that you do not need to change the method in the Sample Properties
window each time that you use the Explore window.
By default, the Explore window selects a sample of 10000 observations. You can change
the Fetch Size to Max to increase the sample size to 60,000 observations. If your data source
contains fewer than 60,000 observations and the Fetch Size is set to Max, the Explore window
uses the full set of observations for exploration.
10

The histograms reveal two issues that must be resolved before attempting a meaningful segmentation
analysis.
The distribution of MedHHInc is highly skewed. Unless this problem is resolved, cases in the tail of
the distribution will be isolated into orphan segments. It is common practice to transform highly
skewed inputs to regularize the shape of their distribution.
The input ranges for the inputs MeanHHSz, MedHHInc, and RegDens differ by several orders
of magnitude. Unless this problem is resolved, the input with the largest range will dominate in the
k-means algorithm used by the Cluster tool. In a later demonstration, you use an option in the Cluster
tool to standardize the range of the inputs.
11

Transforming Analysis Inputs

The k-means clustering algorithm is sensitive to distributions with outlying cases. As was noted above, a
handful of cases have large values for MedHHInc. To avoid creating several segments with a small
number of cases, you should consider transforming this input to have a less extreme distribution. The
Transform Variables tool enables you to perform such data regularizations.
1. Close the Explore window.
2. Select the Modify tab.
3. Drag a Transform Variables tool into the diagram workspace.
4. Connect the Filter node to the Transform Variables node.

5. Select Formulas from the Properties panel for the Transform Variables node.
6. Select OK to update the path. The Formulas window opens.
12

The Formulas window lets you interactively create customized transformations of analysis variables.
The top half of the window displays plots of new and existing variables. The lower half displays the
names of new variables, existing variables, or other information, depending on the selected tab.
Operations are controlled by five icons at the lower left of the window.
Use the Formulas window to create an appropriate transformation of the MedHHInc variable.

7. Select the Create icon, . The Add Transformation dialog box opens.
13

The top half of the Add Transformation dialog box shows metadata information about the new
variable; the bottom half shows the formula for the transformation.
8. Type LogMedHHInc for the Name property.

You can either type the formula directly in the lower half of the Add Transformation dialog box or use
the Expression Builder.
9. Select Build. The Expression Builder opens.
14

The Expression Builder lets you interactively pick transformations and select from existing variables.
Used correctly, this can limit mistakes when entering expressions.
10. Select the Mathematical category folder. A list of mathematical operators is shown in the Functions
pane.
15

11. Select LOG(argument) and select the Insert button. The LOG function is placed in the Expression
Text area with the required numeric argument highlighted.

12. Select the Variables List tab. The lower half of the Expression Builder now shows a list of variables
in the analysis data.

13. Select MedHHInc Insert.


16

MedHHInc is inserted as the argument of the LOG function.

14. Select OK to close the Expression Builder. The defined expression appears in the Add Transformation
dialog box.

15. Select OK to close the Add Transformation dialog box. The newly created LogMedHHInc variable
is listed in the bottom half of the Formula Builder.
17

Recall that the original MedHHInc variable was skewed right. What does the distribution of the new
LogMedHHInc input look like?

16. Select Preview.


18

The distribution of LogMedHHInc is shown in the lower half of the Formulas window. (The number
of histogram bars has been increased to 30 using the Graph Properties window.)

An interesting problem has occurred. The new variable has outlying values and is now slightly left
skewed. Because small outlying values are just as harmful to segmentations as large outlying values,
you might question the value of the log transformation. A simple adjustment to the transformation
will correct this problem.

17. Select the Edit Expression icon, . The Expression Builder opens.

18. Edit the Expression Text area to have the following formula:
LOG(MAX(MedHHInc,10000))
19

This action truncates its distribution: the newly created variable will equal the logarithm of the larger
of MedHHInc or 10,000.

19. Select OK to close the Expression Builder. You return to the Formulas window.
20

Note that the plot has not been updated (except for mysteriously returning to the default number of
bins).
20. Select Refresh Plot. Increase the number of histogram bars to 30 using Graph Properties.
21

The distribution for this variable is now nicely compact and nearly symmetric.
21. Select OK to close the Formulas window.
The variable LogMedHHInc is now part of the analysis data, ready for use in a segmentation analysis.
22

Setting Cluster Tool Options

The Cluster tool performs k-means cluster analyses, a widely used method for cluster and segmentation
analysis. This demonstration shows you how to use the tool to segment the cases in the CENSUS2000
data set.
1. Select the Explore tab.
2. Locate and drag a Cluster tool into the diagram workspace.
3. Connect the Transform Variables node to the Cluster node.

To create meaningful segments, you will need to set the Cluster node to do the following:
ignore the MedHHInc input. (It has been replaced by the newly created LogMedHHInc input.)
standardize the inputs to have a similar range.
A nodes Variables property determines which variables are used in an analysis.
1. Select the Variables property for the Cluster node. The Variables window opens. Click Update
Path to enable the LogMedHHInc variable to be displayed. MedHHInc is automatically suppressed.
23

The Cluster node will create segments using the inputs LogMedHHInc, MeanHHSz, and RegDens.

2. Select OK to close the Variables window.


Segments are created based on the (Euclidean) distance between each case in the space of selected inputs.
If you want to use all the inputs to create clusters, these inputs have similar measurement scales.
Calculating distances using standardized distance measurements (subtracting the mean and dividing by
the standard deviation of the input values) is one way to ensure this. You can standardize the input
measurements using the Transform Variables node. However, it is easier to use the built-in property in the
Cluster node.

Where is the built-in standardization property? It turns out that only a handful of node properties are
shown by default in SAS Enterprise Miner. To see the full extent of options for an analysis node, you
must view the nodes advanced property sheet.
24

1. Select View Property Sheet Advanced. The full range of node options is now available for
changing.

2. Select Internal Standardization Standardization. Distances between points are calculated based
on standardized measurements.

Another way to standardize an input is by subtracting the inputs minimum value and dividing by
the inputs range. This is called range standardization. Range standardization rescales the
distribution of each input to the unit interval, [0,1].
25

Creating Clusters with the Cluster Tool


After you have selected your inputs and manipulated the data to prepare it for the analysis, you can decide
whether you want SAS Enterprise Miner to determine the number of clusters to create or whether you
want to specify a number of clusters for the analysis. It is often useful to have SAS Enterprise Miner
determine the number of clusters automatically the first time you run the analysis. By default, the Cluster
tool attempts to automatically determine the number of clusters in the data. A three-step process is used.
Step 1 A large number of cluster seeds are chosen (50 by default) and placed in the input space. The
Euclidean distance from each case in the training data to each cluster seed (center) is
calculated. Cases are assigned to the closest cluster center. Because the distance metric is
Euclidean, it is important for the inputs to have compatible measurement scales. Unexpected
results can occur if one inputs measurement scale differs greatly from the others. In this
manner, cases in the training data are assigned to the closest seed, and an initial clustering of
the data is completed. The means of the input variables in each of these preliminary clusters
are substituted for the original training data cases representing the seeds in the second step of
the process. Cases are reassigned to the closest cluster center. Cluster centers are updated and
cases are reassigned until the process converges. On convergence, final cluster assignments
are made. Each case is assigned to a unique segment. The segment definitions can be stored
and applied to new cases outside of the training data.
Step 2 A hierarchical clustering algorithm (Wards method) is used to sequentially consolidate the
clusters that were formed in the first step. At each step of the consolidation, a statistic called
the cubic clustering criterion (CCC) is calculated. The first consolidation in which the CCC
exceeds 3 provides the third step with the number of cluster to use. If no consolidation yields
a CCC in excess of 3, the maximum number of clusters is selected.
More details on the CCC can be found in the following 59-page technical report:
https://support.sas.com/documentation/onlinedoc/v82/techreport_a108.pdf
Step 3 The number of clusters determined by the second step provides the value for k in a k-means
clustering of the original training data cases.
Choosing meaningful inputs is clearly important for interpretation and explanation of the generated
clusters. Independence and limited input count make the resulting clusters more stable. An interval
measurement level is recommended for k-means to produce nontrivial clusters. Low skewness and
kurtosis on the inputs avoid creating single-case outlier clusters.
Enterprise Miner has three methods for calculating cluster distances:

Average: the distance between two clusters is the average distance between pairs of observations,
one in each cluster. This method:
o Tends to join clusters with small variances.

o Is slightly biased to finding clusters with equal variance.

o Avoids the extremes of either large clusters or tight compact clusters.


26

Centroid: the distance between two clusters is the Euclidean distance between their centroids or
means. This method is more robust to outliers than most of the other hierarchical methods, but
does not generally perform as well as Ward`s method or the Average method.

Ward:

This method does not use cluster distances to combine clusters. Instead, it joins the clusters such
that the variation inside each cluster will not increase drastically. This method:
o Tends to join clusters with few observations

o Minimizes the variance within each cluster. Therefore, it tends to produce homogeneous
clusters and a symmetric hierarchy.
o Is biased toward finding clusters of equal size (similar to k-means) and approximately
spherical shape. It can be considered as the hierarchical analogue of k-means.
o Is poor at recovering elongated clusters.

Five Seed Initialization Methods are available in Clustering node:

First: Select the first k complete cases as the initial seeds. Also developed by MacQueen (see
below).

MacQueen: Chooses the centers randomly from the data points. The rationale behind this method
is that random selection is likely to pick points from dense regions, i.e., points that are good
candidates to be centers. However, there is no mechanism to avoid choosing outliers or points that
are too close to each other. This is the default method.

Full Replacement: Select initial seeds that are very well separated using a full replacement
algorithm. Use this if you see clusters that are too clumped together.

Princomp: Select evenly-spaced seeds along the first principle component

Partial Replacement: Select initial seeds that are well separated using a partial replacement
algorithm

1. Run the Cluster node and select Results. The Results - Cluster window opens.
27

The Results - Cluster window contains four embedded windows. The Segment Plot window attempts
to show the distribution of each input variable by cluster. The Mean Statistics window lists various
descriptive statistics by cluster. The Segment Size window shows a pie chart describing the size of
each cluster formed. The Output window shows the output of various SAS procedures run by the
Cluster node.
Apparently, the Cluster node has found three clusters in CENSUS2000 data. Because the number of
clusters is based on the cubic clustering, it could be interesting to examine the values of this statistic
for various cluster counts.
28

2. Select View Summary Statistics CCC Plot. The CCC Plot window opens.

The CCC statistic is zero for Number of Clusters equal to one. The statistic decreases for Number of
Clusters equal to two and then rapidly increases. It reaches a maximum at Number of Clusters equal
to 14 and then slowly decreases. The number of clusters selected is the first instance that the CCC
goes from increasing to decreasing. In the above graph, the CCC decreases from 1 cluster to 2,
increases from 2 clusters to 4, then decreases from 4 clusters to 5. So, 4 clusters is chosen as the
optimal number.
In theory, the number of clusters in a data set is revealed by the peak of the CCC versus Number of
Clusters plot. However, when no distinct concentrations of data exist, the utility of the CCC statistic
is somewhat suspect. SAS Enterprise Miner attempts to establish reasonable defaults for its analysis
tools. The appropriateness of these defaults, however, strongly depends on the analysis objective and
the nature of the data.
29

Specifying the Segment Count

You might want to increase the number of clusters created by the Cluster node. You can do this by
changing the CCC cutoff property or by specifying the desired number of clusters.
1. On the Properties panel for the Segmentation node, select Specification Method User Specify.

The User Specify setting creates a number of segments indicated by the Maximum Number of
Clusters property listed above it (in this case, 10).
2. Run the Segmentation node and select Results. The Results - Cluster window opens, this time
showing a total of 10 generated segments.

Segment frequency counts vary from 10 cases to more than 8,000 cases.
30

Exploring Segments

While the Results window shows a variety of data summarizing the analysis, it is difficult to understand
the composition of the generated clusters. If the number of cluster inputs is small, the Graph wizard can
aid in interpreting the cluster analysis.
1. Close the Results - Cluster window.
2. Select Exported Data from the Properties panel for the Cluster node. The Exported Data - Cluster
window opens.

This window shows the data sets that are generated and exported by the Cluster node.
3. Select the Train data set and select Explore. The Explore window opens.
31

You can use the Graph Wizard to generate a three-dimensional plot of the CENSUS2000 data.
4. Select Actions Plot. The Select a Chart Type window opens.
5. Select the icon for a three-dimensional scatter plot.

6. Select Next >. The Graph Wizard proceeds to the next step, Select Chart Roles.
7. Select roles of X, Y, and Z for MeanHHSz, MedHHInc, and RegDens, respectively.

8. Select Role Color for _SEGMENT_.


32

9. Select Finish.
33

The Explore window opens with a three-dimensional plot of the CENSUS2000 data.

Even though LogMedHHInc was used for the segmentation analysis, it is easier to interpret
the results using the original variable, MedHHInc. This works because the two variables are
monotonically related.
10. Rotate the plot by holding the CTRL key and dragging the mouse.
Each square in the plot represents a unique postal code. The squares are color-coded by cluster
segment.
34

To further aid interpretability, add a distribution plot of the segment number.


1. Select Action Plot.
2. Select a Bar chart.

3. Select Next >.


4. Select Role Category for the variable _SEGMENT_.

5. Select Finish. A histogram of _SEGMENT_ opens.


35

By itself, this plot is of limited use. However, when the plot is combined with the three-dimensional
plot, you can easily interpret the generated segments.
6. Select the tallest segment bar in the histogram, segment 8.
36

7. Select the three-dimensional plot. Cases corresponding to segment 8 are highlighted.


37

8. Rotate the three-dimensional plot to get a better look at the highlighted cases.

Cases in this largest segment correspond to households averaging between two and three members,
low population density, and median household incomes between $20,000 and $50,000.
9. For further interpretation, you can make a scatter plot of longitude and latitude to see where people in
this cluster reside.
38

Although these inputs are not used to create the clusters, there is an interesting correlation.

The geographic plot suggests that cases in segment 8 are located in the American heartland. The low
population density suggests rural rather than urban or suburban settings. You could accurately call this
segment Middle America.
39

Similar analyses can be made of the other segments.

By closing and tiling the windows, you can see many aspects of the cluster analysis
simultaneously.
40

Profiling Segments

You can gain a great deal of insight by creating plots as in the previous demonstration. Unfortunately, if
more than three variables are used to generate the segments, the interpretation of such plots becomes
difficult.
Fortunately, there is another useful tool in SAS Enterprise Miner for interpreting the composition of
clusters: the Segment Profile. This tool enables you to compare the distribution of a variable in an
individual segment to distribution of the variable overall. As a bonus, the variables are sorted by how well
they characterize the segment.
1. Drag a Segment Profile tool from the Assess tool palette into the diagram workspace.
2. Connect the Cluster node to the Segment Profile node.
41

3. Run the Segment Profile node and select Results. The Results - Segment Profile window opens.
42

4. Maximize the Profile window.

Features of each segment become apparent. For example, segment 8when compared to the overall
distributionshas a lower Region Density Percentile, more central Median Household Income,
and slightly higher Average Household Size.
43

5. Maximize the Variable Worth: _SEGMENT_ window.

The window shows the relative worth of each variable in characterizing each segment. For example,
segment 8 is largely characterized by the RegDens variable.

Again, similar analyses can be employed to describe the other segments. The advantage of the
Segment Profile window (compared to direct viewing of the segmentation) is that the descriptions can
be more than three-dimensional.