Data Mining with IBM SPSS Modeler 14.

2 University of Arkansas David Douglas
Clustering via Kmeans and Kohonen SOM

Last Updated 12/13/2012 17:52:08 a12/p12

Page 1

xls file. First. Some of the nodes are just for viewing data so the stream flows are not as complicated as it initially appears. As always. Last Updated 12/13/2012 17:52:08 a12/p12 Page 2 .Clustering via k-means and Kohonen SOM IBM SPSS Modeler 14. Open the k-means node—the default tab is Model. By default. you can check the Generate distance field check box. you have the option for providing a custom name. the Excel File node should be edited to connect to a Prospect. For more detailed output information. there is no target variable and the Direction for all the variables to be used in the model should be set to Input. because this in unsupervised modeling.2 for clustering using k-means and Kohonen/SOM. The ID and LOC variables need to be excluded—which can be done while on the Type tab of the Excel Source node. create a stream flow as shown below. click the Read Values button on the Type tab.2 and clustering To illustrate using IBM SPSS Modeler 14. the Use partitioned data is checked. And as always. set the Specified number of clusters to 5. For illustration purposes. Again.

Note – the legend at the top of the left pane indicates the darker the color. Because of the Auto Data Prep node. A model summary and a cluster quality appear in the left pane and a pie chart with the sizes of the cluster appears in the right pane. no further changes are needed so execute the node. Then click cluster-1 in the left pane –not the Cluster Comparison in the right pane. Click the View: dropdown box in the left pane and select Clusters. See below for an explanation but for now. The Cluster Sizes pie chart provides the percent for each of the five clusters. Last Updated 12/13/2012 17:52:08 a12/p12 Page 3 . note that moving the mouse into a cell will provide the importance value and frequency. These are the default settings—note that both panes have a dropdown box to allow the user to select desired views. It also lets you change the encoding value for sets—which has a default value of 0. As with most of IBM SPSS Modeler 14.Clicking the Expert tab allows setting the maximum number of iterations as well as tolerance levels. the more important the variable. all the variables have a suffix of _transformed. Run the kmeans node and right-click the model nugget on the canvas to review the results. Also. executing the k-means node results in a model nugget on the canvas as well as one placed in the GMP.2’s model nodes.70711 instead of 1. Select Cluster Comparison from the dropdown box in the right lane.

the cluster contains only records with a climate value of 20. males who do not own homes. Clusters are presented in order of number of records in the cluster. The right pane has the additional options of variable importance and cell distributions. Double-click the Age_transformed variable and review its distribution. The first three clusters: cluster-1. Also note that double clicking a cell in the left pane will create a distribution as shown for OwnHome to the right. cluster-3 and cluster-4 are considerable larger than the last two clusters: cluster-2 and cluster-5. you can see that cluster-1 has an average age very close to the population average age. married. A cluster by cluster comparison can be made in this way. Last Updated 12/13/2012 17:52:08 a12/p12 Page 4 .From the cluster comparison.

If you really want to jazz up the display. click the Generate menu option and select the Select Node from the drop down list. For our illustration. drag the generated Select node from the upper left –hand corner of the canvas to the right of the k-means model nugget and connect from the model nugget to the generated Select node. While the columns are selected. Connect the nodes as was previously shown. No editing is required for the generated Select node. and Distribution node. the most populous three clusters (the ones to the left) are selected in order to generate the Select node. The generated node will be placed in the upper left hand corner of the stream canvas. As shown in the first canvas drawing. Open the Plot node and try combinations of variables against the created variable $KM-K-Means. This particular illustration selects Sex for the X field. .To provide additional information about the cluster. select the first three clusters (columns). Last Updated 12/13/2012 17:52:08 a12/p12 Page 5 . Using windows techniques. See plot below where cluster-3 contains married and single females and cluster 4 contains married and single males. you can select desired clusters and generate a Select node. Plot. Add three nodes to view the data—a Histogram. Married for the Y field and $KM-M-Means for the Overlay Color: field. also select a variable such as Climate for the animation field. The size of the dots should indicate comparatively how many records are in each cluster for each level of gender and marital status.

also check the Normalize by color checkbox.Open the Distribution Node and set the Field and Overlay entries as shown. Then Run the node—the graphic output displays and is saved in the Outputs tab in the upper right window. Last Updated 12/13/2012 17:52:08 a12/p12 Page 6 .

Review the clusters for the other categorical variables.The display below is shown with the Normalized by color check box checked. use the Histogram Node. Run the Histogram Node to get the following graph—note that the higher income values are all in cluster 4. Proportionally. Last Updated 12/13/2012 17:52:08 a12/p12 Page 7 . For interval variables. Review the other interval variable of Age and FICO. Open the Histogram Node and select Income for the Field value and $KM-K-Means for the Overlay variable—this window is not shown. The columns on the right provide the percents and counts for each cluster. cluster-1 has considerable fewer homeowners and cluster-4 have considerable more homeowners.

See the initial discussion of Kohonen (SOM) for a conceptual understanding of how it works. X=0. some of the expert settings apply a similar logic. Red indicates the cells winning the most instances. For our example. Open the Kohonen Node. When the Model is run. note that they are referred to as: X=0. This requires clicking the Expert option. use the default settings. and X=0. Last Updated 12/13/2012 17:52:08 a12/p12 Page 8 . Y=1. no changes are needed for the Model tab. Expand and review all 4 clusters—looking for uniqueness in each cluster via the Cluster Comparison pane. For our example. For illustrative purposes. set the Width to 2 and the Length to 2. a grid will appear and the colors will change as the data is passed through the Kohonen Node. Note that the browse views are identical to the K-Means Node so no further explanation is needed. Also. Run the node. you have the option of providing a custom name for the Node. After setting the Width and Length values. Its basic assumption that clusters are formed from patterns that share similar features is consistent with the kmeans clustering algorithm. Y=4. Double-click the model nugget to review the results. only 4 clusters were created as shown on the right. also does clustering. Click the Expert tab. This may happen quickly enough you miss it. Y=0. Note that with a setting of 2 by 2. X=0. although using a different algorithm. If you wish to replicate the run. As always. Y=2. only one cluster is created.IBM SPSS Modeler’s Kohonen (SOM) Node The Kohonen (SOM) Node. you would need to provide a random seed. Because this algorithm works similar to a Neural Net without the Hidden Layer(s). Change the setting to 1 by 5 and run the model again. Although 5 possible clusters could have been generated.

$KY-Kohonen Overlay: Sex Distribution Node: Field: $KXY. you will find that it determines the best cluster model is the TwoStep based on a Silhouette value.2 Auto Cluster node. Distribution and Histogram Nodes directly attached to the model nugget. you will want to try over combinations for variable in exploring the clusters. Illustrative examples are: Plot Node: X Field -. Part of the overall stream flow is show below. Search for Silhouette Ranking Measures in Help for details. The TwoStep Clustering was also run but no additional output is illustrated here. The Kohonen node creates three new variables--$KX-Hohonen. the Silhouette value is an index that measures both cluster cohesion and separation. You may wish to use a Table node to review the values for these variables. Distribution and Histogram Nodes are used similarly as in the K-Means example.Kohonen SOM Overlay: OWNHOME Histogram Node: Field: Income Overlay: $KXY-Kohonen SOM Of course.$KX-Kohonen Y Field -. Because the Plot. Last Updated 12/13/2012 17:52:08 a12/p12 Page 9 . Further analysis of the nodes can be accomplished by using the Plot. If you check the Help menu. $KY-Kohonen and $KXY-Kohonen. Also shown is the IBM SPSS Modeler 14.One could generate a Select Node as before but this is not necessary when using all the clusters--four in this case. A portion of the results are shown below. It you connect the Auto Data Prep node to the Auto Cluster node run it. The last created variable is useful for representing a cluster. they will not be discussed here.

Thus.5—recall that these type variables were transformed to values from 0 to 1.What you should think about: 1. The approach for missing values is to replace them with neutral values.2 Notes for missing values and standardization—IBM SPSS Modeler 14. Last Updated 12/13/2012 17:52:08 a12/p12 Page 10 . coded as either 0 or 1 (a dummy variable).707 instead of 1 because 1’s tend to dominate the cluster. For Set fields. For range and flag fields with missing values (blanks and nulls). . There is no quantitative method to determine which is the most useful cluster or clusters 3.2 uses the same standardizing and handling of missing values techniques for both k-means and Kohonen/SOMs models. Thus. For set fields. It can only average cohesiveness and separation. the missing value is replaced with . Values lower than the lower bound will be set to the lower bound—likewise for upper bound—values above the upper bound will be assigned the upper bound. All of this is done automatically for you. each value of the set will have a new temporary input field assigned to it. Fields of Range type are transformed to a 0 to 1 range as follows New Value = (Value – Lower bound) / Range Flag fields are coded such that false =0 and true = 1. Clustering requires being inquisitive and having domain knowledge 2. Actually. a set with three values will have three new inputs. it by no means can identify what could be a useful cluster created in any one of the clusters created via the various cluster nodes.5 is in theory relatively neutral. IBM SPSS Modeler 14. Even though the Silhouette Ranking Measure is a way to determine “best” clusters. A small cluster may be the most useful cluster 4. the derived fields are all set to zero. these new inputs use a value of .

Sign up to vote on this title
UsefulNot useful