Data Mining with IBM SPSS Modeler 14.

Clustering via Kmeans and Kohonen SOM

xls file. First. Some of the nodes are just for viewing data so the stream flows are not as complicated as it initially appears. As always. Open the k-means node—the default tab is Model. By default. you can check the Generate distance field check box. you have the option for providing a custom name. the Excel File node should be edited to connect to a Prospect. For more detailed output information. there is no target variable and the Direction for all the variables to be used in the model should be set to Input. because this in unsupervised modeling. The ID and LOC variables need to be excluded—which can be done while on the Type tab of the Excel Source node. create a stream flow as shown below. click the Read Values button on the Type tab.2 for clustering using k-means and Kohonen/SOM. To illustrate using IBM SPSS Modeler 14. the Use partitioned data is checked. And as always. set the Specified number of clusters to 5. For illustration purposes. Again.

Note – the legend at the top of the left pane indicates the darker the color. Because of the Auto Data Prep node. A model summary and a cluster quality appear in the left pane and a pie chart with the sizes of the cluster appears in the right pane. Then click cluster-1 in the left pane –not the Cluster Comparison in the right pane. Click the View: dropdown box in the left pane and select Clusters. See below for an explanation but for now. The Cluster Sizes pie chart provides the percent for each of the five clusters. These are the default settings—note that both panes have a dropdown box to allow the user to select desired views. It also lets you change the encoding value for sets—which has a default value of 0. As with most of IBM SPSS Modeler 14.Clicking the Expert tab allows setting the maximum number of iterations as well as tolerance levels. the more important the variable. all the variables have a suffix of _transformed. Run the kmeans node and right-click the model nugget on the canvas to review the results. Also. executing the k-means node results in a model nugget on the canvas as well as one placed in the GMP.2's model nodes.70711 instead of 1. Select Cluster Comparison from the dropdown box in the right lane.

the cluster contains only records with a climate value of 20. males who do not own homes. Clusters are presented in order of number of records in the cluster. The right pane has the additional options of variable importance and cell distributions. Double-click the Age_transformed variable and review its distribution. The first three clusters: cluster-1. Also note that double clicking a cell in the left pane will create a distribution as shown for OwnHome to the right. cluster-3 and cluster-4 are considerable larger than the last two clusters: cluster-2 and cluster-5. married. you can see that cluster-1 has an average age very close to the population average age. A cluster by cluster comparison can be made in this way.

If you really want to jazz up the display. click the Generate menu option and select the Select Node from the drop down list. For our illustration. drag the generated Select node from the upper left –hand corner of the canvas to the right of the k-means model nugget and connect from the model nugget to the generated Select node. While the columns are selected. Connect the nodes as was previously shown. No editing is required for the generated Select node. the most populous three clusters (the ones to the left) are selected in order to generate the Select node. The generated node will be placed in the upper left hand corner of the stream canvas. As shown in the first canvas drawing. Open the Plot node and try combinations of variables against the created variable $KM-K-Means. This particular illustration selects Sex for the X field. To provide additional information about the cluster. select the first three clusters (columns). Using windows techniques. you can select desired clusters and generate a Select node. See plot below where cluster-3 contains married and single females and cluster 4 contains married and single males. Plot. Add three nodes to view the data—a Histogram. Married for the Y field and $KM-M-Means for the Overlay Color: field. also select a variable such as Climate for the animation field. The size of the dots should indicate comparatively how many records are in each cluster for each level of gender and marital status. and Distribution node.

Open the Distribution Node and set the Field and Overlay entries as shown. Then Run the node—the graphic output displays and is saved in the Outputs tab in the upper right window.

Review the clusters for the other categorical variables.The display below is shown with the Normalized by color check box checked. use the Histogram Node. Run the Histogram Node to get the following graph—note that the higher income values are all in cluster 4. Proportionally. For interval variables. Review the other interval variable of Age and FICO. Open the Histogram Node and select Income for the Field value and $KM-K-Means for the Overlay variable—this window is not shown. The columns on the right provide the percents and counts for each cluster. cluster-1 has considerable fewer homeowners and cluster-4 have considerable more homeowners.

See the initial discussion of Kohonen (SOM) for a conceptual understanding of how it works. X=0. some of the expert settings apply a similar logic. Red indicates the cells winning the most instances. For our example. Open the Kohonen Node. When the Model is run. note that they are referred to as: X=0. This requires clicking the Expert option. use the default settings. and X=0. For our example. no changes are needed for the Model tab. For illustrative purposes. set the Width to 2 and the Length to 2. Expand and review all 4 clusters—looking for uniqueness in each cluster via the Cluster Comparison pane. Note that the browse views are identical to the K-Means Node so no further explanation is needed. Run the node. you have the option of providing a custom name for the Node. After setting the Width and Length values. Its basic assumption that clusters are formed from patterns that share similar features is consistent with the kmeans clustering algorithm. Double-click the model nugget to review the results. Click the Expert tab. only 4 clusters were created as shown on the right. Note that with a setting of 2 by 2. If you wish to replicate the run. As always. Y=4. Y=0. Y=2. X=0. Y=1. This may happen quickly enough you miss it. also does clustering. X=0. IBM SPSS Modeler's Kohonen (SOM) Node The Kohonen (SOM) Node. you would need to provide a random seed. Because this algorithm works similar to a Neural Net without the Hidden Layer(s). Change the setting to 1 by 5 and run the model again. Although 5 possible clusters could have been generated. although using a different algorithm. only one cluster is created.

$KY-Kohonen Overlay: Sex Distribution Node: Field: $KXY. you will find that it determines the best cluster model is the TwoStep based on a Silhouette value.2 Auto Cluster node. Distribution and Histogram Nodes directly attached to the model nugget. Illustrative examples are: Plot Node: X Field -. Part of the overall stream flow is show below. Search for Silhouette Ranking Measures in Help for details. The TwoStep Clustering was also run but no additional output is illustrated here. The Kohonen node creates three new variables--$KX-Hohonen. the Silhouette value is an index that measures both cluster cohesion and separation. You may wish to use a Table node to review the values for these variables. You may wish to use a Table node to review the values for these variables. Distribution and Histogram Nodes are used similarly as in the K-Means example.Kohonen SOM Overlay: OWNHOME Histogram Node: Field: Income Overlay: $KXY-Kohonen SOM Of course.$KX-Kohonen Y Field -. Because the Plot. Further analysis of the nodes can be accomplished by using the Plot. If you check the Help menu. Also shown is the IBM SPSS Modeler 14. $KY-Kohonen and $KXY-Kohonen. they will not be discussed here. A portion of the results are shown below. It you connect the Auto Data Prep node to the Auto Cluster node run it. The last created variable is useful for representing a cluster.

What you should think about: 1. The approach for missing values is to replace them with neutral values.2 Notes for missing values and standardization—IBM SPSS Modeler 14. coded as either 0 or 1 (a dummy variable).707 instead of 1 because 1's tend to dominate the cluster. For Set fields. For range and flag fields with missing values (blanks and nulls). There is no quantitative method to determine which is the most useful cluster or clusters 3.2 uses the same standardizing and handling of missing values techniques for both k-means and Kohonen/SOMs models. Thus. For set fields. It can only average cohesiveness and separation. Values lower than the lower bound will be set to the lower bound—likewise for upper bound—values above the upper bound will be assigned the upper bound. All of this is done automatically for you. Fields of Range type are transformed to a 0 to 1 range as follows New Value = (Value – Lower bound) / Range Flag fields are coded such that false =0 and true = 1. Actually. a set with three values will have three new inputs. Clustering requires being inquisitive and having domain knowledge 2. it by no means can identify what could be a useful cluster created in any one of the clusters created via the various cluster nodes. IBM SPSS Modeler 14. a set with three values will have three new inputs. Even though the Silhouette Ranking Measure is a way to determine "best" clusters. A small cluster may be the most useful cluster 4. the derived fields are all set to zero. these new inputs use a value of .5—recall that these type variables were transformed to values from 0 to 1.Thus. the missing value is replaced with .5 is in theory relatively neutral. each value of the set will have a new temporary input field assigned to it.

