Mr. Digvijay D. Desai.

Gower's General Similarity Coefficient

Gower's General Similarity Coefficient is one of the most popular measures of proximity for mixed data types. For details of mixed data types click here. Gower's General Similarity Coefficient sij compares two cases i and j, and is defined as follows: sij = Σk wijksijk

Σk wijk

where: sijk denotes the contribution provided by the kth variable, and wijk is usually 1 or 0 depending upon whether or not the comparison is valid for the kth variable; if differential variable weights are specified it is the weight of the kth variable or 0 if the comparison is not valid. It should be noted that the effect of the denominator Σk wijk is to divide the sum of the similarity scores by the number of variables; or if variable weights have been specified, by the sum of their weights. Ordinal and Continuous Variables Gower defines the value of sijk for ordinal and continuous variables as follows: sijk = 1 - | xik - xjk | /rk where: rk is the range of values for the kth variable. For continuous variables sijk ranges between 1, for identical values xik = xjk, and 0, for the two extreme values xmax - xmin. Binary Variables Value of attribute k

For a binary variable (or dichotomous character), Gower defines the component of similarity and the weight according to the table (right), where + denotes that attribute k is "present" and - denotes that attribute k is "absent".

+ Case j +
Case i sijk wijk 1 1

+ 0 1

0 1

0 0

Thus sijk = 1 if cases i and j both have attribute k "present" or 0 otherwise, and the weight wijk causes negative matches to be ignored. If negative matches are not to be ignored, the variable should be specified as a nominal variable (see below). If all your variables are binary, then Gower's General Similarity Coefficient is equivalent to Jaccard's Similarity Coefficient A/(A+B+C) since the negative matches scored in cell D are ignored. Nominal Variables The value of sijk for nominal variables is 1 if xik = xjk , or 0 if xik ≠ xjk. Thus sijk = 1 if cases i and j

Mr. Digvijay D. Desai.
have the same "state" for attribute k, or 0 if they have different "states", and w ijk = 1 if both cases have observed states for attribute k. Differential Variable Weights It was noted above that the weight wijk for the comparison on the kth variable is usually 1 or 0. However, if you assign differential weights to your variables in ClustanGraphics, then wijk is either the weight of the k th variable or 0, depending upon whether the comparison is valid or not. This allows larger weights to be given to important variables, or for another type of external scaling of the variables to be specified. If the weight of any variable is zero, then the variable is effectively ignored for the calculation of proximities. Such variables are "masked" for clustering, but available for cluster profiling, to assist in the interpretation of a resulting cluster analysis. General Distance Coefficients If you specify mixed data types in ClustanGraphics and select Gower's Similarity Coefficient in Compute/Proximities, your proximity matrix will be calculated according to the above definitions. However, the clustering options available using Gower are restricted to those applicable to similarity measures, and not to dissimilarities. Thus, for example, you will not be able to optimize the Euclidean Sum of Squares without first transforming your proximities into distances. For details of the corresponding General Distance Coefficient. Our implementation of Gower's General Similarity Coefficient is another example of the great flexibilty provided in Clustan software. Mixed data types frequently occur in social surveys and databases, but you are unlikely to find that other software for cluster analysis or neural networks adequately caters for such practical diversity. Gower's General Similarity Coefficient has been available in Clustan since 1984, and in ClustanGraphics since release 5 in 2001. A worked example of Gower's coefficient with psychiatric data is given here.

ClustanGraphics allows you to run very powerful clustering algorithms on different data types with or without missing values and differential case or variable weighting. Having read your data, either specify your variable types using an Auto Script, or select Edit/Data Types and specify them interactively using the following dialogue:

Mr. Digvijay D. Desai.

The example shown here illustrates four types of variables allowed in ClustanGraphics - binary, nominal, ordinal and continuous, and two data transformations - range or z-scores. These apply as follows:

Binary Two codes other than missing, the higher code signifying "yes" or "present", the
lower code signifying "no" or "absent" (e.g. CreditAllowed, meaning whether the client has credit terms).

Nominal Integer codes having no logical numerical order (e.g. AccountType or

Ordinal Integer codes having a logical numerical order (e.g. VolumeLevel, by band). Continuous Wide range of numerical values on a continuous or semi-continuous
scale (e.g. InvoiceValue, or the actual value of the current contract). To When you have completed a cluster analysis with mixed data types, the results are easily and flexibly presented in our cluster model dialogue, shown here.

Data Types
On first entry, ClustanGraphics examines your data and tries to interpret the type of each variable according to whether the values are integers and their frequencies. This may be correct; for example, if all your variables are binary then they should be interpreted as binary by having only two possible values. If you have nominal or ordinal variables, they will be interpreted as nominal - you should therefore change the type of any such variable that is ordinal. To do this, click on the type cell and select from the drop-down list (right).

Mr. Digvijay D. Desai. Variable Transformations
ClustanGraphics allows you to transform ordinal or continuous variables. The transformation options are none, range or z-scores. Range divides each value by the range of valid values, so that the transformed values range between zero and 1. zscores transforms the values so that they have a mean of zero and a standard deviation of 1. To specify the transformation of any variable, click on the variable transform cell and select from the drop-down list (right). More details of data transformations are here. Transformations are not available for binary or nominal variables. A binary variable is stored as a present/absent score for each case (e.g. CreditAllowed is either true or false). Liikewise, a nominal variable is stored as a present/absent score for each category represented by an integer code (e.g. ClientSector=5 is held as true for sector 5 and false for all other sector codes).

Variable Weights
With ClustanGraphics you can have different weights for each variable. The standard default is a weight of 1, so that all variables have equal weight. If you want to give some variables more emphasis than others you can specify differential variable weights. To do this, click the variable weight cell and type a new weight value (right). Your current choice of weights can also be reviewed and changed in the Edit/Weights dialogue, on the Edit menu.

Masking Variables
If you specify a weight of zero, the variable will be masked from the cluster analysis. In this case, the Edit/Data Types dialogue will show the variable as masked, and its entries will be grayed (right). This is helpful if you want to carry background variables that are "inactive", that is not to be used for clustering but are nevertheless to be interpreted in cluster profiling.

Variable Names
The Edit/Data Types dialogue allows you to change the names of variables. Simply click on a variable's name and edit it in situ (right). Your current choice of variable names can also be reviewed and changed in the Edit/Labels dialogue, on the Edit menu.

Variable Summaries
If you point the cursor at any variable and click the right mouse button, a summary of the current parameters for that variable will be displayed. This helps you check that you have selected the correct type and transformation for the variable (right). You can display a summary table for all your variables, by clicking the Summary button. An abbreviated table of Data Types specifications can be printed by clicking the Print button.

Mr. Digvijay D. Desai. Confirming Data Types
When you click OK in the Edit/Data Types dialogue, you will be asked whether you wish the changed specifications to be confirmed. At this point you can, if you wish, revert to the type settings previously recorded; or you can update to the new settings entered into the dialogue. Don't forget to save your ClustanGraphics file so that your changes will be correctly reproduced when you next open your file. You are now ready to run a cluster analysis on mixed data types. The current options are hierarchical cluster analysis using Compute Proximities, Nearest Neighbours , k-Means Analysis and Classify Cases . For further details, please refer to the file DataTypes.doc which accompanies ClustanGraphics or view a worked example of Gower's Similarity Coefficient with mixed data types here.

This is a worked example of Gower's Similarity Coefficient, taken from Cluster Analysis, Third Edition, by Brian S. Everitt, Arnold, London, 45-46. Everitt illustrates the coefficient using the following data for five psychiatrically ill patients: Case Patient1 Patient2 Patient3 Patient4 Patient5 Weight 120 150 110 145 120 Anxiety 1 2 3 1 1 Depression Hallucination 1 2 2 1 1 1 1 2 2 2 Age 1 2 3 3 1

The above data can be easily read by ClustanGraphics. Simply select the values in the table and copy them to an Excel file, then click File/New/Data in ClustanGraphics and choose Excel Spreadsheet as the file format to read the file and the headings and case labels. Next, select Edit/Data Types and change the type specifications of Anxiety and Age to nominal. Note here that this is possibly an incorrect definition, since these two variables appear to be ordinal; however, we shall specify nominal to be consistent with the type definitions in Everitt's example.

Mr. Digvijay D. Desai.

Click OK and accept the changed data type specifications. Note that it is not necessary to transform the Weight variable because transformation by range is standard in Gower's coefficient.

Now select Prox/Compute, noting that ClustanGraphics has recognized that the variables comprise mixed data types. Select Gower's Coefficient from the list of similarity and dissimilarity coefficients available for mixed data types.

Mr. Digvijay D. Desai.

When you press OK the proximity matrix will be computed. You may also wish at this stage to cluster the data hierarchically. To check the values for Gower's coefficient click View/Prox. There are unfortunately two errors in the similarity matrix shown on page 46 of Everitt's book coefficients s25 and s45 are wrongly reported. You can easily check by hand that the correct Gower similarity coefficients have been computed by ClustanGraphics.

Sign up to vote on this title
UsefulNot useful