You are on page 1of 10

2011 15th European Conference on Software Maintenance and Reengineering

Improved Similarity Measures For Software Clustering


Dept. of Computer Science, Quaid-I-Azam University, Islamabad Elixir Technologies Pakistan (PVT) LTD Email: rnsqau@gmail.com, onaiza@qau.edu.pk, msiraj83@gmail.com

Rashid Naseem , Onaiza Maqbool , Siraj Muhammad

AbstractSoftware clustering is a useful technique to recover architecture of a software system. The results of clustering depend upon choice of entities, features, similarity measures and clustering algorithms. Different similarity measures have been used for determining similarity between entities during the clustering process. In software architecture recovery domain the Jaccard and the Unbiased Ellenberg measures have shown better results than other measures for binary and non-binary features respectively. In this paper we analyze the Russell and Rao measure for binary features to show the conditions under which its performance is expected to be better than that of Jaccard. We also show how our proposed Jaccard-NM measure is suitable for software clustering and propose its counterpart for non-binary features. Experimental results indicate that our proposed Jaccard-NM measure and Russell & Rao measure perform better than Jaccard measure for binary features, while for non-binary features, the proposed Unbiased Ellenberg-NM measure produces results which are closer to the decomposition prepared by experts. Index TermsSoftware Clustering, Jaccard-NM Measure, Jaccard Measure, Unbiased Ellenberg-NM Measure, Russell & Rao Measure

I. I NTRODUCTION Software clustering has engaged the interest of researchers in the last two decades, primarily as a technique to facilitate understanding of legacy software systems. When the architectural documentation is not available, or the documentation has not been updated to reect changes in the software over time, software clustering may be used for software modularization and architecture recovery [1], [2]. Besides clustering, other techniques used for this purpose are association rule mining [3], concept analysis [4] and graphical visualization [5]. The clustering process is used to modularize a software system or to recover sub-systems by grouping together software entities that are similar to each other. Thus entities within a cluster have similar characteristics or features, and are dis-similar from entities in other clusters. To determine similarity based on features of an entity, a similarity measure is employed. Many different similarity measures are available. The choice of a measure depends on the characteristics of the domain in which they are applied. In the software domain, the most commonly used similarity measure for hierarchical clustering is Jaccard coefcient for binary features [6], [7], while for non-binary features Unbiased Ellenberg and Information Loss measure have been shown to produce better results as compared to other measures [1].
1534-5351/11 $26.00 2011 IEEE DOI 10.1109/CSMR.2011.9 46 45

In this paper, we describe our proposed Jaccard-NM [8] measure for binary features, and compare it with the Jaccard and Russell & Rao [9] measures. We present different cases to show deciencies in the Jaccard measure which may deteriorate clustering results, and show how in these cases the Russell & Rao and Jaccard-NM measures are expected to have better performance. For non-binary features we propose the Unbiased Ellenberg-NM measure, and compare its performance with the Unbiased Ellenberg and Information Loss measures. We also analyze cases where these measures produce arbitrary decisions. We call a decision arbitrary when more than two entities have equal similarity value. In this situation, clustering algorithms select two entities to be clustered arbitrarily. This arbitrary decision may create problems [1]. Thus the contributions of this paper can be summarized as: 1) Analysis of the Jaccard, Jaccard-NM and Russell & Rao measures for binary features and a comparison of their strengths and weaknesses. 2) Denition of a new similarity measure for non-binary features and its comparison with well known existing measures used for software clustering. 3) Internal and external evaluation of clustering results. Internal assessment is carried out using arbitrary decisions taken by proposed and existing similarity measures. External assessment is carried out by comparing manually prepared software decompositions with automatically produced clustering results using MoJoFM. This paper is organized as follows. In Section 2 we describe related work. An overview of clustering is presented in Section 3. In Section 4 we present an analysis of similarity measures for binary features and dene a new measure for non-binary features. Section 5 describes our experimental setup. In Section 6, we analyze our experimental results. Finally in Section 7, we present the conclusions and future work. II. R ELATED W ORK To nd similarity between entities, various similarity measures have been used. Davey and Burd evaluated different similarity measures including Jaccard, Sorensen-Dice, Canberra and Correlation coefcients [7]. From experimental results they concluded that Jaccard and Sorensen-Dice similarity measures perform identically and they recommended Jaccard similarity measure for software clustering when features are binary.

Anquetil and Lethbridge compared different similarity measures including Jaccard, Simple Matching, Sorensen-Dice, Correlation, Taxonomic and Canberra [6]. For clustering they used Complete linkage, Weighted linkage, Unweighted linkage and Single linkage algorithms. They concluded that Jaccard and Sorensen-Dice similarity measures produce good results because they do not consider absence of a feature (d) as a sign of similarity, while Simple Matching and other similarity measures consider absence of a feature to be a sign of similarity and thus do not produce satisfactory results. In 2003 Anquetil and Lethbridge evaluated different features, similarity measures and clustering algorithms. From experimental results they once again concluded that Jaccard similarity measure produces good results [10]. Saeed et al. developed a new linkage algorithm called Combined algorithm [11]. They compared this algorithm with Complete linkage using different similarity measures including Jaccard, Sorensen-Dice, Simple Matching and Correlation coefcient. They concluded that behavior of Correlation coefcient is similar to the Jaccard similarity measure when the number of absent feature is very large as compared to present features. In 2004, Maqbool and Babri developed the Weighted Combined algorithm, and proposed the Unbiased Ellenberg similarity measure [12]. In this paper they evaluated Complete linkage, Combined algorithm and Weighted Combined algorithms using Jaccard, Euclidean distance, Pearson correlation coefcient, Ellenberg and the Unbiased Ellenberg similarity measures. Their results suggested that Weighted Combined algorithm produces better results than Complete and Combined algorithms especially with Unbiased Ellenberg measure. Andritsos and Tzerpos developed an algorithm called LIMBO (scaLable InforMation BOtleneck algorithm) in 2005 [2]. They applied LIMBO to three different data sets and compared the results with ACDC, NAHC-lib, SAHC, SAHClib, Single Linkage, Complete Linkage, Weighted Average Linkage and Unweighted Average Linkage algorithms. They concluded that, on an average, LIMBO performed better than other algorithms. In 2006, Mitchell and Mancoridis described their Bunch clustering tool which uses search techniques (hill-climbing and genetic algorithms) to nd optimal solutions [13]. Bunch tool was developed in 1998 [14] and over time modied to include new features (e.g omnipresent modules detection and deletion) [15], [16]. Bunch tool uses Module Dependency Graph (MDG), where modules are entities and edges are static relationships among entities. Bunch makes partitions of MDG and uses a tness function Modularization Quality (MQ), to calculate the quality of graph partitions. Harman et al. investigated the effect of noise in input information available for software module clustering [17]. To guide the search they examine two tness functions: Modularization Quality (MQ) and Evaluation Metric function (EVM). For evaluation purpose they used six real software systems, three Perfect module dependency graphs and three Random module dependency graphs and concluded that in the presence of noise

EVM performs better than MQ for real and perfect MDGs. Results also show that EVM is more robust than MQ for smaller software systems. In 2010 Naseem et al. proposed a new similarity measure called Jaccard-NM [8] for binary features. They evaluated this measure using Complete linkage, Weighted average and Unweighted average. From the experimental results they concluded that, in general, Jaccard-NM produces better results than Jaccard similarity measure for binary features. Besides the software domain, a comparison of similarity measures has also been carried out in other domains. In these domains, the Jaccard measure does not necessarily perform better than other measures as in the case of software due to different domain characteristics. For example, Willett used thirteen similarity measures including Tanimoto, Russell & Rao, and Simple matching to nd the similarity between molecular ngerprints for virtual screening [18]. He concluded that Tanimoto, Baroni-Urbani/Buser, Kulczynski(2), Fossum and Ochiai/Cosine coefcients perform reasonably well across the range of molecular size [18]. Dalirsefat1a et al. compared three similarity measures including Jaccard, Sorensen-Dice and Simple Matching to nd the similarity between biological organisms. They concluded from their experimental results that when the organisms are closely related then Jaccard or Sorensen-Dice give satisfactory results [19]. Moreover Jaccard and Sorensen-Dice produce closely similar results because these two measures exclude negative co-occurrences. These results are similar to those obtained for software. III. OVERVIEW OF C LUSTERING In the clustering process, entities are grouped together based on their features. In this section, we provide an overview of the steps in clustering. A. Selection of Entities and Features Selection of entities and features depend on type of software system, and also on the required architectural view. For modularization of structured software systems, researchers have selected different entities e.g. les, processes and functions. Features may be global variables or user dened types used by an entity [6]. For object oriented software systems, entities may be classes [20] and features are typically dened by the relationships between classes e.g. inheritance or containment. In the software domain, features are usually binary i.e. they indicate the presence or absence of a characteristic or relationship. Before applying a clustering algorithm, a software system must be parsed to extract entities and features. The result is an NxP matrix, where N is the number of entities and P is the number of features. Table I presents an NxP matrix of a small software system containing 4 entities and 6 binary features. B. Selection of Similarity Metrics In the second step, a similarity measure is applied to compute similarity between every pair of entities, resulting in a similarity matrix. Selection of similarity measure should

46 47

TABLE I (N X P) FEATURE MATRIX FOR A SMALL SYSTEM f1 0 1 1 1 f2 1 1 0 0 f3 1 1 0 0 f4 0 0 1 1 f5 1 0 0 1 f6 0 1 1 0 S. No 1 2 3 4

TABLE IV S IMILARITY M EASURES FOR NON - BINARY FEATURES Name Ellenberg Unbiased Ellenberg Gleason Measure Unbiased Gleason measure Mathematical representation 0.5 M a/(0.5 M a + M b + M c) 0.5 M a/(0.5 M a + b + c) M a/(M a + M b + M c) M a/(M a + b + c)

E1 E2 E3 E4

be done carefully, because selecting an appropriate similarity measure may inuence clustering results more than selection of a clustering algorithm [21]. Table II lists some well known similarity measures for binary features.
TABLE II S IMILARITY M EASURES FOR BINARY FEATURES S. No 1 2 3 4 5 6 Name Jaccard Russell & Rao Simple Matching Sokal Sneeth Rogers-Tanimoto Gower-Legendre Mathematical representation a/(a + b + c) a/(a + b + c + d) (a + d)/(a + b + c + d) a/(a + 2(b + c)) (a + d)/a + 2(b + c) + d) (a + d)/(a + 0.5(b + c) + d)

has been observed that in software clustering, the features are asymmetric, i.e. the presence of a feature 1 has more weight than its absence 0. The absence of features does not indicate similarity between two entities e.g. if two classes both do not use a variable, it does not mean that they are similar. For nonbinary features the counter part of Jaccard similarity measure Unbiased Ellenberg produces better results for software clustering [1], [12]. C. Application of a Clustering Algorithm The next step is to apply a clustering algorithm, which can be categorized into hierarchical or non-hierarchical. Agglomerative Hierarchical Clustering (AHC) algorithms are based on the bottom-up approach. In this approach, an algorithm considers entities as singleton clusters, and at every step clusters the two most similar entities together. At the end, the algorithm makes one large cluster which contains all entities. Although in the software domain, non-hierarchical algorithms have also been used [23], [24], but there are some advantages of using AHC algorithms. For example, there is no need of prior information about number of clusters. Moreover, the hierarchical structure of a software system is naturally represented through hierarchical algorithms. But the disadvantage is that we have to select a cutoff point, which represents the number of steps after which to stop the algorithm. Widely used agglomerative hierarchical algorithms for software architecture recovery are Complete Linkage (CL), Single Linkage (SL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL). When two entities are merged into a cluster, similarity between the newly formed cluster and other clusters/entities is calculated differently by these algorithms. Suppose we have three entities E1 , E2 and E3 . Using these algorithms, similarity between E1 and newly formed cluster E23 is calculated as [22]: Complete Linkage Similarity (E1 , E23 ) = min(Similarity (E1 , E2 ), Similarity (E1 , E3 )). Single Linkage Similarity (E1 , E23 ) = max(Similarity (E1 , E2 ), Similarity (E1 , E3 )). Weighted Average Linkage Similarity (E1 , E23 ) = (1/2 Similarity (E1 , E2 ) + 1/2 Similarity (E1 , E3 )). Unweighted Average Linkage Similarity (E1 , E23 ) = (Similarity (E1 , E2 )

The Jaccard-NM measure for binary features proposed by us in [8] is given by: a Jaccard N M = (1) 2(a + b + c) + d In Table II and Equation 1, a, b, c and d can be determined using Table III. For two entities X and Y , a is the number of features that are present 1 in both entities X and Y , b represents features that are present in X but absent in Y , c represents features that are not present in X and present in Y , and d represents the number of features that are absent 0 in both entities. n = a + b + c + d is the total number of features.
TABLE III C ONTINGENCY TABLE

Y X 1 (Presence) 0 (Absence) Sum 1 (Presence) a c a+c 0 (Absence) b d b+d Sum a+b c+d n=a+b+c+d

Table IV lists some well known similarity measures for non-binary features. In Table IV, since the features are nonbinary, M a represents the sum of features that are present in both entities X and Y , M b represents sum of features that are present in X but absent in Y and M c represents sum of features that are not present in X and are present in Y . In the software domain, it has been shown that Jaccard measure produces better results than other measures for binary features [6], [7]. One reason for this is that it does not consider d (absence of feature/negative match) [11], [22]. It

47 48

size(E2 )+ Similarity (E1 , E3 ) size(E3 ))/(size(E2 )+ size(E3 ). The Complete linkage algorithm supports formation of small but cohesive clusters, while the Single linkage algorithm makes large non-cohesive but stable clusters. The results of Weighted and Unweighted Average Linkage algorithms lie between these two. Two recently proposed hierarchical algorithms for software clustering are Weighted Combined Algorithm (WCA) [12] and LIMBO [2]. When two entities are merged in a cluster, information about the number of entities accessing a feature is lost [12] when using linkage algorithms. WCA and LIMBO overcome this limitation of linkage algorithms by making a new feature vector for the newly formed cluster. This feature vector contains information about number of entities accessing a feature. Unlike linkage algorithms, these algorithms update feature matrix after every step. Suppose we have two entities E1 and E2 with normalized feature vectors fi and fj , respectively. The newly feature vector fij is calculated for both algorithms as: f ij = (fi + fj ) /(ni + nj ) = (fik + fjk ) /(ni + nj ), k=1,2,...,p Information Loss (IL) measure is used with LIMBO to calculate the information loss between any two entities/clusters. The entities are chosen for grouping together into a new cluster when their IL is minimum. The IL represented by I , is briey described below (For detail and examples see [2]). Information loss is given as:
I=[p(Ei ) + p(Ej )]*Djs [fi , fj ]

max(mno(A, B )) is the minimum number of possible move and join operations needed to convert from A to B . A higher MoJoFM (100%) value denotes greater correspondence between the two decompositions and hence better results while lower MoJoFM (0%) values indicate that decompositions are very different. In internal assessment, some internal characteristic of clusters may be used to evaluate quality of results. Arbitrary decisions represent an internal quality measure [1]. Arbitrary decision is taken by an algorithm when there are more than one maximum values for similarity between entities (or for distance and information loss measures, there are more than one minimum values). IV. A N A NALYSIS OF S IMILARITY M EASURES AND F EATURE V ECTOR C ASES In this section, we analyze similarity measures for binary features and propose a new measure for non-binary features. A. Analysis of similarity measures As described in Section III-B, for software clustering, measures that do not contain d produce better results. This is because features in software are asymmetric, and a 1 and a 0 do not have equal weight. 0 indicates the absence of a feature, and hence d indicates that features are not being shared between entities. For software, the absence of a feature in two entities does not indicate similarity. For example, if two classes do not access the same global function, it does not mean that the two classes are similar. To show that the presence of d in a measure does not necessarily deteriorate results, consider Table V which shows 4 entities, E1-E4. E1 and E2 share two features, so that value of a is 2. Both of them access one feature each that the other entity does not, so b = 1 and c = 1. E3 and E4 share three features, so a = 3. Similar to E1 and E2, both of them access one feature each that the other entity does not, so b = 1 and c = 1, as given in Figure 1.
TABLE V S OFTWARE S YSTEM A Entities E1 E2 E3 E4 f1 1 1 1 1 f2 1 1 1 1 f3 0 0 1 1 f4 1 0 0 0 f5 0 1 0 0 f6 0 0 1 0 f7 0 0 0 1

For each singleton entity, p(Ei ) = p(Ej ) = 1/n, where n is the total number of entities. Djs is the Jensen-Shannon divergence, dened as follows:
Djs = p(Ei )/p(Eij )*Dkl[fi ||fij ] + p(Ej )/p(Eij )*Dkl[fj ||fij ]

Dkl is the relative entropy (also called Kullback-Leibler (KL) divergence), which is the difference between two probability distributions, given as:
Dkl[fi ||fj ] =
p f k=1 ik

log fik / fjk

D. Evaluation of Results In external assessment, the automatically prepared decompositions are compared with the decompositions prepared by human experts. For this purpose different measures may be used. A well known measure is MoJoFM [25], a recent version of MoJo [26]. MoJoFM is an external assessment measure which calculates the percentage of Move and Join operations to convert the decomposition produced by a clustering algorithm to an expert decomposition [25]. To compare the result A of our algorithm with expert decomposition B , we have: M oJoF M (M ) = 1 mno(A, B ) max(mno(A, B )) 100 (2)

TABLE VI S IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM A Entities E1 E2 E3 E4 E1 0 0.5 0.4 0.4 E2 0 0.4 0.4 E3 E4

0 0.6

where mno(A, B ) is the minimum number of move and join operations needed to convert from A to B and

48 49

features. Consider the following two cases which indicate how the presence of d in Jaccard-NM and Russell & Rao measure may improve performance as compared to Jaccard. - Case1: Value of a is different among entities, but similarity as per Jaccard is same.
TABLE X S OFTWARE SYSTEM B Entities E1 E2 E3 E4 f1 1 1 1 1 f2 1 1 1 1 f3 0 0 1 1 f4 0 0 1 1

Fig. 1.

Relationships between entities in software system A

TABLE VII S IMILARITY M ATRIX U SING JACCARD -NM FOR S OFTWARE S YSTEM A Entities E1 E2 E3 E4 E1 0 0.18 0.08 0.08 E2 0 0.08 0.08 E3 E4

0 0.25

TABLE VIII S IMILARITY M ATRIX U SING RUSSELL & R AO FOR S OFTWARE S YSTEM A Entities E1 E2 E3 E4 E1 0 0.28 0.28 0.28 E2 0 0.28 0.28 E3 E4 Fig. 2. Relationships between entities in software system B

0 0.42

TABLE IX S IMILARITY M ATRIX U SING S IMPLE M ATCHING FOR S OFTWARE S YSTEM A Entities E1 E2 E3 E4 E1 0 0.71 0.57 0.57 E2 0 0.57 0.57 E3 E4

0 0.71

An example feature matrix with 4 entities (E1-E4) and 4 features (f1-f4) of a software system B for this case is presented in Table V and shown in Figure 2. In this system value of a is 2 for entities E1 and E2. For entities E3 and E4, value of a is 4. The corresponding similarity matrices using Jaccard, Jaccard-NM and Russell & Rao measures are given in Table XI - Table XIII. It can be seen from Table XI that using the Jaccard measure, both E1 and E2, and E3 and E4 are found to be equally similar. It may be better to choose E3 and E4 for clustering rather than E1 and E2 as they share a larger number of features. Both Jaccard-NM and Russell & Rao nd E3 and E4 to be more similar so an arbitrary decision is reduced.
TABLE XI S IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM B Entities E1 E2 E3 E4 E1 0 1 0.5 0.5 E2 0 0.5 0.5 E3 E4

The similarity matrix according to the Jaccard measure is given in Table VI. The similarity matrices according to the Jaccard-NM, Russell & Rao and Simple Matching measures (all of which contain d) are given in Table VII - Table IX. It can be seen from Table VI - Table VIII that Jaccard, JaccardNM and Russell & Rao measures nd E3 and E4 to be most similar. From Figure 1, it is clear that E3 and E4 should indeed be considered most similar. However, due to presence of d in numerator of Simple Matching coefcient, it nds E1 & E2 and E3 & E4 to be equally similar, resulting in an arbitrary decision where either of these entities may be grouped. From this example, it is clear that the signicant factor here is whether d is present in numerator or denominator of a measure. Its presence in the numerator deteriorates results (as for Simple Matching Coefcient). However, if it is present in denominator only, it does not indicate similarity but it is a useful indicator of the proportion of common and total

0 1

TABLE XII S IMILARITY M ATRIX U SING JACCARD -NM FOR S OFTWARE S YSTEM B Entities E1 E2 E3 E4 E1 0 0.3 0.25 0.25 E2 0 0.25 0.25 E3 E4

0 0.5

49 50

TABLE XIII S IMILARITY M ATRIX U SING RUSSELL & R AO FOR S OFTWARE S YSTEM B Entities E1 E2 E3 E4 E1 0 0.5 0.5 0.5 E2 0 0.5 0.5 E3 E4

TABLE XVI S IMILARITY M ATRIX U SING JACCARD -NM FOR S OFTWARE S YSTEM C Entities E1 E2 E3 E4 E1 0 0.25 0.18 0.18 E2 0 0.18 0.18 E3 E4

0 1

0 0.33

TABLE XVII S IMILARITY M ATRIX U SING RUSSELL & R AO FOR S OFTWARE S YSTEM C

- Case2: Value of a is high among entities, but they are not completely similar. An example feature matrix with 4 entities (E1-E4) and 9 features (f1-f9) of a software system C for this case is presented in Table XIV and Figure 3. The corresponding similarity matrices using Jaccard measure, Jaccard-NM and Russell & Rao measure are given in Table XV - Table XVII. It can be seen that entities E1 and E2 are found to be most similar by Jaccard. However, Jaccard-NM and Russell & Rao nd E3 and E4 to be most similar, which may be more appropriate.
TABLE XIV S OFTWARE SYSTEM C Entities E1 E2 E3 E4 f1 1 1 1 1 f2 1 1 1 1 f3 0 0 1 1 f4 0 0 1 1 f5 0 0 1 0 f6 0 0 0 1

Entities E1 E2 E3 E4

E1 0 0.33 0.33 0.33

E2 0 0.33 0.33

E3

E4

0 0.6

when Russell & Rao already exists. To answer this question, consider the following example: - Case3: Value of a is same but values of b and c is not. Consider Table XVIII and Figure 4 having four entities (E1E4) and ve features (f1-f5). All the entities have same value of a equal to three but entities E1 and E2 have b and c = 0 while E3 and E4 have b = 1 and c = 1.
TABLE XVIII S OFTWARE SYSTEM D Entities E1 E2 E3 E4 f1 1 1 1 1 f2 1 1 1 1 f3 1 1 1 1 f4 0 0 1 0 f5 0 0 0 1

Fig. 3.

Relationships between entities in software system C

TABLE XV S IMILARITY M ATRIX U SING JACCARD FOR S OFTWARE S YSTEM C Entities E1 E2 E3 E4 E1 0 1 0.4 0.4 E2 0 0.4 0.4 E3 E4

Fig. 4.

Relationships between entities in software system D

0 0.6

Through case1 and case2, we have shown that both the Jaccard-NM and Russell & Rao measures are expected to provide better results as compared to the Jaccard measure. The question arises as to why we need to dene Jaccard-NM

The corresponding similarity matrices using Russell & Rao and Jaccard-NM measures are given in Table XIX and Table XX respectively. It can be seen from Table XIX that Russell & Rao results in arbitrary decisions among all entities. But it can be seen from Table XX that Jaccard-NM reduces arbitrary decisions and gives preference to E1 and E2 to form a cluster in rst step. Hence in certain cases, the results of Jaccard-NM and Russell & Rao are different, with Jaccard-NM reducing the arbitrary decisions which have a negative impact on the clustering results.

50 51

TABLE XIX S IMILARITY M ATRIX U SING RUSSELL & R AO FOR S OFTWARE S YSTEM D Entities E1 E2 E3 E4 E1 0 0.6 0.6 0.6 E2 0 0.6 0.6 E3 E4 S. No. 1 2 0 3 TABLE XX S IMILARITY M ATRIX U SING JACCARD -NM FOR S OFTWARE S YSTEM D Entities E1 E2 E3 E4 E1 0 0.37 0.33 0.33 E2 0 0.33 0.33 E3 E4 4

TABLE XXI B RIEF D ESCRIPTION OF DATA S ETS PLP 50661 30 28 72 SAVT 27311 70 37 97 PLC 51768 27 27 69

0 0.6

Total number source code lines Total number of header (.h) les Total number of implementation (.cpp,.cxx) les Total number of Classes

B. Entities and Features


0 0.3 0

B. Unbiased Ellenberg-NM - A new similarity measure for non-binary features Unbiased Ellenberg is a Jaccard like measure but for nonbinary features as given in equation 3. 0.5 M a U nbiasedEllenberg = 0.5 M a + b + c (3)

Since all systems are object-oriented, we selected class as an entity. From different relationships that exist between classes, we selected eleven sibling (indirect) relationships [20] listed in Table XXII, since the similarity measures listed in Table II can only be applied to indirect relationships. We used these relationships because they occur frequently within object-oriented systems. C. Similarity Measures To nd out similarity between entities having binary features we selected the Jaccard, Jaccard-NM and Russell & Rao similarity measures. For non-binary features we selected Unbiased Ellenberg and Information Loss measures and compared their results with our new proposed measure Unbiased EllenbergNM. D. Algorithms To cluster the most similar entities we selected agglomerative clustering algorithms including Complete linkage, Weighted average and Unweighted average described in Section III-C. We also selected Weighted Combined Algorithm [12] and LIMBO [2]. E. Assessment We obtained expert decompositions for each test system and compared our automatically produced clustering results with the expert decompositions at each step of hierarchical clustering using the MoJoFM [25]. Results are reported by selecting the maximum MoJoFM value obtained during the clustering process. For internal assessment, the results obtained by measures were evaluated internally by number of arbitrary decisions taken during clustering process. VI. E XPERIMENTAL R ESULTS AND A NALYSIS A. External evaluation of results for binary features In this section, we present experimental results of Complete Linkage (CL), Weighted Average Linkage (WAL) and Unweighted Average Linkage (UWAL) algorithms using Jaccard (J), Jaccard-NM (JNM) and Russell & Rao (RR) similarity measures. Table XXIII and Figure 5 present the results of the comparison between automatically obtained decomposition and expert decomposition using MoJoFM. From Figure 5 one can see that

The cases discussed in Section IV-A can also occur in nonbinary features matrix. Therefore to solve these problems, we propose a new measure called Unbiased Ellenberg-NM. Our new measure is dened as follows. 0.5 M a U nbiasedEllenberg N M = (4) 0.5 M a + b + c + n = 0.5 M a 0.5 M a + b + c + (a + b + c + d) = 0.5 M a 0.5 M a + 2(b + c) + a + d) V. E XPERIMENTAL S ETUP In this section, we describe the test systems and clustering setup for our experiments. A. The Test Systems To conduct clustering experiments, we selected three object oriented software systems which have been developed in Visual C++ [20]. These are proprietary software systems that run under Windows platforms. Statistical Analysis Visualization Tool (SAVT) is an application which provides functionality related to statistical data and result visualization. Printer Language Converter (PLC) is a part of another system, which provides conversion of intermediate data structures to printer language. Print Language Parser (PLP) is a parser of a well known printer language. It transforms plain text and stores output in intermediate data structures. A brief description is given in Table XXI. (5)

(6)

51 52

TABLE XXII I NDIRECT RELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS Name Same Inheritance Hierarchy Same Class Containment Same Class in Methods Same Generic Class Same Generic Parameter Same File Same Folder Same Global Function Access Same Macro Access Same Global Variable Access Description Two or more classes that are derived from same class Represents that classes contain objects of same class Represents classes containing objects of same class declared in a method locally or as parameter Represents that two classes are used as instantiating parameters to same generic class The relationship between two generic classes which have same class as their parameter The source code of two or more classes is written in same le Two or more classes reside in same folder Two or more than two classes access same global functions Two or more than two classes access same macro Two or more than two classes access same global variable

in all data sets Jaccard-NM, and Russell & Rao give results equal to or better than Jaccard for all algorithms. From Table XXIII and Figure 6 it can be seen that on an average, JaccardNM and Russell & Rao produce signicantly better results than the Jaccard similarity measure for all linkage algorithms.

Fig. 6. Average MoJoFM using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR)

Ellenberg-NM measures and Limbo using Information Loss measure. Figure 7 indicates that Unbiased Ellenberg-NM gives better results as compared to Unbiased Ellenberg and Information Loss measures. We analyze the reason for the better results of Unbiased Ellenberg-NM in the next section.
Fig. 5. Experimental results using MoJoFM values for Complete(CL),Unweighted Average(UWAL) and Weighted Average(WAL) using Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR) similarity measures

TABLE XXIII M O J O FM VALUES OF JACCARD , JACCARD -NM AND RUSSELL & R AO MEASURES FOR ALL DATA S ETS AND L INKAGE A LGORITHMS PLP JNM RR 60 55 46 46 52 54 53 52 SAVT JNM 54 54 53 54 PLC JNM RR 65 64 50 52 55 55 57 57

CL UWAL WAL Average

J 51 43 46 47

J 54 49 48 50

RR 58 55 49 54

J 61 47 56 55

B. External evaluation of results for non-binary features Figure 7 and Table XXIV show results of applying Weighted Combined algorithm using Unbiased Ellenberg and Unbiased

Fig. 7. MoJoFM results for Weighted Combined(WC) using Unbiased Ellenberg(UE) and Unbiased Ellenberg-NM(UENM) measures and Information Loss Measure(IL) measures

52 53

TABLE XXIV E XPERIMENTAL R ESULTS USING M O J O FM VALUES FOR U NBIASED E LLENBERG (UE) AND U NBIASED E LLENBERG -NM (UENM) USING W EIGHTED C OMBINED A LGORITHM AND L IMBO USING I NFORMATION L OSS MEASURE FOR ALL DATA S ETS PLP UENM 73 SAVT UENM 74 PLC UENM 71

UE 70

IL 74

UE 68

IL 68

UE 68

IL 67

Fig. 9. Experimental results for arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR)

Fig. 8. Average Number of arbitrary decisions using Complete Linkage(CL) with Jaccard(J), Jaccard-NM(JNM) and Russell & Rao(RR)

C. Internal evaluation using arbitrary decisions Figure 8 presents the arbitrary decisions taken as a result of applying the Jaccard, Jaccard-NM and Russell & Rao measures throughout the clustering process for all test systems. We can see from Figure 9 that in rst thirteen steps of the clustering process for PLP, in rst quarter for SAVT and in rst half for PLC, the Jaccard similarity measure results in more arbitrary decisions as compared to Jaccard-NM and Russell & Rao. This is due to entities which have Jaccard similarity value equal to 1, while the value of a is different and these create large number of arbitrary decisions. This is case1 which we have dened, and for which we proposed Jaccard-NM. In this case Russell & Rao also gives better results. It can be seen from Figure 9 that for PLP, number of arbitrary decisions by Russell & Rao is higher as compared to Jaccard and Jaccard-NM. It can be seen also that for SAVT and PLC behavior of Jaccard-NM and Russell & Rao is almost same. This difference in PLP data set is due to the case3 dened in Section IV-A. The average arbitrary decisions for Unbiased Ellenberg, Unbiased Ellenberg-NM and Information Loss measure are presented in Figure 10. It was expected that the number of arbitrary decisions by Unbiased Ellenberg-NM would be less than for other similarity measures. The experimental results conrm our expectations. We can see that Information Loss results in less arbitrary decisions while Unbiased Ellenberg results in more [1]. Moreover, our new measure Unbiased Ellenberg-NM results in less arbitrary decisions as compared to Information Loss, thus producing the best clustering results. Thus from our analysis and experimental results we con-

Fig. 10. Average number of arbitrary decisions using Weighted Combined(WC) with Unbiased Ellenberg(UE) and Unbiased EllenbergNM(UENM) and Limbo using Information Loss(IL) measure

clude that: When feature vector has d = 0, then Jaccard-NM and Russell & Rao become equal to Jaccard measure. Russell & Rao depends on a only. Jaccard-NM and Russell & Rao produce better clustering results as compared to Jaccard by reducing arbitrary decisions. Unbiased Ellenberg-NM substantially decreases number of arbitrary decisions as compared to Unbiased Ellenberg and Information Loss for non-binary features producing signicantly better clustering results. VII. C ONCLUSIONS Various binary and non-binary similarity measures have been used during clustering for software architecture recovery. Each of the measures has its own characteristics. Previous research suggests that the similarity measures which do not consider absence of features d, perform well for software clustering and those that include d do not. Amongst the measures not containing d, Jaccard measure produces the best

53 54

results. In this paper, we analyzed the performance of the Jaccard measure (which does not contain d), and Jaccard-NM and Russell & Rao measures (which contain d) using various cases that may arise in the feature matrix of a software system. We identied deciencies of the Jaccard measure and showed how Jaccard-NM and Russell & Rao give better results than Jaccard. This is because they use d not to determine similarity, but to determine proportion of common and total features. We also showed how Jaccard-NM is capable of reducing arbitrary decisions, which may be problematic during clustering process. We also dened the non-binary counterpart of Jaccard-NM, the Unbiased Ellenberg-NM and compared its performance with Unbiased Ellenberg and Information Loss measures. Similar to Jaccard-NM, it reduces arbitrary decisions and results in better clusters. In the future, it will be interesting to evaluate the performance of Jaccard-NM, Russell & Rao and Unbiased Ellenberg-NM measures on other systems ACKNOWLEDGMENT The authors would like to thanks Mr. Abdul Qudus Abbasi for providing the Software Test Systems. R EFERENCES
[1] O. Maqbool and H. A. Babri, Hierarchical clustering for software architecture recovery, IEEE Trans. Software Eng., vol. 33, no. 11, pp. 759 780, November 2007. [2] P. Andritsos and V. Tzerpos, Information theoretic software clustering, IEEE Trans. Software Eng., vol. 31, no. 2, pp. 150 165, February 2005. [3] C. Tjortjis, L. Sinos, and P. Layzell, Facilitating program comprehension by mining association rules from source code, Proc. Intl Workshop Program Comprehension, pp. 125 132, May 2003. [4] P. Tonella, Concept analysis for module restructuring, IEEE Trans. software Eng., vol. 27, pp. 351363, Apr 2001. [5] M. Consens, A. Mendelzon, and A. Ryman, Visualizing and querying software structures, Proc. of the Intl. Conference on Software Engineering(ICSE), vol. 133, pp. 138156, May 1992. [6] N. Anquetil and T. C. Lethbridge, Experiments with clustering as a software remodularization method, Proc. Working Conference Reverse Engineering (WCRE), pp. 235255, 1999. [7] J. Davey and E. Burd, Evaluating the suitability of data clustering for software remodularization, Proc. Working Conf. Reverse Eng., pp. 268 276, November 2000. [8] R. Naseem, O. Maqbool, and S. Muhammad, An improved similarity measure for binary features in software clustering, Proc. of the Intl. Conference on Computational Intelligence, Modelling and Simulation(CIMSim), pp. 111116, September 2010. [9] S.-S. Chot, S.-H. Cha, and C. C. Tappert, A survey of Binary similarity nd distance measures, Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp. 43 48, 2010. [10] N. Anquetil and T. Lethbridge, Comparative study of clustering algorithms and abstract representations for software remodularisation, Software, IEE Proceedings, vol. 150, no. 3, pp. 185 201, 2003. [11] M. Saeed, O. Maqbool, H. A. Babri, S. Hassan, and S. Sarwar, Software clustering techniques and the use of combined algorithm, Proc. Intl Conf. Software Maintenance and Reeng., pp. 301 306, March 2003. [12] O. Maqbool and H. A. Babri, The weighted combined algorithm: a linkage algorithm for software clustering, Proc. Intl Conf. Software Maintenance and Reeng., pp. 15 24, 2004. [13] B. S. Mitchell and S. Mancoridis, On the automatic modularization of software systems using the bunch tool, IEEE Trans. Software Eng., vol. 32, no. 3, pp. 193 208, March 2006.

[14] S. Mancoridis, B. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner, Using automatic clustering to produce high-level system organizations of source code, In Proc. 6th Intl. Workshop on Program Comprehension, pp. 45 53, 1998. [15] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner, Bunch: A clustering tool for the recovery and maintenance of software system structures, IEEE Intl. Conference on Software Maintenance, p. 50, 1999. [16] B. S. Mitchell and S. Mancoridis, Using heuristic search techniques to extract design abstractions from source code, Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1375 1382, 2002. [17] M. Harman, S. Swift, and K. Mahdavi, An empirical study of the robustness of two module clustering tness functions, Proc. Genetic and Evolutionary Computation Conference, pp. 1029 1036, June 2005. [18] P. Willett, Similarity-based approaches to virtual screening, Biochemical Society Transactions, vol. 31, no. 3, pp. 603 606, Jun 2003. [19] S. Dalirsefat, A. da Silva Meyer, and S. Mirhoseini, Comparison of similarity coefcients used for cluster analysis with amplied fragment length polymorphism markers in the silkworm, Bombyx mori, Journal of Insect Science, vol. 71, pp. 1 8, 2009. [20] A. Q. Abbasi, Application of appropriate machine learning techniques for automatic modularization of software systems, MPhil. thesis, Quaide-Azam University Islamabad, 2008. [21] Z. Wen and V. Tzerpos, Evaluating similarity measures for software decompositions, Proc. Intl Conf. Software Maintenance, pp. 368 377, September 2004. [22] N. Anquetil, C. Fourier, and T. C. Lethbridge, Experiments with hierarchical clustering algorithms as software remodularization methods, Proc. Working Conf. Reverse Eng., 1999. [23] Y. Kanellopoulos, P. Antonellis, C. Tjortjis, and C. Makris1, kattractors: A clustering algorithm for software measurement data analysis, In Proc. 19th IEEE Intl. Conference on Tools with Articial Intelligence, pp. 358 365, 2007. [24] A. Lakhotia, A unied framework for expressing software subsystem classication techniques, Journal of Systems and Software, vol. 36, pp. 211 231, 1997. [25] Z. Wen and V. Tzerpos, An effectiveness measure for algorithms, Proc. Intl Workshop Program Comprehension, pp. 194 203, June 2004. [26] M. Shtern and V. Tzerpos, A framework for the comparison of nested software decompositions, In Proc. of the 11th IEEE Working Conf. Reverse Engineering, pp. 284292, 2004.

54 55