You are on page 1of 13

www.ietdl.

org
Published in IET Software Received on 14th February 2012 Revised on 25th April 2012 doi: 10.1049/iet-sen.2012.0027

ISSN 1751-8806

Enhancing comprehensibility of software clustering results


F. Siddique O. Maqbool
Department of Computer Science, Quaid-i-Azam University, Islamabad, Pakistan E-mail: onaiza@qau.edu.pk; omaqbool@gmail.com

Abstract: As requirements of organisations change, so do the software systems within them. When changes are carried out under tough deadlines, software developers often do not follow software engineering principles, which results in deteriorated structure of the software. A badly structured system is difcult to understand for further changes. To improve structure, re-modularisation may be carried out. Clustering techniques have been used to facilitate automatic re-modularisation. However, clusters produced by clustering algorithms are difcult to comprehend unless they are labelled appropriately. Manual assignment of labels is tiresome, thus efforts should be made towards automatic cluster label assignment. In this study, the authors focus on facilitating comprehension of software clustering results by automatically assigning meaningful labels to clusters. To assign labels, the authors use term weighting schemes borrowed from the domain of information retrieval and text categorisation. Although some term weighting schemes have been used by researchers for software cluster labelling, there is a need to analyse the term weighting schemes and related issues to identify the strengths and weaknesses of these schemes for software cluster labelling. In this context, the authors analyse the behaviour of seven well-known term weighting schemes. Also, they perform the experiments on ve software systems to identify software characteristics which affect the labelling behaviour of the term weighting schemes.

Introduction

Software systems must be updated to meet changing requirements. When the system is under continuous change, its structure and design deviates from its original design [1]. Such a software system is difcult to comprehend, which increases time required to maintain it [2]. According to Kuhn et al. [3], the comprehension of software systems takes up to 60% of maintenance cost. Thus it is important to keep the system well structured by decomposing it into manageable modules. Several techniques have been proposed to decompose, re-modularise or recover the architecture of software systems including concept analysis [4], latent semantic analysis [3, 5], graph partitioning [6 8], evolution strategies [9, 10] and software clustering [11, 12]. Clustering groups items such that entities in a cluster are similar to each other (high intra-similarity) and dissimilar to entities in other clusters (low inter-similarity). For software systems, intra-similarity and inter-similarity are analogous to cohesion and coupling, respectively [13], whereas a cluster is analogous to softwares module or sub-system. A good modularisation of a software system should attain high cohesion and low coupling. Since the clustering concept is general, besides being applied for software architecture recovery, clustering techniques have been used to solve problems in various disciplines. For example, they have been used to improve recommender systems for web resources by learning user interests and ontologies [14, 15], to nd relevant services [16], and to discover ontologies for
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

the semantic web [17, 18]. Clustering has also been used to group together nodes in mobile ad hoc networks (MANETS) to improve performance [19] and to selforganise wireless sensor networks [20]. Clustering is regarded as a quasi-automatic technique [21], with the main objective to present the data in a comprehensible way. Although much research has been done on proposing new clustering algorithms and improving accuracy and effectiveness of software clustering algorithms, relatively less work has been done for comprehension of software clustering results [22]. Although clustering helps to decompose a software system in a manageable form, developers still need to go through the source code of entities (items such as functions, classes or les) in order to comprehend the functionality of each cluster or to locate a particular functionality. If these clusters are assigned labels, this can improve understanding of clustering results, as well as of the main functionality of each cluster. One way to comprehend software clustering results is to assign labels to clusters manually. However, this requires understanding of source code, which is tiresome and time consuming because of the size of software systems. Tzerpos and Holt [22], and Maqbool and Babri [23] argue that if labels can be assigned to software clusters by automatic means, it would help to understand software systems. According to Kuhn et al. [3], a basic source of information to understand software systems are entity identiers and comments, where developers put their knowledge of
283

& The Institution of Engineering and Technology 2012

www.ietdl.org
software systems. Languages are meant for communication and same is the case with programming languages which converse at two levels. First, at human machine level by programming and second, at human human level through comments and identiers. Therefore by incorporating the developers knowledge about source code, an automatic label assignment may be helpful to comprehend the software system at hand. In order to assign labels to clusters automatically, it is important to select a meaningful concept that represents the functionality of entities in a cluster. For this purpose term weighting schemes (TWS) may be used. TWS are mainly borrowed from the eld of information retrieval (IR) and text categorisation (TC), and are used to nd the weight of terms with respect to their importance in a document as well as in a collection of documents. Hence, they may be tailored for labelling software clusters. Although TWS have been used in IR to select important terms in documents, there are very few studies in which they have been applied for software cluster labelling. Two TWS, that is, term frequency and inverse document frequency, are used in [23] for labelling software clusters. The results of these two schemes are compared based on experiments on three test systems. A brief analysis of four TWS is presented in [24]. Labels are assigned to clusters using these four TWS for two software systems. Since naming conventions (e.g. for functions, les, folders etc.) within the systems are not described, it is difcult to evaluate their effect on labels assigned by the different TWS and to highlight correspondence between them. Our focus in this paper is to analyse the behaviour of wellknown TWS used in IR for labelling software clusters. To successfully apply TWS for software cluster labelling, it is necessary to identify the strengths and weaknesses of these schemes in the software domain. These strengths and weaknesses may depend on characteristics of a software system and its naming conventions. In this regard, our main research contributions in this paper are as follow: We explain the behaviour of seven well-known TWS, that is, term frequency, inverse document frequency, relevant frequency, odds-ratio, chi-square, information gain and gain ratio used in IR and TC, for software cluster labelling. We use an entity-cluster distribution of representative terms in software clusters for a detailed analysis of each TWS. We analyse software characteristics, that is, naming conventions in software systems and suggest that if a software system follows a certain naming convention, then which TWS is expected to perform better than others and which TWS will fail to assign meaningful labels. This analysis helps in choosing an appropriate weighting scheme for labelling a software system. We conduct experiments to cluster ve software systems and assign labels to clusters automatically using the seven TWSs. Based on experimental results, characteristics of different schemes are discussed in detail. The rest of this paper is organised as follows. Section 2 briey describes steps in clustering for software modularisation. Section 3 discusses TWS and their characteristics for software labelling. In Section 4, we describe some common naming conventions and their effects on labelling. Section 5 details the setup for our experiments. Section 6 presents the results of clustering and labelling using the TWS and summarises their strengths and weaknesses. In Section 7, we briey discuss work related to
284 & The Institution of Engineering and Technology 2012

clustering, cluster labelling and TWS. Finally, in Section 8 we present the conclusions.

Clustering for software re-modularisation

Clustering techniques have been widely used to create taxonomies or to group large data sets in various disciplines, for example, biology, astronomy. Two broad categories of clustering algorithms, as described in [25], are partitional and hierarchical. Partitional algorithms divide data into a certain number of clusters, which should be known in advance. After initially partitioning the data, each step of the algorithm modies contents of the clusters so as to optimise a dened criterion. The result is a at decomposition. Hierarchical algorithms, on the other hand, produce a hierarchy or nested decomposition of clusters. Since a hierarchical representation is more appropriate for modules of a software, which may be divided into subsystems at multiple levels, hierarchical algorithms have been widely used for architecture recovery and modularisation [26]. In this section, we briey describe steps in hierarchical clustering for software re-modularisation. Since clustering is used to group together entities based on their characteristics or features, an initial concern is the selection of entities for clustering, which are usually les [27], functions [9] or classes [28]. Each entity is associated with a set of features. Features are used to nd similarities between entities, thus appropriate feature selection is necessary. Commonly used features for software modularisation include functions called by an entity, userdened types and global variables referred to by an entity [26]. After selection of entities and features, usually a binary feature matrix is associated with each entity, where 1 represents presence of a feature and 0 represents its absence. The next step in clustering is to create a similarity matrix to nd similarity between every pair of entities. Higher the similarity value, higher the coupling between them. Thus an entity pair with highest similarity is merged to form a cluster. A list of well-known measures for software clustering can be found in [26]. When two most similar entities are merged as a cluster, the next step is to compute similarity between newly formed cluster and other clusters or entities. For this, various clustering algorithms exist, for example, single linkage, complete linkage [29]. A more recently proposed algorithm for software clustering is the weighted combined algorithm (WCA) [30], which associates a new feature vector with the newly formed cluster. After clustering results have been produced, they must be evaluated. In external assessment, the automatic decomposition produced by a clustering algorithm is evaluated by comparison with an expert decomposition. A well-known measure for comparison is MoJoFM [31], whose value ranges between 0 and 100%. Higher value of MoJoFM indicates that automatic decomposition is closer to expert decomposition. Another measure dened in [32] is ELF. It is dened in terms of segments. A cluster from an automatic decomposition is called a segment if majority of its entities are also present in a cluster of the expert decomposition. A cluster from the automatic decomposition is called ExtraneousCluster (E ) if it is not a segment of any cluster in the expert decomposition. A cluster from the expert decomposition is called LossCluster (L) if it does not have any segment in the automatic decomposition. Fragmentation (F ) represents that on an average, a cluster
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

www.ietdl.org
from the expert decomposition is fragmented in how many clusters in automatic decomposition (lower value is better).

3 Application of TWSs for software cluster labelling


TWSs are mainly borrowed from the eld of IR and TC. TC is a task of automatically assigning unlabelled documents to a predened labelled set of categories. It is a supervised task that uses prior information of membership of documents in categories to build a classier [33]. When such a classier is built for each category, the chosen category (i.e. the category under process) is called positive (+ ve) category and the remaining categories are called negative ( ve) category. Thus a classication problem having several categories is reduced to multiple binary classication problems [34]. In TC, documents are rst converted into a compact format, that is, index terms, which can be syllables, words or phrases [34]. As documents contain a large set of index terms and every term might not contribute equally to the respective document, hence feature reduction schemes are used to reduce the large feature space. Debole and Sebastiani [33] suggested that as these schemes are useful for feature reduction and show good results [35], thus they are also practical for weighting terms. In the context of software clustering, the cluster which is being labelled (i.e. cluster under process) can be thought of as analogous to + ve category (hence called + ve cluster), and other clusters are analogous to 2 ve category (thus called 2 ve cluster). Documents are represented by entities or clusters. Terms are information associated with software entities contained in clusters, for example, comments or identiers. For each term, a contingency table may be computed as given in Table 1. Here C denotes positive (+ ve) cluster, C stands for negative ( ve) cluster, a denotes number of entities in + ve cluster in which term t is present, b denotes number of entities in + ve cluster in which t is absent, c denotes number of entities in 2 ve cluster in which t is present, d denotes number of entities in 2 ve cluster in which t is absent and N is total number of entities. To explain the various TWS in the software context, we use Fig. 1 that represents entity-cluster distribution of eight terms t1 2 t8 . For each term, its entity-cluster distribution is represented by a column. Horizontal line divides the distribution into + ve and 2 ve clusters. Shaded area represents number of entities that contain term and nonshaded area represents number of entities that do not contain term, for example, for t1 , above horizontal line, white non-shaded area depicts number of entities that do not contain term in + ve cluster, that is, b. Black-shaded area represents number of entities that contain term, that is, a. Below horizontal line, black-shaded area represents number of entities that contain term, that is, c. White nonshaded area depicts number of entities that do not contain term in 2 ve cluster, that is, d.
Table 1
Document category contingency table C t t a b C c d

Fig. 1 Example of entity-cluster distribution of eight terms in + ve and 2 ve clusters

3.1

Term frequency (tf)

In this scheme, weights are assigned to terms based upon frequency, that is, higher the frequency of a term in + ve cluster, higher the weight of term. The value of tf is given as tf = a (1)

tf does not require information of 2 ve cluster, weights are computed using information obtained from + ve cluster only. Consider Fig. 1, t4 , t5 and t6 have highest frequency, hence they are assigned highest weight by tf. Descending order of terms in Fig. 1 with respect to tf are given as t4 t5 t6 , t1 t2 , t3 , t8 and t7 . 3.2 Inverse document frequency (idf)

This idf gives more weight to a term if it occurs in less number of documents [36]. The idea behind this approach is that a term occurring in many documents represents a domain concept rather than some specic concept and should be given less importance. idf is calculated as idf = log N a+c (2)

where a and c are as dened in Table 1. In Fig. 1, t7 will be given highest weight by idf because it occurs in minimum number of documents. idf does not take into account the distribution among + ve and 2 ve clusters. idf will assign equal weight to t2 and t4 , although both terms have different distribution in + ve and 2 ve clusters, as both terms have equal value of a + c. Descending order of terms in Fig. 1 with respect to weight of idf is given as t7 , t3 , t2 t4 , t1 t5 , t8 and t6 . 3.3 Relevant frequency (rf)

It is based upon the assumption that if a term is highly concentrated in + ve cluster compared to in 2 ve cluster, this term is more important for that cluster [34]. Consider terms t2 and t4 , both terms occur in same number of
285

IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

& The Institution of Engineering and Technology 2012

www.ietdl.org
documents, hence idf will give same weight to both terms. However, rf will give more weight to t4 as it is more concentrated in + ve cluster. Its value is computed as rf = log 2 + a max(1, c) (3) computed as ig = a aN b bN log + log N (a + c)(a + b) N (a + b)(b + d ) + c cN d dN log + log (6) N (a + c)(c + d ) N (b + d )(c + d )

It is relevant to note that since the denominator in (3) contains max(1, c), the value of rf remains same for c 0 and c 1. Descending order of terms in Fig. 1 with respect to weight of rf is given as t4 , t3 , t5 , t2 , t1 , t7 , t6 and t8 . 3.4 Odds-ratio (or)

The decreasing ordering of ig for terms in Fig. 1 is given as t4 , t5 , t3 , t2 , t1 , t8 , t6 and t7 . 3.7 Gain ratio (gr)

It is a statistical measure which is used to nd probability of an event in two groups, and is given by or = ad bc (4)

It is also an information theoretic measure and is dened as ratio of ig(t, C ) and entropy of term t and cluster C. gr can apply unbiased weights as compared with ig. It can be computed as gr = ig/ a+b a+b c+d c+d log + log N N N N (7)

For software clustering, it can be used to nd important terms between two clusters. If weight is greater than 1, then term is more likely for + ve cluster and vice versa. Its shortcoming is that if any of a, b, c or d is zero, its weight is zero. Consider t4 and t5; t4 has b c 0 and t5 has b 0. It is clear from Fig. 1 that both terms are important as labels. However, or will assign them zero weight. Descending order of terms in Fig. 1 with respect to weight of or is given as t2 , t1 , t7 , t8 , t3 t4 t5 t6 0. 3.5 Chi-square (x2)

The decreasing ordering of gr for terms in Fig. 1 is given as t4 , t5 , t3 , t2 , t1 , t8 , t6 and t7 .

4 Software characteristics and their effect on labelling


Software systems normally follow a certain naming convention for les, functions, classes and packages, which may affect labelling behaviour. Below are some naming conventions as well as their possible consequences on selection of label terms. To elaborate a naming conventions effect on labels we use the eight terms given in Fig. 1. 4.1 System-based naming convention

It is an information theoretic [37] measure which measures degree to which term and cluster are independent [38]. Term with minimum x2 value means that term and cluster are more independent. Since our objective is to nd how much term and cluster are dependent on each other, hence term with highest x2 value may be selected [33]. x2 is given as

x2 = N

(ad bc)2 (a + c)(a + b)(c + d )(b + d )

(5)

where x2 favours terms with low value of c and higher a to b ratio. Consider terms t1 and t2 , x2 will give more weight to t2 , because both have same values for a and b, but lower c in t2 means t2 and + ve cluster are more inter-dependent than t1 and + ve cluster. Now consider t2 and t5 , both have same values for c and d, whereas a is higher for t5 , it means that + ve cluster is more dependent on t5 compared to t2 . Descending order of terms in Fig. 1 with respect to weight of x2 is given as t4 , t5 , t3 , t2 , t1 , t8 , t6 and t7 . 3.6 Information gain (ig)

In system-based naming convention, a common abbreviation may be a prex of every entity in the system. For example, if a system named Operating System is under consideration, then OS might be sufx or prex of almost every entity. In this case terms may resemble t6 in clusters of automated clustering. Although OS reects an overall concept of the system rather than any specic concept, therefore weighting schemes should give less weight to t6 . Another option is that system is divided into sub-systems based on naming convention, and any acronym representing the sub-system is prex or sufx of entities of sub-system. For example, if two sub-systems in system OS are Process Management (PM) and Memory Management (MM), then functions of these sub-systems may have sufx or prex PM and MM, respectively. Then most occurring terms may resemble t1 , t2 , t4 and t5 . If system acronym is also included with each function, then function names might start with OS_PM and OS_MM. In such a case, our required names of sub-systems are PM and MM and t4 would be the better label term. 4.2 Folder-based naming convention

It is an information theoretic function and measures bits of information that one random variable holds about other random variable [38, 39]. In the context of software clusters, it can be regarded as how much a term contains the information about the + ve clusters in which it is contained. If a term and cluster are independent, then its value is zero. Higher value of ig means that the term contains more information about + ve cluster. It can be
286 & The Institution of Engineering and Technology 2012

The main idea of folder-based naming convention is same as of system-based naming convention. In folder-based scheme, developers usually divide the system into sub-systems using folders. A common acronym of folder is prex of the system name. For sub-folders, either only sub-folder
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

www.ietdl.org
acronym is attached or folder_sub-folder acronym is attached. Both cases are same as described in Section 4.1. 4.3 Concept-based naming convention object-oriented and structured approach. Thus, we selected les as clustering entities. Weka and Compost are medium sized, object-oriented systems for which we chose classes as entities. 5.2.2 Feature selection: For function-/class-based clustering, we chose functions invoked, global variables referred to and user-dened types accessed by a function/ class. For le-based clustering, we used le calling feature. A le fi calls fj if functions of fj are invoked by a function declared in fi , and global variables, user-dened types dened in fj are accessed by a function declared in fi . We also used function/class/le identiers as features. Mature software developers usually assign meaningful identiers, that describe main functionality. Usually, two identiers identifying entities which share some functionality also share one or more terms. To benet from this idea, a tokenisation process was applied to identiers, that is, Terms like Net_SVModule1 were separated into (Net, SV, Module, 1), numeric terms were removed, and stemming was applied to stem words to roots [45]. We applied stemming by using Snowball Stemmer [46]. 5.2.3 Clustering algorithm: For clustering, we chose WCA [30] with Ellenberg Unbiased similarity measure as WCA produces a more understandable cluster hierarchy than other algorithms, for example, complete linkage [23]. 5.3 Cluster label assignment process

In concept-based naming convention, entity identier normally represents main concept or its functionality. The concept can be divided into various levels, that is, from les to classes to functions. File names normally represent the main concept of all functions or classes contained in it. If a concept term is shared between entities in a system, then we can have terms resembling t1 2 t5 . It is also possible that the main concept is dened at folder level and concept term is not shared in entities, then most common case is t7 , which does not represent cluster functionality very well. 4.4 Conceptless naming convention

There are also cases when entity names are ambiguous, are composed of non-tokenisable terms and there is no way to separate individual terms for an entity. For example, if we have an identier nsjcapistd then it has no special character, caps or numeric terms, which might help to tokenise term. In such cases we have terms resembling t7 or even worse that one term occurs only once in the complete system, which results in selection of all terms contained in a cluster as a label.

5
5.1

Experimental setup
Test systems description

For experiments we selected ve systems: two structured systems in C (XFig. [40] and Chocolate Doom [41]), two object-oriented systems in Java (Weka [42] and Compost [43]) and one structured/object oriented system in C/C + + (Mozilla [44]). To obtain a detailed analysis we chose a sub-system from Xg, that is, d les which contains 94 functions. We refer to this sub-system as Xgd . For Chocolate Doom, we also selected a sub-system (CDnet), which contains 159 functions. Relevant statistics of the systems are presented in Table 2. 5.2 Software clustering process

To apply labels, label assignment process is incorporated within the clustering algorithm. At every clustering step, as two clusters are merged, a label is assigned to the newly formed cluster. The advantage of assigning labels at every step is that we can have labels not only for major clusters but for every sub-cluster. 5.3.1 Term selection and preprocessing: For terms, we chose entity identiers, as they are omnipresent [21], describe higher-level objective of entity and are easy to extract [29]. For term processing to get individual terms, tokenisation procedure is same as for features, except stemming process. For labels, we did not apply stemming because we need labels for human comprehension. 5.3.2 Selection of terms document-cluster contingency table scheme: To apply labels to a cluster, it is necessary to assign weights to terms. Then the term with highest weight is assigned as label to the cluster. To calculate weights, rst step is to create term contingency table (Section 3). For weighting we treated every entity identier as a single document in both the positive and the

To automatically decompose a software system using clustering, it is important to make appropriate decisions regarding entity, feature and algorithm selection according to systems characteristics. 5.2.1 Entity selection: Xgd and CDnet are structured, medium-sized systems, therefore we chose functions as entities. Mozilla is a large system, developed using both
Table 2
Systems Test systemss statistics Version LOC Total no. header les (.h)

Total no. source les (.c, .cpp, .java) 99 146 115 (.c), 1018 (.cpp) 544 453

Functions

Classes

Global variables 1746 636 182 217

User-dened types 828 278 0 0

Xg Chocolate Doom Mozilla Weka Compost

3.2.3 1.3.0 1.3 3.4 0.4

75 K 30 K 4.0 M 100 K 50 K

86 140 69

1661 1534 3336 4296

331 469

IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

287

& The Institution of Engineering and Technology 2012

www.ietdl.org
negative cluster. Thus (a + b) is cardinality of cluster being labelled and (c + d ) is N (a + b). 5.3.3 Selection of weighting schemes: Next step is to compute weights for terms, to decide which term is more important to be chosen as label. To weigh terms, we experimented with the seven weighting schemes described in Section 3. Let wi is weight of term ti computed by TWS, tfi is frequency of ti in + ve cluster, then local weights w.r.t + ve cluster are computed as WTWS (ti ) = tf i wtws (ti ) [Local weights are computed in this manner for all TWS except tf. For tf, the ordering of terms with respect to weight does not change after this calculation, so we do not compute tf tf]. Terms with highest weight are selected as label of cluster. 5.4 Evaluation criteria for software clustering and labelling results For evaluation purpose we used external evaluation, in which results of an automatic decomposition and cluster labels are compared against an expert decomposition. For comparison of automatic decomposition, we selected MoJoFM. Alongwith MoJoFM, we also selected ELF. Expert decomposition and labels of Xg are taken from Maqbool and Babri [23], available at [47]. Mozillas expert decomposition and labels are taken from [27], available online at [48]. For Weka, we used the expert decomposition presented in [12], and for labels, we used the names of packages. Compost expert decomposition was obtained from [43] and for its labels, we used names of packages. For CD, we contacted an expert from the software industry, with three years development experience, to develop an expert decomposition as well as to assign labels to clusters. We are using a hierarchical clustering algorithm, where results can be seen at different levels of abstraction. It is important to decide a cut-off point, where the system is represented by meaningful clusters which are closer to expert decomposition. For this we chose the clustering step at highest MoJoFM value, where automatic decomposition is most similar to expert decomposition [24]. Labelled clusters formed at highest MoJoFM are compared against expert labels to judge their meaningfulness.
Table 3
System Total Xgd CDnet Mozilla Weka Compost 93 158 257 330 468 Systems results with iteration at highest MoJoFM value Iteration At MoJoFMh 76 135 219 261 387 65.48 61.59 67.06 53.21 65.05 MoJoFMh ED 10 8 6 26 24 No. of clusters in AD 18 24 39 69 69 AD3 10 13 23 32 33 E 0 2 0 8 7 ELF L 0 1 0 5 1 F 1.8 3.14 6.33 2.90 2.69

Experimental results and analysis

In this section, we present the results of our experiments. First, we present the clustering results for various test systems. The next section details the labelling results and presents their analysis. 6.1 Comparison of clustering results

Table 3 presents a comparison of software clustering results for various systems. It presents total number of iterations in clustering for each test system, iteration number at highest MoJoFM value (MoJoFMh), MoJoFMh value (higher value is better), number of clusters in expert decomposition (ED), number of clusters in automatic decomposition (AD) at MoJoFMh , number of clusters in AD with cardinality greater than 3 (AD3) and ELF at MoJoFMh . From Table 3 it can be seen that MoJoFM results are above 60% for all systems except for Weka, where the number of lost and extra clusters are also higher as compared to other systems. For Xgd and Mozilla, none of the clusters are lost or extra, and Xgd contains less fragmentation than Mozilla and other systems. In Mozilla, fragmentation is relatively high. An analysis shows that the Mozilla expert decomposition contains two clusters with larger size compared to the size of other clusters. These two clusters are segmented into several clusters in automatic decomposition which results in smaller, manageable and comprehensible [22] clusters although higher fragmentation is indicated. In CDnet and Compost, number of lost clusters is just 1. Moreover, higher number of extraneous clusters indicates a more detailed AD. 6.2 Analysis of TWS

In Table 4, we present for each system, percentage of meaningful labels assigned by each TWS (a detailed comparison of the labels assigned automatically by different TWS is presented in Tables 5 9 in the appendix). The last row presents the average percentage of meaningful labels by TWS for all systems.

Table 4

Percentage of meaningful labels assigned by TWS tf, % tf idf, % 81.81 80 56.52 61.53 85.18 73.00 tf rf, % 90.91 60 26.08 53.84 92.59 64.68 tf or, % 81.81 60 34.78 15.38 44.44 47.28 tf x2, % 90.91 90 48.82 61.53 96.29 77.51 tf ig, % 90.91 80 39.13 61.53 96.29 73.57 tf gr, % 90.91 80 39.13 61.53 96.29 73.57

Xgd CDnet Mozilla Weka Compost Average, %

72.72 40 17.39 57.69 85.18 54.59

288 & The Institution of Engineering and Technology 2012

IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

www.ietdl.org
Table 5
No. 1 (4/8) 2 (5/9) 3 (7/12) 1 (4/8) 4 (4/4) Xgdi: comparison between labels of TWS and expert labels tf tf idf tf rf tf or tf x2 tf ig cancel char create + create blinking init + init erase selected subspline text tf gr Expert label

cancel_create char create + cancel_create blinking_cursor 5 (10/13) drawing_init + 6 (5/5) init 7 (4/6) draw_erase 8 (12/12) drawing _selected 9 (5/5) subspline 10 (5/6) text

cancel cancel cancel cancel char char char char create + create + create + create + create create create create blinking blinking_cursor cursor blinking init + init + init + init + spline spline spline spline erase erase erase erase selected selected circlebydiameter_ selected circlebyradius_ ellipsebydiameter_ ellipsebyradius subspline subspline subspline subspline text text text text

cancel cancel (10) char char manipulation (8) create + create (12) create blinking cursor (6) getPoints (3) init + intilialise (17) init erase prex/postx manipulation (7) selected select (12) subspline subspline (6) text text (13) manipulation

Table 6
No. 1 (9/9) 2 (6/6) 3 (6/7)

CDneti: comparison between labels of TWS and expert labels tf cl_net + cl_net net net_ticcmd + net net net + net + net + net_sv tf idf cl + cl conn ticcmd + int query advance_ module_ window + game + send + sv tf rf cl + cl conn ticcmd + int query net + net + net sv tf or resend + shutdown conn expand + read query add game + sv + num tf x2 cl + cl conn ticcmd + read query advance_ module_ window + game + sv + sv tf ig cl + cl conn ticcmd + int query advance_ module_ window + game + sv + sv tf gr cl + cl conn ticcmd + int query advance_ module_ window + game + sv + sv Expert label client (23) connection (18) GUI (7) IO (8) packet (44) query (14) SDL (10) server (35)

4 (6/8) 5 (17/17) 6 (5/6) 7 (4/6)

8 (5/7) 9 (12/15) 10 (8/8)

It is relevant to note that evaluation of meaningfulness is based on subjective assessment of a human expert, since it is difcult to dene a quantitative measure for how well a label represents the semantic content of a cluster. We assess a label to be meaningful in the following three cases. First, a label is considered meaningful if it matches the label assigned by the expert, for example, label query for segment (6) in CDnet (Table 6) assigned by tf idf is considered meaningful since it matches the expert label. Second, a label is considered meaningful if it is an abbreviation of the label assigned by an expert, for example, label conn for segment (3) in CDnet (Table 6) assigned by tf idf. For the third case, it should be noted that a sub-system in the expert decomposition may be segmented into multiple clusters in the automatic decomposition. This is especially so for large sub-systems within the expert decomposition, for example, in CDnet , sub-system Server is segmented into four clusters. Consider the segment (8), in which entities are NET_CL_ParseGameStart, NET_SV_ParseGameDataACK, NET_SV_ParseGameData, NET_SV_ParseGameStart, NET_SV_GameEnded, NET_SV_LatestAcknowledged and NET_WaitForStart . It is clear from the entity identiers
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

that this segment performs functionality related to Game at the server side. In this case, we consider a label meaningful if it represents the functionality of entities contained in it, even if the label of the segment does not match the expert label exactly. Before a discussion of the results of various TWS, we describe the naming conventions followed in our test systems. 6.2.1 Naming conventions in the test systems: Xgd follows concept-based naming convention for functions. Its function identiers are tokenisable and share common concept terms, for example, functions related to initialisation contain a common concept term init_ as prex. As a result, clusters obtained automatically usually contain a term similar to t1 2 t5 and remaining terms of each cluster resemble t7 (term is contained in few entity identiers in + ve cluster as well as in 2 ve cluster). CDnet follows system-based naming convention for les, and system- and concept-based naming convention for functions such that each sub-system is assigned an acronym, and les of that sub-system have that acronym as a common prex. For example, each le name has a common prex Net_ for our selected sub-system. A
289

& The Institution of Engineering and Technology 2012

www.ietdl.org
Table 7
No. Mozillai: comparison between labels of TWS and expert labels tf tf idf tf rf tf or tf x2 tf ig tf gr Expert

label ns + dtd + ns + html + dtd + ns + ns + HTML driver_ driver_ driver_ driver_ driver_ driver_ driver_ parser expat_ expat_ expat_ expat_ expat_ expat_ expat_ hashtable_ hashtable_ hashtable_ hashtable_ hashtable_ hashtable_ hashtable_ strdup_ strdup_ strdup_ strdup_ strdup_ strdup_ strdup_ xmlparse xmlparse xmlparse xmlparse xmlparse xmlparse xmlparse 3(8/9) img + img + img + gif + img + img + img + image Lib 4(4/4) jdcoefct_ jdcoefct_ jdcoefct_ jdcoefct_ jdcoefct_ jdcoefct_ jdcoefct_ jdhuff_ jdhuff_ jdhuff_ jdhuff_ jdhuff_ jdhuff_ jdhuff_ jdinput_ jdinput_ jdinput_ jdinput_ jdinput_ jdinput_ jdinput_ jdmaster jdmaster jdmaster jdmaster jdmaster jdmaster jdmaster 5(30/38) jsapi_prmem + jsapi_prmem + jsapi_prmem + gif + ns + ns + ns + Java script 6(8/8) jsdhash + jsdhash + jsdhash + jsarray_ jsdhash + jsdhash + jsdhash + jsdhash_ jsinterp_ jsnum_ jsregexp_ jsopcode_ jsscope + 7(4/4) xpcexception_ xpcexception_ xpcexception_ xpcexception_ xpcexception_ xpcexception_ xpcexception_ xpcmodule_ xpcmodule_ xpcmodule_ xpcmodule_ xpcmodule_ xpcmodule_ xpcmodule_ xpcstack_ xpcstack_ xpcstack_ xpcstack_ xpcstack_ xpcstack_ xpcstack_ xpcthrower + xpcthrower + xpcthrower + xpcthrower + xpcthrower + xpcthrower + xpcthrower + 8(7/9) jsd + jsd + jsd + jsd + jsd + jsd + jsd + 9(5/5) component_ component_ component_ component_ component_ component_ component_ js_jsdate_ js_jsdate_ js_jsdate_ js_jsdate_ js_jsdate_ js_jsdate_ js_jsdate_ jsmath_ jsmath_ jsmath_ jsmath_ jsmath_ jsmath_ jsmath_ loader_moz_ loader_moz_ loader_moz_ loader_moz_ loader_moz_ loader_moz_ loader_moz_ prmjtime_ prmjtime_ prmjtime_ prmjtime_ prmjtime_ prmjtime_ prmjtime_ xpcstring + xpcstring + xpcstring + xpcstring + xpcstring + xpcstring + xpcstring + 10(4/4) jsd_jsdebug jsdebug jsdebug jsd jsdebug jsdebug jsdebug 11(4/4) prerror_ prerror_ prerror_ prerror_ prerror_ prerror_ prerror_ nsprbub prthread_ prthread_ prthread_ prthread_ prthread_ prthread_ prthread_ prtpd_ prtpd_ prtpd_ prtpd_ prtpd_ prtpd_ prtpd_ ptthread ptthread ptthread ptthread ptthread ptthread ptthread 12(13/16) ns + data_rdf + ns + data_rdf + ns + ns + ns + user interface 13(5/8) ns + service + ns + service + service + service + service + 14(3/5) ns + le + ns + script_token + le + le + le + 15(17/17) ns + context + context + context + context + ns + ns + 16(9/9) ns + x+ ns + x+ x+ ns + ns + 17(4/4) ns + color_ color_ ns + color_ color_ color_ name + name + name + name + name + 18(5/5) ns + resource + ns + rdf + resource + resource + resource + 19(7/7) ns window window window window window window 20(17/33) ns + string + ns + ns + ns + ns + ns + utility 21(6/7) ns + string + ns + string + string + string + string + 22(6/10) ns + jar + jar + unix + jar + jar + jar + 23(5/6) ns_plugin plugin plugin x plugin plugin plugin 1(13/17) 2(3/4)

function contained in le Net_Query.c has name Net_Query_Init. In this case, segments of automatic decomposition contain one or two terms similar to any of t1 2 t5 , each segment contains a term similar to t6 (Net) and remaining terms in each segment resemble t7 . Mozilla follows folder-based shared concept naming convention along with concept less naming. Sub-systems are divided into folders and some sub-system names contain the acronym of folder, for example, majority of les in Java Script sub-system contain a common acronym jsd. Another common acronym which is prex of majority of
290

les of the system is ns (an acronym of name space). File names are tokenisable as well as non-tokenisable. When le names are non-tokenisable, then majority of cluster terms resemble t7 . Wekas sub-systems are divided into folders and subfolders based on concept-naming conventions. Some class names are singleton (class names consist of only one term, i.e. Remove, Add), and if composite and tokenisable, the terms are generally not shared with other class names. Very few class names share the concept of sub-systems. Weka case is similar to Mozilla, with the difference that Mozillas
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

& The Institution of Engineering and Technology 2012

www.ietdl.org
Table 8
No. 1 (6/6) 2 (3/6) 3 (2/4) Wekai: comparison between labels of TWS and expert labels tf tf idf tf rf literal + linked handler_ option_set + evaluator + srt + eval curve regression neural mixture kernal k_star discretise antd way split loader + loader handler_ option_set + fast + exception conditional listener_ result + remote lter series_time + to + sparse + remove tf or tf x2 tf ig tf gr Expert label

literal + literal + list list handler_ option + option_set + 4 (6/6) evaluator + evaluator + 5 (4/6) srt + srt + 6 (10/13) eval eval 7 (6/12) curve_matrix curve 8 (6/11) regression regression 9 (7/7) neural neural 10 (8/8) mixture mixture 11 (4/4) kernal kernal 12 (5/5) k_star star 13 (5/9) discretise discretise 14 (13/14) antd_rule antd 15 (3/6) node way 16 (6/9) split split 17 (4/6) loader + loader + 18 (4/4) loader loader 3 (2/4) handler_ option + option_set + 19(3/4) fast_vector + fast + 20 (5/5) exception exception 21 (10/10) estimator estimator 22 (5/7) instance_listener_ listener_ result + result + 23 (4/4) remote remote 24 (5/5) lter lter 25 (6/11) series_time + series_time + 26 (8/9) to + to + 27 (3/4) sparse + sparse + 28 (3/6) remove remove

individual + literal + literal + literal + associations (13) linked list list list handler_ option + option + option + attribute set + selection (31) evaluator + evaluator + evaluator + evaluator + srt + srt + srt + srt + attribute eval eval eval bayes (10) matrix curve curve curve evaluation (9) regression regression regression regression functions (7) node neural neural neural feural (9) matrix mixture mixture mixture pace (13) kernal_ kernal kernal kernal support normalized_ vector (8) ploy_rbf cache star star star lazy (9) cost discretise discretise discretise meta (11) antd antd antd antd rules (26) trees (7) prediction_ way node node adtree (6) two split split split split j48 (13) m5 (9) lmt (6) classiers (8) clusterers (7) loader + loader + loader + loader + core abstract loader loader loader converters (9) handler_ option + option + option + core (33) set + vector + fast + fast + fast + type exception exception exception datagenerators (3) conditional estimator estimator estimator estimators (14) listener_ listener_ listener_ listener_ experiment (20) result + result + result + result + task remote remote remote unsupervised lter lter lter lters (5) lters supervised (6) abstract_ series_time + series_time + series_time + lters cobweb_ unsupervised (39) order + to + to + to + to + sparse + sparse + sparse + sparse + remove remove remove remove

terms are non-tokenisable, and Wekas terms do not share common concept. Compost follows shared concept-based naming convention. Majority of class names contain shared concept terms and few class names are singleton terms. For example, les contained in sub-system Reference and List share a common sufx of reference and list. An example of singleton terms arises in the sub-system Expression, which contains les Literal.java and Operator.java. Here, although class name reects its own main idea, but such cases would make it difcult to grasp the main functionality of sub-system in which these classes are present. Since this system primarily follows shared concept-based naming convention (same as Xgd), therefore majority of its labelling cases resemble cases of Xgd .
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

6.2.2 TWS results: Term frequency: As given in Table 4, for systems with shared common concepts (Xgd and Compost), tf results are better than for other systems. In these systems, majority of cases are such that each cluster contains one or two terms which resemble any of t1 2 t5 and all remaining terms resemble t7 . For these type of systems, tf assigns meaningful labels. However, tf cannot distinguish between terms like (t1 and t2) and (t4 , t5 and t6) as illustrated in Fig. 1. This has two consequences. Firstly, labels assigned by tf have a relatively longer length compared to labels selected by other TWSs (for example, see segments 1 8 in Table 5). Secondly,
291

& The Institution of Engineering and Technology 2012

www.ietdl.org
Table 9
No. 1 (3/4) Composti: comparison between labels of TWS and expert labels tf tf idf tf rf tf or tf x2 tf ig tf gr Expert label

service service service default service service service compost services(10) 2 (2/2) port port port call_port port port port architecture (3) 3 (4/8) info_type info info info info info info abstraction (15) 4 (9/11) class_ component component manager component component component composition (19) component 5 (3/6) hook hook hook naming_ hook hook hook hook(8) point 6 (3/6) generic_ super super super super super super declared super hook (8) 7 (4/8) factory_ parameter parameter parameter parameter parameter parameter implicit member_ hook (12) method_ parameter 8 (5/5) append_ append_ append_ append_ append_ append_ append_ composer (6) bind_ bind_ bind_ bind_ bind_ bind_ bind_ extract_ extract_ extract_ extract_ extract_ extract_ extract_ prepend_ prepend_ prepend_ prepend_ prepend_ prepend_ prepend_ remove remove remove remove remove remove remove 9 (4/7) walker + walker + walker + walker + walker + walker + walker + convenience (21) 10 (5/5) ast_iterator ast ast lter ast ast ast 11 (5/7) le + le + le + le + le + le + le + IO (13) 12 (3/5) comment comment comment contract_ comment comment comment Java (26) doc 13 (3/5) code code code parser code code code byte Code (9) 14 (9/14) declaration declaration declaration declaration declaration declaration declaration declaration (19) modier (12) expression (6) 15 (8/10) literal literal literal literal literal literal literal literal (8) 16 (33/34) assignment + assignment + assignment + or + assignment + assignment + assignment + operator (46) 17 (8/8) decrement_ decrement_ decrement_ operator decrement_ decrement_ decrement_ increment_ increment_ increment_ increment_ increment_ increment_ post_pre post_pre post_pre post_pre post_pre post_pre 18 (7/8) reference + reference + reference + name + reference + reference + reference + reference (22) 19 (12/12) reference reference reference constructor reference reference reference 20 (11/16) statement statement statement statement statement statement statement statement (26) 21 (98/109) list + list + list + list + list + list + list + list (120) 22 (19/30) element element model model element element element pattern (5) 23 (2/4) info_source + source + source + source + source + source + source + service (20) 24 (4/7) operation operation operation operation operation operation operation 25 (4/4) token token token parser token token token tools (11) 26 (8/8) hash + hash + hash + hash + hash + hash + hash + util (16) 27 (4/4) order order order identity_ order order order lexical_ natural_ order

since t4 t5 t6 are given the highest weight by tf, therefore it tends to select terms which represent domain concepts as labels (For example, see Table 6 for CDnet which contains common prex net as label, and Table 7 for Mozilla where ns is selected as label by tf for 13 out of the 23 segments). Therefore for such systems, tf favours domain concept terms. Inverse document frequency: We know that idf gives more weight to a term of it is in fewer documents. In software clustering, if a term is in minimum number of identiers (rare terms), then it cannot reect the main idea of the segment. As illustrated in Fig. 1, t7 is assigned
292

highest weight by idf. Another drawback of idf is that it does not consider how the term is spread in + ve and 2 ve cluster and thus cannot distinguish between terms like t2 and t4 , as both will be given equal weights [34]. However, we noted in our experiments that when idf weights are multiplied by tf, then t4 is often given higher weight as required. Overall idf results are better for systems like Mozilla and Weka, which contain few tokenisable terms. Therefore we can conclude that for systems which do not share many common terms, tf idf has the ability to provide more meaningful labels.
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

& The Institution of Engineering and Technology 2012

www.ietdl.org
Relevant frequency: For systems which follow shared common concept, for example, Xgd and Compost, tf rf has shown overall good results. However, for systems which share common prexes, for example, CDnet and Mozilla, tf rf tends to select domain concepts as labels (for example, see segments 7 9 in Table 6 for CDnet where net is selected as label. Similarly, see segments 12 14 in Table 7 for Mozilla where ns is selected as label). Domain concepts are similar to t6 which are higher not only in + ve cluster but in 2 ve cluster also. tf rf assigns lower weight to terms like t6 in order to select the terms which are higher in + ve cluster and lower in 2 ve cluster. However, because of the small range of rf, after multiplication with term frequency, weight of terms like t6 often becomes higher than other terms. Therefore tf rf tends to choose domain concept for systems which share common prexes. Odds-ratio: As explained in Section 3, if any value a, b, c, d is zero in terms document cluster contingency table then weight of OR is also ZERO. As OR is borrowed from TC, where because of large data sets, it is rare to have a, b, c or d zero, it has satisfactory results in that domain. On the other hand, in software clustering there are high chances of having zero value in document-cluster contingency table. Owing to the ZERO shortcoming, OR has not shown overall good results. The labels selected by OR are different and less meaningful than those of other schemes (for example, see segment 1, 4, 5 in Table 9 for Compost). Therefore OR fails to assign appropriate labels for software clustering. Chi-square: x2 has consistently given better results except for Mozilla for which it is at second number. Overall its results are better than all weighting schemes. Usually, in software clustering we have d a, b and c. Also in software clustering d does not vary much, so inuencing factors to weigh terms are a, b and c. For software clustering, higher ratio of a to b is important as it helps to pick the term on which + ve cluster is more dependent. At the same time, lower c determines that the term contributes little in 2 ve cluster. Hence x2 has picked more appropriate weights than other TWS. In [34], it was reported that x2 results were worse than idf, rf and or in TC. However, for our experiments in the software domain, it provided more meaningful labels than other TWS. Information gain and gain ratio: It can be seen from Table 4, that results of both these schemes are same for all systems. Their results are generally better than other schemes (except x2), but for systems which contain common prexes, tf ig and tf gr tend to select domain concept as label. tool [8]. Spek et al. [5] used latent semantic indexing (LSI) technique to recover architectural concepts. Researchers have used automatic label assignment or visualisation techniques to improve software comprehension. Tzerpos [51] introduced a software clustering approach that employs the idea of pattern driven software clustering and assigns labels to software clusters automatically. To label clusters, Maqbool and Babri [23] used two TWS, tf and idf, to assign automatic labels to software clusters. In [24], an analysis of four weighting schemes is presented. Kuhn et al. [3] introduced the idea of semantic clustering and applied automatic labels to clusters by selecting top n-terms from LSI-index. Ducasse and Lanza [52] and Abdeen et al. [53] have used visualisation techniques to support comprehension of classes and packages. Patel et al. [12] used UML diagrams to show software clustering results. Various TWS have been used in the eld of IR and TC to apply weights to terms with respect to their importance in a document or corpus. In [34], a detail overview of supervised and unsupervised term weighting schemes is presented. Debole and Sebastiani used the feature reduction functions, that is, x2, ig and gr to weight terms called supervised term weighting [33]. A novel feature selection scheme called probability based is proposed in [54]. In [55], modied x2 is presented to select feature set for small categories. Various TWS are evaluated for question categorisation (QC) task in [56]. The authors proposed three new TWS for QC and analysed their behaviour for TC. In [57], authors proposed that TWS can also be considered as term occurrence probabilities in + ve and 2 ve categories.

Conclusions

Related work

In recent years, there has been much work on developing algorithms for software clustering. Andritsos and Tzerpos [11] proposed a clustering algorithm LIMBO (scaLable InforMation BOttleneck) with the main objective to minimise information loss during clustering. In [30], Maqbool and Babri presented a new linkage algorithm called WCA which maintains the access pattern of software entities during clustering. WCA enhances the combined algorithm presented in [49]. Patel et al. [12] proposed a hybrid approach that combines dynamic analysis and static dependencies for software system clustering. In [50], a new clustering approach is presented which incorporates hierarchical and partitional clustering. Mitchell and Mancoridis used search-based techniques in their Bunch
IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

Clustering techniques are used to decompose software systems by grouping similar entities in modules (clusters). Although one of the main objectives of clustering is to enhance comprehensibility, developers still need to put in effort to understand the functionality of clusters. Assignment of appropriate labels to clusters can make clustering results much easier to understand. For automatic label assignment, it is necessary to select terms which can reect the main intention of clusters. In the eld of IR and TC, TWS are used to nd importance of terms in a document or a corpus. Thus TWS can also be used to nd important terms in software clusters. In this paper, we addressed issues related to the use of TWS for software cluster labelling. These issues include evaluation of TWS to determine which schemes perform well in the software domain, and identication of software characteristics which affect the labelling behaviour of TWS. We conducted experiments on ve software systems having entitiy identiers with different naming conventions. Our results show that TWS may be used to assign meaningful labels to clusters when entity identiers are descriptive, tokenisable and contain terms shared between entities. Moreover, our results reveal that there is a difference in performance between the TWS depending on software characteristics. Overall tf x2 gave better results for all systems. tf and tf rf tend to assign domain concept terms as labels for systems which contain common prexes. For a system whose entity identiers contain non-tokenisable or singleton terms tf idf gives better results. tf or consistently gave inferior performance because of its zero value problem. Results of tf ig and tf gr were same for all systems, and were similar to those of tf x2. However in
293

& The Institution of Engineering and Technology 2012

www.ietdl.org
some cases, these two weighting schemes chose domain terms rather than more appropriate labels as chosen by tf x2. The analysis presented in this paper provides guidelines for selecting TWS for cluster labelling. We may conclude that when a TWS is selected carefully depending on a software systems naming conventions, it produces meaningful labels that can considerably enhance comprehensibility of clustering results.
23 Maqbool, O., Babri, H.A.: Automated software clustering: an insight using cluster labels, J. Syst. Softw., 2006, 79, (11), pp. 1632 1648 24 Siddique, F., Maqbool, O.: Analyzing term weighting schemes for labeling software clusters. Proc. 15th Eur. Conf. on Software Maintenance and Reengineering, 2011, pp. 8588 25 Jain, A.K., Mutry, M.N., Flynn, P.J.: Data clustering: a review, ACM Comput. Surv., 1999, 13, (3), pp. 264323 26 Maqbool, O., Babri, H.A.: Hierarchical clustering for software architecture recovery, IEEE Trans. Softw. Eng., 2007, 33, (11), pp. 759780 27 Andreopoulos, B., An, A., Tzerpos, V., Wang, X.: Multiple layer clustering of large software systems. Proc. 12th Working Conf. on Reverse Engineering, 2005, pp. 79 88 28 Sindhgatta, R., Pooloth, K.: Identifying software decompositions by applying transaction clustering on source code. Proc. 31st Annu. Int. Conf. on Computer Software and Applications, 2007,, vol. 1, pp. 317326 29 Anquetil, N., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. Proc. Sixth Working Conf. on Reverse Engineering, 1999, pp. 235255 30 Maqbool, O., Babri, H.A.: The weighted combined algorithm: a linkage algorithm for software clustering. Proc. Eighth Eur. Conf. on Software Maintenance and Reengineering, 2004, pp. 1524 31 Wen, Z., Tzerpos, V.: An effectiveness measure for software clustering algorithms. Proc. 12th IEEE Int. Workshop on Program Comprehension, 2004, pp. 194 203 32 Shtern, M., Tzerpos, V.: Rening clustering evaluation using structure indicators. Proc. IEEE Int. Conf. on Software Maintenance, 2009, pp. 297305 33 Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. Eighteenth ACM Symp. on Applied Computing, 2003, pp. 784 788 34 Lan, M., Lim Tan, C., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell., 2009, 31, (4), pp. 721 735 35 Forman, G.: An extensive empirical study of feature selection metrics for text classication, J. Mach. Learn. Res., 2003, 3, pp. 1289 1305 36 Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval, Inf. Process. Manage., 1988, 24, (5), pp. 513523 37 Sebastiani, F.: Machine learning in automated text categorization, ACM Comput. Surv., 2002, 34, (1), pp. 147 38 Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. Proc. Fourteenth Int. Conf. on Machine Learning, 1997, pp. 412 420 39 Montanes, E., Diaz, I., Ranilla, J., Combarro, E.F., Fernandez, J.: Scoring and selecting terms for text categorization, IEEE Intell. Syst., 2005, 20, (3), pp. 4047 40 WebsiteXg: Xg source system, http://www.xg.org/, 2006 41 ChocoWebsite: Chocolate doom source system, http://www.chocolatedoom.org/, 2010 42 WebsiteWeka: Weka source system, http://weka.pentaho.com/, 2010 43 WebsiteCOMPOST: Compost source system, http://www.info.unikarlsruhe.de/compost, 2009 44 WebsiteMozilla: Mozilla source system, http://www.mozilla.org/, 2003 45 Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval (Addision Wesley and ACM Press, 1999) 46 WebsiteStemmer: Porter stemmer, http://tartarus.org/martin/ PorterStemmer/, 2010 47 WebsiteXFig: Xg expert decomposition and labels, http://www.suraj. lums.edu.pk/reverseeng, 2006 48 WebsiteMozilla: Mozilla expert decomposition http://www.cse.yorku. ca/billa/MULICsoftware05/MULICWCREconf mozillaAuth:kos, 2005 49 Saeed, M., Maqbool, O., Babri, H.A., Hassan, S.Z., Sarwar, S.M.: Software clustering techniques and the use of combined algorithm. Proc. Seventh Eur. Conf. on Software Maintenance and Reengineering, 2003, pp. 301306 50 Zhang, Q., Qiu, D., Tian, Q., Sun, L.: Object-oriented software architecture recovery using a new hybrid clustering algorithm. Proc. Seventh Int. Conf. on Fuzzy Systems and Knowledge Discovery, 2010, pp. 25462550 51 Tzerpos, V.: Comprehension-driven software clustering. PhD thesis, University of Toronto, 2001 52 Ducasse, S., Lanza, M.: The class blueprint: visually supporting the understanding of glasses, IEEE Trans. Softw. Eng., 2005, 31, (1), pp. 7590 53 Abdeen, H., Ducasse, S., Pollet, D., Alloui, I.: Package ngerprints: a visual summary of package interface usage, Inf. Softw. Technol., 2010, 52, (12), pp. 13121330

References

1 Lehman, M.M.: Programs, life cycles, and laws of software evolution, Proc. IEEE, 1980, 68, (9), pp. 1060 1076 2 Praditwong, K., Harman, M., Yao, X.: Software module clustering as a multi-objective search problem, IEEE Trans. Softw. Eng., 2011, 37, (2), pp. 264282 3 Kuhn, A., Ducasse, S., Girba, T.: Semantic clustering: identifying topics in source code, Inf. Softw. Technol., 2007, 49, (3), pp. 230 243 4 van Deursen, A., Kuipers, T.: Identifying objects using cluster and concept analysis. Int. Conf. on Software Engineering, 1999, pp. 246255 5 van der Spek, P., Klusener, S., van de Laar, P.: Towards recovering architectural concepts using latent semantic indexing. Proc. 12th European Conf. on Software Maintenance and Reengineering, 2008, pp. 253257 6 Mancoridis, S., Mitchell, B.S., Chen, Y., Gansner, E.R.: Bunch: a clustering tool for the recovery and maintenance of software system structures. Proc. Int. Conf. on Software Maintenance, 1999, pp. 5059 7 Mamaghani, A.S., Meybodi, M.R.: Clustering of software systems using new hybrid algorithms. Proc. Ninth IEEE Int. Conf. on Computer and Information Technology, 2009, pp. 2025 8 Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool, IEEE Trans. Softw. Eng., 2006, 32, (3), pp. 193208 9 Khan, B., Sohail, S.: Using es based automated software clustering approach to achieve consistent decompositions. Fifteenth Asia-Pacic Software Engineering Conf., 2008, pp. 429436 10 Khan, B., Sohail, S., Younus Javed, M.: Evolution strategy based automated software clustering approach, Adv. Softw. Eng. Appl., 2008, pp. 2734 11 Andritsos, P., Tzerpos, V.: Information-theoretic software clustering, IEEE Trans. Softw. Eng., 2005, 31, (2), pp. 150 165 12 Patel, C., Hamou-Lhadj, A., Rilling, J.: Software clustering using dynamic analysis and static dependencies. Proc. 13th Eur. Conf. on Software Maintenance and Reengineering, 2009, pp. 2736 13 Lung, C.-H., Zaman, M., Nandi, A.: Application of clustering techniques to software partitioning, recovery and restructuring, J. Syst. Softw., 2004, 73, (2), pp. 227244 14 Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.: Personalized recommendation in social tagging systems using hierarchical clustering. Proc. 2008 ACM Conf. on Recommender Systems, 2008, pp. 259266 15 Schickel-Zuber, V., Faltings, B.: Using hierarchical clustering for learning the ontologies used in recommendation systems. Proc. 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2007, pp. 599 608 16 Ma, J., Zhang, Y., He, J.: Efciently nding web services using a clustering semantic approach. Proc. 2008 Int. Workshop on Context Enabled Source and Service Selection, Integration and Adaptation, 2008 17 Clerkin, P., Cunningham, P., Hayes, C.: Ontology discovery for the semantic web using hierarchical clustering. Proc. Semantic Web Mining Workshop, 2001 18 Thanh Tho, Q., Cheung Hui, S., Fong, A.C.M., Hoang Cao, T.: Automatic fuzzy ontology generation for semantic web, IEEE Trans. Knowl. Data Eng., 2006, 18, (6), pp. 842 856 19 Gavalas, D., Pantziou, G., Konstantopoulos, C., Mamalis, B.: Abp: a low-cost, energy-efcient clustering algorithm for relatively static and quasi-static MANETS, Int. J. Sensor Netw., 2009, 4, (4), pp. 260269 20 Hebden, P., Pearcea, A.R.: Distributed asynchronous clustering for selforganisation of wireless sensor networks. Proc. Conf. on Intelligent Sensing and Information Processing, 2006, pp. 37 42 21 Ducasse, S., Pollet, D.: Software architecture reconstruction: a processoriented taxonomy, IEEE Trans. Softw. Eng., 2009, 35, (4), pp. 573591 22 Tzerpos, V., Holt, R.C.: Acdc: an algorithm for comprehension-driven clustering. Proc. Seventh Working Conf. on Reverse Engineering, 2000, pp. 258 267

294

& The Institution of Engineering and Technology 2012

IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

www.ietdl.org
54 Lui, Y., Tong Loh, H., Sun, A.: Imbalanced text classication: a term weighting approach, Expert Syst. Appl., 2009, 36, (1), pp. 690701 55 Dai, L., Hu, J., Liu, W.: Using modied chi square and rough set for text categorization with many redundant features. Proc. Int. Symp. Computational Intelligence and Design, 2008, vol. 1, pp. 182185 56 Quan, X., Wenyin, L., Qiu, B.: Term weighting schemes for question categorization, IEEE Trans. Pattern Anal. Mach. Intell., 2011, 33, (5), pp. 1009 1021 57 Erenel, Z., Altincay, H., Varoglu, E.: A symmetric term weighting scheme for text categorization based on term occurrence probabilities. Proc. Fifth Int. Conf. on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control, 2009, pp. 1 4

10

Appendix

Tables 5 9 present a comparison of the labels assigned automatically by different TWS. In Tables 5 9, rst

column represents segments unique id. Suppose we have a segment adi with cardinality adi and a sub-system edj from expert decomposition with cardinality edj , then common entities are given as ( adi > edj / adi ). Remaining columns represent labels assigned by each TWS and last column represents expert label along with cardinality of sub-system. A cluster with one or two entities is not useful for comprehension [51], therefore we only select clusters with cardinality greater than three for label comparison. In the following tables, we use _ to represent a composite label, for example, if a cluster is assigned two labels init and spline, then it is indicated as init_spline. If a cluster from expert decomposition has two segments, then a + sign is used to represent the cluster labels.

IET Softw., 2012, Vol. 6, Iss. 4, pp. 283 295 doi: 10.1049/iet-sen.2012.0027

295

& The Institution of Engineering and Technology 2012