Software Engineering Research Group

Final thesis
Philips Medical Archive Splitting
Marco Glorie
Software Engineering Research Group
Final thesis
THESIS
submitted in partial fulfillment of the
requirements for the degree of
MASTER OF SCIENCE
in
SOFTWARE ENGINEERING
by
Marco Glorie
born in Heemskerk, the Netherlands
Software Engineering Research Group
Department of Software Technology
Faculty EEMCS, Delft University of Technology
Delft, the Netherlands
www.ewi.tudelft.nl
MR Department
Philips Medical Systems
Veenpluis 6
5684 PC Best, the Netherlands
www.medical.philips.com
c (2007 Marco Glorie. All rights reserved.
Software Engineering Research Group
Final thesis
Author: Marco Glorie
Student id: 1133772
Email: marcoglorie@hi.nl
Abstract
Philips Medical Systems makes MR-scanners - among many other products - for medical pur-
poses such as diagnostics. MR stands for Magnetic Resonance and the technology enables MR-
scanners to analyze the human body for potential diseases. The software for these MR-scanners is
complex and has been evolving for more than 27 years. The software archive has been evolved into
a single 9 MLOC archive, which is now hard to maintain. This paper aims at ways to split the single
archive into multiple archives that can be developed independently. We will discuss formal concept
analysis and cluster analysis as analysis methods to obtain such archives from the single archive,
using the Symphony framework. We will conclude with an evaluation of the used analysis methods.
Thesis Committee:
Chair: Prof. Dr. A. van Deursen, Faculty EEMCS, TU Delft
University supervisor: Dr. A. Zaidman, Faculty EEMCS, TU Delft
Company supervisor: Drs. L. Hofland, Philips Medical Systems
Preface
This document would not have been possible to create without the help from both the side
of Philips Medical Systems and Delft University of Technology. Therefore I would like to
thank Lennart Hofland for his support, advices and reviewing activities. Also I thank Andy
Zaidman and Arie van Deursen for supporting me with reviews and advices on the subject.
Further during my project I was surrounded by colleagues from the ‘image processing
and platform’ group. I really appreciated they invited me to their group meetings, this gave
me a good insight of how they work at Philips and what difficulties they are facing. Also I
enjoyed the several group trips, such as our ‘adventurous’ GPS-tour in Velthoven. Thanks
for that.
Finally my thanks also go to Mulo Emmanuel, Christian Schachten, Jihad Lathal and
Frederic Buehler, who all made my stay in Best an exciting stay.
Marco Glorie
Best, the Netherlands
August 3, 2007
iii
Contents
Preface iii
Contents v
List of Figures vii
1 Introduction 1
2 Problem definition 3
2.1 The MR case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Symphony 7
3.1 Reconstruction design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Reconstruction execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Symphony for the MR case . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Reconstruction design for the MR case . . . . . . . . . . . . . . . 9
3.3.2 Reconstruction execution for the MR case . . . . . . . . . . . . . . 10
4 Formal Concept Analysis 11
4.1 Formal context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Formal concept, extent and intent . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Subconcept-superconcept-relation and concept lattice . . . . . . . . . . . . 12
4.4 Concept lattice construction . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Applications of FCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Application to the MR case . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6.2 Features for FCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6.3 Scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.4 Concept quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
CONTENTS
4.7.1 Project documentation features . . . . . . . . . . . . . . . . . . . . 23
4.7.2 MR-specificity and layering features . . . . . . . . . . . . . . . . . 24
4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Cluster Analysis 29
5.1 Entities to cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Cluster analysis algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Applications of cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Application to the MR case . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5.2 Scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5.3 Visualization and validation . . . . . . . . . . . . . . . . . . . . . 39
5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusions and future work 47
6.1 Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 49
A MR Architecture 53
vi
List of Figures
2.1 The building block structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The scope of a fictive project . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Reconstruction execution interactions . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Crude characterization of mammals . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Extent and intent of the concepts for the mammal example . . . . . . . . . . . 13
4.3 Concept lattice for the mammal example . . . . . . . . . . . . . . . . . . . . . 13
4.4 An example context to partition the archive into multiple archives using project
documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Example of how features are accumulated to higher level building blocks . . . . 21
4.6 Flow of data scheme of analysis . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Results of analysis of FCA with features extracted from project documentation 24
4.8 Results of analysis of FCA with MR-specificity and layering features . . . . . . 25
4.9 Results of analysis of FCA with MR-specificity and layering features, with im-
posed threshold of 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.10 Results of analysis of FCA with MR-specificity and layering features on one
subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.11 Resulting concept subpartitions from the set of concepts of the analysis on one
subsystem with no threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.12 Results of analysis of FCA with MR-specificity and layering features on one
subsystem, with imposed threshold of 25 (left) and 50 (right) . . . . . . . . . . 27
5.1 Example of a merge of a group of building blocks . . . . . . . . . . . . . . . . 38
5.2 Flow of data scheme of analysis . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Results of analysis on complete hierarchy . . . . . . . . . . . . . . . . . . . . 40
5.4 Results of analysis on one subsystem only . . . . . . . . . . . . . . . . . . . . 40
5.5 Results of analysis with one subsystem in detail . . . . . . . . . . . . . . . . . 41
5.6 Results of analysis with one subsystem in detail - improved start solution . . . 41
5.7 Results of analysis with one subsystem in detail - improved start solution, round 2 42
5.8 Results of analysis with two predefined large clusters . . . . . . . . . . . . . . 43
vii
List of Figures List of Figures
5.9 Results of analysis with two predefined large clusters, round 2 . . . . . . . . . 43
A.1 Main MR system components and their physical connections . . . . . . . . . . 53
A.2 Descriptions of the main proposed archives (MR Software Architecture Vision
2010, October 4th 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
viii
Chapter 1
Introduction
Philips Medical Systems (PMS) develops and produces complex systems to aid the medical
world with monitoring, diagnostic and other activities. Among these systems is the MR-
scanner, a product that uses Magnetic Resonance technology to analyze the human body for
potential diseases. The software for these MR-scanners is complex and has been evolving
for more than 27 years. The system is a combination of hardware and software and contains
(real-time) embedded software modules. Many software technologies are used and also
third party off-the-shelf modules are integrated in the software.
Currently the software is developed using branching with a single 9 MLOC software
archive consisting of approximately 30,000 files. Merging the multiple software streams
causes integration problems and the many dependencies among the files make that the fea-
ture that takes the longest time to be developed determines the release time of the complete
project.
This way of developing is the result of many years of software evolution, but the MR
department of PMS sees the current process of development as inefficient and wants to
improve the efficiency of the process. A team has been formed to improve the development
process by evolving the current architecture in a new architecture that can be more easily
maintained.
The team sees the new architecture as an architecture consisting of about 7 archives that
can be developed independently. In order to obtain these independent archives the current
software archive should be analyzed on what modules should form these archives and clear
interfaces should be defined between the archives.
We use Symphony as an architecture reconstruction framework to obtain a splitting of
the current archive, using analysis methods available in literature. Existing algorithms and
tools are discussed and an approach is presented to use the analysis methods to obtain a
splitting of the current archive.
The following outline is followed in this thesis: in chapter 2 we present the MR case
and research questions which are extracted from the MR case. The research questions will
be answered in the following chapters. In chapter 3 we discuss Symphony, which is used as
framework for the analyses. Then in chapter 4 we discuss formal concept analysis and in
chapter 5 cluster analysis, as analysis methods to obtain a splitting of the archive. Finally in
chapter 6 conclusions will be drawn from the researched matters.
1
Chapter 2
Problem definition
In order to define the problem definition of this thesis first the MR case is discussed. From
the MR case the problem definition is extracted. In section 2.1 the MR case is discussed
and in section 2.2 the problem definition of this thesis is presented.
2.1 The MR case
Philips Medical Systems (PMS) develops evolutionary medical systems such as MR-scanners,
based on Magnetic Resonance technology. The medical activities of Philips date back to
1918, when it introduced a medical X-ray tube. Today PMS is a global leader in diagnos-
tic imaging systems, healthcare information technology solutions, patient monitoring and
cardiac devices.
The MR case concerns the software for the MR-scanners produced by PMS. The soft-
ware is developed on multiple sites (Netherlands, USA and India) and there are about 150
developers working on the software spread out over these 3 sites. Currently, to minimize
project risks, parallel software streams are used (branching).
Current development process The current way of developing the software for MR-
scanners is seen by the MR department as inefficient. A first reason for this inefficiency
is that because multiple software streams are used, at some point in the development pro-
cess, these need to be merged.
To give a basic overview of how this works: The streams work with copies of that parts
of the software archive that are to be modified by developers. When two streams are merged
it is possible that there is an overlap in changed sections of the archive copies. This process
of development gives a lot of merging problems.
A second reason is the fact that there are many dependencies of the building blocks in
the archive, which cause that the release moment is determined by the release moment of
the slowest feature.
The ‘Software Architectuur Team’ (SWAT) of MR is responsible for providing a strat-
egy to cope with the current problems. The team has created a vision that the future software
archive should consist of about 7 independent archives. Each of these different archives
3
2.2 Problem definition Problem definition
should act as a component and they should be combinable with each other independently.
In order to implement this vision the current archive should be split into pieces.
System’s characteristics and building block structure To provide some insight into the
MR case, an overview of some of the system’s characteristics is presented. The software for
MR-scanners has been evolving for more than 27 years. The application-suite is built using
a lot of different software technologies, such as C, C++, C# and Perl and many parts of it
can be considered as legacy software. Also, many off-the shelf components are integrated
into the system. This has resulted in the current situation, wherein the code base consists of
approximately 9 MLOC, located in around 30,000 files.
Files are contained in entities called building blocks; the files are subdivided into nearly
600 building blocks. Furthermore, a lot of dependencies exist between these building
blocks.
The code archive is organized by structuring building blocks in a tree-structure, a so-
called building block structure or building block hierarchy. At the highest level of ab-
straction in this building block structure there are the so-called subsystems, which contain
multiple lower level building blocks. However, the tree-structure of building blocks does
not represent the high level architecture of the system. Figure 2.1 shows the hierarchy of
building blocks.
There exists detailed documentation on the architecture, such as UML diagrams and
project documentation. More details on the architecture can be found in appendix A. For
this thesis mainly project documentation is used: The documentation specifies the purpose
of the projects and - among other details - which building blocks are expected to be under
the scope of projects.
For the last two years these scopes have been documented and around 50 documents are
available. An example of a description of a scope of a (fictive) project can be seen in figure
2.2.
2.2 Problem definition
In order to achieve a splitting of the archive for the MR case, the software should be an-
alyzed in order to obtain archives that can be developed independently. Therefore the fol-
lowing research question is central in this thesis:
• Which analysis methods are available in order to split an archive into multiple inde-
pendent archives?
To come to a solution for this problem, we envision that we should first refine our central
research question in the following subsidiary research questions:
• Are there tools or algorithms available to perform the ‘splitting’ analysis?
• How can the analysis methods be applied to the MR case?
4
Problem definition 2.2 Problem definition
Figure 2.1: The building block structure
Figure 2.2: The scope of a fictive project
• Are these analysis methods scalable enough to cope with large scale industrial soft-
ware?
We discuss formal concept analysis and cluster analysis as analysis methods in order to
split the software archive in multiple independent archives. To apply the analysis methods
algorithms or tools performing the analysis are required. Therefore existing algorithms and
tools are discussed for this purpose.
To apply the analysis methods to the MRcase, first we discuss how the analysis methods
are applied in the available literature and an approach is presented how to come to a splitting
of the current archive of MR.
As the archive for the MR case concerns a large scale industrial software archive, scala-
bility issues have to be taken into account. Therefore scalability issues of applying analysis
methods for the MR case are discussed and approaches are proposed to cope with these
issues.
After performing analyses to the MR case with the researched analysis methods, results
are evaluated with the domain experts at PMS to validate the outcomes. Finally a conclusion
will be drawn from the researched matters.
5
Chapter 3
Symphony
The problem of how to obtain multiple independent archives from a single archive can
be seen as an architecture reconstruction problem. Symphony is a view-driven software
architecture reconstruction method and is used in this paper to provide a framework to
come to a solution of the defined problem. [32]
Central to Symphony are views and viewpoints to reconstruct software architectures.
The terminology of views and viewpoints are adapted from the IEEE1471 standard. The
view is a “representation of a whole system from the perspective of a related set of con-
cerns”. A view conforms to a viewpoint. A viewpoint is a “specification of the conventions
for constructing and using a view.” [25]
Symphony is a framework describing the process of architecture reconstruction. The
Symphony process consists of two stages: [32]
1. Reconstruction Design
2. Reconstruction Execution
The two stages are typically iterated because reconstruction execution reveals informa-
tion about the system that can serve as an input for a better understanding of the problem
and a renewed reconstruction design. In section 3.1 the reconstruction design stage is dis-
cussed and in section 3.2 the reconstruction execution. Finally in section 3.3 is discussed
what Symphony means for the MR case.
3.1 Reconstruction design
The first stage of Symphony is the reconstruction design. During this stage the problem
is analyzed, viewpoints for the ‘target views’ are selected, ‘source views’ are defined and
‘mapping rules’ from source to target views are designed. [32] These terms are explained
below.
The source viewis a viewof a system that “can be extracted fromartifacts of that system,
such as source code, build files, configuration information, documentation or traces”. [32]
7
3.2 Reconstruction execution Symphony
The target view is a view of the system that describes the “as-implemented architecture
and contains the information needed to solve the problem/perform the tasks for which the
reconstruction process was carried out”. [32]
The mapping rules define how the target view is to be created from the source view
and depends on for example what analysis method is used. When during the reconstruction
design the problem, views and mapping rules are defined the next step is to perform the
reconstruction execution.
The reconstruction design stage consists of the problem elicitation and the concept de-
termination. During the problem elicitation the problem that needs to be solved is analyzed.
Having analyzed the problem, the concept determination follows, where concepts rele-
vant to the problem are identified and a recovery strategy is presented: the target viewpoint,
source viewpoint and mapping rules are defined or refined.
So during the first stage of Symphony - the reconstruction design - the problem is ana-
lyzed and the viewpoints and mapping rules are defined or refined. Refined, because there
could be new insights into the problem after the reconstruction execution stage.
3.2 Reconstruction execution
The second stage of Symphony is the reconstruction execution. During the reconstruction
execution stage the system is analyzed, source views are extracted and the mapping rules
are applied to obtain the target views, earlier defined during the reconstruction design. [32]
This thesis follows the extract-abstract-present approach used by Symphony. Sym-
phony presents three steps to apply this approach:
1. Data gathering
2. Knowledge inference
3. Information interpretation
The data gathering step extracts information from the system’s artifacts that can be
used to recover selected architectural concepts from the system. These artifacts include the
source code, but also build-files or documentation of the system.
During the knowledge inference step the mapping rules, defined during reconstruction
design, are applied to the source view to create a target view.
Finally during the information interpretation step the created target views are analyzed
and interpreted to solve the problem. Results from this stage lead to a refined problem
definition or to a refined reconstruction design to eventually come to a satisfying solution
of the problem.
Figure 3.1 - adopted from[32] - describes the three steps of the reconstruction execution.
The figure also shows that the knowledge inference step can be repeated, by inferencing
knowledge from intermediate results of a previous step of inference.
8
Symphony 3.3 Symphony for the MR case
Figure 3.1: Reconstruction execution interactions
3.3 Symphony for the MR case
In this section we discuss how Symphony is used as a framework for the MR case and how
the structure of this thesis follows the steps in the Symphony reconstruction process. In
section 3.3.1 we discuss the first phase of Symphony: the reconstruction design, applied to
the MR case and in section 3.3.2 the second phase: the reconstruction execution.
3.3.1 Reconstruction design for the MR case
The framework of Symphony is used for the MR case because the problem of splitting the
single archive into multiple independent archives can be seen as an architecture reconstruc-
tion effort. The first stage of Symphony, the reconstruction design, starts with the problem
elicitation. This has been defined in chapter 2, the problem definition.
The concept determination of Symphony, as part of the reconstruction design, for the
MR case means defining the source views, target views and the mapping rules.
Two different analysis methods are used to analyze how a splitting can be achieved for
the MR case: Formal concept analysis and cluster analysis. For each of the two differ-
ent analysis methods a different source viewpoint and different mapping rules are defined,
which serve as different bases for the reconstruction design. Chapters 4 and 5 describe the
two approaches.
Source views For both analysis methods static relations between entities are taken as
source viewpoint. Also the existing building block hierarchy is taken as source view for
both analysis methods.
Mapping rules The mapping rules differ for both analysis methods. The mapping rules
are inherent to the theory of each of the two analysis methods. For formal concept analysis
this can be found in chapter 4 and for cluster analysis this can be found in chapter 5.
9
3.3 Symphony for the MR case Symphony
Target views The target view is identical for both approaches and can readily be selected
based on the description of the MR case and the problem definition. The problem is how
to split the single archive into multiple independent archives which leads to a Module view-
point. [16] The viewpoint identifies high-level architecture abstractions in the system and
describes relationships among them.
3.3.2 Reconstruction execution for the MR case
The processes of analysis proposed in the sections ‘Application to the MR case’ for both
analysis methods follow the extract-abstract-present approach of the reconstruction execu-
tion stage of Symphony. For formal concept analysis this can be found in section 4.6 and
for cluster analysis this can be found in section 5.5.
Data gathering During the data gathering step static relations between entities are ac-
quired for both analysis methods. In this section is described how these relations are ob-
tained.
A commercial tool called ‘Sotograph’ is available at PMS to statically extract relations
from the code archive. These relations include the following types: aggregation, catch,
componentcall, componentimplementation, componentinterfacecall, externalias, externdec-
laration, frienddeclaration, inheritance, metadata, throw, throws, typeaccess and write. [31]
The relations are analyzed on the method / function level. Relations on higher abstrac-
tion levels - such as the file or building block level - are obtained by accumulating the
relations to and from the lower level abstraction levels.
Sotograph only analyses the parts of the archive which are written in C, C++ and the
proprietary language GOAL-C. As Sotograph does not provide an integrated view with the
C# parts of the archive yet, this thesis is focussed on the parts that are not written in C#.
This means that the scope of the analyses in this thesis is limited to around 15.000 files and
360 building blocks, consisting of around 4 million LOC.
Sotograph maintains a database containing information of the static code analysis, in-
cluding building block hierarchy information. This database can be seen as a populated
repository containing the extracted source view.
Knowledge inference and information interpretation For each of the two analysis meth-
ods the knowledge inference and information interpretation differ, because of the two dif-
ferent underlying theories. The details can be found in chapter 4 for formal concept analysis
and in chapter 5 for cluster analysis.
10
Chapter 4
Formal Concept Analysis
In order to split the software archive for the MR case the archive should be analyzed to
obtain archives that can be developed independently. This section discusses Formal Concept
Analysis (FCA) as an analysis method for this purpose.
FCA has been introduced by Wille around 1980. [37] The analysis method is used
for structuring data into units. These units are “formal abstractions of concepts of human
thought allowing meaningful and comprehensible interpretation”. [10]
They state that a concept can be philosophically understood as “the basic units of
thought formed in dynamic processes within social and cultural environments”. The philo-
sophical view is that a concept is defined by its extension, that is all the objects that belong
to the concept and by its intention, that is all the attributes that apply to all the objects of the
concept. To be less formal: FCA provides a way to identify sensible groupings of objects
that have common attributes. [10]
The following subsections will describe how FCA works. In section 4.1 the term formal
context is defined and in section 4.2 the term formal concept. The subconcept-superconcept-
relation and the concept lattice is explained in section 4.3 and algorithms to construct the
concept lattice from a context are discussed in section 4.4. Section 4.5 discusses appli-
cations of FCA and in section 4.6 the applicability to the MR case is explored. Then in
section 4.7 the results of the analyses are discussed and the results are evaluated in section
4.8. Finally in section 4.9 a summary is given.
4.1 Formal context
The formal context defines what objects have what attributes and is the environment in
which formal concepts can be defined. (One should see the terms objects and attributes
in an abstract way, in FCA these objects and attributes have no such meaning as in OO-
programming for example).
The original mathematical definition is defined using capitals that designate original
German words. [10] The definition in this thesis uses capital letters designating the English
translations and is as follows:
11
4.2 Formal concept, extent and intent Formal Concept Analysis
A triple (O, A, R) is called a formal context when O and A are finite sets (the objects and
the attributes respectively) and R is a binary relation between O and A that is:
R ⊆OA. (o, a) ∈ I is read: object o has attribute a.
In figure 4.1 an example of a context is shown by indicating relations between objects
and attributes. This table illustrates a ‘crude characterization of mammals’ adopted from
Siff and Reps. [28]
Figure 4.1: Crude characterization of mammals
4.2 Formal concept, extent and intent
A formal concept or just called concept is a maximum set of objects that all share the same
attributes. The mathematical definition is as follows: [10]
When for X ⊆ O is defined: X
/
:= ¦a ∈ A[(o, a) ∈ R f or all o ∈ X¦ and for Y ⊆ A is
defined: Y
/
:=¦o ∈ O[(o, a) ∈ R f or all a ∈Y¦ then a pair (X,Y) is a formal concept
of (O, A, R) if and only if X ⊆O, Y ⊆A, X
/
=Y and X =Y
/
.
In the figure 4.1 can be seen that both the cats as the dogs are four-legged and hair-
covered. When we define set X as ¦cats, dogs¦ and set Y as ¦four-legged, hair-covered¦
then (¦cats, dogs¦, ¦four-legged, hair-covered¦) is a formal concept of the context of the
mammal example. Set X is called the extent and set Y the intent. [28]
Figure 4.2 shows the extent and the intent of the concepts for the example of the crude
characterization of mammals, adopted from Siff and Reps. [28]
4.3 Subconcept-superconcept-relation and concept lattice
Concepts are related to each other when the subconcept-superconcept-relation holds, in fact
the concepts are naturally ordered by the relation: [10]
(X
1
,Y
1
) ≤(X
2
,Y
2
) :⇐⇒X
1
⊆X
2
(⇐⇒Y
2
⊆Y
1
)
12
Formal Concept Analysis 4.4 Concept lattice construction
Figure 4.2: Extent and intent of the concepts for the mammal example
Figure 4.3: Concept lattice for the mammal example
For instance the concept (¦cats, dogs¦, ¦hair-covered, four-legged¦) is a subconcept of
the concept (¦cats, gibbons, dogs¦, ¦hair-covered¦). The subconcept-superconcept-relation
forms a complete partial order over the set of concepts. This ordered set is called the con-
cept lattice of the context (O, A, R). This lattice has on top the concept (¦all objects¦, ¦no
attributes¦) and as bottom the concept (¦no objects¦, ¦all attributes¦). In figure 4.3 the
concept lattice of the mammal example is shown, adopted from Siff and Reps. [28]
4.4 Concept lattice construction
To construct a concept lattice for a context several algorithms exist. There are two types of
algorithms for computing the concept lattice for a context: batch algorithms and iterative
algorithms. Batch algorithms construct the concept lattice bottom-up or top-down. An
example of a top-down approach is Bordat’s algorithm which starts with the node (¦all
13
4.5 Applications of FCA Formal Concept Analysis
objects¦, ¦no attributes¦). From this node it generates all the children which are then linked
to their parent. This is iteratively repeated for every newly generated pair. [3]
An example of a batch algorithm that constructs the concept lattice bottom-up is Snelt-
ing’s algorithm. Snelting shows that the worst case running time of the construction algo-
rithm is exponential in the number of elements in the lattice. However Snelting empirically
determines that in practice, the concept lattice can typically be constructed in O(n
3
) time
when using sets of size n for the objects and attributes. [29]
Iterative algorithms do not build the concept lattice from scratch but update the concept
lattice incrementally. When adding a new object to the concept lattice depending on the
element and its relations some nodes in the lattice will be modified, new pairs are created.
These algorithms have been created because the idea was that it will takes less time to
update a concept lattice then to build it from scratch. An example of an iterative algorithm
is Godin’s algorithm. [15]
Godin compares some batch and iterative algorithms to see in what cases the algo-
rithms should be used. Surprisingly he shows his most simple proposed iterative algorithm
outperforms the batch algorithms when the number of instances grows. Further he shows
empirically that on average the incremental update of his proposed algorithm is done in
time “proportional to the number of instances previously treated”. He states that although
the theoretical worst case is exponential, when one supposes there is a fixed upper bound to
the number of attributes related to an object, the worst case analysis of the algorithm shows
linear growth with respect to the number of objects. [15]
There exists a tool called ConExp (Concept Explorer) which can be used to construct
and visualize a concept lattice, given a context. [6] The tool can be given a context as input
and can output the corresponding concept lattice in a format readable for other applications.
4.5 Applications of FCA
FCA has been applied in many different settings and for different purposes. To explore
how FCA can be applied to identify multiple archives for the MR case the several different
settings for which FCA is used are discussed.
Snelting for example used FCA to study how members of a class hierarchy are used in
the executable code of a set of applications by examining relationships between variables
and class members, and relationships among class members. [30]
Eisenbarth uses FCAembedded in a semi-automatic technique for feature identification.
They identify the “computational units that specifically implement a feature as well as the
set of jointly or distinctly required computational units for a set of features.” [8, 2]
Godin and Mili try to obtain multiple archives from a single archive and are trying to
identify generalizations of (OO-)classes using FCA. As starting point they use the set of
interfaces of classes. As object they use the set of classes and as attributes the set of class
properties including attributes and methods. [13, 14]
The previous examples show that FCA is applicable for a wide range of purposes. The
reason why FCA can be used for different settings is that the objects, attributes and the
14
Formal Concept Analysis 4.5 Applications of FCA
relations are of free choice. For each purpose a different choice of the three sets is needed.
For example Snelting uses variables as objects and class members as attributes.
In order to see how FCA can be applied for the MR case we discuss the work of Siff
and Reps. They used FCA in a process to identify modules in legacy code. In their paper
they present these three steps in the process: [28]
1. Build context where objects are functions defined in the input program and attributes
are properties of those functions.
2. Construct the concept lattice from the context with a lattice construction algorithm.
3. Identify concept partitions-collections of concepts whose extents partition the set of
objects. Each concept partition corresponds to a possible modularization of the input
program.
Siff and Reps thus choose functions as objects and properties of functions as attributes
to group together functions that have common attributes and thus are assumed to be located
in the same module. They use the concept partition to retrieve a possible modularization.
[28]
A concept partition is a set of concepts whose extents are non-empty and form a par-
tition of the set of objects O, given a context (O, A, R). More formally this means that
CP = ¦(X
0
,Y
0
). . . (X
n
,Y
n
)¦ is a concept partition iff the extents of the concepts blanket the
object set and are pair wise disjoint: [26, 28]
n
[
i=1
X
i
= O and ∀i = j, X
i
\
X
j
= / 0
Tonella uses the same identities as objects and attributes as Siff and Reps but he uses the
concept subpartition (CSP) to retrieve a possible modularization. Tonella proposes concept
subpartitions because of an overly restrictive constraint on the concept extents to cover the
complete object set. He states that for concept partitions “important information that was
identified by concept analysis is lost without reason” when concepts are disregarded because
they cannot be combined with other concepts. He defines that CSP =¦(X
0
,Y
0
). . . (X
n
,Y
n

is a concept subpartition iff: [26]
∀i = j, X
i
\
X
j
= / 0
Where CSPs can be directly mapped to object partitions - that is partitions of the object
set - CSPs have to be extended to the object set using the so-called partition subtraction:
[26]
The partition subtraction of an object subpartition SP from an object partition P gives the
subpartition complementary to SP with respect to P. It can be obtained by subtracting
the union of the sets in SP from each set in P.
P sub SP =¦M
k
= M
i

[
M
j
∈SP
M
j
[M
i
∈ P¦
15
4.6 Application to the MR case Formal Concept Analysis
P sub SP is itself a subpartition because sets in P are disjoint and remain such after
the subtraction. The partition subtraction is used in the subpartition extension to obtain the
object set partitioning: [26]
An object subpartition SP can be extended to an object partition Q, with reference to an
original partition P, by the union of SP and the subtraction of SP from P. The empty
set is not considered an element of Q.
Q = SP
[
(P sub SP) − / 0
There is a remark to be made about the concept CSPs: If there is no overlap between
the object sets of a set of concepts, the number of CSPs resulting from this set of concepts
will grow enormously: The number of CSPs are then the number of ways to partition a set
of n elements into k nonempty subsets. (This number is given by the Stirling numbers of
the second kind.)
The previously presented applications of FCA, the process presented by Siff and Reps,
and the definitions of concept partition and concept subpartition can be used to apply FCA
to the MR case. [28, 26]
4.6 Application to the MR case
In this section is discussed how FCA can be applied for the MR case. In section 4.6.1 the
approach of applying FCA to the MR case is discussed, in section 4.6.2 the ‘features’ in
the context that is used are more deeply elaborated and in section 4.6.3 scalability issues
concerning the presented approach are discussed. Finally in section 4.6.4 the notion of
‘concept quality’ is defined to aid the analysis on the archive.
4.6.1 Approach
To obtain multiple independent archives for the MR case using FCA the first step is to build
a suitable context. For the object set in the context initially the set of files in the MR archive
was chosen. The reason is that a file can be seen as a good abstraction of functionality; Choi
states that functions in a file are semantically related and form a “cohesive logical module”.
[5]
However early experiments with formal concept analysis with files gave such amount of
scalability issues that one step higher in the level of abstraction was chosen from the archive
of the MR case. This level of abstraction results in a set of building blocks.
The choice of taking building blocks as the set of objects is not only the most logical
with respect to the level of abstraction, also domain experts of PMS indicate that building
blocks are designed to encapsulate particular functionality within the archive.
The attributes have to be chosen in such way that building blocks that are highly related
to each other appear in concepts of the context. The following two sets of attributes are
combined in the context for this purpose and indicate that a building block:
16
Formal Concept Analysis 4.6 Application to the MR case
1. is highly dependent on another building block, that is: uses a function or data structure
of a file within a particular building block. The term ‘highly dependent’ is used
to discriminate between the heavy use of a building block and occasional use. A
threshold is used to filter relations on the degree of dependency.
2. has particular features associated with it. These particular features are: MR speci-
ficity, layering and information about what building blocks are affected during soft-
ware projects. Such features are extracted from architecture overviews and project
documentation or are defined by domain experts.
The reason why these two sets of attributes are identified is that the first set assumes that
building blocks which are highly dependent on each other should reside in the same archive.
The second set of attributes assumes that building blocks that share the same features, such
as building blocks that are all very MR-specific, should be grouped in the same archive.
The first set of attributes is defined in the ‘as-is’ architecture; the code archive. The
interdependencies of the building blocks in the architecture are extracted from the code
archive using the commercial tool Sotograph and the degree of dependency is determined.
[31] Sotograph determines the degree of dependency by summing up static dependencies
up to the desired abstraction level. A threshold is used to be able to discriminate between a
high degree of dependency and a low degree.
The second set of attributes is defined in the documentation of the architecture and is
present at domain experts of the MR department. The features associated with the building
blocks are discussed in more detail in section 4.6.2.
So the two sets of attributes that form the attributes of the context are a combination
of the ‘as-is’ architecture extracted from the code archive and features extracted from the
‘as-built’ architecture, according to the documentation and the domain experts. In figure
4.4 a simple example of a context is shown.
Now having defined what the context is for using FCA for the MR case this paper
adapts the process proposed by Siff and Reps, but adjusted for this particular purpose. [28]
It follows the extract-abstract-present approach of Symphony: [32]
1. Build context where objects are building blocks defined in the software archive and
attributes indicate that building blocks are highly related, using the two sets of at-
tributes previously defined in this section.
2. Construct the concept lattice from the context with a lattice construction algorithm
discussed in section 4.3.
3. Identify concept subpartitions-collections of concepts whose extents partition the set
of objects using the object subpartition SP proposed by Tonella. [26] Each object
subpartition SP corresponds to a possible partitioning of the archive in multiple inde-
pendent archives.
17
4.6 Application to the MR case Formal Concept Analysis
Figure 4.4: An example context to partition the archive into multiple archives using project
documentation
4.6.2 Features for FCA
As mentioned before there are two sets of attributes used for the concept analysis approach.
The first set of attributes indicates whether building blocks are highly dependent on other
objects. The second set of attributes defines the several features of the building blocks.
For the features two approaches have been chosen:
• Information about building blocks affected during projects
• MR-specificity and layering
These two approaches give two different results, which are evaluated further in this
thesis. Both results can be used to define a final splitting of the archive, in cooperation with
the systems architects.
Information extracted from project documentation
The first approach makes use of several software project documents. The documents de-
scribe which building blocks are under its scope, which means what building blocks are
expected to be changed during the project. This scope is determined by the system’s ar-
chitects prior to the start of the project. The scope can consist of building blocks that are
scattered through the entire archive, but because projects often are used to implement cer-
tain functionality there is a relation between the building blocks in the scope. Figure 2.2
shows an example of the scope of a fictive project.
This particular relation is used to group the building blocks together in the form of
concepts after the construction of the context. The fact that this grouping possibly crosscuts
the archive makes this feature interesting to use for FCA in combination with the high
dependency relations between building blocks.
To give an example: given a project that implements a certain feature, named ‘projectA-
feature1’, there is documentation at PMS that describes that ‘buildingblockA’, ‘building-
blockB’ and ‘buildingblockC’ are under the scope of this project, which crosscuts the code
18
Formal Concept Analysis 4.6 Application to the MR case
archive with respect to the building block hierarchy. Now the feature ‘projectA-feature1’ is
assigned to each of the three building blocks in the scope.
The information about scopes of projects in PMS documentation is only available for
projects from the last two years. Before this time this specific information was not docu-
mented. Further, when looking at the extracted features from the documentation, it is clear
that not all building blocks have one or more features designated to it. So the features do
not cover the complete object set of building blocks in the context.
This has consequences for deducing concepts from a context with these features. The
building block where no features have been designated to, will be grouped based on the
attributes that indicate high dependency on other building blocks. This attribute however
could also not be there, either because there are no dependencies from this building block
to other building blocks or because the number of dependencies to another building block
is below a chosen threshold. This should be kept in mind while analyzing the results.
There are around 50 documents available to extract information from, each with a scope
of building blocks. So around 50 different features are assigned to the building blocks in
the archive. The documents have been produced for two long-term projects.
While extracting information from the documentation some difference in detailedness
was noticed. That is, some scopes of projects were defined very detailed with respect to
the building blocks in the hierarchy, while others were only defined on a very high level of
abstraction. In the documentation there were examples of scopes that were defined as two
or three ‘subsystems’ in the archive.
For these specific cases the feature was accumulated down the building block hierarchy.
For example when project documentation states that the scope of ‘projectA-feature1’ is
‘platform’, all underlying building blocks in the building block structure of ‘platform’ in
the archive are given the feature ‘projectA-feature1’, including ‘platform’ itself.
So the idea of this approach is that the building blocks will be grouped based on the
fact that they are related based on certain features of the software they implement or are
involved with. This can be a different grouping than a grouping based on just high depen-
dencies between the building blocks. We think it is interesting to use both types of features
in the context for analysis, as a combination of the ‘as-is’ architecture and the ‘as-built’
architecture.
MR-specificity and layering
The other approach is taking into account the MR-specificity and layering of the entities in
the archive. The MR-specificity can be explained like this: Some building blocks are only
to be found in MR software, while others are common in all medical scanner applications or
even in other applications, such as database management entities or logging functionality.
For the layering a designated scale is used for the building blocks that states whether
a building block is more on the service level or more on the application/UI level. One can
think of a process dispatcher on the service level and a scan-define UI at the application/UI
level.
When assigning these features to each building block, building blocks that share the
same characteristics of MR-specificity and layering, are grouped by deducing concepts from
19
4.6 Application to the MR case Formal Concept Analysis
a context which include these features as attributes. The underlying reason for choosing
these specific features, MR-specificity and layering, is the wish of Philips Medical to evolve
to a more homogeneous organization in terms of software applications. After analysis it
is interesting to consider opportunities such as reusing building blocks that are common
in medical scanner software from other departments or develop them together as reusable
building blocks.
The MR-specificity and layering information of the building blocks in the archive is
not available in documentation. Therefore to obtain these features domain experts from
PMS assigned them to the building blocks. This was done using a rough scale for the
MR-specificity: ¦very specific, specific, neutral, non-specific, very non-specific¦. For the
layering a similar scale holds starting from application/UI level to the service level. This
means that the complete object set of building blocks is covered, that is each entity has a
feature indicating the MR-specificity and has a feature indicating the layering. So when
taking one building block into account, there are 25 combinations possible with respect to
the MR-specificity and layering.
Building blocks that have certain common features - such as a group of building blocks
that are ‘very MR-specific’ and are on the ‘application/UI level’ - are grouped based on
these features. Of course this can be a different grouping than the grouping based on the
high dependencies between building blocks. We think it is interesting to use both types
of features in the context for FCA, as a combination of the ‘as-built’ architecture and the
‘as-is’ architecture.
4.6.3 Scalability issues
When using the process proposed in the previous section scalability issues come into play,
because the set of building blocks in the archive are taken as the set of objects in the context.
The archive consists of around 360 building blocks (not taking in account the C# coded
parts of the archive) , which results in a big context with the attributes defined, with a large
corresponding concept lattice.
Because identifying concept subpartitions (CSPs) from the concept lattice results in
enormous amounts of CSP-collections, which have to be manually evaluated, steps must be
undertaken to cope with this amount.
Leveled approach To cope with the large amount of results this paper proposes to use
a leveled approach. The approach makes use of the hierarchical structuring of the MR
archive: the archives are modularized in high level ‘subsystems’, which consist of multiple
‘building blocks’, which again are structured in a hierarchy. Figure 2.1 shows the hierarchi-
cal structuring.
By analyzing parts of the hierarchy in detail, resulting concepts from that analysis are
merged for the next analysis. This will make the context and concept lattice of the next
analysis round smaller and expected is that the resulting number of CSPs will also decrease.
We will elaborate this further in this section.
First we will give a more detail detailed description of the leveled approach: By using
the leveled approach some parts of the archive can be analyzed in detail, while keeping
20
Formal Concept Analysis 4.6 Application to the MR case
the other parts at a high level. The results from such analysis, such as groupings of ‘lower
level’ building blocks, can be accumulated to a next analysis where another part is analyzed
in detail. These groupings are accumulated by merging the building blocks into a single
fictive building block to make the context and resulting concept lattice smaller. This is
repeated until all building blocks are analyzed in detail and the results are accumulated.
The attributes for each of the two approaches are assigned to the building blocks. When
a part of the hierarchy of the archive is not examined in detail the attributes are accumulated
to the entity that is examined globally. Figure 4.5 shows how this works. In the figure
the ‘platform-subsystem’ in the hierarchy of the archive is analyzed globally and the others
in detail. The figure shows that all the features in the lower levels in the top table of the
‘platform-subsystem’ are accumulated to ‘platform’ which can be seen in the lower table.
Figure 4.5: Example of how features are accumulated to higher level building blocks
The analysis itself was performed using a newly developed tool - named ‘Analysis Se-
lector’ - that uses the output of the analysis performed by Sotograph, which recognizes the
hierarchy of the archive and relations between building blocks. Further, separate input files
are given to the tool for the features, either extracted from the project planning or from
MR-specificity and generality documentation.
The tool enables the selection of parts of the hierarchy to analyze in detail and enables
viewing the resulting concept subpartitions and merging resulting groupings in the concept
subpartitions for further analysis. After selecting what part should be analyzed in detail and
what parts should not be analyzed in detail the context is created using the accumulated
attributes of the context.
This context can be exported to a format that an existing tool can use as input. This tool
is called ‘ConExp’. [6] ConExp creates given a context the corresponding concept lattice.
This can be exported again to serve as input for the newly developed tool, which can deduce
concept subpartitions from the concept lattice. Figure 4.6 shows how this works.
Merging concepts Considering the number of CSPs resulting from the number of con-
cepts, the amount of concepts taken into consideration should be kept small when calcu-
21
4.6 Application to the MR case Formal Concept Analysis
Figure 4.6: Flow of data scheme of analysis
lating the concept subpartitions. This can be accomplished by merging the extents of the
concepts (the object sets of the concepts) resulting from the context of an analysis. We will
call this merging a concept from now on.
When a concept is merged, the objects of that concept will be grouped into a so-called
‘merge’ which is a fictive object with the same attributes as the original concept. Expected is
that the context for a successive analysis round is now reduced is size and a smaller amount
of concepts will result from the next analysis round.
This process of merging and recalculating the context and concepts can be continued
until a small number of concepts result from the defined context. From these concepts then
the CSPs can be calculated.
Merging ‘some’ concepts raises the question which concepts to choose when merging.
Picking concepts from a set of, for example, 300 concepts resulting from an analysis step is
difficult. To aid in the choosing of the concepts a quality function of a concept is developed,
which we will call the concept quality.
This quality function is developed to ease the choice of concepts to group and is not
part of FCA. The quality function takes in account how ‘strong’ a concept is, that is based
on how many attributes. Also relatively small groups of objects in a concept with relatively
large groups of attributes could be ranked as a ‘strong’ concept. The following section
elaborates the notion of ‘concept quality’ for this purpose.
4.6.4 Concept quality
The concept quality is developed as an extension for FCA to ease choosing concepts that
should be merged for a next analysis round as described in section 4.6.3. In order to give
each concept a value indicating the quality of the concept a quality function is needed. This
function indicates how strong the grouping of the concept is.
Basically the concept quality is based on two measures:
22
Formal Concept Analysis 4.7 Results
• The relative number of attributes of a concept
• The relative number of objects of a concept
A grouping of a concept is based on the maximal set of attributes that a group of objects
share. Intuitively when a small number of objects is grouped by a large number of attributes,
this indicates a strong grouping of this objects and therefore is assigned a high concept
quality value. Conversely, when a small number of objects share a single attribute, this
intuitively can be seen as a less strong grouping than with a large number of attributes, and
therefore is assigned a lower quality value.
Asimilar degree of quality is devised for the number of objects. Given a set of attributes,
when a small number of objects share this set this can intuitively be seen as a strong group-
ing, on the other hand a large number of objects sharing this set as less strong grouping.
So the quality function is also based on the degree of objects in a concept sharing a set
of attributes. The smaller the number of objects, sharing a set of attributes, the lower the
resulting quality function value is of the specific concept.
Because a degree of attributes in a concept and a degree of objects in a concept is mea-
sured for each concept, a ceiling value is defined. As ceiling value the maximum number
of objects and attributes respectively are taken given a set of concepts. The maximum num-
ber of objects in a set of concepts is called the ‘MaxObjects’ and the maximum number of
attributes in a set of attributes is called the ‘MaxAttributes’.
Now we define the concept quality for a concept is as follows, the values are in the range
[0,100]:
Quality(c) =
MaxOb jects −#Ob jects
MaxOb jects

#Attributes
MaxAttributes
100
Given this quality function all resulting concepts from a context are evaluated and based
on the values, concepts are merged into single entities. These merges of building blocks are
taken as one entity for the next round of analysis, with the purpose of decreasing the number
of concepts.
4.7 Results
The analysis has been performed for the two approaches; the approach with the project
documentation features and the approach with the MR specificity and layering features.
Both approaches make use of the hierarchy of the archive. The results of the analysis using
the two approaches are discussed in this section. Because the two approaches give two
different results, the two approaches are discussed separately.
4.7.1 Project documentation features
Analysis has been performed on the complete building block structure, with no threshold
imposed on the relations. This means that every static relation between building blocks is
counted as a high dependency attribute in the context. This is combined with the information
23
4.7 Results Formal Concept Analysis
extracted from the project documentation. Figure 4.7 shows the number of concepts plotted
against the number of analysis rounds.
Figure 4.7: Results of analysis of FCA with features extracted from project documentation
The full context results in 3031 concepts. The number of concepts is too big to create
concept subpartitions (CSPs) from. In figure 4.7 can be seen that the number of concepts
decreases over the successive analysis rounds.
After 25 rounds the number of concepts has decreased to 379. However, calculating
CSPs from this set of concepts resulted in more than 10 million CSPs.
When the resulting (intermediate) concepts were discussed with the domain experts,
we could be seen that the objects of the concepts were mainly grouped on the high depen-
dency attributes in the context. This can be explained by the fact that the features extracted
from existing project documentation covered around 30% of the complete set of around 360
building blocks.
So when grouping objects in the defined context by concepts the main factor of grouping
is defined by the dependencies. This degree of influence of the high dependency attributes
can be decreased by using a threshold on the static relations, but this also means that there
will be more objects in the context that have no attributes assigned to. Therefore no infor-
mation is available to group these objects, so these objects will not be grouped in concepts.
For this reason we have chosen not to continue the analysis with a threshold with fea-
tures extracted from project documentation and we decided not to apply FCA with these
features on parts on the hierarchy.
4.7.2 MR-specificity and layering features
This section presents the results from analysis on the context with the MR-specificity and
layering features.
24
Formal Concept Analysis 4.7 Results
Full building block structure The first results of the analysis with the MR-specificity and
layering features are from analysis of the complete hierarchy of building blocks, without
any threshold imposed on the static dependencies, used in the context. Figure 4.8 shows the
results.
Figure 4.8: Results of analysis of FCA with MR-specificity and layering features
The full context results in 3011 concepts. Similar to the analysis on the project docu-
mentation features, this number of concepts is too big to create CSPs from. So concepts
were chosen to be merged each round. Figure 4.8 shows that the number of concepts de-
crease over the number of analysis rounds.
After 27 rounds the number of concepts has decreased to 351. Calculating CSPs from
this set of concepts resulted in more than 10 million CSPs.
The following results are from the analysis on the complete hierarchy of building blocks
with the MR-specificity and layering features, but now with a threshold of 25 imposed on
the static dependencies.
The full context results in 719 concepts. This number of concepts is too big to create
CSPs from. Therefore concept were chosen to be merged each round. Figure 4.9 shows a
decrease of the number of concepts over the successive analysis rounds.
After 6 analysis rounds the number of concepts has decreased to 378. Calculating CSPs
from this set of concepts also resulted in more than 10 million CSPs.
One subsystem in detail Figure 4.10 shows the results of analysis on one subsystem
- called ‘platform’ - with the MR-specificity and layering features. This analysis has no
threshold imposed on the static dependencies.
There are 458 resulting concepts from the defined context. By merging concepts using
the concept quality measure, after 30 analysis rounds there are 490000 CSPs resulting from
the set of 95 concepts. Figure 4.11 shows the number of resulting CSPs from the last five
analysis rounds.
25
4.7 Results Formal Concept Analysis
Figure 4.9: Results of analysis of FCA with MR-specificity and layering features, with
imposed threshold of 25
Figure 4.10: Results of analysis of FCA with MR-specificity and layering features on one
subsystem
The same analysis has been performed with two different thresholds imposed on the
static relations: one of 25 and one of 50. The results of this analysis can be seen in figure
4.12. The left figure shows the results with a threshold of 25 and the right figure shows the
results with a threshold of 50.
The analysis with the threshold of 25 imposed starts with a context that results in 338
concepts, the analysis with the threshold of 50 imposed starts with a context that results in
287 concepts.
For the analysis with the 25 threshold after 7 rounds there was a set of 311 concepts, for
the analysis with the 50 threshold there was a set of 269 concepts.
26
Formal Concept Analysis 4.8 Evaluation
Figure 4.11: Resulting concept subpartitions from the set of concepts of the analysis on one
subsystem with no threshold
Figure 4.12: Results of analysis of FCA with MR-specificity and layering features on one
subsystem, with imposed threshold of 25 (left) and 50 (right)
4.8 Evaluation
In this section the results of the previous section 4.7 are discussed and is reflected how our
proposed approach works.
When we look at the results of both analyses on the complete building block structure we
can see that indeed the number of concepts decrease by applying the leveled approach, aided
by the concept quality values. Expected was that by decreasing the number of concepts over
the several analysis rounds also the number of concept subpartitions (CSPs) would decrease.
Indeed the number of resulting CSPs decreases, if we consider both the analyses on the
complete building block hierarchy. As each CSP represents a possible partitioning of the set
of building blocks, all the resulting CSPs have to be evaluated by domain experts. However
the number of CSPs is far too big for domain experts to evaluate.
The huge number of CSPs can be explained by the degree of overlap of the objects in
a resulting set of concepts. For example, when each combination of two concepts in this
set of concepts has an intersection of their objects set that is empty, the resulting number of
CSPs from this set grows exponentially.
Also we observed a lot of concepts in the sets with a small number of objects. Concepts
27
4.9 Summary Formal Concept Analysis
with a small number of building blocks as the object set have a high chance that they can be
combined with other concepts, as then the chance of an overlap with building blocks in the
other concepts is small. More combinations of concepts result in more CSPs.
But if we consider an amount of around 100 concepts and a resulting number of around
500.000 CSPs, we do not expect to get a significant smaller amount of concept subpartitions
if we choose to use different attributes in the starting context, for example other features,
or a more detailed scale for the MR-specificity and layering features. We base that on
the mathematical definition of the CSP, which enables small sets of concepts resulting in
enormous amounts of CSPs.
So we think that the leveled approach works, considering the decreasing number of
concepts. However, the approach of deducing CSPs from a set of concepts - in order to
obtain partitionings of the set of building blocks - is in our opinion not applicable to a
problem of the size of the MR case.
4.9 Summary
We proposed FCA in this chapter as an analysis method to obtain a splitting of the archive
for the MR case. FCA provides ways to identify sensible groupings of objects that have
common attributes.
For the context of FCA building blocks are taken as objects and for the attributes two
sets are combined: a set of attributes indicating that building blocks are highly related and
a set of attributes that represent certain features of the building blocks. These features are
derived from existing project documentation and domain experts.
From this context concepts are derived, which are used to generate concept subpar-
titions. Each concept subpartition represents a possible partitioning of the object set of
building blocks, which can serve as a basis for defining a splitting of the archive.
The process of analysis works with a leveled approach, which means that some parts of
the building block hierarchy in the archive are analyzed in detail, while other parts are ana-
lyzed at a higher level. Results from a previous analysis are used for a next analysis where
more parts are analyzed in detail. Selecting what parts are used for successive analysis
rounds are aided by the newly defined concept quality.
After evaluating the results we see that the leveled approach, in combination with the
concept quality, enables us to lower the number of resulting concepts from the context.
However the resulting number of concept subpartitions from the number of concepts are
enormous.
Therefore we conclude that the approach of concept subpartitions from a set of concepts
is not suited for a problem of the size of the MR case.
28
Chapter 5
Cluster Analysis
This chapter discusses cluster analysis as an analysis method to split the software archive for
the MR case. Cluster analysis can be used to analyze a system for clusters. The basics are
essentially to group highly dependent objects. The group of the highly dependent objects
are called clusters.
To give a more formal definition of the term cluster: “continuous regions of space con-
taining a relatively high density of points, separated from other such regions by regions
containing a relatively low density of points”. [9]
When applying cluster analysis three main questions need to be answered according to
Wiggerts: [36]
1. What are the entities to be clustered?
2. When are two entities to be found similar?
3. What algorithm do we apply?
To discuss what cluster analysis is we follow Wiggerts by answering the three raised
questions. Section 5.1 discusses the entities to be clustered, section 5.2 discusses when en-
tities are found to be similar and should be grouped in clusters and in section 5.3 algorithms
to perform the clustering are discussed. In section 5.4 applications of cluster analysis in
the literature is discussed and in section 5.5 we discuss how cluster analysis can be applied
to the MR case. In section 5.6 the results of the analysis are discussed and the results are
evaluated in section 5.7. Finally a summary is given in section 5.8.
5.1 Entities to cluster
The entities that are to be clustered are artifacts extracted from source code. However this
can be done in several ways and at different abstraction levels, depending on the desired
clustering result.
Hutchens identifies potential modules by clustering on data-bindings between proce-
dures. [17] Schwanke also identifies potential modules but clusters call dependencies be-
tween procedures and shared features of procedures to come to this abstraction. [27]
29
5.2 Similarity measures Cluster Analysis
Another level of abstraction with respect to the entities to be clustered is presented by
van Deursen. Van Deursen identifies potential objects by clustering highly dependent data
records fields in Cobol. Van Deursen chooses to apply cluster analysis to the usage of record
fields, because he assumes that record fields that are related in the implementation are also
related in the application domain and therefore should reside in a object. [33]
‘High-level system organizations’ are identified by Mancoridis by clustering modules
and the dependencies between the modules using a clustering tool called Bunch. [20, 19]
Anquetil also identifies abstractions of the architecture but clusters modules based on file
names. [1]
5.2 Similarity measures
When the first question is answered of what entities are to be clustered the next question
that raises is when are two entities to be found similar or dependent? In other words what
similarity measure is used to group similar entities in clusters?
As mentioned above there are different possibilities to define the similarity of entities,
such as data-bindings [17] and call dependencies between entities [27]. Wiggerts divides
the similarity measure found in literature in two groups: [36]
1. Relationships between entities.
2. For each entity the scores for a number of variables.
The first group is essentially based on the notion of: “the more relations between entities
the more similar the entities are”. The entities can be seen as nodes in a graph and the
relations as weighted edges between the nodes. The relations between the entities can be
dynamic or static relations, such as: inheritance/include, call or data-use relations. The
several types of relations can be used together by summing up the relations of the different
kinds. This type of similarity measure is applied by Mancoridis. [20, 19]
The second group of similarity measures is based on features or attributes of the entities,
such as the presence of a particular construct in the source code of the entities or the degree
of usage of particular constructs in the source code of the entities. Schwanke applies this
type of similarity measure. [27] Also Anquetil who performs cluster analysis based on
source-code file names is an example of this group. [1]
Wiggerts states that “the goal of clustering methods is to extract an existing ‘natural’
cluster structure”. He recognizes that different clustering methods may result in different
clusterings. [36] Using different similarity measures results in different clusterings of the
entities. So the choice of similarity measure determines the result and in that way one should
think carefully what similarity measure should be used for a particular clustering purpose.
5.3 Cluster analysis algorithms
Now the first two questions raised by Wiggerts are answered; what entities are to be clus-
tered and how similarity between entities is determined, the next question is raised: How
30
Cluster Analysis 5.3 Cluster analysis algorithms
are similar entities grouped into clusters? The grouping can be achieved by using different
types of algorithms. Wiggerts states that a “particular algorithm may impose a structure
rather than find an existing one”. [36] So not only the choice of similarity measure affects
the final result but also the choice of the algorithm to do the clustering with.
Wiggerts divides the existing algorithms in literature in four categories: [36]
• graph theoretical algorithms
• construction algorithms
• optimization algorithms
• hierarchical algorithms
Graph theoretic algorithms The graph theoretic algorithms work on graphs with the
entities as nodes and relations between the entities as edges. The algorithms basically try
to find special subgraphs in the graph. The finding of these special subgraphs in a graph is
supported by graph theory. [36]
An example of the application of such algorithm is given by Botafogo. He uses a graph
theoretical algorithm to cluster nodes and links in hypertext structures in more abstract
structures. [4] Gibson applies a graph theoretical algorithm to identify large dense sub-
graphs in a massive graph; a graph showing connections between hosts on the World Wide
Web. [11]
Construction algorithms Construction algorithms assign entities to clusters in one pass.
An example of such construction algorithm is given by Gitman. [12] He uses fuzzy set
theory to derive a clustering from a data set. Another example of such algorithm is the bi-
section algorithm discussed by Laszewski. [34]. The algorithm is used for dividing points
or entities in a two dimensional plane into a number of partitions containing an approxi-
mately equal number of points. The membership of a point to a partition is determined by
its position in the two dimensional plane.
Optimization algorithms Optimization algorithms work with a function describing the
quality of a partitioning using a similarity measure. The algorithm then tries to improve the
value of the quality describing function by placing entities in other clusters or by merging
or creating clusters.
Mancoridis has shown that the number of possible partitions given a set of entities grows
exponentially with respect to the number of entities. For example 15 entities already result
in 1,382,958,545 distinct partitions. [20] This means that a naive exhaustive algorithm that
calculates the quality value for each possible partition is not applicable to larger sets of
entities.
Doval proposed an algorithm to deal with the exponential growth of possible partitions.
[7] He uses a genetic algorithm and a ‘modularization quality’ function to cluster a graph
consisting of entities as nodes and relations as edges. The algorithm starts with a random
clustering and tries to come to a sub-optimal clustering by applying selection-, mutation-
31
5.4 Applications of cluster analysis Cluster Analysis
and crossover-operators on a population of clusterings. The working of genetic algorithms is
out of the scope of this thesis, for more details on genetic algorithms we refer to [20, 19, 7].
Doval demonstrates that the algorithm works for a small case study (20 entities). [7]
Mancoridis uses a genetic algorithm in the tool Bunch to come to a sub-optimal clustering
of given entities. [22] He presented a case study concerning 261 entities and 941 relations
between the entities. The results showed that the approach worked in this case study by
verification using a clustering provided by a system’s expert.
The genetic algorithm in Bunch has several options that can be set by the user, which
can influence the quality of the clustering. The main option is the number of successive
generations of the genetic algorithm, which linearly influences the time the algorithm takes
to finish.
Another optimization algorithm is the hill-climbing technique, a local search algorithm.
This algorithm is also implemented in the Bunch tool and starts with a random partitioning
of a graph consisting of entities as nodes and relations as edges. Using the ‘modularization
quality’ function - as with the genetic algorithm - the hill-climbing algorithm start improv-
ing a random partitioning by systematically rearranging the entities “in an attempt to find
an improved partition with a higher modularization quality”. [22]
As a random starting solution can cause that a hill-climbing algorithm converges to
a local optimum, simulated annealing can be used to prevent that. Simulated annealing
“enables the search algorithm to accept, with some probability, a worse variation as the new
solution of the current iteration”. [22] This ‘worse variation’ can be seen as a mutation of
the sub-optimal solution.
Hierarchical algorithms Hierarchical algorithms build hierarchies of clusterings in such
way that “each level contains the same clusters as the first lower level except for two cluster
which are joined to form one cluster”. There are two types of such algorithms: [36]
• agglomerative algorithms
• divisive algorithms
Agglomerative hierarchical algorithms start with N clusters containing each one entity
(N is number of entities). The algorithms then iteratively choose two clusters that are most
similar - using a particular similarity measure and merge the two clusters in one cluster.
Each step two clusters are merged into one cluster. After N-1 steps all entities are contained
in one cluster. Now each level in the hierarchy represents a clustering. [36]
Divisive hierarchical algorithms do it the other way around. The algorithms start with
one cluster containing all entities. The algorithms then choose at each step one cluster to
split in two clusters. After N-1 steps there are N clusters each containing one entity. [36]
5.4 Applications of cluster analysis
Cluster analysis has been applied for many different purposes and in many settings. As
mentioned in the previous sections the choice of entity, similarity measure and algorithm
32
Cluster Analysis 5.5 Application to the MR case
determines the result of the cluster analysis. Examples of applications of cluster analysis
have been given in the previous sections to illustrate these choices. This section will discuss
the applications of cluster analysis with the applicability to the MR case in mind.
Wierda for example uses clustering to “group classes of an existing object-oriented sys-
tem of significant size into subsystems”, based on structural relations between the classes.
He showed with a case study on a system consisting of around 2700 classes that cluster anal-
ysis was applicable in practice for the given problem size using a hill-climbing algorithm
combined with simulated annealing. [35]
Another example in literature is the research done by Lung. Lung presents an approach
to restructure a program based on clustering at the function level and applies cluster analysis
on the statements in modules of around 6500 LOC. Lung states that the approach worked
well, however no time measurements were included in the paper to prove the practical
applicability. [18]
Other examples in literature perform cluster analysis on a far lesser amount of entities,
but mainly aim at providing evidence of usefulness of cluster analysis for specific goals.
[20, 19, 21, 33]
5.5 Application to the MR case
In this section we discuss how cluster analysis can be applied to the MR case. First in
section 5.5.1 the approach we take is discussed, in section 5.5.2 is discussed how scalability
issues are coped with and finally in section 5.5.3 is discussed how (intermediate) results are
visualized and validated by domain experts.
5.5.1 Approach
In order to discuss the applicability of cluster analysis to the MR case this section tries to
answer the three main questions raised by Wiggerts: [36]
• What are the entities to be clustered?
• When are two entities to be found similar?
• What algorithm do we apply?
The three questions are answered in this section. Also after answering the three ques-
tions the process of analysis is discussed.
5.5.1.1 Entities: building blocks
In order to define what the entities are, that are to be clustered the goal of the clustering
should be clear and that is to identify multiple independent archives. This goal resembles the
goal set by Mancoridis: Identifying high-level system organizations by clustering modules
and the relations between the modules. [20, 19]
33
5.5 Application to the MR case Cluster Analysis
As entities we initially chose the set of files in the MR archive, as a file can be seen as
a good abstraction of functionality. [5] However, the number of 15000 files gave such scal-
ability problems in early experiments, that we chose to go one abstraction level higher. The
scalability problems are caused by the fact that the number of partitions grow exponentially
with the number of entities to cluster. [20] The next level of abstract in the MR case archive
is the set of 360 building blocks.
5.5.1.2 Similarity measure: relations
The next question is when are two entities to be found similar. The choice of similarity
measure affects the resulting clustering and thus should be chosen by focusing on the desired
result. The desired result is obtaining independent archives from the original code archive.
To achieve independency between the archives there should be minimal relations be-
tween the obtained archives in order to be able to define clear interfaces between the archives.
Also within an archive there should be a high degree of relations between the modules in it,
because the highly related modules should be grouped into an archive.
The relations between modules are thus the most suited as similarity measure and should
be extracted from the current code archive of the MR department. Because of the need to
make a distinction between a strong dependency (lots or relations) and a weak one the sim-
ilarity measure should measure the number of relations or degree of dependency between
modules.
The choice of similarity measure is also guided by the available information at PMS,
which are static dependencies between entities in the archive. In cooperation with the do-
main experts at PMS the choice was made to take these static relations as similarity measure
for the cluster analysis. Other types of dependencies could also be used as relations such as
dynamic dependencies and version information or file names. [?, 1]
As mentioned before, the MR department has a commercial tool for extracting static
dependencies from the code archive, called Sotograph. The tool can handle the parts of the
current code archive written in C, C++ and GOAL-C.
5.5.1.3 Algorithm: optimization
The last question that needs to be answered is what algorithm to use. The choice of the
algorithm also affects the resulting clustering and therefore with the goal of this research in
mind the algorithm that should be used for the MR case is an optimization algorithm.
The reason why is because of the goal that archives should be obtained from the code
archive that can be developed independently. This means as mentioned before that there
should be a low degree of relations between the archives and a high degree of relations
between the modules in an archive. These two statements can be translated in a function
that evaluates a clustering and can be used to obtain the optimal clustering based on this
function.
We have chosen for a genetic algorithm to cope with the exponential growth of the
number of possible partitions. We assume that a genetic algorithm is less likely to get stuck
in local optima than the hill-climbing algorithm.
34
Cluster Analysis 5.5 Application to the MR case
The reason for this assumption is the fact that with a genetic algorithm mutation is
embedded in the process of selecting what partitions are continued with, which means that
the algorithm can make jumps of varying size in the search space. Also in the algorithm
sub-optimal solutions are recombined (cross-over function), which allows large structured
jumps in the search space of the problem.
Hill-climbing in combination with simulated annealing can also prevent getting stuck
at local optima, because simulated annealing is basically a (often small) mutation on the
sub-optimal solution. However no cross-over is provided. Further in this section the theory
needed for the optimization algorithm is discussed.
Intra-Connectivity To come to a function that can be used to evaluate a clustering first
we need to translate the notion of the low degree of relations between archives. Therefore
we adopt the definition of Intra-Connectivity (A) from Mancoridis. [20] In his paper he
describes intra-connectivity as the “measurement of connectivity between the components
that are grouped together in the same cluster”. The components or modules that are grouped
together in the same cluster should be highly dependent. That is why Mancoridis defines
the intra-connectivity as follows:
A
i
=
µ
i
N
2
i
The intra-connectivity measurement A
i
of cluster i consists of N
i
components and µ
i
intra-edge dependencies. The measurement A
i
is a fraction of the maximum number of
intra-edge dependencies, because µ
i
is divided by N
2
i
. Here N
2
i
is the maximum number of
intra-edge dependencies that can exist between N
i
modules.
The measurement A
i
has values in the range [0, 1]. It has the value 0 when there are no
intra-edge dependencies and the value 1 if cluster i consists of components that are fully
connected which each other (including themselves).
Inter-Connectivity To translate the notion of a high degree of relation between modules
in an archive the definition of Inter-Connectivity (E) from Mancoridis is adapted. [20].
Mancoridis describes inter-connectivity as the “measurement of the connectivity between
two distinct clusters. As mentioned before the degree of dependency between the clusters
should be minimized. To be able to minimize this automatically Mancoridis defines the
inter-connectivity as follows:
E
i, j
=

0 if i = j
ε
i, j
2N
i
N
j
if i = j
The inter-connectivity measurement E
i, j
between clusters i and j consists of N
i
and
N
j
components respectively, with ε
i, j
inter-edge dependencies. The measurement E
i, j
is a
fraction of the maximum number of inter-edge dependencies possible between two clusters
which is 2N
i
N
j
.
The measurement E
i, j
also has values in the range [0, 1]. It has the value 0 when there
are no inter-edge dependencies between the modules of the two clusters and the value 1
35
5.5 Application to the MR case Cluster Analysis
when all the modules in the two clusters are dependent on all the other modules in the other
clusters.
Modularization Quality Now the Modularization Quality (MQ) measurement can be de-
fined in order to be able to give a value for how good a particular partitioning is. This is
done using the following formula defined by Mancoridis and integrating the two previous
measures: ‘intra-connectivity’ and ‘inter-connectivity’. [20]
MQ =

1
k

k
i=1
A
i

1
k(k−1)
2

k
i, j=1
E
i, j
if k > 1
A
1
if k = 1
The formula rewards the creation of highly dependent clusters while penalizing the cre-
ation of too many inter-edges between the clusters. Using this formula, clustered partitions
can be evaluated and compared. The tool Bunch, which uses this formula, is able to auto-
matically obtain (sub-)optimal clustered partitions. [19]
Bunch Bunch uses as input a so called Module Dependency Graph (MDG). Formally,
MDG= (M, R) is a graph where M is the set of entities and R ⊆MM is the set of ordered
pairs 'u, v` that represent the source-level dependencies between the entities u and v. [19]
This means that the chosen entities (files) and similarity measure (relation) can be mapped
to such graph.
Mitchell and Mancoridis developed an improvement of the MQ measurement previ-
ously discussed, which also supports weighted edges between entities. The improved MQ
measurement has been integrated into the Bunch tool. [23]
Also Bunch provides a feature called ‘user-directed clustering’. When this feature is
enabled, the user is able to “cluster some modules manually, using their knowledge of the
system design while taking advantage of the automatic clustering capabilities of Bunch to
organize the remaining modules”. [19]
5.5.1.4 Process of analysis
Having defined what the entities to be clustered are, what the similarity measure is and what
algorithm to use, the next point of discussion is the process of analysis. As the algorithm
proposed is likely to produce suboptimal partitions there should be a way to iteratively come
to a satisfying solution.
The genetic algorithm proposed by Doval starts with a random initial population, that is
a random partitioning. After that genetic operators are applied to this population for some
maximum number of times. Then, the fittest population is shown, i.e. the population that
has the highest Modularization Quality from all solutions generated by the algorithm. [7]
The main problem with this algorithm is the fact that it starts with a random partitioning
which can be a very bad partitioning. When as a starting point a bad partitioning is taken,
the algorithm is likely to find a suboptimal solution, which would imply a bad partitioning
of the multiple archives solution, originating from the original single archive.
36
Cluster Analysis 5.5 Application to the MR case
Therefore a good starting point is necessary. This paper presents a process to iteratively
improve the starting point of the genetic algorithm by using the suboptimal solution of the
algorithm and by reviewing the solution with domain experts, such as architects and devel-
opers of the MR department. It follows the extract-abstract-present approach of Symphony:
[32]
1. Obtain the Module Dependency Graph of the code archive using the commercial tool
Sotograph.
2. Use the genetic algorithm provided by the tool Bunch to obtain a clustering of the
MDG.
3. Review the results with domain experts by visualizing the results.
4. Use newly obtained knowledge of the system as starting solution for the genetic al-
gorithm provided by Bunch. This knowledge is acquired in the previous step where
domain experts review the results.
5. Repeat steps 2 till 4 until satisfying results are obtained, validated by the domain
experts and the value of the MQ function.
When newarchitectural knowledge about the clusters is obtained and the search space of
the algorithm is made smaller, the optimization value MQof the initial solution is improved,
meaning that the genetic algorithm can obtain a better optimization.
As mentioned above the process is repeated until satisfying results are obtained. Whether
the results are satisfying is determined by domain experts and this can be seen as valida-
tion of the process. This validation is difficult when results are not visualized. Therefore a
plug-in for the tool Sotograph is written. More details on the visualization can be found in
section 5.5.3
5.5.2 Scalability issues
When applying the process proposed in the previous section obtained knowledge of the
system is used to come to a better clustering solution. However there are scalability issues
concerning this approach.
The search space for the analysis is extremely large, as the number of possible cluster-
ings grows exponentially with the number of building blocks. As such, a first run clustering
of the building blocks will probably result in a clustering from which domain experts cannot
extract usable information.
Therefore we propose to use a leveled approach, already discussed in section 4.6.3 for
FCA. The leveled approach makes use of the hierarchical structure of the archive. The
archive is composed of multiple ‘subsystems’, which consist of multiple ‘building blocks’.
Figure 2.1 shows the hierarchical structuring.
Certain parts of the archive can be analyzed in detail while other parts of the archive are
analyzed at a high level. Results from analysis of one part of the archive can be used for a
next iteration of the analysis.
37
5.5 Application to the MR case Cluster Analysis
For example one subsystem can be analyzed in detail, while other subsystem are ana-
lyzed globally. When a subsystem is analyzed globally, the relations from all underlying
building blocks in the hierarchy are accumulated to be relations from the higher level sub-
system. Also relations from one building block to the underlying building blocks in the
hierarchy are adjusted to the higher level subsystem.
When after an analysis round some building blocks are clustered together, these building
blocks can be merged into one entity, called a ‘merge’. This merge is in fact a fictive building
block. The relations are accumulated to the merge as mentioned above.
An example of a merge is shown in figure 5.1. Here the building blocks ‘platform’, ‘ba-
sicsw’, ‘computeros’ and ‘configuration’ are merged into one new merge, named ‘merge’.
In the figure can be seen that the relations of the named building blocks are accumulated
and designated to the new building block ‘merge’. Also the relations to the building blocks
are changed to relations to the merge.
Figure 5.1: Example of a merge of a group of building blocks
The analysis itself is performed using a newly developed tool, called the ‘Analysis Se-
lector’. This tool is also used to enable the leveled approach for FCA. The tool uses the
output of the static dependency analysis of Sotograph. From this output the hierarchy of the
building blocks can also be extracted.
The tool enables the selection of parts of the archive to be analyzed in detail, while other
parts are kept at a higher level. When the user has made his selection the necessary relation
accumulation is performed and a Module Dependency Graph (MDG) is created. This MDG
is used as input for the tool Bunch. This tool actually performs the cluster analysis using
the built in genetic algorithm.
These results can be imported using a plug-in in the Sotograph tool to visualize the
results and get more metrics on the clustering result. More information about this can be
found in section 5.5.3. Domain experts can now select certain parts of the clustering to be
merged for a next analysis step. The parts can be selected in the Analysis Selector, after
which a next round of analysis can start. Figure 5.2 shows how this works.
38
Cluster Analysis 5.6 Results
Figure 5.2: Flow of data scheme of analysis
5.5.3 Visualization and validation
The reviewing of the result of the clustering with domain experts can be aided by visualizing
the results. Early results were visualized with a tool called ‘dotty’, which displays the
clustering of the modules and the dependencies between the modules. [24] However we
have chosen to use Sotograph as visualization tool, because we think that the tool has better
visualization capabilities and also offers several useful metrics, such as LOC of clusters and
dependency metrics. Therefore we developed a plug-in for the Sotograph tool. [31]
The results of the analyses are loaded into the Sotograph tool using the developed plug-
in. Sotograph maintains a database of information extracted from static analysis of the
code archive. This information includes the dependencies between archive entities, size of
entities and many other metrics. The tool enables the user to visualize the clustering, get
more detailed information on the dependencies between the clusters and within the clusters,
see what entities are placed in a particular cluster and retrieve the sizes of the clusters.
Results of the cluster analysis are evaluated by the domain experts at PMS in the tool
and building blocks are selected as merges for a next analysis round, as discussed in the
previous section 5.5.2. Figure 5.2 provides an overview of our approach.
5.6 Results
This section presents the results of the several analyses on the building block hierarchy.
For each analysis the scope of the analysis and the evaluation of the domain experts are
discussed.
Complete building block hierarchy Cluster analysis was performed on the complete
building block hierarchy. The input file given to Bunch consisted of 369 nodes (building
blocks) and 2232 edges (weighted dependencies). The genetic algorithm in Bunch was
executed three times. Figure 5.3 shows the results for this analysis.
39
5.6 Results Cluster Analysis
Figure 5.3: Results of analysis on complete hierarchy
As can be seen in figure 5.3 the MQ values do not differ significantly, therefore each of
the three results was visualized in Sotograph and was evaluated by domain experts at PMS.
However the domain experts could not extract clusters of building blocks from the three
clusterings to merge for a next analysis round. The reasons for that include:
• The three executions resulted in three completely different clusterings, which made
them hard to compare.
• The resulting clusters differed in size (LOC) significantly, up to a factor 20000.
• Average of around 25% of the clusters contained one (low-level) building block.
• For some clusters dependencies between building blocks placed in the clusters were
not evident or were not there at all.
One subsystem in detail only This analysis was performed on one subsystem only (plat-
form). All the building blocks were analyzed in detail. The input file given to Bunch
consisted of 121 building blocks as nodes and 409 weighted dependencies as edges. Figure
5.4 shows the results of three executions of the genetic algorithm in Bunch.
Figure 5.4: Results of analysis on one subsystem only
All the three results were visualized in Sotograph. Domain experts have evaluated all
three clusterings. Domain experts at PMS could not extract groupings of building blocks
from the three clusterings for a next analysis round. The reasons for that include:
• The three executions resulted in three completely different clusterings, which made
them hard to compare.
• Big difference in size of the clusters with respect to LOC, up to factor 100
• For some clusters dependencies between building blocks placed in the clusters were
not evident or were not there at all.
40
Cluster Analysis 5.6 Results
One subsystem in detail This analysis was performed on the hierarchy with one subsys-
tem in detail (platform), while keeping the others at the higher level. The input file given
to Bunch consisted of 138 nodes (building blocks) and 655 edges (weighted dependencies).
Figure 5.5 shows the results of three executions of the genetic algorithm in Bunch.
Figure 5.5: Results of analysis with one subsystem in detail
All the three results were visualized in Sotograph. Domain experts have evaluated all
three clusterings. Domain experts at PMS could not extract groupings of building blocks
from the three clusterings for a next analysis round. The reasons for that include:
• Difficulty understanding the result of presence of multiple subsystems in one cluster.
This cluster contained about 40% of the total amount of LOC of the archive.
• Big difference in size of the clusters with respect to LOC, up to factor 200
• Average of 20% of the clusters contained one (low-level) building block
• For some clusters dependencies between building blocks placed in the clusters were
not evident or were not there at all.
One subsystem in detail - improved start solution This analysis was performed on the
building block hierarchy with one subsystem in detail (platform), and the other subsystems
at the higher level. Each of these higher level subsystems were placed in a cluster to start
with, which is enabled by Bunch by giving a so called ‘user-directed’ clustering input file.
The idea behind this, is that now all the building block in the platform subsystem will be
divided among the other subsystems.
The input file for Bunch consisted of 138 building blocks as nodes and 655 weighted
dependencies as edges and a file designated all higher level subsystems to a cluster. Figure
5.6 shows the results of three executions of the genetic algorithm in Bunch.
Figure 5.6: Results of analysis with one subsystem in detail - improved start solution
Domain experts have evaluated the three results. The following points were observed:
41
5.6 Results Cluster Analysis
• The three executions resulted in three completely different clusterings, which made
them hard to compare.
• Average of 20% of the clusters contained one (low-level) building block
• For some clusters dependencies between building blocks placed in the clusters were
not evident or were not there at all.
• The third clustering has a nice distribution of the subsystems over the clusters; there
are no multiple subsystems present in a cluster
Having evaluated the three results, decided was with the domain experts that the results
from the third clustering are taken for a next analysis round. Decided was to take all clusters
containing more than 3 building blocks as a merge for the next analysis. The remaining
clusters were seen by the domain experts as not logical and therefore the building blocks
contained in these clusters are used as single entities for the second analysis round. Figure
5.7 shows the results of one execution of the analysis. (We have chosen to perform one
execution, because of lack of time to run and review three executions for each setting at the
end of the project.)
Figure 5.7: Results of analysis with one subsystem in detail - improved start solution, round
2
The input file for Bunch consisted of 42 nodes and 226 edges. When visualizing the
results of the analysis in Sotograph the domain experts observed that there was one huge
cluster containing around 40 % of the lines of code of the total archive. This is not desirable
for PMS and therefore was decided not to continue with the analysis of one subsystem,
while keeping the other subsystems at the highest level.
Two predefined large clusters - improved start solution This analysis was performed
on the building block hierarchy, having 2 predefined large clusters as start solution. This
start solution with the 2 predefined clusters is achieved by merging the building blocks in the
clusters before the analysis is executed. The clusters were defined by domain experts: one
cluster contained mainly image handling software and the other contained mainly scanning
software. The first cluster contains 2 subsystems from the building block hierarchy and the
latter cluster contains 11 subsystems from the building block hierarchy. The remaining 2
subsystems were analyzed in detail.
As a start solution there is already a ‘bipartition’ of the archive. The idea behind the
analysis is to see how the remaining building blocks are divided among the two partitions,
or see how the remaining building blocks are grouped in new partitions. Figure 5.8 shows
the results of one execution of the genetic algorithm.
42
Cluster Analysis 5.6 Results
Figure 5.8: Results of analysis with two predefined large clusters
The input file given to Bunch consisted of 129 building blocks as nodes and 485 weighted
dependencies as edges. Domain experts observed the following items:
• A few building blocks were divided among the two predefined clusters.
• The majority of the building blocks were clustered in 27 clusters
• For some of the 27 clusters dependencies between building blocks placed in the clus-
ters were not evident or were not there at all.
With the domain experts was decided to continue with the analysis, by merging the
building blocks that were clustered with the two predefined clusters. The remaining building
blocks (in the 27 other clusters) were not merged and are used as single entity for the next
analysis round. Figure 5.9 shows the results of one analysis with the genetic algorithm in
Bunch.
Figure 5.9: Results of analysis with two predefined large clusters, round 2
The input file given to Bunch consisted of 109 building blocks as nodes and 385 weighted
dependencies as edges. Domain experts observed the following items:
• A few building blocks were divided among the two predefined clusters.
• The majority of the building blocks were clustered in 18 clusters
• For some of the 18 clusters dependencies between building blocks placed in the clus-
ters were not evident or were not there at all.
Because the building blocks that were clustered with the two predefined clusters had
very low dependencies or even no dependencies on the other building blocks in the clusters,
we decided that continuing with this analysis - by merging the added building blocks to the
two predefined clusters - was not useful. Therefore we decided not to continue with the
analysis on the two predefined clusters.
43
5.7 Evaluation Cluster Analysis
5.7 Evaluation
In this section the results of the previous section 5.6 are discussed and is reflected how our
proposed approach works.
Cluster analysis on the building block hierarchy did not result in a satisfying clustering,
that could serve as basis for the splitting of the archive. The main reason for that is that it
is difficult for the domain experts to get merges from an analysis round when the results are
not ‘acceptable’.
The reasons why the intermediate results were not acceptable for the domain experts
are the following:
• Different executions of the algorithm result in different clusterings, which makes it
hard to compare the results
• Multiple clusters contain one building block
• Multiple clusters contain building blocks that are not related
This means that the proposed process with the leveled approach only works if results
from previous analysis rounds are at least to some degree acceptable, that is domain experts
can filter out meaningful clusters.
The fact that the intermediate results could not be used by the domain experts in most
cases is also caused by the genetic algorithm of Bunch. The algorithm produces results
with clusters that highly differ in size (LOC). The system’s architects found that this is not
desirable for the MR case. ’Cluster size equality’ is not a requirement made explicit in this
thesis, however this requirement appeared to be an important criterion for evaluating the
results by the domain experts.
The genetic algorithm also in some cases produced results with a cluster that contained
up to 40% of the size of the archive. This is also not desirable for the MR case.
5.8 Summary
This chapter proposed cluster analysis as a method to obtain multiple independent archives
from the original code archive of the MR case. As entities for analysis building blocks
are taken, static dependencies as similarity measure and a genetic algorithm to execute the
actual analysis.
Analysis has been performed using the leveled approach on several parts of the archive.
This means that some parts of the archive are analyzed in detail, while other parts are ana-
lyzed at a higher level. By using the leveled approach results on one part of the archive can
be used for a next analysis, where other parts are analyzed in detail.
Domain experts evaluated the results by visualizing them in Sotograph. They could not
filter groupings of building blocks from clusterings that could serve as ‘merges’ for a next
analysis because of the following reasons:
• Different executions of the algorithm result in different clusterings, which makes it
hard to compare the results
44
Cluster Analysis 5.8 Summary
• Multiple clusters contain one building block
• Multiple clusters contain building blocks that are not related
The intermediate results of the analysis were not ‘acceptable’ for the domain experts,
and therefore they could not filter ‘merges’ for successive analysis round. This means that
no splitting of the archive could be obtained using the proposed approach.
45
Chapter 6
Conclusions and future work
This thesis aims at ways to split the software archive of MR into multiple archives that can
be developed independently. The framework of Symphony is used to come to a solution
of this problem. Two analysis methods are used within the framework: Formal Concept
Analysis (FCA) and cluster analysis.
In this thesis we have made the following contributions:
• We discussed howFCAand cluster analysis can be applied to a large-scale, non-trivial
industrial software archive. In literature we could not find cases with comparable
goals and scale.
• We presented the leveled approach, which makes use of the existing hierarchy in the
software archive, in order to cope with scalability issues that come with applying
FCA and cluster analysis on a large-scale software archive.
• We presented the concept quality, a measure that aids in choosing concepts from a set
of concepts.
Conclusions and future work for FCA are discussed in section 6.1 and for cluster anal-
ysis in section 6.2.
6.1 Formal Concept Analysis
The first analysis method that we have presented is FCA. FCAprovides ways to identify sen-
sible groupings of object that have common attributes. For the context of FCA we defined
building blocks as objects and for the attributes two sets are combined: a set of attributes
indicating that building blocks are highly dependent on each other - using a threshold - and
a set of attributes that represent certain features of the building blocks. These features are
derived from existing project documentation and domain experts.
From this context we derived concepts, which are used to generate concept subpartitions
(CSPs). Each CSP represents a possible partitioning of the object set of building blocks,
which can serve as a basis of defining a splitting of the archive.
47
6.2 Cluster analysis Conclusions and future work
The process of analysis works with a leveled approach, which means that some parts
of the building block hierarchy in the archive are analyzed in detail, while other parts are
analyzed at a higher level. Results from previous analyses are used for successive analysis
rounds, where more parts are analyzed in detail. Selecting what parts are used for the
successive analysis rounds are aided by the newly defined concept quality.
When looking at the results with the domain experts at PMS, the idea of applying the
leveled approach in combination with the concept quality works well: The number of con-
cepts decrease significantly. However the resulting number of CSPs is enormous. Therefore
we think that FCA in this setting - obtaining CSPs from the set of concepts, given a context
of reasonable size - is practically not applicable.
FCA is in our opinion very suited for recognizing certain patterns or groupings of ob-
jects based on attributes, but not suited for an analysis that should result in a precise non-
overlapping partitioning of the object set.
6.2 Cluster analysis
The second analysis method is cluster analysis. Cluster analysis is used to group highly
dependent objects in clusters. We have chosen building blocks as entities to cluster. Static
dependencies are used as similarity measure and we use a genetic algorithm to evaluate
clusterings on a clustering quality measure.
The process of analysis works with a leveled approach, meaning that some parts of the
code archive can be analyzed in detail, while other parts of the archive are analyzed at a
higher level. This enables us to use results from previous analyses for successive analyses,
by merging clusters of building blocks that are validated by domain experts.
However the domain experts could not filter groupings of building blocks from clus-
terings that could serve as ‘merges’ for successive analysis rounds. One reason for that is
the fact that different executions of the genetic algorithm resulted in different clusterings,
which made it hard for the domain experts to compare the results.
The other reason is the fact that some clusters in the resulting clusterings were not
acceptable for the domain experts: the size of resulting clusters differed significantly and
some clusters contained only one building block, or contained building blocks that were
hardly or not related.
This means that no splitting of the archive could be obtained using the proposed ap-
proach. However we do think that using this leveled approach is a good way to cope with
the scalability issues, with the remark that intermediate results should be ‘good’ enough to
use for successive analysis rounds.
These ‘good’ intermediate results could be achieved by implementing some implicit re-
quirements the domain experts pose on clusterings in the clustering algorithm. For example
parameters that can achieve equally sized clusters, or parameters that can set a range of the
number of desired clusters. This might help to input domain knowledge into the analysis
process, which could narrow the search space of the algorithm.
48
Bibliography
[1] N. Anquetil and T. C. Lethbridge. Recovering software architecture from the names
of source files. Journal of Software Maintenance, 11(3):201–221, 1999.
[2] D. Bojic, T. Eisenbarth, R. Koschke, D. Simon, and D. Velasevic. Addendum to
”locating features in source code’. IEEE Trans. Softw. Eng., 30(2):140, 2004.
[3] J. P. Bordat. Calcul pratique du treillis de galois d’une correspondance. Mathema-
tiques et Sciences Humaines, 96:31–47, 1986.
[4] R. A. Botafogo and B. Shneiderman. Identifying aggregates in hypertext structures.
In HYPERTEXT ’91: Proceedings of the third annual ACM conference on Hypertext,
pages 63–74, New York, NY, USA, 1991. ACM Press.
[5] S. C. Choi and W. Scacchi. Extracting and restructuring the design of large systems.
IEEE Software, 7(1):66–71, 1990.
[6] ConExp. http://sourceforge.net/projects/conexp, 2007.
[7] D. Doval, S. Mancoridis, and B. S. Mitchell. Automatic clustering of software systems
using a genetic algorithm. In STEP ’99: Proceedings of the Software Technology and
Engineering Practice, page 73, Washington, DC, USA, 1999. IEEE Computer Society.
[8] T. Eisenbarth, R. Koschke, and D. Simon. Locating features in source code. IEEE
Transactions on Software Engineering, 29(3):210–224, 2003.
[9] B. Everitt. Cluster Analysis. Heineman Educational Books, London, 1974.
[10] B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations.
Springer-Verlag New York, Inc., Secaucus, NY, USA, 1997.
[11] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive
graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large
data bases, pages 721–732. VLDB Endowment, 2005.
49
BIBLIOGRAPHY
[12] I. Gitman and M.D. Levine. An algorithm for detecting unimodal fuzzy sets and its
application as a clustering technique. IEEE Transactions on Computers, C-19:583–
593, 1970.
[13] R. Godin and H. Mili. Building and Maintaining Analysis-Level Class Hierarchies
Using Galois Lattices. In Proceedings of the OOPSLA ’93 Conference on Object-
oriented Programming Systems, Languages and Applications, pages 394–410. ACM
Press, 1993.
[14] R. Godin, H. Mili, G. W. Mineau, R. Missaoui, A. Arfi, and T. Chau. Design of class
hierarchies based on concept (galois) lattices. Theory and Practice of Object Systems,
4(2):117–134, 1998.
[15] R. Godin, R. Missaoui, and H. Alaoui. Incremental concept formation algorithms
based on galois (concept) lattices. Computational Intelligence, 11:246–267, 1995.
[16] C. Hofmeister, R.Nord, and D. Soni. Applied software architecture. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 2000.
[17] D. H. Hutchens and V. R. Basili. System structure analysis: clustering with data
bindings. IEEE Trans. Softw. Eng., 11(8):749–757, 1985.
[18] C. Lung, X. Xu, M. Zaman, and A. Srinivasan. Program restructuring using clustering
techniques. J. Syst. Softw., 79(9):1261–1279, 2006.
[19] S. Mancoridis, B. S. Mitchell, Y. Chen, and E. R. Gansner. Bunch: A clustering
tool for the recovery and maintenance of software system structures. In ICSM ’99:
Proceedings of the IEEE International Conference on Software Maintenance, pages
50–59, Washington, DC, USA, 1999. IEEE Computer Society.
[20] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner. Using automatic
clustering to produce high-level system organizations of source code. In IWPC ’98:
Proceedings of the 6th International Workshop on Program Comprehension, pages
45–52, Washington, DC, USA, 1998. IEEE Computer Society.
[21] B. S. Mitchell and S. Mancoridis. Comparing the decompositions produced by soft-
ware clustering algorithms using similarity measurements. In IEEE International Con-
ference on Software Maintenance (ICSM), pages 744–753. IEEE Computer Society,
2001.
[22] B. S. Mitchell and S. Mancoridis. On the automatic modularization of software sys-
tems using the bunch tool. IEEE Trans. Softw. Eng., 32(3):193–208, 2006.
[23] Brian S. Mitchell and Spiros Mancoridis. Using heuristic search techniques to extract
design abstractions from source code. In GECCO’02: Proceedings of the Genetic and
Evolutionary Computation Conference, pages 1375–1382, San Francisco, CA, USA,
2002. Morgan Kaufmann Publishers Inc.
50
BIBLIOGRAPHY
[24] Stephen C. North and Eleftherios Koutsofios. Application of graph visualization. In
Proceedings of Graphics Interface ’94, pages 235–245, Banff, Alberta, Canada, 1994.
[25] IEEE P1471-2000. Recommended practice for architectural description of software
intensive systems. Technical Report IEEE P1471–2000, The Architecture Working
Group of the Software Engineering Committee, Standards Department, IEEE, Piscat-
away, New Jersey, USA, September 2000.
[26] P.Tonella. Concept analysis for module restructuring. IEEE Trans. Softw. Eng.,
27(4):351–363, 2001.
[27] R. W. Schwanke. An intelligent tool for re-engineering software modularity. In ICSE
’91: Proceedings of the 13th international conference on Software engineering, pages
83–92, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press.
[28] M. Siff and T. Reps. Identifying modules via concept analysis. In Proc. of. the Inter-
nation Conference on Software Maintenance, pages 170–179. IEEE Computer Society
Press, 1997.
[29] G. Snelting. Reengineering of configurations based on mathematical concept analysis.
ACM Transactions on Software Engineering and Methodology, 5(2):146–189, 1996.
[30] G. Snelting and F. Tip. Reengineering class hierarchies using concept analysis. In
SIGSOFT ’98/FSE-6: Proceedings of the 6th ACMSIGSOFT international symposium
on Foundations of software engineering, pages 99–110, New York, NY, USA, 1998.
ACM Press.
[31] Sotograph. http://www.software-tomography.com/html/sotograph.html, 2007.
[32] A. van Deursen, C. Hofmeister, R. Koschke, L. Moonen, and C. Riva. Symphony:
View-driven software architecture reconstruction. In WICSA ’04: Proceedings of the
Fourth Working IEEE/IFIP Conference on Software Architecture (WICSA’04), pages
122–134, Washington, DC, USA, 2004. IEEE Computer Society.
[33] Arie van Deursen and Tobias Kuipers. Identifying objects using cluster and concept
analysis. In ICSE ’99: Proceedings of the 21st international conference on Software
engineering, pages 246–255, Los Alamitos, CA, USA, 1999. IEEE Computer Society
Press.
[34] G. von Laszewski. A Collection of Graph Partitioning Algorithms: Simulated Anneal-
ing, Simulated Tempering, Kernighan Lin, Two Optimal, Graph Reduction, Bisection.
Technical Report SCCS 477, Northeast Parallel Architectures Center at Syracuse Uni-
versity, April 1993.
[35] Andreas Wierda, Eric Dortmans, and Lou Lou Somers. Using version information in
architectural clustering - a case study. In CSMR ’06: Proceedings of the Conference
on Software Maintenance and Reengineering, pages 214–228, Washington, DC, USA,
2006. IEEE Computer Society.
51
BIBLIOGRAPHY
[36] T. A. Wiggerts. Using clustering algorithms in legacy systems remodularization. In
WCRE ’97: Proceedings of the Fourth Working Conference on Reverse Engineering
(WCRE ’97), page 33, Washington, DC, USA, 1997. IEEE Computer Society.
[37] R. Wille. Restructuring lattice theory: an approach based on hierarchies of concepts.
Ordered sets, Reidel, Dordrecht-Boston, 1982.
52

Software Engineering Research Group Final thesis

THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE
in

SOFTWARE ENGINEERING
by

Marco Glorie born in Heemskerk, the Netherlands

Software Engineering Research Group Department of Software Technology Faculty EEMCS, Delft University of Technology Delft, the Netherlands www.ewi.tudelft.nl

MR Department Philips Medical Systems Veenpluis 6 5684 PC Best, the Netherlands www.medical.philips.com

.c 2007 Marco Glorie. All rights reserved.

Software Engineering Research Group Final thesis Author: Student id: Email: Marco Glorie 1133772 marcoglorie@hi. MR stands for Magnetic Resonance and the technology enables MRscanners to analyze the human body for potential diseases. TU Delft Dr. L. using the Symphony framework. TU Delft Drs.among many other products . This paper aims at ways to split the single archive into multiple archives that can be developed independently. Philips Medical Systems . We will conclude with an evaluation of the used analysis methods. which is now hard to maintain. Dr. Thesis Committee: Chair: University supervisor: Company supervisor: Prof. The software archive has been evolved into a single 9 MLOC archive. The software for these MR-scanners is complex and has been evolving for more than 27 years. A. Faculty EEMCS. Faculty EEMCS.for medical purposes such as diagnostics.nl Abstract Philips Medical Systems makes MR-scanners . Hofland. van Deursen. Zaidman. We will discuss formal concept analysis and cluster analysis as analysis methods to obtain such archives from the single archive. A.

.

advices and reviewing activities. Therefore I would like to thank Lennart Hofland for his support. Also I enjoyed the several group trips. 2007 iii . such as our ‘adventurous’ GPS-tour in Velthoven. Also I thank Andy Zaidman and Arie van Deursen for supporting me with reviews and advices on the subject. Further during my project I was surrounded by colleagues from the ‘image processing and platform’ group. who all made my stay in Best an exciting stay. Thanks for that. the Netherlands August 3. Christian Schachten. Marco Glorie Best. Finally my thanks also go to Mulo Emmanuel. I really appreciated they invited me to their group meetings.Preface This document would not have been possible to create without the help from both the side of Philips Medical Systems and Delft University of Technology. Jihad Lathal and Frederic Buehler. this gave me a good insight of how they work at Philips and what difficulties they are facing.

Contents
Preface Contents List of Figures 1 2 Introduction Problem definition 2.1 The MR case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Symphony 3.1 Reconstruction design . . . . . . . . . . . . . . . 3.2 Reconstruction execution . . . . . . . . . . . . . 3.3 Symphony for the MR case . . . . . . . . . . . . 3.3.1 Reconstruction design for the MR case . 3.3.2 Reconstruction execution for the MR case iii v vii 1 3 3 4 7 7 8 9 9 10 11 11 12 12 13 14 16 16 18 20 22 23 v

3

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4

Formal Concept Analysis 4.1 Formal context . . . . . . . . . . . . . . . . . . . . 4.2 Formal concept, extent and intent . . . . . . . . . . . 4.3 Subconcept-superconcept-relation and concept lattice 4.4 Concept lattice construction . . . . . . . . . . . . . 4.5 Applications of FCA . . . . . . . . . . . . . . . . . 4.6 Application to the MR case . . . . . . . . . . . . . . 4.6.1 Approach . . . . . . . . . . . . . . . . . . . 4.6.2 Features for FCA . . . . . . . . . . . . . . . 4.6.3 Scalability issues . . . . . . . . . . . . . . . 4.6.4 Concept quality . . . . . . . . . . . . . . . . 4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS 4.7.1 Project documentation features . . . . 4.7.2 MR-specificity and layering features . Evaluation . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . 24 . 27 . 28 29 . 29 . 30 . 30 . 32 . 33 . 33 . 37 . 39 . 39 . 44 . 44 47 47 48 49 53

4.8 4.9 5

Cluster Analysis 5.1 Entities to cluster . . . . . . . . . . 5.2 Similarity measures . . . . . . . . . 5.3 Cluster analysis algorithms . . . . . 5.4 Applications of cluster analysis . . . 5.5 Application to the MR case . . . . . 5.5.1 Approach . . . . . . . . . . 5.5.2 Scalability issues . . . . . . 5.5.3 Visualization and validation 5.6 Results . . . . . . . . . . . . . . . . 5.7 Evaluation . . . . . . . . . . . . . . 5.8 Summary . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

6

Conclusions and future work 6.1 Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography A MR Architecture

vi

. .11 Resulting concept subpartitions from the set of concepts of the analysis on one subsystem with no threshold . . . . . . . . . . . . . . . . . Concept lattice for the mammal example . . . .5 Example of how features are accumulated to higher level building blocks . . . .improved start solution. . Results of analysis on complete hierarchy . . . .8 Results of analysis of FCA with MR-specificity and layering features . . . . . . . . . . . . .12 Results of analysis of FCA with MR-specificity and layering features on one subsystem. . . . . .6 5. . . . . . . . . . . with imposed threshold of 25 (left) and 50 (right) . . . . . . .7 5. . . .1 5. . . . . . . . . .2 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . .improved start solution . . . . . . .4 The building block structure . . . . Extent and intent of the concepts for the mammal example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . .9 Results of analysis of FCA with MR-specificity and layering features. . . . . An example context to partition the archive into multiple archives using project documentation . .7 Results of analysis of FCA with features extracted from project documentation 4. . . . . . . . . . . . . . . . . . . . . . 4. . . .1 4. . . . . .3 5. . Reconstruction execution interactions . . Results of analysis with one subsystem in detail . .6 Flow of data scheme of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . .10 Results of analysis of FCA with MR-specificity and layering features on one subsystem . 5 5 9 12 13 13 18 21 22 24 25 26 26 27 27 38 39 40 40 41 41 42 43 vii Crude characterization of mammals . . . . . . 4. . . . . . . . . . . . . . .2 5. . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . . . with imposed threshold of 25 . . . .1 2. . . . . . . . . . . . . . . . . . . . round 2 Results of analysis with two predefined large clusters . . . Results of analysis with one subsystem in detail . . . . . . . . . . Flow of data scheme of analysis . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . 4. . . . .8 Example of a merge of a group of building blocks . . . .3 4. . . . 4. . .4 5. . . . . . . . The scope of a fictive project . . . Results of analysis on one subsystem only . . . . . . . . . . . . . . . .List of Figures 2. . . . .5 5. . . . . . . . Results of analysis with one subsystem in detail . . .

October 4th 2005) . . . . . . . . 53 A. . . A. . . . . . . . round 2 . . .List of Figures 5. . . . . . . . .2 Descriptions of the main proposed archives (MR Software Architecture Vision 2010. . . . .1 Main MR system components and their physical connections . . .9 List of Figures 43 Results of analysis with two predefined large clusters. . 54 viii . . . . . . . . . . . . .

Then in chapter 4 we discuss formal concept analysis and in chapter 5 cluster analysis. Among these systems is the MRscanner. A team has been formed to improve the development process by evolving the current architecture in a new architecture that can be more easily maintained. using analysis methods available in literature. diagnostic and other activities. In chapter 3 we discuss Symphony. as analysis methods to obtain a splitting of the archive.000 files. The following outline is followed in this thesis: in chapter 2 we present the MR case and research questions which are extracted from the MR case. Finally in chapter 6 conclusions will be drawn from the researched matters. 1 . The system is a combination of hardware and software and contains (real-time) embedded software modules. but the MR department of PMS sees the current process of development as inefficient and wants to improve the efficiency of the process. Many software technologies are used and also third party off-the-shelf modules are integrated in the software. In order to obtain these independent archives the current software archive should be analyzed on what modules should form these archives and clear interfaces should be defined between the archives. Existing algorithms and tools are discussed and an approach is presented to use the analysis methods to obtain a splitting of the current archive. The team sees the new architecture as an architecture consisting of about 7 archives that can be developed independently. The research questions will be answered in the following chapters. a product that uses Magnetic Resonance technology to analyze the human body for potential diseases. This way of developing is the result of many years of software evolution. The software for these MR-scanners is complex and has been evolving for more than 27 years. Merging the multiple software streams causes integration problems and the many dependencies among the files make that the feature that takes the longest time to be developed determines the release time of the complete project.Chapter 1 Introduction Philips Medical Systems (PMS) develops and produces complex systems to aid the medical world with monitoring. which is used as framework for the analyses. We use Symphony as an architecture reconstruction framework to obtain a splitting of the current archive. Currently the software is developed using branching with a single 9 MLOC software archive consisting of approximately 30.

.

To give a basic overview of how this works: The streams work with copies of that parts of the software archive that are to be modified by developers. The ‘Software Architectuur Team’ (SWAT) of MR is responsible for providing a strategy to cope with the current problems. When two streams are merged it is possible that there is an overlap in changed sections of the archive copies. which cause that the release moment is determined by the release moment of the slowest feature. A second reason is the fact that there are many dependencies of the building blocks in the archive. patient monitoring and cardiac devices. Current development process The current way of developing the software for MRscanners is seen by the MR department as inefficient. Each of these different archives 3 .2 the problem definition of this thesis is presented. The software is developed on multiple sites (Netherlands. A first reason for this inefficiency is that because multiple software streams are used. these need to be merged. Currently. when it introduced a medical X-ray tube. In section 2. at some point in the development process. Today PMS is a global leader in diagnostic imaging systems. From the MR case the problem definition is extracted. healthcare information technology solutions.Chapter 2 Problem definition In order to define the problem definition of this thesis first the MR case is discussed. to minimize project risks. This process of development gives a lot of merging problems.1 the MR case is discussed and in section 2.1 The MR case Philips Medical Systems (PMS) develops evolutionary medical systems such as MR-scanners. The team has created a vision that the future software archive should consist of about 7 independent archives. USA and India) and there are about 150 developers working on the software spread out over these 3 sites. 2. The MR case concerns the software for the MR-scanners produced by PMS. parallel software streams are used (branching). The medical activities of Philips date back to 1918. based on Magnetic Resonance technology.

System’s characteristics and building block structure To provide some insight into the MR case. However. wherein the code base consists of approximately 9 MLOC. a socalled building block structure or building block hierarchy. The code archive is organized by structuring building blocks in a tree-structure. The software for MR-scanners has been evolving for more than 27 years.2. At the highest level of abstraction in this building block structure there are the so-called subsystems. located in around 30.among other details .2 Problem definition In order to achieve a splitting of the archive for the MR case. This has resulted in the current situation. An example of a description of a scope of a (fictive) project can be seen in figure 2. Furthermore. the software should be analyzed in order to obtain archives that can be developed independently. Therefore the following research question is central in this thesis: • Which analysis methods are available in order to split an archive into multiple independent archives? To come to a solution for this problem.which building blocks are expected to be under the scope of projects. Files are contained in entities called building blocks. The application-suite is built using a lot of different software technologies. C# and Perl and many parts of it can be considered as legacy software.000 files. C++. For the last two years these scopes have been documented and around 50 documents are available. we envision that we should first refine our central research question in the following subsidiary research questions: • Are there tools or algorithms available to perform the ‘splitting’ analysis? • How can the analysis methods be applied to the MR case? 4 . an overview of some of the system’s characteristics is presented. More details on the architecture can be found in appendix A.1 shows the hierarchy of building blocks. such as UML diagrams and project documentation. Also. There exists detailed documentation on the architecture. the files are subdivided into nearly 600 building blocks. For this thesis mainly project documentation is used: The documentation specifies the purpose of the projects and . In order to implement this vision the current archive should be split into pieces. which contain multiple lower level building blocks.2 Problem definition Problem definition should act as a component and they should be combinable with each other independently. 2. such as C. Figure 2. many off-the shelf components are integrated into the system. the tree-structure of building blocks does not represent the high level architecture of the system. a lot of dependencies exist between these building blocks.2.

2 Problem definition Figure 2. As the archive for the MR case concerns a large scale industrial software archive. To apply the analysis methods to the MR case.1: The building block structure Figure 2. scalability issues have to be taken into account. Therefore existing algorithms and tools are discussed for this purpose.Problem definition 2. results are evaluated with the domain experts at PMS to validate the outcomes. To apply the analysis methods algorithms or tools performing the analysis are required. After performing analyses to the MR case with the researched analysis methods.2: The scope of a fictive project • Are these analysis methods scalable enough to cope with large scale industrial software? We discuss formal concept analysis and cluster analysis as analysis methods in order to split the software archive in multiple independent archives. Finally a conclusion will be drawn from the researched matters. first we discuss how the analysis methods are applied in the available literature and an approach is presented how to come to a splitting of the current archive of MR. Therefore scalability issues of applying analysis methods for the MR case are discussed and approaches are proposed to cope with these issues. 5 .

.

2 the reconstruction execution. The terminology of views and viewpoints are adapted from the IEEE1471 standard. such as source code. Symphony is a view-driven software architecture reconstruction method and is used in this paper to provide a framework to come to a solution of the defined problem. A view conforms to a viewpoint. The Symphony process consists of two stages: [32] 1. Finally in section 3. The view is a “representation of a whole system from the perspective of a related set of concerns”.1 Reconstruction design The first stage of Symphony is the reconstruction design.” [25] Symphony is a framework describing the process of architecture reconstruction. viewpoints for the ‘target views’ are selected. The source view is a view of a system that “can be extracted from artifacts of that system. ‘source views’ are defined and ‘mapping rules’ from source to target views are designed. 3. In section 3.Chapter 3 Symphony The problem of how to obtain multiple independent archives from a single archive can be seen as an architecture reconstruction problem. [32] Central to Symphony are views and viewpoints to reconstruct software architectures. [32] These terms are explained below. Reconstruction Design 2.1 the reconstruction design stage is discussed and in section 3. [32] 7 . documentation or traces”. During this stage the problem is analyzed. build files.3 is discussed what Symphony means for the MR case. configuration information. A viewpoint is a “specification of the conventions for constructing and using a view. Reconstruction Execution The two stages are typically iterated because reconstruction execution reveals information about the system that can serve as an input for a better understanding of the problem and a renewed reconstruction design.

Knowledge inference 3. are applied to the source view to create a target view. but also build-files or documentation of the system. During the knowledge inference step the mapping rules. earlier defined during the reconstruction design. Refined.2 Reconstruction execution Symphony The target view is a view of the system that describes the “as-implemented architecture and contains the information needed to solve the problem/perform the tasks for which the reconstruction process was carried out”. Figure 3. These artifacts include the source code.the reconstruction design . The figure also shows that the knowledge inference step can be repeated. During the problem elicitation the problem that needs to be solved is analyzed. Data gathering 2. defined during reconstruction design. Having analyzed the problem. because there could be new insights into the problem after the reconstruction execution stage. views and mapping rules are defined the next step is to perform the reconstruction execution.2 Reconstruction execution The second stage of Symphony is the reconstruction execution.adopted from [32] . where concepts relevant to the problem are identified and a recovery strategy is presented: the target viewpoint. Symphony presents three steps to apply this approach: 1. by inferencing knowledge from intermediate results of a previous step of inference. source viewpoint and mapping rules are defined or refined.3. 8 . So during the first stage of Symphony . Results from this stage lead to a refined problem definition or to a refined reconstruction design to eventually come to a satisfying solution of the problem. the concept determination follows. source views are extracted and the mapping rules are applied to obtain the target views. [32] The mapping rules define how the target view is to be created from the source view and depends on for example what analysis method is used. Finally during the information interpretation step the created target views are analyzed and interpreted to solve the problem. The reconstruction design stage consists of the problem elicitation and the concept determination. During the reconstruction execution stage the system is analyzed. 3.the problem is analyzed and the viewpoints and mapping rules are defined or refined.describes the three steps of the reconstruction execution. When during the reconstruction design the problem. [32] This thesis follows the extract-abstract-present approach used by Symphony.1 . Information interpretation The data gathering step extracts information from the system’s artifacts that can be used to recover selected architectural concepts from the system.

starts with the problem elicitation.3 Symphony for the MR case Figure 3. For each of the two different analysis methods a different source viewpoint and different mapping rules are defined. The first stage of Symphony. which serve as different bases for the reconstruction design.1: Reconstruction execution interactions 3.1 Reconstruction design for the MR case The framework of Symphony is used for the MR case because the problem of splitting the single archive into multiple independent archives can be seen as an architecture reconstruction effort. the problem definition. Source views For both analysis methods static relations between entities are taken as source viewpoint.1 we discuss the first phase of Symphony: the reconstruction design.3.3 Symphony for the MR case In this section we discuss how Symphony is used as a framework for the MR case and how the structure of this thesis follows the steps in the Symphony reconstruction process. This has been defined in chapter 2.3. the reconstruction design. The mapping rules are inherent to the theory of each of the two analysis methods. Also the existing building block hierarchy is taken as source view for both analysis methods. For formal concept analysis this can be found in chapter 4 and for cluster analysis this can be found in chapter 5.3. as part of the reconstruction design. The concept determination of Symphony. Mapping rules The mapping rules differ for both analysis methods. Two different analysis methods are used to analyze how a splitting can be achieved for the MR case: Formal concept analysis and cluster analysis. Chapters 4 and 5 describe the two approaches.Symphony 3. In section 3. applied to the MR case and in section 3. 9 . 3. for the MR case means defining the source views. target views and the mapping rules.2 the second phase: the reconstruction execution.

3.000 files and 360 building blocks. The details can be found in chapter 4 for formal concept analysis and in chapter 5 for cluster analysis. Relations on higher abstraction levels . this thesis is focussed on the parts that are not written in C#. externdeclaration.3. As Sotograph does not provide an integrated view with the C# parts of the archive yet.such as the file or building block level .are obtained by accumulating the relations to and from the lower level abstraction levels.5. For formal concept analysis this can be found in section 4. Sotograph only analyses the parts of the archive which are written in C. [16] The viewpoint identifies high-level architecture abstractions in the system and describes relationships among them. metadata.3 Symphony for the MR case Symphony Target views The target view is identical for both approaches and can readily be selected based on the description of the MR case and the problem definition.2 Reconstruction execution for the MR case The processes of analysis proposed in the sections ‘Application to the MR case’ for both analysis methods follow the extract-abstract-present approach of the reconstruction execution stage of Symphony. because of the two different underlying theories. C++ and the proprietary language GOAL-C. In this section is described how these relations are obtained. componentinterfacecall. These relations include the following types: aggregation. Data gathering During the data gathering step static relations between entities are acquired for both analysis methods. including building block hierarchy information. Sotograph maintains a database containing information of the static code analysis. catch. externalias. consisting of around 4 million LOC. A commercial tool called ‘Sotograph’ is available at PMS to statically extract relations from the code archive. This means that the scope of the analyses in this thesis is limited to around 15. componentimplementation. The problem is how to split the single archive into multiple independent archives which leads to a Module viewpoint. throws. inheritance. 3. typeaccess and write. componentcall. 10 .6 and for cluster analysis this can be found in section 5. throw. frienddeclaration. This database can be seen as a populated repository containing the extracted source view. Knowledge inference and information interpretation For each of the two analysis methods the knowledge inference and information interpretation differ. [31] The relations are analyzed on the method / function level.

To be less formal: FCA provides a way to identify sensible groupings of objects that have common attributes. 4. In section 4.7 the results of the analyses are discussed and the results are evaluated in section 4.2 the term formal concept. that is all the objects that belong to the concept and by its intention.9 a summary is given. (One should see the terms objects and attributes in an abstract way.3 and algorithms to construct the concept lattice from a context are discussed in section 4.8. This section discusses Formal Concept Analysis (FCA) as an analysis method for this purpose. Section 4. Then in section 4. The original mathematical definition is defined using capitals that designate original German words. Finally in section 4. [10] The definition in this thesis uses capital letters designating the English translations and is as follows: 11 . The philosophical view is that a concept is defined by its extension. [10] They state that a concept can be philosophically understood as “the basic units of thought formed in dynamic processes within social and cultural environments”. [10] The following subsections will describe how FCA works.5 discusses applications of FCA and in section 4.6 the applicability to the MR case is explored. that is all the attributes that apply to all the objects of the concept. [37] The analysis method is used for structuring data into units. in FCA these objects and attributes have no such meaning as in OOprogramming for example). These units are “formal abstractions of concepts of human thought allowing meaningful and comprehensible interpretation”.4.1 Formal context The formal context defines what objects have what attributes and is the environment in which formal concepts can be defined.1 the term formal context is defined and in section 4. The subconcept-superconceptrelation and the concept lattice is explained in section 4.Chapter 4 Formal Concept Analysis In order to split the software archive for the MR case the archive should be analyzed to obtain archives that can be developed independently. FCA has been introduced by Wille around 1980.

4.Y1 ) ≤ (X2 . adopted from Siff and Reps. In the figure 4. (o. When we define set X as {cats. [28] Figure 4. a) ∈ R f or all a ∈ Y } then a pair (X.1 an example of a context is shown by indicating relations between objects and attributes. {four-legged. [28] 4.1 can be seen that both the cats as the dogs are four-legged and haircovered.Y2 ) :⇐⇒ X1 ⊆ X2 (⇐⇒ Y2 ⊆ Y1 ) 12 . R) if and only if X ⊆ O.2 shows the extent and the intent of the concepts for the example of the crude characterization of mammals.1: Crude characterization of mammals 4. dogs} and set Y as {four-legged. a) ∈ R f or all o ∈ X} and for Y ⊆ A is defined: Y := {o ∈ O|(o. hair-covered} then ({cats. hair-covered}) is a formal concept of the context of the mammal example. A. extent and intent A formal concept or just called concept is a maximum set of objects that all share the same attributes. in fact the concepts are naturally ordered by the relation: [10] (X1 .2 Formal concept.3 Subconcept-superconcept-relation and concept lattice Concepts are related to each other when the subconcept-superconcept-relation holds. A. [28] Figure 4.2 Formal concept. extent and intent Formal Concept Analysis A triple (O. This table illustrates a ‘crude characterization of mammals’ adopted from Siff and Reps. X = Y and X = Y . In figure 4. a) ∈ I is read: object o has attribute a. Set X is called the extent and set Y the intent. R) is called a formal context when O and A are finite sets (the objects and the attributes respectively) and R is a binary relation between O and A that is: R ⊆ O × A. dogs}. Y ⊆ A.Y ) is a formal concept of (O. The mathematical definition is as follows: [10] When for X ⊆ O is defined: X := {a ∈ A|(o.

The subconcept-superconcept-relation forms a complete partial order over the set of concepts. four-legged}) is a subconcept of the concept ({cats.2: Extent and intent of the concepts for the mammal example Figure 4. {all attributes}). {hair-covered. dogs}.4 Concept lattice construction Figure 4.3: Concept lattice for the mammal example For instance the concept ({cats.Formal Concept Analysis 4. {hair-covered}). {no attributes}) and as bottom the concept ({no objects}. Batch algorithms construct the concept lattice bottom-up or top-down. This ordered set is called the concept lattice of the context (O. There are two types of algorithms for computing the concept lattice for a context: batch algorithms and iterative algorithms. dogs}.3 the concept lattice of the mammal example is shown. In figure 4. adopted from Siff and Reps.4 Concept lattice construction To construct a concept lattice for a context several algorithms exist. This lattice has on top the concept ({all objects}. [28] 4. An example of a top-down approach is Bordat’s algorithm which starts with the node ({all 13 . R). gibbons. A.

2] Godin and Mili try to obtain multiple archives from a single archive and are trying to identify generalizations of (OO-)classes using FCA. [13. 4. The reason why FCA can be used for different settings is that the objects. the worst case analysis of the algorithm shows linear growth with respect to the number of objects. Snelting for example used FCA to study how members of a class hierarchy are used in the executable code of a set of applications by examining relationships between variables and class members. new pairs are created. when one supposes there is a fixed upper bound to the number of attributes related to an object. Further he shows empirically that on average the incremental update of his proposed algorithm is done in time “proportional to the number of instances previously treated”. the concept lattice can typically be constructed in O(n ) time when using sets of size n for the objects and attributes.5 Applications of FCA Formal Concept Analysis objects}. [15] There exists a tool called ConExp (Concept Explorer) which can be used to construct and visualize a concept lattice. From this node it generates all the children which are then linked to their parent. To explore how FCA can be applied to identify multiple archives for the MR case the several different settings for which FCA is used are discussed. [3] An example of a batch algorithm that constructs the concept lattice bottom-up is Snelting’s algorithm. Snelting shows that the worst case running time of the construction algorithm is exponential in the number of elements in the lattice. When adding a new object to the concept lattice depending on the element and its relations some nodes in the lattice will be modified. [29] Iterative algorithms do not build the concept lattice from scratch but update the concept lattice incrementally. attributes and the 14 . {no attributes}). They identify the “computational units that specifically implement a feature as well as the set of jointly or distinctly required computational units for a set of features. [15] Godin compares some batch and iterative algorithms to see in what cases the algorithms should be used. 14] The previous examples show that FCA is applicable for a wide range of purposes.” [8. and relationships among class members. As starting point they use the set of interfaces of classes. As object they use the set of classes and as attributes the set of class properties including attributes and methods. Surprisingly he shows his most simple proposed iterative algorithm outperforms the batch algorithms when the number of instances grows. However Snelting empirically 3 determines that in practice. These algorithms have been created because the idea was that it will takes less time to update a concept lattice then to build it from scratch. given a context. [30] Eisenbarth uses FCA embedded in a semi-automatic technique for feature identification.5 Applications of FCA FCA has been applied in many different settings and for different purposes. He states that although the theoretical worst case is exponential. An example of an iterative algorithm is Godin’s algorithm.4. [6] The tool can be given a context as input and can output the corresponding concept lattice in a format readable for other applications. This is iteratively repeated for every newly generated pair.

R). Tonella proposes concept subpartitions because of an overly restrictive constraint on the concept extents to cover the complete object set. It can be obtained by subtracting the union of the sets in SP from each set in P. Siff and Reps thus choose functions as objects and properties of functions as attributes to group together functions that have common attributes and thus are assumed to be located in the same module. He states that for concept partitions “important information that was identified by concept analysis is lost without reason” when concepts are disregarded because they cannot be combined with other concepts. (Xn .that is partitions of the object set . . P sub SP = {Mk = Mi − [ M j ∈SP M j |Mi ∈ P} 15 .Y0 ) . He defines that CSP = {(X0 . For example Snelting uses variables as objects and class members as attributes. They used FCA in a process to identify modules in legacy code. In their paper they present these three steps in the process: [28] 1.Y0 ) . . More formally this means that CP = {(X0 .Yn )} is a concept partition iff the extents of the concepts blanket the object set and are pair wise disjoint: [26. given a context (O. Xi \ / Xj = 0 Tonella uses the same identities as objects and attributes as Siff and Reps but he uses the concept subpartition (CSP) to retrieve a possible modularization. Build context where objects are functions defined in the input program and attributes are properties of those functions. For each purpose a different choice of the three sets is needed. A. Each concept partition corresponds to a possible modularization of the input program. 2. .5 Applications of FCA relations are of free choice. Identify concept partitions-collections of concepts whose extents partition the set of objects. In order to see how FCA can be applied for the MR case we discuss the work of Siff and Reps. They use the concept partition to retrieve a possible modularization. 3. (Xn . Construct the concept lattice from the context with a lattice construction algorithm. Xi \ / Xj = 0 Where CSPs can be directly mapped to object partitions . [28] A concept partition is a set of concepts whose extents are non-empty and form a partition of the set of objects O. .Yn )} is a concept subpartition iff: [26] ∀i = j. 28] n [ i=1 Xi = O and ∀i = j.CSPs have to be extended to the object set using the so-called partition subtraction: [26] The partition subtraction of an object subpartition SP from an object partition P gives the subpartition complementary to SP with respect to P.Formal Concept Analysis 4.

Finally in section 4. The attributes have to be chosen in such way that building blocks that are highly related to each other appear in concepts of the context.4 the notion of ‘concept quality’ is defined to aid the analysis on the archive. The empty set is not considered an element of Q. 4.3 scalability issues concerning the presented approach are discussed. Choi states that functions in a file are semantically related and form a “cohesive logical module”. For the object set in the context initially the set of files in the MR archive was chosen.1 the approach of applying FCA to the MR case is discussed. This level of abstraction results in a set of building blocks. 26] 4.6 Application to the MR case Formal Concept Analysis P sub SP is itself a subpartition because sets in P are disjoint and remain such after the subtraction.) The previously presented applications of FCA.1 Approach To obtain multiple independent archives for the MR case using FCA the first step is to build a suitable context.6. [ Q = SP / (P sub SP) − 0 There is a remark to be made about the concept CSPs: If there is no overlap between the object sets of a set of concepts. also domain experts of PMS indicate that building blocks are designed to encapsulate particular functionality within the archive.6. In section 4. the process presented by Siff and Reps. The following two sets of attributes are combined in the context for this purpose and indicate that a building block: 16 . The partition subtraction is used in the subpartition extension to obtain the object set partitioning: [26] An object subpartition SP can be extended to an object partition Q. the number of CSPs resulting from this set of concepts will grow enormously: The number of CSPs are then the number of ways to partition a set of n elements into k nonempty subsets. The reason is that a file can be seen as a good abstraction of functionality. by the union of SP and the subtraction of SP from P.6. [28. in section 4.4. and the definitions of concept partition and concept subpartition can be used to apply FCA to the MR case. [5] However early experiments with formal concept analysis with files gave such amount of scalability issues that one step higher in the level of abstraction was chosen from the archive of the MR case. (This number is given by the Stirling numbers of the second kind. with reference to an original partition P.6.2 the ‘features’ in the context that is used are more deeply elaborated and in section 4. The choice of taking building blocks as the set of objects is not only the most logical with respect to the level of abstraction.6.6 Application to the MR case In this section is discussed how FCA can be applied for the MR case.

but adjusted for this particular purpose. 3. The second set of attributes is defined in the documentation of the architecture and is present at domain experts of the MR department.3. The reason why these two sets of attributes are identified is that the first set assumes that building blocks which are highly dependent on each other should reside in the same archive. These particular features are: MR specificity.6 Application to the MR case 1.Formal Concept Analysis 4. the code archive. The term ‘highly dependent’ is used to discriminate between the heavy use of a building block and occasional use. should be grouped in the same archive. A threshold is used to filter relations on the degree of dependency. that is: uses a function or data structure of a file within a particular building block. according to the documentation and the domain experts. has particular features associated with it. The features associated with the building blocks are discussed in more detail in section 4. A threshold is used to be able to discriminate between a high degree of dependency and a low degree. In figure 4. [31] Sotograph determines the degree of dependency by summing up static dependencies up to the desired abstraction level. So the two sets of attributes that form the attributes of the context are a combination of the ‘as-is’ architecture extracted from the code archive and features extracted from the ‘as-built’ architecture. The second set of attributes assumes that building blocks that share the same features. [28] It follows the extract-abstract-present approach of Symphony: [32] 1. is highly dependent on another building block. such as building blocks that are all very MR-specific. Build context where objects are building blocks defined in the software archive and attributes indicate that building blocks are highly related. using the two sets of attributes previously defined in this section. Identify concept subpartitions-collections of concepts whose extents partition the set of objects using the object subpartition SP proposed by Tonella. 17 .4 a simple example of a context is shown. layering and information about what building blocks are affected during software projects. Construct the concept lattice from the context with a lattice construction algorithm discussed in section 4. The interdependencies of the building blocks in the architecture are extracted from the code archive using the commercial tool Sotograph and the degree of dependency is determined. Such features are extracted from architecture overviews and project documentation or are defined by domain experts. 2. Now having defined what the context is for using FCA for the MR case this paper adapts the process proposed by Siff and Reps. The first set of attributes is defined in the ‘as-is’ architecture. [26] Each object subpartition SP corresponds to a possible partitioning of the archive in multiple independent archives.2.6. 2.

To give an example: given a project that implements a certain feature.2 Features for FCA As mentioned before there are two sets of attributes used for the concept analysis approach.6. Both results can be used to define a final splitting of the archive. For the features two approaches have been chosen: • Information about building blocks affected during projects • MR-specificity and layering These two approaches give two different results. This particular relation is used to group the building blocks together in the form of concepts after the construction of the context. named ‘projectAfeature1’. which crosscuts the code 18 .6 Application to the MR case Formal Concept Analysis Figure 4. The second set of attributes defines the several features of the building blocks. there is documentation at PMS that describes that ‘buildingblockA’. Figure 2. which are evaluated further in this thesis. This scope is determined by the system’s architects prior to the start of the project. Information extracted from project documentation The first approach makes use of several software project documents. The scope can consist of building blocks that are scattered through the entire archive. The documents describe which building blocks are under its scope. ‘buildingblockB’ and ‘buildingblockC’ are under the scope of this project.4. but because projects often are used to implement certain functionality there is a relation between the building blocks in the scope. which means what building blocks are expected to be changed during the project. in cooperation with the systems architects.4: An example context to partition the archive into multiple archives using project documentation 4.2 shows an example of the scope of a fictive project. The fact that this grouping possibly crosscuts the archive makes this feature interesting to use for FCA in combination with the high dependency relations between building blocks. The first set of attributes indicates whether building blocks are highly dependent on other objects.

MR-specificity and layering The other approach is taking into account the MR-specificity and layering of the entities in the archive. The building block where no features have been designated to.6 Application to the MR case archive with respect to the building block hierarchy. as a combination of the ‘as-is’ architecture and the ‘as-built’ architecture. while others were only defined on a very high level of abstraction. While extracting information from the documentation some difference in detailedness was noticed. Before this time this specific information was not documented. That is. each with a scope of building blocks. So the idea of this approach is that the building blocks will be grouped based on the fact that they are related based on certain features of the software they implement or are involved with. For these specific cases the feature was accumulated down the building block hierarchy. Now the feature ‘projectA-feature1’ is assigned to each of the three building blocks in the scope. when looking at the extracted features from the documentation. including ‘platform’ itself. This should be kept in mind while analyzing the results. For example when project documentation states that the scope of ‘projectA-feature1’ is ‘platform’. When assigning these features to each building block. This has consequences for deducing concepts from a context with these features. it is clear that not all building blocks have one or more features designated to it. One can think of a process dispatcher on the service level and a scan-define UI at the application/UI level. This attribute however could also not be there. For the layering a designated scale is used for the building blocks that states whether a building block is more on the service level or more on the application/UI level. So around 50 different features are assigned to the building blocks in the archive. building blocks that share the same characteristics of MR-specificity and layering. The MR-specificity can be explained like this: Some building blocks are only to be found in MR software. will be grouped based on the attributes that indicate high dependency on other building blocks. So the features do not cover the complete object set of building blocks in the context. either because there are no dependencies from this building block to other building blocks or because the number of dependencies to another building block is below a chosen threshold. all underlying building blocks in the building block structure of ‘platform’ in the archive are given the feature ‘projectA-feature1’. Further. The information about scopes of projects in PMS documentation is only available for projects from the last two years. There are around 50 documents available to extract information from. We think it is interesting to use both types of features in the context for analysis. The documents have been produced for two long-term projects. are grouped by deducing concepts from 19 . while others are common in all medical scanner applications or even in other applications.Formal Concept Analysis 4. such as database management entities or logging functionality. In the documentation there were examples of scopes that were defined as two or three ‘subsystems’ in the archive. This can be a different grouping than a grouping based on just high dependencies between the building blocks. some scopes of projects were defined very detailed with respect to the building blocks in the hierarchy.

1 shows the hierarchical structuring. non-specific. as a combination of the ‘as-built’ architecture and the ‘as-is’ architecture. So when taking one building block into account. very non-specific}. because the set of building blocks in the archive are taken as the set of objects in the context. First we will give a more detail detailed description of the leveled approach: By using the leveled approach some parts of the archive can be analyzed in detail. 4. We will elaborate this further in this section. Figure 2.6.3 Scalability issues When using the process proposed in the previous section scalability issues come into play. Of course this can be a different grouping than the grouping based on the high dependencies between building blocks. steps must be undertaken to cope with this amount. resulting concepts from that analysis are merged for the next analysis. The underlying reason for choosing these specific features. which have to be manually evaluated.such as a group of building blocks that are ‘very MR-specific’ and are on the ‘application/UI level’ . The archive consists of around 360 building blocks (not taking in account the C# coded parts of the archive) . MR-specificity and layering. which again are structured in a hierarchy. Leveled approach To cope with the large amount of results this paper proposes to use a leveled approach. that is each entity has a feature indicating the MR-specificity and has a feature indicating the layering.4. Building blocks that have certain common features . Therefore to obtain these features domain experts from PMS assigned them to the building blocks. with a large corresponding concept lattice. Because identifying concept subpartitions (CSPs) from the concept lattice results in enormous amounts of CSP-collections. This will make the context and concept lattice of the next analysis round smaller and expected is that the resulting number of CSPs will also decrease. After analysis it is interesting to consider opportunities such as reusing building blocks that are common in medical scanner software from other departments or develop them together as reusable building blocks. We think it is interesting to use both types of features in the context for FCA. specific.6 Application to the MR case Formal Concept Analysis a context which include these features as attributes. there are 25 combinations possible with respect to the MR-specificity and layering. The approach makes use of the hierarchical structuring of the MR archive: the archives are modularized in high level ‘subsystems’. neutral.are grouped based on these features. which consist of multiple ‘building blocks’. By analyzing parts of the hierarchy in detail. The MR-specificity and layering information of the building blocks in the archive is not available in documentation. is the wish of Philips Medical to evolve to a more homogeneous organization in terms of software applications. This was done using a rough scale for the MR-specificity: {very specific. while keeping 20 . This means that the complete object set of building blocks is covered. For the layering a similar scale holds starting from application/UI level to the service level. which results in a big context with the attributes defined.

6 shows how this works. The results from such analysis. This can be exported again to serve as input for the newly developed tool. This tool is called ‘ConExp’. After selecting what part should be analyzed in detail and what parts should not be analyzed in detail the context is created using the accumulated attributes of the context. [6] ConExp creates given a context the corresponding concept lattice. such as groupings of ‘lower level’ building blocks. The tool enables the selection of parts of the hierarchy to analyze in detail and enables viewing the resulting concept subpartitions and merging resulting groupings in the concept subpartitions for further analysis. This context can be exported to a format that an existing tool can use as input. The figure shows that all the features in the lower levels in the top table of the ‘platform-subsystem’ are accumulated to ‘platform’ which can be seen in the lower table. Figure 4. separate input files are given to the tool for the features. These groupings are accumulated by merging the building blocks into a single fictive building block to make the context and resulting concept lattice smaller. the amount of concepts taken into consideration should be kept small when calcu21 . When a part of the hierarchy of the archive is not examined in detail the attributes are accumulated to the entity that is examined globally. This is repeated until all building blocks are analyzed in detail and the results are accumulated.5: Example of how features are accumulated to higher level building blocks The analysis itself was performed using a newly developed tool .Formal Concept Analysis 4. Figure 4.5 shows how this works. Further. which can deduce concept subpartitions from the concept lattice. The attributes for each of the two approaches are assigned to the building blocks. can be accumulated to a next analysis where another part is analyzed in detail. either extracted from the project planning or from MR-specificity and generality documentation.that uses the output of the analysis performed by Sotograph. Merging concepts Considering the number of CSPs resulting from the number of concepts. Figure 4. which recognizes the hierarchy of the archive and relations between building blocks.named ‘Analysis Selector’ . In the figure the ‘platform-subsystem’ in the hierarchy of the archive is analyzed globally and the others in detail.6 Application to the MR case the other parts at a high level.

When a concept is merged. This process of merging and recalculating the context and concepts can be continued until a small number of concepts result from the defined context. Also relatively small groups of objects in a concept with relatively large groups of attributes could be ranked as a ‘strong’ concept. This function indicates how strong the grouping of the concept is.6 Application to the MR case Formal Concept Analysis Figure 4. the objects of that concept will be grouped into a so-called ‘merge’ which is a fictive object with the same attributes as the original concept. From these concepts then the CSPs can be calculated.6: Flow of data scheme of analysis lating the concept subpartitions. We will call this merging a concept from now on. for example. Picking concepts from a set of. Expected is that the context for a successive analysis round is now reduced is size and a smaller amount of concepts will result from the next analysis round. To aid in the choosing of the concepts a quality function of a concept is developed. 4. 300 concepts resulting from an analysis step is difficult. The following section elaborates the notion of ‘concept quality’ for this purpose.6.3.6. Merging ‘some’ concepts raises the question which concepts to choose when merging. which we will call the concept quality. The quality function takes in account how ‘strong’ a concept is. Basically the concept quality is based on two measures: 22 . This can be accomplished by merging the extents of the concepts (the object sets of the concepts) resulting from the context of an analysis. This quality function is developed to ease the choice of concepts to group and is not part of FCA.4 Concept quality The concept quality is developed as an extension for FCA to ease choosing concepts that should be merged for a next analysis round as described in section 4.4. that is based on how many attributes. In order to give each concept a value indicating the quality of the concept a quality function is needed.

the values are in the range [0. Both approaches make use of the hierarchy of the archive. Intuitively when a small number of objects is grouped by a large number of attributes. Because a degree of attributes in a concept and a degree of objects in a concept is measured for each concept. sharing a set of attributes. this indicates a strong grouping of this objects and therefore is assigned a high concept quality value. this intuitively can be seen as a less strong grouping than with a large number of attributes. when a small number of objects share a single attribute.7 Results A grouping of a concept is based on the maximal set of attributes that a group of objects share. As ceiling value the maximum number of objects and attributes respectively are taken given a set of concepts. This means that every static relation between building blocks is counted as a high dependency attribute in the context. with the purpose of decreasing the number of concepts.7. These merges of building blocks are taken as one entity for the next round of analysis. 4. the approach with the project documentation features and the approach with the MR specificity and layering features. So the quality function is also based on the degree of objects in a concept sharing a set of attributes. on the other hand a large number of objects sharing this set as less strong grouping. The results of the analysis using the two approaches are discussed in this section. Given a set of attributes. and therefore is assigned a lower quality value. the two approaches are discussed separately. with no threshold imposed on the relations. Because the two approaches give two different results. when a small number of objects share this set this can intuitively be seen as a strong grouping.100]: Quality(c) = #Attributes MaxOb jects − #Ob jects × × 100 MaxOb jects MaxAttributes Given this quality function all resulting concepts from a context are evaluated and based on the values. concepts are merged into single entities. The maximum number of objects in a set of concepts is called the ‘MaxObjects’ and the maximum number of attributes in a set of attributes is called the ‘MaxAttributes’. Conversely. Now we define the concept quality for a concept is as follows.7 Results The analysis has been performed for the two approaches.1 Project documentation features Analysis has been performed on the complete building block structure. 4. the lower the resulting quality function value is of the specific concept. A similar degree of quality is devised for the number of objects. This is combined with the information 23 . a ceiling value is defined.Formal Concept Analysis • The relative number of attributes of a concept • The relative number of objects of a concept 4. The smaller the number of objects.

In figure 4. Therefore no information is available to group these objects. This degree of influence of the high dependency attributes can be decreased by using a threshold on the static relations. Figure 4.7 shows the number of concepts plotted against the number of analysis rounds. This can be explained by the fact that the features extracted from existing project documentation covered around 30% of the complete set of around 360 building blocks. The number of concepts is too big to create concept subpartitions (CSPs) from. When the resulting (intermediate) concepts were discussed with the domain experts.7 Results Formal Concept Analysis extracted from the project documentation. Figure 4. we could be seen that the objects of the concepts were mainly grouped on the high dependency attributes in the context.7: Results of analysis of FCA with features extracted from project documentation The full context results in 3031 concepts. calculating CSPs from this set of concepts resulted in more than 10 million CSPs.4.2 MR-specificity and layering features This section presents the results from analysis on the context with the MR-specificity and layering features.7 can be seen that the number of concepts decreases over the successive analysis rounds. So when grouping objects in the defined context by concepts the main factor of grouping is defined by the dependencies. For this reason we have chosen not to continue the analysis with a threshold with features extracted from project documentation and we decided not to apply FCA with these features on parts on the hierarchy. 4. so these objects will not be grouped in concepts. After 25 rounds the number of concepts has decreased to 379. However. 24 .7. but this also means that there will be more objects in the context that have no attributes assigned to.

This analysis has no threshold imposed on the static dependencies. Figure 4. without any threshold imposed on the static dependencies. So concepts were chosen to be merged each round. this number of concepts is too big to create CSPs from.8 shows that the number of concepts decrease over the number of analysis rounds. After 6 analysis rounds the number of concepts has decreased to 378. Figure 4. Calculating CSPs from this set of concepts also resulted in more than 10 million CSPs. After 27 rounds the number of concepts has decreased to 351.9 shows a decrease of the number of concepts over the successive analysis rounds. Calculating CSPs from this set of concepts resulted in more than 10 million CSPs. 25 .Formal Concept Analysis 4. The full context results in 719 concepts. By merging concepts using the concept quality measure. but now with a threshold of 25 imposed on the static dependencies. Figure 4. Figure 4.7 Results Full building block structure The first results of the analysis with the MR-specificity and layering features are from analysis of the complete hierarchy of building blocks. Figure 4. The following results are from the analysis on the complete hierarchy of building blocks with the MR-specificity and layering features. used in the context.11 shows the number of resulting CSPs from the last five analysis rounds. There are 458 resulting concepts from the defined context.8: Results of analysis of FCA with MR-specificity and layering features The full context results in 3011 concepts. Similar to the analysis on the project documentation features. This number of concepts is too big to create CSPs from. Therefore concept were chosen to be merged each round.with the MR-specificity and layering features.called ‘platform’ .10 shows the results of analysis on one subsystem .8 shows the results. One subsystem in detail Figure 4. after 30 analysis rounds there are 490000 CSPs resulting from the set of 95 concepts.

9: Results of analysis of FCA with MR-specificity and layering features. The results of this analysis can be seen in figure 4.4.10: Results of analysis of FCA with MR-specificity and layering features on one subsystem The same analysis has been performed with two different thresholds imposed on the static relations: one of 25 and one of 50. The left figure shows the results with a threshold of 25 and the right figure shows the results with a threshold of 50. For the analysis with the 25 threshold after 7 rounds there was a set of 311 concepts. with imposed threshold of 25 Figure 4.12. The analysis with the threshold of 25 imposed starts with a context that results in 338 concepts. the analysis with the threshold of 50 imposed starts with a context that results in 287 concepts. for the analysis with the 50 threshold there was a set of 269 concepts.7 Results Formal Concept Analysis Figure 4. 26 .

when each combination of two concepts in this set of concepts has an intersection of their objects set that is empty.12: Results of analysis of FCA with MR-specificity and layering features on one subsystem. However the number of CSPs is far too big for domain experts to evaluate. Indeed the number of resulting CSPs decreases. For example. if we consider both the analyses on the complete building block hierarchy. When we look at the results of both analyses on the complete building block structure we can see that indeed the number of concepts decrease by applying the leveled approach. The huge number of CSPs can be explained by the degree of overlap of the objects in a resulting set of concepts. As each CSP represents a possible partitioning of the set of building blocks. Also we observed a lot of concepts in the sets with a small number of objects. Concepts 27 . the resulting number of CSPs from this set grows exponentially. Expected was that by decreasing the number of concepts over the several analysis rounds also the number of concept subpartitions (CSPs) would decrease.8 Evaluation Figure 4. with imposed threshold of 25 (left) and 50 (right) 4.8 Evaluation In this section the results of the previous section 4.11: Resulting concept subpartitions from the set of concepts of the analysis on one subsystem with no threshold Figure 4.Formal Concept Analysis 4. all the resulting CSPs have to be evaluated by domain experts. aided by the concept quality values.7 are discussed and is reflected how our proposed approach works.

9 Summary We proposed FCA in this chapter as an analysis method to obtain a splitting of the archive for the MR case. considering the decreasing number of concepts. enables us to lower the number of resulting concepts from the context. More combinations of concepts result in more CSPs. FCA provides ways to identify sensible groupings of objects that have common attributes. for example other features. 4. However.000 CSPs. After evaluating the results we see that the leveled approach. So we think that the leveled approach works. The process of analysis works with a leveled approach. However the resulting number of concept subpartitions from the number of concepts are enormous. which can serve as a basis for defining a splitting of the archive. These features are derived from existing project documentation and domain experts. 28 . in combination with the concept quality. while other parts are analyzed at a higher level. which are used to generate concept subpartitions. We base that on the mathematical definition of the CSP. we do not expect to get a significant smaller amount of concept subpartitions if we choose to use different attributes in the starting context. Selecting what parts are used for successive analysis rounds are aided by the newly defined concept quality.4. or a more detailed scale for the MR-specificity and layering features. Each concept subpartition represents a possible partitioning of the object set of building blocks.is in our opinion not applicable to a problem of the size of the MR case. which means that some parts of the building block hierarchy in the archive are analyzed in detail. For the context of FCA building blocks are taken as objects and for the attributes two sets are combined: a set of attributes indicating that building blocks are highly related and a set of attributes that represent certain features of the building blocks.9 Summary Formal Concept Analysis with a small number of building blocks as the object set have a high chance that they can be combined with other concepts. But if we consider an amount of around 100 concepts and a resulting number of around 500. Therefore we conclude that the approach of concept subpartitions from a set of concepts is not suited for a problem of the size of the MR case. which enables small sets of concepts resulting in enormous amounts of CSPs.in order to obtain partitionings of the set of building blocks . as then the chance of an overlap with building blocks in the other concepts is small. Results from a previous analysis are used for a next analysis where more parts are analyzed in detail. From this context concepts are derived. the approach of deducing CSPs from a set of concepts .

In section 5. Hutchens identifies potential modules by clustering on data-bindings between procedures. To give a more formal definition of the term cluster: “continuous regions of space containing a relatively high density of points.1 Entities to cluster The entities that are to be clustered are artifacts extracted from source code.3 algorithms to perform the clustering are discussed. [27] 29 . What are the entities to be clustered? 2.1 discusses the entities to be clustered.8. Finally a summary is given in section 5. section 5. [9] When applying cluster analysis three main questions need to be answered according to Wiggerts: [36] 1.5 we discuss how cluster analysis can be applied to the MR case. depending on the desired clustering result. [17] Schwanke also identifies potential modules but clusters call dependencies between procedures and shared features of procedures to come to this abstraction.7. What algorithm do we apply? To discuss what cluster analysis is we follow Wiggerts by answering the three raised questions.2 discusses when entities are found to be similar and should be grouped in clusters and in section 5. Cluster analysis can be used to analyze a system for clusters. The group of the highly dependent objects are called clusters.Chapter 5 Cluster Analysis This chapter discusses cluster analysis as an analysis method to split the software archive for the MR case. Section 5. When are two entities to be found similar? 3. The basics are essentially to group highly dependent objects. 5.4 applications of cluster analysis in the literature is discussed and in section 5. However this can be done in several ways and at different abstraction levels. In section 5. separated from other such regions by regions containing a relatively low density of points”.6 the results of the analysis are discussed and the results are evaluated in section 5.

because he assumes that record fields that are related in the implementation are also related in the application domain and therefore should reside in a object. such as data-bindings [17] and call dependencies between entities [27].5. call or data-use relations. [20. 5. This type of similarity measure is applied by Mancoridis. [36] Using different similarity measures results in different clusterings of the entities. [1] 5. [27] Also Anquetil who performs cluster analysis based on source-code file names is an example of this group. For each entity the scores for a number of variables. such as: inheritance/include.2 Similarity measures Cluster Analysis Another level of abstraction with respect to the entities to be clustered is presented by van Deursen. 19] Anquetil also identifies abstractions of the architecture but clusters modules based on file names. the next question is raised: How 30 . Relationships between entities. what entities are to be clustered and how similarity between entities is determined.2 Similarity measures When the first question is answered of what entities are to be clustered the next question that raises is when are two entities to be found similar or dependent? In other words what similarity measure is used to group similar entities in clusters? As mentioned above there are different possibilities to define the similarity of entities. The relations between the entities can be dynamic or static relations. The several types of relations can be used together by summing up the relations of the different kinds. 2. Van Deursen identifies potential objects by clustering highly dependent data records fields in Cobol. So the choice of similarity measure determines the result and in that way one should think carefully what similarity measure should be used for a particular clustering purpose. The first group is essentially based on the notion of: “the more relations between entities the more similar the entities are”. Wiggerts divides the similarity measure found in literature in two groups: [36] 1.3 Cluster analysis algorithms Now the first two questions raised by Wiggerts are answered. Van Deursen chooses to apply cluster analysis to the usage of record fields. Schwanke applies this type of similarity measure. [20. The entities can be seen as nodes in a graph and the relations as weighted edges between the nodes. He recognizes that different clustering methods may result in different clusterings. [33] ‘High-level system organizations’ are identified by Mancoridis by clustering modules and the dependencies between the modules using a clustering tool called Bunch. such as the presence of a particular construct in the source code of the entities or the degree of usage of particular constructs in the source code of the entities. 19] The second group of similarity measures is based on features or attributes of the entities. [1] Wiggerts states that “the goal of clustering methods is to extract an existing ‘natural’ cluster structure”.

mutation31 . Mancoridis has shown that the number of possible partitions given a set of entities grows exponentially with respect to the number of entities. a graph showing connections between hosts on the World Wide Web. [34]. For example 15 entities already result in 1. [36] So not only the choice of similarity measure affects the final result but also the choice of the algorithm to do the clustering with. The algorithm then tries to improve the value of the quality describing function by placing entities in other clusters or by merging or creating clusters. [36] An example of the application of such algorithm is given by Botafogo. [4] Gibson applies a graph theoretical algorithm to identify large dense subgraphs in a massive graph. He uses a graph theoretical algorithm to cluster nodes and links in hypertext structures in more abstract structures.545 distinct partitions.3 Cluster analysis algorithms are similar entities grouped into clusters? The grouping can be achieved by using different types of algorithms. Wiggerts states that a “particular algorithm may impose a structure rather than find an existing one”. The membership of a point to a partition is determined by its position in the two dimensional plane. [7] He uses a genetic algorithm and a ‘modularization quality’ function to cluster a graph consisting of entities as nodes and relations as edges. Optimization algorithms Optimization algorithms work with a function describing the quality of a partitioning using a similarity measure. [12] He uses fuzzy set theory to derive a clustering from a data set. The algorithms basically try to find special subgraphs in the graph. [20] This means that a naive exhaustive algorithm that calculates the quality value for each possible partition is not applicable to larger sets of entities. [11] Construction algorithms Construction algorithms assign entities to clusters in one pass.Cluster Analysis 5. The algorithm is used for dividing points or entities in a two dimensional plane into a number of partitions containing an approximately equal number of points. The finding of these special subgraphs in a graph is supported by graph theory. Another example of such algorithm is the bisection algorithm discussed by Laszewski.958. The algorithm starts with a random clustering and tries to come to a sub-optimal clustering by applying selection-.382. Doval proposed an algorithm to deal with the exponential growth of possible partitions. An example of such construction algorithm is given by Gitman. Wiggerts divides the existing algorithms in literature in four categories: [36] • graph theoretical algorithms • construction algorithms • optimization algorithms • hierarchical algorithms Graph theoretic algorithms The graph theoretic algorithms work on graphs with the entities as nodes and relations between the entities as edges.

a local search algorithm. similarity measure and algorithm 32 . After N-1 steps there are N clusters each containing one entity. This algorithm is also implemented in the Bunch tool and starts with a random partitioning of a graph consisting of entities as nodes and relations as edges. The algorithms start with one cluster containing all entities. [7] Mancoridis uses a genetic algorithm in the tool Bunch to come to a sub-optimal clustering of given entities. The working of genetic algorithms is out of the scope of this thesis. The results showed that the approach worked in this case study by verification using a clustering provided by a system’s expert. Now each level in the hierarchy represents a clustering. Doval demonstrates that the algorithm works for a small case study (20 entities). which linearly influences the time the algorithm takes to finish.using a particular similarity measure and merge the two clusters in one cluster. The genetic algorithm in Bunch has several options that can be set by the user.5. [36] Divisive hierarchical algorithms do it the other way around. a worse variation as the new solution of the current iteration”. 19.the hill-climbing algorithm start improving a random partitioning by systematically rearranging the entities “in an attempt to find an improved partition with a higher modularization quality”. Hierarchical algorithms Hierarchical algorithms build hierarchies of clusterings in such way that “each level contains the same clusters as the first lower level except for two cluster which are joined to form one cluster”. [36] 5. The algorithms then choose at each step one cluster to split in two clusters. The algorithms then iteratively choose two clusters that are most similar . The main option is the number of successive generations of the genetic algorithm. There are two types of such algorithms: [36] • agglomerative algorithms • divisive algorithms Agglomerative hierarchical algorithms start with N clusters containing each one entity (N is number of entities). which can influence the quality of the clustering. for more details on genetic algorithms we refer to [20. with some probability. Each step two clusters are merged into one cluster.4 Applications of cluster analysis Cluster Analysis and crossover-operators on a population of clusterings. Another optimization algorithm is the hill-climbing technique. Simulated annealing “enables the search algorithm to accept. After N-1 steps all entities are contained in one cluster. [22] This ‘worse variation’ can be seen as a mutation of the sub-optimal solution. 7].4 Applications of cluster analysis Cluster analysis has been applied for many different purposes and in many settings.as with the genetic algorithm . Using the ‘modularization quality’ function . [22] He presented a case study concerning 261 entities and 941 relations between the entities. [22] As a random starting solution can cause that a hill-climbing algorithm converges to a local optimum. simulated annealing can be used to prevent that. As mentioned in the previous sections the choice of entity.

5. 19] 33 . [35] Another example in literature is the research done by Lung. He showed with a case study on a system consisting of around 2700 classes that cluster analysis was applicable in practice for the given problem size using a hill-climbing algorithm combined with simulated annealing. however no time measurements were included in the paper to prove the practical applicability. 19. This section will discuss the applications of cluster analysis with the applicability to the MR case in mind.3 is discussed how (intermediate) results are visualized and validated by domain experts. 33] 5. [20.Cluster Analysis 5.5 Application to the MR case determines the result of the cluster analysis.5.1 Entities: building blocks In order to define what the entities are. but mainly aim at providing evidence of usefulness of cluster analysis for specific goals.1 Approach In order to discuss the applicability of cluster analysis to the MR case this section tries to answer the three main questions raised by Wiggerts: [36] • What are the entities to be clustered? • When are two entities to be found similar? • What algorithm do we apply? The three questions are answered in this section.5 Application to the MR case In this section we discuss how cluster analysis can be applied to the MR case. 5. Lung states that the approach worked well.5.1 the approach we take is discussed. [20. This goal resembles the goal set by Mancoridis: Identifying high-level system organizations by clustering modules and the relations between the modules. Examples of applications of cluster analysis have been given in the previous sections to illustrate these choices. Wierda for example uses clustering to “group classes of an existing object-oriented system of significant size into subsystems”. Lung presents an approach to restructure a program based on clustering at the function level and applies cluster analysis on the statements in modules of around 6500 LOC. that are to be clustered the goal of the clustering should be clear and that is to identify multiple independent archives.5. 5. [18] Other examples in literature perform cluster analysis on a far lesser amount of entities. in section 5. Also after answering the three questions the process of analysis is discussed.1. First in section 5.2 is discussed how scalability issues are coped with and finally in section 5.5. 21. based on structural relations between the classes.

We assume that a genetic algorithm is less likely to get stuck in local optima than the hill-climbing algorithm.5.5. The reason why is because of the goal that archives should be obtained from the code archive that can be developed independently. as a file can be seen as a good abstraction of functionality. the number of 15000 files gave such scalability problems in early experiments.1.5 Application to the MR case Cluster Analysis As entities we initially chose the set of files in the MR archive. C++ and GOAL-C. The scalability problems are caused by the fact that the number of partitions grow exponentially with the number of entities to cluster. The choice of similarity measure is also guided by the available information at PMS. called Sotograph. The desired result is obtaining independent archives from the original code archive. The tool can handle the parts of the current code archive written in C. that we chose to go one abstraction level higher. We have chosen for a genetic algorithm to cope with the exponential growth of the number of possible partitions. These two statements can be translated in a function that evaluates a clustering and can be used to obtain the optimal clustering based on this function.5. In cooperation with the domain experts at PMS the choice was made to take these static relations as similarity measure for the cluster analysis. This means as mentioned before that there should be a low degree of relations between the archives and a high degree of relations between the modules in an archive. Also within an archive there should be a high degree of relations between the modules in it. 5.2 Similarity measure: relations The next question is when are two entities to be found similar. 1] As mentioned before. 5.1. 34 . because the highly related modules should be grouped into an archive. [5] However. the MR department has a commercial tool for extracting static dependencies from the code archive. which are static dependencies between entities in the archive. Other types of dependencies could also be used as relations such as dynamic dependencies and version information or file names. [?.3 Algorithm: optimization The last question that needs to be answered is what algorithm to use. [20] The next level of abstract in the MR case archive is the set of 360 building blocks. Because of the need to make a distinction between a strong dependency (lots or relations) and a weak one the similarity measure should measure the number of relations or degree of dependency between modules. The choice of the algorithm also affects the resulting clustering and therefore with the goal of this research in mind the algorithm that should be used for the MR case is an optimization algorithm. The relations between modules are thus the most suited as similarity measure and should be extracted from the current code archive of the MR department. To achieve independency between the archives there should be minimal relations between the obtained archives in order to be able to define clear interfaces between the archives. The choice of similarity measure affects the resulting clustering and thus should be chosen by focusing on the desired result.

The components or modules that are grouped together in the same cluster should be highly dependent. It has the value 0 when there are no inter-edge dependencies between the modules of the two clusters and the value 1 35 . j between clusters i and j consists of Ni and N j components respectively. 1]. As mentioned before the degree of dependency between the clusters should be minimized. Intra-Connectivity To come to a function that can be used to evaluate a clustering first we need to translate the notion of the low degree of relations between archives. j is a fraction of the maximum number of inter-edge dependencies possible between two clusters which is 2Ni N j . It has the value 0 when there are no intra-edge dependencies and the value 1 if cluster i consists of components that are fully connected which each other (including themselves).5 Application to the MR case The reason for this assumption is the fact that with a genetic algorithm mutation is embedded in the process of selecting what partitions are continued with. The measurement Ei. j = 0 εi. j also has values in the range [0. [20] In his paper he describes intra-connectivity as the “measurement of connectivity between the components that are grouped together in the same cluster”. The measurement Ei. which allows large structured jumps in the search space of the problem. That is why Mancoridis defines the intra-connectivity as follows: Ai = µi Ni2 The intra-connectivity measurement Ai of cluster i consists of Ni components and µi intra-edge dependencies. However no cross-over is provided. The measurement Ai has values in the range [0. with εi. Further in this section the theory needed for the optimization algorithm is discussed. j 2Ni N j if i = j if i = j The inter-connectivity measurement Ei. which means that the algorithm can make jumps of varying size in the search space. j inter-edge dependencies. To be able to minimize this automatically Mancoridis defines the inter-connectivity as follows: Ei. because µi is divided by Ni2 . Therefore we adopt the definition of Intra-Connectivity (A) from Mancoridis. Also in the algorithm sub-optimal solutions are recombined (cross-over function). Inter-Connectivity To translate the notion of a high degree of relation between modules in an archive the definition of Inter-Connectivity (E) from Mancoridis is adapted.Cluster Analysis 5. [20]. The measurement Ai is a fraction of the maximum number of intra-edge dependencies. because simulated annealing is basically a (often small) mutation on the sub-optimal solution. Mancoridis describes inter-connectivity as the “measurement of the connectivity between two distinct clusters. Hill-climbing in combination with simulated annealing can also prevent getting stuck at local optima. 1]. Here Ni2 is the maximum number of intra-edge dependencies that can exist between Ni modules.

using their knowledge of the system design while taking advantage of the automatic clustering capabilities of Bunch to organize the remaining modules”. the algorithm is likely to find a suboptimal solution. which would imply a bad partitioning of the multiple archives solution. Mitchell and Mancoridis developed an improvement of the MQ measurement previously discussed. what the similarity measure is and what algorithm to use. Then. As the algorithm proposed is likely to produce suboptimal partitions there should be a way to iteratively come to a satisfying solution. is able to automatically obtain (sub-)optimal clustered partitions. clustered partitions can be evaluated and compared. When as a starting point a bad partitioning is taken. [19] This means that the chosen entities (files) and similarity measure (relation) can be mapped to such graph.5 Application to the MR case Cluster Analysis when all the modules in the two clusters are dependent on all the other modules in the other clusters.4 Process of analysis Having defined what the entities to be clustered are. v that represent the source-level dependencies between the entities u and v. The improved MQ measurement has been integrated into the Bunch tool. i. which uses this formula. Using this formula. j if k > 1 i. the next point of discussion is the process of analysis. [7] The main problem with this algorithm is the fact that it starts with a random partitioning which can be a very bad partitioning. After that genetic operators are applied to this population for some maximum number of times. Modularization Quality Now the Modularization Quality (MQ) measurement can be defined in order to be able to give a value for how good a particular partitioning is.1. [23] Also Bunch provides a feature called ‘user-directed clustering’. Formally. MDG = (M. The tool Bunch. [20] MQ = 1 k k ∑i=1 Ai − 1 k(k−1) 2 ∑k j=1 Ei. the user is able to “cluster some modules manually.e. [19] 5. that is a random partitioning. This is done using the following formula defined by Mancoridis and integrating the two previous measures: ‘intra-connectivity’ and ‘inter-connectivity’. the population that has the highest Modularization Quality from all solutions generated by the algorithm. The genetic algorithm proposed by Doval starts with a random initial population. R) is a graph where M is the set of entities and R ⊆ M × M is the set of ordered pairs u.5. the fittest population is shown.5. [19] Bunch Bunch uses as input a so called Module Dependency Graph (MDG). originating from the original single archive. if k = 1 A1 The formula rewards the creation of highly dependent clusters while penalizing the creation of too many inter-edges between the clusters. 36 . which also supports weighted edges between entities. When this feature is enabled.

Cluster Analysis 5. 5. the optimization value MQ of the initial solution is improved. When new architectural knowledge about the clusters is obtained and the search space of the algorithm is made smaller. as the number of possible clusterings grows exponentially with the number of building blocks. 3. This paper presents a process to iteratively improve the starting point of the genetic algorithm by using the suboptimal solution of the algorithm and by reviewing the solution with domain experts. As mentioned above the process is repeated until satisfying results are obtained. which consist of multiple ‘building blocks’. such as architects and developers of the MR department. Therefore a plug-in for the tool Sotograph is written. However there are scalability issues concerning this approach.2 Scalability issues When applying the process proposed in the previous section obtained knowledge of the system is used to come to a better clustering solution. Whether the results are satisfying is determined by domain experts and this can be seen as validation of the process. As such. validated by the domain experts and the value of the MQ function. a first run clustering of the building blocks will probably result in a clustering from which domain experts cannot extract usable information.5. Use the genetic algorithm provided by the tool Bunch to obtain a clustering of the MDG. Use newly obtained knowledge of the system as starting solution for the genetic algorithm provided by Bunch. 2. It follows the extract-abstract-present approach of Symphony: [32] 1. Certain parts of the archive can be analyzed in detail while other parts of the archive are analyzed at a high level. Review the results with domain experts by visualizing the results.1 shows the hierarchical structuring.3 for FCA.3 5. Results from analysis of one part of the archive can be used for a next iteration of the analysis. Figure 2. The archive is composed of multiple ‘subsystems’. This knowledge is acquired in the previous step where domain experts review the results.5. already discussed in section 4. This validation is difficult when results are not visualized.5 Application to the MR case Therefore a good starting point is necessary. The leveled approach makes use of the hierarchical structure of the archive. 37 . Repeat steps 2 till 4 until satisfying results are obtained. meaning that the genetic algorithm can obtain a better optimization.6. Therefore we propose to use a leveled approach. The search space for the analysis is extremely large. Obtain the Module Dependency Graph of the code archive using the commercial tool Sotograph. More details on the visualization can be found in section 5. 4.

Here the building blocks ‘platform’. ‘computeros’ and ‘configuration’ are merged into one new merge. The parts can be selected in the Analysis Selector. while other subsystem are analyzed globally.1. Figure 5. named ‘merge’. Figure 5. the relations from all underlying building blocks in the hierarchy are accumulated to be relations from the higher level subsystem. When the user has made his selection the necessary relation accumulation is performed and a Module Dependency Graph (MDG) is created. Also relations from one building block to the underlying building blocks in the hierarchy are adjusted to the higher level subsystem. called a ‘merge’. while other parts are kept at a higher level. This merge is in fact a fictive building block. after which a next round of analysis can start. called the ‘Analysis Selector’. 38 . Domain experts can now select certain parts of the clustering to be merged for a next analysis step. The relations are accumulated to the merge as mentioned above.5. More information about this can be found in section 5. ‘basicsw’.5 Application to the MR case Cluster Analysis For example one subsystem can be analyzed in detail. This tool is also used to enable the leveled approach for FCA. An example of a merge is shown in figure 5. The tool uses the output of the static dependency analysis of Sotograph. This MDG is used as input for the tool Bunch. these building blocks can be merged into one entity.3.5. This tool actually performs the cluster analysis using the built in genetic algorithm. In the figure can be seen that the relations of the named building blocks are accumulated and designated to the new building block ‘merge’. Also the relations to the building blocks are changed to relations to the merge. These results can be imported using a plug-in in the Sotograph tool to visualize the results and get more metrics on the clustering result. The tool enables the selection of parts of the archive to be analyzed in detail. When after an analysis round some building blocks are clustered together. When a subsystem is analyzed globally.2 shows how this works. From this output the hierarchy of the building blocks can also be extracted.1: Example of a merge of a group of building blocks The analysis itself is performed using a newly developed tool.

5. This information includes the dependencies between archive entities.5. as discussed in the previous section 5. [31] The results of the analyses are loaded into the Sotograph tool using the developed plugin. 39 . size of entities and many other metrics.2: Flow of data scheme of analysis 5.5. get more detailed information on the dependencies between the clusters and within the clusters. Early results were visualized with a tool called ‘dotty’. Therefore we developed a plug-in for the Sotograph tool.6 Results Figure 5. The genetic algorithm in Bunch was executed three times. Results of the cluster analysis are evaluated by the domain experts at PMS in the tool and building blocks are selected as merges for a next analysis round. see what entities are placed in a particular cluster and retrieve the sizes of the clusters. The tool enables the user to visualize the clustering.2 provides an overview of our approach. Complete building block hierarchy Cluster analysis was performed on the complete building block hierarchy. Figure 5. For each analysis the scope of the analysis and the evaluation of the domain experts are discussed. Figure 5. Sotograph maintains a database of information extracted from static analysis of the code archive.2.3 shows the results for this analysis. [24] However we have chosen to use Sotograph as visualization tool.3 Visualization and validation The reviewing of the result of the clustering with domain experts can be aided by visualizing the results. because we think that the tool has better visualization capabilities and also offers several useful metrics. such as LOC of clusters and dependency metrics. which displays the clustering of the modules and the dependencies between the modules.Cluster Analysis 5. The input file given to Bunch consisted of 369 nodes (building blocks) and 2232 edges (weighted dependencies).6 Results This section presents the results of the several analyses on the building block hierarchy.

40 . • Big difference in size of the clusters with respect to LOC. which made them hard to compare.6 Results Cluster Analysis Figure 5. Domain experts have evaluated all three clusterings. • Average of around 25% of the clusters contained one (low-level) building block. Domain experts at PMS could not extract groupings of building blocks from the three clusterings for a next analysis round.4: Results of analysis on one subsystem only All the three results were visualized in Sotograph. All the building blocks were analyzed in detail. One subsystem in detail only This analysis was performed on one subsystem only (platform). therefore each of the three results was visualized in Sotograph and was evaluated by domain experts at PMS.3: Results of analysis on complete hierarchy As can be seen in figure 5. up to a factor 20000. The reasons for that include: • The three executions resulted in three completely different clusterings. Figure 5. Figure 5. up to factor 100 • For some clusters dependencies between building blocks placed in the clusters were not evident or were not there at all. which made them hard to compare.5. The input file given to Bunch consisted of 121 building blocks as nodes and 409 weighted dependencies as edges.3 the MQ values do not differ significantly. • For some clusters dependencies between building blocks placed in the clusters were not evident or were not there at all.4 shows the results of three executions of the genetic algorithm in Bunch. The reasons for that include: • The three executions resulted in three completely different clusterings. • The resulting clusters differed in size (LOC) significantly. However the domain experts could not extract clusters of building blocks from the three clusterings to merge for a next analysis round.

is that now all the building block in the platform subsystem will be divided among the other subsystems. • Big difference in size of the clusters with respect to LOC. The idea behind this.5 shows the results of three executions of the genetic algorithm in Bunch.5: Results of analysis with one subsystem in detail All the three results were visualized in Sotograph.6: Results of analysis with one subsystem in detail .improved start solution This analysis was performed on the building block hierarchy with one subsystem in detail (platform). One subsystem in detail .Cluster Analysis 5. The input file given to Bunch consisted of 138 nodes (building blocks) and 655 edges (weighted dependencies). Domain experts have evaluated all three clusterings. The following points were observed: 41 . Figure 5.improved start solution Domain experts have evaluated the three results. This cluster contained about 40% of the total amount of LOC of the archive. Figure 5. The reasons for that include: • Difficulty understanding the result of presence of multiple subsystems in one cluster. which is enabled by Bunch by giving a so called ‘user-directed’ clustering input file.6 shows the results of three executions of the genetic algorithm in Bunch. while keeping the others at the higher level. and the other subsystems at the higher level. Figure 5.6 Results One subsystem in detail This analysis was performed on the hierarchy with one subsystem in detail (platform). up to factor 200 • Average of 20% of the clusters contained one (low-level) building block • For some clusters dependencies between building blocks placed in the clusters were not evident or were not there at all. Each of these higher level subsystems were placed in a cluster to start with. The input file for Bunch consisted of 138 building blocks as nodes and 655 weighted dependencies as edges and a file designated all higher level subsystems to a cluster. Figure 5. Domain experts at PMS could not extract groupings of building blocks from the three clusterings for a next analysis round.

The remaining clusters were seen by the domain experts as not logical and therefore the building blocks contained in these clusters are used as single entities for the second analysis round. Figure 5. • The third clustering has a nice distribution of the subsystems over the clusters. When visualizing the results of the analysis in Sotograph the domain experts observed that there was one huge cluster containing around 40 % of the lines of code of the total archive.5.7 shows the results of one execution of the analysis. or see how the remaining building blocks are grouped in new partitions.7: Results of analysis with one subsystem in detail . while keeping the other subsystems at the highest level. Figure 5. • Average of 20% of the clusters contained one (low-level) building block • For some clusters dependencies between building blocks placed in the clusters were not evident or were not there at all. The remaining 2 subsystems were analyzed in detail. having 2 predefined large clusters as start solution. because of lack of time to run and review three executions for each setting at the end of the project. which made them hard to compare. This is not desirable for PMS and therefore was decided not to continue with the analysis of one subsystem.6 Results Cluster Analysis • The three executions resulted in three completely different clusterings. 42 . The idea behind the analysis is to see how the remaining building blocks are divided among the two partitions. decided was with the domain experts that the results from the third clustering are taken for a next analysis round. The first cluster contains 2 subsystems from the building block hierarchy and the latter cluster contains 11 subsystems from the building block hierarchy.improved start solution This analysis was performed on the building block hierarchy. round 2 The input file for Bunch consisted of 42 nodes and 226 edges. Two predefined large clusters .8 shows the results of one execution of the genetic algorithm. The clusters were defined by domain experts: one cluster contained mainly image handling software and the other contained mainly scanning software.improved start solution. This start solution with the 2 predefined clusters is achieved by merging the building blocks in the clusters before the analysis is executed. As a start solution there is already a ‘bipartition’ of the archive. Decided was to take all clusters containing more than 3 building blocks as a merge for the next analysis.) Figure 5. there are no multiple subsystems present in a cluster Having evaluated the three results. (We have chosen to perform one execution.

we decided that continuing with this analysis .9 shows the results of one analysis with the genetic algorithm in Bunch. 43 . Figure 5. • The majority of the building blocks were clustered in 27 clusters • For some of the 27 clusters dependencies between building blocks placed in the clusters were not evident or were not there at all. round 2 The input file given to Bunch consisted of 109 building blocks as nodes and 385 weighted dependencies as edges. With the domain experts was decided to continue with the analysis.6 Results Figure 5. Therefore we decided not to continue with the analysis on the two predefined clusters. The remaining building blocks (in the 27 other clusters) were not merged and are used as single entity for the next analysis round. Figure 5. by merging the building blocks that were clustered with the two predefined clusters. Domain experts observed the following items: • A few building blocks were divided among the two predefined clusters. Domain experts observed the following items: • A few building blocks were divided among the two predefined clusters.was not useful.Cluster Analysis 5.8: Results of analysis with two predefined large clusters The input file given to Bunch consisted of 129 building blocks as nodes and 485 weighted dependencies as edges.9: Results of analysis with two predefined large clusters. Because the building blocks that were clustered with the two predefined clusters had very low dependencies or even no dependencies on the other building blocks in the clusters. • The majority of the building blocks were clustered in 18 clusters • For some of the 18 clusters dependencies between building blocks placed in the clusters were not evident or were not there at all.by merging the added building blocks to the two predefined clusters .

The fact that the intermediate results could not be used by the domain experts in most cases is also caused by the genetic algorithm of Bunch. that could serve as basis for the splitting of the archive.8 Summary This chapter proposed cluster analysis as a method to obtain multiple independent archives from the original code archive of the MR case. The algorithm produces results with clusters that highly differ in size (LOC).5. The main reason for that is that it is difficult for the domain experts to get merges from an analysis round when the results are not ‘acceptable’.6 are discussed and is reflected how our proposed approach works. static dependencies as similarity measure and a genetic algorithm to execute the actual analysis. that is domain experts can filter out meaningful clusters. Domain experts evaluated the results by visualizing them in Sotograph. while other parts are analyzed at a higher level. By using the leveled approach results on one part of the archive can be used for a next analysis. where other parts are analyzed in detail. ’Cluster size equality’ is not a requirement made explicit in this thesis. The reasons why the intermediate results were not acceptable for the domain experts are the following: • Different executions of the algorithm result in different clusterings. which makes it hard to compare the results • Multiple clusters contain one building block • Multiple clusters contain building blocks that are not related This means that the proposed process with the leveled approach only works if results from previous analysis rounds are at least to some degree acceptable. Analysis has been performed using the leveled approach on several parts of the archive. The genetic algorithm also in some cases produced results with a cluster that contained up to 40% of the size of the archive. Cluster analysis on the building block hierarchy did not result in a satisfying clustering.7 Evaluation Cluster Analysis 5. As entities for analysis building blocks are taken.7 Evaluation In this section the results of the previous section 5. This means that some parts of the archive are analyzed in detail. They could not filter groupings of building blocks from clusterings that could serve as ‘merges’ for a next analysis because of the following reasons: • Different executions of the algorithm result in different clusterings. 5. The system’s architects found that this is not desirable for the MR case. however this requirement appeared to be an important criterion for evaluating the results by the domain experts. which makes it hard to compare the results 44 . This is also not desirable for the MR case.

45 . and therefore they could not filter ‘merges’ for successive analysis round. This means that no splitting of the archive could be obtained using the proposed approach.Cluster Analysis • Multiple clusters contain one building block • Multiple clusters contain building blocks that are not related 5.8 Summary The intermediate results of the analysis were not ‘acceptable’ for the domain experts.

.

These features are derived from existing project documentation and domain experts.Chapter 6 Conclusions and future work This thesis aims at ways to split the software archive of MR into multiple archives that can be developed independently. which makes use of the existing hierarchy in the software archive. In this thesis we have made the following contributions: • We discussed how FCA and cluster analysis can be applied to a large-scale. For the context of FCA we defined building blocks as objects and for the attributes two sets are combined: a set of attributes indicating that building blocks are highly dependent on each other . In literature we could not find cases with comparable goals and scale. a measure that aids in choosing concepts from a set of concepts.2. in order to cope with scalability issues that come with applying FCA and cluster analysis on a large-scale software archive. Conclusions and future work for FCA are discussed in section 6. From this context we derived concepts. which can serve as a basis of defining a splitting of the archive. non-trivial industrial software archive.1 and for cluster analysis in section 6. Two analysis methods are used within the framework: Formal Concept Analysis (FCA) and cluster analysis. which are used to generate concept subpartitions (CSPs).and a set of attributes that represent certain features of the building blocks. Each CSP represents a possible partitioning of the object set of building blocks. 47 . • We presented the concept quality.using a threshold .1 Formal Concept Analysis The first analysis method that we have presented is FCA. • We presented the leveled approach. The framework of Symphony is used to come to a solution of this problem. FCA provides ways to identify sensible groupings of object that have common attributes. 6.

However the resulting number of CSPs is enormous. meaning that some parts of the code archive can be analyzed in detail. with the remark that intermediate results should be ‘good’ enough to use for successive analysis rounds. but not suited for an analysis that should result in a precise nonoverlapping partitioning of the object set.2 Cluster analysis Conclusions and future work The process of analysis works with a leveled approach. FCA is in our opinion very suited for recognizing certain patterns or groupings of objects based on attributes. the idea of applying the leveled approach in combination with the concept quality works well: The number of concepts decrease significantly. This might help to input domain knowledge into the analysis process. while other parts are analyzed at a higher level. We have chosen building blocks as entities to cluster. Therefore we think that FCA in this setting . given a context of reasonable size . Cluster analysis is used to group highly dependent objects in clusters. When looking at the results with the domain experts at PMS. Results from previous analyses are used for successive analysis rounds. The process of analysis works with a leveled approach. which could narrow the search space of the algorithm. However the domain experts could not filter groupings of building blocks from clusterings that could serve as ‘merges’ for successive analysis rounds. However we do think that using this leveled approach is a good way to cope with the scalability issues. For example parameters that can achieve equally sized clusters. One reason for that is the fact that different executions of the genetic algorithm resulted in different clusterings. These ‘good’ intermediate results could be achieved by implementing some implicit requirements the domain experts pose on clusterings in the clustering algorithm. The other reason is the fact that some clusters in the resulting clusterings were not acceptable for the domain experts: the size of resulting clusters differed significantly and some clusters contained only one building block. which means that some parts of the building block hierarchy in the archive are analyzed in detail. or contained building blocks that were hardly or not related. This means that no splitting of the archive could be obtained using the proposed approach.is practically not applicable. Static dependencies are used as similarity measure and we use a genetic algorithm to evaluate clusterings on a clustering quality measure.2 Cluster analysis The second analysis method is cluster analysis. or parameters that can set a range of the number of desired clusters. where more parts are analyzed in detail. 48 . which made it hard for the domain experts to compare the results. by merging clusters of building blocks that are validated by domain experts. while other parts of the archive are analyzed at a higher level. This enables us to use results from previous analyses for successive analyses. Selecting what parts are used for the successive analysis rounds are aided by the newly defined concept quality.obtaining CSPs from the set of concepts. 6.6.

USA. Tomkins. 1986. 1990. New York. D. Eisenbarth. Heineman Educational Books. D. Springer-Verlag New York. Choi and W. Eisenbarth. Softw. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases. Mancoridis. Velasevic. Wille. Bordat. C. and B. Koschke. Cluster Analysis. Simon. 1991. NY. Inc. Everitt. Addendum to ”locating features in source code’. In STEP ’99: Proceedings of the Software Technology and Engineering Practice. USA. Identifying aggregates in hypertext structures. C. 2007. B. 1997. pages 721–732.Bibliography [1] N.net/projects/conexp. Scacchi. Secaucus. Recovering software architecture from the names of source files. R. Doval. and D. Ganter and R. Anquetil and T. 96:31–47. R. 2005. Automatic clustering of software systems using a genetic algorithm. 1974. S. ConExp. Locating features in source code. Shneiderman. 1999. Discovering large dense subgraphs in massive graphs. IEEE Computer Society. Koschke. 30(2):140.. Simon. S. Mathematiques et Sciences Humaines. and A. Kumar. IEEE Software. Extracting and restructuring the design of large systems. and D. R. Washington.. http://sourceforge. Journal of Software Maintenance. page 73. [11] D. Lethbridge. J. Gibson. D. pages 63–74. 2003. Bojic. Calcul pratique du treillis de galois d’une correspondance. 11(3):201–221. IEEE Transactions on Software Engineering. London. T. Botafogo and B. A. 49 . [2] [3] [4] [5] [6] [7] [8] [9] [10] B. In HYPERTEXT ’91: Proceedings of the third annual ACM conference on Hypertext. 2004. 29(3):210–224. R. NY. 7(1):66–71. Mitchell. Eng. USA. T. Formal Concept Analysis: Mathematical Foundations. DC. 1999. IEEE Trans. ACM Press. VLDB Endowment. S. P.

4(2):117–134. DC. DC. Washington. In GECCO ’02: Proceedings of the Genetic and Evolutionary Computation Conference. [20] S. MA. IEEE Trans. and E. W.. 79(9):1261–1279.. Incremental concept formation algorithms based on galois (concept) lattices. Zaman. S. 11(8):749–757. Mitchell and Spiros Mancoridis. Softw. R. X. pages 394–410. Missaoui. Xu. [17] D. pages 1375–1382. Alaoui. M. On the automatic modularization of software systems using the bunch tool. Using heuristic search techniques to extract design abstractions from source code. Program restructuring using clustering techniques. Eng. R. 2000. J. 2006. pages 45–52. [21] B. 2002. 1970. IEEE Trans. Morgan Kaufmann Publishers Inc. 1993. Chen. S.BIBLIOGRAPHY [12] I. and E. Mineau. IEEE Computer Society. 1998.D. An algorithm for detecting unimodal fuzzy sets and its application as a clustering technique. IEEE Computer Society. Design of class hierarchies based on concept (galois) lattices. Mancoridis. IEEE Computer Society. Rorres. Chau. Mili. Lung. and T. Mitchell. IEEE Transactions on Computers. Hofmeister. Applied software architecture. Mancoridis. Computational Intelligence. Washington. Gitman and M. [16] C.. Building and Maintaining Analysis-Level Class Hierarchies Using Galois Lattices. H. Godin. Syst. Mancoridis. Arfi. 1995. 11:246–267. Languages and Applications. USA. 1998. and A. Addison-Wesley Longman Publishing Co. and D. Mitchell. Basili. Gansner. Chen. Inc. and H. USA. Theory and Practice of Object Systems. A. [15] R. Eng. In IWPC ’98: Proceedings of the 6th International Workshop on Program Comprehension. H. Mitchell and S. USA. Soni. Gansner. [13] R. Hutchens and V. CA. [23] Brian S. Levine. 1999. G. R. 32(3):193–208. In IEEE International Conference on Software Maintenance (ICSM). Bunch: A clustering tool for the recovery and maintenance of software system structures. USA. R. In Proceedings of the OOPSLA ’93 Conference on Objectoriented Programming Systems. Using automatic clustering to produce high-level system organizations of source code. Comparing the decompositions produced by software clustering algorithms using similarity measurements. B. [22] B. Softw. 2006. 50 . B. pages 50–59.. Mili. Boston. Srinivasan. Y. C. ACM Press. San Francisco. [18] C. Godin and H. S. R. [19] S. Mitchell and S. Y. S. pages 744–753. C-19:583– 593. System structure analysis: clustering with data bindings.Nord. R. Missaoui. Godin.. Softw. 1985. In ICSM ’99: Proceedings of the IEEE International Conference on Software Maintenance. 2001. Mancoridis. [14] R.

Eric Dortmans. W. USA. Northeast Parallel Architectures Center at Syracuse University. 1997. Piscataway. L. Concept analysis for module restructuring. Technical Report IEEE P1471–2000. An intelligent tool for re-engineering software modularity. New Jersey. Koschke. April 1993. IEEE Computer Society Press. [31] Sotograph. In ICSE ’99: Proceedings of the 21st international conference on Software engineering. IEEE Computer Society Press. Riva. Softw. http://www. 27(4):351–363. USA. Washington. Moonen. USA. [30] G. DC. In Proceedings of Graphics Interface ’94. 2001. 2006. [32] A. Graph Reduction. pages 83–92. Technical Report SCCS 477. ACM Press. USA. Simulated Tempering. [25] IEEE P1471-2000. and C. Los Alamitos. [33] Arie van Deursen and Tobias Kuipers. Reps. In SIGSOFT ’98/FSE-6: Proceedings of the 6th ACM SIGSOFT international symposium on Foundations of software engineering. [35] Andreas Wierda. van Deursen. DC. North and Eleftherios Koutsofios. Banff. pages 214–228.. C. Snelting and F. Symphony: View-driven software architecture reconstruction. 1996.com/html/sotograph. pages 170–179. In WICSA ’04: Proceedings of the Fourth Working IEEE/IFIP Conference on Software Architecture (WICSA’04).html. The Architecture Working Group of the Software Engineering Committee.a case study. Tip. [27] R. CA. Hofmeister. Standards Department. 1998. September 2000. 2007. Recommended practice for architectural description of software intensive systems. IEEE Computer Society Press. [34] G. of. Los Alamitos. [26] P. Identifying objects using cluster and concept analysis. [28] M. In Proc.software-tomography. IEEE Computer Society. New York. pages 235–245. Application of graph visualization. Kernighan Lin. A Collection of Graph Partitioning Algorithms: Simulated Annealing. R. IEEE. pages 122–134. Bisection. and Lou Lou Somers. Reengineering class hierarchies using concept analysis. ACM Transactions on Software Engineering and Methodology. Identifying modules via concept analysis.BIBLIOGRAPHY [24] Stephen C. 1991. Two Optimal. [29] G. 51 . the Internation Conference on Software Maintenance. NY. USA. IEEE Computer Society. Eng. Canada. In ICSE ’91: Proceedings of the 13th international conference on Software engineering. pages 99–110. pages 246–255. Washington. von Laszewski. In CSMR ’06: Proceedings of the Conference on Software Maintenance and Reengineering. Using version information in architectural clustering .Tonella. 2004. USA. IEEE Trans. Siff and T. Schwanke. 1994. Snelting. 1999. Reengineering of configurations based on mathematical concept analysis. Alberta. 5(2):146–189. CA.

Dordrecht-Boston. Restructuring lattice theory: an approach based on hierarchies of concepts. Reidel. 1997. Ordered sets. Wille. [37] R. Wiggerts. Using clustering algorithms in legacy systems remodularization.BIBLIOGRAPHY [36] T. In WCRE ’97: Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE ’97). USA. IEEE Computer Society. page 33. A. 1982. 52 . DC. Washington.

Sign up to vote on this title
UsefulNot useful