You are on page 1of 12

Folder Structure Evolution in Open Source Software

Andrea Capiluppi Maurizio Morisio Juan F. Ramil


Dipartimento di Automatica e Dipartimento di Automatica e Computing Department
Informatica Informatica Faculty of Maths and Computing
Politecnico di Torino – ITALY Politecnico di Torino – ITALY The Open University, UK
Andrea.Capiluppi@polito.it Maurizio.Morisio@polito.it J.F.Ramil@open.ac.uk

Abstract software products and processes is available in freely


Predicting when and how a software system will evolve accessible forms such as mailing lists, releases,
is one of the most fascinating challenges of software configuration management repositories (such as the
engineering. No matter what approach one is using to concurrent versions system, CVS), etc.
study such evolution, empirical studies, including In this paper we study OSS software systems from the
observations of systems used in the real world, and of their point of view of their structural evolution. This involves
software processes, are needed in order to identify the study of their enhancement, adaptation and
correlations, find recurring patterns, and eventually restructuring. We are interested in finding out patterns in
predict how systems are likely to evolve. the evolution of software code structure. Our data set is
In the empirical study presented in this paper, we take composed by 25 OSS systems which we observed in a
25 software systems released as Open Source, and observe discrete-time perspective, that is, studying each available
their evolution. Our focus is not only on how much systems release. The dataset globally represents 992 releases or data
grow in size, but rather on how code structure is adapted points. We are interested in observing source code
and gets modified as the system is evolved. The main goal structure and its changes, to learn from long-lived OSS
of this study is to recognize recurring patterns and systems what types of structural patterns emerge, what
practices used in evolving long-lived real world software structural changes are more frequently brought to the
systems. source code, and also to seek for patterns in the
In our study we find three dominant patterns of code evolutionary trends.
structure evolution of Open Source systems: horizontal Given that code structure and in general, system
expansion, vertical expansion, vertical shrinking. By architecture, can be visualized using a variety of means, we
detailed study of exemplars of these three patterns one can focus on the simplest possible approach: the source folder
identify under which conditions a particular pattern is structure. By folder we mean any directory in the code
more likely to prevail than the others. repository which contains source files. Our research goal is
to understand how OSS projects evolve with regards to
1. Introduction source code internal structures. In future studies we plan to
The long term evolution of E-type software, that is, relate the source folder view of software structural
software systems which are actively used in real world evolution and other structural views (for example, obtained
domains and environments [Lehman and Belady 1985], is through design recovery [Di Lucca et al 2000]) with factors
an important issue for empirical study which can lead to such as size and type of application, effort subsumed by the
useful insight and applicable lessons both for researchers evolution and the type of software process.
and practitioners. On the other hand, toy systems, or
prototypes, are surely worth analysis, but the conclusions 2. Related work
drawn out of them are limited in their applicability to real Empirical studies on software development gained
world applications and domains. The empirical studies of momentum after pioneering work of Lehman and his
real world software processes and products are limited by collaborators on the study of the evolution of the
the kind of artifacts that an investigator may be able to proprietary operating system OS/360 [Belady and Lehman,
obtain and measure: proprietary systems are in general 1976]. The initial studied observed 20 releases of OS/360.
difficult to be studied, since the public disclosure of data The results that emerged from that investigation, and
reflecting those systems tends to be restricted. subsequent studies of other proprietary commercial
In this paper we use metrics derived from a number of software [Lehman, 1974, 1980], [Lehman and Belady,
open source1 systems (OSS), in order to study the 1985], [Lehman et al, 1987], [Lehman et al, 1988], include
characteristics of their long term evolution, and, in the SPE program classification and a set of laws of E-type
particular, how their internal structure evolve. Choosing software evolution. The findings made in the seventies and
OSS systems for studying software evolution is an eighties were refined and supplemented in the recent
advantage since important amounts of data concerning FEAST projects [Lehman et al 1998].
1 The authors are aware of the distinction between Free and Open Source More recently, other researchers have studied the
software. The distinction is relevant, for example, with regards to users' software evolution phenomena. For example, Kemerer and
rights over software artifacts. In this paper, however, we will use Free and Slaughter [1999] studied the evolution of two different
Open Software as synonyms.
proprietary systems using two approaches: one based on that a system is likely to grow from an initial low-level
the time series analysis, and the other based on sequence tree, first by adding branches to the existing levels, and
analysis. A study which identifies and categorizes software next by adding additional levels. If this or any other
evolution patterns also is reported in [Barry et al., 2003]. common evolutionary pattern is supported by empirical
During the last few years, it has been realized that OSS observations, the next question would be why such pattern
systems have an edge over commercial ones when it comes occurs and whether it can be linked to other characteristics
to availability of data: many studies were done since initial of the software and its related domains. Moreover, the
research involving the Apache web-server and Mozilla empirical study of structural evolution can help us to
browser [Mockus et al. 2002]. identify, and even predict, when and how structural
More recent studies include those which examine single changes occur and whether this can be related to transitions
OSS projects [German 2003], [Koch and Schneider 2000], between stages [Rajlich and Bennett 2000], [Nakakoji et al
[Aoki et al. 2002], [González-Barahona et al 2001], 2002] in the evolution of a software system.
[Stamelos et al 2002], [Godfrey and Tu 2000], and those This investigation of code structure evolution in OSS
which involve several systems [Capiluppi et al 2003], requires one to address the following research questions:
[Capiluppi 2003]. • How does the source tree evolve over time or releases?
Even though the vast majority of OSS software • How does the depth of the source tree relate to code
evolution studies are based on direct trend visualisation size?
and curve fitting, interesting new approaches to study the • How does the code structure evolution relates to the
evolution phenomenon have been recently proposed rate of functional growth and change of a system?
through both quantitative [Antoniades et al 2003], and • What common patterns emerge in source tree growth,
qualitative [Smith et al 2004] simulation methods. given the horizontal and vertical perspective introduced
The work presented in this paper explores the study of in Figure 1 and in the above discussion?
the evolution of the code structure, a new dimension not • How could one, by visualizing the evolving code
covered in any of the above studies. In doing so, this work structure, distinguish functional enhancement and
aims at complementing the understanding of OSS adaptation activities, usually the predominant effort
evolution. during the evolution of source code, from refactoring
and restructuring, also called anti-regressive activities
3. Rationale [Lehman 1974]?
When investigating code structure of various OSS Parent
systems, one may encounter different patterns of Folder
modifications: if we consider code structure from the
perspective of its organisation and storage (one example is
depicted in Figure 1), it is possible to visualize basic Parent F1 F2
components (source files, source folders) as composing a Folder
tree, with the root of the tree being represented by the F3
parent folder. When analyzing software evolution in a tree
perspective, one distinguishes two dimensions: F1 F2 Parent
1. vertical growth, that is, creating a sub-branch in an Folder
existing branch (upper part of Figure 1),
2. horizontal growth, that is, adding a new branch over an
existing branch (lower part of Figure 1). F3
F1 F2
If we consider Figure 1 from a tree-perspective, we may
also state that any vertical growth adds depth in code Figure 1 - Two possible modifications of code structure
structure, i.e. a new level has been nested under an existing
level. The upper part of Figure 1 shows that creation of
folder F3 has introduced a nested level under a current 4. Methodology
level, which is composed of F1 and F2. Alternatively, as Our methodological approach can be summarized as the
shown in the lower part of Figure 1, F3 can be added at the list of steps presented below. The list is not intended as
same level of F1 and F2, that is, without adding a new fully sequential, since some steps are intertwined, and
level. provide feedback to other steps:
The initial focus for the research reported here is based 1. Projects selection: as reported in previous work
on Figure 1, and on the common assumption that evolution [Capiluppi et al 2003], we have created a large database
in software systems is generally implemented in an with data representing over 400 OSS systems,
incremental fashion. Our aim is to understand if source randomly selected from a popular OSS repository.
code trees have a common pattern of growth, and if (and Initially, we classified these systems based on a number
how) those patterns have an impact on the evolvability of of process and product characteristics. For the study of
the systems. In particular, we would like to assess a structural evolution we decided to focus on the larger
working hypothesis which is based on anecdotal systems, that involve more complex and richer folder
observations by one of the authors. The hypothesis states structures. For the present study, we define as 'large'
those systems composed of over 100 KLOCs of code. The present study focuses at a coarser level of granularity
Furthermore, we extract from the data set some smaller by measuring attributes of the whole system. For example
systems for which all the releases in the system's we deal with code structure in three different forms:
evolution were publicly available for investigation. In 1. source files, as to say, all files that are supposed to
total, the sample for the present study includes 25 OSS contain source code (e.g., “*.c”)
systems, which is what we could investigate within the 2. source folders, as to say, directories containing at least
time and the resources available. one source file.
2. Attribute definition and metrics derivation: since our 3. folder levels, as to say each level in the code structure
focus is on measuring systems' evolution, we collect a where topologically folders may be placed.
set of metrics which include system’s size, an indicator Files, folders and levels form together a structure which
which is generally accepted as a surrogate of the may be interpreted as a simple architectural view of the
functional power of the system. Section 5 provides a system.
description of this and other attributes.
3. Parsing tools: automatic data extraction is key in 5.2.2. Folders level
systems' evolution analysis. In this study we used off- Observing Figure 1, one would wish to know in which
the-shelf, freely available, utilities [XSCC] for counting sequence F1 and F2 were added to the system. In order to
lines of code. In addition, we built our own tools for investigate this, we use the term encapsulated to refer to a
parsing source trees (these tools are available to anyone folder that is contained inside another one. Each
who wishes to replicate this study). Next, we used the encapsulation is associated with a specific depth inside of
dot graphic tool [Graphviz] for extracting source trees the source file structure; therefore each encapsulation may
out of data, and, finally a PERL script to quantify the be related to a depth-attribute, which we call level. Our
number of changes made in-between subsequent interest is therefore to analyze the characteristics of folder
releases. levels, and observe maximum depths, the size of each
4. Data analysis and pattern recognition: basic plots and level, patterns of growth, and break points in the evolution
visualisations were used as a means to identify of source folder trees.
recurring patterns.
5. Interpretation: in addition to observing (and 5.3. Modification types
recognizing) patterns, one needs to formulate possible Different approaches for classifying maintenance and
explanations for them, based on existing literature (e.g. evolution activity have been proposed over the years e.g.,
[Lehman and Belady 1985], [Rajlich and Bennett 2000]), [Kemerer and Slaughter 1999], [Chapin et al 2001]. The
new observations by the authors and hints provided by application of these classification schemes in an empirical
the documentation of the observed systems. study involve considerable work. In this study we focused
on two types of activity, observed based on identifying
5. Definition of attributes which files have been added, modified or deleted between
5.1. Source code size two releases. This is relatively simple to identify:
The vast majority of studies on the evolution software 1. source additions, i.e. the set of source files added in-
systems so far have involved one type or another of source between two subsequent releases or over a given period
code size metrics [Lehman and Belady 1985], with only of time (e.g. one month);
some exceptions [e.g. Anton and Potts 2001]. In this study 2. source deltas, i.e. the set of files modified or deleted
we measured source code size in three different forms: in-between two subsequent releases or over a given
1. LOCs: the total amount of lines of code, which we period of time (e.g. one month);
usually counted through off the shelf utilities (wc -l, for 3. number of touched files (or files handled [Lehman and
instance). Belady 1985]) i.e. the cardinality of the union of source
2. SLOCs: the total amount of source lines of code, i.e. additions and source deltas. The percentage of touched
remaining LOCs after blank lines and comments have files at release (or period) j is calculated as the number of
been purged. files touched at release (or period) j, divided by the total
3. KBs: the size of a source file in kilo bytes. number of files present at the previous release (or
period), j-1.
5.2. Code Structure
5.2.1. Code components 6. Patterns in structural evolution
Research has been done aiming at correlating various 6.1. Evolution of size
structural evolutionary metrics to fault and failure In this section we briefly summarize our findings with
discovery rates [Nikora and Munson 2003] based on the regards to the evolution of source code, correlations
view that evolutionary characteristics may be directly between different measures and the composition and
related to a few common evolution attributes measured at structure of source trees in the 25 OSS systems studied.
the file or module level. Other studies in search of patterns The size and the length of the evolution period studied for
of software evolution have concentrated on metrics [Barry each of these systems are presented in table A1 in the
et al 2003] but not on visualizations of software structure. appendix.
Whilst visualizing the code evolution of the 25 systems is what one would have expected from the roughly stable
studied, one interesting invariance emerged when plotting average size of files, displayed in Figure 2.
the average size of source files, as a function of the system Projects size vs. number of files
size. In almost all the cases (except for the IMLIB system), 2.750

Latest state nr. of files


the average size of source files displays values not greater 2.250
than 20 Kilobyte or so. 2.000

In Figure 2, we observe that almost all projects stabilize 1.500


1.250
the average size for their source files, albeit in general their
total size grows over releases. There are particularly 750
500
interesting cases when these stabilization points are
0
reached after a digressive trend (plots not shown here due 0 5.000 10.000 15.000 20.000 25.000
to space limitations).
Average source file size [KB]at latest state

Project's latest state size [KB]


Project size vs average source file size
Figure 4 - Project size as a function of number of files at the most
80
recent release
70
60
50 6.2. Evolution of code structure
40 When we observe the evolution of the folder structure,
30 some recurring patterns can be recognized. In a first
20 attempt to categorizing these patterns, we were able to
10 identify basically three main cases. Here we briefly
0 describe all of them, while in the next Sections we present
0 5.000 10.000 15.000 20.000 25.000 some illustrative exemplars of each of these three types.
Project latest size [KB] Before discussing the types, we need to briefly introduce
the notion of articulated source tree. Under articulated
Figure 2 - Average size of individual files as a function of total
size of the system, measured at the most recent release
source tree we mean a tree which consists of at least two or
more levels, which in turn implies the presence of at least
If we perform the same analysis for the second basic one sub-branch in the source folder structure. The three
source component (source folders), we observe that the structural patterns which emerged are the following:
average source folder size measured by adding up the size 1. Horizontally expanding: a first pattern is characterized
of files located in each folder and then taking the average by the early presence of an articulated source tree at the
over all folders in the same system, varies widely amongst first release available for study. The articulated tree
systems (Figure 3). In the plot one can observe, close to the continues to exists during the subsequent releases, no
origin, the smallest system of the sample, whose size is vertical growth is observed (or the number of levels
around 800 LOCs at the most recent release. does not grow), but there is horizontal growth. We
observed this pattern in 10 out of 25 analyzed projects.
Project size vs latest average folders size
700
2. Vertically shrinking: a second pattern is characterized
by an initial articulated source tree which evolves into a
fiolders average size

600
Latest state source

500 source tree with a smaller number of levels. This


400 vertical shrinking is not accompanied in general with
300 horizontal shrinking: in other words, some levels get
200 lost in the evolution of the source tree (vertical
100 dimension), but we do not observe a decrease of the
0 number of source folders (horizontal dimension). We
0 5.000 10.000 15.000 20.000 25.000 observed this pattern in 4 out of 25 projects.
Project latest release size [KB] 3. Vertically expanding: a third recognized evolution
Figure 3 - Average folder size, as a function of the total size of pattern starts with a simple tree structure which then
the system, both measured at the most recent release evolves adding at least one level. We observed this in
11 out of 25 projects. In the majority of the cases the
Figure 3 is more scattered than Figure 2, and it implies pattern followed is a vertical expansion from an early
that there is a higher variability in the average amount of articulated source tree. However, there are 3 systems
files per folder in the systems studied. This suggests that from this set of 11 whose first observation was a simple
the study of source folder evolution provides an source tree (consisting of 1 level only), which in turn
orthogonal, complementary, view to that provided by evolved into an articulated one.
studying source files evolution only. It is worth noting that a horizontally shrinking pattern
Figure 4 presents the plot of systems size in Kilobytes did not emerge in any of the systems studied. That pattern
versus the size in number of files, both measured at the simply did not exist in the dataset.
latest release: there seems to be a linear relationship which
6.3. Horizontally expanding interpreted as two segments of decaying growth rate with a
The first evolutionary pattern that we have identified is midlife growth regeneration point at about release 32. The
based on a structure whose vertical dimension remains trend presents similarities with those observed in
constant over the entire observed evolution of the commercial systems [Ramil and Smith 2002]. The plot of
application: we observe, in general, a horizontal growth of files per level as in Figure 5 is useful for identifying the
new branches and leaves, but there's no growth in the mid-life growth rate regeneration points and Figure 5
vertical dimension, that is, the maximum depth keeps the suggests that such regeneration in growth rate was linked
same value. In some specific cases, new vertical levels to a restructuring of the system.
were added in the evolution of the system, but then they In Figure 6 the number of files touched per release
were discarded in latter releases (e.g. the Grace system). In presents only one major peak at release 50, that is, around
the following sub-sections, we will analyze a subset of the ¾ of the system's life-cycle (95 percent or so of the size of
systems which display this first pattern, and we indicate the system at the previous release was touched), while all
some background information on their evolution in order to other peaks of file touched don't go beyond 60 percent.
better understand and interpret the observed behaviors. Except the outlier around release 50, one can observe a
predominantly decreasing trend with a super-imposed
6.3.1. ARLA oscillation in this attribute. The peaks correspond to the
The ARLA project made available its first public major releases. In the case of the ARLA system, the
release in February 1998, and its most recent release is decreasing growth rate in the last third of its evolution
labeled 0.35.12 (February 2003). 35 major releases were history can be linked to a possible move of the system into
developed. 62 total releases are made available through a “servicing stage” [Rajlich and Bennett 2000], [Nakakoji
their web sites, which then included 27 minor releases. et al 2001], as revealed by the declining evolution rate,
ARLA project's main purpose is to achieve similar suggested by decreasing trend in the proportion of files
functionality as the IBM AFS file system. It is likely that touched.
ARLA has currently achieved even more functionality than ARLA - adaptations
AFS. Its application domain is distributed file systems 700 100%

management, a domain in which a lot of knowledge is 90%

Adaptati ons (Fil es touched)


600
80%
available and openly shared. In this respect, this system is 500 70%
Number of files

similar to flagships OSS successes (such as Linux or 400 60%


50%
Apache). In ARLA’s evolution there have been two basic 300 40%
ways of enhancing and evolving the system: adding 200 30%

common features for the system (e.g. supporting of specific 100


20%
10%
network protocols), and adding ports so that the system 0 0%
supports different architectures.
7

27

47
1
4

11
15
19
23

31
35
39
43

51
55
59
Observing its folders makeup, as measured by the Adaptations TOTAL # of
releases
number of files per folder level (Figure 5), we observe that FILES

the majority of the files have been located at Levels 2 and Figure 6 - ARLA evolutionary trends: total size and files touched
3. Level 4 experienced a sudden midlife increase at around per release
release 25, accompanied by a sudden decrease at Level 3.
Several new folders were added on Level 4, other moved 6.3.2. KSI
from other parts of the system, which also significantly The KSI project is aimed at building a lightweight
affect Level 3. implementation of the Scheme programming language and
In Figure 6, we observe the makeup of evolution of interpreter [Scheme]: therefore, it counts on an existing
ARLA as depicted by the total number of source files, and system's knowledge base to build a portable and embedded
touched files over releases. The growth trend can be environment for the Scheme executables. CVS archives do
not include the very first releases of the system. This
ARLA - growth of levels means that we are limited to study the most recent
325
300
evolution of this system. This issue typically emerges in
275 empirical studies of software evolution in which release 1
250
in the data set does not correspond to the actual very first
Nr. of files per level

225
200 release2, because the oldest data has been deleted or is
175
150
unavailable. In spite of this, the study of the available
125 subset of releases is meaningful, since were able to access
100 data for the most recent 12 releases of KSI, which span
75
50 over an interval of time of 860 days. We were also able to
25 identify 3 cycles of major releases (3.2, 3.3 and 3.4), and
0
we noticed also that this system is the only one in the data
40

49
7
10
13
16
19

25
28

37

43
46

55
58
1
4

22

31
34

52

61

Level-1 Level-2 Level-3 Level-4 releases 2 In this and the remainder of the figures and text of this paper, release 1
does not necessarily correspond to the first release of the system, and
Figure 5 - Number of files per level for the ARLA system release should be read as release sequence number.
set that shrinks its size from the earliest to latest available A possible cause for the declining growth behavior of
releases (from 111.288 LOCs to 100.157 LOCs). In this KSI, as seen in Figure 9, is the possible existence of
particular system Level 1 contains source files used for memory or performance constraints in the target hardware
building the package only, while Level 2 and Level 3 hold which limit further growth of the system. Another more
nearly the whole code for the application. fundamental explanation would be that KSI is not an E-
It's interesting to observe the disposition of folders in a type system [Lehman and Belady 1985] in the strict sense,
graphical fashion: since few folders are involved, the since it addresses a problem (the implementation of
visualization of this system’s tree structure is easier and Schema) which can be sufficiently well specified as to
clearer than for the larger systems of our study database. consider it an S-type system. In this regard, KSI behavior is
Figure 7 shows the disposition of source folders in the the one that one may expect to visualize for compilers and
earliest release available: each ellipse is a source folder, other precisely specified programs, which are closer to the
and all source folders at the same level have the same S than to the E-type program type.
associated source level number (rectangles on the left of KSI - adaptations
the figure). In Figure 7, the edges between two folders are 275 100%
annotated with the number of files contained in the 250
90%
80%
lowermost side of the connection. For example, close to the

Fi l es tou ch ed [ %]
225 70%

total number of files


connection “ksi-3.20” -> “gc”, the label 59 means that the 200 60%
“gc” folder contains 59 source files. 175
50%
40%
150 30%
20%
125
10%
100 0%
1 2 3 4 5 6 7 8 9 10 11 12
releases
Number of files Adaptations

Figure 9 - KSI evolutionary trends: total size and files touched


per release

6.3.3. Ganymede
Ganymede has been initially developed and evolved by
academic staff, and includes both an application/database
server, and its client. We were able to recover data on its
first 12 releases. This data set includes, we believe, its very
first release. Only one major release cycle is recognizable
Figure 7 - KSI earliest folder structure (series 1.0) for this system so far. This is reflected both in
the size change, and in the structure changes, which are all
Figure 8, displays the KSI folder structure for the most relatively small. Rather than showing the graph of changes,
recent release available for study: reduction in LOCs size is we present a table (Table 1) with data displaying the
reflected on a reduced number of both source code and number of files per levels for this application.
source folders. Some branches were pruned away, and the We observe very few additions or deletions of source
whole design has become more compact, in the sense that components, even though the application has grown from
the number of folders at Level 2, Level 3 and Level 4 have 221.893 LOCs to 229.110 LOCs over its lifetime. There's
been reduced. The profile of files touched between releases no significant evolution with regards to structural changes:
for KSI, displayed in Figure 9, shows peaks which can be all the components seem to remain in the same place
related to the major releases, as in other systems. within the folder structure, and few additions are made
available during the system lifetime. One possible
explanation for tis emerges by going through the
Changelog of this system: all developers and code
contributors appear to be a small group, with few new
contributions coming from outside this group.
One of the challenges in software evolution, which is
particularly evident is OSS, is the difficulty for outsiders or
new contributors to assimilate and comprehend, sometimes
massive, amounts of code which has been generated in a
closed environment. If we observe the latest available
release, we realize that it's dated November 2002, which
means that neither new features, nor modifications have
been released since then. A possible interpretation is that
the feedback coming from the open source external
Figure 8 - KSI folder structure at most recent release.
communities was not sufficiently strong as to guarantee a 21 subsequent releases for this system, but they don't
sustained evolution of the application. represent its whole life-cycle, since its earlier evolution is
not available for study, neither in form of releases, nor in
CVS storing. The available releases reflect 4 cycles of
# of files per level
major releases, spanning over 1673 days.
RSN 1 2 3 4 5 6 7 8 9 10 11 12 We observe in Figure 11 that the first available data
Lev1 -- -- -- -- -- -- -- -- -- -- -- --
point is composed of 7 nested levels, which have been
Lev2 3 2 3 3 3 3 3 3 3 3 3 3 progressively mostly likely accomplished through a
Lev3 273 268 273 273 274 274 274 274 275 275 276 277 previous series of releases for which we do not have data.
Lev4 14 14 14 14 14 14 14 14 14 14 14 14 Gwydion-Dylan - growth of levels
600
Lev5 76 75 76 77 77 77 77 77 77 77 77 77 550
Lev6 500
95 95 95 95 95 95 95 95 95 95 95 95

number of files per level


450
Lev7 12 12 12 12 12 12 12 12 12 12 12 12 400 Level-1
350 Level-2

Table 1 - Ganymede source levels evolution 300 Level-3


Level-4
250
Level-5
200
When we plot code additions and modifications, 150
Level-6
Level-7

presented in Figure 10, we observe that, while this system 100


50
is growing very slowly, code adaptations (expressed in 0
percentage of files touched per release) are dispersed

11

13

15

17

19
1
2
3
4
5
6
7
8
9
10

12

14

16

18

20
21
throughout the code, and represent on average 70 percent releases

of the whole system for each release. We may conclude


Figure 11 - Number of files per level for Gwydion-Dylan
that, in this particular case, functional enhancement has
been very limited by lack of sufficient external feedback,
The folder structure of the most recently observed
while “servicing” of the application [Rajlich and Bennett
release is composed of only 5 levels. The evolution of
2000], [Nakakoji et al 2001] was conducted by a small
source folders and files grow proportionally with the
group of developers.
evolution of code (on its earliest stage: 64 source folders,
Ganymede - adaptations 607 source files; on its latest stage 137 source folders, 1147
500 100%
source files). A midlife restructuring of the system is
450 90%
clearly observable in Figure 11 after release 11, which can
400 80%
explain the increase in growth rate experimented by the
fi l es tou ch ed [ %]
total number of files

350 70%
system during the last half of the observed sequence of
300 60%
releases (Figure 12). The behavior of the proportion of files
250 50%
200 40%
touched, for this system is displayed in Figure 12.
150 30% Gwydion-Dylan adaptations
100 20% 1200 100%
1100 90%
50 10% 1000
80%
0 0% 900
70%
800
Files per release

1 2 3 4 5 6 7 8 9 10 11 12

fi l es tou ch ed [ %]
700 60%

total files % files touched releases 600 50%


500 40%

Figure 10 - Ganymede evolutionary trends: total size and files 400


300
30%

touched per release 200


20%

100 10%
0 0%

6.4. Vertically shrinking


1

10

11

12

13

14

15

16

17

18

releases
The second evolutionary pattern is based on a structure total files % files touched

which becomes less articulated in the observed evolution. Figure 12 - Gwydion-Dylan evolutionary trends: total size and
This means that some branches are pruned from the source files touched per release
tree, so that the global amount of vertical levels is lower
than the initial observations. 6.4.2. Gist
As we did for the first pattern, we will present below a Gist is a set of tools for building dynamic web sites. We
subset of the systems which display this pattern: some have access at the whole story of this project (20 releases),
background information on their evolution is given in order which enables us to observe for this system how growth
to better understand and interpret the observed behaviors. and change are related from the first release onwards.
There are 4 cycles of major releases for this system, which
6.4.1. Gwydion-Dylan are clearly noticeable when one plots code changes over
Gwydion-Dylan is an object-oriented compiler releases. When observing the profile of code adaptations,
supporting rapid applications development, and aiming to in Figure 13 we observe peaks in correspondence of major
become a complete development environment. We observe
releases. These peaks are quite noticeable: during these believe, its whole evolution history. We have been able to
releases, more than 80 percent of the files are touched. recognize four cycle of major releases (1.0, 2.0, 3.0 and 4.0).
As for the Ganymede system analyzed before, this LCRZO - growth of levels
system is mainly developed by a small and stable group of 350
developers. We were interested in finding out whether 325
similar trends to the ones of Ganymede would appear. 300
275
However, as shown in Figure 13, the size of Gist code is 250
not roughly constant, as in Ganymede: several shrinks in 225

# of files per level


Level-2
200 Level-3
size, both in global LOCs and in the source files, are 175 Level-4

visible in different points of the system's life-cycle, but the 150


125
Level-5

overall trend indicates an increasing size, but with 100


declining rate. The growth from the first to the last release 75
50
is about 20 percent, which is a low evolution rate in 25
comparison to other systems in the dataset. 0

What's also interesting from the point of view of the

37
1
4

10
13
16
19
22
25
28
31
34

40
43
46
49
52
55
folders structure is that it is simple at both the initial and releases

most recent releases of the system: growth seems to Figure 14 - Number of files per level for LCRZO system
proceed on an horizontal basis, while vertical growth
seems to be shrinked as long as new horizontal folders are When investigating source levels, we found 4 different
added. In the latest available release, over 30 folders form vertical levels. Two sudden jumps in the number of files at
the structure of the same, nested, level, while in the earliest Level 2 suggest that something significant happened twice in
release we found around 20. For the Gist system, we are the operational lifetime of this system. We then investigated
able to conclude that the horizontal growth of folders was these jumps using the tree structure perspective.
effective in the evolution of the system. In Figure 15 we depict the source folders' structure before
GIST - adaptations the first jump (release 22): there are 5 source levels. The
1100 100% number of files contained in each of the level represent the
1000 90% relative weight of the level in the overall structure.
900 80%
Number of files

800
70%
fi les touched [%]

700
60%
600
50%
500
40%
400
30%
300
200 20%

100 10%
0 0%
7
1
2
3
4
5
6

8
9
10

12

14

16

18

20

releases
total files % files touched

Figure 13 - Gist evolutionary trends: total size and files modified


per release

6.5. Vertically expanding


The third evolutionary pattern is based on a structure that Figure 15 - LCRZO code tree at release 22, before the first jump
expands during the observed evolution of the application:
this means that new branches are added in one or more Figure 15 displays the folder structure at the last release
sections of the tree, and new vertical levels appear. Besides, of the 2.0 cycle of releases. When depicting the subsequent
horizontal levels may be added, but we experienced that release, then, we may be aware that some stable status has
there is not a clear relation between the two dimensions. been reached, since the first release of 3.0 cycle has been
Two case studies are analyzed in the following sections, made available.
and additional information, beside size and structure, is We see in Figure 16 that some branches were pruned:
provided in order to gain insights on the observed pattern. being a major release, unstable features are typically
excluded from being made available. What's more, folder
6.5.1. Lcrzo “example” in Level 2 gets filled with some 28 KLOCs of
Lcrzo is a shared library for developing network new source code. Analyzing their nature, they are models,
applications. Its functionality, then, consists on a common skeletons and schemas of potential new applications wich
framework for nearly all network protocols (Ethernet, TCP, can be implemented by using this library. In a sense, they
and so on). In Figure 14 we depict the evolution trend of the provide the community with entry point for new
application over 1400 days of its evolution: we were able to development (they show potential users what the system
access to 56 releases of this system, and these represent, we may be able to do for them).
6.5.2. Vovida SIP stack
Vovida is the system which has experienced one of the
largest delta sizes (13 KLOC to 650 KLOC) in the data set
from the first to the most recent available release. Vovida is
an open source application that implements the SIP (Session
Initiation Protocol) stack protocol, for multimedia sessions.
It's a particularly interesting application from the point of
view of level's growth: we have been able to access the
entire life-ime of this system, and it evolved through nesting
several levels (from a single level in first release, to 8 levels
in latest available).

Vovida Sip Stack - growth of levels


rd
Figure 16 - LCRZO state at the 23 release (after first jump) 1500

1400

1300

Number of files per level


We also see (Figure 17) that after a series of releases, the
1200

1100
Level-1
code structure is changed again, after which two different 1000

900
Level-2
Level-3
levels seem to grow in parallel. The first is dedicated to code 800 Level-4

of the application, the second as a sort of incubator for new Level-5


700

600 Level-6
features: in latest instances of this system, we observe new 500

400
Level-7
Level-8
levels, that is, vertical branches in the code structure. 300

200

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

releases

Figure 19 - Number of files per level for VOVIDA system

In Figure 19 we can observe that at starting with Level 1,


next Level 2 and so on, all levels were added at different
moments of time, as well as new source files and folders.
That is to say that a massive amount of evolution effort has
been made in this application, in order to add new features
and functionality
As part of this large amount of efforts in code additions,
Figure 17 - LCRZO most recent state we observed that the source adaptation trend displays on
high valued peaks, representing the high evolution rate to
Observing the plots for size and change activity (Figure which this system has displayed (Figure 20). The rapid
18), we note that this system underwent many rewritings and evolution rate can be linked to a dynamic and growing
large adaptations: this is surely interesting, because here community of developers.
large adaptations are experienced between minor releases,
not only in major releases.
LCRZO - adaptations
375 100% VOVIDA - adaptations
350 2750 100%
fi l es tou ch ed p er r el ease [ %]

90%
325
2500 90%
300 80%
2250
files per release

275 80%
70%
250 2000 70%
files touched [%]

225 60% 1750


Number of files

200 60%
50% 1500
175 50%
150 40% 1250
125 40%
30% 1000
100 30%
750
75 20%
500 20%
50
10% 10%
25 250
0 0% 0 0%
7

37
1
4

25
10
13
16
19
22

28
31
34

40
43
46
49
52
55

releases
7

releases
1

13

14

15
10

11

12

total files % files touched total files % files touched

Figure 18 - LCRZO evolutionary trends: total size and files Figure 20 - VOVIDA evolutionary trends: total size and files
touched per release modified per release
7. Conclusions process aspects, and basically we’ve found higher growth
We have analyzed in this paper the evolution of 25 OSS rate evolution trends for the systems in which it was easier
systems, possibly the widest and largest dataset in a study for potential contributors to become so, that is, where more
of this kind. The systems studied are of different size feedback was available (the ARLA system evolution
ranging from 700 LOCs to 700 KLOCs and represent a pattern could take advantage of several added developers,
diverse set of application domains. 20 out of 25 may be while the Ganymede could not). Furthermore, when
categorized as large-sized systems given that their size at observing the percentage of files touched per release, it
the most recently observed release is greater than 100 shows few peak values, and nearly all of those
KLOCs. corresponding to important releases. The range of these
These systems are a sub-set taken from a version history peaks is on average between 70 and 90 percent. One
database of OSS systems which we have collected for our interesting case was also described (Ganymede), where
research. The systems in this database were randomly very few new components were added, while on average
extracted from a popular software repository dedicated to 70 percent of existing ones were touched through every
open source. In this particular study we have sought to release. This behavior requires further investigation.
identify interesting patterns in the evolution of these The second recognizable pattern is when the vertical
systems, with focus on the source code. Our aim is to better dimension grows. We had initially expected this as the
understand the evolution of OSS systems and to relate predominant pattern emerging from our analysis, but we
traditional analysis such as plotting of growth trend with found the pattern on only 10 systems out of 25. What's
visualization of the evolving folder structure. In particular, more, several of these underwent some shrinks and
we are interested in topological patterns, that is when and expansions in the depth of the code tree, as well. The trend
how new source components are added, how do they relate of files touched per version for this class of systems has in
to existing components, and to the existing overall general higher peaks than the first pattern, also because
structure. In this work, we define a “source file” as each new components are added. Several peaks around 100
single file containing source code, and “source folder” as percent (LCRZO system, VOVIDA system), and around 80
each directory containing at least one source file. percent can be observed (VOVIDA system).
Our first result shows (Figure 2) that there is a A third, less frequent, pattern also emerges in which the
stabilization point in the average size of the source files, in- vertical dimension shrinks. The profile of files touched per
between 5 and 20 KB. However, when investigating the releases here is in-between pattern one and pattern two, but
correlation between the average size of the source folders remarkable peaks in the range 90 percent to 100 percent
and the size of the system, there is no apparent correlation. are recognizable when major releases are prepared
This suggests that both the number of source files and the (Gwydion-Dylan and GIST), but these peaks are rather
number of source folders provide two complementary sporadic and rarely recurrent.
views. This is an improvement on previous studies which In our future work we plan to refine the identification of
have been based on the study of source files counts and patterns of structural evolution by considering metrics
related metrics, but which have not considered the folder which reflect both the evolution of the horizontal and
structure. vertical dimension of the code structure, and relate this to
We analyzed the structure of the source folders, other system characteristics by applying cluster analysis.
visualizing them as a tree containing branches (source
folders), and leaves (source files). In doing so, we have 8. References
been able to distinguish three main evolution patterns, [Aoki et al. 2001] Aoki A., Hayashi K., Kishida K., Nakakoji
basically related to how the folders evolve on a vertical and K., Nishinaka Y., Reeves B., Takashima A., and Yamamoto Y.,
horizontal dimension. “A Case Study of the Evolution of Jun: an Object-Oriented Open-
Source 3D Multimedia Library”, Proc. 23rd Intl. Conference on
The first pattern is based on an invariant code structure
Software Engineering, ICSE 23, Toronto, Canada, 12-19 May
on the vertical dimension: we observed this pattern in 10 2001, pp. 524 - 533
systems out of 25 analyzed. Deepening the analysis of [Anton and Potts 2001] Anton A. and Potts C.; “Functional
these 10 systems, we realized that in three of them the Paleontology: System Evolution as the User Sees It”, Proc. 23rd
system was given a structure before its first public release: ICSE, Toronto, Canada, 12-19 May 2001, pp. 421 – 430
in order words, a core group was in charge of developing it [Antoniades et al 2003] Antoniades P., Samoladas I., Stamelos
before becoming publicly available. I., Bleris G.L. “Dynamical simulation models of the Open Source
In general, when one is studying patterns in software Development process”. To appear in Free/Open Source Software
evolution, the smallest systems are likely to display a less Development, Stefan Koch (ed.), Idea Group, Inc.
[Barry et al 2003] Barry E.J., Kemerer C.F., and Slaughter
disciplined evolutionary behavior, driven by the decision S.A., “On the Uniformity of Software Evolution Patterns”, Proc.
and action of a small group of developers. On the contrary, ICSE 25, Portland, Oregon, May 3 – 10, 2003, pp. 106 – 113
larger projects are more likely to exhibit an evolution [Basili et al 1996] Basili, V. R. et al, “Understanding and
dynamics of their own for reasons that have been discussed Predicting the Process of Software Maintenance Releases”. Proc.
in the literature [Lehman and Belady 1985]. In our study, 18th ICSE, Berlin, March 25 – 29, 1996, pp. 464 - 474
depending on the system, we observed a faster growth in [Belady et al 1976] Belady L.A, Lehman M.M, “A Model of
some cases and slower in others: next, we tried to identify Large Program Development”, IBM Systems J., vol. 15, no. 1,
why this was so, looking at details of the development 1976, pp. 225 – 252.
[Capiluppi 2003] Capiluppi A., “Models for the evolution of [Lehman et al 1997] Lehman M.M., J.F. Ramil, P.D. Wernick,
OSS projects”, Proc. of the 7th International Conference on D.E. Perry, and W.M. Turski, "Metrics and Laws of Software
Software Maintenance, ICSM, Amsterdam, September 22 – 26 Evolution The Nineties View”, Proc. Fourth Intl. Software
2003, pp. 65 – 74. Metrics Symp., Metrics '97, Albuquerque, N.M., 1997,
[Capiluppi et al 2003] Capiluppi A., Lago P., Morisio M.,, pp. 20 –32
“Characteristics of Open Source Projects”, Proc. of the 7th [Lehman et al 1998] Lehman M. M., D. E. Perry, and J. F.
European Conference on Software Maintenance and Ramil. “Implications of evolution metrics on software
Reengineering, CSMR, March 26 – 28 2003, pp. 317 – 327. maintenance.” Proc. of the 1998 ICSM 98, Bethesda, Maryland,
[Chapin et al 2001] Chapin N., Hale J.E., Khan K.M., Ramil Nov. 1998, pp. 208 – 217.
J.F. and Tan W.G., “Types of Software Evolution and Software [Mockus et al 2002] Mockus A., Fielding R.T., Herbsleb J.D.,
Maintenance”, Journal of Software Maintenance and Evolution: “Two Case Studies of Open Source Development: Apache and
Res. and Practice, 13(1), January-February, pp 1 – 30, 2001 Mozilla”. In ACM Transactions on Software Engineering and
[Curtis et al 1979] Curtis B., Sheppard S.B., Milliman P., Borst Methodology Vol. 11, No. 3, 2002, pp. 309 – 346.
M.A. and Love T., “Measuring the Psychological Complexity of [Nakakoji et al 2002] Nakakoji K., Yamamoto Y.,
Software Maintenance Tasks with the Halstead and McCabe Nishinaka Y., Kishida K.,Ye Y., “Evolution Patterns of Open-
Metrics”, IEEE Trans. on Softw. Eng., 5(2), 1979, pp. 96 –104 Source Software Systems and Communities”. In Proceedings of
[Di Lucca et al 2000] Di Lucca G.A. et al, Recovering Class International Workshop on Principles of Software Evolution
Diagrams from Data Intensive Legacy Systems, Proc. ICSM (IWPSE 2002), Orlando, Florida, 19 – 20 May, 2002, pp. 76 – 85
2000, 11 – 14 Oct. 2000, San Jose CA, pp. 52 – 63 [Nikora and Munson 2003] Nikora A.P. and Munson J.C.,
[El-Emam et al 2000] K. El-Emam, S. Benlarbi, N. Goel, W. “Understanding the Nature of Software Evolution”, Proc. ICSM
Melo, H. Lounis, and S. Rai, "The Optimal Class Size for Object- 2003, 22 – 26 Sept., Amsterdam, pp. 83 – 93
Oriented Software: A Replicated Study," National Research [Rajlich and Bennett 2000] Rajlich V.T. and Bennett K.H., “A
Council of Canada, NRC/ERB 1074, 2000. Staged Model for the Software Life Cycle”, IEEE Computer,
[German 2003] German D., “Using software trails to rebuild July, 2000, pp. 66 – 71
the evolution of software”, International Workshop on Evolution [Scheme] The Scheme programming language, project
of Large-scale Industrial Software Applications (ELISA) 23 available at http://www.swiss.ai.mit.edu/projects/scheme/ (as of
September 2003, Amsterdam, The Netherlands June 2004)
http://prog.vub.ac.be/FFSE/Workshops/ELISA-Workshop.html, [Shankland 2000] Shankland S., “Linux kernel release falls
(as of Sept. 2003) behind schedule”, available on-line at http://news.com.com/2100-
[Godfrey and Tu 2000] Godfrey, M., and Tu Q., “Evolution in 1001-240061.html?legacy=cnetandtag=st.ne.1002.thed.1003-
Open Source Software: A Case Study”. Proc. of 2000 ICSM, 200-1808165 (as of June 2004)
October 11-14 2000, pp. 131 – 142 [Smith et al 2004] Smith N., Capiluppi A., Ramil J.F., 2004,
[González-Barahona et al 2001] González-Barahona J.M., “Qualitative Analysis and Simulation of Open Source Software
Ortuño-Pérez M. A., de las Heras-Quirós P., Centeno-González Evolution” on the Proc. of the 5th Int. Workshop on Software
J., Matellán-Olivera V, “ Counting potatoes: The size of Debian Process Simulation and Modeling, May 24 – 25 2004,
2.2”, http://people.debian.org/~jgb/debian-counting/counting- pp. 103-112
potatoes-0.2/ (as of June 2004) [Stamelos 2002] Stamelos, I., Angelis, L., Oikonomou, A.,
[Graphviz] Graphviz - open source graph drawing software Bleris, G.L., “Code Quality Analysis in Open-Source Software
http://www.research.att.com/sw/tools/graphviz/ Development”, Information Systems Journal, 2nd Special Issue
[Kemerer and Slaughter 1999] Kemerer, C.F., and S. on OS Software, 12(1), January 2002, pp. 43 – 60.
Slaughter. “An Empirical Approach to Studying Software [XSCC] A tool for extraction source lines of code,
Evolution”. IEEE Transactions on Software Engineering, 1999, http://members.tripod.com/vgoenka/unixscripts/xscc.html (as of
25(4), pp. 493 – 509. June 2004)
[Koch and Schneider 2000] Koch S., Schneider G., “Results
from Software Engineering Research into Open Source
Development Projects Using Public Data”, in ”Zum
Tätigkeitsfeld Informationsverarbeitung und
Informationswirtschaft”, Hans R. Hansen und Wolfgang H.
Janko (eds.), Nr. 22, Wirtschaftsuniversität Wien, 2000.
[Lehman 1969] Lehman M.M., “The Programming Process”,
IBM Res. Rep. RC 2722, Dec. 1969: 46 pp. Also as Chapter 3
in [Lehman and Belady 1985]
[Lehman 1974] Lehman M.M., “Programs, Cities, Students,
Limits to Growth?”, Inaugural Lecture, in Imperial College of
Science and Technology Inaugural Lecture Series, v. 9, 1970,
1974, pp. 211 – 229. Also in Programming Methodology, Gries
D (ed.), Springer Verlag, 1978, pp. 42 – 62. Reprinted as Chapter
7 in [Lehman and Belady 1985]
[Lehman 1980] Lehman M.M, “Programs, Life Cycles, and
Laws of Software Evolution”, Proc. Special Issue Software Eng.,
IEEE, vol. 68, no. 9, 1980, pp. 1,060 –1,076
[Lehman and Belady 1985] Lehman M.M. and Belady L.A.,
(eds.) Program Evolution – Processes of Software Change,
Academic Press, London, 1985
APPENDIX
Files ini Folders Files fin Folders Kbs ini Kbs fin LOCs LOCS SLOCS SLOCS Depth Depth Time
ini fin ini fin ini fin ini fin interval
(days)
Arla 321 31 658 69 1.831 4.091 63.663 162.218 40.009 108.838 4 4 1.820
Ganymede 473 28 478 28 5.455 5.646 221.893 229.110 123.093 126.955 6 6 558
Gwydion-
dylan 607 64 1.147 137 6.606 11.012 213.688 348.644 151.145 252.997 6 5 1.673
Ghemical 586 12 555 12 6.426 6.716 217.463 226.769 171.998 180.159 4 4 454
Gimpprint 7 1 136 14 305 2.206 11.156 80.567 9.172 61.895 1 3 1.304
Gist 778 27 1.067 37 4.098 4.519 172.111 190.933 126.987 131.401 5 4 1.436
Grace 91 4 310 14 2.025 4.428 73.691 157.919 63.423 113.668 2 2 2.730
Htdig 136 16 511 24 441 3.926 21.300 153.722 14.529 102.621 3 5 2.451
Imlib 27 4 36 4 2.631 2.692 52.651 55.839 50.300 53.163 2 2 1.277
Ksi 259 19 191 14 2.933 2.708 111.288 100.157 81.681 75.561 4 4 860
Lcrzo 19 3 235 9 197 3.658 6.409 109.323 4.955 70.517 1 6 1.435
Linuxconf 586 46 1.347 117 2.475 6.104 103.498 239.223 82.810 191.594 4 4 2.028
Mit-scheme 1.511 31 1.946 51 17.127 21.941 545.093 704.864 467.151 614.141 3 5 3.430
Motion 2 1 28 1 7 160 239 6.836 204 5.901 2 2 1.281
Mutt 120 2 201 6 1.131 2.391 48.640 96.415 37.477 70.171 2 3 2.032
Nicestep 44 4 140 17 1.173 2.414 33.990 74.441 27.555 59.729 1 2 1.168
Parted 52 6 122 16 417 1.354 16.911 51.907 12.431 38.720 3 3 1.405
Pliant 227 37 641 94 1.255 4.270 36.347 116.947 28.868 101.363 5 5 1.845
Quakeforge 396 17 696 58 3.815 5.696 172.946 233.534 123.234 175.377 3 5 1.268
Rblcheck 1 1 7 5 2 19 104 772 68 447 1 3 1.493
Rrdtool 113 10 153 26 1.926 3.025 86.138 128.211 68.695 102.298 3 4 1.634
Siagoffice 42 5 322 18 356 3.618 15.386 137.504 13.743 108.254 2 2 2.594
Vovida Sip
Stack 49 1 2.618 135 13.307 19.809 13.307 665.749 7.406 398.938 1 6 1.309
Weasel 16 1 36 2 142 511 4.449 17.591 2.629 11.924 1 2 834
Xfce 207 12 450 69 1.323 8.450 46.808 277.423 35.317 225.736 2 3 1.662

Table A1 – Various size measures and length of evolution period studied (time interval) for the 25 OSS systems
NOTES:
• In the table header, “init” indicates size measured at the first publicly available release, “fin” indicates size measured at
the last publicly available release.
• Columns 2 to 13 represent various size measures
• Column 14 represents the length of the period studied for each software, measured as the interval between the first and
the latest available release

You might also like