You are on page 1of 2

Studying the Impact of Evolution in R Libraries on

Software Engineering Research


Catherine Ramirez, Meiyappan Nagappan,
Mehdi Mirakhorli
Department of Software Engineering
Rochester Institute of Technology
Rochester, NY, USA
{cer2377, mxnvse, mxmvse}@rit.edu

Abstract—Empirical software engineering has become an is judged more cost effective to replace the system with a
integral and important part of software engineering research in recreated version.” [9]. The libraries and packages in the
both academia and industry. Every year several new theories statistical tools are no exception to this law.
are empirically validated by mining and analyzing historical
data from open source and closed source projects. Researchers
rely on statistical libraries in tools like R, Weka, SAS, SPPS, II. M ETHODOLOGY
and Matlab for their analysis. However, these libraries like
In this paper, we try to find some initial evidence of an
any software library undergo periodic maintenance. Such
maintenance can be to improve performance, but can also be to impact on the results due to evolution in the packages and
alter the core algorithms behind the library. If indeed the core libraries of the statistical tools. Our approach follows three
algorithms are changed, then the empirical results that have steps which we detail below:
been compiled with the previous versions may not be current
anymore. However, this problem exists only if (a) statistical
1) Library Selection - We need to choose a library or
libraries are constantly edited and (b) the results they produce package from a statistical tool that is used often by
are difference from one version to another. Hence in this paper, software engineering researchers. We chose the R statis-
we first explore if either of the above two conditions hold true tical tool, since it is open source and commonly used by
for one library in the statistical package R. We find that both researchers in many fields of science and engineering.
conditions are true in the case of the randomForest method in
the randomForest package.
In our initial investigation, we chose the randomForest
package of R since it is one of the most adopted
Keywords—Empirical software engineering, scientific tools, techniques by SE researchers.
mining software repositories. 2) Version Identification - We need to identify versions
where the code of the package has undergone evolution.
I. INTRODUCTION In order to do this, we first download all the versions of
Over the last few decades there has been a steady increase in the randomForest library. We then use the Linux diff util-
the amount of empirical studies within the context of software ity to determine the number of lines inserted and deleted
engineering research. This is evident from the creation of a between two consecutive versions of the package. Note
journal specifically to cater to empirical software engineering that we only focus on the randomForest.default.R file in
research [1]. There are also several conferences like MSR [2], the randomForest package.
ESEM [3], and Promise [4] that focus entirely on empirical 3) Impact Analysis - In the final step we asses the impact
studies. of the changes to the library/package. To do this, we use
In empirical studies, researchers collect data on the hy- a black box testing approach. We pass the same input
pothesis they postulate, and analyze the data to prove or to the API of each version of the package/library. Note
disprove them. The analysis is carried out using statistical that we will use only versions where there has been a
analysis tools like R, SPSS, SAS, or Matlab. These tools change in the code of the API under question.
have libraries or packages for each type of statistical analysis.
III. C ASE S TUDY R ESULTS
Software engineering researchers, like researchers in many
other communities have become reliant on these tools for their From Figure 1, we can see that there are several version of
analysis. the randomForest package and that in several of these versions
However, like any software system [10], [11], these pack- changes have been made to the randomForest.default.R file
ages and libraries undergo constant evolution. Lehman’s first alone. The average number of lines added is almost 29, and the
law of evolution states that “A program that is used and that average number of lines deleted is almost 23. In one version
as an implementation of its specification reflects some other of the package (randomForest 4.4-2), there was 427 lines that
reality, undergoes continual change or becomes progressively were removed and 431 lines that were added. Thus the amount
less useful. The change or decay process continues until it of change in the randomForest package is substantial.

978-1-4673-6923-7/15/$31.00
c 2015 IEEE 29 SWAN 2015, Montréal, Canada
but also use the corresponding version of R used in
each experiment. In order to do this, we need to get all
the previous versions of R, and run them on operating
systems through virtual machines that are compatible
with that version of R.
• Reproducing research experiments. We need to repro-
duce and re-run previously published research experi-
ments in the community to examine if any differences
exists between different versions of algorithms. If there
is a mismatch in results, we then need to conduct further
experiments to examine if the fundamental research hy-
pothesis can be impacted by the differences in the output
of the algorithms. Statistical tests need to be performed
Fig. 1. Number of deletions and insertions to the randomForest method in to analyze the validity of the results.
the randomForest package • Taking into consideration the algorithms parameters.
(a) Result for version 4.5-1 We anticipate that algorithms might behave differently
setosa versicolor virginica class.error with diverse execution parameters. Therefore, we need to
setosa 50 0 0 0.00 identify mechanisms to rerun experiments with several
versicolor 0 47 3 0.06 different parameters.
virginica 0 3 47 0.06
• Establishing experiments traceability. We need to work
(b) Result for version 4.6-7 on mechanisms which can help scientists easily establish
setosa versicolor virginica class.error connection between their experiments and the version
setosa 50 0 0 0.00 of the algorithms used in their work. This would add
versicolor 0 46 4 0.08
virginica 0 4 46 0.08 further challenges in performing benchmark experiments
TABLE I and comparing the outcome of different scientific work.
R ESULTS OF THE STUDY IN TWO DIFFERENT RELEASES
V. C ONCLUSIONS
We see that changes in the randomForest API are substan-
tial, and the results do change when the package evolves. The
We then picked two versions of the package where the code
preliminary results encourage us to examine this problem in
between them has evolved - 4.5-1, and 4.6-7. The results of
a much larger scale. To do this we will need to automate the
executing the following code in both versions of the package
methodology. We hope to identify popular R packages that
is reported in Table I:
have different results, and possibly identify results that may
be different from since it was published. In future work we
data(iris) will extend our analysis and utilize algorithms and data used
set.seed(71) by several other researchers in SE community.
iris.rf <- randomForest(Species ˜ .,
data=iris, R EFERENCES
importance=TRUE, [1] Empirical Software Engineering - http://www.springer.com/computer/
swe/journal/10664
proximity=TRUE) [2] Working Conference on Mining Software Repositories - http://msrconf.
We use the iris dataset in our example and the results from org/
[3] Empirical Software Engineering and Measurements - http://www.
the two versions are different. This is evidence that as the esem-conferences.org/
statistical packages evolve, and the results and conclusions [4] Promise conference - http://promisedata.org/2014/
made using these packages may no longer be relevant. [5] The R Project for Statistical Computing - http://www.r-project.org/
[6] SPSS software - http://www-01.ibm.com/software/analytics/spss/
IV. C HALLENGES IN F ULL S CALE C ASE S TUDY [7] SAS/STAT - http://www.sas.com/en us/software/analytics/stat.html
[8] Matlab - http://www.mathworks.com/products/matlab/
In this paper we merely present a very small case study. [9] Lehman, M.M., “Programs, life cycles, and laws of software evolution,”
However, the results of this initial investigation are very Proceedings of the IEEE , vol.68, no.9, pp.1060,1076, Sept. 1980
[10] Godfrey, M.W.; Qiang Tu, “Evolution in open source software: a
encouraging. In order to scale this study to a much larger case study,” Software Maintenance, 2000. Proceedings. International
scale, we will face the challenges of: Conference on , vol., no., pp.131,142, 2000
[11] D’Ambros, M.; Gall, H.; Lanza, M.; Pinzger, M., “Analyzing Software
• Identifying all packages that are important to the SE
Repositories to Understand Software Evolution” In Software Evolution,
community. We need to carry out an extensive literature pp. 37 - 67, Springer, 2008.
survey to determine which studies have used R, and what [12] Ghotra, B; McIntosh, S;, and Hassan, A.E., “Revisiting the Impact
of Classification Techniques on the Performance of Defect Prediction
packages were used for the statistical analysis. Models,” To appear in Proc. of the 37th Int’l Conf. on Software
• Establish an execution environment for comparison. Engineering (ICSE), 2015.
We need to not just use an older version of a package,

30

You might also like