You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334390513

Exploring ISBSG R12 Dataset Using Multi-data Analytics

Conference Paper · July 2019

CITATIONS READS
0 145

3 authors:

Ghazi Alkhatib Khalid T. Al-Sarayreh


Hashemite University Hashemite University
44 PUBLICATIONS   173 CITATIONS    78 PUBLICATIONS   381 CITATIONS   

SEE PROFILE SEE PROFILE

Alain Abran
École de Technologie Supérieure
637 PUBLICATIONS   8,192 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Estimation Models for Testing View project

STELLAR View project

All content following this page was uploaded by Khalid T. Al-Sarayreh on 26 December 2019.

The user has requested enhancement of the downloaded file.


Data Characterization of ISBSG R12 Using Data Analytics:
An Exploratory Study
Ghazi Alkhatib1, Khalid Al-Sarayrah2, and Alain Abram3
1,2
The Hashemite University, Zarqa Jordan
3
Université du Québec Montréal, Québec Canada
1
g.alkhatib@hu.edu.jo - 2KhalidT@hu.edu.jo -
3
Alain.Abran@ele.etsmtl.ca

Abstract: This paper presents an exploratory study that applies three data analysis techniques:
statistical analysis, data clustering, and visualization conducted to the ISBSG R12 data set.
Both SPSS and RapidMiner are used to conduct the analysis. While statistical analysis main
advantage is the summarization of data, the overall behavior of the data is lost, particularly the
view of outlier values. The study applied two techniques in this regard using SPSS: correlation
analysis and the general linear model using multiple variables. The statistical analysis showed a
high significant level of relationship between and among the selected variables. In the data
mining areas, the clustering technique and visualization used both SPSS and RapidMiner (RM).
For the selected variables, the number of clusters is determined after several runs, in an attempt
to diversify the one larger cluster into several sub-clusters. Finally, visualization technique
demonstrates how it could show concentration and trends. Statistical analysis found high
correlation between speed of delivery and manpower delivery rate, and the independent factors
of industry type and development methodologies vs. the dependent variable of defect density.
The clustering process highlighted the importance of variables related to work efforts and
defects in forming the clusters. Major conclusions of the visualization charts revealed
an inverse no-linear relationship between effort of analysis and design of total effort and
speed of delivery form one side and total defects delivered. Overall, multiple view of data
analytics is needed to arrive at a clear and consistent understanding of the underlying behavior
of the data in a complex data set such as ISBSG,

Key words: ISBSG release 12, data analytics, data mining, correlation, clustering,
visualization, SPSS, RapidMiner.

1 Research background: review of literature and ISBSG


related work
The competitive market of software products demands software companies to deliver
reliable and functional products with minimum development costs and time, in
addition to satisfy customers who have changing needs [18]. The continuous
changing markets require the use of an effective software development methodology
to be able to enhance and measure both productivity and effort required for each
development stage.

Software development methodology is also known as Software Development Life


Cycle (SDLC). The Waterfall, spiral, incremental, rational unified process (RUP),
rapid application development (RAD), agile software development, and rapid
prototyping are examples of different SDLC models [17]. All of SDLC models have a

1
sequence of phases that must be completed in order to deliver a final software product.
This study considered waterfall and agile software development.

The waterfall model was introduced by Winston W. Royce in 1970 [16] [17]. It
has five consecutive phases that aim to develop large software for delivery to a
customer; these phases (in order) are: Analysis, Design, Implementation, Testing, and
Maintenance. Each phase has to be fully completed before moving to the next phase.
Software manager assigns available resources to each phase. In order to attain the
maximum productivity in every stage, [2] proposed a simulation model for Waterfall
SDLC to assist project managers to determine the optimal amount of resources
required in each phase within the allocated schedule and budget. The successful
implementation of waterfall requires that software requirements remain unchanged
and fixed in the entire process. Unfortunately, this is not applicable to modern
products because customers tend to change their needs frequently.

Recently, lightweight software development methods are developed to respond


efficiently to market changes and continuous development in a dynamic environment.
The business meeting of seventeen professional software developers in 2001 had
published the Manifesto for Agile Software Development which is based on twelve
principles [3]. It is known that most of the agile methods are not new [7]. Agile
development support traditional SDLC in addition to these important values:
individuals and interactions, working software, customer collaboration and responding
to change.

The use of agile development methods has improved productivity, delivery time,
as well as customer satisfaction [13]. The 2011 CHAOS report from the Standish
Group comprised success and failure rates of software projects between waterfall
model and agile development. The report cites 12% success rate for waterfall projects
and 42% success rates for agile projects; the study was applied from projects
conducted from 2002 through 2010 [8].

Furthermore, productivity in software development is defined as the relationship of


an output and its corresponding input. According to [12] productivity considers the
relation of output and effort that is measured by using Source Lines of Code (SLOC),
Function Points (FP), or further revisions of FP. These measures are usually used
when applying classical development methods such as waterfall. Other alternatives
are used such as story points for agile development [19]. However, these
measurements are not applicable for all software engineering tasks [11] [4], since few
tasks in software development do not produce codes as output.
Productivity in software development processes is influenced by numerous factors.
The results of a comprehensive study of 126 publications made by [18] showed that
the main factor in increasing productivity depends upon the capability of developers,
followed by tools and methods used within the process. The dependability of human
factor makes it difficult to estimate productivity. The results of an empirical study
conducted by [16] that studied the relationship between productivity and team size
showed that there are statistical correlations between team size, effort, productivity
and project duration. The study was performed using a preprocessed ISBSG release
10.

The applied SDLC in software projects affects productivity. In agile team projects,
productivity is affected by team composition and allocation, external dependencies,
and staff turnover. This is the result of a study that analyzed data from two projects

2
[14]. Moreover, [10] summarized productivity threats to ten problem areas, by
studying factors that compromise productivity in large agile development projects,
they conducted repertory grid interviews with 13 project members; each project
consist of 11 Scrum teams. These problem areas could be classified into the following
major factors: organizational culture and control, relationships with external parties,
business requirements and functional dissemination, and manpower management.
Overall, this research supports the findings of the previous paper in areas related to
external factors and manpower management.

In methods of clustering, several papers developed different techniques for


different applications. Authors in [1] suggested an approach combining behavioral
and targeting methods to improve the identification of specific customers for
personalized ubiquitous offers. In another paper, Fuzzy k-nearest neighbors is applied
in performing phoneme segmentation [5]. Researchers in [6] proposed a temporal
and functional analysis approach based on data mining type of unsupervised
classification to collect evidence location with time-based, functional, and relational
aspects of crimes to assist investigators in identifying anomalies and information on
these crimes. Another paper used K-means Euclidean distance algorithm for creating
training groups in order to calculate the optimal number of clusters using the Mean
Squared Error, as it applies to basic literature classifier [9}. RapidMiner used in this
study performs clustering by k-means algorithm of unsupervised machine learning
[15]. On the other hand, SPSS adopts three methods for the cluster analysis: K-Means
Cluster, Hierarchical Cluster, and Two-Step Cluster. K-means cluster is a method used
for fast clustering of large data sets, as applied in this research.

2 Research methodology
Following the introduction and review of literature, the ISBSG R12 was checked for
related variable that will constitute a coherent body of knowledge addressing specific
area in software project management. As a result of such analysis, the research
selected the area of industry type with concentration on front-end analysis and design
with the computed front-end analysis and design activities of total efforts. These areas
are then regressed and clustered in relationship to time, efforts in software projects,
development methodologies, adjusted functional points, and delivery-/defects-related
variables.

The research methodology started with a close scrutiny of the data set led to the
use of quality of data rating of A and B only. Vertically, many variables are excluded,
while others are considered for the analysis. Some form of theoretical foundation
supported such selection, such as the effect of effort on analysis and design on
software implementation, functional point computation and its relationship to a
selection of other variables, such as industry type and development methodology.
Normally, low organizational level transaction processing system dictates different
development practices than hardware-based systems, such as telecommunication, and
mathematically oriented systems. Horizontally, the data set is visually examined
searching for weak/missing data for certain variables which led to the deletion of
about 1 percent of the projects. Two software packages are used in the analysis: SPSS
and RapidMiner (RM). Statistical analysis included correlation analysis and the
general linear model for multiple independent and dependent variables. The
clustering techniques using SPSS followed several runs for the number of clusters
from 2-10 using K-Means. The objective is to identify a couple of clusters within a

3
group of clusters hoping to isolate outliers. The different cluster runs always revealed
one large cluster and other smaller ones. Following the analysis of cluster
composition for 2 to 10 using SPSS, the size of 8 came up with three clusters with
slight variations and the other 5 with member cases of less than 6. Only the three
largest clusters were analyzed. As to visualization charts, the charts were revised to
eliminate outliers until an acceptable scatter diagram without many outliers is reached.
Also, the research arrives at a new variable representing total efforts. In some cases,
effort was not distributed into the different effort related categories, but rather it was
reported as a figure under the variable unrecorded. Looking at the figures in this latter
column, large figures were included, and for those who distributed effort into the
different categories reported a zero figure. Therefore, a new column for the total
efforts is created by adding the distributed efforts and considering the unrecorded
figure for non-distributed effort as a total effort.

3 Data analysis
Data analysis of the ISBGS release 12 conducted three type of techniques: traditional
statistical analysis, data mining clustering, and visualization.

1.1 Statistical Analysis

In this section, two statistical analysis techniques are presented: correlation and
general linear model with multi-variables.

Correlation: The next two tables belonging to Table 1 demonstrate correlation


between two set of variables, as shown in the two consecutive tables.
Table 1. Correlation analysis among variables of the analysis and Pearson correlation
significance.

Correlations
Effort Speed Manpow
Total
AandD Resourc Defect of er
Variable Defects
Percent e Level Density Delive Delivery
Delivered
of total ry Rate
Total Defects
1 -.160* .084 .477** .036 -.043
Delivered
Effort AandD
1 -.025 -.113 -.106 -.141*
Percent of total
Resource Level 1 -.092 .204** .204**
Defect Density 1 -.085 -.104
Speed of Delivery 1 .818**
Manpower
1
Delivery Rate

4
Correlations Pearson Correlation Sig. (2-tailed)

Adjusted Total
Total Defect
Function Defects
Efforts Density
Points Delivered

Adjusted Function Points 1 .553** .190** -.024

Total Efforts 1 .227** .015

Total Defects Delivered 1 .459**

Defect Density 1

Note for both tables: *: at alpha of .01. **: at alpha .05 two tailed test.

The above two tables reveal that the selected variables in general are correlated.
The strongest correlations are between manpower delivery rate and speed of delivery.
A noteworthy observation is the negative relationship between the percentage of
analysis and design to total effort from one side, and total defects delivered and
manpower delivery rate from the other side.

General linear model with multi-variables: The study used the following source of
independent variables: data quality, industry sector, and development type from one
hand, and the dependent variables of defect density, effort of analysis and design of
total effort, manpower delivery rate, resource level, The next Table 2 shows the
general linear model using multiple dependent and independent variables sorted from
the highest significance to the least and divided into three tiers. The top tier list defect
density 4 times and overall count of 7 indicating a strong relationship between the
dependent and independent variables. The top independent variable in the second tier
is manpower delivery rate. The results of correlation using F test validates the quality
of data assessment. Relationships among other independent and dependent variables
show an overall significance except for industry sector alone with 0 significant level
with total efforts, resource level, and effort of analysis and design percent of total
effort, with the highest relationship expressed between and among data quality and
industry sector vs. defect density. Overall, this shows the importance of industry
section and development type in software development. In the insignificant
relationships, industry type leads the list with 6 out of 9 variables. In the dependent
variable, resource level lead the list with 4 insignificant relationships. Appendix 1
contains a pictorial representation of Table 2.

5
Table 2. Multi-variables relationships among sources and dependent variables ranked by
significance level.

1.2 Clustering Using Data Mining Techniques

The following two figures demonstrated that a cluster of two generated by SPSS (see
Fig. 1) and RapidMiner (see Fig. 2) are similar.

Fig.1: SPSS clusters generation of 2

6
Fig.2. Visualization of two cluster generation using RapidMiner

Obviously, cluster one concentration is greater than cluster 2, identifying the need
to stratify that cluster. To conduct further cluster analysis looking for patterns in that
cluster, several iterations for cluster sizes ranging from 2 to 10 using the variables as
shown in the next table. The observation is that cluster of 2 showed great skewness
with one cluster having 267 cases and the second cluster having only 8 cases. As the
number of clusters increase, the number of cases in each cluster started to distribute,
with the decrease of cases in the first cluster, as displayed in Table 3. With the
analysis of the behavior of cluster membership among the cases, cluster number 8
showed three clusters with acceptable number of cases, and the rest 5 clusters with
less than 6 cases each. The largest three clusters of 1, 5, and 4 are analyzed using
cluster centers, as shown in the table below. The variables with increased magnitude
of the cluster center values for the largest 3 clusters are: defect density, total defects
delivered, normalized work efforts, and total efforts. The other two variables showed
inconsistent magnitude of cluster centers, namely speed of delivery and effort on
analysis and design. In data visualization below, these latter two variables showed
non-linear inverse function relationship to total defects delivered, clearly indicating
why they were excluded from forming the clusters.
Table 3. Final cluster centers for variable cluster sizes

1.3 The Power of Visualization

This section of the paper presents two types of charts: time dependent and variables
dependent.

7
Time dependent visualization: The following variables are depicted against time:
speed of delivery, total defects delivered, a computed variable that represents the total
effort of analysis and design to total efforts, and defect density. The chart on time vs.
speed of delivery using SPSS reached the highest during the year 2000 when
developers were coping with fixing systems to handle the migration from the two-digit
year code to the four digits year code (see Fig. 3, top). Future estimates may suggest,
over a period of five years, slight increase in speed of delivery followed by a decline.
The second chart depicting time against total defects delivered shows an
approximation of a normal distribution(see Fig. 3, bottom). Future projects may start
another cycle of normal distribution in the long run, experience a gradual increase in
the total defects delivered.

Fig.3.Time dependent visualization between two variables (top: speed of delivery, bottom: total
defects)

The following two charts depict year of project vs. effort of analysis and design of
total effort. Both charts, the SPSS generated (see Fig. 4, top), and the RM generated
(see Fig. 4, bottom) generally display the same distribution, which could be estimated
as a skewed to the left distribution. Software developers started to recognize the
importance of the front-end analysis and design activity of the SDLC phases.

8
Fig.4 Time dependent visualization for effort of analysis and design of total efforts (Top: SPSS
generated, bottom: RapidMiner generated)

The next chart demonstrates that defect density increases during the latter years
(see Fig. 5), possibly indicating increased complexity of projects.

Fig.5. Depicting Defect Density over time

The RM generated chart (see Fig. 6) shows that the total effort stayed consistent
throughout the years, with more outliers appearing in earlier years, perhaps indicating
consistent effort as software developer become more proficient.

9
Fig.6. Depicting total efforts over time.

Variables dependent visualization. Using SPSS, the two charts below reveal an
astonishing relationship between total defects delivered vs. percent of analysis and
design effort of total efforts (see Fig. 7), and speed of delivery vs. total defects
delivered (see Fig. 8). The common estimate is the non-linear inverse function. Starts
with a steep decline, to a curve, then levels off straight. Both charts indicate the
importance of spending more time on front-end analysis and design to reduce defects,
as well as taking time in delivering software project milestones.

Fig.7. Displaying variables dependencies: effort of analysis and design of total effort vs. total
defects delivered.

10
Speed of Delivery

Fig.8. Displaying variables dependencies: total defect delivered vs. speed of delivery.

2. Conclusions and future research


The three type of data analysis showed how three different views can lead to different
understanding of the underlining behavior of ISBSG data set. Below is a list of some
of the significant and important conclusions of these analysis:

- Statistical analysis related conclusions:


o A strong relationship between speed of delivery and manpower delivery rate
o A relationship between defect density and speed of delivery and manpower
delivery rate
o Adjusted functional points is related to total efforts and total defects delivered
o Total effort is related to total defect delivered.
o Over the general linear model with multiple dependent and independent variables
show significant relationships among the model variables: quality of data, industry
sector, and development type as independent variables; and defect density, effort of
analysis and design as a total of efforts, total efforts, manpower delivery rate, and
resource level as dependent variables.
o Effort of analysis and design as a percent of total efforts was negatively related to
total defects delivered. This relationship does support the postulate that from-end
activities do positively affect subsequent back-end activities.

- Data mining clustering:


o Normalized work effort, defect density, total effort, and total defects delivered
impacted the formation of the clusters; while speed of delivery, effort on analysis and
design had no effect in forming the clusters.

- Visualization related conclusions:


o Year 2000 projects showed a steep upward in speed of delivery, with expected
increase in speed of delivery over the next few years.
o Total defect delivered over time had an approximation of a normal distribution,
predicting an upsurge in coming years.
o Effort on analysis and design of total effort did not experience any particular
behavior over time
o Defect density showed an increase towards the latter years possible indicating
more complex system development.

11
o Total efforts over the years stay consistently uniform except for outliers in early
years until 2005.
o Effort of analysis and design of total efforts vs. total defects delivered and total
defect delivered vs. speed of delivery followed a non-linear inverse function.

A macro view of these conclusions reveals some support to findings of prior


research. For example, authors in [9] highlighted the staff turnover, and in [10]
authors stated overloading of key personnel, relating it to finding of this study that
manpower delivery rate is affected by industry type. Furthermore, authors in [10]
developed a framework that lists time pressure and highly complex business rules as
root causes to the ten problems, relating it to delivery speed and industry sector in this
study's findings. In addition, the framework lists as intervening variables anchored
methodology and business and user involvement, relating it to development type and
industry type as used in this study. Authors in [10] further developed an framework
that lists four root causes including time pressure and highly complex business rules,
as well as four intervening variables including anchored methodology and business
and user involvement. This study concluded that higher speed of delivery results in
introducing more defects in software projects, while manpower delivery rate
(manpower management) occupied three spots in the top two tiers affected by
development type and industry sector. As stressed in this study, agile development
principles will link the following factors/variables: business and user involvement,
highly complex business rules from one side as in [10], and the importance of front-
end analysis and design, as concluded in this study. The link among these three
variables are expounded in agile development environments, where user involvement
is strongly stressed throughout the development life cycle; this is particularly critical
at the front-end phase.

Further research will use Release 13 for more detailed analysis on each of the three
types of techniques using advanced data mining analysis and statistical techniques
such as outlier detection and non-linear regression, respectively.

References
1. Abdaoui, N., Khalifa, I. ; Faiz, S.: Sending a personalized advertisement to loyal customers in
the ubiquitous environment. in: Proceedings of the 7th International Conference on Sciences of
Electronics, Technologies of Information and Telecommunications (SETIT), pp. 40 - 47. IEEE,
Mammamet, Tunisia (2016).

2. Bassil, Y.: A Simulation Model for the Waterfall Software Development Life Cycle,
International Journal of Engineering & Technology (iJET) 2(5), 742-749 (2012).

3. Beck, K. et al. (2001). : Principles behind the Agile Manifesto.


http://www.agilemanifesto.org, last accessed 2017/4/31.

4. Bellini, C., Pereira, R., Becker, J.: Measurement in software engineering: From the roadmap
to the crossroads, International Journal of Software Engineering & Knowledge Engineering
18(1), 37-64 (2008).

5. Ben Fredj, I., Ouni, K.: Fuzzy k-nearest neighbors applied to phoneme recognition. in:
Proceedings of the 7th International Conference on Sciences of Electronics, Technologies of
Information and Telecommunications (SETIT), pp. 422 - 426. IEEE, Mammamet, Tunisia
(2016).

12
6. Bermad, N., Kechadi, M.: Evidence analysis to basis of clustering: Approach based on
mobile forensic investigation. in: Proceedings of the 7th International Conference on Sciences
of Electronics, Technologies of Information and Telecommunications (SETIT), pp. 300 - 307.
IEEE, Mammamet, Tunisia (2016).

7. Cockburn A., Highsmith J.: Agile software development: The business of innovation. IEEE
Computer 34(9) 120–127 (2001).

8. Cohn, M.: CHAOS report from the Standish Group,


http://www.mountaingoatsoftware.com/blog/agile-succeeds-three-times-more-often-than-
waterfall . Posted 2011, last access 2017/3/31.

9. Guerfala, M., Sifaoui, A., Abdelkrim, A.: Data classification using logarithmic spiral method
based on RBF classifiers. in: Proceedings of the 7th International Conference on Sciences of
Electronics, Technologies of Information and Telecommunications (SETIT), pp. 416 - 421.
IEEE, Mammamet, Tunisia (2016).

10. Hannay, J., Benestad, H.: Perceived productivity threats in large agile development projects.
In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software
Engineering and Measurement (article # 15). ACM, New York, NY, USA (2010)

11. Hernandez-Lopez, A., Colomo-Palacios, R., Garcis-Crespo, Á.: Software Engineering


Productivity: Concepts, Issues and Challenges. International Journal of Software Engineering &
Knowledge Engineering 2(1), 37-47 (2011).

12. Koch, S.: Exploring the effects of SourceForge.net coordination and communication tools
on the efficiency of open source projects using data envelopment analysis, Empirical Software
Engineering 14(4), 397-417 (2009).

13. Lindvall, M. et al.: Agile software development in large organizations. IEEE Computer
37(12) 26-34 (2004).

14. Melo, C., Cruzes, D. S., Kon, F., & Conradi, R.: Agile team perceptions of productivity
factors. In Agile Conference (AGILE), pp. 57-66. IEEE Computer Society Press Los Alamitos,
CA, USA (2011).

15. RapidMiner Studio Manual,. 2014. https://docs.rapidminer.com/downloads/RapidMiner-


v6-user-manual.pdf.

16. Rodriguez, D., Sicilia, M., Garcia, E., Harrison, R.: Empirical findings on team size and
productivity in software development. Journal of Systems and Software 85(3), 562-570 (2012).

16. Royce, W.: Managing the Development of Large Software Systems: concepts and
techniques. CSE '87 Proceedings of the 9th international conference on Software Engineering,
pp. 328-338. IEEE Computer Society Press Los Alamitos, CA, USA (1987).

17. Sommerville, I. Software Engineering. 10th ed. Addison Wesley, Boston, USA (2015).

18. Trendowicz, A. Jürgen M.: Factors Influencing Software Development Productivity - State-
of-the-Art and Industrial Experiences. Advances in Computers 77, 185-241 (2009).

19. Wang, Y.: On the Cognitive Informatics Foundations of software engineering, In: Chan, C,
Kinsner, W., Wang, Y., Miller, D. (eds.) Proceedings of Third IEEE International Conference
on Cognitive Informatics 2004, pp. 22-31 IEEE Computer Society Press Los Alamitos, CA,
USA (2004).

13
Appendix 1. Pictorial representation of Table 2

Level 1 Signf. .82-1.0 Defect Density


Level 2 Signf .40-.69
Level 3 Signf. .13-.37

IndustrySector (1)
Effort Analysis
Development & Design
Type (2)

Data Quality (3)

1 and 2 Total Efforts

1 and 3

2 and 3 Manpower
Delivery Rate
1, 2 and 3
Resource Level

14

View publication stats

You might also like