Professional Documents
Culture Documents
net/publication/334390513
CITATIONS READS
0 145
3 authors:
Alain Abran
École de Technologie Supérieure
637 PUBLICATIONS 8,192 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Khalid T. Al-Sarayreh on 26 December 2019.
Abstract: This paper presents an exploratory study that applies three data analysis techniques:
statistical analysis, data clustering, and visualization conducted to the ISBSG R12 data set.
Both SPSS and RapidMiner are used to conduct the analysis. While statistical analysis main
advantage is the summarization of data, the overall behavior of the data is lost, particularly the
view of outlier values. The study applied two techniques in this regard using SPSS: correlation
analysis and the general linear model using multiple variables. The statistical analysis showed a
high significant level of relationship between and among the selected variables. In the data
mining areas, the clustering technique and visualization used both SPSS and RapidMiner (RM).
For the selected variables, the number of clusters is determined after several runs, in an attempt
to diversify the one larger cluster into several sub-clusters. Finally, visualization technique
demonstrates how it could show concentration and trends. Statistical analysis found high
correlation between speed of delivery and manpower delivery rate, and the independent factors
of industry type and development methodologies vs. the dependent variable of defect density.
The clustering process highlighted the importance of variables related to work efforts and
defects in forming the clusters. Major conclusions of the visualization charts revealed
an inverse no-linear relationship between effort of analysis and design of total effort and
speed of delivery form one side and total defects delivered. Overall, multiple view of data
analytics is needed to arrive at a clear and consistent understanding of the underlying behavior
of the data in a complex data set such as ISBSG,
Key words: ISBSG release 12, data analytics, data mining, correlation, clustering,
visualization, SPSS, RapidMiner.
1
sequence of phases that must be completed in order to deliver a final software product.
This study considered waterfall and agile software development.
The waterfall model was introduced by Winston W. Royce in 1970 [16] [17]. It
has five consecutive phases that aim to develop large software for delivery to a
customer; these phases (in order) are: Analysis, Design, Implementation, Testing, and
Maintenance. Each phase has to be fully completed before moving to the next phase.
Software manager assigns available resources to each phase. In order to attain the
maximum productivity in every stage, [2] proposed a simulation model for Waterfall
SDLC to assist project managers to determine the optimal amount of resources
required in each phase within the allocated schedule and budget. The successful
implementation of waterfall requires that software requirements remain unchanged
and fixed in the entire process. Unfortunately, this is not applicable to modern
products because customers tend to change their needs frequently.
The use of agile development methods has improved productivity, delivery time,
as well as customer satisfaction [13]. The 2011 CHAOS report from the Standish
Group comprised success and failure rates of software projects between waterfall
model and agile development. The report cites 12% success rate for waterfall projects
and 42% success rates for agile projects; the study was applied from projects
conducted from 2002 through 2010 [8].
The applied SDLC in software projects affects productivity. In agile team projects,
productivity is affected by team composition and allocation, external dependencies,
and staff turnover. This is the result of a study that analyzed data from two projects
2
[14]. Moreover, [10] summarized productivity threats to ten problem areas, by
studying factors that compromise productivity in large agile development projects,
they conducted repertory grid interviews with 13 project members; each project
consist of 11 Scrum teams. These problem areas could be classified into the following
major factors: organizational culture and control, relationships with external parties,
business requirements and functional dissemination, and manpower management.
Overall, this research supports the findings of the previous paper in areas related to
external factors and manpower management.
2 Research methodology
Following the introduction and review of literature, the ISBSG R12 was checked for
related variable that will constitute a coherent body of knowledge addressing specific
area in software project management. As a result of such analysis, the research
selected the area of industry type with concentration on front-end analysis and design
with the computed front-end analysis and design activities of total efforts. These areas
are then regressed and clustered in relationship to time, efforts in software projects,
development methodologies, adjusted functional points, and delivery-/defects-related
variables.
The research methodology started with a close scrutiny of the data set led to the
use of quality of data rating of A and B only. Vertically, many variables are excluded,
while others are considered for the analysis. Some form of theoretical foundation
supported such selection, such as the effect of effort on analysis and design on
software implementation, functional point computation and its relationship to a
selection of other variables, such as industry type and development methodology.
Normally, low organizational level transaction processing system dictates different
development practices than hardware-based systems, such as telecommunication, and
mathematically oriented systems. Horizontally, the data set is visually examined
searching for weak/missing data for certain variables which led to the deletion of
about 1 percent of the projects. Two software packages are used in the analysis: SPSS
and RapidMiner (RM). Statistical analysis included correlation analysis and the
general linear model for multiple independent and dependent variables. The
clustering techniques using SPSS followed several runs for the number of clusters
from 2-10 using K-Means. The objective is to identify a couple of clusters within a
3
group of clusters hoping to isolate outliers. The different cluster runs always revealed
one large cluster and other smaller ones. Following the analysis of cluster
composition for 2 to 10 using SPSS, the size of 8 came up with three clusters with
slight variations and the other 5 with member cases of less than 6. Only the three
largest clusters were analyzed. As to visualization charts, the charts were revised to
eliminate outliers until an acceptable scatter diagram without many outliers is reached.
Also, the research arrives at a new variable representing total efforts. In some cases,
effort was not distributed into the different effort related categories, but rather it was
reported as a figure under the variable unrecorded. Looking at the figures in this latter
column, large figures were included, and for those who distributed effort into the
different categories reported a zero figure. Therefore, a new column for the total
efforts is created by adding the distributed efforts and considering the unrecorded
figure for non-distributed effort as a total effort.
3 Data analysis
Data analysis of the ISBGS release 12 conducted three type of techniques: traditional
statistical analysis, data mining clustering, and visualization.
In this section, two statistical analysis techniques are presented: correlation and
general linear model with multi-variables.
Correlations
Effort Speed Manpow
Total
AandD Resourc Defect of er
Variable Defects
Percent e Level Density Delive Delivery
Delivered
of total ry Rate
Total Defects
1 -.160* .084 .477** .036 -.043
Delivered
Effort AandD
1 -.025 -.113 -.106 -.141*
Percent of total
Resource Level 1 -.092 .204** .204**
Defect Density 1 -.085 -.104
Speed of Delivery 1 .818**
Manpower
1
Delivery Rate
4
Correlations Pearson Correlation Sig. (2-tailed)
Adjusted Total
Total Defect
Function Defects
Efforts Density
Points Delivered
Defect Density 1
Note for both tables: *: at alpha of .01. **: at alpha .05 two tailed test.
The above two tables reveal that the selected variables in general are correlated.
The strongest correlations are between manpower delivery rate and speed of delivery.
A noteworthy observation is the negative relationship between the percentage of
analysis and design to total effort from one side, and total defects delivered and
manpower delivery rate from the other side.
General linear model with multi-variables: The study used the following source of
independent variables: data quality, industry sector, and development type from one
hand, and the dependent variables of defect density, effort of analysis and design of
total effort, manpower delivery rate, resource level, The next Table 2 shows the
general linear model using multiple dependent and independent variables sorted from
the highest significance to the least and divided into three tiers. The top tier list defect
density 4 times and overall count of 7 indicating a strong relationship between the
dependent and independent variables. The top independent variable in the second tier
is manpower delivery rate. The results of correlation using F test validates the quality
of data assessment. Relationships among other independent and dependent variables
show an overall significance except for industry sector alone with 0 significant level
with total efforts, resource level, and effort of analysis and design percent of total
effort, with the highest relationship expressed between and among data quality and
industry sector vs. defect density. Overall, this shows the importance of industry
section and development type in software development. In the insignificant
relationships, industry type leads the list with 6 out of 9 variables. In the dependent
variable, resource level lead the list with 4 insignificant relationships. Appendix 1
contains a pictorial representation of Table 2.
5
Table 2. Multi-variables relationships among sources and dependent variables ranked by
significance level.
The following two figures demonstrated that a cluster of two generated by SPSS (see
Fig. 1) and RapidMiner (see Fig. 2) are similar.
6
Fig.2. Visualization of two cluster generation using RapidMiner
Obviously, cluster one concentration is greater than cluster 2, identifying the need
to stratify that cluster. To conduct further cluster analysis looking for patterns in that
cluster, several iterations for cluster sizes ranging from 2 to 10 using the variables as
shown in the next table. The observation is that cluster of 2 showed great skewness
with one cluster having 267 cases and the second cluster having only 8 cases. As the
number of clusters increase, the number of cases in each cluster started to distribute,
with the decrease of cases in the first cluster, as displayed in Table 3. With the
analysis of the behavior of cluster membership among the cases, cluster number 8
showed three clusters with acceptable number of cases, and the rest 5 clusters with
less than 6 cases each. The largest three clusters of 1, 5, and 4 are analyzed using
cluster centers, as shown in the table below. The variables with increased magnitude
of the cluster center values for the largest 3 clusters are: defect density, total defects
delivered, normalized work efforts, and total efforts. The other two variables showed
inconsistent magnitude of cluster centers, namely speed of delivery and effort on
analysis and design. In data visualization below, these latter two variables showed
non-linear inverse function relationship to total defects delivered, clearly indicating
why they were excluded from forming the clusters.
Table 3. Final cluster centers for variable cluster sizes
This section of the paper presents two types of charts: time dependent and variables
dependent.
7
Time dependent visualization: The following variables are depicted against time:
speed of delivery, total defects delivered, a computed variable that represents the total
effort of analysis and design to total efforts, and defect density. The chart on time vs.
speed of delivery using SPSS reached the highest during the year 2000 when
developers were coping with fixing systems to handle the migration from the two-digit
year code to the four digits year code (see Fig. 3, top). Future estimates may suggest,
over a period of five years, slight increase in speed of delivery followed by a decline.
The second chart depicting time against total defects delivered shows an
approximation of a normal distribution(see Fig. 3, bottom). Future projects may start
another cycle of normal distribution in the long run, experience a gradual increase in
the total defects delivered.
Fig.3.Time dependent visualization between two variables (top: speed of delivery, bottom: total
defects)
The following two charts depict year of project vs. effort of analysis and design of
total effort. Both charts, the SPSS generated (see Fig. 4, top), and the RM generated
(see Fig. 4, bottom) generally display the same distribution, which could be estimated
as a skewed to the left distribution. Software developers started to recognize the
importance of the front-end analysis and design activity of the SDLC phases.
8
Fig.4 Time dependent visualization for effort of analysis and design of total efforts (Top: SPSS
generated, bottom: RapidMiner generated)
The next chart demonstrates that defect density increases during the latter years
(see Fig. 5), possibly indicating increased complexity of projects.
The RM generated chart (see Fig. 6) shows that the total effort stayed consistent
throughout the years, with more outliers appearing in earlier years, perhaps indicating
consistent effort as software developer become more proficient.
9
Fig.6. Depicting total efforts over time.
Variables dependent visualization. Using SPSS, the two charts below reveal an
astonishing relationship between total defects delivered vs. percent of analysis and
design effort of total efforts (see Fig. 7), and speed of delivery vs. total defects
delivered (see Fig. 8). The common estimate is the non-linear inverse function. Starts
with a steep decline, to a curve, then levels off straight. Both charts indicate the
importance of spending more time on front-end analysis and design to reduce defects,
as well as taking time in delivering software project milestones.
Fig.7. Displaying variables dependencies: effort of analysis and design of total effort vs. total
defects delivered.
10
Speed of Delivery
Fig.8. Displaying variables dependencies: total defect delivered vs. speed of delivery.
11
o Total efforts over the years stay consistently uniform except for outliers in early
years until 2005.
o Effort of analysis and design of total efforts vs. total defects delivered and total
defect delivered vs. speed of delivery followed a non-linear inverse function.
Further research will use Release 13 for more detailed analysis on each of the three
types of techniques using advanced data mining analysis and statistical techniques
such as outlier detection and non-linear regression, respectively.
References
1. Abdaoui, N., Khalifa, I. ; Faiz, S.: Sending a personalized advertisement to loyal customers in
the ubiquitous environment. in: Proceedings of the 7th International Conference on Sciences of
Electronics, Technologies of Information and Telecommunications (SETIT), pp. 40 - 47. IEEE,
Mammamet, Tunisia (2016).
2. Bassil, Y.: A Simulation Model for the Waterfall Software Development Life Cycle,
International Journal of Engineering & Technology (iJET) 2(5), 742-749 (2012).
4. Bellini, C., Pereira, R., Becker, J.: Measurement in software engineering: From the roadmap
to the crossroads, International Journal of Software Engineering & Knowledge Engineering
18(1), 37-64 (2008).
5. Ben Fredj, I., Ouni, K.: Fuzzy k-nearest neighbors applied to phoneme recognition. in:
Proceedings of the 7th International Conference on Sciences of Electronics, Technologies of
Information and Telecommunications (SETIT), pp. 422 - 426. IEEE, Mammamet, Tunisia
(2016).
12
6. Bermad, N., Kechadi, M.: Evidence analysis to basis of clustering: Approach based on
mobile forensic investigation. in: Proceedings of the 7th International Conference on Sciences
of Electronics, Technologies of Information and Telecommunications (SETIT), pp. 300 - 307.
IEEE, Mammamet, Tunisia (2016).
7. Cockburn A., Highsmith J.: Agile software development: The business of innovation. IEEE
Computer 34(9) 120–127 (2001).
9. Guerfala, M., Sifaoui, A., Abdelkrim, A.: Data classification using logarithmic spiral method
based on RBF classifiers. in: Proceedings of the 7th International Conference on Sciences of
Electronics, Technologies of Information and Telecommunications (SETIT), pp. 416 - 421.
IEEE, Mammamet, Tunisia (2016).
10. Hannay, J., Benestad, H.: Perceived productivity threats in large agile development projects.
In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software
Engineering and Measurement (article # 15). ACM, New York, NY, USA (2010)
12. Koch, S.: Exploring the effects of SourceForge.net coordination and communication tools
on the efficiency of open source projects using data envelopment analysis, Empirical Software
Engineering 14(4), 397-417 (2009).
13. Lindvall, M. et al.: Agile software development in large organizations. IEEE Computer
37(12) 26-34 (2004).
14. Melo, C., Cruzes, D. S., Kon, F., & Conradi, R.: Agile team perceptions of productivity
factors. In Agile Conference (AGILE), pp. 57-66. IEEE Computer Society Press Los Alamitos,
CA, USA (2011).
16. Rodriguez, D., Sicilia, M., Garcia, E., Harrison, R.: Empirical findings on team size and
productivity in software development. Journal of Systems and Software 85(3), 562-570 (2012).
16. Royce, W.: Managing the Development of Large Software Systems: concepts and
techniques. CSE '87 Proceedings of the 9th international conference on Software Engineering,
pp. 328-338. IEEE Computer Society Press Los Alamitos, CA, USA (1987).
17. Sommerville, I. Software Engineering. 10th ed. Addison Wesley, Boston, USA (2015).
18. Trendowicz, A. Jürgen M.: Factors Influencing Software Development Productivity - State-
of-the-Art and Industrial Experiences. Advances in Computers 77, 185-241 (2009).
19. Wang, Y.: On the Cognitive Informatics Foundations of software engineering, In: Chan, C,
Kinsner, W., Wang, Y., Miller, D. (eds.) Proceedings of Third IEEE International Conference
on Cognitive Informatics 2004, pp. 22-31 IEEE Computer Society Press Los Alamitos, CA,
USA (2004).
13
Appendix 1. Pictorial representation of Table 2
IndustrySector (1)
Effort Analysis
Development & Design
Type (2)
1 and 3
2 and 3 Manpower
Delivery Rate
1, 2 and 3
Resource Level
14