You are on page 1of 10

Journal of Information and Computational Science ISSN: 1548-7741

Fuzzy C Means Method for Cross – Project Software


Defect Prediction

S Aleem Basha1, Dr Prasanna Kottapalle2


1
Research Scholar, SSSUTMS, Sehore, Bhopal, M.P
2
Associate Professor, Department of CSE, AITS, Rajampet, A.P
1
aleemshaik02@gmail.com, 2prasanna.k642@gmail.com

Abstract
Cross-project software defect prediction helps increase the chances of delivering a
bug-free product from the software industry. The ultimate goal of predicting cross-
project software defects is to reduce the cost and time of the project in the software
testing phase to improve the quality of the software. Many defect data sets are publicly
available online and are used as historical data. But the data sets are not the same; the
environments, the projects are different and most of the projects have multiple visions.
The objective of this research is to show the performance of both types of data sets using
the machine learning approach. After that, a high-performance data set of all selected
data sets will be identified based on their performance, which will help predict future
data.
This paper presents a Fuzzy C Means algorithm that is proposed to predict and
classify software defects in defective and non-defective modules. The Fuzzy C Means
algorithm that efficiently classifies and predicts the accuracy of software defect
detection. This algorithm also makes use of a selection of heuristic features through the
fitness function. The empirical analysis showed that the proposed approach can be used
effectively with a high accuracy rate. In addition, a measure of comparison accuracy is
applied to compare the proposed prediction model with the current state of the
algorithms. The collected results showed that the delineated algorithm achieved better
performance with respect to Accuracy measurements.

Keywords: Cross-project software defect prediction, Fuzzy C Means, Accuracy,


Classification, F1-Measure

1. Introduction
The use of software is increasing continuously; As a result, prediction of cross-project
software defect prediction has become an important research topic in software
engineering. The defect is a bug or error in the software source code, it may cause
software failures. Finding and correcting defects is expensive for the development and
maintenance of both fields. Nowadays, the software grows enormously and questions
and attention also arise in size and complexity [1]. Before delivering it to customers, it is
very important to predict and correct defects because ensuring software quality takes a
long time. Here the prediction of defects is important to fight for large and complex

Volume 10 Issue 7 - 2020 466 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

software projects. However, the research area of software defect prediction between
projects is wide; there are many resources available on this subject. Many techniques are
being used in predicting cross-project software defects, such as statistics, machine
learning and parametric and mixed model techniques.
Software defect prediction is an important task in the software life cycle. In addition,
an early prediction of the software defect improves software changes in various
configurations and improves the use of those available. A defect shows the unexpected
behavior of the system for certain requirements. During software tests, unexpected
behavior is identified and manifested as a defect. A software defect may be called "a
failure in the software development process that would result in the software not
congregating the preferred anticipation" [2]. In addition, finding and correcting defects
leads to costly software development activities [3, 4, 5].
Therefore, the sensible classification of software defects facilitates the efficient
allocation of test resources and facilitates developers to improve the architectural design
[6, 7, 8]. Classification and grouping can be machine learning techniques that can be
used to identify defects in software data sets. It involves classifying software modules
into defective or non-defective software modules that are denoted by a set of software
complexity metrics using a classification model resulting from data from previous
development projects [9]. Software complexity metrics can consist of code size [10],
McCabe's cyclomatic complexity [11] and Halstead Complexity [12].
In defect prediction research, the data set has a great contribution; the variation of the
data element can affect the result. In the prediction study, some data is used to train the
model and some are used for validation. Here the selection of train data is important,
defect prediction techniques work well when a sufficient amount of training data is
available. Generally, two types of methods aim to build a prediction model. One is
within-project, where training data and test data come from the same project. Another is
the cross-project, where data from a project is used to build the model as training data
and other project data is used to test the model. The methods within the project require
that there be sufficient historical data to build the prediction model. However, for a new
project or a project with limited historical data, you cannot build a good prediction
model.
In the cross-project method, training and test data can be taken from the same
environment, their characteristics are the same. But a different environment has different
characteristics, they are known as heterogeneous. For this heterogeneous problem, cross-
project methods are no longer applicable. In this study the cross-project method is being
used. In these collected data sets, some are balanced, which means they have an almost
equal number of defective and non-defective classes. But most of the data sets are
unbalanced, either with majority defective or non-defective majority data. It is difficult
to distribute the data correctly, because to train a model you need a sufficient number of
defective and non-defective data. The ultimate goal of this study is to identify the
difference in performance and find the impact of the class ration on software defect
prediction.

Volume 10 Issue 7 - 2020 467 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Finally, the main contribution of the work is to design a new Fuzzy C Means
algorithm that efficiently classifies and predicts the accuracy of cross-project software
defect detection. This algorithm also makes use of a selection of heuristic features
through the fitness function. In Section 2, we present basic preliminaries and related
literature works associated with the detection of software defects. Section 3 gives the
step-by-step description of the proposed approach. The description of the data set, the
evaluation methodology and the exhaustive analysis are presented in section 4 and the
conclusion and future work will be in section 5.

2. Preliminaries and Related Research


The cross-project software defect prediction research warehouse is rich. Huge models
of defect prediction have been proposed, where machine learning was a dominated task.
From the heap, cross-project software defect prediction techniques can be pointed out in
some categories. In this document, predictions of cross-project software defects have
been described [17]. There are many software defect prediction studies using machine
learning techniques. For example, a linear regression approach was proposed in the study
in [2] to predict defective modules. With the available historical data of accumulated
software defects, the research study predicts future software failures.
A new framework is proposed to predict software defects in different data sets on
existing classification algorithms [13] and it is pragmatic that their selected classification
methods present good predictive accuracy and support metric-based classification. By
way of comparison, the operational characteristics of the receiver curve (AUC) [14,15 ]
are used. Particularly for the proportional study in the detection of software defects, its
use is suggested. In particular, previous findings on efficacy [16] have been confirmed
by RndFor for defect prediction. The results of the experiments showed that in the
performance of different classification algorithms there is no significant difference. The
study covered only the classification model for cross project software defect prediction.
The prediction of failures of different Machine Learning methods is studied in [3], [4]
to analyze the applicability. [3] The most significant research on each Machine Learning
technique and the trends in the prediction of machine learning software project defects
were added to its study. This study can be used as a benchmark in the preparation of
cross-project software defect predictions for future work. A systematic review is
presented in [5] using Machine Learning for cross-project software defect prediction
techniques. It shows an exhaustive analysis of all machine learning algorithms and
statistical techniques and their study in the prediction of software defects and their
performance, evaluations compared to different types of machine learning. Algorithms
with strengths and weaknesses summarized. The document provided a benchmark in [6]
to allow a common and useful comparison of different approaches to defect prediction.
He developed a model of defect prediction system (SBPS) for object-oriented software
[7] for defect prediction.

Volume 10 Issue 7 - 2020 468 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

2.1. Cross-Project Defect Prediction


The prediction of defects between projects attempts to build the prediction model in one
or more projects and uses this model to predict in other projects, but the attributes of the
application must be identical here. A new project often does not have previous data to
create a prediction model. Use the prediction of defects between projects to create a
predictive model, the usage data of other projects will solve the problem. Subsequently,
the concept will be applied to new initiatives. [18] [19] have work in defecting crossed
projects just in time. They identified that building training and testing models by using
similar attributes often work well, they have also mentioned that joint learning
techniques work well. [20] [21] have collected 10 project data, containing 5,305 labeled
data. They have compared the composite algorithms with CODEP Logistic. They have
used profitability and the F measure as evaluation metrics. They have demonstrated in
terms of CODEP measure F the best logistic performance.
3. Fuzzy C-Means Algorithm
The study aims to examine and review the accuracy of the cross project software
defect prediction using the computational intelligence algorithm known as Fuzzy c-
means algorithm. The present investigation will show the performance and capacity of
the proposed algorithm in detecting software defects and presents an empirical
examination of the existing algorithms of the latest generation literature.
3.1. Description of the data set
This research has been carried out with project data available to the public, the total
number of classes is 2663. All are widely used in cross-project software defect prediction
research and are available in the OSS repository. These class level data sets were written
in Java. There are multiple versions available for selected data sets, but for this study the
latest version of each project has been taken. Each data set has a different class
relationship, as one has less than 10% non-defective classes compared to another has
more than 90% non-defective classes as well. This variation can extract the performance
of all grouped data sets, and that will help identify a clear distinction of performance.
Table 1 presents the statistical description of the data sets selected from this study.
Table 1: Descriptive statistics of project datasets taken for this study

Dataset No. of Non-


Rate Defective Rate
Name Class defective
Ant 745 579 78% 166 22%
Camel 965 777 81% 188 19%
jEdit 492 481 98% 11 2%
Log4j 205 16 8% 189 92%
Synapse 256 170 66% 86 34%

Volume 10 Issue 7 - 2020 469 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

3.2. Selection of function through the function of heuristic aptitude


Feature selection is an important task in predicting software defects between projects.
Due to the diverse nature of the database and containing different types of attributes, the
selection of attributes will be more complex. In this document, a selection of attributes
based on heuristics based on the fitness function is used to assess the eminence of each
characteristic [9]. Consider each characteristic in the data set with a sequence of data
elements; apply the Fitness function to calculate the eminence of each function using
fitness.
𝑆𝑈𝑀𝑖 2
∑𝑛𝑖=1 (
𝐹𝑖𝑡𝑛𝑒𝑠𝑠 = 1 − [ 𝐶 ) ]
𝑁

Where N is the number of features in the data set, SUMi is the sum of all random
sequences in container i, and the capacity of the container is C. Characteristics whose
Aptitude value is less than the threshold limit will be removed from the data set.
3.3. Fuzzy C-Means Algorithm
The FCM in the application of cluster analysis is the generally prevalent and reputed
process. In fuzzy clustering, data elements can be assigned to more than one cluster in
addition to having a collection of membership degrees connected to each entity. The
degree of intensity of the interaction involving the data element and a specific group is
represented. Fuzzy clustering is a good preference with value to real-world situations
where there are no sharp edges between clusters. The finite set of Crisp partition data
needed involves replacing a fuzzy partition with a pathetic requirement. However, the
pseudo fuzzy partition in the same collection is difficult in the fuzzy grouping. With this
approach, the value of the objective function is calculated and, consequently, an
established ambiguous allocation of elements to groups is estimated.
Pseudo code of the proposed algorithm:
 Initially, the user specifies the constant k in the grouping and grouping phase of an
unmarked data point by calculating the distance between the query point and all the
points in the data set.
 The selection of the initial fuzzy partition as p (0) and the iteration as t = 0.
 The distances are calculated between the starting point of the cluster and the points
in the user vector.
 Estimate membership function.
 If the neighbor is in a private class, then it is considered as true positive; otherwise,
it is taken as true negative. Accuracy is calculated based on true positive value and
true negative value.
This pseudo code selects the fuzzy parameter m greater than one, it is considered
for any problem. The choice of partition is difficult with increasing m and there is
no evidence to select this value.

Volume 10 Issue 7 - 2020 470 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

4. Experimental Analysis
In this section, the performance of the proposed approach is evaluated based on the
commonly used precision and the elapsed time is measured as the duration of time. The
proposed FCM clustering technique marks the data with class tags in seven different
classes; Blocker, Critical, Improvement, Major, Minor, Normal and Trivial.
To evaluate the performance of the proposed algorithms we have used Naïve Bayes
machine learning algorithms, random forest, radial base function, nearest K neighbor,
support vector machine, K-means grouping algorithm, Fuzzy c-Means grouping
algorithm using various measures of performance, that is, accuracy, F!-Measure [16]
based on the confusion matrices generated.
4.1. Evaluation Measures
How a classifier or classification was performed, this calculation was performed using
performance metrics. This section describes the performance measures used in this
investigation. Performance metrics in predicting software defects between projects come
from a confusion matrix; Shows the actual and expected value. The SDP classification
has two outputs, positive and negative. F1-score and accuracy have been used here, the
next two subsections will tell about them.
F1-score
It is the weighted average of precision and recovery. The mathematical view of
the F1-score is:
2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Accuracy
Accuracy is the correct amount of response about a classification. In this study,
here are two correct answers in the confusion matrix, one is TP (where the defective
class has been identified as defective) and another is TN (where the non-defective
class has been identified as non-defective). The mathematical vision of precision is:
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
The accuracy of the proposed algorithm is tested in several data sets with known
classifiers as shown in Table 2. The proposed algorithm obtained a higher precision
limit. In general, the average value of the accuracy rate for classifiers in all data sets is
greater than 95 percent. However, the low value arose for the KNN algorithm. We
believe that this is because the data set is small and the proposed algorithm requires a
larger data set to achieve a higher precision value.

Volume 10 Issue 7 - 2020 471 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Table 2: Accuracy measure for the different algorithms over different data sets
Dataset Radom K- Proposed
Bayes RBF KNN SVM
Name forest Means Approach
Ant 84.45 91.56 90.33 65.92 91.97 90.02 95.11
Camel 85.25 86.39 86.38 75.13 86.01 83.65 97.20
jEdit 85.90 89.90 89.70 84.24 90.52 86.58 98.45
Log4j 84.78 82.56 84.68 79.03 82.30 80.99 93.22
Synapse 86.17 89.65 90.87 60.59 90.80 87.91 90.11

Figure 1: Accuracy Measure for the different algorithms over data sets.

Figure 1 shows the accuracy of the proposed algorithm in different data sets with
existing classifiers. Figure 1 shows that the suggested calculation of precision in relation
to other classifiers has been achieved. For all three classifiers, the overall consistency
rating value in all data sets exceeds 95 percent on average. However, there is the lowest
value for the KNN algorithm. We assume that this is because the data set is low and the
suggested algorithm needs a larger data set to achieve more accurate performance.
Table 3: F1-score for the different algorithms over different data sets
Dataset Radom Proposed
Bayes RBF KNN SVM K-Means
Name forest Approach
Ant 98.4 100 94.9 79 96 94 100
Camel 96 92 92 84 93 90 99
jEdit 91 94 95 91 95 93 100
Log4j 90 89 90 86 90 88 92
Synapse 91 94 95 72 95 93 96

Volume 10 Issue 7 - 2020 472 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 2: F1-score for the different algorithms over different data sets.

5. Conclusion
This paper proposed an adaptive computing intelligence approach known as a C-
Means clustering algorithm and evaluated the cross-project software defect prediction
model and its performance with the use of Naïve Bayes, Random Forest, and machine
learning algorithms. Radial base function, nearest K neighbor, Supports Vector Machine,
K-means clustering algorithm, Fuzzy c-Means clustering algorithm. The experimental
results are compiled based on Accuracy and F1-Measure. The results reveal that the
proposed technique is an effective approach to predict future defects in the prediction of
cross-project software defects. We can involve other Machine Learning techniques as a
future job and provide an extensive comparison between them. In addition, adding more
software metrics in the learning process is a possible approach to increase the accuracy
of the prediction model.

References
[1] Y. Tohman, K. Tokunaga, S. Nagase, and M. Y., “Structural approach to the
estimation of the number of residual software faults based on the hyper-
geometric districution model,” IEEE Trans. on Software Engineering, pp. 345–
355, 1989.
[2] S. Kumaresh & R. Baskaran (2010) “Defect analysis and prevention for software
process quality improvement”, International Journal of Computer Applications,
Vol. 8, Issue 7, pp. 42-47.
[3] K. Ahmad & N. Varshney (2012) “On minimizing software defects during new
product development using enhanced preventive approach”, International
Journal of Soft Computing and Engineering, Vol. 2, Issue 5, pp. 9-12.
[4] C. Andersson (2007) “A replicated empirical study of a selection method for
software reliability growth models”, Empirical Software Engineering, Vol.12,
Issue 2, pp. 161-182.

Volume 10 Issue 7 - 2020 473 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

[5] N. E. Fenton & N. Ohlsson (2000) “Quantitative analysis of faults and failures in a
complex software system”, IEEE Transactions on Software Engineering, Vol. 26,
Issue 8, pp. 797-814.
[6] T. M. Khoshgoftaar & N. Seliya (2004) “Comparative assessment of software quality
classification techniques: An empirical case study”, Empirical Software
Engineering, Vol. 9, Issue 3, pp. 229-257.
[7] T. M. Khoshgoftaar, N. Seliya & N. Sundaresh (2006) “An empirical study of
predicting software faults with case-based reasoning”, Software Quality Journal,
Vol. 14, No. 2, pp. 85-111.
[8] T. Menzies, J. Greenwald & A. Frank (2007) “Data mining static code attributes to
learn defect predictors”, IEEE Transaction Software Engineering., Vol. 33, Issue
1, pp. 2-13.
[9] Prasanna, K., M.Seetha, and A.P.Siva Kumar, “ CApriori: Conviction based Apriori
algorithm for discovering frequent determinant patterns from high dimensional
datasets”, ICSEMR 2014.
[10] D. Shiwei (2009) “Defect prevention and detection of DSP-Software”, World
Academy of Science, Engineering and Technology, Vol. 3, Issue 10, pp. 406-409.
[11] P. Trivedi & S. Pachori (2010) “Modelling and analyzing of software defect
prevention using ODC”, International Journal of Advanced Computer Science
and Applications, Vol. 1, No. 3, pp. 75- 77.
[12] Awni Hammouri, Mustafa Hammad, Mohammad Alnabhan, Fatima Alsarayrah,”
Software Bug Prediction using Machine Learning Approach,”, in the journal of
ijacsa, vol 9 no 2, 2018 pp 78-83.
[13] S. Lessmann, B. Baesens, C. Mues & S. Pietsch (2008) “Benchmarking
classification models for cross project software defect prediction: A proposed
framework and novel finding”, IEEE Transaction on Software Engineering, Vol.
34, Issue 4, pp. 485-496.
[14] K. El-Emam, S. Benlarbi, N. Goel, & S.N. Rai (2001) “Comparing Case- Based
Reasoning Classifiers for Predicting High-Risk Software Components”, Journal
of Systems and Software, Vol. 55, No. 3, pp. 301-320.
[15] L.F. Capretz & P.A. Lee, (1992) “Reusability and life cycle issues within an
object-oriented design methodology”, in book: Technology of Object-Oriented
Languages and Systems, pp. 139-150, Prentice-Hall.
[16] Olsen, David L. and Delen, “ Advanced Data Mining Techniques ”, Springer, 1st
edition, page 138, ISBN 3-540-76016-1, Feb 2008.
[17] S. Adiu & N. Geethanjali (2013) “Classification of defects in software using
decision tree algorithm”, International Journal of Engineering Science and
Technology (IJEST), Vol. 5, Issue 6, pp. 1332-1340. 12.
[18] Ni, C., Liu, W., Gu, Q., Chen, X., and Chen, D. (2017). Fesch: A feature selec-
tion method using clusters of hybrid-data for cross-project defect prediction. In IEEE
41st Annual Computer Software and Applications Conference (COMP- SAC), pages
51–56.
[19] Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., and Ubayashi, N. (2014).
An empirical study of just-in-time defect prediction using cross-project models. In

Volume 10 Issue 7 - 2020 474 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Proceedings of the 11th Working Conference on Mining Software Repositories, pages 172–
181.
[20] He, Z., Shu, F., Yang, Y., Li, M., and Wang, Q. (2012). An investigation on the
feasibility of cross-project defect prediction. 19(2):167–199.
[21] Zhang, Y., Lo, D., Xia, X., and Sun, J. (2015). An empirical study of classifier
combination for cross-project defect prediction. In Proceedings of the world
congress on engineering and computer science, pages 264–269.

Volume 10 Issue 7 - 2020 475 www.joics.org

You might also like