Professional Documents
Culture Documents
1
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
performance. To address this problem, in case of dynamic classifiers. The results of experiments show that the ensemble
changes, authors introduce the under-sampling technique that learning with resampling technique improved the accuracy
is robust against samples and work well in case of dynamic and reduced the false negative rate compared to the single
changing in class distribution. learner.
A Roughly Balanced Bagging (RBBAG) algorithm, proposed Ahmed et al [17] presented the machine learning approach
by Seliya et al. [11], as a different type of solutions based on such as Neural Network, Fuzzy Logic, Linear and Logistic
ensemble learning. The experiment results measured by the Regression to predict the failure of a software project. The
Geometric Mean (GM), indicated that RBBAG method is author used multiple linear regression analyses to determine
more effective in performance and classification accuracy the critical failure factors, then it employed the fuzzy logic to
than individual classifiers such as C4.5 decision tree and naive predict the failure of a software project.
Bayes classifier. In addition, RGBBAG has the ability to
handle imbalance data it occurs in the test data. 3. BACKGROUND
Ensemble learning is called meta-learning techniques that
Sun et al. [12], addressed the problem of data skew by using integrate multiple classifiers computed separately over
multiclass classification methods with different types of code different databases, and then it builds a classification model
schema such as (one-against-one, random correcting code, and based on weight vote technique to improve the prediction of
one-against-all). Sun et al used several types of ensemble the software defect [18]. One of the advantages of using these
learning methods such as (Boosting, Bagging and Random techniques is it enhances the accuracy of defect prediction
Forest) that integrated with previous coding schemas. The model compared to a single classifier. In ensemble technique,
experiment results show that the one-against-one coding the results of a set of learning classifiers, whose individual
schema achieves the best results. decisions are combined together, enhance the overall system.
Also, in ensemble learning, different types of the classifiers
A comparative study of ensemble learning methods related to can be combined into one predictive model to improve the
software defect has been proposed by Wang et al [13]. The accuracy of prediction, decrease bias (Boosting) and variance
proposed model included Boosting, Bagging, Random forest, (Bagging). On the other side, the ensemble techniques have
Random tree, Stacking and Voting methods. The author classified into two types: parallel ensemble and sequential
compares the previous ensemble methods with a single ensemble techniques. In case of parallel ensemble technique,
classifier such as Naive Bayes. The experiment of the it depends on the independence among base learners such as
comparative analysis reported that the ensemble models Random Forest (RF) classifier that generates the base learner
outperformed the result of the single classifier based on in parallel to reduce the average of error dramatically.
several public datasets. Another type is called sequential ensemble learning which
depends on the dependence among the base learners. Such as
Arvinder et al [14], proposed using ensemble learning
AdaBoost algorithm, which is used to boost the overall
methods for predicting the defects in open source software.
performance by assigning a high weight to mislabeled training
The author uses three homogenous ensemble methods, such as
examples. In our comparative study, several classification
Bagging, Rotation Forest and Boosting on fifteen of base
models have been used such as an Artificial Neural Network
learners to build the software defect prediction model. The
(ANN) and Support Vector Machine (SVM) as discriminating
results show that a naïve base classifier is not recommended
linear classifiers. Random Forest (RF) and J48 as decision tree
to be used as a base classifier for ensemble techniques
classifiers, Naïve Bayes as a probabilistic classifier and PART
because it does not achieve any performance gain than a
algorithm are used as a classification rules algorithm. The
single classifier.
results of these methods are compared to ensemble techniques
Chug and Singh [15] examined five of machine learning such as Bagging, Boosting and Rotation Forest to examine the
algorithms used for early prediction of software defect i.e. effectiveness of ensemble methods in the accuracy of the
Particle Swarm Optimization (PSO), Artificial Neural software defect prediction model.
Network (ANN), Naïve Bayes (NB), Decision Tree (DT) and
Linear Classifier (LC). The results of the study show that, the 3.1 Single Machine Linear Classifiers
linear classifier is better in prediction accuracy than other 3.1.1 Artificial Neural Network (ANN)
algorithms, but ANN and DT algorithms have the lowest error ANN is a computational model of a biological neuron. The
rate. The popular metrics used is the NASA dataset such as basic unit of ANN is called a neuron [19]. It consists of
inheritance, cohesion and Line of Code (LOC) metrics. several nodes and it receives the inputs of an external source
or from other nodes, each input has an associated weight. The
Kevin et al [16] introduced oversampling techniques as results of neural network transformed into the output after the
preprocessing with a set of the individual base learner to build input of weight are added. In our research, the researchers
the ensemble model. Three of oversample techniques have utilized a Multi-Layer Perceptron (MLP) technique, which is
been employed to overcome the bias of the sampling considered one of the feed-forward neural networks, and it
approach. Then, ensemble learning is used to increase the uses the back-propagation algorithm as a supervised learning
accuracy of classification by taking the advantage of many technique.
2
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
3.1.2 Support Vector Machine (SVM) classification based on the transformation of the target
SVM is considered one of a new trend in machine learning variable to assume any binary values in the interval. Logistic
algorithm; it can deal with nonlinear problem data by using Regression finds the weight that fits the training examples
Kernel Function [20]. SVM achieve high classification well, and then it transforms the target using a linear function
accuracy because it has a high ability to map high dimensional of predictor variables to approximate the target of the
input data from nonlinear to linear separable data. The main response variable.
concept of SVM depends on the maximization of margin
distance between different classes and minimizing the training 3.1.8 PART Algorithm
error. The hyperplane is determined by selecting the closest PART algorithm [27] is a combination of both RIPPER and
samples to the margin. SVM can solve the classification C4.5 algorithms; it used a method of the rule induction to
problems by building a global function after completing the build a partial tree for the full training examples. The partial
training phase for all samples. One of the disadvantages of tree contains the unexpected branches and subtree
global methods is the high computational cost is required. replacement has been used during building the tree as a
Furthermore, a global method in sometimes cannot achieve a pruning strategy to build a partial tree. Based on the value of
sufficient approximation because no parameter values can be minimum entropy, PART algorithm expands the nodes until it
provided in the global solution methods. finds the node that corresponds to the value provided, or
returns null if it finds nothing. Then, the process of the
3.1.3 Locally Weighted Learning (LWL) pruning is started.
The basic idea of LWL [21] is to build a local model based on The subtree replaces the node by one of its leaf children when
neighboring data instead of building a global model. it will be the better. The PART algorithm follows the
According to the influence of data points on prediction model, separate-and-conquer strategy based on a recursive algorithm.
each data point in the case of the neighborhood to the current
query point it will have a higher weight factor than the points 3.2 Ensemble Machine Learning
very distant. One advantage of LWL algorithm, it is the ability Classifiers
to build approximation function and easy to add new
3.2.1 Bagging Techniques
incremental training points. Bagging technique is one of the ensemble learning techniques
[28] and it is called also Bootstrap aggregating Bagging, as
3.1.4 Naïve Bayes (NB) shown in Fig 1, it depends on the different training sizes of
The Naïve Bayes classifier [22] depends on the Bayes rule
training data it called bags collected from the training dataset.
theorem of conditional probability as a simple classifier. It
Bagging method is used to construct each member of the
assumes that attributes’ values are independent and unrelated,
ensemble. Then, the prediction model is built for each subset
it called independent feature model. Naïve Bayes uses the
of bags, and it combines the values of multiple outputs by
maximum likelihood methods [23] to estimate its parameters
taking either voting or average over the class label. First,
in many of the applications.
Bagging algorithm selects a random sample with replacement
from the original training dataset, and then multiple outputs of
3.1.5 Decision Tree: Random Forest (RF) learner algorithms are generated (bags). Finally, Bagging
RF algorithm [24] constructs a small decision tree with a few
features based on the random choice of the attributes. First, algorithm applies the predictor on the samples and combine
the simple algorithm of the decision tree is used to build the the results by voting techniques and predicts the final class
individual tree with a few features. Then, many of small and label for software defect.
weak decision trees are built in parallel. Finally, majority
voting or average techniques have applied to combine the
trees and form a single and strong learner.
3
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
incorrect instances than instance correctly predicted. Finally, time of the model will be increased but it doesn't lose the
the classifiers are combined by using the voting technique to information.
predict the final result of defect prediction model as shown in
Fig 2. The under- resampling technique decreases the time of
training, but its loss of information.
4
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
5. EXPERIMENTAL RESULTS AND Table 2: Studied Metrics within NASA MDP datasets [35]
DISCUSSION
5.1 Dataset Description and Research
Hypothesis
5
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
6
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
Random Naïve
Datasets PART J48 Logistic MLP SVM LWL
Forest Bayes
MC2 66.51 66.47 67.65 70.81 67.1 72.68 73.35 73.93
MC2 + Resample 88.35 86.47 86.47 88.31 87.06 89.01 79.04 84.6
MW1 91.07 91.3 91.82 91.82 91.81 92.32 83.38 91.57
MW1 + Resample 94.54 93.55 93.29 94.05 95.29 94.29 83.12 92.56
KC3 91.48 90.6 91.05 89.73 90.61 90.17 85.13 89.74
KC3 + Resample 95.41 94.76 93.88 93 96.49 94.96 86.69 91.7
CM1 89.32 89.91 87.92 89.11 90.7 89.91 84.36 90.1
CM1 + Resample 95.06 94.07 91.48 93.47 96.04 95.25 79.98 89.9
KC1 85.81 86.05 85.81 85.95 84.91 86.47 82.39 84.48
KC1 + Resample 91.27 91.55 85.43 87.04 91.5 93.02 81.78 84.1
PC1 93.86 93.59 92.5 93.5 93.59 93.86 88.43 93.14
PC1 + Resample 97.38 96.75 93.68 95.3 97.29 97.29 87.9 93.95
PC4 90.88 90.53 91.15 90.88 87.72 90.81 86.07 87.79
PC4 + Resample 95.54 93.76 91.01 92.8 94.04 95.34 85.12 87.45
Random Naïve
Datasets PART J48 Logistic MLP SVM LWL
Forest Bayes
MC2 73.24 73.9 68.9 72.06 67.1 70.85 74.52 65.22
MC2 + Resample 90.18 88.35 85.15 90.18 88.31 90.22 80.96 88.97
MW1 90.57 90.34 89.1 90.33 91.32 91.82 86.11 91.32
MW1 + Resample 94.79 95.04 92.06 92.55 95.04 95.04 87.87 93.81
KC3 90.15 87.32 90.62 88.85 90.17 89.73 84.7 91.05
KC3 + Resample 95.42 95.85 93.23 93.67 96.5 96.28 89.52 93.88
CM1 87.54 87.93 88.71 87.71 90.7 89.52 85.16 89.71
CM1 + Resample 95.45 94.86 91.67 91.09 96.84 96.64 82.96 90.1
KC1 84.72 84.06 85.76 85.19 84.48 86.33 82.44 84.72
KC1 + Resample 91.98 91.98 85.48 86 92.07 93.59 81.82 84.01
PC1 93.5 92.96 92.59 92.68 93.41 94.22 90.06 93.14
PC1 + Resample 97.2 97.47 93.95 94.94 97.56 97.65 87.81 93.5
PC4 90.4 90.53 91.29 90.06 87.72 91.08 87.38 89.03
PC4 + Resample 95.54 95.41 90.95 92.32 94.45 95.27 86.56 89.78
7
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
8
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
RT
om e s
LP
st
L
8
c
M
sti
Na LW
J4
equally good
re
y
SV
PA
Ba
gi
Fo
Lo
e
ïv
Rotation Forest Rotation Forest
J48
nd
Effect of Resample on MC2 Dataset Vs Single
Ra
MC2 with Resample
Learners MC2 without Resample
Logistic None None
Fig. 6: The Effect of Resample on MC2 with Single
Bagging and
Bagging and Learners
MLP Rotation Forest
Rotation Forest
equally good
Accuracy with Previous Studies on KC1 Accuracy with Proposed Model on KC1
Fig. 10: The Accuracy of the Proposed Model and Previous Studies
6. CONCLUSION AND FUTURE WORK 2) The researchers do not recommend using SVM, Logistic
In this research, the researchers have analyzed the accuracy, and SVM as a single learner with three homogenous resample
performance of three ensemble learner methods Boosting, methods Boosting, Bagging and Rotation Forest.
Bagging and Rotation Forest based on 8 bases single learners
3) The accuracy, performance is gained by using Bagging
in the SDP dataset of software prediction defect and the
with MLP and PART base learner. With Boosting, the
results as follow:
performance accuracy is gained for PART, Random Forest
1) The accuracy of most of the single leaners is enhanced on and Naïve Bayes Whereas Rotation Forest with 4 base
the most of 7 samples of NASA datasets by using the learners is gained in performance such PART, J48, MLP and
resample technique as preprocessing step and is showed in Random Forest.
Figures 6: 9 which applied to MC 2 dataset as an example in
our case study. 4) The accuracy of performance results is loosed by using
SVM with three of homogenous ensemble methods, with
Rotation Forest, there are no accuracy, performance losses
except with SVM. Thus, with Rotation Forest, there are no
10
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
accuracy, performance losses except with SVM. Thus, [15] P. Singh and A. Chug, "Software defect prediction
Rotation Forest is the best method the researchers are analysis using machine learning algorithms", In
recommended to use this research study for its advantage of International Conference of Computing and Data
Science, 2017, pp. 775-781.
the generalization ability. In the future work, we will test
more ensemble algorithms with different base classifier on [16] H.Shamsul, L.Kevin and M.Abdelrazek, "An ensemble
more software, datasets, and we will test more than oversampling model for class imbalance problem in
software defect prediction ", IEEE, 2018.
preprocessing techniques to enhance the results.
[17] A. Abdel Aziz, N. Ramadan and H. Hefny, "Towards a
Table 9: Comparative between Previous Studies and our Machine Learning Model for Predicting Failure of Agile
Proposed Model with Resample Technique Software Projects", IJCA, Vol.168, No.6, 2017.
[18] S. Wang and X. Yao, "Using class imbalance learning for
11
International Journal of Computer Applications (0975 – 8887)
Volume *– No.*, ___________ 2018
[29] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, [34] Q. Song,, Z. Jia and M. Shepperd ,"A general software
and F Herrera, "A review on ensembles for the class defect-proneness prediction framework", IEEE
imbalance problem: bagging-, boosting-, and hybrid- Transactions on Software Engineering, Vol. 37, No..3,
based approaches", IEEE Transactions on Systems, 2011, pp. 356-370.
Man,Vol.42, No. 4 , 2012, pp. 463-484. [35] Y. Jiang, C. Bojan, M. Tim and B. Nick, "Comparing
[30] E. Menahem, L. Rokach, and Y. Elovici, "Troika–An design and code metrics for software quality prediction,
improved stacking schema for classification tasks", PROMISE, 2008, pp. 11-18.
Information Sciences,Vo; 179, No. 24 ,2009, pp. 4097- [36] M. Hall, E. Mark, G. Holmes and B. Pfahringer, "The
4122. WEKA data mining software: an update", ACM
[31] V. Francisco, B. Ghimire and J. Rogan, "An assessment SIGKDD Vol. 11, No. 1, 2009, pp.10-18.
of the effectiveness of a random forest classifier for land- [37] T. Menzies, J. Greenwald and Frank, "Data mining static
cover classification", ISPRS Vol.67, 2012, pp. 93-104. code attributes to learn defect predictors”. IEEE Software
[32] Y. Saeys,., T. Abeel and Y. Van, “Robust feature Engineering, Vol.33, No.1, 2007, pp.2-13.
selection using ensemble feature selection techniques”, [38] J Demšar, "Statistical comparisons of classifiers over
In Joint European Conference on Machine Learning and multiple data sets. Journal of Machine learning research”,
Knowledge Discovery in Databases,2008, pp. 313-325. 2006, Vol.7, pp.1-30.
[33] Y. Ma, K. Qin, and S.Zhu, "Discrimination analysis for [39] S. Aleem, L. Fernando Capretz, and F. Ahmed,
predicting defect-prone software modules", Journal of "Comparative performance analysis of machine learning
Applied Mathematics, 2014. techniques for software bug detection", CSCP, 2015, pp.
71-79.
12