Professional Documents
Culture Documents
New Feedback
New Feedback
Research METHODOLOGY
3.0 Introduction:
Research methods include all the techniques and methods which have been taken for
systematically. In this field the researcher explains himself with the different steps generally
taken to study a research problem. Hence, the scientific approach which is adopted for
According to Irny and Rose (2005), methodology does not set out to provide solutions - it is,
therefore, not the same thing as a method. Instead, it offers the theoretical underpinning for
understanding which method, set of methods or best practices which can be applied to
In this chapter the research methods are carefully presented alongside the research design
which details all the processes, methods and tools applied during the research.
Research design has been described as a blueprint or outline used in carrying out research in a
way that exercises maximum control over factors that affect the validity of the results (Pilot
This work introduces a comparative study of several machine learning techniques for
detection of defects in software code snippets. Machine learning algorithms tested in this
work include Logistic Regression, K-Nearest Neighbor (kNN), Decision Trees, Gaussian
Naive Bayes, Random Forest, Bernoulli Naive Bayes, Adaptive Boosting, Neural Network
Analysis (NCA) to project the computed features into a new space where they can be linearly
separable as against the widely and commonly used Principal Component Analysis (PCA).
with the goal of maximizing prediction accuracy of regression and classification algorithms.
These techniques were tested on multiple publicly available datasets. These datasets include
the mozilla4, Kc1, Ar1 and Pc1 datasets. Details about the dataset are provided in the Table 1
below including number of samples in each dataset, Language of the original datasets and
number of features computed from each of the dataset. Each of these datasets have a different
number of computed features which make it a challenging problem to solve. The idea was to
evaluate and compare multiple models and select a subset which performs best on this task.
The research design consists of processes that were taken to achieve this research objectives.
The processes include data collection, data preparation, data processing, data analysis, Model
The datasets in this work were taken from the PROMISE and NASA database repositories
which are publicly available in the NASA metrics Data Program’s repository. Description of
Mozilla4: This dataset was written in C++ programming language. It consists of 15544
modules. The authors have provided a pre-processed version of the dataset where they
computed 5 static code attributes (features) including McCabe, Halstead, and LOC measures
and one label indicating the defect and non-defect status of the module.
Kc1: This dataset also contains C++ language codes and was provided with pre-computed 43
features related to KCOL. Authors have provided a total of 2108 modules/samples, where
precomputed features from each module/sample. This dataset contains a total of 1107
Ar1 & Ar4: These datasets were also written in C programming language. Each of them
contains a total of 121 modules/samples. Each of these modules have 30 features pre-
computed. In each of the datasets, there are 8 defective examples and 112 non-defective
examples.
As can be observed from the above description of each of the datasets, these datasets have
different numbers of features, and these features are different in nature. Similarly, all these
datasets are examples of the imbalance datasets where the defective class is rare.
Table 1
Taken from: Dynamic Detection of Software Defects Using Supervised Learning Techniques
New model development should be part of your research design. After using the
existing models, you have to come up or design a new model or an extension of an
existing model for better performance. This is what your research topic has
suggested.
3.2.1. Data Pre-Processing
The first step in carrying out this experiment was to perform the pre-processing of the
data to ensure only relevant features are left for the next process.
The Data pre-processing for this study involved cleaning and removing unwanted and
irrelevant rows and columns from the data set. The data pre-processing was geared
towards reducing the execution time and complexity of machine learning models. All the
data sets are double checked for null values. The data is normalized using standard scaler
Upon completing the data pre-processing, the next task was to prepare and split the data sets
into two with one being used as the training data and the second used as the test data.
In this case, to evaluate the performance of each model, we performed a 10-fold cross
validation process where each dataset was divided into 10 splits/folds (randomly chosen). In
the first iteration, 9 of the folds were used as a training set and the 10th fold was used as a
test set. Performance metrics like accuracy, precision, recall etc. were computed for the test
set. This process was repeated 10 times such that each fold was used as a test set and involve
in training process. The figure 3.2 shows the processes that are performed in each fold of the
train and test set. I don’t understand figure 3.2. please explain the figure and what each of
The table 2 shows first 20 rows and 10 columns of jm1 data set. The provided data set is split
into 10 equal parts through k fold cross validation technique (explain this technique and how
you performed it). Each part is assigned a number starting from 1 to 10. Using cross
validation technique, a model is trained k times. In this work the number of k is 10. In first
iteration data of 1 is kept for testing and model is trained on data parts from 2 to 10. In
second iteration, number 2 sample is kept for testing and each model is trained on sample 1
and 3 to 10. In third iteration, sample 3 is kept for testing and model is trained on sample 1, 2
and 4 to 10. The process was repeated until all ten samples were utilized in training as well
as testing.
1.1 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3
1 1 1 1 1 1 1 1 1 1
2 91 9 3 2 318 2089.21 0.04 27.68 75.47 57833.24
109 21 5 18 381 2547.56 0.04 28.37 89.79 72282.68
3 272.5
505 106 41 82 2339 20696.93 0.01 75.93 8 1571507
107 25 7 14 619 4282.78 0.02 52.91 80.95 226588.8
4 74 11 1 8 294 1917.93 0.03 28.77 66.66 55178.46
246.4
602 136 123 123 2785 25942.69 0.01 105.26 7 2730637
5 29 2 1 2 140 718.1 0.1 9.93 72.35 7127.8
36 3 1 1 254 1447.91 0.04 23.72 61.05 34338.99
6 70 11 1 6 434 3047.71 0.04 26.63 114.4 81152.69
6
109 20 12 4 223 1322.55 0.04 23.91 55.32 31619.49
7 37 3 1 3 187 1095.44 0.04 24.97 43.87 27356.45
163.7
90 29 8 9 488 3387.95 0.05 20.69 4 70100.61
8 19 2 1 2 71 351.75 0.08 12.66 27.79 4451.81
152 21 4 5 430 2850.62 0.03 33.3 85.6 94929.66
9 22 5 5 4 100 539.23 0.07 13.94 38.68 7516.89
160.2
69 12 1 1 536 3745.93 0.04 23.38 4 87570.07
10 49 9 1 8 191 1166.73 0.05 21.31 54.76 24859.27
48 3 1 2 248 1470.82 0.03 33.59 43.78 49408.04
After splitting the data sets, the correlation-based feature selection for each dataset was
carried out next. The idea is that if there are multiple features which are highly correlated to
each other, they might be presenting the same information and hence redundant features. For
each feature, this study computed its correlation using Pearson’s Correlation Coefficient with
all the other features. If two features had a correlation higher than 0.95, one features was
randomly dropped. This will result in reduced feature space. The formula of Pearson’s
correlation is:
r=
where x is considered as one variable and y is considered as another variable. is mean
selection (CFS) method is a filter approach and therefore independent of the final
classification model. It evaluates feature subsets only based on data intrinsic properties,
as the name already suggest correlations. The goal is to find a feature subset with low
maintain or increase predictive power. The figure 3.3, figure 3.4, figure 3.5, and figure
3.6 (please these are not figures but rather, they are all tables. Label all as tables and
explain some entries in each of the tables for easy understanding and meaning. There
should be columns headers for understanding. The values in the tables are for what?
Present the tables and their explanations one after the other for easy referencing and cross
Coefficient.
Table xx: Figure 3.3: Correlation Analysis of pc1 data set
The pc1 data set (Table xx) has total 21 features and 11 features that have correlation
value higher than 0.95. There are 7 features among the highly correlated features that are
branchCount, IOCode. (explain this table further and say what is the usefulness of it in
your research or model development. Why did you drop the seven features that are
highly correlated?)
that have correlation value greater than 0.95. The following 13 features are dropped out
of 16. These features are dropped before applying the machine learning models. These
further explanation on the data set and why 13 features are dropped. We need to
understand these data set and how and what they are used for in the research)
The data set ar4 has also 29 features and there are 18 features that have correlation value
greater than 0.95, among 18 features, 16 features are dropped. All these features are not
utilized for machine learning model training and testing. These features are blank loc,
executable_loc, total operands, total operators, halstead vocabulary, halstead lenght, halstead
volume, halstead effort, halstead error, halstead time, branch count, decision count, call pairs,
From the datasets’, computed features are on different scales and have different ranges.
For some of the machine learning models, the computed features need to be normalized
into similar range. This study carried out standardization from each feature. the mean
was subtracted and then divided by the standard deviation as given below:
This technique is also called the Z-score Normalization and the resultant features will be
in the range of 0 to 1. This enables all the features to have the same scale. The table 3.2
shows few instances of kc1 data set and table 3.3 is the normalized form of data.
E b t lOCode
e b t lOCode
-1.19793 2.628743 -1.19675 -1.69488
-1.19796 1.730539 -1.19718 -1.73783
0.510146 -0.33533 0.510053 1.010952
-0.28423 -0.48503 -0.28455 -0.19164
-1.00772 -1.02395 -1.00825 -0.87884
-0.68477 -0.69461 -0.68521 -0.27754
-0.27147 -0.51497 -0.27179 -0.01984
1.443663 -0.06587 1.443833 0.882103
-0.23371 -0.51497 -0.23401 -0.01984
-0.05717 -0.45509 -0.05743 0.023107
0.500799 -0.24551 0.500702 0.366705
0.065063 -0.45509 0.064837 0.538504
2.41529 0.443114 2.41575 1.998797
What does the rows and columns signify in the table above? I mean the headings and the
values? How did you normalize? What tools, method and process did you use for
normalization? Explain.
The feature standardization was followed by the dimensionality reduction step. In this
step the Neighborhood Component Analysis (NCA) was used to project the computed
features into a new space where they can be linearly separable. In general, researchers
Analysis (PCA) for dimensionality reduction to find components in the new vector space
which can define the most variance in the dataset. However, variance is not correlated
with the usability of the features for the classification task at hand. In this case, this study
labels and class separation when finding the new projections. This is because it has the
In the above equation xi is the vectors of features and yi is the output class labels.
you have only explain how PCA works. You have not explained what you did with it or
how you applied it in your research. How did you carry out dimensionality reduction
with PCA? What tools, method and process did you use for dimensionality reduction ?
Explain.
The next step is to train and test several machine learning algorithms for detection of
defects in software modules. The selected algorithms in this work present a wide variety
of machine learning algorithms which make different assumptions about the problem at
hand. For each model, this study applied default hyper-parameters for comparison. Each
Also, it is important to note that all selected models are classifiers models with ensemble
techniques because of the nature of data set, the task at hand, and the expected outcome.
All models are built using python as the development tool alongside some libraries like
Sklearn, pandas, Numpy, matplotlib etc. The models that are utilized in this artifact are
Logistic Regression
Logistic regression is a type of linear classifier that can predict whether a given object
would lie into the class ’1’ or class ’0’. Logistics regression was considered because it is
mostly deployed to solve classification problems. Based on the available data and task,
the logistic regression model is used to create and ascertain the probability distribution
that corresponds to the extracted features. This forms one of the major reasons it has been
considered for this research especially when it is considered that the most common type
y = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂
variables and 𝑏₀, 𝑏₁, 𝑏₂ are weights added to each feature. Logistic regression uses
sigmoidal function for converting the continuous output into 0 or 1 form. It is a threshold
for this research, we set our default logistic regression model at l2 regularization penalty
and maximum iteration of 100. The threshold for sigmoidal function is kept at 0.5. The
output of model having value less than 0.5 is considered as 0 while output value of model
0.5 or above is considered as 1. The model is trained on training data and evaluated
through testing data. The evaluation metrics are found out and discussed in results
section of chapter 4.
At first the model is imported from sklearn library. The logistic regression model is
you have only explain how LR works. You have not explained what you did with it or
how you applied it in your research. explain how logistic regression was carried out on
the data, problems encountered and solution, tools, methods and processes used
Supervised Learning technique. KNN is chosen for this research because of its ability to
solve classification problems like the one being carried out in this research where the
intention is to classify defects and non-defects present in the data set. KNN algorithm is
used in this research also because of its ability to classify new data points to the most
suitable category of the data. The steps involved in KNN algorithm are discussed below:
6. Assign the new data points to that category for which the number of the neighbor is
maximum.
In this research, the process of building the KNN model using the KNN algorithm
included fitting the KNN algorithm on training data and predicting test results. After
training the model, results are tested by putting a new dataset, i.e., Test dataset. The
value of k is kept 2 as there are only two output classes in each data set. Euclidean
distance formula is used to find out distance of each node with cluster.
The formula of Euclidean is d =√ [(x2 – x1)2 + (y2 – y1)2], where d is distance between
two points, x2, y2 are two coordinates of one point and x 1, y1 are two coordinates of
The KNN model is imported from sklearn library and initiated through following code.
you have only explain how KNN works. You have not explained what you did with it or
how you applied it in your research. explain how KNN was carried out on the data,
problems encountered and solution, tools, methods and processes used
Random Forest Model
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
multiple classifiers to solve a complex problem and to improve the performance of the
model. This algorithm was chosen majorly because of its ensembling learning concept.
This is because it is believed that detecting and predicting defects in software is a very
complex classification problem that requires a sophisticated algorithm like random forest.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
For this research, the random forest algorithm was used to build model using some
selected data points (subsets and providing number of N decision trees to be built. This
will help the model to predict and assign new data points to categories with highest
number of occurrences.
each tree is up to three branches. The random state is kept 0 during designing and training of
model. The model is applied on all four data sets and results are discussed in chapter 4.
you have only explain how RF works. You have not explained what you did with it or
how you applied it in your research. explain how RF was carried out on the data,
problems encountered and solution, tools, methods and processes used
Adaptive boosting (AdaBoost) is also used in building one of the models for this research.
techniques thereby combining different other models inside it is deemed suitable for this
research because of its ability to classify features based on assigned weight. This machine
learning technique is like the Decision tree because it also makes use of decision trees
algorithm which assigns equal weight to all data points and assigns more weight to points
that are wrongly classified. The wrongly classified weights are then given more priority in
the next model training, and this is repeated until a lower error is received.
5. Update Weights
7. Final Predictions
The random forest classifier is imported from sklearn library and model is appended in
‘model’ variable. The n_estimators is number of trees used in random forest model.
from sklearn.ensemble import RandomForestClassifier
models.append(('RF', RandomForestClassifier(n_estimators=100)))
you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how Adaboost was carried out on the data,
problems encountered and solution, tools, methods and processes used
learning that draws influence from Bayes Theorem. Bayes theorem is a formula that
offers a conditional probability of an event A happening given that another event B has
Where,
Now, for the purpose of this research, this research adopted the generated classification
In applying the Naïve Bayes Model, this research experimented using Bernoulli Naïve
Bayes classifier which is used when the predicators are Boolean in nature.
probability of failed predictions, x has only two output class labels i.e., 0 and 1.
The Bernoulli naïve bayes model is imported from sklearn library and appended to
model.
model to analyze the results of models. The results are discussed with graphs in chapter
4.
you have only explain how NB works. You have not explained what you did with it or
how you applied it in your research. explain how the NB model was carried out on the
data, problems encountered and solution, tools, methods and processes used
Adaboost stands for adaptive boosting. This technique is used for ensemble purposes in
machine learning. Mostly one level of decision tree is used with adaboost technique. The tree
in adaboost model is known as decision stumps. At start adaboost model gives equal weights
to all data points. After one iteration, the model increases the weights of data points that are
classified wrongly. So, in next iteration these wrongly classified data points are highly
considered and the process continuous till the error of data points is decreased. The formula
Where:
The performance of stump for tree is calculated through following formula. After finding
stump value, the weights are updated, and the process is repeated.
The adaboost model is imported from ensemble library of sklearn. The model is then stored
in AdaC variable and appended to model through following code.
from sklearn.ensemble import AdaBoostClassifier
models.append(('AdaC',
AdaBoostClassifier(n_estimators=100)))
you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how the Adaboost model was carried out
on the data, problems encountered and solution, as well as tools, methods and processes
used
Extremely randomized tree classifier is ensemble technique that combines the output of
multiple trees in a forest. Extra tree classification technique aggregates the results of
multiple de-correlated decision trees. This algorithm is like random forest but different in
case of:
Extra tree uses whole data set for training while random forest splits the data into subsets.
Extra tree choses the splitting of nodes randomly while random forest choses the
optimum split.
Gini Index is found out through following formula for each feature. The feature having
The extra tree classifier is imported from ensemble library of sklearn. The classifier is
stored in variable ExtC and is appended to the model for implementation. The number of
you have only explain how extra tree works. You have not explained what you did with it
or how you applied it in your research. explain how extra tree was carried out on the data,
problems encountered and solution, tools, methods and processes used
Artificial Neural Networks (ANN’s)
Artificial Neural Networks is one of the models deployed in this research because of its
ability to mimic how the human brain processes information. Its structure allows us to
use the artificial neurons of ANN to learn the features of the provided data while also
learning the relationship between the dependent and independent variables available in
the data. The fundamental working principle of ANN involves learning via adjusting the
different weights between various neurons to learn the relationship between the
The objective function in the case of a neural network is the sum-of-squares error
function, which gives the network the information as to how incorrect or diverged the
output is from the expected result. This information about the error is then used to
function. The aim of the learning process is to evaluate the error function at each iteration
and re-adjust the weights to attain a local minimum. The concept of a layer in a neural
network is defined as the neurons and their corresponding weights residing in that layer.
Consequently, for every neural network there are three types of layers, namely, Input
layer, Hidden layers, Output layer. The neuron in every layer firstly computes the
weighted input using the function, and then applies the activation function on the
weighted input to determine the output as either 0 or 1. There are various activation
functions that are used, for instance, the Relu function, the Signum function and the most
common one, the sigmoid function given as 1/ 1+e −x. The figure 3.9 shows basic
The artificial neural network model is imported from sklearn library of python and
appended in model. The size of hidden layer is 10 and maximum iteration is 500.
you have only explain how ANN works. You have not explained what you did with it or
how you applied it in your research. explain how ANN was carried out or applied on the
data, problems encountered and solution, tools, methods and processes used.
weak prediction models. The aim of boosting model is to reduce the value of loss function.
As the loss function reduces, the performance of model increases. The trees in the model
predict outputs for data and multiple techniques like majority voting, averaging is used for
The loss value for each data sample is found out through the following formula.
The information gain for splitting the data is through following formula.
E=−I ∑Cpilog2pi
The gradient boosting classifier is imported from ensemble library of sklearn. The model is
saved in variable gb and appended to model. The number of trees is 100 in this model with
from sklearn.ensemble import GradientBoostingClassifier
models.append(('gb' , GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_
depth=1, random_state=0)))
you have only explain how GB works. You have not explained what you did with it or how
dimension of features. LDA algorithm easily classify patterns in binary classes. LDA
considers the data in a linear form and draw a hyperplane between the output of features. This
hyperplane increases the distance between means of two classes and reduces the differences
The model is imported and saved in variable lda. The model is assigned for further
calculations using following Python code.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
models.append(('LDA', LinearDiscriminantAnalysis()))
you have only explain how LDA works. You have not explained what you did with it or how
you applied it in your research.
SVM
Support Vector Machine (SVM) is supervised learning algorithm that is widely used for
classification purposes. SVM model is best suitable when the output class label is binary.
However, it works and outperform in multi class classification. SVM model plot the data in
‘n’ number of dimensions. The number of dimensions depends on the feature of data. SVM
model draws a hyperplane between the outputs of data and differentiate the classes on base of
distances between centers. SVM uses different distance formulas for finding distance of class
The SVM classification library is imported from sklearn. The model is saved in “SVM”
from sklearn.svm import SVC
models.append(('SVM', SVC(gamma=0.1, C=1.)))
you have only explain how SVM works. You have not explained what you did with it or how
you applied it in your research. For instance, in all the algorithms or models you have listed
steps that are followed in performing or applying the algorithms. Why not you carry out those
steps and report exactly what and how you did the steps.
Performance Evaluation
For performance evaluation of each model, we computed the following metrics for each fold
and then computed the average and standard deviation for each metric. In this report the
following four evaluating metrics were computed for each fold and each model. Before
discussing the metrics, some terminologies need to be clarified like True Positive, False
True Positive are values of testing data that are true, and model also predicted as true.
False Positive are values of testing data that are false, and the model predicted as true.
False Negative are values of testing data that are true, and the model predicted as false.
True Negative are values of testing data that are false, and the model predicted as false.
You have only explain what is performance evaluation. Explain how you carried out
performance evaluation in this research. What are the tools, methods and processes
applied during the performance evaluation process. What are the problems
3.2.6.1 Accuracy
This is the most common type of scoring method and essentially the most misused one.
This type of method is viable for certain types of classification problems. It is calculated
as the total number of correct predictions made over the total predictions. The formula of
You have only explain what is accuracy. Explain how you carried out the process of
finding accuracy in this research. What are the tools, methods and processes applied
during the process. What are the problems encountered if any and how did you
resolve them.
3.2.6.2 Precision
Precision is defined as the correctly predicted classifications over the total predicted
classifications. High precision relates to the model’s low error in classifying the data
points that do not belong to a certain class within that class. The precision metrics
become more important when the class labels are imbalanced, and the prediction of true
3.2.6.3 Recall
Recall is defined as the correctly predicted classifications over all the classifications of
members of a certain class. The model gives the information as to how many objects that
belong to the class in question get non-classified or get classified outside that class.
Recall = True Positive / (True Positive + False Negative)
You have only explain what is recall. Explain how you carried out the process of
finding the recall value in this research. What are the tools, methods and processes
applied during the performance evaluation process. What are the problems
3.2.6.4 F1-Score
essentially brings in both the False Positives and False Negatives to weigh in the error in
decision making. It is defined as the Harmonic mean of precision and recall. Ideally, we
would want to list all the true positive observations that exist for a particular class while
being careful to omit all those who do not belong to that class. If we could do that, then
we would have both high precision and high recall respectively. And this consequently
will ensure, a high F1-Score corresponding to the model. Important thing to note here is,
even if the precision is remarkably high, having a low recall will always dominate and
finding the F-score value in this research. What are the tools, methods and processes
applied during the performance evaluation process. What are the problems
You have to design or develop a new model or an extension of existing model just as your
topic, statement of problem and aim have suggested in chapter one. I have not seen this in
your chapter three or four. After developing a new model, you then apply it on the same data
sets, then you can now do a comparison to know if the new model is better than the existing
models or not. Add model development in chapter three
CHAPTER FOUR
You have to follow your research design steps in presenting the whole of chapter four and in
3. Feature normalization/standardization
8. Developed model
9. Discussions