New Feedback

CHAPTER THREE
Research METHODOLOGY
3.0 Introduction:
Research methods include all the techniques and methods which have been taken for
conducting research whereas research methodology is the approach in which research
troubles are solved thoroughly. It is a science of studying how research is conducted
systematically. In this field the researcher explains himself with the different steps generally
taken to study a research problem. Hence, the scientific approach which is adopted for
conducting research is referred to as methodology (Shanti & Shashi, 2017).
According to Irny and Rose (2005), methodology does not set out to provide solutions - it is,
therefore, not the same thing as a method. Instead, it offers the theoretical underpinning for
understanding which method, set of methods or best practices which can be applied to
specific cases towards deriving specific results.
In this chapter the research methods are carefully presented alongside the research design
which details all the processes, methods and tools applied during the research.
3.1 Research Design
Research design has been described as a blueprint or outline used in carrying out research in a
way that exercises maximum control over factors that affect the validity of the results (Pilot
& Hungler, 1999).
This work introduces a comparative study of several machine learning techniques for
detection of defects in software code snippets. Machine learning algorithms tested in this
work include Logistic Regression, K-Nearest Neighbor (kNN), Decision Trees, Gaussian
Naive Bayes, Random Forest, Bernoulli Naive Bayes, Adaptive Boosting, Neural Network
and Support Vector Machine.

This work was designed using a methodology known as the Neighborhood Component
Analysis (NCA) to project the computed features into a new space where they can be linearly
separable as against the widely and commonly used Principal Component Analysis (PCA).
Neighborhood component analysis (NCA) is a non-parametric method for selecting features
with the goal of maximizing prediction accuracy of regression and classification algorithms.
These techniques were tested on multiple publicly available datasets. These datasets include
the mozilla4, Kc1, Ar1 and Pc1 datasets. Details about the dataset are provided in the Table 1
below including number of samples in each dataset, Language of the original datasets and
number of features computed from each of the dataset. Each of these datasets have a different
number of computed features which make it a challenging problem to solve. The idea was to
evaluate and compare multiple models and select a subset which performs best on this task.
The research design consists of processes that were taken to achieve this research objectives.
The processes include data collection, data preparation, data processing, data analysis, Model
building and model evaluation.
3.1.1 Description of the datasets:
The datasets in this work were taken from the PROMISE and NASA database repositories
which are publicly available in the NASA metrics Data Program’s repository. Description of
each dataset is given below:
Mozilla4: This dataset was written in C++ programming language. It consists of 15544
modules. The authors have provided a pre-processed version of the dataset where they
computed 5 static code attributes (features) including McCabe, Halstead, and LOC measures
and one label indicating the defect and non-defect status of the module.
Kc1: This dataset also contains C++ language codes and was provided with pre-computed 43
features related to KCOL. Authors have provided a total of 2108 modules/samples, where
325 modules belonged to the defective class.

Pc1: This dataset was written in C programming language and includes a total of 40
precomputed features from each module/sample. This dataset contains a total of 1107
samples and has 76 defective modules.
Ar1 & Ar4: These datasets were also written in C programming language. Each of them
contains a total of 121 modules/samples. Each of these modules have 30 features pre-
computed. In each of the datasets, there are 8 defective examples and 112 non-defective
examples.
As can be observed from the above description of each of the datasets, these datasets have
different numbers of features, and these features are different in nature. Similarly, all these
datasets are examples of the imbalance datasets where the defective class is rare.
Table 1
Data Set N0 of Modules Language N0 of Attributes

Mozilla 4 15544 C++ 6
KC1 2108 C++ 43
Ar 1 121 C 30
Ar 4 121 C 30
Pc 1 1107 C 40
Taken from: Dynamic Detection of Software Defects Using Supervised Learning Techniques
3.2 Data Processing, Feature Computation and Model training

The flowchart below shows different stages of the design . Detailed descriptions of each step
are given below.
New model development should be part of your research design. After using the
existing models, you have to come up or design a new model or an extension of an
existing model for better performance. This is what your research topic has
suggested.
3.2.1. Data Pre-Processing
The first step in carrying out this experiment was to perform the pre-processing of the
data to ensure only relevant features are left for the next process.
The Data pre-processing for this study involved cleaning and removing unwanted and
irrelevant rows and columns from the data set. The data pre-processing was geared
towards reducing the execution time and complexity of machine learning models. All the
data sets are double checked for null values. The data is normalized using standard scaler
technique that is discussed in section 3.2.4.
3.2.2 Data preparation (train and test split):
Upon completing the data pre-processing, the next task was to prepare and split the data sets
into two with one being used as the training data and the second used as the test data.
In this case, to evaluate the performance of each model, we performed a 10-fold cross
validation process where each dataset was divided into 10 splits/folds (randomly chosen). In
the first iteration, 9 of the folds were used as a training set and the 10th fold was used as a
test set. Performance metrics like accuracy, precision, recall etc. were computed for the test
set. This process was repeated 10 times such that each fold was used as a test set and involve
in training process. The figure 3.2 shows the processes that are performed in each fold of the
train and test set. I don’t understand figure 3.2. please explain the figure and what each of
the boxes represent

Figure 3.2: K-Fold Cross Validation (what is k-fold cross validation? And how did you
perform it?)
The table 2 shows first 20 rows and 10 columns of jm1 data set. The provided data set is split
into 10 equal parts through k fold cross validation technique (explain this technique and how
you performed it). Each part is assigned a number starting from 1 to 10. Using cross
validation technique, a model is trained k times. In this work the number of k is 10. In first
iteration data of 1 is kept for testing and model is trained on data parts from 2 to 10. In
second iteration, number 2 sample is kept for testing and each model is trained on sample 1
and 3 to 10. In third iteration, sample 3 is kept for testing and model is trained on sample 1, 2
and 4 to 10. The process was repeated until all ten samples were utilized in training as well
as testing.
1.1 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3
1 1 1 1 1 1 1 1 1 1
2 91 9 3 2 318 2089.21 0.04 27.68 75.47 57833.24
109 21 5 18 381 2547.56 0.04 28.37 89.79 72282.68
3 272.5
505 106 41 82 2339 20696.93 0.01 75.93 8 1571507
107 25 7 14 619 4282.78 0.02 52.91 80.95 226588.8
4 74 11 1 8 294 1917.93 0.03 28.77 66.66 55178.46
246.4
602 136 123 123 2785 25942.69 0.01 105.26 7 2730637
5 29 2 1 2 140 718.1 0.1 9.93 72.35 7127.8
36 3 1 1 254 1447.91 0.04 23.72 61.05 34338.99
6 70 11 1 6 434 3047.71 0.04 26.63 114.4 81152.69
6
109 20 12 4 223 1322.55 0.04 23.91 55.32 31619.49
7 37 3 1 3 187 1095.44 0.04 24.97 43.87 27356.45
163.7
90 29 8 9 488 3387.95 0.05 20.69 4 70100.61
8 19 2 1 2 71 351.75 0.08 12.66 27.79 4451.81
152 21 4 5 430 2850.62 0.03 33.3 85.6 94929.66
9 22 5 5 4 100 539.23 0.07 13.94 38.68 7516.89
160.2
69 12 1 1 536 3745.93 0.04 23.38 4 87570.07
10 49 9 1 8 191 1166.73 0.05 21.31 54.76 24859.27
48 3 1 2 248 1470.82 0.03 33.59 43.78 49408.04
3.2.3 Correlation based Feature Selection:
After splitting the data sets, the correlation-based feature selection for each dataset was
carried out next. The idea is that if there are multiple features which are highly correlated to
each other, they might be presenting the same information and hence redundant features. For
each feature, this study computed its correlation using Pearson’s Correlation Coefficient with
all the other features. If two features had a correlation higher than 0.95, one features was
randomly dropped. This will result in reduced feature space. The formula of Pearson’s
correlation is:
r=
where x is considered as one variable and y is considered as another variable. is mean
of x assigned columns and mean of y assigned variable. The correlation-based feature
selection (CFS) method is a filter approach and therefore independent of the final
classification model. It evaluates feature subsets only based on data intrinsic properties,
as the name already suggest correlations. The goal is to find a feature subset with low
feature-feature correlation, to avoid redundancy, and high feature-class correlation to
maintain or increase predictive power. The figure 3.3, figure 3.4, figure 3.5, and figure
3.6 (please these are not figures but rather, they are all tables. Label all as tables and
explain some entries in each of the tables for easy understanding and meaning. There
should be columns headers for understanding. The values in the tables are for what?
Present the tables and their explanations one after the other for easy referencing and cross
checking) shows computation of features correlation using Pearson’s Correlation
Coefficient.
Table xx: Figure 3.3: Correlation Analysis of pc1 data set
The pc1 data set (Table xx) has total 21 features and 11 features that have correlation
value higher than 0.95. There are 7 features among the highly correlated features that are
dropped after correlation analysis. These features are V, B, T, total_Op, total_Opnd,
branchCount, IOCode. (explain this table further and say what is the usefulness of it in
your research or model development. Why did you drop the seven features that are
highly correlated?)
Table yy: Figure 3.5: Correlation Analysis of ar1 data set

The data set ar1(Table yy) has total 29 features. Among 29 features, there are 16 features
that have correlation value greater than 0.95. The following 13 features are dropped out
of 16. These features are dropped before applying the machine learning models. These
features are executable_loc, total_operands, total_operators, halstead_vocabulary,
halstead_length, halstead_volume, halstead_error, halstead_time, branch_count,
decision_count, condition_count, cyclomatic_complexity, and design_complexity. (give
further explanation on the data set and why 13 features are dropped. We need to
understand these data set and how and what they are used for in the research)
Table yx: ar4

Table yx: Figure 3.6: Correlation Analysis of ar4 data set
The data set ar4 has also 29 features and there are 18 features that have correlation value
greater than 0.95, among 18 features, 16 features are dropped. All these features are not
utilized for machine learning model training and testing. These features are blank loc,
executable_loc, total operands, total operators, halstead vocabulary, halstead lenght, halstead
volume, halstead effort, halstead error, halstead time, branch count, decision count, call pairs,
condition count, cyclomatic complexity, and design complexity. (same as above)
3.2.4 Data Standardization/Normalization:
From the datasets’, computed features are on different scales and have different ranges.
For some of the machine learning models, the computed features need to be normalized
into similar range. This study carried out standardization from each feature. the mean
was subtracted and then divided by the standard deviation as given below:
This technique is also called the Z-score Normalization and the resultant features will be
in the range of 0 to 1. This enables all the features to have the same scale. The table 3.2
shows few instances of kc1 data set and table 3.3 is the normalized form of data.
Table 3.2: Data Normalization
E b t lOCode
1.3 1.3 1.3 2

1 1 1 1
21378.61 0.31 1187.7 65
11436.73 0.26 635.37 37
2381.95 0.08 132.33 21
6423.73 0.19 356.87 35
11596.34 0.25 644.24 41
33061.94 0.4 1836.77 62
12069 0.25 670.5 41
14278.39 0.27 793.24 42
21261.63 0.34 1181.2 50
15808.22 0.27 878.23 54
45222.23 0.57 2512.35 88
What does the rows and columns signify in the table above? I mean the headings and the
values?
Table 3.3: Normalized Data
e b t lOCode
-1.19793 2.628743 -1.19675 -1.69488
-1.19796 1.730539 -1.19718 -1.73783
0.510146 -0.33533 0.510053 1.010952
-0.28423 -0.48503 -0.28455 -0.19164
-1.00772 -1.02395 -1.00825 -0.87884
-0.68477 -0.69461 -0.68521 -0.27754
-0.27147 -0.51497 -0.27179 -0.01984
1.443663 -0.06587 1.443833 0.882103
-0.23371 -0.51497 -0.23401 -0.01984
-0.05717 -0.45509 -0.05743 0.023107
0.500799 -0.24551 0.500702 0.366705
0.065063 -0.45509 0.064837 0.538504
2.41529 0.443114 2.41575 1.998797
What does the rows and columns signify in the table above? I mean the headings and the
values? How did you normalize? What tools, method and process did you use for
normalization? Explain.
3.2.5 Dimensionality Reduction (Neighborhood Component Analysis):
The feature standardization was followed by the dimensionality reduction step. In this
step the Neighborhood Component Analysis (NCA) was used to project the computed
features into a new space where they can be linearly separable. In general, researchers
use unsupervised dimensionality reduction techniques such as Principal Component
Analysis (PCA) for dimensionality reduction to find components in the new vector space
which can define the most variance in the dataset. However, variance is not correlated
with the usability of the features for the classification task at hand. In this case, this study
used NCA as a supervised dimensionality reduction technique which considers class
labels and class separation when finding the new projections. This is because it has the
potential to give a better separation and can be represented mathematically as:
S = {(xi, yi), i=1, 2…, n}
In the above equation xi is the vectors of features and yi is the output class labels.
you have only explain how PCA works. You have not explained what you did with it or
how you applied it in your research. How did you carry out dimensionality reduction
with PCA? What tools, method and process did you use for dimensionality reduction ?
Explain.
3.2.6 Machine Learning Models:
The next step is to train and test several machine learning algorithms for detection of
defects in software modules. The selected algorithms in this work present a wide variety
of machine learning algorithms which make different assumptions about the problem at
hand. For each model, this study applied default hyper-parameters for comparison. Each
model is evaluated using the 10-fold cross validation technique.
Also, it is important to note that all selected models are classifiers models with ensemble
techniques because of the nature of data set, the task at hand, and the expected outcome.
All models are built using python as the development tool alongside some libraries like
Sklearn, pandas, Numpy, matplotlib etc. The models that are utilized in this artifact are
discussed one by one.
 Logistic Regression
Logistic regression is a type of linear classifier that can predict whether a given object
would lie into the class ’1’ or class ’0’. Logistics regression was considered because it is
mostly deployed to solve classification problems. Based on the available data and task,
the logistic regression model is used to create and ascertain the probability distribution
that corresponds to the extracted features. This forms one of the major reasons it has been
considered for this research especially when it is considered that the most common type
of problem of a logistic regression classifier are dependent variable is binary. The
regression equation is represented as
y = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂
where y is dependent variable that needs to be predicted, 𝑥₁, 𝑥₂ are independent
variables and 𝑏₀, 𝑏₁, 𝑏₂ are weights added to each feature. Logistic regression uses
sigmoidal function for converting the continuous output into 0 or 1 form. It is a threshold
that is set before modelling. The formula of sigmoidal function is p = 1/1 + .
Figure 3.7 shows the graph of sigmoidal function.

.
Figure 3.7: Graph of sigmoidal function
for this research, we set our default logistic regression model at l2 regularization penalty
and maximum iteration of 100. The threshold for sigmoidal function is kept at 0.5. The
output of model having value less than 0.5 is considered as 0 while output value of model
0.5 or above is considered as 1. The model is trained on training data and evaluated
through testing data. The evaluation metrics are found out and discussed in results
section of chapter 4.
At first the model is imported from sklearn library. The logistic regression model is
initiated and stored in ‘LR’ variable.
from sklearn.linear_model import LogisticRegression

models.append(('LR', LogisticRegression()))
you have only explain how LR works. You have not explained what you did with it or
how you applied it in your research. explain how logistic regression was carried out on
the data, problems encountered and solution, tools, methods and processes used
 K- Nearest Neighbor (KNN):
K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique. KNN is chosen for this research because of its ability to
solve classification problems like the one being carried out in this research where the
intention is to classify defects and non-defects present in the data set. KNN algorithm is
used in this research also because of its ability to classify new data points to the most
suitable category of the data. The steps involved in KNN algorithm are discussed below:
1. Provide training data and their output class labels to model.
2. Find and select optimal value of K.
3. Find the distance between the neighbors of K clusters.
4. Select the nearest neighbor according to distance formula.
5. Calculate the number of output classes in each K clusters.
6. Assign the new data points to that category for which the number of the neighbor is
maximum.
7. The KNN model is created
In this research, the process of building the KNN model using the KNN algorithm
included fitting the KNN algorithm on training data and predicting test results. After
training the model, results are tested by putting a new dataset, i.e., Test dataset. The
value of k is kept 2 as there are only two output classes in each data set. Euclidean
distance formula is used to find out distance of each node with cluster.
The formula of Euclidean is d =√ [(x2 – x1)2 + (y2 – y1)2], where d is distance between
two points, x2, y2 are two coordinates of one point and x 1, y1 are two coordinates of
another data point.
The KNN model is imported from sklearn library and initiated through following code.
from sklearn.neighbors import KNeighborsClassifier

models.append(('KNN', KNeighborsClassifier()))
you have only explain how KNN works. You have not explained what you did with it or
how you applied it in your research. explain how KNN was carried out on the data,
problems encountered and solution, tools, methods and processes used
 Random Forest Model
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model. This algorithm was chosen majorly because of its ensembling learning concept.
This is because it is believed that detecting and predicting defects in software is a very
complex classification problem that requires a sophisticated algorithm like random forest.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
For this research, the random forest algorithm was used to build model using some
selected data points (subsets and providing number of N decision trees to be built. This
will help the model to predict and assign new data points to categories with highest
number of occurrences.
The steps involve in creating random forest classification model are:
1. Create subsets of training data and follow the replacement technique.
2. Create trees for each subset of data.
3. Combine the outputs of all decision trees.
4. Choose the output having majority voting.

The number of trees used in designing random forest model is 100. The maximum depth of
each tree is up to three branches. The random state is kept 0 during designing and training of
model. The model is applied on all four data sets and results are discussed in chapter 4.
you have only explain how RF works. You have not explained what you did with it or
how you applied it in your research. explain how RF was carried out on the data,
Adaptive boosting (AdaBoost) is also used in building one of the models for this research.
AdaBoost which is a booster machine learning technique that combines Ensembling
techniques thereby combining different other models inside it is deemed suitable for this
research because of its ability to classify features based on assigned weight. This machine
learning technique is like the Decision tree because it also makes use of decision trees
algorithm which assigns equal weight to all data points and assigns more weight to points
that are wrongly classified. The wrongly classified weights are then given more priority in
the next model training, and this is repeated until a lower error is received.
The algorithm of creating AdaBoost model is defined in steps below:
1. Assign Equal Weights to all the observations
2. Classify random samples using stumps
3. Calculate Total Error
4. Calculate Performance of the Stump
5. Update Weights
6. Update weights in iteration
7. Final Predictions
The random forest classifier is imported from sklearn library and model is appended in
‘model’ variable. The n_estimators is number of trees used in random forest model.
from sklearn.ensemble import RandomForestClassifier
models.append(('RF', RandomForestClassifier(n_estimators=100)))
you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how Adaboost was carried out on the data,
 Bernoulli Naïve Bayes Model
Naive Bayes is a basic but effective probabilistic classification model in machine
learning that draws influence from Bayes Theorem. Bayes theorem is a formula that
offers a conditional probability of an event A happening given that another event B has
previously happened. Its mathematical formula is as follows:
Where,
 A and B are two events
 P(A|B) is the probability of event A provided event B has already happened.
 P(B|A) is the probability of event B provided event A has already happened.
 P(A) is the independent probability of A
 P(B) is the independent probability of B
Now, for the purpose of this research, this research adopted the generated classification
model of the Bayes theorem which is represented as

Where,
 X = x1, x2, x3, … xN is list of independent predictors
 y is the class label
 P (y | X) is the probability of label y given the predictors X
In applying the Naïve Bayes Model, this research experimented using Bernoulli Naïve
Bayes classifier which is used when the predicators are Boolean in nature.
The formula of Bernoulli Naïve Bayes is given below:
In the above formula, p is the probability of successful predictions and q is the
probability of failed predictions, x has only two output class labels i.e., 0 and 1.
The Bernoulli naïve bayes model is imported from sklearn library and appended to
model.
from sklearn.naive_bayes import BernoulliNB

models.append(('BND', BernoulliNB()))
Though the problem here is binary classification and applied Bernoulli Naïve Bayes
model to analyze the results of models. The results are discussed with graphs in chapter
4.
you have only explain how NB works. You have not explained what you did with it or
how you applied it in your research. explain how the NB model was carried out on the
data, problems encountered and solution, tools, methods and processes used
 Adaboost Classification Model
Adaboost stands for adaptive boosting. This technique is used for ensemble purposes in
machine learning. Mostly one level of decision tree is used with adaboost technique. The tree
in adaboost model is known as decision stumps. At start adaboost model gives equal weights
to all data points. After one iteration, the model increases the weights of data points that are
classified wrongly. So, in next iteration these wrongly classified data points are highly
considered and the process continuous till the error of data points is decreased. The formula
for calculating the weights is given below.
Where:
N represents total datapoints numbers.
Steps of adaboost classification model:

 Assign equal weight to each data point.
 Calculate Gini Index for each feature.
 Select feature with lowest Gini Index as root feature.
 Calculate performance of stump.
 Update weights of data points.
 Repeat above steps to reduce the error.
Gini Index is found out through following formula for each feature. The feature having
smallest value of Gini Index is considered as root feature.
The performance of stump for tree is calculated through following formula. After finding
stump value, the weights are updated, and the process is repeated.
The adaboost model is imported from ensemble library of sklearn. The model is then stored
in AdaC variable and appended to model through following code.
from sklearn.ensemble import AdaBoostClassifier
models.append(('AdaC',
AdaBoostClassifier(n_estimators=100)))
you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how the Adaboost model was carried out
on the data, problems encountered and solution, as well as tools, methods and processes
used
 Extra Tree Classification Model
Extremely randomized tree classifier is ensemble technique that combines the output of
multiple trees in a forest. Extra tree classification technique aggregates the results of
multiple de-correlated decision trees. This algorithm is like random forest but different in
case of:
Extra tree uses whole data set for training while random forest splits the data into subsets.
Extra tree choses the splitting of nodes randomly while random forest choses the
optimum split.
Steps involved in extra tree classification
1. Select whole data for classification
2. Provide k-features for each tree
3. Select best feature for splitting
4. Use Gini index for feature splitting
5. Create multiple de-correlated decision trees.
6. Combine the output of all trees
7. Select the output through majority voting or averaging.
Gini Index is found out through following formula for each feature. The feature having
smallest value of Gini Index is considered as root feature.
The extra tree classifier is imported from ensemble library of sklearn. The classifier is
stored in variable ExtC and is appended to the model for implementation. The number of
trees used by classifier is 100.
from sklearn.ensemble import ExtraTreesClassifier

models.append(('ExtC', ExtraTreesClassifier(n_estimators=100)))
you have only explain how extra tree works. You have not explained what you did with it
or how you applied it in your research. explain how extra tree was carried out on the data,
 Artificial Neural Networks (ANN’s)
Artificial Neural Networks is one of the models deployed in this research because of its
ability to mimic how the human brain processes information. Its structure allows us to
use the artificial neurons of ANN to learn the features of the provided data while also
learning the relationship between the dependent and independent variables available in
the data. The fundamental working principle of ANN involves learning via adjusting the
different weights between various neurons to learn the relationship between the
dependent and the independent variables.
The objective function in the case of a neural network is the sum-of-squares error
function, which gives the network the information as to how incorrect or diverged the
output is from the expected result. This information about the error is then used to
remodel the weights in a manner corresponding to a further reduction in the error
function. The aim of the learning process is to evaluate the error function at each iteration
and re-adjust the weights to attain a local minimum. The concept of a layer in a neural
network is defined as the neurons and their corresponding weights residing in that layer.
Consequently, for every neural network there are three types of layers, namely, Input
layer, Hidden layers, Output layer. The neuron in every layer firstly computes the
weighted input using the function, and then applies the activation function on the
weighted input to determine the output as either 0 or 1. There are various activation
functions that are used, for instance, the Relu function, the Signum function and the most
common one, the sigmoid function given as 1/ 1+e −x. The figure 3.9 shows basic
structure of the layers of ANN model.

Figure 3.9: ANN Architecture
The artificial neural network model is imported from sklearn library of python and
appended in model. The size of hidden layer is 10 and maximum iteration is 500.
from sklearn.neural_network import MLPClassifier

models.append(('MLP', MLPClassifier(alpha=0.0001, hidden_layer_sizes=10,
max_iter=500)))
you have only explain how ANN works. You have not explained what you did with it or
how you applied it in your research. explain how ANN was carried out or applied on the
data, problems encountered and solution, tools, methods and processes used.
 Gradient Boosting Model
Gradient boosting is an ensemble technique that is based on decision trees. It is a type of
weak prediction models. The aim of boosting model is to reduce the value of loss function.
As the loss function reduces, the performance of model increases. The trees in the model
predict outputs for data and multiple techniques like majority voting, averaging is used for
the final output.

Steps of gradient boosting model
1. Create data for training
2. Find loss value through loss function
3. Split the data for optimal feature selection
4. Use information gain for splitting
5. Use majority voting for final prediction
The loss value for each data sample is found out through the following formula.
The information gain for splitting the data is through following formula.
E=−I ∑Cpilog2pi
The gradient boosting classifier is imported from ensemble library of sklearn. The model is
saved in variable gb and appended to model. The number of trees is 100 in this model with
learning rate of 1, maximum depth of 1 and random state of 0.
from sklearn.ensemble import GradientBoostingClassifier
models.append(('gb' , GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_
depth=1, random_state=0)))
you have only explain how GB works. You have not explained what you did with it or how
you applied it in your research.
 Linear Discriminant Analysis Model

It is also called normal discriminant analysis. LDA modeling is used for reduction of
dimension of features. LDA algorithm easily classify patterns in binary classes. LDA
considers the data in a linear form and draw a hyperplane between the output of features. This
hyperplane increases the distance between means of two classes and reduces the differences
between variables of each class.
Figure 3.10: LDA Example
Steps involved in LDA are following:

1. Calculate dimension for each class of output.
2. Calculate scatter matrices for each class of output.
3. Calculate eigen vectors.
4. Sort the eigen vectors.
5. Transform the data using sorted eigen vectors.
The model is imported and saved in variable lda. The model is assigned for further
calculations using following Python code.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
models.append(('LDA', LinearDiscriminantAnalysis()))
you have only explain how LDA works. You have not explained what you did with it or how
you applied it in your research.
 SVM
Support Vector Machine (SVM) is supervised learning algorithm that is widely used for
classification purposes. SVM model is best suitable when the output class label is binary.
However, it works and outperform in multi class classification. SVM model plot the data in
‘n’ number of dimensions. The number of dimensions depends on the feature of data. SVM
model draws a hyperplane between the outputs of data and differentiate the classes on base of
distances between centers. SVM uses different distance formulas for finding distance of class
labels and centers.
The steps involved in SVM algorithm are as follow:
1. Identify classes in output labels
2. Consider one class as 0 and other as 1
3. Use cost function or kernel function to find margin of labels
The SVM classification library is imported from sklearn. The model is saved in “SVM”
variable and appended to model. The gamma value is 0.1 with c = 1.
from sklearn.svm import SVC
models.append(('SVM', SVC(gamma=0.1, C=1.)))
you have only explain how SVM works. You have not explained what you did with it or how
you applied it in your research. For instance, in all the algorithms or models you have listed
steps that are followed in performing or applying the algorithms. Why not you carry out those
steps and report exactly what and how you did the steps.
Performance Evaluation
For performance evaluation of each model, we computed the following metrics for each fold
and then computed the average and standard deviation for each metric. In this report the
following four evaluating metrics were computed for each fold and each model. Before
discussing the metrics, some terminologies need to be clarified like True Positive, False
Positive, True Negative, and False Negative.
 True Positive are values of testing data that are true, and model also predicted as true.
 False Positive are values of testing data that are false, and the model predicted as true.
 False Negative are values of testing data that are true, and the model predicted as false.
 True Negative are values of testing data that are false, and the model predicted as false.
You have only explain what is performance evaluation. Explain how you carried out
performance evaluation in this research. What are the tools, methods and processes
applied during the performance evaluation process. What are the problems
encountered if any and how did you resolve them.
3.2.6.1 Accuracy
This is the most common type of scoring method and essentially the most misused one.
This type of method is viable for certain types of classification problems. It is calculated
as the total number of correct predictions made over the total predictions. The formula of
accuracy is given below

Accuracy =
You have only explain what is accuracy. Explain how you carried out the process of
finding accuracy in this research. What are the tools, methods and processes applied
during the process. What are the problems encountered if any and how did you
resolve them.
3.2.6.2 Precision
Precision is defined as the correctly predicted classifications over the total predicted
classifications. High precision relates to the model’s low error in classifying the data
points that do not belong to a certain class within that class. The precision metrics
become more important when the class labels are imbalanced, and the prediction of true
positive are more important as compared to true negative.
Precision = True Positive / (True Positive + False Positive)
3.2.6.3 Recall
Recall is defined as the correctly predicted classifications over all the classifications of
members of a certain class. The model gives the information as to how many objects that
belong to the class in question get non-classified or get classified outside that class.
Recall = True Positive / (True Positive + False Negative)
You have only explain what is recall. Explain how you carried out the process of
finding the recall value in this research. What are the tools, methods and processes
3.2.6.4 F1-Score
F1 Score is another metric which at first sight is hard to understand intuitively. It
essentially brings in both the False Positives and False Negatives to weigh in the error in
decision making. It is defined as the Harmonic mean of precision and recall. Ideally, we
would want to list all the true positive observations that exist for a particular class while
being careful to omit all those who do not belong to that class. If we could do that, then
we would have both high precision and high recall respectively. And this consequently
will ensure, a high F1-Score corresponding to the model. Important thing to note here is,
even if the precision is remarkably high, having a low recall will always dominate and
bring down the F1-Score necessarily and vice-versa.
F1-score = 2 * precision * recall / precision + recall

You have only explain what is F-score. Explain how you carried out the process of
finding the F-score value in this research. What are the tools, methods and processes
You have to design or develop a new model or an extension of existing model just as your
topic, statement of problem and aim have suggested in chapter one. I have not seen this in
your chapter three or four. After developing a new model, you then apply it on the same data
sets, then you can now do a comparison to know if the new model is better than the existing
models or not. Add model development in chapter three
CHAPTER FOUR
RESULTS & ANALYSIS
You have to follow your research design steps in presenting the whole of chapter four and in
each step present and discuss the results obtained.:
1. data pre-computation features
2. Feature selection: correlation computation
3. Feature normalization/standardization
4. Dimensionality reduction (using NCA)
5. Training and evaluating models
6. Supervised classifiers : logistic regression … ANN
7. Decision: defect, non defect
8. Developed model
9. Discussions

New Feedback

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Feedback

Uploaded by

Copyright:

Available Formats

CHAPTER THREE

conducting research whereas research methodology is the approach in which research

troubles are solved thoroughly. It is a science of studying how research is conducted

conducting research is referred to as methodology (Shanti & Shashi, 2017).

specific cases towards deriving specific results.

3.1 Research Design

& Hungler, 1999).

and Support Vector Machine.

Neighborhood component analysis (NCA) is a non-parametric method for selecting features

building and model evaluation.

3.1.1 Description of the datasets:

each dataset is given below:

325 modules belonged to the defective class.

samples and has 76 defective modules.

Data Set N0 of Modules Language N0 of Attributes

3.2 Data Processing, Feature Computation and Model training

are given below.

technique that is discussed in section 3.2.4.

3.2.2 Data preparation (train and test split):

the boxes represent

3.2.3 Correlation based Feature Selection:

of x assigned columns and mean of y assigned variable. The correlation-based feature

feature-feature correlation, to avoid redundancy, and high feature-class correlation to

checking) shows computation of features correlation using Pearson’s Correlation

dropped after correlation analysis. These features are V, B, T, total_Op, total_Opnd,

Table yy: Figure 3.5: Correlation Analysis of ar1 data set

features are executable_loc, total_operands, total_operators, halstead_vocabulary,

halstead_length, halstead_volume, halstead_error, halstead_time, branch_count,

decision_count, condition_count, cyclomatic_complexity, and design_complexity. (give

Table yx: ar4

condition count, cyclomatic complexity, and design complexity. (same as above)

3.2.4 Data Standardization/Normalization:

Table 3.2: Data Normalization

1.3 1.3 1.3 2

Table 3.3: Normalized Data

3.2.5 Dimensionality Reduction (Neighborhood Component Analysis):

use unsupervised dimensionality reduction techniques such as Principal Component

used NCA as a supervised dimensionality reduction technique which considers class

potential to give a better separation and can be represented mathematically as:

S = {(xi, yi), i=1, 2…, n}

3.2.6 Machine Learning Models:

model is evaluated using the 10-fold cross validation technique.

discussed one by one.

of problem of a logistic regression classifier are dependent variable is binary. The

regression equation is represented as

where y is dependent variable that needs to be predicted, 𝑥₁, 𝑥₂ are independent

that is set before modelling. The formula of sigmoidal function is p = 1/1 + .

Figure 3.7 shows the graph of sigmoidal function.

Figure 3.7: Graph of sigmoidal function

initiated and stored in ‘LR’ variable.

from sklearn.linear_model import LogisticRegression

 K- Nearest Neighbor (KNN):

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on

1. Provide training data and their output class labels to model.

2. Find and select optimal value of K.

3. Find the distance between the neighbors of K clusters.

4. Select the nearest neighbor according to distance formula.

5. Calculate the number of output classes in each K clusters.

7. The KNN model is created

another data point.

from sklearn.neighbors import KNeighborsClassifier

ML. It is based on the concept of ensemble learning, which is a process of combining

and it predicts the final output.

The steps involve in creating random forest classification model are: