You are on page 1of 32

CHAPTER THREE

Research METHODOLOGY

3.0 Introduction:

Research methods include all the techniques and methods which have been taken for

conducting research whereas research methodology is the approach in which research

troubles are solved thoroughly. It is a science of studying how research is conducted

systematically. In this field the researcher explains himself with the different steps generally

taken to study a research problem. Hence, the scientific approach which is adopted for

conducting research is referred to as methodology (Shanti & Shashi, 2017).

According to Irny and Rose (2005), methodology does not set out to provide solutions - it is,

therefore, not the same thing as a method. Instead, it offers the theoretical underpinning for

understanding which method, set of methods or best practices which can be applied to

specific cases towards deriving specific results.

In this chapter the research methods are carefully presented alongside the research design

which details all the processes, methods and tools applied during the research.

3.1 Research Design

Research design has been described as a blueprint or outline used in carrying out research in a

way that exercises maximum control over factors that affect the validity of the results (Pilot

& Hungler, 1999).

This work introduces a comparative study of several machine learning techniques for

detection of defects in software code snippets. Machine learning algorithms tested in this

work include Logistic Regression, K-Nearest Neighbor (kNN), Decision Trees, Gaussian

Naive Bayes, Random Forest, Bernoulli Naive Bayes, Adaptive Boosting, Neural Network

and Support Vector Machine.


This work was designed using a methodology known as the Neighborhood Component

Analysis (NCA) to project the computed features into a new space where they can be linearly

separable as against the widely and commonly used Principal Component Analysis (PCA).

Neighborhood component analysis (NCA) is a non-parametric method for selecting features

with the goal of maximizing prediction accuracy of regression and classification algorithms.

These techniques were tested on multiple publicly available datasets. These datasets include

the mozilla4, Kc1, Ar1 and Pc1 datasets. Details about the dataset are provided in the Table 1

below including number of samples in each dataset, Language of the original datasets and

number of features computed from each of the dataset. Each of these datasets have a different

number of computed features which make it a challenging problem to solve. The idea was to

evaluate and compare multiple models and select a subset which performs best on this task.

The research design consists of processes that were taken to achieve this research objectives.

The processes include data collection, data preparation, data processing, data analysis, Model

building and model evaluation.

3.1.1 Description of the datasets:

The datasets in this work were taken from the PROMISE and NASA database repositories

which are publicly available in the NASA metrics Data Program’s repository. Description of

each dataset is given below:

Mozilla4: This dataset was written in C++ programming language. It consists of 15544

modules. The authors have provided a pre-processed version of the dataset where they

computed 5 static code attributes (features) including McCabe, Halstead, and LOC measures

and one label indicating the defect and non-defect status of the module.

Kc1: This dataset also contains C++ language codes and was provided with pre-computed 43

features related to KCOL. Authors have provided a total of 2108 modules/samples, where

325 modules belonged to the defective class.


Pc1: This dataset was written in C programming language and includes a total of 40

precomputed features from each module/sample. This dataset contains a total of 1107

samples and has 76 defective modules.

Ar1 & Ar4: These datasets were also written in C programming language. Each of them

contains a total of 121 modules/samples. Each of these modules have 30 features pre-

computed. In each of the datasets, there are 8 defective examples and 112 non-defective

examples.

As can be observed from the above description of each of the datasets, these datasets have

different numbers of features, and these features are different in nature. Similarly, all these

datasets are examples of the imbalance datasets where the defective class is rare.

Table 1

Data Set N0 of Modules Language N0 of Attributes


Mozilla 4 15544 C++ 6
KC1 2108 C++ 43
Ar 1 121 C 30
Ar 4 121 C 30
Pc 1 1107 C 40

Taken from: Dynamic Detection of Software Defects Using Supervised Learning Techniques

3.2 Data Processing, Feature Computation and Model training


The flowchart below shows different stages of the design . Detailed descriptions of each step

are given below.

New model development should be part of your research design. After using the
existing models, you have to come up or design a new model or an extension of an
existing model for better performance. This is what your research topic has
suggested.
3.2.1. Data Pre-Processing

The first step in carrying out this experiment was to perform the pre-processing of the

data to ensure only relevant features are left for the next process.

The Data pre-processing for this study involved cleaning and removing unwanted and

irrelevant rows and columns from the data set. The data pre-processing was geared

towards reducing the execution time and complexity of machine learning models. All the

data sets are double checked for null values. The data is normalized using standard scaler

technique that is discussed in section 3.2.4.

3.2.2 Data preparation (train and test split):

Upon completing the data pre-processing, the next task was to prepare and split the data sets

into two with one being used as the training data and the second used as the test data.

In this case, to evaluate the performance of each model, we performed a 10-fold cross

validation process where each dataset was divided into 10 splits/folds (randomly chosen). In

the first iteration, 9 of the folds were used as a training set and the 10th fold was used as a

test set. Performance metrics like accuracy, precision, recall etc. were computed for the test

set. This process was repeated 10 times such that each fold was used as a test set and involve

in training process. The figure 3.2 shows the processes that are performed in each fold of the

train and test set. I don’t understand figure 3.2. please explain the figure and what each of

the boxes represent


Figure 3.2: K-Fold Cross Validation (what is k-fold cross validation? And how did you
perform it?)

The table 2 shows first 20 rows and 10 columns of jm1 data set. The provided data set is split

into 10 equal parts through k fold cross validation technique (explain this technique and how

you performed it). Each part is assigned a number starting from 1 to 10. Using cross

validation technique, a model is trained k times. In this work the number of k is 10. In first

iteration data of 1 is kept for testing and model is trained on data parts from 2 to 10. In

second iteration, number 2 sample is kept for testing and each model is trained on sample 1

and 3 to 10. In third iteration, sample 3 is kept for testing and model is trained on sample 1, 2

and 4 to 10. The process was repeated until all ten samples were utilized in training as well

as testing.

1.1 1.4 1.4 1.4 1.3 1.3 1.3 1.3 1.3 1.3
1 1 1 1 1 1 1 1 1 1
2 91 9 3 2 318 2089.21 0.04 27.68 75.47 57833.24
109 21 5 18 381 2547.56 0.04 28.37 89.79 72282.68
3 272.5
505 106 41 82 2339 20696.93 0.01 75.93 8 1571507
107 25 7 14 619 4282.78 0.02 52.91 80.95 226588.8
4 74 11 1 8 294 1917.93 0.03 28.77 66.66 55178.46
246.4
602 136 123 123 2785 25942.69 0.01 105.26 7 2730637
5 29 2 1 2 140 718.1 0.1 9.93 72.35 7127.8
36 3 1 1 254 1447.91 0.04 23.72 61.05 34338.99
6 70 11 1 6 434 3047.71 0.04 26.63 114.4 81152.69
6
109 20 12 4 223 1322.55 0.04 23.91 55.32 31619.49
7 37 3 1 3 187 1095.44 0.04 24.97 43.87 27356.45
163.7
90 29 8 9 488 3387.95 0.05 20.69 4 70100.61
8 19 2 1 2 71 351.75 0.08 12.66 27.79 4451.81
152 21 4 5 430 2850.62 0.03 33.3 85.6 94929.66
9 22 5 5 4 100 539.23 0.07 13.94 38.68 7516.89
160.2
69 12 1 1 536 3745.93 0.04 23.38 4 87570.07
10 49 9 1 8 191 1166.73 0.05 21.31 54.76 24859.27
48 3 1 2 248 1470.82 0.03 33.59 43.78 49408.04

3.2.3 Correlation based Feature Selection:

After splitting the data sets, the correlation-based feature selection for each dataset was

carried out next. The idea is that if there are multiple features which are highly correlated to

each other, they might be presenting the same information and hence redundant features. For

each feature, this study computed its correlation using Pearson’s Correlation Coefficient with

all the other features. If two features had a correlation higher than 0.95, one features was

randomly dropped. This will result in reduced feature space. The formula of Pearson’s

correlation is:

r=
where x is considered as one variable and y is considered as another variable. is mean

of x assigned columns and mean of y assigned variable. The correlation-based feature

selection (CFS) method is a filter approach and therefore independent of the final

classification model. It evaluates feature subsets only based on data intrinsic properties,
as the name already suggest correlations. The goal is to find a feature subset with low

feature-feature correlation, to avoid redundancy, and high feature-class correlation to

maintain or increase predictive power. The figure 3.3, figure 3.4, figure 3.5, and figure

3.6 (please these are not figures but rather, they are all tables. Label all as tables and

explain some entries in each of the tables for easy understanding and meaning. There

should be columns headers for understanding. The values in the tables are for what?

Present the tables and their explanations one after the other for easy referencing and cross

checking) shows computation of features correlation using Pearson’s Correlation

Coefficient.
Table xx: Figure 3.3: Correlation Analysis of pc1 data set

The pc1 data set (Table xx) has total 21 features and 11 features that have correlation

value higher than 0.95. There are 7 features among the highly correlated features that are

dropped after correlation analysis. These features are V, B, T, total_Op, total_Opnd,

branchCount, IOCode. (explain this table further and say what is the usefulness of it in

your research or model development. Why did you drop the seven features that are

highly correlated?)

Table yy: Figure 3.5: Correlation Analysis of ar1 data set


The data set ar1(Table yy) has total 29 features. Among 29 features, there are 16 features

that have correlation value greater than 0.95. The following 13 features are dropped out

of 16. These features are dropped before applying the machine learning models. These

features are executable_loc, total_operands, total_operators, halstead_vocabulary,

halstead_length, halstead_volume, halstead_error, halstead_time, branch_count,

decision_count, condition_count, cyclomatic_complexity, and design_complexity. (give

further explanation on the data set and why 13 features are dropped. We need to

understand these data set and how and what they are used for in the research)

Table yx: ar4


Table yx: Figure 3.6: Correlation Analysis of ar4 data set

The data set ar4 has also 29 features and there are 18 features that have correlation value

greater than 0.95, among 18 features, 16 features are dropped. All these features are not

utilized for machine learning model training and testing. These features are blank loc,

executable_loc, total operands, total operators, halstead vocabulary, halstead lenght, halstead

volume, halstead effort, halstead error, halstead time, branch count, decision count, call pairs,

condition count, cyclomatic complexity, and design complexity. (same as above)

3.2.4 Data Standardization/Normalization:

From the datasets’, computed features are on different scales and have different ranges.

For some of the machine learning models, the computed features need to be normalized

into similar range. This study carried out standardization from each feature. the mean

was subtracted and then divided by the standard deviation as given below:

This technique is also called the Z-score Normalization and the resultant features will be

in the range of 0 to 1. This enables all the features to have the same scale. The table 3.2

shows few instances of kc1 data set and table 3.3 is the normalized form of data.

Table 3.2: Data Normalization

E b t lOCode

1.3 1.3 1.3 2


1 1 1 1
21378.61 0.31 1187.7 65
11436.73 0.26 635.37 37
2381.95 0.08 132.33 21
6423.73 0.19 356.87 35
11596.34 0.25 644.24 41
33061.94 0.4 1836.77 62
12069 0.25 670.5 41
14278.39 0.27 793.24 42
21261.63 0.34 1181.2 50
15808.22 0.27 878.23 54
45222.23 0.57 2512.35 88
What does the rows and columns signify in the table above? I mean the headings and the
values?

Table 3.3: Normalized Data

e b t lOCode
-1.19793 2.628743 -1.19675 -1.69488
-1.19796 1.730539 -1.19718 -1.73783
0.510146 -0.33533 0.510053 1.010952
-0.28423 -0.48503 -0.28455 -0.19164
-1.00772 -1.02395 -1.00825 -0.87884
-0.68477 -0.69461 -0.68521 -0.27754
-0.27147 -0.51497 -0.27179 -0.01984
1.443663 -0.06587 1.443833 0.882103
-0.23371 -0.51497 -0.23401 -0.01984
-0.05717 -0.45509 -0.05743 0.023107
0.500799 -0.24551 0.500702 0.366705
0.065063 -0.45509 0.064837 0.538504
2.41529 0.443114 2.41575 1.998797
What does the rows and columns signify in the table above? I mean the headings and the
values? How did you normalize? What tools, method and process did you use for
normalization? Explain.

3.2.5 Dimensionality Reduction (Neighborhood Component Analysis):

The feature standardization was followed by the dimensionality reduction step. In this

step the Neighborhood Component Analysis (NCA) was used to project the computed

features into a new space where they can be linearly separable. In general, researchers

use unsupervised dimensionality reduction techniques such as Principal Component

Analysis (PCA) for dimensionality reduction to find components in the new vector space

which can define the most variance in the dataset. However, variance is not correlated

with the usability of the features for the classification task at hand. In this case, this study

used NCA as a supervised dimensionality reduction technique which considers class

labels and class separation when finding the new projections. This is because it has the

potential to give a better separation and can be represented mathematically as:

S = {(xi, yi), i=1, 2…, n}

In the above equation xi is the vectors of features and yi is the output class labels.

you have only explain how PCA works. You have not explained what you did with it or
how you applied it in your research. How did you carry out dimensionality reduction
with PCA? What tools, method and process did you use for dimensionality reduction ?
Explain.

3.2.6 Machine Learning Models:

The next step is to train and test several machine learning algorithms for detection of

defects in software modules. The selected algorithms in this work present a wide variety
of machine learning algorithms which make different assumptions about the problem at

hand. For each model, this study applied default hyper-parameters for comparison. Each

model is evaluated using the 10-fold cross validation technique.

Also, it is important to note that all selected models are classifiers models with ensemble

techniques because of the nature of data set, the task at hand, and the expected outcome.

All models are built using python as the development tool alongside some libraries like

Sklearn, pandas, Numpy, matplotlib etc. The models that are utilized in this artifact are

discussed one by one.

 Logistic Regression

Logistic regression is a type of linear classifier that can predict whether a given object

would lie into the class ’1’ or class ’0’. Logistics regression was considered because it is

mostly deployed to solve classification problems. Based on the available data and task,

the logistic regression model is used to create and ascertain the probability distribution

that corresponds to the extracted features. This forms one of the major reasons it has been

considered for this research especially when it is considered that the most common type

of problem of a logistic regression classifier are dependent variable is binary. The

regression equation is represented as

y = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂

where y is dependent variable that needs to be predicted, 𝑥₁, 𝑥₂ are independent

variables and 𝑏₀, 𝑏₁, 𝑏₂ are weights added to each feature. Logistic regression uses

sigmoidal function for converting the continuous output into 0 or 1 form. It is a threshold

that is set before modelling. The formula of sigmoidal function is p = 1/1 + .

Figure 3.7 shows the graph of sigmoidal function.


 .

Figure 3.7: Graph of sigmoidal function

for this research, we set our default logistic regression model at l2 regularization penalty

and maximum iteration of 100. The threshold for sigmoidal function is kept at 0.5. The

output of model having value less than 0.5 is considered as 0 while output value of model

0.5 or above is considered as 1. The model is trained on training data and evaluated

through testing data. The evaluation metrics are found out and discussed in results

section of chapter 4.

At first the model is imported from sklearn library. The logistic regression model is

initiated and stored in ‘LR’ variable.

from sklearn.linear_model import LogisticRegression


models.append(('LR', LogisticRegression()))

you have only explain how LR works. You have not explained what you did with it or
how you applied it in your research. explain how logistic regression was carried out on
the data, problems encountered and solution, tools, methods and processes used

 K- Nearest Neighbor (KNN):

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on

Supervised Learning technique. KNN is chosen for this research because of its ability to

solve classification problems like the one being carried out in this research where the
intention is to classify defects and non-defects present in the data set. KNN algorithm is

used in this research also because of its ability to classify new data points to the most

suitable category of the data. The steps involved in KNN algorithm are discussed below:

1. Provide training data and their output class labels to model.

2. Find and select optimal value of K.

3. Find the distance between the neighbors of K clusters.

4. Select the nearest neighbor according to distance formula.

5. Calculate the number of output classes in each K clusters.

6. Assign the new data points to that category for which the number of the neighbor is

maximum.

7. The KNN model is created

In this research, the process of building the KNN model using the KNN algorithm

included fitting the KNN algorithm on training data and predicting test results. After

training the model, results are tested by putting a new dataset, i.e., Test dataset. The

value of k is kept 2 as there are only two output classes in each data set. Euclidean

distance formula is used to find out distance of each node with cluster.

The formula of Euclidean is d =√ [(x2 – x1)2 + (y2 – y1)2], where d is distance between

two points, x2, y2 are two coordinates of one point and x 1, y1 are two coordinates of

another data point.

The KNN model is imported from sklearn library and initiated through following code.

from sklearn.neighbors import KNeighborsClassifier


models.append(('KNN', KNeighborsClassifier()))

you have only explain how KNN works. You have not explained what you did with it or
how you applied it in your research. explain how KNN was carried out on the data,
problems encountered and solution, tools, methods and processes used
 Random Forest Model

Random Forest is a popular machine learning algorithm that belongs to the supervised

learning technique. It can be used for both Classification and Regression problems in

ML. It is based on the concept of ensemble learning, which is a process of combining

multiple classifiers to solve a complex problem and to improve the performance of the

model. This algorithm was chosen majorly because of its ensembling learning concept.

This is because it is believed that detecting and predicting defects in software is a very

complex classification problem that requires a sophisticated algorithm like random forest.

As the name suggests, "Random Forest is a classifier that contains a number of decision

trees on various subsets of the given dataset and takes the average to improve the

predictive accuracy of that dataset." Instead of relying on one decision tree, the random

forest takes the prediction from each tree and based on the majority votes of predictions,

and it predicts the final output.

For this research, the random forest algorithm was used to build model using some

selected data points (subsets and providing number of N decision trees to be built. This

will help the model to predict and assign new data points to categories with highest

number of occurrences.

The steps involve in creating random forest classification model are:

1. Create subsets of training data and follow the replacement technique.

2. Create trees for each subset of data.

3. Combine the outputs of all decision trees.

4. Choose the output having majority voting.


The number of trees used in designing random forest model is 100. The maximum depth of

each tree is up to three branches. The random state is kept 0 during designing and training of

model. The model is applied on all four data sets and results are discussed in chapter 4.

you have only explain how RF works. You have not explained what you did with it or
how you applied it in your research. explain how RF was carried out on the data,
problems encountered and solution, tools, methods and processes used

Adaptive boosting (AdaBoost) is also used in building one of the models for this research.

AdaBoost which is a booster machine learning technique that combines Ensembling

techniques thereby combining different other models inside it is deemed suitable for this

research because of its ability to classify features based on assigned weight. This machine

learning technique is like the Decision tree because it also makes use of decision trees

algorithm which assigns equal weight to all data points and assigns more weight to points

that are wrongly classified. The wrongly classified weights are then given more priority in

the next model training, and this is repeated until a lower error is received.

The algorithm of creating AdaBoost model is defined in steps below:

1. Assign Equal Weights to all the observations

2. Classify random samples using stumps

3. Calculate Total Error

4. Calculate Performance of the Stump

5. Update Weights

6. Update weights in iteration

7. Final Predictions

The random forest classifier is imported from sklearn library and model is appended in

‘model’ variable. The n_estimators is number of trees used in random forest model.
from sklearn.ensemble import RandomForestClassifier
models.append(('RF', RandomForestClassifier(n_estimators=100)))

you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how Adaboost was carried out on the data,
problems encountered and solution, tools, methods and processes used

 Bernoulli Naïve Bayes Model

Naive Bayes is a basic but effective probabilistic classification model in machine

learning that draws influence from Bayes Theorem. Bayes theorem is a formula that

offers a conditional probability of an event A happening given that another event B has

previously happened. Its mathematical formula is as follows:

Where,

 A and B are two events

 P(A|B) is the probability of event A provided event B has already happened.

 P(B|A) is the probability of event B provided event A has already happened.

 P(A) is the independent probability of A

 P(B) is the independent probability of B

Now, for the purpose of this research, this research adopted the generated classification

model of the Bayes theorem which is represented as


Where,

 X = x1, x2, x3, … xN is list of independent predictors

 y is the class label

 P (y | X) is the probability of label y given the predictors X

In applying the Naïve Bayes Model, this research experimented using Bernoulli Naïve

Bayes classifier which is used when the predicators are Boolean in nature.

The formula of Bernoulli Naïve Bayes is given below:

In the above formula, p is the probability of successful predictions and q is the

probability of failed predictions, x has only two output class labels i.e., 0 and 1.

The Bernoulli naïve bayes model is imported from sklearn library and appended to

model.

from sklearn.naive_bayes import BernoulliNB


models.append(('BND', BernoulliNB()))
Though the problem here is binary classification and applied Bernoulli Naïve Bayes

model to analyze the results of models. The results are discussed with graphs in chapter

4.

you have only explain how NB works. You have not explained what you did with it or
how you applied it in your research. explain how the NB model was carried out on the
data, problems encountered and solution, tools, methods and processes used

 Adaboost Classification Model

Adaboost stands for adaptive boosting. This technique is used for ensemble purposes in

machine learning. Mostly one level of decision tree is used with adaboost technique. The tree

in adaboost model is known as decision stumps. At start adaboost model gives equal weights

to all data points. After one iteration, the model increases the weights of data points that are

classified wrongly. So, in next iteration these wrongly classified data points are highly

considered and the process continuous till the error of data points is decreased. The formula

for calculating the weights is given below.

Where:

N represents total datapoints numbers.

Steps of adaboost classification model:


 Assign equal weight to each data point.
 Calculate Gini Index for each feature.
 Select feature with lowest Gini Index as root feature.
 Calculate performance of stump.
 Update weights of data points.
 Repeat above steps to reduce the error.
Gini Index is found out through following formula for each feature. The feature having
smallest value of Gini Index is considered as root feature.

The performance of stump for tree is calculated through following formula. After finding
stump value, the weights are updated, and the process is repeated.

The adaboost model is imported from ensemble library of sklearn. The model is then stored
in AdaC variable and appended to model through following code.
from sklearn.ensemble import AdaBoostClassifier
models.append(('AdaC',
AdaBoostClassifier(n_estimators=100)))

you have only explain how Adaboost works. You have not explained what you did with it
or how you applied it in your research. explain how the Adaboost model was carried out
on the data, problems encountered and solution, as well as tools, methods and processes
used

 Extra Tree Classification Model

Extremely randomized tree classifier is ensemble technique that combines the output of

multiple trees in a forest. Extra tree classification technique aggregates the results of

multiple de-correlated decision trees. This algorithm is like random forest but different in

case of:

Extra tree uses whole data set for training while random forest splits the data into subsets.
Extra tree choses the splitting of nodes randomly while random forest choses the

optimum split.

Steps involved in extra tree classification

1. Select whole data for classification

2. Provide k-features for each tree

3. Select best feature for splitting

4. Use Gini index for feature splitting

5. Create multiple de-correlated decision trees.

6. Combine the output of all trees

7. Select the output through majority voting or averaging.

Gini Index is found out through following formula for each feature. The feature having

smallest value of Gini Index is considered as root feature.

The extra tree classifier is imported from ensemble library of sklearn. The classifier is

stored in variable ExtC and is appended to the model for implementation. The number of

trees used by classifier is 100.

from sklearn.ensemble import ExtraTreesClassifier


models.append(('ExtC', ExtraTreesClassifier(n_estimators=100)))

you have only explain how extra tree works. You have not explained what you did with it
or how you applied it in your research. explain how extra tree was carried out on the data,
problems encountered and solution, tools, methods and processes used
 Artificial Neural Networks (ANN’s)

Artificial Neural Networks is one of the models deployed in this research because of its

ability to mimic how the human brain processes information. Its structure allows us to

use the artificial neurons of ANN to learn the features of the provided data while also

learning the relationship between the dependent and independent variables available in

the data. The fundamental working principle of ANN involves learning via adjusting the

different weights between various neurons to learn the relationship between the

dependent and the independent variables.

The objective function in the case of a neural network is the sum-of-squares error

function, which gives the network the information as to how incorrect or diverged the

output is from the expected result. This information about the error is then used to

remodel the weights in a manner corresponding to a further reduction in the error

function. The aim of the learning process is to evaluate the error function at each iteration

and re-adjust the weights to attain a local minimum. The concept of a layer in a neural

network is defined as the neurons and their corresponding weights residing in that layer.

Consequently, for every neural network there are three types of layers, namely, Input

layer, Hidden layers, Output layer. The neuron in every layer firstly computes the

weighted input using the function, and then applies the activation function on the

weighted input to determine the output as either 0 or 1. There are various activation

functions that are used, for instance, the Relu function, the Signum function and the most

common one, the sigmoid function given as 1/ 1+e −x. The figure 3.9 shows basic

structure of the layers of ANN model.


Figure 3.9: ANN Architecture

The artificial neural network model is imported from sklearn library of python and

appended in model. The size of hidden layer is 10 and maximum iteration is 500.

from sklearn.neural_network import MLPClassifier


models.append(('MLP', MLPClassifier(alpha=0.0001, hidden_layer_sizes=10,
max_iter=500)))

you have only explain how ANN works. You have not explained what you did with it or
how you applied it in your research. explain how ANN was carried out or applied on the
data, problems encountered and solution, tools, methods and processes used.

 Gradient Boosting Model

Gradient boosting is an ensemble technique that is based on decision trees. It is a type of

weak prediction models. The aim of boosting model is to reduce the value of loss function.

As the loss function reduces, the performance of model increases. The trees in the model

predict outputs for data and multiple techniques like majority voting, averaging is used for

the final output.


Steps of gradient boosting model

1. Create data for training

2. Find loss value through loss function

3. Split the data for optimal feature selection

4. Use information gain for splitting

5. Use majority voting for final prediction

The loss value for each data sample is found out through the following formula.

The information gain for splitting the data is through following formula.

E=−I ∑Cpilog2pi

The gradient boosting classifier is imported from ensemble library of sklearn. The model is

saved in variable gb and appended to model. The number of trees is 100 in this model with

learning rate of 1, maximum depth of 1 and random state of 0.

from sklearn.ensemble import GradientBoostingClassifier

models.append(('gb' , GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_

depth=1, random_state=0)))

you have only explain how GB works. You have not explained what you did with it or how

you applied it in your research.

 Linear Discriminant Analysis Model


It is also called normal discriminant analysis. LDA modeling is used for reduction of

dimension of features. LDA algorithm easily classify patterns in binary classes. LDA

considers the data in a linear form and draw a hyperplane between the output of features. This

hyperplane increases the distance between means of two classes and reduces the differences

between variables of each class.

Figure 3.10: LDA Example

Steps involved in LDA are following:


1. Calculate dimension for each class of output.
2. Calculate scatter matrices for each class of output.
3. Calculate eigen vectors.
4. Sort the eigen vectors.
5. Transform the data using sorted eigen vectors.

The model is imported and saved in variable lda. The model is assigned for further
calculations using following Python code.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
models.append(('LDA', LinearDiscriminantAnalysis()))
you have only explain how LDA works. You have not explained what you did with it or how
you applied it in your research.

 SVM

Support Vector Machine (SVM) is supervised learning algorithm that is widely used for

classification purposes. SVM model is best suitable when the output class label is binary.

However, it works and outperform in multi class classification. SVM model plot the data in

‘n’ number of dimensions. The number of dimensions depends on the feature of data. SVM

model draws a hyperplane between the outputs of data and differentiate the classes on base of

distances between centers. SVM uses different distance formulas for finding distance of class

labels and centers.

The steps involved in SVM algorithm are as follow:

1. Identify classes in output labels

2. Consider one class as 0 and other as 1

3. Use cost function or kernel function to find margin of labels

The SVM classification library is imported from sklearn. The model is saved in “SVM”

variable and appended to model. The gamma value is 0.1 with c = 1.

from sklearn.svm import SVC
models.append(('SVM', SVC(gamma=0.1, C=1.)))

you have only explain how SVM works. You have not explained what you did with it or how
you applied it in your research. For instance, in all the algorithms or models you have listed
steps that are followed in performing or applying the algorithms. Why not you carry out those
steps and report exactly what and how you did the steps.
Performance Evaluation

For performance evaluation of each model, we computed the following metrics for each fold

and then computed the average and standard deviation for each metric. In this report the

following four evaluating metrics were computed for each fold and each model. Before

discussing the metrics, some terminologies need to be clarified like True Positive, False

Positive, True Negative, and False Negative.

 True Positive are values of testing data that are true, and model also predicted as true.

 False Positive are values of testing data that are false, and the model predicted as true.

 False Negative are values of testing data that are true, and the model predicted as false.

 True Negative are values of testing data that are false, and the model predicted as false.

You have only explain what is performance evaluation. Explain how you carried out

performance evaluation in this research. What are the tools, methods and processes

applied during the performance evaluation process. What are the problems

encountered if any and how did you resolve them.

3.2.6.1 Accuracy

This is the most common type of scoring method and essentially the most misused one.

This type of method is viable for certain types of classification problems. It is calculated

as the total number of correct predictions made over the total predictions. The formula of

accuracy is given below


Accuracy =

You have only explain what is accuracy. Explain how you carried out the process of

finding accuracy in this research. What are the tools, methods and processes applied

during the process. What are the problems encountered if any and how did you

resolve them.

3.2.6.2 Precision

Precision is defined as the correctly predicted classifications over the total predicted

classifications. High precision relates to the model’s low error in classifying the data

points that do not belong to a certain class within that class. The precision metrics

become more important when the class labels are imbalanced, and the prediction of true

positive are more important as compared to true negative.

Precision = True Positive / (True Positive + False Positive)

3.2.6.3 Recall

Recall is defined as the correctly predicted classifications over all the classifications of

members of a certain class. The model gives the information as to how many objects that

belong to the class in question get non-classified or get classified outside that class.
Recall = True Positive / (True Positive + False Negative)

You have only explain what is recall. Explain how you carried out the process of

finding the recall value in this research. What are the tools, methods and processes

applied during the performance evaluation process. What are the problems

encountered if any and how did you resolve them.

3.2.6.4 F1-Score

F1 Score is another metric which at first sight is hard to understand intuitively. It

essentially brings in both the False Positives and False Negatives to weigh in the error in

decision making. It is defined as the Harmonic mean of precision and recall. Ideally, we

would want to list all the true positive observations that exist for a particular class while

being careful to omit all those who do not belong to that class. If we could do that, then

we would have both high precision and high recall respectively. And this consequently

will ensure, a high F1-Score corresponding to the model. Important thing to note here is,

even if the precision is remarkably high, having a low recall will always dominate and

bring down the F1-Score necessarily and vice-versa.

F1-score = 2 * precision * recall / precision + recall


You have only explain what is F-score. Explain how you carried out the process of

finding the F-score value in this research. What are the tools, methods and processes

applied during the performance evaluation process. What are the problems

encountered if any and how did you resolve them.

You have to design or develop a new model or an extension of existing model just as your
topic, statement of problem and aim have suggested in chapter one. I have not seen this in
your chapter three or four. After developing a new model, you then apply it on the same data
sets, then you can now do a comparison to know if the new model is better than the existing
models or not. Add model development in chapter three

CHAPTER FOUR

RESULTS & ANALYSIS

You have to follow your research design steps in presenting the whole of chapter four and in

each step present and discuss the results obtained.:

1. data pre-computation features

2. Feature selection: correlation computation

3. Feature normalization/standardization

4. Dimensionality reduction (using NCA)

5. Training and evaluating models

6. Supervised classifiers : logistic regression … ANN

7. Decision: defect, non defect

8. Developed model

9. Discussions

You might also like