You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335058609

Classifying non-functional requirements using RNN variants for quality


software development

Conference Paper · August 2019


DOI: 10.1145/3340482.3342745

CITATIONS READS
33 1,277

4 authors:

Md. Abdur Rahman Ariful Haque


University of Dhaka Masaryk University
19 PUBLICATIONS 185 CITATIONS 29 PUBLICATIONS 88 CITATIONS

SEE PROFILE SEE PROFILE

Md. Nurul Ahad Tawhid Md Saeed Siddik


Victoria University Melbourne University of Dhaka
22 PUBLICATIONS 177 CITATIONS 31 PUBLICATIONS 234 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Md. Nurul Ahad Tawhid on 09 September 2019.

The user has requested enhancement of the downloaded file.


Classifying Non-functional Requirements using RNN Variants
for Quality Software Development
Md. Abdur Rahman Md. Ariful Haque
Centre for Advanced Research in Sciences Institute of Information Technology
University of Dhaka, Bangladesh University of Dhaka, Bangladesh
mukul.arahman@gmail.com arifulmit17@gmail.com

Md. Nurul Ahad Tawhid Md. Saeed Siddik


Institute of Information Technology Institute of Information Technology
University of Dhaka, Bangladesh University of Dhaka, Bangladesh
tawhid@iit.du.ac.bd saeed.siddik@iit.du.ac.bd
ABSTRACT ACM Reference Format:
Non-Functional Requirements (NFR), a set of quality attributes, Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md.
Saeed Siddik. 2019. Classifying Non-functional Requirements using RNN
required for software architectural design. Which are usually scat-
Variants for Quality Software Development. In Proceedings of the 3rd ACM
tered in SRS and must be extracted for quality software development SIGSOFT International Workshop on Machine Learning Techniques for Soft-
to meet user expectations. Researchers show that functional and ware Quality Evaluation (MaLTeSQuE ’19), August 27, 2019, Tallinn, Estonia.
non-functional requirements are mixed together within the same ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3340482.3342745
SRS, which requires a mammoth effort for distinguishing them. Au-
tomatic NFR classification would be a feasible way to characterize
those requirements, where several techniques have been recom- 1 INTRODUCTION
mended e.g. IR, linguistic knowledge, etc. However, conventional Software requirements specification is the most important artifacts
supervised machine learning methods suffered for word represen- that describe the features and behaviors of software application.
tation problem and usually required hand-crafted features, which It includes a variety of components to define the intended func-
will be overcome by proposed research using RNN variants to cate- tionality required by users, which are composed of Functional Re-
gories NFR. The NFR are interrelated and one task happens after quirements (FR) and Non-Functional Requirements (NFR). Where,
another, which is the ideal situation for RNN. In this approach, re- functional requirements describe software behavior directly listed
quirements are processed to eliminate unnecessary contents, which by stakeholders and the non-functionals are the expected non listed
are used to extract features using word2vec to fed as input of RNN requirements to perform functional tasks [1]. The functional and
variants LSTM and GRU. Performance has been evaluated using non-functional requirements are mingled within the same docu-
PROMISE dataset considering several statistical analysis. Among ments, which are hard to detect manually. Since NFR are not given
those models, precision, recall, and f1-score of LSTM validation by stakeholders, researchers showed that unaddressed and inap-
are 0.973, 0.967 and 0.966 respectively, which is higher over CNN propriate NFR classification leads to project failure or increase
and GRU models. LSTM also correctly classified minimum 60% and production cost, which may affect software quality evolution by
maximum 80% unseen requirements. In addition, classification ac- frequent change request [2, 3]. Therefore, early NFR detection en-
curacy of LSTM is 6.1% better than GRU, which concluded that sures quality software development and development cost reduc-
RNN variants can lead to better classification results, and LSTM is tion. However, manual NFR classification escalates development
more suitable for NFR classification from textual requirements. time and maintenance cost, which may ruin software quality.
For automatic textual data classification, various approaches
CCS CONCEPTS have been proposed such as information retrieval [2], linguistic
• Software and its engineering → Requirements analysis. knowledge [4], machine learning [5], etc. Among those, very little
have been found to direct requirement classifying techniques, where
KEYWORDS IR based NFR classification strategy was presented by [2]. Textual
requirements was classified as functional and non-functional us-
Non-Functional Requirements, NLP, Deep Learning, RNN ing linguistic knowledge [4]. In 2017, Matsumato et. al proposed
method to identify ambiguity, inconsistency, incompleteness, and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed redundancy of NFR in software requirement specification [6]. Lu et.
for profit or commercial advantage and that copies bear this notice and the full citation al proposed a new technique to classify app user reviews into three
on the first page. Copyrights for components of this work owned by others than ACM categories named as functional, non-functional and others [7].
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a However, all of those models were developed using machine
fee. Request permissions from permissions@acm.org. learning algorithms, which usually stacked on prerequisite hand-
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia crafted features [8]. More significantly, word representation in ma-
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6855-1/19/08. . . $15.00 chine learning approach has several problems such as context lost,
https://doi.org/10.1145/3340482.3342745 sparse representation, arbitrary encodings, etc. For example, Dhaka

25
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik

and Estonia may be represented as Id224 and Id453 respectively, Information Retrieval (IR) technique plays a vital part for de-
meaning the 453r d entry of the long sparse vector is 1. Such repre- tecting and classifying NFR, which was presented in [2]. In IR
sentation does not provide any useful information to the system approach, a set of indicator terms was identified for each NFR cate-
regarding similarities between individual symbols. This machine gory to classify user requirements. A probabilistic weight for each
learning shortcoming would be overcome using deep learning based potential indicator term was calculated from the requirements. This
distributed vector representation, which attempts to learn multiple strategy requires less effort for NFR classification compared to semi-
levels of representation for handling increased complexity. automated classification methods. However, this technique suffers
In the literature, several machine learning approaches have been for evaluating correctness of candidate NFR manually.
reported for text and NFR classification [2, 4, 5]. Even though few Requirements were classified into functional and non-functional
research found in deep learning for NFR classification, including using linguistic knowledge in [4]. The method works for sentence
text data representation in low dimensional vector space using Con- level as the characteristics of functional and non-functional require-
volutional Neural Network (CNN) [9]. However, hardly found any ments remain within the scope of sentences. The results were cross
research direction for NFR classification with Recurrent Neural Net- checked ten times, and showed that their method significantly im-
work (RNN). Where, RNN is one of the most popular architecture proved classification performance. However, more training and
used in NLP, because its recurrent structure is suitable to process testing data could be introduced to justify performance.
variable length text [10]. RNN utilized distributed representations Using requirements frame model, a method was proposed to de-
of words by comprising each text into vectors to form a matrix. tect the redundancy, ambiguity, inconsistency and incompleteness
This paper presents an efficient NFR classification technique of NFR in SRS [6]. The research focused on the requirements re-
using RNN variants to categories requirements into pre-defined lated to response time and usability. Here, the specific requirements
labels. At first, the textual requirement has been processed to elim- were retrieved using respective keywords and multiline statements
inate unnecessary text, symbols, etc. from the dataset. Then the were converted to single sentences manually. However, this manual
processed documents were vectorized using word2vec algorithm statements conversion is time consuming and error-prone.
to fed in the neural network model. The RNN variants named as Combining different feature extraction and machine learning
Long Short Term Memory (LSTM) [8] and Gated Recurrent Unit techniques, a NFR classification method was presented to find out
(GRU) [11] are used for NFR model construction. The implemented the best pair by [5]. BoW and variation of TF-IDF were used as fea-
classifiers have been applied to categorize NFR into pre-defined ture extraction strategies, where eight machine learning algorithms
labels. Finally, proposed classifier performance has been evaluated were applied as NFR classifier. The empirical results showed that
using PROMISE [12] dataset considering several statistical analyses. TF-IDF (character level) combined with SDG SVM performed best to
The investigated validation and testing results of proposed mod- categorise NFR into pre-defined levels. However, NFR classification
els denoted that LSTM performs highest score over CNN and GRU. efficiency could be improved incorporating RNN and word2vec.
The reported average precision, recall, f1-score for LSTM validation Functional and non-functional requirements were classified us-
are 0.973, 0.967 and 0.966 respectively which is high and indicate the ing supervised machine learning approach in [13], where both bi-
model was trained well. In the model, the boundary scores are 0.95 nary and multi-class classifiers have been introduced. The research
and 1.00, which indicate model’s minimum classification validity focused on accurately identification of four NFR named as usabil-
is 95% and maximum is 100%. The reported average accuracy for ity, security, operational and performance. Problem of imbalance
LSTM is always higher over GRU, where, LSTM’s lowest and high- dataset was handled using under and over sampling techniques. For
est accuracy are 0.60 and 0.80 respectively, that means minimum feature extraction, BoW strategies were used, where part of speech
60% and maximum 80% unseen requirements are correctly classified tags were identified as most informative features. The experimen-
by the model. The reported average precision, recall, f1-score and tation was performed using support vector machine classifier.
accuracy are 71.7%, 71.5%, 70% and 71.5% respectively. However, A machine learning technique was presented to classify app user
LSTM’s standard deviation is always lower than GRU. The low de- reviews in [7], where BoW, TF-IDF, CHI2 and AUR-BoW techniques
viation indicates model’s well stability, which are 0.081, 0.062, 0.057 were applied with NB, J48, and Bagging algorithms. The research
and 0.62 for precision, recall, f1-score and accuracy respectively. focused on four types of NFR names as reliability, usability, porta-
The main contribution of this paper is to investigate how well deep bility, and performance. The paper concluded that imbalance and
learning methods perform for multi-class NFR classification. smaller dataset effects badly on classification results in machine
The rest of this paper is organized as follows; Section II discusses learning environment. However, no direction has been given to
the existing work related to this research. Section III and IV illus- handle mentioned problem regarding dataset.
trates the proposed method and result analysis respectively. Finally, Winkler and vogelsang [1] presented convolutional neural net-
section V concludes this paper with future research direction. works based approach to automatically classify content elements
of a natural language requirements specification. Their approach
can be used to classify content elements in documents. To train the
2 RELATED WORK neural network, this model used 10K content elements extracted
Classifying non-functional requirements using analysis of textual from 89 requirements specifications from industry and reported
natural language is an emerging field in software engineering re- precision of 0.73 and a recall of 0.89. However, requirements are
search to evolve software quality. NFR may classify using informa- interconnected and one task happens after another, where RNN
tion retrieval based [2], linguistic knowledge based [4], CNN based may perform better than CNN.
[9], etc. approaches which are briefly discussed in this section.

26
Classifying Non-functional Requirements using RNN Variants for Quality Software Development MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia

Using CNN model a method was proposed to classify NFR into


different pre-defined categories in [9], where text was converted
to low dimensional vector space using word2vec algorithm. The
requirement text was represented using embedding vectors to fed
into CNN model. Effectiveness of the proposed model was evaluated
using PROMISE dataset. The research applied k-fold cross validation
strategy to improve accuracy and reduce biasness in prediction. The
technique used different statistical analysis to compare proposed
model results with others. The results showed CNN can classify
NFR into different categories efficiently, however, RNN could be
incorporated for better accuracy.
The analysis of existing approaches concluded that different
strategies have been proposed for NFR classification such as ma-
chine learning based, CNN based, etc. However, very few research
found for automatic NFR classification using deep learning method.
In addition, no guideline has been found yet to classify NFR using
RNN, which is significant in NLP domain.

3 METHODOLOGY
Multi-class NFR classification using RNN has been proposed in
this framework to facilitate quality software development. This
automatic NFR classification method considers word2vec algorithm
to extract features from requirement dataset. On the other hand,
twice RNN variants named as LSTM and GRU have been applied as
NFR classifier. Textual software requirements have been processed Figure 1: Overview of the Proposed Method
to eliminate unnecessary text, symbols, etc. and vectorized using
word2vec algorithm for RNN modeling. The whole working process
of this framework is divided into following four steps. 3.2 Word Vectorization
The deep learning model could not understand natural language
• Dataset Pre-processing
sentence to process, therefore, a word vectorization technique is
• Word Vectorization
required. This allows the model to recognize patterns, even if the
• RNN Model Construction
words occur in the pattern vary slightly. The requirement text has
• Model Training and Testing
been converted to word vectors using word2vec model with skip-
The details of above steps have been elaborately described in the gram feature extraction method. Word2vec maps a single word to
following sub-sections. a vector v ∈ R, where R is the set of real numbers. In word2vec
technique, vector distance of two given words is small if these two
words are used in similar context, otherwise it is large. The sentence
3.1 Dataset Pre-processing transformation in word2vec can be written as m ∈ Rn,l , where m,
The first step of this framework is cleaning and pre-processing R, n, and l represents the matrix, set of real number, embedding
software requirement dataset. This involves removing special char- vector size and length of sentence respectively.
acters, stop words, case-folding, lemmatization and tokenization. This strategy uses google pretrained word embedding model,
Special characters and symbols are usually non-alphanumeric char- where the model is trained on over 100 billion words. At the time
acters, which add extra noise to the experimental dataset. To remove of tokenization, each unique word is turned into a unique number,
noisy data, special characters have been removed using regular ex- which is used as the respective word index. The vector of a particular
pression. Words appear in upper-case or lowercase having similar word is extracted from the pretrained word embedding, and kept
meaning in Enlgish; therefore, case-folding is applied for unique in an embedding matrix, which is initialized by zero as elements.
case consideration. On the other hand, stop words usually refer Then, these elements are replaced by the embedding vectors of a
to the most common words such as "a", "an", "the", which do not particular word. Thus the embedding matrix contains all the word
influence the semantics of a software requirement. The most uses embedding vectors and some zero vectors for the words that are
natural language processing stop words corpus of nltk has been not found in the pretrained word embeddings.
used in this implementation. Lemmatization approach has been The unique word vectors were fetched into an embedding matrix
utilized to get base form of word. It also grouped the inflected which size is measured using vocabulary size and embedding di-
words together. Numbers are also converted to word in order to mension. The features of the documents are extracted as embedded
enrich dataset. Finally, tokenization process splits longer text into word vectors with dimension of 300 for each word in each sentence.
smaller piece or tokens. The requirements have been tokenized into The embedding vectors were imported as pre-trained word vectors
sentences which then split into distinct words. and transfer learning is used to learn the word embedding. The

27
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik

embedding matrix is fed as weights in the embedding layer in the into different given categories. This process makes an effective pre-
neural network. diction based on constructing current model, calculating prediction
incorrectness, and updating network parameters to minimize this
3.3 RNN Model Construction error and make the model better. This process is repeated until the
The two RNN variants model has been implemented in this NFR model has converged and can no longer learn. Following parameters
classification framework using word2vec vectorization algorithm. have been used to tune this process.
The RNN variants LSTM and GRU algorithms are prominent in nat- Metric: This is used to measure performance of proposed model,
ural language domain especially for text classification [11], which where accuracy has been used as metric value for this framework.
are implemented in NFR classification framework. The sequential Loss function: This function is used to calculate a loss value
RNN architecture mainly consists of three layers named as input of the model in the training process. It attempts to minimize the
layer, hidden layer and output layer. value by tuning the network weights. This framework uses the loss
function which is suitable for multiclass categorical classification.
3.3.1 Input Layer: The first layer prepares word embeddings ac- Optimizer: This is a significant function in the model that de-
cording to model’s input requirement using sequence length and cides how the network weights will be updated based on the output
embedding dimension. The parameters in the embedding layer are of the loss function.
not trainable to ensure proper transfer learning. For training deep The actual training happens using the fit method. In each training
learning model, transfer learning is used in the first layer in order iteration, batch size, sample number and weight are updated single
to improve context learning of the word, where vector weights time. The training process completes an epoch, when the model has
remain unchanged. seen the entire training dataset. At the end of each epoch, validation
dataset is used to evaluate model’s learning accuracy. This process
3.3.2 Hidden Layer: LSTM and GRU algorithms have been used
is repeated for a predetermined number of epochs.
in hidden layer, which takes each word embedding in a time step
and processes to produce an output. The dropout and recurrent
dropout has been applied here to reduce over-fitting of the model.
4 EXPERIMENT AND RESULT ANALYSIS
The algorithms learn the context of the sentence and pass the last The dataset and experiment for RNN variants algorithms are dis-
time step output to the dense layer activation function. The leaky cussed in this section with the result analysis and efficiency com-
relu (1) activation function has been used in this case, where α is a parison of different techniques for automatic NFR classification.
small constant.
4.1 Dataset
z>0
 
z The OpenScience tera-PROMISE software requirement dataset [12]
LeakyReLU (z) = (1)
αz αz <= 0 has been used for experimental analysis containing both func-
3.3.3 Output Layer: This is the last layer of the model, where tional and non-functional (11 categories) requirements. The dataset
final output for each requirement has been produced. This dense is small in size which consists of 625 requirement sentences. In
layer contains softmax (2) activation function to help predicting these sentences 255 are functional and 370 are non-functional re-
the target class. The softmax function calculates the probabilities quirements. The NFR are labeled with eleven categories named
distribution of the requirement over pre-defined different categories. as availability, legal, look and feel, maintainability, operational,
This function produces unique probability for each class, where the performance, scalability, security, usability, fault tolerance, and
maximum probability will be used to predict the decision class. The portability. However, the dataset contains majority number of func-
produced probability value ranges between 0 to 1, and the sum of all tional requirements which indicates class distribution is imbalanced.
the probabilities will be equal to 1. The Equation (2) computes the Therefore, sampling strategies for dealing with imbalances in data
exponential (e-power) of the given input and the sum of exponential has been applied, as balanced distribution can improve the classi-
values of entire input set. Then the ratio of the exponential of the fication accuracy. Also, one category containing single value has
input value and the sum of exponential values lead the output of been removed to reduce class distribution biasness in the dataset.
the softmax function.
4.2 Experiment
exp(wyt x) In this framework, the word embedding matrix is used as weights
p(y|x) = Í t (2) in the embedding layer. The labels were converted to integers using
y exp(wy x)
a dictionary to hold unique classes which are used to convert the
The model was built using the sequential API of keras. This API labels into specific classes representing each number. The training
stacks the layers one over another and uses one-layer output as and validation sentences were tokenized to find all the unique
input of next layer. The dimensions for embedding, LSTM/GRU and words. After tokenizing the sentences, zero padding was applied
Leaky Relu were 300. On the other hand, the softmax dense layer according to the length of the longest sentence in the dataset.
was 10 which is equal to total number of targeted NFR category. The training, validation and testing requirement sentences were
converted to same length to ensure that same length sentences
3.4 Model Training and Testing are fed as input in the model. Where, the highest length of a sen-
The implemented LSTM and GRU models have been trained to tence over the dataset has been considered. Then the class labels
evaluate how effectively the proposed framework can identify NFR were encoded to integer in the range of the total classes minus

28
Classifying Non-functional Requirements using RNN Variants for Quality Software Development MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia

one. The testing dataset was selected manually to ensure that all and 0.73 respectively which covered by CNN experiment. On other
class instances are present. The training set is split using k-fold hand, LSTM’s lowest and highest score in Table 1 are 0.95 and 1.00,
cross validation process. Before training started, class weights were indicates the minimum classification validity is 95% and maximum
calculated to give the model more attention on the classes with low is 100%. The reported average precision, recall, f1-score for LSTM
number of instances. The model was fitted using the training and are 0.973, 0.967, and 0.966 respectively. Analysis of demonstrated
validation data, where the validation data size was equal to 1/k results showed that LSTM over performed than other models. In
of k-fold cross validation. Class weights were used to produce a addition, it presents a very well stability that are reflected in the low
model giving more attention to the underrepresented classes. At the standard deviations 0.015, 0.017 and 0.015 in precision, recall and
training time, the model takes labels as one-hot encoded vector for f1-score respectively. The validation performance was recalculated
producing separate probabilities for each class in output softmax after each epoch and weights were recursively updated.
function as Equation (2). The model was trained for several epochs
with a fixed batch size and output was kept in a history object.
The dropout probability was set to 0.5 in order to avoid overfit-
ting during training. The learning algorithm selected to this problem
was Adam [14] with learning rate λ = 0.001, β1 = 0.9, β2 = 0.999,
ϵ = 1e − 08 and the loss function was categorical cross-entropy.
The number of epochs to train the LSTM and GRU were fixed to
200. The trained model showed total 1,144,810 parameters, where
814,510 as trainable and 330,300 as non-trainable.

4.3 Evaluation Metric


Precision, recall, f1-score, and accuracy matrix have been used to
measure the model learning for classifying NFR. The precision,
recall and f1-score values are the average of used ten classes named
as availability, legal, look and feel, maintainability, operational,
performance, scalability, security, usability and fault tolerance.

4.4 Result Analysis


The labeled and unlabeled requirement dataset was used for training
and testing respectively. The model was trained with k-fold cross
validation, where k was set to 10. The obtained results of validation Figure 2: Validation Result of CNN, GRU and RNN
and testing for each fold has been plotted in Table 1 and Table 2
The graphical representation based on validation results for CNN,
respectively. The comparative experimental results for CNN [9],
GRU and LSTM has been depicted in Figure 2. Where, Y-axis and
GRU and LSTM are showed in Table 1 and plotted in Figure 2.
X-axis represents validation values and techniques respectively.
Where, GRU and LSTM test accuracy are reported in Table 2.
The box whisker plots for precision, recall and f1-score have been
grouped by the classifier methods. According to the plot, lower,
Table 1: CNN[9], GRU and RNN Results
middle and upper quartile starting range is always better for RNN
compared to CNN.
Precision Recall F1-score
Fold CNN GRU LSTM CNN GRU LSTM CNN GRU LSTM
1 0.80 0.95 0.98 0.79 0.95 0.97 0.77 0.95 0.97 Table 2: GRU and LSTM Test Scores
2 0.75 0.97 0.95 0.76 0.97 0.97 0.73 0.97 0.96
3 0.79 0.99 0.98 0.76 0.99 0.94 0.75 0.99 0.95 Precision Recall F1-score Accuracy
4 0.81 0.97 0.98 0.81 0.97 0.97 0.80 0.97 0.97 Fold GRU LSTM GRU LSTM GRU LSTM GRU LSTM
5 0.81 0.97 0.98 0.78 0.97 0.97 0.76 0.97 0.97 1 0.77 0.79 0.60 0.75 0.64 0.74 0.60 0.75
6 0.83 0.97 0.95 0.78 0.97 0.97 0.76 0.97 0.96 2 0.83 0.67 0.75 0.70 0.74 0.66 0.75 0.70
7 0.84 0.94 0.96 0.85 0.94 0.94 0.82 0.94 0.94 3 0.78 0.84 0.70 0.80 0.68 0.79 0.70 0.80
8 0.85 0.88 0.98 0.77 0.88 0.97 0.76 0.88 0.97 4 0.67 0.60 0.65 0.60 0.61 0.60 0.65 0.60
9 0.85 1.00 1.00 0.81 1.00 1.00 0.81 1.00 1.00 5 0.83 0.73 0.70 0.75 0.71 0.71 0.70 0.75
10 0.80 0.97 0.97 0.75 0.97 0.97 0.74 0.97 0.97 6 0.75 0.76 0.70 0.75 0.68 0.72 0.70 0.75
Avg 0.813 0.961 0.973 0.785 0.961 0.967 0.77 0.961 0.966 7 0.77 0.62 0.70 0.65 0.68 0.65 0.70 0.65
Std 0.032 0.033 0.015 0.027 0.033 0.017 0.03 0.033 0.015 8 0.30 0.78 0.35 0.75 0.27 0.73 0.35 0.75
9 0.72 0.63 0.65 0.65 0.66 0.65 0.65 0.65
10 0.75 0.75 0.70 0.75 0.68 0.75 0.70 0.75
According to Table 1, the RNN model’s validation score are
higher than CNN and GRU, which indicate that the model is trained Avg 0.717 0.717 0.65 0.715 0.635 0.70 0.65 0.715
well. In Table 1, the lowest precision, recall and f1-score are 0.75, 0.75 Std 0.153 0.081 0.113 0.062 0.133 0.057 0.113 0.062

29
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik

The testing results of two RNN variants GRU and LSTM have 5 CONCLUSION
been plotted in Table 2 and depicted in Figure 3. Testing results An effective requirement identification framework has been pro-
comparison has been carried out on GRU and LSTM only in Ta- posed using RNN variants in order to classify NFR into pre-defined
ble 2, because comparing with Ju et al. [9] the accuracy of CNN categories. This automatic NFR classification method considers
was unreported. The precision, recall, f1-score and accuracy were word2vec algorithm and RNN deep learning model for feature ex-
calculated from true and predicted labels. As stated in Table 2, aver- traction and text classification respectively. The requirement text
age scores are always higher for LSTM over GRU. Where, LSTM’s has been processed to eliminate unnecessary text, symbols, etc.
lowest and highest accuracy are 0.60 and 0.80 respectively that from the dataset. The processed documents were vectorized using
means the model can classify unseen requirements minimum 60% word2vec algorithm to fed on RNN. The standard PROMISE dataset
and maximum 80% correctly. The reported average precision, recall, has been used to evaluate the proposed model. The reported re-
f1-score and accuracy are 71.7%, 71.5%, 70% and 71.5% respectively. sults showed that the model’s minimum classification validity is
On the other hand, the LSTM’s standard deviation is always lower 95% and maximum is 100%. Where, the average testing precision,
than GRU, which are 0.081, 0.062, 0.057 and 0.062 for precision, re- recall, f1-score and accuracy are 71.7%, 71.5%, 70% and 71.5% re-
call, f1-score and accuracy respectively. The low deviations indicate spectively. The LSTM model reports a very well stability that are
LSTM model’s well stability. reflected in the low standard test score deviation (<0.082) in preci-
sion, recall and f-measure. According to the experimental results, it
can be concluded that deep learning algorithms especially LSTM
is useful on software non-functional requirements classification.
Incorporating large scale dataset and significant feature extraction
techniques such as glove, fastText, and ensemble classifier are the
future research direction of this framework.

REFERENCES
[1] Jonas Winkler and Andreas Vogelsang. Automatic classification of requirements
based on convolutional neural networks. In 2016 IEEE 24th International Require-
ments Engineering Conference Workshops (REW), pages 39–45. IEEE, 2016.
[2] J Cleland-Huang, R Settimi, X Zou, and P Solc. The detection and classification
of non-functional requirements with application to early aspects. In 14th Int.
Requirements Engineering Conference (RE’06), pages 39–48. IEEE, 2006.
[3] Mirza Rehenuma Tabassum, Md Saeed Siddik, Mohammad Shoyaib, and
Shah Mostafa Khaled. Determining interdependency among non-functional
requirements to reduce conflict. In 2014 International Conference on Informatics,
Electronics & Vision (ICIEV), pages 1–6. IEEE, 2014.
[4] I Hussain, L Kosseim, and O Ormandjieva. Using linguistic knowledge to classify
non-functional requirements in srs documents. In Int. Conf. on Application of
Natural Language to Information Systems, pages 287–298. Springer, 2008.
[5] Md. Ariful Haque, Md. Abdur Rahman, and Md Saeed Siddik. Non functional
requirements classification with machine learning: An empirical study. In Inter-
national Conference on Advances in Science, Engineering and Robotics Technology
(ICASERT), pages 629–633. IEEE, 2019.
[6] Y Matsumoto, S Shirai, and A Ohnishi. A method for verifying non-functional
Figure 3: RNN Test Result of GRU and RNN requirements. Procedia Computer Science, 112:157–166, 2017.
[7] Mengmeng Lu and Peng Liang. Automatic classification of non-functional re-
The variants of RNN model test results are graphically repre- quirements from augmented app user reviews. In 21st International Conference on
sented in Figure 3. Where, Y-axis and X-axis represents testing Evaluation and Assessment in Software Engineering, pages 344–353. ACM, 2017.
[8] Jenq-Haur Wang, Ting-Wei Liu, Xiong Luo, and Long Wang. An lstm approach
values and techniques respectively. The box whisker plots for sta- to short text sentiment classification with word embeddings. In Proceedings of the
tistical analysis named precision, recall, f1-score and accuracy have 30th Conference on Computational Linguistics and Speech Processing (ROCLING
been grouped by the RNN variants LSTM and GRU. According 2018), pages 214–223, 2018.
[9] N Almanza R, Reyes Ju, and Guillermo Licea. Towards supporting software
the plot, an outlier has been found in GRU, but the LSTM plotted engineering using deep learning: A case of software requirements classification.
as packed. The lower, middle and upper quartile starting range is In 5th International Conference in Software Engineering Research and Innovation
(CONISOFT), pages 116–120. IEEE, 2017.
always better for LSTM compared to GRU. [10] P Zhou, Zhenyu Qi, S Zheng, Jiaming Xu, H Bao, and Bo Xu. Text classification
This experiment covers three deep learning methods named CNN improved by integrating bidirectional lstm with two-dimensional max pooling.
[9], LSTM, and GRU for finding the optimal NFR classifier. After In 26th Int. Conference on Computational Linguistics, pages 3485–3495, 2016.
[11] Jakub Nowak, Ahmet Taspinar, and Rafał Scherer. Lstm recurrent neural networks
analyzing entire experiment, LSTM performs best for categorizing for short text and sentiment classification. In International Conference on Artificial
NFR to ensure quality software development. Intelligence and Soft Computing, pages 553–562. Springer, 2017.
[12] Openscience tera-promise software requirement (last accessed: July 01, 2019).
URL https://terapromise.csc.ncsu.edu/!/#repo/view/head/requirements/nfr.
4.5 Threats to Validity [13] Zijad Kurtanović and Walid Maalej. Automatically classifying functional and non-
functional requirements using supervised machine learning. In 25th International
The classification results have been obtained for 10 NFR categories. Requirements Engineering Conference (RE), pages 490–495. IEEE, 2017.
However, during this work it was evident the lack of publicly avail- [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
able large datasets to applied deep earning in software requirement International Conference on Learning Representations, 12 2014.

classification. Therefore, that limits the experimentation of more


complex and interesting models.

30

View publication stats

You might also like