You are on page 1of 37

Empirical Software Engineering (2019) 24:3659–3695

https://doi.org/10.1007/s10664-019-09716-7

Mining non-functional requirements from App store


reviews

Nishant Jha1 · Anas Mahmoud1

Published online: 7 June 2019


© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract
User reviews obtained from mobile application (app) stores contain technical feedback that
can be useful for app developers. Recent research has been focused on mining and catego-
rizing such feedback into actionable software maintenance requests, such as bug reports and
functional feature requests. However, little attention has been paid to extracting and synthe-
sizing the Non-Functional Requirements (NFRs) expressed in these reviews. NFRs describe
a set of high-level quality constraints that a software system should exhibit (e.g., security,
performance, usability, and dependability). Meeting these requirements is a key factor for
achieving user satisfaction, and ultimately, surviving in the app market. To bridge this gap, in
this paper, we present a two-phase study aimed at mining NFRs from user reviews available
on mobile app stores. In the first phase, we conduct a qualitative analysis using a dataset of
6,000 user reviews, sampled from a broad range of iOS app categories. Our results show that
40% of the reviews in our dataset signify at least one type of NFRs. The results also show
that users in different app categories tend to raise different types of NFRs. In the second
phase, we devise an optimized dictionary-based multi-label classification approach to auto-
matically capture NFRs in user reviews. Evaluating the proposed approach over a dataset
of 1,100 reviews, sampled from a set of iOS and Android apps, shows that it achieves an
average precision of 70% (range [66% - 80%]) and average recall of 86% (range [69% -
98%]).

Keywords Requirements elicitation · Non-functional requirements · Application store ·


Classification

Communicated by: David Lo, Meiyappan Nagappan, Fabio Palomba, and Sebastiano Panichella
 Anas Mahmoud
amahmo4@lsu.edu

Nishant Jha
njha1@lsu.edu

1 Division of Computer Science and Engineering, Louisiana State University, Baton Rouge, LA, USA
3660 Empirical Software Engineering (2019) 24:3659–3695

1 Introduction

Over the past decade, mobile devices, such as smart phones and tablets, have become acces-
sible worldwide. This has led to a spike in the demand for software to support these devices.
To meet these evolving market demands, mobile application (app) stores (e.g., Google Play
and the Apple App Store) have emerged as a new model of online distribution platforms
(Petsas et al. 2013; Basole and Karla 2012). These platforms have adopted an open business
model, focusing on lowering the barriers of entry for small-scale companies and individuals,
thus, raising the competition in the app market to unprecedented levels1 .
To survive market competition, app release decisions should be driven by a deep
knowledge of the current landscape of competition along with end-users’ expectations,
preferences, and needs (Coulton and Bamford 2011; Harman et al. 2012; Ihm et al. 2013;
Finkelstein et al. 2014; Lee and Raghu 2011; Nayebi and Ruhe 2017b; Shah et al. 2016).
Staying close to the customer not only minimizes the risk of failure, but also serves as a
key factor in achieving market competence as well as managing and sustaining innovation
(Fu et al. 2013; Pagano and Maalej 2013; Groen et al. 2017; Bano et al. 2017; Gómez et al.
2017). To acquire such feedback, conventional app stores provide a mechanism for app
users to express their opinions about apps in the form of textual reviews and meta-data (e.g.,
star ratings). Recent analysis of these reviews has revealed that they contain technical infor-
mation that can be useful to app developers (Martin et al. 2016b). Such information can be
utilized to help developers understand their users’ needs and resolve any issues that went
undetected during in-house testing (Groen et al. 2017; Jha and Mahmoud 2017a; Martin
et al. 2017; Maalej et al. 2016).
Existing research on mining app store reviews has been focused on extracting and
classifying technically informative reviews into bug reports and feature requests (Jha and
Mahmoud 2017a; Carreño and Winbladh 2013; Chen et al. 2014). However, little atten-
tion has been paid to mining Non-Functional Requirements (NFRs). NFRs describe a set
of quality attributes that a software system should exhibit, such as its security, usability,
and performance (Glinz 2007; Cleland-Huang et al. 2007; Mahmoud and Williams 2016).
These attributes enforce a variety of design constraints throughout the software develop-
ment process (Mairiza et al. 2010). Addressing these constraints is essential for achieving
user satisfaction and maintaining market viability (Glinz 2007; Chung et al. 2009; Gross
and Yu 2001).
In general, the lack of research effort on mining NFRs from user reviews can be attributed
to the vague understanding of what NFRs actually are. Unlike functional requirements,
which are typically explicitly identified, NFRs tend to be implicit and latent within the
text (Glinz 2007; Mairiza et al. 2010; Mahmoud and Williams 2016; Gross and Yu 2001;
Gotel et al. 2012). For example, the review “The system shall respond within 60 seconds”
emphasizes a performance NFR. This same NFR is also emphasized in other reviews but
using different terminologies, such as “It takes so long for the app to give me results”, or
“I’ve waited for a whole minute and still no result”. These challenges highlight the need
for a more in-depth analysis of user reviews to detect such issues, focusing on the latent
concepts that the text signifies.
To address these challenges, in this paper, we present a study aimed at mining NFRs
from app store reviews. Our analysis is conducted using 6,000 user reviews sampled from

1 https://www.statista.com/topics/1729/app-stores/
Empirical Software Engineering (2019) 24:3659–3695 3661

the reviews of 24 different iOS apps, extending over a broad range of app categories.
Specifically, our contributions in this paper are:
– We conduct a qualitative analysis to determine the presence and distribution of different
types of NFRs in user reviews and over different app categories (application domains).
– We devise a dictionary-based multi-label classification approach for automatically
classifying user reviews, raising valid quality issues, into different categories of NFRs.
The remainder of this paper is organized as follows. Section 2 reviews related work and
motivates our research. Section 3 describes the data collection process along with the main
findings of our qualitative analysis. Section 4 investigates the performance of automated
classification techniques in classifying user reviews into different generic NFR categories.
Section 5 discusses our main findings, their potential impact, and the main threats to the
study’s validity. Finally, section 6 concludes the paper and outlines prospects of future work.

2 Background and Motivation

In this section, we discuss NFRs in the context of mobile software, review existing literature
related to our work in this paper, and motivate and describe our main research questions.

2.1 NFRs in the Context of Mobile Software

Non-Functional Requirements (NFRs) describe a set of operational constraints that a soft-


ware system should exhibit. These constraints are related to the utility of the system, such
as its usability, reliability, security, and accessibility (Glinz 2007; Cleland-Huang et al.
2005). Unlike functional requirements, which can be explicitly satisfied, NFRs can only be
satisficed, or partially met through functional measures. For instance, usability can be sat-
isficed by using user-friendly GUI elements, while security can be satisficed by encryption
algorithms and multi-factor authentication mechanisms.
Explicitly identifying NFRs early in the software process is critical for making the
right design decisions, and later for evaluating architectural alternatives for the system
(Nuseibeh 2001). However, NFRs are often overlooked during the requirements elicitation
phase, where the main emphasis is on getting the system’s functional features explicitly
and formally defined (Chung et al. 2009; Cleland-Huang et al. 2005). This can be partially
attributed to the vague understanding of what NFRs actually are, and the lack of effective
NFR elicitation, modeling, and documentation methods (Glinz 2007; Cleland-Huang et al.
2005). This problem becomes even more challenging in the context of mobile app develop-
ment. Specifically, mobile software has several distinct attributes that impose a unique set of
NFRs on its operational characteristics (Dehlinger and Dixon 2011; Forman and Zahorjan
1994; Wasserman 2010). These attributes can be described as follows:

– Hardware and computational capabilities: Mobile devices are constrained in several


aspects, such as the smaller screen size, fluctuating network bandwidth, limited battery
life, input modalities, and multi-tasking support. These restrictions impose a new set of
NFRs on mobile software development. For example, the small screen size has changed
the way developers think about usability, or the look and feel, of the system (Harrison
et al. 2013), while the limited battery life has imposed new restrictions on the perfor-
mance aspects of the system, such as its energy efficiency, speed, and overall resource
consumption (Hindle et al. 2014).
3662 Empirical Software Engineering (2019) 24:3659–3695

– Ubiquity: Mobile devices have become ubiquitous. People use their phones to track
their eating, drinking, sleeping, and exercising habits, search for information, manage
their finances, shop and pay for their merchandise, and get directions while traveling.
This ubiquity has led to the emergence new types of user-driven NFRs. For example,
recent research has shown that location-tracking services generate more concerns for
privacy (Li et al. 2017), while business-related apps (e.g., online banking and credit
managing apps) are getting people more concerned about their data protection and
security (He et al. 2015). In fact, this over reliance on mobile apps has made people
less forgiving when it comes to software reliability (e.g., having to restart the device),
which led to more strict dependability concerns, such as the maintianability and avail-
ability of the app. Furthermore, the emergence of Internet of Things (IoT) technologies
has imposed several new constraints on apps’ portability and compatibility, as well as
their ability to continuously and seamlessly connect to other devices and access various
services in the IoT network (Mahatanankoon et al. 2005).

In summary, adherence to NFRs in mobile software can have a direct impact on the over-
all user experience. App developers should be aware of their users concerns as they emerge
and should be able to integrate these concerns into the development process (Wasserman
2010). This emphasizes the need for effective methods that are capable of capturing these
requirements as accurately and exhaustively as possible.

2.2 Related Work

The research on app store review analysis has noticeably advanced in the past few years.
Numerous studies have been conducted on review classification, summarization, and prior-
itization. A comprehensive survey of these studies can be found in Martin et al. (2017). In
this section, we selectively review and discuss important work related to our analysis in this
paper.
Groen et al. (2017) investigated the presence of NFRs in mobile app user reviews. By
tagging online reviews, the authors found that users mainly expressed usability and relia-
bility concerns, focusing on aspects such as operability, adaptability, fault tolerance, and
interoperability. The authors further proposed a set of linguistic patterns to automatically
capture usability concerns in user reviews. Evaluating these patterns using a large dataset
of reviews showed that they can be used to identify statements about user quality concerns
with high precision. However, very low recall levels were reported.
Ciurumelea et al. (2017) proposed a taxonomy to analyze reviews and codes of mobile
apps for better release planning. The authors defined mobile specific categories of user
reviews that can be highly relevant for developers during software maintenance (e.g., com-
patibility, usage, resources, pricing, and complaints). A prototype that uses machine learning
and information retrieval techniques was then introduced to classify reviews and recom-
mend source code files that are likely to be modified to handle issues raised in the reviews.
The proposed approach was evaluated using 39 open source apps from Google Play. The
results showed that the proposed approach can organize reviews according to the prede-
fined taxonomy with an average precision of 80% and average recall of 94%. Furthermore,
the results showed that the approach was able to localize code artifacts according to user
reviews with an average precision of 51% and average recall of 79%.
Guzman and Maalej (2014) proposed an automated approach to help developers filter,
aggregate, and analyze app reviews. The proposed approach uses a collocation finding algo-
rithm to extract any fine-grained requirements mentioned in the review. These requirements
Empirical Software Engineering (2019) 24:3659–3695 3663

are grouped into more meaningful high-level features using topic modeling. The authors
used over 32,210 reviews extracted from seven iOS and Android apps to conduct their anal-
ysis. The results showed that the proposed approach managed to successfully capture and
group the most common feature requests in the reviews.
Mcllroy et al. (2016) analyzed the multi-labelled nature of user reviews in Google Play
store and the Apple App Store. A qualitative analysis of the data showed that a substan-
tial amount (30%) of user reviews raised more than one technical issue (feature requests,
functional complaints, and privacy issue). The authors experimented with several classifica-
tion and multi-labelling techniques to automatically assign multiple labels to user reviews.
The results showed that a combination of Pruned Sets with threshold extension (PSt) (Read
et al. 2008) and Support Vector Machines (SVM) achieved the best performance. The
authors further demonstrated several useful applications of their multi-labelling approach
for stakeholders using various usage scenarios.
Panichella et al. (2015) proposed a supervised approach for classifying mobile app
reviews into several categories of technical feedback (e.g., bug reports and feature requests).
The authors extracted a set of linguistic features from each review, including the most
important words, the main sentiment of the review, and any linguistic patterns that repre-
sented potential maintenance requests. Different types of classifiers were then trained using
various combinations of these features. The results showed that Decision Trees (Quinlan
1986), trained over recurrent linguistic patterns and sentiment scores, achieved the best
performance in terms of precision and recall.
Maalej and Nabil (2015) introduced several probabilistic techniques for classifying
app reviews into bug reports, feature requests, user experiences, and ratings. The authors
experimented with several binary and multi-class classifiers, including Naive Bayes (NB),
Decision Trees, and Maximum Entropy. A dataset of 4,400 manually labeled reviews, sam-
pled from Google Play and the Apple App Store, was used to evaluate the performance of
these different classifiers. The results showed that binary classifiers (NB) were more accu-
rate for predicting the review type than multi-class classifiers. The results also revealed that
review features, such as star-rating, tense, sentiment scores, and length, as well as text anal-
ysis techniques, such as stemming and lemmatization, had a limited impact on the accuracy
of classification.
Kurtanović and Maalej (2017) proposed an approach for mining user rationale from user
reviews. Specifically, the authors used a grounded theory approach and peer content analysis
to investigate how users argue and justify their decisions of upgrading, installing, or switch-
ing software applications in their reviews. The proposed approach was evaluated using
32,414 reviews sampled from 52 software applications from the Amazon Store. The results
showed that performance, compatibility, and usability issues to be the most pervasive. The
authors also evaluated multiple text classification techniques and different configurations to
predict user rationale in reviews. The results also showed that different rationale concepts
can be detected with high levels of precision (80%) and recall (99%).
Chen et al. (2014) presented AR-Miner, a computational framework that helps developers
to identify the most informative user app reviews. Uninformative reviews were initially
filtered out using Expectation Maximization for NB—a semi-supervised text classification
algorithm. The remaining reviews were then analyzed and categorized into different groups
using topic modeling (Blei et al. 2003). These groups were ranked by a review ranking
scheme based on their potential information value. The proposed approach was evaluated
on a manually classified dataset of app reviews collected from four popular Android apps.
The results showed high accuracy levels in terms of precision, recall, and the quality of the
ranking.
3664 Empirical Software Engineering (2019) 24:3659–3695

Carreño and Winbladh (2013) applied topic modeling and sentiment analysis classifica-
tion to identify user comments relevant to requirement changes. Specifically, the authors
processed user comments to extract the main topics mentioned as well as some sentences
representative of those topics. Evaluating the proposed approach over three datasets of man-
ually classified user reviews showed promising performance levels in terms of accuracy and
effort-saving.
Villarroel et al. (2016) introduced CLAP (Crowd Listener for releAse Planning). CLAP
categorizes and prioritizes user reviews to aid in release planning. Technically, the authors
used DBSCAN (Ester et al. 1996), a density-based algorithm for discovering clusters of
related reviews. Random Forests algorithm was used to label each cluster as high or low
priority based on factors such as the size of the cluster, its average review rating, and hard-
ware devices mentioned in its reviews. CLAP was evaluated in industrial settings and using
expert judgment. The results showed that CLAP was able to accurately categorize and
cluster reviews and make meaningful release planning recommendations.
Johann et al. (2017) identified five sentence patterns (e.g., enumerations and conjunc-
tions) and 18 part-of-speech patterns (e.g., verb-noun-noun) that are frequently used
in review text to refer to app features. The main advantage of using such patterns over
classification algorithms, such as SVM and NB, is that no large training and configuration
data are required. The proposed approach was evaluated using reviews extracted from ten
different apps. The results showed that linguistic patterns outperformed other models that
relied on more computationally expensive techniques, such as sentiment analysis and topic
modeling.

2.3 Research Questions

Our brief review shows that the lion share of existing work has been focused on extracting
and synthesizing functional requirements (feature requests) from user reviews, with only a
few papers focusing on mining NFRs. To bridge this gap, in this paper, we focus on extract-
ing and classifying NFRs present in app store reviews. Our objectives are to a) investigate
the presence and distribution of NFRs in user reviews, and b) devise an automated approach
to capture and classify these quality attributes into general categories of NFRs. To guide our
analysis, we formulate the following research questions:

– RQ1 : Do user reviews on app stores express any types of NFRs?


The first objective of our analysis is to examine the presence of NFRs in app user
reviews. In other words, determine the percentage of user reviews, on average, which
contains user concerns that can be translated into valid NFRs.
– RQ2 : Do users of different types of apps (e.g., different app categories and free vs. paid
apps) raise different types of NFRs?
Popular app stores, such as the Apple App Store and Google Play, classify
apps under several broad categories (genre) and subcategories of generic application
domains. For instance, the Gaming genre includes subcategories such as Sport, Board,
Card, Educational, and Racing. This sort of classification is intended to enable cus-
tomers to discover apps more effectively. In our analysis, we will explore if users in
different categories raise different types of NFRs. We further explore if these NFRs
differ between free and paid apps. Such information can help app developers to under-
stand the specific needs of app users in different categories and can be further used to
enable a more accurate automated classification process.
Empirical Software Engineering (2019) 24:3659–3695 3665

– RQ3 : Can NFRs be automatically identified and classified?


Assuming app store reviews contain NFRs, filtering large numbers of reviews man-
ually can be a tedious and error-prone task. Therefore, our third objective in this paper
is to determine if NFRs present in technically informative app store reviews can be
automatically and accurately identified and classified.

3 Qualitative Analysis

In this section, we describe our qualitative analysis, including our data collection procedure,
the manual classification process, and the main findings of our analysis.

3.1 Data Collection

Our data collection procedure can be broken down into three main steps: category selection,
app sampling, and review collection. These steps can be described as follows:
1. The first step in our data collection procedure was concerned with identifying the set of
app categories to sample our data from. To determine these categories, we used the list
of most popular apps2 provided by the Apple App Store. This list is typically updated
on a daily basis. We retrieved our list, shown in Fig. 1, on June 15th, 2017 and selected
the categories of the top 100 apps in the list. This resulted in a set of 12 categories
(Table 1).
2. In the second step, we selected the individual apps to be included in our analysis from
the list of app categories identified in the previous step. Specifically, the top 100 apps
from each category were selected. Apps that did not have enough user reviews (≤ 500),
or had restrictions placed on API access, were removed. For each application domain,
two apps were randomly selected from the list. To ensure a representative sample, each
pair of apps consisted of a paid and a free app. Our final list of apps consisted of 24
apps in total (Martin et al. 2015). These apps are shown in Table 1.
3. The third step of our data collection procedure included collecting user reviews from
the set of apps identified in step 2. Technically, we developed a user review sniffing
tool that uses the RSS3 feed generator of iOS apps to extract their reviews. The tool
accepted the iOS app’s ID as input and returned the most recent reviews of the app. For
each app, we collected the 500 most recent user reviews.

3.2 Manual Classification

To conduct our qualitative analysis, we randomly sampled 250 reviews from the list of
reviews collected for each app. These reviews were then manually examined by three human
annotators to classify them into reviews raising valid user quality concerns (NFRs) and oth-
ers (miscellaneous). NFR reviews were then classified into more fine-grained categories of
NFRs. Around 250 NFRs have been recognized in the literature (Mairiza et al. 2010). These
NFRs extend over a broad range of categories and sub-categories of quality attributes. A
trivial classification approach would be to treat each quality attribute as a separate class.
However, carrying out such a labor-intensive analysis manually can result in erroneous data

2 https://www.apple.com/itunes/charts
3 https://rss.itunes.apple.com/us/?urlDesc=
3666 Empirical Software Engineering (2019) 24:3659–3695

Fig. 1 The top 100 apps in the Apple App Store and their categories as of July 2017

especially that some of these categories are vaguely-defined and very closely related. Fur-
thermore, such classification would result in too fine-grained classes, which can limit the
generalization capabilities of any future automated classifiers.

Table 1 The set of free and paid sample apps used in our analysis

# Domain Free Paid

1 Photos & Videos KeepSafe Enlight


2 Weather AccuWeather NOAARadarPro
3 Utilities Avira Antivirus Swype
4 Health & Fitness FitBit 7MinuteWorkout
5 Navigation Google Maps Boating,USA
6 Entertainment Prisma ProCreate
7 Business Adobe Acrobat HotSchedule
8 Music Pandora Jukebox
9 Education Lumosity StarWalk
10 Lifestyle Zillow My Babys’Beat
11 Reference Google Translate WolframAlpha
12 Games Pokèmon Go Super Mario Run
Empirical Software Engineering (2019) 24:3659–3695 3667

To overcome these limitations, we adopt the classification proposed by Kurtanović


and Maalej (2017). In Kurtanović and Maalej (2017), NFRs raised in user reviews were
classified into the general categories of Dependability, Reliability, Performance, and
Supportability. Table 2 shows the low-level concrete NFRs that are classified under each
generic NFR category. In general, these categories can be described as follows:
– Dependability: this category includes user reviews that raise concerns about the depend-
ability, or questions the trustworthiness, of the app, such as its reliability, availability,
and security. Examples of these reviews include, “Is the app secure enough to depend
on it?”, “Can we trust the app to keep everything private or to encrypt the data?”, and
“Does the app meet all legal requirements?”.
– Usability: usability includes reviews that raise issues related to user interface (i.e., look
and feel, consistency, attractiveness, and layout), and the ease-of-use of the app (e.g.,
predictability, learnability, accessibility, understandability, readability, configurability,
and documentation). Example of such reviews include, “I use one line but I can barely
hear the other person, even if my volume is at max” and “Extremely convoluted layout,
much more difficult to use than it needs to be, but at least it’s accurate”.
– Performance: this category includes user reviews that are concerned with the perfor-
mance of the app, such as its response time, scalability, and resource consumption.
Examples of such reviews include, “It’s ok but takes forever to load maps” and “I have
over a 1000 pictures in my dashboard and the scroll is glitchy”.
– Supportability: this category contains any user reviews that express issues related to the
app’s ability to connect or function across multiple devices or platforms (e.g., compat-
ibility, interoperability, compatibility, adaptability, and portability), in addition to any
concerns about updates or maintenance issues of the app (e.g., installability, testability,
and modifiability). Examples of such reviews include, “Do not get this app because
it cannot be used on older iPads using iOS 5” and “iOS 9 update won’t load and
repeatedly fails to update”.
Our manual classification of user reviews into the different categories of NFRs can be
described as a multi-label classification process. Specifically, a user review can be classi-
fied under multiple categories if it raises more than one NFR-related issue. This step was
necessary to ensure the accuracy of classification; as mentioned earlier, NFRs are inher-
ently vague—a single review can express multiple issues at the same time. For example,

Table 2 The four main generic categories of NFRs along with set of quality attributes classified under each
category

Usability Dependability Performance Supportability

Look and Feel Availability Response Time Adaptability


Accessibility Reliability Efficiency Modifiability
Predictability Safety/Privacy Speed Compatibility
Learnability Security Scalability Portability
Understandability Exploitability Throughput Interoperability
Configurability Accuracy Resource Consumption Installability
Readability Stability Load Time Serviceability
Coherency Legal Startup Time Maintainability
Documentation Backup & Recovery
3668 Empirical Software Engineering (2019) 24:3659–3695

the review “The app switches screen orientation whenever I click save, the app then sud-
denly crashes when I try switch back” is classified under Dependability and Usability, and
the review “As soon as I connect to my apple watch, the app suddenly becomes slow to
operate” is classified under Performance and Supportability.
To facilitate the manual classification process, we created a special-purpose tool to aid
our judges (annotators) in their classification effort4 . The tool saves the results of classi-
fication in a SQL database for further processing. Any of the four class assignments has
to receive at least two votes in order to be assigned to the review. Our judges included
two Ph.D. students in Software Engineering and a senior undergraduate student in Com-
puter Science. The undergraduate student was enrolled in the Software Engineering class
one semester earlier. This class included a semester-long term project on NFRs. Both Ph.D.
students reported three years of industrial software development experience, while the
undergraduate student reported one year of industrial programming experience, acquired
through multiple summer internships. A discussion session was held before the task was
assigned to the judges. The objectives of this session were to discuss the different categories
of NFRs that we were targeting, introduce the judges to the classification tool, and resolve
any issues or concerns they had. No time constraint was enforced to avoid any fatigue issues
and to ensure the integrity of the data (Wohlin et al. 2012). The manual classification process
took approximately two months5 .

3.3 Results and Discussion

The outcome of our manual classification process is shown in Fig. 2. The results revealed
that, out of the 6,000 user reviews examined, 39.40% (2364) reviews raised at least one
NFR, out of which 79% (1869) received more than two votes (RQ1). For the remain-
ing 60.6% (3636) of reviews, our data shows that 86% (3114) received no NFRs votes
while 14% (522) received only one vote. Usability (18.63%) and Dependability (17.22%)
were the two most frequently raised NFRs, followed by Supportability (8.22%) and then
Performance (1.91%). Out of the 1033 Dependability reviews, 296 were related to app
crashing. Out of the 119 Performance reviews, 64 were related to slow performance. Out of
the 494 Supportability reviews, 348 were related to app update issues, and out of the 1153
Usability reviews, 306 were related to user interface issues.
Figure 3 shows the number of NFRs raised in the set of reviews sampled from each app
category. The results show that, the category of Education, Photos & Videos and Music
have the lowest number of NFR reviews, while users in the Health & Fitness, Business and
Utilities categories raised the most number of NFRs. In general, our results show that some
categories of NFRs tend to be category specific, while others are common to almost all
categories (RQ2). For example, the majority of user complaints in the Business category are
related to Dependability, where users are raising concerns about apps randomly crashing or
errors in calculations and security. This can be attributed to the fact that users expect apps
managing their business to be accurate, secure, and to protect their data and conform to the
legal standards. Supportability issues were very frequent in the Health & Fitness category.
This can be explained based on the fact that users expect these apps to synchronize with their
other wearable fitness devices. Such devices (e.g., smart watches) usually work through a
Bluetooth connection that may not be reliable and has limited data transfer speeds. Anytime

4 https://github.com/seelprojects/ManualReviewClassifier
5 The data is available at: http://seel.cse.lsu.edu/data/emse19.zip
Empirical Software Engineering (2019) 24:3659–3695 3669

Fig. 2 The number of NFRs per app category detected in the reviews sampled from each app category

the device fails to connect with an external device, users seem to complain about such issue
in their reviews.
A large number of Performance issues were also raised in Health & Fitness apps. In
general, users often use these apps while performing fitness activities, such as running or
cardio, thus they expect the app to keep up with their motion in real-time. Our qualitative
analysis also shows that almost all app categories suffer from Usability issues, where 42%
of all user NFRs in each category are related to Usability. These results confirmed by others’
previous findings that users mostly talk about usability issues in their reviews (Groen et al.
2017).
We further analyze the results in terms of free vs. paid apps. Figure 4 shows the num-
ber of NFRs raised in free and paid apps. In general, users raised more Dependability,
Performance, and Supportability concerns in the paid apps, while the number of Usability
concerns raised was lower. In general, all free apps had more Usability concerns when
compared to the paid apps from the same category. We also found that paid apps received
longer reviews. The average review length from free apps was 30 words while the average
review length from paid apps was 35 words. In general, users tend to be more critical and
discontented when there is an issue with paid apps, thereby leaving longer reviews while
3670 Empirical Software Engineering (2019) 24:3659–3695

Fig. 3 The total number of NFRs detected in each app category

expressing their concerns. Finally, it is important to point out that these results were obtained
by analyzing iOS app reviews. Analyzing reviews from other app stores (e.g., Google Play)
might lead to different conclusions (Mcllroy et al. 2016).

Fig. 4 The number of NFRs manually detected in the free and paid apps in our dataset
Empirical Software Engineering (2019) 24:3659–3695 3671

4 Automated Classification

Under this phase of our analysis, we investigate the performance of automated classification
techniques in classifying user reviews into the different NFR categories identified earlier.
Specifically, our problem can be described as a multi-label classification problem, where
each review can be classified under more than one label. Formally, a multi-label classifi-
cation problem can be described as follows: let L be a finite and non-empty set of labels
{l1 , ..., lL }, let Y be an input space, and let Z be the output space, defined as a subset of the
set of labels L. A multi-label classification task can be given by D = (y1 , z1 ), ..., (yn , zn ) ⊂
Y × Z, where (y1 , z1 ) is the classification of the data instance y1 ∈ Y under the label
z1 ∈ Z. In what follows, we describe our performance measures, classification techniques,
and discuss our results in greater detail.

4.1 Evaluation Measures

The outcome of multi-label classification can be either fully correct, partially correct, or
completely incorrect. Therefore, the standard precision and recall metrics, typically used in
binary classification tasks, need to be accompanied with other measures that can account
for partial correctness. To account for such information, we use the multi-label evaluation
measures proposed by Godbole and Sarawagi (2004).

Example Let χ be an instance space consisting of n data samples. Let L = {λ1 , λ2 , . . . , λk }


be a finite set of class labels. Let Y = y1 , y2 , . . . , yn  represent the vector space for the
correct labels of each instance xi ∈ χ where yi ∈ {0, 1}, 1 ≤ i ≤ n. In a multi-label
classification task, a classifier, h, predicts Z = z1 , z2 ....., zn , zi ∈ {0, 1}, 1 ≤ i ≤ n, which
represents the vector space for the predicted labels of each instance xi ∈ χ . Given these
assumptions, consider the following example of three reviews, x1 , x2 and x3 belonging to
at least one of the three labels λ1 , λ2 and λ3 . This example shows three different prediction
scenarios. More specifically, review x1 is predicted partially correctly, review x2 is predicted
correctly, and review x3 is predicted completely incorrectly.

(1)
Given this example, the performance of h can be measured as follows:
– Subset Accuracy (SA): also referred to as Exact Match (EM), is the number of pre-
dictions that are completely correct divided by the total number of classified data
instances.
1
n
SA = Yi = Z i (2)
n
i=1
– Hamming Score (HS): is the proportion of correctly predicted labels over the total
number (predicted and actual) of labels identified for a data instance.
1  |Yi ∩ Zi |
n
HS = (3)
n |Yi ∪ Zi |
i=1
3672 Empirical Software Engineering (2019) 24:3659–3695

– Hamming Loss (HL): is a measure of how many times on average, a class label is incor-
rectly predicted. This measure takes into account the prediction error (an incorrectly
predicted label) along with the missing error (a label not predicted), normalized over
the total number of classes and the total number of examples (Sorower 2010). HL =
0 implies that there is no error in the prediction. Practically, the smaller the value of
Hamming Loss, the better an algorithm performs.

|n|
1  xor(Yi , Zi )
HL = (4)
|n| |l|
i=1
– Recall, Precision, and F-Measure: in addition to the above measures, we compute the
macro-averaged Precision (P ), Recall (R), and F-Measure (Fβ ) values for h. These
measures are computed independently for each classification label and averaged over
all the labels. Precision is calculated as the ratio of the number of correctly classified
instances under a specific label (tp ) to the total number of classified instances under the
same label (tp + fp ).
tp
P = (5)
tp + fp
Recall is calculated as the ratio of tp to the total number of instances belonging to
that label (tp + fn ).
tp
R= (6)
tp + fn
The F-measure represents the weighted harmonic mean of precision and recall. β is
used to emphasize either precision or recall. In our analysis, we use β = 2 to emphasize
recall over precision. The assumption is that, errors of commission (false positives) can
be easier to deal with than errors of omission (false negative). Formally, F2 can be
calculated as follows:
5P R
F2 = (7)
4P + R
Such values can be used for estimating how much a tool’s imprecision can be tolerated
to gain the benefit of the tool’s recall in a situation in which it takes so much more time to
manually find a true positive than it does to reject a tool’s false positive.

4.2 Binary Relevance Classification

In multi-label classification, a data sample can be classified under one or more labels. Such
classification is usually performed by decomposing the problem into multiple, independent
classification tasks (Ghamrawi and McCallum 2005). One commonly used and straightfor-
ward approach to perform multi-label classification is Binary Relevance (BR). BR assumes
label independence. Specifically, it decomposes a problem with n classes (labels) into n
binary problems, as shown in Fig. 5. In each problem, a single label i is considered to be the
correct class and the rest of classes (n − 1) are considered to be incorrect. BR then learns a
single binary model for each of the n binary problems (Tsoumakas et al. 2009). The output
of the classifier is the union of predictions, where a data sample is assigned to any label i
if the classifier classified it under that label. Despite its sometimes-unrealistic label depen-
dence assumption, BR has been successfully used in tasks such as assigning keywords to
scientific papers, illnesses to patients, and emotional expressions to human faces (Luaces
et al. 2012).
Empirical Software Engineering (2019) 24:3659–3695 3673

Fig. 5 An illustration of multi-label classification using Binary Relevance (BR)

4.2.1 Classification Settings

In our analysis, we employ two commonly-used text classification algorithms as a basis for
BR: Naive Bayes (NB) and Support Vector Machines (SVM). Both algorithms have been
extensively used in related literature to extract and classify user reviews (Mcllroy et al. 2016;
Panichella et al. 2015; Maalej and Nabil 2015; Jha and Mahmoud 2017a). Furthermore,
both algorithms are commonly used in multi-label classification tasks (Brinker et al. 2006;
Elisseeff and Weston 2001). These algorithms can be described as follows:
– Naive Bayes: NB is a simple, yet efficient, linear probabilistic classifier that is based
on Bayes’ theorem (Lewis 1998). It assumes conditional independence between the
attributes’ values for each class. In the context of text classification, NB adopts the
Bag-of-Word (BOW) approach where the features of the model are the individual words
from the text. Such data is typically represented using a 2-dimensional word x document
matrix. The entry i,j in the matrix can be either a binary value that indicates whether
the document di contains the word wj or not (i.e., {0, 1}), or the normalized term
frequency of the word wj appearing in the document di (McCallum et al. 1998). In the
domain of app review classification, NB has been shown to be very competitive in its
performance, detecting different types of reviews with decent levels of accuracy over
multiple datasets (Panichella et al. 2015; Maalej and Nabil 2015).
– Support Vector Machines: SVM is a supervised machine learning algorithm that is
used for classification and regression analysis (Burges 1998). Technically, SVM tries
to find optimal hyperplanes for linearly separable patterns in the dataset and then max-
imizes the margin around the separating hyperplanes. The position of the dividing
hyperplanes is determined by the location of the support vectors, which are the criti-
cal instances in the training dataset. SVM classifies the data by mapping input vectors
into an N-dimensional space, and deciding on which side of the defined hyperplane the
new data instance lies. SVMs have been empirically shown to be very effective in high
3674 Empirical Software Engineering (2019) 24:3659–3695

dimensional feature spaces and sparse instance vectors (Joachims 1998). In the domain
of app review classification, SVM has also been shown to perform very well (Mcllroy
et al. 2016; Jha and Mahmoud 2017a, b).
In addition to the underlying classifier, we consider the following classification configu-
rations:
– Text pre-processing: we use text reduction strategies to reduce the number of features
(words) the classifier has to deal with, thus remove any information that might nega-
tively impact the predictive power of the classifier. For example, we remove English
words that are considered too generic (e.g., the, in, will). These words are highly
unlikely to carry any generalizable information to the classifier. Text reduction strate-
gies also include techniques such as stemming and lemmatization. Stemming is the
process of removing derivational suffixes as well as inflections so that word variants can
be conflated into the same roots or stems. Lemmatization, on the other hand, reduces
the inflectional forms of words into the base or dictionary form of the word (Bird et al.
2009). Both techniques are commonly used in app review classification tasks (Maalej
et al. 2016; Panichella et al. 2015). The results often show a marginal impact of these
techniques on the precision of classification. In our analysis, we use stemming for its
lower overhead. Specifically, lemmatization techniques are often exponential to the text
length, while stemming is known for its linear time complexity (Bird et al. 2009).
– App category: we introduce the app category, or domain name (D), as a classification
feature. We base this decision on the results of our qualitative analysis. Specifically, our
analysis has revealed that users in different app categories tend to raise different NFRs.
Therefore, such information might carry some distinctive value to the classifier.
– Sentiment Analysis: we further introduce sentiment analysis (SEN) as a classification
feature. A pre-assumption is that, different categories of NFRs are typically expressed
using different sentiments. Sentiment analysis determines whether a text conveys posi-
tive, neutral, or negative feelings (Wilson et al. 2005). To conduct our analysis, we use
VADER—Valence Aware Dictionary and sEntiment Reasoner, a more recent and pop-
ular rule-based sentiment classifier which is designed to identify sentiments in social
media texts (Hutto and Gilbert, E 2014). VADER uses grammatical and syntactical
cues to identify sentiment intensity in user reviews, such as punctuation (e.g., num-
ber of exclamation points), capitalization (e.g., “I HATE THIS APP” is considered to
be more intense than “I hate this app”), degree modifiers (e.g., “The new sync fea-
ture is extremely good” is considered to be more intense than “The new sync feature is
good”), constructive conjunction “but” to shift the polarity, and tri-gram examination
to identify negation (e.g., “The app isn’t really all that great”) (Ribeiro et al. 2015).
The output of VADER is a sentiment score, s, between -1 (extremely negative) and
1 (extremely positive). We use the following scale to assign the positive, neutral, and
negative sentiment to each review:

⎨ positive, if s ≥ 0.5
sentiment = negative, if s ≤ −0.5

neutral, otherwise
Figure 6 shows the proportions of different sentiments raised in each NFR category,
including the miscellaneous category. The results show that the majority (77%) of the
miscellaneous reviews are expressed with a positive sentiment. This shows that if a
user is not expressing any concerns related to the app then the review is most likely to
be a praise. For example, “This app is amazing” and “I love the new send anywhere
Empirical Software Engineering (2019) 24:3659–3695 3675

Fig. 6 The percentage of NFRs in our dataset with different types of sentiments

feature”. Such information may be used by automated classifiers to separate reviews


raising NFRs from the miscellaneous ones.

4.2.2 Classification Tool

To evaluate the performance of BR under the different classification settings, we use the
Scikit-learn toolkit (Pedregosa et al. 2011). Scikit-learn is a Python library that integrates a
wide range of state-of-the-art machine learning algorithms for supervised and unsupervised
classification problems (Pedregosa et al. 2011). To classify our data, we start by splitting the
data into a 70:30 ratio for training and testing respectively. The dataset and its correspond-
ing labels are split using the Scikit-learns’ split function as shown in line 2 of Listing 1. The
parameter random state is the seed used by the random number generator to randomize
the dataset. The CountV ectorizer class is then used to transform the reviews into bag-of-
word (BOW) vectors. CountV ectorizer provides a rich set of parameters to pre-process
the input text, for example, removing English stop-words, converting to lowercase, and
defining a tokenizer to perform custom pre-processing. Currently, CountV ectorizer does
not provide a stemming functionality; therefore, we use NLTK’s (Bird et al. 2009) porter
stemmer (lines 5-14). NLTK6 , Natural Language Toolkit, is platform for building Python
programs to work with human language data. The implementation of NLTK’s porter stem-
mer is passed as a tokenizer parameter during the transformation of reviews into vectors
using the CountV ectorizer class (line 17-20). To remove the stop-words we use Scikit-
learns’ stop-words list. A classifier is then defined and used to predict the output labels (line
22-27). The precision and recall scores are generated using a Scikit-learns’s built-in library
to measures the classifier’s accuracy (line 30-31). This operation is performed for each of

6 https://www.nltk.org/
3676 Empirical Software Engineering (2019) 24:3659–3695

the four NFR labels in our dataset. The output labels are then combined to generate a vector
space of predicted labels (i.e., (1)).

4.2.3 Results

The classification results, in terms of the different performance measures, are shown in
Table 3. In general, BRNB performed poorly, with an average F2 = 0.26 and H S = 0.32.
Stemming and stop-word removal had a limited impact on the performance of BRNB ,
achieving an increase of 10% in precision and 5% in recall. In general, BRSVM performed
better than BRNB , achieving F2 = 0.58 and H S = 0.44, and similar to BRNB , stemming and
removing stop-words had a small positive impact on the classification accuracy of BRSVM .
On average, the classification accuracy of BRSVM+ST and BRSVM+ST+SW is comparable to
one another. The relatively better performance of BRSVM in comparison to BRNB can be
attributed to its overfitting avoidance tendency—an inherent behavior of margin maximiza-
tion which does not depend on the number of features (Brusilovsky et al. 2007). Therefore,
it has the potential to scale up to high-dimensional data spaces with sparse instances (e.g.,
textual data).

Code 1 A sample Python code to generate a classifier using the Scikit-learn tool and the NLTK library
Table 3 The performance of Binary Relevance (BR) under the different classification configurations (NB: Naive Bayes, SVM: Support Vectors Machines, STM: Stemming, SW:
Stop-Word removal, D: App category, SEN: Sentiment score)

Dep. Perf. Sup. Us. Average

Classification settings P R P R P R P R HS SA HL P R F2

NB 0.69 0.47 0 0 0.58 0.13 0.64 0.33 0.32 0.26 0.22 0.48 0.23 0.26
NB + STM 0.67 0.54 0 0 0.68 0.22 0.68 0.39 0.38 0.31 0.20 0.51 0.29 0.31
NB + STM + SW 0.68 0.53 0.25 0.03 0.71 0.23 0.68 0.35 0.36 0.30 0.21 0.58 0.28 0.31
SVM 0.65 0.55 0.58 0.53 0.54 0.44 0.54 0.45 0.44 0.35 0.20 0.58 0.49 0.51
Empirical Software Engineering (2019) 24:3659–3695

SVM + STM 0.65 0.54 0.59 0.59 0.54 0.48 0.57 0.51 0.48 0.38 0.19 0.59 0.53 0.54
SVM + STM + SW 0.66 0.56 0.65 0.59 0.59 0.48 0.56 0.46 0.47 0.38 0.19 0.62 0.53 0.55
NB + STM + D 0.69 0.54 0.20 0.03 0.69 0.26 0.67 0.38 0.39 0.32 0.20 0.56 0.30 0.33
NB + STM + SW + D 0.69 0.53 0.25 0.03 0.72 0.27 0.67 0.35 0.37 0.31 0.20 0.58 0.29 0.32
SVM + STM + D 0.65 0.56 0.59 0.59 0.57 0.48 0.55 0.49 0.47 0.38 0.19 0.59 0.53 0.54
SVM + STM + SW + D 0.67 0.58 0.69 0.62 0.62 0.52 0.56 0.45 0.49 0.40 0.18 0.64 0.54 0.56
NB + STM + SEN 0.66 0.56 0 0 0.67 0.23 0.68 0.38 0.38 0.31 0.21 0.50 0.29 0.31
NB + STM + SW + SEN 0.67 0.54 0.25 0.03 0.71 0.22 0.67 0.36 0.36 0.29 0.21 0.58 0.29 0.32
SVM + STM + SEN 0.64 0.54 0.61 0.59 0.56 0.48 0.56 0.50 0.47 0.38 0.19 0.59 0.53 0.54
SVM + STM + SW + SEN 0.66 0.58 0.68 0.59 0.59 0.47 0.55 0.46 0.47 0.38 0.19 0.62 0.53 0.55
3677
3678 Empirical Software Engineering (2019) 24:3659–3695

Our results also show that considering sentiment as a classification feature leads to a
slight improvement in the accuracy, detected at BRSVM+STM+SW which achieved F2 = 0.55
and H S = 0.47. To utilize the app category name (D) as a classification feature, the list of
features for each review is altered to include the general category of the app. The results
show that including the category name generally improves the classification accuracy. The
best outcome is obtained when the category name is added as a classification feature to
BRSVM+STM+SW+D , resulting in F2 = 0.56 and H S = 0.49.
In general, the poor performance of the BR method can be attributed to the fact that it
treats every class independently of all other classes (Cheng and Hüllermeier 2009; Bi and
Kwok 2014). A closer look at the distribution and overlap of categories in our dataset reveals
that several labels occur with each other frequently, as shown in Fig. 7. For example, 23%
of all Dependability reviews, 55% of all Performance reviews, 37% of all Supportability
reviews, and 23% of all Usability reviews overlap with at least one other category. This
can be explained based on the fact that NFRs are vague in nature, not well-defined, and
often intertwine with each other. Such relations are not taken into account when using BR,
which partially explains why it performs poorly in our dataset. Furthermore, when multiple
independent classifiers are constructed during the training phase, the dataset becomes unbal-
anced. In particular, for each binary classifier, the number of negative instances becomes
extremely high when compared to the number of positive instances. Therefore, the words
(classification features) that are more distinctive to the negative class are assigned higher
weights. The classifier then favors the negative instances and fails to accurately predict a
positive instance.

4.3 Dictionary Based Classification

Previous research has shown that NFRs can be classified using term matching (Cleland-
Huang et al. 2006). The main assumption is that there is a set of keywords, or indicator
terms, that trigger certain NFR issues. For example, words such as slow, lag, and delay
typically indicate an issue with performance. These terms are what a human expert would
use as signals to detect an NFR in a user review. Therefore, if these terms are captured and
indexed, they can be used to automatically detect these issues, or in other words, imitate the
human classification process. In what follows, we describe this approach in greater detail.

Fig. 7 A Venn diagram of the distribution of classification labels and their overlap in the training dataset
Empirical Software Engineering (2019) 24:3659–3695 3679

Table 4 Lists of indicator terms manually identified for each category of NFRs

NFR class Indicator terms

Dependability accuracy, authenticate, bug, correct, count, crash, delete, disappear,


error, fail, failure, fix, glitch, issue, log, login, lost, password, privacy,
recognize, reliable, reset, restart, restore, shut, unstable, username,
wrong
Performance battery, buffer, delay, drain, fast, freeze, improve, jerk, speed, lag, load,
memory, notify, optimize, slow, stuck, wait
Supportability dropbox, bluetooth, calibrate, cellular, cloud, compatible, compute,
connect, iOS, iPhone, iTunes, import, internet, ipad, language, laptop,
offline, onedrive, phone, require, server, service, support, sync, update,
upgrade, version, watch, wifi
Usability access, ad, alarm, autocorrect, bookmark, brush, button, change, click,
color, confuse, control, create, crop, difficult, draw, figure, filter, for-
mat, hard, hear, highlight, icon, interface, keyboard, landscape, listen,
option, order, orient, pick, pixel, portrait, rate, rate, read, repetitive,
replace, resize, scan, screen, select, shape, shuffle, style, switch, tap,
track, turn, tutorial, type, unusable, verify, view, volume, website,
widget, zoom

4.3.1 Extracting and Evaluating NFRs Indicator Terms

To extract the set of indicator terms, we adopted an iterative process. At each iteration, both
authors went through each set of reviews classified under each NFR category(ies). For each
review, we looked for terms that might have triggered the classification of that review under
its NFR category. These terms were extracted and added to our indicator term dictionary.
The coverage of the dictionary was then examined using a simple term matching analysis
over the dataset. Basically, if a review in our dataset contained one of the indicator terms
in our dictionary, the review was classified under the NFR class of that term. This process
was repeated for three iterations until no more distinctive terms could be identified, this
dictionary is shown in Table 4. The generated dictionary achieved a recall of 0.85 over the
dataset, thus, providing the evidence that we were able to capture most of the indicator
terms. The loss in the recall came from cases where there was no generic indicator terms in
the reviews. For example, the review “it [AccuWeather] had snow yesterday and it was not
even raining” expresses a Dependability issue; basically the app is not making dependable
weather forecasts. However, we were not able to identify any generic terms that would help
to classify this review as Dependability.
To assess the quality of the generated dictionary, we examine its ability to accurately
capture NFRs present in unseen before reviews. Specifically, we prepared a new dataset
of reviews for testing purposes. This dataset consisted of 600 reviews sampled from a
set of 12 apps randomly selected from the different iOS app categories identified ear-
lier. These apps are: iMovie, Weather Und, TrueCaller, SweatCoin, Transit,
Netflix, OneDrive, Amazon Music, iTunes U, Realtor, Dictionary, and
Temple Run. The most recent 50 user reviews (collected during the first week of January,
2018) for each app were classified using the same manual classification procedure used to
prepare our original dataset7 . To conduct our analysis, we loop through the 600 reviews in
the test set, if a review contains a term that belongs to the set of indicator terms of any of the

7 Dataset is available at: http://seel.cse.lsu.edu/data/emse19.zip


3680 Empirical Software Engineering (2019) 24:3659–3695

NFR categories, the review is classified under that category. Given that we follow a multi-
label classification procedure, it is possible for a review to be classified under multiple NFR
labels.
The results of this process are shown in the first row of Table 5. In general, this classifica-
tion scheme, while managed to achieve a higher recall (0.94), suffered in terms of precision
(0.51). A closer look at the data reveals that the low precision values can be attributed to
longer reviews. Specifically, in shorter reviews, users tend to be more straight-forward in
expressing their concerns. In most of these reviews, a single indicator term can be suffi-
cient to classify the review. For example, the following three reviews can be classified under
Supportability, Usability, and Performance respectively with a high degree of certainty due
to the presence of the terms sync, shuffle, and lagged.

– I cannot sync my [Fitbit] to my iPhone.


– The app [Jukebox] doesn’t seem to have any form of shuffle button.
– The game [Pokèmon Go] lagged and didn’t give me any coins.

However, our qualitative analysis has revealed that in longer reviews, users tend to
express both positive and negative concerns about the app. Consequently, users may use
indicator terms that do not necessarily indicate an NFR issue. For example, in the following
review:

I like this app [Pandora] but for some reason it doesn’t automatically see my wifi,
its so annoying I have to click connect every time I come back home or go to school or
work.

the term click in the above review does not indicate a Usability concern. The review,
however, can be classified under the Supportability class due to the presence of the indicator-
terms connect and wifi. In such cases, simply assigning an NFR class based on a single
indicator term typically results in more false positives, thus a lower precision.
To work around this problem, for longer reviews, we enforce a two indicator-term match-
ing rule. Particularly, for reviews over a certain length (number of words), two indicator
terms must be present in the review in order to be classified under a specific NFR category.
To determine the optimal length, in terms of number of words, we run an exhaustive search.
Specifically, we initially remove English stop-words (e.g., the, is, at, a, which, and on).
We then start at a four-word review length as the optimal cut-off point (threshold) between
shorter and longer reviews. If a review is four words or less, it can be assigned under a spe-
cific NFR category if it only contains one indicator term from that category. For reviews
with more than four words, two or more indicator terms are required from the NFR category
for the review to be classified under that category. To approximate a near optimal length,
we gradually increased the short/long review cut-off point and measured the performance
in terms of the different performance measures.
The results of this analysis are shown in Fig. 8. In general, at smaller cut-off points,
the precision is relatively high and the recall is low (too many false negatives). The recall
gradually increases as the threshold increases; however, the precision decreases. At a 12
word cut-off point, the performance achieves a reasonable balance between precision and
Table 5 The performance of the dictionary-based approach under different configuration settings

Classifier settings Dep. Per. Sup. Us. Average

P R P R P R P R HS HL SA P R F2

TM 0.60 0.92 0.39 0.98 0.45 0.99 0.60 0.85 0.62 0.19 0.48 0.51 0.93 0.80
Empirical Software Engineering (2019) 24:3659–3695

TM + (12) 0.76 0.75 0.74 0.74 0.67 0.95 0.69 0.70 0.69 0.12 0.60 0.72 0.79 0.77
TM + TF-IDF + (12) 0.70 0.78 0.48 0.91 0.60 0.97 0.66 0.72 0.66 0.14 0.55 0.61 0.85 0.79
TM + TF + (12) 0.72 0.81 0.66 0.93 0.66 0.98 0.67 0.75 0.69 0.12 0.61 0.68 0.87 0.82
TM + TF + (12) + (Android) 0.74 0.89 0.71 0.88 0.66 0.92 0.80 0.69 0.74 0.10 0.69 0.72 0.85 0.82

TM: basic term matching, (12): 12 words review length threshold applied, TF/TF-IDF: term matching with TF and TF-IDF super indicator terms support respectively, and
Android: the performance on Android reviews
3681
3682 Empirical Software Engineering (2019) 24:3659–3695

Fig. 8 The performance of dictionary matching at different cut-off points (review length)

recall. The average performance in terms of the different performance measures at threshold
12 is shown at the second row of Table 5. At this point, precision = 0.72 and recall = 0.79.

4.3.2 Optimizing Accuracy

To further improve the classification accuracy, we examined several cases of the misclassi-
fied reviews. The analysis showed that some indicator terms have more distinctive value to
their NFR categories than others. We refer to these terms as the set of super indicator terms.
In other words, the presence of any super indicator term in a review is enough to indicate
the presence of an NFR even if no other indicator terms are present. Consider, for example,
the following review:

I love the [Fitbit] app, it helps me a lot, but when I want to add another Fitbit or do
anything it lags my entire iPhone, and I never get to do what I was trying to do.

In this relatively longer review, only the indicator-term lag is present. Therefore, the review
was not classified as Performance. However, our manual analysis has shown that the word
Empirical Software Engineering (2019) 24:3659–3695 3683

lag is always associated with Performance. Given this observation, we update our classifica-
tion procedure to consider such super indicator terms. To identify these terms, we use term
frequency techniques for identifying important words in text collections. These techniques
can be described as follows:
– Relative term frequency (TF): the TF weight of a word can be calculated as:
f wij
T F (wij ) = (8)
Nj
where f wij is the frequency of the word wij in the set of reviews classified under the
NFR category j and Nj is the total number of unique words in these reviews.
– Term frequency-inverse document frequency (TF-IDF): TF-IDF is a measure of term
specificity. It accounts for a word’s scarcity across all reviews by using the inverse
document frequency (IDF) of the word. IDF penalizes words that are too frequent in
the reviews. Formally, TF-IDF can be computed as:
|Rj |
T F .I DF (wij ) = T F (wij ) × log (9)
|ri : wij ∈ ri ∧ ri ∈ Rj |
where T F (wij ) is the frequency of the word wij in the collection of reviews classified
under the NFR category j , |Rj | is the total number of reviews classified under j , and
|ri : wij ∈ ri ∧ ri ∈ Rj | is the number of reviews in Rj that contain the word wij .
The top 10 TF and TF-IDF scoring words for each NFR category are shown in Tables 6
and 7 respectively. Note that, some of these words did not appear in our list of manually
extracted keywords as they were considered too generic. For example, the word problem
appeared as one of the top most frequent words in all categories. Therefore, it was not
included in our set of manually identified terms. To control for these very frequent words
that appear in multiple NFR categories, we do not include them in our classification. These
words are shown with a strike-through in Table 6. TF-IDF, on the other hand, seems to be
immune to this problem as it penalizes words that are too frequent in the text collection.
To test the impact of using the set of super indicator terms on performance, we re-ran our
review-length optimization analysis taking into consideration the set of terms identified by

Table 6 The top 10 terms in each NFR class as identified by TF

Dependability Performance Supportability Usability

fix slow update time


time phone iOS fix
update battery iPhone feature
crash time sync screen
phone problem fix update
problem lag time option
log iOS iPad hear
issue notification support version
iPhone speed FitBit easy
accurate fix problem problem

Terms that appear in multiple NFR categories are removed (strike-through) to improve the classification
accuracy
3684 Empirical Software Engineering (2019) 24:3659–3695

Table 7 The top 10 indicator terms in each NFR class as identified by TF-IDF

Dependability Performance Supportability Usability

crash lag compatible heartbeat


value drain ipod acrobat
username slow nice hear
authenticate storm spout interface
schedule store blah word
inaccurate i5 64bit white
login complain require pick
correct locator document easier
certain satisfy support upload
redownload lighter integrate menu

both TF and TF-IDF. Specifically, if a review contains an indicator term from Tables 6 or 7,
it gets assigned to the category of that indicator term regardless of its length. The procedure
then continues normally over multiple cut-off points (review length) as shown in Figs. 9 and
10. The best results obtained when using TF and TF-IDF super indicator terms are shown
in the third and fourth rows of Table 5 respectively. Using the list of TF terms, a reasonable
performance (F2 = 0.82) is achieved at the cut-off point of 12 words. At that point (shown
by the vertical line in Fig. 10), precision = 0.69, recall = 0.86, HS = 0.69, and HL= 0.12.

Fig. 9 The performance of the dictionary-based approach at different cut-off points (review length in words)
after considering the TF super indicator terms
Empirical Software Engineering (2019) 24:3659–3695 3685

Fig. 10 The performance of the dictionary-based approach at different cut-off points (review length in words)
after considering the TF-IDF super indicator terms

Using TF-IDF, the performance also seems to be balanced (F2 = 0.80) at the cut-off point
of 12 words. At that point (shown by the vertical line in Fig. 10), precision = 0.66, recall =
0.84, HS = 0.67, and HL= 0.13.
The relatively better performance of TF in comparison to TF-IDF can be attributed to
the fact that, during text classification tasks, selecting the most frequent words rather than
selecting the most informative words are more likely to give better results (Apté et al. 1994;
Gokcay and Gokcay 1995). For instance, in our analysis, once the mutually exclusive words
are removed from the TF list, the remaining words, for example, crash, support, sync, and
screen are more representative of the NFR classes than the list of TF-IDF words, such as
heartbeat, value, store, and lighter.
In conclusions, our TF, review-length optimized, term matching classification process
can be described as shown in Fig. 11. To examine the generalizability of this approach,
we collect a second dataset of reviews from the Google Play (Android) app store. This
dataset included the most recent 50 reviews from the Google Play copy of the apps used
in our first test set, excluding the apps iMovie and iTunes U since they do not have
Android versions. This data was collected during the first week of November of 2018. The
same manual annotation procedure (using the same judges who classified our original data)
was used to manually classify the new Android data. The results of applying the approach
in Fig. 11 are shown in the fifth row of Table 5. In general, the approach maintained its
accuracy, achieving an average precision of 0.72 and average recall of 0.82, thus, increasing
the confidence in the generalizability of our results.
3686 Empirical Software Engineering (2019) 24:3659–3695

Fig. 11 A textual description of our final NFR classification procedure. This description is simplified for
explanatory reasons. Different, more efficient implementations, are possible

5 Impact, Tool Support, and Threats to Validity

In this section, we discuss our results and their potential impact on the state of practice, the
implementation of our proposed solution, and the main threats to our study’s validity.

5.1 Discussion and Expected Impact

Our analysis has revealed that different categories of NFRs can be accurately captured by
looking for indicator terms that are distinctive to these categories (RQ3 ). Specifically, due
to the inherently complex nature of the problem (i.e., multi-label classification), a simpler
approach can provide more accurate results than more computationally expensive classifi-
cation techniques (Cleland-Huang et al. 2006). In fact, term matching was found to be an
effective approach for classifying other types of software artifacts, such as classifying open
source commit comments into different types of software maintenance requests (Hattori and
Lanza 2008).
It is important to point out that relying on single-term (unigram) matching can be an
oversimplification of the problem; more complex language models might be necessary to
achieve a better accuracy. In the context of mining app reviews, language structures such
as recurrent sentence patterns (e.g., [someone] thinks [something] should
[verb]) (Panichella et al. 2015), regular expressions (e.g., (?i)(?<!EN Negation
)(EN Emphasis)(intuitive)) (Groen et al. 2017), and Part-of-Speech patterns
(Verb Determiner Noun) (Johann et al. 2017), have been used to detect feature
requests and quality issues in app reviews. However, the main problem with such language
structures, or patterns, stems from the complexity of the language they attempt to model.
Specifically, online human language evolves at a relatively faster pace than spoken lan-
guage (Eisenstein et al. 2014; Javarone and Armano 2013). Furthermore, such language
tend to be restricted both syntactically (e.g., neologisms and abbreviations) and semanti-
cally (i.e., grammatically). Therefore, language structures have to be continuously refined,
Empirical Software Engineering (2019) 24:3659–3695 3687

or tailored, to fit the evolving patterns in reviews. This in turn can lead to complex and
highly customized, or over-fitted, patterns that would not generalize well outside the scope
of training data. On the other hand, term matching, with some sort of optimization, can be
less prone to these problems; only individual terms need to be occasionally added, removed,
or reweighed to maintain accuracy, thus, the chances of over-fitting are lower. In fact, an
interesting direction of future work would be to combine both approaches, for example,
using an NFR dictionary along with generic language patterns to detect NFRs at a higher
level of accuracy.
In terms of impact, NFRs information can serve as an input for the release engineer-
ing process (Regnell et al. 2007). Our expectation is that such information can complement
other types of information developers typically get from app store reviews (e.g., feature
requests and bug reports Jha and Mahmoud 2018) as well as other channels of user feed-
back (e.g., social media) (Nayebi et al. 2018; Williams and Mahmoud 2017b). Specifically,
due to their abstract nature, NFRs can provide more high-level and system-wide feedback
than specific feature requests or bug reports. Using such information, developers can adjust
their release engineering strategies to focus on features that enhance the desirable NFRs of
the system, or address the shortcomings of the poorly executed ones (Nayebi et al. 2016a,
2017a; Martin et al. 2016a; Williams and Mahmoud 2018). This can be particularly use-
ful for smaller businesses and startups trying to break into the app market (Nguyen Duc
and Abrahamsson 2016b; Giardino et al. 2014). Startups are different from traditional com-
panies in the sense that they have to immediately and accurately identify and implement
a product (Gralha et al. 2018), often known as the Minimum Viable Product (MVP), that
delivers actual customer value (Nguyen Duc and Abrahamsson 2016b; Paternoster et al.
2014). NFRs information can serve as a core asset that will help startup companies, oper-
ating under significant time and market pressure and with little operating history, to get a
quick and comprehensive understanding of the main pressing NFR issues in their domain
of operation. Such knowledge can then be utilized to redirect developers’ effort toward
important user concerns in their app category.

5.2 Implementation and Tool Support

In terms of tool support, the proposed dictionary-based approach is implemented in our


working tool MARC—Mobile Application Review Classifier (Jha and Mahmoud 2017b).
Release 3.0 of MARC8 enables users to download the most recent reviews of iOS apps,
classify and summarize these reviews, and identify the main NFRs raised in the reviews.
MARC also provides options for users to adjust their configuration settings. To help users
maintain the tool’s accuracy, the list of NFR indicator terms is provided in separate text
files under the folder InputData/TrainingDataset to enable users to add or remove
words from the dictionary.

5.3 More on Fβ Score

Recent research has suggested that, other, and often larger, values of β can provide more
realistic lower bounds to evaluate Natural Language IR-based tools (Berry 2017). Such
values can be used for estimating how much a tool’s imprecision can be tolerated to gain the
benefit of the tool’s recall in a situation in which it takes so much more time to manually

8 https://github.com/seelprojects/MARC-3.0
3688 Empirical Software Engineering (2019) 24:3659–3695

find a true positive than it does to reject a tool’s false positive. Specifically, according to
Berry (2017), β can be approximated as the ratio of:
– The time for a human to manually find a true positive in the original documents and
– The time for a human to reject a tool-presented false positive
To estimate such values, we selected three random sets of 50 reviews (3 × 50) from the
apps FitBit, Panadora, and Zillow. Each set was then assigned to a judge to vet (i.e., identify
reviews raising valid NFRs). The judges were asked to roughly estimate the time it took to
correctly identify and classify a true positive (τf ind ). Each judge was then assigned a new set
of reviews to vet, but using MARC this time. Specifically, MARC was used to automatically
classify the reviews and the judges were asked to vet the outcome, keeping track of the time
it took to reject a false positive (τvet ). Our results (in Table 8) show that a more appropriate
β value to evaluate our tool hovers around 7. Using F7 changes the results in Table 5 as
shown in Table 9. On average, when recall is emphasized over precision, relying solely on
term matching, without any sort of optimization, results in the best performance.
The judges recall (Rj = 93%) in Table 9 can be thought of as the humanly achievable
high recall (HAHR) under ideal manual settings, and this HAHR is the recall that a tool
should beat if the tool will be used in a setting in which it is critical to have as close to 100%
recall as is possible. The tool achieves only 78% recall, because some recall was sacrificed
in exchange for high precision. Perhaps that sacrifice was not a good tradeoff. The fact that
the time it took judges to discard false positives (τvet ) using the tool was on average 1/7 of
the time they needed to consider one candidate in the manual search, says that even with
low precision, using a tool will be faster than doing the task manually.
Furthermore, given our data, we can estimate β for our task (T ) before any tool has
been built to see if building a tool for the task is worth the effort. Specifically, let λ be the
fraction of the potential true positives in the set of reviews, then βT = λ1 . Our quantitative
analysis revealed that λ ≈ 0.4, thus our βT = 2.5. The ratio between βT and the tool’s β
(2.5 to 7) provides another evidence of the time-saving the tool achieves over the manual
task.

5.4 Threats to Validity

The study presented in this paper has several limitations that could potentially limit the
validity of the results. In what follows, we discuss these threats along with our mitigation
strategies in greater detail.

5.4.1 Internal Validity

Internal validity refers to confounding factors that might affect the causal relations estab-
lished throughout the experiment (Wohlin et al. 2012). A potential threat to the proposed

Table 8 Approximating a more appropriate β to evaluate our tool (MARC), where Rj is the recall of the
judge, Rt is the recall of the tool, and Pt is the precision of the tool

Judge Review Set1 τf ind Rj Review Set2 Rt Pt τvet β

1 FitBit 0.83 87% Zillow 72% 70% 0.12 7


2 Panadora 1 94% FitBit 85% 68% 0.17 5.9
3 Zillow 0.78 97% Panadora 76% 72% 0.1 7.8
Empirical Software Engineering (2019) 24:3659–3695 3689

Table 9 The dictionary-based approach’s performance as measured by F7

Classification settings P R F2 F7

TM 0.51 0.93 0.80 0.92


TM + (12) 0.72 0.79 0.77 0.79
TM + TF-IDF + (12) 0.61 0.85 0.79 0.84
TM + TF + (12) 0.68 0.87 0.82 0.87

study’s internal validity is the fact that subjective human judgment was used to prepare our
ground-truth datasets. This includes the manual classification of the training and testing
datasets, and the identification of indicator-terms for each NFR category. Despite the sub-
jectivity concerns, it is not uncommon in text classification tasks to use humans’ judgment
to prepare the ground-truth. Therefore, these threats are inevitable; however, they can be
partially mitigated by following a systematic classification procedure using multiple judges.
Furthermore, the manual classification process have extended over the period of two months
(average 100 reviews/day). This might raise some concerns about the quality of the ground-
truth. However, this relatively long period of time can be justified based on two reasons,
first, our judges were not paid to carry out the task. Therefore, we could not enforce a spe-
cific number of hours or reviews per day. In fact, to ensure the integrity of the data, our
judges were advised to take their time to avoid any fatigue issues. Second, the task of man-
ually classifying/labeling NFRs is more complicated than the task of binary-classifying a
review into a bug report or a feature request. As mentioned earlier, unlike functional require-
ments or explicit bug reports, NFRs tend to be vague, and often latent in the text. This
resulted in a relatively longer manual classification process.
Other internal validity issues might arise from the specific algorithms (SVM and NB),
text processing techniques (stemming), sentiment analysis method (VADER), and tools
(Scikit-learn) used in our analysis. For example, we used VADER to analyze the senti-
ment of the reviews. Other sentiment analysis techniques, such as SentiStrength and the
Stanford CoreNLP, have also been used in the literature (Panichella et al. 2015; Bakiu and
Guzman 2017; Williams and Mahmoud 2017b). However, the decision to use VADER over
SentiStrength can be justified based on the fact that VADER was specifically designed
to identify sentiments expressed in social media (microblogs) and short-text (Hutto and
Gilbert, E 2014). In our analysis, we assumed that, given their short-text attributes, VADER
was an appropriate technique for analyzing the sentiment of app store reviews. In fact, an
independent investigation of VADER versus other state-of-the-art sentiment analysis meth-
ods revealed that VADER was over all more effective than other methods across several
benchmarks of textual data (Ribeiro et al. 2015). In the context of app user review classifi-
cation, recent research has shown that the difference in performance among these methods
tend to be marginal (Lin et al. 2018; Jongeling et al. 2017; Williams and Mahmoud 2017a).
Therefore, our expectation is that such effect would be minimal.
Other internal validity threats can stem from the fact that, in our analysis, we only relied
on the textual content of the reviews and their sentiment as classification features. In the
literature, other types of meta-data attributes, such as the star-rating, author, app version, or
submission time of the review, have been also considered as classification features (Maalej
et al. 2016; Guzman and Maalej 2014). However the results showed that such attributes had
an overall low precision for predicting the type of the review. Nonetheless, we do acknowl-
edge the fact that considering such attributes in the context of NFR classification might lead
3690 Empirical Software Engineering (2019) 24:3659–3695

to different results. For example, some specific versions of the app might have more NFRs
issues in their reviews than others. In these versions, detecting NFRs can be easier.

5.4.2 External Validity

Threats to external validity impacts the generalizability of the result obtained in the study
(Wohlin et al. 2012). In particular, the results of our experiment might not generalize beyond
the specific experimental settings used in this paper. One potential threat to our external
validity is the datasets used in our experiment. Our dataset is limited in size and is generated
from a limited number of apps. To mitigate this threat, we made sure that the reviews were
selected from a diverse set of free and paid apps, covering a broad range of app categories.
Furthermore, to examine the generalizability of our approach, we collected and manually
classified a new dataset of Android reviews and we tested our optimized dictionary matching
procedure on this new dataset. The results showed that our approach achieved comparable
results over the new dataset of Android reviews.

5.4.3 Construct Validity

Construct validity is the degree to which the various performance measures accurately cap-
ture the concepts they intend to measure (Wohlin et al. 2012). In our experiment, there
were minimal threats to construct validity as the standard performance measures (Precision,
Recall, HS, SA, and HL), which are commonly used for evaluating multi-label classification
problems, were used to assess the performance of the different classification methods inves-
tigated in our analysis. We believe that these metrics sufficiently captured and quantified
the different aspects of performance we were interested in measuring.

6 Conclusions and Future Work

In this paper, we tackled the problem of detecting and classifying NFRs in mobile app user
reviews available on app stores. Our analysis was divided into two main phases. In the first
phase, we conducted a qualitative analysis over a dataset of 6,000 reviews sampled from a
broad range of iOS mobile apps. These reviews were manually classified by multiple human
judges into four main categories of NFRs: Dependability, Performance, Supportability, and
Usability. The results revealed that around 40% of user reviews in our dataset signified at
least one type of NFR (RQ1 ). The result also showed that users in app store categories tend
to express different NFRs. Furthermore, users tend to be more critical (raise more NFR
concerns) in paid apps in comparison to free apps (RQ2 ).
Under the second phase of our analysis, we evaluated the performance of different clas-
sification techniques in automatically capturing the different types of NFRs raised in user
reviews. The first technique, Binary Relevance (BR), decomposes a classification problem
with n classes (labels) into n binary problems. A separate binary classifier is then learned
for each problem. In addition to the textual content of the review, our classification features
included text pre-processing (stemming and stop-word removal) along with the sentiment
score of the review and the category (application domain) of the app. NB and SVM were
used as a basis for BR classification. The results showed that BR achieved a relatively low
accuracy. Classification features such as the sentiment score of the review and its category
name were found to have a limited impact on the accuracy of classification. Overall, the
best results of BR were achieved using SVM, when both stemming and stop-word removal
Empirical Software Engineering (2019) 24:3659–3695 3691

were applied and the app category name of reviews was considered as a classification
feature.
The second classification approach included using a dictionary of indicator terms to
detect different types of NFRs in the reviews. The terms in the dictionary were extracted
manually from the list of reviews classified under each NFR category. This approach was
applied on a test set of 1,100 reviews sampled from 12 iOS apps and 10 Android apps. The
results showed that relying on a single indicator term matching can result in a low preci-
sion, especially in longer reviews. To enhance the classification accuracy, we approximated
an optimal review-length cut-off point for term matching. Specifically, for reviews longer
than 12 words, more than one indicator term was required to classify a review under a spe-
cific NFR category. Our analysis also revealed that there were specific terms that carried
more information value to their NFRs. We referred to these terms as the set of super indi-
cator terms. The presence of only one of these terms was found to be sufficient to classify
reviews, even the longer ones, under the NFR category the term belongs to. Such terms can
be automatically identified using their relative frequency in the reviews.
Our results in this paper can be utilized to redirect app developers’ effort toward impor-
tant user concerns in their app category. Our future work in this domain will be focused on
integrating NFR information into the design and development process of mobile apps. Our
objective is to study the value of exploiting such information on the quality of apps and their
success and failure in the app market (e.g., ratings, downloads, and revenue). Our overar-
ching goal is to enable a more systematic and effective release engineering process that can
address user needs and sustain app growth and success.

Acknowledgements We would like to extend our gratitude to Dr. Daniel M. Berry from the University of
Waterloo for his contribution to this work. This work was supported in part by the Louisiana Board of Regents
Research Competitiveness Subprogram (LA BoR-RCS), contract number: LEQSF(2015-18)-RD-A-07 and
by the LSU Economic Development Assistantships (EDA) program.

References

Apté C, Damerau F, Weiss S (1994) Towards language independent automated learning of text categorization
models. In: Special interest group on information retrieval, pp 23–30
Bakiu E, Guzman E (2017) Which feature is unusable? Detecting usability and user experience issues from
user reviews. In: International requirements engineering conference workshops, pp 182–187
Bano M, Zowghi D, da Rimini F (2017) User satisfaction and system success: An empirical exploration of
user involvement in software development. Empir Softw Eng 22(5):2339–2372
Basole R, Karla J (2012) Value transformation in the mobile service ecosystem: A study of app store
emergence and growth. Serv Sci 4(1):24–41
Berry D (2017) Evaluation of tools for hairy requirements and software engineering tasks. In: International
requirements engineering conference workshops, pp 284–291
Bi W, Kwok J (2014) Multilabel classification with label correlations and missing labels. In: AAAI
conference on artificial intelligence, pp 1680–1686
Bird S, Loper E, Klein E (2009) Natural language processing with python. Sentiment Short Strength Detect
Informal Text 61(12):2544–2558
Blei D, Ng A, Jordan M (2003) LAtent Dirichlet Allocation. J Mach Learn Res 3:993–1022
Brinker K, Fürnkranz J, Hüllermeier E (2006) A unified model for multilabel classification and ranking. In:
European conference on artificial intelligence, pp 489–493
Brusilovsky P, Kobsa A, Nejdl W (2007) The Adaptive Web: Methods and Strategies of Web Personalization.
Springer, Berlin, pp 335–336
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc
2(2):121–167
3692 Empirical Software Engineering (2019) 24:3659–3695

Carreño L, Winbladh K (2013) Analysis of user comments: An approach for software requirements evolution.
In: International conference on software engineering, pp 582–591
Chen N, Lin J, Hoi S, Xiao X, Zhang B (2014) AR-Miner: Mining informative reviews for developers from
mobile app marketplace. In: International conference on software engineering, pp 767–778
Cheng W, Hüllermeier E (2009) A simple instance-based approach to multilabel classification using the
mallows model. In: International workshop on learning from multi-label data, pp 28–38
Chung L, Cesar J, do Prado Leite S (2009) On non-functional requirements in software engineering. Springer,
Berlin, pp 363–379
Ciurumelea A, Schaufelbühl A, Panichella S, Gall H (2017) Analyzing reviews and code of mobile apps for
better release planning. In: International conference on software analysis, evolution and reengineering,
pp 91–102
Cleland-Huang J, Settimi R, BenKhadra O, Berezhanskaya E, Christina S (2005) Goal-centric traceabil-
ity for managing non-functional requirements. In: International conference on software engineering,
pp 362–371
Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional
requirements with application to early aspects. In: Requirements engineering, pp 39–48
Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated classification of non-functional requirements.
Requir Eng 12(2):103–120
Coulton P, Bamford W (2011) Experimenting through mobile apps and app stores. Int J Mob Hum Comput
Interact 3(4):55–70
Dehlinger J, Dixon J (2011) Mobile application software engineering: Challenges and research directions.
In: Workshop on mobile software engineering, pp 29–32
Eisenstein J, OĆonnor B, Smith N, Xing E (2014) Diffusion of lexical change in social media. PLoS ONE
9:1–13
Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: International conference
on neural information processing systems: natural and synthetic, pp 681–687
Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large
spatial databases with noise. Knowl Discov Data Min 96(34):226–231
Finkelstein A, Harman M, Jia Y, Martin W, Sarro F, Zhang Y (2014) App store analysis: Mining app stores
for relationships between customer, business and technical characteristics, University of College London,
Tech. Rep. rN/14/10, Tech Rep.
Forman G, Zahorjan J (1994) The challenges of mobile computing. Computer 27(4):38–47
Fu B, Lin J, Li L, Faloutsos C, Hong J, Sadeh N (2013) Why people hate your app: Making sense of user
feedback in a mobile app store. In: Knowledge discovery and data mining, pp 1276–1284
Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: International conference on
information and knowledge management, pp 195–200
Giardino C, Wang X, Abrahamsson P (2014) Why early-stage software startups fail: A behavioral framework.
In: International conference of software business, pp 27–41
Glinz M (2007) On non-functional requirements. In: International requirements engineering conference,
pp 21–26
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Advances in
knowledge discovery and data mining, pp 22–30
Gokcay D, Gokcay E (1995) Generating titles for paragraphs using statistically extracted keywords and
phrases. Syst Man Cybern 4:3174–3179
Gómez M, Adams B, Maalej W, Monperrus M, Rouvoy R (2017) App store 2.0: From crowdsourced
information to actionable feedback in mobile ecosystems. IEEE Softw 34(2):81–89
Gotel O, Cleland-Huang J, Hayes J, Zisman A, Egyed A, Grünbacher P, Dekhtyar A, Antoniol G, Maletic J
(2012) The grand challenge of traceability (v1. 0). In: Software and systems traceability, pp 343–409
Gralha W, Damian D, Wasserman A, Goulao M, Araújo J (2018) The evolution of requirements practices in
software startups. In: International conference on software engineering
Groen E, Kopczynska S, Hauer M, Krafft T, Doerr J (2017) Users - The hidden software product quality
experts? Requirements Engineering, pp 80–89
Gross D, Yu E (2001) From non-functional requirements to design through patterns. Requir Eng 6(1):18–36
Guzman E, Maalej W (2014) How do users like this feature? A fine grained sentiment analysis of app reviews.
In: Requirements engineering, pp 153–162
Harman M., Jia Y., Zhang Y. (2012) App store mining and analysis: MSR for app stores, In: Mining software
repositories, pp 108–111
Harrison R, Flood D, Duce D (2013) Usability of mobile applications: Literature review and rationale for a
new usability model. J Interact Sci 1(1):1–16
Empirical Software Engineering (2019) 24:3659–3695 3693

Hattori L, Lanza M (2008) On the nature of commits. In: International conference on automated software
engineering, pp 63–71
He W, Tian X, Shen J (2015) Examining security risks of mobile banking applications through blog mining.
In: Modern artificial intelligence and cognitive science conference, pp 103–108
Hindle A, Wilson A, Rasmussen K, Barlow J, Charles J, Romansky S (2014) GreenMiner: A hardware
based mining software repositories software energy consumption framework. In: Working conference
on mining software repositories, pp 21–21
Hutto C, Gilbert, E (2014) VADER: A parsimonious rule-based model for sentiment analysis of social media
text. In: International AAAI conference on weblogs and social media
Ihm S, Loh W, Park Y (2013) App analytic: A study on correlation analysis of app ranking data. In:
International conference on cloud and green computing, pp 561–563
Javarone M, Armano G (2013) Emergence of acronyms in a community of language users. Eur Phys J B
86(11):474
Jha N, Mahmoud A (2017a) Mining user requirements from application store reviews using frame semantics.
In: Requirements engineering: foundation for software quality, pp 273–287
Jha N, Mahmoud A (2017b) MARC: A Mobile application review classifier. In: Requirements engineering:
foundation for software quality, workshops, pp 1-15
Jha N, Mahmoud A (2018) Using frame semantics for classifying and summarizing application store reviews.
Empir Softw Eng 23(6):3734–3767
Joachims T (1998) Text categorization with Support Vector Machines: Learning with many relevant features,
pp 137–142
Johann T, Stanik C, Maalej W et al (2017) Safe: A simple approach for feature extraction from app
descriptions and app reviews. In: Requirements engineering, pp 21–30
Jongeling R, Sarkar P, Datta S, Serebrenik A (2017) On negative results when using sentiment analysis tools
for software engineering research. Empir Softw Eng 22(5):2543–2584
Kurtanović Z, Maalej W (2017) Mining user rationale from software reviews. In: Requirements engineering,
pp 61–70
Lee G, Raghu T (2011) Product portfolio and mobile apps success: Evidence from app store market. In:
Americas conference information systems, pp 3912–3921
Lewis D (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In: European
conference on machine learning, pp 4–15
Li J, Yan H, Liu Z, Chen X, Huang X, Wong D (2017) Location-sharing systems with enhanced privacy in
mobile online social networks. IEEE Syst J 11(2):439–448
Lin B, Zampetti F, Bavota G, Di Penta M, Lanza M, Oliveto R (2018) Sentiment analysis for software
engineering: How far can we go? In: International conference on software engineering, pp 94–104
Luaces O, Dı́ez J, Barranquero J, Coz J, Bahamonde A (2012) Binary relevance efficacy for multilabel
classification. Prog Artif Intell 1(4):303–313
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app
reviews. In: Requirements engineering, pp 116–125
Maalej W, Kurtanović Z, nabil H, Stanik C (2016) On the automatic classification of app reviews. Requir
Eng 21(3):311–331
Mahatanankoon R, Joseph Wen H, Lim B (2005) Consumer-based m-commerce: Exploring consumer
perception of mobile applications. Comput Stand Interfaces 27(4):347–357
Mahmoud A, Williams G (2016) Detecting, classifying, and tracing non-functional software requirements.
Requir Eng 21(3):357–381
Mairiza D, Zowghi D, Nurmuliani N (2010) An investigation into the notion of non-functional requirements.
In: Association for computing machinery symposium on applied computing, pp 311–317
Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In:
Working conference on mining software repositories, pp 123–133
Martin W, Sarro F, Harman M (2016a) Causal impact analysis for app releases in google play. In:
International symposium on foundations of software engineering, pp 435–446
Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2016b) A survey of app store analysis for software
engineering. IEEE Transactions on Software Engineering
Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2017) A survey of app store analysis for software
engineering. IEEE Trans Softw Eng 43(9):817–847
McCallum A, Nigam K et al (1998) A comparison of event models for Naive Bayes text classification. In:
AAAI-98 workshop on learning for text categorization, vol 752, pp 41–48
Mcllroy S, Ali N, Khalid H, Hassan A (2016) Analyzing and automatically labelling the types of user issues
that are raised in mobile app reviews. Empir Softw Eng 21(3):1067–1106
3694 Empirical Software Engineering (2019) 24:3659–3695

Nayebi M, Adams B, Ruhe G (2016a) Release practices for mobile apps – what do users and developers
think?. In: International conference on software analysis, evolution, and reengineering, pp 552–562
Nguyen Duc A, Abrahamsson P (2016b) Minimum viable product or multiple facet product? The role of mvp
in software startups. In: Agile processes in software engineering and extreme programming, pp 118–130
Nayebi M, Farahi H, Ruhe G (2017a) Which version should be released to app store?. In: International
symposium on empirical software engineering and measurement, pp 324–333
Nayebi M, Ruhe G (2017b) Optimized functionality for super mobile apps. In: International requirements
engineering conference, pp 388–393
Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng
23(5):2764–2794
Nuseibeh B (2001) Weaving together requirements and architectures. Computer 34(3):115–119
Pagano D, Maalej W (2013) User feedback in the appstore: An empirical study. In: Requirements
engineering, pp 125–134
Panichella S, Sorbo A, Guzman E, Visaggio C, Canfora G, Gall H (2015) How can I improve my app? Clas-
sifying user reviews for software maintenance and evolution. In: International conference on software
maintenance and evolution, pp 281–290
Paternoster N, Giardino C, Unterkalmsteiner M, Gorschek T, Abrahamsson P (2014) Software development
in Startup companies: A systematic mapping study. Inf Softw Technol 56(10):1200–1218
Pedregosa F et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Petsas T, Papadogiannakis A, Polychronakis M, Markatos E, Karagiannis T (2013) Rise of the planet of
the apps: A systematic study of the mobile app ecosystem. In: Conference on internet measurement,
pp 277–290
Quinlan R (1986) Induction of Decision Trees. Mach Learn 1(1):81–106
Read J, Pfahringer B, Holmes G (2008) Multi-label classification using ensembles of pruned sets. In:
International conference on data mining, pp 995–1000
Regnell B, Höst M, Berntsson Svensson R (2007) A quality performance model for cost-benefit analysis
of non-functional requirements applied to the mobile handset domain. In: Requirements engineering:
foundation for software quality, pp 277–291
Ribeiro F, Araújo M, Gonċalves P, Benevenuto F, Gonċalves M (2015) SentiBench-a benchmark comparison
of state-of-the-practice sentiment analysis methods, arXiv:1512.01818
Shah F, Sabanin Y, Pfahl D (2016) Feature-based evaluation of competing apps. In: International workshop
on app market analytics, pp 15–21
Sorower M (2010) A literature survey on algorithms for multi-label learning, vol 18. Oregon State University,
Corvallis
Tsoumakas G, Dimou A, Spyromitros E, Mezaris V, Kompatsiaris I, Vlahavas I (2009) Correlation-based
pruning of stacked binary relevance models for multi-label learning. In: International workshop on
learning from multi-label data, pp 101–116
Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on
user reviews. In: International conference on software engineering, pp 14–24
Wasserman A (2010) Software engineering issues for mobile application development. In: The FSE/SDP
workshop on future of software engineering research, pp 397–400
Williams G, Mahmoud A (2017a) Analyzing, classifying, and interpreting emotions in software users’ tweets.
In: International workshop on emotion awareness in software engineering, pp 2–7
Williams G, Mahmoud A (2017b) Mining Twitter feeds for software user requirements. In: International
requirements engineering conference, pp 1–10
Williams G, Mahmoud A (2018) Modeling user concerns in the app store: A case study on the rise and fall
of Yik Yak. In: International requirements engineering conference, pp 64–75
Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis.
In: Human language technology and empirical methods in natural language processing, pp 347–354
Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B (2012) A wesslèn Experimentation in Software
Engineering. Springer, Berlin

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Empirical Software Engineering (2019) 24:3659–3695 3695

Nishant Jha is a software engineer and data scientist at WebCE


in Dallas, Texas. He received the PhD in Computer Science and
Engineering in 2018 from Louisiana State University. His main
research interests include requirements engineering, app store analy-
tics, human computer interaction, and data mining.

Anas Mahmoud is an Assistant Professor of Computer Science


and Engineering at Louisiana State University. He received the MS
and PhD in Computer Science and Engineering in 2009 and 2014
from Mississippi State University. His main research interests include
requirements engineering, app store analytics, static code analysis,
program comprehension, code navigation, natural language analysis
of software, program refactoring, and information foraging.

You might also like