You are on page 1of 11

Automatic Sentence Annotation for More Useful

Bug Report Summarization

Abstract—Bug reports have long been an important and useful questions, clarifications, possible solutions, and attachments
software artifact in software development projects with software such as patches or screenshots. As a bug tracking system
developers referring to them for various information needs. As records most of this information as free-form text, we observed
bug reports can become long due to a large number of comments
from various DevOps tools and conversations between developers, different representations of the information in bug reports. For
users of bug reports may need to spend a lot of time reading them. example, the steps to reproduce a bug may be presented as an
To address this problem, previous studies developed bug report enumerated list in one bug report and as a paragraph in another
summarizers to extract comments from bug reports which were report. Even the structured data such as code snippets, stack
deemed to be important. Comment importance was determined traces and error messages can appear as or alongside free form
based on human-created gold-standard summaries. We believe
creating such gold-standard summaries for evaluating bug report text in bug reports.
summarization approaches is not a good practice for a number As bug reports are often long and contain a lot of textual
of reasons. Consequently, we developed an automatic sentence information, researchers and practitioners are interested in
annotation method to identify content in bug report comments creating bug report summaries [1]. The purpose of creating
which allows bug report users to customize a view for their summaries is to reduce the time spent on reading a bug
specific task-dependent information needs.
Index Terms—bug report summarization, text tagging, text report for a particular software engineering task. For example,
annotation, natural language processing Rastkar, Murphy, and Murray did a user study involving bug
report summaries for detecting duplicate bug reports [2], [3].
I. I NTRODUCTION The findings of Lotufo, Malik, and Czarnecki’s user study
indicated that software developers found summaries useful for
Software users and software developers file bug reports1 a variety of tasks such as when looking for a solution or a
when they find unexpected behaviours or when they need work-around for a bug, finding the status of an open bug,
additional functionalities after a software product is released. and triage activities including searching for duplicate bugs,
Those reports are stored in a bug tracking system such as determining bug priority, and closing bug reports [4].
Bugzilla2 or Jira3 . Software project members, such as triagers, Although several bug report summarization approaches have
developers, and project managers, refer to these reports for been proposed [2]–[8], all of these studies focus on creating
completing their work. For example, a bug report triager reads a fixed-size summary. However, the restrictions imposed by
a bug report to decide the most suitable developer to fix a a fixed-size summary approach can reduce the quality of the
bug, or compares two bug reports to determine whether they summary by missing important or useful information for a bug
are duplicates. A software developer reads through several bug report user [2]–[4]. Therefore, the summary of a bug report
reports to understand the project history before fixing a new should be flexible such that the user can include or exclude
bug. A project manager examines a collection of bug reports information according to their information needs [1].
to make decisions about the next product release. In all of Previous bug report summarization efforts created con-
these examples, a bug report user has to read the bug report densed bug report summaries that didn’t provide for user
description (i.e. the first comment) and a few or all of the interaction [2], [3], [5]–[7]. Although Lotufo, Malik, and Czar-
following comments. While some bug reports may be small necki provided two summary views: a condensed summary
and concise, others may be large due to any combination and an interlaced summary where the interlaced summary
of long comment threads, long enumerated lists inside of displayed the entire bug report by highlighting the sentences
comments, or large stack traces or codes snippets. extracted as the summary [4], their approach didn’t allow for
A bug report can be viewed as a communication thread. any user interaction with the summary. After evaluating the
Once a reporter submits a bug description, others, including results and comments of user studies conducted by Rastkar,
other developers and the reporter themself, add comments. The Murphy, and Murray and Lotufo, Malik, and Czarnecki, we
description contains important information, such as steps to concluded that fixed summaries that miss information sought
reproduce the bug, the expected software behaviour and the by some bug report users is a significant flaw in these bug
actual behaviour of the software product. Comments contain report summarization approaches. Therefore, we introduce an
1 We use the colloquial term ”bug report” to refer to all change requests,
approach to automatically annotate the contents of a bug report
feature requests, or tasks created in a software project’s issue tracking system.
such that an interface can allow bug report users to find
2 https://www.bugzilla.org verified 03/05/2020 and interact with the bug report to meet their task-specific
3 https://www.atlassian.com/software/jira verified 03/05/2020 information needs.
The contributions of this paper are as follows: The next study by Mani, Catherine, Sinha, and Dubey [5]
• We present a bug report schema describing its content took an unsupervised learning approach to creating automatic
and structural dependencies. bug report summaries. This approach was taken as a result of
• We present a sentence labelling schema to identify inten- the observed shortcoming of Rastkar, Murphy, and Murray’s
tions of comments on bug reports. supervised learning approach. Similar to Rastkar, Murphy, and
• We present a set of labelling modules to assign labels Murray, Mani, Catherine, Sinha, and Dubey created their own
to sentences in bug report comments according to the bug report corpus with 19 bug reports and hired two annotators
labelling schema. to create both abstractive and extractive summaries. Also, they
• We present a tool that allows bug report users to create a used the bug report corpus created by Rastkar, Murphy, and
customized summary for their specific information needs Murray to evaluate the quality of the generated extractive sum-
according to the labels assigned to the sentences. maries. Finally, their study introduced an automatic “noise”
The rest of the paper is organized as follows. First, a identifier intended to differentiate between greeting sentences,
summary of previous bug report summarization research is code snippets and stack traces. The authors found that the
provided. Next, we give detailed arguments for why these extractive summaries improved when the noise was removed.
previous approaches may not be the best answer for solving The final primary study of automatic bug report sum-
the bug report summarization problem. We then present our marization was conducted by Lotufo, Malik, and Czarnecki
approach for automatically annotating the sentences in bug [4]. Like Mani, Catherine, Sinha, and Dubey, the authors
report comments and provide an example of how these an- focused on creating an unsupervised bug report summarizer.
notations can be used to customize the view of a bug report In Lotufo, Malik, and Czarnecki’s approach, they considered
to meet specific bug report user’s information needs. Finally the similarity between a sentence and the bug report title, the
we present an evaluation of our automatic bug report sentence similarity between two sentences, and the use of a heuristic to
annotation process. measure the agreement of two sentences. These characteristics
were used to compare the quality of the generated extractive
II. R ELATED W ORK summaries and led to a more generalized approach for evalu-
This section provides an overview of previous works re- ating bug report summaries.
garding bug report summarization, visualisation of bug report As part of creating their corpus, the annotators of the
summaries, and tagging of the contents in bug report com- Rastkar, Murphy, and Murray corpus [2], [3] labelled the intent
ments. of a sentence as a ‘problem’, ‘suggestion’, ‘fix’, ‘agreement’,
or ‘disagreement’. The label ‘meta’ was used for sentences
A. Bug Report Summarization such as codes or URLs. Huai, Li, Wu, and Wang [7] im-
Within the automatic bug report summarization literature, proved upon Rastkar, Murphy, and Murray’s summarizer by
there are three primary studies: Rastkar, Murphy, and Murray incorporating sentence intentions into a logistic regression
[2], [3], Mani, Catherine, Sinha, and Dubey [5], and Lotufo, classification model using the calculated probabilities of the
Malik, and Czarnecki [4]. In short, it was found that the two different intentions in Rastkar, Murphy, and Murray’s corpus
unsupervised learning models of Mani, Catherine, Sinha, and to modify the returned probabilities.
Dubey and Lotufo, Malik, and Czarnecki performed better Jiang, Li, Ren, Xuan, and Jin [9] argued that feature
than the supervised learning model proposed by Rastkar, engineering is an important part of supervised learning mod-
Murphy, and Murray. els. The authors proposed to select features for bug report
summarizer using a crowd-sourced platform. The platform
The initial study of automatic bug report summarization
broadcasts the task (i.e. create a bug report summary) and
was conducted by Rastkar, Murphy, and Murray [2], [3].
asks volunteers to submit their responses. A response contains
This study provided a significant contribution in establishing
summary sentences and reasons for choosing the sentences.
a corpus for automatic bug report summarization with 36
After analyzing the sentences and the reasons, the authors
bug reports from four open source software projects. Three
created eleven features for bug report summarization. Three
human annotators were hired to create both abstractive4 and
of the eleven features were used in the Rastkar, Murphy,
extractive5 summaries for each bug report. Rastkar, Murphy,
and Murray’s bug report summarizer [2], [3] and one of
and Murray applied a supervised learning model to create
the features (bug description sentence) was used in Lotufo,
extractive summaries by classifying a bug report sentence as
Malik, and Czarnecki’s unsupervised bug report summarizer
being in the summary or not. Their results indicated that,
[4]. The evaluation indicated that the logistic regression with
even though the average precision was 57% for leave-one-
crowd-sourced features out-performed the three bug report
out cross-validation, the model performed poorly when trying
summarizers created by Rastkar, Murphy, and Murray [2], [3],
to summarize a bug report outside of the training corpus. This
Mani, Catherine, Sinha, and Dubey [5], and Lotufo, Malik, and
was found to be because their supervised learning model was
Czarnecki [4].
highly sensitive to the training data.
One of the most recent works on bug report summarization
4 An “abstractive summary” is created based on the semantics of the text. proposed to apply deep neural networks for feature extrac-
5 An “extractive summary” is composed of selected sentences from the text. tion. Liu, Yu, Li, Guo, Wang, and Mao [10] integrated an
auto-encoder network for feature extraction and applied it Instead, we propose a broad set of tags that can be extracted
to Rastkar, Murphy, and Murray’s corpus. The novel metric based on keywords (or key phrases) and regular patterns such
called ‘believability’ measured the degree to which a sentence as URLs or enumerated lists.
is approved or disapproved within the conversation. Then the
sentence vectors were modified according to the corresponding III. P RIOR A PPROACHES A IM AT THE W RONG TARGET
believability score. The performance evaluation showed that Although the previous works provided significant contribu-
the model out-performed the bug report summarizers created tions for the summarization of bug reports, we believe that
by Rastkar, Murphy, and Murray and Mani, Catherine, Sinha, they were aiming for the wrong target. In all of the prior
and Dubey However, the model’s precision and Pyramid pre- works, the target was to create an extractive summary that
cision was lower than the unsupervised bug report summarizer was as close as possible to a ”gold standard” summary (GSS)
created by Lotufo, Malik, and Czarnecki. created by human annotators. However, based on observations
in prior work, specifically those by Rastkar, Murphy, and
B. Visualization of Bug Report Summaries Murray [2], [3] and our own work [removed for blind review],
Lotufo, Malik, and Czarnecki addressed the problem of we have concluded that such gold-standard summaries are
bug report summary visualization by creating two different neither attainable in reality nor wanted in practice.
views; one that only displays the summary sentences and In the previous studies, the metrics used for evaluating
the other that has the entire bug report and highlights the the summarization systems were those that are traditionally
summary sentences. Their user study found that 56% preferred used for a recommender system: precision, recall, and F-score.
the condensed summary view whereas 46% preferred the Pyramid precision [12], a variation of the traditional precision
interlaced view. Participants indicated that the interlaced view metric accounting for variation in the annotators, was also
was preferred as they did not trust the automatic summarizer used. The annotator-created extractive summaries were used
and that the sentence highlighting allowed such users to read as the gold-standard when computing these metrics.
through the comments quickly. However, the interlaced view However, we found two issues with the gold-standard sum-
was found to not be useful for very long bug reports. Yeasmin, maries used for evaluation in these works. The first issue is
Roy, and Schneider [11] extended this work by using different related to the word count restrictions placed on the automatic
colours for the summary sentences so that a user can focus summarization techniques, which were not applied (or at least
his/her attention on the specific colour-coded sentences. We not enforced) on the gold-standard summaries. This resulted
use a similar multi-coloured approach in our task-relevant bug in situations where the constraints placed on the automatic ap-
report summarizer. proach made it improbable, if not impossible, for the automatic
approach to achieve a high score in the evaluation.
C. Tagging Bug Report Comments The second issue is regarding the agreement, or rather the
As part of their work on providing an efficient interface for disagreement, between the annotators. Both Rastkar, Murphy,
bug report triaging, Bortis and van der Hoek commented that and Murray and Mani, Catherine, Sinha, and Dubey reported a
information in bug report summaries should be interpretable moderate level of agreement (kappa 0.4) between the human
at-a-glance but still have the descriptive information readily annotators. This means that annotators themselves cannot
available [1]. In their work, tags were used to mark bug agree on what is a gold-standard summary. Although pyramid
reports with common characteristics. Similarly, in our work, precision attempts to address this issue, it has its own flaws.
we developed tags to mark sentences with common attributes Specifically, this metric depends on the number of annotators
such as codes, URLs, and quotes from other comments. (i.e. the number of tiers in the pyramid) and the reliability
Mani, Catherine, Sinha, and Dubey [5] found that identify- of the annotators. Nenkova, Passonneau, and McKeown found
ing and removing noise when training the classifier improved that at least five annotators are needed to have a stable pyramid
the performance of the classification model. They identified score for a summary [12] and the number of annotators was
syntax, stack traces, and error messages (labelled as code) below this threshold for all previous work.
as well as greeting sentences (labelled as other) as noise. Moreover, having five or more bug report summary an-
Similar to Mani, Catherine, Sinha, and Dubey’s approach we notators is not realistic for software engineering projects.
identify but do not emphasize the sentences that have noise Assume that a summarization approach focuses on producing
tags (e.g. URL, code, and off-topic). We provide an interface bug report summaries for project managers. We would need
that presents information to understand the bug report with the five or more project managers to reasonably establish the
flexibility of viewing other information if desired. Therefore, gold-standard extractive summaries. Considering the resource
unlike Mani, Catherine, Sinha, and Dubey’s approach which limitations of most software projects, it is unlikely that a
completely removed noise sentences for the user, we allow project would have five project managers for creating an
selective access to any information in the bug report. extractive summary corpus with the needed agreement level.
As previously mentioned, Rastkar, Murphy, and Murray’s And project managers are only one type of bug report user.
corpus contains tags that describe the intent of a sentence. In This leads to a final problem with the goal of previous ap-
our work, we chose to refrain from using these tags because proaches; they did not consider the different information needs
of the inherent problems with human annotated information. that project members have for the summaries. In fact, it is
this difference in information needs between project members
that likely leads to annotator disagreement. For example, in
Rastkar, Murphy, and Murray’s user study, they found that
58% of users would like to have reproduction steps in the
summary when determining whether a bug report is a duplicate
or not. Similarly, bug reporting guidelines for projects (e.g.
[13]–[15] and studies on what are the important items for a bug
report [16] show similar disagreements about what constitutes
a ”good bug report”. Such disagreements about bug report
contents inevitably carry over into annotations for bug report
summaries. In other words, bug report summaries need to be
tailored to the information needs of the user.
Due to all of these points, we believe that creating a ”one-
Fig. 2. Bug report sentence intention schema.
size-fits-all” gold standard corpus is an unrealistic goal.
IV. AUTOMATED A NNOTATION FOR B UG R EPORT
a well-structured bug report description, these intentions are
S UMMARIZATION
easy to identify. We found that recently reported bug reports
This section presents our automated approach for annotating were more likely to contain this structure. However, when ex-
sentences in a bug report to support task-specific views for amining the bug reports in the Rastkar, Murphy, and Murray’s
summarizing bug reports. corpus, where some bug reports were reported as early as the
A. Bug Report Schema for Summarization year 2000, automatically identifying these intentions was more
challenging as the descriptions were mostly of the unstructured
Bug reports contain both structured and unstructured data form. We created a label Des to annotate sentences in the first
[16]. Most of the comments appear as free-form text, a form comment.
of unstructured data. Structured data in bug reports include Once a bug is reported, other bug report users (e.g. triagers,
code snippets, stack traces, error reports, and attachments developers, project managers) post comments. Upon careful
[17]. After careful examination of the bug reports in Rastkar, examination of the comments subsequent to the first comment
Murphy, and Murray’s corpus, which we considered to be a (i.e. the bug description), we found the following three inten-
representative sample, we developed a schema of a bug report tions.
for summarization (Figure 1). According to this schema, a bug 2) Clarification: If a bug description or any other com-
report is composed of sentences. The content of some of these ment is not clear enough, a bug report user requests further
sentences is of interest to a bug report user and we can use information with a response appearing in a later comment.
features and keywords to identify the interesting sentences. We identify such groups of sentences as having an intention of
Note that the set of “interesting sentences” need not be the Clarification. As developers typically quote the original
same for every bug report user. Some of the sentences have sentence(s) from the previous comment(s) when they request
dependencies on other sentences in the bug report. We can further information or when they respond to an information
use these dependencies between sentences to apply techniques request, this creates a communication structure similar to that
such as clue words [18] and topic modelling. of email threads. We created the labels Org and Qt to annotate
original and quoted sentences respectively. The label CW was
created to annotate sentences that are related to the quoted
sentences.
3) Resolution: Sometimes when a bug reporter identifies a
problem, they also provide a solution. Alternatively, a solution
may be provided by another bug report user, typically the
Fig. 1. Bug report summarization schema. developer who is responsible for resolving the report. We
refer to this intention as Resolution. To annotate sentences
The sentences in a bug report have various intentions. We related to bug resolution, we introduced a label Res.
further refine the Content node of the bug report summariza- 4) Plan: A bug report user may introduce an idea to
tion schema with the four most common sentence intentions solve an issue but not provide the actual solution as with
that we observed in bug reports (Figure 2). a Resolution comment. In this case, the comment is
1) Description: The first group of sentences (i.e. comment) considered to have a Plan intention.
is the Description. The description provides details about a) Other labels: We mentioned five labels that are di-
the bug. Some bug descriptions are written as a paragraph rectly related to sentence intentions. Apart from those labels,
(i.e. unstructured) and others are well-structured. In all cases, we created three additional labels to annotate sentences. The
the three intentions of Steps to Reproduce, Actual labels OT, URL, and Code represent off-topic sentences,
Outcome, and Expected Outcome are found here. In URLs, and code snippets respectively. Such sentences appear
TABLE I addresses this problem by identifying sentences that participate
D ESCRIPTION OF LABELS in the dominant topic of the bug report.
Label Description 4) Resolution [Res]: Once an bug report is created, it goes
Des Bug report description including steps to reproduce, expected through different states, as indicated by the report’s status,
outcome and actual outcome with the actual states differing between bug tracking systems
CW Clue words
Org Original sentence if the sentences is quoted in later comment [19]. Within these different states, developers try to reproduce
Qt Quoted sentence from a previous comment the bug, provide screenshots or mock-ups of user interfaces,
Topic Topic word is included create partial solutions, or create patches that may be accepted
Res Resolution statements (contain fix, patch, attachment)
Plan Proposals for changes to design or implementation. as the solution by another developer. Such attachments, partial
OT Off-topic sentence patches and final solutions are of interest to bug report users
URL Sentence contains a URL [16].
Code Codes, stack traces, error messages and etc.
5) [Plan]: It is common for a discussion in a bug report to
revolve around the planning of how to fix a bug or implement
in different intentions in bug reports. Therefore, those three la- a feature. Such sentences are given the Plan label.
bels are not directly related to any of the intentions introduced 6) Off-Topic [OT]: During our inspection of bug reports,
above. we found comments which did not contribute to the on-
We created one more label, Topic, to annotate sentences going conversation. Those comments were usually greetings
that participate in the ongoing conversation of the bug report. or appreciations towards another developer. The label OT is
This label is also not directly related to an intention. assigned to such a sentence to reflect that it is an off-topic
comment in the bug report.
B. Sentence Labels 7) [URL]: URLs are often added to bug reports either
manually by bug report users or automatically by DevOps
We propose ten labels to identify the content found in bug tools. For example, we observed links to other bug reports
report sentences (Table I). The first seven labels (Des, CW, and to version control commits. Links to other bug reports
Org, Qt, Topic, Res and Plan) are assigned to sentences allow developers to investigate the content of these other bug
that are considered to contain useful information for bug report reports. This may help a developer in solving a current bug, or
users. The last three labels (OT, URL, and C) are assigned to a triager to create a “super bug” that gathers together a set of
sentences considered to be unimportant or noise, which can reports that have similar behaviour. Links to version control
then be filtered out for the user. systems, such as Git and Subversion, allow developers
1) Description [Des]: The bug report’s description contains to inspect the changes made by a patch. Although URLs may
essential information for bug report users. For example, steps be considered as noise for summarization, a bug report user
to reproduce, observed results, and expected results are all could benefit from having these links accessible when a deep
useful to triagers when determining duplicate bug reports or investigation of all of the information is needed. Therefore,
to developers when creating a fix. we consider being able to identify URLs automatically and
2) Clue Word [CW], Orginal [Org], and Quoted [Qt]: allowing users to view them as important.
Developers often quote sentences from other developers when 8) [Code]: Bug report comments can contain code snip-
they need further information about that comment or when pets, error messages, software configuration parameters, stack
they need to respond to that comment. For example, if a traces and software version names and numbers. Although this
developer posts a question, another developer will post the information may be relevant to a bug report, it may also not
answer by first quoting the question and then replying to it. be essential for understanding the bug report. Mani, Catherine,
Or if a developer posts a way to solve a bug, another developer Sinha, and Dubey identified such information as noise in their
will post a comment accepting the solution or requesting study because the summarization models performed poorly
more information by quoting the solution. Since these quotes when noise was present. We assign the label Code to such
and responses create a chain-like communication between sentences so that the user can decide whether to show or hide
developers, we use the labels Org for an original sentence such information in the summary view.
(i.e. start node) and Qt for when that sentence is quoted (i.e.
end node). The label CW is attached to a sentence if both the C. Automatic Labelling of Bug Report Sentences
response and the quote have words in common. CW denotes For a given bug report, each sentence is extracted from a
the presence of “clue words” as proposed by Carenini, Ng, comment and passed to various modules for labelling. Figure
and Zhou [18] for email summarization. 3 provides an overview of the process. The process proceeds
3) [Topic]: We observed that developers sometimes re- in the following manner.
spond to other comments without quoting the previous com- First, a sentence is converted to lower case, stop words
ment(s). As these communication threads are not explicit, we and irrelevant symbols such as emojis are removed, and the
cannot directly identify such communication patterns in the words are lemmatized. Digits or punctuation symbols are not
bug report. This can result in missing important information removed as these are necessary to identify URLs and code
and such sentences not receiving a label. The Topic label snippets.
Next, the text is passed to various modules for labelling. bug report corpus contains data from four software projects.
The following is a detailed description of each of the labelling We created a diverse dataset by extracting ∼3000 lines of
modules. Note that the order of calling the URL Detector, comments from each source.
Code Detector, and Off-Topic Classifier is im- During preprocessing, we converted all text to lowercase
portant because the Code Detector, and Off-Topic and removed all punctuations symbols, extra white spaces,
Classifier depend on the output of the previously called and stop words. The comments from Stack Overflow were
module. However, the rest of the modules are independent of manually filtered because we found some greeting comments.
each other and they may be called in any order. We excluded those comments when training the models. Then
1) Description Labeller: The label Des is applied to a we lemmatized the words to remove the effect of variations of
sentence in the first bug report comment if it is determined the same word. We trained a Support Vector Machine (SVM)
to either contain steps to reproduce the bug, observed be- classifier8 to categorize the bug report comments into off-topic
haviour or expected behaviour. When reporting an issue in or on-topic comments.
issue tracking systems such as Bugzilla, it is not required This module ignores sentences previously labelled by the
to fill all or any textual details for steps to reproduce, URL Detector and Code Detector, and assigns the OT
observed behaviour, and expected behaviour. When those label to sentences classified as “off-topic”.
details are present explicitly, we used regular expressions 5) Clue-word Detector: The entire bug report is passed into
to capture the explicitly stated steps to reproduce, observed this module to determine the clue-word relationships between
behaviour and expected behaviour. The regular expressions sentences. Identified sentences are considered when determin-
used to capture those statements are: step reproduce, ing the Clarification intention. The Clue-word Detector
actual results, expect results. module uses the approach proposed by Carenini, Ng, and Zhou
If these regular expressions fail, another regular expression [18] for email summarization and used by Rastkar, Murphy,
is used for identifying any enumerated lists, with the assump- and Murrayin their bug report summarization approach. This
tion that such a list is describing the steps for reproduction. modules assigns the labels of CW, Org and Qt.
We used the regular expression ˆ\)?\d+(\.|\)) to capture 6) Topic Modeller: To determine the dominant topic of a
enumerated lists. bug report, we apply a topic modelling approach from the
Finally, the sentences in the bug report description are field of Natural Language Processing. Topic modelling is a
compared with the bug report title. If there is a common word technique used to identify topics from a pool of documents
in both the sentence and the title we apply the Des label to from different fields [21]. However, in our case, we are not
the sentence. interested in identifying the different topics in a bug report.
2) URL Detector.: All comments are passed into this Instead, we utilize the technique to capture the words of the
module to find sentences containing URLs. We use regular dominant topic in the bug report by extracting only one topic
expressions to identify URLs in comments and assign them from the topic model. Then we compare the sentences against
the label URL. the set of topic words to find sentences that participate in the
3) Code Detector.: By analysing Rastkar, Murphy, and dominant conversation of the bug report. We assign the label
Murray’s corpus and identifying common patterns of noise Topic to these sentences.
present in bug reports, we created regular expressions to cap- In this study, we used Latent Dirichlet Allocation (LDA)
ture those patterns. The Code Detector module identifies [21] to extract topic words. We created a corpus for each
program language syntax, error messages, stack traces, and bug report using a bag of word model and extracted ten topic
configuration parameters. The identified sentences are labelled words. Since we were only interested in the most dominant
as Code. This module ignores the sentences identified by the conversation, the number of topics was set to one.
URL Detector. 7) Resolution Detector: All comments, except the first
4) Off-topic Classifier.: To determine the off-topic com- one, are passed into this module to find sentences with the
ments, we follow Chowdhury and Hindle’s approach for Resolution intention. We use regular expressions to identify
filtering off-topic comments from Python IRC channels [20]. attachments and comments that reflect the final solution or an
We extracted comments from YouTube6 and Stack Overflow7 intermediate step towards the final solution. The label Res is
as positive (i.e. off-topic) and negative (i.e. on-topic) examples, assigned to a sentence if that sentence has such keywords as
respectively, to train a classification model. The YouTube push, fix, patch, attach, and commit. Similar to clue
dataset consists of comments from YouTube channels with word matching, we do root word matching instead of exact
the keywords cooking, news and sports, as these were con- word matching.
sidered to likely contain a good representation of common D. Plan Detector
off-topic comments with respect to software development. For As we do not yet have a specific pattern or set of keywords
the Stack Overflow dataset, we did not filter questions from to identify Plan sentences, we did not implement such a
any programming language or technology types because the labelling module. This is due to the diversity of problems
6 https://developers.google.com/youtube/v3/getting-started 8 A Naı̈ve Bayes classifier was also trained, but we found it produced too
7 https://api.stackexchange.com/docs many false positives.
Fig. 3. Overview of the labelling and summarization process

described in bug reports and developers’ personal preference TABLE II


when composing comments. For example, if the proposed Q UALITY OF AUTOMATIC ANNOTATIONS
solution to a problem is small, such as changing a parameter Label Precision Recall F-Score
or adding one line to the source code, the comment could be Des 90.91% 48.34% 63.12%
a single sentence. On the other hand, a proposed solution may CW 65.33% 63.64% 64.47%
Org 92.31% 98.63% 95.36%
be complex and have multiple lines of code snippets and/or QT 92.68% 98.7% 95.6%
comments. Also, developers express their Plan intention in a RES 83.97% 77.06% 80.37%
variety of ways, such as enumerated lists and paragraphs. Con- OT (SVM) 58.23% 50% 53.8%
OT (Naı̈ve Bayes) 54.43% 50.59% 52.44%
sequently, we were unable to derive a general enough heuristic URL 86.84% 80.48% 83.54%
to assign the Plan label to comments with a sufficient level Code 63.81% 45.12% 52.86%
of accuracy. However, Chaparro et al.’s work [22] on discourse
patterns may provide a starting point for work in this direction.

V. R ESULTS AND D ISCUSSION the sentences in Rastkar, Murphy, and Murray’s corpus with
To develop and test our approach for automatically labelling the labels from Section IV-B. One annotator examined and
bug report sentences, we chose to use the bug report corpus labelled the entire corpus. The other volunteers were assigned
curated by Rastkar, Murphy, and Murray. This bug report non-overlapping thirds of the corpus to label independently.
corpus is a collection of XML documents where sentences Once the labelling was completed, the first annotator met
are identified by a <Sentence> tag. individually with the other annotators to resolve any disagree-
The choice to use Rastkar, Murphy, and Murray’s bug report ments. The agreed upon annotations were then compared with
corpus was made for two reasons. First, the dataset is publicly the labels assigned by the approach. Note that our annotation
available9 and is the bug report corpus used in the previous mechanism does not require annotators’ personal judgement
two studies of bug report summarization by Mani, Catherine, as we were searching for lexical characteristics or structural
Sinha, and Dubey [5] and Lotufo, Malik, and Czarnecki characteristics in sentences. Therefore, annotating sentences to
[4]. Second, as the comment sentences are already identified indicate the intention(s) and annotating sentences to indicate if
and verified, this removes errors occurring by a sub-optimal it were a summary sentence are two different tasks. The latter
sentence tokenisation approach on a raw bug report dataset. task involves personal judgement that introduces bias into
the summary and is why multiple annotators and agreement
A. Manual Labelling of Sentences rates are important when creating a bug report summary
To evaluate the quality of our automatic sentence labelling corpora. We found that any disagreements in sentence labelling
approach, four volunteers were asked to annotate by hand were due to mistakes made during the annotation and/or
misinterpreted guidelines. After the resolution meetings, the
9 https://github.com/HuaiBeibei/IBRS-Corpus verified 01/05/2020 agreement rate was 100%.
B. Results of Automatic Labelling 2) Clue-word Detector: The Clue-word Detector
achieved a 65% precision and a 64% recall. We found that
Table II shows the precision, recall and F-score (Equations the detector did well at capturing original (Org) and quoted
1, 2 and 3) for each type of label assigned by our approach. (QT) sentences. However, capturing clue words (CW) was not
The remainder of this section provides a discussion for each as effective.
labelling module of these results.
When investigating the clue words missed by the detector,
we found that certain words were not converted to the expected
# of labels correctly assigned root form by the lemmatizer. For example, localization was
P recision = (1) not converted to the same root as localize. Although a person
total # labels assigned can identify such words as clue words, the Clue-word
Detector failed to identify words having similar semantics.
3) Off-topic Classifier: The Off-topic Classifier
# of labels correctly assigned had a precision and recall of 58% and 50%, respectively.10
Recall = (2)
total # labels assigned by annotators These results appear to be related to the positive data set
(i.e. from YouTube) mostly containing short greeting sentences
(e.g. “Thank you”, “Thanks”, and “Much appreciated”). Con-
2 × precision × recall sequently, we found that the classifier successfully detected
F − score = (3)
precision + recall short greetings, but the classifier failed when the sentences
were long. For example, in Eclipse #224588 the sentence
1) Description Labeller: The Description Labeller “Again, I think it’s a brilliant idea Martin.” was not correctly
achieved a 91% precision with a recall of 48%. After analyzing labelled as being off-topic.
the false negatives, we found that the module missed the root
4) Code Detector: Our Code Detector achieved preci-
form similarity of words such as rescale and scaling. For
sion and recall values of 64% and 45%, respectively. Regard-
example, in bug report GIMP #164995 the term scale was
ing these values, the regular expression patterns in the Code
in the title and the words rescaled and scaling appeared in the
Detector captured syntax embedded in regular English
bug report description. But the module lemmatizer failed to
sentences as opposed to proper code snippets. Such sentences
find the root form of these words as scale, and the label Des
were not marked as code by the human annotators because
was not assigned.
they observed that the programmers were discussing the code,
Most of the bug reporters used plain text to describe ex- not providing code for context or as a solution.
pected and actual software behaviour in the description. How-
When creating our labelling modules, we chose not
ever, some reporters included stack traces along with other
to remove symbols and stop words which are impor-
text. This resulted in the Code Detector identifying code
tant for detecting programming language syntax. This re-
fragments and other program-related sentences that were part
sulted in some sentences that contain brackets, parenthe-
of the description. However, recall that the Description
sis, and special symbols, which are common in program-
Labeller uses three different strategies to assign a Des
ming language syntax, being detected as code. For exam-
label: words in common with the title, enumerated lists, and
ple, Eclipse #223734 contains the sentence “According to
structured text. As a result, the Description Labeller
the JavaDoc in Platform.getResourceString(Bundle,String)...”,
would assign the Des label to the same sentences as those iden-
which was labelled as code according to the regular expression
tified by the Code Detector if the sentences appeared after
\w+[\.]?\w+\(((\w+,?\s?)?+)\) used in the mod-
the identified steps to reproduce, actual behaviour, and ex-
ule. Another example is Eclipse #260502 which describes an
pected behaviour. For example, lines 18-151 of the description
issue in a JavaDoc file. The reporter provided an example of
in Bugzilla #429126 is a stack trace. As the expected behaviour
the order of defining a set of classes and interfaces. As these
is explained in lines 15-17, the Description Labeller
comments contain the words class and interface we could
labelled every sentence afterwards (i.e. the stack trace) as
not write a regular expression pattern that identifies such a
Des. In other words, the Description Labeller falsely
comment as code, but also correctly excludes sentences that
labelled 133 sentences in this bug report.
contain the same words.
When investigating the reasons for a low recall rate 5) Resolution Detector: We found our Resolution
(48.34%), we found that the Description Labeller Detector to be effective with precision and recall rates
failed to distinguish between the use of some words from of 84% and 77%, respectively. Recall that the Resolution
code, whereas a person would make this distinction. For Detector searched for keywords such as patch, push, com-
example, the title of Eclipse #260502 has the word class mit, attach, and fix. Most of the comments that contained such
inside brackets (i.e. ”class”) and the word class is repeated keywords, in their various forms, were found by the human
in the first comment. To a person, it is obvious that the annotators to describe either the solution or an intermediate
reporter is describing class documentation, however to the
Description Labeller these sentences appeared to be 10 A classifier trained using a Naı̈ve Bayes algorithm was found to be
a piece of code and was not labelled. slightly worse with a precision and recall of 54% and 51%, respectively.
form of the final resolution for the bug report. However, certain At the top of the interface are checkboxes for the various
sentences were questions that inquired about a patch or a fix, labels, allowing the user to show or hide sentences according
or a further explanation that contained one of the keywords. to the particular sentence label. On the right-hand side is
For example, Eclipse #154119, contains the sentence “It would provided a list of the dominant topic words within the bug
be possible push parts of TextFieldNavigationHandler down to report, allowing the user to filter the sentences by topic.
Platform Text...”, which contains the keyword push but is not As mentioned, sentences are displayed with a label indi-
a resolution statement. cating to which sentence category it belongs. The labels are
6) URL Detector: The URL Detector was effective in colour-coded to aid in understanding the relevance of the
correctly identifying links, with a precison of 87% and a sentence category. For example, the Resolution label is green
recall of 80%. Although URL regular expressions are well- to indicate a high perceived relevance to a user, and the Off-
known and developed, the reason that this detector did not topic annotation is yellow to indicate that the sentence is likely
achieve as a higher score was due to URL patterns in stack not relevant to the user. In Figure 4, the user has selected to
traces. We found that some stack traces contained the text focus on sentences with the labels Des, CW, Qt, Org, Topic,
http://localhost which were labelled as URL by the and Res. Sentences that are visible, but are not one of these
module. Such URLs were not considered useful to bug report categories, are shown greyed out, such as “LOL, me too”
users by the annotators. which has an OT label.
When the user selects a topic word that they are inter-
VI. T HREATS TO VALIDITY
ested in, the interface highlights the word in the comments
One threat to the validity of our work is the generalizability to focus the user’s attention. In Figure 4 the user selected
of our labelling process. Our regular expression patterns were preference in the topic word list, and the comments with
written by investigating the sentences in the Rastkar, Murphy, this word are highlighted (e.g. at the top of the screen).
and Murray’s corpus. This may mean that there are bug reports
in some software projects where there regular expressions VIII. C ONCLUSION AND F UTURE W ORK
would fail. However, we believe this threat to be small, as In this study, we focused on creating customizable bug
the corpus contains bug reports extracted from four different report summaries based on our belief that fixed-length sum-
projects where each project belongs to a different domain. maries cannot support the different information needs of
Therefore, we expect that our modules cover different code diverse software development roles. Since different users are
styles, error reports, and stack traces in various programming interested in a variety of information, we developed a set
languages such as Java and C/C++. of labelling modules to identify the content or intention of
Another threat to validity relates to the labelling of sen- sentences in bug report comments. The empirical results indi-
tences in Rastkar, Murphy, and Murray’s corpus by human cate that, in general, these labelling modules achieve sufficient
annotators. We used two annotators for each report and precision and recall scores for practical use. The code detector
resolved disagreements when they occurred. However, it is and off-topic classifier received slightly over 50% F-score.
possible that both annotators may have agreed on an incorrect Even though those scores were not as high as the other sub-
label. Similarly, it may seem that this approach would suffer modules, they provide a good first step in the direction of bug
from the same annotator disagreement problem as expressed report sentence labelling.
in Section III. However, we consider that the annotations of In the future, we plan to curate a better training set and
sentences from a fixed set of well-defined labels is much less examine the use of NLP word embedding techniques to
error-prone and subjective than the selection of sentences for improve the effectiveness of the Off-Topic Classifier module.
summarization. Also, we plan to examine how to improve the performance of
VII. A TASK R ELEVANT B UG R EPORT S UMMARIZER the Code Detector module.
Although we implemented a proof-of-concept task-relevant
Having labelled the bug report sentences by their content bug report summarizer, it was designed to work with data
type or intention, an interface can be provided to the bug report from Rastkar, Murphy, and Murray’s corpus. We plan to
users which presents the labelling and allows the user to filter implement a prototype that works with data from a ”live” bug
the sentences according to their information needs. Figure 4 repository and conduct a user study to determine how useful
provides an example of the user interface for such a tool that our automated sentence annotations are to software project
supports the creation of task-relevant bug report summaries. members. Also, the user study will involve project members
The interface is modelled after Mozilla’s Bugzilla11 interface with different roles to better understand the importance of each
to leverage familiarity with that or similar systems. label to their tasks. Such a study will provide a baseline for
The tool provides access to all of the comments from future approached towards task-relevant bug report summariz-
the bug report instead of displaying only the automatically ers.
selected sentences as in a traditional bug report summarizer.
Similar to Bugzilla and other issue tracking systems, the user R EFERENCES
can choose to expand or collapse individual comments at will. [1] G. Bortis and A. van der Hoek, “Porchlight: a tag-based approach
to bug triaging,” in 35th International Conference on Software
11 https://bugzilla.mozilla.org/ Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26,
Fig. 4. User interface of customizable bug report summary

2013, D. Notkin, B. H. C. Cheng, and K. Pohl, Eds. IEEE Trans. Reliab., vol. 68, no. 1, pp. 2–22, 2019. [Online]. Available:
Computer Society, 2013, pp. 342–351. [Online]. Available: https: https://doi.org/10.1109/TR.2018.2873427
//doi.org/10.1109/ICSE.2013.6606580 [10] H. Liu, Y. Yu, S. Li, Y. Guo, D. Wang, and X. Mao, “Bugsum:
[2] S. Rastkar, G. C. Murphy, and G. Murray, “Summarizing software Deep context understanding for bug report summarization,” in ICPC
artifacts: A case study of bug reports,” in Proceedings of the ’20: 28th International Conference on Program Comprehension, Seoul,
32nd ACM/IEEE International Conference on Software Engineering Republic of Korea, July 13-15, 2020. ACM, 2020, pp. 94–105.
- Volume 1, ser. ICSE ’10. New York, NY, USA: Association [Online]. Available: https://doi.org/10.1145/3387904.3389272
for Computing Machinery, 2010, p. 505–514. [Online]. Available: [11] S. Yeasmin, C. K. Roy, and K. A. Schneider, “Interactive visualization
https://doi.org/10.1145/1806799.1806872 of bug reports using topic evolution and extractive summaries,” in
[3] S. Rastkar, G. C. Murphy, and G. Murray, “Automatic summarization of 30th IEEE International Conference on Software Maintenance and
bug reports,” IEEE Trans. Softw. Eng., vol. 40, no. 4, pp. 366–380, Apr. Evolution, Victoria, BC, Canada, September 29 - October 3, 2014.
2014. [Online]. Available: https://doi.org/10.1109/TSE.2013.2297712 IEEE Computer Society, 2014, pp. 421–425. [Online]. Available:
[4] R. Lotufo, Z. Malik, and K. Czarnecki, “Modelling the ‘hurried’ bug https://doi.org/10.1109/ICSME.2014.66
report reading process to summarize bug reports,” Empirical Softw. [12] A. Nenkova, R. Passonneau, and K. McKeown, “The pyramid method:
Engg., vol. 20, no. 2, pp. 516–548, Apr. 2015. [Online]. Available: Incorporating human content selection variation in summarization
http://dx.doi.org/10.1007/s10664-014-9311-2 evaluation,” ACM Trans. Speech Lang. Process., vol. 4, no. 2, p.
[5] S. Mani, R. Catherine, V. S. Sinha, and A. Dubey, “Ausum: Approach 4–es, May 2007. [Online]. Available: https://doi.org/10.1145/1233912.
for unsupervised bug report summarization,” in Proceedings of the ACM 1233913
SIGSOFT 20th International Symposium on the Foundations of Software [13] S. Tatham, “How to report bugs effectively,” https://www.chiark.
Engineering, ser. FSE ’12. New York, NY, USA: ACM, 2012, pp. 11:1– greenend.org.uk/∼sgtatham/bugs.html, accessed: 2020-02-03.
11:11. [Online]. Available: http://doi.acm.org/10.1145/2393596.2393607 [14] “Apache tomcat bug reporting,” https://tomcat.apache.org/bugreport.
[6] H. Jiang, N. Nazar, J. Zhang, T. Zhang, and Z. Ren, “PRST: A html#How to write a bug report, accessed: 2020-02-03.
pagerank-based summarization technique for summarizing bug reports [15] “Bug report writing guidelines - mozilla,” https://developer.mozilla.org/
with duplicates,” International Journal of Software Engineering and en-US/docs/Mozilla/QA/Bug writing guidelines, accessed: 2020-02-03.
Knowledge Engineering, vol. 27, no. 6, pp. 869–896, 2017. [Online]. [16] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and
Available: https://doi.org/10.1142/S0218194017500322 T. Zimmermann, “What makes a good bug report?” in Proceedings
[7] B. Huai, W. Li, Q. Wu, and M. Wang, “Mining intentions of the 16th ACM SIGSOFT International Symposium on Foundations
to improve bug report summarization,” in The 30th International of Software Engineering, ser. SIGSOFT ’08/FSE-16. New York,
Conference on Software Engineering and Knowledge Engineering, NY, USA: Association for Computing Machinery, 2008, p. 308–318.
Hotel Pullman, Redwood City, California, USA, July 1-3, 2018, [Online]. Available: https://doi.org/10.1145/1453101.1453146
Ó. M. Pereira, Ed. KSI Research Inc. and Knowledge Systems [17] N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim, “Extracting
Institute Graduate School, 2018, pp. 320–319. [Online]. Available: structural information from bug reports,” in Proceedings of the 2008
https://doi.org/10.18293/SEKE2018-096 International Working Conference on Mining Software Repositories,
[8] X. Li, H. Jiang, D. Liu, Z. Ren, and G. Li, “Unsupervised deep MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-
bug report summarization,” in Proceedings of the 26th Conference on 11, 2008, Proceedings, A. E. Hassan, M. Lanza, and M. W.
Program Comprehension, ICPC 2018, Gothenburg, Sweden, May 27-28, Godfrey, Eds. ACM, 2008, pp. 27–30. [Online]. Available: https:
2018, F. Khomh, C. K. Roy, and J. Siegmund, Eds. ACM, 2018, pp. //doi.org/10.1145/1370750.1370757
144–155. [Online]. Available: https://doi.org/10.1145/3196321.3196326 [18] G. Carenini, R. T. Ng, and X. Zhou, “Summarizing email conversations
[9] H. Jiang, X. Li, Z. Ren, J. Xuan, and Z. Jin, “Toward better with clue words,” in Proceedings of the 16th International Conference
summarizing bug reports with crowdsourcing elicited attributes,” IEEE on World Wide Web, ser. WWW ’07. New York, NY, USA:
Association for Computing Machinery, 2007, p. 91–100. [Online].
Available: https://doi.org/10.1145/1242572.1242586
[19] J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open
bug repository,” in Proceedings of the 2005 OOPSLA workshop on
Eclipse Technology eXchange, ETX 2005, San Diego, California, USA,
October 16-17, 2005, M. D. Storey, M. G. Burke, L. Cheng, and
A. van der Hoek, Eds. ACM, 2005, pp. 35–39. [Online]. Available:
https://doi.org/10.1145/1117696.1117704
[20] S. A. Chowdhury and A. Hindle, “Mining stackoverflow to filter out
off-topic IRC discussion,” in 12th IEEE/ACM Working Conference
on Mining Software Repositories, MSR 2015, Florence, Italy, May
16-17, 2015, M. D. Penta, M. Pinzger, and R. Robbes, Eds.
IEEE Computer Society, 2015, pp. 422–425. [Online]. Available:
https://doi.org/10.1109/MSR.2015.54
[21] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003. [Online]. Available:
http://jmlr.org/papers/v3/blei03a.html
[22] O. Chaparro, J. Lu, F. Zampetti, L. Moreno, M. D. Penta, A. Marcus,
G. Bavota, and V. Ng, “Detecting missing information in bug
descriptions,” in Proceedings of the 2017 11th Joint Meeting on
Foundations of Software Engineering, ESEC/FSE 2017, Paderborn,
Germany, September 4-8, 2017, E. Bodden, W. Schäfer, A. van
Deursen, and A. Zisman, Eds. ACM, 2017, pp. 396–407. [Online].
Available: https://doi.org/10.1145/3106237.3106285

You might also like