You are on page 1of 17

CodeReviewer: Pre-Training for Automating Code Review

Activities
Zhiyu Li∗† Shuai Lu∗ Daya Guo†
Peking University Microsoft Research Asia Sun Yat-sen University

Nan Duan‡ Shailesh Jannu Grant Jenks


Microsoft Research Asia LinkedIn LinkedIn

Deep Majumder Jared Green Alexey Svyatkovskiy


LinkedIn LinkedIn Microsoft
arXiv:2203.09095v1 [cs.SE] 17 Mar 2022

Shengyu Fu Neel Sundaresan


Microsoft Microsoft
ABSTRACT and aim at fully understanding code changes [4]. Given its bene-
Code review is an essential part to software development lifecycle fits, code review has been widely adopted in both open-source and
since it aims at guaranteeing the quality of codes. Modern code industrial projects. As shown by Yang et al. [45], in open-source
review activities necessitate developers viewing, understanding and projects, such as Qt, there are ten thousand reviews taking place
even running the programs to assess logic, functionality, latency, every month (~22k reviews in case of Qt). However, it’s not free
style and other factors. It turns out that developers have to spend to take all advantages of code reviews. To make a correct judge-
far too much time reviewing the code of their peers. Accordingly, ment whether to accept the pull request, developers have to put
it is in significant demand to automate the code review process. In much time and effort into code reviewing, checking code from ev-
this research, we focus on utilizing pre-training techniques for the ery aspect, including logic, functionality, complexity, code style,
tasks in the code review scenario. We collect a large-scale dataset documentation, etc. For example, Yang et al. [45] report that only
of real world code changes and code reviews from open-source 1,437 reviewers have given more than 1M reviews in 4 years in
projects in nine of the most popular programming languages. To project Qt. Accordingly, it is in significant demand to automate the
better understand code diffs and reviews, we propose CodeReviewer, code review process.
a pre-trained model that utilizes four pre-training tasks tailored Many researchers have explored ways to assist reviewers and
specifically for the code review senario. To evaluate our model, we committers (i.e., code authors) to reduce their workload in the
focus on three key tasks related to code review activities, including code review process, such as recommending the best reviewer [9,
code change quality estimation, review comment generation and 37], recommending or generating the possible review comments
code refinement. Furthermore, we establish a high-quality bench- [18, 40, 41] and even revising the code before submitting it for
mark dataset based on our collected data for these three tasks and review [40]. This paper shares the same goal to automate some
conduct comprehensive experiments on it. The experimental results specific tasks related to code review. We target three scenarios
demonstrate that our model outperforms the previous state-of-the- from both reviewers’ and committers’ perspectives. The first is
art pre-training approaches in all tasks. Further analysis show that called code change quality estimation, aiming to predict whether
our proposed pre-training tasks and the multilingual pre-training a code diff needs a review comment. It assists reviewers to pick
dataset benefit the model on the understanding of code changes a code diff chunk that might have issues among a large number
and reviews. of code chunks. The second task is review generation which can
dramatically reduce the time cost for the reviewer. The last scenario
KEYWORDS is for committers, which is code refinement according to the previous
code and reviewer’s comment.
Code review, deep learning, datasets, pre-training
Motivated by the wide adaptation of deep learning (DL) tech-
niques for software engineering and the rapid development of pre-
1 INTRODUCTION training techniques, we leverage pre-training for automating code
Code review, a process of manually inspecting source code by team- review activities. Recently, researchers have proposed many pre-
mates other than the code author, is recognized as a critical part of trained models on source code [7, 14, 17, 43]. However, they can
the software development lifecycle [13]. Studies have shown the hardly handle the code review process. Chen et al. [7] propose
huge benefits of code reviews [1, 3, 31, 32]. Compared with the Codex, a large scale pre-trained model based on GPT-3 [6]. Codex
traditional code review process formalized by Fagan in 1976 [12], has been proven to be the most powerful model for code generation
which avoids introducing errors and defects but is cumbersome, tasks. Since it is trained on the source code files in raw format, it
modern code review activities involve fewer formal requirements knows very little about code review. We demonstrate that it can-
∗ Equal not generate any meaningful comment in the review generation
contribution.
† Work done during internship at Microsoft Research. task based on our evaluation. Tufano et al. [40] attempt to use
‡ Corresponding author is Nan Duan nanduan@microsoft.com. a pre-trained model for code review automation. However, their
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

Code diff w/o comment Code Diff


from torch import ( Quality Estimation
- reshard_output,

CodeReviewer
+ _reshard_output,
)
𝐷𝐷 → 𝑅𝑅
Review Generation
from torch import (
Pull Requests in - reshard_output,
GitHub + _reshard_output,
)
Code Refinement
Reviewer: need some Pre-train an encoder-decoder
small doc changes
transformer model
Code diff with comment

Figure 1: Overview of the workflow of CodeReviewer.

pre-training dataset is collected from Stack Overflow and Code- 2 CODE REVIEW AUTOMATION TASKS
SearchNet [21], which is not directly related to code review process. In this section, we give a formulated description of the code review
Furthermore, they take the normal form of source code as the model process, provide the formal definitions of three key tasks abstracted
input, ignoring the special format of code diff that can help the from the code review process, and clarify the data format of the
model better understand code changes. tasks. The employed symbols are listed in Table 1.
To tackle the problems, we pre-train CodeReviewer, an encoder-
decoder transformer model. Different from Tufano et al. [40]’s work, 2.1 Code Review
CodeReviewer is pre-trained on a large dataset in code review sce-
In the code review process, contributors (code change author, 𝑃𝐶 )
nario, consisting of code diff hunks and code review comments. We
update the source code to accomplish new features or fix bugs in
propose four pre-training tasks, including diff tag prediction, de-
the old version. The original code and updated code are denoted
noising code diff, denoising review comment, and review comment
as 𝐶 0 and 𝐶 1 . Once the code changes (𝐷 : 𝐶 0 → 𝐶 1 ) are ready for
generation to make CodeReviewer better understand code diffs and
review, the author of these changes creates a pull request to start the
generate review comments. Figure 1 shows the overview of the
code review process. Other peer contributors (reviewers, 𝑃𝑅 ) will
workflow of our CodeReviewer.
review the code changes and provide comments or suggestions (𝑅𝑛𝑙 )
To make CodeReviewer more suitable for the code review pro-
on them if necessary. Based on the comments, the author makes
cess, we build a large-scale dataset of code changes and corre-
revisions and provides a newer version of code 𝐶 2 . We call the
sponding review comments. Such code review data is collected
activities up to now as a review round. Note that the review process
from GitHub pull requests in high-quality projects with high stars
is not finished yet. The reviewers can further give suggestions on
covering nine of the most popular programming languages. Using
the revisions. And the author may also revise the code again. After a
GitHub code review data, we create pre-training and benchmark
few review rounds, a pull request will finally be approved (changes
datasets for three tasks related to the code review process. To the
are merged into the main branch) or discarded.
best of our knowledge, it is the largest multilingual code review
There are usually multiple commits in a pull request, and each
dataset with complete information of code changes and reviews.
commit may be related to multiple code changes across different
We further evaluate our CodeReviewer model on our benchmark
files. Thus, to model the whole review process is difficult and chal-
dataset. Compare with previous state-of-the-art (SOTA) genera-
lenging [19]. As an early step, we focus on automating the re-
tive models for code, experimental results show that our model
view process of a single commit. In the review round, two distinct
outperforms previous works on all three tasks. Further analysis
roles are involved – contributor 𝑃𝐶 who commits a code change
proves the effectiveness of our proposed pre-training tasks and the
(𝐷 : 𝐶 0 → 𝐶 1 ), and reviewer 𝑃𝑅 who gives advice and comments
multilingual high-quality dataset.
(𝑅𝑛𝑙 ). The goal of code review automation is not to replace these
To summarize, the contributions of this work are:
two roles by machine but to help reduce their workload. To that
end, we focus on three specific code review automation tasks in
both scenarios of different roles. An overview of all three tasks is
• A pre-trained model for automating three different tasks in shown in figure 2.
the code review process.
• Novel pre-training tasks for code changes understanding 2.2 Code Change Quality Estimation
and generation. Code change quality estimation task is to predict whether a code
• A large-scale code review related pre-training dataset and change is high-quality and ready to be accepted in the review pro-
a high-quality benchmark dataset for evaluation in nine cess. This is a binary classification task, i.e., Y = {0, 1}. The input
programming languages. is a code change, i.e., X = {𝐷 (𝐶 0, 𝐶 1 )}. As stated before, there are
• Our dataset, code, and model will be open-sourced to facili- usually lots of changes across different source code files in a pull
tate the development of code review automation. request. It takes reviewers a large amount of time to review all the
CodeReviewer: Pre-Training for Automating Code Review Activities

Quality
estimation
Code diff A No review
-import java.sql.statement; needed
+import java.sql.Statement;
import java.sql.Connection;
import java.sql.DriverManager;
Review generation Code refinement
Code diff B
-import java.util.*;
+import java.util.*; Need review I think "import *" is not +import java.util.List;
import org.apache.commons.lang3.StringUtils; allowed in Kylin's static +import java.util.Properties;
import org.apache.kylin.common.util.DBUtils;
code analysis. import org.apache.commons.lang3.StringUtils;
import org.apache.kylin.common.util.DBUtils;

Figure 2: Overview of code review automation tasks.

Table 1: Summary of notations and symbols in this paper. the reviewers may choose an ideal one from them, being rescued
from writing comments manually.
Notation Definition
𝑃𝐶 , 𝑃𝑅 Author and reviewer of the code change 2.4 Code Refinement
𝐶 0, 𝐶 1, 𝐶 2 Different versions of source code In the code refinement task, the model takes as input both the
𝐷 Code change between 𝐶 0 and 𝐶 1 original code 𝐶 1 written by the contributor and the review com-
𝑅𝑛𝑙 Review comment written in natural language ment 𝑅𝑛𝑙 from the reviewer, and targets to generate the revised
X, Y Model input and output space version code 𝐶 2 implementing the requirements mentioned in 𝑅𝑛𝑙 .
𝑐 1, 𝑐 2, · · · , 𝑐𝑛 Source code tokens Different from the code refinement task in other software devel-
𝑤 1 , 𝑤 2 , · · · , 𝑤𝑚 Review comment tokens opment scenarios where only source code is taken as input [39],
L Training loss function we design specifically for the code review process to utilize the
review comment as guidance for refining. Code refinement is a task
designed for assisting contributors. When they open pull requests
and get review comments as feedback, the model helps to revise
changes. But it always turns out that most of the changes are minor
the submitted code automatically based on the comments.
and don’t need a comment or suggestion. To improve the efficiency
of code review, we define and advance to automating the code
change quality estimation task. Given the estimates, reviewers can 2.5 Data Format
give questionable code changes a higher priority to review and pay
All three tasks are related to source code or code changes. In the
more attention to them, saving time and effort. Contributors can
code review process, there are usually multiple related files and
also leverage the estimates to improve low-quality code changes
functions [19]. A source code file or a function/method is too large
before submitting them to reviewers.
to process, because there may be multiple review comments and
code revisions spreading across a file or function/method. Instead,
2.3 Code Review Generation we define the inputs of the three tasks at diff hunk level.
Code review generation is a sequence generation task. The output In pull requests, the original version and revised version of code
is a predicted review comment, i.e., Y = {𝑤 1, · · · , 𝑤𝑛 } where 𝑤𝑖 are shown in a special format: diff (Figure 2). The diff file is gener-
is a natural language word and 𝑛 ∈ N is the length of review ated by comparing two files before and after the change. The diff
comment. The input is still a code change, i.e., X = {𝐷 (𝐶 0, 𝐶 1 )}, tool finds sequences of lines common to both files, interspersed
with its context. In some previous works [18, 40, 41], researchers with groups of differing lines called hunks. Specifically, a diff hunk
use the changed code as input but not the code diff, without taking is a sequence of source code lines surrounded by a few unchanged
into account that review comments have to focus on the changed lines. Diff hunk includes the lines deleted from the original file and
part. It’s not recommended for reviewers to give suggestions to added to the new file. The diff format is frequently used by pro-
the code context which has not been revised. In the code review grammers to show the changes in the code clearly. Moreover, the
process, there are usually common issues in some code changes. diff format is an efficient representation for code changes because
For example, reviewers often write “This method should be private” the unchanged lines occur only once, and the changes are aligned
to suggest the contributor add a private decorator to a method. with “-” and “+” tags at the begging of each line. In the first two
We attempt to learn these patterns and generate such comments tasks, we formulate the input as a diff hunk representing the code
automatically to lighten the burden of reviewers. In the real-world change. In the code refinement task, we extract the input lines (𝐶 1 )
scenario, the model generates a few review comment candidates and and output lines (𝐶 2 ) from the revision diff hunk.
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

Figure 3: The process of building our dataset.

3 CODE REVIEW DATASET pull-request-based development. ETCR tool can be set up requir-
Nowadays, many developers use collaborative software develop- ing only a GitHub API key and repository name. The ETCR uses
ment platforms like GitHub1 and Gerrit2 not only to share their a Kotlin-based web crawler and works in four stages to collect
code but also to perform code review activities. GitHub platform data of pull requests, commits, comments and source files via the
allows contributors to make all these review details publicly avail- GitHub API. All the data is stored in a relational database and thus
able, including the exact content of code changes, review comments can be accessed conveniently. We first use the ETCR tool [20] to
and their authors, and happening time in each pull request, which collect meta-information of pull requests and review comments,
makes it possible to collect and analyze code review data from open- including the git commit hash and changed file name related to
source projects. Hence, we build our CodeReview dataset based on the comments. Using the meta-information, we can further query
pull requests data collected from open-source projects, covering GitHub API to get code changes (including the original file, new file,
nine programming languages, in GitHub. and the code diff) corresponding to the review comments. These
code changes and comments make up our CodeReview data. By
3.1 Project Selection now, we have collected all required data for building datasets for
the three downstream tasks.
To ensure the quality of our dataset, we collect pull request data
from publicly available high-quality open-source repositories. We
first sort the projects by “popularity” as indicated by the number
of stars. To improve the generalizability of the models trained on 3.3 Dataset Construction
our dataset, we collect projects in nine of the most popular pro-
The pre-training dataset is a set of code changes with or without
gramming languages in GitHub, including C, C++, C#, Go, Java,
review comments. However, it requires further data processing
JavaScript, PHP, Python, and Ruby. Then, we keep the top 10,000
to build datasets for the three downstream tasks: ❶ Code change
projects for each of the nine programming languages and remove
quality estimation: All commented code changes are regarded as
those projects that do not explicitly permit the re-distribution of
suspicious code that introduces bugs or conflicts with code spec-
their data. To be specific, all repositories with popular licenses such
ifications. Other code changes without comments are labeled as
as Apache-2.0 license and MIT license 3 are kept. If the contributors
correct. As the number of code changes without comments is about
write in the license file that they allow for re-distribution of their
2-3 times the number of commented code changes, we perform
data, their projects are also collected. In order to ensure the quality
random down-sampling on the changes without review comments
of the project and acquire code review data as much as possible, we
to build a balanced dataset. ❷ Review comment generation: Code
further sort the projects by the number of pull requests and filter
changes with review comments will be used for this task. We filter
out projects with less than 1,500 pull requests. This process keeps
out those comments written by the code changes author. When
only active projects with many contributors and removes reposito-
there is more than 1 comment related to a diff hunk, we only keep
ries that are forked from other projects as the pull request number
the earliest one. ❸ Code refinement: We traverse each pull request
is not inherited. Then we start crawling pull request information
in the projects. For each commented code change, we check all
of the projects.
the commits in this pull request to find whether there is a later
commit that updates this part of code again. To be specific, when
3.2 Review Data Collection the reviewer writes a comment on the code diff 𝐷 : 𝐶 0 → 𝐶 1 and
We use GitHub REST API to collect pull requests data from the the revised source code lines in 𝐶 1 are further modified to a newer
projects. GitHub API offers an HTTP request-based API for access- version 𝐶 2 in a later commit, we assume that the comments on the
ing repository information. By sending HTTP requests to GitHub, early change helped the contributor to make the later updates. Then
we can access the branches, commits, pull requests, code diff, re- the triplets (𝐶 1, 𝑅𝑛𝑙 , 𝐶 2 ) are collected to build the code refinement
view comments, etc in Json format conveniently. Many researchers dataset. In some cases, a comment is related to multiple future code
have used GitHub API to collect and analyze pull request data revisions or multiple comments contribute to a revision together.
in their work [20, 34]. To advance the research of code review, All these samples are removed from the dataset because modeling
Heumüller et al. [20] develop the ETCR infrastructure for mining the interactions between multiple comments and multiple changes
code review datasets from any GitHub project which practices is beyond the discussion of this paper. Even though we collect data
1 https://github.com/ from projects with high star numbers and PR numbers, the dataset
2 https://www.gerritcodereview.com/ quality still varies, especially when it comes to review comments.
3 Apache-2.0, GPL-3.0, MIT, BSD-2.0, BSD-3.0, BSL-1.0, GPL-2.0 license, etc. To filter out low-quality comments, we clean the pre-training and
CodeReviewer: Pre-Training for Automating Code Review Activities

downstream datasets carefully following the steps described in 4.3 Pre-training Tasks
Appendix A. The key challenge of automating code review activities is to un-
derstand code changes and capture the relationship between code
3.4 Data Split changes and corresponding review comments. Therefore, we design
We use the CodeReview data to build the pre-training dataset and four pre-training tasks to improve the abilities of CodeReviewer.
the benchmark dataset for the three downstream tasks. To prevent
the dataset from information leakage, we split the data in project 4.3.1 Diff Tag Prediction. As mentioned in Section 2.5, diff is a
level. The code repositories with more than 2,500 pull requests special format containing rich information in code changes. Com-
are used to build the pre-training dataset and training set of the pared with encoding both the original and new version code, the
benchmark dataset. Other code repositories with [1, 500, 2, 500) pull diff format prevents duplicates of unchanged lines. Thus, learning
requests are used to build the validation and test dataset. code diff as a special format of source code is critical for the model
in the code review scenario.
4 CODEREVIEWER We use Diff Tag Prediction (DTP) task to equip the model with
the ability to understand the special line tags in code diff. Note that
In this section, we describe our CodeReviewer model in detail,
we replace “+” and “-” tags at the beginning of diff lines by special
including the model architecture, the input-output representations
tags [𝐴𝐷𝐷], [𝐷𝐸𝐿] and involve [𝐾𝐸𝐸𝑃] in the model input. In DTP
of the model, and the pre-training tasks designed for code review.
training, the three kinds of special tags are replaced by [𝑀𝐴𝑆𝐾]
We develop CodeReviewer based on Transformer [42] and design
tokens in the model input. We train our model to predict special
four pre-training tasks related to the code review process to improve
tokens [𝐴𝐷𝐷], [𝐷𝐸𝐿] or [𝐾𝐸𝐸𝑃] at corresponding positions. By
the model’s capacity for automating code review activities.
training on DTP, the model learns to distinguish whether a line is
unchanged or updated. This helps CodeReviewer to understand the
4.1 Model Architecture
code diff format. Concretely, model predicts a probability distribu-
The CodeReviewer is an encoder-decoder model based on Trans- (𝑖) (𝑖) (𝑖)
tion p (𝑖) = (𝑝 0 , 𝑝 1 , 𝑝 2 ) for the 𝑖-th masked tag position and a
former [42]. We adopt the same architecture as Text-To-Text-Transfer
standard cross-entropy loss is used to train our model:
Transformer (T5) model [30]. The CodeReviewer model consists of
12 Transformer encoder layers and 12 decoder layers. There are 12
attention heads in each layer and the hidden size is 768. The total ∑︁  
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
L𝐷𝑇 𝑃 = − 𝑦0 log 𝑝 0 + 𝑦1 log 𝑝 1 + 𝑦2 log 𝑝 2 (1)
parameter size of the model is 223M.
𝑖
We initialize CodeReview with the parameters of CodeT5 [43].
We further pre-train the model with our four pre-training tasks. 4.3.2 Denoising Objective. Denoising objective is an unsupervised
Once pre-trained, CodeReviewer is finetuned and evaluated on the pre-training objective introduced by Lewis et al. [23] to acquire a
downstream tasks respectively. general understanding of a corpus. The original denoising objective
randomly masks spans in the input sentence with a corrupted rate of
4.2 Input-Output Representations 15% and predicts these spans. By learning to predict masked spans,
CodeReviewer takes different inputs and outputs for different tasks. the model gains general knowledge about the corpus distribution.
For code understanding tasks or comment generation tasks, the To better understand code diff and review comments, we design
model takes code diff as input. For the code refinement task, the two denoising objectives for CodeReviewer: denoising code diff
model takes as input both the original source code and the review (DCD) and denoising review comment (DRC). In the DCD task, we
comment. randomly select 15% code lines and mask them. We corrupt inputs
Following the standard way of input processing in Transformer, on line-level but not span-level in order to keep the integrity format
the input is treated as a token sequence. We use the same RoBERTa of code diff. It’s worth noting that to reduce the difficulty of this
[25] tokenizer as CodeT5 to split the source code and review com- task, the line tags ([𝐴𝐷𝐷], [𝐷𝐸𝐿] and [𝐾𝐸𝐸𝑃]) are preserved to
ment to the tokens. A special token [𝐶𝐿𝑆] is prepended to the inform the model whether a line is updated or removed. The DCD
sequence, forming the input as {[𝐶𝐿𝑆], 𝑐 1, 𝑐 2, · · · , 𝑐𝑛 } where 𝑐𝑖 task aims to help the model learn the distribution of code changes,
is source code token and 𝑛 is the sequence length. To help our augmenting both the encoder and decoder. Encoder producing a
model understand the diff format better, the special line tags “-” better representation of code changes benefits both understanding
and “+” in diff file indicating line deletion and line insertion are and generation tasks. Pre-training the decoder on DCD helps the
replaced with special tokens [𝐷𝐸𝐿] and [𝐴𝐷𝐷]. We also insert a model generate better source code in the code refinement task.
[𝐾𝐸𝐸𝑃] before each unchanged line. When there are both source In DRC task, the model is given a corrupted review comment as
code tokens and review comment tokens in the input, a [𝑀𝑆𝐺] input and is required to recover the masked spans. Specifically,
token is inserted to separate the two sequences. Thus the input is we randomly mask spans with 20% corrupted rate and train the
{[𝐶𝐿𝑆], 𝑐 1, 𝑐 2, · · · , 𝑐𝑛 , [𝑀𝑆𝐺], 𝑤 1, 𝑤 2, · · · , 𝑤𝑚 }. decoder to generate them. Training with the DRC task is expected
The model outputs consist of two components: token represen- to benefit comment-related tasks. We describe the loss of DCD and
tations from the encoder and generated token sequence from the DRC tasks as:
decoder. For classification tasks, the representation of [𝐶𝐿𝑆] token 𝑘
is used to generate the predictions. For sequence generation tasks,
∑︁
L𝐷𝐶𝐷 = − log 𝑃𝜃 (𝑐𝑡 |cmask, c<𝑡 ) (2)
the decoder outputs the predicted token sequence. 𝑡 =1
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

Figure 4: Pre-training tasks of CodeReviewer.

𝑘
∑︁ 5 STUDY DESIGN
L𝐷𝑅𝐶 = − log 𝑃𝜃 (𝑤𝑡 |wmask, w<𝑡 ) (3)
𝑡 =1
To investigate the performance of our CodeReviewer model on the
code review tasks, we perform a large-scale study to answer the
where cmask, wmask are masked code diff and masked review com- following research questions (RQ):
ment, and c<𝑡 , w<𝑡 are the span sequence of code diff and review RQ1: How does CodeReviewer perform on the code change
comment generated so far. quality estimation task? We provide as input to CodeReviewer
a code change 𝐷 1 : 𝐶 0 → 𝐶 1 , and ask the model to give a binary
4.3.3 Review Comment Generation. In each pre-training task men- prediction whether the code change needs a review.
tioned above, only one modal is involved, which means that the
model only learns to understand source code or review comments RQ3: How does CodeReviewer perform on the review gen-
in one task. However, the most challenging part of code review eration task? In this RQ, CodeReviewer is also provided a code
is to capture the relationship between code changes and review change 𝐷 1 : 𝐶 0 → 𝐶 1 , but we assess the ability of CodeReviewer to
comments. To equip our model with this capability, we utilize bi- generate a natural language comment 𝑅𝑛𝑙 as the reviewers would
modal data (code changes in programming languages and related do.
review comments in natural language) in the pre-training task. For RQ3: How does CodeReviewer perform on the code refine-
generalizability, we use a simple conditional generation task: re- ment task? Given a few lines of source code submitted for code
view comment generation (RCG). In RCG, the model is given a code review and the feedback from reviewers written in natural language,
change as input and asked to generate the review comment written the model is required to refine the submitted code to implement
by the human reviewer. We use the negative log-likelihood loss: the requirements of the review comments.
In RQ1-RQ3, we evaluate CodeReviewer on different code review
𝑘
∑︁ tasks. To assess the contribution of each pre-training task and our
L𝑅𝐶𝐺 = − log 𝑃 (𝑤𝑡 |c, w<𝑡 ) (4) multilingual dataset, we further investigate RQ4 and RQ5.
𝑡 =1
RQ4: What role does each pre-training task play in CodeRe-
where c is code change and w<𝑡 is the comment generated so far. viewer? In RQ1-RQ3, the evaluated CodeReviewer model is pre-
trained on all four pre-training tasks before being applied to the
downstream tasks. We remove the pre-training tasks one by one
4.4 Fine-Tuning
and evaluate the resulting models again to expose the influence of
We group all downstream tasks into classification tasks and gen- each pre-training task on different downstream tasks.
eration tasks. For classification tasks such as code change quality
estimation, we only use the pre-trained encoder. The representation RQ5: Can multilingual dataset benefit model performance
in the last layer of the special token [𝐶𝐿𝑆] at the beginning ℎ 0 is on understanding single programming language? Existing re-
fed into a linear classifier to produce the prediction. For genera- search show that multilingual training dataset can benefit for model
tion tasks such as code refinement, the entire pre-trained encoder- performance compared with monolingual dataset on neural ma-
decoder model is used. A standard token negative log-likelihood chine translation and code translation [8, 48], especially for low-
loss is leveraged to optimize the probability of the target sequence. resource languages. We evaluate the performance of CodeReview
CodeReviewer: Pre-Training for Automating Code Review Activities

Table 2: Statistics of pre-training dataset. Wang et al. [43]. They propose to pre-train CodeT5 with identifier-
aware denoising tasks and bimodal dual generation objective, which
Meta Info Data # makes CodeT5 the SOTA model for multiple code-related down-
Language
stream tasks including code summarization, code generation, code
Project # PRs† Data Size w/ comment w/o comment
translation and code refinement. The base version of CodeT5 con-
Python 195 1,451k 72.8G 887k 518k sists of 12 encoder layers and decoder layers with the parameter
Java 175 1,073k 54.8G 876k 467k size 220M. We use CodeT5 to refer to this model.
Go 146 951k 40.4G 728k 410k
C++ 133 999k 82.1G 474k 202k
JavaScript 194 1,354k 30.6G 425k 293k
5.2 Evaluation Metrics
C 77 441k 135.4G 292k 110k We provide a brief description of the evaluation metrics used for
C# 77 463k 28.2G 324k 199k the three downstream tasks in this section.
Php 92 574k 16.0G 215k 157k
Code change quality estimation. It is a binary classification task.
Ruby 72 626k 3.8G 90k 126k
We use accuracy, precision, recall, and F1 to evaluate the model
Total 1,161 7,933k 463.2G 4,311k 2,481k predictions. Note that when computing the later 3 metrics, the
† Pull request numbers of the projects in total. code changes with issues (requires for comments and updates) are
treated as the positive class.
Table 3: Statistics of benchmark datasets. Review comment generation. We compute the BLEU (Bilingual
Evaluation Understudy) [29] score of the predictions. BLEU is
widely used to assess the quality of automatically generated text
Dataset Train # Valid # Test # LOC in generation tasks such as machine translation and code genera-
Quality Estimation ∼266k ∼31k ∼31k ∼11M tion. We use the BLEU-4 variant, which computes the overlap of
Review Comment Generation ∼118k ∼10k ∼10k ∼1.8M 𝑛-grams between the generated text and the reference text with 𝑛
Code Refinement ∼150k ∼13k ∼13k ∼1.3M from 1 up to 4. As review comments are diverse and non-unique,
we further apply human evaluation on generated comments from
two perspectives: information and relevance. ❶ Information: in
this metric, we evaluate how informative the comment is for the
pre-trained and fine-tuned on monolingual dataset and compare it
contributor to revise the source code. Comments like “declare this
with the model trained on the full dataset to show the influence of
variable as private” are more informative than those like “why do
multilingual dataset.
we need this?”. ❷ Relevance: in this metric, we evaluate to what
Dataset We build our pre-training dataset and 3 downstream task extent the review comment is related to the corresponding code
datasets as described in Section 3.3. Table 2 summarizes the statistics change. The comments point out the issues in the code changes
of the pre-training dataset. For benchmark datasets, details are will get high scores, while those not related or disobey the logic in
shown in Table 3. the code changes should get low scores. The comments are labeled
with a 1-5 score for each metric. We describe in detail of the two
5.1 Baseline Models metrics in the Appendix B.
To demonstrate the superiority of our multilingual code review Code refinement. For this task, we compute the BLEU score be-
related pre-training dataset and carefully designed pre-training tween generated code and target code and the exact match (EM)
tasks, we compare our CodeReviewer model with three baselines, rate. BLEU only evaluates the similarity between the prediction
including a state-of-the-art (SOTA) model architecture Transformer and the target, while a small change in source code can result in
[42] trained from scratch and two pre-trained models: T5 for code compile error and execution error. So exact match is a more impor-
review [40] and CodeT5 [43]. tant metric in this task. Only if a prediction is exactly the same as
Transformer. Transformer [42] is a SOTA model architecture for the target, will it be considered as correct.
many classification and generation tasks. The key component of
Transformer is the multi-head attention module and the parameter- 5.3 Implementation Details
ized linear transformation layers. We use the same model setting We implement our model with the popular deep learning develop-
as CodeT5-base, with a 12-layer encoder and a 12-layer decoder. ment framework PyTorch 4 and the python package transformers
T5 for code review. Tufano et al. [40] attempt to use pre-trained developed by HuggingFace 5 . We pre-train our CodeReivewer model
model for code review automation. They pre-train a small version with 2 DGX-2 servers with 16 NVIDIA V100-32G GPUs on each
of T5 model with the denoising objective proposed by Raffel et al. server. The learning rate and batch size in the pre-training stage
[30] on their own dataset including Stack Overflow dumps and are set to 0.0002 and 768. We use AdamW optimizer with linear
CodeSearchNet [21]. The model consists of 61M parameters, with warmup to optimize the model for 250k steps. The warmup step
6 layers for both encoder and decoder. We refer to this model with number is set to 2500. When fine-tuning our CodeReivewer and
T5 for convenience later. baseline models on the three downstream tasks, we use batch size
CodeT5-base. CodeT5 is a SOTA unified pre-trained encoder-decoder 4 https://pytorch.org/

model for code understanding and generation tasks proposed by 5 https://huggingface.co/


Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

Table 4: Results on code change quality estimation. Table 5: Results on review comment generation. The perfect
scores of two metrics in human evaluation set are both 5.
Model (Layers #) Precision Recall F1 Accuracy
Test set Human evaluation set
Transformer (12) 74.50 46.07 56.93 65.16 Model (Layers #)
T5 (6) 70.82 57.20 63.29 66.82 BLEU Information Relevance
CodeT5 (12) 70.36 58.96 64.16 67.07 Transformer (12) 4.76 3.08 2.50
CodeReviewer (12) 78.60 65.63 71.53 73.89 T5 (6) 4.39 2.54 1.62
CodeT5 (12) 4.83 3.03 2.40
CodeReviewer (12) 5.32 3.60 3.20
72 and learning rate 0.0003. Each experiment on downstream tasks
is performed on a server with 4 V100 GPUs. When evaluating the
review comment generation task and the code refinement task, we Table 6: Results on code refinement.
use beam search with size 10.
Model (Layers #) BLEU EM
6 RESULTS ANALYSIS
NaïveCopy 58.75 0.00
In this section, we answer each research question based on our
T5 (6) 77.03 15.08
experiment results. we first focus on three research questions con-
CodeT5 (12) 80.82 24.41
cerning our model’s performance in each task and how it compares
to other state-of-the-art baseline models. We further demonstrate CodeReviewer (12) 82.61 30.32
the effects of our pre-training tasks and multilingual dataset.

6.1 RQ1: Performance on code change quality is about 1/4 of the other models (223M). We conjecture that it is too
estimation challenging for a small model to generate the review comments.
Table 4 shows the results on the code change quality estimation task. In this task, we also conduct a qualitative analysis of the world’s
From the table, we can see that whether the baselines are trained largest language model on source code, Codex [7], which has been
from scratch or pre-trained models, CodeReviewer outperforms proven to be the most powerful code generation model and ca-
them significantly on all four metrics. Specifically, our CodeRe- pable of zero-shot task transfer. Since we don’t get access to the
viewer model improves F1 and accuracy by 8.24% and 7.07% com- model and can’t fine-tune it, we alternately use GitHub Copilot
pared with T5. The improvement over CodeT5 is also over/about [15] which is powered by Codex. By providing several code diff
7%, which demonstrates that our pre-training tasks help CodeRe- and review comment pairs as prompt, we find that Codex can’t
viewer understand code changes better. Besides, the performance generate any meaningful reviews but copying comments in the
of Transformer trained from scratch is inferior to the other three given examples. Figure 5a shows an example of outputs generated
models, indicating the importance of pre-training. by different models including Codex. More cases are shown in the
Appendix C.
6.2 RQ2: Performance on review generation
6.3 RQ3: Performance on code refinement
Table 5 shows the results on the review generation task. CodeRe-
viewer produces a higher BLEU score than the baseline models. Table 6 shows the results on the code refinement task. The Naïve-
However, the BLEU score of our model is still lower than 10, in- Copy method directly copy the input code as the refinement re-
dicating it is a hard task. As mentioned before, review comments sult. It produces a decent BLEU score but 0.00 Exact Match (EM).
are diverse and non-unique. Referring to the example illustrated CodeReviewer successfully generates the repaired code exactly the
in Figure 5a, our model prediction conveys similar intent as the same as ground truth for more than 30% cases, which is two times
ground truth. But their words differ greatly so that the BLEU score as the result of T5 and 25% more than CodeT5 relatively, demon-
is lower than Codex. To investigate the models’ performance better, strating the superior ability of CodeReviewer to understand the
we perform a human evaluation on their predictions. review comments and refine code based on them. The BLEU score
We firstly choose 300 random samples in Java from the test set of CodeReviewer is also higher than T5 and CodeT5. Figure 5b
and manually select 100 of them with high-quality code changes shows an example of a perfect prediction generated by our model.
and review comments. For the selected samples, we invite 6 profes- Note that the Transformer model failed to converge in 10 epochs,
sional sophisticated programmers to score the reference comment producing about 0 BLEU score, so the result is not listed in the table.
and generated comments of each model as described in Section 5.2.
Results are listed in Table 5. Our model improves the information 6.4 RQ4: Influence of pre-training tasks
score and relevance score of the generated comments by about 20% We demonstrate the superior performance of CodeReviewer on
relatively compared with the baseline models, which means com- different downstream tasks in RQ1-RQ3, proving that the well-
ments generated by CodeReviewer are more useful and adequate. designed pre-training tasks benefit for automating code review
The performance of T5 is inferior to the other models, including activities. In this RQ, we dive deeper and expose the contribution
the Transformer baseline. Because the parameter size of T5 (61M) of each pre-training task. We further pre-train three models, each
CodeReviewer: Pre-Training for Automating Code Review Activities

public class DockerOptions {


for (int i = 0; i < maxContainerCount; i++) {
node.add(caps, new DockerSessionFactory(clientFactory, docker, image, caps));
}
- LOG.info(String.format(
+ LOG.finest(String.format(
"Mapping %s to docker image %s %d times", caps, name,
Ground truth: This change prevents a user understanding how their server is configured. Best to leave at `info` level.
T5: I don't think this change is needed. BLEU:4.93
CodeT5: Why do we need this change? BLEU:2.97
Codex: I think this is a good example of how to use the docker BLEU:8.83
CodeReviewer: This is an important debugging help and shouldn't be lower than the default visible INFO. BLEU:7.97

(a) An example of the review generation task. The Codex output is obtained by Copilot.
Review comment: This should only happen once in initialization, can you move it out of this method?
CodeReview:
private string ReplaceUnit(string resultText)
{
- var unitMap = Config.UnitMap.OrderBy(o => o.Key.Length).ToDictionary(o => o.Key, p => p.Value);
- foreach (var unit in unitMap.Keys)
+ foreach (var unit in Config.UnitMap.Keys)
{
resultText = resultText.Replace(unit, Config.UnitMap[unit]);
}
...
}

(b) An example of the code refinement task. To make clear, we show the code diff between the inputs and model outputs instead of their
original form.

Figure 5: Examples of the review generation task and the code refinement task.

Table 7: Ablation study on code change quality estimation. Table 8: Ablation study of multi-lingual dataset on the code
change quality estimation task.
Model Precision Recall F1 Accuracy
Java C# Ruby
CodeReviewer 78.60 65.63 71.53 73.89 Metric
-CMT 77.14 66.35 71.34 73.35 Multi Single Multi Single Multi Single
-ENC 79.59 62.64 70.10 73.29 Accuracy 74.04 72.45 74.80 72.21 82.70 79.92
-DEC 79.87 60.94 69.13 72.80 F1 70.53 69.34 76.52 75.53 89.23 88.10

of which is trained with some pre-training tasks removed, and


evaluate their performance on the code change quality estimation
task. The models CodeReviewer-DTP, CodeReviewer-DCD and
CodeReviewer-CMT represent CodeReviewer trained without the
Diff Tag Prediction task, the Denoising Code Diff task, and the programming language, we build monolingual datasets for Java, C#
other two comment related tasks (Denoising Review Comment and and Ruby language, repectively. Java represents popular languages
Review Comment Generation) respectively. and Ruby represents the low-resource languages as shown in Ta-
Table 7 shows the results. The drop of the model performance ble 2. For each of the three languages, we pre-train and finetune
demonstrates the importance of the pre-training tasks. For the code the CodeReviewer on the monolingual datasets and compare the
change quality estimation task, all the pre-training tasks help our performance on the code change quality estimation task with the
model understand the code diff better. But the diff tag prediction original CodeReviewer pre-trained and finetuned on full multilin-
task and the denoising code diff task are more important. The results gual datasets. The results are listed in Table 8.
are consistent with our motivation in Section 4.3. The multilingual CodeReviewer outperforms the three monolin-
gual models consistently, improving the accuracy by 2.32% and the
6.5 RQ5: Influence of multilingual dataset F1 score by 1.10% on average. We conclude that our multilingual
In the previous experiments, CodeReviewer is trained on the datasets dataset benefits the CodeReviewer for understanding specific lan-
consisting of nine programming languages. To investigate whether guages significantly. This reveals the superiority of our multilingual
multilingual datasets help our model better understand a single dataset over the datasets for single programming language.
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

7 RELATED WORKS the relation of different hunks in a pull request by encoding each
hunk and computing attention scores across diff hunks to fuse the
7.1 Pre-training for code-related tasks
information. Li et al. [24] formalize automatic code review as a
Deep Learning techniques have been widely adopted in software multi-instance learning task, in which each hunk is an instance and
engineering research [10, 44]. In recent years, motivated by the the target is to predict whether a pull request will be accepted.
great impact of pre-training technique in natural language (NL) Other researchers focus on tasks related to review comments.
processing area [6, 11, 23, 30], researchers have made attempts to To save the time reviewers spent on writing reviews related to
investigate whether pre-trained models for programming language common issues such as coding style and documentations, Gupta
(PL) can further facilitate software development. and Sundaresan [18] propose the LSTM-based model DeepMem.
Since pre-trained models can address several downstream tasks DeepMem learns the relations between code changes and review
by fine-tuning, previous works target at different scopes of code- comments, and recommends review comments automatically based
related tasks with their models. Kanade et al. [22] pre-train CuBERT on existing code reviews and code changes. Siow et al. [34] also rec-
in a Python code corpus mostly for classification tasks like variable- ommend code reviews in a retrieval-based manner. They propose an
misuse classification, wrong binary operator, etc. CodeBERT [14] LSTM-based multi-level embedding attentional model CORE aim-
and GraphCodeBERT [17] are bi-directional transformer models ing at capturing the semantic information in both source code and
pre-trained on NL-PL pairs in six programming languages, the latter reviews. Tufano et al. [41] use deep learning techniques to automate
introduces new pre-training tasks designed for source code data a different task in code review. They train a Transformer model to
flow. Both models have shown effectiveness on code-text crossing revise the contributors’ code to implement the requirements from
tasks like NL code search and code summarization, and other code the review comments. In the previous work, the researchers usually
understanding tasks like code clone detection and defect detection train a small model for a specific task in the code review process.
thanks to bi-directional attention mechanism. CodeGPT [26], GPT- Differently, we propose a large model pre-trained on four general
C [36] and Codex [7] focus on generative tasks like code completion pre-training tasks for code review, producing superior performance
and code generation because they are pre-trained with decoder- across three different code review tasks.
only transformer architecture. As for encoder-decoder models like
PLBART [2] and CodeT5 [43] and unified models like UniXcoder
[16], they can be applied for both understanding and generation 7.3 Code review datasets
tasks that take source code or text as inputs. To advance the research of automating code reviews activities,
All the above models pay no attention to tasks in the code review researchers also make effort to collect and publish code review
process. A distinct feature of code review tasks is that code changes datasets. Tufano et al. [41] create two datasets with 17k abstracted
should be considered as inputs. Recently, Tufano et al. [40] pre- code changes for predicting method-level code changes. The Code
trained a T5 model for automating code review activities. However, Review Open Platform (CROP) by Paixao et al. [28] is a dataset
their work is different from ours on two points. First, they just use based on specific Gerrit projects consisting of 507k comments. The
the pre-training objectives of T5 [30] and don’t take into consid- dataset from Mukadam et al. [27] has fewer comments (106k) and
eration how to integrate code changes into pre-training. Second, provides only metadata of source code. Yang et al. [46] propose the
their pre-training data is not directly related to code review, but we dataset with the largest number of comments. But they also provide
collect data from GitHub pull requests for pre-training. only metadata of source code. Compared with them, we propose
the largest, multilingual code review dataset with completed infor-
7.2 Automating code review activities mation collected from over 1k GitHub repositories with more than
As indicated by previous empirical studies, code review is an impor- 7.9M pull requests in total.
tant part in the software development lifecycle and involves a sig-
nificant amount of effort and time of reviewers [5, 32]. Researchers
are paying more attention to automating code review activities, 8 THREATS TO VALIDITY
including reviewer recommendation [38, 47], comment location Threats to internal validity relate to the roles played by the
prediction [19, 33], review comment recommendation [18, 34] and model architecture and hyper-parameters setting. We do a small-
code refinement [41]. range grid search on learning rate and batch size settings, leaving
Thongtanunam et al. [38] reveal that 4%-30% of reviews have other hyper-parameters the same as those in CodeT5 [43]. It is
code reviewer assignment problems. Thus, they propose a file expected that more hyper-parameter tuning would bring more
location-based tool RevFinder to recommend appropriate code re- improvements.
viewers. To tackle the same problem, Zanjani et al. [47] design the
system cHRev which utilizes code review histories to recommend Threats to external validity are mostly related to the dataset we
reviewers for a code change. While these researchers aim at im- collect and use in this paper. Since the dataset is collected from
proving code review at an early stage, others are devoted to solving GitHub, it is built upon only open-source projects, not industrial
the more challenging tasks during the code review process. Shi projects. Besides, even though we explore many ways to do filtering,
et al. [33] propose the DACE framework based on CNN and LSTM in the high-quality projects, there still exists review comments that
to predict whether a hunk in the code change will be accepted by have little to do with the code change, leading noises in our dataset.
the reviewers. Hellendoorn et al. [19] use the Transformer architec- Threats to construct validity include the rationality of evalua-
ture to solve this task. Furthermore, they also attempt to capture tion metrics. We argue that the BLEU score which is widely used in
CodeReviewer: Pre-Training for Automating Code Review Activities

text generation tasks is not suitable for the review generation task. [16] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022.
So we conduct a human annotation to better evaluate the results. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv
preprint arXiv:2203.03850 (2022).
[17] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long
Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2021. GraphCodeBERT:
9 CONCLUSION Pre-training Code Representations with Data Flow. In ICLR.
In this paper, we focus on pre-training techniques for automating [18] Anshul Gupta and Neel Sundaresan. 2018. Intelligent code reviews using deep
learning. In Proceedings of the 24th ACM SIGKDD International Conference on
code review activities. We start by formulating three tasks in the Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day.
code review process with the code change format. We collect and [19] Vincent J Hellendoorn, Jason Tsay, Manisha Mukherjee, and Martin Hirzel. 2021.
Towards automating code review at scale. In Proceedings of the 29th ACM Joint
organize a large-scale dataset from GitHub for code review pre- Meeting on European Software Engineering Conference and Symposium on the
training and a benchmark for evaluation on the three tasks. It is Foundations of Software Engineering. 1479–1482.
the largest dataset in code review scenario, covering nine of the [20] Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. Exploit those
code reviews! bigger data for deeper learning. In ESEC/SIGSOFT FSE. ACM, 1505–
most popular programming languages in GitHub. Based on this, 1509.
we introduce CodeReviewer, a transformer-based encoder-decoder [21] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
model pre-trained on our dataset with four designed pre-training Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic
Code Search. CoRR abs/1909.09436 (2019).
tasks for code review. The experimental results demonstrate that [22] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020.
our model outperforms the state-of-the-art models pre-trained with Learning and evaluating contextual embedding of source code. In International
Conference on Machine Learning. PMLR, 5110–5121.
source code in all three tasks. [23] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising
sequence-to-sequence pre-training for natural language generation, translation,
REFERENCES and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[1] A Frank Ackerman, Priscilla J Fowler, and Robert G Ebenau. 1984. Software [24] Heng-Yi Li, Shu-Ting Shi, Ferdian Thung, Xuan Huo, Bowen Xu, Ming Li, and
inspections and the industrial production of software. In Proc. of a symposium on David Lo. 2019. Deepreview: automatic code review using deep multi-instance
Software validation: inspection-testing-verification-alternatives. 13–40. learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining.
[2] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- Springer, 318–330.
fied Pre-training for Program Understanding and Generation. In Proceedings of [25] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
the 2021 Conference of the North American Chapter of the Association for Compu- Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
tational Linguistics: Human Language Technologies. 2655–2668. robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
[3] Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- (2019).
lenges of modern code review. In 35th International Conference on Software [26] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro-
Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, David Notkin, sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021.
Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, 712–721. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understand-
https://doi.org/10.1109/ICSE.2013.6606617 ing and Generation. In Thirty-fifth Conference on Neural Information Processing
[4] Moritz Beller, Alberto Bacchelli, Andy Zaidman, and Elmar Jürgens. 2014. Modern Systems Datasets and Benchmarks Track (Round 1).
code reviews in open-source projects: which problems do they fix?. In 11th [27] Murtuza Mukadam, Christian Bird, and Peter C Rigby. 2013. Gerrit software code
Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May review data from android. In 2013 10th Working Conference on Mining Software
31 - June 1, 2014, Hyderabad, India, Premkumar T. Devanbu, Sung Kim, and Martin Repositories (MSR). IEEE, 45–48.
Pinzger (Eds.). ACM, 202–211. https://doi.org/10.1145/2597073.2597082 [28] Matheus Paixao, Jens Krinke, Donggyun Han, and Mark Harman. 2018. CROP:
[5] Amiangshu Bosu and Jeffrey C. Carver. 2013. Impact of Peer Code Review on Peer Linking code reviews to source code changes. In Proceedings of the 15th Interna-
Impression Formation: A Survey. In ESEM. IEEE Computer Society, 133–142. tional Conference on Mining Software Repositories. 46–49.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [29] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda method for automatic evaluation of machine translation. In Proceedings of the
Askell, et al. 2020. Language models are few-shot learners. Advances in neural 40th annual meeting of the Association for Computational Linguistics. 311–318.
information processing systems 33 (2020), 1877–1901. [30] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[7] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, its of transfer learning with a unified text-to-text transformer. arXiv preprint
et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:1910.10683 (2019).
arXiv:2107.03374 (2021). [31] Peter C. Rigby and Christian Bird. 2013. Convergent contemporary software
[8] Ting-Rui Chiang, Yi-Pei Chen, Yi-Ting Yeh, and Graham Neubig. 2021. Breaking peer review practices. In Joint Meeting of the European Software Engineering
Down Multilingual Machine Translation. CoRR abs/2110.08130 (2021). Conference and the ACM SIGSOFT Symposium on the Foundations of Software
[9] Moataz Chouchen, Ali Ouni, Mohamed Wiem Mkaouer, Raula Gaikovina Kula, Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26,
and Katsuro Inoue. 2021. WhoReview: A multi-objective search-based approach 2013, Bertrand Meyer, Luciano Baresi, and Mira Mezini (Eds.). ACM, 202–212.
for code reviewers recommendation in modern code review. Appl. Soft Comput. https://doi.org/10.1145/2491411.2491444
100 (2021), 106908. https://doi.org/10.1016/j.asoc.2020.106908 [32] Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
[10] Prem Devanbu, Matthew Dwyer, Sebastian Elbaum, Michael Lowry, Kevin Moran, Bacchelli. 2018. Modern code review: a case study at google. In Proceedings of
Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, and Xiangyu Zhang. 2020. the 40th International Conference on Software Engineering: Software Engineering
Deep learning & software engineering: State of research and future directions. in Practice, ICSE (SEIP) 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Frances
arXiv preprint arXiv:2009.08525 (2020). Paulisch and Jan Bosch (Eds.). ACM, 181–190. https://doi.org/10.1145/3183519.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: 3183525
Pre-training of deep bidirectional transformers for language understanding. arXiv [33] Shu-Ting Shi, Ming Li, David Lo, Ferdian Thung, and Xuan Huo. 2019. Automatic
preprint arXiv:1810.04805 (2018). code review by learning the revision of source code. In Proceedings of the AAAI
[12] Michael E. Fagan. 2002. Design and Code Inspections to Reduce Errors in Program Conference on Artificial Intelligence, Vol. 33. 4910–4917.
Development (Reprint). In Software Pioneers, Manfred Broy and Ernst Denert [34] Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, and Yang Liu. 2020. CORE:
(Eds.). Springer Berlin Heidelberg, 575–607. https://doi.org/10.1007/978-3-642- Automating Review Recommendation for Code Changes. In SANER. IEEE, 284–
59412-0_35 295.
[13] Michael E. Fagan. 2002. A History of Software Inspections. In Software Pioneers. [35] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet:
Springer Berlin Heidelberg, 562–573. Masked and permuted pre-training for language understanding. Advances in
[14] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Neural Information Processing Systems 33 (2020), 16857–16867.
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre- [36] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020.
Trained Model for Programming and Natural Languages. In Findings of the Asso- Intellicode compose: Code generation using transformer. In Proceedings of the 28th
ciation for Computational Linguistics: EMNLP 2020. 1536–1547. ACM Joint Meeting on European Software Engineering Conference and Symposium
[15] Github. 2021. GitHub Copilot · Your AI pair programmer. https://copilot.github. on the Foundations of Software Engineering. 1433–1443.
com/.
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

[37] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, and to detect the sentiment of comments. Comments that are not
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should written English, that are both short and positive are also removed.
review my code? A file location-based code-reviewer recommendation approach
for Modern Code Review. In 22nd IEEE International Conference on Software At last, we apply some handwritten rules [40] to remove comments
Analysis, Evolution, and Reengineering, SANER 2015, Montreal, QC, Canada, March not related to the code review process. As for code changes, we
2-6, 2015, Yann-Gaël Guéhéneuc, Bram Adams, and Alexander Serebrenik (Eds.).
IEEE Computer Society, 141–150. https://doi.org/10.1109/SANER.2015.7081824
believe most of the changes are meaningful and only remove the
[38] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, hunks longer than 200 words (split by spaces) because they are too
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should long for the model to understand (usually more than 512 tokens
review my code? a file location-based code-reviewer recommendation approach
for modern code review. In 2015 IEEE 22nd International Conference on Software after tokenization).
Analysis, Evolution, and Reengineering (SANER). IEEE, 141–150.
[39] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin
White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing A.2 Data Filtering For Downstream Tasks
patches in the wild via neural machine translation. ACM Transactions on Software
Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29. Handwritten rules can be used to filter most of the low-quality
[40] Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys comments. Applying such rules to the pre-training dataset is sat-
Poshyvanyk, and Gabriele Bavota. 2022. Using Pre-Trained Models to Boost isfying. However, it is not good enough to build a high-quality
Code Review Automation. CoRR abs/2201.06850 (2022). arXiv:2201.06850 https:
//arxiv.org/abs/2201.06850 benchmark dataset for downstream tasks like review generation.
[41] Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Since reviewers are likely to write comments on a wide range of
Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 43rd
IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid,
topics, some comments have limited relation to code or convey
Spain, 22-30 May 2021. IEEE, 163–174. https://doi.org/10.1109/ICSE43902.2021. little information, e.g., “What is this change about?”. To elimi-
00027 nate such data, we train a simple comment classification model
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All in our internal projects with 37k review comments. These com-
you Need. In NIPS. 5998–6008. ments are manually labeled by professional software engineers.
[43] Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Tags of comments include UNIT_TESTING, NULL_HANDLING,
Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under-
standing and Generation. CoRR abs/2109.00859 (2021). CODE_REFACTOR_RENAME, and so on. Then we use a SOTA sen-
[44] Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys tence embedding pre-trained model for natural language – MPNet
Poshyvanyk. 2020. A Systematic Literature Review on the Use of Deep Learning
in Software Engineering Research. arXiv preprint arXiv:2009.06520 (2020).
[35] to encode all the comments. Considering comments are also
[45] Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining natural language, we do not fine-tune MPNet. Thus we have a sen-
the modern code review repositories: a dataset of people, process and product. In tence embedding for each comment. A simple SVM (Support Vector
Proceedings of the 13th International Conference on Mining Software Repositories,
MSR 2016, Austin, TX, USA, May 14-22, 2016, Miryung Kim, Romain Robbes, and Machine) is applied to train a classifier with the sentence embed-
Christian Bird (Eds.). ACM, 460–463. https://doi.org/10.1145/2901739.2903504 ding as input. Finally we use the comment classification model to
[46] Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining label all the comments in downstream datasets. Comments which
the modern code review repositories: A dataset of people, process and product.
In Proceedings of the 13th International Conference on Mining Software Repositories. are tagged as UNKNOWN or classified to some class with confi-
460–463. dence less than 50% are removed. This process helps us filter about
[47] Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2015. Automati-
cally recommending peer reviewers in modern code review. IEEE Transactions
5% comments in downstream datasets. Additionally, to reduce the
on Software Engineering 42, 6 (2015), 530–543. difficulty of the review generation task and code refinement task,
[48] Ming Zhu, Karthik Suresh, and Chandan K Reddy. 2022. Multilingual Code we filter out code changes longer than 20 lines in the corresponding
Snippets Training for Program Translation. (2022).
tasks.

A DATA FILTERING
B HUMAN EVALUATION
A.1 General Data Filtering
We introduce the standards in our human evaluation process of the
We perform the following cleaning steps in both pre-training and review generation task.
downstream datasets. First, we eliminate Emojis, non-ASCII char-
acters, trailing spaces, and multiple white spaces from comments
to keep the dataset in a unified normalized format. We further B.1 Information metric
remove comments that contain external URLs as they are often We score the comment with 1-5 (1,2,3,4,5) according how much
related to other commits or pull requests. In the comment gener- information the author can get from the comment.
ation task, comments containing source code as suggestions are Score 5: If a comment gives a clear and concrete suggestion, it
also removed because we focus on generating natural language should be scored with 5.
comments. Note that these comments are kept in the code refine- Score 4: If a comment is about a concrete suggestion but in a
ment task, as the code suggestions in the comment may be direct euphemistic tone, or the suggestion has small logic errors, it should
guidance for contributors to revise the code. Comments that are too be scored by 4.
short/long (less than 3 words / more than 150 words) are consid- Score 3: If a comment is not about a suggestion but only raising
ered non-informative/lengthy and removed. Besides, In our human a concrete question, it should be scored with 3.
investigation, we notice that some comments are not written in Score 2: If a comment raises a vague question or suggestion, it
English and some comments are praises to the author (such as should be scored by 2.
“Nice change!” or “LGTM.”) that are not helpful for the code review Score 1: Comments that are praising the code change, or suitable
process. We use the Python package “langdetect” and a finetuned for any scenario, or hard to understand should get a low score of 1.
BERT-base sentiment analysis model to detect comment language Examples are shown in Table 9.
CodeReviewer: Pre-Training for Automating Code Review Activities

I think the docstring should say something like “Alias for the source option”.
Score 5
When Activity is lost in the HttpModule, we’d better restore the boot (HttpIn).
Should we also do “reloadCache” when database is null?
Score 4
it is better to use ’e’ instead of ’e’ here
Score 3 when send log failed, why update the last sent log id?
Why do we need to handle the case of an error?
Score 2
I don’t think this is the right place for this.
Why is this needed?
Score 1
I’m really impressed by this change!
Table 9: Examples of different scores for the information metric.

B.2 Relevance metric no changes to the original code. In Figure 9, CodeReview performs
We score the comment with 1-5 (1,2,3,4,5) according to the relevance exactly as what the reviewer asks to do. CodeT5 incorrectly changes
between comment and code change. A good comment is strongly the operator. T5 makes numerous adjustments that are unrelated
related to the code change. And the logic of the comment matches to the review.
the code change.
Score 5: Comments that talk about the code change only and
are logically correct should be given a high score of 5.
Score 4: If the comment is highly related to this code change
while the logic of a comment doesn’t match the code change, the
score should be 4. The score should be lower if the logic is quite
wrong.
Score 3: If the comment is related to the code change, but it
cannot be inferred by the context, e.g., it contains information from
other contexts that are not included in this code diff, the score
should be 3.
Score 2: If the comment is related to the code diff context but
not the code change itself, it should be scored with 2.
Score 1: Comments that are not related to the code change, or
can be suitable for any code change, or are hard to understand,
should get a low score of 1.
Examples are shown in Figure 6.

C CASE STUDY
In this section, we show more examples generated by our CodeRe-
viewer model and other baseline models in the review generation
task and the code refinement task.
Figure 7 shows two examples of the review generation task. In the
first example, only CodeReviewer generates a comment similar to
ground truth, while other models generate useless comment that is
not specific for this code change. In the second example, comments
generated by CodeT5 and Codex are not related to the code change.
Our model generates a comment which is highly related to the
code change, but it is totally different from the ground truth, which
indicts that there might be different perspectives to review the code
changes.
Figure 8 and Figure 9 show two examples of the code refinement
task. In Figure 8, CodeReviewer follows the instruction in the review
comment and moves the assignment statement out of this method.
So does CodeT5, but it alters other aspects that are not mentioned in
this review. T5 doesn’t get the meaning of the comment and makes
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

buf.WriteInt32(0);
- buf.WriteString(message);
+
Score 5 + // auto inject saylinks (say)
+ if (RuleB(Chat, AutoInjectSaylinksToSay)) {
+ std::string new_message =
EQ::SayLinkEngine::InjectSaylinksIfNotExist(message);
+ buf.WriteString(new_message.c_str());
+ }
+ else {
+ buf.WriteString(message);
+ }
auto outapp = new EQApplicationPacket(OP_SpecialMesg, buf);
Review comment: `c_str()` is not needed. You can just use
`buf.WriteString(new_message);`

from functools import reduce


import scapy.modules.six as six
Score 4 -from scapy.modules.six.moves import filter, range, zip
+from scapy.modules.six.moves import range, zip
Review comment: why did you remove `zip`? It is used!

importedname = node.modname
else:
importedname = node.names[0][0].split('.')[0]
Score 3 + if node.as_string().startswith('from .'):
+ importedname = '.' + importedname
self._imports_stack.append((node, importedname))
@staticmethod
Review comment: What if the imported name already starts with `from
__future__ import unicode_literals`?

import java.util.logging.LogRecord;
* @author Wouter Zelle
+ * @deprecated This class will be complety removed in 7.0.0
Score 2 */
+@Deprecated
public class ConsoleLogHandler extends Handler
Review comment: So, the culprit was actually this class
"ConsoleLogHandler", correct?

import flatbuffers
from flatbuffers.compat import import_numpy
+from tests.namespace_test.NamespaceC.TableInC import TableInC, TableInCT
Score 1 np = import_numpy()
class SecondTableInA(object):
Review comment: These changes will need to be made in the code generator

Figure 6: Examples of different scores for the relevance metric.


CodeReviewer: Pre-Training for Automating Code Review Activities

public class MockExecutorLoader implements ExecutorLoader {


public void unassignExecutor(int executionId) throws ExecutorManagerException {
executionExecutorMapping.remove(executionId);
}
+
+ @Override
+ public List<ExecutableFlow> fetchRecentlyFinishedFlows(long lifeTimeMs)
+ throws ExecutorManagerException {
+ return null;
+ }
}

Ground truth: I think we should return an empty list instead of null here.

T5: I don't think we need this.

CodeT5: I don't think this class belongs here.

Codex: I think this is a good idea.

CodeReviewer: return empty list instead?

public abstract class Directory implements Closeable {


* @throws AlreadyClosedException if this Directory is closed
*/
protected void ensureOpen() throws AlreadyClosedException {}
+
+ /**
+ * Implementations can override this if they are capable of reporting modification time
+ * of a file in seconds since the epoch.
+ */
+ public long fileModified(String name) throws IOException {
+ throw new UnsupportedOperationException();
+ }
+
}

Ground truth: I think we should avoid changing any lucene classes for the moment -
fileModified() can probably stay where it is?

T5: Why do we need this method? I don't see it being used anywhere else.

CodeT5: Shouldn't this be synchronized as well?

Codex: should we make this abstract?

CodeReviewer: I think this should throw `FileNotFoundException` instead of `IOException`

Figure 7: Examples of the review generation task.


Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

Original Code:
private string ReplaceUnit(string resultText)
{
var unitMap = Config.UnitMap.OrderBy(o => o.Key.Length).ToDictionary(o => o.Key, p =>
p.Value);
foreach (var unit in unitMap.Keys)
{
resultText = resultText.Replace(unit, Config.UnitMap[unit]);
}
...
}

Review comment: This should only happen once in initialization, can you move it out of this
method?
CodeReviewer: (also the ground truth)
private string ReplaceUnit(string resultText)
{
foreach (var unit in Config.UnitMap.Keys)
{
resultText = resultText.Replace(unit, Config.UnitMap[unit]);
}
...
}
CodeT5:
private string ReplaceUnit(string resultText)
{
foreach (var unit in Config.UnitMap.OrderBy(o => o.Key.Length).ToDictionary(o => o.Key,
p => p.Value))
{
resultText = resultText.Replace(unit, Config.UnitMap[unit]);
}
...
}
T5:
private string ReplaceUnit(string resultText)
{
var unitMap = Config.UnitMap.OrderBy(o => o.Key.Length).ToDictionary(o => o.Key, p =>
p.Value);
foreach (var unit in unitMap.Keys)
{
resultText = resultText.Replace(unit, Config.UnitMap[unit]);
}
...
}

Figure 8: An example of the code refinement task.


CodeReviewer: Pre-Training for Automating Code Review Activities

Original Code:
try {
wallpaperManager.setBitmap(bitmap);
ViewUtil.showLongToast(context,
context.getString(R.string.wallpaper_set_successfully));
if (progressDialog.isShowing() && progressDialog != null) {
progressDialog.dismiss();
}
} catch (IOException e) { ... }

Review comment: flip the conditions, null check should be done before
CodeReviewer: (also the ground truth)
try {
wallpaperManager.setBitmap(bitmap);
ViewUtil.showLongToast(context,
context.getString(R.string.wallpaper_set_successfully));
if (progressDialog != null && progressDialog.isShowing()) {
progressDialog.dismiss();
}
} catch (IOException e) { ... }
CodeT5:
try {
wallpaperManager.setBitmap(bitmap);
ViewUtil.showLongToast(context,
context.getString(R.string.wallpaper_set_successfully));
if (progressDialog.isShowing() || progressDialog != null) {
progressDialog.dismiss();
}
} catch (IOException e) { ... }
T5:
try {
GoogleApiClientManager.setBitmap(bitmap);
ViewUtil.showLongToast(context,
context.getString(R.string.GoogleApiClient_set_successfully));
if (progressDialog.isShowing()) {
a.dismiss();
}
} catch (IOException e) { ... }

Figure 9: Another example of the code refinement task.

You might also like