You are on page 1of 14

000 050

001
A Structural Transformer with Relative Positions in Trees 051
002 for Code-to-Sequence Tasks 052
003 053
004 054
005 055
006 056
007 057
008 058
009 059
010 060
011 061
Abstract input sequence, whereas “structure” is only repre-
012 062
We suggest two approaches to incorporate
sented by adding absolute positional embeddings
013 063
syntactic information into transformer mod- to the input.
014 064
els encoding trees (e.g. abstract syntax trees) To overcome these limitations, several ap-
015 065
and generating sequences. First, we use proaches have been suggested recently. These lin-
016 self-attention with relative position represen- earize the input tree using a pre-order traversal and 066
017 tations to consider structural relationships be- apply a modified “tree” transformer, either using 067
018 tween nodes using a representation that en- absolute positional embeddings that encode struc- 068
019 codes movements between any pair of nodes 069
ture (Shiv and Quirk, 2019), aggregation (Nguyen
in the tree, and demonstrate how those move-
020
ments can be computed efficiently on the fly. et al., 2020), or a boosting of attention weights with 070
021 Second, we suggest an auxiliary loss enforc- relative node positions (Kim et al., 2020). We con- 071
022 ing the network to predict the lowest common tinue this line of research by three contributions1 : 072
023 ancestor of node pairs. We apply both meth- 073
ods to source code summarization tasks, where 1. We extend relative positional embeddings
024 074
we outperform the state-of-the-art by up to (Shaw et al., 2018) to encode structural rela-
025 075
6% F1. On natural language machine transla- tionships between nodes in trees. To do so, we
026 076
tion, our models yield competitive results. We demonstrate how relative positions for trees
027 also consistently outperform sequence-based 077
can be computed efficiently and densely dur-
028 transformers, and demonstrate that our method ing training using simple matrix operations. 078
029 yields representations that are more closely 079
030 aligned with the AST structure. 2. While auxiliary losses – such as next sentence 080
031 1 Introduction prediction (Devlin et al., 2018) or sentence 081
032 reordering (Sun et al., 2019) – have already 082
033
Modeling the semantics of source code has recently been shown to improve accuracy on sequence 083
emerged as a research topic, with applications in tasks, we suggest a structural loss for trees by
034 084
duplicate detection (Baxter et al., 1998), automatic predicting lowest common ancestors.
035 085
summarization (Alon et al., 2018a), natural lan-
036 3. We explore the two above strategies in quanti- 086
guage database querying (Xu et al., 2017), bug
037 tative experiments on source code understand- 087
triage (Mani et al., 2019) and semantic code re-
038 ing and machine translation. Combining our 088
trieval (Gu et al., 2018). We focus on source-code
039 two approaches outperforms the state-of-the- 089
related sequence generation tasks such as code sum-
040
marization. Here, the most successful models learn art by up to 6% and offers significant improve- 090
041
embeddings of code elements and put a strong fo- ments over sequential transformer baselines. 091
042 cus on code structure, usually exploiting the ab- Compared to other recent work on tree trans- 092
043 stract syntax tree (AST) (Baxter et al., 1998; Alon formers, our approach either performs bet- 093
044 et al., 2019, 2018a). In contrast, transformer net- ter (Shiv and Quirk, 2019) or performs com- 094
045 works like BERT (Devlin et al., 2018) – which parable while offering a much more scalable 095
046 are currently considered state-of-the-art in model- training (Nguyen et al., 2020). Finally, our 096
047 ing natural language – are weak at exploiting such approach requires only moderate adaptations 097
048 structure: Their key component, self-attention, is to the conventional transformer architecture 098
1
049 based on a pairwise comparison of all tokens in the Our code and a demo are available under: hidden. 099

1
100 (namely, relative positional embeddings), and architecture with attention. In contrast code2seq, 150
101 can be applied orthogonally to other tree ex- our model encodes the full AST with all relative 151
102 tensions. positions at once and additionally and implicitly 152
103 models the path between any two nodes. LeClair 153
104 2 Related Work et al. (2020) and Fernandes et al. (2018) propose 154
105 a structured summarization approach for code and 155
106
Abstractive summarization condenses a source se- natural language by adding a graph neural network 156
quence into a short descriptive target sequence, on top of a sequence-to-sequence encoder. The
107 157
while maintaining its meaning (Fan et al., 2017), above approaches utilize neither transformer net-
108 158
and has been approached with encoder-decoder works, nor structural losses or relative position rep-
109 159
architectures ranging from recurrent neural net- resentations.
110 160
works (optionally with attention) (Bahdanau et al., While only few of the above methods employ
111 161
2014) over convolutional networks (Gehring et al., transformer encoders, our work is closest to re-
112 162
2017) to the attention-based transformer (Vaswani cent approaches enabling transformers to utilize
113 163
et al., 2017) architectures. Many summarization syntax trees. Shiv and Quirk (2019) define abso-
114 approaches use pointer networks (Vinyals et al., 164
lute positional embeddings for regular trees and
115 2015) to mitigate the out-of-vocabulary problem 165
show that these can be used to leverage syntactic
116 by copying words from the input sequence (See 166
information (by converting the tree into a binary
117 et al., 2017), but recently attention has shifted to 167
tree). Our model processes non-regular trees with-
118 Byte-Pair Encoding (BPE) (Sennrich et al., 2015), 168
out modification, and orthogonally shows that rela-
119 which we utilize in this work. tive positional representations are also appropriate
169
120 Modeling the semantics of source code by pre- to represent trees in transformers. Kim et al. (2020) 170
121 dicting precise summaries or missing identifiers has utilize pre-computed relative node positions similar 171
122 been extensively studied in the recent years (Al- to ours, but only apply them for a scalar boosting 172
123 lamanis et al., 2017). Allamanis et al. (2016) of attention weights during self-attention. The hi- 173
124 use a convolutional attention network over the to- erarchical transformer (Nguyen et al., 2020) uses 174
125 kens in a method to predict sub-tokens in method aggregation, masking and hierarchical embeddings 175
126 names. Recently, masked language model-based during self-attention to incorporate structure into 176
127 approaches (Devlin et al., 2018) been applied as transformers. In contrast to this, our approaches 177
128 well, for which Feng et al. (2020) train CodeBERT pose rather mild modifications (structural embed- 178
129 on pairs of natural language and methods. However, dings) or no modification at all (structural loss) to 179
130
most models treat source-code as a token sequence the standard transformer architecture, resulting in 180
131
and do not exploit additional structural information tolerable performance drops. 181
132
provided by existing syntax parsers. 182
133
There have been various approaches to utilize 3 Approach 183
structural information: Tai et al. (2015) propose the
134 184
TreeLSTM network, which recursively encodes a As illustrated in Figure 1, our approach extends
135 185
tree by computing a node’s representation based transformer models (Vaswani et al., 2017) to in-
136 186
on its children with an LSTM. The inverse prob- corporate structural information from trees. For
137 187
lem of generating code from descriptions in natu- source code, those trees are abstract syntax trees
138 188
ral language has been addressed by Yin and Neu- (ASTs), for natural language we use constituency
139 189
big (2017) following a rule-based approach over parse trees. Specifically, we present two ap-
140 proaches: First, we extend relative position rep- 190
ASTs. Similar to our work (Hu et al., 2018) lin-
141 resentations (Shaw et al., 2018) to encode the pair- 191
earize an AST and use a longer structure-based
142 traversal (SBT) as input for a regular sequence-to- wise relations between nodes as movements (Sec- 192
143 sequence model to predict comments. LeClair et al. tion 3.1). Second, we introduce a structural loss 193
144 (2019) summarize source-code by using two en- enforcing the model to predict the lowest common 194
145 coder networks, one that encodes the SBT of the ancestor of two nodes (Section 3.2) based on the 195
146 AST and another the textual information in the se- encoder output. 196
147 quence. For the same purpose Alon et al. (2018a)’s We feed a tree into the transformer by replacing 197
148 code2seq model encodes paths between terminal to- the conventional sequence of tokens with a pre- 198
149 kens in an AST using a dedicated encoder-decoder order sequence of nodes x = (x1 , x2 , x3 , . . . , xn ) 199

2
200 LCA-Loss 250
201 a21 = -1↑0↓ x1 251
202 z1 z2 z3 z4 z5 z...
6
... zn y1 y2 ... y
m 252
203 x2 x3 253
204 a43 = -1↑0↓ Relative Transformer Encoder 254
Transformer
205 x4 x5 xn (Self-Attention with Relative Position 255
Decoder
Representations)
206 256
207 x6 ... ... 257
a64 = -2↑1↓ epos 1 2 m
208 + + + 258
209 x1 x2 x3 x4 x5 x6 ... xn eword sos y1 ... y 259
a6n = 2↑1↓ m-1
210 260
211 Figure 1: Our approach feeds a pre-order traversed tree (left) into a transformer encoder (middle). We use self- 261
212 attention with relative position representations and encode relationships between all nodes by up- and down move- 262
ments (gray dashed arrows). For example, the relation between x6 (yellow) and xn (blue) is +2↑1↓ (2 steps up,
213 263
then right (+), then 1 step down). Our structural loss is based on predicting the lowest common ancestor (green)
214 264
between randomly sampled node pairs (yellow and blue). The decoder operates on absolute positional embeddings.
215 265
216 266
217 including both terminals and non-terminals. Every head in an transformer layer operates on an input 267
218
node i is reached via a unique path from the root sequence of embeddings xL = (xL L
1 , . . . , xn ) with 268
(1 =) i1 , i2 , . . . , iu (= i), where il−1 is the parent L d
xi ∈R and outputs a new sequence x L+1 with
219 269
L+1 d
220
of il and u (or depth(i)) denotes the depth of node xi ∈R . The pairwise relationship between in-
h
270
i. Based on this path, we define the ancestors and put elements xL L
i and xj is represented by learned
221 271
222
descendants of Node i: embeddings aij ∈ Rdh that are shared between at- 272
tention heads, but not between layers. The output
223 anc(i) := {i1 , . . . , iu } 273
of self-attention with RPR is computed by
224 desc(i) := {j 6= i | i ∈ anc(j)}. 274
225 n 275
X
Note that while the ancestors include i, the descen- xL+1 = αij (xL V
226
i jW ) (1) 276
227 dants do not. Finally, we define the lowest common j=1 277
228 ancestor (LCA) of nodes number i and j as 278
229
whereas the weight coefficient αij is computed by 279
lca(i, j) = arg max depth(a) a softmax over compatibility scores eij :
230 a ∈ anc(i)∩anc(j) 280
231 281
xL Q L K >
232 The node sequence x is first transformed into a i W (xj W + aij ) 282
eij = √ (2)
233
sequence of real-valued input embeddings xi ∈ Rd , d 283
234
which are processed by a regular transformer en- 284
235
coder (Vaswani et al., 2017) (for brevity we omit where d×dh matrices WQ , WK , WV map the in- 285
details here), resulting in a sequence of continuous puts to a head-specific embedding space2 . Note
236 286
representations z(x), or shorter z = (z1 , . . . , zn ). that Equation (2) is the original self-attention
237 287
From this, a standard auto-regressive transformer from Vaswani et al. (2017) if one omits aij .
238 288
decoder generates an output token sequence y = While Shaw et al. (2018) define the relative posi-
239 289
(y1 , . . . , ym ), e.g. a summary of the input program tional embeddings aij based on the linear distance
240 290
x. When generating token yi+1 , the decoder at- between sequential input tokens, we extend aij to
241 291
tends to the whole encoded sequence z as well as take the input tree structure into account. We en-
242 292
all previously generated symbols y1 , . . . , yi . code the path between nodes i and j in a matrix
243 293
M∈Nn×n : From i, we first take Mij steps upward
244 3.1 Relative Position Representations in 294
to lca(i, j) and then Mji steps downward to j. We
245 Trees 295
investigate two options for representing this path:
246 Our model builts on relative position representa- 296
2
247 tions (RPR) (Shaw et al., 2018), which influence Note that Shaw et al. (2018) originally defined relative po- 297
sition representations aK V
ij , aij for keys and values, but showed
248 the attention between two tokens based on their in experiments that key embeddings seem to suffice. Corre- 298
249 pairwise relative position. A relative self-attention spondingly, we only use key embeddings. 299

3
300 • the path length (clamped to a maximum) i = j, j+1, ..., j+|desc(j)|). The number of de- 350
301 l(i, j) = Mij +Mji ∈ {0, ..., C}. Then, aij scendants per node can be pre-computed in O(n) 351
302 is derived from an embedding table E: time and space, and the above operations can be ef- 352
303 ficiently conducted on a batch of trees on a GPU us- 353
304 aij := E1i<j ,l(i,j) (3) ing parallel matrix multiplications in O(n3 ), which 354
305 we found feasible for input lengths commonly used 355
306
1i<j indicates if Node i is left of Node j. in NLP (e.g., 1024 – see Figure 4 in the Appendix). 356
307 • Movement pattern: We distinguish between 357
308 steps up (Mij ) and down (Mji ). Consider Fig- 3.2 Structural Loss 358
309 ure 1, where we take 2 steps up and 1 step The goal of our structural loss is to enforce the en- 359
310 down from node x6 to x4 (−2↑1↓). Both val- coded representations z to adopt a notion of struc- 360
311 ues are used as indices to the embedding table: tural similarity from the AST. Thereby, we con- 361
312 sider two nodes as “similar” if they are part of the 362
313 aij := E1i<j ,Mij ,Mji (4) same low-level syntactic unit (e.g., an if-statement), 363
314 while nodes representing different passages in a 364
315
We clamp both Mij and Mji to a maximum text/program are considered dissimilar. We enforce 365
of C steps, such that the embedding table has this notion of similarity by training our model to
316 366
2×(C + 1)×(C + 1) entries. predict nodes’ LCAs. To do so, we concatenate
317 367
318 Computing relative node positions efficiently the encoded representations zi , zj of two nodes to 368
319 The basis for relative position representations is predict their lowest common ancestor a: 369
320 the matrix M. We show that M – and with it aij – 370
321 can be derived efficiently with matrix operations: vij = RELU ([zi ; zj ] · W + b) 371
322 We first represent the tree using a binary node inci- plca (a|i, j) = sof tmax(vij · Z)a 372
323 dence matrix N ∈ {0, 1}n×n (see Appendix A for 373
324 an illustration) which encodes each node’s path to where W ∈ R2d×d and b ∈ Rd
are learned, and 374
325 the root by Z ∈ Rd×n contains the stacked encoded repre- 375
326 ( sentations. For each position a, the resulting soft- 376
327
1 if j ∈ anc(i) max vector contains the probability plca (a|i, j) that 377
Nij = (5)
328 0 otherwise. node a is node i and j’s lowest common ancestor. 378
329 We then use a log-likelihood loss over node pairs 379
330
Note that we defined i ∈ anc(i), and thus Nii =1. (i, j): 380
Based on N, we compute
331 381
n X
n
332 node i by row-wise summa-
• the depth of each P 382
X
Llca (Θ)=− log plca (lca(i, j)|i, j, z) (8)
333 tion: depth(i) = j Nij . 383
i=1 j=1
334 384
335 • a (symmetric) ancestral matrix A ∈ Nn×n where Θ denotes the collection of model parame- 385
336 (Andriantiana et al., 2018) whose entries Aij ters and z is the encoder output. We complement 386
337 contain the level of lca(i, j) (or in other words this loss with a conventional cross entropy transla- 387
338
the length of the common prefix) of two nodes tion loss (Vaswani et al., 2017) 388
339
i and j by 389
A = NN> (6) n
340 X 390
Ltrans (Θ) = − log p(yi |y≤i−1 , x) (9)
341 391
• the matrix M ∈ Nn×n by i=1
342 392
343 Mij = depth(i) − Aij (7) More specifically, we use the label-smoothed ver- 393
344 sion of Equation (9) (Szegedy et al., 2015). (x, y) 394
345 Note that – since in a pre-order traversal each denote training pairs consisting of pre-ordered in- 395
346 node’s descendants directly follow the node – we put trees x and corresponding output sequences 396
347 can easily derive N from the size of each node’s y. Overall, we train the model by minimizing 397
348 sub-tree, by filling |desc(j)|+1 rows in column both losses simultaneously, i.e. our loss function is 398
349 j with ones, starting at the diagonal (Ni,j =1 for L(Θ) = Ltrans (Θ) + γlca ·Llca (Θ). 399

4
400 Sampling for Ancestor Prediction Instead of a Hex 16”. (3) To reduce the vocabulary further, 450
401 dense loss computation (Equation (8)), we imple- we modify the BPE algorithm so that when a to- 451
402 ment an efficient random sampling of M node ken is a number, it is split into single digits, simi- 452
403 pairs (i, j) with their lowest common ancestor lar to character-level language modeling (Al-Rfou 453
404 a = lca(i, j). Instead of sampling i and j and et al., 2019). This allows us to represent all pos- 454
405 computing the corresponding a, we sample a and sible numbers with only 20 tokens in our vocabu- 455
406 then draw random descendants (i, j) of a such that lary. The example becomes “get Num@@ ber 456
407 a = lca(i, j). This can be done in O(1) per sam- Hex 1@@ 6”. (4) Additionally we replace all 457
408 ple: We first draw the LCA a, whereas the proba- string and character literals in the source code 458
409 bility of drawing Node a is chosen proportionally with an identifier (e.g. <STRING>). Note that 459
410 to the number of its descendants. This gives more this tokenization yields multiple tokens for each 460
411
weight to nodes at the top of the tree (which are terminal node/identifier, which alters the tree struc- 461
412
more likely to occur as LCAs). Given ancestor ture. When splitting a terminal node xi into tokens 462
413
a, we sample two nodes i, j from a’s descendants. (t1 , . . . , tm ), we replace xi with these tokens. Each 463
Since the sequence is a pre-order traversal of the token becomes a new node, the first token t1 with
414 464
tree, a’s descendants can be found at positions xi ’s parent as parent and each other token with
415 465
a+1,. . . ,a+|desc(a)|. We distinguish two cases: its predecessor as parent. In the example, we ob-
416 466
(1) If a has at least two children, we draw samples tain get←Num@@←ber←Hex←1@@←6. Note
417 467
i, j from two different children (or their descen- that the BPE encoding is removed when computing
418 468
dants). (2) If a has only a single child, we choose evaluation measures such as BLEU.
419 469
i=a and j as a random descendant of a (if i and j
420
were both descendants of a, their LCA would not 4.2 Tasks 470
421 be a but a’s child or one of its descendants). 471
Task 1: Method Naming We evaluate the
422 472
method naming task on three different datasets in-
423 4 Experimental Setup troduced by Alon et al. (2018a): java-small, 473
424 java-med, and java-large. All three 474
Most code-to-sequence models are evaluated on
425 datasets consist of java files from selected open- 475
method naming and code captioning (or code-to-
426 source projects, splitted into training, validation 476
documentation) tasks (Alon et al., 2018a). Method
427 and test sets on project-level. 477
naming aims to predict the usually short name of a
428 The java-small dataset consists of java files 478
method given its signature and body. Our second
429
task – code captioning – aims at generating a longer from 11 different projects, from which 9 are used 479
430 natural language description of a given function, for training, 1 is used for validation and 1 for test- 480
431 whereas the first line of a documentation is used ing. It contains around 700k samples/methods. 481
432 as ground truth. On all code related tasks we use The java-med dataset consists of 800 projects 482
433 ASTs as input. Additionally, we show that our for training, 100 for validation and 100 for test- 483
434 model is also applicable to natural language where ing, containing about 4M samples. The largest 484
435 we translate using constituency parse trees as input. dataset java-large contains 9, 000 projects for 485
436 training, 250 for validation and 300 for testing, con- 486
437
4.1 Preprocessing Source Code taining 16M samples. Method names in all three 487
438 When parsing code to ASTs3 , the set of termi- datasets have 3.1 BPE tokens on average. 488
439 nal nodes becomes large. To overcome this prob- Following the work of Allamanis et al. (2016) 489
440 lem, we follow common NLP practice (Babii and Alon et al. (2018a), we predict the method 490
441 et al., 2019) and tokenize the identifiers asso- name as a sequence of sub-tokens splitted on camel 491
ciated with terminal nodes into fine-grain to- case and underscores (compare Section 4.1). Note
442 492
kens: (1) Inspired by Alon et al. (2018a), we that the BPE encoding applied in preprocessing
443 493
split all identifiers on camel case or underscores, is removed before evaluating the model’s perfor-
444 494
i.e. “getNumber Hex16” will become “get mance. Like Alon et al. (2018a), we report case-
445 495
Number Hex 16”. (2) Next, we apply Byte insensitive micro precision, recall and F-measure
446 496
Pair Encoding (BPE) on the above tokens (Sen- over the target sequence.
447 497
nrich et al., 2015), obtaining “get Num@@ ber
448 Task 2: Code Captioning We evaluate the code 498
3
449 We use the tree-sitter library for parsing code into ASTs. captioning task on the FunCom dataset intro- 499

5
500 java-small java-med java-large 550
Model
501 P R F1 P R F1 P R F1 551
502 552
Paths+CRFs (Alon et al., 2018b)† 8.4 5.6 6.7 32.6 20.4 25.1 32.6 20.4 25.1
503 code2vec (Alon et al., 2019)† 18.5 18.7 18.6 38.1 28.3 32.5 48.2 38.4 42.7 553
504 TreeLSTM (Tai et al., 2015)† 40.0 31.8 35.5 53.1 41.7 46.7 60.3 48.3 53.6 554
505 2-layer BiLSTM† 42.6 30.0 35.2 55.2 41.8 47.5 63.5 48.8 55.2 555
ConvAttention (Allamanis et al., 2016)† 50.3 24.6 33.1 60.8 26.8 37.2 60.7 27.6 38.0
506 556
Transformer (no tree) (Alon et al., 2018a)† 38.1 26.7 31.4 50.1 35.0 41.2 59.1 40.6 48.1
507 code2seq (Alon et al., 2018a)† 50.6 37.4 43.0 61.2 47.1 53.2 64.0 55.0 59.2 557
508 558
Transformer (no tree) (Vaswani et al., 2017) 48.5 45.9 45.9 57.5 57.1 56.2 66.2 63.8 63.9
509 Absolute Tree Transformer (k = 64) (Shiv and Quirk, 2019) 47.2 45.6 45.0 59.3 57.9 57.3 66.6 64.2 64.3 559
510 Relative Structural Transformer 52.7 47.6 48.6 61.3 60.0 59.4 66.6 64.6 64.5 560
511 561
Table 1: Micro F1 for Method naming. Results marked with † are taken from Alon et al. (2018a).
512 562
513 563
514
Model B1 B2 B3 B4 BLEU-4 564
515 attendgru (LeClair and McMillan, 2019)† - - - - 17.4 565
516 ast-attendgru (LeClair et al., 2019)‡ 37.1 21.1 14.27 10.9 18.7 566
517 graph2seq (Xu et al., 2018)‡ 37.6 21.3 14.1 10.6 18.6 567
518 code2seq (Alon et al., 2018a)‡ 37.5 21.4 14.4 11.0 18.8 568
519 BiLSTM+GNN-LSTM (Fernandes et al., 2018)‡ 37.7 21.5 14.6 11.1 19.1 569
520 code+gnn+BiLSTM-2hops (LeClair et al., 2020)‡ 39.1 22.5 15.3 11.7 19.9 570
521 Transformer (no tree) (Vaswani et al., 2017) 40.3 23.6 16.4 12.6 21.1 571
522 Relative Structural Transformer 42.3 24.4 16.8 12.9 21.7 572
523 573
524 Table 2: Code captioning on the FunCom dataset (java). We report cumulative BLEU-4 score, together with single 574
525 n-gram scores up to 4 n-grams (B1, . . . , B4), evaluated with the script released along with the dataset. Results 575
marked with † have been reported by LeClair and McMillan (2019), results marked with ‡ by LeClair et al. (2020).
526 576
527 577
528 duced by LeClair and McMillan (2019) and on the function), but it also has been applied to code-to- 578
529 CodeSearchNet dataset (Husain et al., 2019). documentation generation by Feng et al. (2020) 579
530 (CodeBERT). After pre-training CodeBERT on the 580
The FunCom dataset consists of 2.1M java func-
531 full dataset, the authors filtered the dataset to re- 581
tion/description pairs, with descriptions parsed
532 move samples with poor quality and subsequently 582
from the first sentence of a Javadoc comment. This
533 fine-tune and evaluate the CodeBERT model for 583
dataset has been constructed from a much larger
534 code-to-documentation generation on the filtered 584
dataset of 51M java methods (Lopes et al., 2010)
535 subset. To be comparable to CodeBERT we re-use 585
by filtering for English comments, removing pairs
only the filtered subset for training and evaluation
536 with less than three and more than 13 tokens in 586
of our model and use the same evaluation script4
537 the description, or more than 100 tokens in the 587
that computes smoothed BLEU-4 score as Feng
538 method. A target description contains 7.6 tokens 588
et al. (2020). Furthermore we omit splitting on
539 on average. We reuse the provided training, val- 589
camel-case for this dataset, as this would yield a
540 idation and test split, which has been created by 590
different tokenization that would influence the re-
541 splitting per project. We report case-insensitive 591
sulting BLUE score. We train a joint encoder on
542 corpus-level BLEU scores (Papineni et al., 2002) 592
all languages, to leverage knowledge transfer to the
543 and use the evaluation scripts released by LeClair 593
less frequent languages.
544 and McMillan (2019) along with the dataset. 594
545 The CodeSearchNet dataset consists of func- Task 3: Neural Machine Translation To test 595
546 tion/documentation pairs in 6 different program- our model on a non-code task, we evaluate it on 596
547 ming languages (Go, Java, JavaScript, PHP, Python, machine translation on the IWSLT’14 English- 597
548 Ruby). The main task for this dataset is code 4
Feng et al. (2020) use the same evaluation script for com- 598
549 search (i.e. given a documentation find the correct puting BLEU scores as Iyer et al. (2016). 599

6
600 Model Ruby Javascript Go Python Java Php Overall 650
601 651
Seq2Seq (Feng et al., 2020)† 9.64 10.21 13.98 15.93 15.09 21.08 14.32
602 652
Transformer (Feng et al., 2020)† 11.18 11.59 16.38 15.81 16.26 22.12 15.56
603 Roberta (Feng et al., 2020)† 11.17 11.90 17.72 18.14 16.47 24.02 16.57 653
604 CodeBERT (RTD+MLM) (Feng et al., 2020)† 12.16 14.90 18.07 19.06 17.65 25.16 17.83 654
605 Transformer (no tree) (Vaswani et al., 2017) 13.90 14.61 18.08 18.19 18.20 23.12 17.68 655
606 Relative Structural Transformer 14.81 14.96 18.62 17.93 18.63 23.77 18.12 656
607 657
608 Table 3: Code captioning on the CodeSearchNet dataset. As Feng et al. (2020) we report smoothed cumulative 658
609
BLEU-4 scores. Results marked with † have been reported by Feng et al. (2020). 659
610 660
611 German (En-De and De-En) dataset. We replicate 3.1), and whether the model is trained with a joint 661
612 the training settings from Nguyen et al. (2020), vocabulary for encoder and decoder, in which case 662
613 thereby also using the same preprocessing script we share the encoder’s token embedding matrix 663
614 to create parse trees with the Stanford CoreNLP with the decoder. We conduct three runs with the 664
615 parser (Manning et al., 2014) and also report to- best hyperparameter setting and report results for 665
616 kenized BLEU-4. We use the same form of BPE the model with the highest BLEU/F1 on the val- 666
617 on terminal nodes as on the code-related tasks, but idation data. The full set of hyperparameters is 667
618 omit splitting on camel-case (10k subwords). provided in Appendix A.1. 668
619 For the transformer baseline we tokenize the 669
620
4.3 Hyperparameters and Setup source code in the same way as for ASTs (Section 670
621 We implemented our model in PyTorch with the 4.1) and train a regular transformer model on the 671
622 fairseq toolkit (Ott et al., 2019), on top of an exist- resulting token sequence (Vaswani et al., 2017) 672
623 ing transformer implementation. As our focus is with the same set of hyperparameters. 673
624 on studying syntax-specific extensions, we refrain 674
to a standard transformer architecture and optimize 5 Experimental Results
625 675
626 only hyperparameters relevant to our extensions: Tables 1 – 3 cover the three code-related tasks, com- 676
627 We replicate the same transformer architecture and paring our approach with the state-of-the-art meth- 677
628 most hyperparameters that have been used in the ex- ods and a regular sequential transformer baseline. 678
629 periments of Nguyen et al. (2020), consisting of 6 In Table 1 we additionally conduct experiments 679
630
transformer layers in the encoder and decoder, with with a transformer that uses absolute tree positional 680
631
4 attention heads, 1024-dimensional feed-forward embeddings (Shiv and Quirk, 2019). We investi- 681
632
layers, d = 512 dimensional token embeddings gate the use of our approach on natural language 682
and the sharing of the input and output embedding machine translation in Table 4. Finally, we also
633 683
matrices in the transformer decoder. Like Nguyen study the effect of our structural loss function and
634 684
et al. (2020), we train our model using the Adam relative position representations in ablation studies.
635 685
optimizer and an inverse square root learning rate
636 686
scheduler with a linear warm-up for 4, 000 updates Task 1: MethodNaming On the method nam-
637 687
up to a peak learning rate of 5e−4 and a dropout ing task our model outperforms the state-of-
638 688
rate of 0.3. For our final evaluations, we generate the-art code2seq model (Alon et al., 2018a) on
639 689
sequences with beam search using a beam width java-small, java-med and java-large
640 of 5 and additionally prohibit repeating n-grams of 690
datasets by more than 6.2%. We continuously out-
641 length 2. 691
perform the absolute tree transformer (Shiv and
642 We learn a Byte-Pair-Encoding of 16k sub-words Quirk, 2019), which is – on java-small – out- 692
643 on the training data for all code-related tasks. For performed by a token-level transformer. On all 693
644 the lowest common ancestor loss (Section 3.2) we three datasets our token-level transformer baseline 694
645 sample M =min(n, 50) node pairs per method. We outperforms the current SoTA (which does not use 695
646 also optimize the following parameters manually BPE), by up to 4.7%. This demonstrates the effec- 696
647 on the validation set: the batch size, γlca , whether tiveness of our preprocessing pipeline, particularly 697
648 path-length or movements are used for relative posi- the use of BPE for source code. Our tree extensions 698
649 tion representations, the clamping value C (Section improve performance further, especially on small 699

7
700 IWSLT’14 750
Model Relationship Type C γlca P R F1
701 En-De De-En 751
Tree2Seq (Shi et al., 2018)† 24.01 29.95
- - - 56.2 56.2 55.1
702 752
Conv-Seq2Seq (Gehring et al., 2017)† 24.76 30.32 - - 0.3 60.8 59.3 58.7
703 Transformer (Vaswani et al., 2017)† 753
28.35 34.42 Movements 2 - 60.5 58.8 58.4
704 Dynamic Conv (Wu et al., 2019)† 28.43 34.72 754
Movements 2 0.05 61.1 59.2 58.9
Hierarchical Transformer (Nguyen et al., 2020)† 29.47 35.96
705 Movements 2 0.3 61.3 60.0 59.4 755
Transformer (no tree) (Vaswani et al., 2017) 28.30 34.62
706 Relative Structural Transformer 29.40 35.32
Path-Length 8 0.3 60.6 59.4 58.8 756
707 757
Table 4: Machine translation on the IWSLT’14 Table 5: Ablation study on java-med. All models
708 758
dataset. We report the tokenized cumulative BLEU- are trained on linearized ASTs. Adding a structural
709 759
4 score. Results marked with † have been reported loss or relative positions improves performance.
710 Adding both gives best results. 760
by Nguyen et al. (2020).
711 761
712 762
713 datasets. We conclude that our approach success- our method is also applicable to natural language 763
714 fully adds a structural prior. A t-SNE visualization in general and may be even used for longer docu- 764
715 in Figure 5 confirms that our extensions yield em- ments. 765
716
beddings that are more closely aligned with the 766
AST structure. Ablation Studies We investigate the impact of
717 767
our two extensions on the java-med dataset. Ta-
718 Task 2: CodeCaptioning On the code-to- ble 5 shows that simply linearizing the AST without
768
719 documentation task on the FunCom dataset our any structural information results in a performance
769
720 model outperforms the state-of-the-art by 1.8%. drop of 4.3% compared to the best model. Using 770
721 Including structural information shows to be bene- the structural loss or relative positions indepen- 771
722 ficial, as we outperform a token-level transformer dently improves performance. Even though LCAs 772
723 baseline by 0.6%. On the CodeSearchNet for ancestor prediction can be inferred from the 773
724 dataset our approach – which is trained end-to-end – relative positions, the system benefits from an ad- 774
725 outperforms the language model based CodeBERT ditional implicit modeling as an auxiliary loss. We 775
726 on most languages, with an overall improvement hypothesize that the two approaches complement 776
727 of 0.3%. Our baseline outperforms reported trans- each other, as LCA prediction enforces the model 777
728
formers by 2.1%, that have been using a non code- to represent relative position in the encodings z. 778
729
specific BPE5 and no literal replacements. 779
730 Task 3: Machine Translation Our model is able 6 Conclusion 780
731 to utilize the structural prior and improves over a 781
732 token-level transformer baseline by up to 1.1%. It We propose two extensions to transformer models 782
733 thereby outperforms various other recent models, with relative position representations to incorpo- 783
734 and yields competitive results compared to the hi- rate structural information from trees: (1) relative 784
erarchical transformer (Nguyen et al., 2020). Also, position representations that encode the position of
735 785
we found our approach to be much more scalable nodes explicitly using a representation that encodes
736 786
with respect to input sequence length (Figure 4 movements between any pair of nodes in the tree,
737 787
in the Appendix), both in terms of speed (left) and (2) a new structural loss based on lowest com-
738 788
and memory (right)6 . Since sequence lengths and mon ancestor (LCA) prediction. In a broader sense,
739 789
datasets in code-related tasks are longer than in our model offers a simple yet effective way to in-
740 790
machine translation, memory limitations prevent corporate syntactic information into transformer
741 791
us from training the hierarchical transformer with models. This may be interesting for modeling long-
742 term dependencies in NLP tasks such as relation 792
adequate batch sizes. Figure 4 also shows that our
743 extraction or co-reference resolution. Another in- 793
approach’s performance is comparable to that of a
744 teresting direction for future research would be to 794
sequential transformer with only negligible speed
745 incorporate our structural extensions not only into 795
and memory losses. Overall, this indicates that
746 supervised training, but also into language models, 796
5
747 CodeBERT re-uses the BPE of RoBERTa. e.g. for an unsupervised learning of source code 797
6
We used the reference implementation of the hierarchical
748
transformer (https://github.com/nxphi47/tree_ semantics. 798
749 transformer) with the same hyperparameters as ours. 799

8
800 References Patrick Fernandes, Miltiadis Allamanis, and Marc 850
801 Brockschmidt. 2018. Structured neural summariza- 851
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy tion. CoRR, abs/1811.01824.
802 Guo, and Llion Jones. 2019. Character-level lan- 852
803 guage modeling with deeper self-attention. In Pro- Jonas Gehring, Michael Auli, David Grangier, Denis 853
804 ceedings of the AAAI Conference on Artificial Intel- Yarats, and Yann N Dauphin. 2017. Convolutional 854
ligence, volume 33, pages 3159–3166. sequence to sequence learning. In Proceedings
805 855
of the 34th International Conference on Machine
806 Miltiadis Allamanis, Earl T. Barr, Premkumar T. De- 856
Learning-Volume 70, pages 1243–1252. JMLR. org.
807
vanbu, and Charles A. Sutton. 2017. A survey of ma- 857
chine learning for big code and naturalness. CoRR, Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018.
808 abs/1709.06182. 858
Deep code search. In 2018 IEEE/ACM 40th Interna-
809 tional Conference on Software Engineering (ICSE), 859
Miltiadis Allamanis, Hao Peng, and Charles Sutton.
810 pages 933–944. IEEE. 860
2016. A convolutional attention network for ex-
811 treme summarization of source code. In Inter- 861
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018.
812 national Conference on Machine Learning, pages Deep code comment generation. In Proceedings 862
813
2091–2100. of the 26th Conference on Program Comprehension, 863
814 Uri Alon, Shaked Brody, Omer Levy, and Eran Ya- pages 200–210. ACM. 864
815 hav. 2018a. code2seq: Generating sequences from Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis 865
structured representations of code. arXiv preprint Allamanis, and Marc Brockschmidt. 2019. Code-
816 866
arXiv:1808.01400. searchnet challenge: Evaluating the state of seman-
817 867
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Ya- tic code search.
818 868
hav. 2018b. A general path-based representation for Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and
819 869
predicting program properties. In ACM SIGPLAN Luke Zettlemoyer. 2016. Summarizing source code
820 Notices, volume 53, pages 404–419. ACM. 870
using a neural attention model. In Proceedings
821 of the 54th Annual Meeting of the Association for 871
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Ya-
822 hav. 2019. code2vec: Learning distributed represen- Computational Linguistics (Volume 1: Long Papers), 872
823 tations of code. Proceedings of the ACM on Pro- pages 2073–2083. 873
824 gramming Languages, 3(POPL):40. 874
Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish
825 Eric O. D. Andriantiana, Kenneth Dadedzi, and Chandra. 2020. Code prediction by feeding trees to 875
826 Stephan Wagner. 2018. The ancestral matrix of a transformers. 876
827 rooted tree. 877
Alex LeClair, Sakib Haque, Lingfei Wu, and Collin
828 Hlib Babii, Andrea Janes, and Romain Robbes. 2019. McMillan. 2020. Improved code summarization via 878
829 Modeling vocabulary for big code machine learning. a graph neural network. In 2020 IEEE/ACM Inter- 879
arXiv preprint arXiv:1904.01873. national Conference on Program Comprehension.
830 880
831 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Alexander LeClair, Siyuan Jiang, and Collin McMil- 881
832 gio. 2014. Neural machine translation by jointly lan. 2019. A neural model for generating natural 882
learning to align and translate. arXiv preprint language summaries of program subroutines. CoRR,
833 abs/1902.01954. 883
arXiv:1409.0473.
834 884
835 Ira D Baxter, Andrew Yahin, Leonardo Moura, Alexander LeClair and Collin McMillan. 2019. Rec- 885
Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone ommendations for datasets for source code summa-
836 rization. CoRR, abs/1904.02660. 886
detection using abstract syntax trees. In Proceed-
837 ings. International Conference on Software Mainte- 887
nance (Cat. No. 98CB36272), pages 368–377. IEEE. C. Lopes, S. Bajracharya, J. Ossher, and P. Baldi. 2010.
838 888
UCI source code data sets.
839 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 889
840 Kristina Toutanova. 2018. BERT: pre-training of Senthil Mani, Anush Sankaran, and Rahul Aralikatte. 890
deep bidirectional transformers for language under- 2019. Deeptriage: Exploring the effectiveness of
841 891
standing. CoRR, abs/1810.04805. deep learning for bug triaging. In Proceedings of the
842 ACM India Joint International Conference on Data 892
843 Angela Fan, David Grangier, and Michael Auli. 2017. Science and Management of Data, pages 171–179. 893
844 Controllable abstractive summarization. arXiv ACM. 894
preprint arXiv:1711.05217.
845 Christopher D Manning, Mihai Surdeanu, John Bauer, 895
846 Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- Jenny Rose Finkel, Steven Bethard, and David Mc- 896
847
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Closky. 2014. The stanford corenlp natural language 897
Ting Liu, Daxin Jiang, et al. 2020. Codebert: A processing toolkit. In Proceedings of 52nd annual
848 pre-trained model for programming and natural lan- meeting of the association for computational linguis- 898
849 guages. arXiv preprint arXiv:2002.08155. tics: system demonstrations, pages 55–60. 899

9
900 Xuan-Phi Nguyen, Shafiq Joty, Steven CH Hoi, and Felix Wu, Angela Fan, Alexei Baevski, Yann N 950
901 Richard Socher. 2020. Tree-structured attention Dauphin, and Michael Auli. 2019. Pay less attention 951
902
with hierarchical accumulation. arXiv preprint with lightweight and dynamic convolutions. arXiv 952
arXiv:2002.08046. preprint arXiv:1901.10430.
903 953
904 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, 954
905
Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Witbrock, and Vadim Sheinin. 2018. 955
Michael Auli. 2019. fairseq: A fast, extensible Graph2seq: Graph to sequence learning with
906 toolkit for sequence modeling. In Proceedings of attention-based neural networks. 956
907 NAACL-HLT 2019: Demonstrations. 957
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
908 958
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Generating structured queries from natural language
909 Jing Zhu. 2002. Bleu: a method for automatic eval- without reinforcement learning. arXiv preprint 959
910 uation of machine translation. In Proceedings of arXiv:1711.04436. 960
911 the 40th annual meeting on association for compu- 961
tational linguistics, pages 311–318. Association for Pengcheng Yin and Graham Neubig. 2017. A syntactic
912 neural model for general-purpose code generation. 962
Computational Linguistics.
913 CoRR, abs/1704.01696. 963
914
Abigail See, Peter J Liu, and Christopher D Man- 964
ning. 2017. Get to the point: Summarization
915 with pointer-generator networks. arXiv preprint 965
916 arXiv:1704.04368. 966
917 967
Rico Sennrich, Barry Haddow, and Alexandra Birch.
918 2015. Neural machine translation of rare words with 968
919 subword units. arXiv preprint arXiv:1508.07909. 969
920 970
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
921 2018. Self-attention with relative position represen- 971
922 tations. arXiv preprint arXiv:1803.02155. 972
923 973
Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. 2018.
924 On tree-based neural sentence modeling. arXiv 974
925 preprint arXiv:1808.09644. 975
926 976
Vighnesh Shiv and Chris Quirk. 2019. Novel po-
927 sitional encodings to enable tree-based transform- 977
928 ers. In H. Wallach, H. Larochelle, A. Beygelzimer, 978
929
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Ad- 979
vances in Neural Information Processing Systems
930 32, pages 12081–12091. Curran Associates, Inc. 980
931 981
932
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao 982
Tian, Hua Wu, and Haifeng Wang. 2019. ERNIE
933 2.0: A continual pre-training framework for lan- 983
934 guage understanding. CoRR, abs/1907.12412. 984
935 985
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
936 Jonathon Shlens, and Zbigniew Wojna. 2015. Re- 986
937 thinking the inception architecture for computer vi- 987
938 sion. CoRR, abs/1512.00567. 988
939 Kai Sheng Tai, Richard Socher, and Christopher D 989
940 Manning. 2015. Improved semantic representations 990
941
from tree-structured long short-term memory net- 991
works. arXiv preprint arXiv:1503.00075.
942 992
943 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 993
944
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 994
Kaiser, and Illia Polosukhin. 2017. Attention is all
945 you need. In Advances in neural information pro- 995
946 cessing systems, pages 5998–6008. 996
947 997
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
948 2015. Pointer networks. In Advances in Neural In- 998
949 formation Processing Systems, pages 2692–2700. 999

10
1000 A Appendices 1050
1001 1051
1002
A.1 Datasets and Hyperparameters 1052
1003 Relevant statistics about the datasets used are pro- 1053
1004 vided in Table 6 and additional hyperparameters for n0 1054
1005 the experiments are listed in Table 7. Additionally, 1055
1006 we provide code, data and configs for our experi- n1 n3 1056
1007
ments at https://removed-for-blind-review. 1057
net. We ran a hyperparameter sweep over n2 n4 n5 n7
1008 1058
1009
the clamping distance and explored the values 1059
C = {2, 3, 8, 16} for movements and C =
1010 n6 1060
{4, 6, 8, 16, 32} for path-length. For γlca we in-
1011 1061
vestigated {0.05, 0.3, 0.6, 1}. We report the best (a) Tree
1012 1062
performing hyperparameters in Table 7. For the
1013   1063
absolute tree transformer (Shiv and Quirk, 2019) 1 0 0 0 0 0 0 0
1014 1 1 0 0 0 0 0 0 1064
we ran a hyperparameter sweep over the maximum  
1015 1065
 
tree depth k = {32, 64, 256}. 
 1 1 1 0 0 0 0 0 

1016 1 0 0 1 0 0 0 0 1066
For the machine translation experiments on 



1017 the IWSLT’14 dataset we follow the approach  1 0 0 1 1 0 0 0  1067
 
1018 of Nguyen et al. (2020) and use 5% of ≈160k 
 1 0 0 1 0 1 0 0 
 1068
1019 sentence pairs for validation, thereby com-  1 0 0 1 0 1 1 0  1069
1020 bine (IWSLT14.TED.dev2010, dev2012, tst2010- 1 0 0 1 0 0 0 1 1070
1021 tst2012) for testing and also use Stanfords 1071
1022 CoreNLP (v3.9.2) to parse trees. We average the 5 (b) Node incidence matrix N (Equation (5)) 1072
1023 checkpoints with the best validation BLEU.   1073
1024 1 1 1 1 1 1 1 1 1074
A.2 Infrastructure Details 1 2 2 1 1 1 1 1
1025 1075
 
 
1026 We vary the amount of GPUs (all Nvidia GeForce 
 1 2 3 1 1 1 1 1 
 1076
1027 GTX1080TI) by the amount of training data. We 
 1 1 1 2 2 2 2 2 
 1077
1028 train models on java-med, java-large and 
 1 1 1 2 3 2 2 2 
 1078
1029 code captioning datasets on 3 GPUs, but for 
 1 1 1 2 2 3 3 2 
 1079
java-small and the machine translation experi-  1 1 1 2 2 3 4 2 
1030 1080
ments in Table 4 we use a single GPU. 1 1 1 2 2 2 2 3
1031 1081
1032 A.3 Sample Visualizations 1082
(c) Ancestral matrix A (Equation (6))
1033 1083
To illustrate the effect of our structural loss on
1034   1084
the transformer, Figure 5 visualizes the (t-SNE- 0 0 0 0 0 0 0 0
1035 1 0 0 1 1 1 1 1 1085
transformed) representations z produced by the  
1036 1086
 
encoder. Edges in the visualization correspond to 
 2 1 0 2 2 2 2 2 

1037 1 1 1 0 0 0 0 0 1087
edges in the AST. When using only relative posi- 



1038 tion representations, but no structural loss (right),  2 2 2 1 0 1 1 1  1088
 
1039 AST nodes are scattered over the embedding space, 
 2 2 2 1 1 0 0 1 
 1089
1040 where mostly the representations of identical input  3 3 3 2 2 1 0 2  1090
1041 tokens form clusters (lower right). In contrast, our 2 2 2 1 1 1 1 0 1091
1042 relative structural transformer using LCA-loss (left) 1092
1043 produces clusters that strongly align with subtrees: (d) Movements matrix M (Equation (7)) 1093
1044 Obviously, the representation of a code component 1094
Figure 2: Visualization of the matrices used for com-
1045 (such as a for-loop) is similar to its subcomponents 1095
puting relative position representations for trees.
1046 (such as the statements within the loop), an indica- 1096
1047 tor that attention is aligned with the tree structure 1097
1048 and focuses stronger on close components in the 1098
1049 tree. 1099

11
1100 1150
1101 1151
1102 1152
1103 java-small java-med java-large CodeSearchNet FunCom IWSLT’14 1153
1104 1154
Samples Train 665,115 3,004,536 15,344,512 908,224 1,937,136 160,239
1105 Samples Valid 23,505 410,699 320,866 44,689 106,153 7,283 1155
1106 Samples Test 56,165 411,751 417,003 52,561 52,561 6,750 1156
1107 1157
1108 Table 6: Statistic about the datasets used. 1158
1109 1159
1110 1160
1111 1161
1112 1162
1113 1163
1114 1164
java-small java-med java-large CodeSearchNet FunCom IWSLT’14
1115 1165
Baseline RST Baseline RST Baseline RST Baseline RST Baseline RST Baseline RST
1116 Warmup Updates 4000 4000 4000 4000 10000 10000 4000 4000 4000 4000 4000 4000 1166
Max Epoch 30 30 60 60 60 60 60 60 60 60 70 70
1117 Validation Metric F1 F1 F1 F1 F1 F1 BLEU BLEU BLEU BLEU BLEU BLEU 1167
Validation Performance 44.84 46.19 56.05 58.61 63.31 63.66 8.32* 8.92* 23.42 23.91 29.69 30.07
1118 Max Source Positions 512 512 512 512 512 512 1024 1024 512 512 1024 1024
1168
Max Target Positions 80 80 80 80 80 80 1024 1024 30 30 1024 1024
1119 1169
Batch Size
8192 8192 8192 8192 8192 8192 6146 6146 6146 6146 4096 10240
1120 (in tokens/batch/gpu) 1170
Accumulate Gradients
8 8 8 8 8 8 6 6 6 6 - -
1121 (in batches) 1171
Share Embeddings
Yes Yes Yes Yes No No Yes Yes No No No No
1122 (between Encoder/Decoder) 1172
Relationship Type - Movements - Movements - Path-Length - Movements - Movements - Path-Length
1123 C - 2 - 2 - 8 - 2 - 2 - 8 1173
γlca - 0.3 - 0.3 - 0.3 - 0.3 - 0.3 - 0.3
1124 Parameters (Million) 38.76 38.79 39.3 39.9 47.5 47.6 39.8 40.4 47.5 48.3 39.5 40 1174
Average Runtime (hours) 10 12 22 25 66 71 20 23 16 19 7 9
1125 1175
1126 Table 7: Additional hyperparameters for the experiments in Table 1 – 4. (*) denotes that we used a different BLEU 1176
1127 implementation during validation than for testing. For all experiments we set Label Smoothing=0.1, Learning 1177
1128 Rate=5e-4, Optimizer: Adam, Adam-Betas=0.9, 0.98 and Weight Decay=0.0001 1178
1129 1179
1130 1180
1131 1181
1132 1182
1133 1183
1134 1184
MethodDeclaration
1135 1185
1136 boolean isPrime(int x) { BasicType FormalParameter body 1186
double mRoot = Math.sqrt(x);
1137 1187
for(int i=2; i <= mRoot; ++i) boolean BasicType x LocalVariableDeclaration ForStatement ReturnStatement
1138 { 1188
1139 if(x % i == 0){ int BasicType VariableDeclarator ... Literal 1189
return false;
1140 } 1190
double m MethodInvocation true
1141 } 1191
return true;
1142 Ro@@ Math MemberReference sqrt 1192
}
1143 1193
ot x
1144 1194
1145 Figure 3: A java method that computes whether a number is a prime, together with its preprocessed abstract syntax 1195
1146 tree. 1196
1147 1197
1148 1198
1149 1199

12
1200 1250
1201 1251
1202 1252
1203 1253
1204 1254
1205 1255
1206 Transformer (bsz=2) Transformer (bsz=2) 1256
Hierarchical Transformer (bsz=2) Hierarchical Transformer (bsz=2)
1207 Relative Structural Transformer (bsz=2) Relative Structural Transformer (bsz=2) 1257
Transformer (bsz=8) Transformer (bsz=8)
Hierarchical Transformer (bsz=8) Hierarchical Transformer (bsz=8)
Seconds / Update

1208 100 1258


Relative Structural Transformer (bsz=8) 101 Relative Structural Transformer (bsz=8)
1209 1259

GiB
1210 1260
1211 1261
1212 10 1 1262
1213 100 1263
100 200 300 400 500 600 700 800 900 1000 1100 1200 100 200 300 400 500 600 700 800 900 1000 1100 1200
1214 Sample length Sample length 1264
1215 1265
Figure 4: Memory usage (GiB) and speed comparison (seconds per iteration) with respect to sequence length
1216 using a fixed batch-size (bzs – in samples per batch). We compare a regular transformer – whose results are also 1266
1217 applicable to the absolute tree transformer (Shiv and Quirk, 2019), as absolute positional representations can be 1267
1218 pre-computed, the hierarchical transformer (Nguyen et al., 2020) using the released reference implementation and 1268
1219 our relative structural transformer utilizing all proposed extensions. All benchmarks are done with the same model 1269
1220
architecture (in terms of layers, dimensions and vocabulary), the same datasets – containing only samples of a 1270
specific length – on a single Quadro RTX 8000 (48GiB).
1221 1271
1222 1272
1223 1273
1224 1274
1225 1275
1226 1276
1227 1277
1228 1278
1229 1279
1230 1280
1231 1281
1232 1282
1233 1283
1234 1284
1235 1285
1236 1286
1237 1287
1238 1288
1239 1289
1240 1290
1241 1291
Figure 5: t-SNE visualization of the encoded node representations of a model trained with (left) and without (right)
1242 1292
a structural loss on java-med. The root node is marked dark red, artificial/abstract tokens – that don’t appear in
1243 1293
source code – are red and nodes that appear in source code green. Nodes are connected by lines, as in the original
1244 AST. 1294
1245 1295
1246 1296
1247 1297
1248 1298
1249 1299

13
1300 1350
1301 1351
1302 1352
1303 1353
1304 1354
1305 1355
1306 1356
1307 1357
1308 1358
1309 1359
1310 1360
1311 1361
1312 1362
1313 public Throwable blockingGetError() { 1363
1314 if (getCount() != 0) { 1364
try {
1315 BlockingHelper.verifyNonBlocking(); 1365
1316 await(); 1366
} catch(InterruptedException ex) {
1317 1367
dispose();
1318 return ex; 1368
1319
} 1369
}
1320 return error; 1370
1321 } 1371
1322 1372
Target: ”Block until the latch is counted down and return the error received or null if no error happened .”
1323 1373
Transformer (no tree): ”Returns an error if there is one . Otherwise returns nil .”
1324 1374
Relative Structural Transformer: ”This method blocks until there is an error or the end of the queue is
1325 1375
reached .”
1326 1376
1327 public static ScheduledExecutorService create(ThreadFactory factory) { 1377
final ScheduledExecutorService exec = Executors.newScheduledThreadPool(1, factory);
1328 tryPutIntoPool(PURGE_ENABLED, exec); 1378
1329 return exec; 1379
}
1330 1380
1331 Target: ”Creates a ScheduledExecutorService with the given factory .” 1381
1332 Transformer (no tree): ”Create a new ScheduledExecutorService .” 1382
1333 Relative Structural Transformer: ”Creates a ScheduledExecutorService with the given ThreadFactory 1383
1334 .” 1384
1335 1385
Figure 6: Sample predictions on CodeSearchNet.
1336 1386
1337 1387
1338 1388
1339 1389
1340 1390
1341 1391
1342 1392
1343 1393
1344 1394
1345 1395
1346 1396
1347 1397
1348 1398
1349 1399

14

You might also like