Professional Documents
Culture Documents
Mana Ihori and Hiroshi Sato and Tomohiro Tanaka and Ryo Masumura
NTT Computer and Data Science Laboratories, NTT Corporation
1-1 Hikarinooka, Yokosuka-Shi, Kanagawa 239-0847, Japan
mana.ihori.kx@hco.ntt.co.jp
Sentence splitting Word unification Grammatical error correction Sentence deletion Sentence reordering
Post-revision document:
I should review English education in elementary schools. Until now, the mainstream of English education has focused on grammar and
formal aspects. However, this is somewhat counterproductive for elementary school students who have not yet formed the ability to think
logically. It is effective for them to learn practical English. Therefore, I believe that English education in Japan should be practical.
Conjunction insertion
Revised
document
The development of social media has made it easier to get information. On the other hand, there can be
difficulties in handling vast amounts of information. Also, in most cases, we only use social media to access our
favorite types and sources of information. Previously, many people got the same information from newspapers
Translation
and television, and thus, they could talk on an equal footing. Now, however, some people unknowingly treat
their closely held opinions as complete information, so their information is biased. Therefore, social media
seems to be a treasure trove of information, but it may also be a tool for maintaining biased information.
Figure 2: Example of Japanese multi-perspective document revision dataset. The colors of characters correspond to
perspectives in Figure 1.
dundant expressions, text style normalization, and in constructing logical structures. Thus, we reorder
punctuation errors. For example, in revising redun- sentences, in order for a document to have a logical
dant expressions, we remove expressions with the order.
same meaning or words that make sense without
(6) Sentences deletion A coherent document
them (e.g., 一番最後→最後 # last, することが
consists of multiple sentences describing a single
できる→できる # can).
topic, but a document that mixes multiple topics
(2) Sentence splitting Sentences more extended can inhibit the understanding of the reader. Thus,
than 60 characters, which is the length of a typical we delete sentences that describe distinctly differ-
sentence in Japanese, should be divided because ent topics from a document.
long sentences decrease readability. We defined
(7) Conjunction insertion This perspective in-
this by using specific numerical values to prevent
cludes the discourse relation classification task (Liu
different labelers from having other divisions.
et al., 2016). Discourse relations support a set of
(3) Word unification To avoid confusion for the sentences to form a coherent document. Also, the
reader, words that have the same meaning in a conjunction has a role in showing these relations
document should be expressed in the same way. and serves as a natural means to connect sentences.
In addition, in Japanese documents, expressions Thus, we restore the conjunction in documents to
“desu, masu” or “da, dearu,” are used at the end of improve readability. We created a list of conjunc-
sentences, and these expressions need to be unified tions that show their kinds and roles, and we asked
within the same document. labelers to select from this list.
(4) Subject restoration Throughout Japanese 3.2 Dataset specifications
documents, the subject is often omitted; however,
Source documents: To make the source docu-
subject restoration may be necessary because a
ments, we hired 161 Japanese workers through a
sentence without a subject may not correctly con-
crowdsourcing service and asked them to write
vey the intent of the writer to the reader. Thus,
paragraph-level documents in Japanese. The
we restore the subject in sentences without a sub-
documents had an essay-style structure because
ject. Note that subject restoration should not be
Japanese schools teach how to write essays; thus,
performed when the subject is clearly recognizable
we expected that many workers could write essays
because consecutive occurrences of the same sub-
at the same level. First, we showed the workers 48
ject may reduce readability.
themes, and they each selected 1-15 themes. The
(5) Sentence reordering This perspective in- 48 themes were chosen by the crowdsourcing com-
cludes the sentence ordering task (Barzilay and pany from actual themes that were used for exam
Lapata, 2008). A well-organized document with a essays in Japan. Next, the workers wrote paragraph-
logical structure is much easier for people to read level documents, each of which contained 200-300
and understand. Also, sentence order is important characters and four or more sentences. We could
6131
# of documents # of sentences
Figure 4: Example of training and decoding multi-perspective document revision model using switching tokens.
contexts. In the training phase, all three datasets ing. These networks are effective for monolingual
are trained jointly by distinguishing each task with translation tasks because they contain a copy mech-
three switching tokens. In the decoding phase, the anism that copies tokens from a source text to help
model performs the multi-perspective document re- generate infrequent tokens. Note that our method
vision task by feeding switching tokens, [gec_on], does not change the architecture of a transformer
[ci_on], and [other_on]. Note that we can also per- pointer-generator network but merely adds switch-
form the grammatical error correction or conjunc- ing tokens to the model input.
tion insertion tasks by feeding appropriate switch-
ing tokens. Pre-training: In this paper, we use a MAsked
Pointer-Generator Network (MAPGN) (Ihori et al.,
4.2 Modeling method
2021a) as self-supervised pre-training for the
We define a source document as X = {x1 , · · · , seq2seq model because it is a suitable pre-
xm , · · · , xM } and a revised document as Y = training method for pointer-generator networks.
{y1 , · · · , yn , · · · , yN }, where M and N are the In MAPGN, the pointer-generator network is pre-
amount of tokens in source and revised documents, trained by predicting a sentence fragment ya:b given
respectively. xm and yn are tokens that include a masked sequence Y/a:b . Here, Y/a:b denotes a
not only characters or words but also punctuation fragment in which positions a to b are masked, and
marks. In this paper, we handle a set of sentences, ya:b denotes a sentence fragment of Y from a to
so X and Y have multiple sentences. Thus, we in- b. The model parameter set can be optimized from
troduce [CLS] token at the beginning of sentences unpaired dataset Du that is consisted of a set of
into all datasets to distinguish each sentence in doc- sentences. The training loss function L is defined
uments. as
The multi-perspective document revision model
predicts the generation probabilities of a revised X
L= − log P (ya:b |ya−1 , Y/a:b ; Θ), (3)
document Y given a source document X and Y ∈Du
switching tokens s1:T = {s1 , · · · , st , · · · , sT }, b
where T is the total number of partial tasks that
X X
=− log P (yt |ya−1:t−1 , Y/a:b ; Θ).
are included in each partially-matched dataset. The Y ∈Du t=a
generation probability of Y is defined as
P (Y |X, s1:T ; Θ) (1) Note that all switching tokens have to be included
N
in the vocabulary during pre-training.
Y
= P (yn |y1:n−1 , X, s1:T ; Θ),
n=1 Fine-tuning: During fine-tuning, the matched
dataset Dm , and multiple partially-matched datasets
where y1:n−1 = {y1 , · · · , yn−1 }, and Θ represents 1 , · · · , D p , · · · , D P } are trained jointly in a
{Dpm pm pm
the trainable parameters. st is the t-th switching
single model. P is the number of partially-matched
token represented as
datasets. The training loss function L is defined as
st ∈ {[t−th task_on], [t−th task_off]}. (2)
P
In this paper, we use Transformer pointer-
X
L = Lm + Lppm , (4)
generator networks (Deaton, 2019) for this model- p=1
6133
where Lm is the loss function against the main task # of documents # of sentences
a 5,000 26,477
and it is computed from Training b - 506,786
X c 90,000 533,422
Lm = − log P (Y 0 |X 0 , s01:T ; Θ), a 554 2,922
(X 0 ,Y 0 )∈Dm Validation b - 8,542
c 10,000 59,396
(5) a 1,121 6,054
Test b - 8,542
where s01:T = {s01 , · · · , s0T } are switching tokens c 1,000 6,026
and s0t is represented as a [gec_on][ci_on][other_on]
switch b [gec_on][ci_off][other_off]
c [gec_off][ci_on][other_off]
s0t = [t−th task_on]. (6)
a. Multi-perspective document revision dataset
b. Japanese grammatical error correction dataset
Lppm is the loss function against the p-th partially- c. Conjunction insertion dataset
matched dataset and it is computed from
Table 3: Details of a matched dataset and two partially-
log P (Y p |X p , sp1:T ; Θ), matched datasets.
X
Lppm = −
p
(X p ,Y p )∈Dpm
(7) from paragraph-level documents, we divided the
dataset into paragraphs and extracted the docu-
where sp1:T = {sp1 , · · · , spT } are switching tokens
ments that contained conjunctions. In addition,
and spt is represented as
we use unpaired 880k paragraph-level documents
(
p for self-supervised pre-training, and these docu-
[t−th task_on] if t-th task in Dpm ,
spt = ments, which were prepared from the Wiki-40B
[t−th task_off] otherwise. dataset, were not used in the conjunction insertion
(8) dataset. The details of these datasets are listed in
Decoding: The decoding problem using switch- Table 3, where “switch” refers to switching tokens.
ing tokens is defined as For training and decoding, we use three switching
tokens for the gec task, the ci task, and other tasks.
Ŷ = arg max P (Y |X, s1:T ; Θ). (9) The multi-perspective document revision model
Y
can also perform the gec and ci tasks by feeding
The model can perform the multi-perspective doc- appropriate switching tokens, so we evaluate the
ument revision task or each partial task in accor- performance of each partial task using each test set.
dance with the given switching tokens. Note that the Japanese grammatical error correc-
tion dataset does not have documents because this
5 Experiments dataset consists of single sentences.
We experimentally evaluated the effectiveness of
5.2 Setup
this modeling method that can use both a matched
dataset and multiple partially-matched datasets. For evaluation purposes, we constructed 11
Transformer-based pointer-generator networks. We
5.1 Dataset use the three datasets combined in different ways,
In this paper, we use two partial tasks, the grammat- each dataset only, a matched dataset and each
ical error correction (gec) and conjunction insertion partially-matched dataset, and the three datasets
(ci) tasks, to build a multi-perspective document jointly. Then, we construct the models with and
revision model. Accordingly, we use three datasets, without switching tokens. In addition, we apply
a multi-perspective document revision dataset de- the pre-training to the models that use a matched
scribed in section 3, a Japanese grammatical er- dataset only and the three datasets jointly with
ror correction dataset (Tanaka et al., 2020), and a switching tokens.
conjunction insertion dataset. We made the con- We used the following configurations. The en-
junction insertion dataset by deleting and restoring coder and decoder had a 4-layer and 2-layer trans-
conjunctions from the Japanese Wiki-40b dataset former block with 512 units. The output unit
(Guo et al., 2020), which is a high-quality pro- size (corresponding to the number of tokens in the
cessed Wikipedia dataset. To make this dataset pre-training data) was set to 12,773. To train the
6134
Dataset GLEU F0.5 C-F1 cause multiple conjunctions have the same sense
Source - 0.886 0 0
(e.g., “but” and “however”).
Baseline (1) a 0.857 0.198 0.193
(2) + PT 0.884 0.321 0.211
+ Datasets (3) a+b 0.863 0.189 0.164 Results of multi-perspective document revision
(4) a+c 0.881 0.155 0.101 task: In Table 4, when the number of partially-
(5) a+b+c 0.883 0.236 0.205
+ Switch (6) a+b 0.887 0.278 0.163 matched datasets increased, the performance im-
(7) a+c 0.888 0.234 0.214 proved. This indicated that switching-token-based
(8) a+b+c 0.889 0.282 0.270 joint modeling that trains a matched dataset and
(9) + PT 0.892 0.333 0.274
multiple partially-matched datasets using switch-
Table 4: Results of multi-perspective document revi- ing tokens is effective for modeling the multi-
sion task. “Source” row indicates results for source perspective document revision task. Therefore, it
documents in multi-perspective document revision task is important for switching-token-based joint mod-
dataset. eling to distinguish tasks because the scores with
switching tokens were higher than those without
Task Dataset GLEU F0.5 C-F1
gec b 0.943 0.635 - switching tokens. In addition, when we com-
a+b+c 0.932 0.613 - pared the results of lines (8) with (9), the results
+ switch 0.943 0.630 - with pre-training outperformed those without pre-
ci c 0.964 0.198 0.230
a+b+c 0.966 0.207 0.222 training. This indicates that switching-token-based
+ switch 0.967 0.239 0.263 joint modeling can be effectively applied after per-
forming self-supervised pre-training.
Table 5: Results of grammatical error correction (gec)
Figure 5 shows that the generation examples of
and conjunction insertion (ci) tasks.
lines (2), (5), and (8) in Table 4. In the figure, the
generation example in (2) shows that conversion
Transformer pointer-generator networks, we used errors were decreased, but task-specific generation,
the RAdam optimizer (Liu et al., 2019) and label like a conjunction insertion, was not performed
smoothing (Lukasik et al., 2020) with a smoothing well. On the other hand, the generation example in
parameter of 0.1. We set the mini-batch size to (8) shows that task-specific generation performed
32 documents and the dropout rate in each Trans- better than (2). Thus, we suppose it is difficult to
former block to 0.1. All trainable parameters were improve the performance of the multi-perspective
initialized randomly, and we used characters as document revision task by applying pre-training
tokens. In the pre-training, we set the number alone. Here, the generation example in (5) shows
of masked tokens to roughly 50% of an input se- it has more errors than (8) in a task-specific genera-
quence. For decoding, we used the beam search tion. These facts indicate that the switching tokens
algorithm with a beam size of 4. are important for joint training of the matched and
multiple partially-matched tasks.
5.3 Results
Tables 4 and 5 show the results for the 11 Trans- Results of partial tasks: Table 5 shows that in
former pointer-generator networks. In these tables, grammatical error correction, the performance of
a, b, and c represent the multi-perspective docu- individual modeling and switching-token-based
ment revision dataset, the Japanese grammatical joint modeling were not significantly different.
error correction dataset, and the conjunction in- However, the performance of joint modeling with-
sertion dataset. Also, “switch” and “PT” indicate out switching tokens under-performed that of the
switching tokens and pre-training, respectively. We individual modeling. For the conjunction insertion
used automatic evaluation scores in terms of two task, the results of joint modeling outperformed
metrics: GLEU (Napoles et al., 2015) and F0.5 . those of individual modeling. Also, the results
Specifically, we calculated these metrics for char- of joint modeling with switching tokens outper-
acters in generated documents and used 4-grams formed those without switching tokens. Therefore,
for GLEU. In addition, we also calculated the F1 these results indicated that switching-token-based
score for conjunction insertion, denoted as C-F1, to joint modeling could improve the performance of
evaluate the performance of the conjunction inser- the multi-perspective document revision task with-
tion task. We evaluated whether the system could out impairing the performance of each task. Note
insert conjunctions with the correct meaning be- that this study aims to improve the performance
6135
(I understand you) (disconnect) achieve success
(... In today's society, ... "co-dependent relationships" have become a problem. In ancient times and in an environment full of enemies,
one's life was one's own ... a sign of adulthood. To be dependent on someone else was to risk the lives of many.)
(... In today's society, ... "co-dependent relationships" have become a problem. In ancient times and in an environment full of enemies,
one's life was one's own ... a sign of adulthood. In other words, to be dependent on someone else was to risk the lives of many.)
(... In today's society, ... "co-dependent relationships" have become a problem. In ancient times and in an environment full of enemies,
one's life was one's own ... a sign of adulthood. Thus, to be dependent on someone else was to risk the lives of many.)
(... In today's society, ... "co-dependent relationships" have become a problem. Since in ancient times and in an environment full of
enemies, one's life was one's own ... a sign of adulthood. Therefore, to be dependent on someone else was to risk the lives of many.)
(... In today's society, ... "co-dependent relationships" have become a problem. In ancient times and in an environment full of enemies,
one's life was one's own ... a sign of adulthood. In other woeds, to be dependent on someone else was to risk the lives of many.)
i: Correction focused on grammatical error correction, ii: Correction focused on conjunction insertion
6138