Professional Documents
Culture Documents
Towards Fully Automated Manga Translation: Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, Yusuke Matsui
Towards Fully Automated Manga Translation: Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, Yusuke Matsui
Abstract
We tackle the problem of machine translation (MT) of manga, 1
Japanese comics. Manga translation involves two important
problems in MT: context-aware and multimodal translation.
Since text and images are mixed up in an unstructured fash-
ion in manga, obtaining context from the image is essential 4
2
for its translation. However, it is still an open problem how to
extract context from images and integrate into MT models. In
addition, corpus and benchmarks to train and evaluate such
models are currently unavailable. In this paper, we make the 3
˚ mask
だ だ だ
?
お
?
お
?
お お
? Sentence pairs
𝐽! 𝐸!
{
前 前 前 前
お前は “誰だ?”
(d) Split
は は は は
“WHO”
Table of contents
𝐽# 𝐸# 𝑀(𝐸)
WHO
ARE
WHO
{ “ロボ”
“ROBOT”
…
Train NMT
1 傷の きを知らぬ <MULTIPLE_GIRLS>
<SCHOOL_UNIFORM>
2 お前では <1GIRL> <LONGHAIR>
Input <1BOY> <SERAFUKU>
<TWINTAILS>
<SHORTHAIR>
1 you don't feel the
throbbing of my wound Input Predicted tags Input Predicted tags
2 you will never H: 2.2 / B: 23.8 H: 2.6 / B: 16.7 w/o visual: she's cute w/o visual: i... i don't want to go near him!
Reference Sentence-NMT (Manga) Scene-NMT w/ visual: you're so cute ❌w/ visual: i-i don't want to get close to her! ❌
Figure 7: Outputs of the sentence-based (center) and frame- Figure 8: Translation results with and without visual fea-
based (right) models. The values after H and B are respec- tures. ©Miki Ueda, ©Satoshi Arai.
tively the human evaluation and BLEU scores for each page.
©Mitsuki Kuchitaka
resolves the differences in word order between Japanese and
English. However, it results in a worse BLEU score since the
all important information is correctly translated). All the references usually maintain the original order of the texts.
methods explained above other than Google Translate were Although there is no statistically significant difference be-
compared. The order of presenting the methods was random- tween Scene-NMT and Scene-NMT w/ visual, Fig. 8 shows
ized. In total, we collected 5 participants × 5 methods × 214 some promising results; pronouns (“you” and “her”) that
pages = 5, 350 samples. See the supplemental material for cannot be estimated from textual information are correctly
the details of the evaluation system. We also conducted an translated by using visual information. These examples indi-
automatic evaluation using the BLEU (Papineni et al. 2002). cate that we need to combine textual and visual information
to appropriately translate the content of manga. However, we
Results: Table 1 shows the results of the manual and au- found that a large portion of the errors of Scene-NMT w/
tomatic evaluation. The huge improvement of the Sentence- visual are caused by the incorrect visual features. To fully
NMT (Manga) over Google Translate and Sentence-NMT understand the impact of the visual feature (i.e., semantic
(OS18) indicates the effectiveness of our strategy of Manga tags) on translation, we conducted an analysis in Fig. 10: (i)
corpus construction. and (ii) in the figure show the outputs of the Scene-NMT
A pair-wise bootstrap resampling test (Koehn 2004) on and Scene-NMT w/ visual, respectively. The pronoun er-
the results of the human evaluation shows that the Scene- rors in (ii) are caused by the incorrect visual feature “Multi-
NMT outperformed the Sentence-NMT (Manga). On the ple Girls” extracted from the original image. When we over-
other hand, there is no statistically significant difference be- wrote the character face with a male face, Scene-NMT w/
tween 2 + 2 and Sentence-NMT (Manga). These results visual output the correct pronouns, as shown in (iii). This
suggest that not only the contextual information but also the result proved that Scene-NMT w/ visual model consider
appropriate way to group them is essential for accurate trans- visual information to determine translation results, and it
lation. would be improved if we devise a way to extract visual fea-
In contrast to the results of the human evaluation, the tures more accurately. Designing such a good recognition
BLEU scores of the context-aware models (fourth to sixth model for manga images remains as future work.
lines in Table 1) are worse than that of Sentence-NMT
(Manga). These results suggest that the BLEU is not suit- Evaluation of Corpus Construction
able for evaluating manga translations. Fig. 7 shows an To evaluate the performance of corpus construction, we
example where the Scene-NMT outperformed Sentence- compared the following four approaches: 1) Box: Bound-
NMT (Manga) in the manual evaluation but had lower ing boxes by the speech bubble detector are used as text re-
BLEU scores. Here, we can see that only the Scene-NMT gions instead of segmentation masks. This is the baseline of
has swapped the order of the texts. This flexibility naturally a simple combination of speech bubble detection and OCR.
Input (Japanese) English Chinese Input (Japanese) English Chinese
Figure 9: Results of fully automatic manga translation from Japanese to English and Chinese. ©Masami Taira, ©Syuji Takeya
1 maybe I’m just tired. Table 2: Corpus construction performance on the Pub-
2 1 2 we'll wake him up later.
J Manga.
(i) w/o visual
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Saito, M.; and Matsui, Y. 2015. Illustration2vec: a semantic
and Belongie, S. 2017a. Feature pyramid networks for ob- vector representation of illustrations. In SIGGRAPH Asia
ject detection. In Proc. CVPR, 2117–2125. 2015 Technical Briefs, 1–4.
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. Scherrer, Y.; Tiedemann, J.; and Loáiciga, S. 2019.
2017b. Focal loss for dense object detection. In Proc. ICCV, Analysing concatenation approaches to document-level
2980–2988. NMT in two different domains. In Proc. DiscoMT.
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, Shi, B.; Bai, X.; and Yao, C. 2017. An End-to-End Train-
R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and able Neural Network for Image-Based Sequence Recogni-
Dollár, P. 2014. Microsoft COCO: Common Objects in Con- tion and Its Application to Scene Text Recognition. IEEE
text. CoRR abs/1405.0312. TPAMI 39(11): 2298–2304.
Lison, P.; Tiedemann, J.; and Kouylekov, M. 2018. Open- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2016. Robust
Subtitles2018: Statistical Rescoring of Sentence Alignments scene text recognition with automatic rectification. In Proc.
in Large, Noisy Parallel Corpora. In Pro. LREC. CVPR.
Maruf, S.; and Haffari, G. 2018. Document Context Neural Specia, L.; Frank, S.; Sima’an, K.; and Elliott, D. 2016.
Machine Translation with Memory Networks. In Proc. ACL. A Shared Task on Multimodal Machine Translation and
Crosslingual Image Description. In Proc. WMT.
Maruf, S.; Martins, A. F.; and Haffari, G. 2019. Selective
Attention for Context-aware Neural Machine Translation. In Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to
Proc. NAACL. sequence Learning with Neural Networks. In Proc. NIPS.
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Tiedemann, J.; and Scherrer, Y. 2017. Neural Machine
Yamasaki, T.; and Aizawa, K. 2017. Sketch-based Manga Translation with Extended Context. In Proc. DiscoMT.
Retrieval using Manga109 Dataset. Multimedia Tools and Tu, Z.; Liu, Y.; Shi, S.; and Zhang, T. 2018. Learning to
Applications 76(20): 21811–21838. remember translation history with a continuous cache. TACL
Nakazawa, T.; Sudoh, K.; Higashiyama, S.; Ding, C.; Dabre, 6: 407–420.
R.; Mino, H.; Goto, I.; Pa Pa, W.; Kunchukuttan, A.; and Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Kurohashi, S. 2018. Overview of the 5th Workshop on Asian L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017.
Translation (WAT2018). In Proc. WAT. Attention is All you Need. In Proc. NIPS.
Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015.
Show and tell: A neural image caption generator. In Proc.
CVPR.
Voita, E.; Sennrich, R.; and Titov, I. 2019a. Context-Aware
Monolingual Repair for Neural Machine Translation. In
Proc. EMNLP-IJCNLP.
Voita, E.; Sennrich, R.; and Titov, I. 2019b. When a Good
Translation is Wrong in Context: Context-Aware Machine
Translation Improves on Deixis, Ellipsis, and Lexical Cohe-
sion. In Proc. ACL.
Voita, E.; Serdyukov, P.; Sennrich, R.; and Titov, I.
2018. Context-Aware Neural Machine Translation Learns
Anaphora Resolution. In Proc. ACL.
Wang, L.; Tu, Z.; Way, A.; and Liu, Q. 2017. Exploiting
Cross-Sentence Context for Neural Machine Translation. In
Proc. EMNLP.
Werlen, L. M.; Ram, D.; Pappas, N.; and Henderson, J. 2018.
Document-Level Neural Machine Translation with Hierar-
chical Attention Networks. In Proc. EMNLP.
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;
Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;
et al. 2016. Google’s Neural Machine Translation System:
Bridging the Gap between Human and Machine Translation.
CoRR abs/1609.08144.
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; and He, K. 2017. Ag-
gregated residual transformations for deep neural networks.
In Proc. CVPR, 1492–1500.
Xiong, H.; He, Z.; Wu, H.; and Wang, H. 2019. Modeling
coherence for discourse neural machine translation. In Proc.
AAAI.
Zhang, J.; Luan, H.; Sun, M.; Zhai, F.; Xu, J.; Zhang, M.;
and Liu, Y. 2018. Improving the Transformer Translation
Model with Document-Level Context. In Proc. EMNLP.
Zhou, M.; Cheng, R.; Lee, Y. J.; and Yu, Z. 2018. A Vi-
sual Attention Grounding Neural Model for Multimodal Ma-
chine Translation. In Proc. EMNLP.
Dataset details Noise
(b) Concatenation
(a) Columns with
Table A describes the details of datasets used in our exper-
Vertical text
black pixels
iments. Manga corpus is used to train our machine transla-
tion models. OpenMantra is used to evaluate machine trans-
lation. PubManga is used to evaluate corpus extraction and
text ordering. Manga109 is used to train and evaluate object
detectors. Too narrow, skip it
Text line
(d) Concatenation
Horizontal text
text frame
Method input image size backbone AP AP50 AP AP50
SSD-fork (Ogawa et al. 2018) 300 VGG n/a 84.1 n/a 96.9
500 ResNet-101 65.0 92.5 91.6 97.5
800 ResNet-101 69.3 94.4 92.5 97.6
1170 ResNet-101 71.2 94.9 92.5 97.7
Faster R-CNN
1170 ResNet-50 70.9 94.8 90.7 97.5
1170 ResNet-101-FPN 70.3 94.4 92.5 97.7
1170 ResNeXt-101 70.4 94.5 92.9 98.5
RetinaNet 1170 ResNet-101 70.6 95.4 89.8 98.3
that the performance of Faster R-CNN is much poorer than Table D: Hyperparameters for training the NMT model.
that of their SSD-based model, this is because they trained
the Faster R-CNN as a multiclass object detector. They men- # Layers of encoder 6
tioned that it is usually difficult to train a multiclass detection # Layers of decoder 6
model on comic images in the same way as a generic de- # Dimensions of encoder embeddings 1024
tection because some objects are overlapping significantly. # Dimensions of decoder embeddings 1024
Instead, we trained a Faster R-CNN with a single class. The # Dimensions of FFN encoder embeddings 4096
table also shows several important tips for object detection # Dimensions of FFN decoder embeddings 4096
in manga. For example, using a larger input size is effective # Encoder attention heads 16
for text classes, while it is not effective for frame classes # Decoder attention heads 16
because frames tend to be larger. Therefore, in practice, the
β1 of Adam 0.9
computational time can be reduced by using a small-sized
β2 of Adam 0.98
input for the frame class. In addition, several architectures
Learning rate 0.001
that have had success in object detection tasks (RetinaNet
Learning rate for warm-up 1e-07
and ResNeXt/FPN backbone (Lin et al. 2017b; Xie et al.
Warm-up steps 4000
2017; Lin et al. 2017a)) does not improve the accuracy of
Dropout probability 0.3
this task.
Hyperparameters of the NMT module should be detected and processed separately, which has re-
We implement a Transformer (big) model with the mained as future work.
fairseq (Ott et al. 2019) toolkit and set its default parame-
ters in accordance with (Vaswani et al. 2017). The model is Text Cleaning Examples
trained using an Adam (Kingma and Ba 2015) optimizer. Fig. E shows the examples of text cleaning. Our inpainting-
The hyperparameters of the model and optimizer are de- based method removes Japanese texts even if texts are on
tailed in Tab. D. textures, although the complemented texture is a little dif-
ferent from original one.
GUI of User Study
The evaluation system for the user study was developed as a More End-to-End Translation Examples
web application. The whole GUI is visualized in Fig. C. A Fig. F and G shows more results of our fully automatic
Japanese page and its English translated page are shown to a manga translation system. The left images show the input
participant. He/she selects the score for each sentence in the pages, while the center and right figures show the translated
check box. Unlike the usual plain text translation, this study results to English and Chinese.
directly compares the translated pages.
Figure D: Results of our text and frame order estimation. The bounding boxes of text and frame are shown in green and red
rectangle, respectively. Estimated order of text and frame are depicted at the upper left corner of bounding boxes. ©Ito Kira
©Mitsuki Kuchitaka ©Nako Nameko
Original Cleaned Original Cleaned