A Multimodal Text Block Segmentation Framework For Photo Translation

A Multimodal Text Block Segmentation
Framework for Photo Translation
Anonymous for Double-Blind Review

1
xxx@xxx.edu.xxx
2
xxx@xxx.edu.xxx
Abstract. Nowadays, with the vigorous development of OCR(Optical

Character Recognition) and machine translation, photo translation tech-
nology brings great convenience to people’s life and study. However, when
translating the content of an image line by line, the lack of contextual in-
formation in adjacent semantic-related text lines will seriously influence
the actual effect of translation, making it difficult for people to under-
stand. To tackle the above problem, we propose a novel multimodal text
block segmentation encoder-decoder model. Specifically, we construct a
convolutional encoder to extract the multimodal representation which
combines visual, semantic, and positional features together for each text
line. In the decoder stage, the LSTM(Long Short Term Memory) mod-
ule is employed to output the predicted segmentation sequence inspired
by the pointer network. Experimental results illustrate that our model
outperforms the other baselines by a large margin.
Keywords: Text block segmentation · Multimodal fusion · Pointer net-

work.
1 Introduction
In recent years, photo translation technology has brought great convenience to

people’s life and study, so that people can understand the information on signs
and menus when they are in a foreign country. Photo translation technology
is mainly composed of three modules: text detection [11, 31, 32], text recogni-
tion [19,20,30] and machine translation [3,4,22]. First, the text detection method
is performed on the input image to get the coordinates of bounding boxes for all
text lines in the image. Then, text recognition is applied on each text line patch.
Finally, the recognized text of each line should be put into a machine trans-
lation, and the results are combined to obtain the predicted translation of the
entire image. With the widespread application of deep learning techniques [8,24],
especially the maturity of neural network-based end-to-end sequence modeling
techniques [3, 20], the three techniques mentioned above have achieved high ac-
curacy in their respective scenarios.
2 Anonymous for Double-Blind Review
However, previous translation systems usually translate the content of each

text line independently, resulting in the lack of context information, which seri-
ously affect the actual effect of the translator. As a result, people are likely to
obtain translations that are not smooth, fluent or even difficult to understand.
For instance, the second line of a paragraph in the document scene may be just
only a part of a complete sentence. It is possible to be related to the first line and
the third line. If we translate the recognition result of the second line alone, the
transcription could be obscure, incomplete or unsatisfactory because the context
in the upper part and the lower part of the sentence is missing. The proposed
text block segmentation technology is exactly to solve this problem. Through the
segmentation model, the text lines containing semantic relationships are grouped
into a text block and then translated by a translator, which can greatly improve
the user’s actual experience and the related evaluation indicators of photo trans-
lation technology. Figure 1 shows an example of a non-text segmented image and
a text segmented image.
(a) (b)
Fig. 1: (a) The results of scene text detection, which are the input of a trans-
lation system without text block segmentation. (b) The results of the proposed
method, which are the input of a translation system with text block segmenta-
tion. Different blocks are represented by different colors.
In this paper, we propose a novel multimodal encoder-decoder text block seg-

mentation model to separate the text blocks with semantic relationship for photo
translation in order to tackle the problem as mentioned above. In the encoder
stage, to make better use of multimodal information, we employ a convolutional
network to extract and fuse multimodal features, including the visual embed-
ding, text embedding and position embedding to improve the model accuracy.
The multimodal features of all text lines together with the start symbol, delim-
A Multimodal Text Block Segmentation Framework for Photo Translation 3
iter symbol and end symbol flag vector are combined to form a candidate feature
queue. In the decoder stage, the text line index at each moment is predicted by
calculating the similarity between the feature output from a LSTM [6] module
and all of the features in the decoding candidate queue. When the delimiter
symbol is decoded, it means that the prediction of a text block ends, and the
prediction of the next text block starts. When the end symbol is decoded or the
maximum decoding length is reached, the whole decoding process ends.
Our contributions are summarized as follows. First, we present an novel
framework for text block segmentation task aiming at clustering semantic-related
text lines to merge a complete sentence, which can greatly improve photo trans-
lation performance. Second, we fully utilize the multimodal features to improve
model accuracy. Finally, we design quantitative sequence evaluation to demon-
strate the rationality and feasibility of our model in four different languages.
2 Related works
In this section, we describe the previous work related to the proposed method,
including text detection, text recognition and sequence modeling.
2.1 Text Detection

Text detection methods usually adopt similar ideas as in general object detec-
tion methods such as Faster R-CNN [18], Cascade R-CNN [5] and YOLO [17].
However, some differences between them are notable. For example, the shape of
a text box can be arbitrary while the shape of an object box is usually rectan-
gular and text lines are usually dense while objects are usually relatively sparse.
In order to handle above problems, semantic segmentation methods [13, 26, 33]
are applied to text detection which usually achieve good results.
2.2 Text Recognition

In the early days, text recognition methods were based on segmentation, then
recognize segments through character models and merge them through language
model. Since segmentation is not accurate in handwriting scene and complex
backgrounds, the state of the art methods are often end-to-end without segmen-
tation. The key of the end-to-end method is how to align the feature sequence
and label sequence. The CTC-based methods align them through CTC loss [19],
ACE [28] uses aggregate cross entropy loss and encoder-decoder methods [20]
make use of the attention mechanism to align feature sequence and label se-
quence.
2.3 Sequence modeling

Sequence modeling has been widely researched in many fields. In natural lan-
guage processing, RNN-based Sequence-to-Sequence model [3] is often used in
machine translation task and in computer vision, encoder-decoder based model [29]
is often used in image caption task. However, these methods can not solve the
permutation problem where the size of outputs depends on the inputs directly.
Pointer Network [25] and attention map loss [27] which uses an attention mech-
anism to find the proper units from the input sequence as outputs and these
methods are often applied in text summarization [1], visual question answer-
ing(VQA) [2] and text re-organization [10] which brings insightful idea for text
block segmentation.
3 METHOD
In this section, we illustrate the main architecture of our method, including a
multimodal encoder and a dynamic decoder based on the LSTM [6] modules.
Specifically, In Section 3.1, we elaborate how visual, semantic and positional
features are extracted respectively by the encoder. In Section 3.2, we describe
the global LSTM decoder with a dynamic vocabulary. The whole architecture is
shown in the Figure 2.
Encoder with Multimodal Embedding Dynamic

Vocabulary
Vision 𝑓1 𝑓𝑠𝑡𝑎𝑟𝑡
Embedding
...
𝑓𝑠𝑒𝑝
concat 𝑓2
𝑓𝑒𝑛𝑑
Position ...
…
Embedding 𝑓1
𝑓n
𝑓2
…
Sentence ...
Embedding 𝑓𝑛
t1
𝑓𝑠𝑡𝑎𝑟𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡
0 1 2 sep
t2
LSTM LSTM LSTM LSTM ... t3
t4
sep
…
Attention 𝑖𝑛𝑑𝑒𝑥_𝑡0 Attention 𝑖𝑛𝑑𝑒𝑥_𝑡1 Attention 𝑖𝑛𝑑𝑒𝑥_𝑡2

end
Decoder with Dynamic Vocabulary
Fig. 2: The architecture of the proposed method, including the multimodal em-
bedding and the LSTM-based decoder with dynamic vocabulary.
3.1 Encoder with Multimodal Embedding

Formally, we define the annotated text line set of the image is τ = {t1 , t2 , ..., tn }
where ti refers to the ith text line. The target of the encoder is to get the embed-
ding feature set of all of the text lines, which is defined as F = {f1 , f2 , ..., fn }.
The embedding of vision. First, the input image is resized to 1280 along
the long side. Then a Resnet50 [8] based backbone followed by a FPN [12]
module is used to extract the visual feature of the input image. After obtaining
the whole image feature, a ROI-Align [7] module can be employed to extract the
visual feature of each text line in terms of the labeled bounding box. Finally, we
flatten the visual features of every text line and put them into a linear projection
head to convert them to a fixed dimension dv . As a result, we can get the visual
embedding set V = {v1 , v2 , ..., vn } , vi ∈ Rdv .
The embedding of sentence. We build a RNN-based language model [14]
for sentence embedding. First, the input content of each text line is segmented
into tokens through tokenizer. Then the initial embeddings of these tokens
should be put into the language model one by one. Finally, the last hidden
vector of dimension dw is extracted as the embedding of the whole text line.
The embeddings of all text lines of the input image can be represented as
W = {w1 , w2 , ..., wn } , wi ∈ Rdw .
The embedding of position. The location information is very essential
for our task as most text lines belonging to one block are adjacent to each
other in position. We encode the position embedding for each text region via the
normalized coordinates of upper left corner and lower right corner of the text
line bounding box, which is calculated as below:
i i i i
x1 y1 x2 y2
li = , , , (1)
W H W H
where (xi1 , y1i ) and (xi2 , y2i ) denote the coordinate of the bottom-left and top-
right corner of the ith text line region while W and H are the width and height
of the input image. After acquiring the normalized coordinates, a projection g
with two linear layers and the ReLU operation are applied to map the vectors
to a specific dimension dp :
pi = g (li ) (2)
we use P = {p1 , p2 , ..., pn } , pi ∈ Rdp to represent the set of position features.
Multimodal Fusion. So far, the visual, semantic and positional embedding
vector of each text line on the input image have been extracted. At the end of
the encoder, the feature vectors of the above three modalities are combined with
a simple cascading operation which can be simply explained as follows:
fi = concat(vi , wi , pi ) (3)
where fi ∈ Rdf , df = dv + dw + dp .
3.2 Decoder with LSTM

The dynamic vocabulary is composed of the the output feature vectors of the
encoder and three extra learnable flag vectors, which can be interpreted as D =
{fstart , fsep , fend , f1 , f2 , ..., fn }. It is shown in Figure 2 and the dimension of
each element is equal to df . Further, the fstart vector is used as the input of
the LSTM decoder at time 0, the fsep vector stands for the separation between
different text blocks and the fend vector means that decoding process has ended.
The architecture of the decoder is cascading of LSTM modules which are

usually used to solve sequence problems [23]. The whole decoding process is in
an autoregressive pattern. The dynamic vocabulary is set to the decoding range
of the decoder at each time step. At time step t, the hidden state of LSTM is
considered as ht . The attention score between the hidden vector and the feature
in decoding range is calculated as below:
ht · fi
at,i = (4)
df
where fi means the ith feature vector in the dynamic vocabulary. After obtaining
the attention scores of all vectors in the dictionary, a softmax function is utilized
to normalize the scores:
exp(at,i )
st,i = P (5)
l at,l
According to the result of Equation 5, the feature with argmax index will be
selected as the final decoding result at step, then this feature will be removed
from dynamic vocabulary for avoiding repeated prediction of the same text line
in the decoding sequence. The decoder repeats the above operations until the
end symbol is decoded or the maximum decoding length is reached.
At each time step during decoding, we compute the cross-entropy loss be-
tween the one-hot groundtruth and the predicted probability distribution via
Equation 6:
P P
L(θ) = − t i yt,i logst,i + (1 − yt,i )log(1 − st,i ) (6)
where yt refers to the label at time step t and θ is the weight of the model.
Suppose the final decoding sequence is {t1 , t3 , sep, t2 , end}. It means that
text line t1 and text line t3 should be linked together as a block and text line t2
is a block. Generally speaking, for every text block in the image, the text lines
in the block are combined in the reading order to get a semantically complete
sentence. Then the sentence is fed into a translator which gains the full contextual
information.
4 Experiments
In this section, we set up experiments to evaluate our proposed method via the
accuracy of the model and the translation performance. Specifically, the training
and testing data are presented in Section 4.1 and the training details are shown
in Section 4.2. In Section 4.3, we represent how to build a rule-based baseline
and a detection-based baseline. In Section 4.4 we give the evaluation metrics and
the result of our model will be presented in Section 4.5.
4.1 Dataset
We collected some images of natural scenes in multiple languages, including
Kazakh, Russian, Uyghur and Vietnamese. These images are annotated and
splited into train set, validation set and test set. Each image in the dataset has
hierarchical labels, including text line and text block. The annotation of text line
contains the coordinates of the text bounding box and the transcription. While
the annotation of the text block contains the coordinates of the block bounding
box and the transcription. The mapping between text block and text line are
also stored according to the IOU matching.
The image number of training sets and testing sets for each language is shown
in Table 1. What’s more, we train four models for each language mentioned
above respectively with language-specific training data and test the result in
each language separately.
Table 1: The number of training and testing images in four language.

language trainset valset testset
Kazakh 12540 500 563
Russian 13881 500 428
Uyghur 19747 500 513
Vietnamese 23484 500 575
4.2 Training
For encoder, the ResNet50 along with a FPN module is employed as our visual
backbone. The dimension of the visual feature vectors is equal to 256 while the
dimension of semantic and position embedding is 128. As a result, the multimodal
feature’s dimension is 512. For decoder, the dimension of LSTM’s hidden state is
512 and we use two cascaded LSTM cells as the base architecture of the decoder.
We use an Adam [9] optimizer with weight decay 0.0005 and the initial learn-
ing rate 1e-4. Then the learning rate is reduced by the CosineAnnealing sched-
ule [21] epoch by epoch. The model is trained for 20 epochs in total and we set
the batchsize to 16.
4.3 Baseline
We compare our method with the following two baselines, one is a rule-based
clustering method and the other is a two-stage detection method. The former
text clustering method is in terms of the position rules which is the most direct
way. First, all the text lines on the input image are rearranged from top to
bottom and then from left to right. After that, the separate flag is inserted
between the sorted text lines when the distance of two adjacent text lines is
greater than the preset threshold. For the latter baseline, we utilize the popular
Faster R-CNN [18] architecture which is a state-of-the-art method to detect the
bounding box of each text block directly.
4.4 Evaluation metrics
We use two metrics to evaluate the performance of the model. One is the accuracy
of the text block and the other is the BLEU [16] score of the translation task.
The accuracy score can be computed by the ratio of the number of text blocks
whose text lines are perfectly matched with the label. For instance, if the text
block label of the input image is [[t1 , t2 ] , [t5 ] , [t4 ] , [t3 ]] and the inference sequence
after postprocess is [[t1 , t2 ] , [t3 , t4 ] , [t5 ]] , then the number of matching correctly
is 2 which stands for the length of the subset [[t1 , t2 ] , [t5 ]]. It is worth noting that
the text lines in the text block need to maintain the correct reading order which
is required to form the right sentence, whether in annotation or prediction. In
addition, we take no account of the order between diverse text blocks because it
is not necessary for translation.
4.5 Results and Analysis
The main result of the accuracy score is depicted in Table 2. We compare the
accuracy of our proposed method with the rule-based baseline and the Faster
R-CNN baseline in test dataset of four different languages. Our approach ex-
ceeds the two baselines by a large margin. In fact, the rule-based baseline is
very sensitive to the threshold which we will make a detailed analysis later while
the Faster R-CNN network requires complex post-processing such as NMS [15]
and sorting the matched lines in each block according to a positional rule. In
addition, these two baseline methods do not use semantic features, and basically
perform text block segmentation based on the visual distance of text lines. It
is less effective for text block segmentation that is semantically close but vi-
sually distant or semantically not close but visually close. Compared to them,
the proposed algorithm makes full use of multimodal information, not only has
advantages in the performance, but also does not depend on the appropriate
threshold and post-processing.
After obtaining the block of text, we join the content in the same text block
and feed the contextual sentence into a translation engine. Table 3 presents
the BLEU-2 score and BLEU-4 score between the annotated translation and
the predicted translation under multiple settings. Single line setting refers to
directly translating each text line without dividing them into text blocks, which
is a common practice in previous photo translation systems. While annotation
setting means we use the label of text block and the result under this setting
represents the upper limit of our model. When considering the results in Table 2
and Table 3 comprehensively, our method outperforms other methods obviously
and is very close to the upper limit on the BLEU score.
We test the effect of different thresholds on the rule-based baseline. We use
a certain proportion of the mean line height of text lines in the last text block
along with the current line as the threshold for judging whether insert a separate
flag before the current text line. All text lines have been sorted in advance. If two
text lines are arranged left and right, a certain proportion of the mean line width
is used as the separation threshold. Here, we define the former proportion as
Table 2: Text block segmentation accuracy between baseline and our method on
the test set of four different languages.
Method Kazakh Russian Uyghur Vietnamese
Rule 46.73 43.64 47.57 45.05
Faster R-CNN 62.41 58.09 64.09 73.04
Our method 79.86 78.10 79.80 87.59
Table 3: The BLEU scores of translation on test data of four languages.

Kazakh Russian Uyghur Vietnamese
Method
BLEU-2 BLEU-4 BLEU-2 BLEU-4 BLEU-2 BLEU-4 BLEU-2 BLEU-4
Single line 46.48 32.05 64.31 50.64 65.70 51.23 62.27 46.30
Rule 49.08 34.97 66.38 53.13 67.71 53.99 64.91 50.62
Faster R-CNN 56.59 45.17 74.07 63.73 75.72 64.89 68.55 56.63
Our method 56.97 45.60 74.86 65.07 76.29 65.57 68.82 57.00
Annotation 57.29 46.21 75.99 66.63 76.75 66.23 69.25 57.65
ratioh while the latter as ratiow and suppose ratioh = 2 ∗ ratio, ratiow = ratio.
Intuitively, the larger the ratio is, the more the text lines tend to aggregate while
the smaller the ratio is, the more the text lines tend to separate. We compare
text block segmentation accuracy and the BLEU score as the threshold changes
in Fig 3, where 0 represents that we consider every text line as a text block,
and ∞ represents that all text lines in the image are aggregated together into
a single block. The threshold we used in Table 2 and Table 3 is 0.1 for balance
accuracy and BLEU score.
We also test ablation experiments based on different modal features. Here,
we conduct ablation studies over the Vietnamese test data. Because visual in-
formation is very important in natural scenes, the basic network is constructed
only via the visual embedding. Then, the position embedding and the seman-
tic embedding are added to the network successively. The accuracy score and
the translation BLEU score of the above three model are shown in Table 4. As
we can see, the embedding of each mode play a positive role in the proposed
network.
Table 4: Ablation studies of the three modal feature embedding.

Modal type Acc BLEU-2 BLEU-4
Vision only 78.17 67.89 55.45
Vision+Position 83.08 68.51 56.56
Vision+Position+Semantic 87.59 68.82 57.00
Some visualization results are shown in Figure 4. It can be found the advan-
tage of our model is that it can segment sentences accurately by using multimodal
information, so as to provide complete contextual information for translation.
70 75
60 70
50 65
40 60
Bleu
Acc
30 55
20 50
10 45
0 40
0 0.05 0.1 0.15 0.2 0.25 0.3 0.4 0.5 ∞
Threshold
Acc Kazakh Russian Uyghur Vietnamese
Bleu Kazakh Russian Uyghur Vietnamese
Fig. 3: Text block segmentation accuracy and BLEU-2 scores of the rule-based
method under different thresholds.
(a) Rule (b) Faster R-CNN (c) Our method (d) Ground truth
Fig. 4: Comparative visualization results between our proposed method and other
methods.
5 Conclusions
In this study, we propose a novel multimodal framework for text block segmen-
tation. This framework improves model accuracy through fusion of multimodal
information and attention mechanism. It makes use of a LSTM decoder to predict
text block segmentation result. In addition, we introduce quantitative evaluation
to measure the effect of the proposed method. Our model outperforms the base-
lines both on text block segmentation accuracy and translation accuracy. As for
future work, we will try to combine unsupervised pre-trained model to improve
text block segmentation accuracy.
References
1. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez,
J.B., Kochut, K.: Text summarization techniques: a brief survey. arXiv preprint
arXiv:1707.02268 (2017) 4
2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,
D.: Vqa: Visual question answering. In: Proceedings of the IEEE international
conference on computer vision. pp. 2425–2433 (2015) 4
3. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learn-
ing to align and translate. In: 3rd International Conference on Learning Represen-
tations, ICLR 2015 (2015) 1, 3
4. Caglayan, O., Kuyu, M., Amac, M.S., Madhyastha, P., Erdem, E., Erdem, A.,
Specia, L.: Cross-lingual visual pre-training for multimodal machine translation.
In: arXiv preprint arXiv:2101.10044 (2021) 1
5. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 6154–6162 (2018) 3
6. D’Informatique, D.E., Ese, N., Esent, P., Au, E., Frasconi, P.P.: Long short-term
memory in recurrent neural networks. epfl (2001) 3, 4
7. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE international conference on computer vision. pp. 2961–2969 (2017) 5
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016) 1, 4
9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014) 7
10. Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end ocr text
re-organization sequence learning for rich-text detail image comprehension. In:
European Conference on Computer Vision. pp. 85–100. Springer (2020) 4
11. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector
with a single deep neural network. In: Thirty-First AAAI Conference on Artificial
Intelligence (2017) 1
12. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2017) 4
13. Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: Textsnake: A flexible rep-
resentation for detecting text of arbitrary shapes. In: Proceedings of the European
conference on computer vision (ECCV). pp. 20–36 (2018) 3
14. Mikolov, T., Karafi¨¢t, M., Burget, L., Cernock, J., Khudanpur, S.: Recurrent neu-
ral network based language model. In: Interspeech, Conference of the International
Speech Communication Association, Makuhari, Chiba, Japan, September (2015)
5
15. Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th Interna-
tional Conference on Pattern Recognition (ICPR’06). vol. 3, pp. 850–855. IEEE
(2006) 8
16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. Proceedings of the 40th annual meeting of the
Association for Computational Linguistics pp. 311–318 (2002) 8
17. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 779–788 (2016) 3
18. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. pp. 91–99 (2015) 3, 7
19. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. IEEE transac-
tions on pattern analysis and machine intelligence 39(11), 2298–2304 (2016) 1,
3
20. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with
automatic rectification. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 4168–4176 (2016) 1, 3
21. Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks
using large learning rates. Artificial intelligence and machine learning for multi-
domain operations applications 11006, 369–386 (2019) 7
22. St¨¹n, A., Berard, A., Besacier, L., Gall¨¦, M.: Multilingual unsupervised neural
machine translation with denoising adapters. In: Empirical Methods in Natural
Language Processing (2021) 1
23. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: NIPS (2014) 6
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017) 1
25. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in neural
information processing systems. pp. 2692–2700 (2015) 4
26. Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., Shao, S.: Shape robust text
detection with progressive scale expansion network. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 9336–9345 (2019) 3
27. Wu, J., Du, J., Wang, F., Yang, C., Jiang, X., Hu, J., Yin, B., Zhang, J., Dai, L.:
A multimodal attention fusion network with a dynamic vocabulary for textvqa.
Pattern Recognition 122, 108214 (2022) 4
28. Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for
sequence recognition. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 6538–6547 (2019) 3
29. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: International conference on machine learning. pp. 2048–2057 (2015)
4
30. Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text
recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 284–293 (2021) 1
31. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an
efficient and accurate scene text detector. In: Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition. pp. 5551–5560 (2017) 1
32. Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., Zhang, W.: Fourier contour em-
bedding for arbitrary-shaped text detection. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 3123–3131 (2021) 1
33. Zhu, Y., Du, J.: Textmountain: Accurate scene text detection via instance segmen-
tation. Pattern Recognition p. 107336 (2020) 3

A Multimodal Text Block Segmentation Framework For Photo Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Multimodal Text Block Segmentation Framework For Photo Translation

Uploaded by

Copyright:

Available Formats

A Multimodal Text Block Segmentation

Framework for Photo Translation

Anonymous for Double-Blind Review

Abstract. Nowadays, with the vigorous development of OCR(Optical

Keywords: Text block segmentation · Multimodal fusion · Pointer net-

In recent years, photo translation technology has brought great convenience to

However, previous translation systems usually translate the content of each

In this paper, we propose a novel multimodal encoder-decoder text block seg-

2.1 Text Detection

2.2 Text Recognition

2.3 Sequence modeling

Encoder with Multimodal Embedding Dynamic

Attention 𝑖𝑛𝑑𝑒𝑥_𝑡0 Attention 𝑖𝑛𝑑𝑒𝑥_𝑡1 Attention 𝑖𝑛𝑑𝑒𝑥_𝑡2

3.1 Encoder with Multimodal Embedding

3.2 Decoder with LSTM

The architecture of the decoder is cascading of LSTM modules which are

Table 1: The number of training and testing images in four language.

4.4 Evaluation metrics

4.5 Results and Analysis

Table 3: The BLEU scores of translation on test data of four languages.

Table 4: Ablation studies of the three modal feature embedding.

You might also like