Professional Documents
Culture Documents
1 Introduction
(a) (b)
Fig. 1: (a) The results of scene text detection, which are the input of a trans-
lation system without text block segmentation. (b) The results of the proposed
method, which are the input of a translation system with text block segmenta-
tion. Different blocks are represented by different colors.
iter symbol and end symbol flag vector are combined to form a candidate feature
queue. In the decoder stage, the text line index at each moment is predicted by
calculating the similarity between the feature output from a LSTM [6] module
and all of the features in the decoding candidate queue. When the delimiter
symbol is decoded, it means that the prediction of a text block ends, and the
prediction of the next text block starts. When the end symbol is decoded or the
maximum decoding length is reached, the whole decoding process ends.
Our contributions are summarized as follows. First, we present an novel
framework for text block segmentation task aiming at clustering semantic-related
text lines to merge a complete sentence, which can greatly improve photo trans-
lation performance. Second, we fully utilize the multimodal features to improve
model accuracy. Finally, we design quantitative sequence evaluation to demon-
strate the rationality and feasibility of our model in four different languages.
2 Related works
In this section, we describe the previous work related to the proposed method,
including text detection, text recognition and sequence modeling.
machine translation task and in computer vision, encoder-decoder based model [29]
is often used in image caption task. However, these methods can not solve the
permutation problem where the size of outputs depends on the inputs directly.
Pointer Network [25] and attention map loss [27] which uses an attention mech-
anism to find the proper units from the input sequence as outputs and these
methods are often applied in text summarization [1], visual question answer-
ing(VQA) [2] and text re-organization [10] which brings insightful idea for text
block segmentation.
3 METHOD
In this section, we illustrate the main architecture of our method, including a
multimodal encoder and a dynamic decoder based on the LSTM [6] modules.
Specifically, In Section 3.1, we elaborate how visual, semantic and positional
features are extracted respectively by the encoder. In Section 3.2, we describe
the global LSTM decoder with a dynamic vocabulary. The whole architecture is
shown in the Figure 2.
Embedding 𝑓1
𝑓n
𝑓2
…
Sentence ...
Embedding 𝑓𝑛
t1
𝑓𝑠𝑡𝑎𝑟𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡 𝑓𝑖𝑛𝑑𝑒𝑥_𝑡
0 1 2 sep
t2
LSTM LSTM LSTM LSTM ... t3
t4
sep
…
Fig. 2: The architecture of the proposed method, including the multimodal em-
bedding and the LSTM-based decoder with dynamic vocabulary.
module is used to extract the visual feature of the input image. After obtaining
the whole image feature, a ROI-Align [7] module can be employed to extract the
visual feature of each text line in terms of the labeled bounding box. Finally, we
flatten the visual features of every text line and put them into a linear projection
head to convert them to a fixed dimension dv . As a result, we can get the visual
embedding set V = {v1 , v2 , ..., vn } , vi ∈ Rdv .
The embedding of sentence. We build a RNN-based language model [14]
for sentence embedding. First, the input content of each text line is segmented
into tokens through tokenizer. Then the initial embeddings of these tokens
should be put into the language model one by one. Finally, the last hidden
vector of dimension dw is extracted as the embedding of the whole text line.
The embeddings of all text lines of the input image can be represented as
W = {w1 , w2 , ..., wn } , wi ∈ Rdw .
The embedding of position. The location information is very essential
for our task as most text lines belonging to one block are adjacent to each
other in position. We encode the position embedding for each text region via the
normalized coordinates of upper left corner and lower right corner of the text
line bounding box, which is calculated as below:
i i i i
x1 y1 x2 y2
li = , , , (1)
W H W H
where (xi1 , y1i ) and (xi2 , y2i ) denote the coordinate of the bottom-left and top-
right corner of the ith text line region while W and H are the width and height
of the input image. After acquiring the normalized coordinates, a projection g
with two linear layers and the ReLU operation are applied to map the vectors
to a specific dimension dp :
pi = g (li ) (2)
we use P = {p1 , p2 , ..., pn } , pi ∈ Rdp to represent the set of position features.
Multimodal Fusion. So far, the visual, semantic and positional embedding
vector of each text line on the input image have been extracted. At the end of
the encoder, the feature vectors of the above three modalities are combined with
a simple cascading operation which can be simply explained as follows:
fi = concat(vi , wi , pi ) (3)
where fi ∈ Rdf , df = dv + dw + dp .
where fi means the ith feature vector in the dynamic vocabulary. After obtaining
the attention scores of all vectors in the dictionary, a softmax function is utilized
to normalize the scores:
exp(at,i )
st,i = P (5)
l at,l
According to the result of Equation 5, the feature with argmax index will be
selected as the final decoding result at step, then this feature will be removed
from dynamic vocabulary for avoiding repeated prediction of the same text line
in the decoding sequence. The decoder repeats the above operations until the
end symbol is decoded or the maximum decoding length is reached.
At each time step during decoding, we compute the cross-entropy loss be-
tween the one-hot groundtruth and the predicted probability distribution via
Equation 6:
P P
L(θ) = − t i yt,i logst,i + (1 − yt,i )log(1 − st,i ) (6)
where yt refers to the label at time step t and θ is the weight of the model.
Suppose the final decoding sequence is {t1 , t3 , sep, t2 , end}. It means that
text line t1 and text line t3 should be linked together as a block and text line t2
is a block. Generally speaking, for every text block in the image, the text lines
in the block are combined in the reading order to get a semantically complete
sentence. Then the sentence is fed into a translator which gains the full contextual
information.
4 Experiments
In this section, we set up experiments to evaluate our proposed method via the
accuracy of the model and the translation performance. Specifically, the training
and testing data are presented in Section 4.1 and the training details are shown
in Section 4.2. In Section 4.3, we represent how to build a rule-based baseline
and a detection-based baseline. In Section 4.4 we give the evaluation metrics and
the result of our model will be presented in Section 4.5.
4.1 Dataset
We collected some images of natural scenes in multiple languages, including
Kazakh, Russian, Uyghur and Vietnamese. These images are annotated and
A Multimodal Text Block Segmentation Framework for Photo Translation 7
splited into train set, validation set and test set. Each image in the dataset has
hierarchical labels, including text line and text block. The annotation of text line
contains the coordinates of the text bounding box and the transcription. While
the annotation of the text block contains the coordinates of the block bounding
box and the transcription. The mapping between text block and text line are
also stored according to the IOU matching.
The image number of training sets and testing sets for each language is shown
in Table 1. What’s more, we train four models for each language mentioned
above respectively with language-specific training data and test the result in
each language separately.
4.2 Training
For encoder, the ResNet50 along with a FPN module is employed as our visual
backbone. The dimension of the visual feature vectors is equal to 256 while the
dimension of semantic and position embedding is 128. As a result, the multimodal
feature’s dimension is 512. For decoder, the dimension of LSTM’s hidden state is
512 and we use two cascaded LSTM cells as the base architecture of the decoder.
We use an Adam [9] optimizer with weight decay 0.0005 and the initial learn-
ing rate 1e-4. Then the learning rate is reduced by the CosineAnnealing sched-
ule [21] epoch by epoch. The model is trained for 20 epochs in total and we set
the batchsize to 16.
4.3 Baseline
We compare our method with the following two baselines, one is a rule-based
clustering method and the other is a two-stage detection method. The former
text clustering method is in terms of the position rules which is the most direct
way. First, all the text lines on the input image are rearranged from top to
bottom and then from left to right. After that, the separate flag is inserted
between the sorted text lines when the distance of two adjacent text lines is
greater than the preset threshold. For the latter baseline, we utilize the popular
Faster R-CNN [18] architecture which is a state-of-the-art method to detect the
bounding box of each text block directly.
8 Anonymous for Double-Blind Review
We use two metrics to evaluate the performance of the model. One is the accuracy
of the text block and the other is the BLEU [16] score of the translation task.
The accuracy score can be computed by the ratio of the number of text blocks
whose text lines are perfectly matched with the label. For instance, if the text
block label of the input image is [[t1 , t2 ] , [t5 ] , [t4 ] , [t3 ]] and the inference sequence
after postprocess is [[t1 , t2 ] , [t3 , t4 ] , [t5 ]] , then the number of matching correctly
is 2 which stands for the length of the subset [[t1 , t2 ] , [t5 ]]. It is worth noting that
the text lines in the text block need to maintain the correct reading order which
is required to form the right sentence, whether in annotation or prediction. In
addition, we take no account of the order between diverse text blocks because it
is not necessary for translation.
The main result of the accuracy score is depicted in Table 2. We compare the
accuracy of our proposed method with the rule-based baseline and the Faster
R-CNN baseline in test dataset of four different languages. Our approach ex-
ceeds the two baselines by a large margin. In fact, the rule-based baseline is
very sensitive to the threshold which we will make a detailed analysis later while
the Faster R-CNN network requires complex post-processing such as NMS [15]
and sorting the matched lines in each block according to a positional rule. In
addition, these two baseline methods do not use semantic features, and basically
perform text block segmentation based on the visual distance of text lines. It
is less effective for text block segmentation that is semantically close but vi-
sually distant or semantically not close but visually close. Compared to them,
the proposed algorithm makes full use of multimodal information, not only has
advantages in the performance, but also does not depend on the appropriate
threshold and post-processing.
After obtaining the block of text, we join the content in the same text block
and feed the contextual sentence into a translation engine. Table 3 presents
the BLEU-2 score and BLEU-4 score between the annotated translation and
the predicted translation under multiple settings. Single line setting refers to
directly translating each text line without dividing them into text blocks, which
is a common practice in previous photo translation systems. While annotation
setting means we use the label of text block and the result under this setting
represents the upper limit of our model. When considering the results in Table 2
and Table 3 comprehensively, our method outperforms other methods obviously
and is very close to the upper limit on the BLEU score.
We test the effect of different thresholds on the rule-based baseline. We use
a certain proportion of the mean line height of text lines in the last text block
along with the current line as the threshold for judging whether insert a separate
flag before the current text line. All text lines have been sorted in advance. If two
text lines are arranged left and right, a certain proportion of the mean line width
is used as the separation threshold. Here, we define the former proportion as
A Multimodal Text Block Segmentation Framework for Photo Translation 9
Table 2: Text block segmentation accuracy between baseline and our method on
the test set of four different languages.
Method Kazakh Russian Uyghur Vietnamese
Rule 46.73 43.64 47.57 45.05
Faster R-CNN 62.41 58.09 64.09 73.04
Our method 79.86 78.10 79.80 87.59
ratioh while the latter as ratiow and suppose ratioh = 2 ∗ ratio, ratiow = ratio.
Intuitively, the larger the ratio is, the more the text lines tend to aggregate while
the smaller the ratio is, the more the text lines tend to separate. We compare
text block segmentation accuracy and the BLEU score as the threshold changes
in Fig 3, where 0 represents that we consider every text line as a text block,
and ∞ represents that all text lines in the image are aggregated together into
a single block. The threshold we used in Table 2 and Table 3 is 0.1 for balance
accuracy and BLEU score.
We also test ablation experiments based on different modal features. Here,
we conduct ablation studies over the Vietnamese test data. Because visual in-
formation is very important in natural scenes, the basic network is constructed
only via the visual embedding. Then, the position embedding and the seman-
tic embedding are added to the network successively. The accuracy score and
the translation BLEU score of the above three model are shown in Table 4. As
we can see, the embedding of each mode play a positive role in the proposed
network.
Some visualization results are shown in Figure 4. It can be found the advan-
tage of our model is that it can segment sentences accurately by using multimodal
information, so as to provide complete contextual information for translation.
10 Anonymous for Double-Blind Review
70 75
60 70
50 65
40 60
Bleu
Acc
30 55
20 50
10 45
0 40
0 0.05 0.1 0.15 0.2 0.25 0.3 0.4 0.5 ∞
Threshold
Acc Kazakh Russian Uyghur Vietnamese
Bleu Kazakh Russian Uyghur Vietnamese
Fig. 3: Text block segmentation accuracy and BLEU-2 scores of the rule-based
method under different thresholds.
(a) Rule (b) Faster R-CNN (c) Our method (d) Ground truth
Fig. 4: Comparative visualization results between our proposed method and other
methods.
5 Conclusions
In this study, we propose a novel multimodal framework for text block segmen-
tation. This framework improves model accuracy through fusion of multimodal
information and attention mechanism. It makes use of a LSTM decoder to predict
text block segmentation result. In addition, we introduce quantitative evaluation
to measure the effect of the proposed method. Our model outperforms the base-
lines both on text block segmentation accuracy and translation accuracy. As for
future work, we will try to combine unsupervised pre-trained model to improve
text block segmentation accuracy.
A Multimodal Text Block Segmentation Framework for Photo Translation 11
References
1. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez,
J.B., Kochut, K.: Text summarization techniques: a brief survey. arXiv preprint
arXiv:1707.02268 (2017) 4
2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,
D.: Vqa: Visual question answering. In: Proceedings of the IEEE international
conference on computer vision. pp. 2425–2433 (2015) 4
3. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learn-
ing to align and translate. In: 3rd International Conference on Learning Represen-
tations, ICLR 2015 (2015) 1, 3
4. Caglayan, O., Kuyu, M., Amac, M.S., Madhyastha, P., Erdem, E., Erdem, A.,
Specia, L.: Cross-lingual visual pre-training for multimodal machine translation.
In: arXiv preprint arXiv:2101.10044 (2021) 1
5. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 6154–6162 (2018) 3
6. D’Informatique, D.E., Ese, N., Esent, P., Au, E., Frasconi, P.P.: Long short-term
memory in recurrent neural networks. epfl (2001) 3, 4
7. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE international conference on computer vision. pp. 2961–2969 (2017) 5
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016) 1, 4
9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014) 7
10. Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end ocr text
re-organization sequence learning for rich-text detail image comprehension. In:
European Conference on Computer Vision. pp. 85–100. Springer (2020) 4
11. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector
with a single deep neural network. In: Thirty-First AAAI Conference on Artificial
Intelligence (2017) 1
12. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2017) 4
13. Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: Textsnake: A flexible rep-
resentation for detecting text of arbitrary shapes. In: Proceedings of the European
conference on computer vision (ECCV). pp. 20–36 (2018) 3
14. Mikolov, T., Karafi¨¢t, M., Burget, L., Cernock, J., Khudanpur, S.: Recurrent neu-
ral network based language model. In: Interspeech, Conference of the International
Speech Communication Association, Makuhari, Chiba, Japan, September (2015)
5
15. Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th Interna-
tional Conference on Pattern Recognition (ICPR’06). vol. 3, pp. 850–855. IEEE
(2006) 8
16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. Proceedings of the 40th annual meeting of the
Association for Computational Linguistics pp. 311–318 (2002) 8
17. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 779–788 (2016) 3
12 Anonymous for Double-Blind Review
18. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. pp. 91–99 (2015) 3, 7
19. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. IEEE transac-
tions on pattern analysis and machine intelligence 39(11), 2298–2304 (2016) 1,
3
20. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with
automatic rectification. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 4168–4176 (2016) 1, 3
21. Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks
using large learning rates. Artificial intelligence and machine learning for multi-
domain operations applications 11006, 369–386 (2019) 7
22. St¨¹n, A., Berard, A., Besacier, L., Gall¨¦, M.: Multilingual unsupervised neural
machine translation with denoising adapters. In: Empirical Methods in Natural
Language Processing (2021) 1
23. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: NIPS (2014) 6
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017) 1
25. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in neural
information processing systems. pp. 2692–2700 (2015) 4
26. Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., Shao, S.: Shape robust text
detection with progressive scale expansion network. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 9336–9345 (2019) 3
27. Wu, J., Du, J., Wang, F., Yang, C., Jiang, X., Hu, J., Yin, B., Zhang, J., Dai, L.:
A multimodal attention fusion network with a dynamic vocabulary for textvqa.
Pattern Recognition 122, 108214 (2022) 4
28. Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for
sequence recognition. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 6538–6547 (2019) 3
29. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: International conference on machine learning. pp. 2048–2057 (2015)
4
30. Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text
recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 284–293 (2021) 1
31. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an
efficient and accurate scene text detector. In: Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition. pp. 5551–5560 (2017) 1
32. Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., Zhang, W.: Fourier contour em-
bedding for arbitrary-shaped text detection. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 3123–3131 (2021) 1
33. Zhu, Y., Du, J.: Textmountain: Accurate scene text detection via instance segmen-
tation. Pattern Recognition p. 107336 (2020) 3