Professional Documents
Culture Documents
Paper 91-Vietnamese Sentence Paraphrase Identification
Paper 91-Vietnamese Sentence Paraphrase Identification
Abstract—The paraphrase identification task identifies TABLE I. EXAMPLES OF VIETNAMESE PARAPHRASES AND THEIR
whether two text segments share the same meaning, thereby TRANSLATION INTO ENGLISH
playing a crucial role in various applications, such as computer-
assisted translation, question answering, machine translation, etc. Paraphrase Corpus
Although the literature on paraphrase identification in English
ASA sẽ tìm thấy người ngoài hành ASA nói có thể sẽ tìm thấy người vnPara
and other popular languages is vast and growing, the research tinh trong 20 năm tới. ngoài hành tinh trong 20 năm tới.
on this topic in Vietnamese remains relatively untapped. In this
ASA will find aliens in the next 20 ASA says that it’s possible to find
paper, we propose a novel method to classify Vietnamese sentence years aliens in the next 20 years
paraphrases, which deploys both the pre-trained model to exploit
the semantic context and linguistic knowledge to provide further Đáng chú ý, mã độc này chưa hoạt Đáng chú ý mã độc này chưa hoạt VNPC
động mà ở chế độ “ngủ đông”, chờ động mà ở chế độ nằm vùng.
information in the identification process. Two branches of neural lệnh tấn công.
networks built in the Siamese architecture are also responsible
Remarkable, this malware has not Remarkable, this malware has not
for learning the differences among the sentence representations. been working yet but is in "hiber- been working yet but is in "stand-
To evaluate the proposed method, we present experiments on nate" mode, wait for an attack. by" mode.
two existing Vietnamese sentence paraphrase corpora. The results
show that for the same corpora, our method using the PhoBERT
as a feature vector yields 94.97% F1-score on the VnPara corpus
and 93.49% F1-score on the VNPC corpus. They are better than TABLE II. EXAMPLES OF VIETNAMESE NON-PARAPHRASES AND THEIR
the results of the Siamese LSTM method and the pre-trained TRANSLATION INTO ENGLISH
models.
Non-paraphrase Corpus
Keywords—Paraphrase identification; Vietnamese; pre-trained
model; linguistics; neural networks Các gián điệp Trung Quốc đã tấn Các gián điệp Trung Quốc đã tấn vnPara
công hệ thống mạng của một nước công hệ thống mạng của một tổ
thuộc khu vực Đông Nam Á. chức nghiên cứu lớn của chính
phủ Canada, giới chức Canada ngày
I. INTRODUCTION 29/7 cho biết.
Paraphrase identification, a task that whether two text Chinese spies have attacked the net- Chinese spies have attacked a big
work system of a country in South- Canada research organization’s net-
segments with different wordings express similar meaning, is east Asia. work system, from Canada authori-
critical in various Natural Language Processing (NLP) appli- ties - 29/7.
cations, such as text summarization, text clustering, computer- Cầu thủ trẻ đắt giá thứ ba mà Real Một số cầu thủ khác từng trưởng VNPC
assisted translation, and, especially plagiarism detection [1]. từng đào tạo là Alvaro Negredo. thành từ lò đào tạo trẻ của Real là
Paraphrases can take place at different linguistic levels, ranging Cheryshev, Joselu, Diego Lopez và
Rodrigo Moreno
from word and phrase to sentence and discourse. For instance,
The third most valuable young Some other players who have grown
Neculoiu et al. [2] deployed Siamese recurrent networks to player who Real has trained is Al- up at Real’s youth academy are
determine similarity among texts, normalizing job titles that varo Negredo. Cheryshev, Joselu, Diego Lopez
are paraphrases at the word level. Meanwhile, to detect para- and Rodrigo Moreno.
phrases at the discourse level, Liu et al. [3] calculated semantic
equivalence among academic articles published in 2017 to
identify documents with similar themes and contents.
are constructed with different strings can still be paraphrases.
Paraphrase corpora are corpora that contain pairs of sen- On the contrary, various text segments that have overlapping
tences that convey the same meaning. Regarding Vietnamese, substrings can denote different interpretations, and thus they
there have been two paraphrase corpora published for the are non-paraphrases.
language, one of which is vnPara by Bach et al. [4], while
the other is VNPC (Vietnamese News Paraphrase Corpus)
by Nguyen-Son et al. [5]. Both of these corpora consist of According to Suzuki et al. [1], these two types of para-
sentence-level paraphrases. Examples of paraphrases and non- phrases and non-paraphrases are categorized as a non-trivial
paraphrases extracted from vnPara and VNPC are shown in class, whose instances hold a key role in the paraphrase
Tables I and II, respectively. identification task. Table III presents examples of non-trivial
instances extracted from VNPC. The WOR column in this
While string matching is the simplest solution to the stable represents the word overlap rate of two given sentences,
paraphrase identification question in theory, it does not yield which is calculated using Jaccard index [7], where X and Y
high accuracy rates in practice. Two segments of text that denote the set of words of those two sentences:
www.ijacsa.thesai.org 796 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
TABLE III. EXAMPLES OF NON-TRIVIAL INSTANCES EXTRACTED FROM are relevant to the current study in Section 2, and then propose
VNPC a novel method to identify sentence paraphrases in Vietnamese
in Section 3. Section 4 contains our experiments on evaluating
Sentences pair Type WOR the performance of this method. Section 5 concludes the work
Nadal đánh bóng ra ngoài, Game đấu thứ chín, Nadal paraphrase 21.05% and discusses future directions.
mất mini-break sớm. có tới bốn cú đánh bóng ra
ngoài, để mất break.
II. RELATED WORK
Nadal hit the ball out and In the ninth game of the
lost the mini-break early. match, Nadal hit four balls Various paraphrase identification and similarity measure-
out and lost the break.
ment methods have been proposed for a range of languages.
Link sopcast xem trực tiếp Link sopcast xem trực tiếp non- 77.27% The methods can be categorized into four different groups of
U23 Đức vs U23 Nige- U23 Brazil vs U23 Hon- paraphrase
ria trong khuôn khổ bán duras trong khuôn khổ bán approaches: string-based, corpus-based, knowledge-based, and
kết bóng đá nam Olympic kết bóng đá nam Olympic hybrid [15]. In this section, we first present the methods laid
2016 được cập nhật liên 2016. out in these four approaches and then discuss previous work
tục tại đây.
conducted for the Vietnamese language.
Sopcast link to watch U23 Sopcast link to watch U23
Germany vs U23 Nige- Brazil vs U23 Honduras in
ria in the semi-final of the semi-final of the Men’s A. String-based Approach
the Men’s Olympic Foot- Olympic Football 2016.
ball 2016 is updated con- The advantage of this approach lies in its simplicity, as most
tinuously here. of the methods are easy to implement. The main information
is derived from the text itself, with little to no reliance on
additional resources. However, this also lowers the accuracy of
the approach, as all of these methods do not detect semantic
|X ∩ Y | |X ∩ Y | similarity effectively, thereby failing to account for non-trivial
W OR(X, Y ) = = (1) cases, as discussed earlier.
|X ∪ Y | |X| + |Y | − |X ∩ Y |
First, among the similarity measures that are widely used
The accurate identification of non-trivial paraphrases and across different applications is the Damerau-Levenshtein dis-
non-paraphrases requires methods that can exploit the semantic tance [16]. This measure considers the minimum number of
differences of texts. Hitherto, the paraphrase identification task operations needed to convert one text into the other. An
has been a focus in various studies in English and some operation can be either an insertion, a deletion, a substitution
other popular languages. In particular, works by Yin et al. [8], of a single character or a transposition of two consecutive
Mueller et al. [9], Jiang et al. [10], Zhou et al. [11], among characters.
many others, have proposed various methods, ranging from Secondly, the n-gram comparison of two texts is also
simple string-matching to machine learning and deep learning considered a common algorithm. An n-gram is a sequence of n
techniques. In contrast, research on this topic in Vietnamese elements of a text sample. These n elements can be characters,
remains relatively limited, with only two studies conducted by phonemes, syllables, or words, depending on the tasks and
Bach et al. [4] and Nguyen-Son et al. [5]. applications. Alberto et al. define the formula to calculate the
On the one hand, previous literature on the paraphrase text similarity value using n-grams as follows [17]:
identification task in Vietnamese also depends heavily on the
string-based methods. For instance, Bach et al. [4] use nine Number of the same n-grams
Similarity = (2)
string-based similarity measures combined with seven-string Total number of n-grams
pairs to represent a sentence. As discussed earlier, this method
has proven to be rather ineffective in classifying non-trivial Another popular similarity measure in not only this vein of
instances. On the other hand, while the deep learning tech- research but also in other fields is the Jaccard similarity index.
niques can be applied to Vietnamese, they require an extensive This measure is calculated by taking the ratio of the number of
paraphrase corpus, the construction of which demands high common words and the total number of distinct words of both
costs of human and machinery resources. This creates apparent texts [7]. Moreover, other methods, such as Euclid, Manhattan,
obstacles for conducting research on paraphrase identification and Cosine, typically represent texts in the form of vectors and
in the language. then compute text similarity using the distance between these
vectors, as shown below:
To address these problems, in this study, we propose a
novel method to identify sentence paraphrases in Vietnamese v
u n
implementing a combination of pre-trained models such as uX
the Bidirectional Encoder Representations from Transformers Euclid distance = t (Xi + Yi )2 (3)
(BERT) model [12], XML-R [13] and PhoBERT [14] and i=1
linguistic knowledge. The pre-trained models are used as a
n
feature extractor to embed semantic context information in the X
representation vectors of Vietnamese sentences and help to Manhattan distance = |Xi − Yi | (4)
overcome the lack of paraphrase corpora. Besides, linguistic i=1
knowledge also aids in providing additional information for Pn
the training process of Siamese architecture. The rest of the i=1 Xi Yi
Cosine similarity = pPn 2
pPn (5)
paper is organized as follows. We present previous studies that i=1 Xi i=1 Yi2
www.ijacsa.thesai.org 797 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
In all of these formulas, X and Y denote the two represen- identification. We implement this BERT model in the proposed
tation vectors of two corresponding segments of text. method of our study.
Furthermore, while these three measures are considered The introduction of the BERT model also leads to the
methods within the string-based approach, they are still utilized emergence of the Siamese BERT model. Reimers et al.’s
as objective functions in other methods in other approaches, (2019) Sentence BERT (SBERT) model [21] uses the Siamese
especially in machine learning models. Given its straightfor- architecture to help fine-tune BERT with some corpora, target-
wardness implementation, the string-based approach can be ing specific tasks to improve sentence representation for each
found in applications that do not strictly rely on paraphrase task. The results of this work in downstream tasks are better
identification. Since the processing occurs mainly on the input than those of the representation vectors obtained from BERT.
strings, these methods can be extended to the analyses of texts
Based on Transformer-XL, Yang et al.’s (2019) XLNet
in a broad range of languages, including Vietnamese.
model [22] is argued to yield better results than BERT. The
research team pointed out some shortcomings of the BERT
B. Corpus-based Approach model such as inconsistencies between training and the fine-
tuning task and parallel independent word predictions. To over-
The methods of this approach exploit information from come these drawbacks, they utilize both Permutation Language
existing corpora to predict the similarity of input texts. The Modeling (PLM) and Transformer-XL [23].
most common method in this approach is the Latent Semantic
Analysis (LSA) [18], which assumes that words with similar Besides the pretrained model BERT, Alexis Conneau et al.
meanings are co-occurrence in similar text segments. In this (2020) introduce the XML-R model (XML-RoBERTa) [13],
method, a matrix that represents the cohesion between words which is a generic cross lingual sentence encoder that obtains
and text segments is first constructed from one or more given state-of-the-art results on many cross-lingual understanding
corpora. Then, its dimensions are reduced using the Singular (XLU) benchmarks. It is trained on 2.5T of filtered Common-
Value Decomposition (SVD) technique. Finally, the similarity Crawl data in 100 languages, and Vietnamese is one of the
is calculated by the cosine similarity between the vectors which supported languages.
are the rows of the matrix. Based on RoBERTa, Dat Quoc Nguyen and Anh Tuan
Some methods use online corpora obtained from websites Nguyen (2020) introduce the PhoBERT model [14]. PhoBERT
or search engine results. The advantage of these methods is outperforms previous monolingual and multilingual ap-
that the extracted information is not only tremendously large, proaches, obtaining new state-of-the-art performances on four
but it is also regularly updated. For instance, the Explicit downstream Vietnamese NLP tasks of Part-of-speech tagging,
Semantic Analysis (ESA) method uses Wikipedia articles as Dependency parsing, Named-entity recognition and Natural
a data source to build representation vectors for texts [19]. language inference.
Likewise, Cilibrasi et al. calculate the text similarity based on While this is a potential approach for Vietnamese, the lack
the statistics of results from the Google search engine for a of high-quality and large corpora remains an obstacle to adopt
given set of keywords [20]. these methods to the language.
In recent years, the deep learning technique on machine
learning models has become more and more popular because C. Knowledge-based Approach
of their efficiency in solving classification problems in various
Methods in this approach exploit the linguistic knowledge
fields. In the paraphrase identification task, deep learning on
from knowledge bases such as semantic networks, ontology,
the Siamese architecture for neural networks is the most pop-
etc. WordNet [24], the most popular semantic network, is often
ular method. The Siamese networks are dual-branch networks
used to extract linguistic knowledge at the lexical level to
that share the same weights and are merged by an energy
recognize the similarity between texts. Meanwhile, BabelNet
function. The Siamese architecture can learn the information
[25] is a new semantic network that covers 284 languages. The
about the differences between two input samples. Recently,
main disadvantage that comes with Babelnet is that it only
the Siamese LSTM model is a well-known combination. Each
provides API in Java in its free edition.
input text is fed into an LSTM’s sequence. The outputs of the
LSTM’s sequence are then merged by the Manhattan distance There are six semantic measures, three of which are based
function in [9]. Meanwhile, Neculoiu et al. [2] use another on information content, while the remaining three are based
feed-forward neural network which finetunes the output of on the connection length in the network. The former measures
LSTM layers before they are being merged by the cosine are proposed in Resnik (res) [26], Lin (lin) [27], and Jiang
similarity function. Neculoiu et al. also use bi-directional & Conrath (in) [28], while the latter ones can be found in
LSTM’s sequence to exploit the bi-directional context instead Leacock & Chodorow (lch) [29], Wu & Palmer (wup) [30],
of the single LSTM’s sequence as in [9]. and Path Length (path). These measures are slightly different
but can be interchangeable. Path Length is the most commonly
The Google AI Research team then proposes the Bidi- used measure.
rectional Encoder Representations from Transformers (BERT,
2018) model using Transformers as the model’s core [12]. With the work of Le et al. [31] and BabelNet, we can
These Transformers are fully connected, which allows it to apply this approach to Vietnamese. However, the Vietnamese
outperform the state-of-the-art models at that time for some semantic networks are not complete and still being updated,
NLP downstream tasks. The model has achieved high results in implying inconsistent results that would be yielded from the
over six tasks of NLP, including text similarity and paraphrase implementation of this approach alone.
www.ijacsa.thesai.org 798 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
B. Features Extraction Using Pre-trained Model TABLE IV. PROCESS OF CONSTRUCTING THE SEMANTIC VECTOR FOR
THE F IRST S ENTENCE
Heretofore, simple word embedding models such as
Word2Vec [34], GloVe [35], FastText [36], etc. are common anh ấy là giáo_viên nhà_giáo
methods used by many research groups to represent text in
anh (he) 1
vector form. However, these models represent every word with
a unique vector in all contexts. In contrast, the feature vector ấy (he) 1
constructed from pre-trained model contains full information là (is) 1
of the bi-directional context, thanks to Transformer blocks’ giáo_viên (teacher) 1 0.33
multi-head self-attention mechanisms and fully connected ar- s 1 1 1 1 0.33
chitecture.
Weights I(anh) I(ấy) I(là) I(giáo_viên) I(giáo_viên)
The features extracted from pre-trained model were the I(anh) I(ấy) I(là) I(giáo_viên) I(nhà_giáo)
output of the Transformer blocks. For instance, BERT-Base Semantic vector 0.1026 0.1474 0.0654 0.2597 0.1065
with 12 Transformer blocks provided 12 real vectors for each
token and BERT-Large with 24 Transformer blocks provided
24 vectors. The dimension of each vector was the number of
hidden units of each layer. There were 768 dimensions for
BERT-Base and 1,024 dimensions for BERT-Large. loglogp(w) loglog(n + 1))
I(w) = − =1− (9)
loglog(N + 1) loglog(N + 1)
C. Semantic Vector Construction
where p(w) is the relative frequency of w in a corpus, N
We followed the method of [33] to construct semantic is the number of words in the corpus and n is the frequency
vectors for a sentence pair. A semantic vector contained of the word w in the corpus.
information on the semantic relatedness of the words in these
sentences. These vectors were constructed by using a semantic Table IV shows the process of constructing the semantic
network and statistical information of a corpus. In our work, vector for the first sentence in this sentences pair:
we use Vietnamese WordNet which was constructed by Le et - Sentence 1: anh ấy là giáo_viên (he is a teacher)
al. in 2016 [31] and the statistical information from [37].
- Sentence 2: anh ấy là nhà_giáo (he is an educator)
From the list of words of the sentences pair, a set of unique
words was constructed. The order of these words was preserved The semantic vector must be padded with zero-value to
in the order of the words in the sentences. have a fixed length. According to the statistics in [37] about
the average length of sentence (in words), we assume that the
Let M be a two-dimensional matrix containing the re- longest sentence may have a length of 50 words. Thus, we
latedness of each pair of words. The matrix M has n rows construct the semantic vector with a fixed length of 100.
corresponding to n words of the considered sentence and m
columns corresponding to m words in the set of unique words.
The relatedness between w1 (line r) and w2 (column c) is D. Parts-of-speech (POS) Vector Construction
calculated using the formula: WordNet only accepts four simple parts-of-speech which
M(r,c) = are noun, verb, adjective, and adverb so that the semantic
vector does not contain full information of the sentence’s parts-
1 if w1=w2
(
of-speech. Therefore, we also used the POS vector to provide
PathLength(w1,w2) if w1 6= w2 (6) more information to the model. To construct a POS vector,
0 if w1 or w2 is not in Wordnet each word in a sentence was tagged with its part-of-speech to
create a list of parts-of-speech for each sentence. These POS
where WordNet.PathLength(w1, w2) is the Path Length lists were then represented as real vectors by using the FastText
similarity in WordNet of word w1 and word w2. Each element model [36]. To train this model, we used the Vietnamese
of the lexical vector s is the maximum value on a column of Treebank corpus [38] with 10,000 POS tagged sentences. An
the matrix M: output vector of the FastText model had a fixed length of 100.
y
TABLE V. EXAMPLES FROM VNPARA AND THEIR TRANSLATION INTO
ENGLISH
Energy function
Sentences pair Is
Paraphrase
Output layer ASA sẽ tìm thấy người ngoài ASA nói có thể sẽ tìm thấy người 1
hành tinh trong 20 năm tới. ngoài hành tinh trong 20 năm tới.
ASA will find aliens in the ASA says that it’s possible to find
next 20 years. aliens in the next 20 years
Hidden layers
Các gián điệp Trung Quốc đã Các gián điệp Trung Quốc đã tấn 0
tấn công hệ thống mạng của công hệ thống mạng của một tổ
một nước thuộc khu vực Đông chức nghiên cứu lớn của chính
Nam Á. phủ Canada, giới chức Canada
ngày 29/7 cho biết.
Input layer
Chinese spies have attacked Chinese spies have attacked the
the network system of a coun- network of a prominent Canadian
try in Southeast Asia. government research organization,
Input vector
the Canadian officials say on July
29.
Fig. 2. Overview of the Proposed Method to Identify Vietnamese Sentence
Paraphrase. Bà đã cho ra đời 15 cuốn tiểu Trong suốt sự nghiệp của mình, bà 1
thuyết, nhiều tập truyện ngắn đã sáng tác 15 tiểu thuyết, nhiều
và các bài bình luận văn học. truyện ngắn và nhận được gần 20
giải thưởng văn học lớn.
She has published 15 novels, In her entire career, she has writ-
many short stories, and liter- ten 15 novels, many short stories
ary studies. and achieve about 20 major liter-
Fig. 3. Forming the Representation Vector. ary awards.
www.ijacsa.thesai.org 801 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
Percentage of paraphrase sentence pairs
25%
vnPara
VNPC
was divided into 5 folds randomly to perform a 5-fold cross
20% validation test. We also used the same metrics which were
15%
accuracy and F1 score as follows:
10%
TP + TN
Accuracy = (10)
5% TP + TN + FP + FN
0%
1
)
3)
4)
5)
7)
8)
)
TP
,1
,6
,9
;1
0,
0,
0,
0,
0,
0,
;0
;0
;0
,9
;
7;
P recision = (11)
[0
,1
,2
,3
,4
,5
,6
,8
[0
,
[0
[0
[0
[0
[0
[0
[0
[0
Word overlap rate
TP + FP
3)
4)
5)
6)
7)
)
,1
,8
,9
;1
0,
0,
0,
0,
0,
0,
;0
;0
;0
,9
;
,1
,2
,3
,4
,5
,6
,7
,8
[0
[0
[0
[0
[0
[0
[0
[0
[0
C. Experimental Results
3) Some Properties of the Two Corpora: The experiments with our proposed method were per-
formed with some different configurations of stop words,
a) Number of sentence pairs per class.: The number BERT’s output layer, and BERT’s output pooling strategies.
of sentence pairs of the vnPara corpus is 3,000 and the We achieved the best result when testing our method with the
number of sentence pairs of the VNPC is 3,134. The VNPC configuration in which we kept stop words, used the second-to-
corpus contains 2,748 paraphrase pairs and 386 non paraphrase last layer of pre-trained output, and utilized an average pooling
pairs. Meanwhile, the vnPara corpus has the same number of strategy to get the feature vector. When experimenting with the
paraphrase pairs and non paraphrase pairs is 1,500 sentence Siamese LSTM model in the article [9], we used the pre-trained
pairs. Vietnamese Word2Vec model of Vu et al. [40].
b) Number of non-trivial sentence pairs.: Research by Tables VII and VIII show the results of experiments we
Yui Suzuki et al. [1] shows that the importance of non- conducted on vnPara and VNPC with several methods. The
trivial paraphrase or non- paraphrase sentence pairs. The result of each method is presented by each row in the tables.
authors define that a non-trivial paraphrase sentence pair is Each method is evaluated by the accuracy and the F1 score.
a paraphrase sentence pair with a small ratio of words overlap Each table shows the available results from previous studies
(WOR) between two sentences. On the contrary, a non-trivial for Vietnamese, the results of the Siamese LSTM model [9],
non- paraphrase sentence pair is non-paraphrase sentence pair the results of original pre-trained models, and the results of
with a large ratio of words overlap between two sentences. We our method. The results of our method are presented in three
have made statistics on the rate of word overlap of sentence rows according to three different configurations of additional
pairs in both corpora. The word overlap rate is calculated using vectors: adding semantic vector, adding the POS vector, and
the formula for calculating the Jaccard index. adding both semantic vector and POS vector.
Fig. 4 shows that the VNPC corpus contains more non- We also compute the F1 score on each word overlap rate
trivial paraphrase sentence pairs than vnPara. Fig. 5 shows range with the proposed method as figures similar to the 4 and
that both corpora contain very few non-trivial non-paraphrase 5. The calculation of the F1 score is divided into two cases:
sentence pairs. The vnPara corpus almost does not contain any paraphrase cases and non-paraphrase cases to assess the effect
paraphrase sentence pair with a overlap rate from 0.5. of non-trivial cases on model training.
B. Experimental Setup Fig. 6 shows that the proposed method results above 80%
on all word overlap rate ranges in the VNPC corpus. For the
1) Evaluation Method: To compare the result of our vnPara corpus, the proposed method’s result is below 80%
method with the results of the vnPara and VNPC studies, we for sentence pairs with a word overlap rate of 0.3 or less. In
conducted the experiment in the same manner. Each corpus the word overlap rate range [0.1; 0.2), the proposed method
www.ijacsa.thesai.org 802 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
100%
vnPara
90% VNPC
80%
F1 score
50%
Method Accuracy F1 score 40%
(%) (%) 30%
2)
3)
4)
5)
6)
7)
1)
,1
,8
,9
0,
0,
0,
0,
0,
0,
;
;0
;0
;0
,9
;
6;
[0
,1
,2
,3
,4
,5
,7
,8
[0
,
[0
[0
[0
[0
[0
[0
[0
[0
Our method (using BERT)
Word overlap rate
Feature vector (BERT) + Semantic vector 94.12 94.28
Fig. 6. F1 Score according to the Word Overlap Rate of Paraphrase Cases in
Feature vector (BERT) + POS vector 72.45 72.22
vnPara and VNPC.
Feature vector (BERT) + Semantic vector + POS vector 94.27 94.38
XLM-R 74.57 75.22
100%
Our method (using XLM-R) 90%
vnPara
VNPC
F1 score
Feature vector (XLM-R) + Semantic vector + POS vector 93.67 93.85
50%
PhoBERT 76.33 75.80 40%
2)
3)
4)
5)
6)
1)
,1
,7
,8
,9
Feature vector (PhoBERT) + Semantic vector + POS 94.86 94.97
0,
0,
0,
0,
0,
;
;0
;0
;0
;0
,9
;
;
[0
,1
,2
,3
,4
,5
,6
,7
,8
[0
vector
[0
[0
[0
[0
[0
[0
[0
[0
Word overlap rate
TABLE VIII. EVALUATION RESULTS OF DIFFERENT METHODS ON THE achieves F1 score of 28.57% for the vnPara corpus and 89.16%
VNPC CORPUS for the VNPC corpus.
www.ijacsa.thesai.org 803 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
TABLE IX. EVALUATION RESULTS ON THE VIETNAMESE-TRANSLATED involved in the training process. Therefore, they have not been
MSRP best exploited for this task.
With testing on the Vietnamese-translated MSRP corpus, This research is funded by University of Science, VNU-
the results obtained from the proposed method are still higher HCM under grant number CNTT 2020-07. This research is
than the results of the pre-trained models. Meanwhile, the re- supported by Computational Linguistics Center, University of
sults with the F1 score of our proposed method are much better. Science, Vietnam National University, Ho Chi Minh City, and
This shows that our proposed method still achieves higher uses the Vietnamese Treebank, which was developed by the
results than the original pre-trained models when processing VLSP project.
translated documents.
REFERENCES
Although we achieve good results with our method, the
model itself contains some disadvantages. First of all, the [1] Yui Suzuki, Tomoyuki Kajiwara, and Mamoru Komachi. Building a Non-
model requires big resources to operate, so it is not ready Trivial Paraphrase Corpus Using Multiple Machine Translation Systems.
In: (2017), pp. 36 42. DOI: 10.18653/v1/p17-3007.
to work in practice. To build the semantic vector, the model
[2] Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. Learning Text
depends much on the POS tagger. The mistakes of the POS Similarity with Siamese Recurrent Networks. In: Proceedings of the 1st
tagger can entail the mistakes when building the semantic Workshop on Representation Learning for NLP. 2016, pp. 148 157. DOI:
vector. The pre-trained models are used passively, not yet 10.18653/v1/w161617.
www.ijacsa.thesai.org 804 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
[3] Ming Liu, Bo Lang, and Zepeng Gu. Calculating Semantic Similarity be- dings using Siamese BERT-Networks. In: (2019). arXiv: 1908.10084.
tween Academic Articles using Topic Event and Ontology. In: 37 (2017), URL: http: //arxiv.org/abs/1908.10084.
pp. 1 21. arXiv: 1711.11508. URL: http://arxiv.org/abs/1711.11508. [22] Zhilin Yang et al. XLNet: Generalized Autoregressive Pretraining for
[4] Ngo Xuan Bach et al. Paraphrase Identification in Vietnamese Docu- Language Understanding. In: (2019), pp. 1 18. arXiv: 1906.08237. URL:
ments. In: 2015 IEEE International Conference on Knowledge and Sys- http: //arxiv.org/abs/1906.08237.
tems Engineering, KSE 2015. 2015, pp. 174 179. ISBN: 9781467380133. [23] Zihang Dai et al. Transformer-XL: Attentive Language Models beyond
DOI: 10.1109/KSE. 2015.37. a Fixed-Length Context. In: Proceedings of the 57th Annual Meeting
[5] Hoang-Quoc Nguyen-Son et al. Vietnamese Paraphrase Identification of the Association for Computational Linguistics. 2019, pp. 2978 2988.
Using Matching Duplicate Phrases and Similar Words. In: Future Data DOI: 10.18653/ v1/p19-1285. arXiv: 1901.02860.
and Security Engineering. Vol. 11251. Springer International Publishing, [24] George A. Miller et al. Introduction to wordnet: An on-line lexical
2018, pp. 172 182. ISBN: 978-3-319-70003-8. DOI: 10.1007/978-3-319- database. In: International Journal of Lexicography 3.4 (1990), pp. 235
70004-5. URL: http: //link.springer.com/10.1007/978-3-319-70004-5. 244. ISSN: 09503846. DOI: 10.1093/ijl/3.4.235.
[6] Dien Dinh, Nguyen Le Thanh. English–Vietnamese cross-language para- [25] Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic
phrase identification using hybrid feature classes. In: J Heuristics (2019). construction, evaluation and application of a wide-coverage multi-
DOI: 10.1007/s10732-019-09411-2 lingual semantic network. In: Artificial Intelligence 193 (2012), pp.
[7] Paul Jaccard. Étude comparative de la distribution florale dans une 217 250. ISSN: 00043702. DOI: 10.1016/j.artint.2012.07.001. URL:
portion des Alpes et du Jura. In: Bulletin de la Société Vaudoise des http://dx.doi.org/10.1016/ j.artint.2012.07.001.
Sciences Naturelles 142.37 (1901), pp. 547 579. DOI: 10.5169/seals- [26] Philip Resnik. Using Information Content to Evaluate Semantic Sim-
266450. URL: http: //dx.doi.org/10.5169/seals-266450. ilarity in a Taxonomy. In: Proceedings of the 14th international joint
[8] Wenpeng Yin and Hinrich Schutze. Convolutional Neural Network for conference on Artificial intelligence. Vol. 1. 1995, pp. 448 453. arXiv:
Paraphrase Identification. In: Proceedings of the 2015 Conference of 9511007 [cmp-lg]. URL: http://arxiv.org/abs/cmp-lg/9511007.
the North American Chapter of the Association for Computational Lin- [27] Dekang Lin. Extracting Collocations from Text Corpora. In: st Work-
guistics: Human Language Technologies. Denver, Colorado: Association shop on Computational Terminology, Computerm ’98. 1998, pp. 57 63.
for Computational Linguistics, 2015, pp. 901 911. DOI: 10.3115/v1/N15-
1091. URL: https://www. aclweb.org/anthology/N15-1091. [28] Jay J. Jiang and David W. Conrath. Semantic Similarity Based on
Corpus Statistics and Lexical Taxonomy. In: Proceedings of International
[9] Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures Conference Research on Computational Linguistics. 1997, pp. 19 33.
for learning sentence similarity. In: 30th AAAI Conference on Artificial DOI: 10.1152/ ajplegacy.1959.196.2.457.
Intelligence, AAAI 2016 2014 (2016), pp. 2786 2792.
[29] Claudia Leacock and Martin Chodorow. Combining Local Context and
[10] Y. Jiang, Y. Hao, and X. Zhu. A Chinese text paraphrase detection WordNet Similarity for Word Sense Identification. In: Computational
method based on dependency tree. In: 2016 IEEE 13th International Linguistics Special issue on word sense disambiguation 24.1 (1998), pp.
Conference on Networking, Sensing, and Control (ICNSC). 2016, pp. 147 165. DOI: 10.7551/mitpress/7287.003.0018.
1 5. DOI: 10.1109/ ICNSC.2016.7479003.
[30] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection.
[11] Jie Zhou, Gongshen Liu, and Huanrong Sun. Paraphrase Identification In: Proceedings of the 32nd annual meeting on Association for Compu-
Based on Weighted URAE, Unit Similarity and Context Correlation tational Linguistics. Las Cruces, New Mexico, 1994, pp. 133 138. DOI:
Feature. In: 7th CCF International Conference, NLPCC 2018, Hohhot, 10.1152/ajplung. 1998.274.3.l351.
China, August 26 30, 2018, Proceedings, Part II. DOI: 10.1007/978-3-
319-99501-4_4. [31] Le Anh Tu and Tran Van Tri. Exploiting English-Vietnamese OALD
resource and applying in automatically translating WordNet into Viet-
[12] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform- namese. In: The 6th Young Scientists Conference. 2016, pp. 29 36.
ers for Language Understanding. In: Mlm (2018). arXiv: 1810.04805.
URL: http: //arxiv.org/abs/1810.04805. [32] Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-
based and knowledge-based measures of text semantic similarity. In:
[13] Alexis Conneau et al. Unsupervised Cross-lingual Representation The Twenty-First National Conference on Artificial Intelligence and the
Learning at Scale. In: Proceedings of the 58th Annual Meeting of the Eighteenth Innovative Applications of Artificial Intelligence Conference.
Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl- Vol. 1. 2006, pp. 775 780. ISBN: 1577352815.
main.747.
[33] Yuhua Li et al. Sentence similarity based on semantic nets and
[14] Dat Quoc Nguyen and Anh Tuan Nguyen. PhoBERT: Pre-trained corpus statistics. In: IEEE Transactions on Knowledge and Data
language models for Vietnamese. In: Findings of the Association for Engineering 18.8 (2006), pp. 1138 1150. ISSN: 10414347. DOI:
Computational Linguistics: EMNLP 2020 (2020), p. 1037–1042. 10.1109/TKDE.2006.130.
[15] Wael H.Gomaa and Aly A. Fahmy. A Survey of Text Similarity Ap- [34] Tomas Mikolov et al. Efficient estimation of word representations in
proaches. In: International Journal of Computer Applications 68.13 vector space. In: 1st International Conference on Learning Representa-
(2013), pp. 13 18. DOI: 10.5120/11638-7118. tions, ICLR 2013 - Workshop Track Proceedings (2013), pp. 1 12. arXiv:
[16] Iskandar Setiadi. Damerau-Levenshtein Algorithm and Bayes Theorem 1301.3781.
for Spell Checker Optimization. In: Bandung Institute of Technology [35] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
November (2013), pp. 1 6. DOI: 10.13140/2.1.2706.4008. GloVe: Global Vectors for Word Representation. In: Empirical Methods
[17] Alberto Barrón-Cedeño et al. Plagiarism detection across distant lan- in Natural Language Processing (EMNLP). 2014, pp. 1532 1543. URL:
guage pairs. In: Coling 2010 - the 23rd International Conference on http://www. aclweb.org/anthology/D14-1162.
Computational Linguistics. Vol. 2. August. 2010, pp. 37 45. [36] Piotr Bojanowski et al. Enriching Word Vectors with Subword Informa-
[18] Thomas K Landauer and Susan T. Dutnais. A Solution to Plato’s tion. In: Transactions of the Association for Computational Linguistics
Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, 5 (2017), pp. 135 146. ISSN: 2307-387X. DOI: 10.1162/tacl_a_00051.
and Representation of Knowledge. In: Psychological Review. Vol. 104. arXiv: 1607. 04606.
2. 1997, pp. 221 240. [37] Dien Dinh, Nhung Nguyen Tuyet, and Thuy Ho Hai. Building a corpus-
[19] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Re- based frequency dictionary of Vietnamese. In: The 12th International
latedness using Wikipedia-based Explicit Semantic Analysis. In: IJCAI Conference of The Asian Association for Lexicography. 2018, pp. 72
International Joint Conference on Artificial Intelligence. Vol. 6. 2007. 98.
ISBN: 9783642407215. DOI: 10.1007/978-3-642-40722-2_7. [38] Phuong Thai Nguyen et al. Building a large syntactically-annotated
[20] Rudi L. Cilibrasi and Paul M.B. Vitányi. The Google similarity distance. corpus of Vietnamese. In: ACL-IJCNLP 2009 - LAW 2009: 3rd Lin-
In: IEEE Transactions on Knowledge and Data Engineering 19.3 (2007), guistic Annotation Workshop, Proceedings (2009), pp. 182 185. DOI:
pp. 370 383. ISSN: 10414347. DOI: 10.1109/TKDE.2007.48. arXiv: 10.3115/1698381.1698416.
0412098 [cs]. [39] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a Similarity
[21] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embed- Metric Discriminatively, with Application to Face Verification Sumit. In:
www.ijacsa.thesai.org 805 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 8, 2021
Computer Vision and Pattern Recognition, 2005. Vol. 1. 2005, pp. 539 9bb2d6ffafa4238f82dc6e85e6ade072c8322334. 2016.
546. DOI: 10. 1007/BF02407565. [41] W. B. Dolan and C. Brockett. Automatically constructing a corpus
[40] Xuan-Son Vu. Pre-trained Word2Vec models for Viet- of sentential paraphrases. In Proceedings of the Third International
namese. https://github.com/sonvx/word2vecVN. commit Workshop on Paraphrasing (IWP 2005), pages 9-16.
www.ijacsa.thesai.org 806 | P a g e