Vietnamese Sentiment Analysis

Vietnamese Sentiment Analysis: An Overview and
Comparative Study of Fine-tuning Pretrained Language

Models
DANG VAN THIN, DUONG NGOC HAO, and NGAN LUU-THUY NGUYEN,
University of Information Technology, Ho Chi Minh City, Vietnam, Vietnam National University, Ho Chi
Minh City, Vietnam
Sentiment Analysis (SA) is one of the most active research areas in the Natural Language Processing (NLP)
field due to its potential for business and society. With the development of language representation mod-
els, numerous methods have shown promising efficiency in fine-tuning pre-trained language models in NLP
downstream tasks. For Vietnamese, many available pre-trained language models were also released, includ-
ing the monolingual and multilingual language models. Unfortunately, all of these models were trained on
different architectures, pre-trained data, and pre-processing steps; consequently, fine-tuning these models
can be expected to yield different effectiveness. In addition, there is no study focusing on evaluating the per-
formance of these models on the same datasets for the SA task up to now. This article presents a fine-tuning
approach to investigate the performance of different pre-trained language models for the Vietnamese SA
task. The experimental results show the superior performance of the monolingual PhoBERT model and ViT5
model in comparison with previous studies and provide new state-of-the-art performances on five bench- 166
mark Vietnamese SA datasets. To the best of our knowledge, our study is the first attempt to investigate the
performance of fine-tuning Transformer-based models on five datasets with different domains and sizes for
the Vietnamese SA task.
CCS Concepts: • Computing methodologies → Natural language processing;
Additional Key Words and Phrases: Vietnamese Sentiment Analysis, fine-tuning language models, monolin-
gual BERT model, multilingual BERT model, T5 architecture
ACM Reference format:
Dang Van Thin, Duong Ngoc Hao, and Ngan Luu-Thuy Nguyen. 2023. Vietnamese Sentiment Analysis: An
Overview and Comparative Study of Fine-tuning Pretrained Language Models. ACM Trans. Asian Low-Resour.
Lang. Inf. Process. 22, 6, Article 166 (June 2023), 27 pages.
https://doi.org/10.1145/3589131
1 INTRODUCTION
Sentiment Analysis has received much attention in the past two decades, because it has many
applications in the real world. The primary purpose of the SA task is to analyze people’s opinions
This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.
Authors’ address: D. V. Thin, D. N. Hao, and N. L.-T. Nguyen (corresponding author), University of Information Tech-
nology, Ho Chi Minh City, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam; emails: {thindv, haodn,
ngannlt}@uit.edu.vn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2375-4699/2023/06-ART166 $15.00
https://doi.org/10.1145/3589131
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 6, Article 166. Publication date: June 2023.
166:2 D. V. Thin et al.
toward products, services, and so on [37], which have a significant influence on other users in the
decision-making process when they do shopping online [31]. Analyzing user reviews can also give
important information to business organizations for improving and developing their products and
services. Recently, the development of pre-trained language models has helped many architectures
achieve the SOTA results for the Natural Language Processing (NLP) downstream tasks in
general and for SA tasks in particular. For the SA topic, fine-tuning pre-trained language models
obtained SOTA results not only for English [13, 41, 67] but also for other languages, such as Italian
[63], Spanish [20], and Arabic [1]. These approaches are mostly based on the pre-trained language
models such as the BERT [13], RoBERTa [38], and XLM-R models [11]. For example, Gao et al.
[19] presented an architecture based on the BERT model and achieved a new SOTA performance
compared with traditional methods or embedding-based models. Li et al. [35, 36] analyzed the
effectiveness of the BERT model for the sentiment in reviews and applied this result to the stock
market domain. Ray et al. [65] proposed a hotel recommendation system based on the combination
of BERT models and different textual features for predicting the sentiment of reviews. Recently,
with the affection of COVID infectious disease, Reference [67] presented a study based on BERT
models for the SA task to analyze the impact of coronavirus on social life. Authors scraped data
from Twitter and observed people’s behaviour via social media platforms.
For Vietnamese, one of the low-resource languages, many attempts have been made to evalu-
ate the performance of fine-tuning pre-trained language models on Vietnamese SA tasks [30, 43,
52, 77]. For instance, Truong et al. [77] and Nguyen et al. [43] presented a study on fine-tuning
the PhoBERT model [44] on the Vietnamese Students’ Feedback Corpus (UIT-VSFC) dataset [49],
while Huynh et al. [30] and Nguyen et al. [52] used the multilingual BERT (mBERT) [13] as the
main pre-trained language model. The common point of these works is that the authors used only
one version of BERT with different configurations and compared the performance of their models
to other machine learning/deep learning approaches. These works would have had more contri-
butions if they had been compared with other pre-trained models, including the monolingual and
multilingual pre-trained language models, which are available for the Vietnamese language such
as mBERT [13], XLMR [11], and monolingual variants [9, 44]. We noticed that many monolin-
gual language models were trained on the same training data and the same architecture based on
the original BERT model [13], but their configuration settings and the pre-processing steps are
different. These differences affect the contextual representation of Vietnamese words as well as
the efficiency of fine-tuning models for downstream tasks. For that reason, we raise a research
question, Which is the most suitable pre-trained language model for the Vietnamese SA
tasks?
In this article, we aim to investigate the performance of the recently introduced NLP approaches
based on fine-tuning the new advances in language models, including the multilingual and mono-
lingual language models on five Vietnamese benchmark datasets. Our experiments aim to answer
the research question of which is the best pre-trained language model for Vietnamese in the SA
task. We also list important points when fine-tuning language models for Vietnamese, especially
pre-processing steps. This article also presents an overview of Vietnamese SA to indicate the de-
velopment and research gap as our above mention for the Vietnamese language. The contributions
of this article can be summarized as follows:
• We present the results of a fine-tuning approach on five Vietnamese benchmark datasets
through six different available pre-trained language models, including the monolingual and
multilingual models. Our results also provide the new SOTA scores of five available bench-
mark datasets for Vietnamese SA compared to previous methods, especially for complicated
datasets.
Vietnamese Sentiment Analysis 166:3
• We provide an overview of previous studies based on their approaches and techniques re-
lated to Vietnamese sentiment analysis and emotion classification.
• Our source code is released as open source and dataset collection instructions to facilitate
adoption and further research on this topic.1
The remainder of this article is structured as follows: Section 2 presents a survey of the previ-
ous studies on the Vietnamese sentiment analysis topic. Section 3 gives information on various
benchmark datasets and describes the methodology using different pre-trained language models.
Section 4 presents the experimental results. Finally, our conclusions are found in the final section.
2 RELATED WORK
Sentiment analysis is known to be one of the most active research fields in the NLP. Because of
its applicability, it influences many areas such as education, marketing, social network, political
science, business, and society. Therefore, there have been a lot of works on sentiment analysis;
however, most of them focused on English and high-resource languages such as Arabic, Chinese,
and so on. In detail, Al-Ayyoub et al. [3] presented a survey of a large of studies on Arabic sentiment
analysis (ASA), including methods, tools, and resources. Abu Farha and Magdy [1] presented a
comprehensive survey on the most effective approaches and explored the performance of fine-
tuning transformer-language models on three benchmark datasets for the ASA. In addition, there
are numerous recent survey papers on sentiment analysis [2, 7, 58, 62, 87, 89], which proves that
this problem has been still receiving much attention from the research community.
2.1 Unsupervised Learning

Since the early 2010s, there have been many research studies on the Vietnamese Sentiment Anal-
ysis problem. Table 1 displays the general information of published studies on the SA task in Viet-
namese, including the approach, domain of the dataset, the level of data, and evaluation metrics.
As can be seen in Table 1, the work of Kieu and Pham [33], one of the first studies using an unsuper-
vised learning approach on the SA problem for the Vietnamese language, presented the rule-based
system based on the GATE framework for the sentence-level review. The authors annotated a
new dataset for the computer product domain and used the test set instead of the cross-validation
approach. The F 1-score measure was used to reports the experimental results. Following that, a
few studies have been published on building the sentiment dictionary or ontology for the SA task
[42, 69, 78, 83]. These are lexical resources in which each synset is associated with polarity scores.
Vu and Park [83] is the first attempt to construct Vietnamese SentiWordNet from a dictionary be-
cause of the lack of Vietnamese WordNet. Nam Nguyen et al. [42] also built a sentiment dictionary
using statistical methods for a specific domain on online products and services. Tran and Phan
[69] is the first work constructing Sentiment Ontology for Vietnamese based on observation and
linguistic rules. Recently, Vo and Yamamoto [78] introduced a new VietSentiLex dictionary for the
hotel domain. The authors built 6,231 lexicons for five sentiment classes based on the English and
Japanese lexicons and demonstrated the effectiveness with other sentiment dictionaries. Based on
Vietnamese linguistic characteristics, the work of Tran and Phan [70] used fuzzy rules to compute
the sentiment scores of adjective phrases. However, their approach requires linguistic experts, and
time-consuming to develop an efficient handcrafted opinion dictionary. Inspired by this research,
Tran and Phan [72] built and released large sentiment phrases covering the Vietnamese emotional
lexicons. Phu et al. [60] proposed many rules to determine the sentiment of Vietnamese adjective
phrases in the context. As a result, they provided an adjective emotion dictionary based on the
1 https://github.com/dangvanthin/Vietnamese-Sentiment-Classification.
Table 1. The General Information of Previous Studies on Vietnamese SA

Work Approach Domain Test Level Metrics
Kieu and Pham Rule-based system, Gate framework Computer Yes Sentence F 1 -score
[33] products∗
Vu and Park [83] Vietnamese SentiWordNet Vdict Yes Word F 1 -score
Nam Nguyen Dictionary-Based, statistical methods Web∗ Yes Sentence Accuracy
et al. [42]
Tran and Phan Sentiment Ontology, Double propagation, Corpus and linguistics rules Hotel∗ , Cell Yes Document Accuracy
[69] phone products∗
Tran and Phan Fuzzy rule, Sentiment phrase dictionary, SentiWordNet Agoda∗ Yes Phrase Accuracy
[70]
Phu et al. [60] Adjective emotion dictionary, Language characteristics N/A∗ Yes Phrase Accuracy
Tran and Phan Sentiment dictionaries, Syntactic dependency rules, Ontology Education∗ Yes Sentence F 1 -score
[73]
Phu et al. [61] Jaccard Measure (JM), Valence-Totaling Social networks∗ Yes Document Accuracy
Ha et al. [23] Semi-supervised, Syntactic rules, VietSentiWordnet Phone products∗ Yes Sentence F 1 -score
Bach and Weakly supervised learning, Rating feature, SVM Hotel CV Sentence Accuracy,
Phuong [4] F 1 -score
Trinh et al. [76] Lexicon based, Emotional dictionary, SVM Facebook∗ Yes Sentence Precision
Vo et al. [81] Hierarchical Dirichlet Process (HDP), Lexicon-based, Corpus-based, Reviews∗ Yes Sentence Accuracy
SVM
Nguyen-Nhat One-document training, Negation handling, Intensification handling, Reviews Yes Document F 1 -score
and Duong [55] Data augmentation
Nguyen-Thi and Multiple classifiers with enhancing lexicon features: Logistic E-commerce∗ CV Phrases F 1 -score
Duong [57] Regression, SVM, Random Forest, OVO, OVR
Huong and Data augmentation, SVM, NB, RF Food∗ , Product Yes Sentence F 1 -score
Hoang [29]
Duong and Semi-supervised learning, Pre-processing techniques, Data Food, Product Yes Sentence F 1 -score
Nguyen [16] Augmentation
Duyen et al. [17] SVM, Naive Bayes, Maximum Entropy, Handcraft features Hotel∗ CV Sentence Accuracy,
F 1 -score
Bang et al. [5] Feature selection, Decision Tree, Naive Bayes, SVM, χ̃ 2 Hotel∗ CV Sentence F 1 -score
Tran and Phan SVM, Maximum Entropy, Naive Bayes, Feature Selection Hotel∗ Yes Document Accuracy,
[71] F 1 -score
Ha et al. [24] Lifelong learning, Bigram, Bag-of-bigram features E-commerce∗ CV Document Macro F 1 ,
Micro F 1
Vo et al. [79] Bag-of-Structure Education∗ Yes Sentence Accuracy, F 1 ,
RMSE
Nguyen et al. Maximum Entropy classifier, Naive Bayes Education∗ CV Sentence F 1 -score
[49]
Vo et al. [80] Multi-channel LSTM-CNN E-commerce∗ CV Document Accuracy

Nguyen et al. LSTM, Dependency Tree-LSTM, SVM Education Yes Sentence F 1 -score,
[53] Accuracy
Nguyen et al. BiLSTM, word2vec Education Yes Sentence Micro F 1
[50]
Nguyen and Ensemble method, Shallow algorithms (LR and SVM), Deep learning VLSP, Food∗ , Yes Document Accuracy
Nguyen [46] algorithms (CNN and LSTM) Restaurants and
Travel∗
Hoang et al. [26] Self-attention Neural networks Electronics Yes Document Macro-F 1
product∗
Nguyen et al. Self-attention with the Transformer architecture, SE Layer (the layer Electronic Yes Document F 1 -score,
[47] of Squeeze and Excitation), fastText embedding product∗ Accuracy
Le et al. [34] CNN, LSTM, Multi-chanel, BiLSTM-CNN Education, Yes Sentence F 1 -score
E-commerce
Huynh et al. [30] CNN, BiLSTM, Multilingual BERT, Ensemble model VSMEC, Yes Document F 1 -score
UIT-VSFC,
HSD-VLSP
Huang et al. [28] Transfer learning, Word2Vec, LSTM, Attention mechanism Education, Yes Sentence Accuracy,
Movie review F 1 -score
Nguyen et al. Bi-LSTM/GRU, Attention model, Recurrent CNN, Residual CNN, E-commerce∗ Yes Document Macro-F 1
[43] Transfer methods, Transfer Learning
Nguyen et al. CNN, LSTM combine with word2vec and POS feature, Two channel Reviews Yes Sentence Accuracy,
[51] model F 1 -score,
AUC
Truong et al. Fine-tuning pre-trained language model, PhoBERT Education Yes Sentence Accuracy,
[77] Weighted F 1
The symbol “∗” in the “Domain” column means that previous studies annotated the new dataset. CV: cross-validation.
We group all approaches in the same category.
Vietnamese language characteristics. Tran and Phan [73] used a sentiment dictionary and syntac-
tic dependency rules to summarise sentiment for each aspect of the review.
2.2 Semi-supervised Learning

Another approach that has been extensively studied is the semi-supervised learning methods. Ha
et al. [23] described a model by using syntactic rules to extract feature words and opinion words.
Subsequently, the authors applied the HAC clustering and semi-supervised Support Vector Ma-
chine (SVM)-kNN classification to group feature words. Finally, they used VietSentiWordNet
to determine the sentiment of the review. After that, Vu and Park [83] outlined the first semi-
supervised learning framework to construct Vietnamese SentiWordNet (VSWN) from a dictionary
not from WordNet automatically. This VSWN is a lexical resource and has played a vital role in
the SA task. Next, Bach and Phuong [4] proposed a novel semi-supervised method based on re-
views and associated rating scores and showed its effectiveness compared with pure supervised
methods. In addition, some of the works utilized the sentiment lexicon, phrase, and dictionary
combined with machine learning methods [75, 76]. Specifically, Trinh et al. [76] constructed five
Vietnamese emotional dictionaries and SVM model to identify the emotions of the text input. Their
dictionary is essentially based on the English SO-CAL dictionary and some selected Vietnamese
words. However, this approach depends entirely on the English dictionary, and the number of
emotional expressions is quite limited. Similarly, Trinh et al. [75] presented a framework based
on the lexicon dictionary combining supervised learning algorithms. In addition, there have been
some studies making efforts on the data augmentation techniques for this topic [16, 29, 40]. Most
of the researchers applied the EDA techniques [84] and an ensemble of data augmentation such as
synonym replacement, random swap, random insert, and random delete to generate more samples
for the training model.
2.3 Supervised Learning

The supervised learning method is one of the popular approaches to solve the SA task for many
languages. For Vietnamese, Duyen et al. [17] presented an empirical study based on the traditional
machine learning methods such as SVM, Naive Bayes (NB), and Maximum Entropy Model com-
bined with handcraft features, including linguistics features and overall score. Their work focused
on investigating the performance of different handcrafted features and gave some recommenda-
tions to select the features for this task. After that, Bang et al. [5] also presented a study on feature
selection techniques and used the traditional machine learning methods such as SVM, NB, and De-
cision Tree. With the development of deep learning architectures, there have been several success-
ful kinds of research in this task. However, a weakness of deep learning architecture is to need for
more samples to train the model effectively. Some of published studies presented the performance
of deep learning methods for the Vietnamese dataset as follows: Convolutional Neural Network
(CNN) [51], Long short-term Memory (LSTM) [50, 54], self-attention neural network [26, 47], and
Ensemble methods [30, 32, 34, 46, 56, 74]. In detail, Nguyen et al. [51] proposed architecture based
on two neural network models parallel to the word2vec feature and pos2vec feature, respectively.
They experimented on four datasets and showed the proposed model’s effectiveness; however, the
authors only compared it with two single deep learning models (CNN and LSTM) without previous
studies. The works of Nguyen et al. [50], Nguyen et al. [54] presented an architecture based on the
LSTM model on the student feedbacks dataset [49]. We notice that the experimental results are re-
ported in a different way. The results of the BiLSTM model [50] are reported on the Macro F 1-score
and used the cross-validation technique, while the results of dependency LSTM [54] are reported
on the test set with Weighted F 1-score. This leads to inconsistencies in the fair comparison of
methods and affects the development of future methods on the Vietnamese SA problem.
2.4 Deep Learning

With the effectiveness of the self-attention neural network on the English dataset, Hoang et al.
[26] and Nguyen et al. [47] presented a study on different self-attention models on their collected
dataset and compared this model with their implemented models without previous studies. It is
difficult to verify the performance of these models with published models. In addition, ensemble
models based on several individual models are researched because of better generalization perfor-
mance. For Vietnamese SA, Nguyen and Nguyen [46] presented an ensemble model based on the
deep neural network and traditional machine learning models. The ensemble techniques are em-
ployed in their paper, including the average rule, max rule, and voting rule. Similarly, Khai Tran
and Thi Phan [32], Tran and Phan [74] also presented a meta-ensemble architecture using rule-
based, deep learning and statistics, and machine learning models. This model is able to capture
both surface features and deep features in the text; however, it consumes computational costs.
Nguyen-Thanh and Tran [56] applied a Voting Ensemble model based on the traditional machine
learning algorithms such as SVM, NB, SGD, LR, and experimented on their crawled dataset for
the hotel review. A different work of Le et al. [34] presented a combination architecture of BiL-
STM and CNN model on the e-commerce review. Recently, Huynh et al. [30] also presented an
ensemble based on various deep learning models and a pre-trained multilingual BERT model. The
experimental results are higher than in single models and previous studies, but the difference has
not been significant yet.
2.5 Transformer-based Learning

Over the past few years, there is a growing interest in fine-tuning the BERT architecture and its
variants [13] on downstream tasks in the NLP field; for Vietnamese, much work on the potential of
pre-trained BERT models has been carried out [43, 52, 77]. Truong et al. [77] explored the perfor-
mance of the pre-trained language model PhoBERT on the UIT_VSFC dataset. The authors used the
representation of the last four layers of the model as the input representation. However, the authors
tuned the hyperparameters on the test set in their experiments, which might affect the model’s gen-
eralization performance. Typically, this work was done using the validation set or cross-validation
techniques. While the work of Nguyen et al. [52] presented the performance of a multilingual pre-
trained BERT model combined with deep learning architecture. Moreover, Nguyen et al. [52] also
presented the experimental results of the PhoBERT model and mBERT model on their collected
datasets. To the best of our knowledge, there are many available pre-trained language models with
different configurations for the Vietnamese language; therefore, there is still a critical issue in com-
paring the effectiveness of available pre-trained language models on the same benchmark datasets
for this task.
2.6 Summary
We summarize the common points from previous studies on the Vietnamese language as follows:
• The previous studies used only one of BERT’s versions on different datasets with differ-
ent configurations and compared the performance of their models to other machine learn-
ing/deep learning approaches.
• There have not been many studies comparing the effectiveness of models on benchmark
datasets. Therefore, it is difficult for evaluating the sustainability and scalability of a model.
• Some previous studies did not use the same evaluation metrics for the same datasets.
For those reasons, In addition to comparing the performance of different language models on
standard datasets, we are going to report the experimental results on common metrics and compare
the experimental results of previous studies on the same datasets.
Fig. 1. Overview of Encoder-Decoder architecture with the input and output examples for the educational
domain.
3 METHODOLOGY AND RESOURCES

This section presents our method by fine-tuning the pre-trained BERT models and T5 encoder-
decoder models to build sentiment classifiers on experimented datasets. First, we present two task
descriptions. Second, we describe the overall of our architecture. Third, we briefly give information
on available pre-trained language models for Vietnamese language. Finally, we introduce the five
benchmark datasets for the Vietnamese SA task.
3.1 Task Description

Sentiment Analysis: Given a review at the sentence level or document level, this task aims to
predict the sentiment polarity of the review. In our case, the sentiment polarity is assigned to
one of three labels: positive, neutral, or negative. For the emotion task, the output is classified into
Enjoyment, Sadness, Fear, Anger, Disgust, Surprise, and Other. The “Other” class indicates that the
comment does not contain any emotion.
3.2 Architecture
First, we present two transformer-based architectures to evaluate the performance on the five
benchmarks, including Text2Text-based architecture and BERT-based architecture.
3.2.1 Text2Text Architecture. To explore the performance of Text2Text architectures, we fine-
tune the pre-trained encoder-decoder for the downstream tasks as an original study [64]. Figure 1
presents the Text-to-Text framework for the sentiment classification task. To train the encoder-
decoder architecture, we convert the task to the text-to-text format, it means that we convert
the numeric label to the corresponding target word in the Vietnamese language. For example,
the model would be fed the sequence “ hay, ” (Good lectures, lots of
practical exercises). The model is required to predict a sequence corresponding to the target class,
in this case, the “ ” (positive). We also translate the target classes to Vietnamese words
with different meanings to explore the best output representation. For example, the “neutral” label
can be translated into Vietnamese as “ ” or “ .” Moreover, we verify the pre-
processing way of each model to apply the compatible steps, because this affects the representation
of tokens in the review.
3.2.2 BERT-based Architecture. The overview of our architecture is shown in Figure 2. To begin,
the input review is processed in the pre-processing component. This pre-processing component
is one of the crucial parts of classification architectures, which helps models extract meaning-
ful insights from the original data. We found that each pre-trained BERT architecture processes
the text for the pre-training data in different ways. For example, Nguyen and Tuan Nguyen [44]
used the word segmentation technique before creating the vocabulary of subword types. How-
ever, other models such as mBERT [13], XLM-R [11], viBERT_FPT [9] did not apply this technique
Fig. 2. Overview of our fine-tuning approach based on the BERT model and its variants.
for preparing the training data. This vital point is not mentioned in the studies when using pre-
trained language models [30, 43, 52, 77]. Therefore, it is vital to carefully refer to the pre-processing
steps of each model to prepare the input of models accurately. In this article, we apply several pre-
processing steps for user’s reviews data presented by Dang et al. [12] except lowercase and word
segmentation. Then we combine the basic processing steps of each pre-trained language model.
This means that we will apply pre-processing steps and then use the word segmentation tech-
nique [82] without lowercase for the PhoBERT model. For other pre-trained language models such
as XLM-R, viBERT4News, and so on, we do not apply the word segmentation technique and low-
ercase the text input, because these models are not trained on the data with these pre-processing
steps.
As presented in Figure 2, the next component is Tokenization. This component adds two special
tokens [CLS] and [SEP] automatically corresponding to the start and end points of the input. Then,
this component segments the input into tokens that are mapped to the index value of the BERT’s
vocabulary. This component is responsible for padding the whole input to make them have the
same length for each dataset. Most of the studies [30, 43, 52, 77] conducted the experiments using
BERT models on the sentence-level datasets; therefore, this problem has yet to be solved for the
document-level datasets in the Vietnamese SA task. Since the maximum length of the input of the
BERT model and its variants is limited to up to 512 tokens, we need a specific approach to address
this problem in the document-level datasets. In our case, we apply the simple “head only” trunca-
tion approach to deal with the long text proposed by Sun et al. [68], because the main subject often
appears at the beginning of the review. Aside from that, for the sentence-level dataset, we use the
maximum sentence length in the training set as the max length of the whole input. However, due
to the limitation of resources, we expressly set the maximum length of the input at 256 tokens as
the maximum length of the PhoBERT and viBERT_FPT model. After the tokenization steps, we
feed the list of token ids with corresponding attention marks into the pre-trained BERT model to
extract the hidden state of tokens. The BERT model with L transformer layers will calculate the
corresponding contextualized representations H L = {hCL LS , h 1L , . . . , hnL , h SL EP } ∈ RT ×dimh , where
dimh denotes the dimension of the representation vector. Following previous works [13, 15, 68],
we extract the hidden state hCL LS of [CLS] token in the last transformer layer as the input represen-
tation. This representation is directly fed into the fully connected layer with Softmax activation
for predicting P,

P = so f tmax Wo hCL LS + bo , (1)
where Wo and bo are the learnable parameters of the fully connected layer.
3.3 Vietnamese Pre-trained Language Models

There are several pre-trained language models with different settings available for the Vietnamese
language, including the monolingual and multilingual models. Available models are trained on
various architectures as well as applied several pre-processing steps for the text input. Moreover,
the domain and size of training data are different for each model. Therefore, each pre-trained
language model might bring different effectiveness for a specific dataset. For that reason, we decide
to use these models and evaluate their performances independently on five benchmark datasets.
The list of available models for Vietnamese that are employed for experiments in this article is
briefly summarized as follows:
• PhoBERT: PhoBERT was developed based on the RoBERTa architecture [38] and proposed
by Nguyen and Tuan Nguyen [44] for the Vietnamese language. In particular, this is the first
large-scale monolingual pre-trained language model that is trained on word-level data with
20 GB of two sources of Vietnamese data (Wikipedia + News). However, the limitation of
this model is the maximum length of the input at 256 subword tokens, while other models
allow the maximum length of the input up to 512 tokens.
• viBERT4News: This model was built entirely on the original BERT architecture [13]. The
authors used the word sentence piece and BERT tokenization with the same configuration
as the official BERT case without lowercase. The size of the training data is 20 GB of News
text for the Vietnamese language. The application of this model is integrated into ViNLP2
system annotation for Vietnamese. This model can be accessed in the Github.3
• viBERT: The ViBERT model was developed based on the original multilingual BERT archi-
tecture [9] on the pre-processed online newspapers data. The vocabulary of this model are
modified from the mBERT model [13] by removing the insufficient vocabulary. This model
is only trained on 10 GB of data.
• viELECTRA: This is ELECTRA model [10] that is based on an ELECTRA pre-training ap-
proach that aims to boost the downstream performance of a pre-trained model. The advan-
tage of the ELECTRA approach is using significantly fewer compute resources for the pre-
training stage; therefore, the work in Bui et al. [9] trained this model on 60 GB of data from
two sources (NewsCorpus + OscarCorpus) for the Vietnamese language. The pre-trained vi-
ELECTRA model is also released as an open source for further research on the Vietnamese
NLP field.
• mBERT: The multilingual BERT is released by Devlin et al. [13] trained on the concatenation
of monolingual Wikipedia datasets from 104 languages, including Vietnamese. mBERT is
particularly well suited to cross-lingual study, because it enables to create the multilingual
representations of different languages.
• XLM-R: XLM-R is proposed in Conneau et al. [11] for cross-lingual language understanding
and achieved the SOTA on many downstream tasks in NLP, especially for the low-resource
2 https://github.com/bino282/ViNLP.
3 https://github.com/bino282/bert4news.
Table 2. Statistics for the Experimental Dataset

Data N c Domain lpr e−avд lpr e−max lavд lmax |V | Test Level
HSA 3,304 3 Hotel 13.67 161 9.00 124 1,434 CV Sentence
VS 17,500 3 E-commerce 32.14 905 27.88 712 5,816 CV Document
UIT-VSFC 16,174 3 Education 14.22 161 9.57 124 4,336 3,166 Sentence
VLSP 6,150 3 Technology 28.86 2,885 26.15 2481 9,496 1,050 Document
UIT-VSMEC 6,927 7 Facebook 14.00 164 12.00 139 6,898 693 Sentence
N : The total size of the dataset. c: Number of target classes. lpr e−avд : Average length of sentences before
pre-processing. lpr e−max : Maximum length of the sentence before pre-processing. lavд : Average length of
sentence after pre-processing. lmax : Maximum length sentence after pre-processing. |V |: Vocabulary size.
Test: The size of the testing set (CV denotes cross-validation and fivefold CV was used).
languages that lack labelled data. Fortunately, Vietnamese is one of the largest languages in
the training data with 137 GB of Vietnamese text. Theoretically, using more training data
can significantly enhance the quality of the language models [38]. Therefore, it is fascinating
to evaluate the performance of this model on the Vietnamese SA task.
• mT5: mT5 [86] is a multilingual variant of T5 architecture and is trained on a new Common
Crawl corpus covering 101 languages. This is one of the multilingual pre-trained generative
models with a strong performance on different benchmarks. The number of Vietnamese
tokens in mT5 is 116B and accounts for about 1.87% of all languages.
• ViT5: Authors [59] presented a Transformer-based encoder-decoder model based on the T5
architecture for Vietnamese language. This model is trained on a total of 138GB of text data
that is filtered out from the CC100 dataset. There are two versions of this model, including
ViT5 base and ViT5 large. To now, there has not been an investigation into the efficiency of
classification problems.
3.4 Datasets
We examine five benchmark datasets with a variety of sizes, domains of the dataset, the level of the
dataset, and so on. Table 2 describes and summarizes the information of experimental datasets used
in this study. These datasets can be accessed for free or for research purposes. A brief introduction
of datasets is presented as follows:
• Hotel Sentiment Analysis (HSA): The HSA dataset consists of sentence-level reviews for
the hotel domain in Vietnamese. This dataset is annotated by two people with a high inner-
annotator agreement of 89% for three sentiment classes: “positive,” “neutral,” and “negative.”
In particular, each review also includes an additional rating score of users. This value ranges
from 1 to 10. The experimental results are reported in Bach and Phuong [4], Duyen et al. [17]
on fivefold cross-validation and measured by the Accuracy score, while Vo and Yamamoto
[78] reported the results using the F 1-score.
• VLSP Sentiment Analysis (VLSP)4 : This dataset is presented at the first official shared-
task for Sentiment Analysis for the Vietnamese language at the VLSP workshop [45]. The
VLSP dataset is annotated at the document-level for the electronic domain. Significantly, this
dataset is created as a balanced dataset between classes for the evaluation campaign. The
Precision, Recall, Weighted F 1 , and Accuracy score are used to evaluate the performance of
participant’s systems and previous studies [46, 51].
• Vietnamese Sentiment Analysis (VS): This is a document-level review dataset for vari-
ous products from Vietnamese e-commercial sites. Each review is annotated by one of the
4 https://vlsp.org.vn/resources-VLSP.
three sentiment polarities (positive, negative, and neutral). The inter-annotators agreement
Cohen’s Kappa score is more than 0.74 [80]. The original work used fivefold cross-validation
to report the experimental results. This dataset is published on author’s Github.5 It is sur-
prising that this dataset has not been studied by any previous work.
• UIT-VSFC6 : This is a sentence-level dataset for sentiment analysis task on the student’s feed-
back. The purpose of this dataset is to assign a sentiment polarity for a text as multi-class
classification [49]. This dataset is annotated by three people with high inter-annotator agree-
ments of 91.20% and evaluated using the weighted F 1 -score and Accuracy [49, 53]. However,
some studies still evaluated this dataset on the Micro F 1 -score [30, 50].
• Vietnamese Social Media Emotion Corpus (UIT-VSMEC): This dataset consists of anno-
tated sentences that are crawled on the Facebook social network. Each sentence is assigned
one of seven emotional labels. The dataset is annotated by four people with the agreement
of 82.94% [25]. The Weighted F 1 and Accuracy metrics are used for the evaluation step in
the study of Ho et al. [25], Huynh et al. [30], Nguyen and Kiet [48].
4 EXPERIMENTS
In our experiments, we use five pre-trained language BERT-based models, including mBERT case,7
PhoBERT_base,8 viBERT4news,9 viBERT_FPT,10 and viELECTRA.11 For the text2text models (mt5
and ViT5), we employ the variants of two models as follows: mT5 base,12 mT5 large,13 ViT5 base,14
and ViT5 large.15 All models can be found in the HuggingFace transformer library [85].
4.1 Hyper-parameters Settings

We re-implement the recommended BERT-based architectures by Hugging Face library with the
following hyper-parameters: learning rate = 5e-5, epochs = 3, warmup_steps = 500, the batch
size is set as 32, 16, or 8 based on the size of the dataset. We use AdamW [39] with the default
value for β 1 = 0.9 and β 2 = 0.999. The max sequence length is set based on the dataset: We choose
the maximum length of review for the sentence-level datasets as the maximum length of the
corresponding input of the dataset. In contrast, for the document-level datasets, we set the max
length of 256 due to the limitation of research resources. For the Text2Text architecture, we also
use the maximum length of the longest review for the sentence-level dataset and 512 tokens for
the document-level dataset. The learning rate is set as 5e-5 and the number of epochs as 5. The
batch size depends on the size of the models, we select in the range of 32, 16, and 8 value. All
models are run on the same random seed.
4.2 Evaluation Metrics

It can be seen that all of the five datasets are imbalanced datasets except the VLSP dataset [45]. As
mentioned, we indicated that these previous research studies reported the results on different met-
5 https://github.com/ntienhuy/MultiChannel.
6 http://nlp.uit.edu.vn/datasets/.
7 https://huggingface.co/bert-base-multilingual-cased.
8 https://huggingface.co/vinai/phobert-base.
9 https://huggingface.co/NlpHUST/vibert4news-base-cased.
10 https://huggingface.co/FPTAI/vibert-base-cased.
11 https://huggingface.co/FPTAI/velectra-base-discriminator-cased.
12 https://huggingface.co/google/mt5-base.
13 https://huggingface.co/google/mt5-large.
14 https://huggingface.co/VietAI/vit5-base.
15 https://huggingface.co/VietAI/vit5-large.
Table 3. Results of Models against Other Methods on UIT-VSMEC Dataset
Model Accuracy Balance Acc Weighted F 1 Macro F 1 Micro F 1

CNN + fastText [25] 59.74 — 59.74 — —
Ensemble model [30] — — 65.79 — —
MLR + processing [48] 64.36 — 64.40 — —
mT5 base 34.05 21.77 28.89 19.55 34.05
mT5 large 54.69 42.08 52.15 43.49 54.69
ViT5 base 61.47 58.65 61.41 58.85 61.47
ViT5 large 60.89 59.87 61.24 59.10 60.89
viBERT4News 39.97 29.26 32.34 25.61 39.97
viBERT_FPT 52.96 41.31 50.90 42.84 52.96
viELECTRA_FPT 50.79 39.65 48.97 40.35 50.79
mBERT 57.72 49.64 56.79 51.16 57.72
XLM-R 60.32 51.09 59.23 52.69 60.32
PhoBERT 64.65 57.11 63.80 59.06 64.65
rics, which leads to the difficulty of comparing the performance of methods in future research. For
that reason, we evaluate the performance of models on standard benchmark metrics, including the
Accuracy score, Weighted F 1 -score, Macro F 1 -score, and Micro F 1 -score as in previous studies. As
shown in Grandini et al. [21], the Micro F 1-score is equal to the Accuracy score for the imbalanced
classification problem; therefore, we use Balanced Accuracy [8] as an additional evaluation metric
for imbalanced datasets. This can help further studies to compare the proposed models with the
unbalanced datasets effectively.
4.3 Result and Analysis

This section presents our experimental results and compares them with previous studies on five
benchmark datasets. We analyze the performance of pre-trained language models on the datasets,
and then we compare the experimental results with previous studies corresponding to datasets.
4.3.1 Experimental Results. The experimental results for UIT-VSMEC and VLSP are reported
in Table 3 and Table 4. While Table 5, Table 6, and Table 7 present the results of HSA, VS, and
UIT-VSFC datasets, respectively. Each table’s first section presents the results of earlier studies on
that dataset. The second section in the table relates to the results of the T5 model and its variants.
Finally, we provide the performance of BERTology models for the specific datasets in the table’s
last section. The remainder of this section will describe the results on different datasets separately.
On the emotion dataset UIT-VSMEC dataset, we can see that the PhoBERT is the best model
when achieves an accuracy of 64.65% and is higher than the XLM-R model with about 4.33% Ac-
curacy score. For the T5 models, it can be seen that the monolingual ViT5 showed a higher per-
formance than the multilingual T5 models by a large margin. Comparing the PhoBERT model and
ViT5 model, it is obvious that the PhoBERT model achieved the best scores in terms of Accuracy,
Weighted F 1, and Micro F 1-score. While ViT5 with the large version gives the best performance on
the Balance Accuracy and Macro F 1-score. Specifically, ViT5 gives 2.76% and 0.04% better Balance
Accuracy and Macro F 1 than PhoBERT.
For the document-level VS dataset and VLSP dataset, the PhoBERT and ViT5 are two models
that give the best scores on all metrics. The results show that PhoBERT is able to achieve better
results in the VLSP dataset. ViT5 achieves a higher F 1 in the VS data, but the difference between
two models is not significant. For the sentence-level VSFC and HSA dataset, the PhoBERT and T5
Table 4. Results of Models against Other Methods on VLSP Dataset

MaxEnt + Feature Selection [45] 76.38 — — — —
MLP + N-grams +TFIDF [45] — — 69.40 — —
Ensemble [46] 69.43 — — — —
Two-channel LSTM [51] 69.50 — 69.50 — —
Two-channel CNN [51] 64.10 — 64.00 — —
mT5 base 46.19 46.19 45.57 45.57 46.19
mT5 large 63.05 63.05 63.27 63.27 63.05
ViT5 base 71.71 71.71 71.70 71.70 71.71
ViT5 large 75.81 75.81 75.66 75.66 75.81
viBERT4News 50.48 50.48 49.33 49.33 50.48
viBERT_FPT 70.00 70.00 69.98 69.98 70.00
viELECTRA_FPT 67.24 67.24 67.33 67.33 67.24
mBERT 68.48 68.48 68.53 68.53 68.48
XLM-R 73.05 73.05 73.06 73.06 73.05
PhoBERT 76.00 76.00 76.05 76.05 76.00
Table 5. Results of Models against Other Methods on HSA Dataset

SVM + handcraft features [17] 76.80 — — — —
Semi-supervised learning [4] 78.20 — — — —
VietSentiLex [78] — — 77.00 — —
mT5 base 53.39 39.71 53.39 35.46 53.39
mT5 large 74.40 61.99 73.07 40.64 74.40
ViT5 base 80.57 71.43 80.19 68.24 80.57
ViT5 large 80.72 73.16 80.80 72.94 80.72
viBERT4News 63.83 41.68 53.07 35.32 63.83
viBERT_FPT 77.06 63.70 74.02 62.14 77.06
viELECTRA_FPT 75.94 64.69 74.10 62.54 75.94
mBERT 78.57 67.76 77.15 67.31 78.57
XLM-R 79.12 64.59 74.57 60.95 79.12
PhoBERT 82.51 71.53 80.94 71.79 82.51
consistently outperform all remainder BERT and T5 models on two datasets. In detail, PhoBERT
gives the best results in terms of Accuracy and Weighted and Micro F 1-score, while T5 achieved
the best performance in terms of Balance Accuracy and Macro F 1-score for two datasets.
Overall, based on the results of the five datasets, we can see that the monolingual models such
as PhoBERT and ViT5 are the best model among all the pre-trained language BERT models on
the five datasets. The recent work of Rust et al. [66] found two factors that affect the downstream
performance of pre-trained language models: (1) the pre-trained data size and (2) the designated
monolingual tokenizer. This can explain why PhoBERT and ViT5 are two best models based on
the experimental results. PhoBERT is the first architecture trained on 20 GB word-level data, and
85% Vietnamese words often are compounded by two and more syllables [82]. While ViT5 is fil-
tered out on the pre-trained dataset and trained on the Vietnamese corpus with 71 GB. Moreover,
the vocabulary of the two models is applied to the pre-processing steps to create effective vo-
cabularies [44, 59]. Our results demonstrated that the PhoBERT model is the currently best pre-
trained language BERT model for fine-tuning classification tasks such as sentiment analysis of the
Table 6. Results of Models against Other Methods on VS Dataset

CNN-LSTM + with token [80] 81.86 — — — —
CNN-LSTM + w/o token [80] 87.72 — — 87.47 —
mT5 base 71.33 71.20 71.14 70.51 71.33
mT5 large 80.34 80.23 80.46 80.30 80.34
ViT5 base 89.27 89.29 89.37 84.77 89.27
ViT5 large 90.41 90.31 90.40 90.31 90.41
viBERT4News 59.16 58.36 53.36 52.67 59.16
viBERT_FPT 87.32 87.29 87.38 87.28 87.32
viELECTRA_FPT 86.83 86.81 86.90 86.79 86.83
mBERT 86.68 86.59 86.68 86.57 86.68
XLM-R 87.21 87.08 87.18 87.05 87.21
PhoBERT 90.40 90.32 90.38 90.29 90.40
We calculated the Macro F 1 -score of the method based on the reported results for three classes.
Table 7. Results of Models Against other Methods on UIT-VSFC [49] Dataset

MaxEnt [49] — — 87.94 — —
LD-SVM [54] 90.74 — 90.20 — —
BiLSTM-CNN [34] — — 93.51 — —
CNN [40] — — — 75.57 89.82
CNN + Augmentation [40] — — — 77.16 89.38
Two-channel LSTM [51] 89.90 — 89.30 — —
Two-channel CNN [51] 89.60 — 88.90 — —
Ensemble [30] — — — — 92.79
PhoBERT [77] 94.28 — 93.92 — —
mT5 base 85.75 60.46 83.47 59.07 85.75
mT5 large 90.68 68.28 89.27 70.12 90.68
ViT5 base 92.40 81.35 92.19 82.07 92.40
ViT5 large 92.89 79.84 92.54 82.98 92.89
viBERT4News 86.67 61.01 84.32 59.34 86.67
viBERT_FPT 91.25 74.18 90.64 76.74 91.25
viELECTRA_FPT 90.52 72.96 89.87 75.37 90.52
mBERT 92.04 75.10 91.41 78.07 92.04
XLM-R 93.08 77.25 92.55 80.35 93.08
PhoBERT 93.87 79.57 93.45 82.83 93.87
Vietnamese language. In addition, it is interesting to notice that the ViT5 model is the best choice
for the imbalanced datasets based on the performance of Balance Accuracy and Macro F 1-score.
Our results indicated that fine-tuning ViT5 models could bring promising performances on the
imbalance datasets where the importance of labels is the same. Our results match those observed
in earlier studies [44, 59] in different downstream NLP tasks such as Named Entity Recognition,
Text Summarization, and so on.
Based on the experimental results, we observe that the XLM-R model outperforms monolingual
models except for PhoBERT in almost of datasets. The multilingual BERT also achieved better
performance than the monolingual BERT (viBERT4News, viBERT_FPT, and viELECTRA) in some
datasets. Among monolingual models, the viBERT4News showed the worst results in all of the
datasets, especially for the datasets that are collected from social media sites or e-commerce plat-
forms such as the UIT-VSMEC, VLSP, and VS dataset. The reviews on these platforms often contain
much informal information and errors in grammar or vocabulary. One of the reasons for the poor
performance of viBERT4News might be the difference between the training corpus for the lan-
guage model and the above datasets. Based on the results of the four tables, we can also observe
a difference between the mBERT model and the viBERT_FPT model depending on the types of
the dataset, but the difference is still not significant. We notice that the viBERT_FPT gains better
results than the mBERT model on the e-commerce reviews such as the VLSP and VS datasets. For
other datasets, mBERT achieved better results than viBERT_FPT. This might be explained that the
vocabulary dictionary of viBERT_FPT is modified based on the vocabulary of mBERT model [9].
Therefore, there is no big difference between the performances of the two models on datasets.
4.3.2 Comparison with the State-of-the-art. Comparison of the results of the PhoBERT model
against previous methods shows that the PhoBERT model is able to achieve better results in terms
of Accuracy and Weighted F 1 -score in almost all of datasets. It is critical to note that we use the
PhoBERT with default hyper-parameters on the training set and do not use the development set
to select the best parameters in this scenario. In detail, the PhoBERT model achieved new state-
of-the-art scores on two benchmark datasets (VS and HSA) in most of the metrics. In addition,
our implementation also achieved the same results as the previous study [77] on the UIT-VSFC
dataset. Comparing other methods on the UIT-VSMEC dataset, the PhoBERT model outperforms
other approaches, such as the Ensemble method [30]. The results of the remaining models are
also competitive with previous studies. We believe that our results provide additional support for
previous studies on this topic. To explain more about the results, we continue to conduct a detailed
analysis of each dataset as below.
For the emotion UIT-VSMEC dataset, the PhoBERT model also shows effectiveness in terms of
Accuracy score compared with other methods. In detail, our experiment achieved 64.65% of Accu-
racy that is higher than the best previous method [48] about 0.29%. However, our results cannot
outperform the SOTA weighted F 1 -score of an ensemble method [30], and this can be explained by
the fact that our reported results are implemented with the default hyperparameters. In the next
section, we present the results of the hyper-parameters configuration on the development set and
the test set.
For the sentiment analysis VS dataset and VLSP dataset, the PhoBERT and ViT5 model still
achieved a better performance than other approaches in all metrics. Although two datasets were
published for free research purposes in 2016 and 2017, they are rarely used to evaluate the proposed
method in previous studies. The main reason might be that these datasets contain various com-
ments with different lengths, spelling and grammar errors, informal information, and so on. For
those reasons, traditional machine learning algorithms and deep learning models do not achieve
the expected results. For the VS dataset, we were surprised that the results of the CNN-LSTM
model without word segmentation achieved a high accuracy than this model with word segmen-
tation. In contrast, our empirical results demonstrated that PhoBERT with tokenization produced
a state-of-the-art score in this dataset. Another remarkable thing is that two datasets are quite
balanced between classes; therefore, five metrics are the same value with negligible differences.
For the small SA dataset such as HSA, fine-tuning pre-trained language models still outperforms
traditional methods [4, 17]. We found that this dataset contains a numeric feature—a user rating
score; this value ranges from 1 to 10. These previous methods utilized this feature to increase
the performance of the proposed system, because this feature is closely related to the polarity of
the user’s review. The overall rating score of the review is higher; the review is more positive
Fig. 3. Visualization by t-SNE of PhoBERT vs. XLMR model on the development set of UIT-VSFC dataset.
otherwise. Conversely, we discard this feature from the review and only use the user’s review for
our experiments. Because the size of this dataset is small and imbalanced between classes, the
deep learning methods are not effective. However, our experimental results demonstrated that the
PhoBERT and ViT5 models still achieved the new SOTA scores even on the small dataset.
For the UIT-VSFC SA dataset, our experiments are consistent with previous results [77] for
PhoBERT model. However, in their paper, the authors concatenated outputs of [CLS] from the
four last layers of the pre-trained PhoBERT model and fine-tuned the model’s parameters on the
test set and then reported the best result, while we only extracted the hidden state of the [CLS]
token to feed into the classifier layer but still achieved the comparative result. This result has
further strengthened our confidence in fine-tuning pre-trained language models for downstream
tasks that can perform better than deep learning ensemble models with data augmentation tech-
niques. In our view, the result emphasizes the validity of the PhoBERT model with default hy-
perparameters. Comparing the performance of the PhoBERT model with ViT5, we can see that
PhoBERT still gives better results in terms of Accuracy, Weighted, and Micro F 1-score. In order
to visualize the sequential representations of the fine-tuning language model, we use the t-SNE
technique to visualize the last four intermediate representations of the [CLS] token of models,
shown in Figure 3 and Figure 4. There are three classes of sentiment in the UIT-VSFC dataset,
illustrated in green, yellow, and red, representing positive, neural, and negative, respectively. In
Figure 3 and Figure 4, we show the t-SNE visualization of the four last layers of models except for
the viBERT4News on the development set. From this visualization, it is easy to notice some points
that are clustered with the wrong class. We notice that most of these wrong points correspond to
the “neutral” class, which is difficult to identify. We can see that our model performs well on the
positive and negative labels. Between the “positive” and “negative” class, we also observed that
the PhoBERT model displays the separation of two classes in four last layers than other models,
followed by the XLMR model. Moreover, t-SNE visualization of the PhoBERT and XLM-R models
shows that the separation between the classes is almost perfect. It is easy to draw the conclusion
that PhoBERT partitions different classes of data more densely than other models.
Fig. 4. Visualization by t-SNE of the mBERT vs. vELECTRA vs. viBERT_FPT model on the development set
of UIT-VSFC dataset [49].
4.3.3 Zero-shot Cross-Domain. The main focus of this section is to investigate the possibility of
zero-shot cross-domain transfer learning of sentiment classification. The zero-shot cross-domain
term indicates a task that can be solved without having examples for a specific domain. For exam-
ple, we train the PhoBERT model on the training samples of the VSFC dataset, and then we evaluate
the performance of the model on the testing samples of the HSA dataset. We conduct two experi-
ments based on the levels of datasets for the SA task: (1) document-level datasets (VS and VLSP)
and (2) sentence-level datasets (VSFC and HSA). Two architectures are used in these experiments,
including the PhoBERT and ViT5 with the base versions. The Accuracy and Weighted F1-score
are used to report the experimental results showed in Table 8. It can be seen that on all scenarios,
the PhoBERT model outperforms ViT5, especially for the document-level datasets. For example,
fine-tuning the PhoBERT model on the training VS set gives the Weighted F1-score of 41.08% on
the VLSP testing set, which is much larger than the performance of 30.12% by ViT5 model. For the
sentence-level benchmark, the PhoBERT model gives a 4.71% and 4.2% better Weighted F1-score
than ViT5. One possible reason is that PhoBERT can produce the contextual representation better
than the T5 model for Vietnamese words. Note also that the samples in document-level datasets are
different and contains contain many sentiment expressions in the paragraph. Therefore, zero-shot
learning on document-level datasets cannot give positive results in the zero-shot cross-domain.
Table 8. The Zero-shot Cross-domain Results for Two Levels of Sentiment Analysis Task
Document-level SA Sentence-level SA
Model VS → VLSP VLSP → VS VSFC → HSA HSA → VSFC
Accuracy Weighted F 1 Accuracy Weighted F 1 Accuracy Weighted F 1 Accuracy Weighted F 1
ViT5 30.25 30.12 34.37 34.10 72.37 70.39 77.73 79.48
PhoBERT 40.90 41.08 44.99 42.61 76.69 75.10 81.69 83.68
Fig. 5. Fine-tuning PhoBERT model accuracy and Weighted F 1 -score vs. the learning rate values on five
benchmark datasets.
Fig. 6. Fine-tuning PhoBERT model accuracy and Weighted F 1 -score vs. the number of epochs on five bench-
mark datasets.
4.3.4 Effect of Hyper-parameters. Devlin et al. [13] pointed out that the small dataset is sensi-
tive to the hyper-parameter choice. Therefore, we conduct exhaustive experiments to choose the
best parameters of the PhoBERT model on the development set. We explore the effect of learning
rate and the number of training epochs. First, we keep the other hyper-parameters as the default
model in Section 4 and change the learning rate in the range of 1e-5 to 10e-5. Then we update
the new best learning rate on the model and experiment with the range of epochs from 2 to 10
epochs. Figure 5 and Figure 6 illustrate the performance of fine-tuning PhoBERT model in terms
of accuracy and weighted F 1 -score on five benchmark datasets. As shown in Figure 5, we can see
that fine-tuning the PhoBERT model with lower values such as 5e-5 and 6e-5 can increase the per-
formance. For the number of epochs, we found that the range of 3 to 6 epochs works well across
datasets as the original BERT’s recommendation [13]. In particular, our experimental model with
the optimal hyperparameters achieved a new SOTA performance than previous methods. Specifi-
cally, the BERT model with a learning rate of 6e-5 and the number of epochs as 4 gives 66.09% of
accuracy and 66.00% of weighted F 1 -score on the UIT-VSMEC dataset. For the VLSP dataset, the
PhoBERT model with a learning rate of 6e-5 and the number of epochs as 6 achieved 76.90% in
terms of accuracy; it is better than the top score of this dataset. Therefore, choosing a set of optimal
hyperparameters can improve the performance of the model.
Fig. 7. Confusion matrix of VLSP and UIT_VSFC dataset on the test set.
5 ERROR ANALYSIS
To better understand the behavior of model on the datasets, we conduct an error analysis and case
study in this section. We first analyze the results on the confusion matrix on the test set for some
datasets. Then we check the incorrect predictions and categorize their error types for the specific
datasets.
Figure 7 shows the confusion matrices of the predicted class labels against the ground truth class
labels for the VLSP and UIT_VSFC dataset on the test set. From Figure 7 to the left presents the
confusion matrix of the PhoBERT model on the test set for the VLSP dataset. It can be observed that
the wrong predictions focus on the ambiguity of “negative” and “neutral” labels with a percentage
of 36%. We noticed that the samples in the VLSP dataset are annotated at the document-level review
with different lengths (see in Table 2) and the truncation method is applied to be suitable for the
input’s model in our experiments. This is a key reason for the poor performance of our model on
this dataset, because the information of sentiment can be removed in the processing steps. Figure 7
to the right visualizes the confusion matrix of the PhoBERT model on the UIT_VSFC test set. We
can see that the confusion rate between the “neutral” label and the two remainder labels is high.
There are several possible explanations for this result; however, we noticed the ambiguity among
the annotators to determine this label in our observation. In addition, the sentences of this label
are typical short, and the number of training samples is more minimal than others. Figure 8 shows
the confusion matrix on the test set for the Emotion UIT-VSMEC dataset. We can see that there
are many ambiguities between actual labels and predicted labels. As shown in Figure 8, our model
still predicts confusion between labels with similar meanings such as “anger” with “disgust” and
“surprise” with “enjoyment.” This has been explained in the work of Ho et al. [25] with two primary
reasons: (1) the definitions of labels and (2) the limitation of samples for classes. Therefore, in order
to improve the performance of the model on this dataset, the correction or re-annotation of this
dataset should be considered.
In order to understand the reason for the low performances in the document-level VLSP and
sentence-level HSA dataset, we selected the wrong samples in the test set of each dataset to ana-
lyze their error types. For the VLSP dataset, we found that the wrong predicted samples in the VLSP
dataset are sentences without accents—accent restoration is one of the typical tasks for Vietnamese.
However, the accent restoration task depends on the domain, therefore adding the accent restora-
tion in the processing steps can improve the performance of this dataset. In addition, we found
that our model predicts failure on samples with sarcastic meanings. For example, we show some
examples with ground truth and our prediction in this dataset in Table 9. Therefore, we believe that
Fig. 8. Confusion matrix of UIT_VMEC dataset on the test set.
sarcasm in the sentence can significantly hamper our prediction in this dataset. Moreover, our pre-
diction often fail on samples that contain the sentiment words but are labeled as “neutral.” For ex-
ample, a sentence “ mua j7” (What a pity!, just buy j7) or “ ”
(Do not want to buy cheap goods) are predicted as “negative.” It is obvious that these samples
express the negative meanings; however, the annotators assigned to them are “neutral.” However,
our model also gives the wrong prediction with comments that express the comparison between
two products. For example “s7 ” (S7 is follow, the
price is higher but the quality is lower) has the “positive” label, because it criticize the “s7” prod-
uct instead of current product. However, the model still predict this case as negative, because the
phrase “ ” (the price is higher but the quality is lower). Indeed,
these are difficult examples in the test set—it requires the model to understand emotional phrases
or words regarding the product or context to assign the sentiment polarity class.
For the sentence-level HSA dataset, we found that this dataset still has some confusion
among annotators on labeling sentiment classes and contains meaningless sentences. For exam-
ple, a sentence “ ” (Food is normal) is assigned as “negative” while another sen-
tence “ ” (breakfast is okay) is labelled as “positive.” In Vietnamese, the words
“ ” and “ ” are close meanings in this context. Moreover, if a sentence is com-
mented with two different sentiment polarities, then it will be annotated as “neutral.” For example,
the sentence “ , ” (Everything is quite good,
the price is reasonable, but the breakfast is too bad) has a “neutral” class. The first and second
phases in the review are positive, while the remainder is negative. That is why the overall senti-
ment polarity of these sentences is neural. This annotation style in the HSA dataset affected the
Table 9. Wrong Prediction with Sarcastic Meanings on the VLSP Dataset
Table 10. Examples of Wrong Annotation in the UIT_VSFC Dataset
classifiers and their performances. When we check the prediction on the test set, we observe that
our model predicts wrong in the “neutral” samples as mentioned above.
For the sentiment UIT_VSFC dataset, we found an inconsistency in the annotation process. We
notice that some annotators label a sample with sentiment meaning as neutral, considering the
objectivity of the sentence. However, sometimes they tended to give the sentiment label based on
the content itself. For example, a sentence “ ”
(This course helps us understand the basics) is labeled as “neutral.” Figure 10 shows the examples
with the miss-annotation in the test set of the UIT_VSFC dataset. Therefore, we recommend that
future work can re-annotate this dataset to improve the performance of classifiers, which can be
applied in real applications.
6 CONCLUSION
With the development of different pre-trained language models for the Vietnamese language, it
may be difficult in future studies to choose a suitable pre-trained language model for the Viet-
namese NLP field. This article is an effort to meet the gap. The article investigated the performances
based on fine-tuning different pre-trained language models in a wide variety of Vietnamese bench-
mark datasets for sentiment analysis tasks. The evidence from this study suggests that PhoBERT
is the best pre-trained language model for fine-tuning the text classification task as sentiment anal-
ysis and emotion analysis classification. This model demonstrated the effectiveness of datasets of
different sizes and domains. Our experimental results established the new SOTA performances
on five benchmark datasets compared with previous studies by standard evaluation metrics. More-
over, we indicated the crucial points in using the fine-tune approach based on pre-trained language
models for Vietnamese. From our literature survey and experimental analysis, we draw several con-
clusions that we hope will guide the future directions of Vietnamese NLP. Moreover, this article
is helpful for researchers in selecting suitable language models and the way to fine-tune language
models for the Vietnamese NLP tasks. We make the following recommendations:
• Future studies should verify the evaluation metrics of previous studies and use the same
metrics to compare methods in a fair manner.
• The pre-processing steps should be applied with the same steps on the pre-training data of
the pre-trained language models to produce the best representation for the input.
• Most BERTology models have limited the length of the input (e.g., PhoBERT only supports
the input with 256 tokens), therefore it is a necessity to apply the right truncation method
to reduce the length of a long document. Furthermore, we can utilize generative language
models such as T5 to tackle the classification problems with the long input but still give the
same performance.
• From our experimental results, it can be seen that the monolingual language models outper-
form the multilingual language models on different SA benchmark datasets, including the
discriminative and generative language models. However, future work can utilize the power
of multilingual models by training on a combined dataset of multiple languages to improve
the performance on the specific task with the lack of training data.
• We found that the pre-trained language models are trained on monolingual word-segmented
corpora in Vietnamese, which can yield better performance, because about 85% of Viet-
namese vocabulary is composed of two or more syllables. Therefore, we recommend that
future studies should apply the word segmentation technique on the text data before train-
ing language models for Vietnamse language.
• Based on our experimental results, we suggest that future work should use the PhoBERT
model as a first option to extract the contextual representation or employ it as the backbone
model on other approaches (e.g., prompt-tuning for pre-trained language models [88]).
Finally, our article is also the first overview study of previous methods for the Vietnamese lan-
guage and points out a limitation of previous studies. We hope that our work in this article can be
seen as a comprehensive study and push forward the research on the Vietnamese SA topic.
In the future, it would be interesting to investigate the performance of pre-trained language
models on other Vietnamese NLP tasks, such as Word Segmentation, Named Entity Recognition,
and so on. Our method can be applied to other low-resource languages. We will explore develop-
ing new methods to improve the performance of experimental datasets. Further studies can apply
data augmentation methods [18] to low-resource languages to enhance the performance of models
on different datasets. In addition, instead of fine-tuning the language model directly, future works
can explore many parameter-efficient fine-tuning approaches, such as adapter [27] and diff prun-
ing [22]. The cross-domain and cross-lingual sentiment analysis between datasets and languages
should receive more attention for future research on the Vietnamese community. For document-
level datasets, such as the VLSP dataset, the truncation method might lose important information
before putting them into the pre-trained language model. Therefore, future studies need to pay
attention to this issue and build effective processing [14] and models (e.g., Longformer [6]) to deal
with document-level datasets.
ACKNOWLEDGMENTS
We thank the reviewers for their helpful comments.
REFERENCES
[1] Ibrahim Abu Farha and Walid Magdy. 2021. A comparative study of effective approaches for Arabic sentiment analysis.
Inf. Process. Manage. 58, 2 (2021), 102438. https://doi.org/10.1016/j.ipm.2020.102438
[2] Marvin M. Agüero-Torales, José I. Abreu Salas, and Antonio G. López-Herrera. 2021. Deep learning and multilingual
sentiment analysis on social media data: An overview. Appl. Soft Comput. 107 (2021), 107373. https://doi.org/10.1016/
j.asoc.2021.107373
[3] Mahmoud Al-Ayyoub, Abed Allah Khamaiseh, Yaser Jararweh, and Mohammed N. Al-Kabi. 2019. A comprehensive
survey of arabic sentiment analysis. Inf. Process. Manage. 56, 2 (2019), 320–342.
[4] Ngo Xuan Bach and Tu Minh Phuong. 2015. Leveraging user ratings for resource-poor sentiment classification. Proc.
Comput. Sci. 60 (2015), 322–331.
[5] Tran Sy Bang, Choochart Haruechaiyasak, and Virach Sornlertlamvanich. 2015. Vietnamese sentiment analysis based
on term feature selection approach. In Proceedings of the 10th International Conference on Knowledge Information and
Creativity Support Systems. Springer, 196–204.
[6] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. CoRR
abs/2004.05150 (2020). arXiv:2004.05150 https://arxiv.org/abs/2004.05150
[7] Marouane Birjali, Mohammed Kasri, and Abderrahim Beni-Hssane. 2021. A comprehensive survey on sentiment anal-
ysis: Approaches, challenges and trends. Knowl.-Bas. Syst. 226 (2021), 107134. https://doi.org/10.1016/j.knosys.2021.
107134
[8] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The balanced ac-
curacy and its posterior distribution. In Proceedings of the 20th International Conference on Pattern Recognition. IEEE,
3121–3124. https://doi.org/10.1109/ICPR.2010.764
[9] The Viet Bui, Thi Oanh Tran, and Phuong Le-Hong. 2020. Improving sequence tagging for vietnamese text using
transformer-based neural models. In Proceedings of the 34th Pacific Asia Conference on Language, Information and
Computation. Association for Computational Linguistics, Hanoi, Vietnam, 13–20.
[10] Kevin Clark, Minh-Thang Luong, Quoc Le, and Christopher D. Manning. 2020. Pre-training transformers as energy-
based cloze models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Association for Computational Linguistics, 285–294. https://doi.org/10.18653/v1/2020.emnlp-main.20
[11] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation
learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association
for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
[12] Thin Dang, Vu Nguyen, Nguyen Kiet, and Nguyen Ngan. 2019. A transformation method for aspect-based sentiment
analysis. J. Comput. Sci. Cybernet. 34, 4 (2019), 323–333.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association
for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[14] Ming Ding, Chang Zhou, Hongxia Yang, and Jie Tang. 2020. CogLTX: Applying BERT to long texts. In Advances in
Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33.
Curran Associates, Inc., 12792–12804.
[15] Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A. Smith. 2020. Fine-tuning
pretrained language models: Weight initializations, data orders, and early stopping. arXiv:2002.06305. Retrieved from
https://arxiv.org/abs/2002.06305.
[16] Huu-Thanh Duong and Tram-Anh Nguyen, Thi. 2021. A review: Preprocessing techniques and data augmentation for
sentiment analysis. Comput. Soc. Netw. 8, 1 (2021), 1–16.
[17] Nguyen Thi Duyen, Ngo Xuan Bach, and Tu Minh Phuong. 2014. An empirical study on sentiment analysis for viet-
namese. In Proceedings of the International Conference on Advanced Technologies for Communications. IEEE, 309–314.
https://doi.org/10.1109/ATC.2014.7043403
[18] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy.
2021. A survey of data augmentation approaches for NLP. In Proceedings of the Joint Conference of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan-
guage Processing (ACL-IJCNLP’21). Association for Computational Linguistics, 968–988. https://doi.org/10.18653/v1/
2021.findings-acl.84
[19] Zhengjie Gao, Ao Feng, Xinyu Song, and Xi Wu. 2019. Target-dependent sentiment classification with BERT. IEEE
Access 7 (2019), 154290–154299. https://doi.org/10.1109/ACCESS.2019.2946594
[20] Jose Angel Gonzalez, Lluís-F. Hurtado, and Ferran Pla. 2021. TWilBert: Pre-trained deep bidirectional transformers
for spanish twitter. Neurocomputing 426 (2021), 58–69.
[21] Margherita Grandini, Enrico Bagli, and Giorgio Visani. 2020. Metrics for multi-class classification: An overview.
arXiv:2008.05756. Retrieved from https://arxiv.org/abs/2008.05756.
[22] Demi Guo, Alexander Rush, and Yoon Kim. 2021. Parameter-efficient transfer learning with diff pruning. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 4884–4896. https:
//doi.org/10.18653/v1/2021.acl-long.378
[23] Quang-Thuy Ha, Tien-Thanh Vu, Huyen-Trang Pham, and Cong-To Luu. 2011. An upgrading feature-based opinion
mining model on vietnamese product reviews. In Active Media Technology, Ning Zhong, Vic Callaghan, Ali A. Ghor-
bani, and Bin Hu (Eds.). Springer, Berlin, 173–185.
[24] Quang-Vinh Ha, Bao-Dai Nguyen-Hoang, and Minh-Quoc Nghiem. 2016. Lifelong learning for cross-domain viet-
namese sentiment classification. In Computational Social Networks, Hien T. Nguyen and Vaclav Snasel (Eds.). Springer
International Publishing, Cham, 298–308.
[25] Vong Anh Ho, Duong Huynh-Cong Nguyen, Danh Hoang Nguyen, Linh Thi-Van Pham, Duc-Vu Nguyen, Kiet Van
Nguyen, and Ngan Luu-Thuy Nguyen. 2020. Emotion recognition for vietnamese social media text. In Computational
Linguistics, Le-Minh Nguyen, Xuan-Hieu Phan, Kôiti Hasida, and Satoshi Tojo (Eds.). Springer, Singapore, Singapore,
319–333.
[26] Suong N. Hoang, Linh V. Nguyen, Tai Huynh, and Vuong T. Pham. 2019. An efficient model for sentiment analysis
of electronic product reviews in vietnamese. In Future Data and Security Engineering, Tran Khanh Dang, Josef Küng,
Makoto Takizawa, and Son Ha Bui (Eds.). Springer International Publishing, Cham, 132–142.
[27] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th
International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and
Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 2790–2799.
[28] Yong Huang, Siwei Liu, Liangdong Qu, and Yongsheng Li. 2020. Effective vietnamese sentiment analysis model us-
ing sentiment word embedding and transfer learning. In Data Science, Pinle Qin, Hongzhi Wang, Guanglu Sun, and
Zeguang Lu (Eds.). Springer, Singapore, Singapore, 36–46.
[29] Thien Ho Huong and Vinh Truong Hoang. 2020. A data augmentation technique based on text for vietnamese senti-
ment analysis. In Proceedings of the 11th International Conference on Advances in Information Technology (IAIT2020).
Association for Computing Machinery, New York, NY, Article 13, 5 pages. https://doi.org/10.1145/3406601.3406618
[30] Huy Duc Huynh, Hang Thi-Thuy Do, Kiet Van Nguyen, and Ngan Thuy-Luu Nguyen. 2020. A simple and efficient
ensemble classifier combining multiple neural network models on social media datasets in vietnamese. In Proceed-
ings of the 34th Pacific Asia Conference on Language, Information and Computation. Association for Computational
Linguistics, 420–429.
[31] Ashish Katrekar and Big Data Analytics AVP. 2005. An Introduction to Sentiment Analysis. GlobalLogic Inc.
[32] Thien Khai Tran and Tuoi Thi Phan. 2019. Deep learning application to ensemble learning-the simple, but effective,
approach to sentiment classifying. Appl. Sci. 9, 13 (2019). https://doi.org/10.3390/app9132760
[33] Binh Thanh Kieu and Son Bao Pham. 2010. Sentiment analysis for vietnamese. In Proceedings of the 2nd International
Conference on Knowledge and Systems Engineering. IEEE, 152–157. https://doi.org/10.1109/KSE.2010.33
[34] Lac Si Le, Dang Van Thin, Ngan Luu-Thuy Nguyen, and Son Quoc Trinh. 2020. A multi-filter BiLSTM-CNN architec-
ture for vietnamese sentiment analysis. In Advances in Computational Collective Intelligence, Marcin Hernes, Krystian
Wojtkiewicz, and Edward Szczerbicki (Eds.). Springer International Publishing, Cham, 752–763.
[35] Mingzheng Li, Lei Chen, Jing Zhao, and Qiang Li. 2021. Sentiment analysis of chinese stock reviews based on BERT
model. Appl. Intell. 51, 7 (2021), 1–9. https://doi.org/10.1007/s10489-020-02101-8
[36] Menggang Li, Wenrui Li, Fang Wang, Xiaojun Jia, and Guangwei Rui. 2021. Applying BERT to analyze investor sen-
timent in stock market. Neural Comput. Appl. 33, 10 (2021), 4663–4676.
[37] Bing Liu. 2012. Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1 (2012), 1–167.
[38] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv e-prints, arXiv.1907.
[39] Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the 7th Inter-
national Conference on Learning Representations (ICLR’19, New Orleans, LA, USA, May 6-9, 2019). OpenReview.net.
https://openreview.net/forum?id=Bkg6RiCqY7.
[40] Son Luu, Kiet Nguyen, and Ngan Nguyen. 2020. Empirical study of text augmentation on social media text in viet-
namese. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. Association for
Computational Linguistics, 462–470.
[41] Manish Munikar, Sushil Shakya, and Aakash Shrestha. 2019. Fine-grained sentiment classification using BERT. In
Proceedings of the Artificial Intelligence for Transforming Business and Society (AITB’19), Vol. 1. IEEE, 1–5. https://doi.
org/10.1109/AITB48515.2019.8947435
[42] Hong Nam Nguyen, Thanh Van Le, Hai Son Le, and Tran Vu Pham. 2014. Domain specific sentiment dictionary
for opinion mining of vietnamese text. In Multi-disciplinary Trends in Artificial Intelligence. Springer International
Publishing, Cham, 136–148.
[43] Cuong Nguyen, Khiem Le, Anh Tran, and Binh Nguyen. 2020. Knowledge innovation through Intelligent software
methodologies, tools and techniques. In An Efficient Framework for Vietnamese Sentiment Classification, Vol. 327. IOS
Press, 343–354. https://doi.org/10.3233/FAIA200579
[44] Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for vietnamese. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational
Linguistics, 1037–1042. https://doi.org/10.18653/v1/2020.findings-emnlp.92
[45] Huyen Nguyen, Hung Nguyen, Quyen Ngo, Luong Vu, Vu Tran, Bach Ngo, and Cuong Le. 2019. VLSP shared task:
Sentiment analysis. J. Comput. Sci. Cybernet. 34, 4 (2019), 295–310. https://doi.org/10.15625/1813-9663/34/4/13160
[46] H. Nguyen and Q. Nguyen. 2018. An ensemble of shallow and deep learning algorithms for vietnamese sentiment
analysis. In Proceedings of the 5th NAFOSTED Conference on Information and Computer Science. IEEE, 165–170. https:
//doi.org/10.1109/NICS.2018.8606880
[47] Hien D. Nguyen, Tai Huynh, Suong N. Hoang, Vuong T. Pham, and Ivan Zelinka. 2020. Language-oriented sentiment
analysis based on the grammar structure and improved self-attention network. In Proceedings of the Evaluation of
Novel Approaches to Software Engineering (ENASE’20). 339–346.
[48] Khang Phuoc-Quy Nguyen and Nguyen Van Kiet. 2020. Exploiting vietnamese social media characteristics for textual
emotion recognition in vietnamese. In Proceedings of the International Conference on Asian Language Processing. IEEE,
276–281. https://doi.org/10.1109/IALP51396.2020.9310495
[49] Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan Vinh Nguyen, and Tham Thi Hong Truong; Ngan Luu-Thuy Nguyen.
2018. UIT-VSFC: Vietnamese students’ feedback corpus for sentiment analysis. In Proceedings of the 10th International
Conference on Knowledge and Systems Engineering. IEEE, 19–24. https://doi.org/10.1109/KSE.2018.8573337
[50] Phu X. V. Nguyen, Tham T. T. Hong, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2018. Deep learning versus
traditional classifiers on vietnamese students’ feedback corpus. In Proceedings of the 5th NAFOSTED Conference on
Information and Computer Science (NICS). IEEE, 75–80. https://doi.org/10.1109/NICS.2018.8606837
[51] Quan Nguyen, Ly Vu, and Quang Uy Nguyen. 2020. A two-channel model for representation learning in vietnamese
sentiment classification problem. J. Comput. Sci. Cybernet. 36, 4 (2020), 305–323. https://doi.org/10.15625/1813-9663/
36/4/14829
[52] Quoc Thai Nguyen, Thoai Linh Nguyen, Ngoc Hoang Luong, and Quoc Hung Ngo. 2020. Fine-tuning BERT for sen-
timent analysis of vietnamese reviews. In Proceedings of the 7th NAFOSTED Conference on Information and Computer
Science. IEEE, 302–307. https://doi.org/10.1109/NICS51282.2020.9335899
[53] Vu Duc Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2018. Variants of long short-term memory for senti-
ment analysis on vietnamese students’ feedback corpus. In Proceedings of the 10th International Conference on Knowl-
edge and Systems Engineering. IEEE, 306–311.
[54] Vu Duc Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2018. Variants of long short-term memory for senti-
ment analysis on vietnamese students’ feedback corpus. In Proceedings of the 10th International Conference on Knowl-
edge and Systems Engineering. IEEE, 306–311. https://doi.org/10.1109/KSE.2018.8573351
[55] Dang-Khoa Nguyen-Nhat and Huu-Thanh Duong. 2019. One-document training for vietnamese sentiment analysis.
In Computational Data and Social Networks, Andrea Tagarelli and Hanghang Tong (Eds.). Springer International Pub-
lishing, Cham, 189–200.
[56] Thuy Nguyen-Thanh and Giang Tran Cong Tran. 2019. Vietnamese sentiment analysis for hotel review based on
overfitting training and ensemble learning. In Proceedings of the 10th International Symposium on Information and
Communication Technology. Association for Computing Machinery, 147–153. https://doi.org/10.1145/3368926.3369675
[57] Bich-Tuyen Nguyen-Thi and Huu-Thanh Duong. 2019. A vietnamese sentiment analysis system based on multiple clas-
sifiers with enhancing lexicon features. In Industrial Networks and Intelligent Systems, Trung Quang Duong, Nguyen-
Son Vo, Loi K. Nguyen, Quoc-Tuan Vien, and Van-Dinh Nguyen (Eds.). Springer International Publishing, Cham,
240–249.
[58] Denilson Alves Pereira. 2021. A survey of sentiment analysis in the portuguese language. Artif. Intell. Rev. 54, 2 (2021),
1087–1115.
[59] Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H. Trinh. 2022. ViT5: Pretrained text-to-text transformer for viet-
namese language generation. In Proceedings of the Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies: Student Research Workshop. Association for Computational
Linguistics, 136–142. https://doi.org/10.18653/v1/2022.naacl-srw.18
[60] Vo Ngoc Phu, Vo Thi Ngoc Chau, Vo Thi Ngoc Tran, and Nguyen Duy Dat. 2018. A vietnamese adjective emotion
dictionary based on exploitation of Vietnamese language characteristics. Artif. Intell. Rev. 50, 1 (2018), 93–159.
[61] Vo Ngoc Phu, Vo Thi Ngoc Chau, Vo Thi Ngoc Tran, Dat Nguyen Duy, and Khanh Ly Doan Duy. 2019. A valence-
totaling model for vietnamese sentiment classification. Evolv. Syst. 10, 3 (2019), 453–499.
[62] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, and Rada Mihalcea. 2023. Beneath the tip of the iceberg:
Current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 14, 1 (2023), 108–132.
https://doi.org/10.1109/TAFFC.2020.3038167
[63] Marco Pota, Mirko Ventura, Rosario Catelli, and Massimo Esposito. 2021. An effective BERT-based pipeline for twitter
sentiment analysis: A case study in italian. Sensors 21, 1 (2021), 133.
[64] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J
Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.
21, 140 (2020), 1–67.
[65] Biswarup Ray, Avishek Garain, and Ram Sarkar. 2021. An ensemble-based hotel recommender system using sentiment
analysis and aspect categorization of hotel reviews. Appl. Soft Comput. 98 (2021), 106935.
[66] Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2021. How good is your tokenizer? On the
monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers). Association for Computational Linguistics, 3118–3135. https://doi.org/10.18653/v1/2021.acl-long.243
[67] Mrityunjay Singh, Amit Kumar Jakhar, and Shivam Pandey. 2021. Sentiment analysis on the impact of coronavirus in
social life using the BERT model. Soc. Netw. Anal. Min. 11, 1 (2021), 1–11. https://doi.org/10.1007/s13278-021-00737-z
[68] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In Chi-
nese Computational Linguistics, Maosong Sun, Xuanjing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu (Eds.). Springer
International Publishing, Cham, 194–206.
[69] Thien Khai Tran and Tuoi Thi Phan. 2015. Constructing sentiment ontology for vietnamese reviews. In Proceedings
of the 17th International Conference on Information Integration and Web-Based Applications and Services (iiWAS’15).
Association for Computing Machinery, New York, NY, Article 36, 5 pages. https://doi.org/10.1145/2837185.2837215
[70] Thien Khai Tran and Tuoi Thi Phan. 2016. Computing sentiment scores of adjective phrases for vietnamese. In Multi-
disciplinary Trends in Artificial Intelligence, Chattrakul Sombattheera, Frieder Stolzenburg, Fangzhen Lin, and Abhaya
Nayak (Eds.). Springer International Publishing, Cham, 288–296.
[71] Thien Khai Tran and Tuoi Thi Phan. 2016. Multi-class opinion classification for Vietnamese hotel reviews. Int. J. Intell.
Technol. Appl. Stat. 9, 1 (2016), 7–18.
[72] Thien Khai Tran and Tuoi Thi Phan. 2018. A hybrid approach for building a Vietnamese sentiment dictionary. J. Intell.
Fuzzy Syst. 35, 1 (2018), 967–978.
[73] Thien Khai Tran and Tuoi Thi Phan. 2018. Towards a sentiment analysis model based on semantic relation analysis.
Int. J. Synth. Emot. 9, 2 (2018), 54–75.
[74] Thien Khai Tran and Tuoi Thi Phan. 2020. Capturing contextual factors in sentiment classification: An ensemble
approach. IEEE Access 8 (2020), 116856–116865.
[75] Son Trinh, Luu Nguyen, and Minh Vo. 2018. Combining Lexicon-Based and Learning-Based Methods for Sentiment
Analysis for Product Reviews in Vietnamese Language. Springer International Publishing, Cham, 57–75. https://doi.org/
10.1007/978-3-319-60170-0_5
[76] Son Trinh, Luu Nguyen, Minh Vo, and Phuc Do. 2016. Lexicon-Based Sentiment Analysis of Facebook Comments in
Vietnamese Language. Springer International Publishing, Cham, 263–276. https://doi.org/10.1007/978-3-319-31277-
4_23
[77] Trong-Loc Truong, Hanh-Linh Le, and Thien-Phuc Le Dang. 2020. Sentiment analysis implementing BERT-based pre-
trained language model for vietnamese. In Proceedings of the 7th NAFOSTED Conference on Information and Computer
Science. IEEE, 362–367. https://doi.org/10.1109/NICS51282.2020.9335912
[78] Huynh Quoc Viet Vo and Kazuhide Yamamoto. 2018. VietSentiLex: A sentiment dictionary that considers the polar-
ity of ambiguous sentiment words. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and
Computation. Association for Computational Linguistics.
[79] Hung T. Vo, Hai C. Lam, Duc Dung Nguyen, and Nguyen Huynh Tuong. 2016. Topic classification and sentiment
analysis for vietnamese education survey system. As. J. Comput. Sci. Inf. Technol. 6, 3 (2016), 27–34.
[80] Quan Vo, Huy Nguyen, Bac Le, and Minh Nguyen. 2017. Multi-channel LSTM-CNN model for vietnamese sentiment
analysis. In Proceedings of the 9th International Conference on Knowledge and Systems Engineering. IEEE, 24–29. https:
//doi.org/10.1109/KSE.2017.8119429
[81] Thanh Hung Vo, Thien Tin Nguyen, Hoang Anh Pham, and Thanh Van Le. 2017. An efficient hybrid model for
vietnamese sentiment analysis. In Intelligent Information and Database Systems, Ngoc Thanh Nguyen, Satoshi Tojo,
Le Minh Nguyen, and Bogdan Trawiński (Eds.). Springer International Publishing, Cham, 227–237.
[82] Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras, and Mark Johnson. 2018. VnCoreNLP: A vietnamese
natural language processing toolkit. In Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Demonstrations. Association for Computational Linguistics, 56–60. https://doi.org/10.
18653/v1/N18-5012
[83] Xuan-Son Vu and Seong-Bae Park. 2014. Construction of vietnamese sentiwordnet by using Vietnamese dictionary.
In Proceedings of the Korea Information Processing Society Conference. Korea Information Processing Society, 745–748.
[84] Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classifi-
cation tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong,
China, 6382–6388. https://doi.org/10.18653/v1/D19-1670
[85] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45.
https://doi.org/10.18653/v1/2020.emnlp-demos.6
[86] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel.
2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association
for Computational Linguistics, Online, 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41
[87] Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: A review.
Artificial Intelligence Review 53, 6 (2020), 4335–4385.
[88] Yuan Yao, Bowen Dong, Ao Zhang, Zhengyan Zhang, Ruobing Xie, Zhiyuan Liu, Leyu Lin, Maosong Sun, and Jianyong
Wang. 2022. Prompt tuning for discriminative pre-trained language models. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics (ACL’22). Association for Computational Linguistics, Dublin, Ireland,
3468–3473. https://doi.org/10.18653/v1/2022.findings-acl.273
[89] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Data Min. Knowl. Discov.
8, 4 (2018), e1253.
Received 22 December 2021; revised 17 May 2022; accepted 17 March 2023

Vietnamese Sentiment Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vietnamese Sentiment Analysis

Uploaded by

Copyright:

Available Formats

Vietnamese Sentiment Analysis: An Overview and

Comparative Study of Fine-tuning Pretrained Language

2.1 Unsupervised Learning

Table 1. The General Information of Previous Studies on Vietnamese SA

Vo et al. [80] Multi-channel LSTM-CNN E-commerce∗ CV Document Accuracy

2.2 Semi-supervised Learning

2.3 Supervised Learning

2.4 Deep Learning

2.5 Transformer-based Learning

3 METHODOLOGY AND RESOURCES

3.1 Task Description

3.3 Vietnamese Pre-trained Language Models

Table 2. Statistics for the Experimental Dataset

4.1 Hyper-parameters Settings

4.2 Evaluation Metrics

Table 3. Results of Models against Other Methods on UIT-VSMEC Dataset

Model Accuracy Balance Acc Weighted F 1 Macro F 1 Micro F 1

4.3 Result and Analysis

Table 4. Results of Models against Other Methods on VLSP Dataset

Table 5. Results of Models against Other Methods on HSA Dataset

Model Accuracy Balance Acc Weighted F 1 Macro F 1 Micro F 1

Table 6. Results of Models against Other Methods on VS Dataset

Model Accuracy Balance Acc Weighted F 1 Macro F 1 Micro F 1

Table 7. Results of Models Against other Methods on UIT-VSFC [49] Dataset

Model Accuracy Balance Acc Weighted F 1 Macro F 1 Micro F 1

Fig. 8. Confusion matrix of UIT_VMEC dataset on the test set.

Table 9. Wrong Prediction with Sarcastic Meanings on the VLSP Dataset

Table 10. Examples of Wrong Annotation in the UIT_VSFC Dataset

Received 22 December 2021; revised 17 May 2022; accepted 17 March 2023

You might also like