Professional Documents
Culture Documents
Sentence Embeddings
Md. Kowsher1 , Md. Shohanur Islam Sobuj2 , Nusrat Jahan Prottasha1 ,
Mohammad Shamsul Arefin3 , Yasuhiko Morimoto4
1
Stevens Institute of Technology, USA
2
Hajee Mohammad Danesh Science and Technology University, Bangladesh
3
Chittagong University of Engineering and Technology, Bangladesh
4
Graduate School of Engineering, Hiroshima University, Japan
{ga.kowsher,shohanursobuj,jahannusratprotta}@gmail.com
sarefin@cuet.ac.bd morimo@hiroshima-u.ac.jp
ha
Similarity Dissimilarity
ha
hp hn
Share Share
Encoder weights Encoder weights Encoder
sp sa sn
Figure 1: Overview of the proposed methodology for achieving universal zero-shot NLI. The approach incorporates
contrastive learning with cross-lingual sentence embeddings, leveraging a large-scale pre-trained multilingual
language model trained on NLI data from diverse languages. The Siamese network-based contrastive learning
framework establishes semantic relationships among similar sentences, enabling the zero-shot NLI model to capture
meaningful semantic representations. By extending the zero-shot capability to additional languages within the
multilingual model, the approach achieves universal zero-shot NLI across a broad range of languages, including
low-resource ones. In this framework, "a" serves as an anchor, "n" as negative, and "p" as positive in defining the
relationships between three categories: entailment (E), neutral (N), or contradiction (C) (for more details, refer to
Model Architecture in 5.1).
We exploit the power of contrastive learning by ture of the text, extracting useful information from
employing a Siamese network-based framework texts can be very time-consuming and inefficient.
to establish semantic relationships among similar With the rapidly development of deep learning, neu-
sentences in the 15 languages. Contrastive learning ral network methods such as RNN (Hochreiter and
enables the model to learn robust representations Schmidhuber, 1997; Chung et al., 2014) and CNN
that can effectively discriminate between entail- (Kim, 2014; Zhang et al., 2015) have been widely
ment and contradiction. explored for efficiently encoding the text sequences.
By training the zero-shot NLI model using con- However, their capabilities are limited by compu-
trastive training on this multilingual dataset, we tational bottlenecks and the problem of long-term
equip the model with the ability to generalize dependencies. Recently, large-scale pre-trained
to unseen languages. The fine-tuned language language models (PLMs) based on transformers
model’s zero-shot learning capabilities allow us (Vaswani et al., 2017) has emerged as the art of text
to extend the zero-shot NLI capability to additional modeling. Some of these auto-regressive PLMs in-
languages within the multilingual model. This ap- clude GPT (Radford et al., 2018) and XLNet (Yang
proach significantly broadens the language cover- et al., 2019), auto-encoding PLMs such as BERT
age and practical applicability of the NLI model, es- (Devlin et al., 2019b), RoBERTa (Liu et al.) and
pecially for low-resource languages where labeled ALBERT (Lan et al., 2019). The stunning perfor-
data is scarce. mance of PLMs mainly comes from the extensive
knowledge in the large scale corpus used for pre-
2 Related Work training.
Despite the optimality of the cross-entropy in
Text classification is a typical task of categoriz- supervised learning, a large number of studies have
ing texts into groups, including sentiment analysis, revealed the drawbacks of the cross-entropy loss,
question answering, etc. Due to the unstructured na- e.g., vulnerable to noisy labels (Zhang et al., 2018),
poor margins (Elsayed et al., 2018) and weak ad- rent layers, followed by fully connected layers. It
versarial robustness (Pang et al., 2019). Inspired by aims to extract relevant features from the input sam-
the InfoNCE loss (Oord et al., 2018), contrastive ples and map them into a common representation
learning (Hadsell et al., 2006) has been widely used space.
in unsupervised learning to learn good generic rep- The Siamese network takes two input samples,
resentations for downstream tasks. For example, x1 and x2 , and applies the shared subnetwork to
(He et al., 2020) leverages a momentum encoder each input to obtain the respective representations:
to maintain a look-up dictionary for encoding the
input examples. (Chen et al., 2020) produces mul-
tiple views of the input example using data aug- h1 = f (x1 ), h2 = f (x2 )
mentations as the positive samples, and compare
them to the negative samples in the datasets. (Gao
et al., 2021) similarly dropouts each sentence twice To measure the similarity between h1 and h2 , a
to generate positive pairs. In the supervised sce- distance metric is commonly employed, such as
nario, (Khosla et al., 2020) clusters the training Euclidean distance or cosine similarity. For exam-
examples by their labels to maximize the similar- ple, cosine similarity can be calculated as:
ity of representations of training examples within
the same class while minimizing ones between dif-
ferent classes. (Gunel et al., 2021) extends super- h1 · h2
similarity =
vised contrastive learning to the natural language ∥h1 ∥ · ∥h2 ∥
domain with pre-trained language models. (Lopez-
Martin et al., 2022) studies the network intrusion
detection problem using well-designed supervised During training, Siamese networks utilize a con-
contrastive loss. trastive loss function to encourage similar inputs to
have close representations and dissimilar inputs to
3 Background have distant representations. The contrastive loss
3.1 NLI penalizes large distances for similar pairs and small
distances for dissimilar pairs.
Natural Language Inference (NLI) is a task in natu-
ral language processing (NLP) where the goal is to Siamese networks have demonstrated effective-
determine the relationship between two sentences. ness in various domains, enabling tasks such as
Given two input sentences s1 and s2 , the task is to similarity-based classification, retrieval, and clus-
classify their relationship as one of three categories: tering. The ability to learn meaningful representa-
entailment (E), neutral (N), or contradiction (C). tions for similarity estimation has made Siamese
Formally, let S1 and S2 be sets of sentences in networks widely applicable in research and practi-
two different languages, and let L = E, N, C be cal applications.
the set of possible relationship labels. Given a pair
of sentences (s1 , s2 ) ∈ S1 × S2 , the task is to pre-
dict the label l ∈ L that represents the relationship 3.3 Contrastive learning
between the two sentences, i.e., l = N LI(s1 , s2 ).
Let D = (xi , yi )Ni=1 be a dataset of N samples,
3.2 Siamese Networks where xi is a sentence and yi is a label. Let ϕ be an
Siamese networks are neural network architectures embedding function that maps a sentence xi to a
specifically designed for comparing the similarity low-dimensional vector representation ϕ(xi ) ∈ Rd ,
or dissimilarity between pairs of inputs (Chen and where d is the dimensionality of the embedding
He, 2021). Given two input samples x1 and x2 , a space. The goal of contrastive learning is to learn
Siamese network learns a shared representation for an embedding function ϕ such that the similarity
both inputs and measures their similarity based on between the representation of a sentence xi and
this shared representation. its positive sample xj is greater than that of its
Let f denote the shared subnetwork of the negative samples xk .
Siamese network. The shared subnetwork consists Given a pair of sentences (xi , xj ), the con-
of multiple layers, such as convolutional or recur- trastive loss can be defined as follows:
the dense vector representation of a sentence s in
sim(xi ,xj )
an embedding space of dimensionality d. The em-
exp θ bedding function ϕ is trained to generate similar
L = − log
embeddings for semantically equivalent sentences
sim(xi ,xk )
exp θ across different languages, while producing dis-
N similar embeddings for sentences with different
X sim(xi , xk ) (1)
+ [yk = yi ] exp meanings.
θ
k=1
We formulate our NLI model as a multi-task
N
X sim(xi , xk ) learning problem by simultaneously optimizing
+ [yk ̸= yi ] exp
θ two loss functions: the cross-entropy loss and
k=1
the contrastive loss. The cross-entropy loss is
ϕ(x )⊤ ϕ(x ) employed to predict the NLI label li for a given
where sim(xi , xj ) = ∥ϕ(xii)∥∥ϕ(xjk )∥ is the cosine
sentence pair szi = (szi,p , szi,h ). The contrastive
similarity between the embeddings of the sentences
loss ensures that cross-lingual sentence embed-
xi and xj , θ is the temperature parameter that con-
dings with similar semantics are close together
trols the sharpness of the probability distribution
in the embedding space, i.e., for two languages
over the similarity scores, [yk = yi ] is the Iverson α ·hβ
bracket that takes the value 1 if yk = yi and 0 oth- α, β ∈ z, sim(sα , sβ ) = |hhα ||h β | > τ , while
erwise, and [yk ̸= yi ] is the Iverson bracket that dissimilar sentence embeddings are far apart, i.e.,
α ·hβ
takes the value 1 if yk ̸= yi and 0 otherwise. sim(sα , sβ ) = |hhα ||hβ | < τ . Here, τ represents
Table 1: Performance comparison of various multilingual models on unseen and low-resource NLI datasets in both
zero-shot and few-shot settings in terms of accuracy, the higher the better. The models with an asterisk (*) denote
our proposed universal zero-shot models. The best results are highlighted in bold and the second best results are
highlighted with underline.
Figure 2: Accuracy comparison of various NLI models in both zero-shot (Left Figure ) and few-shot (Right Figure)
settings across different low-resource datasets. The performances of the unseen multilingual XLM-RoBERTa, seen
XLM-RoBERTa, and our proposed XLM-RoBERTa* are depicted. In this context, seen alludes to the language data
that has been employed in training the zero-shot model, while unseen pertains to data that hasn’t been incorporated
into the zero-shot training process.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006.
Vishrav Chaudhary, Guillaume Wenzek, Francisco Dimensionality reduction by learning an invariant
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- mapping. Computer vision and pattern recognition,
moyer, and Veselin Stoyanov. 2020b. Unsupervised 2006 IEEE computer society conference on, 2:1735–
cross-lingual representation learning at scale. 1742.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
ina Williams, Samuel R Bowman, Holger Schwenk, Ross Girshick. 2020. Momentum contrast for un-
and Veselin Stoyanov. 2018. Xnli: Evaluating cross- supervised visual representation learning. In Pro-
lingual sentence representations. In Proceedings of ceedings of the IEEE/CVF Conference on Computer
the 2018 Conference on Empirical Methods in Natu- Vision and Pattern Recognition, pages 9729–9738.
ral Language Processing, pages 2475–2485.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023.
Kristina Toutanova. 2019a. BERT: Pre-training of Debertav3: Improving deberta using electra-style pre-
deep bidirectional transformers for language under- training with gradient-disentangled embedding shar-
standing. In Proceedings of the 2019 Conference of ing.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
nologies, Volume 1 (Long and Short Papers), pages short-term memory. Neural computation, 9(8):1735–
4171–4186, Minneapolis, Minnesota. Association for 1780.
Computational Linguistics.
Phillip Keung, Yichao Lu, György Szarvas, and Noah A
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Smith. 2020a. The multilingual amazon reviews cor-
Kristina Toutanova. 2019b. Bert: Pre-training of pus. arXiv preprint arXiv:2010.02573.
Phillip Keung, Yichao Lu, György Szarvas, and Noah A. embeddings for network intrusion detection. Infor-
Smith. 2020b. The multilingual Amazon reviews mation Fusion, 79:200–228.
corpus. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing Andrew Maas, Raymond E Daly, Peter T Pham, Dan
(EMNLP), pages 4563–4568, Online. Association for Huang, Andrew Y Ng, and Christopher Potts. 2011.
Computational Linguistics. Learning word vectors for sentiment analysis. In
Proceedings of the 49th annual meeting of the associ-
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron ation for computational linguistics: Human language
Sarna, Yonglong Tian, Phillip Isola, Aaron technologies, pages 142–150.
Maschinot, Ce Liu, and Dilip Krishnan. 2020. Super-
vised Contrastive Learning. In Advances in Neural Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Information Processing Systems, volume 33, pages Representation learning with contrastive predictive
18661–18673. Curran Associates, Inc. coding. arXiv preprint arXiv:1807.03748.
Yoon Kim. 2014. Convolutional neural net- Tianyu Pang, Chuanxiong Xu, Hongtao Du, Ningyu
works for sentence classification. arXiv preprint Zhang, and Jun Zhu. 2019. Improving adversarial
arXiv:1408.5882. robustness via promoting ensemble diversity. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Md Kowsher, Abdullah As Sami, Nusrat Jahan Prot- Vision and Pattern Recognition, pages 6440–6449.
tasha, Mohammad Shamsul Arefin, Pranab Kumar
Dhar, and Takeshi Koshiba. 2022. Bangla-bert: Adam Paszke, Sam Gross, Francisco Massa, Adam
transformer-based efficient model for transfer learn- Lerer, James Bradbury, Gregory Chanan, Trevor
ing and language understanding. IEEE Access, Killeen, Zeming Lin, Natalia Gimelshein, Luca
10:91855–91870. Antiga, Alban Desmaison, Andreas Köpf, Edward
Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
Kentaro Kurihara, Daisuke Kawahara, and Tomohide Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
Shibata. 2022. JGLUE: Japanese general language jie Bai, and Soumith Chintala. 2019. PyTorch: An
understanding evaluation. In Proceedings of the Thir- Imperative Style, High-Performance Deep Learning
teenth Language Resources and Evaluation Confer- Library. Advances in Neural Information Processing
ence, pages 2957–2966, Marseille, France. European Systems, 32.
Language Resources Association.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Ilya Sutskever. 2018. Improving language under-
Kevin Gimpel, Piyush Sharma, and Radu Sori- standing by generative pre-training. URL: https://
cut. 2019. ALBERT: A lite BERT for self- s3-us-west-2.amazonaws.com/openai-assets/
supervised learning of language representations. researchcovers/languageunsupervised/
CoRR, abs/1909.11942. language_understanding_paper.pdf.
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max- Sebastian Ruder, Ivan Vulic, and Anders Søgaard. 2019.
imin Coavoux, Benjamin Lecouteux, Alexandre Al- A survey of cross-lingual word embedding models.
lauzen, Benoit Crabbé, Laurent Besacier, and Di- Journal of Artificial Intelligence Research, 65:569–
dier Schwab. 2019. Flaubert: Unsupervised lan- 630.
guage model pre-training for french. arXiv preprint
arXiv:1912.05372. Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2020. Distilbert, a distilled version of
João Augusto Leite, Diego Silva, Kalina Bontcheva, bert: smaller, faster, cheaper and lighter.
and Carolina Scarton. 2020. Toxic language detec-
tion in social media for Brazilian Portuguese: New Gudbjartur Ingi Sigurbergsson and Leon Derczynski.
dataset and multilingual analysis. In Proceedings of 2020. Offensive language and hate speech detec-
the 1st Conference of the Asia-Pacific Chapter of the tion for Danish. In Proceedings of the Twelfth Lan-
Association for Computational Linguistics and the guage Resources and Evaluation Conference, pages
10th International Joint Conference on Natural Lan- 3498–3508, Marseille, France. European Language
guage Processing, pages 914–924, Suzhou, China. Resources Association.
Association for Computational Linguistics.
Md. Shohanur Islam Sobuj, Md. Kowsher, and
Yue Li, Xutao Wang, and Pengjian Xu. 2018. Chinese Md. Fahim Shahriar. 2021. Bangla multi class text
text classification model based on deep learning. Fu- dataset. https://www.kaggle.com/datasets/
ture Internet, 10(11):113. shohanursobuj/banglamct.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Tyqiangz. 2023. Multilingual sentiments dataset.
dar Joshi, Danqi Chen, and Omer and Levy. Roberta: https://huggingface.co/datasets/tyqiangz/
A robustly optimized bert pretraining approach. multilingual-sentiments.
Manuel Lopez-Martin, Antonio Sanchez-Esguevillas, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Juan Ignacio Arribas, and Belen Carro. 2022. Su- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
pervised contrastive learning over prototype-label Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro- A.2.1 MARC-ja
cessing systems, pages 5998–6008.
The Multilingual Amazon Reviews Corpus
Andika William and Yunita Sari. 2020. CLICK-ID: (MARC), from which the Japanese dataset MARC-
A novel dataset for Indonesian clickbait headlines. ja was built (Keung et al., 2020b), was used to cre-
Data in Brief, 32:106231.
ate the JGLUE benchmark (Kurihara et al., 2022).
Adina Williams, Nikita Nangia, and Samuel Bowman. This study focuses on text classification, and to that
2018. A broad-coverage challenge corpus for sen- end, 4- and 5-star ratings were converted to the
tence understanding through inference. In Proceed-
"positive" class, while 1- and 2-star ratings were
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin- assigned to the "negative" class. The dev and test
guistics: Human Language Technologies, Volume 1 set each contained 5,654 and 5,639 occurrences,
(Long Papers), pages 1112–1122. compared to 187,528 instances in the training set.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
The extensive collection of product reviews pro-
Chaumond, Clement Delangue, Anthony Moi, Pier- vided by MARC-ja makes it possible to evaluate
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- NLP models in-depth. The characteristics of the
icz, Joe Davison, Sam Shleifer, Patrick von Platen, dataset and the accuracy metric used for evalua-
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
tion help to provide a thorough examination of how
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2019. Hug- well models perform on tasks involving Japanese
gingFace’s Transformers: State-of-the-art Natural text classification.
Language Processing.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, A.2.2 DKHate
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mt5: A massively multilingual The Danish hate speech dataset, used in this study,
pre-trained text-to-text transformer. is a significant resource that consists of anonymized
Twitter data that has been properly annotated for
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- hate speech. The dataset offers a targeted and
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain- thorough collection for hate speech detection and
ing for language understanding. arXiv preprint was produced by Sigurbergsson and Derczynski for
arXiv:1906.08237. their article titled "Offensive Language and Hate
Speech Detection for Danish" (Sigurbergsson and
Hongyi Zhang, Moustapha Cisse, and Yann N Dauphin.
2018. Generalized cross entropy loss for train- Derczynski, 2020). Each element in the collection
ing deep neural networks with noisy labels. arXiv contains a tweet and a label designating whether
preprint arXiv:1805.07836. or not it is offensive ("OFF" or "NOT"). It has a
training split of 2,960 tweets and a test split of 329
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas- tweets.
sification. In Advances in neural information pro-
cessing systems, pages 649–657. A.2.3 Kor-3i4k
A Appendix The Korean speaker intentions dataset 3i4K used
in this study is an invaluable tool for this purpose
A.1 Hardware and Software (Cho et al., 2018). Along with manually crafted
We perform our experiments on a double NVIDIA commands and inquiries, it includes commonly
RTX3090 GPU with 24GB memory. We use Py- used Korean terms from the corpus of the Seoul
Torch (Paszke et al., 2019) as the deep learning National University Speech Language Processing
framework and Hugging Face’s Transformers li- Lab. It includes classifications for utterances that
brary (Wolf et al., 2019) to work with the XLM- depend on intonation as well as fragments, state-
RoBERTa-large model. We use the official eval- ments, inquiries, and directives. This dataset of-
uation scripts provided with the XNLI dataset to fers essential information on precisely determining
compute the evaluation metrics. speaker intents given the importance of intonation
in languages like Korean. With a training set of
A.2 Dataset 55,134 examples and a test set of 6,121 examples,
The dataset provided in this paper is described in this domain can effectively train and evaluate mod-
this section. els.
A.2.4 Id-clickbait zero-shot models. However, our models stood up
The CLICK-ID dataset used in this study is made well when compared to these trained models. We
up of a selection of headlines from Indonesian news demonstrate the performance of our model across
sources (William and Sari, 2020). There are two various languages and tasks. In our experimental
primary components to it: Specifically, a subset of setup, including the training, validation, and test
15,000 annotated sample headlines that have been phases, we closely followed the settings defined in
classified as clickbait or non-clickbait and 46,119 the baseline papers.
raw article data. Three annotators separately exam-
ined each headline during the annotation process, Accuracy
Model
and the majority conclusion was taken as the ac- Dev Test
tual truth. There are 6,290 clickbait headlines and Human 0.989 0.990
8,710 non-clickbait headlines in the annotated sam- Tohoku BERTBASE 0.958 0.957
ple. We only trained and evaluated models on the Tohoku BERTBASE (char) 0.956 0.957
annotated example for the classification task used Tohoku BERTLARGE 0.955 0.961
in this study. NICT BERTBASE 0.958 0.96
Waseda RoBERTaBASE 0.962 0.962
A.2.5 BanglaMCT XLM-RoBERTaBASE 0.961 0.962
The BanglaMCT dataset, known as the Bangla XLM-RoBERTaLARGE 0.964 0.965
Multi Class Text Dataset, is a comprehensive col- XLM-RoBERTa* 0.820 0.819
lection of Bengali news tags sourced from various + few shot 0.896 0.873
newspapers (Sobuj et al., 2021) (Kowsher et al., mDeBERTa-v3* 0.829 0.820
2022). It offers two versions, MCT4 and MCT7. + few shot 0.882 0.878
MCT4 consists of four tags, while MCT7 includes
seven tags. The dataset contains a total of 287,566 Table 4: JGLUE performance on the DEV/TEST sets of
documents for MCT4 and 197,767 documents for the MARC-ja dataset. The ∗ represents our NLI model
MCT7. The dataset is split into a balanced 50/50 for zero-shot classification. The baseline performances
are taken from (Kurihara et al., 2022)
ratio for training and testing, making it suitable for
text classification tasks in Bengali, particularly for
news-related content across different categories. Table 4 shows the performance of different mod-
els on the DEV and TEST sets of the MARC-ja
A.2.6 ToLD-br dataset. The baseline models, such as Tohoku
The ToLD-Br dataset is a valuable resource for in- BERTBASE, Tohoku BERTLARGE, NICT
vestigating toxic tweets in Brazilian Portuguese BERTBASE, Waseda RoBERTaBASE, XLM-
(Leite et al., 2020). The dataset provides thor- RoBERTaBASE, and XLM-RoBERTaLARGE,
ough coverage of LGBTQ+phobia, Xenophobia, are explicitly trained models. Our zero-shot mod-
Obscene, Insult, Misogyny, and Racism with con- els, XLM-RoBERTaLARGE* and mDeBERTa-
tributions from 42 annotators chosen to reflect vari- v3base*, initially exhibit lower accuracy but
ous populations. The binary version of the dataset achieve notable improvement after few-shot train-
was used in this study, to evaluate whether a tweet ing. This demonstrates the potential of our zero-
is toxic or not. There are 21,000 examples total in shot approach combined with limited fine-tuning
the dataset, with 16,800 examples in the training data to bridge the performance gap with explicitly
set, 2,100 examples in the validation set, and 2,100 trained models.
examples in the test set. This large dataset helps the Table 5 presents the results from sub-task A in
construction and testing of models for identifying Danish. Existing models, such as Logistic Re-
toxicity in Brazilian Portuguese tweets. gression DA, Learned-BiLSTM (10 Epochs) DA,
Fast-BiLSTM (100 Epochs) DA, and AUX-Fast-
A.3 Universal Zero-shot vs Trained model BiLSTM (50 Epochs) DA, are trained models. Our
In this section, we present the experimental re- zero-shot models, XLM-RoBERTaLARGE* and
sults of our zero-shot and hence few-shot NLI mDeBERTa-v3base*, achieve competitive perfor-
model compared to previously established datasets mance, and their accuracy further improves after
and trained models. Typically, models that are few-shot training.
specifically trained for a task perform better than For the FCI module in the Korean language, Ta-
Model Macro F1 Model Name Average Accuracy
Logistic Regression DA 0.699
M-BERT 0.9153
Learned-BiLSTM (10 Epochs) DA 0.658
Fast-BiLSTM (100 Epochs) DA 0.630 Bi-LSTM 0.8125
AUX-Fast-BiLSTM (50 Epochs) DA 0.675 CNN 0.7958
XLM-RoBERTa* 0.685 XGBoost 0.8069
+ few shot 0.711 XLM-RoBERTa* 0.7794
mDeBERTa-v3* 0.680
+ few shot 0.709 + few-shot 0.8294
mDeBERTa-v3* 0.7492
Table 5: Results from sub-task A in Danish. The base- + few-shot 0.8061
line performances are taken from (Sigurbergsson and
Derczynski, 2020) Table 7: Performance Comparison of Clickbait Headline
Detection in Indonesian News Sites. The baseline per-
Models F1 score accuracy formances are taken from (Fakhruzzaman et al., 2021)
charCNN 0.7691 0.8706
charBiLSTM 0.7811 0.8807
charCNN + charBiLSTM 0.7700 0.8745 Portuguese social media. Existing methods,
charBiLSTM-Att 0.7977 0.8869 such as BoW + AutoML, BR-BERT, M-BERT-
charCNN + charBiLSTM-Att 0.7822 0.8746 BR, M-BERT(transfer), and M-BERT(zero-shot),
XLM-RoBERTa* 0.7741 0.8760 are compared. Our zero-shot models, XLM-
+ few-shot 0.7913 0.8839
RoBERTaLARGE* and mDeBERTa-v3base*, ex-
mDeBERTa-v3* 0.7817 0.8722
+ few-shot 0.7989 0.8901 hibit competitive performance initially and demon-
strate improvement after few-shot training. Overall,
Table 6: Model Performance for FCI module for the our zero-shot NLI models demonstrate the ability
Korean Language. The baseline performances are taken to perform reasonably well without explicit train-
from (Cho et al., 2018) ing on the target language. Although their initial
performance might be lower compared to explicitly
ble 6 displays the performance comparison of dif- trained models, few-shot training significantly
ferent models. Existing models, including char-
CNN, charBiLSTM, charCNN + charBiLSTM,
and charBiLSTM-Att, are trained models. Our
zero-shot models, XLM-RoBERTaLARGE* and
mDeBERTa-v3base*, exhibit comparable perfor-
mance initially and achieve notable improvement
after few-shot training.
In the context of clickbait headline detection in
Indonesian news sites (Table 7), the average ac-
curacy of established models like M-BERT, Bi-
LSTM, CNN, and XGBoost is provided. Our
zero-shot models, XLM-RoBERTaLARGE* and
mDeBERTa-v3base*, demonstrate competitive per-
formance initially and show significant enhance-
ment after few-shot training.
Table 8 presents the results of Bengali multi-
class text classification. The models compared in-
clude biLSTM, CNN, CNN-biLSTM, DNN, Lo-
gistic Regression, and MNB. Our zero-shot mod-
els, XLM-RoBERTaLARGE* and mDeBERTa-
v3base*, initially show lower accuracy but achieve
notable improvement after few-shot training.
Finally, Table 9 displays the model evalua-
tion for toxic language detection in Brazilian
Model Accuracy f1-score
biLSTM 0.9652 0.9653
CNN 0.9723 0.9723
CNN-biLSTM 0.9673 0.9673
DNN 0.9707 0.9708
MCT4 Logistic Regression 0.9586 0.9587
MNB 0.9357 0.9359
XLM-RoBERTa* 0.8316 0.8290
+ few-shot 0.8713 0.8639
mDeBERTa-v3* 0.8012 0.8007
+ few-shot 0.8518 0.8600
biLSTM 0.9236 0.9237
CNN 0.9204 0.9204
CNN-biLSTM 0.9115 0.9114
DNN 0.9289 0.9290
Logistic Regression 0.9156 0.9156
MCT7
MNB 0.8858 0.8859
XLM-RoBERTa* 0.7418 0.7562
+ few-shot 0.8234 0.8221
mDeBERTa-v3* 0.7441 0.7612
+ few-shot 0.8309 0.8237