You are on page 1of 4

2021 IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) | 978-1-6654-0962-9/21/$31.

00 ©2021 IEEE | DOI: 10.1109/UPCON52273.2021.9667615

Analysis of Word-level Embeddings for Indic


Languages on AI4Bharat-IndicNLP Corpora
Dipam Goswami Shrikant Malviya
Department of Computer Science Center for Cognitive Computing
and Information Systems IIIT Allahabad, India
BITS Pilani, India s.kant.malviya@gmail.com
dipamgoswami01@gmail.com

Rohit Mishra Uma Shanker Tiwary


Center for Cognitive Computing Center for Cognitive Computing
IIIT Allahabad, India IIIT Allahabad, India
rohit129iiita@gmail.com ustiwary@gmail.com

Abstract—This paper presents the analysis of non-contextual for training the word embeddings in this paper. The statistics
word embeddings trained on AI4Bharat-IndicNLP corpus con- of the corpus is given in Table I for reference. Hindi and Tamil
taining 2.7 billion words covering 10 Indian languages. We share have the largest number of tokens, while only Oriya has less
the pre-trained embeddings for research and development in
Indic languages. These embeddings are evaluated on several than 100 million tokens.
evaluation tasks like word similarity and analogy evaluation, Almost all the Indian languages are morphologically rich
classification tasks on multiple datasets. The analysis of word em- due to the presence of a large number of inflected forms of
beddings is expected to give researchers a better understanding words and follow SOV word order. Dravidian languages are
of the Indic Languages. We show that Word2Vec skip-gram and
FastText skip-gram embeddings are the best performing models generally spoken in the southern part of the Indian subcon-
for NLP tasks on Indic languages. All the embeddings are made tinent, while the Indo-Aryan languages are mostly spoken in
freely available. the northern part of India. The Indo-Aryan languages include
Index Terms—Word embeddings, Word2Vec, FastText, GloVe, Hindi, Bengali, Sanskrit, Marathi, Punjabi, Gujrati, Oriya,
Continuous Bag Of Words, Skip-Gram, Sentiment analysis etc., while the Dravidian languages include Kannada, Tamil,
Telugu, Malayalam. These languages are spoken by more
I. I NTRODUCTION than 95% of the Indian population, with Hindi being the
Indian languages are among the low-resource languages most widely spoken language. Hence, Indic languages play
for NLP tasks having billions of its users. Word embeddings a significant role in NLP research and the availability of
provide an efficient, dense representation of words into vectors these word embeddings will impact several representative NLP
in a way that similar words have a similar representation tasks.
[1]. Word vectors or embeddings have been applied to many
The main contributions of this paper can be summarised as:
tasks in natural language processing like Machine Transla-
tion [2], Question Answering [3], Information Retrieval [4], • Pre-trained word-level embeddings1 for 10 Indian lan-
etc., leading to state-of-the-art performance. However, these guages, trained on AI4Bharat-IndicNLP corpus [7] using
embeddings have to be trained on very large corpora to Word2Vec, FastText and GloVe, are released.
achieve better performance on the NLP tasks. Therefore, the • The performance of these embeddings are analysed on
performance of the embeddings depends on the size of the various language-specific evaluation tasks, i.e. word sim-
corpora used for training [5]. ilarity [10], analogy evaluation [11] and news article
The availability of large monolingual corpora is an impor- category classification [7].
tant factor for training the word embeddings. These pre-trained
models represent the semantic relation of words and are used We demonstrate how the Word2Vec (CBOW and skip-
for transfer learning in NLP tasks. Pre-trained models are gen- gram), FastText (CBOW and skip-gram), and GloVe embed-
erally learned using unsupervised approaches from a large and dings perform on the evaluation tasks for each of the 10
diverse monolingual corpus [5], [6]. A large set of monolingual Indic languages and present a comparative analysis of their
corpora (AI4Bharat-IndicNLP) [7] for 10 languages from two performances. The analysis will serve as a guide for NLP
language families (Indo-Aryan and Dravidian) has been used researchers in using the embeddings on different NLP tasks.

978-1-6654-0962-9/21/$31.00 ©2021 IEEE 1 https://github.com/IL-Collection/Word-embeddings-indicnlp-corpus

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:24 UTC from IEEE Xplore. Restrictions apply.
II. L ITERATURE S URVEY TABLE I
AI4B HARAT-I NDIC NLP CORPORA STATISTICS [7]
Word2Vec [1] trained the word vectors to make the neural-
network-based training of the embedding more efficient. Fast- Language #Sentences #Tokens #Types
Text [6] utilized the sub-word information to generate word Punjabi (pa) 6,525,312 179,434,326 594,546
Hindi (hi) 62,961,411 1,199,882,012 5,322,594
vectors. Instead of feeding words into the Neural Network ar- Bengali (bn) 7,209,184 100,126,473 1,590,761
chitecture, FastText breaks words into character n-grams (sub- Oriya (or) 3,593,096 51,592,831 735,746
words). In contrast, GloVe [8] derives semantic relationships Gujarati (gu) 7,889,815 129,780,649 2,439,608
Marathi (mr) 9,934,816 142,415,964 2,676,939
between words from the word co-occurrence matrix. Kannada (kn) 14,780,315 174,973,508 3,064,805
Word2Vec and FastText embeddings can be obtained using Telugu (te) 15,152,277 190,283,400 4,189,798
two methods: skip-gram (SG) and continuous-bag-of-words Malayalam (ml) 11,665,803 167,411,472 8,830,430
Tamil (ta) 20,966,284 362,880,008 9,452,638
(CBOW). CBOW method [5] tries to predict the target word Total 160,678,313 2,698,780,643 38,897,865
corresponding to the context, which is a bag of words con-
tained in a fixed size window around the target word. On the
other hand, the skip-gram model introduced by [6] learns to C. GloVe
take one word as input and predict the context words. We trained GloVe embeddings [8] with 300 dimensions
Pre-trained Word embeddings available for many Indian vector-size. Words with occurrence frequency less than five are
languages were trained using limited corpora. The FastText treated as out-of-vocabulary words. The co-occurrence matrix
project provides embeddings trained on Wikipedia corpus is created using a window of size 5. We share two variants of
(WK) [6]. FastText also provides embeddings trained on GloVe embeddings - GloVe-10, which is trained for ten epochs
Wikipedia + CommonCrawl corpus (WK+CC) [11]. Recently, and GloVe-15 trained for 15 epochs. Glove embeddings of
[7] shared the AI4Bharat-IndicNLP corpora and analyzed the AI4Bharat-IndicNLP corpora were also not available publicly.
performance of FastText skip-gram embeddings using several
datasets covering word similarity [10], word analogy [11], IV. E VALUATION AND A NALYSIS
sentiment analysis [12], [13] and news category classification We evaluate and compare the performance of Word2Vec,
[7], [15]. FastText and Glove embeddings on the following datasets:
• IIIT-Hyderabad word similarity dataset [10] having sim-
III. M ODEL T RAINING ilarity databases for 7 Indian languages (100-200 word
We train the embeddings for 10 Indian Languages on pairs per language).
the AI4Bharat-IndicNLP corpora (Table I) and evaluate their • Facebook Hindi word analogy dataset [11].
performance on word similarity, word analogy and text classi- • IIT-Patna Sentiment Analysis dataset [12], BBC News
fication tasks. Hence, the embeddings, i.e. Word2Vec (CBOW Articles2 classification dataset for Hindi.
and skip-gram), FastText (CBOW and skip-gram) and GloVe • ACTSA Sentiment Analysis corpus [13] for Telugu
embeddings, are trained for each language. • iNLTK Headlines dataset for Gujrati, Malayalam,
Marathi, Tamil and Telugu3 [15].
A. Word2Vec • Soham Bengali News Classification dataset .
4
5
We trained Word2Vec embeddings [1] of both skip-gram • IndicNLP News Category dataset [7] for 9 languages
(W-SG) and CBOW (W-CBOW) architectures in 300 dimen- covering the categories: entertainment, sports, business,
sions using the gensim library [9]. Words having a frequency lifestyle, technology, politics, crime.
less than 5 in the entire corpus are considered as out-of-
We also compare the performance of our embeddings with
vocabulary words. We trained the models for ten epochs, with
the publicly available embeddings trained on Wikipedia corpus
a window size of 5 and 10 negative examples sampled for
(WK) and Wikipedia+CommonCrawl corpus (WK+CC)6 .
each instance. Any of the pre-trained Word2Vec embeddings
on AI4Bharat-IndicNLP corpora were not available publicly. Table II shows the word similarity and analogy evaluation
results. For the Similarity evaluation, the skip-gram models
B. FastText of both Word2Vec and FastText give a higher correlation for
The FastText embeddings [6] of both skip-gram (F-SG) and Gujrati, Hindi and Telegu, whereas the Wikipedia embed-
CBOW (F-CBOW) architectures are trained with vector-size ding gives a better correlation for Punjabi and Tamil. The
of 300-dimension using the gensim library [9] implementation Wikipedia+CommonCrawl embedding is better for Marathi.
of FastText. Words having a frequency less than 5 in the entire Overall, the FastText skip-gram (F-SG) embeddings provide
corpus are considered as out-of-vocabulary words. We trained the highest average correlation on the word-similarity task. On
the models for ten epochs, with a window size of 5 and 10
2 https://github.com/NirantK/hindi2vec/
negative examples sampled for each instance. In addition to 3 https://github.com/goru001/inltk
the FastText skip-gram embeddings of AI4Bharat-IndicNLP 4 https://www.kaggle.com/csoham/classification-bengali-news-articles-indicnlp
corpora shared publicly by [7], we also share the FastText 5 https://github.com/AI4Bharat/indicnlp corpus
CBOW embeddings for all ten languages in this paper. 6 https://fasttext.cc/docs/en/crawl-vectors.html

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:24 UTC from IEEE Xplore. Restrictions apply.
TABLE II
E VALUATION ON WORD - SIMILARITY [10] AND WORD - ANALOGY TASK [11]

Word Similarity (Pearson Correlation)


Lang WK WK+CC W-CBOW W-SG F-CBOW F-SG GloVe-10 GloVe-15
gu 0.507 0.521 0.496 0.569 0.417 0.614 0.425 0.396
hi 0.575 0.551 0.591 0.602 0.501 0.626 0.527 0.507
mr 0.497 0.544 0.432 0.495 0.269 0.495 0.380 0.388
pa 0.467 0.384 0.346 0.362 0.314 0.428 0.314 0.304
ta 0.439 0.438 0.327 0.344 0.406 0.406 0.266 0.304
te 0.559 0.543 0.456 0.51 0.359 0.560 0.366 0.391
Average 0.507 0.496 0.441 0.480 0.378 0.521 0.380 0.382
Word Analogy (Accuracy Percentage)
Lang WK WK+CC W-CBOW W-SG F-CBOW F-SG GloVe-10 GloVe-15
hi 19.76 32.93 28.16 32.66 22.65 33.48 29.7 30.41

TABLE III
C OMPARISON OF TEXT CLASSIFICATION ACCURACY OF THE WORD EMBEDDINGS ON DIFFERENT DATASETS

Lang Dataset WK WK+CC W-CBOW W-SG F-CBOW F-SG GloVe-10 GloVe-15


hi BBC Articles 72.29 67.44 69.39 73.44 65.24 73.09 70.50 69.05
IITP+ Movie [12] 41.61 44.52 46.77 48.71 39.35 47.42 44.83 42.90
IITP Product [12] 58.32 57.17 59.65 62.14 53.72 61.95 58.31 59.84
bn Soham Articles 62.79 64.78 68.53 71.93 68.53 72.50 70.30 69.80
gu 81.94 84.07 85.43 90.14 80.73 90.59 85.43 84.06
ml iNLTK 86.35 83.65 91.11 92.53 83.17 94.44 86.82 88.09
mr Headlines [15] 83.06 81.65 82.48 89.09 80.09 90.41 81.98 81.40
ta 90.88 89.09 94.02 94.32 91.33 93.72 90.43 91.48
te 46.97 44.49 48.06 53.42 42.51 56.37 47.31 47.13
te ACTSA [13] 46.03 42.51 47.32 49.72 46.95 48.79 45.47 47.13
Average 67.02 65.94 69.28 72.54 65.16 72.93 68.14 68.09

TABLE IV
ACCURACY ON I NDIC NLP N EWS C ATEGORY DATASET [7]

Lang WK WK+CC W-CBOW W-SG F-CBOW F-SG GloVe-10 GloVe-15


bn 97.0 97.07 97.28 97.28 96.57 97.28 96.93 97.14
gu 97.05 97.54 97.05 98.50 96.57 98.03 96.57 97.05
mr 96.44 97.07 98.33 99.16 96.23 99.37 97.91 98.17
or 94.00 95.93 96.77 98.00 95.97 97.73 96.90 96.90
pa 94.23 94.87 95.19 96.47 92.95 96.47 94.23 94.55
kn 96.13 96.50 96.83 97.13 96.40 97.10 96.53 96.70
ml 90.00 89.33 91.67 92.67 87.16 92.17 91.17 91.17
ta 95.98 95.81 96.32 97.18 95.21 96.58 97.17 97.09
te 98.46 98.17 98.70 99.04 97.67 98.96 98.67 98.58
Average 95.47 95.81 96.46 97.27 94.97 97.08 96.23 96.37

the other hand, the Word2Vec CBOW (W-CBOW) models give highest accuracy, while FastText skip-gram (F-SG) performs
a higher correlation compared to FastText CBOW and GloVe best for all the other languages. We observe that the FastText
models. Among the GloVe models, GloVe-15 performs better skip-gram, on average, outperforms the other embeddings for
than the GloVe-10 models. For the word-analogy evaluation the classification task on these datasets. GloVe-10 model’s
task as well, both the skip-gram models (W-SG and F-SG) classification accuracy is marginally higher than the GloVe-
and the WK+CC model outperforms the other embeddings. 15. In contrast, the Word2Vec CBOW gives higher accuracy
The embeddings are also evaluated on various text clas- than the GloVe and FastText CBOW embeddings. Considering
sification tasks, i.e. news article topic [7], headlines topic the languages other than Hindi, we get an average accuracy
[15] and sentiment [12], [13] classification. We used a k- of 77.31 for Word2Vec skip-gram (W-SG) embeddings which
Nearest Neighbour classifier with k = 4 [7] for the text is lower when compared to the accuracy of 78.11 for FastText
classification tasks. The input text embedding is taken as the skip-gram embeddings.
mean of all the word embeddings following the bag-of-words
assumption. The classification performance depends on how Table IV presents the accuracy of the embeddings for
well the embedding space represents the text semantics [14]. news classification tasks on the IndicNLP News Category
dataset. Word2Vec skip-gram gives the highest average accu-
Table III illustrates the text classification accuracy of all the racy among all the embeddings. However, FastText skip-gram
embeddings on the corresponding language datasets. For Hindi embeddings also achieve almost similar accuracy for all the
and Tamil, Word2Vec skip-gram (W-SG) embeddings give the languages. In this case, GloVe-15 gives similar accuracy as the

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:24 UTC from IEEE Xplore. Restrictions apply.
GloVe-10 model. These embeddings trained on the AI4Bharat- [13] Sandeep Sricharan Mukku and Radhika Mamidi, “Actsa: Annotated
IndicNLP corpora (except F-CBOW) give higher accuracy corpus for telugu sentiment analysis,” in Proceedings of the First
Workshop on Building Linguistically Generalizable NLP Systems, 2017,
than the previously available WK and WK+CC embeddings, pp.54–58.
as seen in Table III and Table IV. [14] Yu Meng, Jiaxin Huang, Guangyuan Wang, ChaoZhang, Honglei
Zhuang, Lance Kaplan, and Jiawei Han, “Spherical text embedding,”
in Advances in Neural Information Processing Systems, 2019, pp.
V. C ONCLUSION 8208–8217.
[15] Arora G. iNLTK: Natural language toolkit for indic languages. arXiv
We contribute 60 word-level embeddings (Word2Vec preprint arXiv:2009.12534. 2020 Sep 26.
CBOW and SG, FastText CBOW and SG and 2 variants [16] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint
of GloVe for 10 Indian languages) trained on AI4Bharat- arXiv:1810.04805. 2018 Oct 11.
IndicNLP corpora and also analyze their performance on sev- [17] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K,
eral NLP tasks. We provide a comparative analysis of the ac- Zettlemoyer L. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365. 2018 Feb 15.
curacies among our embeddings and also with the embedings
trained on Wikipedia and Wikipedia+CommonCrawl corpus
to help researchers choose the embeddings best suited to the
language and tasks. We conclude that Word2Vec skip-gram
and FastText skip-gram embeddings are the best performing
embedding algorithms for the NLP tasks on Indic languages.
Our work is limited to word-level embeddings and does not
exploit the context of the sentences. In future, we also plan to
train the contextual word embeddings like BERT [16], ELMo
[17] for Indic languages and compare them on various NLP
tasks using the available benchmark datasets.

R EFERENCES
[1] Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey,
“Efficient estimation of word representations in vector space,” arXiv
preprint arXiv:1301.3781, 2013.
[2] Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko, “An effec-
tive approach to unsupervised machine translation,” arXiv preprint
arXiv:1902.01313, 2019.
[3] Bordes, Antoine and Chopra, Sumit and Weston, Jason,
“Question answering with subgraph embeddings,” arXiv preprint
arXiv:1406.3676,2014.
[4] Fernando Diaz, Bhaskar Mitra, and Nick Craswell, “Query ex-
pansion with locally-trained word embeddings,” arXiv preprint
arXiv:1605.07891, 2016.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean, “Distributed representations of words and phrases and their com-
positionality,” in Advances in neural information processing systems,
2013, pp. 3111–3119.
[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov,
“Enriching word vectors with subword information,” Transactions of the
Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
[7] Kunchukuttan, Anoop and Kakwani, Divyanshu and Golla, Satish and
Bhattacharyya, Avik and Khapra, Mitesh M and Kumar, Pratyush,
“AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embed-
dings for Indic Languages,” arXiv preprint arXiv:2005.00085, 2020.
[8] Pennington, Jeffrey and Socher, Richard and Manning, Christopher D,
“Glove: Global vectors for word representation,” Proceedings of the
2014 conference on empirical methods in natural language processing
(EMNLP), 2014, pp. 1532-1543.
[9] Radim Rehurek and Petr Sojka, “Software framework for topic mod-
elling with large corpora,” in In Proceedings of the LREC 2010
Workshop on New Challenges for NLP Frameworks, Citeseer, 2010.
[10] Syed Sarfaraz Akhtar, Arihant Gupta, Avijit Vajpayee, Arjit Srivastava,
and Manish Shrivastava, “Word similarity datasets for indian languages:
Annotation and baseline systems,” in Proceedings of the 11th Linguistic
Annotation Workshop, 2017, pp. 91–94.
[11] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and
Tomas Mikolov, “Learning word vectors for 157 languages,” arXiv
preprintarXiv:1802.06893, 2018.
[12] Md Shad Akhtar, Ayush Kumar, Asif Ekbal, and Pushpak Bhattacharyya,
“A hybrid deep learning architecture for sentiment analysis,” in Proceed-
ings of COLING2016, the 26th International Conference on Computa-
tional Linguistics: Technical Papers, 2016, pp. 482–493.

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:24 UTC from IEEE Xplore. Restrictions apply.

You might also like