Professional Documents
Culture Documents
Normalization of Indonesian-English Code-Mixed Twitter Data
Normalization of Indonesian-English Code-Mixed Twitter Data
417
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 417–424
Hong Kong, Nov 4, 2019.
2019
c Association for Computational Linguistics
4. Translation: We merge the sequence of nor- Types Count
malized tokens back into complete tweet, and Number of tweets 825
translate the tweet into Indonesian, with the Number of tokens 22.736
exception of the name entities that are kept in Number of ’id’ tokens 11.204
original language (i.e term ”The Lion King” Number of ’en’ tokens 5.613
is not translated into ”Raja Singa”). Number of ’rest’ tokens 5.919
To our knowledge, this is the first attempt Table 1: Data Set Detail
to normalize Indonesian-English code-mixed lan-
guage. For our experiment, we build the data set Dhar et al. (2018) augmented existing ma-
consisting of 825 tweets. chine translation by using Matrix Language-
Frame Model proposed by Myers-Scotton (1997)
2 Related Work to increase the performance of the machine trans-
lations apply on code-mixed data.
Text normalization has been studied for Twitter
Bhat et al. (2018) presented a universal depen-
data using a variety of supervised or unsuper-
dency parsing with the Hindi-English dataset us-
vised methods. Liu et al. (2011) modelled lexi-
ing a pipeline comprised of several modules such
cal normalization as a sequence labelling problem,
as language identification, back-transliteration,
by generating letter-level alignment from standard
normalization using encoder-decoder framework
words into nonstandard variant words, using Con-
and dependency parsing using neural stacking. It
ditional Random Field (CRF).
is found that normalization improves the perfor-
Beckley (2015) performed English lexical nor-
mance of POS tagging and parsing models.
malization task in three steps. First, compiled
a substitution list of nonstandard into standard 3 Data Set
words, then built a rule-based components for -ing
and duplication rule as it is often found in the data
set. Last, applied sentence-level re-ranker using We utilize three kinds of Twitter corpora in this
bigram Viterbi algorithm to select the best candi- study i.e. English, Indonesian, and Indonesian-
date among all the candidates generated from first English code-mixed corpus. We obtain 1.6M En-
and second steps. glish tweets collection from ’Sentimen40’ (Go
Sridhar (2015) proposed an unsupervised ap- et al., 2009) and 900K Indonesian tweets from
proach for lexical normalization by training a Adikara (2015). We collect 49K Indonesian-
word embedding model from large English cor- English code-mixed tweets by scrapping them us-
pora. The model is used to create a mapping be- ing Twitter API. To harvest those tweets, first we
tween non-standard into standard words. Hanafiah take 100 English and Indonesian stopwords from
et al. (2017) approached lexical normalization for wiktionary2 . To fetch code-mixed tweets, we use
Indonesian language using the rule-based method the stopwords as query term and set the language
with the help of a dictionary and a list of slang filter as the opposite of the stopword one (e.g. In-
words. If the token is OOV, then they create a can- donesian tweets are collected using a English stop-
didate list based on the consonant skeleton from word as a query, vice versa).
the dictionary. We select 825 tweets randomly from code-
For code-mixed data, Singh et al. (2018) cre- mixed corpus for the experiment. The data is la-
ated clusters of words based on embedding model beled by two annotators. The gold standard is con-
(pre-trained on large corpora) for the semantic fea- structed for four individual tasks.
tures and Levenshtein distance for lexical features. The inter-annotator agreement has 97,92% for
Then, one word is picked to become the parent language identification and 99,23% for normaliza-
candidate for each cluster, and other words in each tion. Our data has a code-mixing index (Das and
cluster are normalized into the parent candidate. Gambäck, 2014) of 33.37, means that the level of
Mave et al. (2018) worked a language iden- mixing between languages in the data is quite high
tification task on code-mixed data. The experi- (see Table 1 for detail)
ment shown that Conditional Random Field (CRF) 2
https://en.wiktionary.org/wiki/Wiktionary:Frequency
model outperformed Bidirectional LSTM. lists
418
Figure 1: An Input-Output Example of Pipeline Model
419
Types Language Example / Explanation
Non-standard spelling/typo ’en’, ’id’ lvie or youuuu
”lht” for ”lihat” (in English: ”see”), ”ppl” for
Informal Abbreviations ’en’, ’id’
”people”
Slang words ’en’, ’id’ ”epic” or ”gue” for ”saya” (in English: ”I”)
”plis” for ”please”, ”kalo” for ”kalau” (in English:
Spelled phonetically ’en’, ’id’
”if”)
In Indonesian, plural form of noun is written with
a hyphen, e.g. ”orang-orang” (in English: ”peo-
Reduplication with 2 ’id’
ple”). However, informal writing style often use
the number ”2” to indicate this (e.g. ”orang2”)
On the other hand, the hyphen is sometimes not
Hyphenless reduplication ’id’
written (e.g. ”orang orang”)
Contracted words ’en’ ”im” for ”i am”
Indonesian people tend to use an informal prefix
Combining English word (nge-) before the words to stress that the word is
’en’
with Indonesian prefix (nge-) a verb. For instance, the word ”vote” is written as
”ngevote”
Suffix ”-nya” in Indonesian means the possessive
pronoun (e.g. ”miliknya” similar to ”hers” or
Combining English word ”his”). Suffix ”-nya” can also refers to the definite
’en’
with Indonesian suffix (-nya) article (in English: ”the”). Informally, the suffix is
used to follow English word usage in Indonesian
conversation, e.g. ”jobnya” (in English: ”the job”)
3. For ”reduplication without hyphen” problem, 4.3.1 Build OOV and normal word mapping
we check whether the token consists of mul- Word embedding can be used to cluster words be-
tiple words, then replace the space delimiter cause it can model word relatedness, both syntacti-
with ”-” if it is a reduplication case. cally and semantically. We use embedding model
to construct mapping between OOV and its nor-
4. For ”contracted words” problem, we normal- mal form. The word embedding model is trained
ize them by utilizing the list provided by by using skip-gram architecture (Mikolov et al.,
Kooten 3 . 2013). The procedure is described as follows:
5. For problem of ”combining English word 1. Collect Indonesian and English vocabulary
with Indonesian prefix (nge-)”, we remove from Kateglo4 and Project Gutebnberg5 .
the prefix (-nge) from the word. 2. For each word normal word in the vocabu-
lary, get 100 most similar words from social
6. For problem of ”combining English word
media corpus by using embedding model.
with Indonesian suffix (-nya)”, we remove
the suffix (-nya) and add the word ”the” be- 3. For each most similar words wi to
fore the word. normal word
• If wi exists in the dictionary, it means
There are two sub tasks in lexical normalization
that wi is already normalized.
module. First, create mapping between OOV to-
• otherwise, wi is OOV, then add a map-
kens to their standard forms. Then, build the sys-
ping instance between OOV word wi
tem to incorporate the rule-based method with the
into normal word.
distributional semantics.
4
http://kateglo.com/
3 5
https://github.com/kootenpv/contractions http://www.gutenberg.org/ebooks/3201
420
4. If an OOV wj is mapped into several 4.4 Translation
normal word entries, choose one which has This module aims to merge the list of tokens pro-
the highest lexical similarity with wj . cessed in previous modules back into one tweet
We use the lexical similarity function (Has- and translates the tweet into Indonesian gram-
san and Menezes, 2013). The function is matical language. The module needs the Ma-
based on the Longest Common Subsequence chine Translation (MT) system which is specific to
Ration (LCSR), which is the ratio of the the Indonesian-English code-mixed text domain.
length of the Longest Common Subsequence However, such MT system is not available at this
(LCS) and the length of the longer token moment. Dhar et al. (2018) found similar problem
(Melamed, 1999). The lexical similarity and tackled this by augmenting Matrix Language-
function defined as: Frame (MLF) model on top of the existing state-
of-the-art MT system.
LCSR(s1 , s2 )
lex sim(s1 , s2 ) = (1) In the MLF model proposed by Myers-Scotton
ED(s1 , s2 ) (1997), the code-mixed sentence can be splitted
into the dominant language (the matrix language)
LCS(s1 , s2 ) and the embedded language. The matrix language
LCSR(s1 , s2 ) = (2) grammar sets the morphosyntactic structure for
M axLength(s1 , s2 )
the code-mixed sentence, while the embedded lan-
We apply static mapping as our mapping mech- guage borrows words from its vocabulary.
anism instead of finding the replacement online Thus, in this module, first, we merge the list
because of the execution time and memory perfor- of tokens into one tweet. Then, we separate the
mance. It is much faster to look up at the mapping tweet into several sentences using sentence delim-
rather than calculate it from the word embedding iter such as period (.), comma (,), the question
model. Furthermore, the memory needed to store mark (?), the exclamation mark (!), etc. For each
the mapping is much smaller than the word em- sentence, we decide the language of the sentence
bedding model. (English or Indonesian sentence). To do that, we
4.3.2 Combine rules with mapping list count how many words are English and how many
words are Indonesian. The language which has
Normalization system employs combination of
a bigger frequency is the dominant language and
hand-crafted rules and mapping of words as fol-
also the language of the sentence, and the language
lows.
which has a smaller frequency is the embedded
1. Skip this procedure when the input is not a language. After we decide the language of the
word (e.g hastag, mention, link, emoticon). sentence, we translate all the words into the lan-
guage of the sentence. Last, if the language of the
2. If the token is a multiword (there are more sentence in English, we translate the whole sen-
than one single word that is separated by tence into Indonesian. We use Microsoft Machine
white space), split the token into the list of Translation6 as the MT system. Figure 2 shows an
single words. Try to normalize each word in- example of input and output of translation module.
dependently by applying the rules, and merge
them back into one token. For instance, put 5 Experiments and Evaluation
the hyphen for reduplication case.
We evaluate the performance of our implementa-
3. If the input is a word, then transform it into tion for each module and the pipeline model as
lowercase format. Reduce character repeti- whole process. First two modules (tokenization
tion to at most two and consider any pos- and language identification) are evaluated in su-
sible transformation. For instance, word pervised way using 4-fold cross validation setting.
”Pleasssseee” is transformed into ”pleassee”, Tokenization is evaluated at both character-
”pleasee”, ”pleasse”, and ”please”. Check level and token-level. The character-tagging
whether one of generated transformed word achieves 98.70 for F1-score, while token identi-
is a normal form. If not, apply rule-based fication obtains 95.15 for F1-score.
strategy. Last, use the mapping list created 6
https://www.microsoft.com/en-
by utilizing word embedding. us/translator/business/machine-translation/
421
Figure 2: An Input-Output Example of Translation Module
422
have been translated (with MLF Model) into In- approach for each language improve the perfor-
donesian with final tweets, and 4) comparing raw mance significantly. At the translation module,
tweets which have been normalized and trans- there are some errors caused by the MT system,
lated (with MLF Model) into Indonesian with final but we could not do much about it since we use
tweets. From the experiments, our final pipeline the existing MT system.
obtains 54.07 for BLEU and 31.89 for WER. From Moving forward, we would like to augment
Table 6, we can see that each module affects posi- more data and enhance technique in order to im-
tively toward the performance of the pipeline. The prove performance of the model. Currently, lex-
pipeline model increases BLEU score for 8 points, ical normalization module is much dependent of
and WER for 14 points compared to the baseline handcrafted rules. While the rule-based approach
(raw tweets). can increase the performance, it still not a robust
solution seen from how language evolved.
Pipeline BLEU WER
Raw Tweets 46.07 45.75 Acknowledgments
Raw Tweets + Translation
51.02 34.37 The authors acknowledge the support of Universi-
(1 + 4 Module)
Raw Tweets + Translation tas Indonesia through Hibah PITTA B 2019 Pen-
with MLF Model (1 + 2 + 51.75 34.39 golahan Teks dan Musik pada Sistem Commu-
4 Modules) nity Question Answering dan Temporal Informa-
Raw Tweets + Normal- tion Retrieval.
ization + Translation
54.07 31.89
with MLF Model (Full
pipeline) References
Putra Pandu Adikara. 2015. Normalisasi kata pada
Table 6: Pipeline Experiment Result pesan/status singkat berbahasa indonesia. Master’s
(BLEU: higher is better, WER: lower is better) thesis, Universitas Indonesia, Depok.
423
Hany Hassan and Arul Menezes. 2013. Social text nor-
malization using contextual graph random walks. In
Proceedings of the 51st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1577–1586.
Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu.
2011. Insertion, deletion, or substitution?: nor-
malizing text messages without pre-categorization
nor supervision. In Proceedings of the 49th An-
nual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short
papers-Volume 2, pages 71–76. Association for
Computational Linguistics.
Deepthi Mave, Suraj Maharjan, and Thamar Solorio.
2018. Language identification and analysis of code-
switched social media text. In Proceedings of the
Third Workshop on Computational Approaches to
Linguistic Code-Switching, pages 51–61.
424