You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/267810801

Code Mixing: A Challenge for Language Identification in the Language of


Social Media

Conference Paper · October 2014


DOI: 10.13140/2.1.3385.6967

CITATIONS READS

192 2,408

4 authors:

Utsab Barman Amitava Das


Elsevier B.V. University of North Texas
8 PUBLICATIONS 504 CITATIONS 110 PUBLICATIONS 1,671 CITATIONS

SEE PROFILE SEE PROFILE

Joachim Wagner Jennifer Foster


Dublin City University Dublin City University
46 PUBLICATIONS 836 CITATIONS 45 PUBLICATIONS 1,169 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Social Network Communities View project

Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling View project

All content following this page was uploaded by Utsab Barman on 06 November 2014.

The user has requested enhancement of the downloaded file.


Code Mixing: A Challenge for Language Identification in the Language of
Social Media
Utsab Barman, Amitava Das† , Joachim Wagner and Jennifer Foster
CNGL Centre for Global Intelligent Content, National Centre for Language Technology
School of Computing, Dublin City University, Dublin, Ireland

Department of Computer Science and Engineering
University of North Texas, Denton, Texas, USA
{ubarman,jwagner,jfoster}@computing.dcu.ie
amitava.das@unt.edu

Abstract that only half of the tweets were in English. More-


over, mixing multiple languages together (code
In social media communication, multilin- mixing) is a popular trend in social media users
gual speakers often switch between lan- from language-dense areas (Cárdenas-Claros and
guages, and, in such an environment, au- Isharyanti, 2009; Shafie and Nayan, 2013). In
tomatic language identification becomes a scenario where speakers switch between lan-
both a necessary and challenging task. guages within a conversation, sentence or even
In this paper, we describe our work in word, the task of automatic language identifica-
progress on the problem of automatic tion becomes increasingly important to facilitate
language identification for the language further processing.
of social media. We describe a new Speakers whose first language uses a non-
dataset that we are in the process of cre- Roman alphabet write using the Roman alphabet
ating, which contains Facebook posts and for convenience (phonetic typing) which increases
comments that exhibit code mixing be- the likelihood of code mixing with a Roman-
tween Bengali, English and Hindi. We alphabet language. This can be especially ob-
also present some preliminary word-level served in South-East Asia and in the Indian sub-
language identification experiments using continent. The following is a code mixing com-
this dataset. Different techniques are ment taken from a Facebook group of Indian uni-
employed, including a simple unsuper- versity students:
vised dictionary-based approach, super-
vised word-level classification with and Original: Yaar tu to, GOD hain. tui JU
without contextual clues, and sequence la- te ki korchis? Hail u man!
belling using Conditional Random Fields. Translation: Buddy you are GOD. What
We find that the dictionary-based approach are you doing in JU? Hail u man!
is surpassed by supervised classification
and sequence labelling, and that it is im- This comment is written in three languages: En-
portant to take contextual clues into con- glish, Hindi (italics), and Bengali (boldface). For
sideration. Bengali and Hindi, phonetic typing has been used.
We follow in the footsteps of recent work on
1 Introduction language identification for SMC (Hughes et al.,
2006; Baldwin and Lui, 2010; Bergsma et al.,
Automatic processing and understanding of Social 2012), focusing specifically on the problem of
Media Content (SMC) is currently attracting much word-level language identification for code mixing
attention from the Natural Language Processing SMC. Our corpus for this task is collected from
research community. Although English is still by Facebook and contains instances of Bengali(BN)-
far the most popular language in SMC, its domi- English(EN)-Hindi(HI) code mixing.
nance is receding. Hong et al. (2011), for exam- The paper is organized as follows: in Section 2,
ple, applied an automatic language detection algo- we review related research in the area of code
rithm to over 62 million tweets to identify the top mixing and language identification; in Section 3,
10 most popular languages on Twitter. They found we describe our code mixing corpus, the data it-
self and the annotation process; in Section 4, we 2008a; Weiner et al., 2012). Solorio and Liu
list the tools and resources which we use in our (2008b) try to predict the points inside a set of spo-
language identification experiments, described in ken Spanish-English sentences where the speak-
Section 5. Finally, in Section 6, we conclude ers switch between the two languages. Other
and provide suggestions for future research on this studies have looked at code mixing in differ-
topic. ent types of short texts, such as information re-
trieval queries (Gottron and Lipka, 2010) and SMS
2 Background and Related Work messages (Farrugia, 2004; Rosner and Farrugia,
2007). Yamaguchi and Tanaka-Ishii (2012) per-
The problem of language identification has been form language identification using artificial mul-
investigated for half a century (Gold, 1967) and tilingual data, created by randomly sampling text
that of computational analysis of code switching segments from monolingual documents. King
for several decades (Joshi, 1982), but there has and Abney (2013) used weakly semi-supervised
been less work on automatic language identifi- methods to perform word-level language identifi-
cation for multilingual code-mixed texts. Before cation. A dataset of 30 languages has been used
turning to that topic, we first briefly survey studies in their work. They explore several language
on the general characteristics of code mixing. identification approaches, including a Naive Bayes
Code mixing is a normal, natural product of classifier for individual word-level classification
bilingual and multilingual language use. Signif- and sequence labelling with Conditional Random
icant studies of the phenomenon can be found Fields trained with Generalized Expectation crite-
in the linguistics literature (Milroy and Muysken, ria (Mann and McCallum, 2008; Mann and Mc-
1995; Alex, 2008; Auer, 2013). These works Callum, 2010), which achieved the highest scores.
mainly discuss the sociological and conversational Another very recent work on this topic is (Nguyen
necessities behind code mixing as well as its lin- and Doğruöz, 2013). They report on language
guistic nature. Scholars distinguish between inter- identification experiments performed on Turkish
sentence, intra-sentence and intra-word code mix- and Dutch forum data. Experiments have been
ing. carried out using language models, dictionaries,
Several researchers have investigated the rea- logistic regression classification and Conditional
sons for and the types of code mixing. Initial stud- Random Fields. They find that language models
ies on Chinese-English code mixing in Hong Kong are more robust than dictionaries and that contex-
(Li, 2000) and Macao (San, 2009) indicated that tual information is helpful for the task.
mainly linguistic motivations were triggering the
code mixing in those highly bilingual societies. 3 Corpus Acquisition
Hidayat (2012) showed that Facebook users tend
Taking into account the claim that code mixing is
to mainly use inter-sentential switching over intra-
frequent among speakers who are multilingual and
sentential, and report that 45% of the switching
younger in age (Cárdenas-Claros and Isharyanti,
was instigated by real lexical needs, 40% was used
2009), we choose an Indian student community
for talking about a particular topic, and 5% for
between the 20-30 year age group as our data
content clarification. The predominance of inter-
source. India is a country with 30 spoken lan-
sentential code mixing in social media text was
guages, among which 22 are official. code mix-
also noted in the study by San (2009), which com-
ing is very frequent in the Indian sub-continent
pared the mixing in blog posts to that in the spoken
because languages change within very short geo-
language in Macao. Dewaele (2010) claims that
distances and people generally have a basic knowl-
‘strong emotional arousal’ increases the frequency
edge of their neighboring languages.
of code mixing. Dey and Fung (2014) present
A Facebook group1 and 11 Facebook users
a speech corpus of English-Hindi code mixing in
(known to the authors) were selected to obtain
student interviews and analyse the motivations for
publicly available posts and comments. The Face-
code mixing and in what grammatical contexts
book graph API explorer was used for data collec-
code mixing occurs.
tion. Since these Facebook users are from West
Turning to the work on automatic analysis of
Bengal, the most dominant language is Bengali
code mixing, there have been some studies on de-
1
tecting code mixing in speech (Solorio and Liu, https://www.facebook.com/jumatrimonial
(Native Language), followed by English and then 2. Bengali-Sentence:
Hindi (National Language of India). The posts [sent-lang=“bn”] shubho nabo borsho.. :)
and comments in Bengali and Hindi script were [/sent]
discarded during data collection, resulting in 2335 3. Hindi Sentence:
posts and 9813 comments. [sent-lang=“hi”] karwa sachh ..... :( [/sent]
4. Mixed-Sentence:
3.1 Annotation [sent-lang=“mixd”] [frag-lang=“hi”] oye
Four annotators took part in the annotation task. hoye ..... angreji me kahte hai ke [/frag]
Three were computer science students and the [frag-lang=“en”] I love u.. !!! [/frag] [/sent]
other was one of the authors. The annotators are 5. Univ-Sentence:
proficient in all three languages of our corpus. A [sent-lang=“univ”] hahahahahahah....!!!!!
simple annotation tool was developed which en- [/sent]
abled these annotators to identify and distinguish 6. Undef-Sentence:
the different languages present in the content by [sent-lang=“undef”] Hablando de una triple
tagging them. Annotators were supplied with 4 amenaza. [/sent]
basic tags (viz. sentence, fragment, inclusion and
wlcm (word-level code mixing)) to annotate differ- Fragment (frag): This refers to a group of for-
ent levels of code mixing. Under each tag, six at- eign words, grammatically related, in a sentence.
tributes were provided, viz. English (en), Bengali The presence of this tag in a sentence conveys that
(bn), Hindi (hi), Mixed (mixd), Universal (univ) intra-sentential code mixing has occurred within
and Undefined (undef). The attribute univ is as- the sentence boundary. Identification of fragments
sociated with symbols, numbers, emoticons and (if present) in a sentence was the second task of
universal expressions (e.g. hahaha, lol). The at- annotation. A sentence (sent) with attribute mixd
tribute undef is specified for a sentence or a word must contain multiple fragments (frag) with a spe-
for which no language tags can be attributed or cific language attribute. In the fourth example
cannot be categorized as univ. In addition, anno- above, the sentence contains a Hindi fragment oye
tators were instructed to annotate named entities hoye ..... angreji me kahte hai ke and an English
separately. What follows are descriptions of each fragment I love u.. !!!, hence it is considered as a
of the annotation tags. mixd sentence. A fragment can have any number
of inclusions and word-level code mixing. In the
Sentence (sent): This tag refers to a sentence first example below, Jio is a popular Bengali word
and can be used to mark inter-sentential code mix- appearing in the English fragment Jio.. good joke,
ing. Annotators were instructed to identify a sen- hence tagged as a Bengali inclusion. One can ar-
tence with its base language (e.g. en, bn, hi and gue that the word Jio could be a separate Bengali
mixd) or with other types (e.g. univ, undef ) as the inclusion (i.e. can be tagged as a Bengali inclu-
first task of annotation. Only the attribute mixd is sion outside the English fragment). But looking
used to refer to a sentence which contains multi- at the syntactic pattern and the sense expressed by
ple languages in the same proportion. A sentence the comment, the annotator kept it as a single unit.
may contain any number of inclusions, fragments In the second example below, an instance of word-
and word-level code mixing. A sentence can be at- level code mixing, typer, has been found in an En-
tributed as univ if and only if it contains symbols, glish fragment (where the root English word type
numbers, emoticons, chat acronyms and no other has the Bengali suffix r).
words (Hindi, English or Bengali). A sentence can
be attributed as undef if it is not a sentence marked 1. Fragment with Inclusion:
as univ and has words/tokens that can not be cate- [sent-lang=“mixd”] [frag-lang=“en”] [incl-
gorized as Hindi, English or Bengali. Some exam- lang=“bn”] Jio.. [/incl] good joke [/frag] [frag
ples of sentence-level annotations are the follow- lang=“bn”] ”amar Babin” [/frag] [/sent]
ing: 2. Fragment with Word-Level code mixing:
[sent-lang=“mixd”] [frag-lang=“en”] ” I will
1. English-Sentence: find u and marry you ” [/frag] [frag-
[sent-lang=“en”] what a.....6 hrs long...but re- lang=“bn”] [wlcm-type=“en-and-bn-suffix”]
ally nice tennis.... [/sent] typer [/wlcm] hoe glo to! :D [/frag] [/sent]
Inclusion (incl): An inclusion is a foreign word that annotation instruction is also straightforward
or phrase in a sentence or in a fragment which are confirmed in a high inter-annotator agreement
is assimilated or used very frequently in native (IAA) with a Kappa value of 0.884.
language. Identification of inclusions can be per-
formed after annotating a sentence and fragment 3.2 Data Characteristics
(if present in that sentence). An inclusion within a Tag-level and word-level statistics of annotated
sentence or fragment also denotes intra-sentential data that reveal the characteristics of our data set
code mixing. In the example below, seriously is an are described in Table 1 and in Table 2 respec-
English inclusion which is assimilated in today’s tively. More than 56% of total sentences and al-
colloquial Bengali and Hindi. The only tag that an most 40% of total tokens are in Bengali, which is
inclusion may contain is word-level code mixing. the dominant language of this corpus. English is
the second most dominant language covering al-
1. Sentence with Inclusion: most 33% of total tokens and 35% of total sen-
[sent-lang=“bn”] Na re [incl-lang=“en”] seri- tences. The amount of Hindi data is substantially
ously [/incl] ami khub kharap achi. [/sent] lower – nearly 1.75% of total tokens and 2% of to-
tal sentences. However, English inclusions (84%
Word-Level code mixing (wlcm): This is the
of total inclusions) are more prominent than Hindi
smallest unit of code mixing. This tag was in-
or Bengali inclusions and there are a substantial
troduced to capture intra-word code mixing and
number of English fragments (almost 52% of total
denotes cases where code mixing has occurred
fragments) present in our corpus. This means that
within a single word. Identifying word-level code
English is the main language involved in the code
mixing is the last task of annotation. Annotators
mixing.
were told to mention the type of word-level code
mixing in the form of an attribute (Base Language Statistics of Different Tags
+ Second Language) format. Some examples are Tags En Bn Hi Mixd Univ Undef
sent 5,370 8,523 354 204 746 15
provided below. In the first example below, the frag 288 213 40 0 6 0
root word class is English and e is an Bengali suf- incl 7,377 262 94 0 1,032 1
fix that has been added. In the third example be- wlcm 477
low, the opposite can be observed – the root word Name Entity 3,602
Acronym 691
Kando is Bengali, and an English suffix z has been
added. In the second example below, a named en- Table 1: Tag-level statistics
tity suman is present with a Bengali suffix er.

1. Word-Level code mixing (EN-BN): Word-Level Tag Count


EN 66,298
[wlcm-type=“en-and-bn-suffix”] classe BN 79,899
[/wlcm] HI 3,440
2. Word-Level code mixing (NE-BN): WLCM 633
NE 5,233
[wlcm-type=“NE-and-bn-suffix”] sumaner
ACRO 715
[/wlcm] UNIV 39,291
3. Word-Level code mixing (BN-EN): UNDEF 61
[wlcm-type=“bn-and-en-suffix”] kandoz
Table 2: Word-level statistics
[/wlcm]

3.1.1 Inter Annotator Agreement 3.2.1 Code Mixing Types


We calculate word-level inter annotator agreement In our corpus, inter- and intra-sentential code mix-
(Cohen’s Kappa) on a subset of 100 comments ing are more prominent than word-level code mix-
(randomly selected) between two annotators. Two ing, which is similar to the findings of (Hidayat,
annotators are in agreement about a word if they 2012) . Our corpus contains every type of code
both annotate the word with the same attribute mixing in English, Hindi and Bengali viz. in-
(en, bn, hi, univ, undef ), regardless of whether ter/intra sentential and word-level as described in
the word is inside an inclusion, fragment or sen- the previous section. Some examples of different
tence. Our observations that the word-level anno- types of code mixing in our corpus are presented
tation process is not a very ambiguous task and below.
1. Inter-Sentential: tators were careful enough to tag such words as
[sent-lang=“hi”] Itna izzat diye aapne mujhe English and sometimes these words were tagged
!!! [/sent] in the annotators’ native languages. During cross
[sent-lang=“en”] Tears of joy. :’( :’( [/sent] checking of the annotated data the same error pat-
2. Intra-Sentential: terns were observed for multiple annotators, i.e.
[sent-lang=“bn”] [incl-lang=“en”] by d way tagging commonly used foreign words into native
[/incl] ei [frag-lang=“en”] my craving arms language. It only demonstrates that these English
shall forever remain empty .. never hold u words are highly assimilated in the conversational
close .. [/frag] line ta baddo [incl-lang=“en”] vocabulary of Bengali and Hindi.
cheezy [/incl] :P ;) [/sent]
3. Word-Level: Phonetic Similarity of Spellings Due to pho-
[sent-lang=“bn”] [incl-lang=“en”] 1st yr netic typing some words share the same surface
[/incl] eo to ei [wlcm-type=“en+bnSuffix”] form across two and sometimes across three lan-
tymer [/wlcm] modhye sobar jute jay .. guages. As an example, to is a word in the three
[/sent] languages: it has occurred 1209 times as English,
715 times as Bengali and 55 times as Hindi in our
3.2.2 Ambiguous Words data. The meaning of these words (e.g. to, bolo,
Annotators were instructed to tag an English word die) are different in different languages. This phe-
as English irrespective of any influence of word nomenon is perhaps exacerbated by the trend to-
borrowing or foreign inclusion but an inspection of wards short and noisy spelling in SMC.
the annotations revealed that English words were
sometimes annotated as Bengali or Hindi. To un- 4 Tools and Resources
derstand this phenomenon we processed the list We have used the following resources and tools in
of language (EN,BN and HI) word types (total our experiment.
26,475) and observed the percentage of types that
were not always annotated with the one language Dictionaries
throughout the corpus. The results are presented in
Table 3. Almost 7% of total types are ambiguous 1. British National Corpus (BNC): We com-
(i.e. tagged in different languages during annota- pile a word frequency list from the BNC (As-
tion). Among them, a substantial amount (5.58%) ton and Burnard, 1998).
are English/Bengali. 2. SEMEVAL 2013 Twitter Corpus (Se-
mevalTwitter): To cope with the language
Label(s) Count Percentage of social media we use the SEMEVAL 2013
EN 9,109 34.40
BN 14,345 54.18
(Nakov et al., 2013) training data for the
HI 1,039 3.92 Twitter sentiment analysis task. This data
EN or BN 1,479 5.58 comes from a popular social media site and
EN or HI 61 0.23 hence is likely to reflect the linguistic proper-
BN or HI 277 1.04
EN or BN or HI 165 0.62
ties of SMC.
3. Lexical Normalization List (LexNorm-
Table 3: Statistics of ambiguous and monolingual List): Spelling variation is a well-known
word types phenomenon in SMC. We use a lexical nor-
malization dictionary created by Han et al.
There are two reasons why this is happening: (2012) to handle the different spelling vari-
Same Words Across Languages Some words ations in our data.
are the same (e.g. baba, maa, na, khali) in Hindi
Machine Learning Toolkits
and Bengali because both of the languages orig-
inated from a single language Sanskrit and share 1. WEKA: We use the Weka toolkit (Hall et
a good amount of common vocabulary. It also al., 2009) for our experiments in decision tree
occurred in English-Hindi and English-Bengali as training.
a result of word borrowing. Most of these are
commonly used inclusions like clg, dept, ques- 2. MALLET: CRF learning is applied using the
tion, cigarette, and topic. Sometimes the anno- MALLET toolkit (McCallum, 2002).
3. Liblinear: We apply Support Vector Ma- can classify a word to be a named entity or an in-
chine (SVM) learning with a linear kernel us- stance of word-level code mixing. To avoid this,
ing the Liblinear package (Fan et al., 2008). we implement a post-processor which backs off in
these cases to a system which hasn’t seen named
NLP Tools For data tokenization we used the
entities or word-level code mixing in training (see
CMU Tweet-Tokenizer (Owoputi et al., 2013).
Section 5.3).
5 Experiments
5.1 Dictionary-Based Detection
Since our training data is entirely labelled at the
We start with dictionary-based language detec-
word-level by human annotators, we address the
tion. Generally a dictionary-based language de-
word-level language identification task in a fully
tector predicts the language of a word based on
supervised way.
its frequency in multiple language dictionaries. In
Out of the total data, 15% is set aside as a
our data the Bengali and Hindi tokens are phoneti-
blind test set, while the rest is employed in our ex-
cally typed. As no such transliterated dictionary is,
periments through a 5-fold cross-validation setup.
to our knowledge, available for Bengali and Hindi,
There is a substantial amount of token overlap be-
we use the training set words as dictionaries. For
tween the cross-validation data and the test set –
words that have multiple annotations in training
88% of total EN tokens, 86% of total Bengali to-
data (ambiguous words), we select the majority
kens and 57% of total Hindi tokens of the test set
tag based on frequency, e.g. the word to will al-
are present in the cross-validation data.2
ways be tagged as English.
We address the problem of word-level in three
Our English dictionaries are those described
different ways:
in Section 4 (BNC, LexNormList, SemEvalTwit-
1. A simple heuristic-based approach which ter) and the training set words. For LexNorm-
uses a combination of our dictionaries to clas- List, we have no frequency information, and so
sify the language of a word we consider it as a simple word list. To pre-
2. Word-level classification using supervised dict the language of a word, dictionaries with nor-
machine learning with SVMs but no contex- malized frequency were considered first (BNC,
tual information SemEvalTwitter, Training Data), if not found,
3. Word-level classification using supervised word list look-up was performed. The predicted
machine learning with SVMs and sequence language is chosen based on the dominant lan-
labelling using CRFs, both employing con- guage(s) of the corpus if the word appears in mul-
textual information tiple dictionaries with same frequency or if the
Named entities and instances of word-level word does not appear in any dictionary or list.
code mixing are excluded from evaluation. For A simple rule-based method is applied to pre-
systems which do not take the context of a word dict universal expressions. A token is considered
into account, i.e. the dictionary-based approach as univ if any of the following conditions satisfies:
(Section 5.1) and the SVM approach without con-
• All characters of the token are symbols or
textual clues (Section 5.2), named entities and in-
numbers.
stances of word-level code mixing can be safely
• The token contains certain repetitions identi-
excluded from training. For systems which do
fied by regular expressions.(e.g. hahaha).
take context into account, the CRF system (Sec-
tion 5.3.1) and the SVM system with contextual • The token is a hash-tag or an URL or
clues (Section 5.3.2), these are included in train- mention-tags (e.g. @Sumit).
ing, because to exclude them would result in un- • Tokens (e.g. lol) identified by a word list
realistic contexts. This means that these systems compiled from the relevant 4/5th of the train-
2
ing data.
We found 25 comments and 17 posts common between
the cross-validation data and the test set. The reason for this
is that users of social media often express themselves in a Table 4 shows the results of dictionary-based
concise way. Almost all of these common data consisted of 1 detection obtained from 5-fold cross-validation
to 3 token(s). In most of the cases these tokens were emoti- averaging. We try different combinations and fre-
cons, symbols or universal expressions such as wow and lol.
As the percentage of these comments is low, we keep these quency thresholds of the above dictionaries. We
comments as they are. find that using a normalized frequency is helpful
and that a combination of LexNormList and Train- Features Accuracy Features Accuracy
G 94.62 GD 94.67
ing Data dictionaries is suited best for our data. GL 94.62 GDL 94.73
Hence, we consider this as our baseline language GC 94.64 GDC 94.72
identification system. GLC 94.64 GDLC 94.75

Dictionary Accuracy(%) Table 5: Average cross-validation accuracy for


BNC 80.09 SVM word-level classification (without context),
SemevalTwitter 77.61
LexNormList 79.86 G = char-n-gram, L = binary length features, D
Training Data 90.21 = presence in dictionaries and C = capitalization
LexNormList+TrainingData (Baseline) 93.12 features

Table 4: Average cross-validation accuracy of GDLC: 94.75%


dictionary-based detection

GLC: 94.64% GDL: 94.73% GDC: 94.72%


5.2 Word-Level Classification without
Contextual Clues
GL: 94.62% GC: 94.64% GD: 94.67%
The following feature types are employed:

1. Char-n-grams (G): We start with a character


G: 94.62%
n-gram-based approach (Cavnar and Tren-
kle, 1994), which is most common and fol-
Figure 1: Average cross-validation accuracy for
lowed by many language identification re-
SVM word-level classification (without context),
searchers. Following the work of King and
G = char-n-gram, L = binary length features, D
Abney (2013), we select character n-grams
= presence in dictionaries and C = capitalization
(n=1 to 5) and the word as the features in our
features: cube visualization
experiments.
2. Presence in Dictionaries (D): We use pres-
ence in a dictionary as a features for all avail- set and best cross-validation accuracy is found for
able dictionaries in previous experiments. the GDLC-based run (94.75%) at C=1 (see Table 5
3. Length of words (L): Instead of using the and Fig. 1).
raw length value as a feature, we follow our We also investigate the use of a dictionary-to-
previous work (Rubino et al., 2013; Wagner char-n-gram back-off model – the idea is to ap-
et al., 2014) and create multiple features for ply the char-n-gram model SVM-GDLC for those
length using a decision tree (J48). We use words for which a majority-based decision is taken
length as the only feature to train a decision during dictionary-based detection. However, it
tree for each fold and use the nodes obtained does not outperform the SVM. Hence, we select
from the tree to create boolean features. SVM-GDLC for the next steps of our experiments
4. Capitalization (C): We use 3 boolean fea- as the best exemplar of our individual word-level
tures to encode capitalization information: classifier (without contextual clues).
whether any letter in the word is capitalized,
whether all letters in the word are capitalized 5.3 Language Identification with Contextual
and whether the first letter is capitalized. Clues
Contextual clues can play a very important role in
We perform experiments with an SVM classifier word-level language identification. As an exam-
(linear kernel) for different combination of these ple, a part of a comment is presented from cross-
features.3 Parameter optimizations (C range 2-15 validation fold 1 that contains the word die which
to 210 ) for SVM are performed for each feature is wrongly classified by the SVM classifier. The
3
According to (Hsu et al., 2010) the SVM linear kernel frequency of die in the training set of fold 1 is 6
with parameter C optimization is good enough when dealing for English, 31 for Bengali and 0 for Hindi.
with a large number of features. Though an RBF kernel can
be more effective than a linear one, it is possible only after
proper optimization of C and γ parameters, which is compu- Gold Data: ..../univ the/en movie/en
tational expensive for such a large feature set. for/en which/en i/en can/en die/en for/en
Features Order-0 Order-1 Order-2 GDLC: 95.98%
G 92.80 95.16 95.36
GD 93.42 95.59 95.98
GL 92.82 95.14 95.41
GDL 93.47 95.60 95.94 GLC: 95.02% GDL: 95.94% GDC: 95.98%
GC 92.07 94.60 95.05
GDC 93.47 95.62 95.98
GLC 92.36 94.53 95.02 GL: 95.41% GC: 95.05% GD: 95.98%
GDLC 93.47 95.58 95.98

Table 6: Average cross-validation accuracy of G: 95.36%


CRF-based predictions where G = char-n-gram, L
= length feature, D = single dictionary-based la- Figure 2: CRF Order-2 results: cube visualisation
bels (baseline system) and C = capitalization fea- G = char-n-gram, L = binary length features, D
tures = presence in dictionaries and C = capitalization
features
...../univ Context Accuracy (%)
SVM Output: ..../univ the/en GDLC + P1 94.66
GDLC + P2 94.55
movie/en for/en which/en i/en can/en GDLC + N1 94.53
die/bn for/en ...../univ GDLC + N2 94.37
GDLC + P1 N1 95.14
We now investigate whether contextual informa- GDLC + P2 N2 94.55
tion can correct the mis-classified tags.
Although named entities and word-level code Table 7: Average cross-validation accuracy of
mixing are excluded from evaluation, when deal- SVM (GDLC) context-based runs, where P-i =
ing with context it is important to consider named previous i word(s) , N-i = next i word(s)
entity and word-level code mixing during training
because these may contain some important infor-
in SVM-based runs). However, we observe that
mation. We include these tokens in the training
accuracy of CRF based runs decreases when bi-
data for our context-based experiments, labelling
narized length features (see Section 5.2 and dic-
them as other. The presence of this new label may
tionary features (a feature for each dictionary) are
affect the prediction for a language token during
involved. Hence, we use the dictionary-based pre-
classification and sequence labelling. To avoid this
dictions of the baseline system to generate a single
situation, a 4-way (bn, hi, en, univ) backoff classi-
dictionary feature for each token and only the raw
fier is trained separately on English, Hindi, Ben-
length value of a token instead of binarized length
gali and universal tokens. During evaluation of
features. The results are presented in Table 6 and
any context-based system we discard named en-
the second order results are visualized in Fig. 2.
tity and word-level code mixing from the predic-
As expected, the performance increases as the
tion of that system. If any of the remaining tokens
order increases from zero to one and two. The use
is predicted as other we back off to the decision
of a single dictionary feature is also helpful. The
of the 4-way classifier for that token. For the CRF
results for GDC, GDLC, and GD based runs are
experiments (Section 5.3.1), the backoff classifier
almost similar (95.98%). However, we choose the
is a CRF system, and, for the SVM experiments
GDC system because it performed slightly better
(Section 5.3.2), the backoff classifier is an SVM
(95.989%) than the GDLC (95.983%) and the GD
system.
(95.983%) systems.
5.3.1 Conditional Random Fields (CRF)
5.3.2 SVM with Context
As our goal is to apply contextual clues, we first
employ Conditional Random Fields (CRF), an ap- We also add contextual clues to our SVM classi-
proach which takes history into account in pre- fier. To obtain contextual information we include
dicting the optimal sequence of labels. We em- the previous and next two words as features in
ploy a linear chain CRF with an increasing or- the SVM-GDLC-based run.4 All possible com-
der (Order-0, Order-1 and Order-2) with 200 it- 4
We also experimented with extracting all GDLC features
erations for different feature combinations (used for the context words but this did not help.
binations are considered during experiments (Ta- et al., 2014).
ble 7). After C parameter optimization, the best We did not include word-level code mixing in
cross-validation accuracy is found for the P1 N1 our experiments – in our future experiments we
(one word previous and one word next) run with will explore ways to identify and segment this type
C=0.125 (95.14%). of code mixing. It will be also important to find the
best way to handle inclusions since there is a fine
5.4 Test Set Results line between word borrowing and code mixing.
We apply our best dictionary-based system, our
best SVM system (with and without context) and Acknowledgements
our best CRF system to the held-out test set. The This research is supported by the Science Founda-
results are shown in Table 8. Our best result is tion Ireland (Grant 12/CE/I2267) as part of CNGL
achieved using the CRF model (95.76%). (www.cngl.ie) at Dublin City University. The
authors wish to acknowledge the DJEI/DES/SFI/
5.5 Error Analysis
HEA for the provision of computational facili-
Manual error analysis shows the limitations of ties and support. Our special thanks to Soumik
these systems. The word-level classifier without Mandal from Jadavpur University, India for co-
contextual clues does not perform well with Hindi ordinating the annotation task. We also thank the
data. The number of Hindi tokens is quite low. administrator of JUMatrimonial and the 11 Face-
Only 2.4% (4,658) of total tokens of the training book users who agreed that we can use their posts
data are Hindi, out of which 55.36% are bilin- for their support and permission.
gually ambiguous and 29.51% are tri-lingually
ambiguous tokens. Individual word-level systems
often fail to assign proper labels to ambiguous References
words, but adding context information helps to Beatrice Alex. 2008. Automatic detection of English
overcome this problem. Considering the previ- inclusions in mixed-lingual data with an application
ous example of die, both context-based SVM and to parsing. Ph.D. thesis, School of Informatics, The
CRF systems classify it properly. Though the final University of Edinburgh, Edinburgh, UK.
system CRF-GDC performs well, it also has some Guy Aston and Lou Burnard. 1998. The BNC hand-
limitations, failing to identify the language for the book: exploring the British National Corpus with
tokens which appear very frequently in three lan- SARA. Capstone.
guages (e.g. are, na, pic).
Peter Auer. 2013. Code-Switching in Conversation:
Language, Interaction and Identity. Routledge.
6 Conclusion
Timothy Baldwin and Marco Lui. 2010. Language
We have presented an initial study on automatic identification: The long and the short of the mat-
language identification with Indian language code ter. In Human Language Technologies: The 2010
mixing from social media communication. We Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
described our dataset of Bengali-Hindi-English pages 229–237. Association for Computational Lin-
Facebook comments and we presented the results guistics.
of our word-level classification experiments on
this dataset. Our experimental results lead us to Utsab Barman, Joachim Wagner, Grzegorz Chrupała,
and Jennifer Foster. 2014. DCU-UVT: Word-
conclude that character n-gram features are useful level language classification with code-mixed data.
for this task, contextual information is also impor- In Proceedings of the First Workshop on Com-
tant and that information from dictionaries can be putational Approaches to Code-Switching. EMNLP
effectively incorporated as features. 2014, Conference on Empirical Methods in Natural
Language Processing, Doha, Qatar. Association for
In the future we plan to apply the techniques Computational Linguistics.
and feature sets that we used in these experiments
to other datasets. We have already started this by Shane Bergsma, Paul McNamee, Mossaab Bagdouri,
applying variants of the systems presented here to Clayton Fink, and Theresa Wilson. 2012. Language
identification for creating language-specific Twitter
the Nepali-English and Spanish-English datasets collections. In Proceedings of the Second Workshop
which were introduced as part of the 2014 code on Language in Social Media, pages 65–74. Associ-
mixing shared task (Solorio et al., 2014; Barman ation for Computational Linguistics.
Precision (%) Recall (%) Accuracy
System
EN BN HI UNIV EN BN HI UNIV (%)
Baseline (Dictionary) 92.67 90.73 80.64 99.67 92.28 94.63 43.47 94.99 93.64
SVM-GDLC 92.49 94.89 80.31 99.34 96.23 94.28 44.92 97.07 95.21
SVM-P1 N1 93.51 95.56 83.18 99.42 96.63 95.23 55.94 96.95 95.52
CRF-GDC 94.77 94.88 91.86 99.34 95.65 96.22 55.65 97.73 95.76

Table 8: Test set results for Baseline (Dictionary), SVM-GDLC, SVM-P1N1 and CRF-GDC

MS Cárdenas-Claros and N Isharyanti. 2009. Code- study programme “Pendidikan Bahasa Ing-
switching and code-mixing in internet chatting: gris” (English Education) at STKIP Siliwangi
Between’yes,”ya,’and’si’-a case study. The Jalt Call Bandung, Indonesia, http://publikasi.
Journal, 5(3):67–78. stkipsiliwangi.ac.id/files/2012/
10/08220227-taofik-hidayat.pdf.
William B. Cavnar and John M. Trenkle. 1994. N-
gram-based text categorization. In Theo Pavlidis, Lichan Hong, Gregorio Convertino, and Ed H. Chi.
editor, Proceedings of SDAIR-94, Third Annual 2011. Language matters in twitter: A large scale
Symposium on Document Analysis and Information study. In Proceedings of the Fifth International
Retrieval, pages 161–175. AAAI Conference on Weblogs and Social Media
(ICWSM-11), pages 518–521, Barcelona, Spain. As-
Jean-Marc Dewaele. 2010. Emotions in Multiple Lan- sociation for the Advancement of Artificial Intelli-
guages. Palgrave Macmillan. gence.
Anik Dey and Pascale Fung. 2014. A Hindi-
English code-switching corpus. In Proceedings of Chih-Wei Hsu, Chih-Chung Chang, and Chih-
the Ninth International Conference on Language Re- Jen Lin. 2010. A practical guide to sup-
sources and Evaluation (LREC’14), pages 2410– port vector classification. Technical re-
2413, Reykjavik, Iceland. European Language Re- port. Department of Computer Science, Na-
sources Association (ELRA). tional Taiwan University, Taiwan, https:
//www.cs.sfu.ca/people/Faculty/
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- teaching/726/spring11/svmguide.pdf.
Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:
A library for large linear classification. Journal of Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy
Machine Learning Research, 9:1871–1874. Nicholson, and Andrew MacKinlay. 2006. Recon-
sidering language identification for written language
Paulseph-John Farrugia. 2004. TTS pre-processing is- resources. In Proc. of the 5th edition of the Interna-
sues for mixed language support. In Proceedings tional Conference on Language Resources and Eval-
of CSAW’04, the second Computer Science Annual uation (LREC 2006), pages 485–488, Genoa, Italy.
Workshop, pages 36–41. Department of Computer
Science & A.I., University of Malta. Aravind K. Joshi. 1982. Processing of sentences with
intra-sentential code-switching. In J. Horecký, ed-
E Mark Gold. 1967. Language identification in the itor, Proceedings of the 9th conference on Compu-
limit. Information and control, 10(5):447–474. tational linguistics - Volume 1 (COLING’82), pages
145–150. Academia Praha, North-Holland Publish-
Thomas Gottron and Nedim Lipka. 2010. A compar- ing Company.
ison of language identification approaches on short,
query-style texts. In Advances in Information Re- Ben King and Steven Abney. 2013. Labeling the lan-
trieval, pages 611–614. Springer. guages of words in mixed-language documents us-
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard ing weakly supervised methods. In Proceedings of
Pfahringer, Peter Reutemann, and Ian H. Witten. the 2013 Conference of the North American Chap-
2009. The WEKA data mining software: An up- ter of the Association for Computational Linguis-
date. SIGKDD Explor. Newsl., 11(1):10–18. tics: Human Language Technologies, pages 1110–
1119, Atlanta, Georgia. Association for Computa-
Bo Han, Paul Cook, and Timothy Baldwin. 2012. tional Linguistics.
Automatically constructing a normalisation dictio-
nary for microblogs. In Proceedings of the 2012 David C. S. Li. 2000. Cantonese-English code-
Joint Conference on Empirical Methods in Natu- switching research in Hong Kong: a Y2K review.
ral Language Processing and Computational Natu- World Englishes, 19(3):305–322.
ral Language Learning, pages 421–432. Association
for Computational Linguistics. Gideon S. Mann and Andrew McCallum. 2008.
Generalized expectation criteria for semi-supervised
Taofik Hidayat. 2012. An analysis of code switch- learning of conditional random fields. In Proceed-
ing used by facebookers: a case study in a ings of ACL-08: HLT, pages 870–878, Columbus,
social network site. Student essay for the Ohio. Association for Computational Linguistics.
Gideon S. Mann and Andrew McCallum. 2010. Study in English Language Teaching, 1(1):187–
Generalized expectation criteria for semi-supervised 199. http://www.scholink.org/ojs/
learning with weakly labeled data. The Journal of index.php/selt.
Machine Learning Research, 11:955–984.
Thamar Solorio and Yang Liu. 2008a. Learning to pre-
Andrew Kachites McCallum. 2002. Mal- dict code-switching points. In Proceedings of the
let: A machine learning for language toolkit. Conference on Empirical Methods in Natural Lan-
http://mallet.cs.umass.edu. guage Processing, pages 973–981. Association for
Computational Linguistics.
Lesley Milroy and Pieter Muysken, editors. 1995. One
speaker, two languages: Cross-disciplinary per- Thamar Solorio and Yang Liu. 2008b. Part-of-speech
spectives on code-switching. Cambridge University tagging for English-Spanish code-switched text. In
Press. Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing, pages 1051–
Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, 1060. Association for Computational Linguistics.
Veselin Stoyanov, Alan Ritter, and Theresa Wilson.
2013. Semeval-2013 task 2: Sentiment analysis in Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steve
twitter. In Second Joint Conference on Lexical and Bethard, Mona Diab, Mahmoud Gonheim, Abdelati
Computational Semantics (*SEM), Volume 2: Pro- Hawwari, Fahad AlGhamdi, Julia Hirshberg, Alison
ceedings of the Seventh International Workshop on Chang, and Pascale Fung. 2014. Overview for the
Semantic Evaluation (SemEval 2013), pages 312– first shared task on language identification in code-
320, Atlanta, Georgia, USA. Association for Com- switched data. In Proceedings of the First Workshop
putational Linguistics. on Computational Approaches to Code-Switching.
EMNLP 2014, Conference on Empirical Methods in
Dong Nguyen and A. Seza Doğruöz. 2013. Word Natural Language Processing, Doha, Qatar. Associ-
level language identification in online multilingual ation for Computational Linguistics.
communication. In Proceedings of the 2013 Con-
ference on Empirical Methods in Natural Language Joachim Wagner, Piyush Arora, Santiago Cortes, Utsab
Processing (EMNLP 2013), pages 857–862, Seattle, Barman, Dasha Bogdanova, Jennifer Foster, and
Washington, USA. Association for Computational Lamia Tounsi. 2014. DCU: Aspect-based polarity
Linguistics. classification for SemEval task 4. In Proceedings
of the International Workshop on Semantic Evalu-
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, ation (SemEval-2014), pages 392–397, Dublin, Ire-
Kevin Gimpel, Nathan Schneider, and Noah A. land. Association for Computational Linguistics.
Smith. 2013. Improved part-of-speech tagging for
online conversational text with word clusters. In Jochen Weiner, Ngoc Thang Vu, Dominic Telaar, Flo-
Proceedings of the 2013 Conference of the North rian Metze, Tanja Schultz, Dau-Cheng Lyu, Eng-
American Chapter of the Association for Computa- Siong Chng, and Haizhou Li. 2012. Integration
tional Linguistics: Human Language Technologies of language identification into a recognition system
(NAACL-HLT 2013), pages 380–390, Atlanta, Geor- for spoken conversations containing code-switches.
gia. Association for Computational Linguistics. In Proceedings of the 3rd Workshop on Spoken Lan-
guage Technologies for Under-resourced Languages
Mike Rosner and Paulseph-John Farrugia. 2007. A (SLTU’12), Cape Town, South Africa. International
tagging algorithm for mixed language identifica- Research Center MICA.
tion in a noisy domain. In INTERSPEECH-2007,
8th Annual Conference of the International Speech Hiroshi Yamaguchi and Kumiko Tanaka-Ishii. 2012.
Communication Association, pages 190–193. ISCA Text segmentation by language using minimum de-
Archive. scription length. In Proceedings of the 50th An-
nual Meeting of the Association for Computational
Raphael Rubino, Joachim Wagner, Jennifer Foster, Jo- Linguistics: Long Papers-Volume 1, pages 969–978.
hann Roturier, Rasoul Samad Zadeh Kaljahi, and Association for Computational Linguistics.
Fred Hollowood. 2013. DCU-Symantec at the
WMT 2013 quality estimation shared task. In Pro-
ceedings of the Eighth Workshop on Statistical Ma-
chine Translation, pages 392–397, Sofia, Bulgaria.
Association for Computational Linguistics.

Hong Ka San. 2009. Chinese-English code-switching


in blogs by Macao young people. Master’s the-
sis, The University of Edinburgh, Edinburgh, UK.
http://hdl.handle.net/1842/3626.

Latisha Asmaak Shafie and Surina Nayan. 2013.


Languages, code-switching practice and primary
functions of facebook among university students.

View publication stats

You might also like