You are on page 1of 5

TRAINING CODE-SWITCHING LANGUAGE MODEL WITH MONOLINGUAL DATA

Shun-Po Chuang, Tzu-Wei Sung, Hung-yi Lee

National Taiwan University

ABSTRACT embedding with unit-normalized vectors in ASR, but without


keeping the possession of unit norm during training. Initial
Lack of code-switching data is an issue of training code- experiments on code-switching data showed that constrain-
switching language model. In this paper, we propose an ap- ing and normalizing output projection matrix helped language
proach to train code-switching language models with mono- model trained with monolingual data better handling code-
lingual data only. By constraining and normalizing output switching data.
projection matrix in RNN based language model, we make
the embeddings of different languages close to each other. 2. CODE-SWITCHING LANGUAGE MODELING
With the numerical and visualized results, we show that the
proposed approaches remarkably improve the code-switching In our approach, we only use monolingual data for training,
language modeling trained from monolingual data. The pro- code-switching data is only for validation and testing.
posed approaches are comparable or even better than training
code-switching language model with artificially generated 2.1. RNN based Language Model
code-switching data. Furthermore, we use unsupervised
bilingual word translation to analyze if semantically equiva- In this work, we adopt the recurrent neural network (RNN)
lent words in different languages are mapped together. based language model [16]. Given a sequence of words
[w1 , w2 , . . . , wT ] we can obtain predictions yi by applying
Index Terms— Code-Switching, Language Model transformation W on RNN hidden states hi with softmax
computation :
1. INTRODUCTION
yi = sof tmax(W hi ) (1)
Code-switching is the practice that two or more languages are
used within a document or a sentence and widely observed in where i = 1, 2, . . . , T and h0 is a zero vector. Specifically, the
multicultural areas. It related research suffers from data insuf- output projection matrix is denoted by W ∈ RV ×z , where V
ficient issue, and applying prior knowledge [1, 2] or designing is the vocabulary size and z is the hidden layer size of RNN.
constraints [3, 4] would alleviate such issues. Because it is Then using gradient descent to update the parameters with
easier to collect monolingual data than code-switching data, cross entropy as loss function.
to efficiently utilize a large amount of monolingual data Consider two languages Lang 1 and Lang 2 in code-
would be a solution to a lack of code-switch data [5]. Recent switching language modeling, the output projection matrix W
work [6] tried to train code-switching language model with could be partitioned into two parts: W1 and W2 with each row
fine-tuning technique. Similar work [7] integrated two mono- indicating the latent representations of each word in Lang 1
lingual language models by introducing special “switch” and Lang 2 respectively. The output projection matrix or em-
token in both languages for training language model, and beddings can be written as : W = [W1 ; W2 ] , where square
further incorporated within automatic speech recognition bracket indicates concatenation operation.
(ASR). Other works have synthesized more code-switching
2.2. Constraints on Output Projection Matrix
text by the modeling distribution from the data [8, 9]. Gener-
ative adversarial neural network [10, 11] is also proposed to Optimizing language model with Lang 1 and Lang 2 mono-
learn code-switching point distribution from code-switching lingual data, it is possible to improve the perplexity on both
text [12]. sides. The word embedding distributions would be arbitrary
In this paper, we propose to utilize some constraints to shaped based on their language characteristics. But without
make word embeddings of different languages close in the seeing bilingual word pairs, two distributions may converge
same latent space, and to normalize each word vector to gen- into their own shape without correlating to each other. It is
erally improve code-switching language model. Similar con- hard for language model to learn how to switch between lan-
straints were used in end-to-end ASR [13] before, but has guages. To train language model with only monolingual data,
not yet reported on code-switching language modeling. And we assume that embedding overlapping would benefit code-
related prior works [14, 15] attempted to initialize the word switching language modeling. To this end, we try to make

978-1-5090-6631-5/20/$31.00 ©2020 IEEE 7949 ICASSP 2020


2.3. Output Projection Matrix Normalization
Apart from constraints stated in Subsection 2.2, we propose
to normalize the output projection matrix, that is, each word
representation is divided by its `2 norm to possess unit norm.
Note that normalization is independent of constraints, and can
be applied together.
The rationale behind normalization is explained as fol-
lows: consider two words wj and wk are semantically equiv-
alent, the cosine similarity between their latent representa-
tion vj and vk should be 1, implying the angle of them is
0 or they have the same orientation. By Eq. (1), we ob-
exp(vj ·hi )
serve that the probabilities yi,j = PV exp(v ·h )
and yi,k =
Fig. 1: Overview of proposed approach. Square brackets in- exp(vk ·hi )
m=1 m i

dicate concatenation operation between W1 and W2 PV are not necessarily equal because the mag-
m=1 exp(vm ·hi )
nitude of vj and vk might not be the same. However, being a
word embeddings of Lang 1 and Lang 2, that is W1 and W2 , unit vector, normalization guarantees that given the same his-
close to each other. We constrain W1 and W2 in the follow- tory, the probabilities of two semantically equivalent words
ing two ways and Fig. 1 shows the overview of our proposed generated by language model will be equal. That means,
approach. normalization would be helpful for clustering semantically
equivalent words in the embedding space, which in general
2.2.1. Symmetric Kullback–Leibler Divergence
improves language modeling.
Kullback–Leibler Divergence (KLD) is a well-known mea-
3. EXPERIMENTAL SETUP
surement on computing distance between distributions. With
minimizing KLD between language distributions, the embed- 3.1. Corpus
ding space would be semantically overlapped. We assume
South East Asia Mandarin-English (SEAME) corpus [17] is
both W1 and W2 follow a z-dimensional multivariate Gaus-
used for the following experiments. It can be simply sepa-
sian distribution, that is,
rated into two parts with its literal language. The first part is
W1 ∼ N (µ1 , Σ1 ), W2 ∼ N (µ2 , Σ2 ) monolingual, and it contains pure Mandarin and pure English
transcriptions, which are two main languages in this corpus.
where µ1 , µ2 ∈ Rz and Σ1 , Σ2 ∈ Rz×z are the mean vector The second part is code-switching sentences, with the syntax
and co-variance matrix for W1 and W2 respectively. Based on structure specific language, that is, Mandarin or English, and
the assumption of Gaussian distribution, we can easily com- the transcriptions composed of mixing two of these languages
pute KLD between W1 and W2 . Due to the asymmetric char- words.
acteristic of KLD, here we adopt symmetric form of KLD The original data consists of train, dev man and dev sgn1 .
(SKLD), that is, we use the sum of KLD between W1 and W2 Each split contains monolingual and code-switching sen-
and the one between W2 and W1 , yielding tences, but dev man and dev sgn are dominated by Mandarin
and English respectively. We held out 1000 Mandarin, 1000
1
LSKLD = tr(Σ−1 −1
1 Σ 2 + Σ 2 Σ1 )
English and all code-switching sentences (because we only
2 needed monolingual data to train language model) from train
+(µ1 − µ2 )T (Σ−1 −1

1 + Σ2 )(µ1 − µ2 ) − 2z . as the validation set. Remaining monolingual sentences were
for the training set. Similar to the prior work [13], we used
2.2.2. Cosine Distance dev man and dev sgn for testing, but to balance the ratio of
Cosine distance (CD) is a common measurement for seman- Mandarin and English, we combined them together as the
tic evaluation. With minimizing CD, we assume the semantic testing set.
latent space of languages would be closer. Similar to SKLD, 3.2. Pseudo Code-switching Training Data
we can compute the mean vector µ1 and µ2 of W1 and W2 re-
spectively, and CD between two mean vectors can be obtained In order to compare the performance of constraints and nor-
as follows: malization with language model trained by code-switching
µ1 · µ2
LCD = 1 − , data, we also introduced pseudo code-switching data train-
kµ1 kkµ2 k
ing, in which we use monolingual data to generate artificial
where k·k denotes the `2 norm. We hypothesize the latent rep- code-switching sentences. Two approaches were used to gen-
resentation of each word in Lang 1 and Lang 2 will distribute erate pseudo code-switching data:
in the same semantic space and will overlap by minimizing 1 https://github.com/zengzp0912/SEAME-dev-set
SKLD or CD.

7950
• Word substitution: Give only monolingual data, we with substitution probability 3 , and (C) sentence concatena-
randomly replaced the word from monolingual sen- tion as mentioned in Subsection 3.2; (D) (E) (F) represents the
tences with its corresponding word in another language results after applying normalization mentioned in Section 2.3
to produce code-switching data based on the substitu- on (A) (B) (C) respectively. Baselines in rows (a)(d)(g) repre-
tion probability. However, this requires a mapping of sent the language model was trained without constraints and
vocabulary between two languages. To this end, we normalization. 4 Observing rows (a)(d)(g), we find out that
used the bilingual translated pair mapping provided by learning with pseudo code-switching sentences indeed helps
MUSE [18]2 . Note that not all translated words are in a lot in code-switching perplexity. It is reasonable because
our vocabulary set. language model had seen some code-switching cases during
training even though the training data is synthetic. However,
• Sentence concatenation: By randomly sample differ-
compared rows (b)(c) with (d) and (g), we see that with ap-
ent lingual sentences from original corpus, we concate-
plying constraints additionally, language model trained with
nated them as a pseudo code-switching sentence then
only monolingual data could be comparable or even better on
added it into the original monolingual corpus.
both monolingual (column ZH and EN) and code-switching
(column CS and CSP) perplexity than the one trained with
3.3. Evaluation Metrics pseudo code-switching data. No matter using monolingual
Perplexity (PPL) is a common measurement of language or pseudo code-switching data for training, normalizing out-
modeling. Lower perplexity indicates higher confidence of put projection matrix generally improves language modeling.
predicted target. To better observe the effects of applying Even trained with monolingual data only, normalization can
different techniques mentioned in previous sections, we com- also improve CSP as shown in rows (a) and (j). We concluded
puted five kinds of perplexity on the corpus: (i) ZH: PPL that the monolingual data in our corpus has similar sentence
of monolingual Mandarin sentences; (ii) EN: PPL of mono- structure, and normalization resulted in an similar latent
lingual English sentences; (iii) CS: PPL of code-switching space, aiding in switching between languages. Then with
sentences; (iv) CSP: PPL of code-switching points, which oc- applying SKLD and normalization together, it improved the
cur when the language of next word is different from present CSP to achieve the best results in monolingual data training
one; (v) Overall: PPL of the whole corpus including mono- case. Perplexity of code-switching points reduced signifi-
lingual and code-switching sentences. Due to the different cantly when applying constraints on the output projection
meanings of CS and CSP, the perplexity are separately mea- matrix by minimizing SKLD or CD without degrading the
sured. Obviously, the improvement of CS does not ensure the performance on monolingual data. Rows (k)(n)(q) also show
the improvement of CSP, because a code-switching sentence that combining SKLD constraint with normalization tech-
often contains much more non code-switching points, CS nique got the best performance on each kind of perplexity
would benefit from improving monolingual perplexity rather over only monolingual and pseudo code-switching data.
than from CSP.

3.4. Implementation
Because of a small amount of training data, we adopted
only a single recurrent layer with Long Short-Term Memory
(LSTM) cell for language modeling [19]. Hidden size for (a) Baseline - (b) Baseline – word (c) SKLD -
both the input projection and LSTM cell was set to 300. We monolingual substitution monolingual
used a dropout of 0.3 for better generalization, and trained the CSP : 1102.31 CSP : 877.27 CSP : 750.64
models using Adam of an initial learning rate 0.001. In order
to obtain better results, the training procedure was stopped Fig. 2: PCA visualization with different training strategies.
when the overall perplexity on the validation set did not de- Note that the figures are plotted from 1 out of 3 runs.
crease for 10 epochs. All reported results are the average of 3
runs.
4.2. Visualization
4. RESULTS
4.1. Language Modeling Besides numerical analysis, we want to know if the overlap-
ping level of embedding space is aligned with the perplexity
The results are in Table 1 with three sub-tables indexed by 3 We performed grid search on substitution probability and 0.2 achieved
capital alphabet. Each of them represents (A) language model the lowest perplexity.
trained with monolingual data only, (B) word substitution 4 5-gram model with smoothing techniques were also experimented, but

got worse performance than our baseline. Due to the limited space, we did
2 https://github.com/facebookresearch/MUSE not show it in our experiment.

7951
Without normalization Metric (A) Mandarin → English (B) English → Mandarin
Approach MRR P@10 MRR P@10
(A) Monolingual only (i) Baseline 0.0274 5.4% 0.0718 20.0%
CS CSP ZH EN Overall (ii) + normalization 0.0554 14.5% 0.0885 23.6%
(iii) SKLD + normalization 0.1024 21.8% 0.1496 30.9%
(a) Baseline 424.80 1118.88 160.40 125.41 289.20
(b) SKLD 319.71 752.03 152.66 115.50 228.79 Table 2: Results for unsupervised bilingual word translation
(c )CD 328.04 778.55 150.78 112.11 231.83 using different approaches, all with monolingual training: the
(B) Pseudo training data - Word substitution translation (A) from Mandarin to English and (B) from En-
(d) Baseline 348.88 884.74 156.90 119.98 246.41 glish to Mandarin.
(e) SKLD 298.24 671.38 157.53 120.36 219.62
(A) Input (B) Baseline (C) SKLD + normalization
(f) CD 296.84 680.19 156.09 117.10 217.56 (i) 你 知道 maybe 你 知道 maybe i think 你 知道 maybe 你 要 去 那边 的 时候 就 会
(you know maybe) (you know maybe i think) (you know maybe when you go there you will)
(C) Pseudo training data - Sentence concatenation (ii) they think 这里 they think 这里 的 时候 我 就 会 去 了 they think 这里 is like a lot of people
(they think here) (when they think here i will go) (they think here is like a lot of people)
(g) Baseline 340.34 831.19 160.21 138.89 248.83
(h) SKLD 289.64 628.09 152.27 126.06 216.39 Table 3: Example generated sentences for different ap-
(i) CD 293.98 652.35 150.76 124.05 217.83 proaches, all with monolingual training: the code-switching
With normalization point is (i) from English to Mandarin, and (ii) from Mandarin
(D) Monolingual only to English. English translation is in the parentheses.
CS CSP ZH EN Overall
(j) Baseline 311.77 754.21 123.28 90.71 212.44 (MRR) is used as an evaluation metric, which is the average
(k) SKLD 277.94 601.58 130.11 96.27 197.15 of the reciprocal ranks, so MRR should be less than 1.0, and
(l) CD 282.24 602.35 132.94 97.86 200.33 the closer to 1.0 the better. The proportion that the correct
(E) Pseudo training data - Word substitution translations are in the top 10 candidate list (r ≤ 10) is also
(m) Baseline 264.93 583.65 131.31 97.50 190.79 reported as “P@10” [20]. In order to mitigate the degrada-
(n) SKLD 248.87 512.27 136.85 101.12 184.14 tion in performance caused by low-frequency words, we se-
(o) CD 251.60 517.85 138.48 101.27 185.84 lected words only with frequency lager than 80, ending up
(F) Pseudo training data - Sentence concatenation
with about 200 vocabularies in Mandarin and English respec-
(p) Baseline 266.11 586.83 123.31 95.82 189.88
(q) SKLD 241.73 490.00 128.75 102.44 179.83
tively, and 55 bilingual pairs used for unsupervised bilingual
(r) CD 247.60 499.41 128.91 103.90 183.49 word translation.
The results of bilingual word translation are in Table 2.
Table 1: ZH, EN, CS, CSP and overall perplexity on the test- We can see the the translation from Mandarin to English (col-
ing set. umn (A)) performed worse on both MRR and P@10 than the
one in the reverse direction (column (B)).
results. We applied Principal Component Analysis (PCA) on Row (i) demonstrates that the baseline without any con-
the output projection matrix, and then visualized the results straint performed not so well, while applying constraints and
on 2-D plane. Fig. 2 shows the visualized results of different normalization in addition, rows (ii) and (iii) show signifi-
approaches. Fig. 2a shows that embeddings of two languages cantly improved MRR score and P@10 compared with row
are linear separable with monolingual data only and without (i). It concludes that applying constraints and normaliza-
applying any proposed approach. With synthesizing pseudo tion for code-switching language modeling can truly enhance
code-switching data for training as shown in Fig. 2b embed- semantic mapping.
dings of two languages are closer than Fig. 2a but do not over-
lap each other too much. As for Fig. 2c, it totally overlap. It 4.4. Sentence Generation
actually meets up with our numerical results shown in Table 1:
We further test the capacity of sentence generation of the lan-
the closer the embeddings are, the lower the perplexity is. 5
guage models trained only with monolingual data. Given par-
tial of a sentence, we used the language models to complete
4.3. Unsupervised Bilingual Word Translation
the sentence. Two generated sentences and their given inputs
To analyze whether the words with equivalent semantics in are shown in Table 3. We can see that our best approach with
different languages are mapped together with the proposed SKLD constraint and normalization, which is listed in column
approaches, we experimented on unsupervised bilingual word (C), could switch language either from English to Mandarin
translation. (row (i)) or from Mandarin to English (row (ii)). However,
Given a word w existing in the same bilingual pair map- the baseline model in column (B) failed to switch code from
ping mentioned in Subsection 3.2, each word in another lan- either side.
guage is ranked according to the cosine similarity of their em- 5. CONCLUSIONS
beddings. If the translated word of w is ranked as r-th candi-
In this work, we trained code-switching language model with
date, then the reciprocal rank is 1r . The mean reciprocal rank
monolingual data by constraining and normalizing the output
5 Due to limited space, we did not show the visualization results of sen- projection matrix. Improved performance was obtained and
tence concatenation/ CD which is quite similar to Fig. 2b/ 2c some analysis of the results are reported in this paper.

7952
6. REFERENCES [11] Martin Arjovsky, Soumith Chintala, and Léon Bottou,
“Wasserstein generative adversarial networks,” in In-
[1] Zhiping Zeng, Haihua Xu, Tze Yuang Chong, Eng- ternational Conference on Machine Learning (ICML),
Siong Chng, and Haizhou Li, “Improving n-gram lan- 2017.
guage modeling for code-switching speech recognition,”
in 2017 Asia-Pacific Signal and Information Processing [12] Ching-Ting Chang, Shun-Po Chuang, and Hung-Yi Lee,
Association Annual Summit and Conference (APSIPA “Code-switching sentence generation by generative ad-
ASC). IEEE, 2017, pp. 1596–1601. versarial networks and its application to data augmenta-
tion,” arXiv preprint arXiv:1811.02356, 2018.
[2] H. Adel, N. T. Vu, F. Kraus, T. Schlippe, H. Li, and
T. Schultz, “Recurrent neural network language model- [13] Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhip-
ing for code switching conversational speech,” in 2013 ing Zeng, Eng Siong Chng, Chongjia Ni, and Bin Ma,
IEEE International Conference on Acoustics, Speech “Constrained output embeddings for end-to-end code-
and Signal Processing, May 2013, pp. 8411–8415. switching speech recognition with only monolingual
data,” arXiv preprint arXiv:1904.03802, 2019.
[3] Ying Li and Pascale Fung, “Improved mixed language
speech recognition using asymmetric acoustic model [14] Kartik Audhkhasi, Bhuvana Ramabhadran, George
and language model with code-switch inversion con- Saon, Michael Picheny, and David Nahamoo, “Direct
straints,” in IEEE International Conference on Acous- acoustics-to-word models for English conversational
tics, Speech and Signal Processing (ICASSP), 2013. speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[4] Ying Li and Pascale Fung, “Language modeling with
functional head constraint for code switching speech [15] Shane Settle, Kartik Audhkhasi, Karen Livescu, and
recognition,” in Proceedings of the Conference on Michael Picheny, “Acoustically grounded word embed-
Empirical Methods in Natural Language Processing dings for improved acoustics-to-word speech recogni-
(EMNLP), 2014. tion,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2019.
[5] Injy Hamed, Mohamed Elmahdy, and Slim Abdennad-
[16] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cer-
her, “Building a first language model for code-switch
nocký, and Sanjeev Khudanpur, “Recurrent neural net-
Arabic-English,” Procedia Computer Science, vol. 117,
work based language model,” in the Conference of the
pp. 208–216, 2017.
International Speech Communication Association (IN-
[6] Hila Gonen and Yoav Goldberg, “Language modeling TERSPEECH), 2010.
for code-switching: Evaluation, integration of monolin-
[17] Dau-Cheng Lyu, Tien Ping Tan, Chng Eng Siong, and
gual data, and discriminative training,” arXiv preprint
Haizhou Li, “SEAME: a Mandarin-English code-
arXiv:1810.11895, 2018.
switching speech corpus in south-east Asia,” in the Con-
[7] Saurabh Garg, Tanmay Parekh, and Preethi Jyothi, ference of the International Speech Communication As-
“Dual language models for code switched speech recog- sociation (INTERSPEECH), 2010.
nition,” arXiv preprint arXiv:1711.01048, 2017. [18] Guillaume Lample, Alexis Conneau, Marc’Aurelio
[8] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Ranzato, Ludovic Denoyer, and Hervé Jégou, “Word
and Pascale Fung, “Learn to code-switch: Data augmen- translation without parallel data,” in International Con-
tation using copy mechanism on language modeling,” ference on Learning Representations (ICLR), 2018.
arXiv preprint arXiv:1810.10254, 2018. [19] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney,
[9] Emre Yilmaz, Henk van den Heuvel, and David A. van “LSTM neural networks for language modeling,” in the
Leeuwen, “Acoustic and textual data augmentation for Conference of the International Speech Communication
improved ASR of code-switching speech,” in the Con- Association (INTERSPEECH), 2012.
ference of the International Speech Communication As- [20] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin, “Nor-
sociation (INTERSPEECH), 2018. malized word embedding and orthogonal transform for
bilingual word translation,” in the Conference of the
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
North American Chapter of the Association for Com-
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
putational Linguistics: Human Language Technologies
Courville, and Yoshua Bengio, “Generative adversar-
(NAACL-HLT), 2015.
ial nets,” in Advances in Neural Information Processing
Systems (NIPS), 2014.

7953

You might also like