You are on page 1of 12

Adversarial Training for Unsupervised Bilingual Lexicon Induction

Meng Zhang†‡ Yang Liu†‡∗ Huanbo Luan† Maosong Sun†‡



State Key Laboratory of Intelligent Technology and Systems
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Science and Technology, Tsinghua University, Beijing, China

Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China
zmlarry@foxmail.com, liuyang2011@tsinghua.edu.cn
luanhuanbo@gmail.com, sms@tsinghua.edu.cn

Abstract
caballo (horse) horse
Word embeddings are well known to cap-
ture linguistic regularities of the language cerdo (pig)
on which they are trained. Researchers pig
also observe that these regularities can
transfer across languages. However, previ-
gato (cat) cat
ous endeavors to connect separate mono-
lingual word embeddings typically require
cross-lingual signals as supervision, either Spanish English
in the form of parallel corpus or seed lex-
icon. In this work, we show that such
cross-lingual connection can actually be Figure 1: Illustrative monolingual word em-
established without any form of supervi- beddings of Spanish and English, adapted from
sion. We achieve this end by formulating (Mikolov et al., 2013a). Although trained inde-
the problem as a natural adversarial game, pendently, the two sets of embeddings exhibit ap-
and investigating techniques that are cru- proximate isomorphism.
cial to successful training. We carry out
evaluation on the unsupervised bilingual et al. (2013a) observe that word embeddings
lexicon induction task. Even though this trained separately on monolingual corpora exhibit
task appears intrinsically cross-lingual, we isomorphic structure across languages, as illus-
are able to demonstrate encouraging per- trated in Figure 1. This interesting finding is
formance without any cross-lingual clues. in line with research on human cognition (Youn
1 Introduction et al., 2016). It also means a linear transforma-
tion may be established to connect word embed-
As word is the basic unit of a language, the better- ding spaces, allowing word feature transfer. This
ment of its representation has significant impact on has far-reaching implication on low-resource sce-
various natural language processing tasks. Con- narios (Daumé III and Jagarlamudi, 2011; Irvine
tinuous word representations, commonly known and Callison-Burch, 2013), because word embed-
as word embeddings, have formed the basis for dings only require plain text to train, which is the
numerous neural network models since their ad- most abundant form of linguistic resource.
vent. Their popularity results from the perfor- However, connecting separate word embedding
mance boost they bring, which should in turn be spaces typically requires supervision from cross-
attributed to the linguistic regularities they capture lingual signals. For example, Mikolov et al.
(Mikolov et al., 2013b). (2013a) use five thousand seed word translation
Soon following the success on monolingual pairs to train the linear transformation. In a re-
tasks, the potential of word embeddings for cross- cent study, Vulić and Korhonen (2016) show that
lingual natural language processing has attracted at least hundreds of seed word translation pairs are
much attention. In their pioneering work, Mikolov needed for the model to generalize. This is un-

Corresponding author. fortunate for low-resource languages and domains,
0/1 0/1 0/1

D D1 D

G G GT G GT

LR
~

D2

1/0
(a) (b) (c)

Figure 2: (a) The unidirectional transformation model directly inspired by the adversarial game: The
generator G tries to transform source word embeddings (squares) to make them seem like target ones
(dots), while the discriminator D tries to classify whether the input embeddings are generated by G or
real samples from the target embedding distribution. (b) The bidirectional transformation model. Two
generators with tied weights perform transformation between languages. Two separate discriminators
are responsible for each language. (c) The adversarial autoencoder model. The generator aims to make
the transformed embeddings not only indistinguishable by the discriminator, but also recoverable as
measured by the reconstruction loss LR .

because data encoding cross-lingual equivalence is source word x by finding the nearest target em-
often expensive to obtain. bedding to f (x).
In this work, we aim to entirely eliminate the We consider x to be drawn from a distribution
need for cross-lingual supervision. Our approach px , and similarly y ∼ py . The key intuition here is
draws inspiration from recent advances in gen- to find the mapping function to make f (x) seem
erative adversarial networks (Goodfellow et al., to follow the distribution py , for all x ∼ px . From
2014). We first formulate our task in a fashion this point of view, we design an adversarial game
that naturally admits an adversarial game. Then as illustrated in Figure 2(a): The generator G im-
we propose three models that implement the game, plements the mapping function f , trying to make
and explore techniques to ensure the success of f (x) passable as target word embeddings, while
training. Finally, our evaluation on the bilingual the discriminator D is a binary classifier striving
lexicon induction task reveals encouraging perfor- to distinguish between fake target word embed-
mance, even though this task appears formidable dings f (x) ∼ pf (x) and real ones y ∼ py . This
without any cross-lingual supervision. intuition can be formalized as the minimax game
minG maxD V (D, G) with value function
2 Models
V (D, G)
In order to induce a bilingual lexicon, we start =Ey∼py [log D (y)] + (1)
from two sets of monolingual word embeddings Ex∼px [log (1 − D (G (x)))] .
with dimensionality d. They are trained separately
on two languages. Our goal is to learn a mapping Theoretical analysis reveals that adversarial
function f : Rd → Rd so that for a source word training tries to minimize the Jensen-Shannon
embedding x, f (x) lies close to the embedding of divergence JSD py ||pf (x) (Goodfellow et al.,
its target language translation y. The learned map- 2014). Importantly, the minimization happens
ping function can then be used to translate each at the distribution level, without requiring word
translation pairs to supervise training. 2.2 Model 2: Bidirectional Transformation
The orthogonal parametrization is still quite slow.
2.1 Model 1: Unidirectional Transformation
We can relax the orthogonal constraint and only
The first model directly implements the adversar- require the transformation to be self-consistent
ial game, as shown in Figure 2(a). As hinted (Smith et al., 2017): If G transforms the source
by the isomorphism shown in Figure 1, previous word embedding space into the target language
works typically choose the mapping function f space, its transpose G> should transform the tar-
to be a linear map (Mikolov et al., 2013a; Dinu get language space back to the source. This can be
et al., 2015; Lazaridou et al., 2015). We therefore implemented by two unidirectional models with a
parametrize the generator as a transformation ma- tied generator, as illustrated in Figure 2(b). Two
trix G ∈ Rd×d . We also tried non-linear maps separate discriminators are used, with the same
parametrized by neural networks, without success. cross-entropy loss as Equation (2) used by Model
In fact, if the generator is given sufficient capacity, 1. The generator loss is given by
it can in principle learn a constant mapping func-  
tion to a target word embedding, which makes the LG = − log D1 (Gx) − log D2 G> y . (4)
discriminator impossible to distinguish, much like
the “mode collapse” problem widely observed in 2.3 Model 3: Adversarial Autoencoder
the image domain (Radford et al., 2015; Salimans As another way to relax the orthogonal con-
et al., 2016). We therefore believe it is crucial to straint, we introduce the adversarial autoencoder
grant the generator with suitable capacity. (Makhzani et al., 2015), depicted in Figure 2(c).
As a generic binary classifier, a standard feed- After the generator G transforms a source word
forward neural network with one hidden layer is embedding x into a target language representation
used to parametrize the discriminator D, and its Gx, we should be able to reconstruct the source
loss function is the usual cross-entropy loss, as in word embedding x by mapping back with G> . We
the value function (1): therefore introduce the reconstruction loss mea-
sured by cosine similarity:
LD = − log D (y) − log (1 − D (Gx)) . (2)  
LR = − cos x, G> Gx . (5)
For simplicity, here we write the loss with a mini-
batch size of 1; in our experiments we use 128. Note that this loss will be minimized if G is or-
The generator loss is given by thogonal. With this term included, the loss func-
tion for the generator becomes
LG = − log D (Gx) . (3)  
LG = − log D (Gx) − λ cos x, G> Gx , (6)
In line with previous work (Goodfellow et al.,
2014), we find this loss easier to minimize than where λ is a hyperparameter that balances the two
the original form log (1 − D (Gx)). terms. λ = 0 recovers the unidirectional trans-
formation model, while larger λ should enforce a
Orthogonal Constraint stricter orthogonal constraint.
The above model is very difficult to train. One
3 Training Techniques
possible reason is that the parameter search space
Rd×d for the generator may still be too large. Pre- Generative adversarial networks are notoriously
vious works have attempted to constrain the trans- difficult to train, and investigation into stabler
formation matrix to be orthogonal (Xing et al., training remains a research frontier (Radford et al.,
2015; Zhang et al., 2016b; Artetxe et al., 2016). 2015; Salimans et al., 2016; Arjovsky and Bottou,
An orthogonal transformation is also theoretically 2017). We contribute in this aspect by reporting
appealing for its self-consistency (Smith et al., techniques that are crucial to successful training
2017) and numerical stability. However, using for our task.
constrained optimization for our purpose is cum-
bersome, so we opt for an orthogonal parametriza- 3.1 Regularizing the Discriminator
tion (Mhammedi et al., 2016) of the generator in- Recently, it has been suggested to inject noise into
stead. the input to the discriminator (Sønderby et al.,
2016; Arjovsky and Bottou, 2017). The noise is 4.0 10
typically additive Gaussian. Here we explore more ||G> G − I||F
possibilities, with the following types of noise, in- 3.5 LG
jected into the input and hidden layer: D accuracy 8
• Multiplicative Bernoulli noise (dropout) (Sri- 3.0 LR
vastava et al., 2014):  ∼ Bernoulli (p).

LG , D accuracy, LR
• Additive Gaussian noise:  ∼ N 0, σ 2 . 2.5

6

||G> G − I||F
• Multiplicative Gaussian noise:  ∼
2.0
N 1, σ 2 .


As noise injection is a form of regularization 4


1.5
(Bishop, 1995; Van der Maaten et al., 2013; Wa-
ger et al., 2013), we also try l2 regularization, and
directly restricting the hidden layer size to combat 1.0
2
overfitting. Our findings include:
0.5
• Without regularization, it is not impossible
for the optimizer to find a satisfactory param-
eter configuration, but the hidden layer size 0.0 0
0 200000 400000
has to be tuned carefully. This indicates that
# minibatches
a balance of capacity between the generator
and discriminator is needed.
Figure 3: A typical training trajectory of the adver-
• All forms of regularization help training by sarial autoencoder model with λ = 1. The values
allowing us to liberally set the hidden layer are averages within each minibatch.
size to a relatively large value.

• Among the types of regularization, multi- nearly orthogonal, and justifies our encourage-
plicative Gaussian injected into the input is ment of G towards orthogonality. With this find-
the most effective, and additive Gaussian is ing, we can train for sufficient steps and save the
similar. On top of input noise, hidden layer model with the lowest generator loss.
noise helps slightly. As we aim to find the cross-lingual transforma-
tion without supervision, it would be ideal to de-
In the following experiments, we inject multiplica- termine hyperparameters without a validation set.
tive Gaussian into the input and hidden layer of the The sharp drops can also be indicative in this case.
discriminator with σ = 0.5. If a hyperparameter configuration is poor, those
3.2 Model Selection values will oscillate without a clear drop. Al-
though this criterion is somewhat subjective, we
From a typical training trajectory shown in Fig- find it to be quite feasible in practice.
ure 3, we observe that training is not convergent.
In fact, simply using the model saved at the end 3.3 Other Training Details
of training gives poor performance. Therefore we
Our approach takes monolingual word embed-
need a mechanism to select a good model. We ob-
dings as input. We train the CBOW model
serve there are sharp drops of the generator loss
(Mikolov et al., 2013b) with default hyperparam-
LG , and find they correspond to good models,
eters in word2vec.1 The embedding dimension
as the discriminator gets confused at these points
d is 50 unless stated otherwise. Before feeding
with its classification accuracy (D accuracy) drop-
them into our system, we normalize the word em-
ping simultaneously. Interestingly,
the reconstruc- beddings to unit length. When sampling words for
tion loss LR and the value of G> G − I F ex-

adversarial training, we penalize frequent words
hibit synchronous drops, even if we use the uni-
in a way similar to (Mikolov et al., 2013b). G is
directional transformation model (λ = 0). This
means a good transformation matrix is indeed 1
https://code.google.com/archive/p/word2vec
initialized with a random orthogonal matrix. The # tokens vocab. size
hidden layer size of D is 500. Adversarial training Wikipedia comparable corpora
involves alternate gradient update of the genera- zh 21m 3,349
tor and discriminator, which we implement with a zh-en
en 53m 5,154
simpler variant algorithm described in (Nowozin es 61m 4,774
et al., 2016). Adam (Kingma and Ba, 2014) is es-en
en 95m 6,637
used as the optimizer, with default hyperparame- it 73m 8,490
ters. For the adversarial autoencoder model, λ = 1 it-en
en 93m 6,597
generally works well, but λ = 10 appears stabler ja 38m 6,043
for the low-resource Turkish-English setting. ja-zh
zh 16m 2,814
tr 6m 7,482
4 Experiments tr-en
en 28m 13,220
We evaluate the quality of the cross-lingual em- Large-scale settings
bedding transformation on the bilingual lexicon zh-en zh 143m 14,686
induction task. After a source word embedding Wikipedia en 1,907m 61,899
is transformed into the target space, its M nearest zh-en zh 2,148m 45,958
target embeddings (in terms of cosine similarity) Gigaword en 5,017m 73,504
are retrieved, and compared against the entry in
Table 1: Statistics of the non-parallel corpora.
a ground truth bilingual lexicon. Performance is
Language codes: zh = Chinese, en = English, es
measured by top-M accuracy (Vulić and Moens,
= Spanish, it = Italian, ja = Japanese, tr = Turkish.
2013): If any of the M translations is found in the
ground truth bilingual lexicon, the source word is
considered to be handled correctly, and the accu- • Translation matrix (TM) (Mikolov et al.,
racy is calculated as the percentage of correctly 2013a): the pioneer of this type of methods
translated source words. We generally report the mentioned in the introduction, using linear
harshest top-1 accuracy, unless when comparing transformation. We use a publicly available
with published figures in Section 4.4. implementation.3
Baselines
Almost all approaches to bilingual lexicon induc- • Isometric alignment (IA) (Zhang et al.,
tion from non-parallel data depend on seed lexica. 2016b): an extension of TM by augmenting
An exception is decipherment (Dou and Knight, its learning objective with the isometric (or-
2012; Dou et al., 2015), and we use it as our thogonal) constraint. Although Zhang et al.
baseline. The decipherment approach is not based (2016b) had subsequent steps for their POS
on distributional semantics, but rather views the tagging task, it could be used for bilingual
source language as a cipher for the target lan- lexicon induction as well.
guage, and attempts to learn a statistical model to
decipher the source language. We run the Mono- We ensure the same input embeddings for these
Giza system as recommended by the toolkit.2 It methods and ours.
can also utilize monolingual embeddings (Dou The seed word translation pairs are obtained as
et al., 2015); in this case, we use the same em- follows. First, we ask Google Translate4 to trans-
beddings as the input to our approach. late the source language vocabulary. Then the tar-
Sharing the underlying spirit with our approach, get translations are queried again and translated
related methods also build upon monolingual word back to the source language, and those that do
embeddings and find transformation to link dif- not match the original source words are discarded.
ferent languages. Although they need seed word This helps to ensure the translation quality. Fi-
translation pairs to train and thus not directly com- nally, the translations are discarded if they fall out
parable, we report their performance with 50 and of our target language vocabulary.
100 seeds for reference. These methods are:
2 3
http://www.isi.edu/natural- http://clic.cimec.unitn.it/˜georgiana.dinu/down
4
language/software/monogiza release v1.0.tar.gz https://translate.google.com
method # seeds accuracy (%) 城市 小行星 文学
MonoGiza w/o emb. 0 0.05 chengshi xiaoxingxing wenxue
MonoGiza w/ emb. 0 0.09 city asteroid poetry
50 0.29 town astronomer literature
TM suburb comet prose
100 21.79
50 18.71 area constellation poet
IA proximity orbit writing
100 32.29
Model 1 0 39.25 Table 3: Top-5 English translation candidates pro-
Model 1 + ortho. 0 28.62 posed by our approach for some Chinese words.
Model 2 0 40.28 The ground truth is marked in bold.
Model 3 0 43.31

Table 2: Chinese-English top-1 accuracies of the 30

Accuracy (%)
MonoGiza baseline and our models, along with
the translation matrix (TM) and isometric align- 20
ment (IA) methods that utilize 50 and 100 seeds. Ours
10 IA
TM
4.1 Experiments on Chinese-English
0
Data 0 500 1000
For this set of experiments, the data for training # seeds
word embeddings comes from Wikipedia com-
parable corpora.5 Following (Vulić and Moens, Figure 4: Top-1 accuracies of our approach,
2013), we retain only nouns with at least 1,000 isometric alignment (IA), and translation matrix
occurrences. For the Chinese side, we first use (TM), with the number of seeds varying in {50,
OpenCC6 to normalize characters to be simplified, 100, 200, 500, 1000, 1280}.
and then perform Chinese word segmentation and
POS tagging with THULAC.7 The preprocessing The unidirectional transformation model attains
of the English side involves tokenization, POS tag- reasonable accuracy if trained successfully, but it
ging, lemmatization, and lowercasing, which we is rather sensitive to hyperparameters and initial-
carry out with the NLTK toolkit.8 The statistics ization. This training difficulty motivates our or-
of the final training data is given in Table 1, along thogonal constraint. But imposing a strict orthog-
with the other experimental settings. onal constraint hurts performance. It is also about
As the ground truth bilingual lexicon for evalua- 20 times slower even though we utilize orthogonal
tion, we use Chinese-English Translation Lexicon parametrization instead of constrained optimiza-
Version 3.0 (LDC2002L27). tion. The last two models represent different relax-
Overall Performance ations of the orthogonal constraint, and the adver-
sarial autoencoder model achieves the best perfor-
Table 2 lists the performance of the MonoGiza
mance. We therefore use it in our following exper-
baseline and our four variants of adversarial train-
iments. Table 3 lists some word translation exam-
ing. MonoGiza obtains low performance, likely
ples given by the adversarial autoencoder model.
due to the harsh evaluation protocol (cf. Sec-
tion 4.4). Providing it with syntactic information Comparison With Seed-Based Methods
can help (Dou and Knight, 2013), but in a low-
In this section, we investigate how many seeds
resource scenario with zero cross-lingual informa-
TM and IA require to attain the performance level
tion, parsers are likely to be inaccurate or even un-
of our approach. There are a total of 1,280 seed
available.
translation pairs for Chinese-English, which are
5
http://linguatools.org/tools/corpora/wikipedia- removed from the test set during the evaluation for
comparable-corpora
6 this experiment. We use the most frequent S pairs
https://github.com/BYVoid/OpenCC
7
http://thulac.thunlp.org for TM and IA.
8
http://www.nltk.org Figure 4 shows the accuracies with respect to
method # seeds es-en it-en ja-zh tr-en
MonoGiza w/o embeddings 0 0.35 0.30 0.04 0.00
MonoGiza w/ embeddings 0 1.19 0.27 0.23 0.09
50 1.24 0.76 0.35 0.09
TM
100 48.61 37.95 26.67 11.15
50 39.89 27.03 19.04 7.58
IA
100 60.44 46.52 36.35 17.11
Ours 0 71.97 58.60 43.02 17.18

Table 4: Top-1 accuracies (%) of the MonoGiza baseline and our approach on Spanish-English, Italian-
English, Japanese-Chinese, and Turkish-English. The results for translation matrix (TM) and isometric
alignment (IA) using 50 and 100 seeds are also listed.

50 4.2 Experiments on Other Language Pairs


Accuracy (%)

Data
40
We also induce bilingual lexica from Wikipedia
30 comparable corpora for the following language
pairs: Spanish-English, Italian-English, Japanese-
Chinese, and Turkish-English. For Spanish-
50 100 150 200
English and Italian-English, we choose to use
Embedding dimension
TreeTagger9 for preprocessing, as in (Vulić and
Figure 5: Top-1 accuracies of our approach with Moens, 2013). For the Japanese corpus, we use
respect to the input embedding dimensions in {20, MeCab10 for word segmentation and POS tag-
50, 100, 200}. ging. For Turkish, we utilize the preprocessing
tools (tokenization and POS tagging) provided in
LORELEI Language Packs (Strassel and Tracey,
2016), and its English side is preprocessed by
S. When the seeds are few, the seed-based meth-
NLTK. Unlike the other language pairs, the fre-
ods exhibit clear performance degradation. In this
quency cutoff threshold for Turkish-English is
case, we also observe the importance of the or-
100, as the amount of data is relatively small.
thogonal constraint from the superiority of IA to
TM, which supports our introduction of this con- The ground truth bilingual lexica for Spanish-
straint as we attempt zero supervision. Finally, English and Italian-English are obtained from
in line with the finding in (Vulić and Korhonen, Open Multilingual WordNet11 through NLTK. For
2016), hundreds of seeds are needed for TM to gen- Japanese-Chinese, we use an in-house lexicon.
eralize. Only then do seed-based methods catch up For Turkish-English, we build a set of ground truth
with our approach, and the performance difference translation pairs in the same way as how we obtain
is marginal even when more seeds are provided. seed word translation pairs from Google Translate,
described above.

Effect of Embedding Dimension Results


As shown in Table 4, the MonoGiza baseline still
As our approach takes monolingual word embed- does not work well on these language pairs, while
dings as input, it is conceivable that their quality our approach achieves much better performance.
significantly affects how well the two spaces can The accuracies are particularly high for Spanish-
be connected by a linear map. We look into this English and Italian-English, likely because they
aspect by varying the embedding dimension d in are closely related languages, and their embedding
Figure 5. As the dimension increases, the accuracy spaces may exhibit stronger isomorphism. The
improves and gradually levels off. This indicates 9
http://www.cis.uni-muenchen.de/˜schmid/tools/
that too low a dimension hampers the encoding of TreeTagger
linguistic information drawn from the corpus, and 10
http://taku910.github.io/mecab
it is advisable to use a sufficiently large dimension. 11
http://compling.hss.ntu.edu.sg/omw
method # seeds Wikipedia Gigaword method 5k 10k
50 0.00 0.01 MonoGiza w/o embeddings 13.74 7.80
TM
100 4.79 2.07 MonoGiza w/ embeddings 17.98 10.56
50 3.25 1.68 (Cao et al., 2016) 23.54 17.82
IA
100 7.08 4.18 Ours 68.59 51.86
Ours 0 7.92 2.53
Table 6: Top-5 accuracies (%) of 5k and 10k most
Table 5: Top-1 accuracies (%) of our approach frequent words in the French-English setting. The
to inducing bilingual lexica for Chinese-English figures for the baselines are taken from (Cao et al.,
from Wikipedia and Gigaword. Also listed are 2016).
results for translation matrix (TM) and isometric
alignment (IA) using 50 and 100 seeds.
configuration used in our experiments, the adver-
sarial autoencoder model takes about two hours to
performance on Japanese-Chinese is lower, on a train for 500k minibatches on a single CPU.
comparable level with Chinese-English (cf. Table
2), and these languages are relatively distantly re- 4.4 Comparison With (Cao et al., 2016)
lated. Turkish-English represents a low-resource In order to compare with the recent method by Cao
scenario, and therefore the lexical semantic struc- et al. (2016), which also uses zero cross-lingual
ture may be insufficiently captured by the embed- signal to connect monolingual embeddings, we
dings. The agglutinative nature of Turkish can also replicate their French-English experiment to test
add to the challenge. our approach.12 This experimental setting has im-
portant differences from the above ones, mostly in
4.3 Large-Scale Settings the evaluation protocol. Apart from using top-5
We experiment with large-scale Chinese-English accuracy as the evaluation metric, the ground truth
data from two sources: the whole Wikipedia dump bilingual lexicon is obtained by performing word
and Gigaword (LDC2011T13 and LDC2011T07). alignment on a parallel corpus. We find this auto-
We also simplify preprocessing by removing the matically constructed bilingual lexicon to be nois-
noun restriction and the lemmatization step (cf. ier than the ones we use for the other language
preprocessing decisions for the above experi- pairs; it often lists tens of translations for a source
ments). word. This lenient evaluation protocol should ex-
Although large-scale data may benefit the train- plain MonoGiza’s higher numbers in Table 6 than
ing of embeddings, it poses a greater challenge to what we report in the other experiments. In this
bilingual lexicon induction. First, the degree of setting, our approach is able to considerably out-
non-parallelism tends to increase. Second, with perform both MonoGiza and the method by Cao
cruder preprocessing, the noise in the corpora may et al. (2016).
take its toll. Finally, but probably most impor-
tantly, the vocabularies expand dramatically com- 5 Related Work
pared to previous settings (see Table 1). This 5.1 Cross-Lingual Word Embeddings for
means a word translation has to be retrieved from Bilingual Lexicon Induction
a much larger pool of candidates.
For these reasons, we consider the performance Inducing bilingual lexica from non-parallel data
of our approach presented in Table 5 to be encour- is a long-standing cross-lingual task. Except for
aging. The imbalanced sizes of the Chinese and the decipherment approach, traditional statistical
English Wikipedia do not seem to cause a prob- methods all require cross-lingual signals (Rapp,
lem for the structural isomorphism needed by our 1999; Koehn and Knight, 2002; Fung and Cheung,
method. MonoGiza does not scale to such large 2004; Gaussier et al., 2004; Haghighi et al., 2008;
vocabularies, as it already takes days to train in our Vulić et al., 2011; Vulić and Moens, 2013).
Italian-English setting. In contrast, our approach Recent advances in cross-lingual word embed-
is immune from scalability issues by working with dings (Vulić and Korhonen, 2016; Upadhyay et al.,
embeddings provided by word2vec, which is 12
As a confirmation, we ran MonoGiza in this setting and
well known for its fast speed. With the network obtained comparable performance as reported.
2016) have rekindled interest in bilingual lexi- language processing. For example, Ganin et al.
con induction. Like their traditional counterparts, (2016) address domain adaptation by adversari-
these embedding-based methods require cross- ally training features to be domain invariant, and
lingual signals encoded in parallel data, aligned at test on sentiment classification. Chen et al. (2016)
document level (Vulić and Moens, 2015), sentence extend this idea to cross-lingual sentiment clas-
level (Zou et al., 2013; Chandar A P et al., 2014; sification. Our research deals with unsupervised
Hermann and Blunsom, 2014; Kočiský et al., bilingual lexicon induction based on word embed-
2014; Gouws et al., 2015; Luong et al., 2015; dings, and therefore works with word embedding
Coulmance et al., 2015; Oshikiri et al., 2016), distributions, which are more interpretable than
or word level (i.e. seed lexicon) (Gouws and the neural feature space of classifiers in the above
Søgaard, 2015; Wick et al., 2016; Duong et al., works.
2016; Shi et al., 2015; Mikolov et al., 2013a; Dinu In the field of neural machine translation, a re-
et al., 2015; Lazaridou et al., 2015; Faruqui and cent work (He et al., 2016) proposes dual learn-
Dyer, 2014; Lu et al., 2015; Ammar et al., 2016; ing, which also involves a two-agent game and
Zhang et al., 2016a, 2017; Smith et al., 2017). In therefore bears conceptual resemblance to the ad-
contrast, our work completely removes the need versarial training idea. The framework is carried
for cross-lingual signals to connect monolingual out with reinforcement learning, and thus differs
word embeddings, trained on non-parallel text cor- greatly in implementation from adversarial train-
pora. ing.
As one of our baselines, the method by Cao
et al. (2016) also does not require cross-lingual 6 Conclusion
signals to train bilingual word embeddings. It In this work, we demonstrate the feasibility of con-
modifies the objective for training embeddings, necting word embeddings of different languages
whereas our approach uses monolingual embed- without any cross-lingual signal. This is achieved
dings trained beforehand and held fixed. More im- by matching the distributions of the transformed
portantly, its learning mechanism is substantially source language embeddings and target ones via
different from ours. It encourages word embed- adversarial training. The success of our approach
dings from different languages to lie in the shared signifies the existence of universal lexical seman-
semantic space by matching the mean and vari- tic structure across languages. Our work also
ance of the hidden states, assumed to follow a opens up opportunities for the processing of ex-
Gaussian distribution, which is hard to justify. Our tremely low-resource languages and domains that
approach does not make any assumptions and di- lack parallel data completely.
rectly matches the mapped source embedding dis- Our work is likely to benefit from advances in
tribution with the target distribution by adversarial techniques that further stabilize adversarial train-
training. ing. Future work also includes investigating other
A recent work also attempts adversarial train- divergences that adversarial training can minimize
ing for cross-lingual embedding transformation (Nowozin et al., 2016), and broader mathematical
(Barone, 2016). The model architectures are simi- tools that match distributions (Mohamed and Lak-
lar to ours, but the reported results are not positive. shminarayanan, 2016).
We tried the publicly available code on our data,
but the results were not positive, either. Therefore, Acknowledgments
we attribute the outcome to the difference in the
loss and training techniques, but not the model ar- We thank the anonymous reviewers for their help-
chitectures or data. ful comments. This work is supported by the Na-
tional Natural Science Foundation of China (No.
61522204), the 973 Program (2014CB340501),
5.2 Adversarial Training
and the National Natural Science Foundation of
Generative adversarial networks are originally China (No. 61331013). This research is also
proposed for generating realistic images as an im- supported by the Singapore National Research
plicit generative model, but the adversarial train- Foundation under its International Research Cen-
ing technique for matching distributions is gen- tre@Singapore Funding Initiative and adminis-
eralizable to much more tasks, including natural tered by the IDM Programme.
References Qing Dou and Kevin Knight. 2012. Large scale deci-
pherment for out-of-domain machine translation. In
Waleed Ammar, George Mulcaire, Yulia Tsvetkov, EMNLP-CoNLL. http://aclweb.org/anthology/D12-
Guillaume Lample, Chris Dyer, and Noah A. 1025.
Smith. 2016. Massively Multilingual Word Embed-
dings. arXiv:1602.01925 [cs] http://arxiv.org/abs/ Qing Dou and Kevin Knight. 2013. Dependency-
1602.01925. Based Decipherment for Resource-Limited Machine
Martin Arjovsky and Léon Bottou. 2017. Towards Translation. In EMNLP. http://aclanthology.info/
Principled Methods For Training Generative Ad- papers/dependency-based-decipherment-for-
versarial Networks. In ICLR. http://arxiv.org/abs/ resource-limited-machine-translation.
1701.04862.
Qing Dou, Ashish Vaswani, Kevin Knight, and Chris
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Dyer. 2015. Unifying Bayesian Inference and
2016. Learning principled bilingual mappings of Vector Space Models for Improved Decipherment.
word embeddings while preserving monolingual In ACL-IJCNLP. http://www.aclweb.org/anthology/
invariance. In EMNLP. http://aclanthology.info/ P15-1081.
papers/learning-principled-bilingual-mappings-of-
word-embeddings-while-preserving-monolingual- Long Duong, Hiroshi Kanayama, Tengfei Ma,
invariance. Steven Bird, and Trevor Cohn. 2016. Learning
Crosslingual Word Embeddings without Bilingual
Antonio Valerio Miceli Barone. 2016. Towards cross- Corpora. In EMNLP. http://aclanthology.info/
lingual distributed representations without paral- papers/learning-crosslingual-word-embeddings-
lel text trained with adversarial autoencoders. In without-bilingual-corpora.
Proceedings of the 1st Workshop on Representa-
tion Learning for NLP. https://doi.org/10.18653/v1/ Manaal Faruqui and Chris Dyer. 2014. Improv-
W16-1614. ing Vector Space Word Representations Using
Multilingual Correlation. In EACL. http:/
Chris M. Bishop. 1995. Training with Noise is Equiv-
/aclanthology.info/papers/improving-vector-
alent to Tikhonov Regularization. Neural Comput.
space-word-representations-using-multilingual-
https://doi.org/10.1162/neco.1995.7.1.108.
correlation.
Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao
Meng. 2016. A Distribution-based Model to Pascale Fung and Percy Cheung. 2004. Mining Very-
Learn Bilingual Word Embeddings. In COL- Non-Parallel Corpora: Parallel Sentence and Lex-
ING. http://aclanthology.info/papers/a-distribution- icon Extraction via Bootstrapping and EM. In
based-model-to-learn-bilingual-word-embeddings. EMNLP. http://aclweb.org/anthology/W04-3208.

Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,
Mitesh Khapra, Balaraman Ravindran, Vikas C Pascal Germain, Hugo Larochelle, François Lavi-
Raykar, and Amrita Saha. 2014. An Autoen- olette, Mario Marchand, and Victor Lempitsky.
coder Approach to Learning Bilingual Word 2016. Domain-Adversarial Training of Neural Net-
Representations. In NIPS. http://papers.nips.cc/ works. Journal of Machine Learning Research http:/
paper/5270-an-autoencoder-approach-to-learning- /jmlr.org/papers/v17/15-239.html.
bilingual-word-representations.pdf.
Eric Gaussier, J.M. Renders, I. Matveeva, C. Goutte,
Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire and H. Dejean. 2004. A Geometric View on Bilin-
Cardie, and Kilian Weinberger. 2016. Adversarial gual Lexicon Extraction from Comparable Corpora.
Deep Averaging Networks for Cross-Lingual Senti- In ACL. https://doi.org/10.3115/1218955.1219022.
ment Classification. arXiv:1606.01614 [cs] http://
arxiv.org/abs/1606.01614. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Jocelyn Coulmance, Jean-Marc Marty, Guillaume
Courville, and Yoshua Bengio. 2014. Generative
Wenzek, and Amine Benhalloum. 2015. Trans-
Adversarial Nets. In NIPS. http://papers.nips.cc/
gram, Fast Cross-lingual Word-embeddings. In
paper/5423-generative-adversarial-nets.pdf.
EMNLP. http://aclanthology.info/papers/trans-
gram-fast-cross-lingual-word-embeddings.
Stephan Gouws, Yoshua Bengio, and Greg Cor-
Hal Daumé III and Jagadeesh Jagarlamudi. 2011. Do- rado. 2015. BilBOWA: Fast Bilingual Dis-
main adaptation for machine translation by mining tributed Representations without Word Alignments.
unseen words. In ACL-HLT. http://aclweb.org/ In ICML. http://jmlr.org/proceedings/papers/v37/
anthology/P11-2071. gouws15.html.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba- Stephan Gouws and Anders Søgaard. 2015. Sim-
roni. 2015. Improving Zero-Shot Learning by Mit- ple task-specific bilingual word embeddings. In
igating the Hubness Problem. In ICLR Workshop. NAACL-HLT. http://www.aclweb.org/anthology/
http://arxiv.org/abs/1412.6568. N15-1157.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rah-
and Dan Klein. 2008. Learning Bilingual Lexi- man, and James Bailey. 2016. Efficient Orthogo-
cons from Monolingual Corpora. In ACL-HLT. nal Parametrisation of Recurrent Neural Networks
http://aclanthology.info/papers/learning-bilingual- Using Householder Reflections. arXiv:1612.00188
lexicons-from-monolingual-corpora. [cs] http://arxiv.org/abs/1612.00188.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.
Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual Exploiting Similarities among Languages for Ma-
Learning for Machine Translation. In NIPS. chine Translation. arXiv:1309.4168 [cs] http://
http://papers.nips.cc/paper/6469-dual-learning-for- arxiv.org/abs/1309.4168.
machine-translation.pdf. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S
Corrado, and Jeff Dean. 2013b. Distributed
Karl Moritz Hermann and Phil Blunsom. 2014.
Representations of Words and Phrases and their
Multilingual Distributed Representations without
Compositionality. In NIPS. http://papers.nips.cc/
Word Alignment. In ICLR. http://arxiv.org/abs/
paper/5021-distributed-representations-of-words-
1312.6173.
and-phrases-and-their-compositionality.pdf.
Ann Irvine and Chris Callison-Burch. 2013. Combin- Shakir Mohamed and Balaji Lakshminarayanan.
ing bilingual and comparable corpora for low re- 2016. Learning in Implicit Generative Mod-
source machine translation. In Proceedings of the els. arXiv:1610.03483 [cs, stat] http://arxiv.org/abs/
Eighth Workshop on Statistical Machine Transla- 1610.03483.
tion. http://aclweb.org/anthology/W13-2233.
Sebastian Nowozin, Botond Cseke, and Ryota
Diederik Kingma and Jimmy Ba. 2014. Adam: Tomioka. 2016. f-GAN: Training Generative Neural
A Method for Stochastic Optimization. Samplers using Variational Divergence Minimiza-
arXiv:1412.6980 [cs] http://arxiv.org/abs/ tion. arXiv:1606.00709 [cs, stat] http://arxiv.org/
1412.6980. abs/1606.00709.

Philipp Koehn and Kevin Knight. 2002. Learning a Takamasa Oshikiri, Kazuki Fukui, and Hidetoshi Shi-
Translation Lexicon from Monolingual Corpora. In modaira. 2016. Cross-Lingual Word Representa-
ACL Workshop on Unsupervised Lexical Acquisi- tions via Spectral Graph Embeddings. In ACL.
tion. https://doi.org/10.3115/1118627.1118629. https://doi.org/10.18653/v1/P16-2080.
Alec Radford, Luke Metz, and Soumith Chintala.
Tomáš Kočiský, Karl Moritz Hermann, and Phil 2015. Unsupervised Representation Learning with
Blunsom. 2014. Learning Bilingual Word Repre- Deep Convolutional Generative Adversarial Net-
sentations by Marginalizing Alignments. In ACL. works. arXiv:1511.06434 [cs] http://arxiv.org/abs/
http://aclanthology.info/papers/learning-bilingual- 1511.06434.
word-representations-by-marginalizing-alignments.
Reinhard Rapp. 1999. Automatic Identification of
Angeliki Lazaridou, Georgiana Dinu, and Marco Ba- Word Translations from Unrelated English and Ger-
roni. 2015. Hubness and Pollution: Delving man Corpora. In ACL. https://doi.org/10.3115/
into Cross-Space Mapping for Zero-Shot Learning. 1034678.1034756.
In ACL-IJCNLP. https://doi.org/10.3115/v1/P15-
1027. Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
Vicki Cheung, Alec Radford, and Xi Chen. 2016.
Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel, Improved Techniques for Training GANs. In
and Karen Livescu. 2015. Deep Multilingual NIPS. http://papers.nips.cc/paper/6125-improved-
Correlation for Improved Word Embeddings. In techniques-for-training-gans.pdf.
NAACL-HLT. http://aclanthology.info/papers/
Tianze Shi, Zhiyuan Liu, Yang Liu, and Maosong Sun.
deep-multilingual-correlation-for-improved-word-
2015. Learning Cross-lingual Word Embeddings
embeddings.
via Matrix Co-factorization. In ACL-IJCNLP. http:/
Thang Luong, Hieu Pham, and Christopher D. /aclanthology.info/papers/learning-cross-lingual-
Manning. 2015. Bilingual Word Representa- word-embeddings-via-matrix-co-factorization.
tions with Monolingual Quality in Mind. In Samuel Smith, David Turban, Steven Hamblin, and
Proceedings of the 1st Workshop on Vector Nils Hammerla. 2017. Offline bilingual word vec-
Space Modeling for Natural Language Process- tors, orthogonal transformations and the inverted
ing. http://aclanthology.info/papers/bilingual-word- softmax. In ICLR. http://arxiv.org/abs/1702.03859.
representations-with-monolingual-quality-in-mind.
Casper Kaae Sønderby, Jose Caballero, Lucas Theis,
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Wenzhe Shi, and Ferenc Huszár. 2016. Amor-
Ian Goodfellow, and Brendan Frey. 2015. Adver- tised MAP Inference for Image Super-resolution.
sarial Autoencoders. arXiv:1511.05644 [cs] http:// arXiv:1610.04490 [cs, stat] http://arxiv.org/abs/
arxiv.org/abs/1511.05644. 1610.04490.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. 2015. Normalized Word Embedding and Orthog-
Dropout: A Simple Way to Prevent Neural Net- onal Transform for Bilingual Word Translation.
works from Overfitting. Journal of Machine In NAACL-HLT. http://aclanthology.info/papers/
Learning Research http://www.jmlr.org/papers/v15/ normalized-word-embedding-and-orthogonal-
srivastava14a.html. transform-for-bilingual-word-translation.

Stephanie Strassel and Jennifer Tracey. 2016. Hyejin Youn, Logan Sutton, Eric Smith, Cristopher
LORELEI Language Packs: Data, Tools, and Moore, Jon F. Wilkins, Ian Maddieson, William
Resources for Technology Development in Low Croft, and Tanmoy Bhattacharya. 2016. On the uni-
Resource Languages. In LREC. http://www.lrec- versal structure of human lexical semantics. Pro-
conf.org/proceedings/lrec2016/pdf/1138 Paper.pdf. ceedings of the National Academy of Sciences
https://doi.org/10.1073/pnas.1520752113.
Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Meng Zhang, Yang Liu, Huanbo Luan, Yiqun
Dan Roth. 2016. Cross-lingual Models of Word Em- Liu, and Maosong Sun. 2016a. Inducing Bilin-
beddings: An Empirical Comparison. In ACL. http:/ gual Lexica From Non-Parallel Data With Earth
/aclanthology.info/papers/cross-lingual-models-of- Mover’s Distance Regularization. In COLING.
word-embeddings-an-empirical-comparison. http://aclanthology.info/papers/inducing-bilingual-
lexica-from-non-parallel-data-with-earth-mover-s-
Laurens Van der Maaten, Minmin Chen, Stephen distance-regularization.
Tyree, and Kilian Weinberger. 2013. Learn-
ing with Marginalized Corrupted Features. In Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan,
ICML. http://www.jmlr.org/proceedings/papers/ and Maosong Sun. 2017. Bilingual Lexicon Induc-
v28/vandermaaten13.html. tion From Non-Parallel Data With Minimal Super-
vision. In AAAI. http://thunlp.org/˜zm/publications/
Ivan Vulić and Anna Korhonen. 2016. On the Role aaai2017.pdf.
of Seed Lexicons in Learning Bilingual Word
Embeddings. In ACL. http://aclanthology.info/ Yuan Zhang, David Gaddy, Regina Barzilay, and
papers/on-the-role-of-seed-lexicons-in-learning- Tommi Jaakkola. 2016b. Ten Pairs to Tag –
bilingual-word-embeddings. Multilingual POS Tagging via Coarse Map-
ping between Embeddings. In NAACL-HLT.
Ivan Vulić and Marie-Francine Moens. 2013. Cross- http://aclanthology.info/papers/ten-pairs-to-tag-
Lingual Semantic Similarity of Words as the multilingual-pos-tagging-via-coarse-mapping-
Similarity of Their Semantic Word Responses. between-embeddings.
In NAACL-HLT. http://aclanthology.info/papers/
cross-lingual-semantic-similarity-of-words-as-the- Will Y. Zou, Richard Socher, Daniel Cer, and
similarity-of-their-semantic-word-responses. Christopher D. Manning. 2013. Bilingual Word
Embeddings for Phrase-Based Machine Transla-
tion. In EMNLP. http://aclanthology.info/papers/
Ivan Vulić and Marie-Francine Moens. 2015.
bilingual-word-embeddings-for-phrase-based-
Bilingual Word Embeddings from Non-Parallel
machine-translation.
Document-Aligned Data Applied to Bilin-
gual Lexicon Induction. In ACL-IJCNLP.
http://aclanthology.info/papers/bilingual-word-
embeddings-from-non-parallel-document-aligned-
data-applied-to-bilingual-lexicon-induction.

Ivan Vulić, Wim De Smet, and Marie-Francine


Moens. 2011. Identifying Word Translations from
Comparable Corpora Using Latent Topic Models.
In ACL-HLT. http://aclanthology.info/papers/
identifying-word-translations-from-comparable-
corpora-using-latent-topic-models.

Stefan Wager, Sida Wang, and Percy S Liang. 2013.


Dropout Training as Adaptive Regularization. In
NIPS. http://papers.nips.cc/paper/4882-dropout-
training-as-adaptive-regularization.pdf.

Michael Wick, Pallika Kanani, and Adam Pocock.


2016. Minimally-Constrained Multilingual Em-
beddings via Artificial Code-Switching. In
AAAI. http://www.aaai.org/Conferences/AAAI/
2016/Papers/15Wick12464.pdf.

You might also like