You are on page 1of 10

Scalable Decipherment for Machine Translation via Hash Sampling

Sujith Ravi
Google
Mountain View, CA 94043
sravi@gooogle.com

Abstract pairs or domains. On the other hand, monolin-


gual data (in written form) exists and is easier to
In this paper, we propose a new Bayesian obtain for many languages. Learning translation
inference method to train statistical ma- models from monolingual corpora could help ad-
chine translation systems using only non- dress the challenges faced by modern-day MT sys-
parallel corpora. Following a probabilis- tems, especially for low resource language pairs.
tic decipherment approach, we first intro- Recently, this topic has been receiving increasing
duce a new framework for decipherment attention from researchers and new methods have
training that is flexible enough to incorpo- been proposed to train statistical machine trans-
rate any number/type of features (besides lation models using only monolingual data in the
simple bag-of-words) as side-information source and target language. The underlying moti-
used for estimating translation models. In vation behind most of these methods is that statis-
order to perform fast, efficient Bayesian tical properties for linguistic elements are shared
inference in this framework, we then de- across different languages and some of these sim-
rive a hash sampling strategy that is in- ilarities (mappings) could be automatically identi-
spired by the work of Ahmed et al. (2012). fied from large amounts of monolingual data.
The new translation hash sampler enables The MT literature does cover some prior work
us to scale elegantly to complex mod- on extracting or augmenting partial lexicons using
els (for the first time) and large vocab- non-parallel corpora (Rapp, 1995; Fung and McK-
ulary/corpora sizes. We show empirical eown, 1997; Koehn and Knight, 2000; Haghighi
results on the OPUS data—our method et al., 2008). However, none of these meth-
yields the best BLEU scores compared to ods attempt to train end-to-end MT models, in-
existing approaches, while achieving sig- stead they focus on mining bilingual lexicons from
nificant computational speedups (several monolingual corpora and often they require par-
orders faster). We also report for the allel seed lexicons as a starting point. Some of
first time—BLEU score results for a large- them (Haghighi et al., 2008) also rely on addi-
scale MT task using only non-parallel data tional linguistic knowledge such as orthography,
(EMEA corpus). etc. to mine word translation pairs across related
languages (e.g., Spanish/English). Unsupervised
1 Introduction
training methods have also been proposed in the
Statistical machine translation (SMT) systems past for related problems in decipherment (Knight
these days are built using large amounts of bilin- and Yamada, 1999; Snyder et al., 2010; Ravi and
gual parallel corpora. The parallel corpora are Knight, 2011a) where the goal is to decode un-
used to estimate translation model parameters in- known scripts or ciphers.
volving word-to-word translation tables, fertilities, The body of work that is more closely related to
distortion, phrase translations, syntactic transfor- ours include that of Ravi and Knight (2011b) who
mations, etc. But obtaining parallel data is an ex- introduced a decipherment approach for training
pensive process and not available for all language translation models using only monolingual cor-
pora. Their best performing method uses an EM Translation Model: Machine translation is a
algorithm to train a word translation model and much more complex task than solving other de-
they show results on a Spanish/English task. Nuhn cipherment tasks such as word substitution ci-
et al. (2012) extend the former approach and im- phers (Ravi and Knight, 2011b; Dou and Knight,
prove training efficiency by pruning translation 2012). The mappings between languages involve
candidates prior to EM training with the help of non-determinism (i.e., words can have multiple
context similarities computed from monolingual translations), re-ordering of words can occur as
corpora. grammar and syntax varies with language, and
In this work we propose a new Bayesian in- in addition word insertion and deletion operations
ference method for estimating translation mod- are also involved.
els from scratch using only monolingual corpora. Ideally, for the translation model P (f |e) we
Secondly, we introduce a new feature-based repre- would like to use well-known statistical models
sentation for sampling translation candidates that such as IBM Model 3 and estimate its parame-
allows one to incorporate any amount of additional ters θ using the EM algorithm (Dempster et al.,
features (beyond simple bag-of-words) as side- 1977). But training becomes intractable with com-
information during decipherment training. Fi- plex translation models and scalability is also an
nally, we also derive a new accelerated sampling issue when large corpora sizes are involved and the
mechanism using locality sensitive hashing in- translation tables become huge to fit in memory.
spired by recent work on fast, probabilistic infer- So, instead we use a simplified generative process
ence for unsupervised clustering (Ahmed et al., for the translation model as proposed by Ravi and
2012). The new sampler allows us to perform fast, Knight (2011b) and used by others (Nuhn et al.,
efficient inference with more complex translation 2012) for this task:
models (than previously used) and scale better to
large vocabulary and corpora sizes compared to 1. Generate a target (e.g., English) string e =
existing methods as evidenced by our experimen- e1 ...el , with probability P (e) according to an
tal results on two different corpora. n-gram language model.

2 Decipherment Model for Machine 2. Insert a NULL word at any position in the
English string, with uniform probability.
Translation
We now describe the decipherment problem for- 3. For each target word token ei (including
mulation for machine translation. NULLs), choose a source word translation fi ,
with probability Pθ (fi |ei ). The source word
Problem Formulation: Given a source text f
may be NULL.
(i.e., source word sequences f1 ...fm ) and a mono-
lingual target language corpus, our goal is to deci- 4. Swap any pair of adjacent source words
pher the source text and produce a target transla- fi−1 , fi , with probability P (swap); set to
tion. 0.1.
Contrary to standard machine translation train-
ing scenarios, here we have to estimate the transla- 5. Output the foreign string f = f1 ...fm , skip-
tion model Pθ (f |e) parameters using only mono- ping over NULLs.
lingual data. During decipherment training, our
objective is to estimate the model parameters in or- Previous approaches (Ravi and Knight, 2011b;
der to maximize the probability of the source text Nuhn et al., 2012) use the EM algorithm to es-
f as suggested by Ravi and Knight (2011b). timate all the parameters θ in order to maximize
likelihood of the foreign corpus. Instead, we pro-
pose a new Bayesian inference framework to esti-
YX
arg max P (e) · Pθ (f |e) (1)
θ f e mate the translation model parameters. In spite of
using Bayesian inference which is typically slow
For P (e), we use a word n-gram language in practice (with standard Gibbs sampling), we
model (LM) trained on monolingual target text. show later that our method is scalable and permits
We then estimate the parameters of the translation decipherment training using more complex trans-
model Pθ (f |e) during training. lation models (with several additional parameters).
2.1 Adding Phrases, Flexible Reordering and with this problem by using a fast, efficient sam-
Fertility to Translation Model pler based on hashing that allows us to speed up
We now extend the generative process (described the Bayesian inference significantly whereas stan-
earlier) to more complex translation models. dard Gibbs sampling would be extremely slow.
Non-local Re-ordering: The generative process
described earlier limits re-ordering to local or ad- 3 Feature-based representation for
jacent word pairs in a source sentence. We ex- Source and Target
tend this to allow re-ordering between any pair of The model described in the previous section while
words in the sentence. being flexible in describing the translation pro-
Fertility: We also add a fertility model Pθf ert to cess, poses several challenges for training. As
the translation model using the formula: the source and target vocabulary sizes increase the
size of the translation table (|Vf | · |Ve |) increases
nθ (φi |ei ) · pφ1 0
Y
Pθf ert = (2)
significantly and often becomes too huge to fit in
i
memory. Additionally, performing Bayesian in-
αf ert · P0 (φi |ei ) + C −i (ei , φi )
nθ (φi |ei ) = (3) ference with such a complex model using stan-
αf ert + C −i (ei ) dard Gibbs sampling can be very slow in prac-
where, P0 represents the base distribution tice. Here, we describe a new method for doing
(which is set to uniform) in a Chinese Restau- Bayesian inference by first introducing a feature-
rant Process (CRP)1 for the fertility model and based representation for the source and target
C −i represents the count of events occurring in words (or phrases) from which we then derive a
the history excluding the observation at position i. novel proposal distribution for sampling transla-
φi is the number of source words aligned to (i.e., tion candidates.
generated by) the target word ei . We use sparse We represent both source and target words in
Dirichlet priors for all the translation model com- a vector space similar to how documents are rep-
ponents.2 φ0 represents the target NULL word fer- resented in typical information retrieval settings.
tility and p1 is the insertion probability which is But unlike documents, here each word w is as-
fixed to 0.1. In addition, we set a maximum thresh- sociated with a feature vector w1 ...wd (where wi
old for fertility values φi ≤ γ · m, where m is the represents the weight for the feature indexed by i)
length of the source sentence. This discourages which is constructed from monolingual corpora.
a particular target word (e.g., NULL word) from For instance, context features for word w may in-
generating too many source words in the same sen- clude other words (or phrases) that appear in the
tence. In our experiments, we set γ = 0.3. We en- immediate context (n-gram window) surrounding
force this constraint in the training process during w in the monolingual corpus. Similarly, we can
sampling.3 add other features based on topic models, orthog-
Modeling Phrases: Finally, we extend the trans- raphy (Haghighi et al., 2008), temporal (Klemen-
lation candidate set in Pθ (fi |ei ) to model phrases tiev et al., 2012), etc. to our representation all of
in addition to words for the target side (i.e., ei can which can be extracted from monolingual corpora.
now be a word or a phrase4 previously seen in the Next, given two high dimensional vectors u and
monolingual target corpus). This greatly increases v it is possible to calculate the similarity between
the training time since in each sampling step, we the two words denoted by s(u, v). The feature
now have many more ei candidates to choose construction process is described in more detail
from. In Section 4, we describe how we deal below:
1 Target Language: We represent each word (or
Each component in the translation model (word/phrase
translations Pθ (fi |ei ), fertility Pθf ert , etc.) is modeled using phrase) ei with the following contextual features
a CRP formulation. along with their counts: (a) f−context : every (word
2
i.e., All the concentration parameters are set to low val- n-gram, position) pair immediately preceding ei
ues; αf |e = αf ert = 0.01.
3
We only apply this constraint when training on source in the monolingual corpus (n=1, position=−1), (b)
text/corpora made of long sentences (>10 words) where the similar features f+context to model the context fol-
sampler might converge very slowly. For short sentences, a lowing ei , and (c) we also throw in generic context
sparse prior on fertility αf ert typically discourages a target
word from being aligned to too many different source words. features fscontext without position information—
4
Phrase size is limited to two words in our experiments. every word that co-occurs with ei in the same sen-
tence. While the two position-features provide word fi in the source text f . This involves choos-
specific context information (may be sparse for ing from |Ve | possible target candidates in every
large monolingual corpora), this feature is more step which can be highly inefficient (and infeasi-
generic and captures long-distance co-occurrence ble for large vocabulary sizes). One possible strat-
statistics. egy is to compute similarity scores s(wfi , we0 ) be-
Source Language: Words appearing in a source tween the current source word feature vector wfi
sentence f are represented using the correspond- and feature vectors we0 ∈Ve for all possible candi-
ing target translation e = e1 ...em generated for dates in the target vocabulary. Following this, we
f in the current sample during training. For each can prune the translation candidate set by keeping
source word fj ∈ f , we look at the corresponding only the top candidates e∗ according to the sim-
word ej in the target translation. We then extract ilarity scores. Nuhn et al. (2012) use a similar
all the context features of ej in the target trans- strategy to obtain a more compact translation table
lation sample sentence e and add these features that improves runtime efficiency for EM training.
(f−context , f+context , fscontext ) with weights to the Their approach requires calculating and sorting all
feature representation for fj . |Ve | · |Vf | distances in time O(V 2 · log(V )), where
Unlike the target word feature vectors (which V = max(|Ve |, |Vf |).
can be pre-computed from the monolingual tar- Challenges: Unfortunately, there are several ad-
get corpus), the feature vector for every source ditional challenges which makes inference very
word fj is dynamically constructed from the tar- hard in our case. Firstly, we would like to in-
get translation sampled in each training iteration. clude as many features as possible to represent
This is a key distinction of our framework com- the source/target words in our framework besides
pared to previous approaches that use contextual simple bag-of-words context similarity (for exam-
similarity (or any other) features constructed from ple, left-context, right-context, and other general-
static monolingual corpora (Rapp, 1995; Koehn purpose features based on topic models, etc.). This
and Knight, 2000; Nuhn et al., 2012). makes the complexity far worse (in practice) since
Note that as we add more and more features for the dimensionality of the feature vectors d is a
a particular word (by training on larger monolin- much higher value than |Ve |. Computing similar-
gual corpora or adding new types of features, etc.), ity scores alone (naı̈vely) would incur O(|Ve | · d)
it results in the feature representation becoming time which is prohibitively huge since we have to
more sparse (especially for source feature vectors) do this for every token in the source language cor-
which can cause problems in efficiency as well pus. Secondly, for Bayesian inference we need to
as robustness when computing similarity against sample from a distribution that involves comput-
other vectors. In the next section, we will describe ing probabilities for all the components (language
how we mitigate this problem by projecting into a model, translation model, fertility, etc.) described
low-dimensional space by computing hash signa- in Equation 1. This distribution needs to be com-
tures. puted for every source word token fi in the corpus,
In all our experiments, we only use the features for all possible candidates ei ∈ Ve and the process
described above for representing source and tar- has to be repeated for multiple sampling iterations
get words. We note that the new sampling frame- (typically more than 1000). Doing standard col-
work is easily extensible to many additional fea- lapsed Gibbs sampling in this scenario would be
ture types (for example, monolingual topic model very slow and intractable.
features, etc.) which can be efficiently handled by
our inference algorithm and could further improve We now present an alternative fast, efficient
translation performance but we leave this for fu- inference strategy that overcomes many of the
ture work. challenges described above and helps acceler-
ate the sampling process significantly. First,
4 Bayesian MT Decipherment via Hash we set our translation models within the con-
Sampling text of a more generic and widely known fam-
ily of distributions—mixtures of exponential fam-
The next step is to use the feature representations ilies. Then we derive a novel proposal distribu-
described earlier and iteratively sample a target tion for sampling translation candidates and intro-
word (or phrase) translation candidate ei for every duce a new sampler for decipherment training that
is based on locality sensitive hashing (LSH). translation
Hashing methods such as LSH have been
widely used in the past in several scenarios in- ei ∼ p(ei |F, E −i )
cluding NLP applications (Ravichandran et al., ∝ p(e) · p(fi |ei , F −i , E −i )
2005). Most of these approaches employ LSH · pf ert (·|ei , F −i , E −i ) · ... (5)
within heuristic methods for speeding up nearest-
neighbor look up and similarity computation tech- where, F is the full source text and E the full
niques. However, we use LSH hashing within target translation generated during sampling.
a probabilistic framework which is very different 2. Update the sufficient statistics for the changed
from the typical use of LSH. target translation assignments.
Our work is inspired by some recent work by For large target vocabularies, computing
Ahmed et al. (2012) on speeding up Bayesian in- p(fi |ei , F −i , E −i ) dominates the inference pro-
ference for unsupervised clustering. We use a sim- cedure. We can accelerate this step significantly
ilar technique as theirs but a different approximate using a good proposal distribution via hashing.
distribution for the proposal, one that is better- Locality Sensitive Hash Sampling: For general
suited for machine translation models and without exponential families, here is a Taylor approxima-
some of the additional overhead required for com- tion for the data likelihood term (Ahmed et al.,
puting certain terms in the original formulation. 2012):
Mixtures of Exponential Families: The transla-
p(x|·) ≈ exp(hφ(x), θ∗ i) − g(θ∗ ) (6)
tion models described earlier (Section 2) can be
represented as mixtures of exponential families, where, θ∗ is the expected parameter (sufficient
specifically mixtures of multinomials. In exponen- statistics).
tial families, distributions over random variables For sampling the translation model, this involves
are given by: computing an expensive inner product hφ(fi ), θe∗0 i
for each source word fi which has to be repeated
p(x; θ) = exp(hφ(x), θi) − g(θ) (4) for every translation candidate e0 , including candi-
dates that have very low probabilities and are un-
where, φ : X → F is a map from x to the space likely to be chosen as the translation for fj .
of sufficient statistics and θ ∈ F. The term g(θ) So, during decipherment training a standard
ensures that p(x; θ) is properly normalized. X is collapsed Gibbs sampler will waste most of its
the domain of observations X = x1 , ..., xm drawn time on expensive computations that will be dis-
from some distribution p. Our goal is to estimate carded in the end anyways. Also, unlike some
p. In our case, this refers to the translation model standard generative models used in other unsu-
from Equation 1. pervised learning scenarios (e.g., clustering) that
model only observed features (namely words ap-
We also choose corresponding conjugate
pearing in the document), here we would like to
Dirichlet distributions for priors which have the
enrich the translation model with a lot more fea-
property that the posterior distribution p(θ|X)
tures (side-information).
over θ remains in the same family as p(θ).
Instead, we can accelerate the computation of
Note that the (translation) model in our the inner product hφ(fi ), θe∗0 i using a hash sam-
case consists of multiple exponential families pling strategy similar to (Ahmed et al., 2012).
components—a multinomial pertaining to the lan- The underlying idea here is to use binary hash-
guage model (which remains fixed5 ), and other ing (Charikar, 2002) to explore only those can-
components pertaining to translation probabilities didates e0 that are sufficiently close to the best
Pθ (fi |ei ), fertility Pθf ert , etc. To do collapsed matching translation via a proposal distribution.
Gibbs sampling under this model, we would per- Next, we briefly introduce some notations and ex-
form the following steps during sampling: isting theoretical results related to binary hashing
1. For a given source word token fi draw target before describing the hash sampling procedure.
5
A high value for the LM concentration parameter α en- For any two vectors u, v ∈ Rn ,
sures that the LM probabilities do not deviate too far from the
original fixed base distribution during sampling. hu, vi = kuk · kvk · cos ](u, v) (7)
](u, v) = πP r{sgn[hu, wi] 6= sgn[hv, wi]} 4.1 Metropolis Hastings
(8) In each sampling step, we use the distribution
where, w is a random vector drawn from a sym- from Equation 10 as a proposal distribution in
metric spherical distribution and the term inside a Metropolis Hastings scheme to sample target
P r{·} represents the relation between the signs of translations for each source word.
the two inner products. Once a new target translation e0 is sampled
Let hl (v) ∈ {0, 1}l be an l-bit binary hash of v for source word fi from the proposal distribution
l 0
where: [hl (v)]i := sgn[hv, wi i]; wi ∼ Um . Then q(·) ∝ exps (fi ,e ) , we accept the proposal (and
the probability of matching signs is given by: update the corresponding hash signatures) accord-
ing to the probability r
1
z l (u, v) := kh(u) − h(v)k1 (9) q(eold
l i ) · pnew (·)
r= (11)
q(enew
i ) · pold (·)
So, z l (u, v) measures how many bits differ be-
tween the hash vectors h(u) and h(v) associated where, pold (·), pnew (·) are the true conditional
with u, v. Combining this with Equations 6 and 7 likelihood probabilities according to our model
we can estimate the unnormalized log-likelihood (including the language model component) for the
of a source word fi being translated as target e0 old, new sample respectively.
via:
5 Training Algorithm
sl (fi , e0 ) ∝ kθe0 k · kφ(fi )k · cos πz l (φ(fi ), θe0 ) Putting together all the pieces described in the pre-
(10) vious section, we perform the following steps:
For each source word fi , we now sample from 1. Initialization: We initialize the starting sample
this new distribution (after normalization) instead as follows: for each source word token, randomly
of the original one. The binary hash representa- sample a target word. If the source word also ex-
tion for the two vectors yield significant speedups ists in the target vocabulary, then choose identity
during sampling since Hamming distance compu- translation instead of the random one.9
tation between h(u) and h(v) is highly optimized 2. Hash Sampling Steps: For each source word
on modern CPUs. Hence, we can compute an es- token fi , run the hash sampler:
timate for the inner product quite efficiently.6 (a) Generate a proposal distribution by comput-
Updating the hash signatures: During training, ing the hamming distance between the feature vec-
we compute the target candidate projection h(θe0 ) tors for the source word and each target translation
and corresponding norm only once7 which is dif- candidate. Sample a new target translation ei for
ferent from the setup of Ahmed et al. (2012). The fi from this distribution.
source word projection φ(fi ) is dynamically up- (b) Compute the acceptance probability for the
dated in every sampling step. Note that doing this chosen translation using a Metropolis Hastings
naı̈vely would scale slowly as O(Dl) where D is scheme and accept (or reject) the sample. In prac-
the total number of features but instead we can up- tice, computation of the acceptance probability
date the hash signatures in a more efficient manner only needs to be done every r iterations (where
that scales as O(Di>0 l) where Di>0 is the number r can be anywhere from 5 or 100).
of non-zero entries in the feature representation for Iterate through steps (2a) and (2b) for every word
the source word φ(fi ). Also, we do not need to in the source text and then repeat this process for
store the random vectors w in practice since these multiple iterations (usually 1000).
can be computed on the fly using hash functions. 3. Other Sampling Operators: After every k it-
The inner product approximation also yields some erations,10 perform the following sampling opera-
theoretical guarantees for the hash sampler.8 tions:
(a) Re-ordering: For each source word token fi
6
We set l = 32 bits in our experiments. at position i, randomly choose another position j
7
In practice, we can ignore the norm terms to further
9
speed up sampling since this is only an estimate for the pro- Initializing with identity translation rather than random
posal distribution and we follow this with the Metropolis choice helps in some cases, especially for unknown words
Hastings step. that involve named entities, etc.
8 10
For further details, please refer to (Ahmed et al., 2012). We set k = 3 in our experiments.
Corpus Language Sent. Words Vocab. 6.1 MT Task and Data
OPUS Spanish 13,181 39,185 562
English 19,770 61,835 411 OPUS movie subtitle corpus (Tiedemann, 2009):
EMEA French 550,000 8,566,321 41,733 This is a large open source collection of parallel
Spanish 550,000 7,245,672 67,446 corpora available for multiple language pairs. We
use the same non-parallel Spanish/English corpus
Table 1: Statistics of non-parallel corpora used used in previous works (Ravi and Knight, 2011b;
here. Nuhn et al., 2012). The details of the corpus are
listed in Table 1. We use the entire Spanish source
in the source sentence and swap the translations ei text for decipherment training and evaluate the fi-
with ej . During the sampling process, we compute nal English output to report BLEU scores.
the probabilities for the two samples—the origi- EMEA corpus (Tiedemann, 2009): This is a par-
nal and the swapped versions, and then sample an allel corpus made out of PDF documents (arti-
alignment from this distribution. cles from the medical domain) from the Euro-
(b) Deletion: For each source word token, pean Medicines Agency. We reserve the first 1k
delete the current target translation (i.e., align it sentences in French as our source text (also used
with the target NULL token). As with the re- in decipherment training). To construct a non-
ordering operation, we sample from a distribution parallel corpus, we split the remaining 1.1M lines
consisting of the original and the deleted versions. as follows: first 550k sentences in French, last
4. Decoding the foreign sentence: Finally, once 550k sentences in Spanish. The latter is used to
the training is done (i.e., after all sampling iter- construct a target language model used for deci-
ations) we choose the final sample as our target pherment training. The corpus statistics are shown
translation output for the source text. in Table 1.

6.2 Results
6 Experiments and Results
OPUS: We compare the MT results (BLEU
We test our method on two different corpora. scores) from different systems on the OPUS cor-
To evaluate translation quality, we use BLEU pus in Table 2. The first row displays baseline
score (Papineni et al., 2002), a standard evaluation performance. The next three rows 1a–1c display
measure used in machine translation. performance achieved by two methods from Ravi
First, we present MT results on non-parallel and Knight (2011b). Rows 2a, 2b show results
Spanish/English data from the OPUS cor- from the of Nuhn et al. (2012). The last two rows
pus (Tiedemann, 2009) which was used by Ravi display results for the new method using Bayesian
and Knight (2011b) and Nuhn et al. (2012). hash sampling. Overall, using a 3-gram language
We show that our method achieves the best model (instead of 2-gram) for decipherment train-
performance (BLEU scores) on this task while ing improves the performance for all methods. We
being significantly faster than both the previous observe that our method produces much better re-
approaches. We then apply our method to a sults than the others even with a 2-gram LM. With
much larger non-parallel French/Spanish corpus a 3-gram LM, the new method achieves the best
constructed from the EMEA corpus (Tiedemann, performance; the highest BLEU score reported on
2009). Here the vocabulary sizes are much larger this task. It is also interesting to note that the hash
and we show how our new Bayesian decipherment sampling method yields much better results than
method scales well to this task inspite of using the Bayesian inference method presented in (Ravi
complex translation models. We also report the and Knight, 2011b). This is due to the accelerated
first BLEU results on such a large-scale MT task sampling scheme introduced earlier which helps it
under truly non-parallel settings (without using converge to better solutions faster.
any parallel data or seed lexicon). Table 2 (last column) also compares the effi-
For both the MT tasks, we also report BLEU ciency of different methods in terms of CPU time
scores for a baseline system using identity trans- required for training. Both our 2-gram and 3-gram
lations for common words (words appearing in based methods are significantly faster than those
both source/target vocabularies) and random trans- previously reported for EM based training meth-
lations for other words. ods presented in (Ravi and Knight, 2011b; Nuhn
Method BLEU Time (hours)
Baseline system (identity translations) 6.9
1a. EM with 2-gram LM (Ravi and Knight, 2011b) 15.3 ∼850h
1b. EM with whole-segment LM (Ravi and Knight, 2011b) 19.3
1c. Bayesian IBM Model 3 with 2-gram LM (Ravi and Knight, 2011b) 15.1
2a. EM+Context with 2-gram LM (Nuhn et al., 2012) 15.2 50h
2b. EM+Context with 3-gram LM (Nuhn et al., 2012) 20.9 200h
3. Bayesian (standard) Gibbs sampling with 2-gram LM 222h
4a. Bayesian Hash Sampling∗ with 2-gram LM (this work) 20.3 2.6h
4b. Bayesian Hash Sampling∗ with 3-gram LM (this work) 21.2 2.7h
(∗ sampler was run for 1000 iterations)

Table 2: Comparison of MT performance (BLEU scores) and efficiency (running time in CPU hours)
on the Spanish/English OPUS corpus using only non-parallel corpora for training. For the Bayesian
methods 4a and 4b, the samplers were run for 1000 iterations each on a single machine (1.8GHz Intel
processor). For 1a, 2a, 2b, we list the training times as reported by Nuhn et al. (2012) based on their EM
implementation for different settings.

Method BLEU Spanish (e) French (f)


Baseline system (identity translations) 3.0 el → les
Bayesian Hash Sampling with 2-gram LM la → la
vocab=full (Ve ), add fertility=no 4.2
por → des
vocab=pruned∗ , add fertility=yes 5.3
sección → rubrique
Table 3: MT results on the French/Spanish EMEA administración → administration
corpus using the new hash sampling method. ∗ The
Table 4: Sample (1-best) Spanish/French transla-
last row displays results when we sample target
tions produced by the new method on the EMEA
translations from a pruned candidate set (most fre-
corpus using word translation models trained with
quent 1k Spanish words + identity translation can-
non-parallel corpora.
didates) which enables the sampler to run much
faster when using more complex models.
EMEA Results Table 3 shows the results achieved
et al., 2012). This is very encouraging since Nuhn by our method on the larger task involving EMEA
et al. (2012) reported obtaining a speedup by prun- corpus. Here, the target vocabulary Ve is much
ing translation candidates (to ∼1/8th the original higher (67k). In spite of this challenge and the
size) prior to EM training. On the other hand, we model complexity, we can still perform decipher-
sample from the full set of translation candidates ment training using Bayesian inference. We report
including additional target phrase (of size 2) can- the first BLEU score results on such a large-scale
didates which results in a much larger vocabulary task using a 2-gram LM. This is achieved without
consisting of 1600 candidates (∼4 times the orig- using any seed lexicon or parallel corpora. The re-
inal size), yet our method runs much faster and sults are encouraging and demonstrates the ability
yields better results. The table also demonstrates of the method to scale to large-scale settings while
the siginificant speedup achieved by the hash sam- performing efficient inference with complex mod-
pler over a standard Gibbs sampler for the same els, which we believe will be especially useful for
model (∼85 times faster when using a 2-gram future MT application in scenarios where parallel
LM). data is hard to obtain. Table 4 displays some sam-
We also compare the results against MT per- ple 1-best translations learned using this method.
formance from parallel training—MOSES sys- For comparison purposes, we also evaluate MT
tem (Koehn et al., 2007) trained on 20k sentence performance on this task using parallel training
pairs. The comparable number for Table 2 is 63.6 (MOSES trained with hundred sentence pairs) and
BLEU. observe a BLEU score of 11.7.
7 Discussion and Future Work 8 Conclusion

There exists some work (Dou and Knight, 2012; To summarize, our method is significantly faster
Klementiev et al., 2012) that uses monolingual than previous methods based on EM or Bayesian
corpora to induce phrase tables, etc. These when with standard Gibbs sampling and obtains better
combined with standard MT systems such as results than any previously published methods for
Moses (Koehn et al., 2007) trained on parallel cor- the same task. The new framework also allows
pora, have been shown to yield some BLEU score performing Bayesian inference for decipherment
improvements. Nuhn et al. (2012) show some applications with more complex models than pre-
sample English/French lexicon entries learnt us- viously shown. We believe this framework will
ing EM algorithm with a pruned translation can- be useful for further extending MT models in the
didate set on a portion of the Gigaword corpus11 future to improve translation performance and for
but do not report any actual MT results. In ad- many other unsupervised decipherment applica-
dition, as we showed earlier our method can use tion scenarios.
Bayesian inference (which has a lot of nice proper-
ties compared to EM for unsupervised natural lan-
References
guage tasks (Johnson, 2007; Goldwater and Grif-
fiths, 2007)) and still scale easily to large vocabu- Amr Ahmed, Sujith Ravi, Shravan Narayanamurthy,
and Alex Smola. 2012. Fastex: Hash clustering
lary, data sizes while allowing the models to grow with exponential families. In Proceedings of the
in complexity. Most importantly, our method pro- 26th Conference on Neural Information Processing
duces better translation results (as demonstrated Systems (NIPS).
on the OPUS MT task). And to our knowledge,
Moses S. Charikar. 2002. Similarity estimation tech-
this is the first time that anyone has reported MT niques from rounding algorithms. In Proceedings of
results under truly non-parallel settings on such a the thiry-fourth annual ACM Symposium on Theory
large-scale task (EMEA). of Computing, pages 380–388.
Our method is also easily extensible to out- A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
of-domain translation scenarios similar to (Dou Maximum likelihood from incomplete data via the
and Knight, 2012). While their work also uses em algorithm. Journal of the Royal Statistical Soci-
Bayesian inference with a slice sampling scheme, ety, Series B, 39(1):1–38.
our new approach uses a novel hash sampling Qing Dou and Kevin Knight. 2012. Large scale deci-
scheme for decipherment that can easily scale pherment for out-of-domain machine translation. In
to more complex models. The new decipher- Proceedings of the 2012 Joint Conference on Empir-
ment framework also allows one to easily incorpo- ical Methods in Natural Language Processing and
Computational Natural Language Learning, pages
rate additional information (besides standard word 266–275.
translations) as features (e.g., context features,
topic features, etc.) for unsupervised machine Pascale Fung and Kathleen McKeown. 1997. Finding
translation which can help further improve the per- terminology translations from non-parallel corpora.
In Proceedings of the 5th Annual Workshop on Very
formance in addition to accelerating the sampling Large Corpora, pages 192–202.
process. We already demonstrated the utility of
this system by going beyond words and incorpo- Sharon Goldwater and Tom Griffiths. 2007. A fully
rating phrase translations in a decipherment model bayesian approach to unsupervised part-of-speech
tagging. In Proceedings of the 45th Annual Meet-
for the first time. ing of the Association of Computational Linguistics,
In the future, we can obtain further speedups pages 744–751.
(especially for large-scale tasks) by parallelizing
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,
the sampling scheme seamlessly across multiple and Dan Klein. 2008. Learning bilingual lexicons
machines and CPU cores. The new framework can from monolingual corpora. In Proceedings of ACL:
also be stacked with complementary techniques HLT, pages 771–779.
such as slice sampling, blocked (and type) sam-
Mark Johnson. 2007. Why doesn’t EM find good
pling to further improve inference efficiency. HMM POS-taggers? In Proceedings of the Joint
Conference on Empirical Methods in Natural Lan-
11 guage Processing and Computational Natural Lan-
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?
catalogId=LDC2003T05 guage Learning (EMNLP-CoNLL), pages 296–305.
Alex Klementiev, Ann Irvine, Chris Callison-Burch, Benjamin Snyder, Regina Barzilay, and Kevin Knight.
and David Yarowsky. 2012. Toward statistical ma- 2010. A statistical model for lost language deci-
chine translation without parallel corpora. In Pro- pherment. In Proceedings of the 48th Annual Meet-
ceedings of the 13th Conference of the European ing of the Association for Computational Linguis-
Chapter of the Association for Computational Lin- tics, pages 1048–1057.
guistics.
Jörg Tiedemann. 2009. News from opus - a collection
Kevin Knight and Kenji Yamada. 1999. A computa- of multilingual parallel corpora with tools and inter-
tional approach to deciphering unknown scripts. In faces. In N. Nicolov, K. Bontcheva, G. Angelova,
Proceedings of the ACL Workshop on Unsupervised and R. Mitkov, editors, Recent Advances in Natural
Learning in Natural Language Processing, pages Language Processing, volume V, pages 237–248.
37–44.
Philipp Koehn and Kevin Knight. 2000. Estimating
word translation probabilities from unrelated mono-
lingual corpora using the em algorithm. In Proceed-
ings of the Seventeenth National Conference on Ar-
tificial Intelligence and Twelfth Conference on Inno-
vative Applications of Artificial Intelligence, pages
711–715.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: open
source toolkit for statistical machine translation. In
Proceedings of the 45th Annual Meeting of the ACL
on Interactive Poster and Demonstration Sessions,
pages 177–180.
Malte Nuhn, Arne Mauser, and Hermann Ney. 2012.
Deciphering foreign language by combining lan-
guage models and context vectors. In Proceedings
of the 50th Annual Meeting of the Association for
Computational Linguistics, pages 156–164.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of the
40th Annual Meeting on Association for Computa-
tional Linguistics, pages 311–318.
Reinhard Rapp. 1995. Identifying word translations
in non-parallel texts. In Proceedings of the 33rd An-
nual Meeting on Association for Computational Lin-
guistics, pages 320–322.
Sujith Ravi and Kevin Knight. 2011a. Bayesian in-
ference for zodiac and other homophonic ciphers.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies - Volume 1, pages 239–247.
Sujith Ravi and Kevin Knight. 2011b. Deciphering
foreign language. In Proceedings of the 49th An-
nual Meeting of the Association for Computational
Linguistics: Human Language Technologies, pages
12–21.
Deepak Ravichandran, Patrick Pantel, and Eduard
Hovy. 2005. Randomized algorithms and nlp: us-
ing locality sensitive hash function for high speed
noun clustering. In Proceedings of the 43rd Annual
Meeting on Association for Computational Linguis-
tics, pages 622–629.

You might also like