Professional Documents
Culture Documents
net/publication/287196051
CITATIONS READS
0 1,600
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Joshua Allan Mathias on 17 December 2015.
translating languages, but in recent years, and even before recent improvements, it
has served valid purposes and improved the efficiency of human translation. How
machine translation works behind the scenes, however, is often left as a black box (a
program with unknown inner workings). Coming to a basic understanding of machine
translation can help translators properly use, and evaluate the trustworthiness of,
machine translation. This paper is intended for the curious translator as well as the
beginning computational linguist or researcher of statistical machine translation,
which is the most widely used form of machine translation today. The concepts
outlined in this paper are based on a highly cited paper from 1993, “
The
mathematics of statistical machine translation: Parameter estimation,” as well as
Knight’s “A statistical MT tutorial workbook,” and Knight and Koehn’s “What's New in
Statistical Machine Translation,” which are based off of the paper previously
mentioned, which I refer to as “Brown’s paper.” For the average reader
I’d
recommend at least the next four paragraphs, or even just the next.
Introduction
The general idea behind statistical machine translation is the following:
There’s a sentence we want to translate from French to English. Since we
have a large number of parallel texts of English and French translations, we’ll use
these to determine a statistical probability that a given English sentence corresponds
to the French sentence, do this for many English sentences, and then pick the
English sentence with the highest probability. The same principle applies to a word,
a paragraph, or a whole text.
The probability for the English translation is determined using a
commonlyused equation called Bayes Rule, which is the following:
P(s|o)=P(o|s)P(s)/P(o), which means the probability of the state given the
observation, P(s|o), equals the probability of the observation given the state, P(o|s),
times the probability of the state happening in general, P(s), divided by the
probability of the observation happening in general, P(o). In this case the state is the
English translation and the observation is the original French sentence. Since P(o),
the probability of the French sentence, is the same for every English translation (the
state), and we’re only concerned with comparing the probabilities of different English
translations, we need only consider P(o|s)P(s).
Now we can replace the variables of the equation with those used in Brown’s
paper, where s is f, for a French sentence, and o is e, for an English sentence. And
thus we arrive at the fundamental theorem of machine translation (Brown 1993):
This means that the best English translation is the English sentence that
maximizes the equation P(f|e)P(e).
To calculate we must understand the meaning behind P(f|e) and P(e).
P(f|e) in terms of Bayes rule represents the “likelihood”: how likely is it that the
French sentence would be a translation of (or would occur given) the English
sentence? P(e) represents the “prior” (our prior knowledge): how likely is it that the
English sentence would ever be used?
As stated in Brown’s paper, there are three computational challenges in
relation to the fundamental theorem of statistical machine translation: “estimating the
language model translation model
probability, Pr(e); estimating the probability,
search
Pr(f|e); and devising an effective and efficient suboptimal for the English
string that maximizes their product” (emphasis added). We will discuss these
challenges one at a time.
Language Model
The language model, P(e), is what verifies that the English sentence is correct
and fluent English. As a translator, one of my most useful tools is Google search,
where I can use how often a particular word or phrase appears in the search as a
general guide to how often it is used and therefore if it sounds fluent. If a phrase of a
few words returns no results, then it is not very likely, but certainly not impossible,
that it is fluent English. This is how a prior is often calculated, as a relative
percentage over all possible options (for example, if the word “dog” appears 50,000
times in 1,000,000 sentences, then it has a prior of 50,000/1,000,000). While this
describes the general idea of a prior, machine translation requires a language model
that effectively looks at multiple words at a time instead of only looking at each word
independently.
A solution to this is to use ngrams, which are contiguous sequences of n
words. Consequently, ngrams account for the ordering of words in a sentence.
This
is how statistical computer programs can check for correct grammar or language
without knowing grammar rules.
We will consider ngrams where n=3, also known as
a trigram. Given the sentence “I have a black dog,” we can break this down into
3word parts. As explained in the previous paragraph, the program can look at a
trigram “I have a” and see how many times this phrase appears in the database of
example sentences. Next it can look at “have a black,” and so on. The phrase “I have
a” will appear many times more than “a have I.” Trigrams are of sufficient length to
account for many issues of grammar and fluency, but calculating the probabilities for
larger ngrams as well will have better results, at the cost of more computation
(calculating the probabilities of only larger ngrams wouldn’t account well for phrases
of n words that aren’t as common but still legitimate). A model like this doesn’t need
to be perfect or work well in every example. It only requires that correct English has
higher probabilities than incorrect English, at least most of the time.
Now we define model: a set of numbers, or parameters, that are used to
calculate the probabilities needed for the fundamental theorem of machine
translation, along with equations for doing so. We determine these numbers with the
help of statistics, and they may change as the program learns from previous results.
In order for our ngram model to work, we need weights (coefficients) to combine
ngrams with different values of n (number of words), by considering how much each
part of the ngram is affecting the overall number of occurrences for a particular
phrase. Knight’s Workbook describes this:
The only way you'll get a zero probability is if the sentence contains a
previously unseen bigram or trigram. That can happen. In that case, we can
do smoothing. If “z” never followed “xy” in our text, we might further wonder
whether “z” at least followed “y”. If it did, then maybe “xyz” isn't so bad. If it
didn't, we might further wonder whether “z” is even a common word or not. If
it's not even a common word, then “xyz” should probably get a low probability.
Instead of
b(z | x y) = numberofoccurrences(“xyz”) / numberofoccurrences(“xy”)
we can use
b(z | x y) = 0.95 * numberofoccurrences(“xyz”) / numberofoccurrences(“xy”)
+ 0.04 * numberofoccurrences (“yz”) / numberofoccurrences (“z”) +
0.008 * numberofoccurrences(“z”) / totalwordsseen +
0.002
Different smoothing coefficients may work better for different situations, and you can
determine the statistically best smoothing coefficients by training the language model
(Knight 2003). Note that the .002 added at the end means that a given ngram or
phrase will never have a probability of 0, even if it has never been seen before in the
training data.
A language model can be trained, and compared to other language models,
using the same Bayes rule. Borrowing again from Knight: P(model | testdata) =
P(model) * P(testdata | model) / P(data). We assume that P(model) and P(data) are
the same for each model. We can then choose the model that maximizes P(testdata
| model), or the probability that the model correctly recognizes the test data as either
correct or poor English. It’s worth mentioning the machine learning principle that to
be confident of our result, the data we use to train the model is different that the data
we use to test the model at the end.
perplexity
A common measure of a translation model is , which uses
P(testdata | model), which is also the same as P(e), or in other words, the
probability of the English test data or alignment occurring. The perplexity of a model
is log(P(e)) / N, where N is the number of words in the test data. This equation
2
scales well for large test sets, which would otherwise produce extremely small
products of probabilities, and it allows us to compare datasets by normalizing over N,
the number of words in the test data.
There are different languages models that can be used other than ngrams,
including grammar trees that look at sentence parts such as nouns, direct objects,
etc. Linguistic models can be used in conjunction with statistical models, which is a
promising path for the future of machine translation. Brown stated it this way:
But it is not our intention to ignore linguistics, neither to replace it. Rather, we
hope to enfold it in the embrace of a secure probabilistic framework so that
the two together may draw strength from one another and guide us to better
natural language processing systems in general and to better machine
translation systems in particular.
Language is too complex to program every possible scenario, nor to account for
change. We can use humandefined linguistic models to prefer language that follows
certain rules and to know how best to break apart and evaluate language, but only
the language itself can create a model that encompasses its complexity, and this is
what a statistical language model does.
Translation Model
A translation model is used to calculate P(f|e), the probability that the original
French sentence corresponds to a given English sentence (which is a potential
translation). The translation model isn’t concerned with producing wellformed
English strings; that is what the language model does. The translation model uses
many training examples to determine the probability of certain words and phrases in
French being the translation for a word in English (Brown’s model doesn’t account
for multiple English words corresponding to a word or words in French, but it could
be adapted for that purpose). Note that the translation model in this case is
translating from English to French. Then, the program uses the translation model to
go through a large number of English sentences, checks how likely it is that the
French original is a translation, and then picks the English sentence with the highest
probability.
The translation model needs to determine which words in English correspond
to certain words in French. This is accomplished by going through many parallel
texts of English and French and keeping track of which words in a French sentence
are most often found in a translation of an English sentence containing a particular
word, and this is done for every English word. In this way, the statistical system
determines how words are translated by probabilities, and not by a userdefined
glossary.
The words in an English sentence must be aligned with the words of a French
translation in order to know which words correspond to which, especially when
certain words frequently appear in the same sentence. In other words, we want to
account for word order, and not just the fact that a word is commonly found in the
translation of a particular word. An alignment is a set of numbers for each word in
English that corresponds to a position in the sentence of a French translation. Even if
the correct set of words are found for a translation, the language model can not
determine the ordering on it’s own because there may be different legitimate English
sentences with the same set of words but that carry different meanings. We also
want to narrow down the possibilities to sentences that are more likely to be correct
English.
Since determining the probabilities of corresponding positions in the
translation involves multiplying probabilities, the probability of changing the position
distortion
(called ) of multiple words corresponding to a single English word would be
low. Brown’s paper suggests solving this problem by considering the first
corresponding French word, the “head,” independently, and then using separate
parameters for the remaining corresponding words.
Another issue is determining how many words in French correspond to a word
in an English sentence, which is referred to as fertility. The probable fertilities of a
word are determined by examining parallel texts for correlations such as sentences
containing a particular word in English having longer translations in French.
Now we can discuss how these ideas of a translation model are put together.
Brown’s paper describes five translation models, all of which are used in the final
machine translation system. Model 1 finds French words that commonly appear in
the translations of sentences containing an English word, for all words. Model 2
accounts for the position of the words in the sentence. Model 3 has parameters for “a
set of fertility probabilities, a set of translation probabilities, and a set of distortion
probabilities” (Brown). Model 4 accounts for phrases that change position in the
deficient
target text as units. However, models 3 and 4 are , meaning that they don’t
“[concentrate] all of [their] probability on events of interest” (Brown). This is because
the sets of words for which these models create probabilities are not all valid
sentences, as there can be multiple words with the same position in the sentence as
well as positions in the sentence that have no word. Model 5 resolves this problem
by enforcing that each word in the French string has a position, and there can only
be one word at each position. The disadvantage of Model 5 is that it is not as flexible
(as a direct result of requiring strict positioning of words), or able to calculate
probabilities for similar sentences, because doing so would require recomputing the
likelihood of the alignment. Model 5 therefore uses information from the previous
models to then produce likely valid sentences.
Each of the five models have advantages and disadvantages as they account
for different aspects of the sentences and have different levels of flexibility. Brown’s
paper provides an example of iteratively combining these models. For calculating the
probabilities of sentence alignment, they run Model 1 once, Model 2 six times, Model
3 once, and Model 5 four times. At each iteration (run of a model) they also run
either the same or the next model to calculate the new counts, which are used for
the parameters of the model in the next iteration. This way, the translation model
improves incrementally, and by using a threshold to keep only the best sentence
alignments, or translations, fewer and fewer possible sentences remain after each
iteration. The incremental use of these models is called an Expectationmaximization
(EM) algorithm. Brown’s paper describes this iterative improvement:
We cannot create a good model or find good parameter values at a stroke. Rather
we employ a process of iterative improvement. For a given model we use current
parameter values to find better ones, and in this way, from initial values we find
locally optimal ones. Then, given good parameter values for one model, we use them
to find initial parameter values for another model. By alternating between these two
steps we proceed through a sequence of gradually more sophisticated models."
This is how the translation model is trained. The training described in Brown’s paper
resulted in improved (decreased) perplexity after each iteration. A good machine
translation model has a high P(e) and a low perplexity.
The EM algorithm performs what is called a local (as opposed to global)
hillclimbing search, which means that it will improve the parameters of the
translation model according to the original parameters, and it may not find the best
possible set of parameters. Conceptually, after reaching the high point of a
suboptimal hill, it won’t go down or move elsewhere in order to then find the highest
hill. Consequently, the choice of original parameters to initially train the model could
affect the resulting accuracy of the translation model. This issue of being stuck at a
local maximum can be resolved to some degree (but not completely) by using
common machine learning techniques, such as considering other similar alignments
and adding some random change.
Decoding
With a trained language model as the prior and a trained translation model as
the likelihood for English sentences, these two models can be combined in the
fundamental theorem of machine translation to find the English sentence with the
maximum probability. The translation model helps provide likely candidates for the
English sentence given the original French sentence. There are ways to efficiently
search through possible translations, though the search through possibilities is of
necessity suboptimal (not guaranteed to return the best answer) because there are
too many possibilities and complications of language to compute a guaranteed
optimal translation. Finding the English sentence that maximizes P(e|f)P(e) is called
decoding
.
There are many ways of implementing a decoder, or search for the best
translation among the possible options (using the language and translation models to
determine the probabilities of each option) (Knight 2003):
A greedy decoder looks at nearby (involving little change) possible variations
of a translation and then uses the new version if it results in a higher probability. By
keeping track of at least the last couple iterations of this, it can avoid some local
minima and improve with relatively little computation.
A beam search looks at a certain number of different options and then
explores nearby options in different directions based off of each of the original
options. After expanding the search more, only the best options are further explored.
This method can become very computationally expensive with long sentences.
Other methods from Knight’s presentation include finite state transducers,
integer programming, and using a string to tree model with dynamic programming.
Additional methods
Statistical machine translation as discussed in this paper goes directly from
words in one language to words in the other. It doesn’t use a linguistic model for
syntax or semantics. However, such a model can be combined with statistical
machine translation to achieve better or more confident results. One example of this
is to use statistics to create an English syntax tree parser. Statistics, from many
example sentences, is used to create the syntax rules used to evaluate the English.
Adding an additional model such as this can be especially successful with languages
such as English that have a very large amount of corpora, or training data, available.
An English sentence would then be parsed (evaluated) with the use of the syntax
tree sentence parts such as noun phrases and verb phrases (which in turn contain a
subject, verb, another verb phrase, etc.).
Another method of machine translation with varying success involves using
interlingua, which involves using syntax and semantics to represent a given text by
standard meanings, that can be shared between all languages, and then recreating
the meaning of the interlingua with the syntax and semantics of the target language.
This can be done using a statistical model to find clusterings of words and potential
related meanings using large amounts of text. One hope for interlingua is that it
could be used to translate between many languages with the same model. However,
it is difficult and perhaps impossible to define through a computer system all
meanings of language, and there are meanings of languages that don’t exist the
same way in other languages.
Conclusion
Statistical machine translation overcomes many difficulties of language by
allowing the computer, and statistics, to determine how language should best be
translated. While machine translation usually does not produce a professional level
translation without postediting, machine translation is a powerful tool, filled with a
database of language and statistical probabilities that no human can match on their
own. This tool, when properly understood and used, can enable translators to
improve both the efficiency and the quality of their translations, because it is based
on a great amount of previous human translations. While I still often search for words
or phrases in Google when translating to evaluate what is most commonly used,
statistical machine translation essentially does this for us, and much more.
As machine translation moves forward, I expect it will continue to involve
statistical evaluation combined with linguistic models created with the aid of human
knowledge and statistics. There are many ways of implementing machine translation,
there has been great success so far, and there is more to come.
View publication stats
Bibliography
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., & Mercer, R. L. (1993). The
mathematics of statistical machine translation: Parameter estimation.
Computational
linguistics
, 19(2), 263311.
Knight, K. (1999). A statistical MT tutorial workbook.
Knight, K., & Koehn, P. (2003, May). What's New in Statistical Machine Translation.
HLTNAACL
In (pp. 55).