Statistical Machine Translation

Statistical Machine
Translation
Pasindu Nivanthaka Tennage
Computer Science and Engineering
University of Moratuwa
Content
1. Overview
2. Language Model
3. Translation Model - Word Based translation
4. Translation Model - Phrase Based translation
5. Decoder
6. Evaluation
Overview
German parallel
Corpora
German P(f/e) Broken English
Translation Model
English Text
Broken English P(e) English
Language Model
Decoding algorithm
Language Model
1. Word Order
Machine translation the most fun research topics one is
Machine translation is one of the most fun research area

Language Model
2. Word Choice
She is in the room or

she is on the room?
Using target language knowledge , can arrange the broken language output
Language Modelling
Modeling the fluency of target language.
Language model (LM):
Good english string -> high p(e)
Doesnt concern about translation

Example LM method
P (John loves Mary) = count(John Loves Mary)
Count (every sentence)
WHAT IF JOHN LOVES MARY IS NOT IN CORPUS???
Commonly used methods
1. Parsing
2. Sequence Models
Parsing
NP VP
N N
V
John Loves Mary
p(John loves Mary) =
p(S-> NP VP)*p(NP-> N)*p(VP->V N)* p(N->John)*p(V->loves)*p(N-> Mary)

Sequence Modeling
Chain rule:
p(w1 ,w2.wn) = p(w1)*p(w2|w1)*p(w3|w1,w2)......*p(wn|w1, w2..wn-1)
Example:
p(John Loves Mary)= p(John)*p(loves|John)*p(Mary|John loves).
IS THIS CORRECT???
p(John Loves Mary)=
p(John|<s>)*p(loves|<s>John)*p(Mary|<s>John loves)
*p(.|<s>John loves Mary)*p(</s>|<s>John Loves Mary)
<s> : sentence start symbol
</s>: sentence end symbol
p(John|<s>): probability of John being the first word

ARE THERE ANY PROBLEMS IN ABOVE APPROACH?????
Markov Assumption
p(wn|w1,....,wn-1) = p(wn|wn-j,....,wn-1) for n>j

p(John|<s>)*p(Loves|John)*p(Mary|Loves)*p(.|Mary)*p(</s>|.)
N-gram
N-word substring
-> Bigram
-> Trigram
-> Unigram
Example Bigram p(wi|wi-1)
p(John|<s>)*p(Loves|John)*p(Mary|Loves)*p(.|Mary)*p(</s>|.)
Example Trigram p(wi|wi-2wi-1)
p(John|<s><s>)*p(loves|<s>John)*p(Mary|John loves)
*p(.|loves Mary)*p(</s>|Mary.)
Parameter Estimation
Maximum likelihood estimation -> counting problem
p(wi|wi-1) = Count(wi-1 wi)
Count(wi-1)
p(wi|wi-2 wi-1) = Count(wi-2 wi-1 wi)
Count(wi-2 wi-1)
Estimating from text
PROBLEM :What if the word is missing???
SOLUTION: Reduce some value from calculated probability and distribute them
among absent words - SMOOTHING
Example:
p(Japan|She works in ) = and p(Malta | she works in ) = 0
p(Japan|She works in ) = and p(Malta | she works in ) = 1/10000

Smoothing
1. Add one - adding one to count (not a good option)
Not fair (Work in india vs work in Kenya)
2. Back off
3. Stupid backoff - GOOGLE
4. Linear Interpolation
Backoff
Example :
p(Malta | work in) = p (Malta | in) or p(Makta)
Stupid Backoff
s(wi | wi-2 wi-1) = s(wi | wi-2 wi-1) if count(s(wi-2 wi-1) > 0
s(wi | wi-2 wi-1) = LAMDA*s(wi | wi-1) else where LAMDA = 0.4 (only works for large
corpus)
Linear Interpolation
Example :
p(Malta | work in) = A1*p(Malta | work in) +A2*p (Malta | in) +A3* p(Malta)
A1+A2+A3=1
Translation model : Word Based Models
Words translation aka lexical translation
Requires a dictionary
Example: Haus(German) in english = house, home, building, household, shell
count(Haus) = 10 000
count(Haus translated to house): 8000
p(house| Haus) =
Lexical translation probability distribution
Pf = e-> pf(e)
Given foreign word f, returns a probability for each choice of english translation e,
that indicates how likely the translation is
Maximum likelihood estimation

Alignment
t(e|f) : translation probability
T tables :
Example das (German)
e t(e|f)
thee 0.7
thy 0.2
tho 0.1
Alignment function
Maps each English output word at position i to a German input word at position j.
A: j->i
Problem: Some models in source language may have no relation to any of the
target language input words
Solution: NULL input word
This makes alignment function fully defined

IBM model 1
Translation probability for a foreign sentence f=(f1f2f3f4f5.f1f) of length lf to an
english sentence e=(e1e2e3e4e5..e1e) of length le with an alignment of each english
word ej to a foreign word fi according to alignment function a: j->i
Normalization constant: p(e,a|f) is a proper probability distribution. (Sum =1)
For NULL
Number of
alignments
Example
das haus ist klein
the house is small
p(e,a|f) = epsln *t(the | das)*t(house | haus)*t(is | ist)*t(small | klein)
54
Translation Model :Phrase Based Translation
NOTE: phrase is not the linguistic phrase it can be any combination of words
Reasons for using phrase based models
1. Words are not best atomic unit for translation (due to frequent one to many
mappings)
2. Translating word groups instead of single words helps to resolve translation
ambiguities.
Mathematical definition
ebest = argmaxep(e | f)= argmaxep(f | e)*plm (e)

Phrase translation table
Phrase translation table for German word natuerlich
Translation p(e | f)
Of course 0.4
naturally 0.3
Learning a phrase translation table
1. Create word alignment between each sentence pair of parallel corpus

2. Extract phrase pairs that are consistent with this word alignment.
Consistency
A phrased based pair (f , e) consistent with an alignment A, if all words f1f2f3 . That
have alignment points in A have these words with e1e2e3 . and vise versa.
(ef) is consistent with A for all ei in e (ei,fj) is in A => fj is in f AND
for all fj in f (ei,fj) is in A => ei is in e AND
There exists ei in e and fj in f ; (ei, fj ) in A

Estimate Phrase translation probability
For each sentence pair, extract the number of phrase pairs. Then count how many
sentence pairs a particular phrase pair is extracted and stored in count(e, f)
Translation probability = count(e, f)
SIgma fi (count(e, fi))

Decoding
Above two models are defined as mathematical formulae , that given probabilistic
translation , assigns a probabilistic score to it.
Task of Decoder is to find the best scoring translation according to these formulae.
Heuristic search methods are used. Ex: Beam search
2 types of errors during decoding
1. Search error : failure to find best translation, consequence of heuristic function

which is unable to explore the entire search space.
2. Model error: highest possible translation might not be the best. (not a problem
of decoding)
Evaluation
1. Manual evaluation
a. Correctness is too broad to measure
2 criterias
I. Fluency : grammatical correctness and idiomatic word choices
II. Adequacy : does output convey the same message as input
2. Automatic evaluation
a. speed
b. Size (can be used in mobile phones)
c. Integration and usability

Statistical Machine Translation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Machine Translation

Uploaded by

Copyright:

Available Formats

Statistical Machine

German P(f/e) Broken English

Broken English P(e) English

Machine translation the most fun research topics one is

Machine translation is one of the most fun research area

She is in the room or

Language model (LM):

Good english string -> high p(e)

Doesnt concern about translation

Count (every sentence)

WHAT IF JOHN LOVES MARY IS NOT IN CORPUS???

Commonly used methods

John Loves Mary

p(John loves Mary) =

p(S-> NP VP)*p(NP-> N)*p(VP->V N)* p(N->John)*p(V->loves)*p(N-> Mary)

p(w1 ,w2.wn) = p(w1)*p(w2|w1)*p(w3|w1,w2)......*p(wn|w1, w2..wn-1)

p(John Loves Mary)= p(John)*p(loves|John)*p(Mary|John loves).

<s> : sentence start symbol

</s>: sentence end symbol

p(John|<s>): probability of John being the first word

p(wn|w1,....,wn-1) = p(wn|wn-j,....,wn-1) for n>j

p(John loves Mary) =

Example Trigram p(wi|wi-2wi-1)

p(wi|wi-1) = Count(wi-1 wi)

p(wi|wi-2 wi-1) = Count(wi-2 wi-1 wi)

p(Japan|She works in ) = and p(Malta | she works in ) = 0

p(Japan|She works in ) = and p(Malta | she works in ) = 1/10000

Not fair (Work in india vs work in Kenya)

p(Malta | work in) = p (Malta | in) or p(Makta)

s(wi | wi-2 wi-1) = s(wi | wi-2 wi-1) if count(s(wi-2 wi-1) > 0

Example: Haus(German) in english = house, home, building, household, shell

count(Haus translated to house): 8000

Maximum likelihood estimation

Example das (German)

Solution: NULL input word

This makes alignment function fully defined

das haus ist klein

the house is small

p(e,a|f) = epsln *t(the | das)*t(house | haus)*t(is | ist)*t(small | klein)

Reasons for using phrase based models

ebest = argmaxep(e | f)= argmaxep(f | e)*plm (e)

Phrase translation table for German word natuerlich

1. Create word alignment between each sentence pair of parallel corpus

(ef) is consistent with A for all ei in e (ei,fj) is in A => fj is in f AND

for all fj in f (ei,fj) is in A => ei is in e AND

There exists ei in e and fj in f ; (ei, fj ) in A

Translation probability = count(e, f)

SIgma fi (count(e, fi))

Heuristic search methods are used. Ex: Beam search

2 types of errors during decoding

1. Search error : failure to find best translation, consequence of heuristic function

You might also like

p(S-> NP VP)p(NP-> N)p(VP->V N)* p(N->John)p(V->loves)p(N-> Mary)

p(w1 ,w2.wn) = p(w1)p(w2|w1)p(w3|w1,w2)......*p(wn|w1, w2..wn-1)

p(John Loves Mary)= p(John)p(loves|John)p(Mary|John loves).

p(e,a|f) = epsln t(the | das)t(house | haus)t(is | ist)t(small | klein)