You are on page 1of 34

Statistical Machine

Translation
Pasindu Nivanthaka Tennage
Computer Science and Engineering
University of Moratuwa
Content
1. Overview
2. Language Model
3. Translation Model - Word Based translation
4. Translation Model - Phrase Based translation
5. Decoder
6. Evaluation
Overview
German parallel
Corpora

German P(f/e) Broken English

Translation Model
English Text

Broken English P(e) English

Language Model
Decoding algorithm
Language Model
1. Word Order

Machine translation the most fun research topics one is

Machine translation is one of the most fun research area


Language Model
2. Word Choice

She is in the room or


she is on the room?
Using target language knowledge , can arrange the broken language output
Language Modelling
Modeling the fluency of target language.

Language model (LM):

Good english string -> high p(e)

Doesnt concern about translation


Example LM method
P (John loves Mary) = count(John Loves Mary)

Count (every sentence)

WHAT IF JOHN LOVES MARY IS NOT IN CORPUS???

Commonly used methods

1. Parsing
2. Sequence Models
Parsing

NP VP

N N
V

John Loves Mary

p(John loves Mary) =

p(S-> NP VP)*p(NP-> N)*p(VP->V N)* p(N->John)*p(V->loves)*p(N-> Mary)


Sequence Modeling
Chain rule:

p(w1 ,w2.wn) = p(w1)*p(w2|w1)*p(w3|w1,w2)......*p(wn|w1, w2..wn-1)

Example:

p(John Loves Mary)= p(John)*p(loves|John)*p(Mary|John loves).

IS THIS CORRECT???
p(John Loves Mary)=
p(John|<s>)*p(loves|<s>John)*p(Mary|<s>John loves)
*p(.|<s>John loves Mary)*p(</s>|<s>John Loves Mary)

<s> : sentence start symbol

</s>: sentence end symbol

p(John|<s>): probability of John being the first word


ARE THERE ANY PROBLEMS IN ABOVE APPROACH?????
Markov Assumption

p(wn|w1,....,wn-1) = p(wn|wn-j,....,wn-1) for n>j

p(John loves Mary) =


p(John|<s>)*p(Loves|John)*p(Mary|Loves)*p(.|Mary)*p(</s>|.)
N-gram
N-word substring

-> Bigram

-> Trigram

-> Unigram
Example Bigram p(wi|wi-1)
p(John loves Mary) =
p(John|<s>)*p(Loves|John)*p(Mary|Loves)*p(.|Mary)*p(</s>|.)

Example Trigram p(wi|wi-2wi-1)

p(John|<s><s>)*p(loves|<s>John)*p(Mary|John loves)
*p(.|loves Mary)*p(</s>|Mary.)
Parameter Estimation
Maximum likelihood estimation -> counting problem

p(wi|wi-1) = Count(wi-1 wi)

Count(wi-1)

p(wi|wi-2 wi-1) = Count(wi-2 wi-1 wi)

Count(wi-2 wi-1)
Estimating from text
PROBLEM :What if the word is missing???

SOLUTION: Reduce some value from calculated probability and distribute them
among absent words - SMOOTHING

Example:

p(Japan|She works in ) = and p(Malta | she works in ) = 0

p(Japan|She works in ) = and p(Malta | she works in ) = 1/10000


Smoothing
1. Add one - adding one to count (not a good option)

Not fair (Work in india vs work in Kenya)

2. Back off
3. Stupid backoff - GOOGLE
4. Linear Interpolation
Backoff
Example :

p(Malta | work in) = p (Malta | in) or p(Makta)

Stupid Backoff

s(wi | wi-2 wi-1) = s(wi | wi-2 wi-1) if count(s(wi-2 wi-1) > 0

s(wi | wi-2 wi-1) = LAMDA*s(wi | wi-1) else where LAMDA = 0.4 (only works for large
corpus)
Linear Interpolation
Example :

p(Malta | work in) = A1*p(Malta | work in) +A2*p (Malta | in) +A3* p(Malta)

A1+A2+A3=1
Translation model : Word Based Models
Words translation aka lexical translation

Requires a dictionary

Example: Haus(German) in english = house, home, building, household, shell

count(Haus) = 10 000

count(Haus translated to house): 8000

p(house| Haus) =
Lexical translation probability distribution
Pf = e-> pf(e)

Given foreign word f, returns a probability for each choice of english translation e,
that indicates how likely the translation is

Maximum likelihood estimation


Alignment
t(e|f) : translation probability

T tables :

Example das (German)

e t(e|f)

thee 0.7

thy 0.2

tho 0.1
Alignment function
Maps each English output word at position i to a German input word at position j.

A: j->i

Problem: Some models in source language may have no relation to any of the
target language input words

Solution: NULL input word

This makes alignment function fully defined


IBM model 1
Translation probability for a foreign sentence f=(f1f2f3f4f5.f1f) of length lf to an
english sentence e=(e1e2e3e4e5..e1e) of length le with an alignment of each english
word ej to a foreign word fi according to alignment function a: j->i
Normalization constant: p(e,a|f) is a proper probability distribution. (Sum =1)

For NULL
Number of
alignments
Example

das haus ist klein

the house is small

p(e,a|f) = epsln *t(the | das)*t(house | haus)*t(is | ist)*t(small | klein)

54
Translation Model :Phrase Based Translation
NOTE: phrase is not the linguistic phrase it can be any combination of words

Reasons for using phrase based models

1. Words are not best atomic unit for translation (due to frequent one to many
mappings)
2. Translating word groups instead of single words helps to resolve translation
ambiguities.

Mathematical definition

ebest = argmaxep(e | f)= argmaxep(f | e)*plm (e)


Phrase translation table

Phrase translation table for German word natuerlich

Translation p(e | f)

Of course 0.4

naturally 0.3
Learning a phrase translation table

1. Create word alignment between each sentence pair of parallel corpus


2. Extract phrase pairs that are consistent with this word alignment.

Consistency

A phrased based pair (f , e) consistent with an alignment A, if all words f1f2f3 . That
have alignment points in A have these words with e1e2e3 . and vise versa.

(ef) is consistent with A for all ei in e (ei,fj) is in A => fj is in f AND

for all fj in f (ei,fj) is in A => ei is in e AND

There exists ei in e and fj in f ; (ei, fj ) in A


Estimate Phrase translation probability
For each sentence pair, extract the number of phrase pairs. Then count how many
sentence pairs a particular phrase pair is extracted and stored in count(e, f)

Translation probability = count(e, f)

SIgma fi (count(e, fi))


Decoding
Above two models are defined as mathematical formulae , that given probabilistic
translation , assigns a probabilistic score to it.

Task of Decoder is to find the best scoring translation according to these formulae.

Heuristic search methods are used. Ex: Beam search

2 types of errors during decoding

1. Search error : failure to find best translation, consequence of heuristic function


which is unable to explore the entire search space.
2. Model error: highest possible translation might not be the best. (not a problem
of decoding)
Evaluation
1. Manual evaluation
a. Correctness is too broad to measure
2 criterias
I. Fluency : grammatical correctness and idiomatic word choices
II. Adequacy : does output convey the same message as input

2. Automatic evaluation
a. speed
b. Size (can be used in mobile phones)
c. Integration and usability

You might also like