Professional Documents
Culture Documents
– p.1
What’s New in Statistical Machine Translation p
Outline p
Data
Evaluation
Translation Model
Language Model
Decoding Algorithm
Available Resources
– p.2
Statistical MT Systems p
Spanish/English English
Bilingual Text Text
– p.3
Translation Language
Model Model
Decoding Algorithm
argmax P(e)*p(s|e)
– p.4
– given an English string e, assigns by formula
– good English string high
– bad English string low
Translation Model
– given a pair of strings , assigns by formula
Decoding Algorithm
– given a language model, a translation model and a new sentence ,
find translation maximizing
– p.5
Outline p
Data
Evaluation
Translation Model
Language Model
Decoding Algorithm
Available Resources
– p.6
Translation Model p
Goal of the Translation Model:
Match foreign input to English output
– p.7
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.8
english foreign
semantics semantics
english foreign
syntax syntax
english foreign
words words
– p.9
english foreign
semantics semantics
english foreign
syntax syntax
english foreign
words words
however, the currently best performing statistical machine translation systems are
still crawling at the bottom.
– p.10
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.11
Statistical Modeling p
Mary did not slap the green witch
Not Sufficient Data to Estimate
Directly
– p.12
– p.13
– p.14
Foreign String
– Choices in Story are Decided by Reference to Parameters
– e.g., bruja witch
Formula for in Terms of Parameters
– usually long and hairy, but mechanical to extract from the story
– p.15
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.16
Parallel Corpora p
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
Incomplete Data
– English and foreign words, but no connections between them
– p.17
EM Algorithm p
Incomplete Data
– if we had complete data, would could estimate model
EM in a Nutshell
– initialize model parameters (e.g. uniform)
– iterate
– p.18
EM Algorithm (2) p
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
– p.19
EM Algorithm (3) p
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
– p.20
EM Algorithm (4) p
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
– p.21
EM Algorithm (5) p
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
Convergence
– p.22
EM Algorithm (6) p
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...
– p.23
– p.24
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.25
Word Alignment p
Notion of Word Alignments Valuable
Mary
did
not
slap
the
green
witch
– p.26
– bidirectionally aligned corpora ,
– take intersection of alignment points
(high precision, low recall)
– p.27
Mary Mary
did did
not not
slap slap
the the
green green
witch witch
intersection
bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
Mary
did
not
slap
the
green
witch
– p.29
– also to non-neighboring
– ...
– p.30
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.31
Flaws of Word-Based MT p
Multiple English Words for one German Word
German: Zeitmangel erschwert das Problem .
Gloss: LACK OF TIME MAKES MORE DIFFICULT THE PROBLEM .
Phrasal Translation
German: Eine Diskussion erübrigt sich demnach
Gloss: A DISCUSSION IS MADE UNNECESSARY ITSELF THEREFORE
– p.32
– p.33
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.34
Phrase-Based Translation p
Morgen fliege ich nach Kanada zur Konferenz
– p.35
– p.36
– p.37
Mary
did
not
slap
the
green
witch
Collect All Phrase Pairs that are Consistent with the Word
Alignment
– a phrase alignment has to contain all alignment points for all words it
covers
– p.38
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch),
(verde, green)
– p.39
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch),
(verde, green), (Maria no, Mary did not), (no daba una bofetada, did not slap),
(daba una bofetada a la, slap the), (bruja verde, green witch)
– p.40
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch),
(verde, green), (Maria no, Mary did not), (no daba una bofetada, did not slap),
(daba una bofetada a la, slap the), (bruja verde, green witch),
(Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch)
– p.41
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch),
(verde, green), (Maria no, Mary did not), (no daba una bofetada, did not slap),
(daba una bofetada a la, slap the), (bruja verde, green witch),
(Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the),
(daba una bofetada a la bruja verde, slap the green witch)
– p.42
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch),
(verde, green), (Maria no, Mary did not), (no daba una bofetada, did not slap),
(daba una bofetada a la, slap the), (bruja verde, green witch),
(Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the),
(daba una bofetada a la bruja verde, slap the green witch),
(no daba una bofetada a la bruja verde, did not slap the green witch),
(Maria no daba una bofetada a la bruja verde, Mary did not slap the green witch)
– p.43
count
count
No Smoothing is Performed
– p.44
the
green
witch
Lexical Weighting:
a la bruja verde the green witch
a the la the
verde green
bruja witch
– p.45
witch, princess
– p.46
1 2 3 4 5
BLEU
.27
.26
.25
.24
.23
.22
.21
WAIPh
.20
Joint
.19 M4
Syn
.18
Syntactic Transformations
– German: Den Antrag verabschiedet das Parlament
– case marking that indicates that “the draft” is object is lost during
translation
– p.50
EM algorithm
Word alignment
Phrase-based translation
Syntax-based translation
– p.51
Syntax-Based Translation p
interlingua
english foreign
semantics semantics
english foreign
syntax syntax
english foreign
words words
– p.52
– p.53
english foreign
semantics semantics
english foreign
syntax syntax
english
words
foreign
words Wu [1997], Alshawi et al. [1998]
interlingua
english foreign
semantics semantics
english foreign
syntax syntax
english
words
foreign
words Yamada and Knight [2001]
– p.54
–
–
–
–
–
– p.55
Syntax Trees p
– p.56
– p.57
– p.58
3. pick one of the dependents as new head word for extension (step 2);
or terminate
– p.59
Dependency Structure p
gekauft
bought
das
the
– p.62
english foreign
semantics semantics
english foreign
syntax syntax
english foreign
words words
– p.63
he adores VB TO he TO VB adores
listening TO MN MN TO listening
to music music to
VB VB
MN TO listening no MN TO kiku no
translate
music to ongaku wo
take leaves
Kare ha ongaku wo kiku no ga daisuki desu
– p.64
Crossings p
Do English Trees Match Foreign Strings?
Crossings between French-English [Fox, 2002]
– 0.29-6.27 per sentence, depending on how it is measured
Can be Reduced by
– flattening tree, as done by [Yamada and Knight, 2001]
– p.65
– p.66
Outline p
Data
Evaluation
Translation Model
Language Model
Decoding Algorithm
Available Resources
– p.67
Language Model p
Goal of the Language Model:
Detect good English
– p.68
Language Model p
What is Good English?
Standard Technique: Trigram Model
– multiplication of trigram probabilities
– p(witch the green) p(green the witch)
S ?
NP S
NP PP VP NP VP VP
the house of the man is good the house is the man is good
– p.70
– p.72
Outline p
Data
Evaluation
Translation Model
Language Model
Decoding Algorithm
Available Resources
– p.73
Decoding Algorithm p
Goal of the decoding algorithm:
Put models to work, perform the actual translation
– p.74
Greedy Decoder p
Maria no daba una bofetada a la bruja verde
GLOSS
Mary no give a slap to the witch green
SWAP
Mary no give a slap to the green witch
ERASE
Mary no give a slap the green witch
CHANGE
Mary not give a slap the green witch
INSERT
Mary did not give a slap the green witch
JOIN
Mary did not slap the green witch
– p.75
e: witch
e: f: -------*-
f: ---------- p: .182
p: 1
– p.76
– p.77
green
Word Graphs
– search graph from beam search can be easily converted
– p.78
– p.79
Outline p
Data
Evaluation
Translation Model
Language Model
Decoding Algorithm
Available Resources
– p.80
New Directions p
How can we add more knowledge to the process?
– Define subtasks
– p.81
– numbers
– dates
– quantities
Noun Phrases
– p.82
Or by Special Handling?
– XML markup of MT input [Germann et al., 2003]
– the revenue for
number translate-as=’’2003’’ 2003 number
is higher than ...
Names p
Often not in Training Corpus
– p.84
Noun Phrases p
Noun Phrases can be Translated in Separation
[Koehn and Knight, 2003]
– German-English: 75% are, 98% can be
Definition of NP/PP
– (informally): maximal phrases that contain at least one noun and no verb
– cover about half of the words, all nouns (largest open word class)
– p.85
Model
Reranker
translation
– see also direct maximum entropy models [Och and Ney, 2002]
– p.86
100%
correct
90%
80%
70%
60%
1 2 4 8 16 32 64
size of n-best list
– p.87
– p.88
In-domain (Politics)
The United States and India May Will Be Held in the Past 40 Years the First Joint Military Exercises
(Afp report from new Delhi) India and U. S. will be held in the past 39 years the first joint military exercises in the
world’s two biggest democracies the cooperative relationship between making milestone.
The Defense Ministry said in a class Indian paratrooper Brigade mid-May and the US Pacific Command of the
special units in the well-known far and near the Thai women Maha tomb near joint military exercises.
The two countries will provide air support.
– p.89