You are on page 1of 62

Discriminative Training of Synchronous Parsers

for Machine Translation

A Thesis presented

Alexander Rush

The Department of Computer Science
in partial fulfillment of the requirements
for the degree of
Bachelor of Arts
in the subject of

Computer Science

Harvard University
Cambridge, Massachusetts
April 3rd 2007
Thesis advisor Author

Stuart Shieber Alexander Rush

Discriminative Training of Synchronous Parsers for Machine Translation

The task of discriminative machine translation by synchronous parsing poses two major

difficulties - the hidden structures in the training set and the inefficiency of parsing syn-

chronous grammars. We propose a general method for syntax-aware translation by handling

these two problems. We approach the the hidden structure problem by adapting the on-

line learning update rules presented by Liang et al. (2006) for discriminative phrase-based

translation. We tackle the training efficiency issues by using an A* search algorithm to

find the best parses. The discriminative training method permits an expanded feature set

compared to generative models. The A* search speeds up the parser when used with an

admissable heuristic. We test this framework on Inversion Transduction Grammar, a simple

synchronous grammar formalism.


Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Learning with Hidden Structure 7

2.1 General Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hidden Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Perceptrons and Hidden Structure . . . . . . . . . . . . . . . . . . . . . . . 15

3 Hypergraph Parsing 19
3.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Hypergraphs and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Hypergraph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Inversion Transduction Grammar . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Translation by ITG Parsing 34

4.1 Translation by Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Local Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Results 49
5.1 Efficiency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 1


1.1 Background

Translation seems so simple. Take in a sentence in a source language and produce

the corresponding sentence in a target language. We know that this problem can be solved,

since we can watch people translate every day. And yet the demand for translation remains

high. These two factors make translation an appealing problem. In addition, computers

should be perfect match for translation. At heart, computers are designed to manipulate

symbols, why should language symbols be so different?

Despite the surface simplicity of translation, it has proved to be a tremendously

difficult area. Researchers have been building computational machine translation (MT)

systems since the 1940’s with only limited success. Until the 1990’s, these systems were

major engineering projects. MT researchers built rule-based systems that used hand-crafted

grammars to convert a sentence in one language to another. This form of system treats

human language like a very complicated programming language and translation is like the

conversion to assembly code. Except unlike a programming language, we do not know form

of this grammar.

Chapter 1: Introduction 2

n a c i e r t a o r a c i ó n e s p a ñ o l a .

n a c i e r t a o r a c i ó n e s p a ñ o l a .

S o m e s p a n i s h s e n t e n c e . S o m e s p a n i s h s e n t e n c e .

(a) IBM Model (b) Phrase-based

n a c i e r t a o r a c i ó n e s p a ñ o l a . ε S o m e s p a n i s h s e n t e n c e .

(c) Synchronous Grammar

n a c i e r t a o r a c i ó n e s p a ñ o l a . S o m e s p a n i s h s e n t e n c e .

(d) Hiero

Figure 1.1: Several different models of translation.

Brown et al. (1990) caused a major shake-up in MT research when they imple-

mented the first successful statistical machine translation (SMT) system. Instead of using

handcrafted rules, SMT systems attempt to learn a probabilistic map between the source

and target languages. The most startling thing about these systems is that they are com-

pletely oblivious to the languages they are translating. Unlike the previous generation of

systems, where the rules were proposed by language experts, SMT systems require no hu-

man guidance. The statistical system learns its map from a a corpus of human translated

sentence pairs from the two languages. The first popular corpus was the proceedings of the

Canadian parliament, which is conducted both in French and English, but as Knight (1997)

jokes, it might as well be the conversations of aliens. In spite of this property, SMT systems

perform better than hand-crafted systems.

Chapter 1: Introduction 3

This original work, which is known as the IBM system, is a generative, alignment

based system. At its base, it has two statistical models. A translation model that gives

the likelihood of source language words translating to destination language words. This

model tries to create adequate translations. Figure 1.1(b) gives a basic sketch of the IBM

translation model. It also has a language model that computes the likelihood of the target

sentence based on the target language alone. This model aims to produce fluent sentences.

The IBM system is still the major component of most SMT systems, and we will discuss

these two models in more depth later.

The current state-of-the-art SMT systems are variants of the phrase-based system

of Koehn, Och, and Marcu (2003). Phrase-based systems improve upon the IBM system

by increasing the granularity of the translation model. Instead of producing one word from

word, phrase-based systems produce one phrase from one phrase. In order to find phrases,

these systems use heuristics that infer phrase level information from the output of the IBM

systems. Figure ?? shows the distinction between word and phrasal alignments.

The major criticism with IBM-style systems and their phrase-based successors is

that they trade syntactic transformations of rule-based systems for word or phrase level

mappings. The old systems had the expressivity of grammars, while phrase-based systems

have the expressivity of finite state automata. In fact, Kumar, Deng, and Byrne (2005)

has implemented a phrase-based system as a weighted finite state transducer. While the

statistical aspect is a major improvement, it seems unlikely that all the complexities of

translation can be expressed at the level of finite state automata.

This criticism has led to an alternative line of research in syntax-aware SMT sys-

tems. Syntax-aware systems attempt to lift the translation problem to the syntax level, and

learn a map from source to destination syntax. We can think of these syntax transformation

as grammars over both sentences simultaneously. Figure 1.1(c) gives an example of a this
Chapter 1: Introduction 4

transformation. These two sentence grammars are known as synchronous grammars, and

we will discuss them in greater detail in Chapter 3.

Recently, Chiang (2006) has shown that a statistical machine translation system

incorporating syntactic information can improve upon a standard phrase-based model. His

system, known as Hiero, uses a hybrid approach called hierarchical phrase-based trans-

lation. The system learns phrase level rules that can be combined using a synchronous

grammar. In this way, he is able to get some of the benefits of both phrase level granularity

and syntax. Figure 1.1(d) shows a simple sketch of the Hiero concept.

1.2 Goal

The Hiero system is a major advance for syntax-aware translation. It can use

information from phrase-based systems and incorporate syntactic information to produce

better translations. However, Hiero is still attached the phrase-based paradigm. In this

work, we attempt to build a discriminative syntax-aware system that is not restricted to a

phrasal model.

While this seems like a modest goal, it has a major complication. The issue is that

the standard corpus for machine translation consists of unmarked sentences pairs. These

sentence pairs provide a reasonable training set for learning word level mappings, but they

provide no information about the syntax. In a perfect world, we would have a corpus marked

with consistent bilingual syntactic structures. Instead, we need to infer these structure from

the corpus.

There are several ways to address this lack of direct supervision. In Hiero ,

Chiang circumvents the supervision problem by creating his own training data. His system

The system first runs a phrase-based algorithm to create a table of possible phrases for each
Chapter 1: Introduction 5

sentence. He then uses a set of heuristics to predict a possible parse from this table. He can

then train his model from these parses. This method avoids the hidden structure problem

at the cost of relying heavily on a phrase-based back bone.

Alternatively, Melamed (2004) suggests treating this as an unsupervised learning

problem. He starts with a base grammar, and using expectation maximization (EM) to

learn reasonable parameters. The standard EM technique for learning grammars is the

inside-outside algorithm of Lari and Young (1991). Inside-outside works by repeatedly

parsing the corpus and updating the parameters of the model to maximize the expected

likelihood of the observed sentences.

The EM method would be ideal for our goal, except that it learns a generative,

joint probability model over the synchronous parse instead of a discriminative, conditional

model. There are two issues with using a generative model for translation. The first is

that a generative model predicts the likelihood of a parse over both sentences. This makes

sense for training, but since in practice, we are only given the source sentence, we are left

with an indirect model for predicting the best target sentence. The second issue is that

discriminative models allow us to had in arbitrary features that may help translation. In

generative models, we can add extra features, but we have to worry about their correlation.

1.3 Overview

In this paper, we present a discriminative method for training unconstrained syn-

chronous grammars. Our aim is to maintain the freedom from phrase table constraints

of the inside-outside approach without sacrificing the flexibility of a discriminative model.

The method consists of two major pieces - a learning algorithm for partially supervised

training and a search algorithm for finding translations with a sychronous grammar. These
Chapter 1: Introduction 6

two pieces are sufficient for describing a complete system, and the structure of this paper

mirrors the modularity of this approach.

Chapter 2 gives background for the learning algorithms we use in our system. We

first present a general framework for thinking about structured learning. We then focus on

basic update techniques in this framework and extensions for problems like translation that

lack full supervision.

In Chapter 3, we explore how to solve the subproblem at the heart of the learning

algorithms from Chapter 2. We describe a parsing framework with several variants that

solve this problem for different learning approaches. We then show how to adapt this

method to a simple synchronous grammar formalism.

Chapter 4 shows how to apply these methods to the translation task. We describe

techniques for language model integration, tricks for reducing the parser’s complexity, ad-

missable heuristics to maintain optimality, and the features that we used for tests.

Finally in Chapter 5, we present results of this system on language data and discuss

future improvements to help scale this system.

Chapter 2

Learning with Hidden Structure

In this work, we view machine translation as a structure prediction problem. The

structure prediction frameword is a very general way of thinking about discriminative learn-

ing. It takes a and input, X, and produces the best scoring output Y . The framework is

agnostic to the form of X and Y . These could be sentences, trees, pictures, etc. In the case

of translation, the input X is sentence in the source language and the output is a sentence

Y in the target language.

This chapter introduces structure prediction and some variants for applying it to

translation. We focus on the general linear models trained with online algorithms. In Sec-

tion 2.1 and 2.2, we introduce the general linear model framework for structure prediction,

and present an online Perceptron-like update rule for training these models. In Section 2.3,

we describe a variant of general linear models for problems like translation that have hid-

den structure. In Section 2.4, we revisit the training question and survey variants of the

Perceptron algorithm that support hidden structure.

Chapter 2: Learning with Hidden Structure 8

2.1 General Linear Models

The key to structure prediction is projecting elements with complex structure onto

representitive vectors. Working with vectors allows us to abstract the learning problem away

from interpreting structures by reducing these them to their essential components. We can

then leverage training methods designed for vector spaces.

Collins (2002) introduces a formal framework for this process known as a general

linear model. A general linear models consist of three functions- an enumeration function

GEN, a feature function Φ, and a scoring function RANK.

We illustrate these functions using a part-of-speech tagging example. In tagging

problems, the goal is to produce the best part-of-speech label for each word of a sentence.

The tagger takes a sentence, X, as input and produces predicted tags, Y , as output. We

use the notation X and Y as the set of all sentences and tags respectively.

The three functions are-

• The enumeration function GEN : X → Y set produces all the legal outputs for a

given input. GEN has a trivial specification for most problems. For instance in

tagging, GEN might produce all the possible permutations of tags for a sentence.

Figure 2.1(a) illustrates this GEN.

• The projection function Φ : X ×Y → Rn converts an input-output pair into a represen-

tative vector, known as the feature vector. This function determines what elements

of structures are relevant, and everything else about the structure will be ignored.

Figure 2.1(b) shows the application of Φ to one of the output of GEN.

In practice, for efficiency we choose a Φ that factors into smaller functions Φ where
Φ = y∈Y Φ(X, y). In turn, Φ is made up n feature function φi that determine one

dimension of the feature vector.

Chapter 2: Learning with Hidden Structure 9

V V A d v V


D o g s c a n n o t fl y . N V A d v N

A d j A d j A d j A d j

(a) GEN

D o g s / V c a n / V n o t / A d v fl y / V . Φ < . . . , 0 , 1 , 0 , 0 , . . . >

(b) Φ


< . . . , 0 , 1 , 0 , 0 , . . . >

1 2 . 5

< . . . , 0 . 4 5 , 3 . 4 , 0 , 4 2 . 0 , . . . >

(c) RANK

Figure 2.1: The three general linear model functions applied to a part-of-speech tagging
task. Figure 2.1(a) shows the enumeration function generating all possible tagging sequences
for a simple sentence. Figure 2.1(b) shows a possible feature vector projection for a single
sentence-tag pair. Figure 2.1(c) shows this vector being scored with the weight vector.
Chapter 2: Learning with Hidden Structure 10

For instance, in the tagging problem we might have a single feature.

 1, if w = Dogs and t = V

φ1 (w, t) =
 0,


where w and t are a word-tag pair. We read this feature as, let the first dimension be

1 if the current word is “Dogs” and its tag is “V”. If we want the features vector for

“Dogs/V bark/N,” we compute.

Φ(Dogs bark , V Adj) =

Φ(Dogs bark, V ) + Φ(Dogs bark, Adj) =

hφ1 (Dogs,V)i + hφ1 (bark,Adj)i =


• The scoring function RANK : Rn × Rn → R ranks the output of Φ using a parameter

vector w. We use a linear model for our scoring function. A linear model makes the

assumption that each dimension of the feature vector contributes independently to

the final score. These models are popular for machine learning because the size of w

scales linearly with the size of the feature vector. Under the assumption of linearity,

the scoring function is just the dot product of the weight vector and the feature vector,

RANK(Φ(X, Y ), w) = Φ(X, Y ) · w

We use these functions to formally define best output as -

argmax w · Φ(X, Y ) (2.1)


Finding this structure can be challenging in practice. For instance in our simple

tagging problem, GEN(X) produces a set with size n ∗ |T | where n is the length of the
Chapter 2: Learning with Hidden Structure 11

Figure 2.2: Geometric representation of the linear separator. Squares represent correct
outputs and circles are incorrect outputs. The solid line represents a possible separator w,
and the dashed lines show the margin δ for that separator.

sentence and |T | is the size of the tag set. In addition, computing feature vectors for each

of these outputs can be costly.

Later in this work, we will follow the work of Daume III and Marcu (2005) and

focus on solving the optimization problem given in equation 2.1 by using search. Search

lets us avoid the size of GEN by lazily expanding values in order of rank. In addition, if we

use are decomposable Φ, we can reduce the cost of Φ by incrementally building our feature

vectors as we search over structure. In Chapter 3, we will discuss in detail how to find this

structure for arbitrary parsing problems.

2.2 Training

In the last section, we assumed that we had a parameter vector w that would rank

good outputs better than bad outputs. In this section, we examine what this condition

formally means, and how to find a parameter vector that satisfies this property.

In structure prediction, we assume that we can partition the output space into

correct and incorrect outputs. All outputs for a given input X are either correct Y ∗ ∈ Y ∗
Chapter 2: Learning with Hidden Structure 12

Algorithm 1 A Perceptron-style update rule for general linear models.

Require: Y ∗ correct output

Ensure: wt+1 updated weights

best ⇐ argmaxY ∈GEN(X) wt · Φ(X, Y )

correct ⇐ Y ∗

if best 6= correct then

return wt + Φ(X, correct) − Φ(X, best)


return wt

end if

or incorrect Y ′ ∈ Y¯∗ . Our goal is to find parameters that “separate” these two sets-

w · Φ(X, Y ∗ ) > w · Φ(X, Y ′ ) + δ

. A parameter vector w that satisfies this inequality is known as a linear separator, and the

value δ is known as its margin. The geometric intuition behind this condition is given in

Figure 2.2.

During training, our goal is to learn this linear separator from data. Collins (2002)

presents a simple method for learning this separator using the Perceptron-style learning

algorithm. This algorithm learns from data instances of the form [Xi , Yi∗ ]ni where each Xi

is an input structure and Yi∗ ∈ Y ∗ is a correct output structure. We start with an arbitrary

parameter vector w0 and run Algorithm 2.2 on each data instance. This algorithm computes

the current best structure and compares it to the correct output structure. If they are

different, we increase the parameters of the features from the correct structure and decrease

those from the current best, but incorrect structure. We repeat this process until we find a

Chapter 2: Learning with Hidden Structure 13

D o g s / V c a n / V n o t / A d v fl y / V


a t s / N l i k e / V n o / A d v d o g / N

D o u g s c a n n t fl y . C

H o r s e / A d j o w / A d j P i g / A d j M o u s e / A d j

Figure 2.3: The GEN function for the garbled tagging problem.

In the tagging example, the algorithm starts by finding the best output for “Dogs

can not fly.” for the given weights. If the weights are incorrect, the best output could

be anything. For instance it might produce, “Dogs/V can/V not/Adv fly/V.” Since this

output is different than the correct output, “Dogs/N can/V not/Adv fly/V.”, we perform

an update. The update only change weights related to the mistake that was made. In

this case, it will increase the weights of all features relating to “Dogs/N” and decrease the

weights for “Dogs/V.” The next time we see this input, we hope that the best output will

be the correct output.

2.3 Hidden Structure

The form of general linear model presented in Section 2.1 assumes that we can

directly predict our output Y from the given sentence X. For many problems, it makes

more sense to first predict some hidden structure H that is not be part of the output Y .

In this section we follow the work of Koo and Collins (2005) and extend the general linear

model framework to problems with hidden structure.

To demonstrate hidden structure, consider the “garbled tagging” task. In this

problem, we want to tag input sentences that have some words misspelled. Our input is a

sentence, X, with misspelled words, and our output as before is its tags, Y . To account

for the misspelled words, we introduce hidden structure H that first predicts the correct
Chapter 2: Learning with Hidden Structure 14

spelling of each word. We notate garbled tagging problem in a x/h/y format, so for the

input “Larg dg” one possible structure would be is “Larg/Large/Adj dg/dodge/V”.

The hidden structure model requires slight changes to GEN and Φ.

• GEN : X → (H × Y) set now enumerates every possible pair of hidden and output

structure. For garbled tagging, this means that it produces every possible corrected

word and every tag for each of these words. Figure 2.3 shows some possible outputs.

As with the earlier GEN, neither the proposed hidden or output structures do not

have to bear any resemblence to the correct output. We let ranking sort out the good

from bad.

• Φ : X × H × Y → Rn can now include features that observe the hidden structure

in addition to the input or output structure. This flexibility is the justification for

include hidden structure. For instance, say we wanted to include a feature from the

previous model,

 1,

if w = Dogs and t = Adv
φi (w, t) =
 0,


This feature would work fine for “Dogs,” but what about “Dgs” or other misspellings?

We could include another feature,

 1,

if w = Dgs and t = Adv
φi (w, t) =
 0,


However, because we are using a linear model, there is no way to relate these two


Instead, we include a features that observes the hidden structure-

 1, if h = Dogs and y = Adv

φt (w, h, t) =
 0,

Chapter 2: Learning with Hidden Structure 15

To ensure we get the right hidden word, we add features that observe the hidden

structure. For instance,

φt (w, h, t) = inDictionary(h)

would check if a proposed hidden word is in the dictionary. We can also include

features that relate hidden and observed structure like

φt (w, h, t) = editDistance(w,h)

which measures the edit distance between a proposed hidden word and its garbled


As above, our goal remains finding the best output structure. We now increase

this search to include all possible hidden structures.

argmax w · Φ(X, H, Y ) (2.2)


2.4 Perceptrons and Hidden Structure

The introduction of hidden structure complicates the Perceptron update rule. If

we had supervision that gave us the correct output and its correct hidden structure, Y ∗ and

H ∗ respectively, we could just change Algorithm 2.2 so that best = (Y ∗ , H ∗ ). Unfortunately,

we do not have H ∗ . Even worse, it is often impossible to find the correct hidden structure.

In the garbled tagging problem, if we found some way to find the correct hidden structure,

we would have found an error-less spelling correcter!

Alternatively, we could try to find something close to H ∗ . We might try to average

over all possible hidden structures that produce the correct output. Averaging gives the

formula -
X Φ(X, H, Y ∗ )
Chapter 2: Learning with Hidden Structure 16

Figure 2.4: A separator in the hidden structure variant. As in Figure 2.2, the squares have
the correct output, and the circles do not. In this diagram, the light square also has the
correct hidden structure, and the dark squares do not. Note that unlike Figure 2.2, there
are squares are on the opposite side of the separator. Even more problematically, the star
represents the average of the squares, and there is no guarantee that it will be on the correct
side of the separator.

Unfortunately, the majority of these structures will be entirely incorrect. For instance in

the garbled tagging example, “Dougs/Dogs/N can/can/V nt/not/Adv fly/fly/V.” will be

correct, but so will “Dougs/Cats/N can/like/V nt/no/Adv fly/dog/V” and many other

absurd sentences.

The reason for this failure is that we no longer have the same separability condition

that we had in Section 2.2. The supervised value Y ∗ tells us is that there is some pair

(H ∗ , Y ∗ ) that is correct. We know nothing about the correctness of (H, Y ∗ ) for any other

H. The new separator is,

w · Φ(X, H ∗ , Y ∗ ) > w · Φ(X, H, Y ′ ) + δ

Figure 2.4 gives the geometric intuition for this formula and demonstrates the difficulty of

learning with hidden structure.

Despite the difficulty of this problem, if we start with some information, we can

proceed in the right direction. Liang et al. (2006) presents two online learning algorithms

designed to try to learn in this context, bold updating and local updating.
Chapter 2: Learning with Hidden Structure 17

Algorithm Best Correct

Perceptron argmaxY ∈GEN(X) w · Φ(X, Y ) Y∗
Φ(X,H,Y ∗)
argmaxY w · Φ(X, Y )
Average H∈H |H|
Bold argmaxY,H∈GEN(X) w · Φ(X, Y, H) argmaxH w · Φ(X, Y ∗, H)
Local argmaxY,H∈GEN(X) w · Φ(X, Y, H) argmin(Y,H)∈nbest m(Y ∗, Y )

Table 2.1: Perceptron extensions for problems with hidden structure. Best and Correct
refer to the variables used in Algorithm 2.2.

Bold updating attempts to fix average updating. Instead of averaging out over all

possible hidden structures with the correct output, we use the best scoring hidden structure

under the current model.

argmax w · Φ(X, H, Y ∗ )

We hope that if our current weights are feasible that this hidden structure will be close to

H ∗ , although unfortunately we have no guarantee that this property will hold. Table 2.1

shows the adjustments to the Perceptron algorithm for by the bold updating. In the garbled

tagging task, bold updating requires that we find the best hidden words to produce the given

correct part-of-speech tags.

Local updating takes a more drastic approach to the hidden structure problem. In

local updating, we assume that if there is a lot of hidden information, even the supervised

outputs may not be exactly correct. Instead of trying to find an output that matches the

given correct output exactly, we generate a list of n highest scoring outputs and choose the

one that is closest to Y ∗ to be our correct output by a problem specific metric m. This

update strategy is more conservative than bold updating because we perform the update

with two outputs that already have high ranks.

In our garbled tagging case, local updating would be useful only if some training

examples are misspelled even beyond human readability and so the labels are not entirely

reliable. For garbled tagging, a reasonable metric m might be to count how many tags are
Chapter 2: Learning with Hidden Structure 18

different between the each Y and Y ∗ .

Chapter 3

Hypergraph Parsing

In the last chapter, we introduced the structure prediction framework and applied

it to a simple tagging example. In this chapter, we show how to apply structure prediction

to machine translation.

Translation has a form similar to our tagging example. We are given a sentence, X,

in the source language and asked to produce the best sentence, Y , in the target language.

Despite the surface similarity, translation is a much harder problem than tagging. No

reasonable translator would translate a sentence word by word, and so machine translation

requires a sentence level transformation. In order to model this transformation, we need

to first predict some some hidden structure before translating. Unfortunately, unlike in

“garbled tagging” problem, we do not know the true form of this hidden structure.

The hidden structure underlying translation is a source of much debate. In the

introduction, we described several different SMT systems. One of the major distinctions

between the systems is in how they model hidden structure. The IBM and phrase-based

systems use alignments as hidden structure. Alignments are maps from the words in the

source sentence to words they generate in the target sentence. Figure 3.1(a) shows an

Chapter 3: Hypergraph Parsing 20

n a c i e r t a o r a c i ó n e s p a ñ o l a .

n a c i e r t a o r a c i ó n e s p a ñ o l a . ε S o m e s p a n i s h s e n t e n c e .

S o m e s p a n i s h s e n t e n c e .

(a) Alignment model (b) Synchronous grammar model

Figure 3.1: Two possible hidden structures for translation.

alignment in the context of structure prediction. There are many variations on alignment

concept with models that have one-to-many or even many-to-many mappings. We use the

term here in the general sense to distinguish word level hidden structure from other forms.

In this work we are interested in syntax-level mappings. Instead of using a pairing

together words, we use a synchronous grammar to model this hidden similarity. Synchronous

grammars are bilingual extensions of standard grammars. Instead of producing a parse tree

over a single sentence, they produce a joint parse over a pair of sentences. Figure 3.1(b)

shows a simple synchronous parse tree. The two parts of the tree are almost identical,

except that we need to introduce the word “Una” on the Spanish side and invert the phrase

“spanish sentence.” We do this by pairing “Una” with an empty symbol and rotating the

“spanish sentence” branch, notated by the arc in the tree. Under the synchronous grammar

model, small operations, like this rotation, can cause large effects in the final sentence.

In order to apply the structure prediction framework to a translation system with

a synchronous grammar hidden structure, we need to be able to find the best scoring output

for a given input. Once we have this method we can use the learning algorithms from the

last chapter to train our system. This chapter is devoted to algorithms for finding this
Chapter 3: Hypergraph Parsing 21

best scoring output. In the next chapter, we combine these parsing ideas with our online

learning techniques to train a translation system.

3.1 Parsing

In this section, we introduce deductive parsing (Shieber, Schabes, and Pereira,

1995), a framework for determining if a sentence is valid under a given grammar. Deductive

parsing provides a compact, logical form for specifying parsers.

The term parsing can refer to several algorithms for processing a sentence with

a grammar. A parser may check if a sentence is valid under the grammar, produce the

correct parse tree for the sentence, or even find all possible parse trees for that sentence.

At heart, these seemingly different applications share a common framework. We focuses

first on determining the validity of a sentence and then show how to extend this method to

other applications.

A deductive parser reduces the problem of determining whether a sentence is valid

to that of proving the existence of a parse. Deductive parser use the original sentence as a

set of axioms and try to prove a goal. They move towards that goal by applying inference

rules that manipulate facts known as items. Inference rules take the form

A1 A2 A3 ... An
hside − conditionsi

which tells us that if the side conditions hold, and we have produced items that satisfy the

preconditions A1 . . . An , then we can infer the item B. The axioms can be thought of as

inference rules with no pre-conditions that introduce the first items into the system, and

the goal is the pre-condition for success. Together the items, axioms, goal, and inference

rules form the deductive system.

Chapter 3: Hypergraph Parsing 22

Grammar : S→ X Y
S→ X X
X→ x
X→ y
Y → y
Sentence: xy
xxy y

Figure 3.2: A sample CFG grammar and valid sentences it could produce.

To demonstrate deductive parsing, we give a simple parser for a Context-Free

Grammar(CFG) in Chomsky normal form. This grammar formalism has two rule types -

• A→ a

• A→ B C

The first rule states that a non-terminal A produces a terminal symbol a. The second rule

states that a non-terminal symbol A produces two non-terminal symbols B and C. This

grammar parse sentences of the form s = w1 , w2 , . . . , wn . Sentences are broken up into

spans that with spans that represent a sub-section of the words . For instance, the span

[i, j] covers the words wi+1 , . . . , wj . Figure 3.2 shows a trivial CFG grammar and a sample


We can construct a deductive parsers for CFG grammars by specifying the items,

axioms, goal, and inference rules. In Figure 3.3, we first give a simple deductive form

for a CKY parser. Each item contains a grammar non-terminal and a span. When we

introduce an item into the system, we have proven that we can cover that span with that

non-terminal. Our goal is to have the final non-terminal symbol cover the entire span of the

sentence. The single axiom observes the sentence and introduces non-terminals in place of

their corresponding terminal symbol. The inference rule combines adjacent non-terminals
Chapter 3: Hypergraph Parsing 23

Item: [A, i, j]

Axioms: A → wi+1
[A, i, i + 1]

Goals: [S, 0, n]

[B, i, j] [C, j, k]
Inference rules: A→B C
[A, i, k]

(a) CKY Deductive Parser

[X, 0, 1] X → x [Y, 1, 2] Y → y
[S, 0, 2]
(b) CKY Inference

Figure 3.3: CKY parser specified through deductive rules, and an example inference

according to a rules in the grammar. Figure 3.3(b) shows the full proof of validity of the

example sentence “x y”.

In this work we use an Earley style parser (Earley, 1970), which uses a different

inference strategy then CKY. A example Earley parser for CFGs is given in Figure 3.4. The

major difference is that the Earley-style parsers use of partially completed rules signaled by

the dot notation (•). For instance, A → B • C indicates that the rule has already processed

a B but not yet a C. The Earley axioms introduce all rules with the dot to the far left.

As we parse, the inference rules move the dot further towards the right until the rule is

completed. Figure 3.4(b) shows a full inference of this example under the Earley rules.

3.2 Hypergraphs and Parsing

Deductive parsing provides a framework for recognizing valid sentences, but we

are need to find best scoring parse for a given sentence. Luckily, the two problems related.

To rank the quality of different parses, we assign a weight to each inference rule in the
Chapter 3: Hypergraph Parsing 24

Item: [A → γ, i, j]

[A → •γ, i, i] A → γ

Goals: [S•, 0, n]

[A → •wi , i, i]
Inference rules:
[A → wi •, i, i + 1]

[A → •B C, i, j] [B → γ•, j, k]
[A → B • C, i, k]

(a) Earley Deductive Parser

[X → •x, 0, 0]
[S → •X Y, 0, 0]
[X•, 0, 1] [Y → •y, 1, 1]
[S → X • Y, 0, 1] [Y •, 1, 2]
[S•, 0, 2]
(b) Inference Example

Figure 3.4: Earley parser specified through deductive rules, and an example inference.
Chapter 3: Hypergraph Parsing 25

Figure 3.5: A sample directed hypergraph. This hypergraph has both multi-tailed and
multi-headed edges. We use an arrowhead to indicate the source node and a double circle
to indicate the destination.

deductive system. The score of an item is the sum of the weights of the inference rules we

used to discover that item, and the best parse is the best scoring series of inferences that

lead to the goal. For now we assume that these weights are given, and in the next chapter,

go into detail about where these weights come from.

Klein and Manning (2001) describe how a weighted deductive parser can be ex-

pressed graphically as a weighted, directed hypergraph. A directed hypergraph is a gen-

eralization of a directed graph. It is a tuple (N, E), where N is a set of nodes and E is

a set of hyperedges. Each hyperedge consists of two non-empty sets of nodes (T, H). The

hyperedge connects all the nodes in tail T to all the nodes in the head H. Figure 3.5

shows a sample hypergraph that demonstrates different edge forms. A weighted, directed

hypergraph includes a weighting function w : E → R for scoring each hyperedge.

Weighted deductive parsers are equivalent to a special case of hypergraph where

each edge has a singleton head node h. The intuition behind this equivalence is that all of

the hypergraph edges have the form (t1 , . . . , tn ) → h while in deductive parsing all inference

rules have the form

A1 A2 A3 ... An

If we think of items as nodes and inference rules as edges, we can apply the conversion in
Chapter 3: Hypergraph Parsing 26




Inference rules:

Figure 3.6: Mapping between deductive parsing and hypergraph elements.

[X, 1, 2] [X, 1, 2]

[X, 0, 1] [S, 0, 2] [X, 0, 1] [S, 0, 2]

[Y, 1, 2] [Y, 1, 2]

(a) CKY Hypergraph for x y (b)

Figure 3.7: Hypergraph representations of a CKY search space for the sentence and gram-
mar pair given in Figure 3.2.
Chapter 3: Hypergraph Parsing 27

Figure 3.6 to build the hypergraph. The deductive form provides a general methodology for

parsing, and the hypergraph form gives a representation for the search space over a specific

parse. Figure 3.7(a) applies this conversion for the sentence “x y” under the CKY parser

we gave above, and figure 3.7(b) shows a path through this graph that is equivalent to the

inference given in the last section.

3.3 Hypergraph Traversal

The conversion to hypergraph representation does not give us any new information

about the parsing problem, but it motivates thinking about parsing in terms of graph

traversal. We argued in the last section that if we can find a path through a sentence’s

hypergraph, then we can convert this path into a proof of the sentences validity. More

importantly, we can the reduce problem of finding the best scoring weighted deduction to

the problem of finding the the shortest path from the start to the goal node. This problem

is known as the single-source shortest path problem and has been well-studied for both

graphs and hypergraphs.

As in standard graph traversal, there are two styles of hypergraph traversal algo-

rithm, Viterbi and Dijkstra. Viterbi-style algorithms provide a fast way for exhaustively

exploring a search space without exploring any edge more than once. The Viterbi algorithm

avoids repeating edges by imposing a topological order on the graph and then examining

each layer in order. Since each node is only in one layer, once we have completed the layer,

we do not need to explore the node again. Viterbi algorithms compactly traverse every

possible path to the goal, so they can be used for finding the shortest path or for computing

aggregate metrics, like the sum of all paths to the goal. These algorithms are particularly

suited for domains with a fixed total ordering. The most common example is the state
Chapter 3: Hypergraph Parsing 28

0 : 2

0 : 1 1 : 2

0 : 0 1 : 1 2 : 2

Figure 3.8: A topological partial order over a two word sentence.

space of HMMs where the linear order of the observations provides total ordering of states.

In hypergraph, the topological partial order of items comes from the spans of the

items. Figure 3.8 shows this partial order for a two word sentence. We can traverse this

partial order in various ways. If we proceed from smaller to larger spans then the Viterbi

traversal is known as bottom-up parsing. If we proceed from left to right then we call it left-

corner parsing. The hypergraph in Figure 3.7(a) is aligned to show a bottom-up topological

ordering on our simple CKY hypergraph. We can use Viterbi traversal of hypergraphs

to find the best parse or to compute metrics like the inside score for the inside-outside


Using the Viterbi algorithm is a nice way of thinking about parsing, but it does not

give us new information. The hypergraph representation becomes more useful for Dijkstra-

style algorithms. These algorithms get around the problem of exploring every possible path

by repeatedly exploring the current shortest path until a goal node is found. They give

up the ability to compute aggregate metrics and require the extra overhead of a queue to

prioritize future explorations, but can provide substantial a speed-up in practice. Since

they do not need to stay fixed to a topological order, they can explore the most promising
Chapter 3: Hypergraph Parsing 29

paths, and if at any time they reach a goal node, they can stop knowing they have found

the shortest path.

Knuth (1977) proposed the first Dijkstra-style for hypergraphs. Algorithm 2 gives

the pseudo-code for a simplified version of this algorithm with binary hyperedges. The

algorithm is very similar to the standard Dijkstra algorithm. It keeps a priority queue of

possible nodes to explore. Each iteration, it explores the best scoring node and checks if it

has already explored a node that shares a hyperedge with that node. If it has, it queues up

the head of that hyperedge. Graehl and Knight (2004) provide an efficient implementation

of this algorithm that runs in time O(V log V + E).

The major advantage of the Knuth algorithm over standard Viterbi parsing is that

it uses ordering information from the problem as opposed to just the topology of the graph.

If there is a one very short path, the Knuth algorithm will find it quickly, while the Viterbi

algorithm will still need to traverse the entire graph. In addition, we can speed up the

Knuth algorithm by using A∗ search. If we have additional information about the problem,

we can use an admissible heuristic that underestimates the cost from any node to the goal.

A good heuristic will encourage the algorithm to explore deeper nodes without sacrificing

optimality. Klein and Manning (2003) introduce an admissible heuristics for parsing and

report huge speed ups.

3.4 Inversion Transduction Grammar

We can use the Knuth algorithm on any grammar defined by weighted deductive

parse rules. In this work, we are interested in finding the best synchronous parse of a

sentence pair, so in order to use the Knuth algorithm, we need to define a synchronous

deductive parser. We use the Inversion Transduction Grammar (ITG) defined by Wu (1997).
Chapter 3: Hypergraph Parsing 30

Algorithm 2 Knuth algorithm for shortest paths on hypergraphs.

for all hyperedge e do

r[e] ⇐ 2

end for

for all node u do

d[u] ⇐ ∞

end for

while Q 6= ∅ do

v ⇐Extract-Min(Q)

for all hyperedge e out of v do

e is (t1 , t2 , h, w) {Edge has two tails, a head, and a weight}

r[e] ← r[e] − 1 {Discovered a tail}

if r[e] = 0 then

d(h) = min(d(u1 ) + d(u2 ) + w, d(h)) {Update the best distance to the node}

Decrease-Key(Q, h)

end if

end for

end while
Chapter 3: Hypergraph Parsing 31

n n

l l

0 i j k 0 i j k

(a) Straight Rule (b) Rotation Rule

Figure 3.9: Grids demonstrating the application of ITG rules. The source sentence is along
the horizontal axis and the destination sentence is along the vertical axis. Figure 3.9(a)
shows the application of the straight rule. The smaller box represents [C, j, k, m, n] and the
larger box represents [A → [B•C], i, j, l, m]. Together they form [A•, i, k, l, n]. Figure 3.9(b)
shows a flip rule. The smaller box is now [C, j, k, l, m] and the large box represents [A →
hB • Ci, i, j, m, n].

In future work, we hope to explore the use of hypergraph parsing with richer synchronous

grammar frameworks.

ITG is the natural bilingual extension of CFG. This formalism has three types of


• A → sw/tw

• A → [B C]

• A → hB Ci

The first rule is a lexical alignment rule. It says that a non-terminal A produces

a terminal word sw in the source language and tw in the target language. There are two

important special cases of the first rule A → sw/ǫ and A → ǫ/tw that align words with the

empty symbol. The second rule states that A produces two non-terminal symbols B and

C ordered left to right in both languages. The final rule is the rotation rule. It says that
Chapter 3: Hypergraph Parsing 32

Item: [A, i, j, l, m]

Axioms: [A, i, i + 1, l, l + 1]A → swi+1 /dwl+1

[A, i, i + 1, l, l]A → swi+1 /ǫ
[A, i, i, l, l + 1]A → ǫ/dwl+1

Goals: [S, 0, n, 0, m]

[B, i, j, l, m] [C, j, k, m, n]
Inference rules: A → [B C]
[A, i, k, l, n]
[B, i, j, m, n] [C, j, k, l, m]
A → hB Ci
[A, j, k, l, n]

Figure 3.10: Deductive parsing rules for CKY ITG.

A produces two non-terminal symbols ordered B C in the source language and C B in the

destination language. Synchronous grammars parse a pairs of sentences known as bitexts.

We notate this pair (S, T ) where S = sw1 , . . . , swm and D = tw1 , . . . , twn . The notion of a

span also generalizes to synchronous parsing. A span now covers [i, j, l, m] where [i, j] is the

span in the source language, and [l, m] is the span in the destination language. Figure 3.9

shows the alignment structures produced by the different rules.

Just as with CFG, we can parse ITG with a CKY or Earley parse style. Figure 3.10

gives the CKY parse rules which are a simple extension of the CFG case. In our system,

we use the Earley style algorithm in Figure 3.4 because it interacts better with some of

the extensions we present in the next chapter. The Earley parser has many more inference

rules than our previous parsers, but it is not much more complicated than the previous

early parser. The three word introduction rules handle the introduction of a lexical pair in

addition to the special cases noted earlier when we align a word with the empty symbol.

The scanning rules simply implement the functionality shown in Figure 3.9.
Chapter 3: Hypergraph Parsing 33

Item: [A → γ, i, j, l, m]

Axioms: A→γ
[A → •γ, i, i, j, j]

Goals: [S, 0, n, 0, m]

Inference rules:
[A → •swi+1 /swl+1 , i, i, l, l]
Word introduction: [A → swi+1 /swl+1 •, i, i + 1, l, l + 1]
[A → •swi+1 /ǫ, i, i, l, l]
[A → swi+1 /ǫ•, i, i + 1, l, l]
[A → •ǫ/swl+1 , i, i, l, l]
[A → ǫ/swl+1 •, i, i, l, l + 1]

[A → •[B C], i, i, l, l] [B•, i, j, l, m]
[A → [B • C], i, j, l, m]
[A → [B • C], i, j, l, m] [B•, j, k, m, n]
[A•, i, k, l, n]
[A → •hB Ci, i, i, l, l] [B•, i, j, l, m]
[A → hB • Ci, i, j, l, m]
[A → hB • Ci, i, j, m, n] [B•, j, k, l, m]
[A•, i, k, l, n]

Figure 3.11: Deductive parsing rules for Earley ITG.

Chapter 4

Translation by ITG Parsing

In this chapter, we combine ideas from the previous two chapters to build a full

translation system. We begin in Section 4.1 by presenting a naı̈ve base system that uses

basic learning and search approaches. While this system is powerful enough to perform

translation, it has some major deficiencies. In the next few sections, we describe some of

these issues and approaches for overcoming them. In Section 4.2, we describe an extension

to our search algorithm which lets use a relaxed learning approach that is more appropriate

for translation domain. In Section 4.3, we introduce an extension to our parser which

allows us to incorporate richer features in the model. These two extensions complete our

translation system, and in Section 4.4, we go into more detail about the specific features

used in the final model.

4.1 Translation by Parsing

We described two methods for learning with hidden structure, bold and local

updating. In this section, we create by describing a simple bold updating translation system

that provides the groundwork for our final system.

Chapter 4: Translation by ITG Parsing 35

Bold updates require that we find two values for each training example the best

and the correct structure. To find the correct structure, we fix the output and find the

highest scoring hidden structure,

argmax w · Φ(X, H, Y ∗ )

To find the best structure, we fix the input and find highest scoring hidden and

output structures.

argmax w · Φ(X, H, Y )

In translation by parsing, the hidden structure is a synchronous parse tree over

the two sentences. Therefore, finding the correct hidden structure is equivalent to finding

the best scoring synchronous parse, a problem we discussed at length in Chapter 3. For our

base system, we use with the Earley-style ITG parser with the Knuth algorithm to find the

correct hidden structure.

For the best structure, we need to modify the parser so that it does not require

a fixed target sentence. Moving from a parser that generates trees to one that generates

trees and sentences seems like a major change; however, the deductive parser makes this

very easy. To generate parses over all possible output sentences, we can simply remove

any constraints on the target sentence. Figure 4.1 shows a relaxed version of the Earley

parser without target constraints. Given a source sentence, this parser will generate all

possible parses over all possible any target sentences. In practice, there would be no way

to enumerate all of these output, since we are only looking for the best scoring pair, we can

avoid exploring the vast majority of states. Notice also that this parser does not tell us

what target sentence is generated, but from the sequence of inference steps we can retrieve

the target sentence.

Chapter 4: Translation by ITG Parsing 36

Item: [A → γ, i, j]

Axioms: A→γ
[A → •γ, i, i]

Goals: [S, 0, n]

Inference rules:
[A → •swi+1 /tw, i, i]
Word introduction: [A → swi+1 /tw•, i, i + 1]
[A → •swi+1 /ǫ, i, i]
[A → swi+1 /ǫ•, i, i + 1]
[A → •ǫ/tw, i, i]
[A → ǫ/tw•, i, i]

[A → •[B C], i, i] [B•, i, j]
[A → [B • C], i, j]
[A → [B • C], i, j] [B•, j, k]
[A•, i, k]
[A → •hB Ci, i, i] [B•, i, j]
[A → hB • Ci, i, j]
[A → hB • Ci, i, j] [B•, j, k]
[A•, i, k]

Figure 4.1: Deductive parsing rules for Earley ITG with no explicit target sentence.
Chapter 4: Translation by ITG Parsing 37

4.2 Local Updating

While the bold update approach has a convenient form, it has two major problems

when used for translation. Both issues stem from the fact that even for people translation

is a poorly defined problem.

The first issue is that in any reasonable corpus, there will be many translation that

are non-literal. For instance, in the Europarl corpus (Koehn, 2002), a standard training set

for machine translation, the fragment

Zweite bemerkung zu der mitteilung.

is translated as-

My second comment is about the notice.

This fragment has the literal translation of “Second comment about the notice.” Adding the

word “is” can be justified as making the English sentence more fluent. On the other hand

the word “my” seems like an embellishment by the translator. In this context, we would

not consider that the translations “The second comment is about the notice.” or even

“Second comment about the notice.” to be wrong, but that is just what bold updating

does. It would move the parameters away from these reasonable translation to satisfy the

non-literal translation.

The second problem is a subtle variant of the first. The issue is that all models

of hidden structure make assumptions about the underlying nature of translation that are

often incorrect. Example translations that are correct and literal may violate assumption

of underlying model. For instance, the example sentence,

Einige sagen jetzt noch ; der dialog ist notwendig.

is translated as
Chapter 4: Translation by ITG Parsing 38

Some people are still saying that dialogue is needed.

This translation is mostly literal except that the German sentence uses a semi-colon, while

the English sentence uses a subordinate clause. In an alignment based system, this transla-

tion would not probably not be as much of a problem as the first example. The translation

of “;” to “that” is not too far of a strectch at the word level. On the other hand, for a

syntax-based system, this translation could implies an entirely different internal structure.

We could force the system to find some way at arrive at this exact translation, but it is un-

likely it would find it without producing wrong hidden structure that may make the system


These are difficult issues that we can not hope to fully solve. We can not know

if our system is producing the wrong output because of poor parameters or because of a

non-literal translation. Local updates try to avoid this determination by making smaller

updates. Instead of updating towards Y ∗ , we use Y ∗ as a reference to judge the best

scoring outputs of the current system. If the reference is reasonable, it is likely to be one

best outputs produced, and this update is identical to a bold update. If it is non-literal or

violates our hidden structure assumptions then we hope to find a similar, but less drastic

translation as one of the top outputs.

Local updating also requires computing the best and correct structure. In local

updates, best has the same definition as in bold updating, but the correct structure is now

the output sentence within the best n outputs that is closest to the supervised output.

argmin m(Y ∗ , Y )

To measure closeness, we use Bleu score (Papineni et al., 2001), a standard test

metric for translation precision. Bleu score computes the precision of a test sentence by
Chapter 4: Translation by ITG Parsing 39

Figure 4.2: Graph where the simple n-best Dijkstra algorthm is inefficient.

counting the number of n-grams the sentence has in common with a reference translation

with a penalty for short translations.

The more difficult problem presented by local updating is calculating the n-best

parses for a given sentence. The Knuth algorithm provides a method for finding the single

best parse, and Huang and Chiang (2005) describe a fast Viterbi-style algorithm for finding

the n-best parses for a given sentence. We want to keep the speed advantages of the Dijkstra

algorithm, so we need an n-best Knuth algorithm. There is a simple n-best extension to

Dijkstra algorithm. Instead of memoizing the shortest path between the start node and

each other node, n-best algorithms remember the top n shortest paths.

Unfortunately, this simple n-best extension keeps around too much information.

For the graph in Figure 4.2, the algorithm would store the n possible ways of getting to

the two non-goal nodes, but we only are interested in n ways to get to the goal. In the

worst case, half these paths are wasted. To fix this issue, Huang (2005) gives a lazy frontier

version of for a k-best Knuth algorithm.

The basic idea behind the algorithm is to wait until we need the next best value

for a node before computing that value for previous nodes. Figure 4.3 sketches the control

behind the lazy n-best computation.In Figure 4.3(a), we have just found the third best

path to the node on the far right. It asks the edge where this path came from to look for
Chapter 4: Translation by ITG Parsing 40

I n e e d 5

1 2 4

Figure 4.3: Diagram showing the lazy n-best Knuth algorithm in progress.

any successors. Figure 4.3(b) shows the state of the parent edge. We have found two of

the best paths for the top node and three of the best paths for the bottom node. We now

look to see if we can find any successors for the path we just found by looking at possible

neighboring paths up and to the right. We can explore the up path immediately since it is

a combination of two paths we already have. We can not do anything yet with the right

path because we have not found the fourth best path for the bottom node. Once this node

arrives we can pass it along.

Our final system uses an approach that Liang et al. (2006) refer to as hybrid

updating. We first try to do bold updating by searching for a reasonable correct output. If

we cannot find one, we switch to local updating.

4.3 Bilingual Parsing

In the original IBM SMT system, Brown et al. (1990) divide the translation

problem into two parts, a translation model and a language model. The translation model

manages the conversion of words from source to target language, and the language model

makes sure that the words produce a coherent target sentence. The structure prediction

framework frees us from making this distinction. We can model arbitrary properties by

including them as features. For instance, the IBM model uses trigram probabilities to
Chapter 4: Translation by ITG Parsing 41

model the target language. If we want to include a trigram language model in the scoring,

we would just include a feature to count trigrams.

 1, if ti = a, ti−1 = b, ti−2 = c

φt (ti , ti−1 , ti−2 ) =
 0,


Unfortunately, adding this feature does not help if it does not efficiently decom-

pose over our search space. Since our search space is a hypergraph, the features need to

decompose over inference rules that produce the hyperedges. For instance, we can easily

include a feature that counts up the occurences of a grammar rule, like

 1, if r = rule

φrule (r) =
 0, otherwise

because each time we apply an inference rule like,

[A → [B • C], i, j] [B•, j, k]

[A•, i, k]

We know that the grammar rule A has been applied, and so the feature should fire. On

the other hand, when we apply this inference rule, it also likely that we have created a new

n-gram in the target sentence, but the inference rule does not have any knowledge about

the target words, so it cannot fire the feature.

There are two ways of dealing with this issue. We could first generate all possible

outputs and then apply the n-gram features. This would still ensure that we find the

optimal output, but the huge number of possible outputs make this infeasible. Collins and

Koo (2005) proposes instead generating an n-best list of high scoring outputs and then rank

them with a richer set of features. This technique, known as discriminative reranking, is

non-optimal, but much more efficient. ?) apply this technique to machine translation.

The other option is to inject the extra information into the parser. This is known

as the n-gram intersection trick and was first proposed for ITGs by Wu (1997). The idea
Chapter 4: Translation by ITG Parsing 42

is to simultaneously build up the information needed for language model features while we

parse. We augment the items of our parser to include both the current symbol and the

outer words of the implied target sentence. Figure 4.4 shows the new parse rules with the

integrated bigram language model. Now when we apply the rule.

[A → [B • C], i, j, s, e′ ] [B•, j, k, s′ , e]

[A•, i, k, s, e]

We know that we have implicitly added the bigram (s′ , e′ ) to our destination sentence.

Figure 4.5 sketches this rule.

This transformation alone is enough for an integrated language model, but it vastly

increases the size of our hypergraph. Previously, our items included a non-terminal and two

indices, so there were O(|N |n2 ) nodes, where |N | is the number of nonterminals and n is the

length of the sentence. In addition, the combination rules have three possible non-terminals

and three free indices, so there were O(|N |3 n3 ) edges. Now by storing the corner words, we

have O(|N |n2 w2 ) nodes and O(|N |3 n3 w4 ) 1 edges where w is the number of words in the

target language.

We can reduce the amount of edges by a factor of w in this graph by applying

the hook trick(Huang, Zhang, and Gildea, 2005). The hook trick notes that there is no

dependence between the bigram combination and the nonterminal combination, so we can

do them in two separate steps. We introduce a unary hook rule that first applies the bigram

and adjust the combination rules accordingly. Items that are “hooked” can only combine

with other items that share the hook word. Figure 4.6 shows how requiring the hook reduces

the number of rules applied. The final deductive parser with the hook trick is shown in
This is actually a bit untrue. Since we can insert a pairs that are empty on the source side, we have
cycles in the hypergraph. To eliminate this is practice, we add a field h to the item which counts up the
number of these edges. We add a side condition to the inference rules that prevent this field from reaching
a maximum value m. This eliminates cycles with a cost of m to the size of the hypergraph.
Chapter 4: Translation by ITG Parsing 43

Item: [A → γ, i, j, s, e]

Axioms: A→γ
[A → •γ, i, i, ǫ, ǫ]

Goals: [S, 0, n, s, /s]

Inference rules:
[A → •swi+1 /dw, i, i, ǫ, ǫ]
Word introduction: [A → swi+1 /dw•, i, i + 1, dw, dw]
[A → •swi+1 /ǫ, i, i, ǫ, ǫ]
[A → swi+1 /ǫ•, i, i + 1, ǫ, ǫ]
[A → •ǫ/dw, i, i, ǫ, ǫ]
[A → ǫ/dw•, i, i, dw, dw]

[A → •[B C], i, i, ǫ, ǫ] [B•, i, j, s, e]
[A → [B • C], i, j, s, e]
[A → [B • C], i, j, s, e′ ] [B•, j, k, s′ , e]
[A•, i, k, s, e]
[A → •hB Ci, i, i, ǫ, ǫ] [B•, i, j, s, e]
[A → hB • Ci, i, j, s, e]
[A → hB • Ci, i, j, s′ , e] [B•, j, k, s, e′ ]
[A•, i, k, s, e]

Figure 4.4: Deductive parsing rules for Earley ITG with bigram intersection.


s e ' s ' e

e w b i g r a m e ' s

Figure 4.5: Representation of the new bigram ITG items.

Chapter 4: Translation by ITG Parsing 44


N 2







(a) Without the hook trick (b) Hook trick

Figure 4.6: The figure shows how the addition of unary hook rules can reduce the total
amount edges. In the first example, we need O(n2 ) rule applications to connect two pairs
of n items. If we first use group all items with a common right element and then do this
combination, we reduce the cost by a factor of n.

Figure ??. With the hook trick, each combination rule has only three free target words.

This reduces the number of edges to O(|N |3 n3 w3 ).

4.4 Features

This section surveys the features we use to assess the quality of a translation. One

of the advantages of a discriminative model is the ability to add features without worrying

about correlation effects, so we can add any features we like as long as they decompose over

the inference rules. As we noted in the last section, the decomposition produces functions

over inference rules.

In practice, we distinguish four types of rules: axioms, combination rules, hook

rules, and goal edges. Axiom edges are the edges that first introduce rules and words,

combination edges apply a grammar rule to combine two rules, hook rules introduce a new

bigram, and goal edges lead to a completed sentence.

In the current implementation, we include the following features:

• RULE For each rule in the grammar, we have a rule count feature that fires on a

combination edge,
Chapter 4: Translation by ITG Parsing 45

Item: [A → γ, i, j, s, e, h]

Axioms: A→γ
[A → •γ, i, i, ǫ, ǫ, 0]

Goals: [S, 0, n, s, /s]

Inference rules:
[A → •swi+1 /dw, i, i, ǫ, ǫ]
Word introduction: [A → swi+1 /dw•, i, i + 1, dw, dw]
[A → •swi+1 /ǫ, i, i, ǫ, ǫ]
[A → swi+1 /ǫ•, i, i + 1, ǫ, ǫ]
[A → •ǫ/dw, i, i, ǫ, ǫ]
[A → ǫ/dw•, i, i, dw, dw]

Hook rule:
[A•, i, j, s, e, 0]
[A•, i, j, s, e′ , 1]

[A → •[B C], i, i, ǫ, ǫ] [B•, i, j, s, e, 1]
[A → [B • C], i, j, s, e]
[A → [B • C], i, j, s, e′ ] [B•, j, k, e′ , e, 0]
[A•, i, k, s, e, 0]
[A → •hB Ci, i, i, ǫ, ǫ] [B•, i, j, s, e, 0]
[A → hB • Ci, i, j, s, e]
[A → hB • Ci, i, j, s′ , e] [B•, j, k, s, s′ , 1]
[A•, i, k, s, e, 0]

Figure 4.7: Deductive parsing rules for Earley ITG with bigram intersection and the hook
rule trick.
Chapter 4: Translation by ITG Parsing 46

 1,

if r = rule
φrule (r) =
 0,


In aggregate, these features counts up the number of times each grammar rule is used

in the parse. These features mimic the role of a standard probabilistic parser.

• LEXICAL TRANSLATION For each possible source-target word pair, we have a

lexical translation feature that fire at word axioms,

 1, if src = s and targ = t

φsrc/targ (s, t) =
 0,


This feature allows the system to learn a one-to-one mapping between the source and

target languages.

• CLASS TRANSLATIONS We also have class translation features that are iden-

tical to the lexical translation features, but for word classes. These features allow us

to generalize the system to word combinations that we have not seen before. We use

mkcls ?) to generate the word classes.

• IBM LEXICAL TRANSLATION We have two IBM translation features that fire

when we at the word axioms.

φs→d (s, t) = − ln(ST (s, t))

φd→s (s, t) = − ln(T S(t, s))

where ST and TS return the probability of a given terminal from IBM Model 3

running from source to destination and destination to source language respectively.

These features are extremely useful for pruning the state space in early iterations. We

use GIZA++(Och and Ney, 2003) to generate the lexical probabilities.

Chapter 4: Translation by ITG Parsing 47

• BIGRAMS We have a language model feature that scores the destination sentence

by its bigrams. The feature fires whenever a hook rule forms a new bigram.

Φlm (x) = − ln(LM (x.bigram))

where LM is a pre-trained bigram language model. This feature allows us to incorpo-

rate a rich language model trained on outside monolingual texts into the system. We

use the SRI language modeling toolkit(Stolcke, 2002) to generate these bigrams.

• EMPTIES We have two features that fire for every non-terminal that of the form

ǫ/w or w/ǫ. The first is called hallucinate and the second ignore. These rule regulate

the relative lengths of the sentences.

• RELATIVE SPANS These feature track the relative spans of a newly created out-

put. We have 11 of these features ranging from [-5,5]. This rule fires when we complete

a parse.

 1, if size(x.srcSpan) − size(x.destSpan) = n

φspan,n (x) =
 0,


When designing our features, we came across a strange issue. Dijkstra-style algorithms

only work on graphs with no negative weights edges. This restriction works fine when

edge weights are probabilities, but in a linear model features can have have negative

weights. The negative weight issue also conflicts with the BIGRAMS feature. If the

BIGRAMS feature had a negative weight, it would always be better to add more

random words. Neither of these problems are addressed in the the literature, and

we were unable to find an elogant solution. We handle this issue by making a lower

bound for feature weights at zero. .

Chapter 4: Translation by ITG Parsing 48

H a l l o

H e l l o

s i r

Figure 4.8: A simple admississable heuristic for ITG. For each source word, we find the best
translation, surrounding bigram, terminal rule, and single combination.

4.5 Heuristic

We argued that with the bigram intersection trick and the hook rule the hyper-

graph for ITG parsing of a sentence hase O(|N |3 n3 w3 ) edges. In the worst case, the Knuth

algorithm takes O(V log V + E) which reduces to O(|N |3 n3 w3 ). This is a steep cost for

parsing a single sentence.

We can combat some of this complexity by including an an admissable heuristic.

In admissable heuristic is an underestimate of the total remaining cost to the goal from an

intermediary node. A good heuristic should be as close as possible to the true cost. We start

by using the ITG heuristic given in Zhang and Gildea (2006) and taylor it to our feature set.

The basic idea is to relax the parsing problem to the problem of finding best translation of

each source word individually. This is an underestimate because in the process of parsing

we will have to translate each word, and we can not score better than this best translation.

We can extend this idea further. Each word we translate will be part of a bigram, so we

can find the best translation plus bigram for each word. Our final heuristic is shown in

Figure 4.8.
Chapter 5


This chapter summarizes the results for this system. We first present some of the

engineering challenges we faced during learning. We then present scores and analysis of the

sytem on two small training sets.

5.1 Efficiency Issues

In initial experiments, we ran the system with pure A∗ search and the admissable

heuristic given at the end of the last chapter.The system quickly runs out of memory due

to the enourmous number of states. We considered two methods for cutting down on the

number of states. We could prune states out of the grammar before parsing or use beam

search during parsing. The first option is less elegant and can make it so it is impossible

to find a correct translation, but ensures that we won’t ever give good scores to states that

were pruned. The second option is leaves pruning decisions up to the current weights of the

translation. Since we are already relying on the assumption that we begin in a reasonable

state, we use beam search for all our pruning. We used a fixed width beam, that prunes all

states with a score below bestScore + beamWidth.

Chapter 5: Results 50

Complicating this problem is the fact that our admissable heuristic is only mod-

erately useful in practice. It is very important for filtering out poor lexical level decisions.

For instance, it will prevent the system from good translations to words that do not fit

with other possible words in a sentence. It does not provide any help with grammar level

decisions. but it provides very little information

When the difference between the heuristic and the correct value is very large,

even beam pruning fails to find a parse in reasonable time. In these cases, we switched to

an inadmissable heuristic. We used the score of the admissable heuristic times a factor, in

future work we could find a tighter inadmissable heuristic by finding a greedy approximation

of the translation.

5.2 Experiments

Because of the inefficiency of training, we were unable to train the system on large

natural language corpus. Instead, we settled for two smaller synthetic corpora- artihmetic

expression data and the METEO weather system data. These data sets do not allow us

to make any claims about the system’s performance for real data, but they do allow us to

examine some positive and negative properties about synchronous grammars trained in this


The other issue is what model to use as a comparison. Our system has two dis-

tinctive properties. It is based around a synchronous grammar, and it is discriminatively

trained. Ideally, we would compare its output to a discriminative alignment based sys-

tem and a generative synchronous grammar system. Unfortunately, there are no available

systems of this form. We settle for comparing it to the IBM model as implemented in

GIZA++ by Och (2000). We use the IBM model instead of more advanced phrase-based
Chapter 5: Results 51

Parse Speed

1.0 1.2 1.4 1.6 1.8

Heuristic Multiplier

Average Parse Score




1.0 1.2 1.4 1.6 1.8
Heuristic Multiplier
Chapter 5: Results 52


Figure 5.1: Examples pairs from the arithmetic expressions corpus. The second pair has
neccessary parentheses, the third has unnecessary parentheses, and the fourth has both.

systems because it shares the word level granularity of our system. We use the CMU–

Cambridge toolkit (Clarkson and Rosenfeld, 1997) for the language model and the the ISI

rewrite decoder for translations.

5.2.1 Arithmetic Expressions

The first test we ran was with the arithmetic expressions data set, a synthetic

corpus proposed by Nesson (2005) to test synchronous grammars. The data set consists

of 1000 sentence pairs where the source language is postfix notation arithmetic expressions

and the target language is noisily parenthesized infix expressions. Figure 5.3 shows some

example sentences.

We consider this an interesting test for MT systems because it is has very simple

lexical translations and language modeling, yet drastic word movement. In our initial test,

we used the a fully connected, ITG grammar with 5 nonterminals. We ran tests with both

bold and local updating.

The bold updating test revealed a major failure with the A∗ search strategy. When

the system begins the only meaningful features are the translation and language model

features inherited from the IBM system. These features always weight translations without

parenthesizes higher than those without. For this reason, if the best translation is,
Chapter 5: Results 53


our system will produce,


This happens repeatedly whether the parentheses are neccessary or not. We would

hope that the system would learn where to include parentheses, instead it continually pe-

nalizes these close translations. Eventually, the cost of this simple translation becomes so

high that anything else is better. When there is no clear good path, A∗ does not help, the

beam can no longer prune anything, and translating trivial sentences becomes prohibitively

costly. This example demonstrates the disadvantages of relying on a Dijkstra-style that ties

performance to model accuracy.

The local update example had more success. We ran tests with n=100, and since

parentheses are expensive in the starting language model, only a small percentage of the

possible outputs have any parentheses. Since these parentheses are scattered throughout

the sentence, they almost never match up with the parentheses in the reference sentence.

The result is that in the test translations, the system almost never adds parentheses. This

demonstrates the big difference between local and bold updating. Local is can avoid some

noise in the reference sentence, at the cost of ignoring some important distinctions.

Even without any parentheses, local updating performs significantly than the the

IBM model. The synchronous grammar allows our system to do global reorderings that the

IBM model cannot do. Local updating has a 0.5875 Bleu score, while the IBM model scores

0.5087. Table 5.3 shows some example sentences and the results from the two systems.

Both systems seem unwilling to add any extra words. Our system compensates by correctly

reordering the words and eliminating parentheses. The IBM system translates chunks of

words well, but loses much of the original sentences.

Chapter 5: Results 54

Correct B+A+A+B*A
Correct (B+(B+A*B)*B)*A
Correct (A+A+A)*B

Figure 5.2: Example results from our system and IBM

jeudi quelques averses cessant en fin de matinèe puis nuageux

thursday a few showers ending late in the morning then cloudy
NUM pour cent de probabilitè d’averses de pluie ou de neige en après-midi et en soirèe
NUM percent chance of rain showers or flurries in the afternoon and evening
mardi nuageux avec NUM pour cent de probabilitè d’averses tôt en matinèe
tuesday cloudy with NUM percent chance of showers early in the morning
vents du sud-est de NUM à NUM km h diminuant à NUM ce soir
wind southeast NUM to NUM km h diminishing to NUM this evening

Figure 5.3: Examples pairs from the METEO weather data corpus. The corpus features
rather literal translations with some added and deleted words and reordering. For instance,
pour cent de probabilitè d’averses is translated as percent chance of rain which requires
adding the word of and dropping the word de.

5.2.2 MATEO Weather Data

Our second experiment was on another synthetic corpus of a much different sort.

The MATEO weather data corpus is a collection of natural language statement generated

in both English and French(Kittredge, Polguère, and Goldberg, 1986). It significantly

different properties than arithmetic data corpus. It has difficult lexical translation and

language fluency issues, but very little global word reordering. Figure 5.3 shows several

examples from this corpus.

We trained our system on 1000 unique sentences from this corpus with lengths

5-15 lengths. For this corpus, we use hybrid updating with n =10. The system took an

average of 20 seconds per sentence, and we trained it for 10 iterations. Training took about
Chapter 5: Results 55

fog and patchy drizzle along the coast in the morning and early in the afternoon
fog patches in the coast in the morning and early along the afternoon and patchy drizzle
fog and patchy drizzle along the coast in the morning and early in the afternoon
tonight showers changing to drizzle and a few showers this evening
dissipating this evening and tonight showers changing drizzle in tonight a few this evening
this evening and tonight periods changing drizzle and a few showers this evening
monday light snow mixed with rain changing to rain late in the morning
monday rain late in the morning changing rain changing light snow mixed
monday light becoming mixed rain moving changing rain late in morning

Table 5.1: Example translations for the METEO Corpus.

2 full days.

The two systems performed similarly on the corpus. Our system had Bleu score

of 0.7044 and the IBM system scored 0.7287.

While both systems translate most sentences correctly, there are subtle differences

that highlight the different underlying models. Table 5.1 shows example references and

results. The first example is particularly striking. The IBM model translates the sentence

exactly correct, while the ITG translation is very incorrect. However, the ITG output does

translate all the words correctly and puts them in a grammatical, if nonsensical order. This

example demonstrates that sometimes the extra expressitivity of the ITG system can be a


5.3 Conclusion

In this work, we have presented a discriminative method for training a synchronous

grammar based machine translation system. At heart, this system is a combination of two

ideas. The first idea is the learning idea that we can abstract away from a problems

complex structure and learn decompose it into a feature vector. The second idea is that we

can approach the particular problem of synchronous grammar parsing as a graph traversal
Chapter 5: Results 56

problem. Together these ideas present an view of synchronous grammar translation. We

train by learning the weights of the graph, and translate by searching over these weights.

This system has several issues. The most prominent is the inefficiency of parsing

under this framework. In the short term, we hope that a better, possibly inadmissable,

heuristic will improve parsing speed. Alternatively, we could include smarter pruning both

before and during parsing. We had hoped that the discriminative framework would increase

efficiency because of the Knuth algorithm. In retrospect, O(n6 ) generative inside-outside

training with a fixed output sentence may be faster in practice, since it does not have the

extra bigram costs.

If this efficiency issue can be resolved, we think this could be a promising frame-

work. While the results we produced were on small synthetic corpora, they show that the

system can learn the global reorderings of the arithmetic expressions corpus without sac-

rificing the translation and language modeling neccessary for the literal translations of the

METEO corpus.

Finally, this method would be even stronger with a better synchronous grammar.

?) has shown, there is no reason not to combine phrasal and syntax systems. The syn-

chronous CFGs that Chiang uses are not too different from ITG, and would allow us to

incorporate a wider range of features.

References 57


Brown, P.F., S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and

P.S. Roossin. 1990. A statistical approach to machine translation. Computational

Linguistics, 16(2):79–85.

Chiang, David. 2006. Hierarchical Phrase-Based Translation. Computational Linguistics.

Clarkson, Philip and Ronald Rosenfeld. 1997. Statistical language modeling using the

CMU–cambridge toolkit. In Proc. Eurospeech ’97, pages 2707–2710, Rhodes,


Collins, M. 2002. Discriminative training methods for hidden Markov models: theory and

experiments with perceptron algorithms. Proceedings of the ACL-02 conference

on Empirical methods in natural language processing-Volume 10, pages 1–8.

Collins, M. and T. Koo. 2005. Discriminative reranking for natural language parsing.

Computational Linguistics, 31(1):25–69.

Daume III, H. and D. Marcu. 2005. Learning as search optimization: Approximate large

margin methods for structured prediction. Proceedings of the International Con-

ference on Machine Learning (ICML).

Earley, J. 1970. An efficient context-free parsing algorithm. Communications of the ACM,


Graehl, J. and K. Knight. 2004. Training tree transducers. Proc. of HLT/NAACL-04,

pages 105–112.

Huang, L. 2005. k-best Knuth Algorithm.

Huang, L. and D. Chiang. 2005. Better k-best parsing. Proceedings of the 9th International

Workshop on Parsing Technologies (IWPT), pages 53–64.

References 58

Huang, L., H. Zhang, and D. Gildea. 2005. Machine translation as lexicalized parsing with

hooks. Proceedings of IWPT, 5.

Kittredge, R., A. Polguère, and E. Goldberg. 1986. Synthesizing weather forecasts from

formated data. Proceedings of the 11th coference on Computational linguistics,

pages 563–565.

Klein, D. and C.D. Manning. 2001. Parsing and hypergraphs. Proceedings of the 7th

International Workshop on Parsing Technologies (IWPT-2001).

Klein, D. and C.D. Manning. 2003. A* parsing: fast exact Viterbi parse selection. Pro-

ceedings of the 2003 Conference of the North American Chapter of the Association

for Computational Linguistics on Human Language Technology-Volume 1, pages


Knight, K. 1997. Automating knowledge acquisition for machine translation. AI Magazine,


Knuth, D.E. 1977. A Generalization of Dijkstra’s Algorithm. Information Processing

Letters, 6(1):1–5.

Koehn, P. 2002. Europarl: A multilingual corpus for evaluation of machine translation.

Unpublished, http://www. isi. edu/koehn/publications/europarl.

Koehn, P., F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. Proceedings

of the 2003 Conference of the North American Chapter of the Association for

Computational Linguistics on Human Language Technology-Volume 1, pages 48–


Koo, T. and M. Collins. 2005. Hidden-variable models for discriminative reranking. Pro-

ceedings of the Joint Conference on Human Language Technology Conference and

Empirical Methods in Natural Language Processing (HLT/EMNLP).

References 59

Kumar, S., Y. Deng, and W. Byrne. 2005. A weighted finite state transducer translation

template model for statistical machine translation. Natural Language Engineering,


Lari, K. and SJ Young. 1991. Applications of stochastic context-free grammars using the

inside-outside algorithm. Computer speech & language, 5(3):237–257.

Liang, P., A. Bouchard-Cote, D. Klein, and B. Taskar. 2006. An End-to-End Discriminative

Approach to Machine Translation. Proceedings of the Association for Computa-

tional Linguistics.

Melamed, I.D. 2004. Statistical machine translation by parsing. Proceedings of the 42nd

Annual Meeting of the Association for Computational Linguistics (ACL).

Nesson, R. 2005. Induction of probabilistic synchronous tree-insertion grammars. Tech-

nical Report TR-20-05, Division of Engineering and Applied Sciences, Harvard

University, Cambridge, MA.

Och, F.J. 2000. Giza++: Training of statistical translation models.

Och, F.J. and H. Ney. 2003. A systematic comparison of various statistical alignment

models. Computational Linguistics, 29(1):19–51.

Papineni, K., S. Roukos, T. Ward, and W.J. Zhu. 2001. BLEU: a method for automatic

evaluation of machine translation. Proceedings of the 40th Annual Meeting on

Association for Computational Linguistics, pages 311–318.

Shieber, Stuart M., Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and

implementation of deductive parsing. Journal of Logic Programming, 24:3–36.

Stolcke, A. 2002. SRILM-an extensible language modeling toolkit. Proc. ICSLP, 2:901–904.

Wu, D. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel

corpora. Computational Linguistics, 23(3):377–403.

Zhang, H. and D. Gildea. 2006. Efficient Search for Inversion Transduction Grammar.