284 views

Uploaded by srush_

- Operator precedence grammar.doc
- Sentiment Web Fountain
- Frame of Reference for Holding NL Query http://ijire.org
- Cse Syllabi
- Neural Assignment 2
- mlp_2
- 5 easy ways to improve cohesion ielts writing task 2
- Syntax Transformations
- The perceptron.doc
- Viva Compiler
- QSAR Studies of Breast Carcinoma using Artificial Neural Network, Bayesian Classifier and Multiple Linear Regression
- A-Bar Interveners in WH Questions
- B.tech Admission in India
- detection aof abusive content in twitter.pdf
- PEG GrammarExplorer
- book 1 part a
- ASSOCIATION MINING FROM BIOMEDICAL TEXT WITH NETWORK ANALYSIS
- ADJECTIVE CLAUSE 2.docx
- AI Based Approach
- English Yearly Scheme of Work Year 4 Sk 2014 Shared by Rosma Esa

You are on page 1of 62

A Thesis presented

by

Alexander Rush

to

The Department of Computer Science

in partial fulfillment of the requirements

for the degree of

Bachelor of Arts

in the subject of

Computer Science

Harvard University

Cambridge, Massachusetts

April 3rd 2007

Thesis advisor Author

Abstract

The task of discriminative machine translation by synchronous parsing poses two major

difficulties - the hidden structures in the training set and the inefficiency of parsing syn-

these two problems. We approach the the hidden structure problem by adapting the on-

line learning update rules presented by Liang et al. (2006) for discriminative phrase-based

find the best parses. The discriminative training method permits an expanded feature set

compared to generative models. The A* search speeds up the parser when used with an

ii

Contents

Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 General Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Hidden Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Perceptrons and Hidden Structure . . . . . . . . . . . . . . . . . . . . . . . 15

3 Hypergraph Parsing 19

3.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Hypergraphs and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Hypergraph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Inversion Transduction Grammar . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Translation by Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Local Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Results 49

5.1 Efficiency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

iii

Chapter 1

Introduction

1.1 Background

the corresponding sentence in a target language. We know that this problem can be solved,

since we can watch people translate every day. And yet the demand for translation remains

high. These two factors make translation an appealing problem. In addition, computers

should be perfect match for translation. At heart, computers are designed to manipulate

difficult area. Researchers have been building computational machine translation (MT)

systems since the 1940’s with only limited success. Until the 1990’s, these systems were

major engineering projects. MT researchers built rule-based systems that used hand-crafted

grammars to convert a sentence in one language to another. This form of system treats

human language like a very complicated programming language and translation is like the

conversion to assembly code. Except unlike a programming language, we do not know form

of this grammar.

1

Chapter 1: Introduction 2

n a c i e r t a o r a c i ó n e s p a ñ o l a .

U

n a c i e r t a o r a c i ó n e s p a ñ o l a .

U

S o m e s p a n i s h s e n t e n c e . S o m e s p a n i s h s e n t e n c e .

n a c i e r t a o r a c i ó n e s p a ñ o l a . ε S o m e s p a n i s h s e n t e n c e .

n a c i e r t a o r a c i ó n e s p a ñ o l a . S o m e s p a n i s h s e n t e n c e .

U

(d) Hiero

Brown et al. (1990) caused a major shake-up in MT research when they imple-

mented the first successful statistical machine translation (SMT) system. Instead of using

handcrafted rules, SMT systems attempt to learn a probabilistic map between the source

and target languages. The most startling thing about these systems is that they are com-

pletely oblivious to the languages they are translating. Unlike the previous generation of

systems, where the rules were proposed by language experts, SMT systems require no hu-

man guidance. The statistical system learns its map from a a corpus of human translated

sentence pairs from the two languages. The first popular corpus was the proceedings of the

Canadian parliament, which is conducted both in French and English, but as Knight (1997)

jokes, it might as well be the conversations of aliens. In spite of this property, SMT systems

Chapter 1: Introduction 3

This original work, which is known as the IBM system, is a generative, alignment

based system. At its base, it has two statistical models. A translation model that gives

the likelihood of source language words translating to destination language words. This

model tries to create adequate translations. Figure 1.1(b) gives a basic sketch of the IBM

translation model. It also has a language model that computes the likelihood of the target

sentence based on the target language alone. This model aims to produce fluent sentences.

The IBM system is still the major component of most SMT systems, and we will discuss

The current state-of-the-art SMT systems are variants of the phrase-based system

of Koehn, Och, and Marcu (2003). Phrase-based systems improve upon the IBM system

by increasing the granularity of the translation model. Instead of producing one word from

word, phrase-based systems produce one phrase from one phrase. In order to find phrases,

these systems use heuristics that infer phrase level information from the output of the IBM

systems. Figure ?? shows the distinction between word and phrasal alignments.

The major criticism with IBM-style systems and their phrase-based successors is

that they trade syntactic transformations of rule-based systems for word or phrase level

mappings. The old systems had the expressivity of grammars, while phrase-based systems

have the expressivity of finite state automata. In fact, Kumar, Deng, and Byrne (2005)

has implemented a phrase-based system as a weighted finite state transducer. While the

statistical aspect is a major improvement, it seems unlikely that all the complexities of

This criticism has led to an alternative line of research in syntax-aware SMT sys-

tems. Syntax-aware systems attempt to lift the translation problem to the syntax level, and

learn a map from source to destination syntax. We can think of these syntax transformation

as grammars over both sentences simultaneously. Figure 1.1(c) gives an example of a this

Chapter 1: Introduction 4

transformation. These two sentence grammars are known as synchronous grammars, and

Recently, Chiang (2006) has shown that a statistical machine translation system

incorporating syntactic information can improve upon a standard phrase-based model. His

system, known as Hiero, uses a hybrid approach called hierarchical phrase-based trans-

lation. The system learns phrase level rules that can be combined using a synchronous

grammar. In this way, he is able to get some of the benefits of both phrase level granularity

and syntax. Figure 1.1(d) shows a simple sketch of the Hiero concept.

1.2 Goal

The Hiero system is a major advance for syntax-aware translation. It can use

better translations. However, Hiero is still attached the phrase-based paradigm. In this

phrasal model.

While this seems like a modest goal, it has a major complication. The issue is that

the standard corpus for machine translation consists of unmarked sentences pairs. These

sentence pairs provide a reasonable training set for learning word level mappings, but they

provide no information about the syntax. In a perfect world, we would have a corpus marked

with consistent bilingual syntactic structures. Instead, we need to infer these structure from

the corpus.

There are several ways to address this lack of direct supervision. In Hiero ,

Chiang circumvents the supervision problem by creating his own training data. His system

The system first runs a phrase-based algorithm to create a table of possible phrases for each

Chapter 1: Introduction 5

sentence. He then uses a set of heuristics to predict a possible parse from this table. He can

then train his model from these parses. This method avoids the hidden structure problem

problem. He starts with a base grammar, and using expectation maximization (EM) to

learn reasonable parameters. The standard EM technique for learning grammars is the

parsing the corpus and updating the parameters of the model to maximize the expected

The EM method would be ideal for our goal, except that it learns a generative,

joint probability model over the synchronous parse instead of a discriminative, conditional

model. There are two issues with using a generative model for translation. The first is

that a generative model predicts the likelihood of a parse over both sentences. This makes

sense for training, but since in practice, we are only given the source sentence, we are left

with an indirect model for predicting the best target sentence. The second issue is that

discriminative models allow us to had in arbitrary features that may help translation. In

generative models, we can add extra features, but we have to worry about their correlation.

1.3 Overview

chronous grammars. Our aim is to maintain the freedom from phrase table constraints

The method consists of two major pieces - a learning algorithm for partially supervised

training and a search algorithm for finding translations with a sychronous grammar. These

Chapter 1: Introduction 6

two pieces are sufficient for describing a complete system, and the structure of this paper

Chapter 2 gives background for the learning algorithms we use in our system. We

first present a general framework for thinking about structured learning. We then focus on

basic update techniques in this framework and extensions for problems like translation that

In Chapter 3, we explore how to solve the subproblem at the heart of the learning

algorithms from Chapter 2. We describe a parsing framework with several variants that

solve this problem for different learning approaches. We then show how to adapt this

Chapter 4 shows how to apply these methods to the translation task. We describe

techniques for language model integration, tricks for reducing the parser’s complexity, ad-

missable heuristics to maintain optimality, and the features that we used for tests.

Finally in Chapter 5, we present results of this system on language data and discuss

Chapter 2

structure prediction frameword is a very general way of thinking about discriminative learn-

ing. It takes a and input, X, and produces the best scoring output Y . The framework is

agnostic to the form of X and Y . These could be sentences, trees, pictures, etc. In the case

of translation, the input X is sentence in the source language and the output is a sentence

This chapter introduces structure prediction and some variants for applying it to

translation. We focus on the general linear models trained with online algorithms. In Sec-

tion 2.1 and 2.2, we introduce the general linear model framework for structure prediction,

and present an online Perceptron-like update rule for training these models. In Section 2.3,

we describe a variant of general linear models for problems like translation that have hid-

den structure. In Section 2.4, we revisit the training question and survey variants of the

7

Chapter 2: Learning with Hidden Structure 8

The key to structure prediction is projecting elements with complex structure onto

representitive vectors. Working with vectors allows us to abstract the learning problem away

from interpreting structures by reducing these them to their essential components. We can

Collins (2002) introduces a formal framework for this process known as a general

linear model. A general linear models consist of three functions- an enumeration function

problems, the goal is to produce the best part-of-speech label for each word of a sentence.

The tagger takes a sentence, X, as input and produces predicted tags, Y , as output. We

use the notation X and Y as the set of all sentences and tags respectively.

• The enumeration function GEN : X → Y set produces all the legal outputs for a

given input. GEN has a trivial specification for most problems. For instance in

tagging, GEN might produce all the possible permutations of tags for a sentence.

tative vector, known as the feature vector. This function determines what elements

of structures are relevant, and everything else about the structure will be ignored.

In practice, for efficiency we choose a Φ that factors into smaller functions Φ where

P

Φ = y∈Y Φ(X, y). In turn, Φ is made up n feature function φi that determine one

Chapter 2: Learning with Hidden Structure 9

V V A d v V

G E N

D o g s c a n n o t ﬂ y . N V A d v N

A d j A d j A d j A d j

(a) GEN

D o g s / V c a n / V n o t / A d v ﬂ y / V . Φ < . . . , 0 , 1 , 0 , 0 , . . . >

(b) Φ

R A N K

< . . . , 0 , 1 , 0 , 0 , . . . >

1 2 . 5

< . . . , 0 . 4 5 , 3 . 4 , 0 , 4 2 . 0 , . . . >

(c) RANK

Figure 2.1: The three general linear model functions applied to a part-of-speech tagging

task. Figure 2.1(a) shows the enumeration function generating all possible tagging sequences

for a simple sentence. Figure 2.1(b) shows a possible feature vector projection for a single

sentence-tag pair. Figure 2.1(c) shows this vector being scored with the weight vector.

Chapter 2: Learning with Hidden Structure 10

1, if w = Dogs and t = V

φ1 (w, t) =

0,

otherwise

where w and t are a word-tag pair. We read this feature as, let the first dimension be

1 if the current word is “Dogs” and its tag is “V”. If we want the features vector for

h1i

vector w. We use a linear model for our scoring function. A linear model makes the

the final score. These models are popular for machine learning because the size of w

scales linearly with the size of the feature vector. Under the assumption of linearity,

the scoring function is just the dot product of the weight vector and the feature vector,

RANK(Φ(X, Y ), w) = Φ(X, Y ) · w

Y ∈GEN(X)

Finding this structure can be challenging in practice. For instance in our simple

tagging problem, GEN(X) produces a set with size n ∗ |T | where n is the length of the

Chapter 2: Learning with Hidden Structure 11

Figure 2.2: Geometric representation of the linear separator. Squares represent correct

outputs and circles are incorrect outputs. The solid line represents a possible separator w,

and the dashed lines show the margin δ for that separator.

sentence and |T | is the size of the tag set. In addition, computing feature vectors for each

Later in this work, we will follow the work of Daume III and Marcu (2005) and

focus on solving the optimization problem given in equation 2.1 by using search. Search

lets us avoid the size of GEN by lazily expanding values in order of rank. In addition, if we

use are decomposable Φ, we can reduce the cost of Φ by incrementally building our feature

vectors as we search over structure. In Chapter 3, we will discuss in detail how to find this

2.2 Training

In the last section, we assumed that we had a parameter vector w that would rank

good outputs better than bad outputs. In this section, we examine what this condition

formally means, and how to find a parameter vector that satisfies this property.

In structure prediction, we assume that we can partition the output space into

correct and incorrect outputs. All outputs for a given input X are either correct Y ∗ ∈ Y ∗

Chapter 2: Learning with Hidden Structure 12

Require: Y ∗ correct output

correct ⇐ Y ∗

else

return wt

end if

or incorrect Y ′ ∈ Y¯∗ . Our goal is to find parameters that “separate” these two sets-

. A parameter vector w that satisfies this inequality is known as a linear separator, and the

value δ is known as its margin. The geometric intuition behind this condition is given in

Figure 2.2.

During training, our goal is to learn this linear separator from data. Collins (2002)

presents a simple method for learning this separator using the Perceptron-style learning

algorithm. This algorithm learns from data instances of the form [Xi , Yi∗ ]ni where each Xi

is an input structure and Yi∗ ∈ Y ∗ is a correct output structure. We start with an arbitrary

parameter vector w0 and run Algorithm 2.2 on each data instance. This algorithm computes

the current best structure and compares it to the correct output structure. If they are

different, we increase the parameters of the features from the correct structure and decrease

those from the current best, but incorrect structure. We repeat this process until we find a

separator.

Chapter 2: Learning with Hidden Structure 13

D o g s / V c a n / V n o t / A d v ﬂ y / V

G E N

a t s / N l i k e / V n o / A d v d o g / N

D o u g s c a n n t ﬂ y . C

H o r s e / A d j o w / A d j P i g / A d j M o u s e / A d j

Figure 2.3: The GEN function for the garbled tagging problem.

In the tagging example, the algorithm starts by finding the best output for “Dogs

can not fly.” for the given weights. If the weights are incorrect, the best output could

be anything. For instance it might produce, “Dogs/V can/V not/Adv fly/V.” Since this

output is different than the correct output, “Dogs/N can/V not/Adv fly/V.”, we perform

an update. The update only change weights related to the mistake that was made. In

this case, it will increase the weights of all features relating to “Dogs/N” and decrease the

weights for “Dogs/V.” The next time we see this input, we hope that the best output will

The form of general linear model presented in Section 2.1 assumes that we can

directly predict our output Y from the given sentence X. For many problems, it makes

more sense to first predict some hidden structure H that is not be part of the output Y .

In this section we follow the work of Koo and Collins (2005) and extend the general linear

problem, we want to tag input sentences that have some words misspelled. Our input is a

sentence, X, with misspelled words, and our output as before is its tags, Y . To account

for the misspelled words, we introduce hidden structure H that first predicts the correct

Chapter 2: Learning with Hidden Structure 14

spelling of each word. We notate garbled tagging problem in a x/h/y format, so for the

• GEN : X → (H × Y) set now enumerates every possible pair of hidden and output

structure. For garbled tagging, this means that it produces every possible corrected

word and every tag for each of these words. Figure 2.3 shows some possible outputs.

As with the earlier GEN, neither the proposed hidden or output structures do not

have to bear any resemblence to the correct output. We let ranking sort out the good

from bad.

in addition to the input or output structure. This flexibility is the justification for

include hidden structure. For instance, say we wanted to include a feature from the

previous model,

1,

if w = Dogs and t = Adv

φi (w, t) =

0,

otherwise

This feature would work fine for “Dogs,” but what about “Dgs” or other misspellings?

1,

if w = Dgs and t = Adv

φi (w, t) =

0,

otherwise

However, because we are using a linear model, there is no way to relate these two

features.

1, if h = Dogs and y = Adv

φt (w, h, t) =

0,

otherwise

Chapter 2: Learning with Hidden Structure 15

To ensure we get the right hidden word, we add features that observe the hidden

φt (w, h, t) = inDictionary(h)

would check if a proposed hidden word is in the dictionary. We can also include

φt (w, h, t) = editDistance(w,h)

which measures the edit distance between a proposed hidden word and its garbled

form.

As above, our goal remains finding the best output structure. We now increase

Y,H∈GEN(X)

we had supervision that gave us the correct output and its correct hidden structure, Y ∗ and

we do not have H ∗ . Even worse, it is often impossible to find the correct hidden structure.

In the garbled tagging problem, if we found some way to find the correct hidden structure,

over all possible hidden structures that produce the correct output. Averaging gives the

formula -

X Φ(X, H, Y ∗ )

|H|

H∈H

Chapter 2: Learning with Hidden Structure 16

Figure 2.4: A separator in the hidden structure variant. As in Figure 2.2, the squares have

the correct output, and the circles do not. In this diagram, the light square also has the

correct hidden structure, and the dark squares do not. Note that unlike Figure 2.2, there

are squares are on the opposite side of the separator. Even more problematically, the star

represents the average of the squares, and there is no guarantee that it will be on the correct

side of the separator.

Unfortunately, the majority of these structures will be entirely incorrect. For instance in

correct, but so will “Dougs/Cats/N can/like/V nt/no/Adv fly/dog/V” and many other

absurd sentences.

The reason for this failure is that we no longer have the same separability condition

that we had in Section 2.2. The supervised value Y ∗ tells us is that there is some pair

(H ∗ , Y ∗ ) that is correct. We know nothing about the correctness of (H, Y ∗ ) for any other

Figure 2.4 gives the geometric intuition for this formula and demonstrates the difficulty of

Despite the difficulty of this problem, if we start with some information, we can

proceed in the right direction. Liang et al. (2006) presents two online learning algorithms

designed to try to learn in this context, bold updating and local updating.

Chapter 2: Learning with Hidden Structure 17

Perceptron argmaxY ∈GEN(X) w · Φ(X, Y ) Y∗

Φ(X,H,Y ∗)

argmaxY w · Φ(X, Y )

P

Average H∈H |H|

Bold argmaxY,H∈GEN(X) w · Φ(X, Y, H) argmaxH w · Φ(X, Y ∗, H)

Local argmaxY,H∈GEN(X) w · Φ(X, Y, H) argmin(Y,H)∈nbest m(Y ∗, Y )

Table 2.1: Perceptron extensions for problems with hidden structure. Best and Correct

refer to the variables used in Algorithm 2.2.

Bold updating attempts to fix average updating. Instead of averaging out over all

possible hidden structures with the correct output, we use the best scoring hidden structure

argmax w · Φ(X, H, Y ∗ )

H

We hope that if our current weights are feasible that this hidden structure will be close to

H ∗ , although unfortunately we have no guarantee that this property will hold. Table 2.1

shows the adjustments to the Perceptron algorithm for by the bold updating. In the garbled

tagging task, bold updating requires that we find the best hidden words to produce the given

Local updating takes a more drastic approach to the hidden structure problem. In

local updating, we assume that if there is a lot of hidden information, even the supervised

outputs may not be exactly correct. Instead of trying to find an output that matches the

given correct output exactly, we generate a list of n highest scoring outputs and choose the

one that is closest to Y ∗ to be our correct output by a problem specific metric m. This

update strategy is more conservative than bold updating because we perform the update

In our garbled tagging case, local updating would be useful only if some training

examples are misspelled even beyond human readability and so the labels are not entirely

reliable. For garbled tagging, a reasonable metric m might be to count how many tags are

Chapter 2: Learning with Hidden Structure 18

Chapter 3

Hypergraph Parsing

In the last chapter, we introduced the structure prediction framework and applied

it to a simple tagging example. In this chapter, we show how to apply structure prediction

to machine translation.

Translation has a form similar to our tagging example. We are given a sentence, X,

in the source language and asked to produce the best sentence, Y , in the target language.

Despite the surface similarity, translation is a much harder problem than tagging. No

reasonable translator would translate a sentence word by word, and so machine translation

to first predict some some hidden structure before translating. Unfortunately, unlike in

“garbled tagging” problem, we do not know the true form of this hidden structure.

introduction, we described several different SMT systems. One of the major distinctions

between the systems is in how they model hidden structure. The IBM and phrase-based

systems use alignments as hidden structure. Alignments are maps from the words in the

source sentence to words they generate in the target sentence. Figure 3.1(a) shows an

19

Chapter 3: Hypergraph Parsing 20

n a c i e r t a o r a c i ó n e s p a ñ o l a .

U

n a c i e r t a o r a c i ó n e s p a ñ o l a . ε S o m e s p a n i s h s e n t e n c e .

S o m e s p a n i s h s e n t e n c e .

alignment in the context of structure prediction. There are many variations on alignment

concept with models that have one-to-many or even many-to-many mappings. We use the

term here in the general sense to distinguish word level hidden structure from other forms.

together words, we use a synchronous grammar to model this hidden similarity. Synchronous

grammars are bilingual extensions of standard grammars. Instead of producing a parse tree

over a single sentence, they produce a joint parse over a pair of sentences. Figure 3.1(b)

shows a simple synchronous parse tree. The two parts of the tree are almost identical,

except that we need to introduce the word “Una” on the Spanish side and invert the phrase

“spanish sentence.” We do this by pairing “Una” with an empty symbol and rotating the

“spanish sentence” branch, notated by the arc in the tree. Under the synchronous grammar

model, small operations, like this rotation, can cause large effects in the final sentence.

a synchronous grammar hidden structure, we need to be able to find the best scoring output

for a given input. Once we have this method we can use the learning algorithms from the

last chapter to train our system. This chapter is devoted to algorithms for finding this

Chapter 3: Hypergraph Parsing 21

best scoring output. In the next chapter, we combine these parsing ideas with our online

3.1 Parsing

1995), a framework for determining if a sentence is valid under a given grammar. Deductive

The term parsing can refer to several algorithms for processing a sentence with

a grammar. A parser may check if a sentence is valid under the grammar, produce the

correct parse tree for the sentence, or even find all possible parse trees for that sentence.

first on determining the validity of a sentence and then show how to extend this method to

other applications.

to that of proving the existence of a parse. Deductive parser use the original sentence as a

set of axioms and try to prove a goal. They move towards that goal by applying inference

rules that manipulate facts known as items. Inference rules take the form

A1 A2 A3 ... An

hside − conditionsi

B

which tells us that if the side conditions hold, and we have produced items that satisfy the

preconditions A1 . . . An , then we can infer the item B. The axioms can be thought of as

inference rules with no pre-conditions that introduce the first items into the system, and

the goal is the pre-condition for success. Together the items, axioms, goal, and inference

Chapter 3: Hypergraph Parsing 22

Grammar : S→ X Y

S→ X X

X→ x

X→ y

Y → y

Sentence: xy

xx

xxy y

Figure 3.2: A sample CFG grammar and valid sentences it could produce.

Grammar(CFG) in Chomsky normal form. This grammar formalism has two rule types -

• A→ a

• A→ B C

The first rule states that a non-terminal A produces a terminal symbol a. The second rule

states that a non-terminal symbol A produces two non-terminal symbols B and C. This

spans that with spans that represent a sub-section of the words . For instance, the span

[i, j] covers the words wi+1 , . . . , wj . Figure 3.2 shows a trivial CFG grammar and a sample

sentences.

We can construct a deductive parsers for CFG grammars by specifying the items,

axioms, goal, and inference rules. In Figure 3.3, we first give a simple deductive form

for a CKY parser. Each item contains a grammar non-terminal and a span. When we

introduce an item into the system, we have proven that we can cover that span with that

non-terminal. Our goal is to have the final non-terminal symbol cover the entire span of the

sentence. The single axiom observes the sentence and introduces non-terminals in place of

their corresponding terminal symbol. The inference rule combines adjacent non-terminals

Chapter 3: Hypergraph Parsing 23

Item: [A, i, j]

Axioms: A → wi+1

[A, i, i + 1]

Goals: [S, 0, n]

[B, i, j] [C, j, k]

Inference rules: A→B C

[A, i, k]

[X, 0, 1] X → x [Y, 1, 2] Y → y

S→X Y

[S, 0, 2]

(b) CKY Inference

Figure 3.3: CKY parser specified through deductive rules, and an example inference

according to a rules in the grammar. Figure 3.3(b) shows the full proof of validity of the

In this work we use an Earley style parser (Earley, 1970), which uses a different

inference strategy then CKY. A example Earley parser for CFGs is given in Figure 3.4. The

major difference is that the Earley-style parsers use of partially completed rules signaled by

the dot notation (•). For instance, A → B • C indicates that the rule has already processed

a B but not yet a C. The Earley axioms introduce all rules with the dot to the far left.

As we parse, the inference rules move the dot further towards the right until the rule is

completed. Figure 3.4(b) shows a full inference of this example under the Earley rules.

are need to find best scoring parse for a given sentence. Luckily, the two problems related.

To rank the quality of different parses, we assign a weight to each inference rule in the

Chapter 3: Hypergraph Parsing 24

Item: [A → γ, i, j]

Axioms:

[A → •γ, i, i] A → γ

Goals: [S•, 0, n]

[A → •wi , i, i]

Inference rules:

[A → wi •, i, i + 1]

[A → •B C, i, j] [B → γ•, j, k]

[A → B • C, i, k]

[X → •x, 0, 0]

[S → •X Y, 0, 0]

[X•, 0, 1] [Y → •y, 1, 1]

[S → X • Y, 0, 1] [Y •, 1, 2]

[S•, 0, 2]

(b) Inference Example

Figure 3.4: Earley parser specified through deductive rules, and an example inference.

Chapter 3: Hypergraph Parsing 25

Figure 3.5: A sample directed hypergraph. This hypergraph has both multi-tailed and

multi-headed edges. We use an arrowhead to indicate the source node and a double circle

to indicate the destination.

deductive system. The score of an item is the sum of the weights of the inference rules we

used to discover that item, and the best parse is the best scoring series of inferences that

lead to the goal. For now we assume that these weights are given, and in the next chapter,

Klein and Manning (2001) describe how a weighted deductive parser can be ex-

eralization of a directed graph. It is a tuple (N, E), where N is a set of nodes and E is

a set of hyperedges. Each hyperedge consists of two non-empty sets of nodes (T, H). The

hyperedge connects all the nodes in tail T to all the nodes in the head H. Figure 3.5

shows a sample hypergraph that demonstrates different edge forms. A weighted, directed

each edge has a singleton head node h. The intuition behind this equivalence is that all of

the hypergraph edges have the form (t1 , . . . , tn ) → h while in deductive parsing all inference

A1 A2 A3 ... An

If we think of items as nodes and inference rules as edges, we can apply the conversion in

Chapter 3: Hypergraph Parsing 26

Item:

Axioms:

Goals:

Inference rules:

[X, 1, 2] [X, 1, 2]

[Y, 1, 2] [Y, 1, 2]

Figure 3.7: Hypergraph representations of a CKY search space for the sentence and gram-

mar pair given in Figure 3.2.

Chapter 3: Hypergraph Parsing 27

Figure 3.6 to build the hypergraph. The deductive form provides a general methodology for

parsing, and the hypergraph form gives a representation for the search space over a specific

parse. Figure 3.7(a) applies this conversion for the sentence “x y” under the CKY parser

we gave above, and figure 3.7(b) shows a path through this graph that is equivalent to the

The conversion to hypergraph representation does not give us any new information

about the parsing problem, but it motivates thinking about parsing in terms of graph

traversal. We argued in the last section that if we can find a path through a sentence’s

hypergraph, then we can convert this path into a proof of the sentences validity. More

importantly, we can the reduce problem of finding the best scoring weighted deduction to

the problem of finding the the shortest path from the start to the goal node. This problem

is known as the single-source shortest path problem and has been well-studied for both

As in standard graph traversal, there are two styles of hypergraph traversal algo-

rithm, Viterbi and Dijkstra. Viterbi-style algorithms provide a fast way for exhaustively

exploring a search space without exploring any edge more than once. The Viterbi algorithm

avoids repeating edges by imposing a topological order on the graph and then examining

each layer in order. Since each node is only in one layer, once we have completed the layer,

we do not need to explore the node again. Viterbi algorithms compactly traverse every

possible path to the goal, so they can be used for finding the shortest path or for computing

aggregate metrics, like the sum of all paths to the goal. These algorithms are particularly

suited for domains with a fixed total ordering. The most common example is the state

Chapter 3: Hypergraph Parsing 28

0 : 2

0 : 1 1 : 2

0 : 0 1 : 1 2 : 2

space of HMMs where the linear order of the observations provides total ordering of states.

In hypergraph, the topological partial order of items comes from the spans of the

items. Figure 3.8 shows this partial order for a two word sentence. We can traverse this

partial order in various ways. If we proceed from smaller to larger spans then the Viterbi

traversal is known as bottom-up parsing. If we proceed from left to right then we call it left-

corner parsing. The hypergraph in Figure 3.7(a) is aligned to show a bottom-up topological

ordering on our simple CKY hypergraph. We can use Viterbi traversal of hypergraphs

to find the best parse or to compute metrics like the inside score for the inside-outside

algorithm.

Using the Viterbi algorithm is a nice way of thinking about parsing, but it does not

give us new information. The hypergraph representation becomes more useful for Dijkstra-

style algorithms. These algorithms get around the problem of exploring every possible path

by repeatedly exploring the current shortest path until a goal node is found. They give

up the ability to compute aggregate metrics and require the extra overhead of a queue to

prioritize future explorations, but can provide substantial a speed-up in practice. Since

they do not need to stay fixed to a topological order, they can explore the most promising

Chapter 3: Hypergraph Parsing 29

paths, and if at any time they reach a goal node, they can stop knowing they have found

Knuth (1977) proposed the first Dijkstra-style for hypergraphs. Algorithm 2 gives

the pseudo-code for a simplified version of this algorithm with binary hyperedges. The

algorithm is very similar to the standard Dijkstra algorithm. It keeps a priority queue of

possible nodes to explore. Each iteration, it explores the best scoring node and checks if it

has already explored a node that shares a hyperedge with that node. If it has, it queues up

the head of that hyperedge. Graehl and Knight (2004) provide an efficient implementation

The major advantage of the Knuth algorithm over standard Viterbi parsing is that

it uses ordering information from the problem as opposed to just the topology of the graph.

If there is a one very short path, the Knuth algorithm will find it quickly, while the Viterbi

algorithm will still need to traverse the entire graph. In addition, we can speed up the

Knuth algorithm by using A∗ search. If we have additional information about the problem,

we can use an admissible heuristic that underestimates the cost from any node to the goal.

A good heuristic will encourage the algorithm to explore deeper nodes without sacrificing

optimality. Klein and Manning (2003) introduce an admissible heuristics for parsing and

We can use the Knuth algorithm on any grammar defined by weighted deductive

parse rules. In this work, we are interested in finding the best synchronous parse of a

sentence pair, so in order to use the Knuth algorithm, we need to define a synchronous

deductive parser. We use the Inversion Transduction Grammar (ITG) defined by Wu (1997).

Chapter 3: Hypergraph Parsing 30

for all hyperedge e do

r[e] ⇐ 2

end for

d[u] ⇐ ∞

end for

while Q 6= ∅ do

v ⇐Extract-Min(Q)

if r[e] = 0 then

d(h) = min(d(u1 ) + d(u2 ) + w, d(h)) {Update the best distance to the node}

Decrease-Key(Q, h)

end if

end for

end while

Chapter 3: Hypergraph Parsing 31

n n

l l

0 i j k 0 i j k

Figure 3.9: Grids demonstrating the application of ITG rules. The source sentence is along

the horizontal axis and the destination sentence is along the vertical axis. Figure 3.9(a)

shows the application of the straight rule. The smaller box represents [C, j, k, m, n] and the

larger box represents [A → [B•C], i, j, l, m]. Together they form [A•, i, k, l, n]. Figure 3.9(b)

shows a flip rule. The smaller box is now [C, j, k, l, m] and the large box represents [A →

hB • Ci, i, j, m, n].

In future work, we hope to explore the use of hypergraph parsing with richer synchronous

grammar frameworks.

ITG is the natural bilingual extension of CFG. This formalism has three types of

rules.

• A → sw/tw

• A → [B C]

• A → hB Ci

The first rule is a lexical alignment rule. It says that a non-terminal A produces

a terminal word sw in the source language and tw in the target language. There are two

important special cases of the first rule A → sw/ǫ and A → ǫ/tw that align words with the

empty symbol. The second rule states that A produces two non-terminal symbols B and

C ordered left to right in both languages. The final rule is the rotation rule. It says that

Chapter 3: Hypergraph Parsing 32

Item: [A, i, j, l, m]

[A, i, i + 1, l, l]A → swi+1 /ǫ

[A, i, i, l, l + 1]A → ǫ/dwl+1

Goals: [S, 0, n, 0, m]

[B, i, j, l, m] [C, j, k, m, n]

Inference rules: A → [B C]

[A, i, k, l, n]

[B, i, j, m, n] [C, j, k, l, m]

A → hB Ci

[A, j, k, l, n]

A produces two non-terminal symbols ordered B C in the source language and C B in the

We notate this pair (S, T ) where S = sw1 , . . . , swm and D = tw1 , . . . , twn . The notion of a

span also generalizes to synchronous parsing. A span now covers [i, j, l, m] where [i, j] is the

span in the source language, and [l, m] is the span in the destination language. Figure 3.9

Just as with CFG, we can parse ITG with a CKY or Earley parse style. Figure 3.10

gives the CKY parse rules which are a simple extension of the CFG case. In our system,

we use the Earley style algorithm in Figure 3.4 because it interacts better with some of

the extensions we present in the next chapter. The Earley parser has many more inference

rules than our previous parsers, but it is not much more complicated than the previous

early parser. The three word introduction rules handle the introduction of a lexical pair in

addition to the special cases noted earlier when we align a word with the empty symbol.

The scanning rules simply implement the functionality shown in Figure 3.9.

Chapter 3: Hypergraph Parsing 33

Item: [A → γ, i, j, l, m]

Axioms: A→γ

[A → •γ, i, i, j, j]

Goals: [S, 0, n, 0, m]

Inference rules:

[A → •swi+1 /swl+1 , i, i, l, l]

WordPair

Word introduction: [A → swi+1 /swl+1 •, i, i + 1, l, l + 1]

[A → •swi+1 /ǫ, i, i, l, l]

EmptyTarget

[A → swi+1 /ǫ•, i, i + 1, l, l]

[A → •ǫ/swl+1 , i, i, l, l]

EmptySource

[A → ǫ/swl+1 •, i, i, l, l + 1]

Scanning:

[A → •[B C], i, i, l, l] [B•, i, j, l, m]

Straight

[A → [B • C], i, j, l, m]

[A → [B • C], i, j, l, m] [B•, j, k, m, n]

[A•, i, k, l, n]

[A → •hB Ci, i, i, l, l] [B•, i, j, l, m]

Rotation

[A → hB • Ci, i, j, l, m]

[A → hB • Ci, i, j, m, n] [B•, j, k, l, m]

[A•, i, k, l, n]

Chapter 4

In this chapter, we combine ideas from the previous two chapters to build a full

translation system. We begin in Section 4.1 by presenting a naı̈ve base system that uses

basic learning and search approaches. While this system is powerful enough to perform

translation, it has some major deficiencies. In the next few sections, we describe some of

these issues and approaches for overcoming them. In Section 4.2, we describe an extension

to our search algorithm which lets use a relaxed learning approach that is more appropriate

for translation domain. In Section 4.3, we introduce an extension to our parser which

allows us to incorporate richer features in the model. These two extensions complete our

translation system, and in Section 4.4, we go into more detail about the specific features

We described two methods for learning with hidden structure, bold and local

updating. In this section, we create by describing a simple bold updating translation system

34

Chapter 4: Translation by ITG Parsing 35

Bold updates require that we find two values for each training example the best

and the correct structure. To find the correct structure, we fix the output and find the

argmax w · Φ(X, H, Y ∗ )

H

To find the best structure, we fix the input and find highest scoring hidden and

output structures.

argmax w · Φ(X, H, Y )

Y,H∈GEN(X)

the two sentences. Therefore, finding the correct hidden structure is equivalent to finding

the best scoring synchronous parse, a problem we discussed at length in Chapter 3. For our

base system, we use with the Earley-style ITG parser with the Knuth algorithm to find the

For the best structure, we need to modify the parser so that it does not require

a fixed target sentence. Moving from a parser that generates trees to one that generates

trees and sentences seems like a major change; however, the deductive parser makes this

very easy. To generate parses over all possible output sentences, we can simply remove

any constraints on the target sentence. Figure 4.1 shows a relaxed version of the Earley

parser without target constraints. Given a source sentence, this parser will generate all

possible parses over all possible any target sentences. In practice, there would be no way

to enumerate all of these output, since we are only looking for the best scoring pair, we can

avoid exploring the vast majority of states. Notice also that this parser does not tell us

what target sentence is generated, but from the sequence of inference steps we can retrieve

Chapter 4: Translation by ITG Parsing 36

Item: [A → γ, i, j]

Axioms: A→γ

[A → •γ, i, i]

Goals: [S, 0, n]

Inference rules:

[A → •swi+1 /tw, i, i]

WordPair

Word introduction: [A → swi+1 /tw•, i, i + 1]

[A → •swi+1 /ǫ, i, i]

EmptyDestination

[A → swi+1 /ǫ•, i, i + 1]

[A → •ǫ/tw, i, i]

EmptySource

[A → ǫ/tw•, i, i]

Scanning:

[A → •[B C], i, i] [B•, i, j]

Straight

[A → [B • C], i, j]

[A → [B • C], i, j] [B•, j, k]

[A•, i, k]

[A → •hB Ci, i, i] [B•, i, j]

Reverse

[A → hB • Ci, i, j]

[A → hB • Ci, i, j] [B•, j, k]

[A•, i, k]

Figure 4.1: Deductive parsing rules for Earley ITG with no explicit target sentence.

Chapter 4: Translation by ITG Parsing 37

While the bold update approach has a convenient form, it has two major problems

when used for translation. Both issues stem from the fact that even for people translation

The first issue is that in any reasonable corpus, there will be many translation that

are non-literal. For instance, in the Europarl corpus (Koehn, 2002), a standard training set

is translated as-

This fragment has the literal translation of “Second comment about the notice.” Adding the

word “is” can be justified as making the English sentence more fluent. On the other hand

the word “my” seems like an embellishment by the translator. In this context, we would

not consider that the translations “The second comment is about the notice.” or even

“Second comment about the notice.” to be wrong, but that is just what bold updating

does. It would move the parameters away from these reasonable translation to satisfy the

non-literal translation.

The second problem is a subtle variant of the first. The issue is that all models

of hidden structure make assumptions about the underlying nature of translation that are

often incorrect. Example translations that are correct and literal may violate assumption

is translated as

Chapter 4: Translation by ITG Parsing 38

This translation is mostly literal except that the German sentence uses a semi-colon, while

the English sentence uses a subordinate clause. In an alignment based system, this transla-

tion would not probably not be as much of a problem as the first example. The translation

of “;” to “that” is not too far of a strectch at the word level. On the other hand, for a

syntax-based system, this translation could implies an entirely different internal structure.

We could force the system to find some way at arrive at this exact translation, but it is un-

likely it would find it without producing wrong hidden structure that may make the system

worse.

These are difficult issues that we can not hope to fully solve. We can not know

if our system is producing the wrong output because of poor parameters or because of a

non-literal translation. Local updates try to avoid this determination by making smaller

scoring outputs of the current system. If the reference is reasonable, it is likely to be one

best outputs produced, and this update is identical to a bold update. If it is non-literal or

violates our hidden structure assumptions then we hope to find a similar, but less drastic

Local updating also requires computing the best and correct structure. In local

updates, best has the same definition as in bold updating, but the correct structure is now

the output sentence within the best n outputs that is closest to the supervised output.

argmin m(Y ∗ , Y )

(Y,H)∈nbest(X,w)

To measure closeness, we use Bleu score (Papineni et al., 2001), a standard test

metric for translation precision. Bleu score computes the precision of a test sentence by

Chapter 4: Translation by ITG Parsing 39

Figure 4.2: Graph where the simple n-best Dijkstra algorthm is inefficient.

counting the number of n-grams the sentence has in common with a reference translation

The more difficult problem presented by local updating is calculating the n-best

parses for a given sentence. The Knuth algorithm provides a method for finding the single

best parse, and Huang and Chiang (2005) describe a fast Viterbi-style algorithm for finding

the n-best parses for a given sentence. We want to keep the speed advantages of the Dijkstra

Dijkstra algorithm. Instead of memoizing the shortest path between the start node and

each other node, n-best algorithms remember the top n shortest paths.

Unfortunately, this simple n-best extension keeps around too much information.

For the graph in Figure 4.2, the algorithm would store the n possible ways of getting to

the two non-goal nodes, but we only are interested in n ways to get to the goal. In the

worst case, half these paths are wasted. To fix this issue, Huang (2005) gives a lazy frontier

The basic idea behind the algorithm is to wait until we need the next best value

for a node before computing that value for previous nodes. Figure 4.3 sketches the control

behind the lazy n-best computation.In Figure 4.3(a), we have just found the third best

path to the node on the far right. It asks the edge where this path came from to look for

Chapter 4: Translation by ITG Parsing 40

I n e e d 5

1 2 4

Figure 4.3: Diagram showing the lazy n-best Knuth algorithm in progress.

any successors. Figure 4.3(b) shows the state of the parent edge. We have found two of

the best paths for the top node and three of the best paths for the bottom node. We now

look to see if we can find any successors for the path we just found by looking at possible

neighboring paths up and to the right. We can explore the up path immediately since it is

a combination of two paths we already have. We can not do anything yet with the right

path because we have not found the fourth best path for the bottom node. Once this node

Our final system uses an approach that Liang et al. (2006) refer to as hybrid

updating. We first try to do bold updating by searching for a reasonable correct output. If

In the original IBM SMT system, Brown et al. (1990) divide the translation

problem into two parts, a translation model and a language model. The translation model

manages the conversion of words from source to target language, and the language model

makes sure that the words produce a coherent target sentence. The structure prediction

framework frees us from making this distinction. We can model arbitrary properties by

including them as features. For instance, the IBM model uses trigram probabilities to

Chapter 4: Translation by ITG Parsing 41

model the target language. If we want to include a trigram language model in the scoring,

1, if ti = a, ti−1 = b, ti−2 = c

φt (ti , ti−1 , ti−2 ) =

0,

otherwise

Unfortunately, adding this feature does not help if it does not efficiently decom-

pose over our search space. Since our search space is a hypergraph, the features need to

decompose over inference rules that produce the hyperedges. For instance, we can easily

1, if r = rule

φrule (r) =

0, otherwise

[A → [B • C], i, j] [B•, j, k]

[A•, i, k]

We know that the grammar rule A has been applied, and so the feature should fire. On

the other hand, when we apply this inference rule, it also likely that we have created a new

n-gram in the target sentence, but the inference rule does not have any knowledge about

There are two ways of dealing with this issue. We could first generate all possible

outputs and then apply the n-gram features. This would still ensure that we find the

optimal output, but the huge number of possible outputs make this infeasible. Collins and

Koo (2005) proposes instead generating an n-best list of high scoring outputs and then rank

them with a richer set of features. This technique, known as discriminative reranking, is

non-optimal, but much more efficient. ?) apply this technique to machine translation.

The other option is to inject the extra information into the parser. This is known

as the n-gram intersection trick and was first proposed for ITGs by Wu (1997). The idea

Chapter 4: Translation by ITG Parsing 42

is to simultaneously build up the information needed for language model features while we

parse. We augment the items of our parser to include both the current symbol and the

outer words of the implied target sentence. Figure 4.4 shows the new parse rules with the

[A → [B • C], i, j, s, e′ ] [B•, j, k, s′ , e]

[A•, i, k, s, e]

We know that we have implicitly added the bigram (s′ , e′ ) to our destination sentence.

This transformation alone is enough for an integrated language model, but it vastly

increases the size of our hypergraph. Previously, our items included a non-terminal and two

indices, so there were O(|N |n2 ) nodes, where |N | is the number of nonterminals and n is the

length of the sentence. In addition, the combination rules have three possible non-terminals

and three free indices, so there were O(|N |3 n3 ) edges. Now by storing the corner words, we

have O(|N |n2 w2 ) nodes and O(|N |3 n3 w4 ) 1 edges where w is the number of words in the

target language.

the hook trick(Huang, Zhang, and Gildea, 2005). The hook trick notes that there is no

dependence between the bigram combination and the nonterminal combination, so we can

do them in two separate steps. We introduce a unary hook rule that first applies the bigram

and adjust the combination rules accordingly. Items that are “hooked” can only combine

with other items that share the hook word. Figure 4.6 shows how requiring the hook reduces

the number of rules applied. The final deductive parser with the hook trick is shown in

1

This is actually a bit untrue. Since we can insert a pairs that are empty on the source side, we have

cycles in the hypergraph. To eliminate this is practice, we add a field h to the item which counts up the

number of these edges. We add a side condition to the inference rules that prevent this field from reaching

a maximum value m. This eliminates cycles with a cost of m to the size of the hypergraph.

Chapter 4: Translation by ITG Parsing 43

Item: [A → γ, i, j, s, e]

Axioms: A→γ

[A → •γ, i, i, ǫ, ǫ]

Inference rules:

[A → •swi+1 /dw, i, i, ǫ, ǫ]

WordPair

Word introduction: [A → swi+1 /dw•, i, i + 1, dw, dw]

[A → •swi+1 /ǫ, i, i, ǫ, ǫ]

EmptyDestination

[A → swi+1 /ǫ•, i, i + 1, ǫ, ǫ]

[A → •ǫ/dw, i, i, ǫ, ǫ]

EmptySource

[A → ǫ/dw•, i, i, dw, dw]

Scanning:

[A → •[B C], i, i, ǫ, ǫ] [B•, i, j, s, e]

Straight

[A → [B • C], i, j, s, e]

[A → [B • C], i, j, s, e′ ] [B•, j, k, s′ , e]

[A•, i, k, s, e]

[A → •hB Ci, i, i, ǫ, ǫ] [B•, i, j, s, e]

Reverse

[A → hB • Ci, i, j, s, e]

[A → hB • Ci, i, j, s′ , e] [B•, j, k, s, e′ ]

[A•, i, k, s, e]

Figure 4.4: Deductive parsing rules for Earley ITG with bigram intersection.

C

B

s e ' s ' e

e w b i g r a m e ' s

Chapter 4: Translation by ITG Parsing 44

N N

N 2

A B A B

Z C

A C A C A Z Z C

Z D

Z D

A D A D

Figure 4.6: The figure shows how the addition of unary hook rules can reduce the total

amount edges. In the first example, we need O(n2 ) rule applications to connect two pairs

of n items. If we first use group all items with a common right element and then do this

combination, we reduce the cost by a factor of n.

Figure ??. With the hook trick, each combination rule has only three free target words.

4.4 Features

This section surveys the features we use to assess the quality of a translation. One

of the advantages of a discriminative model is the ability to add features without worrying

about correlation effects, so we can add any features we like as long as they decompose over

the inference rules. As we noted in the last section, the decomposition produces functions

rules, and goal edges. Axiom edges are the edges that first introduce rules and words,

combination edges apply a grammar rule to combine two rules, hook rules introduce a new

• RULE For each rule in the grammar, we have a rule count feature that fires on a

combination edge,

Chapter 4: Translation by ITG Parsing 45

Item: [A → γ, i, j, s, e, h]

Axioms: A→γ

[A → •γ, i, i, ǫ, ǫ, 0]

Inference rules:

[A → •swi+1 /dw, i, i, ǫ, ǫ]

WordPair

Word introduction: [A → swi+1 /dw•, i, i + 1, dw, dw]

[A → •swi+1 /ǫ, i, i, ǫ, ǫ]

EmptyDestination

[A → swi+1 /ǫ•, i, i + 1, ǫ, ǫ]

[A → •ǫ/dw, i, i, ǫ, ǫ]

EmptySource

[A → ǫ/dw•, i, i, dw, dw]

Hook rule:

[A•, i, j, s, e, 0]

[A•, i, j, s, e′ , 1]

Scanning:

[A → •[B C], i, i, ǫ, ǫ] [B•, i, j, s, e, 1]

Straight

[A → [B • C], i, j, s, e]

[A → [B • C], i, j, s, e′ ] [B•, j, k, e′ , e, 0]

[A•, i, k, s, e, 0]

[A → •hB Ci, i, i, ǫ, ǫ] [B•, i, j, s, e, 0]

Reverse

[A → hB • Ci, i, j, s, e]

[A → hB • Ci, i, j, s′ , e] [B•, j, k, s, s′ , 1]

[A•, i, k, s, e, 0]

Figure 4.7: Deductive parsing rules for Earley ITG with bigram intersection and the hook

rule trick.

Chapter 4: Translation by ITG Parsing 46

1,

if r = rule

φrule (r) =

0,

otherwise

In aggregate, these features counts up the number of times each grammar rule is used

in the parse. These features mimic the role of a standard probabilistic parser.

1, if src = s and targ = t

φsrc/targ (s, t) =

0,

otherwise

This feature allows the system to learn a one-to-one mapping between the source and

target languages.

• CLASS TRANSLATIONS We also have class translation features that are iden-

tical to the lexical translation features, but for word classes. These features allow us

to generalize the system to word combinations that we have not seen before. We use

• IBM LEXICAL TRANSLATION We have two IBM translation features that fire

where ST and TS return the probability of a given terminal from IBM Model 3

These features are extremely useful for pruning the state space in early iterations. We

Chapter 4: Translation by ITG Parsing 47

• BIGRAMS We have a language model feature that scores the destination sentence

by its bigrams. The feature fires whenever a hook rule forms a new bigram.

rate a rich language model trained on outside monolingual texts into the system. We

use the SRI language modeling toolkit(Stolcke, 2002) to generate these bigrams.

• EMPTIES We have two features that fire for every non-terminal that of the form

ǫ/w or w/ǫ. The first is called hallucinate and the second ignore. These rule regulate

• RELATIVE SPANS These feature track the relative spans of a newly created out-

put. We have 11 of these features ranging from [-5,5]. This rule fires when we complete

a parse.

1, if size(x.srcSpan) − size(x.destSpan) = n

φspan,n (x) =

0,

otherwise

When designing our features, we came across a strange issue. Dijkstra-style algorithms

only work on graphs with no negative weights edges. This restriction works fine when

edge weights are probabilities, but in a linear model features can have have negative

weights. The negative weight issue also conflicts with the BIGRAMS feature. If the

BIGRAMS feature had a negative weight, it would always be better to add more

random words. Neither of these problems are addressed in the the literature, and

we were unable to find an elogant solution. We handle this issue by making a lower

Chapter 4: Translation by ITG Parsing 48

H a l l o

H e l l o

s i r

Figure 4.8: A simple admississable heuristic for ITG. For each source word, we find the best

translation, surrounding bigram, terminal rule, and single combination.

4.5 Heuristic

We argued that with the bigram intersection trick and the hook rule the hyper-

graph for ITG parsing of a sentence hase O(|N |3 n3 w3 ) edges. In the worst case, the Knuth

algorithm takes O(V log V + E) which reduces to O(|N |3 n3 w3 ). This is a steep cost for

In admissable heuristic is an underestimate of the total remaining cost to the goal from an

intermediary node. A good heuristic should be as close as possible to the true cost. We start

by using the ITG heuristic given in Zhang and Gildea (2006) and taylor it to our feature set.

The basic idea is to relax the parsing problem to the problem of finding best translation of

each source word individually. This is an underestimate because in the process of parsing

we will have to translate each word, and we can not score better than this best translation.

We can extend this idea further. Each word we translate will be part of a bigram, so we

can find the best translation plus bigram for each word. Our final heuristic is shown in

Figure 4.8.

Chapter 5

Results

This chapter summarizes the results for this system. We first present some of the

engineering challenges we faced during learning. We then present scores and analysis of the

In initial experiments, we ran the system with pure A∗ search and the admissable

heuristic given at the end of the last chapter.The system quickly runs out of memory due

to the enourmous number of states. We considered two methods for cutting down on the

number of states. We could prune states out of the grammar before parsing or use beam

search during parsing. The first option is less elegant and can make it so it is impossible

to find a correct translation, but ensures that we won’t ever give good scores to states that

were pruned. The second option is leaves pruning decisions up to the current weights of the

translation. Since we are already relying on the assumption that we begin in a reasonable

state, we use beam search for all our pruning. We used a fixed width beam, that prunes all

49

Chapter 5: Results 50

Complicating this problem is the fact that our admissable heuristic is only mod-

erately useful in practice. It is very important for filtering out poor lexical level decisions.

For instance, it will prevent the system from good translations to words that do not fit

with other possible words in a sentence. It does not provide any help with grammar level

When the difference between the heuristic and the correct value is very large,

even beam pruning fails to find a parse in reasonable time. In these cases, we switched to

an inadmissable heuristic. We used the score of the admissable heuristic times a factor, in

future work we could find a tighter inadmissable heuristic by finding a greedy approximation

of the translation.

5.2 Experiments

Because of the inefficiency of training, we were unable to train the system on large

natural language corpus. Instead, we settled for two smaller synthetic corpora- artihmetic

expression data and the METEO weather system data. These data sets do not allow us

to make any claims about the system’s performance for real data, but they do allow us to

examine some positive and negative properties about synchronous grammars trained in this

way.

The other issue is what model to use as a comparison. Our system has two dis-

trained. Ideally, we would compare its output to a discriminative alignment based sys-

tem and a generative synchronous grammar system. Unfortunately, there are no available

systems of this form. We settle for comparing it to the IBM model as implemented in

GIZA++ by Och (2000). We use the IBM model instead of more advanced phrase-based

Chapter 5: Results 51

Parse Speed

Heuristic Multiplier

0.8

Average Parse Score

0.6

0.4

0.2

0.0

1.0 1.2 1.4 1.6 1.8

Heuristic Multiplier

Chapter 5: Results 52

AAB++

A+A+B

AB+B*

(A+B)*B

BA+

B+(A)

BBBBB*+++A*

(B+B+B+B*B)*(A)

Figure 5.1: Examples pairs from the arithmetic expressions corpus. The second pair has

neccessary parentheses, the third has unnecessary parentheses, and the fourth has both.

systems because it shares the word level granularity of our system. We use the CMU–

Cambridge toolkit (Clarkson and Rosenfeld, 1997) for the language model and the the ISI

The first test we ran was with the arithmetic expressions data set, a synthetic

corpus proposed by Nesson (2005) to test synchronous grammars. The data set consists

of 1000 sentence pairs where the source language is postfix notation arithmetic expressions

and the target language is noisily parenthesized infix expressions. Figure 5.3 shows some

example sentences.

We consider this an interesting test for MT systems because it is has very simple

lexical translations and language modeling, yet drastic word movement. In our initial test,

we used the a fully connected, ITG grammar with 5 nonterminals. We ran tests with both

The bold updating test revealed a major failure with the A∗ search strategy. When

the system begins the only meaningful features are the translation and language model

features inherited from the IBM system. These features always weight translations without

parenthesizes higher than those without. For this reason, if the best translation is,

Chapter 5: Results 53

(A*B)*A

A*B*A

This happens repeatedly whether the parentheses are neccessary or not. We would

hope that the system would learn where to include parentheses, instead it continually pe-

nalizes these close translations. Eventually, the cost of this simple translation becomes so

high that anything else is better. When there is no clear good path, A∗ does not help, the

beam can no longer prune anything, and translating trivial sentences becomes prohibitively

costly. This example demonstrates the disadvantages of relying on a Dijkstra-style that ties

The local update example had more success. We ran tests with n=100, and since

parentheses are expensive in the starting language model, only a small percentage of the

possible outputs have any parentheses. Since these parentheses are scattered throughout

the sentence, they almost never match up with the parentheses in the reference sentence.

The result is that in the test translations, the system almost never adds parentheses. This

demonstrates the big difference between local and bold updating. Local is can avoid some

noise in the reference sentence, at the cost of ignoring some important distinctions.

Even without any parentheses, local updating performs significantly than the the

IBM model. The synchronous grammar allows our system to do global reorderings that the

IBM model cannot do. Local updating has a 0.5875 Bleu score, while the IBM model scores

0.5087. Table 5.3 shows some example sentences and the results from the two systems.

Both systems seem unwilling to add any extra words. Our system compensates by correctly

reordering the words and eliminating parentheses. The IBM system translates chunks of

Chapter 5: Results 54

Correct B+A+A+B*A

ITG B+A+A+B*A

IBM B*A)

Correct (B+(B+A*B)*B)*A

ITG B+B+A*B*B*A

IBM B*B)*B)*A

Correct (A+A+A)*B

ITG A+A+A*B

IBM (A+A)*B

thursday a few showers ending late in the morning then cloudy

NUM pour cent de probabilitè d’averses de pluie ou de neige en après-midi et en soirèe

NUM percent chance of rain showers or flurries in the afternoon and evening

mardi nuageux avec NUM pour cent de probabilitè d’averses tôt en matinèe

tuesday cloudy with NUM percent chance of showers early in the morning

vents du sud-est de NUM à NUM km h diminuant à NUM ce soir

wind southeast NUM to NUM km h diminishing to NUM this evening

Figure 5.3: Examples pairs from the METEO weather data corpus. The corpus features

rather literal translations with some added and deleted words and reordering. For instance,

pour cent de probabilitè d’averses is translated as percent chance of rain which requires

adding the word of and dropping the word de.

Our second experiment was on another synthetic corpus of a much different sort.

The MATEO weather data corpus is a collection of natural language statement generated

different properties than arithmetic data corpus. It has difficult lexical translation and

language fluency issues, but very little global word reordering. Figure 5.3 shows several

We trained our system on 1000 unique sentences from this corpus with lengths

5-15 lengths. For this corpus, we use hybrid updating with n =10. The system took an

average of 20 seconds per sentence, and we trained it for 10 iterations. Training took about

Chapter 5: Results 55

fog and patchy drizzle along the coast in the morning and early in the afternoon

fog patches in the coast in the morning and early along the afternoon and patchy drizzle

fog and patchy drizzle along the coast in the morning and early in the afternoon

tonight showers changing to drizzle and a few showers this evening

dissipating this evening and tonight showers changing drizzle in tonight a few this evening

this evening and tonight periods changing drizzle and a few showers this evening

monday light snow mixed with rain changing to rain late in the morning

monday rain late in the morning changing rain changing light snow mixed

monday light becoming mixed rain moving changing rain late in morning

2 full days.

The two systems performed similarly on the corpus. Our system had Bleu score

While both systems translate most sentences correctly, there are subtle differences

that highlight the different underlying models. Table 5.1 shows example references and

results. The first example is particularly striking. The IBM model translates the sentence

exactly correct, while the ITG translation is very incorrect. However, the ITG output does

translate all the words correctly and puts them in a grammatical, if nonsensical order. This

example demonstrates that sometimes the extra expressitivity of the ITG system can be a

disadvantage.

5.3 Conclusion

grammar based machine translation system. At heart, this system is a combination of two

ideas. The first idea is the learning idea that we can abstract away from a problems

complex structure and learn decompose it into a feature vector. The second idea is that we

can approach the particular problem of synchronous grammar parsing as a graph traversal

Chapter 5: Results 56

train by learning the weights of the graph, and translate by searching over these weights.

This system has several issues. The most prominent is the inefficiency of parsing

under this framework. In the short term, we hope that a better, possibly inadmissable,

heuristic will improve parsing speed. Alternatively, we could include smarter pruning both

before and during parsing. We had hoped that the discriminative framework would increase

training with a fixed output sentence may be faster in practice, since it does not have the

If this efficiency issue can be resolved, we think this could be a promising frame-

work. While the results we produced were on small synthetic corpora, they show that the

system can learn the global reorderings of the arithmetic expressions corpus without sac-

rificing the translation and language modeling neccessary for the literal translations of the

METEO corpus.

Finally, this method would be even stronger with a better synchronous grammar.

?) has shown, there is no reason not to combine phrasal and syntax systems. The syn-

chronous CFGs that Chiang uses are not too different from ITG, and would allow us to

References 57

References

Brown, P.F., S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and

Linguistics, 16(2):79–85.

Clarkson, Philip and Ronald Rosenfeld. 1997. Statistical language modeling using the

Greece.

Collins, M. 2002. Discriminative training methods for hidden Markov models: theory and

Collins, M. and T. Koo. 2005. Discriminative reranking for natural language parsing.

Daume III, H. and D. Marcu. 2005. Learning as search optimization: Approximate large

13(2):94–102.

pages 105–112.

Huang, L. and D. Chiang. 2005. Better k-best parsing. Proceedings of the 9th International

References 58

Huang, L., H. Zhang, and D. Gildea. 2005. Machine translation as lexicalized parsing with

Kittredge, R., A. Polguère, and E. Goldberg. 1986. Synthesizing weather forecasts from

pages 563–565.

Klein, D. and C.D. Manning. 2001. Parsing and hypergraphs. Proceedings of the 7th

Klein, D. and C.D. Manning. 2003. A* parsing: fast exact Viterbi parse selection. Pro-

ceedings of the 2003 Conference of the North American Chapter of the Association

40–47.

18(4):81–96.

Letters, 6(1):1–5.

Koehn, P., F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. Proceedings

of the 2003 Conference of the North American Chapter of the Association for

54.

Koo, T. and M. Collins. 2005. Hidden-variable models for discriminative reranking. Pro-

References 59

Kumar, S., Y. Deng, and W. Byrne. 2005. A weighted finite state transducer translation

12(01):35–75.

Lari, K. and SJ Young. 1991. Applications of stochastic context-free grammars using the

tional Linguistics.

Melamed, I.D. 2004. Statistical machine translation by parsing. Proceedings of the 42nd

Och, F.J. and H. Ney. 2003. A systematic comparison of various statistical alignment

Papineni, K., S. Roukos, T. Ward, and W.J. Zhu. 2001. BLEU: a method for automatic

Shieber, Stuart M., Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and

Stolcke, A. 2002. SRILM-an extensible language modeling toolkit. Proc. ICSLP, 2:901–904.

Wu, D. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel

Zhang, H. and D. Gildea. 2006. Efficient Search for Inversion Transduction Grammar.

- Operator precedence grammar.docUploaded byAnonymous DFpzhrR
- Sentiment Web FountainUploaded byAmir Hossein Yazdavar
- Frame of Reference for Holding NL Query http://ijire.orgUploaded byijire publication
- Cse SyllabiUploaded bysoftali
- Neural Assignment 2Uploaded byVignesh Surya
- mlp_2Uploaded byKALYANpwn
- 5 easy ways to improve cohesion ielts writing task 2Uploaded byapi-251271177
- Syntax TransformationsUploaded byRishabh Jain
- The perceptron.docUploaded byagnasvidya
- Viva CompilerUploaded byVaishakh Sasikumar
- QSAR Studies of Breast Carcinoma using Artificial Neural Network, Bayesian Classifier and Multiple Linear RegressionUploaded byIRJET Journal
- A-Bar Interveners in WH QuestionsUploaded byJanita Nikoliva
- B.tech Admission in IndiaUploaded byedholecom
- detection aof abusive content in twitter.pdfUploaded byjoe joseph
- PEG GrammarExplorerUploaded byvinhxuann
- book 1 part aUploaded byZuhdi Ismail
- ASSOCIATION MINING FROM BIOMEDICAL TEXT WITH NETWORK ANALYSISUploaded byIJAR Journal
- ADJECTIVE CLAUSE 2.docxUploaded byMuhammad Mahardiyansah
- AI Based ApproachUploaded byPham Dong
- English Yearly Scheme of Work Year 4 Sk 2014 Shared by Rosma EsaUploaded byMohd Hafiz
- 240 Vocabulary Words 5th Grade Kids Need to KnowUploaded byantonello.casillas
- assessment - ffa writing rubricUploaded byapi-241875093
- assessment - ffa writing rubricUploaded byapi-241875093
- CorpusUploaded byrichardx
- rr410212-neural-networks-and-applicationsUploaded bySrinivasa Rao G
- 7 Syntax SemanticsUploaded bymidzbeauty
- 14-lesson-12-demonstrative-p-24.pdfUploaded bysharief145
- Syntax Exercises 2014-2015Uploaded byDuc Nguyen
- gun regulationUploaded byapi-383930983
- Tarea 2 InglesUploaded bylisbeth

- Ergo Project ArulUploaded byarulayee
- 05Uploaded byFelipe Bueno
- bell-projUploaded bySyams Fathur
- Fortianalyzer v5.2.7 Release NotesUploaded byMileSánchez
- Buffalo Link Station Quad LS-QL-R5 User ManualUploaded bybernardng5591
- bus107_wh06.pptUploaded byAko Si Jhadong
- Audio EncryptionUploaded byAditya Gupta
- sis 5582Uploaded byLuis Herrera
- Proposal APM.docUploaded byJohnHKyeyune
- 09 Ddcauado.net Cg Session04Uploaded byJai Deep
- DSDVUploaded bylovsingh
- Term paper on Intelligent Ship ArrangementsUploaded byNikhil Ambattuparambil Gopi
- Bairstow Method HTMLUploaded byRyan Iraola
- Embedded.Linux.Memory.Management.PDFUploaded bykzilla
- Sequence DiagramsUploaded byzeeshan
- Date recognition using LBPUploaded byKhaled HA
- Introduction to Windows Memory ForensicUploaded byljdwhiz
- Top 100 Networking Interview Questions & Answers.pdfUploaded byrahulmarathe80
- 02 IBM Netezza OverviewUploaded byLokesh Ceeba
- Arm Tslinux Ts72xxUploaded byDarryl Quinn
- Devops2Uploaded byDebabrat Rout
- Zm Ve200se EngUploaded byМахмуд Закиров
- Tableau Interview Questions.Uploaded bykaty0001
- Comptia a Training Kit Exam 220-801 and Exam 220-802Uploaded byg_pargade
- Ms Access(Advanced)Uploaded byjkpdbnlbsg
- 2.5 DIsk Arm SchedulingUploaded byakshaydsaraf
- 10_2_90_microcontroller Based Neural Network Controlled Low Cost Autonomous VehicleUploaded byRakesh Kumar
- Danny .Net DeveloperUploaded byJoshElliot
- Design Arcade Comp Game Graphics 03Uploaded byacefogo
- Souvenir FinalUploaded bygladraj_22