c Copyright 2010 Jeremy G.

Kahn

Parse decoration of the word sequence in the speech-to-text machine-translation pipeline

Jeremy G. Kahn

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

University of Washington

2010

Program Authorized to Offer Degree: Linguistics

University of Washington Graduate School

This is to certify that I have examined this copy of a doctoral dissertation by Jeremy G. Kahn and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made.

Chair of the Supervisory Committee:

Mari Ostendorf

Reading Committee:

Mari Ostendorf Paul Aoki Emily M. Bender Fei Xia

Date:

In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.”

Signature

Date

University of Washington Abstract

Parse decoration of the word sequence in the speech-to-text machine-translation pipeline
Jeremy G. Kahn Chair of the Supervisory Committee: Professor Mari Ostendorf Electrical Engineering & Linguistics

Parsing, or the extraction of syntactic structure from text, is appealing to natural language processing (NLP) engineers and researchers. Parsing provides an opportunity to consider information about word sequence and relatedness beyond simple adjacency. This dissertation uses automatically-derived syntactic structure (parse decoration) to improve the performance and evaluation of large-scale NLP systems that have (in general) used only word-sequence level measures to quantify success. In particular, this work focuses on parse structure in the context of large-vocabulary automatic speech recognition (ASR) and statistical machine translation (SMT) in English and (in translation) Mandarin Chinese. The research here explores three characteristics of statistical syntactic parsing: dependency structure, constituent structure, and parse-uncertainty — making use of the parser’s ability to generate an M -best list of parse hypotheses. Parse structure predictions are applied to ASR to improve word-error rate over a baseline non-syntactic (sequence-only) language model (achieving 6–13% of possible error reduction). Critical to this success is the joint reranking of an N ×M -best list of N ASR hypothesis transcripts and M -best parse hypotheses (for each transcript). Jointly reranking the N ×M lists is also demonstrated to be useful in choosing a high-quality parse from these transcriptions. In SMT, this work demonstrates expected dependency pair match (EDPM), a new mechanism for evaluating the quality of SMT translation hypotheses by comparing them to refer-

ence translations. EDPM, which makes direct use of parse dependency structure directly in its measurement, is demonstrated to be superior in correlation with human measurements of translation quality to the competitor (and widely-used) evaluation metrics BLEU4 and translation edit rate. Finally, this work explores how syntactic constituents may predict or improve the behavior of unsupervised word-aligners, a core component of SMT systems, over a collection of Chinese-English parallel text with reference alignment labels. Statistical word-alignment is improved over several machine-generated alignments by exploiting the coherence of certain parse constituent structures to identify source-language regions where a high-recall aligner may be trusted. These diverse results across ASR and SMT point together to the utility of including parse information into large-scale (and generally word-sequence oriented) NLP systems and demonstrate several approaches for doing so.

TABLE OF CONTENTS

Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1: Introduction . . . . . . . . . . . . . . . . . . 1.1 Evaluating the word sequence . . . . . . . . . . . . 1.2 Using parse information within automatic language 1.3 Overview of this work . . . . . . . . . . . . . . . . Chapter 2: Background . . . . . . . 2.1 Statistical parsing . . . . . . . 2.2 Reranking n-best lists . . . . . 2.3 Automatic speech recognition . 2.4 Statistical machine translation 2.5 Summary . . . . . . . . . . . . Chapter 3: Parsing Speech . . . . . 3.1 Background . . . . . . . . . . . 3.2 Architecture . . . . . . . . . . . 3.3 Corpus and experimental setup 3.4 Results . . . . . . . . . . . . . . 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 2 4 6 9 9 14 17 21 28 31 32 36 43 49 58 61 62 63 66 68 71 74

Chapter 4: Using grammatical structure to evaluate machine translation 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Approach: the DPM family of metrics . . . . . . . . . . . . . . . . . 4.3 Implementation of the DPM family . . . . . . . . . . . . . . . . . . . 4.4 Selecting EDPM with human judgements of fluency & adequacy . . 4.5 Correlating EDPM with HTER . . . . . . . . . . . . . . . . . . . . . 4.6 Combining syntax with edit and semantic knowledge sources . . . . i

4.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Measuring coherence in word alignments for automatic statistical machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coherence on bitext spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyzing span coherence among automatic word alignments . . . . . . . . Selecting whole candidates with a reranker . . . . . . . . . . . . . . . . . . . Creating hybrid candidates by merging alignments . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the word sequence . . . . . . . . . . . . . . . . . . . .

Chapter 5: 5.1 5.2 5.3 5.4 5.5 5.6 5.7

. . . . . . . . . . . .

79 81 82 84 88 95 101 105 107 107 109 111

Chapter 6: Conclusion . . . . . . . . . . . . . . . . . 6.1 Summary of key contributions . . . . . . . . . . 6.2 Future directions for these applications . . . . . 6.3 Future challenges for parsing as a decoration on

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

ii

LIST OF FIGURES

Figure Number 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 5.1 5.2 5.3 5.4 5.5 5.6 5.7 A lexicalized phrase structure and the corresponding constituent and dependency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The models that contribute to ASR. . . . . . . . . . . . . . . . . . . . . . . Word alignment between e and f . . . . . . . . . . . . . . . . . . . . . . . . The models that make up statistical machine translation systems . . . . . . A SParseval example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System architecture at test time. . . . . . . . . . . . . . . . . . . . . . . . . n-best resegmentation using confusion networks . . . . . . . . . . . . . . . . Oracle parse performance contours for different numbers of parses M and recognition hypotheses N on reference segmentations. . . . . . . . . . . . . SParseval performance for different feature and optimization conditions as a function of the size of the N-best list. . . . . . . . . . . . . . . . . . . . . Example dependency trees and their dlh decompositions. . . . . . . . . . . The dl and lh decompositions of the hypothesis tree in figure 4.1. . . . . . An example headed constituent tree and the labeled dependency tree derived from it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM, BLEU and TER correlations are provided for comparison. . . . . . . . . . . A Chinese sentence and its translation, with reference alignments and alignments generated by unioned GIZA++ . . . . . . . . . . . . . . . . . . . . . Examples of the four coherence classes . . . . . . . . . . . . . . . . . . . . . Decision trees for VP and IP spans. . . . . . . . . . . . . . . . . . . . . . . An example incoherent CP-over-IP. . . . . . . . . . . . . . . . . . . . . . . . An example of clause-modifying adverb appearing inside a verb chain . . . An example of English ellipsis where Chinese repeats a word. . . . . . . . . Example of an NP-guided union. . . . . . . . . . . . . . . . . . . . . . . . .

Page . . . . 11 17 22 24

. 34 . 37 . 38 . 51 . 56 . 64 . 64 . 67 . 76 . . . . . . . 80 83 93 94 96 97 103

iii

iv

LIST OF TABLES

Table Number 1.1 1.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 5.4 5.5

Page 3 3

Two ASR hypotheses with the same WER. . . . . . . . . . . . . . . . . . . . Word-sequences not considered to match by na¨ word-sequence evaluation . ıve Reranker feature descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . Switchboard data partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . Segmentation conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baseline and oracle WER reranking performance from N = 50 word sequence hypotheses and 1-best parse . . . . . . . . . . . . . . . . . . . . . . . . . . . Oracle SParseval (WER) reranking performance from N = 50 word sequence hypotheses and M = 1, 10, or 50 parses . . . . . . . . . . . . . . . . Reranker feature combinations . . . . . . . . . . . . . . . . . . . . . . . . . WER on the evaluation set for different sentence segmentations and feature sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word error rate results comparing γ . . . . . . . . . . . . . . . . . . . . . . Results under different segmentation conditions when optimizing for SParseval objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Per-segment correlation with human fluency/adequacy judgements of different combination methods and decompositions. . . . . . . . . . . . . . . . . Per-segment correlation with human fluency/adequacy judgements of baselines and different decompositions. N = 1 parses used. . . . . . . . . . . . . Considering γ and N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corpus statistics for the GALE 2.5 translation corpus. . . . . . . . . . . . . Per-document correlations of EDPM and others to HTER . . . . . . . . . . Per-sentence, length-weighted correlations of EDPM and others to HTER, by genre and by source language. . . . . . . . . . . . . . . . . . . . . . . . .

. 40 . 44 . 47 . 50 . 51 . 52 . 53 . 54 . 55 . 69 . . . . 70 71 72 73

. 73 83 84 86 88 89

Four mutually exclusive coherence classes for a span s and its projected range s GALE Mandarin-English manually-aligned parallel corpora . . . . . . . . . . The Mandarin-English parallel corpora used for alignment training . . . . . . Alignment error rate, precision, and recall for automatic aligners . . . . . . . Coherence statistics over the spans delimited by comma classes . . . . . . . . v

5.6 5.7 5.8 5.9 5.10 5.11

Coherence statistics over the spans delimited by certain syntactic non-terminals 91 Some reasons for IP incoherence . . . . . . . . . . . . . . . . . . . . . . . . . 95 Reranking the candidates produced by a committee of aligners. . . . . . . . . 99 Reranking the candidates produced by giza.union.NBEST. . . . . . . . . . . 100 AER, precision and recall for the bg-precise alignment . . . . . . . . . . . . 101 AER, precision and recall over the entire test corpus, using various XP strategies to determine trusted spans . . . . . . . . . . . . . . . . . . . . . . . 104

vi

ACKNOWLEDGMENTS

My advisor, Mari Ostendorf, has been a reliable source of support, encouragement, and ideas through the process of this work. An amazingly busy and productive engineering professor, she has welcomed me into the Signal Speech and Language Interpretation (SSLI) laboratory when I was looking only for summer employment — on the condition that I remain with her for at least another year. It was a good bargain: Mari’s empirical, skeptical, practical approach to research has served as a model and inspiration, and I am proud every time I notice myself saying something Mari would have suggested. SSLI’s home in Electrical Engineering (in a different college, let alone department, from Linguistics) has been a valuable source of perspective: working in the lab (and with the electrical engineers and computer scientists there) gives me the unusual privilege of being the “language guy” among the engineers and the “engineering guy” among the linguists. My committee of readers was delightfully representative of the intersection between linguistics and computers. Paul Aoki represented practical translation and the use of computers for language teaching — and provided unstinting positive regard for me and my work. Emily Bender opened doors for me by opening a master’s program in computational linguistics at the University of Washington just as I began, creating entire cohorts of professional NLP people just across Stevens Way. Fei Xia’s perspectives on Chinese parsing and on statistical machine translation were welcome on every single revision. Among my colleagues at SSLI, I would like to acknowledge Becky Bates, who adopted me as a “big sister” from my first day there, for her clear-eyed, mindful approach to engineering education and her grounded, open approach to the full experience of the world, even for those of us who — through practice or predisposition — spend a lot of time in our head and in the world of words. Dustin Hillard and Kevin Duh shared their enthusiasm and excitement for engineering and machine learning in application to language problems. Lee vii

Damon kept the entire lab infrastructure running in the face of thousands of submitted jobs, many of which were mine. Bin and Wei tolerated both my questions about Chinese and my eagerly verbose explanations of some of the crookeder corners of the English language. Alex, Brian, Julie and Amittai were always game for engaging in a discussion about tactics and strategies for natural-language engineering graduate students, and I am pleased to leave my role as SSLI morale officer in their hands. Across the road in Padelford, my colleagues and teachers in the Linguistics department have also been a pleasure. Beyond my committee members named already, I had the pleasure of guidance and welcome from Julia Herschensohn, the departmental chair, whose enthusiasm for an interdisciplinary computational linguist like me spared me a number of administrative ordeals, some of which I’ll probably never know about (and I am grateful to Julia for that). Richard Wright and Alicia Beckford-Wassink were happy to let me be an “engineering guy” in a room full of empirical linguists. Fellow students Bill, David, and Scott reminded me from the very beginning that having spent time in industry does not disqualify one from still studying linguistics. Lesley, Darren, Julia, Amy, and Laurie remind me whenever I see them (which is often online rather than in person!) that linguistics can be fun, whichever corner of it you live in. Over the last two years, I have had the privilege of being hosted at the Speech Technology and Research (STAR) laboratory at SRI International in Menlo Park, California. I began my study there as part of the DARPA GALE project, on which SSLI and the STAR lab collaborated. STAR director Kristin Precoda graciously allowed me to use office space and come to lab meetings, even after that project ended, while I finished my dissertation. Dimitra, Wen, Fazil, Jing, Murat, Luciana, Martin, Colleen and Harry, support staff Allan and Debra, and fellow SSLI alumni Arindam Mandal and Xin Lei also hosted and oriented me during my time at SRI. All of them have been pleasant hosts and supportive colleagues. I am doubly grateful that they tolerated my poor attempts at playing Colleen’s guitar in the break room. I have had fruitful and enjoyable collaborations with students and faculty beyond UW viii

and SRI in my time in the UW graduate program: I am pleased to have explored interesting computational linguistics research with Matt Lease, Brian Roark, Mark Johnson (who was also my first syntax professor!), Mary Harper, and Matt Snover, among many others. I received support and software guidance from John DeNero, Chris Dyer and Eugene Charniak, again, among others. I am indebted to them all. About a year before completing this dissertation, I began part-time work at Wordnik. I am grateful to Erin McKean for offering me employment thinking about words and language even while I finished this dissertation, and for allowing me to work less than half-time while I finished up the thesis. This was offered with far less grumbling than I deserved. I am lucky, too, to have intelligent, funny, talented co-workers there: Tony, John, Robert, Angela, Russ, Kumanan, Krishna and Mark continue to be a pleasure to work with and work for. Of course, I had little chance to complete this work without support from an amazing troupe of supportive friends in many locations. Matt, Shannon, Kristina, Maryam, Lauren Neil, Ben, Trey, Rosie, and others have held out from the wild world of the Internet. In San Francisco, I am happy to have found community with Nancy, Heather, Jen, Susanna and Derek, all holding on for Wisdom and for my success. Jim and Fiona, William and Jo, Eldan and Melinda, Chris and Miriam, Alex and Kirk, Johns L and A, and many others support me with love and wisdom from Seattle. Finally, I am lucky to have been supported all along the way by my parents, Mickey and Henry; by my brother Daniel, and, most of all, by my wife Dorothy Lemoult, whom I met in Seattle in my second year of the program. Since the day we met, Dorothy has seen me as a better person than even I believed myself to be; to be the object of that kind of fierce love is the best way to be alive. I have received funding for my work from the University of Washington, the National Science Foundation, SRI International, and the Defense Advance Research Projects Administration. Finally, a framing comment: I was supported in the process of creating this dissertation by a community that will undoubtedly be under-represented by any attempt to list everyone, especially this one. To all of you I’ve overlooked or omitted, please forgive me.

ix

x

DEDICATION

For the pursuit of a life of love, play, and inquiry; For my partner, my ally, my friend, my lover; For what we have already and for what we make together; For Dorothy.

xi

xii

1

Chapter 1 INTRODUCTION

Parsing, or extracting syntactic structure from text, is an appealing process to linguists studying the grammatical properties of natural language: parsing is an application of syntactic theory. For non-linguists, including many natural-language engineers, it is not necessarily of immediate practical use. Engineers and other users of language technology have generally found word sequences (as in writing) to be a more tractable input and output, and traditional evaluation measures for their tasks have not considered any linguistic structure beyond the word sequence in their design. While some natural language applications have embraced parsing at their core (e.g. information extraction, which generally begins from parsed sentence structures), this dissertation applies parsers to two other domains: automatic speech recognition (ASR) and statistical machine translation (SMT). In evaluation, both of these natural-language processing tasks traditionally use measurements that evaluate using only matches of words or adjacent sequences of words (N -grams) against a reference (human-generated) output. In ASR, parsing features and scores have been explored for improved modeling of word sequences, but these approaches have not been widely adopted. Similarly, although a few SMT systems use a parse tree in parts of decoding, parse structures are also not widely adopted in SMT. For example, statistical word-alignment, a core internal technology for SMT, generally uses no parse information to hypothesize links between source- and target-language words. This dissertation explores the incorporation of parsing into representations of language for natural language processing, particularly for components that have traditionally considered only the word sequence as input and output. This work takes two related approaches: exploring new opportunities to bring the information provided by a parser to bear within the traditional (syntactically-uninformed) approaches to these natural-language tasks, and

2

exploring the construction of new, parser-informed automatic evaluation measures to guide the behavior of these systems in directions that lead to qualitative improvements in results, as judged by human assessors. 1.1 Evaluating the word sequence

This work focuses on two natural-language processing applications: speech recognition and machine translation. The output of speech recognition is a word sequence transcription hypothesis; the output of a machine translation system is a word sequence translation hypothesis. In each case, the usual approach to evaluation is to compare the transcript (or translation) hypothesis to a reference transcript (or translation). Using the undecorated word-sequence as an interface among natural language systems may sometimes introduce surprising behaviors in evaluation. A word sequence is a very shallow representation of the linguistic structure of language. This representation is almost1 completely devoid of theoretical baggage: no theoretical training is required for language users (or machines) to count over words and compare them for identity. Speech transcript quality, for example, is ordinarily measured by word error rate (WER), which is defined over hypothesis transcript h and reference transcript r as: WER(h; r) = insertions(h; r) + deletions(h; r) + substitutions(h; r) length(r) (1.1)

where insertion, deletion and substitution error counts are calculated through a Levenshtein alignment between reference and hypothesis that minimizes the total number of errors. Automated methods like WER facilitate the optimization and evaluation of natural language processing technology, because they can report the quality of a hypothesis without human intervention, given only a previously-generated reference. For these optimization and evaluation processes, though, the automatic measures should ideally be consistent with human judgements of quality. Word-sequence evaluation measures, however, do not always match human judgements about quality. For example, they rarely have any notion of centrality: no aspect of WER
Chinese and several other written languages do not separate words in text, but there is still high agreement among literate speakers about the character-sequence. Character sequences, rather than word sequences, are thus usually used for evaluation in Chinese speech recognition.
1

3

Table 1.1: Two ASR hypotheses with the same WER. Hypothesis Reference (a) (b) People people people used to easter used to arrange their whole arrange their whole arrange their whole schedules around schedules around schedule and those those those WER — 0.22 0.22

Table 1.2: Word-sequences not considered to match by na¨ word-sequence evaluation ıve The man saw the cat. The diplomat left the room. He quickly left. He warmed the soup. optimize don’t because The cat was seen by the man. The diplomat went out of the room. He left quickly. He heated the soup. optimise do not cuz

captures the intuition that some words are more important to the sentence than others. Table 1.1 considers two hypotheses that are projected to the same distance (WER = 0.22) by the WER metric. In table 1.1, hypothesis (a) and hypothesis (b) have equal WER, but (a)’s substitution is on a more central sequence (the main verb used to), while (b)’s word errors are on a grammatical affix (schedule instead of schedules) and an adjunct adverbial (around those). One indicator of the centrality of used to is that (b)’s substitution causes little adjustment to the overall structure of the sentence, where (a)’s substitution leaves (a) with no workable parse structure other than a fragment. Conversely, table 1.2 presents some example word sequences that a human evaluator might reasonably consider equivalent (for some evaluation tasks), and which a na¨ wordıve sequence evaluation would score as different. To capture any of these matches, the evaluation sequence must be able to find a projection of the word sequences such that they may be found equivalent. The last two pairs in table 1.2 are usually handled by normalization tools,

4

but the others are usually ignored: with the exception of contractions, case normalization and sometimes spelling normalization, most evaluations consider only exact matches over sections of the word-sequence, and treat all words as equally important. Evaluation measures like WER (or extensions using N -grams) use only surface word identity and word adjacency in their measurements. These measures incorporate neither a notion of centrality nor argument structure, but individual words’ roles in the meaning of a sentence are determined by their relationship to other, not necessarily adjacent words. It is the central contention of this work that extending our measurements and evaluations of the word sequence to include a deeper representation of linguistic structure provides benefits to both linguistic and engineering approaches to natural language. 1.2 Using parse information within automatic language processing

The core theme of this work is the use of automatically-derived parse structure to improve the performance and evaluation of language-processing systems that have generally used only word-sequence level measures. Parse decorations on the word sequence can provide benefits to these systems in these two ways: • parse decoration offers a new source of structural information within the models that go into these systems, providing features from which the models may derive more powerful hypothesis-choice criteria, and • parse decoration enables new target measures, for use in system tuning and/or evaluation of the overall performance of a system. Both of these techniques are used in this dissertation in ASR and SMT applications. For ASR systems, this work explores using parse structure for optimization towards both WER and SParseval (an evaluation measure for parses of speech transcription hypotheses). For SMT systems, this work explores using parse structure towards providing an evaluation measure that correlates better with human judgement and towards the optimization of an internal target (word-alignment).

5

Parse structure is not observable in transcripts or other easily-derived training data (outside of the relatively small domain of treebanks), which is one reason that parse-information has not been widely adopted into some of these systems. Parser accuracy, especially on genres that do not match the parser’s training data, may not be very good. This work adopts the approach that a parser’s own confidence estimates may be used to avoid egregious blunders, by using expectations (confidence-weighted averages) over parser predictions. A common thread among the research directions presented here is thus the use of more than one parse-decoration hypothesis to provide structural information about the word sequence. Previous work on applying grammatical structure to ASR systems has focused on either parsing a single hypothesis transcript (the parsing task) or on using a single hypothesis parse to select a transcript (the language-modeling task). By exploring the joint optimization of parse and transcript hypotheses (chapter 3), this work demonstrates the utility of each to the other. It frames the parse-decoration as a source of structural features of the hypotheses, to be used in reranking hypotheses. In this approach, WER-optimization is improved by including information from multiple parse hypotheses, and parse-metric optimization is improved by comparing multiple parse hypotheses over multiple transcript hypotheses. Because many NLP tasks either explicitly use parsing, chunking, or have verb-dependent processing, the parse metric is often a better choice for word transcription associated with NLP tasks. After considering parsing as an ASR objective, we turn to incorporation of parse decoration towards SMT tasks, beginning by considering SMT evaluation (chapter 4). SMT evaluation measures have traditionally used only word-sequence information (e.g., measuring the precision of n-grams against a reference translation). This work explores the use of parsing dependency structure to provide a syntactically-sensitive evaluation measure of the translation hypotheses. Parse structure, here, is represented as an expectation over dependency structure (using the multiple-parse hypotheses approach suggested above), and this work demonstrates that evaluations informed by parse-structure correlate more closely with human judgements of translation quality than the traditional (word-sequence based) metrics. Previous work on applying parsers to SMT has focused mostly on parsing for reordering

6

source language text or within decoders. A third limb of the work presented here (chapter 5) explores the use of parsers in improving translation word-alignment (an internal component of SMT). In this approach, parse-decoration is treated as labels on source-language spans, and this information is applied to selecting better machine translation word-alignments, an SMT task that generally uses only word-sequence information. In this work, we explore the coherence properties of the parse-annotated spans, finding some span-classes that tend to be coherent, in the sense that a contiguous sequence of source language words is not broken up in translation. This syntactic coherence is used to guide the combination of a precision-oriented and recall-oriented automatic alignment. By exploring applying parse decoration to word sequences, this work offers several pieces of evidence for new directions in language-processing work. Word sequences are not always the best way to evaluate the performance of natural language processing systems; grammatical structure (from parsing) is in fact a useful source of information to these other natural-language processing systems, even when used as a component in evaluation (in machine translation). As part of those results, this work offers new reasons to use and improve work in syntactic parsers. 1.3 Overview of this work

The dissertation’s structure is as follows: Chapter 2 covers the shared background material: statistical parsing, and schematic overviews of the operation of ASR and SMT systems. To accomodate the diversity of corpora and applications, some discussion of background material and related work is deferred to the appropriate chapter, rather than covering all background materials in chapter 2. Chapters 3–5 present the prior work, new methods and experimental results of each of the three applications explored in this thesis. Chapter 3 applies parsing to automatic speech recognition on English conversational speech, and shows that information derived from parse structure offers improvements on WER. In addition, when the ASR/parsing pipeline is directed to target a parse-quality measure designed for speech transcripts, not only does the pipeline perform better on that measure but it selects qualitatively different word sequences, reflecting the effect of parse structure (and its evaluation) on speech recognition.

7

Chapter 4 proposes a new evaluation measure, Expected Dependency Pair Match (EDPM) for machine translation evaluation. EDPM is a measure of parse-structure similarity between hypothesis and reference translations. Experiments in this chapter correlating EDPM with human and human-derived judgments of translation quality show that EDPM surpasses popular word-sequence-based evaluation measures and is competitive with other newlyproposed metrics that rely on external knowledge sources. Chapter 5 focuses on Chinese-English parallel-text word alignment, an internal component of machine translation that also traditionally ignores structural information. This chapter applies parsing to the Chinese side of the parallel text, and introduces translation coherence, which is a property of a source span and an alignment. The work in this chapter explores the utility of coherence in selecting good alignments, examines where those coherence measures break down, and shows that parse structure information is useful in selecting regions where two alignment candidates may be combined to improve alignment recall without hurting alignment precision. Chapter 6 concludes with a summary of the key contributions of this thesis, which include both application advances and new understanding of general methods for leveraging parse decorations. It further suggests future directions of research, in which parse-decoration may be applied in new ways to machine-translation, speech recognition, and evaluation methods.

8

9

Chapter 2 BACKGROUND

This chapter provides an overview of the natural-language processing technologies that this dissertation rests upon. The next section (2.1) provides some background on statistical syntactic parsing and describes the statistical syntactic parsers in use in this work. The subsequent section (2.2) explains the framework for n-best list reranking used in several parts of this work. The following sections (2.3 and 2.4) describe the general framework of the two applications (speech recognition and statistical machine translation, respectively) to which this work applies those rerankers and parsers. 2.1 Statistical parsing

Statistical parsing serves as the method of word sequence decoration for all of the research proposed in this work. This section reviews the key decorations available from a statistical parser, considers the strengths and weaknesses of the probabilistic context-free grammar (PCFG) paradigm, and discusses the training and evaluation of such parsers.

2.1.1

Constituent and dependency parse trees

The parse decorations on word sequences used here include both dependency structure and hierarchical spans over word sequences. Hierarchical spans are known as constituent structures; in these trees, span labels nest to form a hierarchy (a tree) of constituent spans; these spans are labeled with the phrase class (e.g. np or vp) that describes its content. The entire segment is labeled with a root span, which is usually coterminous with a single s spanning the sentence. A dependency structure, by contrast, labels each word with a single dependency link to its “head”, with a label representing the nature of the dependency. One word (usually the main verb of the sentence) is dependent on a notional root node; all the other words in

10

the sentence depend on other words in the sentence. These two representations of grammatical structure may be reconciled in a lexicalized phrase representation, which marks one child subspan as the head child of each span. If head information is ignored, this representation is equivalent to the span label representation. The head word of each phrase-constituent φ is recursively defined as the head word of φ’s head child or, if φ contains only one word, that word. A constituent structure is lexicalized when each constituent is additionally annotated with its head word; one may read either constituent spans or dependency structures off of these lexicalized constituent structures. Figure 2.1 shows a lexicalized constituent structure and the dependency tree and constituent tree that may be derived from it. The arc labels on the dependency structure shown in figure 2.1 are derived by extraction from the headed phrase structure by concatenating two labels A/B: A is the lowest constituent dominating both the dependent and the headword and B the highest constituent dominating the dependent. This approach for arc-labeling works well for a language like English (or Chinese) with relatively fixed word order. 2.1.2 Generating parse hypotheses

To provide parse-decoration, we desire a parser/decorator which generates n-best lists of parse hypotheses over input sentences.1 Such an n-best list may be useful in reranking the parse hypotheses (see section 2.2 below) or other applications which benefit from access to the confidence of the parser. We require that the parsers used to generate these n-best lists are adaptable to new domains, robust, and probabilistic. Retrainable parsers are desired because the domain over which this work predicts parses varies widely with the task: parse structures over speech, as in chapter 3, are qualitatively different than parse structures over edited text (e.g. the news-text translation in chapter 5). Robustness, the reliable generation of predictions for any input word sequences, is desirable because the parser is to distinguish among machine-generated word sequences (the output of ASR and SMT), which are not
Packed parse forests, the combined representation of the parser search space used by e.g. Huang [2008], represent a speedy and sometimes elegant alternative to n-best lists, but are constrained by the forestpacking to use only those features that may be computed locally in the tree. This work uses n-best lists instead for their easy combination and for freedom from the tree-locality constraint.
1

11

root s/was np/I I vbd/was was vp/was adjp/acquainted rb/personally vbn/acquainted personally acquainted pp/with in/with with np/people dt/the nns/people the root s np I vbd was rb vp adjp vbn in with dt pp np nns people

personally acquainted

the people
root/s

vp/adjp s/np adjp/rb adjp/pp

pp/np np/dt

I was personally acquainted with the people root

Figure 2.1: A lexicalized phrase structure and the corresponding constituent and dependency trees. Dashed arrows indicate the upward propagation of head words to head phrases. The lexicalized constituent tree encodes both the constituency tree and the dependency relations. The dependency tree may be understood as the link to the headword of the governing constituent.

12

always well-formed either due to recognizer errors or speaker differences. Since we use the parser to predict fine-grained information to make decisions about the word sequences, the ability to generate parse structure over all (or nearly all) the candidate inputs is important. Probabilistic scoring is required not only to predict the order of the n-best list, but to compute the relative contribution of each parse hypothesis to the n-best list. All else being equal, preferred parsers are also fast. While unification grammars, e.g. head-driven phrase structure grammar [HPSG, e.g. Pollard and Sag, 1994] and lexical functional grammar [LFG, e.g. Bresnan, 2001] produce complex and linguistically-informed parse structures that also may be interpreted as headed phrase grammars, existing grammars in these formalisms do not reflect a match to a training set, nor do they have complete coverage (for out-of-domain or ill-formed word sequences, they often produce no structure at all). Most problematic for the research explored here, is that state-of-the-art unification grammars [e.g., Flickinger, 2002, Cahill et al., 2004] do not provide parse N -best lists with the probability of each parse in the list, which is used in some of our work for taking expectations over parse alternatives. Instead of a unification grammar like the ones above, this work uses statistical probabilistic context-free grammar (PCFG) parsers. These sorts of parsers (e.g. Collins [2003], Charniak [2000], and Petrov and Klein [2007]) use lexical and span-label information from a training set of hand-labeled trees known as a treebank, e.g. the Penn Treebank of English [Marcus et al., 1993], and construct syntactic structures on new sentences (in the same language) consistent with the grammar inferred from these training sentences. Because they are probabilistic, these parsers may return not only a “best” parse analysis according to its model, but also a list of n analyses reflecting the n-best parse structures that this parser (and its grammar) assign to the input sentence. Each carries a probabilistic weight p(t, w) of the likelihood of a tree t with leaves w. The PCFG estimation makes the contextfree assumption: that the probability of generating the tree is composed of a combination of probability estimates from tree-local decisions. By constraining the model to use only tree-local decisions, PCFG models may use dynamic-programming techniques to efficiently search a very large space of possible tree structures.

13

2.1.3

Treebanks for the PCFG-derived parser

Parsers of this nature are constrained by the availability and structure of treebanks (from which to learn a grammar). The Penn treebank [Marcus et al., 1993], for example, encodes span labels over a collection of edited English text (mostly the Wall Street Journal); the availability of this labeled set has enabled the development of the statistically-trained parsers for English mentioned above. Recent work to construct treebanks in other languages than English, e.g. in Chinese [Xue et al., 2002], and in other domains than edited English text [e.g., Switchboard telephone speech: Godfrey et al., 1992] have made these parsers much more broadly accessible for use in applications with broader focus than parsing itself. In particular, Huang and Harper [2009] have built a parser tuned for certain genres of Mandarin Chinese. Certain aspects of this research depend on the power of this parser to handle Mandarin news text, despite the relative lack of data (compared to English). Though these parsers do not explicitly include head structure in their output (to match the treebanks on which they were trained), all of the state-of-the-art PCFG parsers infer head structure internally, most using Magerman [1995] style context-free headfinding rules. Recovering the head structure from their output (also using Magerman [1995] style headfinding) is fast and deterministic, and allows for an easy conversion, when dependency structure is called for, from treebank-style span trees to headed span trees and thence to dependency structure.

2.1.4

Intrinsic evaluation of statistical parsing

Statistical parsers are usually evaluated by comparing the hypothesized parse thyp to a reference parse tref . The standard test uses parseval [Black et al., 1991], an F-measure over span-precision and span-recall, which was developed for comparison to the Wall Street Journal treebank [Marcus et al., 1993]. The parseval technique assumes that the hypothesis proposed shares the same word sequence; that is, parseval is only well-defined when whyp = wref and the basic (sentence) segmentation agrees. If the division of those word sequences into segments differs, parseval is not well-defined. Kahn et al. [2004] addressed this on reference transcripts of

14

conversational speech by concatenating all reference segment transcriptions from a single conversation side and computing an F -measure based on error counts at the level of the conversation side. In speech applications, however, it is not reasonable to assume that the reference transcript is available to the parser, so scoring must compare (thyp , whyp ) to (tref , wref ) instead. In this situation, when comparing parses over hypothesized speech sequences, parse quality may instead be measured using SParseval [Roark et al., 2006], which computes an F-measure over syntactic dependency pairs derived from the trees. Research on parse quality over transcription hypotheses, however, has been very limited. It has largely been restricted to parsing only the ASR engine’s best hypothesis, e.g., Harper et al. [2005], which sought to improve the automatic segmentation of ASR transcripts into utterance-level segments. Approaches like this one that use only one ASR hypothesis ignore the potential of more parse-compatible alternative transcription hypotheses available from the ASR engine. Further discussion of the SParseval measure over speech is included in the background section of chapter 3, which explores parsing speech and speech transcripts. 2.2 Reranking n-best lists

As discussed above, PCFG parsers make strong assumptions about locality in order to efficiently explore the very large space of possible trees. However, these independence assumptions also prevent the use of feature extraction that crosses that locality boundary. For example, the relative proportion of noun phrases to verb phrases may be a useful discriminator among good and bad trees, but this statistic is not computable within the context-free locality assumptions that go into the parser itself. An approach to dealing with this challenge is to first generate an n-best list of topranking candidate hypotheses, and then apply discriminative reranking [Collins, 2000] to re-score the set of candidates (incorporating the original scores as one of the features). The features available to n-best reranking need not obey the locality assumptions that were used in generating the candidate list in the first place: rather, the features may be holistic because they are computed exhaustively (against every member of the n-best list) since n is much smaller than the original search space. Collins and Koo [2005] and Charniak and

15

Johnson [2005] use this approach to achieve roughly 13% improvements in parseval F performance on parsing Wall Street Journal text.

2.2.1

Reranking as a general tool

Reranking is of general use, and has been applied elsewhere before being applied to parsing. In ASR, for example, it was applied to transcription n-best lists to lower word error rate long before its use in parsing [e.g., Kannan et al., 1992], and Shen et al. [2004] introduce the use of discriminative reranking in SMT work. Discriminative reranking is a form of discriminative learning, which seeks to minimize the cost function of the top hypotheses. Unlike generative models, which learn their parameters from counting occurrences in training data, n-best rerankers must be trained on hypotheses with explicit evaluation metrics attached. Reranking has one important extension from the general case of discriminative learning: in reranking, the ranker must learn which features separate the optimal candidates from the suboptimal ones by comparing elements only within an n-best list, rather than pooling all positive and negative examples to seek a margin. One way to do this (discussed below in section 2.2.2) is to divide a candidate pool into ranks and attempt to separate each rank from the other ranks. In parsing, for example, it is the relative difference in (e.g.) prepositional phrase count among candidate parses that is used in reranking, not the absolute count; candidate parse trees must be compared to other candidates derived from the same n-best list. The generative component produces overly optimistic n-best lists over its training data, so in order to provide reranker training with realistic N -best lists from which to learn weights, the reranker needs to be trained using candidate parses from a data set that is independent of both the generative component’s training and the evaluation test set. Because of the limited amount of hand-annotated training and evaluation data, it is not always preferable to sequester a separate training partition just for this model. Instead, one may adopt the round-robin procedure described in Collins and Koo [2005]: build N leave-n-out generative models, each trained on
N −1 N

of the partitioned training set, and run

each on the subset that it has not been trained on. The resulting candidate sets are passed to the feature-extraction component and the resulting vectors (and their objective function

16

values) are used to train the reranker models. As already indicated, n-best reranking need not be applied only to parsing. As the following sections will show, it is useful in other complex natural-language processing tasks, where the n-best list generator (the generative stage) must obey strong independence assumptions for the sake of efficiency, but the final result may be re-evaluated with new features (classes of information) applied to the discriminative stage. In addition to using reranking to improve parse quality, this work also uses n-best reranking as a framework for applying syntactic information to other tasks.

2.2.2

Reranker strategy used in this work

Within this work, n-best list reranking is treated as a rank-learning margin problem: within each segment, the task is to separate the best candidate from the other candidate hypotheses. We adopt the svm-rank toolkit [Joachims, 2006] as our reranking tool. To prepare data for training this toolkit, the approach adopted here selects the oracle best score on the objective function φ∗ from the n-best list and converts the objective function into an objective loss with regard to the oracle for all hypotheses ti , e.g., φl (ti ) = φ∗ − φp (ti ) for the parseval p objective φp . To interpret φl as a rank function, we assign ranks to training candidates that focus on those distinctions near the optimal candidate, as follows:
  1 : φ (t ) ≤   l i  

rank(ti ) =

< φl (ti ) ≤ 2  2:     3 : 2 < φ (t ) l i

(2.1)

where

is a small value tuned empirically so that ranks 1 and 2 have a small proportion

of the total number of members in the candidate set. Since svm-rank uses all pairwise comparison between candidates of different rank, and ranks 1 and 2 have very few members, this approach reduces the number of comparisons from a square in |C| to linear in |C|, (where |C| represents the number of candidates in the set) while still focusing the margin costs towards the best candidates.

17

Figure 2.2: The models that contribute to ASR.

2.3

Automatic speech recognition

Automatic speech recognition (ASR) is the process of automatically creating a transcript word sequence w1 . . . wn from recorded speech waveform α. The literature in this discipline is enormous, and the survey here skims only the surface, to orient the reader to the basic models in play in state-of-the-art systems in ASR and to provide context for the contributions of this work.

2.3.1

A schematic summary of ASR

Speech recognition systems are constructed from multiple models. As illustrated in figure 2.2, the usual expression of these models (e.g. in the SRI large-vocabulary speech recognizer [Stolcke et al., 2006]) is as a combination of multiple generative models, which operate together to score possible hypotheses that are pruned down to a list of the top n word sequence hypotheseses. In large vocabulary systems, the resulting list is typically re-scored by discriminative components that reorder that list. Among the generative models, acoustic models pam (α|φ) provide a score of acoustic features of speech α (typically cepstral vectors) given pronunciation φ; pronunciation models

18

provide a score ppm (φ|w) of pronunciation-representation φ given word w; and language models (LMs, e.g. Stolcke [2002]; see Goodman [2001]) give a score plm (w1 , · · · , wn ) of the word sequence w1 , · · · , wn . In decoding, all three of the models descibed above operate on a relatively small local window: pam (·) uses phone-level contexts, ppm (·) uses the word in isolation or with its immediate neighbors, and plm (·) most often uses n-gram Markov assumptions, computing word sequence likelihoods from only the most-recent n − 1 words. The most typical value for n is three, also known as a “trigram” model, and n rarely exceeds four or five, due to the computational explosion in storage costs required. The rescoring component F (α, φ, w1 , · · · , wn ), by contrast, may use all of the above scores and also extracts additional features of an utterance- or sentence-length hypothesis from any of the values mentioned above for use in re-ordering the n-best list. Even without the feature-extraction F (·), the rescoring component may change the relative weight of the contribution of the upstream models, but F (·) is often used to extract long-distance (non-local) features that would be expensive or impossible to extract in the local-context decoding that the other models provide. An exhaustive survey of prior work using reranking to capture non-local information in ASR is impractical, but the sorts of long-distance information exploited include topic information, as in Iyer et al. [1994] or more recently Naptali et al. [2010], or trigger information [Singh-Miller and Collins, 2007]. These model long-distance effects from as far away as other sentences (or speakers!) in the same discourse, not with a syntactic model but with various approaches that cue the activation of a different vocabulary subset. Another application of reranking operates by adjusting the output of the generative model to focus on the specific error measure, as in e.g. Roark et al. [2007]. Further discussion of the use of syntactic information in language-model rescoring may be found in section 2.3.3.

2.3.2

Evaluation of ASR

Evaluation — and optimization — of speech recognition and its components are carried out with word error rate (WER), a measure that treats words (or characters) equally, regardless of their potential impact on a downstream application, as discussed in section 1.1; for

19

example, function words are given equal weight with content words. One exception is that filled-pauses are, in some evaluations, e.g. GALE [DARPA, 2008], optionally inserted or deleted without cost when evaluating speech.

A few larger projects that include ASR as a component have suggested extrinsic evaluation methods: in dialog systems, for example, ASR performance is evaluated along with the other components with a measure of action accuracy (e.g. in Walker et al. [1997] and Lamel et al. [2000]). In the 2005 Summer Workshop on Parsing Speech [Harper et al., 2005], speech recognition was evaluated in the extrinsic context of a downstream parser, but only a single transcription hypothesis was used. Al-Onaizan and Mangu [2007] explored adjustments to ASR hypothesis selection in an ASR-to-MT pipeline to allow relatively more insertions (keeping the WER constant), but found that this made little difference in automaticallyevaluated MT performance.

As an alternative to evaluating ASR with WER or evaluating it directly in the context of a downstream task, one may instead choose to optimize the ASR towards an improved form of some intermediate representation (neither the immediate word sequence nor a fullyextrinsic representation). Hillard et al. [2008], for example, experimented with selecting for high-SParseval Chinese character-sequences for a downstream Chinese-to-English SMT system (instead of selecting low character error rate (CER) hypotheses). In follow-up work, Hillard [2008] found improvement on the automatic SMT measures for unstructured (broadcast conversation) genres of speech, though not for structured speech (broadcast news). Additionally, they found that SParseval measurements of source-language transcription were better correlated with human assessment of MT performance in the target language than CER measurements. Intrinsic measures for ASR, however, are almost entirely limited to WER or its simpler alternative for Chinese, CER.

Chapter 3, which uses parse decoration to rerank ASR transcription hypotheses, evaluates ASR with WER and also with the SParseval parse-quality measure.

20

2.3.3

Parsing in ASR

Efforts to include parsing information in ASR systems have used the parser as an extra information source for selecting word sequences in speech recognition. This section highlights a few parser-based language-models and reranking models that have been used in ASR, demonstrating improvements in both perplexity and WER over n-gram LM baselines. The structured language model [Chelba and Jelinek, 2000] is a shift-reduce parser that conditions probabilities on preceding headwords. When interpolated with an n-gram model, it achieved small improvements in WER on read speech from the Wall Street Journal corpus and on conversational telephone speech from the Switchboard [Godfrey et al., 1992] corpus. The top-down PCFG parser used by Roark [2001] achieved somewhat larger improvements over a trigram on the same set (though the baseline it was compared to was worse than the baseline in Chelba and Jelinek [2000]). Charniak [2001] implemented a top-down PCFG parser that conditions probabilities on the labels and lexical heads of a constituent and its parent. In this model, the probability of a parse is modeled as the product of the conditional distributions of various structural factors. In contrast to both the models in Chelba and Jelinek [2000] and Roark [2001], most of these factors are conditioned on the identity of at most one other lexical item in the tree. This relative reliance on structure (over lexical identity) makes this model distinctly un-trigram-like. This model gets a lower perplexity than both the Structured Language Model and Roark’s model on Wall Street Journal treebank text. While the details of the parsing algorithms and probability models of the above models vary, all are fundamentally some kind of PCFG. A non-CFG syntactic language model that has been used for speech recognition is the SuperARV model, or “almost parsing language model” [Wang and Harper, 2002], which calculates the joint probability of a string of words and their corresponding super abstract role values. These values are tags containing part of speech, semantic and syntactic information. The SuperARV got better perplexity and WER results than both a baseline trigram and the Chelba and Jelinek [2000] and Roark [2001] language models, for a variety of read Wall Street Journal corpora. It also out-performed a state-of-the-art 4-gram interpolated word- and class-based language model on the DARPA

21

RT-02 conversational telephone speech evaluation data [Wang et al., 2004]. Filimonov and Harper [2009] introduce a generalization and extension of the Super-ARV tagging model in a joint language modeling framework for using very large sets of “tags”, which (when they include automatically-induced syntactic information in the tag set), was competitive with the SuperARV performance on both perplexity and WER measures, but requires less complex linguistic knowledge. One challenge for combining parsing with ASR is that parsing is ordinarily performed over well-formed, complete sentences, while automatic segmentation of ASR is difficult, especially in conversational speech (where even a correct segmentation may not be a syntactically well-formed sentence). Parse models of language do not perform as well on poorlysegmented text [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005]. In chapter 3, this work goes into more depth regarding the impact of different methods of automatic segmentation on the utility of parse decorations and success of parsing. 2.4 Statistical machine translation

Statistical machine translation (SMT) is the process of automatically creating a targetlanguage word sequence e1 . . . eE from a source-language word sequence f1 . . . fF . There are non-statistical approaches to this task, e.g. the LOGON Norwegian-English MT project [Lønning et al., 2004], but these are not the subject of this research. This section offers an overview of the state-of-the-art in statistical machine translation, identifying the core models and techniques that are used, the mechanisms for automatic evaluation, and where syntactic structures are already in use.

2.4.1

A schematic summary of SMT

In SMT based on the IBM models [Brown et al., 1990] and their successors, candidate translations are understood to be made up of the source words f , the target words e, and also the alignment a between source and target words. The contributing components are broken down in a noisy-channel model: a language model plm (e) scores the quality of the target word sequence; a reordering model pr (a|e) assigns a penalty for the “reordering” performed

22

I

don’t e2

like e3

blue e4

cheese e5

. eE

e{ e1 a

{
f2 n’ f3 f4 f5 f6 fF . Je aime pas fromage bleu

f { f1

Figure 2.3: Word alignment between e and f . Each alignment link in a represents a correspondence between one word in e and one word in f . There is no guarantee that e and f are the same length (E = F ).

by the alignment, and the translation model ptm (f |e, a) provides a score for pairing sourcelanguage word (or word-group) f with target-language word (or word-groups) e according to alignment a. This approach formulates the translation decoding process as a search over words (e) and alignments (a), which is typically approximated as: argmax p(e|a, f ) ∼ argmax ptm (f |e, a)prm (a|e)plm (e)
e e

(2.2)

Most current approaches to decoding do not actually use this generative model, but instead a weighted combination of multiple ptm (·) translation models including both ptm (f |e, a) and ptm−1 (e|f, a), which lack the well-formed noisy-channel generative structure of equation 2.2 above but seem to work better in practice [Och, 2003]. Training the ptm (·) and prm (·) models requires many parallel sentences with alignments between source and target words, of the form suggested in figure 2.3. Alignments, like parse structure, are rarely annotated over large amounts of parallel text. The approach offered by the IBM models and their descendants is to bootstrap alignment and translation models from bitexts (corpora of parallel sentences). In general, the training of these alignment and translation models is iterated in a bootstrap process. This bootstrap process, as implemented in popular word-alignment tools, [e.g. GIZA++: Och and Ney, 2003], begins with simple, tractable models for prm (·) and ptm (·) and, as the models improve, trains more sophisticated reordering and translation models. Later models are initialized from the align-

23

ments hypothesized by earlier iterations. The language model plm (e) does not participate in this phase of the training: in a bitext, predicting plm (e) is not helpful; language models are usually trained separately, using monolingual text. As a byproduct of the parameter-search to improve these models, the GIZA++ toolkit produces a best alignment linking each word in e to words in f . Other tools exist for generating alignments (such as the Berkeley aligner [DeNero and Klein, 2007]) and there is substantial discussion over how to evaluate and improve the quality of these alignments. Review of this discussion is passed over here; we will return to this literature in chapter 5. Typical independence assumptions in the word-alignment models constrain them to word sequence and adjacency, applying a penalty for moving words into a different order in translation. These models for reordering penalties are usually very simple, and do not incorporate any notion of parse decoration — instead, they assign monotonically-increasing penalties for moving words in translation. For example, Vogel et al. [1996] uses a hidden Markov model (HMM, derived from only sequence information) to assign a prm (·) reordering model. Language-models in translation are also generally sequence-driven: ASR’s basic ngram language-modeling approach serves as an excellent baseline to model plm (e) in MT work. Early stages in the training bootstrapping sometimes ignore even word sequence information: GIZA++’s “Model 1” treats prm (·) as uniform and ptm (·) as independent of adjacency information (dependent only on the alignment links themselves). For language-pairs like French-English, where word-order is largely similar, the localmovement penalties of these simple prm (·) models usefully constrain the search space of possible translations to those without large re-ordering: the language- and translation-model scores will correctly handle any necessary small, local reorderings. For other language-pairs (e.g., Chinese-English or Arabic-English), though, long-distance re-orderings are necessary, and these models must assign a small penalty to long-distance movement, which leads to an explosion in the search space (and a corresponding loss in translation quality). Having bootstrapped from bitext to word-based alignments, many SMT systems (e.g. Pharaoh [Koehn et al., 2003] and its open-source successor Moses [Koehn et al., 2007]) take the bootstrapping farther by automatically extracting a “phrase table” from the aligned

24

Figure 2.4: The models that make up statistical machine translation systems

text. These “phrase-based”2 systems treat aligned chunks of adjacent words as a sort of translation memory (the “phrase table”) which incorporates local reordering and contextaware translations into the translation model. Entries in the phrase table are discovered from the aligned bitext by heuristic selection of observed aligned spans. For some “phrasebased” systems, such as Hiero [Chiang, 2005], the span-discovery (and decoding) may even allow nested “phrases” with discontinuities. Statistical machine translation systems thus, like ASR, use multiple models which contribute together to generate (or “decode”) a scored list of possible hypotheses, as suggested in the top half of figure 2.4. The “phrase table” incorporates some aspects of alignment and translation models, but even when phrase tables are quite sophisticated, choosing and assembling these phrases at run-time usually requires additional translation and alignment models, even if only to assign appropriate penalties to the assembly of phrase-table entries. The n-best list generated by the decoder is typically re-scored using a discriminative re2 “Phrase-based” SMT systems use the term “phrase” to refer to a sequence of adjacent words; these do not have any guarantee of relating to a syntactic or otherwise linguistically-recognizable phrase. Xia and McCord [2004] use “chunk-based” to refer to these systems but this expression has not been widely adopted. This work uses “phrase-based” for consistency with the literature (which describes “phrase-based” SMT and “phrase tables”), despite the infelicity of the expression.

25

ranking component, as outlined in figure 2.4, that takes into account the language-model, translation-model, alignment-model and phrase-table scores already mentioned, and may also incorporate additional features F (a, e, f ) that are difficult to include in the decoding process that generates the original n translation hypotheses. The re-ranking component relies on an automatic measure of translation quality which is computable without human intervention for a given hypothesis translation and one or more reference translations.

2.4.2

Evaluation measures for MT

The development of reliable automatic measures for optimization has changed the field of statistical machine translation, by allowing the discriminative training of rescoring and reweighting models, such as minimum error rate training [MERT: Och, 2003], and by providing a shared measure for success. In MT, evaluation is a complex process, in large part because two (human) translators asked to perform the same translation task may quite ordinarily produce very different resulting strings. The challenge of accounting for allowable variability is not shared with ASR; in ASR, two human transcribers will usually agree on most of the transcription. Instead of a string match to a reference translation, human-assessed measures of translation quality are traditionally broken into separate scales of fluency and adequacy to assess system quality (whether translations are performed by human or machine) [King, 1996, LDC, 2005]. Of course, fluency and adequacy judgements cannot be performed without a human evaluator.3 Comparing system translations to reference translations allows monolingual assessors, which reduces the cost by increasing the available pool of assessors. In many evaluations, automatic measures compare automatic translations to these reference translations; these automatic measures have the virtue of removing annotator variability from the evaluation and further reducing the labor costs of assessing the system translations. For optimization purposes (such as the MERT models and discriminative re-ranking described above), a measure that operates without human intervention is required, because the rescoring models
One might think that fluency and adequacy judgements require a bilingual evaluator as well, but for evaluating MT quality, a monolingual (in the target language) evaluator can compare machine and reference translations of the same text to report these judgements.
3

26

operate over hundreds (or thousands!) of sample translations of the same sentence. The two most popular of the automatic metrics are the BLEU [Papineni et al., 2002] measure of n-gram precision and the TER [Snover et al., 2006] edit distance. BLEU [Papineni et al., 2002], a measure of n-gram precision, remains the most popular and widely-reported measure for measuring translation quality against a reference translation (or set of reference translations). BLEU is a geometric mean of precisions over varying N -gram lengths:
n

BLEUn (h; r) =

n

πi (h; r) · BP (h, r)
i=1

(2.3)

where πi (h; r) reflects the precision of the i-grams in hypothesis h with respect to reference r, and the term BP(h, r) is a “brevity penalty” to discourage the production of extremely short (low-recall, high-precision) translations:
    

BP(h,r) =

exp 1 −

|r| |h|

if |h| < |r| 1 if |h| ≥ |r|

Most results are reported with BLEU4 . Translation Edit Rate (TER) is an error measure like WER, which measures the operations required to transform hypothesis h into reference r: TER(h; r) = insertions(h; r) + deletions(h; r) + substitutions(h; r) + shifts(h; r) length(r) (2.4)

where insertions, deletions and substitutions count one per word, while shift operations move any adjacent sequence of words from one position in h to another. Insertion, deletion, substitution and shift error counts are calculated through an alignment between reference and hypothesis that heuristically minimizes the total number of operations needed. When working with multiple references, BLEU4 is defined so that its n-grams may match those in any of the references, allowing translation variability across the multiple references, but TER’s approach to multiple references is just to return the minimum edit ratio over the set of references, which is less forgiving to the candidate translation. Like word error rate for ASR, the BLEU and TER metrics use no syntactic or argumentstructure modeling to determine which words matter more: all words are treated equally. In TER, substituting or shifting a single word incurs the same cost regardless of where the

27

substitution or shift happens; in BLEU, all hypothesis n-grams contribute equally to the score of the sentence. Because of the emphasis on these automatic measures, innovations in MT have often focused on the innovations’ effects on these measures directly, sometimes to the point of reporting only on one of these entirely automatic measures. Some have raised skepticism towards the focus on the BLEU and TER automatic measures on theoretical [Callison-Burch, 2006] and empirical [Charniak et al., 2003] grounds, in that they do not always accurately track translation quality as judged by a human annotator, and they may not even reliably separate professional from machine translations [Culy and Riehemann, 2003]. Other automatic MT measures have been proposed, some of which use parse decorations. Chapter 4 describes some of these alternatives in more detail. An ideal automatic measure would correlate well with human judgements of translation quality. However, judgements of fluency and adequacy themselves are highly variable across annotators. Rather than correlate with these measurements, one may instead examine the correlation with a different human-derived measure of translation quality: Snover et al. [2006] propose Human-targeted Translation Edit Rate (HTER), a measurement of the work performed by a human editor to correct the translation until it is equivalent to the reference translation. They show that a single HTER score is very well-correlated to fluency/adequacy judgements, and has lower variance: they find that a single HTER score is more predictive of a held-out fluency/adequacy judgement than a single fluency/adequacy judgement. HTER still requires human intervention, but, probably because of its consistency in evaluation, it has been adopted as the evaluation standard for the DARPA GALE project [DARPA, 2008].

2.4.3

Parsing in MT

Early explorations of the application of syntactic structure to SMT were explored as an alternative to the phrase-table approach. Yamada and Knight [2001] and Gildea [2003] incorporate operations on a treebank-trained target-language parse tree to represent p(f |a, e) and p(a|e), but have no “phrase” component; Charniak et al. [2003] apply grammatical structure to the p(e) language-model component. These approaches met with only moder-

28

ate success. Rather than building a syntactic model into the decoder or language model, others proposed automatically [Xia and McCord, 2004, Costa-juss` and Fonollosa, 2006] and manually a coded [Collins et al., 2005a, Popovi´ and Ney, 2006] transformations on source-language c trees, to reorder source sentences from f to f before training or decoding (translation models are trained on bitexts with f and e). Zhang et al. [2007] extend this approach by inserting an explicit source-to-source “pre-reordering” model pr0 (f |f ) to provide lattice input alternatives to the main translation. The phrase-table models described in section 2.4.1 capture some local syntactic structure — even when the phrases are simply reliably-adjacent word-sequences — by virtue of recording actually-observed n-grams in the source- and target-language sequences, but these models offer additional power when they are made syntactically aware. Syntacticallyaware decoders are united with the phrase-table approach in such approaches as the ISI systems [Galley et al., 2004, 2006, Marcu et al., 2006], the systems built by Zollmann et al. [2007], and recently the Joshua open-source project [Li et al., 2009]. Each of these builds syntactic trees over the target side of the bitext in training and learn phrase-table entries with syntactically-labeled spans. Conversely, Quirk et al. [2005] and Xiong et al. [2007] construct phrase-table entries using source-language dependency structure, while Liu et al. [2006a] applies a similar technique using constituent structure instead of dependency. Rather than pursue these phrase-table based decoder models directly, chapter 5 of this work explores mechanisms to use parsers to improve the word-to-word alignments that are the material from which the phrases are learned. 2.5 Summary

This chapter has provided an overview of four key technologies for the remainder of this work: statistical parsing, n-best list reranking, automatic speech recognition, and statistical machine translation. Special attention is paid to the interaction of parsers with speech recognition, the evaluation of speech recognition and machine translation, and the existing roles of syntactic structure in statistical machine translation. The next three chapters use parsers (and rerankers) in various combinations on conversational speech recognition

29

(chapter 3), machine translation evaluation (chapter 4), and on improving word alignment quality for machine translation (chapter 5). Further details on related work more directly related to this thesis are provided in each chapter.

30

31

Chapter 3 PARSING SPEECH

Parse-decoration on the word sequence has a strong potential for application in the domain of automatic speech recognition (ASR). Extracting syntactic structure from speech is more challenging than ASR or parsing alone, because the combination of these two stages introduces the potential for cascading error, and most parsing systems assume that the leaves (words) of the syntactic tree are fixed. This chapter1 applies parse structure as an additional knowledge source, even when the evaluation targets do not include parse structure explicitly. It also considers the benefits to parsing of considering alternative speech transcripts (when the evaluation targets are parse measures themselves). We thus consider recognition and parsing as a joint reranking problem, with uncertainty (in the form of multiple hypotheses) in both the recognizer and parser components. In this joint problem, there are two possible targets: word sequence quality, measured by word error rate (WER), and parse quality, measured over speech transcripts by SParseval. For both these targets, sentence boundary concerns have largely been ignored in prior work: speech recognition research has generally assumed that sentence boundaries do not have a major impact, since the placement of segment boundaries in a string does not affect WER on that string. Parsing research, on the other hand, has generally assumed that sentence boundaries are given (usually by punctuation), since most parsing research has been on text. Spoken language, unlike written language, does not have explicit markers for sentence and paragraph breaks; i.e., punctuation is not verbalized. Sentence boundaries in spoken corpora must therefore be automatically recognized, introducing another source of difficulty for the joint recognition-and-parsing problem, regardless of the target: sentence segmentation.
Tthe work presented in this chapter is included in a paper that has been accepted to Computer Speech and Language.
1

32

Although there has been a substantial amount of research on speech recognition, segmentation of spoken language, and parsing (as described in the next section), there has been little work exploring automation of all three together. Most research has incorporated only one or two of these areas, typically treating recognition and parsing as separable processes. In this chapter, we combine recognition and parsing using discriminative reranking: selecting optimal word sequences from the N -best word sequences generated from a speech recognizer given cues from M parses for each, and selecting optimal parse structure from the N × M -best parse structures associated with these word sequences. At the same time, we explore the impact of automatic segmentation. We ask the following inter-related questions: • In the task of extracting parse structure from conversational speech, how much can we improve performance by exploiting the uncertainty of the speech recognizer? • In the word recognition task, does a discriminative syntactic language model benefit from incorporating parse uncertainty in parse feature extraction? • How does segmentation affect the usefulness of parse information for improving speech recognition, and what is its impact on parsing accuracy, given alternative word sequences and alternative parse hypotheses? Section 3.1 discusses the relevant background for this research integrating speech segmentation, parsing, and speech recognition. Section 3.2 outlines the experimental framework in which this chapter explores those questions, while section 3.3 describes the corpus and the configuration of the various components of this system. Section 3.4 describes the results of those experiments, and section 3.5 discusses these results in the context of the dissertation as a whole. 3.1 Background

Our approach to parsing conversational speech builds on several active research areas in speech and natural language processing. This section extends the review from chapter 2 to highlight the prior work most related to the work in this chapter.

33

3.1.1

Parsing on speech and its evaluation

As discussed in section 2.1.4, most parsing research has been developed with the parseval metric [Black et al., 1991], which was inititally developed for parse measurement on text. It was used in initial studies of speech based on reference transcripts (without considering speech recognizer errors). The grammatical structures of speech are different than those of text: for example, Charniak and Johnson [2001] demonstrated the usefulness (as measured by parseval) of explicit modeling of edit regions in parsing transcripts of conversational speech. Unfortunately, parseval is not well-suited to evaluating parses of automatically-recognized speech. In particular, when the words (leaves) are different between reference and hypothesized trees (as will be the case when there are recognition errors), it is difficult to say whether a particular span is included in both, and the parseval measure is not well defined. Roark et al. [2006] introduce alternative scoring methods to address this problem with SParseval, a parse evaluation toolkit. The SParseval method used here takes into account dependency relationships among words instead of spans. Specifically, CFG trees are converted into dependency trees using a head-finding algorithm and head percolation of the words at the leaves. Each dependency tree is treated as a bag of triples d, r, h where d is the dependent word, r is a symbol describing the relation, and h is the dominating lexical headword (central content word in the phrase). Arc-labels r are determined from the highest constituent label in the dependent and the lowest constituent label dominating the dependent and the head. SParseval describes the overlap between the “gold” and hypothesized bags-of-triples in terms of precision, recall and F measure. Overall, SParseval allows a principled incorporation of both word accuracy and accuracy of parse relationships. Since every triple (the dependency-pair and its link label, as in figure 3.1) involves two words, this measure depends heavily on word accuracy, but in a more complex way than word error rate, the standard speech recognition evaluation metric. Figure 3.1 demonstrates a number of properties of the SParseval measure. Although both (b) and (c) have the same word error (one substitution each), they have very different precision and recall behavior. As the figure suggests, the SParseval measure over-weights

34

(a)
NP/I

S/think

(I, S/NP, think) (really, VP/AdvP, think)
VP/think

(think, <s>/S, <s>) (so, VP/AdvP, think)

I

AdvP/really

VP/think

really

V/think AdvP/so

think

so

(b)
S/think

S/think

3 Precision = 4 , Recall =

3 4

Word Error Rate =
DM/yeah

1 4

(I, S/NP, think)
NP/I VP/think yeah

(really, VP/AdvP, think) (think, <s>/S, <s>)

I

AdvP/really VP/think

(yeah, S/DM, think)
really V/think

think

(c)
NP/I

S/sink

0 Precision = 4 , Recall =

0 4

Word Error Rate =
VP/sink

1 4

(I, S/NP, sink)
I AdvP/really VP/sink

(really, VP/AdvP, sink) (sink, <s>/S, <s>)

really

V/sink AdvP/so

(so, VP/AdvP, sink)
sink so

Figure 3.1: A SParseval example that includes a reference tree (a) and two hypothesized trees (b,c) with alternative word sequences. Each tree lists the dependency triples that it contains; bold triples in the hypothesized trees indicate triples that overlap with the reference tree. Although all have the same parse structure, tree (c) is penalized more heavily (no triples right) because it gets the head word think wrong.

35

“key” words, making SParseval a joint measure of word sequence and parse quality. All words appear exactly once in the left (dependent) side of the triple, but only the heads of phrases appear on the right. Thus, those words that are the lexical heads of many other words (such as think in the figure) are multiply-weighted by this measure. Head words are multiply weighted because getting head words wrong impacts not only the triples where that head word is dependent on some other token, but also the triples where some other word depends on that head word. Non-head words are not involved in so many triples. In this work, we use SParseval as our measure of parse quality for parses produced over speech recognition transcription hypotheses.

3.1.2

Speech segmentation

Speech transcripts offer another challenge for parsing, whether used as an evaluation or a knowledge source. We showed that parser performance (as measured by an adapted parseval) degrades significantly when using automatically-detected (rather than reference) sentence boundaries [Kahn et al., 2004, Kahn, 2005], even when the speech transcripts are entirely accurate. Extending the same lines of research, Harper et al. [2005] used SParseval to assess the impact of automatic segmentation on parse quality, but using automatic word transcriptions as well. Their work focuses on selecting segmentations from a fixed word sequence and providing a top-choice parse for each of those segments. As we [Kahn et al., 2004] previously found on reference transcripts, they show a negative impact of segmentation error on ASR hypothesis transcripts, and further show that optimizing for minimum segmentation error does not lead to the best parsing performance. Parser performance, rather, benefits more from higher segment-boundary recall (i.e., shorter segments). They do not, however, consider alternative speech recognition hypotheses, which is an important focus of the work in this chapter. Though choosing a different segmentation does not affect the WER error measure (as it would for SParseval), choosing alternate segmentations affects ASR language modeling, even in the absence of a parsing language model, because even n-gram language models

36

assume that segment-boundary conditions are matched between LM training and test: Stolcke [1997] demonstrated that adjusting pause-based ASR n-best lists to take into account segment boundaries matched to language model training data gave reductions in word error rate.

3.1.3

Parse features in reranking

Section 2.3.3 discussed general approaches to using parsing as a language model, including parsing language-models like Chelba and Jelinek [2000] and Roark [2001]. Reranking, as discussed in section 2.2, is applied to parsers [Collins and Koo, 2005] but also to languagemodeling for ASR, with [e.g., Collins et al., 2005b] and without [Roark et al., 2007] parse features. Collins et al. [2005b] does discriminative reranking using features of the parse structure extracted from a single-best parse of the English ASR hypothesis. Arisoy et al. [2010] used a similar strategy for Turkish language modeling. In both cases, the objective was the minimization of WER. Harper et al. [2005] and others, as mentioned above, use reranking with the parsing objective over automatic speech transcripts. However, neither the syntactic language-modeling work using syntax nor the parsing work using automatic speech transcripts considers the variable hypotheses of both the speech recognizer and the parser in a reranking context. Using both variables together is the approach pursued in this chapter. 3.2 Architecture

The system for handling conversational speech presented in this chapter is illustrated schematically in figure 3.2 and involves the following steps:

1. a speech recognizer, which generates speech recognition lattices with associated probabilities from an audio segment (here, a conversation side);

2. a segmenter which detects sentence-like segment boundaries E, given the top word hypothesis from the recognizer and prosodic features from the audio;

37

Figure 3.2: System architecture at test time.

3. a resegmenter which applies the segment boundaries E to confusion networks derived from the lattices and generates an N -best word hypothesis cohort W s for each segment s, made up of word sequences wi with associated recognizer posteriors pw (wi ) for each of the N sequences wi ∈ W s ;

4. a parser component which generates an M -best list of parses ti,j , j = 1, . . . , M , for each wi ∈ W s , along with confidences pp (ti,j , wi ) for each parse over each word sequence (all the ti,j for a given segment s make up the parse cohort T s )

5. a feature extractor which extracts a vector of descriptive features fi,j over each member of the parse structure cohort which together make up the feature cohort F s ; and

6. a reranker component which selects an optimal vector of features (and thus a preferred candidate) from the cohort and effectively chooses an optimal w, t , which

38

Figure 3.3: n-best resegmentation using confusion networks

maximize performance with respect to some objective function on the selected candidate and the reference word transcripts and parse-tree.

In the remainder of this section we describe the components created for this joint-problem architecture: the resegmenter (step 3), the features chosen in the feature extractor (step 5), and the re-ranker itself (step 6). We describe the details of each component’s configuration in section 3.3.2.

3.2.1

Resegmentation

This chapter compares multiple segmentations of the word stream, including the ASRstandard pause-based segmentation, reference sentence boundaries, and two cases of automatically detected sentence-like units. Since the recognizer output is based on pause-based segmentation, a resegmenter (step 3) is needed to generate N-best hypotheses for the alternative segmentations, taking recognizer word lattices and a hypothesized segmentation as input. The resegmentation strategy is depicted in Figure 3.3. First, the lattices from step 1 are converted into confusion networks, a compact version of lattices which consist

39

of a sequence of word slots where each slot contains a list of word sequence hypotheses with associated posterior probabilities [Mangu et al., 2000]. Because the slots are linearly ordered, they can be cut and rejoined at any inter-slot boundary. All the confusion networks for a single conversation side are concatenated. Speaker diarization (the relationship between this conversation side and the transcription of the interlocutor) is not varied. The concatenated confusion network is then cut at locations corresponding to the hypothesized segment boundaries, producing a segmented confusion network. Each candidate segmentation produces a different re-cut confusion network. These re-cut confusion networks are used to generate W s , an N -best list of transcription hypotheses, for each hypothesized segment s from the target segmentation. Each transcription wi of W s has a recognizer confidence pr (wi ), calculated as len(wi ) pr (wi ) =
k=1

pr (wik )

(3.1)

where pr (wik ) is the confusion network confidence of the word selected for wi from the k-th slot in the confusion net. This posterior probability pr (wik ) is derived from the recognizer’s forward-backward decoding where the acoustic model, language model, and posterior scaling weights are tuned to minimize WER on a development set.

3.2.2

Feature extraction

After creating the parse cohort T s from the word-sequence cohort W s , each member of the parse cohort is a word-sequence hypothesis wi with a parse tree ti,j projected over it, along with two confidences: the ASR system posterior pr (wi ) and the parse posterior pp (ti,j , wi ). The feature-extraction step (step 5) extracts additional features and organizes all of these features into a vector fi,j to pass to the reranker. The feature extraction is organized to allow us to vary fi,j to include different subsets of those extracted. In this subsection, we present three classes of features extracted from our joint recognizerparser architecture: per-word-sequence features, generated directly from the output of the recognizer and resegmenter and shared by all parse candidates associated with a transcription hypothesis wi ; per-parse features, generated from the output of the parser,

40

Table 3.1: Reranker feature descriptions for parse ti,j of word sequence wi Feature pr (wi ) Ci Bi pp (ti,j , wi ) ψ(ti,j ) pplm (wi ) E[ψi ] Description Recognizer probability Word count of wi True if wi is empty Parse probability per-parse features Non-local syntactic features Parser language model Non-local syntactic feature expectations aggregated parse features Feature Class per-wordsequence features

which are different for each parse hypothesis ti,j ; and aggregated-parse features, constructed from the parse candidates but which aggregate across all ti,j that belong to the same wi . The features are listed in Table 3.1. All of the probability features p(·) are presented to the reranker in logarithmic form (values −∞ to 0).

Per-word-sequence features

Two recognizer outputs are read directly from the N -best lists produced in step 3 and reflect non-parse information. The first is the recognizer language-model score pr (wi ), which is calculated from the resegmenter’s confusion networks as described in equation 3.1. A second recognizer feature is the number of words Ci in the word hypothesis, which allows the reranker to explicitly model sequence length. Lastly, an empty-hypothesis indicator Bi (where Bi = 1 when Ci = 0) allows the reranker to learn a score to counterbalance for lack of a useful parse score. (It is possible that a segment will have some hypothesized word sequences wi that have valid words and some that contain only noise, silence or laughter, i.e., an empty hypothesis, which would have no meaningful parse.)

41

Per-parse features Each parse ti,j has an associated lexicalized-PCFG probability pp (ti,j , wi ) returned by the parser. For the parse quality objective, our system needs to compare parses generated from different word hypotheses. The joint probability p(t, w) contains information about the word sequence (the marginal parsing language model probability p(w) =
t p(t, w))

and the parse for that word sequence p(t|w). For the two objectives of parsing and word transcriptions, it is useful to factor these. Parse-specific features are described here, and in the next section we consider features that are aggregated over the M-best parses. For parsing, we compute the probabilities pp (ti,j |wi ) = pp (ti,j , wi ) M k=1 pp (ti,k , wi ) (3.2)

that represent the proportion of the M -best parser probability mass for sequence wi assigned to tree ti,j . The score pp (·) described above models the parser’s confidence in the quality of the entire parse. Following the parse-reranking schemes sketched in section 3.1.3, we also extract non-local parse features: a vector of integer counts ψ(ti,j ) extracted from parse ti,j and reflecting various aspects of the parse topology, using the feature-extractor from Charniak and Johnson [2005]. These features are non-local in the sense that they make reference to topology outside the usual context-free condition. For example, one element of this vector might count, in a given parse, the number of VPs of length 5 and headed by the word think. Further examples of the sorts of components in ψ(·) may be found in Charniak and Johnson [2005]. Because these features are often counts of the configurations of specific words or nonterminal labels, ψ(ti,j ) is a very high-dimensional vector, which is pruned at training time for the sake of computational tractability. For each segmentation condition, we construct a different definition of ψ, keeping only those features whose values vary between candidates for more than k segments of the training corpus. When Ci = 0, we assign exactly one dummy tree [S [-NONE- null]] to the empty word sequence, set pp (ti,1 , wi ) to a value very close to zero, and derive ψ(ti,1 ) from the dummy tree using the same feature extractor. pp (ti,1 |wi ) is set to unity since there is only one (dummy) parse available.

42

Aggregated-parse features For the WER objective, the details of specific parses are not of interest, but rather their expected behavior given the distribution over possible trees {p(t|w)}. We calculate the “parser language model” feature pplm (wi ) by summing the probabilities of all parse candidates for wi :
M

pplm (wi ) =
k=1

pp (ti,k , wi ).

(3.3)

We also aggregate our non-local syntactic feature vectors ψ(ti,j ) across the multiple parses ti,j associated with a single word sequence wi by taking the (possibly flattened) expectation over the conditional parse probabilities:
M

E[ψi ] =
j=1

pp (ti,j |wi )ψ(ti,j )

(3.4)

We further investigated flattening the parse probabilities (i.e. replace pp (ti,k , wi ) with pp (ti,k , wi )γ for 0 < γ ≤ 1) under the hypothesis that they were “over-confident”, which is useful in chapter 4 (also published as Kahn et al. [2009]).

3.2.3

Reranker

The reranker (step 6) takes as input the feature vector fi,j for each candidate and applies a discriminative model θ to sort the cohort candidates. θ is learned by pairing feature vectors fi,j with a value assigned by an external objective function φ, and finding a θ that optimizes the cumulative objective function of the top-ranked hypotheses over all the training segments s. In this work, we consider two alternative objective functions: word error rate (WER), for targeting the word sequence (φw (wi )), and SParseval for evaluating parse structure (φp (ti,j )). For the SParseval objective, the optimization problem is given by:
 

ˆ θ = argmax
θ s

s φp argmin θ · fi,j  ts i,j

(3.5)

A similar equation results for the word error rate objective, but the minimization is only over word hypotheses wi .

43

For training the re-ranker component of our system, we need segment-level scores, since we apply the re-ranking per-segment. Ideally, scoring operates on the concatenated result for a whole conversation side, to avoid artifacts of mapping word sequences associated with hypothesized segments that differ from the reference segments. However, in training, where scoring is needed for all M × N hypotheses, it is prohibitively complex to score all combinations at the conversation-side level. We therefore approximate per-segment SParseval scores at training time by computing precision and recall against all those reference dependency pairs whose child dependent aligns within the segment boundaries available to the parses being ranked. We treat reranker training as a rank-learning margin problem, using the svm-rank toolkit [Joachims, 2006] as described in section 2.2.2. 3.3 Corpus and experimental setup

These experiments used the Switchboard corpus [Godfrey et al., 1992], a collection of English conversational speech. The Switchboard corpus consists of five-minute telephone conversations between strangers on a randomly-assigned topic. The audio is recorded on different channels for each speaker. The data from a single channel is referred to as a “conversation side.” All the data used in this experiment was taken from a subset of Switchboard conversations which the Linguistic Data Consortium (LDC) has annotated with parse trees. These experiments use the original LDC transcriptions of the Switchboard audio rather than the subsequent Mississippi State transcriptions [ISIP, 1997], because hand-annotated reference parses only exist for the former. The data also has manual annotations of disfluencies and sentence-like units [Meteer et al., 1995], labeled with reference to the audio. Because the treebanking effort used only transcripts (no audio), there are occasional differences in the definition of a “sentence”; because the audio-based annotation was likely to be more faithful to the speaker’s intent, and because the automatic segmenter was trained from data annotated with a related LDC convention [Strassel, 2003], we used the audio-based definition of sentence-like units (referred to henceforth as SUs). The Switchboard parses were preprocessed for use in this system following methods

44

Table 3.2: Switchboard data partitions Partition Train Dev Eval Sides 1042 116 128 Words 654271 76189 58494

described in Kahn [2005], which are summarized here. Various aspects of the syntactic annotation beyond the scope of this task—for example, empty categories—were removed. The parses were also resegmented to match the SU segments, with some additional rule-based changes performed to make these annotations more closely match the LDC SU conventions. In the resegmented trees, constituents spanning manually-annotated segment boundaries were discarded, and multiple trees within a single manually annotated segment were subsumed beneath a top-level SUGROUP constituent. To match the speech recognizer output, punctuation is removed, and contractions are retokenized (e.g., can + n’t ⇒ can’t). The corpus was partitioned into training, development and evaluation sets whose sizes are shown in Table 3.2. Results are reported on the evaluation set; the development set was used during debugging and for exploring new feature-sets for f , but no results from it are reported here.

3.3.1

Evaluation measures

Word recognition performance is evaluated using word-error rate measurements generated by the NIST sclite scoring tool [NIST, 2005] with the words in the reference parses taken as the reference transcription. Because we want to compare performance across different segmentations, WER is calculated on a per-conversation side basis, concatenating all the top-ranked word sequence hypotheses in a given conversation side together. When comparing the statistical significance of different results between configurations, the Wilcoxon Signed Rank test provided by sclite is used. For parse-quality evaluation, we use the SParseval toolkit [Roark et al., 2006], again

45

calculated on a per-conversation side basis, concatenating all the top-ranked parse hypotheses in a given conversation. We use the setting that invokes Charniak’s implementation of the head-finding algorithm and consider performance over both closed- and open-class words. When comparing the statistical significance of SParseval results, we use a persegment randomization [Yeh, 2000].

3.3.2

Component configurations

Speech recognizer The recognizer is the SRI Decipher conversational speech recognition system [Stolcke et al., 2006], a state-of-the-art large-vocabulary speech recognizer that uses various acoustic and language models to perform multiple recognition and adaptation passes. The full system has multiple front-ends, each of which produce n-best lists containing up to 2000 word sequence hypotheses per audio segment, which are then combined into a single set of word sequence hypotheses using a confusion network. This system has a WER of 18.6% on the standard NIST RT-04 evaluation test set. Human-annotated reference parses are required for all the data involved in these experiments. Unfortunately, because they are difficult to create, reference parses are in short supply, and all the Switchboard conversations used in the evaluation of this system are already part of the training data for the SRI recognizer. Although it represents only a very small part of the training data (Switchboard is only a small part of the corpus, and the data here are restricted to the hand-parsed fraction of Switchboard), there is the danger that this will lead to unrealistically good recognizer performance. This work compensates for this potential danger by using a less powerful version of the full recognizer, which has fewer stages of rescoring and adaptation than the full system and a WER of 20.2% on the RT-04 test set. On our evaluation set from Switchboard, this system has a 22.9% WER.

Segmenter Our automatic segmenter [Liu et al., 2006b] frames the sentence-segmentation problem as a binary classification problem in which each boundary between words can be labeled as either

46

a sentence boundary or not. Given a word sequence and prosodic features, it estimates the posterior probability of a boundary after each word. The particular version of the system used here is based on the hidden-event model (HEM) from Stolcke and Shriberg [1996], with features that include n-gram probabilities, part of speech, and automatically-induced semantic classes, and combines the lexical and prosodic information sources. The HEM is an HMM with a higher-order Markov process on the state sequence (the word-boundary label pair) and observation probabilities given by the prosodic information using bagging decision trees. Segment boundaries were hypothesized for all word boundaries where the posterior probability of a sentence boundary was above a certain threshold. We explore four segmentation conditions in our experiments: Pause-based segmentation uses the recognizer’s “automatic” segments, which are determined based on speech/non-speech detection by the recognizer (i.e., pause detection); it serves as a baseline.

Min-SER segmentation is based on the automatic system using a posterior threshold of 0.5, which minimizes the word-level slot error rate (SER).

Over-segmented segmentation is based on the automatic system using a posterior threshold of 0.35, which is that suggested by Harper et al. [2005] for obtaining better parse quality.

Reference segmentation is mapped to the hypothesized word sequence by performing a dynamic-programming alignment between the confusion networks and the reference; it provides an oracle upper bound. Table 3.3 summarizes the segmentation conditions, including the performance (measured as SU boundary F and SER), the number of segments and the average segment length in words for each segmentation condition on the evaluation set. Note that the automatic segmentation with the lower threshold results in more boundaries, so that the average “sentence” length is shorter and recall is favored over precision.

47

Table 3.3: Segmentation conditions. F and SER report the SU boundary performance over the evaluation section of the corpus. Segmentation condition Pause-based Min-SER Over-segmented Reference threshold NA 0.5 0.35 NA F 0.62 0.77 0.78 (1.00) SER 0.61 0.45 0.46 (0.00) # Segments Train 54943 86681 96627 91254 Eval 5693 8417 9369 8779 Average length 10.3 6.9 6.2 6.7

Resegmenter Given the confusion network representation of the speech recognition output, the main task of resegmentation is generating N -best lists given a new segmentation condition for the confusion networks. For a given segment, the lattice-tool program from the SRI Language Modeling Toolkit [Stolcke, 2002] is used to find paths through the confusion network ranked in order of probability, so the N most probable paths are emitted as an N -best list w1 . . . wN , where each wi is a sequence of words. For these experiments, the N -best lists are limited to at most N = 50 word sequence hypotheses.

Parser Our system uses an updated release of the Charniak generative parser [Charniak, 2001] (the first stage of the November 2009 updated release of [Charniak and Johnson, 2005], without the discriminative second-stage component) to do the M -best parse-list (and parse-score) generation. As in Kahn [2005], we do not implement a separate “edit detection” stage but treat edits as part of the syntactic structure. The parser is trained on the entire training set’s reference parses; no parse trees from other sources are included in the training set. We generate M = 10 parses for each word sequence hypothesis, based on analyses (presented later) that showed little benefit from additional parses and much more benefit from increasing the number of sentence hypotheses. If the parser generates less than M

48

hypotheses, we take as many as are available. For the full system, we train a single parser on the entire training set; for providing training cohorts to the reranker, the parser is trained on round-robin subsets of the training set, as discussed in section 3.3.2.

Feature extractor The extraction of non-local syntactic feature ψ(ti,j ) uses the software and feature definitions from Charniak and Johnson [2005]. For tractability, we prune the set of features to those with non-zero (and non-uniform) values within a single segment’s hypothesis set for more than 2000 segments, which is approximately 2% of the total number of training segments (as in the parse-reranking experiments in Kahn et al. [2005]). Pruning is done separately for each segmentation of the training set, yielding about 40,000 non-local syntactic features under most segmentation conditions.2 The aggregate parse features pplm (wi ) and E[ψi ] are calculated by sums across the M parses generated for each wi . We assume that this approximation (instructing the parser to return no parses after the M -th) has no important impact on the value of these features.

Reranker As discussed in section 2.2, the reranker component of our system is the svm-rank tool from Joachims [2006]. The reranker needs to be trained using candidate parses from a data set that is independent of the parser training and the evaluation test set. Because of the limited amount of hand-annotated parse tree data, we did not want to create a separate training partition just for this model. Instead, we adopt the round-robin procedure described in Collins and Koo [2005]: we build 10 leave-n-out parser models, each trained on 9/10 of the training set, and run each on the tenth that it has not been exposed to. The resulting parse candidate sets are passed to the feature-extraction component and the resulting vectors (and their objective function values) are used to train the reranker models.
Our non-local syntactic feature set is thus slightly different for each segmentation, since the number and content of the set of segments vary among segmentations. The pause-based segmentation, with substantially longer segments, selects about 28,000 features under this pruning condition; others have about 40,000.
2

49

To avoid memory constraints, we assign each segment to one of 10 separate bins and train 10 svm-rank models.3 For each experimental combination of segmentation and features in fi,j , we re-train all 10 rerankers. At evaluation time, the cohort candidates are ranked by all 10 models and their scores are averaged. The parse (or word-sequence) of the topranked candidate is taken to be the system’s hypothesis for a given segment, and evaluated according to either the WER or SParseval objective. 3.4 Results

This section describes the results of experiments designed to assess the potential for performance improvement associated with increasing the number of word-sequence vs. parse candidates, as well as the actual gains achieved by reranking under both WER and SParseval objectives and different segmentation conditions. We also include a qualitative analysis of improvements.

3.4.1

Baseline and Oracle Results

To provide a baseline, we sequentially apply the recognizer, segmenter, and parser, choosing the top scoring word-sequence and then the top parse choice. We establish upper bounds for each objective by selecting the candidate from the M × N parse-and-word-sequence cohort that scores the best on each objective function. The results of these experiments are reported in tables 3.4 (optimizing for WER with M = 1 and different N ) and 3.5 (optimizing for SParseval with N = 50 and different M ). The number in parentheses corresponds to the mismatched condition — picking a candidate based on one criterion and scoring it with another. Both sets of results show that improving one objective leads to improvements in the other, since word errors are incorporated into the SParseval score. Table 3.4 shows that the N -best cohorts contain a potential WER error reduction of 32%. Larger gains are possible for the shorter-segment segmentation conditions, due to the increase in the number of available alternatives when generating N -best lists from more
Each candidate set is generated by a single leave-n-out parser (populated by conversation-side), but each svm-rank bin (populated by segments, not by conversation sides) includes some cohorts from each of the leave-n-out tenths.
3

50

Table 3.4: Baseline (1-best serial processing) and oracle WER reranking performance from N = 50 word sequence hypotheses and 1-best parse. Parenthesized values indicate (unoptimized) SParseval scores of the selected hypothesis. Serial Baseline Segmenter Pause Min-SER Over-seg Reference WER 23.7 23.7 23.7 23.7 (SParseval) (68.2) (70.7) (70.9) (72.5) 1xN WER Oracle WER 17.6 16.7 16.2 16.2 (SParseval) (70.7) (73.7) (74.1) (77.0)

(and shorter) confusion networks.

Table 3.5 shows that there is a potential of 39% reduction parse error (1-F ) between the serial baseline (F = 72.5) and the joint M × N optimization (F = 83.2) with the oracle segmentation. The potential benefit is smaller for the pause-based segmentation (F = 68.2 vs. 75.2), both in terms of the relative improvement (22%) and the absolute F score. The possible benefit of automatic segmentation falls between these ranges, with slightly better results for the over-segmented case. We observe smaller gains in going from M = 10 to M = 50 parses (and no gains in the automatic segmentation cases), so only M = 10 parses are used in subsequent experiments, to reduce memory requirements in training.

We can also compare the benefit of increasing N vs. M . Figure 3.4 illustrates the trade-off for reference segmentations, showing that there is a bigger benefit from increasing N than M . However, a comparison of the results in the two tables shows that there is a significant gain in SParseval parse performance from increasing both of N × M : if only M is increased (from 1 × 1 to 1 × 50), the potential benefit is 25% error reduction. If increased to 10 × 50, possible reduction is 36% (39% for 50 × 50).

51

Table 3.5: Oracle SParseval (WER) reranking performance from N = 50 word sequence hypotheses and M = 1, 10, or 50 parses. Parenthesized values indicate (unoptimized) WER of the selected hypotheses. Parse Oracle (N = 50) M =1 Segments Pause Min-SER Over-seg Reference
(WER) SParseval

M = 10
(WER) SParseval

M = 50
(WER) SParseval

(20.8) (20.3) (20.0) (19.1)

72.7 75.8 76.2 79.4

(20.3) (19.7) (19.3) (18.3)

74.4 78.0 78.5 82.3

(20.0) (19.7) (19.3) (18.1)

75.2 78.0 78.5 83.2

50 40 30 M 20

SParseval oracle performance

0.825

0.8 25

0.780 765 0.

10

0.7

0.8
95

10
30 40 50

00

10

20

N

Figure 3.4: Oracle parse performance contours for different numbers of parses M and recognition hypotheses N on reference segmentations.

52

Table 3.6: Reranker feature combinations. Additionally all feature sets also contain the per-word-sequence features pr (wi ), Ci and Bi . Feature Set ASR ParseP ParseLM ParseP+NLSF ParseLM+E[NLSF] Additional features (No additional features) pp (ti,j , wi ) pplm (wi ) pp (ti,j , wi ), ψ(ti,j ) pplm (wi ), E[ψi ] Per word sequence parse word sequence parse word sequence

3.4.2

Optimizing for WER

We also investigate whether providing multiple M -best parses to the reranker augments the parsing knowledge source when optimizing for WER (compared to using only one parse annotation, or to using no parse annotation at all). To examine this, we explore different alternatives for creating the feature-vector representation fi,j of a word-sequence candidate, as summarized in Table 3.6. All experiments include recognizer confidences pw (wi ), word count Ci , empty-hypothesis flag Bi , and parser posteriors pp (ti,j , wi ) in the feature vector. Table 3.6 shows all the feature combinations investigated with the feature names used here. Table 3.7 shows the WER results of all the segmentation conditions and feature sets, which can be compared to the baseline serial result of 23.7%. Reranking with the ASR features alone does not improve performance, since there is little that the reranker can learn (acoustic and language model scores are combined in the process of generating N -best lists from confusion networks). The WER performance is worse than baseline on the MinSER and Ref segmentations, possibly because these segments are relatively longer than the Over-seg condition, making word length differences a less useful feature. Other results in table 3.7 confirm that non-local syntactic features ψ(ti,j ) (NLSF here) are useful for word recognition, confirming the results from Collins et al. [2005b]. In addition, there are some new findings. First, SU segmentation impacts the utility of the parser for word transcription (as well as for parsing). There is no benefit to using the parse probabilities alone except

53

Table 3.7: WER on the evaluation set for different sentence segmentations and feature sets. Baseline WER for all segmentations is 23.7%. Segmentation Features ASR (M = 1) ParseP (M = 1) ParseLM ParseP+NLSF (M = 1) ParseLM+E[NLSF] Oracle-WER Pause-based 23.6 23.6 23.7 23.3 23.3 17.6 Min-SER 24.2 23.7 23.7 23.4 23.3 16.7 Over-seg 23.7 23.7 23.7 23.4 23.4 16.2 Ref seg 24.1 23.1 23.1 22.8 22.7 16.2

in the case of reference segmentation,4 and the benefit of parse features is greater with the reference segmentation than with the automatic segmentations (22.7% vs. 23.3% WER). Second, the use of more than one parse with parse posteriors does not lead to significant performance gains for any feature set. While there is not a significant benefit from using M = 10 for the parse probability plus features, it does give the best result and we will use this in comparisons to the M = 10 SParseval optimization. For all segmentations, the ParseLM+E[NLSF] features provide a significant reduction (p < 0.001 using the Wilcoxon test) in WER from the baseline, but only 4–6% of the possible improvement within the N-best cohort is obtained with the automatic segmentation. When using reference segmentation, reranking with any of the feature sets provides significant (p < 0.001) WER reductions compared to baseline. Table 3.8 explores the effect of lowering the parse-flattening γ below 1.0 for those WERoptimized models that use more than one parse candidate (γ has no effect on expectation weighting when there is only one parse candidate). The differences introduced by γ = 0.5 or γ = 0.1 are not significantly different than γ = 1.0, and systems trained with γ = 1 are in general slightly worse than those with γ = 1.0. In all further experiments, γ is set to the
Since the n-gram language model is trained on much more data than the parser, it may be difficult for the parsing language model to provide added benefit.
4

54

Table 3.8: Word error rate results for different sentence segmentations and feature sets, comparing γ parse-flattening for WER optimization when N = 10. The baseline WER for all segmentations is 23.7%. Segmentation Features ParseLM ParseLM ParseLM ParseLM+E[NLSF] ParseLM+E[NLSF] ParseLM+E[NLSF] γ γ = 0.1 γ = 0.5 γ = 1.0 γ = 0.1 γ = 0.5 γ = 1.0 Pause-based 23.7 23.6 23.6 23.3 23.3 23.3 Min-SER 23.7 23.7 23.7 23.4 23.4 23.3 Over-seg 23.7 23.7 23.7 23.4 23.4 23.4 Ref seg 23.4 23.2 23.1 22.9 22.7 22.7

default (1.0).

3.4.3

Optimizing for SParseval

When optimizing for SParseval, we train and evaluate with the feature set that includes parse-specific features: pr (wi ), Ci , Bi , pplm (wi ), pp (ti,j , wi ), and ψ(ti,j ). Table 3.9 summarizes the results for the different segmentation conditions in comparison to the serial baseline and M × N -best oracle result. WER numbers, reported in parentheses, are the WER of the leaves of the selected parse. As expected from prior work [Kahn et al., 2004, Harper et al., 2005], we find an impact on parsing from the segmentation. The best results for all feature sets are obtained with reference segmentations, and the over-segmented threshold in automatic segmentation is slightly better than the min-SER case. For reference segmentations, higher parse scores correspond to lower WER, but for other segmentations this is not always the case for the automatic systems. For all segmentations, optimizing with the ParseP feature set is better than the baseline (p < 0.01 using per-segment randomization [Yeh, 2000]). The non-local syntactic features did not lead to improved parse performance over the

55

Table 3.9: Results under different segmentation conditions when optimizing for SParseval objective; the associated WER results are reported in parentheses. Segmentation Features Baseline ParseP ParseP+NLSF oracle Pause-based (23.7) (24.1) (24.3) (20.3) 68.2 68.8 69.1 74.4 Min-SER (23.7) (24.0) (25.5) (19.7) 70.7 71.1 70.4 78.0 Over-seg (23.7) (24.0) (25.8) (19.3) 70.9 71.3 70.4 78.5 Ref seg (23.7) (23.2) (23.5) (18.3) 72.5 73.4 73.1 82.3

parse probability alone, and in some cases hurt performance, which seems to contradict prior results in parse reranking. However, as shown in Figure 3.5, there is an improvement due to use of features for the case where there is only N = 1 recognition hypothesis, but that improvement is small compared to gains from increasing N . Figure 3.5 also shows that optimizing for WER with non-local syntactic features actually leads to better parsing performance than when optimizing directly for parse performance. We conjecture that this result is due to overtraining the reranker when the feature dimensionality is high and the training samples are biased to have many poorly-scoring candidates. The parsing problem involves many more candidates to rank than WER (300 vs. 30 on average) because parsereranking has M × N candidates while transcription-reranking has at most N candidates. Since the pool of M × N is much larger, it contains more poorly-ranking candidates and thus the learning may be dominated by the many pairwise cases involving poor-quality candidates.

3.4.4

Qualitative observations

We examined the recognizer outputs for the WER optimization with the ParseLM and expected NLSF features to understand the types of improvements resulting from using a parse-based language-model for re-ranking. Under this WER optimization on reference segmentation, of the 8,726 segments in the test set, 985 had WER improvements and 462

56

Figure 3.5: SParseval performance for different feature and optimization conditions as a function of the size of the N-best list.

had WER degradations. We examined a sample (about 100 each) of these improvements and degradations. Some improvements (about 15%) are simple determiner recoveries, e.g. “a” in “have a happy thanksgiving.” Other examples involve short main verbs (also a bit more than 15%; above 20% if contractions are included), as in: some evenings are [or] worse than others that is [as] a pretty big change used [nice] to live in colorado well they don’t [—] always where the corrected word is in boldface and the incorrect word (substitution) output by the baseline recognizer is italicized and in brackets. More significant from a language processing perspective are the corrections involving pronouns (about 5%), which would impact coreference and entity analysis. The parsing LM recovers lost pronouns and eliminates incorrectly recognized ones, particularly in contractions with short main verbs, as in the following examples:

57

she was there [they’re] like all winter semester they’re [there] going to school we’re [where] the old folks now (Contraction corrections like these are not included in the count for short main verb corrections.) Further improvements are found in complementizers and prepositions (about 5% each), while only about 10% of the improvements changed content words. The remaining 45% of improvements are miscellaneous. Another pronoun example illustrates how the parse features can overcome the bias of frequent n-grams in conversational speech: Improved: they *** really get uh into it Baseline: yeah yeah really get uh into it really get uh into it

Reference: they uh

with substitution errors in italics and deletions indicated by “***.” (The bigram “yeah yeah” is very frequent in the Switchboard corpus.) Of the segments that suffered WER degradation under ParseLM+E[NLSF] WER optimization, a little more than 15% were errors on a word involved in a repetition or selfcorrection, e.g. the omission of the boldface the in: . . . that’s not the not the way that the society is going Another 7-10% of these candidates that had WER degradation were more grammatically plausible than the reference transcription, e.g. the substitution of a determiner a for an unusually-placed pronoun (probably a correction): Reference: but i lot of times i don’t remember the names Optimized: but a lot of times i do not remember the names Most importantly, these last two classes of WER degradation do not have an impact on the meaning of the sentence. The remaining roughly 75% of the WER-degraded segments are difficult to characterize, but are a large-majority of function-words as well.

58

Most of these types of corrections are observed whether the optimization is for WER or SParseval. Many cases where they give different results have a higher or equal WER for the SParseval-optimized case, but the result is arguably better, as in:
WER obj.: i know that people *** to arrange their whole schedules . . . (1 error)

SParseval obj.: i know that people used to arrange their whole schedule . . . (1 errors) Baseline: Reference: i know that people easter arrange their whole schedules . . . (2 errors) i know that people used to arrange their whole schedules . . .

We compared the WER-optimized segment to SParseval-optimized segments, and found that about 100 segments had better SParseval and worse WER in the WER-optimized segment, and better WER and worse SParseval in the WER-optimized segment. Of these cases, about 15% seem to be cases where the SParseval-optimization is more grammatically plausible than the reference, e.g.: Reference: i’ve i’ve probably talked maybe to five people

SParseval opt.: i’ve i’ve probably talked to maybe just five people WER opt: i’ve i’ve probably talked maybe just five people

Reference:

now it’s like you know tough and dirty team

SParseval opt.: now it’s like you know a tough and dirty team WER opt.: now it’s like you know tough and dirty team

Note that it is important that the parser is trained on conversational speech in order to make useful predictions on conversational phenomena such as the hedging “like, you know” and the prescriptively proscribed double-adverb “maybe just”. The remaining improvements in this analysis may be categorized as a variety of other cases. 3.5 Discussion

In this chapter, we have presented a discriminative framework for jointly modeling speech recognition and parsing, with which we improve both word sequence quality (as measured

59

by WER) and parse quality (as measured by SParseval). We confirm and extend previous work in using parse structure for language-modeling [Collins et al., 2005b] and in parsing conversational speech [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005]. Experiments using this framework provide some answers to the questions posed at the beginning of the chapter. First, we find that parsing performance can be improved substantially by incorporating parser uncertainty via N -best list rescoring, particularly with high quality sentence segmentation, although the automatic reranking systems achieve only a small fraction of the potential gain. Further, allowing for word uncertainty is much more important than considering parse alternatives. In optimizing for WER, however, no significant gains are obtained from modeling parse uncertainty in a statistical parser, either in a language model or in non-local syntactic features. Of course, these findings may depend on the particular parser used. Finally, we find that sentence segmentation quality is important for parse information to have a significant impact on speech recognition WER, and that a good segmentation can increase the potential gains in parsing from considering multiple word-sequence hypotheses. A conclusion of these findings is that improvements to automatic segmentation algorithms would substantially extend the utility of parsers in speech processing. One surprising result was that non-local syntactic features in reranking were of more benefit to speech recognition than to parsing and, in fact, sometimes hurt parsing performance. We conjecture that this result is due to the fact that the joint parsing problem involves many more poor candidate pairs among reranker training samples, which seems to be problematic for the learner when the features are high-dimensional. It may be that other types of rerankers are better suited to handling such problems.

60

61

Chapter 4 USING GRAMMATICAL STRUCTURE TO EVALUATE MACHINE TRANSLATION
This chapter1 explores a different use of grammatical structure prediction: its use in predicting the quality of machine translation. As suggested in chapters 1 and 2, a key challenge in automatic machine translation evaluation is to account for allowable variability, since two equally good translations may be quite different in surface form. This is especially challenging when the evaluation measures used consider only the word-sequence. We motivate the use of dependencies for SMT evaluation with two example machine translations (and a human-translated reference):
Ref: S1: S2: Authorities have also closed southern Basra’s airport and seaport. The authorities also closed the airport and seaport in the southern port of Basra. Authorities closed the airport and the port of. (4.1)

A human evaluator judged the system 1 result (S1) as equivalent to the reference, but indicated that the system 2 (S2) result had problematic errors. BLEU4 (a popular automatic metric for SMT) gives S1 and S2 similar scores (0.199 vs. 0.203). TER (another popular metric) prefers S2 (with an error of 0.7 vs. 0.9 for S1), since a deletion requires fewer edits than rephrasing. EDPM (the new metric described later in this chapter) provides a score for S1 (0.414) that is preferred to S2 (0.356), reflecting EDPM’s ability to match dependency structure. The two phrases “southern Basra’s airport and seaport” and “the airport and seaport in the southern port of Basra” have more similar dependency structure than word order. The next section (4.1) reviews some relevant research in the evaluation of machine translation. In section 4.2, this chapter describes a family of dependency pair match (DPM) automatic machine-translation metrics, and section 4.3 describes the infrastructure and tools
Matthew Snover provided invaluable assistance in a version of this work, which has been published as Kahn et al. [2009].
1

62

used to implement that family. Sections 4.4 and 4.5 explore two ways to compare members of this family with human judgements. Section 4.6 explores the potential to adapt the EDPM component measures by combining them with another state-of-the-art MT metric’s use of synonym tables and other word-sequence and sub-word features. Section 4.7, finally, discusses the broader implications and future directions for these findings. 4.1 Background

Currently, the most popular approaches for automatic MT evaluation are BLEU [Papineni et al., 2002], based on n-gram precision, and Translation Edit Rate (TER), an edit distance [Snover et al., 2006]. These measures can only account for variability when given multiple translations, and studies have shown that they may not accurately track translation quality [Charniak et al., 2003, Callison-Burch, 2006]. Both BLEU and TER are word-sequence measures: they use exclusively features of the word-sequence and no knowledge of language similarity or structure beyond that sequence. Some alternative measures have proposed using external knowledge sources to explore mappings within the words themselves, such as synonym tables and morphological stemming, e.g. METEOR [Banerjee and Lavie, 2005] and the ATEC measure [Wong and Kit, 2009]. TER Plus (TERp) [Snover et al., 2009], which is an extension of the previouslymentioned TER, also incorporates synonym sets and stemming, along with automaticallyderived paraphrase tables. Still other systems attempt to map language similarity measures into a high-level semantic entailment abstraction, e.g. [Pad´ et al., 2009]. o By contrast, this chapter’s research proposes a technique for comparing syntactic decompositions of the reference and hypothesis translations. Other metrics modeling syntacticallylocal (rather than string-local) word-sequences include tree-local n-gram precision in various configurations of constituency and dependency trees [Liu and Gildea, 2005] and the d and d var measures proposed by Owczarzak et al. [2007a,b], which compare relational tuples derived from a lexical functional grammar (LFG) over reference and hypothesis translations.2
Owczarzak et al. [2007a] extend their previous line of research [Owczarzak et al., 2007b] by variablyweighting dependencies and by including synonym matching, two directions not pursued here. Hence,
2

63

Any syntactic-dependency-oriented measure requires a system for proposing dependency structure over the reference and hypothesis translations. Liu and Gildea [2005] use a PCFG parser with deterministic head-finding, while Owczarzak et al. [2007a] extract the semantic dependency relations from an LFG parser [Cahill et al., 2004]. This chapter’s work extends the dependency-scoring strategies of Owczarzak et al. [2007a], which reported substantial improvement in correlation with human judgement relative to BLEU and TER, by using a publicly-available probabilistic context-free grammar (PCFG) parser and deterministic head-finding rules, rather than an LFG parser. In addition, this chapter considers alternative syntactic decompositions and alternative mechanisms for computing score combinations. Finally, the work presented here explores combination of syntax with synonymand paraphrase-matching scoring metrics. Evaluation of automatic MT measures requires correlation with MT evaluation measures performed by human beings. Some [Banerjee and Lavie, 2005, Liu and Gildea, 2005, Owczarzak et al., 2007a] compare the measure to human judgements of fluency and adequacy. Other work Snover et al. [e.g. 2006] compares measures’ correlation with humantargeted TER (HTER), an edit-distance to a human-revised reference. The metrics developed here are evaluated in terms of their correlation against both fluency/adequacy judgement and against HTER scores. 4.2 Approach: the DPM family of metrics

The specific family of dependency pair match (DPM) measures described here combines precision and recall scores of various decompositions of a syntactic dependency tree. Rather than comparing string sequences, as BLEU does with its n-gram precision, this approach defers to a parser for an indication of the relevant word tuples associated with meaning — in these implementations, the head on which that word depends. Each sentence (both reference and hypothesis) is converted to a labeled syntactic dependency tree and then relations from each tree are extracted and compared. These measures may be seen as generalizations of
the earlier paper is cited in comparisons. Section 4.6 includes synonym matching, but over data which are not directly comparable with either Owczarzak paper and using an entirely different mechanism for combination.

64

Reference
det root mod subj det

Hypothesis
subj root

tree

The red cat ate the,

root

The cat stumbled

root

dlh list

det → , cat mod red, → , cat subj cat, → , ate root ate, → , <root>

det → , cat subj cat , → , stumbled root stumbled, → , <root> the ,

Figure 4.1: Example dependency trees and their dlh decompositions.

dl det → subj cat , → root stumbled, → the ,

lh det → , cat subj → , stumbled root → , <root>

Figure 4.2: The dl and lh decompositions of the hypothesis tree in figure 4.1.

the dependency-pair F measures found in Owczarzak et al. [2007b]. The particular relations that are extracted from the dependency tree are referred to here as decompositions. Figure 4.1 illustrates the dependency-link-head decomposition of a toy dependency tree into a list of d, l, h tuples. Some members of the DPM family may apply more than one decomposition; other good examples are the dl decomposition, which generates a bag of dependent words with outbound links, and the lh decomposition, which generates a bag of inbound link labels, with the head word for each included. Figure 4.2 shows the dl and lh decompositions for the same hypothesis tree. The decompositions explored in various configurations in this chapter include: dlh Dependent, arc Label, Head – full triple dl Dependent, arc Label – marks how the word fits into its syntactic context lh arc Label, Head – implicitly marks how key the word is to the sentence

65

dh Dependent, Head – drops syntactic-role information. 1g,2g – simple measures of unigram (bigram) counts Various members of the family may choose to include more than one of these decompositions.3 It is worth noting here that the dlh and lh decompositions (but not the dl decomposition) “overweight” the headwords, in that there are n elements in the resulting bag, but if a word has no dependents it is found in the resulting bag exactly one time (in the dlh case) or not at all (in the lh case). Conversely, syntactically “key” words, those on which many other words in the tree depend, are included multiple times in the decomposition (once for each inbound link). This “overweighting” effectively allows the grammatical structure of the sentence to indicate which words are more important to translate correctly, e.g. “Basra” in example (4.1), or head verbs (which participate in multiple dependencies). A statistical parser provides confidences associated with parses in a probabilisticallyweighted N -best list, which we use to compute expected (probability-weighted) counts for each decomposition in both reference and hypothesized translations. By using expected counts, we may count partial matches in computing precision and recall. This approach addresses both the potential for parser error and for syntactic ambiguity in the translations (both reference and hypothesis). When multiple decomposition types are used together, we may combine these subscores in a variety of ways. Here, we experiment with using two variations of a harmonic mean: computing precision and recall over all decompositions as a group (giving a single precision and recall number) vs. computing precision and recall separately for each decomposition. We distinguish between these using the notation in (4.2) and (4.3): F [dl, lh] = µh (Prec (dl ∪ lh) , Recall (dl ∪ lh)) µP R [dl, lh] = µh (Prec (dl) , Recall (dl) , Prec (lh) , Recall (lh)) (4.2) (4.3)

where µh represents a harmonic mean. (Note that when there is only one decomposition,
No d decomposition is included: this would be equivalent to a 1g decomposition. h decomposition might capture the syntactic weighting without the syntactic role that lh captures, but we find that lh has the same effect.
3

66

as in F [dlh], F [·] ≡ µP R [·].) Dependency-based SParseval [Roark et al., 2006] and the d approach from Owczarzak et al. [2007a] may each be understood as F [dlh] (although SParseval focuses on the accuracy of the parse, and Owczarzak et al. use a different mechanism for generating trees for decomposition). The latter’s d var method may be understood as something close to F [dl, lh]. BLEU4 is effectively µP (1g . . . 4g) with the addition of a brevity penalty. Both the combination methods F and µP R are “naive” in that they treat each component score as equivalent. When we introduce syntactic/paraphrasing features in section 4.6, we will consider a weighted combination. 4.3 Implementation of the DPM family

The entire family of DPM measures may be implemented with any parser that generates a dependency graph (a single labeled arc for each word, pointing to its head-word). Prior work [Owczarzak et al., 2007a] on related measures has used an LFG parser [Cahill et al., 2004] or an unlabelled dependency tree [Liu and Gildea, 2005]. In this work, we use a state-of-the-art PCFG (the first stage of Charniak and Johnson [2005]) and context-free head-finding rules [Magerman, 1995] to generate an N -best list of dependency trees for each hypothesis and reference translation. We use the parser’s default (English) Wall Street Journal training parameters. Head-finding uses the Charniak parser’s rules, with three modifications to make the semantic (rather than syntactic) relations more dominant in the dependency tree: prepositional and complementizer phrases choose nominal and verbal heads respectively (rather than functional heads) and auxiliary verbs are dependents of main verbs (rather than the converse). These changes capture the idea that main verbs are more important for adequacy in translation, as illustrated by the functional equivalence of “have also closed” vs. “also closed” in the introductory example. Having constructed the dependency tree, we label the arc between dependent d and its head h as A/B when A is the lowest constituent-label headed by h and dominating d and B is the highest constituent label headed by d. For illustrations, in figure 4.3, the s node is the lowest node headed by stumbled that dominates cat, and the np node is the highest constituent label headed by cat, so the arc linking cat to stumbled is labelled s/np. This strategy is very similar to one adopted in the reference implementation of

67

root/stumbled s/stumbled np/cat vp/stumbled
np/dt s/np root/s

dt/the nn/cat vbd/stumbled the cat stumbled

The cat stumbled

root

Figure 4.3: An example headed constituent tree and the labeled dependency tree derived from it.

labelled-dependency SParseval [Roark et al., 2006], and may be considered as a shallow approximation of the rich semantics generated by LFG parsers [Cahill et al., 2004]. The A/B labels are not as descriptive as the LFG semantics, but they have a similar resolution in English (with its relatively fixed word order), e.g. the s/np arc label usually represents a subject dependent of a sentential verb. For the cases where we have N -best parse hypotheses, we use the associated parse probabilities (or confidences) to compute expected counts. The sentence will then be represented with more tuples, corresponding to alternative analyses. For example, if the N -best parses include two different roles for dependent “Basra”, then two different dl tuples are included, each with the weighted count that is the sum of the confidences of all parses having the respective role.4 The parse confidence p is normalized so that the N -best confidences sum to one. Because ˜ the parser is overconfident, we explore a flattened estimate: p(k) = ˜ the parse and γ is a free parameter.
p(k)γ , p(i)γ i

where k, i index

The use of expectations with N -best parses is different from d 50 and d 50 pm in Owczarzak et al. [2007a], in that the latter uses the best-matching pair of trees rather than an aggregate over the tree sets and they do not use parse confidences.

4

68

4.4

Selecting EDPM with human judgements of fluency & adequacy

We explore various configurations of the DPM by assessing the results against a corpus of human judgements of fluency and adequacy, specifically the LDC Multiple Translation Chinese corpus parts 2 [LDC, 2003] and 4 [LDC, 2006], which are composed of English translations (by machine and human translators) of written (and edited) Chinese newswire articles. For each article in these corpora, multiple human evaluators provided judgements of fluency and adequacy for each sentence (assigned on a five-point scale), with each judgement using a different human judge and a different reference translation. For a rough5 comparison with Owczarzak et al. [2007a], we treat each judgement as a separate segment, which yields 16,815 tuples of hypothesis, reference, fluency, adequacy . We compute persegment correlations.6 The baselines for comparison are case-sensitive BLEU (4-grams, with add-one smoothing) and TER. The specific dimensions of DPM explored include: Decompositions. We compute precision and recall of several different decompositions: d,dl,dlh increasing n-grams, directed up through the tree, as inspired by BLEU4 and Liu and Gildea [2005]. dl,lh partial decomposition, to match d var dlh all labeled dependency link pairs, as suggested by SParseval and d 1g,2g surface unigrams and bigrams only Parser variations. When using more than one parse, we explore: Size of N -best list. 1 (adopting only the best parse) or 50 (as in Owczarzak et al. [2007a])
Our segment count differs slightly from Owczarzak et al. [2007a] for the same corpus: 16,807 vs. 16,815. As a result, the baseline per-segment correlations differ slightly (BLEU4 is higher here, while TER here is lower), but the trends in gains over those baselines are very similar. The use of the same hypothesis translations in multiple comparisons in the Multiple Translation Corpus means that scored segments are not strictly independent, but for methodological comparison with prior work, this strategy is preserved.
6 5

69

Table 4.1: Per-segment correlation with human fluency/adequacy judgements of different combination methods and decompositions. metric BLEU4 F [1g, 2g, dl, lh] µP R [1g, 2g, dl, lh] F [1g, 2g] µP R [1g, 2g] F [1g, dl, dlh] F [dl, lh] µP R [dl, lh] r 0.218 0.237 0.217 0.227 0.215 0.227 0.226 0.208

Parse confidence. The distribution flattening parameter is varied from γ = 0 (uniform distribution) to γ = 1 (no flattening).

Score combination. Global F vs. component harmonic mean µP R .

4.4.1

Choosing a combination method: F vs. µP R

In table 4.1, we compare combination methods for a variety of decompositions. These results demonstrate that F consistently outperforms µP R as well as the BLEU4 baseline (see table 4.2). µP R measures are never better than BLEU; µP R combinations are thus not considered further in this work.

4.4.2

Choosing a set of decompositions

Considering only the 1-best parse, we compare DPM with different decompositions to the baseline measures. Table 4.2 shows that all decompositions except [dlh] have a better per-segment correlation with the fluency/adequacy scores than TER or BLEU4 . Including progressively larger chunks of the dependency graph with F [1g, dl, dlh], inspired by the

70

Table 4.2: Per-segment correlation with human fluency/adequacy judgements of baselines and different decompositions. N = 1 parses used. metric F [1g, 2g, dl, lh] F [1g, 2g] F [dl, lh] BLEU4 F [dlh] TER |r| 0.237 0.227 0.226 0.218 0.185 0.173

BLEUk idea of progressively larger n-grams, did not give an improvement over [dl, lh]. Dependencies [dl, lh] and string-local n-grams [1g, 2g] give similar results, but the combination of all four decompositions [1g, 2g, dl, lh] gives further improvement in correlation over their use in isolation. The results also confirm, with a PCFG, what Owczarzak et al. [2007a] found with an LFG parser: that partial-dependency matches are better correlated with human judgements than full-dependency links. We speculate that this improvement is because partial-dependency matches are more forgiving: they allow the system to detect that a word is used in the proper context without requiring its syntactic neighbors to also be translated in the same way.

4.4.3

Choosing a parse-flattening γ

Since the parser in our implementation provides a confidence in each parse, we explore the use of that confidence with the γ free parameter and N = 50 parses. Table 4.3 explores various “flattenings” (values of γ) of the parse confidence in the F [·] measure. γ = 1 is not always the best, suggesting that the parse probabilities p(tree|words) are overconfident. The differences are small, but the trends are consistent across all the decompositions tested here. We find that γ = 0.25 is generally the best flattening of the parse confidence for the variants of this measure that we have tested: it is nearest the maximum r for both

71

Table 4.3: Considering values of γ, N = 50 (and one N = 1 case) for two different sub-graph lists (dl, lh and 1g, 2g, dl, lh). γ 1 0.75 0.5 0.25 0 [N = 1] F [1g, 2g, dl, lh] 0.239 0.240 0.240 0.240 0.239 0.237 F [dl, lh] 0.232 0.233 0.234 0.234 0.234 0.226

decompositions in table 4.3, though rounding hides the exact maxima. Table 4.3 also shows the effect of using N -best parses for different decompositions. The N = 50 cases are uniformly better than N = 1. While not all of these differences are significant, there is a consistent trend of correlation r improving with 50 vs. 1 parse. In summary, exploring a number of variants of the DPM metric against an average fluency/adequacy judgement leads to a best-case of: EDPM = F [1g, 2g, dl, lh], N = 50, γ = 0.25 We use this configuration in experiments assessing correlations with HTER. 4.5 Correlating EDPM with HTER

In this section, we compare the EDPM metric selected in the previous section to baseline metrics in terms of document- and segment-level correlation with HTER scores using the GALE 2.5 translation corpus [LDC, 2008]. The corpus includes system translations into English from three SMT research sites, all of which use system combination to integrate results from several systems, some phrase-based and some that use syntax on either the source or target side. No system provided system-generated parses; the EDPM measure’s parse structures are generated entirely at evaluation time. The source data includes Arabic and Chinese in four genres: bc (broadcast conversation), bn (broadcast news), nw (newswire),

72

Table 4.4: Corpus statistics for the GALE 2.5 translation corpus. Arabic doc bc bn nw wb Total 59 63 68 69 259 sent 750 666 494 683 2593 Chinese doc 56 63 70 68 257 sent 1061 620 440 588 2709 Total doc 115 126 138 137 516 sent 1811 1286 934 1271 5302

and wb (web text), with corpus sizes shown in table 4.4. This data may thus be broken down in several ways — in one large corpus, or into by language into two corpora (one derived from Arabic and one from Chinese), or by genre (into four) or by language×genre (eight subcorpora). The corpus includes one English reference translation [LDC, 2008] for each sentence and a system translation for each of the three systems. Additionally, each of the system translations has a corresponding “human-targeted” reference aligned at the sentence level, so we may compute the HTER score at both the sentence and document level. HTER and automatic scores all degrade, on average, for more difficult sentences. Since there are multiple system translations in this corpus, it is possible to roughly factor out this source of variability by correlating mean normalized scores,7 m(ti ) = m(ti ) −
1 I I j=1 m(tj )

where m can be HTER, TER, BLEU4 or EDPM, and ti represents the i-th translation of segment t. Mean-removal ensures that the reported correlations are among differences in the translations rather than among differences in the underlying segments.

Previous work Kahn et al. [2008] reported HTER correlations against pairwise differences among translations derived from the same source to factor out sentence difficulty, but this violates independence assumptions used in the Pearson’s r tests.

7

73

Table 4.5: Per-document correlations of EDPM and others to HTER, by genre and by source language. Bold numbers are within 95% significance of the best per column; italics indicate that the sign of the r value has less than 95% confidence (that is, the value r = 0 falls within the 95% confidence interval). r vs. HTER TER −BLEU4 −EDPM bc 0.59 0.42 0.69 bn 0.35 0.32 0.39 nw 0.47 0.46 0.47 wb 0.17 0.27 0.27 all Arabic 0.54 0.42 0.60 all Chinese 0.32 0.33 0.39 all 0.44 0.37 0.50

Table 4.6: Per-sentence, length-weighted correlations of EDPM and others to HTER, by genre and by source language. Bold numbers indicate significance as above. r vs. HTER TER −BLEU4 −EDPM bc 0.44 0.31 0.46 bn 0.29 0.24 0.31 nw 0.33 0.29 0.34 wb 0.25 0.25 0.30 all Arabic 0.44 0.31 0.44 all Chinese 0.25 0.24 0.30 all 0.36 0.28 0.37

4.5.1

Per-document correlation with HTER

Table 4.5 shows per-document Pearson’s r between −EDPM and HTER, as well as the TER and −BLEU4 baselines’ Pearson’s r with HTER. (We correlate with negative BLEU4 and EDPM to keep the sign of a good correlation positive.) EDPM has the best correlation overall, as well as in each of the subcorpora created by dividing by genre or by source language. In structured data (bn and nw), these differences are not significant, but in the unstructured domains (wb and bc), EDPM is always significantly better than at least one of the comparison baselines.

74

4.5.2

Per-sentences correlation with HTER

Table 4.6 presents per-sentence (rather than per-document) correlations based on scores, weighted by sentence length in order to get a per-word measure of correlation which reduces variance across sentences. (Even with length weighting, the r values have smaller magnitude due to the higher variability at the sentence level.) EDPM again has the largest correlation in each category, but TER has r values within 95% confidence of EDPM scores on nearly every breakdown. 4.6 Combining syntax with edit and semantic knowledge sources

While the results in the previous section show that EDPM is as good or better than baseline measures TER and BLEU4 , the correlation is still low. This result is consistent with intuitions derived from the example in section 4.2, where the EDPM score is much less than 1 for the good translation. For that reason, we investigated combining the alternative wording features (synonymy and paraphrase) of TERp [Snover et al., 2009] with the EDPM syntactic features. The TERp tools take an entirely different approach from EDPM. Rather than introduce grammatical structure, the TERp (“TER plus”) model extracts counts of multiple classes of edit operations and linearly combines the costs of those operations. These operations extend the TER operations (insert, delete, substitute, and shift) to include also “substitute-stem”, “substitute-synonym” and “substitute-paraphrase” operations that rely on external knowledge sources (stemmers, synonym tables, and paraphrase tables respectively). TERp’s approach thus exploits a knowledge source that is relatively well-separated from the grammatical-structure information provided by EDPM. To determine the relative cost of each class of edit operation, TERp provides an optimizer for weighting multiple simple subscores. The TERp optimizer performs a hill-climbing search, with randomized restarts, to maximize the correlation of a linear combination of the subscores with a set of human judgements. Within the TERp framework, the subscores are the counts of the various edit types, normalized for the length of the reference, where the counts are determined after aligning the MT output to the reference using default (uniform)

75

edit costs. The experiments here use the TERp optimizer but extend the set of subscores by including the syntactic and n-gram overlap features (modified to reflect false and missed detection rates for the TERp format rather than precision and recall). The subscores explored include: E : the 8 fully syntactic subscores from the DPM family, including false/miss error rates for the expected values of dl, lh, dlh, and dh decompositions. N : the 4 n-gram subscores from the DPM family; specifically, error rates for the 1g and 2g decompositions. T : the 11 subscores from TERp, which include matches, insertions, deletions, substitutions, shifts, synonym and stem matches, and four paraphrase edit scores. For these experiments, we again use the GALE 2.5 data, but with 2-fold cross-validation in order to have independent tuning and test data. Documents are partitioned randomly, such that each subset has the same document distribution across source-language and genre. As in section 4.5.2, the objective is length-normalized per-sentence correlation with HTER, using mean-removed scores as before. In figure 4.4, we plot the Pearson’s r (with 95% confidence interval) for the results on the two test sets combined, after linearly normalizing the predicted scores to account for magnitude differences in the learned weight vectors. The baseline scores, which involve no tuning, are not normalized. The left side of figure 4.4 shows that TER and EDPM are significantly more correlated with HTER than BLEU when measured in this dataset, which is consistent with the overall results of the previous section. It is also worth noting that the N+E combination is not equivalent to EDPM (though it has the same decompositions of the syntactic tree), but EDPM’s combination strategy yields a more robust r correlation with HTER. The N+E combination outperforms E alone (i.e. it is helpful to use both n-gram and dependency overlap) but gives lower performance than EDPM because of the particular combination technique. Both findings are consistent with the fluency/adequacy experiments in section 4.4. The TERp features (T in figure 4.4), which account for synonym/paraphrase

76

Figure 4.4: Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM, BLEU and TER correlations are provided for comparison.

differences, have much higher correlation with HTER than the syntactic E+N subscores. However, a significant additional improvement is obtained by adding syntactic features to TERp (T+E). Adding the n-gram features to TERp (T+N) gives almost as much improvement, probably because most dependencies are local. There is no further gain from using all three subscore types. 4.7 Discussion

In summary, this chapter introduces the DPM family of dependency pair match measures. Through a corpus of human fluency and adequacy judgements, we select EDPM, a member of that family with promising predictive power. We find that EDPM is superior to BLEU4 and TER in terms of correlation with human fluency/adequacy judgements and as a perdocument and per-sentence predictor of mean-normalized HTER. We also experiment with including syntactic (EDPM-style) features and synonym/paraphrase features in a TERpstyle linear combination, and find that the combination improves correlation with HTER

77

over either method alone. EDPM’s approach is shown to be useful even beyond TERp’s own state-of-the-art use of external knowledge sources. One difference with respect to the work of Owczarzak et al. [2007a] is the use of a PCFG vs. an LFG parser. The PCFG has the advantage that it is publicly available and easily adaptable to new domains. However, the performance varies depending on the amount of labeled data for the domain, which raises the question of how sensitive EDPM and related measures are to parser quality. A limitation of this method for MT system tuning is the computational cost of parsing compared to word-based measures such as BLEU or TER. Parsing every sentence with the full-blown PCFG parser, as done here, is hundreds of times slower than these simple n-gram methods. Two alternative low-cost use scenarios include late-pass evaluation, for choosing between different system architectures, or system diagnostics, looking at relative quality of these component scores compared to those of an alternative configuration.

78

79

Chapter 5 MEASURING COHERENCE IN WORD ALIGNMENTS FOR AUTOMATIC STATISTICAL MACHINE TRANSLATION

Syntactic trees (of the type described in section 2.1) fundamentally capture two kinds of information: dependency and span. Chapters 3 and 4 primarily use dependency links in their evaluation (from word to word within the same sentence). This chapter, by contrast, explores the utility of span information in natural language processing, specifically in the analysis of automatically-generated word-alignments in statistical machine translation bitexts. Statistical machine translation (introduced and briefly sketched in section 2.4) uses wordto-word alignment as a core component in its model training, perhaps most critically as a source of aligned bitexts for the construction of the phrase table. For the creation of the phrase tables, a key concern is that bitext alignments of low quality will induce poor phrase tables. For example, a single stray alignment link can greatly reduce the number of useful phrases that may be extracted, as in figure 5.1. In hierarchical or syntactic statistical MT systems, too, incorrect alignments may lead to lower-quality phrasal structure; higher-quality alignments offer more opportunities for any of these systems to learn correct translations by example. The machine alignments in figure 5.1, for example, prevent the alignment of the noun phrase “唯一 遗憾 的” to “the only regret”. It does still allow larger clusters to be mutually aligned (e.g. “唯一 遗憾 的 是” with “the only regret was in the”) and a few of the smaller alignments are still possible (e.g., “唯一” may still be aligned straightforwardly to “only”) but the extra alignment links in the lower alignment force the Chinese span NP1 to be incoherent: its projection in the English side of the lower alignment surrounds the projection of words (e.g. 是) that do not belong to NP1 . This chapter makes explicit this mechanism for describing the coherence of a monolingual

80

NP1 唯一 遗憾 的 是 单杠 。

The

only 唯一

regret 遗憾

was 的

in 是

the 单杠

horizontal 。

bar

.

The

only

regret

was

in

the

horizontal

bar

.

Figure 5.1: A Chinese sentence (about the 2008 Olympic Games) and its translation, with reference alignments (above) and alignments generated by unioned GIZA++ (below). Bold dashed links in the lower link-set indicate alignments that force NP1 to be incoherent.

span in an aligned bitext, and explores the coherence of syntactically-motivated spans over alignments generated by human and machine. Further exploration uses this measure of coherence to choose among alignment candidates derived from multiple machine alignments, and a following approach uses coherent regions to assemble a new, improved alignment from two automatic alignments. Section 5.1 describes the relevant background for this chapter. Section 5.2 outlines the notion of coherence used here, and describes how it is computed on a given span. Section 5.3 outlines the preparation of data for the explorations performed here: the corpora of Chinese-English bitexts and manual alignments, and the construction of several automatic alignments for comparison with these coherence metrics. Section 5.4 examines the performance of the various alignment systems in terms of alignment quality (against manual alignments) and the coherence of certain linguistically-motivated categories, and demonstrates that the coherence measures correspond to the alignment quality of those systems. In section 5.5, we explore using the coherence measures to select a better alignment from

81

a pool of alignment candidates, and section 5.6 explores the creation of hybrid alignments by combining members from the varied system-alignments assembled here. Section 5.7 discusses the implications (linguistic and practical) of these findings. 5.1 Background

Word alignments, as discussed in section 2.4.1, are an important part of the preparation of a parallel corpus for the training of statistical machine translation engines. A wide variety of statistical systems build their models off of aligned parallel corpora – whether to extract word-by-word translation parameters, as in the IBM models [Brown et al., 1990], “phrase” tables as in Moses [Koehn et al., 2007], or more syntactically-involved systems such as the Galley et al. [2006] syntactic translation models. As a tool for building and evaluating these aligned parallel corpora, Och and Ney [2003] proposed an alignment evaluation scheme “alignment error rate” (AER), in the hope that an intrinsic measure of evaluating alignments could shorten the development cycle for new statistical machine translation systems (eliminating the need to try the entire pipeline). AER is based on an F -measure over reference alignments (“sure”, S) and proposed (A) alignment links: AER(S, A) = 1 − 2 × |S ∩ A| |S| + |A| (5.1)

This formulation1 measures individual links rather than groups of links. A variety of other systems have explored using supervised learning over manually-aligned corpora to improve AER, with some success, including Ayan et al. [2005a,b], Lacoste-Julien et al. [2006] and Moore et al. [2006], who mostly focused on improving the AER over English-French parallel corpora. Other metrics exist, e.g. CPER [Ayan and Dorr, 2006], which measures the F of possible aligned phrases for inclusion in the phrase table, but in pilot experiments we found that persentence oracle CPER was not consistent with an improvement in global CPER performance. Fraser and Marcu [2006, 2007] find that optimizing alignments towards an AER variant
The original formulation of AER was defined with both “sure” and “possible” reference alignment links. No reference data available for this task uses “possible” alignment links, so only the simplified version is presented here.
1

82

that weights recall more heavily than precision higher improves BLEU performance on the language pairs they explored (English-Romanian, English-Arabic, and English-French). However, Fossum et al. [2008] find that they can improve a syntactically-oriented statistical machine translation engine by improving precision; their work focuses on deleting individual links from existing GIZA++ alignments, using syntactic features derived directly from the syntactic translation engine for Chinese-English and Arabic-English translation pairs. Another approach to improving alignments with grammatical structure is to do simultaneous parsing and use the parse information to (tightly or loosely) constrain the alignment, as in Lin and Cherry [2003], Cherry [2008], Haghighi et al. [2009] and Burkett et al. [2010], who constrain parsers of one (or both) languages to engage in the parallel alignment process. Rather than combine parse or span constraint information into a machine translation or alignment decoder, this chapter explores span coherence measures (with spans derived from a syntactic parser of Chinese) to select from multiple machine translation alignment candidates over a corpus of manually-labeled Chinese-English alignments. Since evidence for preferring an alignment error measure that over-weights precision or recall seems to be ambiguous (and possibly dependent on the choice of translation engine), we retain AER as the measure of alignment quality, and we explore the coherence measures’ ability to help select alignments to reduce AER. 5.2 Coherence on bitext spans

We define a span s to be any region of adjacent words fi · · · fk on one side (here the source language) of a bitext. Given a set of links a of the form em , fn , we define the projection of a span to be all nodes e such that a link exists between e and some element within s. We further define the projected range s of the span s to be: s = emini {ei ∈proj(s)} · · · emaxi {ei ∈proj(s)} and we define the reprojection of the span s to be the projected range of s (identifying a range of nodes in the same sequence as s). We may thus describe a span s as coherent when the reprojection of s is entirely within s. However, we find it useful to categorize spans into four categories, characterized

83

Table 5.1: Four mutually exclusive coherence classes for a span s and its projected range s coherent null subcoherent incoherent The reprojection of s is entirely within s No link includes any term in fi . . . fk s is not coherent, but s is coherent neither s nor s is coherent.

s3 fi ej

s1 fi+1 ej+1

s2 fi+2 ej+2 s1 s2 fi+3 ej+3

s0 fi+4 ej+4

Figure 5.2: Examples of the four coherence classes. s1 is coherent (because it is its own reprojection); s0 is null; s2 is incoherent (because its reprojection is s1 rather than a subset of s2 ); and s3 is subcoherent (because its projection span s1 is coherent).

in table 5.1.

Figure 5.2 also includes examples of each of the coherence classes. While

coherent, non-coherent, and null coherence classes are fairly easily explained, subcoherent spans are worth a brief digression: these spans often appear in alignments of two corresponding phrases with non-compositional meanings. Such phrases often form a complete bipartite subgraph, in that every source word in the phrase is linked to every target word in the phrase. Any span that includes less than the entire phrase (on one side or the other) will be subcoherent. Unlike AER, coherence is not a measure against the reference alignment; it is instead a measure of a particular span’s behavior in an alignment. It is not necessarily a sign of a high-quality alignment, but section 5.4 explores how coherence corresponds with AER over a pool of automatic alignment candidates.

84

Table 5.2: GALE Mandarin-English manually-aligned parallel corpora used for alignment evaluation and learning. Numbers here reflect the size of the newswire data available from each corpus. Note that Phase 5 parts 1 and 2 (LDC2010E05 and LDC2010E13) had no newswire data included.
word LDC-ID 2009E54 2009E83 2009E89 2010E37 Name Chinese Word Alignment Pilot Phase 4 Chinese Alignment Tagging Part 1 Phase 4 DevTest Chinese Word Alignment Tagging Phase 5 Chinese Parallel Word Alignment and Tagging Part 3 Total 6,173 220,327 157,286 sentence 290 2,092 2,829 962 English 8,818 76,487 101,484 33,537 Chinese 6,329 55,145 73,794 22,018

5.3

Corpus

These experiments focus on the alignment of Chinese-English parallel corpora. They make use of both unaligned (sentence-aligned but not word-aligned) corpora and manually-aligned corpora (aligned by both sentence and word). The key corpora for the experiments in this chapter are the manual alignments generated by the GALE [DARPA, 2008] project. These alignments take text and transcripts of spoken Mandarin Chinese and translations of both into English and provide manual annotation of alignment links between the English and Chinese words. Since Chinese word segmentation is not given by the text, the manual alignments link English words to Chinese characters, even when more than one Chinese character is required to form a word. Table 5.2 lists the sets of manual corpora used to evaluate the aligners (and to train the rerankers). Chinese word counts are listed using the number of words provided by automatic segmentation, and the alignment links (which were manually aligned to individual characters) are collapsed to link the English words to segmented Chinese words (rather than characters). The experiments in this chapter use only the newswire segments of these corpora, so (although other genres of text and transcript are available) only those numbers and sizes are reported here.

85

5.3.1

Corpus preparation

The analyses in this chapter are based on a comparison among these manual alignments and those generated by automatic systems. The most popular automatic aligners GIZA++ [Och and Ney, 2003] and the Berkeley Aligner [DeNero and Klein, 2007] are unsupervised, but require training on very large bodies of parallel text; here we do a similar training to avoid overly pessimistic automatic alignment results. Table 5.3 lists the component corpora used to train the unsupervised aligners. As in table 5.3, the Chinese word count reflects the number of word tokens returned by automatic segmentation. State-of-the-art SMT systems for Chinese-to-English translation do word segmentation and text normalization (the replacement, for example, of numbers and dates by $number and $date tokens) before providing parallel text to the unsupervised aligner. In order to provide automatic alignments for the corpora in table 5.2, all the corpora (both aligned and unaligned, though alignments were discarded at this stage) were passed through the Stanford word segmenter [Chang et al., 2008] and the SRI/UW GALE text normalization system (on the Chinese side) and the RWTH text normalization system (on the English side). Three aligners were trained on the resulting segmented and normalized parallel text: • the Berkeley aligner [DeNero and Klein, 2007], which we refer to hereafter as berkeley, which uses a symmetric alignment strategy;

• the GIZA++ aligner [Och and Ney, 2003], projecting from source-to-target (f -e), which we refer to as giza.f-e; and

• the GIZA++ aligner, projecting from target-to-source (e-f ), referred to as giza.e-f. For each of the giza trainings, we further generate multiple additional alignment candidates: the giza.e-f.NBEST and giza.f-e.NBEST lists retrieve the N = 10 best alignments from each of the GIZA++ trainings. The berkeley system does not support N -best generation. The parallel corpora from table 5.3 are then discarded: their role is only to improve the unsupervised aligners trained above. Over the parallel text corpora in table 5.2, all of

86

Table 5.3: The Mandarin-English parallel corpora used for alignment training
word Name (ID) ADSO Translation Lexicon Chinese English News Magazine Parallel Text Chinese English Parallel Text Project Syndicate Chinese English Translation Lexicon (v3.0) Chinese News Translation Text Part 1 Chinese Treebank English Parallel Corpus CU Web Data (Oct 07) FBIS Multilingual Texts Found Parallel Text GALE Phase 1 Chinese Blog Parallel Text GALE Phase 2r1 Translations GALE Phase 2r2 Translations GALE Phase 2r3 Translations GALE Phase 3 OSC Alignment (v1.0.FOUO) GALE Phase 3r1 Translations GALE Phase 3r2 Translations GALE Y1 Interim Release Translations GALE Y1Q1 Translations GALE Y1Q2 FBIS NVTC Parallel Text (v2.0) GALE Y1 Q2 Translations (v2.0) GALE Y1 Q3 Translations GALE Y1 Q4 Translations Hong Kong Parallel Text MITRE 1997 Mandarin Broadcast News Speech Translations (HUB4NE) UMD CMU Wikipedia translation Xinhua Chinese English Parallel News Text (v1β) Total 77,162 103,415 2,413,691 181,592 3,455,994 60,042,470 145,069 3,411,085 53,162,751 sentence 179,284 269,479 45,767 81,521 10,264 4,064 34,811 123,950 180,222 8,620 14,768 4,794 21,360 4,915 40,503 5,786 20,926 6,618 404,368 9,382 11,879 30,496 699,665 19,672 English 265,705 9,233,773 1,069,021 135,261 314,377 123,825 883,886 4,037,811 5,345,040 185,637 347,480 128,111 387,458 183,812 643,745 177,357 446,367 147,574 14,729,700 194,171 283,354 572,210 16,154,447 414,762 Chinese 267,466 8,826,377 1,129,198 93,073 279,512 92,996 894,809 3,011,172 4,713,169 166,508 286,093 104,330 322,524 134,975 595,411 149,250 398,043 128,740 12,070,648 172,106 247,961 506,563 14,650,516 365,157

87

the resulting alignments (including the reference alignments) were reconciled with the prenormalized text (using dynamic programming to synchronize the English side and retrieving the original text from the SRI/UW GALE text-normalization system), but the Chinese word segmentation was retained (for compatibility with later parsing). Finally, we generate still more alignment candidates by performing union and intersection on the giza candidates and their corresponding N -best lists: • The giza.union alignment is the union of giza.e-f and giza.f-e. • Correspondingly, the giza.intersect alignment is the intersection of giza.e-f and giza.f-e. • The giza.union.NBEST and giza.intersect.NBEST alignments are alignments that choose alignments from the giza.e-f.NBEST and giza.f-e.NBEST lists and union (or intersect) them. In these experiments, the ranks of the e-f and f-e elements in these unions or intersections are constrained to sum to no more than N + 1 = 11 (e-f2 union f-e9 is acceptable because 2 + 9 = 11 > 11, but e-f5 union f-e7 is not included because 5 + 7 = 12 > 11). 5.3.2 Alignment performance of automatic aligners

Table 5.4 reports the AER as computed against the manual labeling of the corpus for the five automatic alignments (excepting the N -best lists). It also includes the precision, recall, and link density (in proportion to reference ldc link density). The Berkeley aligner has the best AER, and its nearest competitor (the giza.union alignment) has the best alignment recall. As one might expect, the giza.intersect alignments have the highest precision, but this high precision comes at a high recall (and AER) cost. This table also includes the per-sentence AER oracle alignment, which reflects the AER (as well as precision and recall) of choosing, for each segment, the single alignment from the pool of five with the best AER.2 The Berkeley system, we may note, seems to be strongly precision heavy, while
The per-sentence oracle is not necessarily the best possible overall AER from this candidate pool, since under some circumstances (especially when the precision/recall proportions are very imbalanced) one may minimize sentence-level AER and increase global AER.
2

88

Table 5.4: Alignment error rate, precision, and recall for automatic aligners. Link density is in percentage of ldc links. System berkeley giza.e-f giza.f-e giza.intersect giza.union (per-sentence) oracle AER 32.87 36.46 40.31 42.37 35.42 30.12 Precision 84.21 70.14 76.78 96.78 63.34 80.01 Recall 55.81 58.08 48.82 41.04 65.87 62.02 Link density (%) 68.52 88.22 67.06 32.90 122.38 75.78

its competitor giza.union is more balanced, but stronger on recall. As we might expect, the giza.intersect and giza.union alignments have the lowest and highest link density respectively. Precision seems to rise as link density drops (again, not unexpectedly), but berkeley is more precise than either of the directional giza systems while having a higher link density. Even the oracle selection has a lower link density than 100%, because most of the candidates from which the per-sentence oracle are lower-density than the reference ldc alignments. 5.4 Analyzing span coherence among automatic word alignments

This section poses two questions: • what kinds of spans are reliably coherent in reference alignments? • what varieties of coherent spans are not captured well by current alignment algorithms? We examine the coherence of reference alignments to answer the first question, and compare those coherences to those generated by the unsupervised automatic alignment systems. We explore both syntactic and orthographic (not explicitly syntactic) techniques for identifying spans over the Chinese source sentences.

89

Table 5.5: Coherence statistics over the spans delimited by comma classes Alignment (% of spans) span (counts) yes comma (16,932) no sub null yes tadpole (14,682) no sub null coherence ldc (ref) 77.9 19.2 2.8 0.1 83.5 14.0 2.3 0.1 23.4 60.6 16.0 0.0 22.8 59.5 17.7 0.0 58.9 35.3 5.3 0.5 62.1 32.6 5.0 0.3 83.1 13.7 0.0 3.2 88.0 10.0 0.0 2.0 giza.union berkeley giza.intersect

5.4.1

Orthographic spans

The first class of spans we consider are spans that may be extracted from orthographic “segment” choices, namely, spans that are delimited by commas on the Chinese side of the bitext. This delimitation is made more complicated by an property of Chinese orthography: in many Chinese texts, a special3 “enumeration comma” is used to delimit items in a list. If this standard were used uniformly, it would actually be a useful distinction, but the enumeration comma is only used sometimes: on many occasions, when an enumeration comma would be correct, Chinese writers or typesetters will use U+002C COMMA, which we dub a “tadpole” comma to distinguish it from the conjoined class that includes the enumeration comma. Nevertheless, the converse error (using enumeration commas when tadpole commas are appropriate) does not seem to occur in the corpus, so there is still information available in using only the tadpole commas as delimiters. We explore dividing sentences into orthographic regions using the orthographic diUnicode uses code U+3001 for this symbol (、) and unfortunately dubs this character IDEOGRAPHIC COMMA, which is a misnomer: its Chinese name is “顿号” which may be glossed as pause symbol.
3

90

viders of comma-delimited spans and “tadpole”-delimited spans. These delimiters — when present — divide the sentence into non-overlapping regions. Table 5.5 shows the distribution of coherence values for comma-delimited spans and for tadpole-delimited spans over the manually- and automatically-generated alignments in table 5.2. Sentences without a comma-delimiter are omitted from the counts here, or the proportion of coherence would be (trivially) higher in all alignments (since a span covering the entire sentence will always be coherent). From the reference (ldc) alignments differences, we can see that using tadpole spans instead of commas improves the proportion of coherent spans to 83.5% and reduces the proportion of non-coherent spans; this result speaks to the utility of excluding the enumeration comma from use as a delimiter. Also, we may observe that the berkeley alignments, which have the best AER of the three automatic systems compared here, also consistently have intermediate values (between the low-recall giza.intersect system and the low-precision giza.union) in all four of the coherence classes. Among these orthographic spans, the high-precision, low-recall giza.intersect system performs the closest to ldc manually-annotated alignments, but overpredicts both coherence and null-links, probably because its link density is too low overall. By comparing the coherence measures over these orthographic alignments, we find supporting evidence that the berkeley alignments are the best (because they are the most similar to the reference). It is difficult to say whether the orthographic (comma- and tadpole-delimited) spans are useful constraints on alignment regions for evaluating alignment accuracy, however: the giza.union and giza.intersect results confirm that the over-linking and under-linking (respectively) even cross these comma delimiters. However, orthographic delimitation represents a mixture of different linguistic phenomena, so we turn instead to grammatical span exploration.

5.4.2

Syntactic spans

To explore the information available from the parser, we parse each of the source sentences in the aligned corpus with a parser [Harper and Huang, 2009] tuned to produce Penn Chinese

91

Table 5.6: Coherence statistics over the spans delimited by certain syntactic non-terminals Alignment (% of spans) span (count) yes NP (59,635) no sub null yes VP (37,167) no sub null yes IP (14,738) no sub null 72.4 16.4 10.0 1.2 64.4 20.7 14.4 0.6 65.7 17.9 16.2 0.2 44.5 44.6 10.9 0.0 26.3 58.1 15.7 0.0 22.7 60.6 16.7 0.0 74.9 17.0 3.9 4.2 64.0 28.1 5.3 2.6 59.8 33.9 5.5 0.7 81.0 4.1 0.0 14.9 78.5 9.6 0.0 11.9 84.2 10.4 0.0 5.4 coherence ldc giza.union berkeley giza.intersect

Treebank [Xue et al., 2002] parse trees. The Chinese word segmentation from the alignment steps in section 5.3 is retained. Table 5.6 shows the same systems as the previous section, but using instead the spans labeled4 by the parser as NP, VP, or IP (noun, verb, or inflectional phrase; IP is the Chinese Treebank equivalent to a sentence or small clause). We choose these categories because they are three core non-terminal categories of the treebank, each with a strong and relatively theory-agnostic linguistic basis. Furthermore, these three categories together make up more than 70% of the non-terminals in the parse trees produced by the automatic parsers used in these experiments. It is reasonable to expect that most of these phrases are coherent in the reference alignment, and indeed they are (72.4% coherent NPs, 64.4% coherent VPs,
Spans that cover the entire sentence are not included in these counts; by definition such spans are always coherent, but this is not informative.
4

92

and 65.7% coherent IPs). Again, we may observe in table 5.6 that the berkeley system’s coherence is intermediate between the giza.union and giza.intersect systems’ coherence values. For these syntactic spans, the berkeley alignments are much closer to the human labels than the giza.intersect, which substantially overpredicts coherence of these smaller units. However, berkeley alignments overpredict incoherent spans on VPs and IPs, and giza.union also overpredicts incoherence on NPs. Together, these results suggest that the union alignments are too link-dense, the intersect alignments too sparse, and the berkeley alignments just about right — although berkeley seems to still make syntactically-unaware errors, inducing incoherent spans. It is interesting to note that the giza.intersect results are actually over-coherent, due to their link density, but that alignment also has a worse AER (due to its low recall). Accordingly, high coherence is not necessarily neatly correlated with improvements to AER. 5.4.3 Syntactic cues to coherence

We find in the previous two subsections that (for example) though roughly 83.5% of tadpole spans are coherent under the LDC (reference alignments), only about 65% of IP and VP spans are coherent in the reference alignments. These proportions are low enough, for the syntactic classes, to suggest inquiry into what characteristics indicate that a span of given syntactic label XP is likely to be coherent. To explore this question, we build a binary decision tree using the WEKA toolkit [Hall et al., 2009] over each collection of XP spans (where XP ∈ {NP, VP, IP}), where the decision tree is binary over the following syntactic features from that XP ’s structure: • whether that span is also a tadpole-delimited span, • the syntactic tags of that XP ’s syntactic children, • the syntactic tags of that XP ’s syntactic parents, and • the length of the XP in question.

93

VP (64% coherent) NT-parent = CP (2,291) NT-parent = CP (34,876) 29.7% coherent
(a) VP tree

66.6% coherent

IP (65.7% coherent) NT-parent = CP (3,362) NT-parent = CP (11,376) 29.9% coherent
(b) IP tree

75.6% coherent

Figure 5.3: Decision trees for VP and IP spans. The decision tree did not find error-reducing distinctions among the NP spans.

These features were chosen as a reasonable characterization of the syntactically-local information (roughly parallel to the information provided on arc-links in the SParseval measure in chapter 3). Although some IP spans (unsurpisingly) cover the entire sentence, spans over the entire sentence are not included in this analysis (whole-sentence spans are always coherent, by definition). Figure 5.3 shows the top forks of the VP and IP decision trees (the single decision offering the greatest error reduction). We may observe an interesting commonality: in both IPs and VPs, the majority of spans with a parent label of CP (“complementizer phrase”) are not coherent in the reference alignment. Anecdotally, the CP-over-IP construction seems to occur incoherently in the bitext when the CP-over-IP marks an construction that is divided in English, e.g. the example in figure 5.4. This kind of construction, which may be expressed in English as a pre-and-post-modified NP (“[np [np the largest X] in the world]”) or a left-modified NP (“[np the world’s largest X]”) is likely to be incoherent, despite having a uniform analysis as CP-over-IP in Chinese (“[cp [ip 世界 最大 X]]”). It is also worth observing that the IP node in this example has a unary expansion to a VP predicate (with two parts), and so accounts for some of the same

94

cp1 ip1 Thailand be world 泰国 是 世界 the most 最 big 大 comp 的 rice export+nation 稻米 出口国 . 。 world .

Thailand is

largest rice exporter

in

the

Figure 5.4: An example incoherent CP-over-IP. Note that ip1 ’s reprojection to English is actually larger than ip1 itself (since it includes “rice exporter” within its English projection span) and larger than cp1 in which it is embedded. Had the English translation chosen the phrasing “the world’s largest rice exporter”, ip1 would be coherent.

incoherent CP-over-VP spans in figure 5.3(a) as well. It is also of note (though not visible in figure 5.3) that the tadpole feature was available to the decision tree but was never selected, even when the tree was allowed to ramify further, suggesting that this orthographic information is not useful in determining the coherence of these syntactic spans. From table 5.6 and figure 5.3, we may see that the base rate of coherence for NPs, VPs and IPs (at least, those not immediate children of CPs) is at about 70% for each, with IPs being particularly promising — at 76% coherence — but relatively rare.

5.4.4

A qualitative survey of incoherent IP spans

Inspired by the example in figure 5.4, we extracted 50 spans labeled IP by the parser that were incoherent according to the ldc reference alignment. The categories suggested here were selected to attempt to characterize the reasons that these parser-indicated spans are incoherent. The most common category of differences (about 30%) were alignments where a clause-modifying adverb was used in English somewhere other than the left or right edge of the clause (and where in Chinese, the clause-modifier lives outside the IP). One common scenario among those examined here was a clause-external Chinese adverb that is aligned

95

Table 5.7: Some reasons for IP incoherence Reason Sentential adverb between subject and main verb in English IPs in conjunction: English language ellipsis; Chinese repeated word Two-part predicates in Chinese pre- and post-modify noun in English Punctuation differences (periods inside quotes) Other translation divergences Parsing attachment errors introducing incoherence 3 10 9 6 n 14 8

with an English adverb after the first (finite) verb in an English clause, as in figure 5.5. Nearly 20% of the incoherent spans were incoherent because of parsing attachment errors, usually because a Chinese adverb was attached low within an adjacent small IP when it should have been considered an sentential modifier. Improving parser performance on the correct attachment of clausal adverbs would be valuable here. Another key challenge is found when Chinese and English disagree on whether all components of a conjunct need to be repeated. Although Chinese omits pronouns in many circumstances, it was the English conjunction of subject-less VPs that introduced incoherence. About 16% of the incoherent IP spans are attributable to ellipsis in the English side (alternatively, a choice to repeat a term in Chinese which is left out in English), e.g. as in figure 5.6. The remaining categories of incoherent IP include Chinese two-part CP/IP predicates that pre- and post-modify a noun in English (about 12%, as in figure 5.4) and a small variety of others. 5.5 Selecting whole candidates with a reranker

In previous sections, we see evidence that the coherence of certain categories corresponds to alignment quality, and that it is — at least in principle — possible to select an alignment with better AER from the pool of candidates: the oracle scores in table 5.4 demonstrate

96

IP NP 该指数 this index ADVP AD 也有 LB 被 NP 投资者 VP VP IP VP 用作投资指南

sometimes by

investors use as investment-guide “This index has sometimes been used by investors as an investment guide.” Figure 5.5: An example of clause-modifying adverb (也有) appearing inside a verb chain. Note that ldc alignment links link the lowest VP to English has, been, and used, so that the projection of the lower IP contains the projection of the upper ADVP and LB spans.

that a better AER is possible. As in chapter 3, we establish a reranking paradigm, where alignment candidates are converted into feature vectors by a feature extraction steps, and (in training) candidates’ optimization target (in this case, AER) is converted to a rank for training an svmrank learner over the training candidates. The learner ranks the candidates in the pool, and we report AER over the learner’s choice of top-ranked candidates. Because of the relatively small amount of labeled data, we report results here over ten-fold cross-validation. This arrangement allows variation in two key experimental variables: Candidate pool : We may include candidates from all of the rerankers available, or only certain subsets. We define two pools of interest: • experts is the “committee of experts” made up of all of the direct outputs of the automatic aligners (berkeley, giza.e-f, giza.f-e, giza.intersect, and giza.union)

TOP IP IP PU VP VV 1998年预计 “1998-estimated” 增长 CD growth 百分之零点五 0.5 % QP , VV 增长 , NP VP QP CD growth 百分之一点五 1.5 % NP 1997年估计 IP PU 。 .

NP

NR

俄罗斯

Russia

“1997-estimated”

“Russia’s estimated growth is 0.5% for 1997, and 1.5 % for 1998.”

Figure 5.6: An example of English ellipsis where Chinese repeats a word (增长, “growth”). The English translation has only

one “growth”. To link both 增长 nodes to the English “growth” requires at least one of the lower IP nodes to be incoherent.

97

98

• giza.union.NBEST is the pool generated by performing the union operation on the members of giza.e-f.NBEST and giza.f-e.NBEST list. Feature selection : We may use any of a variety of features to rerank members of the candidate pool. We define the following features: • voting features include a binary feature for each expert system (e.g., berkeley or giza.union); if a candidate is generated by that system, this feature will be true; otherwise false. • span-X features represent four features: the proportion of spans of type X that are coherent (span-X-yes), subcoherent (span-X-sub), null-coherent (span-X-null), or incoherent (span-X-no). For example, we may use span-NP features, which provide features describing the coherence (or non-coherence) of the noun-phrases in the sentence.

5.5.1

Selecting candidates from expert committee

We conjecture that a reranking approach would help to select better alignments from a pool of alignment candidates generated by a diverse set of rerankers. In this experiment, the alignment candidates berkeley, giza.e-f, giza.f-e, giza.intersect, and giza.union are included in the pool to be reranked. For reranker features, we consider the span-comma, span-tadpole, and span-IP features. Because of the IP coherence patterns gleaned from figure 5.3(b), we further include span-nonCP-IP, which includes features of those IP spans that are not the direct children of CP constituents. Finally, we include an experiment using span-N T features, which are the set of features including span-X for all non-terminal symbols used in the Chinese treebank (span-IP, span-DNP, span-QP, etc.). For all learners, we include the voter feature to allow the reranker to include a learned estimate of the quality of each committee member. As a baseline, we include a voter-only feature set, which learns, as one might expect, to always select the committee member with the best overall AER). Table 5.8 shows the results of reranking the members of the committee of quality experts.

99

Table 5.8: Reranking the candidates produced by a committee of aligners. Identity berkeley giza.e-f isolated giza.f-e giza.union giza.intersect voter only (∼berkeley) span-tadpole span-comma voter & span-IP span-nonCP-IP span-N T (per-sentence) oracle 33.10 33.02 34.09 30.12 82.69 82.81 76.65 80.01 56.18 56.23 57.80 62.02 AER 32.87 36.46 40.31 35.42 42.37 32.87 32.95 33.13 Precision 84.21 70.14 76.78 63.34 96.78 84.21 83.49 82.00 Recall 55.81 58.08 48.82 65.87 41.04 55.81 56.02 56.46

Relative to the span-IP features, the span-nonCP-IP features have a larger improvement in recall with a smaller loss to precision, which suggests that using spans which are generally expected to be coherent may be helpful in this kind of reranking. However, we also observe that none of the rerankers (in the lower half of the table) actually reduce AER from the baseline (voter only) systems: instead, they boost recall, at varying costs to precision. We also observe that the more features are involved, the larger the effect on recall, with span-N T having the largest impact. For those cases with the same number of features (e.g. span-IP and span-tadpole), the features with more spans in the data generally have larger impact. In retrospect, this is unsurprising: the next best systems, beyond berkeley, are giza.union and giza.e-f, which each have lower precision and higher recall: thus, when the reranker chooses an alternative, it is usually choosing one of those two, improving recall (and hurting precision). We even see this in the oracle: though its AER is superior to the berkeley system, its precision is lower.

100

Table 5.9: Reranking the candidates produced by giza.union.NBEST. Identity nbranks only (∼giza.union) span-tadpole span-comma nbranks & span-IP span-N T giza.union.NBEST (per-sentence) oracle 35.42 35.38 32.44 63.34 63.39 66.58 65.87 65.89 68.56 AER 35.42 35.41 35.41 Precision 63.34 63.35 63.36 Recall 65.87 65.87 65.88

5.5.2

Selecting candidates from N -best lists

To avoid the problems with very-different second-best candidates (with very different precision and recall) suggested by the previous experiments, we construct a separate experiment with only the alignments generated by giza.union.NBEST, which (as described in section 5.3) include two new rank features (dubbed nbranks): the rank of the giza.e-f member of the union and the rank of the giza.f-e member of the union. Table 5.9 shows none of the precision-recall imbalance present in the experiments in table 5.8. However, the coherence features do not seem to make much difference, only nudging both precision and recall (non-significantly) higher.

5.5.3

Reranker analysis

In sections 5.5.1 and 5.5.2, we explored using a reranker to select candidates from a pool of generated candidates. System combination at the level of whole candidate selection, as in these experiments, works best when the systems under combination have similar operating performance (similar quality) while also being diverse (making different kinds of errors). In this perspective, the analyses here suggest that the committee of experts (section 5.5.1) performed well in generating diversity, but the single best member of the committee (the berkeley system) so outperformed its fellows that alternates only rarely improved the overall AER. Conversely, the experiments in section 5.5.2 selected from the giza.union N -best

101

Table 5.10: AER, precision and recall for the bg-precise alignment System berkeley giza.intersect bg-precise (berkeley ∪ giza.intersect) AER 32.87 42.37 32.38 Precision 84.21 96.78 83.91 Recall 55.81 41.04 56.62

lists, where the criterion of similar quality was met, but the candidates were insufficiently diverse. We conjecture that the lack of improvement in AER from reranking is due to these problems. However, it may be that the coherence features are not sufficiently powerful to distinguish the candidates without incorporating lexical or other cues, since the berkeley aligner coherence percentages of the different phrases are not so different from the ldc percentages. 5.6 Creating hybrid candidates by merging alignments

Whole-candidate selection from the previous section suggests that the available candidates are insufficiently diverse (when chosen from the N -best lists) and too dissimilar in performance (when chosen from the committee of expert systems). As an alternative strategy, we may perform partial-candidate selection, by constructing hybrid candidates, guided by the syntactic strategies suggested here. The analysis in section 5.3.2 shows that the berkeley and giza.intersect systems are very high precision, but both have relatively low recall. By contrast, giza.union has the best recall, but its precision suffers. We cast the problem of sub-sentence alignment combination as a problem of improving the recall of a high-precision alignment. As a first baseline, we may combine (union) berkeley and giza.intersect, the two high-precision alignments from table 5.4, into a new precision alignment bg-precise, shown in table 5.10. The bg-precise alignment has a lower precision than either of its component high-precision alignments, but yields the best AER thus far, because of improvements to recall. This simple combination, in fact, yields an AER better (although not significantly better) than the persentence oracle AER from the giza.union.NBEST selection, providing further evidence that

102

the N -best lists are insufficiently diverse for reranking as they are.

5.6.1

Using “trusted spans” to merge high-precision and high-recall alignments

Although bg-precise improves recall to some degree, we would like to improve recall further. The giza.union alignments have substantially better recall than any of the precision alignments, so we adopt the strategy of merging only certain alignment links from the giza.union alignment into the bg-precise alignments. We introduce the guideline of a “trusted span” on the source text, and define the guided union over a high-precision alignment and a high-recall alignment and a set of trusted spans: all links from the recall alignment that originate within one of the trusted spans, unioned with all links from the precision alignment. This combination heuristic changes the problem of combining alignments to the problem of identifying “trusted spans” from the source text.

5.6.2

Defining the syntactic “trusted span”

We may thus use the syntactic coherence analysis from section 5.4 to describe “trusted spans” to be used in the guided union operation, and evaluate the resulting guided union alignment according to the same AER metrics we have used throughout this chapter. We extract syntactic trusted spans in a bottom-up recursion from the syntactic tree, defining trusted XP spans with the following heuristic: an XP span s is trusted for the process of the guided union between a precision-oriented alignment P and a recall-oriented alignment R when • s is coherent in P ,

• s is coherent in R,

• all XP spans contained within s are also trusted. These spans define a guided union P ∪ R between precision-oriented alignment P and recall-oriented alignment R. Thus we may define, for example, the NP-trusted guided union
XP

103

(a) Precision alignment
np-max

np China

np high+new

np tech.

np develop+zone prepare p 于 of 80+decade early

中国 The new high

高新

技术

开发区

酝酿

八十年代

初 were

。 brewed in the early 1980 ’s .

level

technology

development

zones

China

(b) Resulting alignment. (Dashed lines are new.)
np-max

np China

np high+new

np tech.

np develop+zone prepare p 于 of 80+decade early

中国 The new high

高新

技术

开发区

酝酿

八十年代

初 were

。 brewed in the early 1980 ’s .

level

technology

development

zones

China

Figure 5.7: Example of an np-guided union. The precision alignment (a) and the recall alignment (b) both agree that each np span is coherent (and all np sub-spans are coherent). np-max may thus be used as a trusted span, allowing us to copy the heavy dashed links from the recall alignment into the union.

104

Table 5.11: AER, precision and recall over the entire test corpus, using various XP strategies to determine trusted spans System Recall system = R Precision system1 = P1 giza.union giza.intersect IP VP P1 ∪ R, where XP =
XP

AER 35.42 42.37 41.23 41.16 41.05 39.56 39.23 32.38 32.29 32.17 32.15 31.61 31.51

Precision 63.34 96.78 92.71 93.07 92.91 93.82 93.00 83.91 82.61 82.77 82.73 83.16 82.90

Recall 65.87 41.04 43.02 43.01 43.17 44.57 45.13 56.62 57.36 57.47 57.50 58.07 58.35

IP or VP NP NP or VP

Precision system2 = P2

bg-precise IP VP

P2 ∪ R, where XP =

XP

IP or VP NP NP or VP

(P ∪ R): the maximal NPs that are coherent in both P and R such that all descendant NPs are also coherent. Figure 5.7 illustrates an NP-guided union P ∪ R, in which we can see, at least anecdotally, that it is reasonable to expect that this syntactic mechanism for selecting trustworthy links to be helpful in extending a high-precision alignment by improving recall without hurting precision much. Table 5.11 shows the results of using this guided union heuristic to generate new alignment candidates, using two different alignments (giza.intersect and bg-precise) for the P role and giza.union as the R role. We see the same trends for each choice of P alignment: using P ∪ R has the smallest reduction in precision. It also has the second-largest improvement in recall, with the best performance going to the union-guide that uses both
np np

np

105

VP and NP spans to form the trusted spans. By contrast, while using P ∪ R or P ∪ R for guided unions both generate a reduction in AER, this reduction is small (and using both VP and IP together seems to make little improvement, probably because VP and IP spans are often nested). However, guided union approaches are not sufficiently powerful to overcome the extremely low link-density of the giza.intersect alignment — precision and recall trade off nearly four percentage points but the corresponding improvement (due to moving the F measure towards balance) is not sufficient to bring AER below the giza.union AER. When the precision measure begins more balanced (as in bg-precise), the guided union effects can drive AER to new minima: using np and vp spans to guide the trusted-span selection produces the best overall AER of 31.51%.

vp

ip

5.7

Discussion

In this chapter, we have presented a new formalism for quantifying the degree to which a bitext alignment retains the coherence of certain spans on the source language. We evaluate the coherence behavior of some orthographically- and syntactically-derived classes of Chinese spans on a manually-aligned corpus of Chinese-English parallel text, and we identify certain classes (motivated by orthography and syntax) of Chinese span that have consistent coherence behavior. We argue that this coherence behavior may be useful in training improved statistical machine translation systems by making a difference in improving statistical machine-translation alignments. To improve alignments, this chapter explored the potential for alignment system combination, following at first the approach of choosing candidates from a committee of experts or from the N -best lists generated by the GIZA++ toolkit (using a reranking toolkit). These initial experiments found that the needs of system combination (systems of rough parity, with usefully different kinds of errors) were not met, and we turned to sub-segment system combination. In this approach, we define a syntax-guided alignment hybridization between a high-precision and a high-recall alignment, and show that the resulting alignments, hybridized with guidance from syntactic structure, have a better performance in AER than

106

the best alignments produced by the component expert systems. These results, taken together, suggest that source-side grammatical structure and coherence can be a useful cue to quality alignment links in producing good alignments for the training of statistical machine translation engines.

107

Chapter 6 CONCLUSION

The work in this dissertation is motivated by the hypothesis that linguistic dependency and span parse structure is an informative and powerful representation of human language, so much so that accounting for parse structure will be useful even to those applications where only a word sequence is produced, e.g. a speech transcript or a text translation. In support of this hypothesis, this work has presented three ways that parse structure (as provided by statistical syntactic parsers) may be engaged with these large sequence-oriented natural language processing systems. This chapter summarizes the work presented here (section 6.1) and suggests directions for further study of these research areas individually (section 6.2) and for parsing as a general-purpose parse decoration tool for these and related applications (section 6.3).

6.1

Summary of key contributions

Chapter 3 demonstrated that it is possible to improve the performance of a speech recognizer and a parser by rescoring the two systems jointly: the speech-recognizer’s output is improved (in terms of WER) by exploiting information from parse structure, and the parse structure resulting from parsing the speech recognizer’s output may be improved by considering alternative transcript hypotheses while evaluating resulting parses. This research also found that the utility of the parse structure was strongly dependent on the quality of the speech segmentation: parse structure was much more valuable in the context of highquality segment boundaries than in the context of using default speech recognizer segment boundaries. In addition, we present a qualitative analysis of the use of parse structure in selecting transcription hypotheses, finding improvement, for example, in the prediction of pronouns and the main sentential verb, which would be critical for use in subsequent linguistic application.

108

Chapter 4 applies parse structure to a rather different domain: the evaluation of statistical machine translation. Like speech recognizers, SMT evaluation is dominated by sequence-focused models, but this work introduces an application of syntax to SMT evaluation: a parse-decomposition measure for comparing translation hypotheses to reference translations. This measure correlates better with human judgement of translation quality than BLEU4 and TER, two popular sequence-based metrics. We further explore combining the new technique with other cutting-edge SMT evaluation techniques like synonym and paraphrase tables and show that the combined techniques (syntax and synonymy) perform better than either alone, although the gains are not strictly additive. Chapters 3 and 4 both explore the utility of considering (probability-weighted) alternative parse hypotheses when using parse structure in their tasks. In the parsing-speech tasks of chapter 3, using this extra information in joint reranking with recognizer N -best transcription hypotheses shows trends in the direction of improvement (but not to significance, possibly because the number of parameters exceeded the reranker’s ability to exploit them), but chapter 4’s machine translation evaluation showed that including additional parse hypotheses clearly improved EDPM’s ability to predict the quality of machine translations. While the parsing-speech work (chapter 3) uses both span information and dependency information from the parses in comparing parse information for reranking, the SMT evaluation work in chapter 4 focuses on dependency (even to the point of putting it in the name of the Expected Dependency Pair Match metric). By contrast, the work in chapter 5 focuses on a use of constituent structure for an internal component of SMT: wordalignment. Chapter 5’s research conducts an analysis which demonstrates the tendencies of a particular class of spans (e.g., those motivated by syntactic constituency) to hold together in a given alignment. It explores the use of this constituent measure to select an alignment hypothesis from a pool of alignment candidates. Although the reranking approach has limited success (because the available candidates are too dissimilar in overall quality), the coherence measure illuminates some characteristics of quality alignments that further work on word alignment might pursue. Chapter 5 also discovered a technique for using these characteristically-coherent spans as a guidance framework for alignment combination, through a guided union of a precision-oriented alignment and a recall-oriented

109

alignment. Syntactic coherence (using this guided-union approach) was demonstrated to be useful in improving the AER of the alignments by this technique. The effects of syntactic constituent coherence are probably even stronger than indicated by these results, since a qualitative analysis in that chapter identified that a sizable minority of incoherent IP spans were incoherent due to parse-decoration error. 6.2 Future directions for these applications

Chapters 3, 4, and 5 offer three different approaches to the using parse structure to improve natural language processing. The results presented here suggest future work in applying parsers to each of these areas of research.

6.2.1

Adaptations for speech processing with parsers

In the domain of parsing speech (chapter 3), it would be valuable to explore the impact of additional parse-reranking features, especially those more directly focused on speech. The features extracted in this work were a re-purposing of the feature extraction used for reranking parses over text; it might be valuable to include features that are more directly targeting the challenges of speech transcription. For example, explicit edit or disfluency modeling, as in Johnson and Charniak [2004], or prosodic features, as in Kahn et al. [2005], might be useful in further improving the reranking available here. Alternatively, including parse structure from parsers using other structural paradigms (e.g. the English Resource Grammar [Flickinger, 2002]) would be an alternative valuable knowledge source (further discussed in section 6). Along similar lines, expanding the joint modeling of speech transcription and parsing to include sentence segmentation (as in Harper et al. [2005]) might be valuable, especially because the evidence presented here points so strongly towards the need for improved segmentation.

6.2.2

Extension of EDPM similarities to other tasks

In extending EDPM, it would be interesting to consider whether these techniques could be shared with other tasks that require a sentence similarity measure. EDPM substantially

110

outperformed BLEU4 , an n-gram precision measure, on correlation with human evaluation of machine-translation quality. In the summarization domain, ROUGE [Lin, 2004] uses ngrams to serve as an automatic evaluation of summary quality; EDPM’s generalization of this approach to use expected parse-structure is worth exploring in summarization as well. Even within machine translation, EDPM’s notion of sentence similarity may be useful in other ways, for example, in computing distances between training and test data in graphbased learning approaches for MT (e.g. Alexandrescu and Kirchhoff [2009]).

6.2.3

Extending syntactic coherence

The coherence measures reported in chapter 5 suggest that one may be able to parse source text alone to identify regions that are translated in relative isolation from one another. However, coherence of those spans by itself does not indicate that the alignment quality is good: a key factor is in the relative density (the proportion of links to words), since high-density alignments seem to under-predict coherence and low-density alignments to over-predict it. We suggest exploring a revised reranking, including link density as a feature alongside (possibly weighting) coherence. Furthermore, the guided-union work showed a welcome success in improving the recall without greatly damaging precision by using source language (Chinese) parsing. As an extension, parse structure from the target language (English) could also be used to identify regions where alignment unions are worth including. Parsing the target side would require a different parser, trained on the target language, which would identify target spans (rather than source-side spans) to trust in guiding alignment union. Since the analysis in section 5.4.4 indicated that some of the incoherent regions could be explained by English constructions, this approach might be particularly fruitful. Further work to integrate the notion of span coherence into machine translation alignments would be valuable: identifying that a span is likely to be coherent in translation should offer a criterion for augmenting the search space pruning strategy for good translations. However, it would be wise to do further analysis of what regions are coherent before undertaking the substantial effort of incorporating coherence into a translation or alignment

111

decoder. Such an analysis might incorporate a lexicalized study of coherence extending the syntactic-span study done in section 5.4.2. Beyond improving AER, both the reranking-alignments and the guided-union techniques may show further improvement in alignment quality (or demonstrate a need for adjustment) when dealing with alternative measures of alignment quality. One computationallyexpensive technique would be to use direct translation evaluation: to evaluate the alignmentselection by training the entire translation model from the generated alignment and evaluate with a translation quality measure on a held-out set of texts. 6.3 Future challenges for parsing as a decoration on the word sequence

Each of the applications described here was developed with a PCFG-based parser which produces a simple span-labeling output. The parsers used here were all trained on in-domain data, with state-of-the-art PCFG engines. As a direction of future work, it is worthwhile to explore which of these constraints is necessary and which may be improved by trying alternatives. 6.3.1 Sensitivity to parser

On any of the three applications presented here, one could explore varying the parser. Alternative PCFG-based systems may present a different variation (their N -best lists, for example, may be richer than the high-precision systems used here). However, one could go farther and explore non-PCFG parsers. Any parser that can generate a dependency tree could be used for EDPM, and any parser that can generate spans with labels could be used in the coherence experiments. The reranking experiments over speech require that the parse trees generated be compatible with the feature-extraction, but if one is willing to adjust the feature extraction as well, any parser could be used there too. One direction of approach might be to generate dependency structures directly for EDPM, e.g. by using dependency parser strategies like the ones described in K¨bler et al. [2009]. u Alternatively, the English Resource Grammar [Flickinger, 2002] produces a rich parse structure that may be projected to span or dependency structure; recent work (e.g. Miyao and Tsujii [2008]) has suggested that it may even generate probabilistically weighted tree

112

structures. Integrating this knowledge-driven parser into these experimental frameworks (as a replacement or supplemental parse-structure knowledge source) would be a valuable exploration of the relative merits of these parsers.

6.3.2

Sensitivity of these measures to parser domain

We expect that the training domain is relevant to a parser’s utility for these applications in ASR and SMT. In the limit, if the parser is trained on the wrong language, most of the information it offers to these measurement and reranking techniques will be lost. However, it is not clear how closely dialect, genre, and register must be matched: is it workable to use a newswire-trained parser in EDPM when comparing translations of a more informal genre (e.g. weblogs or conversational speech)? For some applications, genre may not have an impact on the useful performance of the parser, and for others it may have a substantial one: it would be a useful contribution to explore whether the benefits are retained when parser domains mismatch.

6.3.3

Sensitivity of these measures to parser speed and quality

Parse structure decoration is shown here to be a valuable supplement to large word-sequencebased NLP projects: this work offers a variety of opportunities for further work exploring new ways in which parse structure decoration may benefit large NLP projects. In evaluation, an obstacle to the wider adoption of dependency-based scoring functions such as EDPM (for MT) and SParseval (for ASR) is a concern for scoring speed. Systems that use error-driven training or tuning require fast scoring in order to do multiple runs of parameter estimation for different system configurations. Using a parser is much slower than scoring based on word and word n-gram matches alone. This objection invites exploration regarding the robustness of dependency-based scoring algorithms when a faster (though presumably lower-quality) parser is used rather than the Charniak and Johnson [2005] system; perhaps (rather than the PCFG-inspired system used here) a direct-to-dependency parser (e.g. the K¨bler et al. [2009] parser) would capture enough similar information at u high enough quality to offer the same performance in SMT evaluation.

113

Speed, of course, would be of benefit to any application of parsing: the use of syntactic coherence as a feature of word-alignment in machine translation would also be more appealing if the benefits were present with a much faster parser. Parser error, of course, can be a serious problem, as the qualitative study of Chinese coherence analyses indicated. A different way of approaching sensitivity to parser quality would be to create an array of parsers of known variation in quality (perhaps by using reduced training sets) and exploring the relative merit of each in the tasks presented here. In general, the experiments presented in this work suggest that parsers provide a useful knowledge source for natural language processing tasks in several areas. Improving the parser would, one expects, make that knowledge source more valuable, although it may be that the environments (e.g., the candidates to be reranked) are not sufficiently diverse for that additional knowledge to be valuable. In either case, this work stands as a call to continue exploration for both parsers and natural language processing tasks in which to apply those parsers.

114

115

BIBLIOGRAPHY

Y. Al-Onaizan and L. Mangu. Arabic ASR and MT integration for GALE. In Proc. ICASSP, volume 4, pages 1285–1288, Apr. 2007. A. Alexandrescu and K. Kirchhoff. Graph-based learning for statistical machine translation. In Proc. HLT/NAACL, pages 119–127, 2009. E. Arisoy, M. Saraclar, B. Roark, and I. Shafran. Syntactic and sub-lexical features for Turkish discriminative language models. In Proc. ICASSP, pages 5538–5541, Mar. 2010. N. F. Ayan and B. J. Dorr. Going beyond AER: An extensive analysis of word alignments and their impact on MT. In Proc. ACL, pages 9–16, July 2006. N. F. Ayan, B. J. Dorr, and C. Monz. NeurAlign: Combining word alignments using neural networks. In Proc. HLT/EMNLP, pages 65–72, Oct. 2005a. N. F. Ayan, B. J. Dorr, and C. Monz. Alignment link projection using transformation-based learning. In Proc. HLT/EMNLP, pages 185–192, Oct. 2005b. S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 65–72, 2005. E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. A procedure for quantitatively comparing syntactic coverage of English grammars. In Proc. 4th DARPA Speech & Natural Lang. Workshop, pages 306–311, 1991. J. Bresnan. Lexical-functional syntax. Number 16 in Blackwell textbooks in linguistics. Blackwell, Malden, Mass., 2001.

116

P. F. Brown, J. Cocke, S. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, 1990. D. Burkett, J. Blitzer, and D. Klein. Joint parsing and alignment with weakly synchronized grammars. In Proc. HLT, pages 127–135, June 2010. A. Cahill, M. Burke, R. O’Donovan, J. van Genabith, and A. Way. Long-distance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations. In Proc. ACL, pages 319–326, 2004. C. Callison-Burch. Re-evaluating the role of BLEU in machine translation research. In Proc. EACL, pages 249–256, 2006. P.-C. Chang, M. Galley, and C. D. Manning. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 224–232, June 2008. E. Charniak. A maximum-entropy-inspired parser. In Proc. NAACL, pages 132–139, 2000. E. Charniak. Immediate-head parsing for language models. In Proc. ACL, pages 116–123, 2001. E. Charniak and M. Johnson. Edit detection and parsing for transcribed speech. In Proc. NAACL, pages 118–126, 2001. E. Charniak and M. Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proc. ACL, pages 173–180, June 2005. A revised version was downloaded November 2009 from ftp://ftp.cs.brown.edu/pub/nlparser/. E. Charniak, K. Knight, and K. Yamada. Syntax-based language models for statistical machine translation. In MT Summit IX. Intl. Assoc. for Machine Translation., 2003. C. Chelba and F. Jelinek. Structured language modeling. Computer Speech and Language, 14(4):283–332, October 2000.

117

C. Cherry. Cohesive phrase-based decoding for statistical machine translation. In Proc. ACL, pages 72–80, June 2008. D. Chiang. A hierarchical phrase-based model for statistical machine translation. In Proc. ACL, pages 263–270, June 2005. M. Collins. Discriminative reranking for natural language parsing. In Proc. ICML, pages 175–182, 2000. M. Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4):589–638, 2003. M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–69, 2005. M. Collins, P. Koehn, and I. Kuˇerov´. Clause restructuring for statistical machine transc a lation. In Proc. ACL, pages 531–540, June 2005a. M. Collins, B. Roark, and M. Saraclar. Discriminative syntactic language modeling for speech recognition. In Proc. ACL, pages 507–514, June 2005b. M. R. Costa-juss` and J. A. R. Fonollosa. Statistical machine reordering. In Proc. EMNLP, a pages 70–76, July 2006. C. Culy and S. Z. Riehemann. The limits of n-gram translation evaluation metrics. In Proceedings of MT Summit IX, 2003. DARPA. Global Autonomous Language Exploitation (GALE). Mission, http://www. darpa.mil/ipto/programs/gale/gale.asp, 2008. J. DeNero and D. Klein. Tailoring word alignments to syntactic machine translation. In Proc. ACL, pages 17–24, June 2007. D. Filimonov and M. Harper. A joint language model with fine-grain syntactic tags. In Proc. EMNLP, pages 1114–1123, Aug. 2009.

118

D. Flickinger. On building a more efficient grammar by exploiting types. In S. Oepen, D. Flickinger, J. Tsujii, and H. Uszkoreit, editors, Collaborative Language Engineering, chapter 1. CSLI Publications, 2002. V. Fossum, K. Knight, and S. Abney. Using syntax to improve word alignment precision for syntax-based machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 44–52, June 2008. A. Fraser and D. Marcu. Semi-supervised training for statistical word alignment. In Proc. ACL, pages 769–776, July 2006. A. Fraser and D. Marcu. Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3):293–303, Sept. 2007. M. Galley, M. Hopkins, K. Knight, and D. Marcu. What’s in a translation rule? In D. M. Susan Dumais and S. Roukos, editors, Proc. HLT/NAACL, pages 273–280, May 2004. M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang, and I. Thayer. Scalable inference and training of context-rich syntactic translation models. In Proc. COLING/ACL, pages 961–968, July 2006. D. Gildea. Loosely tree-based alignment for machine translation. In Proc. ACL, pages 80–87, July 2003. J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. ICASSP, volume I, pages 517–520, 1992. J. T. Goodman. A bit of progress in language modeling. Computer Speech and Language, 15:403–434(32), Oct. 2001. A. Haghighi, J. Blitzer, J. DeNero, and D. Klein. Better word alignments with supervised ITG models. In Proc. ACL, pages 923–931, Aug. 2009. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: an update. SIGKDD Explorations Newsletter., 11:10–18, Nov. 2009.

119

M. Harper and Z. Huang. Chinese Statistical Parsing, chapter in press. DARPA, 2009. M. Harper, B. Dorr, J. Hale, B. Roark, I. Shafran, M. Lease, Y. Liu, M. Snover, L. Yung, A. Krasnyanskaya, and R. Stewart. Parsing and spoken structural event detection. Technical report, Johns Hopkins Summer Workshop Final Report, 2005. D. Hillard. Automatic Sentence Structure Annotation for Spoken Language Processing. PhD thesis, University of Washington, 2008. D. Hillard, M. yuh Hwang, M. Harper, and M. Ostendorf. Parsing-based objective functions for speech recognition in translation applications. In Proc. ICASSP, 2008. L. Huang. Forest reranking: Discriminative parsing with non-local features. In Proc. HLT, pages 586–594, June 2008. Z. Huang and M. Harper. Self-training PCFG grammars with latent annotations across languages. In Proc. EMNLP, pages 832–841, Aug. 2009. ISIP. Mississippi State transcriptions of SWITCHBOARD, 1997. URL http://www.isip. msstate.edu/projects/switchboard/. R. Iyer, M. Ostendorf, and J. R. Rohlicek. Language modeling with sentence-level mixtures. In Proc. HLT, pages 82–87, 1994. T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. M. Johnson and E. Charniak. A tag-based noisy-channel model of speech repairs. In Proc. ACL, pages 33–39, 2004. J. G. Kahn. Moving beyond the lexical layer in parsing conversational speech. Master’s thesis, University of Washington, 2005. J. G. Kahn, M. Ostendorf, and C. Chelba. Parsing conversational speech using enhanced segmentation. In Proc. HLT/NAACL, pages 125–128, 2004.

120

J. G. Kahn, M. Lease, E. Charniak, M. Johnson, and M. Ostendorf. Effective use of prosody in parsing conversational speech. In Proc. HLT/EMNLP, pages 233–240, 2005. J. G. Kahn, B. Roark, and M. Ostendorf. Automatic syntactic MT evaluation with expected dependency pair match. In MetricsMATR: NIST Metrics for Machine Translation Challenge. NIST, 2008. J. G. Kahn, M. Snover, and M. Ostendorf. Expected dependency pair match: predicting translation quality with expected syntactic structure. Machine Translation, 23(2–3):169– 179, 2009. A. Kannan, M. Ostendorf, and J. R. Rohlicek. Weight estimation for n-best rescoring. In Proc. of the DARPA workshop on speech and natural language, pages 455–456, Feb. 1992. M. King. Evaluating natural language processing systems. Communications of the ACM, 39(1):73–79, 1996. P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc.

HLT/NAACL, pages 48–54, May–June 2003. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine translation. In Proc. ACL, pages 177–180, June 2007. S. K¨bler, R. McDonald, and J. Nivre. Dependency parsing. Synthesis Lectures on Human u Language Technologies, 2(1):1–127, 2009. S. Lacoste-Julien, B. Taskar, D. Klein, and M. I. Jordan. Word alignment via quadratic assignment. In Proc. HLT/NAACL, pages 112–119, June 2006. L. Lamel, W. Minker, and P. Paroubek. Towards best practice in the development and evaluation of speech recognition components of a spoken language dialog system. Natural Language Engineering, 6(3&4):305–322, 2000. LDC. Multiple translation Chinese corpus, part 2, 2003. Catalog number LDC2003T17.

121

LDC. Linguistic data annotation specification: Assessment of fluency and adequacy in translations. http://projects.ldc.upenn.edu/TIDES/Translation/TransAssess04. pdf, Jan. 2005. LDC. Multiple translation Chinese corpus, part 4, 2006. Catalog number LDC2006T04. LDC. GALE phase 2 + retest evaluation references, 2008. Catalog number LDC2008E11. Z. Li, C. Callison-Burch, C. Dyer, S. Khudanpur, L. Schwartz, W. Thornton, J. Weese, and O. Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139, Mar. 2009. C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proc. ACL-04 Workshop, pages 74–81, July 2004. D. Lin and C. Cherry. Word alignment with cohesion constraint. In Proc. NAACL, pages 49–51, 2003. D. Liu and D. Gildea. Syntactic features for evaluation of machine translation. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 25–32, June 2005. Y. Liu, Q. Liu, and S. Lin. Tree-to-string alignment template for statistical machine translation. In Proc. COLING/ACL, pages 609–616, July 2006a. Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper. Enriching speech recognition with sentence boundaries and disfluencies. IEEE Transactions on Speech, Audio, and Language Processing, 14(5):1526–1540, 2006b. J. T. Lønning, S. Oepen, D. Beermann, L. Hellan, J. Carroll, H. Dyvik, D. Flickinger, J. B. Johannessen, P. Meurer, T. Nordg˚ V. Ros´n, and E. Velldal. LOGON. A Norwegian ard, e MT effort. In Proc. Recent Advances in Scandinavian Machine Translation, 2004. D. M. Magerman. Statistical decision-tree models for parsing. In Proc. ACL, pages 276–283, 1995.

122

L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language, pages 373–400, 2000. D. Marcu, W. Wang, A. Echihabi, and K. Knight. SPMT: Statistical machine translation with syntactified target language phrases. In Proc. EMNLP, pages 44–52, July 2006. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English: the Penn treebank. Computational Linguistics, 19(1):313–330, Mar. 1993. M. Meteer, A. Taylor, R. MacIntyre, and R. Iyer. Dysfluency annotation stylebook for the switchboard corpus. Technical report, Linguistic Data Consortium (LDC), 1995. Y. Miyao and J.-i. Tsujii. Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34(1):35–80, 2008. R. C. Moore, W.-t. Yih, and A. Bode. Improved discriminative bilingual word alignment. In Proc. ACL, pages 513–520, July 2006. W. Naptali, M. Tsuchiya, and S. Nakagawa. Topic-dependent language model with voting on noun history. ACM Transactions on Asian Language Information Processing (TALIP), 9(2):1–31, 2010. NIST. NIST speech recognition scoring toolkit (SCTK). Technical report, NIST, 2005. URL http://www.nist.gov/speech/tools/. F. J. Och. Minimum error rate training in statistical machine translation. In Proc. ACL, pages 160–167, July 2003. F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. K. Owczarzak, J. van Genabith, and A. Way. Evaluating machine translation with LFG dependencies. Machine Translation, 21(2):95–119, June 2007a.

123

K. Owczarzak, J. van Genabith, and A. Way. Labelled dependencies in machine translation evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104–111, June 2007b. S. Pad´, D. Cer, M. Galley, D. Jurafsky, and C. Manning. Measuring machine translao tion quality as semantic equivalence: A metric based on entailment features. Machine Translation, 23:181–193, 2009. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pages 311–318, 2002. S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In Proc. HLT, pages 404–411, Apr. 2007. C. J. Pollard and I. A. Sag. Head-driven phrase structure grammar. Studies in contemporary linguistics. Stanford: CSLI, 1994. M. Popovi´ and H. Ney. POS-based word reorderings for statistical machine translation. In c Proc. LREC, pages 1278–1283, May 2006. C. Quirk, A. Menezes, and C. Cherry. Dependency treelet translation: syntactically informed phrasal SMT. In Proc. ACL, pages 271–279, 2005. B. Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249–276, June 2001. B. Roark, M. Harper, E. Charniak, B. Dorr, M. Johnson, J. G. Kahn, Y. Liu, M. Ostendorf, J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart, and L. Yung. SParseval: Evaluation metrics for parsing speech. In Proc. LREC, 2006. B. Roark, M. Saraclar, and M. Collins. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392, Apr. 2007. L. Shen, A. Sarkar, and F. J. Och. Discriminative reranking for machine translation. In Proc. HLT/NAACL, pages 177–184, May 2004.

124

N. Singh-Miller and M. Collins. Trigger-based language modeling using a loss-sensitive perceptron algorithm. In Proc. ICASSP, volume 4, pages 25–28, 2007. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. A study of translation edit rate with targeted human annotation. In Proc. AMTA, 2006. M. Snover, N. Madnani, B. Dorr, and R. Schwartz. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Workshop on Statistical Machine Translation at EACL, Mar. 2009. A. Stolcke. Modeling linguistic segment and turn-boundaries for n-best rescoring of spontaneous speech. In Proc. Eurospeech, volume 5, pages 2779–2782, 1997. A. Stolcke. SRILM – an extensible language modeling toolkit. In Proc. ICSLP, pages 901–904, 2002. A. Stolcke and E. Shriberg. Automatic linguistic segmentation of conversational speech. In Proc. ICSLP, pages 1005–1008, 1996. A. Stolcke, B. Chen, H. Franco, V. R. R. Gadde, M. Graciarena, M.-Y. Hwang, K. Kirchhoff, A. Mandal, N. Morgan, X. Lei, T. Ng, M. Ostendorf, K. Sonmez, A. Venkataraman, D. Vergyri, W. Wang, J. Zheng, and Q. Zhu. Recent innovations in speech-to-text transcription at SRI-ICSI-UW. Audio, Speech, and Language Processing, IEEE Transactions on, 14(5):1729–1744, Sept. 2006. S. Strassel. Simple Metadata Annotation Specification V5.0. Linguistic Data Consortium, 2003. URL http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/SimpleMDE_ V5.0.pdf. S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proc. COLING, pages 836–841, Copenhagen, Denmark, 1996. M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella. Evaluating interactive dialogue systems: extending component evaluation to integrated system evaluation. In Interactive

125

Spoken Dialog Systems on Bringing Speech and NLP Together in Real Applications, pages 1–8, 1997. W. Wang and M. P. Harper. The SuperARV language model: Investigating the effectiveness of tightly integrating multiple knowledge sources. In Proc. EMNLP, pages 238–247, July 2002. W. Wang, A. Stolcke, and M. P. Harper. The use of a linguistically motivated language model in conversational speech recognition. In Proc. ICASSP, volume 1, pages 261–264, 2004. B. Wong and C. Kit. ATEC: automatic evaluation of machine translation via word choice and word order. Machine Translation, 23:141–155, 2009. F. Xia and M. McCord. Improving a statistical MT system with automatically learned rewrite patterns. In Proc. COLING, pages 508–514, 2004. D. Xiong, Q. Liu, and S. Lin. A dependency treelet string correspondence model for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 40–47, June 2007. N. Xue, F.-D. Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In Proc. COLING, 2002. K. Yamada and K. Knight. A syntax-based statistical translation model. In Proc. ACL, pages 523–530, July 2001. A. Yeh. More accurate tests for the statistical significance of result differences. In Proc. COLING, volume 2, pages 947–953, 2000. Y. Zhang, R. Zens, and H. Ney. Chunk-level reordering of source language sentences with automatically learned rules for statistical machine translation. In Proc. NAACLHLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 1–8, April 2007.

126

A. Zollmann, A. Venugopal, M. Paulik, and S. Vogel. The syntax augmented MT (SAMT) system at the shared task for the 2007 ACL workshop on statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 216–219, June 2007.

127

VITA

Jeremy Gillmor Kahn was born in Atlanta, Georgia and has proceeded widdershins around the continental United States: Providence, Rhode Island, where he received his AB in Linguistics from Brown University; Ithaca, New York, where he discovered a career in speech synthesis; Redmond and Seattle, Washington, where that career extended to include speech recognition. He entered the University of Washington in Linguistics in 2003, receiving an MA and (now) a Ph.D. His counter-clockwise trajectory continues; Jeremy is employed by Wordnik, a Bay Area computational lexicography company. He has a job where they pay him to think about words and numbers and how they fit together. Jeremy lives in San Francisco, California with his wife Dorothy, a dramatherapist. The two of them spend a lot of time talking about what it means to say what you mean and what it says to mean what you say.

128

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.