You are on page 1of 10

Towards Creating Precision Grammars from Interlinear Glossed Text:

Inferring Large-Scale Typological Properties


Emily M. Bender Michael Wayne Goodman Joshua Crowgey Fei Xia
Department of Linguistics
University of Washington
Seattle WA 98195-4340
{ebender,goodmami,jcrowgey,fxia}@uw.edu

Abstract the creation of such precision grammars is time


We propose to bring together two kinds of consuming, and the cost of developing them must
linguistic resources—interlinear glossed be brought down if they are to be effectively inte-
text (IGT) and a language-independent grated into language documentation projects.
precision grammar resource—to automat- In this work, we are interested in leveraging
ically create precision grammars in the existing linguistic resources of two distinct types
context of language documentation. This in order to facilitate the development of precision
paper takes the first steps in that direction grammars for language documentation. The first
by extracting major-constituent word or- type of linguistic resource is collections of inter-
der and case system properties from IGT linear glossed text (IGT), a typical format for dis-
for a diverse sample of languages. playing linguistic examples. A sample of IGT
from Shona is shown in (1).
1 Introduction
(1) Ndakanga ndakatenga muchero
Hale et al. (1992) predicted that more than 90% ndi-aka-nga ndi-aka-teng-a mu-chero
SBJ .1 SG - RP - AUX SBJ .1 SG - RP-buy- FV CL 3-fruit
of the world’s approximately 7,000 languages will
‘I had bought fruit.’ [sna] (Toews, 2009:34)
become extinct by the year 2100. This is a crisis
not only for the field of linguistics—on track to The annotations in IGT result from deep linguistic
lose the majority of its primary data—but also a analysis and represent much effort on the part of
crisis for the social sciences more broadly as lan- field linguists. These rich annotations include the
guages are a key piece of cultural heritage. The segmentation of the source line into morphemes,
field of linguistics has responded with increased the glossing of those individual morphemes, and
efforts to document endangered languages. Lan- the translation into a language of broader commu-
guage documentation not only captures key lin- nication. The IGT format was developed to com-
guistic data (both primary data and analytical pactly display this information to other linguists.
facts) but also supports language revitalization ef- Here, we propose to repurpose such data in the au-
forts. It must include both primary data collec- tomatic development of further resources.
tion (as in Abney and Bird’s (2010) universal cor- The second resource we will be working with
pus) and analytical work elucidating the linguistic is the LinGO Grammar Matrix (Bender et al.,
structures of each language. As such, the outputs 2002; 2010), an open source repository of imple-
of documentary linguistics are dictionaries, de- mented linguistic analyses. The Grammar Matrix
scriptive (prose) grammars as well as transcribed pairs a core grammar, shared across all grammars
and translated texts (Woodbury, 2003). it creates, with a series of libraries of analyses
Traditionally, these outputs were printed ar- of cross-linguistically variable phenomena. Users
tifacts, but the field of documentary linguistics access the system through a web-based question-
has increasingly realized the benefits of producing naire which elicits linguistic descriptions of lan-
digital artifacts as well (Nordhoff and Poggeman, guages and then outputs working HPSG (Pol-
2012). Bender et al. (2012a) argue that the docu- lard and Sag, 1994) grammar fragments compat-
mentary value of electronic descriptive grammars ible with DELPH-IN (www.delph-in.net) tools
can be significantly enhanced by pairing them with based on those descriptions. For present purposes,
implemented (machine-readable) precision gram- this system can be viewed as a function which
mars and grammar-derived treebanks. However, maps simple descriptions of languages to preci-

74
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 74–83,
Sofia, Bulgaria, August 8 2013. 2013
c Association for Computational Linguistics
sion grammar fragments. These fragments are rel- pend on earlier choices) and as the input to the cus-
atively modest, yet they relate linguistic strings to tomization script that actually produces the gram-
semantic representations (and vice versa) and are mar fragments to spec. The customization sys-
ready to be built out to broad coverage. tem distinguishes between choices files which are
Thus we ask whether the information encoded complete and consistent (and can be used to cre-
by documentary linguists in IGT can be lever- ate working grammar fragments) and those which
aged to answer the Grammar Matrix’s question- do not yet have answers to required questions or
naire and create a precision grammar fragment give answers which are inconsistent according to
automatically. The information required by the the underlying grammatical theory. The ultimate
Grammar Matrix questionnaire concerns five dif- goal of the present project is to be able to automat-
ferent aspects of linguistic systems: (i) constituent ically create complete and consistent choices files
ordering (including the presence/absence of con- on the basis of IGT, and in fact to create complete
stituent types), (ii) morphosyntactic systems, (iii) and consistent choices files which take maximal
morphosyntactic features, (iv) lexical types and advantage of the analyses stored in the Grammar
their instances and (v) morphological rules. In this Matrix customization system, answering not only
initial work, we target examples of types (i) and the minimal set of questions required but in fact all
(ii): the major constituent word order and the gen- which are relevant and possible to answer based on
eral type of case system in a language. The Gram- the information in the IGT.
mar Matrix and other related work are described Creating such complete and consistent choices
in further in §2. In §3 we present our test data and files is a long-term project, with different ap-
experimental set-up. §§4–5 describe our method- proches required for the different types of ques-
ology and results for the two tasks, respectively, tions outlined in §1. Bender et al. (2012b) take
with further discussion and outlook in §§6–7. some initial steps towards answering the questions
which define lexical rules. We envision answering
2 Background and Related Work the questions regarding morphosyntactic features
2.1 The Grammar Matrix through an analysis of the grams that appear on the
gloss line, with reference to the GOLD ontology
The Grammar Matrix produces precision gram- (Farrar and Langendoen, 2003). The implementa-
mars on the basis of description of languages tion of such systems in such a way that they are
that include both high-level typological informa- robust to potentially noisy data will undoubtedly
tion and more specific detail. Among the for- be non-trivial. The contribution of this paper is
mer are aspects (i)–(iii) listed in §1. The third the development of systems to handle one example
of these (morphosyntactic features) concerns the each of the questions of types (i) and (ii), namely
type and range of grammaticized information that detecting major constituent word order and the un-
a language marks in its morphology and/or syn- derlying case system. For the first, we build di-
tax. This includes person/number systems (e.g., rectly on the work of Lewis and Xia (2008) (see
is there an inclusive/exclusive distinction in non- §2.2). Our experiment can be viewed as an at-
singular first person forms?), the range of aspec- tempt to reproduce their results in the context of
tual distinctions a language marks, and the range the specific view of word order possibilities devel-
of cases (if any) in a language, inter alia. The an- oped in the Grammar Matrix. The second question
swers to these questions in turn cause the system (that of case systems) is in some ways more sub-
to provide relevant features that the user can ref- tle, requiring not only analysis of IGT instances in
erence in providing the more specific information isolation and aggregation of the results, but also
elicited by the questionnaire ((iv) and (v) above), identification of particular kinds of IGT instances
viz., the definition of both lexical types (e.g., first and comparison across them.
person dual exclusive pronouns) and morphologi-
cal rules (e.g., nominative case marking on nouns).
2.2 RiPLes
The information input by the user to the Gram-
mar Matrix questionnaire is stored in a file called The RiPLes project has two intertwined goals.
a ‘choices file’. The choices file is used both in The first goal is to create a framework that allows
the dynamic definition of the html pages (so that the rapid development of resources for resource-
the features available for lexical definitions de- poor languages (RPLs), which is accomplished by

75
ties of languages, including Daumé III and Camp-
bell’s (2007) Bayesian approach to discovering ty-
pological implications and Georgi et al.’s (2010)
work on predicting (unknown) typological proper-
ties by clustering languages based on known prop-
erties. Both projects use the typological database
WALS (Haspelmath et al., 2008), which has in-
formation about 192 different typological proper-
ties and about 2,678 different languages (though
the matrix is very sparse). This approach is com-
plementary to ours, and it remains an interesting
Figure 1: Welsh IGT with alignment and projected question whether our results could be improved
syntactic structure by bringing in information about other typological
properties of the language (either extracted from
the IGT or looked up in a typological database).
bootstrapping NLP tools with initial seeds created
Another strand of related work concerns the col-
by projecting syntactic information from resource-
lection and curation of IGT, including the ODIN
rich languages to RPLs through IGT. Projecting
project (Lewis, 2006; Xia and Lewis, 2008),
syntactic structures has two steps. First, the words
which harvests IGT from linguistics publications
in the language line and the translation line are
available over the web and TypeCraft (Beermann
aligned via the gloss line. Second, the transla-
and Mihaylov, 2009), which facilitates the collab-
tion line is parsed by a parser for the resource-rich
orative development of IGT annotations. TerraL-
language and the parse tree is then projected to
ing/SSWL2 (Syntactic Structures of the World’s
the language line using word alignment and some
Languages) has begun a database which combines
heuristics as illustrated in Figure 1 (adapted from
both typological properties and IGT illustrating
Xia and Lewis (2009)).1 Previous work has ap-
those properties, contributed by linguists.
plied these projected trees to enhance the perfor-
Finally, Beerman and Hellan (2011) represents
mance of statistical parsers (Georgi et al., 2012).
another approach to inducing grammars from IGT,
Though the projected trees are noisy, they contain
by bringing the hand-built linguistic knowledge
enough information for those tasks.
sources closer together: On the one hand, their
The second goal of RiPLes is to use the au-
cross-linguistic grammar resource (TypeGram) in-
tomatically created resources to perform cross-
cludes a mechanism for mapping from strings
lingual study on a large number of languages
specifying verb valence and valence-altering lex-
to discover linguistic knowledge. For instance,
ical rules to sets of grammar constraints. On
Lewis and Xia (2008) showed that IGT data en-
the other hand, their IGT authoring environment
riched with the projected syntactic structure could
(TypeCraft) provides support for annotating exam-
be used to determine the word order property of a
ples with those strings. The approach advocated
language with a high accuracy (see §4). Naseem
here attempts to bridge the gap between IGT and
et al. (2012) use this type of information (in their
grammar specification algorithmically, instead.
case, drawn from the WALS database (Haspel-
math et al., 2008)) to improve multilingual depen- 3 Development and Test Data
dency parsing. Here, we build on this aspect of
RiPLes and begin to extend it towards the wider Our long-term goal is to produce working gram-
range of linguistic phenomena and more detailed mar fragments from IGT produced in documen-
classification within phenomena required by the tary linguistics projects. However, in order to
Grammar Matrix questionnaire. evaluate the performance of approaches to answer-
ing the high-level questions in the Grammar Ma-
2.3 Other Related Work trix questionnaire, we need both IGT and gold-
Our work is also situated with respect to attempts standard answers for a reasonably-sized sample of
to automatically characterize typological proper- languages. We have constructed development and
test data for this purpose on the basis of work done
1
The details of the algorithm and experimental results
2
were reported in (Xia and Lewis, 2007). http://sswl.railsplayground.net/, accessed 4/25/13

76
Sets of languages DEV 1 (n=10) DEV 2 (n=10) TEST (n=11)
Range of testsuite sizes 16–359 11–229 48–216
Median testsuite size 91 87 76
Language families Indo-European (4), Niger- Indo-European (3), Indo-European (2), Afro-Asiatic,
Congo (2), Afro-Asiatic, Dravidian (2), Algic, Austro-Asiatic, Austronesian,
Japanese, Nadahup, Creole, Niger-Congo, Arauan, Carib, Karvelian,
Sino-Tibetan Quechuan, Salishan N. Caucasian, Tai-Kadai, Isolate

Table 1: Language families and testsuites sizes (in number of grammatical examples)

by students in a class that uses the Grammar Ma- formation. The parameter most closely relevant to
trix (Bender, 2007). In this class, students work the present work is Order of Words in a Sentence
with descriptive resources for languages they are (Dryer, 2011). For this parameter, Lewis and Xia
typically not familiar with to create testsuites (cu- tested their method in 97 languages and found that
rated collections of grammatical and ungrammat- their system had 99% accuracy provided the IGT
ical examples) and Grammar Matrix choices files. collections had at least 40 instances per language.
Later on in the class, the students extend the gram- The Grammar Matrix’s word order questions
mar fragments output by the customization system differ somewhat from the typological classifi-
to handle a broader fragment of the language. Ac- cation that Lewis and Xia (2008) were using.
cordingly, the testsuites cover phenomena which Answering the Grammar Matrix questionnaire
go beyond the customization system. amounts to more than making a descriptive state-
Testsuites for grammars, especially in their ment about a language. The Grammar Matrix cus-
early stages of development, require examples that tomization system translates collections of such
are simple (isolating the phenomena illustrated by descriptive statements into working grammar frag-
the examples to the extent possible), built out of ments. In the case of word order, this most di-
a small vocabulary, and include both grammati- rectly effects the number and nature of phrase
cal and ungrammatical examples (Lehmann et al., structure rules included in the output grammar, but
1996). The examples included in descriptive re- can also interact with other aspects of the gram-
sources often don’t fit these requirements exactly. mar (e.g., the treatment of argument optionality).
As a result, the data we are working with include More broadly, specifying the word order system
examples invented by the students on the basis of of a grammar determines both grammaticality (ac-
the descriptive statements in their resources.3 cepting some strings, ruling out others) and, for
In total, we have testsuites and associated the fixed word orders at least, aspects of the map-
choices files for 31 languages, spanning 17 lan- ping of syntactic to semantic arguments.
guage families (plus one creole and one language Lewis and Xia (2008), like Dryer (2011), gave
isolate). The most well-represented family is the six fixed orders of S, O and V plus “no dom-
Indo-European, with nine languages. We used 20 inant order”. In contrast, the Grammar Matrix
languages, in two dev sets, for algorithm develop- distinguishes Free (pragmatically constrained), V-
ment (including manual error analysis), and saved final, V-initial, and V2 orders, in addition to the
11 languages as a held-out test set to verify the six fixed orders. It is important to note that the
generalizability of our approach. Table 1 lists the relationship between the word order type of a lan-
language families and the range of testsuite sizes guage and the actual orders attested in sentences
for each of these sets of languages. can be somewhat indirect. For a fixed word order
language, we would expect the order declared as
4 Inferring Word Order its type to be the most common in running text,
but not the only type available. English, for exam-
Lewis and Xia (2008) show how IGT from ODIN ple, is an SVO language, but several constructions
(Lewis, 2006) can be used to determine, with high allow for other orders, including subject-auxiliary
accuracy, the word order properties of a language. inversion, so-called topicalization, and others:
They identify 14 typological parameters related to (2) Did Kim leave?
word order for which WALS (Haspelmath et al., (3) The book, Kim forgot.
2008) or other typological resources provide in-
In a language with more word order flexibility in
3
Such examples are flagged in the testsuites’ meta-data. general, there may still be a preferred word order

77
which is the most common due to pragmatic or us three axes: one for the tendency to exhibit VS
other constraints. Users of the Grammar Matrix or SV order, another for VO or OV order, and an-
are advised to choose one of the fixed word orders other for OS or SO order. By counting the ob-
if the deviations from that order can generally be served word orders in the IGT examples, we can
accounted for by specific syntactic constructions, place the language in this three-dimensional space.
and a freer word order otherwise. Figure 4.1 depicts this space with the positions of
The relationship between the correct word or- canonical word orders.4 The canonical word order
der choice for the Grammar Matrix customization positions are those found under homogeneous ob-
system and the distribution of actual token word servations. For example, the canonical position for
orders in our development and test data is affected SOV order is when 100% of the sentences exhibit
by another factor, related to Lewis and Xia’s ‘IGT SO, OV, and SV orders; and the canonical position
bias’ which we dub ‘testsuite bias’. The collec- for Free word order is when each observed order
tions of IGT we are using were constructed as test- occurs with equal frequency to its opposite order
suites for grammar engineering projects and thus (on the same axis; e.g. VO and OV). We select the
comprise examples selected or constructed to il- underlying word order by finding which canoni-
lustrate specific grammatical properties in a test- cal word order position has the shortest Euclidean
ing regime where one example is enough to repre- distance to the observed word order position.
sent each sentence type of interest. Therefore, they When a language is selected as Free word or-
do not represent a natural distribution of word or- der, we employ a secondary heuristic to decide if
der types. For example, the testsuite authors may it is actually V2 word order. The V2 order cannot
show the full range of possible word orders in the be easily recognized only with the binary word or-
word order section of the testsuite and then default ders, so it is not given a unique point in the three-
to one particular choice for other portions (those dimensional space. Rather, we try to recognize it
illustrating e.g., case systems or negation). by comparing the ternary orders. A Free-order lan-
guage is reclassified as V2 if SVO and OVS occur
4.1 Methodology more frequently than SOV and OSV.5

Our first stpes mirror the RiPLes approach, pars- OVS


ing parse the English translation of each sentence OSV
OS
and projecting the parsed structure onto the source
language line. Functional tags, such as SBJ and VOS
OV
OBJ, are added to the NP nodes on the English V-final
VS
side based on our knowledge of English word or- Free/V2
SV
der and then carried over to the source language V-initial
side during the projection of parse trees. The trees VO
SOV
are then searched for any of ten patterns: SOV,
SO
SVO, OSV, OVS, VSO, VOS, SV, VS, OV, and VSO
VO. The six ternary patterns match when both ver- SVO
bal arguments are present in the same clause. The
four binary patterns are for intransitive sentences Figure 2: Three axes of basic word order and the
or those with dropped arguments. These ten pat- positions of canonical word orders.
terns make up the observed word orders.
Given our relatively limited data set (each lan-
guage is one data point), we present an initial 4.2 Results
approach to determining underlying word order Table 2 shows the results we obtained for our dev
based on heuristics informed by general linguis- and test sets. For comparison, we use a most-
tic knowledge. We compare the distribution of ob- 4
Of the eight vertices of this cube, six represent canoni-
served word orders to distributions we expect to cal word orders the other two impossible combinations: The
see for canonical examples of underlying word or- vertex for (SV, VO, OS) (e.g.) has S both before and after O.
5
ders. We accomplish this by first deconstructing The VOS and VSO patterns are excluded from this com-
parison, since they can go either way—there may be un-
the ternary observed-word-orders into binary pat- aligned constituents (i.e. not a S, O, or V) before the verb
terns (the four above plus SO and OS). This gives which are ignored by our system.

78
frequent-type baseline, selecting SOV for all lan- Case Case grams present
system NOM ∨ ACC ERG ∨ ABS
guages, based on Dryer’s (2011) survey. We get none
high accuracy for DEV 1, low accuracy for DEV 2, nom-acc X
and moderate accuracy for TEST, but all are sig- erg-abs X
split-erg X X
nificantly higher than the baseline. (conditioned on V)

Dataset Inferred WO Baseline


DEV 1 0.900 0.200 Table 3: GRAM case system assignment rules
DEV 2 0.500 0.100
TEST 0.727 0.091
(O). Among languages which make use of case,
Table 2: Accuracy of word-order inference the most common alignment type is a nominative-
accusative system (Comrie 2011a,b). In this
Hand analysis of the errors in the dev sets type, S takes the same kind of marking as A.6
show that some languages fall victim to the test- The Grammar Matrix case library provides nine
suite bias, such as Russian, Quechua, and Tamil. options, including none, nominative-accusative,
All of these languages have Free word order, but ergative-absolutive (S marked like O), tripartite (S,
our system infers SVO for Russian and SOV for A and O all distinct) and several more intricate
Quechua and Tamil, because the authors of the types. For example, in a language with one type
testsuites used one order significantly more than of split case system the alignment is nominative-
the others. Similarly, the Free word order lan- accusative in non-past tense clauses, but ergative-
guage Nishnaabemwin is inferred as V2 because absolutive in past tense ones.
there are more SVO and OVS patterns given than As with major constituent word order, the con-
others. We also see errors due to misalignment straints implementing a case system in a grammar
from RiPLes’ syntactic projection. The VSO lan- serve to model both grammaticality and the map-
guage Welsh is inferred as SVO because the near- ping between syntactic and semantic arguments.
ubiquitous sentence-initial auxiliary doesn’t align Here too, the distribution of tokens may be some-
to the main verb of the English translation. thing other than a pure expression of the case
alignment type. Sources of noise in the distri-
5 Inferring Case Systems bution include: argument optionality (e.g., tran-
Case refers to linguistic phenomena in which the sitives with one or more covert arguments), ar-
form of a noun phrase (NP) varies depending on gument frames other than simple intransitives or
the function of the NP in a sentence (Blake, 2001). transitives, and quirky case (verbs that use a non-
The Grammar Matrix’s case library (Drellishak, standard case frame for their arguments, such as
2009) focuses on case marking of core arguments the German verb helfen which selects a dative ar-
of verbs. Specifying a grammar for case involves gument, though the language’s general system is
both choosing the high-level case system to be nominative-accusative (Drellishak, 2009)).
modeled as well as associating verb types with 5.1 Methodology
case frames and defining the lexical items or lex-
We explore two possible methodologies for infer-
ical rules which mark the case on the NPs. Here,
ring case systems, one relatively naı̈ve and one
we focus on the high-level case system question
more elaborate, and compare them to a most-
as it is logically prior, and in some ways more in-
frequent-type baseline. Method 1, called GRAM,
teresting than the lexical details: Answering this
considers only the gloss line of the IGT and as-
question requires identifying case frames of verbs
sumes that it complies with the Leipzig Glossing
in particular examples and then comparing across
Rules (Bickel et al., 2008). These rules not only
those examples, as described below.
prescribe formatting aspects of IGT but also pro-
The high-level case system of a language con-
vide a set of licensed ‘grams’, or tags for grammat-
cerns the alignment of case marking between tran-
ical properties that appear in the gloss line. GRAM
sitive and intransitive clauses. The three ele-
scans for the grams associated with case, and as-
ments in question are the subjects of intransi-
signs case systems according to Table 3.
tives (dubbed S), the subjects (or agent-like ar-
This methodology is simple to implement and
guments) of transitives (dubbed A) and the ob-
6
jects (or patient-like arguments) of intransitives English’s residual case system is of this type.

79
expected to work well given Leipzig-compliant the verb. These could be distinguished by looking
IGT. However, since it does not model the func- for intransitive verbs that appear more than once in
tion of case, it is dependent on the IGT authors’ the data and checking whether their S arguments
choice of gram symbols, and may be confused by all have consistently A or O marking.
either alternative case names (e.g., SBJ and OBJ for This system is agnostic as to the spelling of the
nominative and accusative or LOC for ergative in case grams. By relying on more analysis of the
languages where it is homophonous with the loca- IGT than GRAM, it also introduces new kinds of
tive case) or by other grams which collide with the brittleness. Recognizing the difference between
case name-space (such as NOM for nominalizer). grams being present and (virtually) absent makes
It also only handles four of the nine case systems the system susceptible to noise.
(albeit the most frequent ones).
5.2 Results
Method 2, called SAO, is more theoretically
motivated, builds on the RiPLes approach used Table 4 shows the results for the inference of case-
in inferring word order, and is designed to be marking systems. Currently GRAM performs best,
robust to idiosyncratic glossing conventions. In but both methods generally perform better than
this methodology, we first identify the S, A and the baseline. The better performance of GRAM
O arguments by projecting the information from is expected, given the small size and generally
the parse of the English translation (including Leipzig-compliant glossing of our data sets. In
the function tags) to the source sentence (and its future work, we plan to incorporate data from
glosses). We discard all items which do not appear ODIN, which is likely less consistently annotated
to be simple transitive or intransitive clauses with but more voluminous, and we expect SAO to be
all arguments overt, and then collect all grams for more robust than GRAM to this kind of data.
each argument type (from all words within in the Dataset GRAM SAO Baseline
NP, including head nouns as well as determiners DEV 1 0.900 0.700 0.400
DEV 2 0.900 0.500 0.500
and adjectives). While there are many grammati- TEST 0.545 0.545 0.455
cal features that can be marked on NPs (such as
number, definiteness, honorifics, etc.), the only Table 4: Accuracy of case-marking inference
ones that should correlate strongly with grammat-
ical function are case-marking grams. Further- We find that GRAM is sometimes able to do well
more, in any given NP, while case may be multi- when RiPLes gives alignment errors. For exam-
ply marked, we only expect one type of case gram ple, Old Japanese is a NOM - ACC language, but the
to appear. We thus assume that the most frequent case-marking grams (associated to postpositions)
gram for each argument type is a case marker (if are not aligned to the NP arguments, so SAO is not
there are any) and assign the case system accord- able to judge their distribution. On the other hand,
ing to the following rules, where Sg , Og and Ag de- SAO prevails when non-standard grams are used,
note the most frequent grams associated with these such as the NOM - ACC language Hupdeh, which is
argument positions, respectively: annotated with SUBJ and OBJ grams. This comple-
• Nominative-accusative: Sg =Ag , Sg 6=Og mentarity suggests scope for system combination,
• Ergative-absolutive: Sg =Og , Sg 6=Ag which we leave to future work.
• No case: Sg =Ag =Og , or Sg 6=Ag 6=Og and Sg ,
6 Discussion and Future Work
Ag , Og also present on each of the other ar-
gument types Our initial results are promising, but also show
• Tripartite: Sg 6=Ag 6=Og , and Sg , Ag , Og (vir- remaining room for improvement. Error analysis
tually) absent from the other argument types suggests two main directions to pursue:
• Split-S: Sg 6=Ag 6=Og , and Ag and Og are both Overcoming testsuite bias In both the word or-
present in the list for the S argument type der and case system tasks, we see the effect of
Here, we’re using Split-S to stand in for both testsuite bias on our system results. The testsuites
Split-S and Fluid-S. These are both systems where for freer word order languages can be artificially
some S arguments are marked like A, and some dominated by a particular word order that the test-
like O. In Split-S, which is taken depends on the suite author found convenient. Further, the re-
verb. In Fluid-S, it depends on the interpretation of stricted vocabulary used in testsuites, combined

80
with a general preference for animates as subjects, been able to beat the most-frequent-type baseline.
leads to stems and certain grams potentially being Better handling of these unaligned words is
misidentified as case markers. a non-trivial task, and will require bringing in
We believe that these aspects of testsuite bias sources of knowledge other than the structure of
are not typical of our true target input data, viz., the English translation. The information we have
the larger collections of IGT created by field to leverage in this regard comes mainly from the
projects. On the other hand, there may be other as- gloss line and from general linguistic/typological
pects of testsuites which are simplifying the prob- knowledge which can be added to the algorithm.
lem and to which our current methods are over- That is, there are types of grams which are canon-
fitted. To address these issues, we intend to look ically associated with verbal projections and types
to larger datasets in future work, both IGT collec- of grams canonically associated with nominal pro-
tions from field projects and IGT from ODIN. For jections. When these grams occur on unaligned
the field projects, we will need to construct choices elements, we can hypothesize that the elements
files. For ODIN, we can search for data from the are auxiliaries and case-marking adpositions re-
languages we already have choices files for. spectively. Further typological considerations will
As we move from testsuites to test corpora motivate heuristics for modifying tree structures
(e.g., narratives collected in documentary linguis- based on these classifications.
tics projects), we expect to find different distribu- Other directions for future work include extend-
tions of word order types. Our current methodol- ing this methodology to other aspects of grammat-
ogy for extracting word order is based on idealized ical description, including additional high-level
locations in our word order space for each strict systems (e.g., argument optionality), discovering
word order type. Working with naturally occurring the range of morphosyntactic features active in a
corpora it should be possible to gain a more em- language, and describing and populating lexical
pirically based understanding of the relationship types (e.g., common nouns with a particular gen-
between underlying word order and sentence type der). Once we are able to answer enough of the
distributions. It will be particularly interesting to questionnaire that the customization system is able
see how stable these relationships are across lan- to output a grammar, interesting options for de-
guages with the same underlying word order type tailed evaluation will become available. In par-
but from different language families and/or with ticular, we will be able to parse the IGT (includ-
differences in other typological characteristics. ing held-out examples) with the resulting gram-
mar, and then compare the resulting semantic rep-
Better handling of unaligned words The other resentations to those produced by parsing the En-
main source of error is words that remain un- glish translations with tools that produce compara-
aligned in the projected syntactic structure and ble semantic representations for English (using the
thus only loosely incorporated into the syntax English Resource Grammar (Flickinger, 2000)).
trees. This includes items like case marking adpo-
sitions in Japanese, which are unaligned because 7 Conclusions and Future Work
there is no corresponding word in English, and
auxiliaries in Welsh, which are unaligned when In this paper we have presented an approach to
the English translation doesn’t happen to use an combining two types of linguistic resources—IGT,
auxiliary. In the former case, our SAO method as produced by documentary linguists and a cross-
for case system extraction doesn’t include the case linguistic grammar resource supporting preci-
grams in the set of grams for each NP. In the latter, sion parsing and generation—to create language-
the word order inference system is unable to pick specific resources which can help enrich language
up on the VSO order represented as Aux+S+[VP]. documentation and support language revitaliza-
Simply fixing the attachment of the auxiliaries will tion efforts. In addition to presenting the broad vi-
not be enough in this case, as the word order infer- sion of the project, we have reported initial results
ence algorithm will need to be extended to han- in two case studies as a proof-of-concept. Though
dle auxiliaries, but fixing the alignment is the first there is still a ways to go, we find these initial re-
step. Alignment problems are also the main reason sults a promising indication of the approach’s abil-
our initial attempts to extract information about ity to assist in the preservation of the key type of
the order of determiners and nouns haven’t yet cultural heritage that is linguistic systems.

81
Acknowledgments French verbal morphology. In LSA Annual Meeting
Extended Abstracts 2012.
We are grateful to the students in Ling 567 at the
University of Washington who created the test- Emily M. Bender. 2007. Combining research and
suites and choices files used as development and pedagogy in the development of a crosslinguistic
grammar resource. In Tracy Holloway King and
test data in this work and to the three anonymous Emily M. Bender, editors, Proceedings of the GEAF
reviewers for helpful comments and discussion. 2007 Workshop, Stanford, CA. CSLI Publications.
This material is based upon work supported
by the National Science Foundation under Grant Balthasar Bickel, Bernard Comrie, and Martin Haspel-
No. BCS-1160274. Any opinions, findings, and math. 2008. The Leipzig glossing rules: Con-
ventions for interlinear morpheme-by-morpheme
conclusions or recommendations expressed in this glosses. Max Planck Institute for Evolutionary An-
material are those of the author(s) and do not nec- thropology and Department of Linguistics, Univer-
essarily reflect the views of the National Science sity of Leipzig.
Foundation.
Barry J. Blake. 2001. Case. Cambridge University
Press, Cambridge, second edition.
References
Bernard Comrie. 2011a. Alignment of case marking of
Steven Abney and Steven Bird. 2010. The human full noun phrases. In Matthew S. Dryer and Martin
language project: building a universal corpus of the Haspelmath, editors, The World Atlas of Language
world’s languages. In Proceedings of the 48th An- Structures Online. Max Planck Digital Library, Mu-
nual Meeting of the Association for Computational nich.
Linguistics, ACL ’10, pages 88–97, Stroudsburg,
PA, USA. Association for Computational Linguis- Bernard Comrie. 2011b. Alignment of case marking of
tics. pronouns. In Matthew S. Dryer and Martin Haspel-
math, editors, The World Atlas of Language Struc-
Dorothee Beerman and Lars Hellan. 2011. Induc- tures Online. Max Planck Digital Library, Munich.
ing grammar from IGT. In Proceedings of the 5th
Language and Technology Conference: Human Lan- Hal Daumé III and Lyle Campbell. 2007. A Bayesian
guage Technologies as a Challenge for Computer model for discovering typological implications. In
Science and Linguistics (LTC 2011). Proceedings of the 45th Annual Meeting of the As-
Dorothee Beermann and Pavel Mihaylov. 2009. Type- sociation of Computational Linguistics, pages 65–
Craft: Linguistic data and knowledge sharing, open 72, Prague, Czech Republic, June. Association for
access and linguistic methodology. Paper presented Computational Linguistics.
at the Workshop on Small Tools in Cross-linguistic
Research, University of Utrecht. The Netherlands. Scott Drellishak. 2009. Widespread But Not Uni-
versal: Improving the Typological Coverage of the
Emily M. Bender, Dan Flickinger, and Stephan Oepen. Grammar Matrix. Ph.D. thesis, University of Wash-
2002. The grammar matrix: An open-source starter- ington.
kit for the rapid development of cross-linguistically
consistent broad-coverage precision grammars. In Matthew S. Dryer. 2011. Order of subject, object and
John Carroll, Nelleke Oostdijk, and Richard Sut- verb. In Matthew S. Dryer and Martin Haspelmath,
cliffe, editors, Proceedings of the Workshop on editors, The World Atlas of Language Structures On-
Grammar Engineering and Evaluation at the 19th line. Max Planck Digital Library, Munich.
International Conference on Computational Lin-
guistics, pages 8–14, Taipei, Taiwan. Scott Farrar and Terry Langendoen. 2003. A linguistic
ontology for the semantic web. Glot International,
Emily M. Bender, Scott Drellishak, Antske Fokkens, 7:97–100.
Laurie Poulson, and Safiyyah Saleem. 2010. Gram-
mar customization. Research on Language & Com- Dan Flickinger. 2000. On building a more efficient
putation, pages 1–50. 10.1007/s11168-010-9070-1. grammar by exploiting types. Natural Language
Emily M. Bender, Sumukh Ghodke, Timothy Baldwin, Engineering, 6 (1) (Special Issue on Efficient Pro-
and Rebecca Dridan. 2012a. From database to tree- cessing with HPSG):15 – 28.
bank: Enhancing hypertext grammars with grammar
engineering and treebank search. In Sebastian Nord- Ryan Georgi, Fei Xia, and William Lewis. 2010.
hoff and Karl-Ludwig G. Poggeman, editors, Elec- Comparing language similarity across genetic and
tronic Grammaticography, pages 179–206. Univer- typologically-based groupings. In Proceedings of
sity of Hawaii Press, Honolulu. the 23rd International Conference on Computa-
tional Linguistics (Coling 2010), pages 385–393,
Emily M. Bender, David Wax, and Michael Wayne Beijing, China, August. Coling 2010 Organizing
Goodman. 2012b. From IGT to precision grammar: Committee.

82
Ryan Georgi, Fei Xia, and William Lewis. 2012. Im- Fei Xia and William D. Lewis. 2007. Multilin-
proving dependency parsing with interlinear glossed gual structural projection across interlinear text.
text and syntactic projection. In Proceedings of In Proc. of the Conference on Human Language
COLING 2012: Posters, pages 371–380, Mumbai, Technologies (HLT/NAACL 2007), pages 452–459,
India, December. Rochester, New York.

Ken Hale, Michael Krauss, Lucille J. Wata- Fei Xia and William D. Lewis. 2008. Repurposing
homigie, Akira Y. Yamamoto, Colette Craig, theoretical linguistic data for tool development and
LaVerne Masayesva Jeanne, and Nora C. England. search. In Proceedings of the Third International
1992. Endangered languages. Language, 68(1):pp. Joint Conference on Natural Language Processing,
1–42. pages 529–536, Hyderabad, India.

Martin Haspelmath, Matthew S. Dryer, David Gil, and Fei Xia and William Lewis. 2009. Applying NLP
Bernard Comrie, editors. 2008. The World Atlas technologies to the collection and enrichment of lan-
of Language Structures Online. Max Planck Digital guage data on the web to aid linguistic research. In
Library, Munich. http://wals.info. Proceedings of the EACL 2009 Workshop on Lan-
guage Technology and Resources for Cultural Her-
Sabine Lehmann, Stephan Oepen, Sylvie Regnier- itage, Social Sciences, Humanities, and Education
Prost, Klaus Netter, Veronika Lux, Judith Klein, (LaTeCH – SHELT&R 2009), pages 51–59, Athens,
Kirsten Falkedal, Frederik Fouvry, Dominique Esti- Greece, March. Association for Computational Lin-
val, Eva Dauphin, Hervè Compagnion, Judith Baur, guistics.
Lorna Balkan, and Doug Arnold. 1996. TSNLP:
Test suites for natural language processing. In Pro-
ceedings of the 16th conference on Computational
linguistics - Volume 2, COLING ’96, pages 711–
716, Stroudsburg, PA, USA. Association for Com-
putational Linguistics.

William D. Lewis and Fei Xia. 2008. Automati-


cally identifying computationally relevant typolog-
ical features. In Proceedings of the Third Interna-
tional Joint Conference on Natural Language Pro-
cessing, pages 685–690, Hyderabad, India.

William D. Lewis. 2006. ODIN: A model for adapting


and enriching legacy infrastructure. In Proceedings
of the e-Humanities Workshop, Held in cooperation
with e-Science, Amsterdam.

Tahira Naseem, Regina Barzilay, and Amir Globerson.


2012. Selective sharing for multilingual dependency
parsing. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 629–637, Jeju
Island, Korea, July. Association for Computational
Linguistics.

Sebastian Nordhoff and Karl-Ludwig G. Poggeman,


editors. 2012. Electronic Grammaticography. Uni-
versity of Hawaii Press, Honolulu.

Carl Pollard and Ivan A. Sag. 1994. Head-Driven


Phrase Structure Grammar. Studies in Contempo-
rary Linguistics. The University of Chicago Press
and CSLI Publications, Chicago, IL and Stanford,
CA.

Carmela Toews. 2009. The expression of tense and as-


pect in Shona. Selected Proceedings of the 39th An-
nual Converence on African Linguistics, pages 32–
41.

Tony Woodbury. 2003. Defining documentary lin-


guistics. Language documentation and description,
1(1):35.

83

You might also like