Haig Et Al - 2011 - Documenting Endangered Languages

Documenting Endangered Languages
Unauthenticated
Download Date | 8/13/19 5:35 PM
Trends in Linguistics
Studies and Monographs 240
Editor
Volker Gast
Founding Editor
Werner Winter
Editorial Board
Walter Bisang
Hans Henrich Hock
Heiko Narrog
Matthias Schlesewsky
Niina Ning Zhang
Editor responsible for this volume

Volker Gast
De Gruyter Mouton
Unauthenticated
Documenting
Endangered Languages
Achievements and Perspectives
Edited by
Geoffrey L. J. Haig
Nicole Nau
Stefan Schnell
Claudia Wegener
De Gruyter Mouton
Unauthenticated
ISBN 978-3-11-026001-4
e-ISBN 978-3-11-026002-1
ISSN 1861-4302
Library of Congress Cataloging-in-Publication Data
Documenting endangered languages : achievements and perspectives /

edited by Geoffrey Haig ... [et al.].
p. cm. ⫺ (Trends in linguistics : studies and monographs; 240)
Includes bibliographical references and index.
ISBN 978-3-11-026001-4 (alk. paper)
1. Language obsolescence. 2. Language and languages. I. Haig,
Geoffrey.
P40.5.L33D63 2011
410⫺dc23
2011030746
Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.
” 2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston

Printing: Hubert & Co. GmbH & Co. KG, Göttingen
⬁ Printed on acid-free paper
Printed in Germany.
www.degruyter.com
Unauthenticated
This volume is dedicated to Ulrike Mosel,
in recognition of her contribution towards documenting
the world’s endangered languages.
Unauthenticated
Unauthenticated
Contents
Preface
Ulrike Mosel’s contribution to documentary linguistics . . . . . . . . . . . . . . . xi
Geoffrey Haig and Nicole Nau
Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Geoffrey Haig, Nicole Nau, Stefan Schnell and Claudia Wegener
Part I. Theoretical issues in language documentation
Chapter 2
Competing motivations for documenting endangered languages . . . . . . . 17
Frank Seifart
Chapter 3
Evolving challenges in archiving and data infrastructures . . . . . . . . . . . . . 33
Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uytvanck,
Menzo Windhouwer and Peter Wittenburg
Chapter 4
Comparing corpora from endangered language projects:
Explorations in language typology based on original texts . . . . . . . . . . . . 55
Geoffrey Haig, Stefan Schnell and Claudia Wegener
Part II. Documenting language structure
Chapter 5
“Words” in Kharia – Phonological, morpho-syntactic, and
“orthographical” aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
John Peterson
Unauthenticated
viii Contents
Chapter 6
Aspect in Forest Enets and other Siberian indigenous languages –
when grammaticography and lexicography meet different
metalanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Florian Siegl
Chapter 7
Documentary linguistics and prosodic evidence for the syntax of
spoken language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Candide Simard and Eva Schultze-Berndt
Chapter 8
Diphthongology meets language documentation:
The Finnish experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Klaus Geyer
Chapter 9
Retelling data: Working on transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Dagmar Jung and Nikolaus P. Himmelmann
Part III. Documenting the lexicon
Chapter 10
The making of a multimedia encyclopaedic lexicon for and in
endangered speech communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Gaby Cablitz
Chapter 11
What does it take to make an ethnographic dictionary? On the
treatment of fish and tree names in dictionaries of Oceanic languages . . 263
Andrew Pawley
Part IV. Interaction with speech communities
Chapter 12
Language is power: The impact of fieldwork on community politics . . . 291
Even Hovdhaugen and Åshild Næss
Unauthenticated
Contents ix
Chapter 13
Sustaining Vurës: Making products of language documentation
accessible to multiple audiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Catriona Hyslop Malau
Chapter 14
Filming with native speaker commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Anna Margetts
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Unauthenticated
Unauthenticated
Preface
Ulrike Mosel’s contribution to documentary linguistics
Geoffrey Haig and Nicole Nau
In 2011, we look back on a decade of language documentation within the

Volkswagen Foundation’s DoBeS-programme. Likewise in 2011, Ulrike Mo-
sel, one of the founding scholars behind that programme, and a driving force
within it throughout, will reach official retirement age. We are all familiar
with Ulrike’s disdain for the external trappings of academia; for her, they
are simply diversions from the real business of documenting endangered lan-
guages. Nevertheless, we hope she will excuse us for taking a couple of pages
to briefly reflect on a very distinguished career, one that has been deeply inter-
twined with the burgeoning enterprise of language documentation, its theory,
its practice, but also with training and inspiring the minds of many linguists,
among them the editors of this volume.
It was her fascination with documents of languages that had lost their
speakers long ago that led Ulrike to the study of Semitic languages, Assyri-
ology and General Linguistics at the University of Munich in 1966. Her aca-
demic education was grounded in a tradition where the study of texts was of
central importance, and the principles of classical grammatical analysis were
common knowledge. Her linguistic background is steeped in the philological
tradition, resting on the three cornerstones of text, grammar and dictionary.
At the same time, she witnessed the spread of modern linguistics through
German universities, and absorbed those influences with the same insatiable
curiosity that characterizes all her activities. In her PhD thesis, written on the
great eighth-century Arabic grammarian Sibawayh, Ulrike identified precur-
sors of modern structuralist thinking in his work.
After her PhD, Ulrike’s research interests took a new direction: She turned
her attention away from the study of ancient texts and Semitic philology to
the under-described, and mostly unwritten, languages of the Pacific region,
beginning with Tolai in Papua New Guinea in the 1970’s. Working with the
Tolai was Ulrike’s first experience with linguistic fieldwork (bar a brief trip
to the Bedouins of Jordan, but that is another story), and as it is against her
nature to do something without method, it was probably inevitable that from
Unauthenticated
xii Geoffrey Haig, Nicole Nau
that time on she has contributed substantially to the methodology of field-

work, adding many aspects that had not been properly thought through at that
time; in part thanks to her efforts, the current generation of young linguists
leaves for the field much better prepared than she was. Ulrike was certainly
not the only scholar at the time who went into the field with some high-flung
research agenda and ended up collecting stories. But she was one of the few
wise enough to understand that collecting texts and actually learning the lan-
guage is not something that precludes serious linguistic research, but forms
an essential part of it. Thus, the publication of Tolai texts (1977) stands with
equal importance beside the two later monographs she devoted to the lan-
guage: Tolai and Tok Pisin (1980), where she proved the important role Tolai
had played in the development of Tok Pisin lexicon and grammar, and Tolai
syntax and its historical development (1984), which was her post-PhD thesis
(Habilitationsschrift).
Since then, Ulrike has continued to work on different languages of the Pa-
cific region, focussing on Samoan and Teop. She is not one of those linguists
who produce superficial, quick-fix analyses on isolated aspects of numerous
different languages. Ulrike’s work is based on years of first-hand experience,
and the study of actual usage in the speech community. It would be simply
against her principles to publish on a language she was not personally inti-
mately familiar with; one can only speculate on the beneficial effects such a
principle would have on the quality of many linguistic publications, if it were
more widely adhered to. In her thinking as a linguist, she was strongly influ-
enced by the major typological schools of the time, spending several years
in the inspiring research environment at the University of Cologne, where
Hansjakob Seiler and his associates had founded the major centre for lan-
guage typology and universals in Germany. Around 1990, Ulrike moved to
the Australian National University in Canberra, where she worked with lin-
guists like Avery Andrews, Bob Dixon, Andy Pawley, Malcolm Ross and
Anna Wierzbicka. This period, interspersed with lengthy field trips to Samoa,
is a time she looks back on with great fondness, both intellectually and per-
sonally. During the Canberra years she produced with Even Hovdhaugen the
monumental Samoan reference grammar, as well as numerous important con-
tributions to Samoan syntax and semantics.
Then in 1995 she was offered the chair at the Department of General and
Indo-European Linguistics at the University of Kiel, where she has remained
ever since. When Ulrike arrived in Germany, she brought with her the "can
Unauthenticated
Ulrike Mosel’s contribution to documentary linguistics xiii
do!" attitude that she herself would, perhaps, attribute to the influence of her
Australian years, but in fact is very much part of her own personality. She in-
herited a chair that had been vacant for some time, and a department that was
lacking in direction, and in students. She set about rebuilding the department,
introducing an emphasis on language documentation, but without neglect-
ing foundational theoretical aspects of linguistics. The BA programme she
established in collaboration with the phonetics department has been hugely
successful, with new enrolments of 60–100 students annually. The shake-up
involved some decisions that raised a few eyebrows in the then still very tra-
ditional Philosophische Fakultät at the University of Kiel (for example, her
decision to abolish Latin as a prerequisite for an undergraduate degree in lin-
guistics, or her regular conflicts with the faculty when it came to allowing
graduates to write their theses in English). In both decisions, as in many oth-
ers, she stuck to her own convictions against considerable resistance, and in
doing so proved herself once again ahead of her time.
Within a remarkably short time she had forged a thriving department,
hosted several research projects, supervised more than a dozen PhD’s and
countless MA theses, and set up highly successful and innovative new degree
programmes. She never served as dean, vice-chancellor, or anything else in
university politics: she just concentrated on what she does best, namely re-
search and teaching in linguistics. International recognition of her work has
come in many forms, among them an Honorary Membership in the Linguistic
Society of America, awarded to her in 2007.
Through it all she has retained her signature mix of enthusiasm and rigour,
coupled with a very Prussian work ethic, but also a deep humanity that en-
ables her to genuinely engage with people of all walks of life – one of the
most important qualities for successful field work. Her research commit-
ments in these latter years have been dominated by her involvement in the
DoBeS-programme, as one of its founding scholars, multiple grant-recipient,
and long-standing chairperson of the Steering Committee. There is hardly an
aspect of the programme that Ulrike has not impacted on over the years. We
feel confident that a little thing like retirement will not significantly change
that.
Unauthenticated
xiv Geoffrey Haig, Nicole Nau
Publications by Ulrike Mosel

Grammars and textbooks
1992 Samoan Reference Grammar (with Even Hovdhaugen). Oslo: Scan-
dinavian University Press.
1994 Saliba. München: Lincom.
1997 Say It In Samoan (with Ainslie So’o). (A text book for learners of
Samoan as a second language). Canberra: Australian National Uni-
versity Press.
2000 O le Kalama o le Gagana Samoa (with Fosa Siliko, Ainslie So’o
and Agafili Tuitolova’a). (A Samoan grammar for teachers). Apia,
Western Samoa: Curriculum Development Unit, Department of Ed-
ucation.
2007 The Teop Sketch Grammar (with Yvonne Thiesen). University
of Kiel. Version May 2007. http://corpus1.mpi.nl/ds/imdi_browser/
?openpath=MPI533750%23 and http://www.linguistik.uni-kiel.de/Teop_
Sketch_Grammar_May07.pdf
Dictionaries
1997 O le fale (with Mose Fulu). (A small monolingual Samoan dictionary
on housebuilding and furniture, published by the Ministry of Youth,
Sports and Culture). Apia, Western Samoa: Matagaluega Autalavou
Taaloga ma Aganuu.
2001 Utugaga (with Fosa Siliko, Ainslie So’o and Agafili Tuitolova’a).
(A monolingual dictionary of Samoan for primary school students
of year 5–8). Apia, Western Samoa: Curriculum Development Unit,
Department of Education.
2007 Teop Lexical Database (with Marcia L. Schwartz, Ruth Saovana
Spriggs, Ruth Siimaa Rigamu, Jeremiah Vaabero and Naphtaly
Maion). Kiel: Seminar für Allgemeine und Vergleichende Sprach-
wissenschaft, Christian-Albrechts-Universität. http://www.linguistik.
uni-kiel.de/Teop_Lexical_Database_May07.pdf
2010 A inu. The Teop–English Dictionary of House Building (with Mark
Mahaka, Enoch Horai Magum, Joyce Maion, Naphtaly Maion, Ruth
Siimaa Rigamu, Ruth Saovana Spriggs, Jeremiah Vaabero, Marica
Schwartz and Yvonne Thiesen). With drawings by Neville Vitahi
and photographs by Ulrike Mosel. SAVS Arbeitsberichte 6. Kiel:
Seminar für Allgemeine und Vergleichende Sprachwissenschaft,
Christian-Albrechts-Universität.
Unauthenticated
Ulrike Mosel’s contribution to documentary linguistics xv
Text collections
1977 Tolai Texts. Kivung, Journal of the Linguistic Society of Papua New
Guinea. Volume 10, Port Moresby.
2007 Amaa vahutate vaa Teapu (with Enoch Horai Magum, Joyce Maion,
Jubilie Kamai, Ondria Tavagaga and Yvonne Thiesen). Illustrated
by Rodney Rasin. Kiel: Seminar für Allgemeine und Vergleichende
Sprachwissenschaft, Christian-Albrechts-Universität.
2009 Teop Language Corpus (with Enoch Horai Magum, Helen Magum,
Shalom Magum, Jubilie Kamai, Owen Kasinory, Mark Ma-
haka, Joyce Maion, Naphtaly Maion, Janeth Nasin, Rodney
Rasin, Ruth Siimaa Rigamu, Ruth Saovana Spriggs, Ondria Tava-
gaga, Jerimiah Vaabero, Neville Vitahi, Jessika Reinig, Marcia
Schwartz, Yvonne Schuth (Thiesen)). http://corpus1.mpi.nl/ds/imdi_
browser?openpath=MPI622803%23
On grammaticography
1975 Die syntaktische Terminologie bei Sibawaih. 2 volumes. Disserta-
tion. München: Uni Fotodruck Frank.
1987 Inhalt und Aufbau deskriptiver Grammatiken (How to write a gram-
mar). Arbeitspapier Nr. 4. Köln: Institut für Sprachwissenschaft,
Universität zu Köln.
1980 Syntactic categories in Sibawaih’s “Kitab”. Histoire Épistémologie
Langage 2(1): 27–37.
2002 Analytic and synthetic language description. In Linguistik jenseits
des Strukturalismus. Akten des II. Ost-West-Kolloquiums Berlin
1998, eds. Kennosuke Ezawa, Wilfried Kürschner, Karl H. Rensch
and Manfred Ringmacher, 199–208. Tübingen: Narr.
2006 Sketch grammars. In Essentials of Language Documentation, eds.
Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel, 301–309.
Berlin: Mouton de Gruyter.
2006 Grammaticography, the art and craft of writing grammars. In Catch-
ing Language: the Standing Challenge of Grammar Writing, eds. Fe-
lix Ameka, Alan Dench and Nicholas Evans, 41–68. Berlin: Mouton
de Gruyter.
2007 Early grammars of Oceanic languages (with Even Hovdhaugen).
In Sprachtheorien der Neuzeit III/2: Sprachbeschreibung und
Sprachunterricht, ed. Peter Schmitter, 462–478. Tübingen: Narr.
Unauthenticated
xvi Geoffrey Haig, Nicole Nau
On lexicography
2002 Dictionary making in endangered language communities. In Pro-
ceedings of the International Workshop on Resources and Tools in
Field Linguistics, Las Palmas 26–27 May 2002.
2004 Dictionary making in endangered speech communities. In Language
Documentation and Description, Volume 2, ed. Peter K. Austin,
39–54. London: School of Oriental and African Studies.
2011 Lexicography in endangered language communities. In The Cam-
bridge Handbook of Endangered Languages, eds. Peter K. Austin
and Julia Sallabank, 337–353. Cambridge: Cambridge University
Press.
On fieldwork and language documentation

2006 Essentials of Language Documentation (with Jost Gippert and Niko-
laus Himmelmann, eds.). Berlin: Mouton de Gruyter.
2001 Linguistic fieldwork. In International Encyclopedia of the Social and
Behavioral Sciences, Volume 13, eds. Neil J. Smelser and Paul B.
Baties, 8906–8910.
2002 Methods of language documentation in the DOBES project (with Pe-
ter Wittenburg and Arienne Dwyer). In Proceedings of the Interna-
tional Workshop on Resources and Tools in Field Linguistics, Las
Palmas 26–27 May 2002.
2004 Inventing communicative events: Conflicts arising from the aims of
language documentation. Language Archive Newsletter 1(3): 3–4.
2006 Fieldwork and community language work. In Essentials of Language
Documentation, eds. Jost Gippert, Nikolaus Himmelmann and Ulrike
Mosel, 67–85. Berlin: Mouton de Gruyter.
In print Morphosyntactic analysis in the field, a guide to the guides. In The
Oxford Handbook of Linguistic Fieldwork, ed. Nicholas Thieberger.
Oxford: Oxford University Press.
On typology and grammar of Oceanic languages

1983 Adnominal and Predicative Possessive Constructions in Melanesian
Languages. Arbeiten des Kölner Universalienprojektes 50. Köln: In-
stitut für Sprachwissenschaft, Universität zu Köln.
1999 Negation in Oceanic Languages (with Even Hovdhaugen, eds.).
München: Lincom.
Unauthenticated
Ulrike Mosel’s contribution to documentary linguistics xvii
1982 Number, collection and mass in Tolai. In Apprehension. Das sprach-

liche Erfassen von Gegenständen. Teil II. Die Techniken und ihr
Zusammenhang in Einzelsprachen, eds. Hansjakob Seiler and Franz-
Josef Stachowiak, 123–154. Tübingen: Narr.
1987 Subject in Samoan. In A World of Language: Papers Presented to
Professor S. A. Wurm on his 65th Birthday, eds. Donald C. Laycock
and Werner Winter, 455–479. Canberra: Australian National Univer-
sity Press.
1989 On the classification of verbs and verbal clauses in Samoan. In VI-
CAL 1: Oceanic Languages. Papers from the Fifth International
Conference on Austronesian Linguistics, eds. Ray Harlow and Robin
Hooper, 377–398. Auckland: Linguistic Society of New Zealand.
1991 Abstufungen der Transitivität im Palauischen. In Partizipation: Das
sprachliche Erfassen von Sachverhalten, eds. Hansjakob Seiler and
Waldfried Premper, 400–407. Tübingen: Narr.
1991 The continuum of verbal and nominal clauses in Samoan. In Par-
tizipation: Das sprachliche Erfassen von Sachverhalten, eds. Hans-
jakob Seiler and Waldfried Premper, 138–149. Tübingen: Narr.
1991 Towards a typology of valency. In Partizipation: Das sprachliche
Erfassen von Sachverhalten, eds. Hansjakob Seiler and Waldfried
Premper, 240–251.Tübingen: Narr.
1991 Transitivity and reflexivity in Samoan. Australian Journal of Linguis-
tics 11: 175-194.
1992 On nominalisation in Samoan. In The Language Game: Papers in
Memory of Donald C. Laycock, eds. Tom Dutton, Malcom Ross and
Darell T. Tryon, 263–281. Canberra: Australian National University
Press.
1994 Samoan. In Comparative Austronesian Dictionary: An Introduction
to Austronesian Studies, ed. Darrell T. Tryon, 943–946. Berlin: Mou-
ton de Gruyter.
1994 Tolai. In Comparative Austronesian Dictionary: An Introduction to
Austronesian Studies, ed. Darrell T. Tryon, 727–730. Berlin: Mouton
de Gruyter.
1999 Towards a typology of negation in Oceanic Languages. In Negation
in Oceanic Languages, eds. Even Hovdhaugen and Ulrike Mosel,
1–19. München: Lincom.
Unauthenticated
xviii Geoffrey Haig, Nicole Nau
1999 Negation in Teop (with Ruth Spriggs). In Negation in Oceanic Lan-

guages, eds. Even Hovdhaugen and Ulrike Mosel, 45–56. München:
Lincom.
1999 Gender in Teop (with Ruth Spriggs). In Gender in Grammar and
Cognition, eds. Barbara Unterbeck and Matti Rissanen, 321–349.
Berlin: Mouton de Gruyter.
2000 Aspect in Samoan. In Probleme der Interaktion von Lexik und As-
pekt, ed. Walter Breu, 179–192. Tübingen: Niemeyer.
2000 Valence changing clitics and incorporated prepositions in Teop
(Oceanic, Bougainville) (with Jessika Reinig). In Proceedings of
AFLA VII, ed. Marian Klamer, 133–140. Amsterdam: Department
of Linguistics, Vrije Universiteit.
2004 Complex predicates and juxtapositional constructions in Samoan. In
Complex Predicates in Oceanic Languages. Studies in the Dynam-
ics of Binding and Boundedness, eds. Isabelle Bril and Françoise
Ozanne-Rivierre, 263–296. Berlin: Mouton de Gruyter.
2004 Demonstratives in Samoan. In Deixis in Oceanic Languages, ed.
Gunter Senft, 141–174. Canberra: Pacific Linguistics.
2007 Ditransitivity and valency change in Teop – a corpus based approach.
Tidsskrift for Sprogforskning, 5: 1–40.
2010 The fourth person in Teop. In A Journey through Austronesian and
Papuan Linguistic and Cultural Space: Papers in Honour of Andrew
K. Pawley, eds. John Bowden, Nikolaus Himmelmann and Malcolm
Ross, 391–404. Canberra: Pacific Linguistics.
On Semantics
1982 Local deixis in Tolai. In Here and There: Cross-linguistic Studies in
Deixis and Demonstration, eds. Jürgen Weissenborn and Wolfgang
Klein, 111–132. Amsterdam: Benjamins.
1991 Time metaphors in Samoan. In Festschrift für Meinrad Scheller,
ed. Walter Bisang, 149–165. Arbeiten des Seminars für Allgemeine
Sprachwissenschaft der Universität Zürich 11. Zürich: Seminar für
Allgemeine Sprachwissenschaft, Universität Zürich.
1991 The Samoan construction of reality. In The Currents in Pacific Lin-
guistics. Papers on Austronesian Languages and Ethnolinguistics in
Honor of George Grace, ed. Robert Blust, 293–303. Canberra: Aus-
tralian National University Press.
Unauthenticated
Ulrike Mosel’s contribution to documentary linguistics xix
1994 Samoan. In Semantic and Lexical Universals: Theory and Empirical

Findings, eds. Cliff Goddard and Anna Wierzbicka, 331–360. Ams-
terdam: Benjamins.
On language contact and language change

1980 Tolai and Tok Pisin. The Influence of the Substratum on the Develop-
ment of New Guinea Pidgin. Canberra: Australian National Univer-
sity Press.
1984 Tolai Syntax and its Historical Development. Canberra: Australian
National University Press.
1979 Early language contact between Tolai, Pidgin and English in view of
its sociolinguistic background (1875–1914). In Papers in Pidgin and
Creole Linguistics No. 2, 163-181. Canberra: Australian National
University Press.
1982 The influence of the Church Missions on the development of Tolai. In
Gava’. Studies in Austronesian Languages and Cultures, Dedicated
to Hans Kähler, eds. Rainer Carle, Martina Henschke, Peter W. Pink,
Christel Rost and Karen Stadtlender, 155–172. Berlin: Reimer.
1982 New evidence of Samoan origin of New Guinea Tok Pisin (New
Guinea Pidgin English) (with Peter Mühlhäusler). Journal of Pacific
History 17(3): 166–175.
2004 Borrowing in Samoan. In Borrowing: A Pacific Perspective, eds. Jan
Tent and Paul Geraghty, 215–232. Canberra: Australian National
University.
To app. Analogical levelling across constructions – incorporated preposi-
tions in Teop. In The Evolution of Syntactic Relations, eds. Christian
Lehmann and Stavros Skopeteas. Berlin: Mouton de Gruyter.
Unauthenticated
Unauthenticated
Chapter 1
Introduction: Documenting endangered languages
before, during, and after the DoBeS programme∗
Geoffrey Haig, Nicole Nau, Stefan Schnell

and Claudia Wegener
1. Background
Scholars have been engaged in language documentation for centuries. But it is
only in the past couple of decades that language documentation has emerged
as an academic discipline in its own right, associated with scholarly publica-
tions, university departments, funding initiatives, and a growing repertoire of
practices, theoretical diversification and sophistication, and specialist termi-
nology. The roots of the current upsurge in language documentation run deep,
and stem from several sources, too many to treat in any detail here. Today, lan-
guage documentation has come of age, a fertile domain for cross-disciplinary
exchange, involving linguists, anthropologists, ethno-musicologists, biolo-
gists, along with software developers and corpus linguists.
The development has been remarkably rapid; a glance at the seminal
article by Himmelmann (1998) suffices to confirm that much of what was
envisaged in that programmatic statement has since become part and par-
cel of language documentation practice. In the late 1990’s, a small group
of German-based linguists began developing an agenda for a programme
aimed at documenting endangered languages worldwide. That initiative came
to fruition in the Volkswagen Foundation’s DoBeS-programme (Dokumenta-
tion bedrohter Sprachen), which began with a pilot phase in 2000. DoBeS
went on to become the Foundation’s longest-running programme within the
humanities. From the outset, DoBeS established the Max Planck Institute for
∗
We are extremely grateful to the DoBeS-programme of the Volkswagenstiftung, who
funded most of the research reported in this book, and so much more. We would also
like to thank the authors for their contributions and constructive input during the entire
publication process. Finally, our thanks go to the production team at Mouton and the series
editor, Volker Gast, for their support and encouragement throughout.
Unauthenticated
2 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
Psycholinguistics in Nijmegen as the host for the central archive, also respon-
sible for developing and implementing technical standards for the project.
This generated an intensive exchange between linguists, software developers,
and archivists, yielding significant improvements in annotation and metadata
procedures as well as refinements of ethical and legal guidelines. It has been
an ongoing process, constantly informed by the practical experience gained
through some 50 documentation projects (cf. Broeder et al., this volume).
Moreover, regular workshops and summer schools have provided scores of
linguists, as well as many native speakers of endangered languages, with
training in documentary linguistics, ensuring the initiative’s continued in-
fluence through future generations of linguists. The DoBeS-programme has
undoubtedly been one of the most successful research initiatives in the lan-
guage sciences. Its impact on the emergent field of documentary linguistics
can hardly be exaggerated (see Harrison, Rood, and Dwyer 2008 for a similar
assessment), with long-term implications that go far beyond the field of lin-
guistics itself. With the programme entering its closing phases in late 2011,
it is fitting to take a look at some of its achievements, and some of the future
challenges, through the work of some of its practitioners.
As mentioned, the modern discipline of language documentation has mul-
tiple ancestors. Previous retrospectives, e.g. Woodbury (2003), identify Franz
Boas as the spiritus rector of documentary linguistics in the North American
context. However, with the rise of Chomskyan linguistics in North America,
the Boasian tradition of anthropologically informed documentary linguistics
had lost ground there. Both Grinevald (2003) and Woodbury (2003) identify
the LSA symposium in 1991, and the associated publication in Language
(Hale et al. 1992), as the turning points in triggering the rehabilitation of doc-
umentary linguistics. From a North American perspective, the developments
certainly appear to be quite radical; Grinevald (2003: 52) describes in vivid
terms the “incredible tension” she experienced prior to the LSA panel meet-
ing in 1991 when the blunt facts regarding the imminent loss of much of the
world’s linguistic diversity were to be presented before the assembled digni-
taries of the North American linguistic scene. The perception of a “paradigm
shift” can only really be appreciated when one considers the extent to which
American linguistics at the time was dominated by scholars working in theo-
ries focused on an idealized conception of “grammar”. The topic of language
endangerment, on the other hand, meant introducing social and political di-
mensions into a field that had effectively abstracted away from such consider-
Unauthenticated
Introduction 3
ations. But since the early 1990’s, documentary linguistics in North America
has regained much of the lost ground, with numerous highly successful and
innovative programmes now well-established (cf. Woodbury 2003 for further
discussion).
But the North American perspective is only part of the story.1 From a Eu-
ropean perspective, on the other hand, the paradigm shift appears somewhat
less fundamental. In Germany, now an important centre for language docu-
mentation, the impact of the much-vaunted Chomskyan Revolution was con-
siderably weaker than across the Atlantic. The reasons are partly to be sought
in the way linguistics is institutionalized at German universities. To this day,
dedicated departments of general linguistics are only sparsely scattered across
the university landscape. Linguistics has remained to a large extent the con-
cern of individual philologies (English, Romance, German etc.). Within such
departments, the text-based tradition of philology continued to be an impor-
tant pillar in the academic training. Linguists within these disciplines thus
never felt themselves in the defensive to the same extent that their Ameri-
can colleagues felt during Generative Grammar’s ascension to domination in
North America. In Europe, a deep-rooted tradition in the description of little-
studied languages and dialects could survive as a respectable, if marginal,
niche within the philologies. Particularly in departments such as African lan-
guages, Finno-Ugric, Semitic, Turkic, or Iranian, descriptive grammars, dic-
tionaries and text collections from under-studied languages and dialects have
constituted a regular part of the scientific output for close to two centuries.
Highly-respected German professors, such as the now retired Semitist Otto
Jastrow, devoted much of their professional careers to field work and docu-
menting endangered languages. Among other things, Jastrow’s work led to
the establishment of the Semarch,2 a digital archive of (often highly endan-
gered) Semitic languages at the University of Heidelberg.
1. A comprehensive treatment of the roots of language documentation lies beyond the scope
of this introduction; see Himmelmann (2008) for a more balanced account. Two further an-
tecedents of modern documentary linguistics, outside of Western Europe and North Amer-
ica, are also definitely worthy of mention: the linguistic fieldwork paradigm initiated by the
Moscow-based linguist Alexander Kibrik and his associates in the 1970’s, which generated
an enormous amount of descriptive materials on indigenous languages of the Ex-Soviet-
Union, and the tradition of encouraging descriptive grammars of undescribed languages as
PhD theses, established at the Australian National University in the 1970’s mainly through
the efforts of Bob Dixon.
2. http://www.semarch.uni-hd.de/index.php43
Unauthenticated
Nevertheless, it would be a gross oversimplification to see today’s dis-

cipline of language documentation as simply the digitalized continuation of
the philological tradition of the 19th century. For a start, today’s documentary
linguistics is primarily motivated by the desire to conserve (at least a record
of) the world’s rapidly diminishing linguistic diversity. Prior to the 1980’s, the
global dimensions of this trend had not been fully appreciated. It is this aware-
ness that unites modern documentary linguistics world-wide, and across the
boundaries of language families, creating a common platform that the indi-
vidual philologies never really developed. Himmelmann (2006: 15) identifies
five further innovations that distinguish contemporary documentary linguis-
tics from its philological predecessors. Here we take up what we consider are
the three most salient, while adding a fourth that has very recently come to
the fore:
1. Focus on the full range of communicative practices, rather than on selected
text types with supposedly high cultural, religious or historical significance
(and, one might add, most commonly from exclusively male speakers).
This point is fundamental; it arises from the fact that documentary linguis-
tics is a branch of general linguistics, rather than any particular philology.
Linguistics is concerned with language, regardless of its prestige; spon-
taneous discourse is not merely “idle” chat, but offers important insights
into the language usage of the community and is thus equally worthy of the
documenter’s attention.3 Furthermore, additional research questions have
arisen over the last decades, for example discourse analysis, language and
gender, language acquisition, and language contact, which have led to a
growing interest in a broader range of communicative events. As an ex-
ample of a philologist’s approach, it is instructive to consider the work of
David MacKenzie, a specialist on Iranian languages who undertook doc-
3. The importance of documenting the full range of communicative practices in the commu-
nity has been stressed by many authors (e.g. Himmelmann 1998; Foley 2003). In practice,
however, it is notable that in most documentation projects, it still tends to be more tradi-
tional monologues than everyday conversational interactions that find themselves as fully-
annotated records in the archive. Part of the reason lies in the speech communities’ own
assessment of what is worth preserving for posterity (cf. Mosel 2004, 2006, 2008). Partly,
it may be due to the practical difficulties of recording, annotating and analysing sponta-
neous multi-participant discourse. In this sense, then, documentation practice lags behind
documentation theory, but this is a perfectly normal state of affairs in any discipline, and
with increased experience with, for example, video recording and annotations, we expect
that greater amounts of conversational data will be archived.
Unauthenticated
Introduction 5
umentation of Kurdish dialects in Iraq in the 1950’s and 1960’s. In the

introduction to his two-volume work (MacKenzie 1961: xvii), MacKen-
zie bemoans the dearth of “trustworthy informants”, i.e. speakers who use
dialectally “pure” forms in their speech, and are unaffected by dominant
standard languages. For MacKenzie, the lack of “trustworthy” speakers
constituted a serious obstacle to the work of documentation. In the modern
practice of documenting endangered languages, on the other hand, lack of
speakers of a “pure” dialect is probably the norm for many projects. And in
fact, interest in the dynamics of language shift and obsolescence is now a
major research focus in its own right within documentary linguistics (Sei-
fart, this volume).
2. Concerns for long term storage and preservation of primary data.
This requires no special comments, and has been dealt with in many contri-
butions (Trilsbeek and Wittenburg 2006). However, there is a further con-
sideration generally wedded to the concern for long-term data preserva-
tion, namely the issue of maximal web-based accessibility. As Harrison
et al. (2008: 3) put it: “Never before have extensive text collections been
directly accessible to researchers worldwide.” This ease of accessibility
distinguishes modern documentary linguistics from its predecessors rather
sharply. But like most technical innovations, it has its downside. Global ac-
cessibility of (often sensitive) data has spawned a host of highly complex
legal and ethical issues regarding access rights and protection of personal
and community privacy, most of which remain unresolved (see discussion
in Broeders et al., this volume).
3. Close cooperation with, and direct involvement of the speech community.
The degree to which community involvement characterizes modern lan-
guage documentation, and the fact that it is now explicitly recognized and
encouraged in grant proposals and evaluations, is undoubtedly a major in-
novation, which few would seriously question. But it is worth recalling that
even in the 1990’s, the issue of community involvement was highly contro-
versial. At the 1993 Cologne summer school on documentary linguistics,
organized by Hans-Jürgen Sasse and Nikolaus Himmelmann, the keynote
speech by Colette Grinevald (then Colette Craig) was dominated by a
heated discussion of this issue. Those who advocated a purely “objective
scientific” approach to language documentation, eschewing any involve-
ment in community-based efforts at language revitalization, were pitched
against those who considered it part of the documenters’ responsibility to
Unauthenticated
become involved in such initiatives. Since then, however, documentary lin-

guistics, particularly in the DoBeS framework, has adopted a much more
pragmatic and realistic approach. It is now widely recognized that it is
simply counterproductive to insist on a dogmatic standpoint on this issue.
If the speech community wishes for the documenter to participate in the
speech community’s efforts to maintain its own language, this needs to be
taken seriously as part of the documentation. Quite apart from the moral
imperative, and despite the very real complications that may arise through
the community’s active involvement (see Hovdhaugen and Næss, this vol-
ume), experience from numerous documentation projects worldwide has
demonstrated that such participation is generally highly beneficial for all
actors in the documentary scenario, and enhances the quality of the data in
numerous ways.
4. The final point, not mentioned by Himmelmann, is the scientific potential
inherent in the rapidly growing amounts of digitally archived data from
typologically diverse languages.
This is a point that few proponents of language documentation could have
foreseen at the outset; it is a nice example of how data collection, if done
properly, can generate new research fields which were simply not pre-
dictable at the time the data was collected. A recent example stems from
research based on phoneme inventories of the world’s languages. When
Ian Maddieson began to collate phoneme inventories from a sample of the
world’s languages in the early 1980’s, his work was known primarily as a
resource for language typology. Later, Maddieson contributed his data into
the web-accessible World Atlas of Language Structure (WALS).4 Based
on the phonological data in WALS, Quentin Atkinson has since formulated
some far-reaching claims on the global settlement and migration patterns
of homo sapiens (Atkinson 2011).5 Back in the 1980’s, it is highly un-
likely that Ian Maddieson could have predicted that his data on phoneme
inventories might one day feed into a hypothesis regarding global human
settlement patterns in prehistory. The point of this example, and numerous
comparable ones, is that data collection is never “mere data collection”.
4. http://wals.info/
5. Atkinson’s findings have been heavily criticized, and may thus seem an unfortunate exam-
ple for illustrating this point. However, the basic fact remains that data collectors simply
cannot predict future advances; for them, it is a sufficient goal to ensure maximal accessi-
bility, perseverance, and good data organization so that later applications can be applied as
economically as possible.
Unauthenticated
Introduction 7
The creation of structured, accessible, and rich data-bases raises conceptual

and technical challenges (e.g. appropriate metadata formats); their resolu-
tion represents in itself a progression, and may in fact yield unforeseen the-
oretical advances. The DoBeS archive is thus far more than simply a “data
repository”; it is a dynamic research resource that is generating exciting
new perspectives for research. Language typologists and corpus linguists
are beginning to appreciate these opportunities. Language typology based
on language comparison through primary data – texts – rather than refer-
ence grammars (cf. Wälchli 2009) is already gaining ground, and it is only
a matter of time before the considerable resources residing in archives of
endangered languages will be harnessed for this kind of research (cf. Haig
et al., this volume, for some initial examples). Thus the emphasis in the
DoBeS-programme on “primary data”, rather than a “grammar+dictionary
format” (Himmelmann 2006) finds its natural counterpart in an increas-
ingly primary-data oriented, quantitative approach to language typology.
2. The organization of this volume

The contributions to this volume deal with aspects of language documenta-
tion, as they have emerged through the work of practitioners active in the
DoBeS-programme over the last decade. The chapters are divided into four
parts, which we briefly summarize below.
Part I. Theoretical issues in language documentation
The first part contains three contributions relating to theoretical issues in lan-
guage documentation as a discipline.
In the first one, “Competing motivations for documenting endangered
languages”, Frank Seifart discusses the different motivations and expecta-
tions that different researchers bring to the enterprise of language documenta-
tion, and how they can be resolved. He focuses on four different motivations:
preservation of human cultural heritage, extending the empirical basis of lin-
guistics, aiding a speech community in their efforts at language promotion,
and documenting the effects of language contact. Seifart distinguishes the de-
mands that each motivation places on different aspects of a documentation,
taking as his framework Himmelmann’s (1998, 2006) distinction between
primary data (content), and analytical apparatus of a language documenta-
tion. He notes how conflicts can be resolved by setting priorities either in the
content, or the apparatus component of a language documentation, or both.
Unauthenticated
Language documentation is by nature a multi-faceted enterprise, inevitably

leading to competing interests; Seifart provides a framework in which com-
peting motivations are contextualized as a normal state of affairs, whose res-
olution is part of the overall challenge of language documentation.
In their contribution “Evolving challenges in archiving and data infra-
structures”, Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uyt-
vanck, Menzo Windhouwer and Peter Wittenburg trace some of the develop-
ments in language documentation technology, in which the technical group
of the MPI has been centrally involved. The pace of progress has been ex-
traordinary: data storage capacity has vastly increased, both in archives and
in recording devices; portability and performance of recording devices has
improved beyond measure; annotation and metadata schemes have been pro-
gressively refined. One of the most exciting prospects is the development of
automatic recognition software “that react to comparatively simple patterns
in media streams” as an additional layer to conventional annotations. By con-
trast, developments in the less-technical aspects of archiving has not always
matched the technical advances. In particular, the issue of access rights to
archived material remains, despite 10 years of intense negotiation, far from
resolved. The authors stress the dynamic nature of these developments, an
evolutionary process driven on the one hand by the technical advances, yet
constantly modified in accordance with the constraints of language documen-
tation, with its eminently human aspects.
Geoffrey Haig, Stefan Schnell and Claudia Wegener discuss the potential
relevance of archived data for language typology (“Comparing corpora from
endangered language projects: Explorations in language typology based on
original texts”). They show how archived data can be used in language com-
parison, enriching typological research not only by new data, but also by new
methods. They argue that quantitative cross-corpus analysis is possible and
yields interesting results in cases where texts of the same genre are com-
pared, irrespective of the content. In a case study they explore the marking of
core arguments in four genetically unrelated and geographically distant lan-
guages, using corpora of original spoken narrative texts from DoBeS-projects
and the annotation system GRAID developed by Haig and Schnell (2011).
Their investigation shows that some characteristics of discourse, such as the
proportion of transitive and intransitive clauses in texts, or the well-known
tendency to mark A by a pronoun or zero anaphora rather than a lexical ex-
pression, are remarkably similar across languages, while others, notably the
Unauthenticated
Introduction 9
deployment of pronouns in different syntactic functions, are more language

dependent.
Part II. Documenting language structure
This part contains case studies from individual language documentations show-
ing how language documentation practice can lead to the recognition and
possible solutions to issues of actual structural analysis, with considerable
theoretical relevance.
One of the earliest decisions that language documenters need to make
concerns the nature of the orthography used in various aspects of the docu-
mentation. In Chapter 4, “‘Words’ in Kharia – Phonological, morpho-syntac-
tic, and ‘orthographical’ aspects”, John Peterson discusses the different fac-
tors that need to be considered when fixing the boundaries of the basic or-
thographical unit, the word. The language concerned, Kharia, makes exten-
sive use of clitics in inflection, which is a contributing factor to conflicting
notions of “word” in the language. But the indigenous writing tradition in
the dominant languages, based on the Devanagari script, is a further fac-
tor, as is the prosodic pattern typically associated with content words. Peter-
son presents the results of an investigation with native speakers’ perception
of word boundaries, revealing that while there are areas of relatively stable
judgements across different speakers, certain types of word/phrase may be or-
thographically segmented quite differently by different native speakers. His
approach demonstrates how native speakers’ intuitions in orthography may
run counter to the linguist’s analysis; both need careful consideration before
decisions can be made.
It is an – often unspoken – assumption that documentary linguists should
conduct their grammatical analysis in a maximally accessible, “theory-neu-
tral” framework. In practice, however, no linguist is free of theoretical bias. In
Chapter 5, Florian Siegl shows how both the native language of the investiga-
tor, as well as a local linguistic tradition, may impact on the way grammatical
descriptions are formulated (“Aspect in Forest Enets and other Siberian indig-
enous languages – when grammaticography and lexicography meet different
metalanguages”). The case in point is the description of “aspect” in the en-
dangered Samoyedic language Forest Enets. Siegl demonstrates that the con-
ception of verbal aspect, as established within a long-standing grammatico-
graphic tradition of the Russian language, has seriously distorted the analysis
of a similar, though by no means analogical, category in the endangered lan-
guage, both in grammars and dictionaries of this language. This study illus-
Unauthenticated
trates how linguistic analysis may be driven by the linguist’s pre-established

conception of linguistic phenomena, and underscores the necessity for a truly
data-driven approach.
Current documentations of endangered languages offer vastly improved
possibilities for the study of spoken language. In particular, the range of
languages documented makes it possible to extend linguistic research on
prosody, which so far has been based mainly on data from a few well de-
scribed languages. This point is made by Candide Simard and Eva Schultze-
Berndt (“Documentary linguistics and prosodic evidence for the syntax of
spoken language”), who further argue that grammatical units in spoken lan-
guage cannot be defined without reference to prosodic units. They demon-
strate their approach with an analysis of prosodic and syntactic units in the
Australian language Jaminjung, such as afterthoughts, non-finite construc-
tions, and discontinuous noun phrases. Prosodic evidence is sufficient to dis-
tinguish several syntactic constructions with otherwise identical surface forms.
They show further that different degrees of semantic integration are iconically
reflected by degrees of prosodic integration.
All languages have vowels; yet the analysis of vowel systems in docu-
mentary linguistics is seldom treated as a topic in its own right. In Chapter 7,
“Diphthongology meets language documentation: The Finnish experience”,
Klaus Geyer focuses on the analysis of diphthongs. Diphthongs appear to be
frequently neglected or only poorly treated in language descriptions. Drawing
on the well-documented case of diphthongs in Finnish, the author provides
the reader with a practicable methodological toolkit for analyzing diphthongs,
comprising the relevant diagnostics for their identification and classification
in any language.
In Chapter 8, “Retelling data: Working on transcription”, Dagmar Jung
and Nikolaus Himmelmann investigate methods in the rarely addressed,
though ubiquitous task of transcribing recorded texts. Emphasizing the im-
portance of close collaboration between linguists and speech community,
they discuss a number of ways in which native speakers change or elabo-
rate the recorded text when transferring it into a written form, for instance
by editing out false starts, or by adding information felt to be highly relevant
but missed by the speaker. The authors argue that a comprehensive documen-
tation of transcription practices and such changes and elaborations represent
highly valuable data in terms of native speakers’ (meta)linguistic knowledge
Unauthenticated
Introduction 11
and our understanding of the development of written varieties of hitherto un-

written languages.
Part III. Documenting the lexicon
Dictionaries represent the most tangible product of a language documenta-
tion, the one most readily understood by laypersons, and most appreciated
by the speech community themselves. It is thus no surprise that dictionary-
making has probably the longest tradition of any activity in language docu-
mentation.
In her contribution “The making of a multimedia encyclopaedic lexicon
for and in endangered speech communities”, Gaby Cablitz explores some of
the ways that recent advances in digital technology can be implemented in
a documentation scenario. Specifically, Cablitz describes two major innova-
tions: First, the LEXUS-tool for creating web-based lexica, where entries are
linked to multimedia documents, permitting a much more “encyclopaedic”
feel than conventional dictionary software. LEXUS includes a relational link-
ing device, ViCoS (Visualization of Conceptual Spaces), which allows the
creation of dense networks of linkages between different data types, intended
to model knowledge representation more realistically. Second, she discusses
web-based interactive editing of dictionary entries, in which the speech com-
munity itself takes an active role. Cablitz presents a view from the field, draw-
ing on the experience gained during a four-year interdisciplinary project in the
Marquesan and Tuamotuan speech communities in French Polynesia. As in
other contributions, Cablitz’ work reveals that the innovative potential made
available through advances in hard- and software need to be tempered and
adapted to the realities of the fieldwork situation, where very fundamental
considerations of user-friendliness and respect for community values and re-
lationships cannot be left out of consideration.
In Chapter 11, Andrew Pawley explores the challenges of making dictio-
naries of lesser described languages, in many instances the first dictionaries
of the respective languages (“What does it take to make an ethnographic dic-
tionary? On the treatment of fish and tree names in dictionaries of Oceanic
languages”). Based on experience gained through compiling dictionaries of
several Oceanic languages, Pawley first discusses the challenges of eliciting
and determining the reference of terms in domains that require specialist ex-
pertise that the lexicographer generally does not have. He focuses on fish and
tree names, two terminological domains that are large and important in most
traditional Oceanic speech communities. The second problem discussed is
Unauthenticated
that of defining terms from such fields, when they differ in content and struc-
ture from corresponding fields in the investigator’s language. A central ques-
tion is how much, and what kinds of cultural knowledge should be included
in a dictionary. Pawley concludes that general ethnographic dictionaries need
the cooperation of several well-informed specialists, and that the linguist’s
fluency in the target language, as well as the native-speaker collaborator’s
fluency in the defining language, are of considerable importance. Above all,
he stresses that compiling such a dictionary is inevitably a long term project
in its own right, which cannot readily be accommodated within the tightly-
constrained time frame of a typical documentation project.
Part IV. Interaction with speech communities
The fourth part of this book is devoted to the interaction of linguists with the
speech community where field work is carried out.
Even Hovdhaugen and Åshild Næss open the discussion by drawing at-
tention to problems that may arise when the field work becomes an issue in
local struggles of power (“Language is power: The impact of fieldwork on
community politics”). They describe a situation of conflict they witnessed
during their work in the Solomon Islands, which shows that even the best
preparation and long term experience in an area cannot always prevent the
field worker, however unwillingly, from becoming the cause of conflicts in
the community. Linguists have to be aware of the impact their work may have
on local politics, and questions of the linguist’s behaviour in possible situa-
tions of dissent within the community should be given consideration when
planning a language documentation project.
In Chapter 13, “Sustaining Vurës: Making products of language docu-
mentation accessible to multiple audiences”, Catriona Hyslop Malau presents
her experiences with the production and presentation of a DVD documenting
traditional plaiting and fishing techniques in the Vurës community, accom-
panied by screenings of Vurës dictionary entries of relevant specialized vo-
cabulary. She shows how such professionally processed video material can
be of significant value as an output of a language documentation project. The
very fact that video material is processed professionally signals the worthi-
ness of the community’s linguistic heritage and traditional knowledge, and
encourages the speech community to uphold them. But professionally pro-
duced DVDs also enhance the visibility of the linguistic and cultural diver-
sity of Vanuatu, and of the general enterprise of language documentation, by
making it available to a wider national and global audience.
Unauthenticated
Introduction 13
The issue of which indigenous communicative practices are to be doc-

umented has been a recurring one in the language documentation literature.
But Mosel (2004) points out that the language documentation process itself
may lead to the creation of previously non-existent text genres in the speech
community. In her contribution “Filming with native speaker commentary”,
Anna Margetts discusses one such genre, that of film commentaries supplied
by native speakers to video footage taken during documentation. Such com-
mentaries differ both in discourse structure from other genres, and may in-
clude syntactic constructions that are otherwise unusual. Furthermore, the
commentary may include samples of specialist vocabulary otherwise over-
looked. Finally, such commentaries provide a level of immediacy and natu-
ralness to film material, raising its acceptance within the speech community
itself. In the author’s own project, the value of such commentaries was only
discovered more or less in passing, and she notes in retrospect that this tech-
nique could have been fruitfully applied to several video recordings, adding
to the repertoire of communicative events that are documented, and at the
same time raising levels of community involvement and identification with
the project.
References
Atkinson, Quentin. 2011. Phonemic diversity supports a serial founder effect
model of language expansion from Africa. Science 4(15):346–349.
Foley, William A. 2003. Genre, register and language documentation in liter-
ate and preliterate communities. In Language Documentation and Descrip-
tion, Volume 1, ed. Peter K. Austin, 85–98. London: School of Oriental and
African Studies.
Grinevald, Colette. 2003. Speakers and documentation of endangered lan-
guages. In Language Documentation and Description, Volume 1, ed. Pe-
ter K. Austin, 52–72. London: School of Oriental and African Studies.
Haig, Geoffrey, and Stefan Schnell. 2011. Annotations using GRAID (Gram-
matical Relations and Animacy in Discourse). Introduction and guidelines
for annotators. Version 6.0. Available at: http://vc.uni-bamberg.de/moodle/
course/view.php?id=9488.
Hale, Ken, Michael Krauss, Lucille J. Watahomigie, Akira Y. Yamamoto,
Colette Craig, LaVerne Masayesva Jeanne, and Nora C. England. 1992.
Endangered languages. Language 68:1–42.
Unauthenticated
Harrison, K. David, David Rood, and Arienne Dwyer. 2008. A world of

many voices: Editors’ introduction. In Lessons from Documented En-
dangered Languages, eds. K. David Harrison, David Rood, and Arienne
Dwyer, 1–13. Amsterdam, Philadelphia: John Benjamins.
Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics.
Linguistics 36:161–195.
Himmelmann, Nikolaus P. 2006. Language documentation: What is it, and
what is it good for. In Essentials of Language Documentation, eds. Jost
Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel, 1–30. Berlin, New
York: Mouton de Gruyter.
Himmelmann, Nikolaus P. 2008. Reproduction and preservation of linguis-
tic knowledge: Linguistics’ response to language endangerment. Annual
Review of Anthropology 37:337–350.
MacKenzie, David. 1961. Kurdish Dialect Studies, Volume 1. Oxford: Oxford
University Press.
Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising
from the aims of language documentation. Language Archives Newslet-
ter 1(3):3–4.
Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials
of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann,
and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter.
Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from
a language documentation project in Bougainville, Papua New Guinea.
Presentation at the Simposio Internacional Contacto de lenguas y docu-
mentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral
and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/
mosel_publikationen.htm#download (accessed 2011/02/26).
Trilsbeek, Paul, and Peter Wittenburg. 2006. Archiving challenges. In Essen-
tials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmel-
mann, and Ulrike Mosel, 311–335. Berlin, New York: Mouton de Gruyter.
Wälchli, Bernhard. 2009. Data reduction typology and the bimodal distribu-
tion bias. Linguistic Typology 13:77–94.
Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language
Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51.
London: School of Oriental and African Studies.
Unauthenticated
Chapter 2
Competing motivations
for documenting endangered languages∗
Frank Seifart
1. Introduction
Many aspects of language documentations are crucially shaped by the di-
verse motivations of the documenters. This chapter addresses a number of
questions with the aim of clarifying the role of such motivations in language
documentation, including:
(i) What are the motivations for documenting endangered languages ?
(ii) Which requirements for language documentations stem from them?
(iii) How can these requirements be accommodated in the format of language
documentations?
(iv) Where do these motivations give rise to competing documentation pri-
orities?
With respect to the first question, four possible motivations for documenting
endangered languages may be identified:1
– documentation to preserve human cultural heritage
– documentation to enhance the empirical basis of linguistics
– documentation by and for the speech community
– documentation to study language contact
These four motivations are theoretical abstractions, which are mixed in prac-
tice. This abstraction turns out to be useful, however, since making these mo-
tivations explicit allows one to study their specific requirements for language
∗
I am grateful to the editors of this volume for very useful comments that helped improve
this chapter.
1. The fourth motivation, the study of language contact, may be a less common one than the
first three. However, as argued in Section 3.4 below, there are a number of good reasons to
document endangered languages to study language contact, with interesting implications
for documentation priorities.
Unauthenticated
18 Frank Seifart
documentations. For instance, the empirical basis of linguistics may require

unedited data from spoken language, while reading material for the use in
a speech community may require edited texts, cleared of repetitions, false
starts, etc. These requirements shape important aspects of language docu-
mentation in crucial ways. They affect the two basic components of language
documentation: the collection of primary data, on the one hand, and the ap-
paratus, i.e. annotations, descriptive grammatical sketches, dictionaries, etc.,
on the other. This chapter thus also contributes to clarifying these two cen-
tral concepts of documentary linguistics. These issues are discussed in this
chapter with examples from documentation practices from the past 10 years,
mostly within the DoBeS program of the Volkswagen Foundation.
Section 2 introduces the theoretical framework of language documenta-
tion. Section 3 discusses four motivations for language documentation and
their implications for the collection of primary data and the apparatus of lan-
guage documentation. Section 4 summarizes some of the potentially compet-
ing requirements and Section 5 concludes this chapter.
2. The format of language documentation
Within documentary linguistics, the format of language documentations is

typically conceived of as in Table 1 (Himmelmann 1998, 2006). A key feature
of this format is the clear separation of primary data, i.e. the contents, from
any descriptive or analytical statement about these data, which are placed in
the apparatus of the documentation.
The first major component, the primary data, is organized into sessions,
where one session is a recording of a communicative event. One such session
ideally displays a unity of participants, place, and time, and may correspond
to an observed event that may have naturally occurred, e.g. the performances
of a song, as well as to a staged event, e.g. an elicitation session documenting
metalinguistic knowledge. As discussed in the following section, the selection
of communicative events that make up the contents of a language documen-
tation is crucially shaped by the motivations of the documentation project.
The second major component of language documentation is an apparatus
that is divided into one section that contains documents related to individual
sessions and one for the documentation as a whole. The apparatus comprises
all information that is necessary to access the primary data, in particular infor-
mation that is necessary for someone who is not familiar with the language.
Unauthenticated
Competing motivations for documenting endangered languages 19
This includes annotations of primary data (per session), which in turn cru-
cially include translations. It also includes (for the documentation as a whole)
general access resources, descriptive analyses, and metadata. As we shall see
further below, a number of requirements stemming from different motivations
pertain to these aspects of language documentation.
Table 1. Format of language documentations (Himmelmann 1998, 2006)
Apparatus
Contents For documentation
(Primary data) Per session as a whole
metadata metadata
Annotations General access resources

recordings/records transcription introduction
(sessions) of translation orthographic conventions
observable linguistic further linguistic and glossing conventions
behaviour and ethnographic glossing indices
metalinguistic and commentary links to other resources
knowledge .....
Descriptive analysis
ethnography
descriptive grammar
dictionary
3. Motivations for documenting endangered languages

The term ‘motivation’ of a language documentation is used here in an techni-
cal sense as resulting from the intended aims of a documentation, in particu-
lar the intended uses of the documentation, which in turn, heavily depend on
the intended user groups (see also Austin 2003: 8; Woodbury 2003: 46). The
following subsections identify and discuss four such motivations.2 It should
2. The idea of examining language documentation from the perspective of underlying motiva-
tions stems from Seifart (2000: 25–48), where a similar set of motivations were discussed
with little reference to actual documentation projects. The current paper modifies the con-
cept of motivations and discusses them with respect to experience from over 10 years of
language documentation practice.
Unauthenticated
20 Frank Seifart
be emphasized that I wish to refrain from any judgment with respect to the
(moral or other) validity of these motivations (see the useful discussion in
Fast 2007), but rather focus on clarifying their respective requirements for
the contents and apparatus of language documentations.
3.1. Documentation to preserve human cultural heritage
The general motivation to document endangered languages discussed in this

section is closely linked to a particular concept of language that may be called
the ‘ethnolinguistic’ view, often attributed to Wilhelm von Humboldt (see, for
instance, Zimmermann 1991: 297). This concept asserts that every language
embodies a particular worldview and that language is closely interrelated with
culture and culture-specific thinking, as emphasized also by Whorf (1956).
Based on such a perspective on language, linguistic diversity is seen as a
treasure that is of high value for humanity as a whole. This perspective on
endangered languages is present in many writings aimed at raising awareness
to the problem of language loss and endangered languages since the early
1990s (see, e.g. Mühlhäusler 1990; Thieberger 1990; Wurm 1991; Zimmer-
mann 1991; Hale et al. 1992; Hale 1995; Krauss 1995; Maffi 1998; Crystal
2000; Nettle and Romaine 2000; Grenoble and Whaley 2006; for critical dis-
cussions, see Hill 2002; Errington 2003; Himmelmann 2008: 343–345).
Within an approach to preserving linguistic diversity as human cultural
heritage, the notion of linguistic ecology emphasizes the importance of lin-
guistic diversity involving the active use of multiple and diverse languages on
a global level as a healthy and natural state, drawing heavily on the parallel
with biological species (see Romaine 2007: 127–130 for a recent overview).
To preserve this state, this approach advocates language maintenance and re-
vitalizations efforts, rather than documentations, which are sometimes viewed
as “artificial environments” (Mühlhäusler 1996: 164; Romaine 2007: 127).
However, products derived from language documentations, such as educa-
tional materials or movies in the endangered languages have been used in
many cases for language maintenance and revitalizations efforts (see Section
3.3). Also, the mere existence of any written, audio, or video document in the
language may contribute to language maintenance by positively influencing
the prestige of the language.
Language documentation in the sense of Section 2 may also contribute
to preserving aspects of human cultural heritage in a different sense by pre-
Unauthenticated
serving a long-term record of linguistic and cultural traditions and practices.

This motivation to document endangered languages probably underlies all
language documentation projects, although maybe to different degrees, and
it has implications for the contents and apparatus of the resulting language
documentation.
With respect to contents, i.e. the selection of events to be recorded and in-
cluded as primary data in a language documentation, this motivation implies a
priority for the documentation of the diversity of linguistically expressed cul-
tural production such as verbal art (see, for instance, Evans 2010: 182–204),
rather than, for example, elicitation of paradigms or acceptability judgments.
For instance, Morey and Schöpf’s (2009) documentation focuses on the tra-
ditional songs and poetry of various language of Upper Assam. In a similar
spirit, Jung et al.’s (2010) and Burenhult and Levinson’s (2010) documenta-
tions focus on culture-specific concepts embodied in place naming and land-
scape terminology. Other documentation projects include foci on aspects of
material culture, e.g. canoe building and fishing (Cablitz 2007, this volume),
house building (Mosel et al. 2007), or weaving (Malau, this volume). All of
these efforts clearly contribute to preserving aspects of human cultural her-
itage. Modern lexicon software, such as those discussed in Cablitz (2007,
this volume), additionally allow the documentation of indigenous cultural
knowledge by creating linked networks of entries incorporating encyclope-
dic knowledge.
With respect to the apparatus, i.e. the part of the documentation that al-
lows access to the data, the motivation to preserve human cultural heritage im-
plies particular care for long-term archiving and global accessibility. Besides
purely technical aspects of archiving, this also involves transparent and con-
sistent metadata that adheres to internationally agreed-upon standards. A ma-
jor contribution of the DoBeS initiative in this context is the development and
establishment of the IMDI standard for metadata (http://www.mpi.nl/IMDI/),
a now widely accepted cataloguing scheme for linguistic resources, which fa-
cilitates global accessibility of metadata and associated linguistic resources.
For a documentation to be widely accessible, it is also crucial to use an in-
ternationally widely used language in the metadata as well as for translation
and other annotation.
Unauthenticated
22 Frank Seifart
3.2. Documentation to enhance the empirical basis of linguistics

At least some sub-disciplines of linguistics strive for an empirical basis that
contains data from as many human languages as possible. Most notable among
these are typological approaches, which this section focuses on. The need for
samples containing large numbers of languages stems from the aim to de-
scribe the constraints on the variability of human language in general. Such
claims about human language are obviously the stronger the more languages
are included in the sample, notwithstanding methods to minimize undesired
biasing of limited samples (see Rijkhoff and Bakker 1998; Bickel 2008).
Most typological surveys (e.g., Haspelmath et al. 2008) operate with de-
scriptive statements, e.g. about word order or argument alignment, rather than
primary data that make up the contents of a language documentation in the
sense of Section 2.3 Such descriptive statements may be placed in the appa-
ratus of a language documentation, more specifically in the descriptive gram-
mar, which serves in the context of the language documentation as a means
to make the primary data accessible (see Mosel 2006b). Enhancing the em-
pirical basis for linguistic typology within a language documentation thus
implies analyzing primary data and formulating descriptive statements, pos-
sibly beyond what is necessary to merely access the primary data.
Typology has developed certain traditions of annotating data, which may
be taken as specific requirements for another component of the apparatus of
language documentations. Among these are the now widely accepted Leipzig
Glossing Rules, a set of conventions for interlinear morpheme-by-morpheme
glosses.4 A further development in this spirit is the GRAID project described
in Haig, Schnell, and Wegener (this volume), which enhances the empirical
database for a number of core linguistic research questions by additional,
standardized annotations.
With respect to contents, Himmelmann (1998: 177–179) proposes select-
ing communicative events to be included in a language documentation that
represent different points on the ‘spontaneity parameter’. This parameter dis-
tinguishes communicative events that can be expected to display different
3. There are recent approaches to using textual data for typological studies (Cysouw and
Wälchli 2007). These approaches have so far used mostly parallel texts from translations
of the Bible or popular fiction. The approach can be extended to a kind of parallel texts
contained in the apparatus of languages documentation, namely transcriptions and their
aligned translations.
4. See URL http://www.eva.mpg.de/lingua/resources/glossing-rules.php.
Unauthenticated
linguistic structures. For instance, the verbal system in Potawatomi conversa-

tion (high on spontaneity parameter) is drastically different from that of Pota-
watomi narratives (low on spontaneity parameter) (Buszard-Welcher 2010).
The application of the spontaneity parameter to ensure a certain completeness
of data from a linguistic point of view may thus be taken as a requirement for
language documentations under the motivation of enhancing the empirical
basis of linguistics more generally, beyond the specific requirements of, e.g.,
typological studies.
3.3. Documentation by and for the speech community

There is by now a large literature on the involvement of the speech commu-
nity in language documentation projects, which discusses the complex issues
of field work ethics, cooperative research, and agency of native speaker col-
laborators (see, for instance, Franchetto 2007; Czaykowska-Higgins 2009;
Cablitz, this volume; Malau, this volume). This body of literature amply il-
lustrates that a language documentation project may be at least partially moti-
vated by expectations of the speech community whose language is the object
of documentation. These expectations may of course vary widely from one
community to another, but there are a number of general requirements that
can be considered here.
With respect to contents, speech communities at advanced stages of lan-
guage endangerment often express the wish to document culturally highly
valued texts that are not transmitted to younger generations by traditional
means anymore, for example, traditional songs. Documentation by and for
the speech community may further necessitate the production of versions of
these recordings that are accessible to the speech community, e.g. on CDs. For
instance, in the context of an Australian documentation project (Sasse et al.
2008), copies on CDs of recordings of traditional songs made as part of the
documentation become very popular among younger community members.
Copies on DVDs, CDs, or audio cassettes may also be used by a speech com-
munity to transmit traditional knowledge to geographically far away parts of
the community where this knowledge has already become obsolete. This was
successfully done with DVDs documenting Vurës cultural events involving
sea worm gathering in Vanuatu (Malau, this volume) and DVDs document-
ing traditional Bora songs in Peru and Colombia (Seifart et al. 2009).
Unauthenticated
24 Frank Seifart
There is also often an expectation to produce within a documentation

project written documents to be used as reading materials in the community,
e.g. in community schools. If the production of such documents is based on
recorded, spoken texts, these texts usually require editing. This process of
editing involves the elimination of speech errors, such as false starts, repe-
titions, and possibly code switching (Mosel 2006a: 80). Such edited written
texts may be considered part of the annotation, i.e. the apparatus. Alterna-
tively, they may also be considered separate records of the language, differ-
ent from the underlying recorded text (if it was recorded at all), i.e. as part of
the primary data and thus the contents. Another reason for considering edited
versions of recorded texts as part of the contents rather than the apparatus is
that the editing process reveals metalinguistic intuitions by native speakers
about, e.g., what constitutes a false start or an unnecessary repetition.
With respect to the apparatus, there is at least one undisputable require-
ment that stems from the motivation to document for and by the speech com-
munity. This is the requirement to use a local language, known to the speech
community, as the metalanguage for annotation, translation, metadata, and
any further text in the apparatus, and as target language in a language doc-
umentation dictionary (see Cablitz, this volume). This language is not nec-
essarily a widely used language on a global level (see Section 3.1), but a
language used locally by the speech community. Ironically, this is usually
the dominant language that is displacing the endangered language that is be-
ing documented. A related requirement for the apparatus is to represent data
with an orthographic system that is accepted by the speech community, rather
than (or in addition to) representations of the data that are ideally suited for
linguistic analyses, such as phonemic transcriptions.
3.4. Documentation to study language contact
This last motivation to be discussed here is seldom explicitly on the agenda of

language documentation projects, and it is probably never the primary moti-
vation. However, endangered languages are almost by definition in intensive
contact with at least one other language, the one that speakers are shifting
to, the only exception being the rare case of physical extinction of a mono-
lingual speech community. This means that the documentation of endan-
gered languages almost inevitably contains highly valuable data for studies
on language contact, including code switching and contact-induced language
Unauthenticated
change. One aspect of language contact studies that is particularly relevant

here is the study of the process of language shift, i.e. the process that leads to
language endangerment. This motivation goes beyond the motivation to en-
hance the empirical bases of linguistics in that language contact, as one aspect
of cultural contact, is intrinsically an interdisciplinary field, which requires a
range of linguistic as well as non-linguistic data.
The linguistic data required by the study of language contact may shape
the contents of language documentations in a number of interesting, and po-
tentially conflicting ways. Firstly, any data from the contact language(s) are
highly important, in addition to data from the endangered language that is the
focus of documentation. This includes not only loanwords and code switch-
ing within texts that are primarily in the endangered language, but also de-
tailed information on the local variety of the contact language(s). Secondly,
it is sometimes claimed that advanced stages of language endangerment lead
to structural-linguistic change such as exaggerated variation (Campbell and
Muntzel 1989; Sasse 1992; Tsitsipis 1998), but these claims are mostly based
on very few, often anecdotally reported, case studies. Any data documenting
such deviation from the norm – which are possibly disregarded under other
motivations as speech errors – are highly important to test such claims.
With respect to the annotation of texts, i.e. an aspect pertaining to the
apparatus, some documentation projects have already implemented a means
to enhancing the usability of language documentation for language contact
studies by introducing additional annotation tiers, which specify the source
language for each morpheme in the primary data (e.g. Bickel et al. 2009;
Güldemann et al. 2010).
In addition to multilingual data and their annotation, most language con-
tact studies crucially also require sociolinguistic and other extralinguistic in-
formation, which would also be part of the apparatus. Studies of language en-
dangerment, for instance, operate with sociolinguistic descriptions of the dis-
tribution of the use of the endangered and dominant languages across differ-
ent domains and an evaluation of the quality of these domains (Himmelmann
2009: 46). The ‘ethnolinguistic vitality’ approach to language endangerment,
which is developed in social psychology, additionally takes into account lin-
guistic attitudes, which are assessed by standardized questionnaires, as well
as a range of other information, including demographic and geographic infor-
Unauthenticated
26 Frank Seifart
mation (Giles, Bourhis, and Taylor 1977; Landry and Allard 1994; Edwards
2010).5
4. Where motivations compete
In principle, the format of language documentations is flexible enough to in-

corporate all requirements mentioned in the preceding sections. For instance,
edited versions of spoken discourse can be placed in the corpus of primary
data in addition to unedited versions, and a translation into a locally used
language can be inserted in addition to a translation into an internationally
widely used language. Thus potential conflicts between competing motiva-
tions do not arise for principled reason. However, each documentation project
will have to set its priorities, given intrinsically limited resources in terms of
time, labor, and finances and these will be set according to the underlying mo-
tivations. There seem to be three broad areas where potential conflicts arising
from competing motivations emerge particularly clearly and that are partic-
ularly relevant because they pertain to important and labor-intensive aspects
of documentation:
(i) The use of a local language vs. an international language as the met-
alanguage used in the apparatus for translation, commentary, etc. The acces-
sibility of the documentation crucially depends on the choice of its metalan-
guage. If the language that the speech community prefers as the metalanguage
of their documentation is not one of the few internationally widely used lan-
guages, then the choice for just one language means seriously restricting the
accessibility of the documentation for either the speech community or the in-
ternational community. For instance, the documentation of the languages of
the People of the Center (Seifart et al. 2009) in (Spanish-speaking) Peru opted
for allocating its limited resources to Spanish translations, metadata, etc., fa-
cilitating access for the speech community, but making access difficult for
non-Spanish speaking users. Other documentation projects (e.g. Haig et al.
2011) use only English, maximizing accessibility at an international level,
but making access for members of the speech community difficult.
5. It should be noted that the ethnolinguistic vitality framework was developed in the context
of migrant communities in industrialized countries, often with languages that are far from
being endangered as a whole. It would need to be extended to be applied to language
endangerment settings involving indigenous languages.
Unauthenticated
(ii) Naturally occurring spoken language vs. edited data. Linguistic data
that are produced as naturally as possible are heralded as the most valuable for
many linguistic, as well as other scientific studies, which make use of the full
range of information present in spoken language, including prosody, speech
rate, repetitions, etc. (see, e.g. Finnegan 2008). This applies similarly to lan-
guage contact studies that require data on code switching and other influences
from a contact language. On the other hand, data that are edited according to
normative intuitions of speakers, i.e. that are cleared of speech errors, repe-
titions, code switching, etc., may be appropriate or even required for reading
materials used in community schools, and to a lesser degree for some aspects
of documenting cultural heritage and structural-linguistic studies. Producing
edited versions of texts is a painstaking and time-consuming task, which few
documentation projects have undertaken for more than a small number of
texts. An exception is the Teop language documentation (Mosel et al. 2007),
which has produced a large corpus (probably the largest) of texts edited in
this way, which are archived alongside the original, unedited versions.
(iii) Specific descriptive information in the apparatus. Within a language
documentation, the apparatus serves to facilitate access to the primary data,
allowing for their further usability and analysis. However, a number of moti-
vations to document endangered languages require specific descriptive infor-
mation that may not be contained in such an apparatus, such as sociolinguistic
descriptions of domain-specific language use or specific linguistic-structural
information for typological surveys about, e.g., word order. The question is
to what extent a language documentation project can be expected to provide
such information rather than spending its limited resources on, for instance,
enhancing the collection of primary data. A guideline to mediate between
these competing motivations might be to afford higher priority to providing
information in the apparatus that cannot be deduced, however painstakingly,
from the corpus of primary data at a later stage. Thus, for instance, details of
word order can be determined from a large text collection after the conclu-
sion of a documentation project, while the distribution of languages across
domains cannot.
5. Summary and conclusion
This chapter discussed different intended aims and user groups of language
documentation and subsumed these under four motivations for language doc-
Unauthenticated
28 Frank Seifart
umentation. Each motivation was analyzed in terms of its requirements for

language documentations, conceived as consisting of a collection of primary
data, the contents, and an apparatus. Some motivations include specific re-
quirements on contents or apparatus, such as standardized morpheme glosses
for typological studies. Other requirements are more general in nature, such
as the preference for textual primary data that closely conforms to norma-
tive intuitions, without speech errors or code switching. This perspective on
language documentation showed a complex interaction of different types of
requirements, which the format of language documentation can in principle
accommodate, but where in practice priorities have to be set.
References
Austin, Peter K. 2003. Introduction. In Language Documentation and De-
scription, Volume 1, ed. Peter K. Austin, 6–12. London: School of Oriental
and African Studies.
Bickel, Balthasar. 2008. A refined sampling procedure for genealogical con-
trol. Sprachtypologie und Universalienforschung 61:221–233.
Bickel, Balthasar, Goma Banjade, Toya N. Bhatta, Martin Gaenzle, Netra P.
Paudyal, Manoj Rai, Novel Kishore Rai, Ichchha Purna Rai, and Sabine
Stoll, eds. 2009. Audiovisual Chintang corpus (ca. 150,000 words tran-
scribed and translated, of which ca. 65,000 glossed and translated, plus
paradigm sets and grammar sketches, ethnographic descriptions, pho-
tographs). Nijmegen, Leipzig: DoBeS, Universität Leipzig. http://www.
uni-leipzig.de/~ff/cpdp/.
Burenhult, Niclas, and Stephen C. Levinson, eds. 2010. Tongues of the
Semang: documenting endangered languages and indigenous knowledge
among foragers of the Malay Peninsula. Nijmegen: DoBeS. http://mpi.nl/
DOBES/projects/semang.
Buszard-Welcher, Laura. 2010. Lessons from Potawatomi legacy documen-
tation. In Language Documentation: Practice and Values, eds. Lenore A.
Grenoble and Louanna Furbee, 67–74. Amsterdam, Philadelphia: John
Benjamins.
Cablitz, Gabriele, ed. 2007. Towards a multimedia dictionary of the Marque-
san and Tuamotuan languages of French Polynesia. Nijmegen: DoBeS.
http://mpi.nl/DOBES/projects/marquesan.
Unauthenticated
Campbell, Lyle, and Martha C. Muntzel. 1989. The structural consequences

of language death. In Investigation Obsolescence. Studies in Language
Contraction and Death, ed. Nancy C. Dorian, 181–196. Cambridge: Cam-
bridge University Press.
Crystal, David. 2000. Language Death. Cambridge: Cambridge University
Press.
Cysouw, Michael, and Bernhard Wälchli. 2007. Parallel texts: Using transla-
tional equivalents in linguistic typology. Language Typology and Univer-
sals 60:95–99. doi:10.1524/stuf.2007.60.2.95.
Czaykowska-Higgins, Ewa. 2009. Research models, community engage-
ment, and linguistic fieldwork: Reflections on working within Cana-
dian indigenous communities. Language Documentation & Conservation
3:15–50.
Edwards, John R. 2010. Minority Languages and Group Identity: Cases and
Categories. Amsterdam, Philadelphia: John Benjamins.
Errington, Joseph. 2003. Getting language rights: The rhetorics of lan-
guage endangerment and loss. American Anthropologist 105:723–732.
doi:10.1525/aa.2003.105.4.723.
Evans, Nicholas. 2010. Dying Words: Endangered Languages and What They
Have to Tell Us. Chichester, Malden, Oxford: Wiley-Blackwell.
Fast, Annicka. 2007. Moral incoherence in documentary linguistics: Theoriz-
ing the interventionist aspect of the field. In Proceedings of the Fifth Uni-
versity of Cambridge Postgraduate Conference in Language Research, eds.
Naomi Hilton, Rachel Arscott, Katherine Barden, Arti Krishna, Sheena
Shah, and Meg Zellers, 64–71. Cambridge: Cambridge Institute of Lan-
guage Research.
Finnegan, Ruth. 2008. Data – but data from what? In Language Documen-
tation and Description, Volume 5, ed. Peter K. Austin, 13–28. London:
School of Oriental and African Studies.
Franchetto, Bruna. 2007. A comunidade indígena como agente da documen-
tação lingüística. Revista de Estudos e Pesquisas 4:11–32.
Giles, Howard, Richard Y. Bourhis, and Donald M. Taylor. 1977. Toward
a theory of language in ethnic group relations. In Language, Ethnicity
and Intergroup Relations, ed. Howard Giles, 307–325. London: Academic
Press.
Grenoble, Lenore A., and Lindsay J. Whaley. 2006. Saving Languages: An
Introduction to Language Revitalization. Cambridge: Cambridge Univer-
sity Press.
Unauthenticated
30 Frank Seifart
Güldemann, Tom, Alena Witzlack-Makarevich, Martina Ernszt, and Sven

Siegmund, eds. 2010. A text documentation of N|uu. Leipzig, London:
MPI-EVA, ELDP. http://www.eva.mpg.de/lingua/research/nuu.php.
Haig, Geoffrey, Ludwig Paul, and Philip Kreyenbroek, eds. 2011. Documen-
tation of Gorani, an endangered language of West Iran. Nijmegen: DoBeS.
http://www.mpi.nl/DOBES/projects/gorani.
Hale, Ken. 1995. On the human value of local languages. In Proceedings
of the XVth International Congress of Linguistics, eds. André Crochetière,
Jean-Claude Boulanger, and Conrad Ouellon, 17–31. Quebec: Les Presses
de l’Université Laval.
Hale, Ken, Michael Krauss, Lucille J. Watahomigie, Akira Y. Yamamoto,
Colette Craig, LaVerne Masayesva Jeanne, and Nora C. England. 1992.
Endangered languages. Language 68:1–42.
Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie, eds.
2008. The World Atlas of Language Structures Online. Munich: Max
Planck Digital Library. http://wals.info/.
Hill, Jane H. 2002. “Expert rhetorics” in advocacy for endangered languages:
Who is listening and what do they hear? Journal of Linguistic Anthropol-
ogy 12:119–133. doi:10.1525/jlin.2002.12.2.119.
Himmelmann, Nikolaus P. 2008. Reproduction and preservation of linguis-
tic knowledge: Linguistics’ response to language endangerment. Annual
Review of Anthropology 37:337–350.
Himmelmann, Nikolaus P. 2009. Language endangerment scenarios: A case
study from northern Central Sulawesi. In Endangered Languages of Aus-
tronesia, ed. Margaret Florey, 45–72. Oxford: Oxford University Press.
Jung, Dagmar, Julia Colleen Miller, Pat Moore, Gabriele Schwiertz, Carolina
Pasamonik, Kate Hennessey, and Olga Lovick, eds. 2010. Beaver knowl-
edge systems: documentation of a Canadian First Nation language from
a placenames’ perspective. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/
projects/beaver.
Krauss, Michael. 1995. Language extinction catastrophe just ahead: Should
Unauthenticated
linguists care? In Proceedings of the XVth International Congress of

Linguistics, eds. André Crochetière, Jean-Claude Boulanger, and Conrad
Ouellon, 43–48. Quebec: Les Presses de l’Université Laval.
Landry, Rodrigue, and Réal Allard, eds. 1994. Ethnolinguistics Vitality.
Berlin, New York: Mouton de Gruyter.
Maffi, Luisa. 1998. Linguistic and biological diversity: The inextricable
link. Terralingua Discussion Paper 3. http://www.terralingua.org/activities/
DiscPapers/DiscPaper3.html.
Morey, Stephen, and Jürgen Schöpf, eds. 2009. The Traditional Songs and
Poetry of Upper Assam – A Multifaceted Linguistic and Ethnographic Doc-
umentation of the Tangsa, Tai and Singpho Communities in Margherita,
Northeast India. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/
singpho_tai_tangsa.
Mosel, Ulrike. 2006a. Fieldwork and community language work. In Essen-
Mosel, Ulrike. 2006b. Sketch grammars. In Essentials of Language Docu-
mentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel,
301–309. Berlin, New York: Mouton de Gruyter.
Mosel, Ulrike, Roslyn Purupuru, Jessika Reinig, Alexander Radke, Ruth Sao-
vana Spriggs, Marcia Schwartz, and Yvonne Thiesen, eds. 2007. Teop Doc-
umentation. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/teop.
Mühlhäusler, Peter. 1990. Preserving languages or language ecolo-
gies? A top-down approach to language survival. Oceanic Linguistics
31(2):163–180.
Mühlhäusler, Peter. 1996. Linguistic Ecology. Language Change and Lin-
guistic Imperialism in the Pacific Region. London, New York: Routledge.
Nettle, Daniel, and Suzanne Romaine. 2000. Vanishing Voices: The Extinc-
tion of the World’s Languages. Oxford: Oxford University Press.
Rijkhoff, Jan, and Dik Bakker. 1998. Language sampling. Linguistic Typol-
ogy 2:263–314. doi:10.1515/lity.1998.2.3.263.
Romaine, Suzanne. 2007. Preserving endangered languages. Language and
Linguistics Compass 1:115–132. doi:10.1111/j.1749-818X.2007.00004.x.
Sasse, Hans-Jürgen. 1992. Language decay and contact-induced change:
Similarities and differences. In Language Death. Theoretical and Fac-
tual Explorations with Special Reference to East Africa, ed. Matthias Bren-
zinger, 59–80. Berlin, New York: Mouton de Gruyter.
Unauthenticated
32 Frank Seifart
Sasse, Hans-Jürgen, Nicholas D. Evans, Linda Barwick, Bruce Birch, Murray

Garde, Joy Williams, and Janet Fletcher, eds. 2008. Iwaidja Documenta-
tion. Nijmegen: DoBeS. http://www.mpi.nl/DOBES/projects/iwaidja.
Seifart, Frank. 2000. Grundfragen bei der Dokumentation bedrohter
Sprachen. Köln: Institut für Sprachwissenschaft der Universität zu Köln.
Seifart, Frank, Doris Fagua, Jürg Gasché, and Juan Alvaro Echeverri, eds.
2009. A multimedia documentation of the languages of the People
of the Center. Online publication of transcribed and translated Bora,
Ocaina, Nonuya, Resígaro, and Witoto audio and video recordings
with linguistic and ethnographic annotations and descriptions. Nijme-
gen: DoBeS. http://corpus1.mpi.nl/qfs1/media-archive/dobes_data/Center/Info/
WelcomeToCenterPeople.html.
Thieberger, Nicholas. 1990. Language maintenance: Why bother? Multilin-
gua 9:333–358.
Tsitsipis, Lukas D. 1998. A Linguistic Anthropology of Praxis and Language
Shift: Arvanitika (Albanian) and Greek in Contact. Oxford: Clarendon
Press.
Whorf, Benjamin Lee. 1956. Language, Thought and Reality. Edited with an
introduction by John B. Carroll. New York: John Wiley and Sons.
Wurm, Stephen A. 1991. Language death and disappearance: Causes and
circumstances. In Endangered Languages, eds. Robert H. Robins and Eu-
genius M. Uhlenbeck, 1–18. Oxford: Berg.
Zimmermann, Klaus. 1991. ‘Babel wiederlesen’ und die Vielfalt der Spra-
chen fördern. Jahrbuch Preußischer Kulturbesitz 28:289–301.
Unauthenticated
Chapter 3
Evolving challenges
in archiving and data infrastructures
Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van

Uytvanck, Menzo Windhouwer and Peter Wittenburg
1. Introduction
Increasingly often research in the humanities is based on data. This change in
attitude and research practice is driven to a large extent by the availability of
small and cheap yet high-quality recording equipment (video cameras, audio
recorders) as well as advances in information technology (faster networks,
larger data storage, larger computation power, suitable software). In some
institutes such as the Max Planck Institute for Psycholinguistics, already in
the 90s a clear trend towards an all-digital domain could be identified, making
use of state-of-the-art technology for research purposes. This change of habits
was one of the reasons for the Volkswagen Foundation to establish the DoBeS
program in 2000 with a clear focus on language documentation based on
recordings as primary material.
The fact that more and more data is being collected poses some challenges
for those who are dealing with this data in one way or another. The researcher
who collects the material will need to maintain a coherent administration of
all the relevant bits of contextual information surrounding the data. These
“metadata” descriptions (see Section 4.2) are not just for the researchers own
use but should also allow others to find the data once it has been stored in an
archive and should allow others to assess whether the data suits their needs.
Research data archives that are storing more and more large data collections
will have to provide proper facilities and guidance for potential users of the
data to find what they are looking for.
While technological advances have made it much easier to collect large
amounts of audiovisual recordings, the automatic extraction of the relevant
bits of information from these recordings is still very difficult and therefore
needs to be done manually to a large extent. This causes a discrepancy be-
Unauthenticated
34 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
tween the amount of data that is being collected and the amount of data that
ends up being analyzed and used to support research hypotheses.
Data archiving and sharing is currently on the agenda in all areas of sci-
ence and the technical frameworks that are being developed are often based
on the OAIS reference model (CCSDS 2002) that was originally designed for
space data but can be applied more broadly. Different workflows and usage
scenarios and differences in the nature of the archived data often require de-
viations from this abstract model though, in particular in the case of an online
archive that gives users direct access to the archived material.
2. Issues and strategies in data handling
Since digital technology quickly offered ways to not only create large amounts
of primary recordings but also several associated resources such as transcrip-
tions, linguistic analyses, field notes, etc., it became obvious that new chal-
lenges appeared at the horizon: we needed ways to take care of proper life-
cycle management of the archived data. In 2000 the MPI stored about one
terabyte of digitized recordings, currently the data in the online archive and
the data ready to be integrated take up about 74 terabyte. Due to techno-
logical innovation we are now able to process and store lossless compressed
JPEG2000 video streams, which result in files that are a factor 20 larger than
the MPEG2 files that were our highest quality archival copies until recently.
This increase in file sizes results in an annual growth of the archive of about
18 terabytes currently, however with more and more researchers switching to
high-definition video cameras we can expect another steep increase in annual
growth in the near future.
In the humanities, sheer data volume specifications are not a good indica-
tor for the data management challenges to solve. There are generally complex
relations between the archived objects that need to be maintained in order
to preserve all the knowledge about the objects. Each digitized recording is
for example part of a hierarchy of semantically related objects. Often such
objects are split into new objects for specific reasons such as presentations.
Different layers of annotation of the linguistic content are created, perhaps
even from different annotators at different times. Derived resources such as
lexica are created that relate to a collection of archived objects (see Cablitz,
this volume). Several versions and transformations of many objects might be
created in the course of time. It is important to store the relationships between
Unauthenticated
Evolving challenges in archiving and data infrastructures 35
all these objects, since in most cases context and provenance is essential for
the interpretation of the objects’ content. Handling the complexity in such
collections is thus a true challenge.
From a UNESCO study we know that already for tape media the preser-
vation of the stored information has turned out to become a huge and partly
insoluble problem. About 80% of the existent recordings of languages and
cultures created by ethnologists, linguists etc. are highly endangered because
the physical carriers are deteriorating rapidly and the material is not in the
hands of specialized archives (Schüller 2004). Digital technology moves on
even faster, i.e. uncurated data is much more endangered than the traditional
analog recordings. There is a great risk of losing parts of our cultural and sci-
entific memory if we do not ensure that data formats and encodings are kept
distinct from the software being used, if we do not use open standards such
as XML (eXtensible Markup Language) for specifying structure and if we do
not use widely agreed and thoroughly documented encoding schemes such as
UNICODE, MPEG etc.
Digital data needs to be continuously migrated, both at the carrier level as
well as at the structure/encoding level. How can we maintain integrity and au-
thenticity - both essential pillars for the preservation of our contents - in such
a dynamic world? Migration alone will not ensure data survival, since our
media are very vulnerable and our software erroneous. Automatic copying to
distinct locations according to safe protocols making use of different software
systems is required as well to preserve our digital treasure. For DoBeS data,
six copies are created automatically at three locations and in addition selected
data is being returned to the locations where they were recorded.
For both aspects – migration and copying – there are no simple solutions
that are safe enough and all procedures involving too many manual operations
will not work in the end, since the costs would be much too high for the large
volumes of data that we are creating and maintaining.
2.1. The influence of the DoBeS programme

One of the great outcomes of the DoBeS program in an early stage was that
a few enthusiastic researchers and technologists sat together and contributed
to the specification of a flexible metadata schema and infrastructure: ILSLE
Unauthenticated
metadata Initiative (IMDI1 ). It was quickly understood that metadata is the

glue for maintaining the complex relationships that may exist between vari-
ous objects in an archive. With the IMDI metadata infrastructure we were able
to not only add descriptions to objects in order to make them retrievable, or
to group them based on categories, but also to organize them into various col-
lections. Each depositor constructs a hierarchical reference organization for
their corpus, which forms the basis for all management and access permis-
sion operations, but alternative organizations are also possible. IMDI is still
the basis for one of the largest online archives these days: the MPI language
Archive of which the DoBeS archive is a well-organized part.
There is quite some discussion currently about proper data management
and the challenges posed by what is now also called the “Data Tsunami”.
Just recently the European Commission founded a high-level expert group
to bring out a report with the name “Riding the wave – How Europe can
gain from the rising tide of scientific data” (High Level Expert Group on Sci-
entific Data 2010) and to come up with actions to address the volume and
complexity aspects. In the US, a final report of the Blue Ribbon Task Force
on Sustainable Digital Preservation and Access (Blue Ribbon Task Force on
Sustainable Digital Preservation and Access 2010) and the ASIS&T Summit
on Research Data showed the relevance of the data curation and preservation
challenges.
Looking back a decade we can state that the DoBeS program at an early
stage made great contributions to address these questions. Excellent solutions
were found given the early stages of the debates and contributions to the dis-
cussions about data management are still being made today. Principles of data
archiving were worked out, the need of standards was articulated, a new or-
ganization framework based on metadata descriptions was invented, the issue
of appropriate creation, management, access and enrichment tools was tack-
led and concrete actions were started along all these dimensions resulting in
solutions that meet most of the requirements being discussed these days.
In the report of the EC high-level expert group, “trust” is indicated as
one of the most fundamental principles for success. Obviously trust has many
facets, but most essential is that (1) the depositors trust the archivists that they
take care of proper preservation, curation and access and rights management;
(2) that the archivists rely on the quality of the data provided by the depositors
1. http://www.mpi.nl/IMDI/
Unauthenticated
and (3) that the users can trust getting exactly those objects they are looking
for in authentic quality. The last point has led to an important shift in the MPI
archivists view on researcher involvement in data managing. Given the utterly
dynamic era in which the DoBeS program and in particular the archiving part
has been set up, we can state that this trust has eventually been established,
even though it required some attitude changes both from the archivist’s as
well as from the researcher’s side. The archivists for example had to become
aware of the utmost importance that researchers attach to proper protection
and presentation of their data, and the researchers had to get used to the idea
of handing over their data to an online archive that has data sharing as one
of its goals. It’s in the nature of innovation that trust has to be continuously
re-established.
3. Archive stakeholders and their needs

As indicated in Trilsbeek and Wittenburg (2006), a number of different parties
typically interact with an archive, each from a different perspective and with
different – sometimes conflicting – needs. Depositors require an easy way to
deposit their material and to write metadata descriptions for their deposits,
archivists need means to ensure consistent archive organization and data in-
tegrity, and various groups of users of an archive need easy means to navigate
and access archival content. Particularly the latter group poses a challenge to
the developers of access tools for an archive since it is a rather heterogeneous
group ranging from interested members of the general public to journalists,
to people from the speech communities whose recordings are in the archive,
and to linguists who may or may not have specialized knowledge about the
archived material.
It is almost impossible for an archive to cater for the access needs of
every group of users, so it is important that it offers access to its resources
in an atomic way and ideally also offers access to some basic web-services
to explore the archived material. In this way, different web sites or portals
with different looks and levels of complexity can be developed on top of the
archiving infrastructure.
Offering access to archived material and services in such a way also be-
comes essential if an archive wants to become part of the various “e-research
infrastructures” that are being developed at the moment in projects such as
Unauthenticated
CLARIN2 . Agreements about standards and interchange formats for data and
services are needed to ensure interoperability between various archives and
tool providers.
4. Long-term preservation requirements

4.1. File formats and encodings
Long-term preservation of digital data involves both the physical preservation
of the digital objects as well as keeping these objects interpretable in the long
run. File formats and encodings change over the years up to a point where old
formats cannot even be read any more on common hardware and software of
a certain point in time. We have seen various examples of this in the past
and we will no doubt see many more in the future. Keeping archived data
interpretable therefore means that an archive needs to migrate its stored files
to up-to-date formats before the old ones have become obsolete. Converting
from one format to another, however, often involves loss of information or
the introduction of artifacts. In audiovisual formats, transcoding between two
lossy formats or even encoding a file again in the same lossy format will
introduce artifacts and loss of information. To prevent loss of information
or loss of quality, the archive should use formats according to the following
principles:
– for audiovisual material, use uncompressed or lossless compressed formats

whenever possible
– for textual material, use Unicode character encoding and XML-based for-
mats whenever possible
– avoid closed, proprietary formats
For textual material and audio material it is quite straightforward today to

follow these guidelines. Storing uncompressed or lossless compressed video,
however, still requires a lot of storage capacity by today’s standards, which
is problematic for many language archives. One hour of standard definition
MJPEG2000 lossless compressed video for example takes up about 70 GB of
storage, for High Definition video this number would be even 4 times as high.
The role of video in language documentation is growing since it provides a
2. http://www.clarin.eu
Unauthenticated
way to contextualize the spoken language and to analyze other communi-

cation channels such as gesticulation, however the quality requirements for
video are much less straightforward than for audio. There are a lot of vari-
ables that play a role for the quality of the video signal but their importance
may vary depending on the purpose of the recording. The recording equip-
ment that is being used to acquire the video material also limits the quality
that can be obtained to some extent; the resulting video quality will depend
on the available budget and on the size and weight that is still practical in
the recording situation. It’s probably safe to assume that the price of digital
storage will continue to drop at least at the same rate as it has during the past
decade, so storing uncompressed or lossless compressed video will become
feasible for more archives. The consumer camcorder market on the other hand
is very hard to predict and is not driven by the needs of linguistic researchers.
Two XML-based formats for linguistic data that were developed with the
help of the DoBeS program are the EAF format for linguistic annotations and
the Lexical Markup Framework (LMF) format for lexica (ISO 24613:2008).
EAF is the format that is used by the ELAN multimedia annotation tool for
storing multi-layered annotations that are time-aligned to the audio or video
files. The LMF format is a flexible format for creating structured lexica and is
being used by the LEXUS lexicon tool. Both formats were designed as XML
formats to allow for relatively easy conversions to other formats now and in
the future.
4.2. Organization of data: metadata

When gathering and managing large amounts of data, be it in the form of
analogue or digital resources, an additional layer of meta-information is in-
dispensable. This might seem obvious for the classic case of a library full of
books, but it is even more true for a digital archive where language resources
are stored as digitized recordings and text files. Specific reasons for this are:
Digital resources are meaningless by themselves. On the lowest level they
exist of bits (0 and 1). While digital storage systems themselves already pro-
vide for interpretation of the basic characteristics of the stored bit-streams,
there are many other layers of interpretation necessary for keeping the data
useful and manageable, each layer requiring explicit specific (metadata) in-
formation.
Unauthenticated
There are very many ways to organize digital language resources; one or-
ganization might be more suitable for a specific archiving or research purpose
than another and fortunately the digital storage paradigm does not impose a
single organization. Therefore we need the ability to impose different flex-
ible organization models or views that match the interest of researchers or
archivists. The richer the metadata available, the more possibilities there are
for the end user to create these special views and explore the digital collec-
tion.
In the current landscape of digital repositories and archives, a number
of specific metadata standards are prominent for the description of linguistic
data. Such a standard usually specifies a set of metadata elements (sometimes
called attributes) together with prescriptions for the values of these elements
and also prescriptions on how the metadata elements and values should be
put into a text format (schema).
The first one of these sets and probably the most widely used one, is
Dublin Core,3 which stems from the electronic library world. Dublin Core
was later extended with some linguistic specializations into the OLAC stan-
dard4 which has become popular for exchanging Language Resource meta-
data between archives. Around the same time the IMDI5 standard was intro-
duced and adopted by the DoBeS program. IMDI strives to allow detailed
descriptions and several so-called specialized profiles were created for spe-
cific linguistic subdomains. A suite of tools to edit and use IMDI metadata
was partly developed within the context of the DoBeS program.
At the time of writing (2011) a follow-up standard for IMDI, called CMDI6
(Component metadata Infrastructure, cf. Broeder et al. 2010) is being worked
out within the CLARIN framework. Rather than offering one single meta-
data schema it tries to offer the user a set of loose components that can be
combined into a tailored metadata schema. This approach should allow for
a detailed description while keeping the focus only on those metadata ele-
ments that are relevant. Apart from that, it also allows for partial re-use of
existing metadata schemas and provides better mechanisms of semantic in-
teroperability by requiring that the semantics of all used metadata elements
are explicitly defined in an accepted concept registry. Using CMDI will hope-
3. http://dublincore.org
4. http://www.language-archives.org/OLAC/metadata.html
5. http://www.mpi.nl/IMDI/
6. http://www.clarin.eu/cmdi
Unauthenticated
fully increase metadata interoperability between linguistic research commu-

nities having different needs and traditions.
4.3. Other standards in language archiving

Both the LMF lexicon standard and CMDI metadata standard are prime ex-
amples of a trend to have standards which can be easily adapted to the needs
of a specific resource type, as in use by a specific (research) community, or
even a single resource. LMF provides a core meta model, with some exten-
sions, which can be adorned with data categories taken from a data category
registry to form the actual data model for a specific LMF lexicon. CMDI uses
the same approach by storing pre-defined metadata components and profiles
in a component registry. These components are also annotated with links into
concept registries, from which the data category registry is one, to make se-
mantic descriptions available and to share those. Registries are thus starting
to play an increasingly prominent role in standards related to archiving. The
MPI develops and hosts the following registries:
– ISOcat7 (Kemps-Snijders et al. 2008) is the data category registry (ISO

12620:2009) for ISO TC 37, which is based on a grass-roots approach,
allowing any linguist to participate in the specification and standardization
of linguistic data categories.
– The CMDI component registry8 for CLARIN-NL.
– RELcat is a registry to store (user-specific) relationships between data cat-
egories and possibly other concept registries.
A metadata example that is already in use illustrates the support for mapping
from the IMDI to the Dublin Core metadata schemas by using these strate-
gies. The metadata profile in ISOcat has been bootstrapped with the IMDI
elements, which includes the /mimeType/9 data category. The specification of
a data category can be very elaborate including translations in multiple lan-
guages, but at least an English name and definition should be available. The
/mimeType/ data category is defined as the “specification of the mime-type
of the resource which is a formalized specifier for the format included or a
7. http://www.isocat.org/
8. http://www.clarin.eu/cmdi/
9. http://www.isocat.org/datcat/DC-2571
Unauthenticated
mime-type that the tool/service accepts”. In the CMDI component registry

the cmdi-mimetype10 component links the MimeType element to this data
category. The format11 element in the Dublin Core metadata schema actually
plays the same role and is defined as “the file format, physical medium, or di-
mensions of the resource”. In RELcat the equivalence relations between the
ISOcat data category /mimeType/ and the Dublin Core format element can be
specified using a simple RDF triple:
isocat:DC-2571 rel:sameAs dc:format
The metadata search that is currently under development in CLARIN can al-
ready exploit such semantic relationships to broaden the scope of a search.
While this example zoomed in on the metadata domain, ISOcat is currently
being populated with data categories for various other domains, e.g., mor-
phosyntax and terminology, and it is expected that this will provide the same
kind of flexibility to content search on resources created in these domains.
As usability, accessibility and interoperability are long-term goals of the
archive, the persistency of the registries and the links to them is a major con-
cern. Most of these registries do provide Persistent IDentifiers (PIDs) backed
up by persistency strategies, which allow safe use of these identifiers in the
metadata of resources or even the resources themselves. There are various
PID frameworks available. To help an archive to choose among these frame-
works, ISO 24619 “Persistent identification and sustainable access” (ISO
24619:2010) gives specific requirements these frameworks should meet to
make them useful for archives of linguistic resources.
To promote the (re)use of the resources stored in the archive standards for
harvesting metadata, e.g., OAI-PMH from the Open Archives Initiative, and
standards on and agreements between archives about Authentication and Au-
thorization Infrastructures are important. Large-scale infrastructure initiatives
like CLARIN help to build up the federations of all involved organizations.
4.4. Versioning
When storing and archiving digital resources, an important policy decision
concerns how to respond when a depositor offers a new “version” of a re-
source that is already present in the archive’s holdings. There can be differ-
10. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:c_
1271859438106
11. http://purl.org/dc/elements/1.1/format
Unauthenticated
ent reasons for offering a new version: (a) The depositor has realized that
the first version is simply broken or unusable, for instance in the case that
the files were switched. (b) New insights make it necessary to change some
annotations. (c) The format of a resource may need to be upgraded. For in-
stance a codec used to encode a media-stream may become obsolete requiring
resource replacement.
Depending on the archive organization and policies, it is possible to let the
new version take the place of the old one in the existing network of relations
with other resources and metadata. The old version may then be moved to
background storage or, depending on archive policy, even deleted. Of course
the relation between old and new versions needs to be stored and users should
to be able to see that other versions exist.
We will not go into the question on what actually makes a resource a new
version of another resource. This should best be left to the judgment of the
depositor or caretaker of the original version.
It is however very important to realize that users may have created ref-
erences to a resource in the archive, for instance as a link in a publication.
Most users will expect that that reference will always link to the same ver-
sion, while others may want to refer to the latest version. It is important that
the archive is explicit about its versioning policy in this respect. The most
flexible system is to always keep any reference to a specific resource version
but to provide referencing to the latest version as a special service.
However it is known that some archives are unable to keep stable ref-
erences to resources or resource collections due to legal or organizational
obstacles. For instance, its legal owner might withdraw a resource from an
archive’s holding. In such cases the archive can only be as explicit as possi-
ble about such circumstances.
5. Open access vs. access restrictions

At the moment there is a large push towards open access to research re-
sults, not just the scientific publications but also the data that forms the basis
of these publications. The Berlin declaration on Open Access to Scientific
Knowledge12 was first published and signed in 2003 by representatives of
most of the German research organizations, but has meanwhile been signed
12. http://oa.mpg.de/lang/en-uk/berlin-prozess/berliner-erklarung/
Unauthenticated
by 294 scientific organizations and universities worldwide. In general there

is a lot to be said in favor of making the outcome of research that has been
funded with public money available to the public and to other researchers
in an unrestricted manner. Giving access to the raw data on which publica-
tions are based would in principle allow anyone to verify the claims that were
made and would allow the data to be reused for other analyses. In many fields
of research however, data is collected by making use of human subjects, in
which case the privacy of these subjects needs to be taken into account. In-
formed consent forms are often used to explicitly regulate the rights to publi-
cize data of human subjects. Anonymization is another method to ensure that
the privacy of the human subjects is respected, which in general means that
all information that could be used to identify individuals is removed from the
data. In social sciences for example it is common practice that the names and
contact information of participants to a survey or an experiment are removed
before the data set is published. Both informed consent and anonymization
can be somewhat problematic in the field of documentary linguistics though.
Informed consent about making the data public on the world wide web would
entail that the subject has a good understanding of what this implies. Mask-
ing names in texts and in audio recordings is something that can be done but
modifying audio and video recordings up to a point where the individuals
can no longer be recognized would render them useless for many linguistic
purposes. The fact that recordings are made within small communities some-
times requires the researcher to protect the speakers in order to avoid conflicts
within the communities. It is up to the researchers working in these commu-
nities to discuss these issues with the speakers and to make careful decisions
taking both the Open Access principles and the privacy of the speakers into
consideration. Some of these issues, and possible solutions, are discussed in
the following section.
6. Legal and ethical issues
As indicated above, when working with data collected in small language com-
munities one has to carefully consider the rights and privacy of the inter-
viewed contributors. In the DoBeS program, legal and ethical considerations
were an important point of discussion from the very beginning. In its second
year a workshop was organized with leading European law experts to deter-
mine a proper juridical basis for the DoBeS program and in particular the
Unauthenticated
online archiving ideas. The result however was disappointing from the prac-
titioners’ point of view, since it was concluded that the legal situation is much
too complex to give clear juridical advice. The only advice the experts could
come up with was to lock the material in a safe in a cellar, which was exactly
the opposite of what was expected from the emerging archive – namely to be
a place where authorized persons from all around the world could access and
even enrich the stored data.
Intensive and serious discussions afterwards led to a number of conclu-
sions:
– It was understood that the DoBeS program should have a proper basis to
guide the behavior of all persons involved: collectors, archivists and users.
The result was an elaborate Code of Conduct, which was amended over the
years.
– The roles of all actors in the complex system were defined and the expec-
tations with respect to each actor were formulated. For the archivists it is
the principal researcher who is responsible for specifying for example the
access permissions etc. It is expected that the researcher responsible takes
care of proper relationships with the communities and the interviewees and
that all statements are based on informed consent. The archivist will adhere
to the statements of the researcher responsible and provide access mecha-
nisms that implement the requirements.
– The archivist declared that he does not claim copyright on the stored mate-
rial. However, he needs the right to archive in order to perform his task in
a responsible way. With respect to users the archivist will claim copyright
on behalf of the data producers.
– It was decided to not use visible logos in the video since they might ob-
struct the content.
– The researcher responsible always has access permissions to all material
and he can set access permissions for other persons. In particular mem-
bers of the speech community should be granted the rights and abilities to
access the content.
Handling legal and ethical issues at a responsible level is a serious challenge

especially since communities may withdraw access permissions to certain
material again although it was granted at a certain moment for culture spe-
cific reasons. Also other complicating issues may play a role requiring a high
Unauthenticated
degree of sensitivity of all actors involved. To cope with all kinds of un-
expected events a Linguistic Advisory Board consisting of highly respected
field researchers was established that can be called upon by the archive to
help solving potential difficult questions.
Over the years, when it became more obvious that more users may want
to access material in the online archive, four levels of access granting were
agreed upon:
Level 1: Material under this level is directly accessible via the internet;
Level 2: Material at this level requires that users register and accept the Code
of Conduct;
Level 3: At this level, access is only granted to users who apply to the re-
searcher responsible (or persons specified by him or her) and spec-
ify their usage intentions;
Level 4: Finally, there will be material that will be completely closed, except
for the researcher and (some or all) members of the speech commu-
nities.
Access level specifications for archived resources may change over time for
various reasons, e.g. resources could be opened up a certain number of years
after a speaker has passed away, or access restrictions might be loosened after
a PhD candidate in a documentation project is done writing the thesis.
The number of external people who requested access to “level 3” re-
sources over the last years was not that high. We need to see in the future
whether the regulations that are currently in place can and should be main-
tained as explained. Access regulations remain a highly sensitive area where
the technical possibilities opened up by using web-based technologies need
to be carefully balanced against the ethical and legal responsibilities which
archivists and depositors have towards the speech communities. Despite al-
most 10 years of ongoing discussions and debate, no simple solution to this
problem has yet been found.
7. Data enrichment tools
Providing tools for tagging or annotating audio-visual media has been one of
the focal points of software development at the MPI right from the start, from
the Mac-only application MediaTagger, via a set of client-server based cor-
Unauthenticated
pus visualization programs to their convergence into the stand-alone, multi-

platform annotation tool ELAN. This progression of tools was paralleled by
the switch from data in a proprietary format to data stored in the evolving
the evolving standard XML. After a thorough makeover (in 2003), marking
the transition to the 2.x versions, ELAN further developed into an application
supporting multiple videos in multiple formats, providing a growing number
of import and export options, with increasing editing capabilities and avail-
able as both a Java Web Start and as a downloadable installer version.
ELAN allows enriching of audio and video recordings with multilayered,
structured annotations stored in EAF (ELAN Annotation Format) files, a file
format that can be uploaded into the archive as a constituent of a corpus. To
make inspection and exploration of data in the archive more convenient than
downloading a bundle of files and opening the software, the web application
ANNEX has been created. It streams (chunks of) media recordings from the
archive and visualizes associated annotations, not only those stored in EAF
but other formats as well, in an interface resembling that of ELAN. ANNEX
resembling that of ELAN. ANNEX is closely connected to TROVA, a search
engine for structured search in annotation content. Queries can be executed in
one or more corpora or parts thereof and from any search result or hit a jump
to ANNEX can be made, showing that particular annotation in that particular
file.
Processing multiple files simultaneously has recently become an impor-
tant track of development of ELAN and it is expected that it will be in the
years ahead. This type of operation improves productivity enormously and
stimulates consistency within a (local) corpus. More generally, reducing the
number of mouse clicks and keystrokes and steps that have to be performed
manually will be a future goal. Semi-automatic annotation by pattern-recog-
nition based software components is expected to become available for every-
day language research soon.
Another data enrichment tool developed by the MPI is a flexible online
lexicon tool called LEXUS13 (Ringersma and Kemps-Snijders 2007). The
LEXUS lexicon schema can be based on the meta-model of the LMF stan-
dard, but actually users have extensive freedom to construct a rich lexicon
schema appropriate for the language to be described. Elements in this schema
can be linked to the data category registry, ISOcat, and can thus have ex-
plicit, and shareable, semantics. Import tools allow loading existing lexica in
13. http://www.lat-mpi.eu/tools/lexus
Unauthenticated
various formats, e.g., MDF, into LEXUS. In principle lexica exist in a user-
specific workspace. However, LEXUS allows sharing these lexica with other
users thus enabling collaboration on the development and population of a lex-
icon. Cablitz (this volume) gives a detailed account of the implementation of
LEXUS in an actual documentation project.
The LEXUS frontend has evolved over time using different browser-
based technologies into the current FLEX version, which due to its use of
the Adobe Flash plug-in provides a similar look&feel across a wide variety
of browsers and platforms. The rendering of lexical entries has always been
very flexible, allowing users to construct templates for both list and entry
views. A new version of the LEXUS backend is currently close to comple-
tion and next to providing increased stability and performance will also allow
to more easily add new output formats, e.g., a printable version of the lexicon.
Making different tools like LEXUS, ANNEX/TROVA and ELAN coop-
erate as seamlessly as possible is another important line of development. Sep-
aration of metadata and annotation content has its merits, but at some point
they will have to come together e.g. in a combined data-metadata search in
TROVA. Some annotation editing options, especially those that are executed
on multiple files (like find-and-replace in many files), make perfect sense in
the context of ANNEX. The combination of ELAN and LEXUS will on the
one hand allow lookup and retrieval of information from a lexicon while an-
notating, and on the other hand will enable the user to start building a lexicon
while annotating.
8. Accessing data
8.1. Meta data searching and browsing
Access to archived resources is generally offered by means of search and

browse functions for the metadata catalogue. Search functions can be imple-
mented in various ways, e.g. as free text Google-like search across the entire
metadata catalogue, as an advanced search for searching in specific metadata
fields, or as a “faceted search” that allows one to narrow down search re-
sults by selecting values of a number of pre-defined fields. Searching within
a metadata catalogue that makes use of a single metadata scheme is fairly
straightforward. The only problem here is that there is a certain degree of
variation of metadata values that actually refer to the same kind of data, if
Unauthenticated
the metadata field does not require the use of a controlled vocabulary. Some
kind of mapping would need to be performed in order to find all variants of
the same value. The situation becomes more complex if one needs to search
across different catalogues with different metadata schemes. It is hoped that
the use of links to the ISOcat data category registry in metadata schemas and
value sets will make cross-archive searches more manageable. As an exam-
ple, archive A may use a metadata schema that contains the element “gender”
for speakers for which the values can be “F” and “M”, archive B may use the
element “sex” for basically the same concept and uses the values “female”
and “male”. If both metadata schemas would refer to the proposed ISOcat
term “HumanGender” and the values “feminine” and “masculine” (with the
definitions that this relates to the gender of a person rather than grammati-
cal gender), it would be possible to search across both archives using either
terminology using an ISOcat-aware search tool.
8.2. Content searching

A more elaborate search for the actual content of the resources is required if
one wants to find specific examples of language use that cannot be described
in the metadata. At the moment this content search will be limited to textual
resources (annotations to audio/video) but possibly in the future this could
be extended to a limited set of features in the audio or video material itself.
Searching for annotations can also be done in varying levels of complexity.
The TROVA content search tool for example offers a simple search mode
to search across the entire annotation file, it offers a “single layer” mode to
search for sequences within a single annotation layer and it offers a “multi-
ple layer” mode to search for sequences both within and between annotation
layers. Content search tools can be used to find specific examples in a lan-
guage corpus, but can also be used to perform statistical analyses on a corpus
by finding all cases of a certain linguistic structure. Also in textual content
search tools, the variation in terminology that occurs within and between
archives can be an issue. Here also the ISOcat registry can play a role by
allowing search tools (and users) to create mappings between different terms
that actually have the same meaning.
Unauthenticated
8.3. Portals
While metadata and content search tools are generally suitable for specialists
to find material that they are interested in, members of a speech community
or members of the general public have other requirements when accessing
archived content. If the search services and archive access framework are
set up in a rather generic way and can be called via standard web-service
interfaces, it is possible to create an additional “layer” on top of the archive
that serves a specific user group. This layer or “Portal” can have an appealing
graphical design and it can direct people to certain pre-defined searches that
have been set up or interesting resources that have been selected. Within the
European research infrastructure projects that are currently running such as
CLARIN, more and more tools are being made available as web services. To
what extent these web services will be of use for certain user-specific portals
remains to be seen, but at least they open up a wide range of possibilities to
combine resources and services together in a web interface.
9. New challenges
Life cycle management of data can be split into three major and related
phases: creation, curation/preservation and access/utilization. With respect to
all three phases we will see accelerated technological innovation which on the
one hand has positive effects in so far that research can make use of newest
inventions and products and on the other hand has negative implications with
respect to the stability of the solutions found. The trick will be to define the
islands of stability in a very dynamic environment and to participate stepwise
in the innovation process. This holds for the archive as well as for all software
being written. In all phases of the data life cycle, the challenging ethical and
legal situation needs to be taken into account.
Creation Phase: The creation process will benefit from further sophisti-
cation in recording equipment, where in particular three developments will
have their implications: (1) miniaturization of data storage leading to in-
creased capacity; (2) resolution; (3) connectivity. Miniaturization will lead
to continuously increasing storage capacities allowing researchers to make
high-resolution recordings with portable devices. Miniaturization also will
simplify field work in so far that direct annotation will be easier with help of
smart and small devices demanding less power. The resolution of recording
Unauthenticated
devices will be increased so that soon high-definition video cameras can be

expected in the low cost sector. Also connectivity will become better – even at
remote places. This will mean that digitized (in fact digital recording is now
the norm in the vast majority of fieldwork contexts) recordings can be trans-
mitted earlier and faster. It also will mean that the possibility of downloading
or accessing archive material will be improved.
Preservation/Curation: A big step for video preservation has already
been done by introducing lossless MJPEG2000 recently. This indeed means
that we are able to store a master file from which other formats, e.g. for pre-
sentation purposes, can be generated without risking serious transformation
effects. Information technology (channel bandwidth, storage capacity, CPU
power) will allow us to deal with the increased data amounts.
Long-term preservation is very much dependent on “safe” replication
where every operation on a data object will automatically lead to check wheth-
er the copied instance is indeed the same as the original one. It is widely
agreed now that the extensive use of externally registered persistent identi-
fiers associated with checksum information is the only way to ensure data
integrity and authenticity in distributed and thus more complex data manage-
ment scenarios. The DoBeS archive is prepared to participate in such state-of-
the-art archive federation scenarios, since for some years it is already based
on persistent identifiers and automatically generated checksum information.
Together with the computer center in Garching (RZG) it has been testing ac-
tively a switch to a rule-based safe replication strategy based on the iRODS
software and it seems that the system can be put into operation in 2011. This
will be a major step ahead also to support the open deposit service of the MPI
offered to all researchers with language resources.
Also in 2011 the component based metadata tools will come into place,
which offers much more flexibility for the researchers to design a metadata
profile that is suitable for their resources. Interoperability will be guaranteed
by making use of categories defined in the ISOcat registry. The ARBIL edi-
tor, which has now replaced the IMDI metadata editor, is already supporting
this component structure and will hopefully motivate researchers to provide
better metadata descriptions, since they will be the key for the application of
advanced analysis tools and for generating portals designed for the special
community in mind.
Utilization Phase: We expect many developments in the improved uti-
lization possibilities of the stored data as long as access is being granted and
Unauthenticated
as long as the quality of the metadata and the data is high – quality will ac-
tually be the crucial point for many advanced operations. One big concern is
that the amount of recorded media streams that is not being touched (anno-
tated in some form to make it ready for analysis) is increasing continuously
which means that much of the stored data will effectively not be of much
use to anyone other than the person who collected it. A new attempt to use
state-of-the-art speech and image processing technology is required that does
not build on holistic stochastic models but on detectors that react to compar-
atively simple patterns in media streams and create annotations. There will
be several of these detectors all with different characteristics that may also
be specialized on specific quality types of recordings. The resulting lattice of
annotations could be the base for linguistic evidence and theorization if there
are smart tools allowing the researcher to look for specific patterns and to
easily navigate in it.
We can indicate a few other areas where we expect new opportunities in
the coming months and years:
– Semantically based weaving of content (creating relations and navigating

in the resulting conceptual spaces) is very attractive for finding linguistic
evidence. However, this work is hampered by the huge effort required to
create meaningful relations. Better usage of existing ontologies for auto-
matic support in creating the relations would make this work practically
feasible.
– Archive federations are being set up, metadata has been standardized, re-
source formats are being much more harmonized and improved tools to
foster semantic gateways will make it easier to carry out cross-archive and
cross-corpus related work.
– More and more tools are being turned to web services or at least support
web-based interactions. Since the programming interfaces are also cur-
rently being harmonized, there is great hope that in a few years researchers
will be able to combine useful algorithms to chains of operations on texts
(annotations, etc.), audio and video streams and even other type of data
to carry out work that currently is only possible when large scale expert
knowledge is directly available. For these advanced operations, the quality
of metadata and data will be of crucial importance.
Much funding is currently invested in creating infrastructures that will in-

crease the integration and interoperability of resources and tools. CLARIN is
Unauthenticated
the initiative that aims to achieve these goals in the linguistic domain. Such
infrastructure work can only be achieved when we apply standardization and
harmonization where possible without hampering the research progress. The
DoBeS community was one of the driving forces to apply open standards
and foster new standards. If this positive attitude is continued, the work on
endangered languages will profit in many ways from new technological de-
velopments in the coming years.
References
Blue Ribbon Task Force on Sustainable Digital Preservation and Access.
2010. Sustainable Economics for a Digital Planet: Ensuring Long-Term
Access to Digital Information. San Diego: BRTF-SDPA. Online version:
http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf.
Broeder, Daan, Marc Kemps-Snijders, Dieter Van Uytvanck, Menzo Wind-
houwer, Peter Withers, Peter Wittenburg, and Claus Zinn. 2010. A data
category registry- and component-based metadata framework. In Proceed-
ings of the Seventh International Conference on Language Resources and
Evaluation (LREC 2010), Valetta, Malta, May 19–21, 2010, 43–47.
Consultative Committee for Space Data Systems. 2002. Reference Model for
an Open Archival Information System (OAIS). Washington DC: CCSDS
Secretariat, NASA.
High Level Expert Group on Scientific Data. 2010. Riding the Wave: How Eu-
rope Can Gain from the Rising Tide of Scientific Data. Brussels: European
Commission. Online version: http://cordis.europa.eu/fp7/ict/e-infrastructure/
docs/hlg-sdi-report.pdf.
ISO 12620:2009. 2009. Terminology and other language and content re-
sources – Specification of data categories and management of a Data Cat-
egory Registry for language resources.
ISO 24613:2008. 2008. Language resource management – Lexical markup
framework (LMF).
ISO 24619:2010. 2010. Language resource management – Persistent identi-
fication and sustainable access (PISA).
Kemps-Snijders, Marc, Menzo A. Windhouwer, Peter Wittenburg, and
Sue Ellen Wright. 2008. ISOcat: Corralling data categories in the wild. In
Proceedings of the Sixth International Conference on Language Resources
Unauthenticated
and Evaluation (LREC 2008), Marrakech, Morocco, May 28–30, 2008, ed.
European Language Association (ELRA).
Ringersma, Jacquelijn, and Marc Kemps-Snijders. 2007. Creating multime-
dia dictionaries of endangered languages using LEXUS. In Proceedings of
Interspeech 2007, eds. Hugo van Hamme and Rob van Son, 65–68. Baixas,
France: International Speech Communication Association.
Schüller, Dietrich. 2004. Safeguarding the documentary heritage of cultural
and linguistic diversity. Language Archives Newsletter 1(3):9.
Trilsbeek, Paul, and Peter Wittenburg. 2006. Archiving challenges. In Essen-
Unauthenticated
Chapter 4
Comparing corpora from endangered language
projects: Explorations in language typology based on
original texts∗
Geoffrey Haig, Stefan Schnell and Claudia Wegener
1. Introduction
The current large-scale initiatives towards documenting endangered languages
have produced, among many other outcomes, an unprecedented amount of
digitally archived, natural spoken discourse in a range of typologically di-
verse languages. This chapter investigates some of the ways that language
typologists can tap into these growing resources. Our results suggest that the
cross-language comparison of original narrative texts, of the type now avail-
able in abundance in many documentation projects, is a viable and promising
avenue of typological investigation. Up until now, there has been some re-
luctance within typology with regard to direct comparisons of original texts,
for reasons that are certainly understandable: the lack of comparable con-
tent, the lack of cross-cultural comparability of genres, and perhaps most
tellingly, the lack of a standardized system of annotation. As a result, typolo-
gists interested in text-based (as opposed to grammar-based) typologies have
∗
The present study draws on the GRAID-system of morpho-syntactic annotation of con-
nected discourse (Haig and Schnell 2011). We are grateful to Sabine Reiter and Florian
Siegl for contributing data from Awetí and Forest Enets respectively. We would like to
thank the audiences at conferences in Nijmegen (October 2009; October 2010) and SOAS,
London (November 2009) for their feedback and comments on earlier work on this topic.
We are grateful for financial support from the University of Bamberg, the University
of Kiel, the Max Planck Research Group on Comparative Population Linguistics at the
MPI for Evolutionary Anthropology Leipzig, and the Volkswagen Foundation’s DoBeS-
programme, which financed the four documentation projects from which the data for this
chapter is taken. Many thanks also to Nicole Nau, Bethwyn Evans, Anna Margetts and
Brigitte Pakendorf for their valuable feedback on earlier drafts of this paper. The responsi-
bility for all remaining errors rests with the authors.
Unauthenticated
56 Geoffrey Haig, Stefan Schnell and Claudia Wegener
tended to shy away from large-scale comparison of original texts in favour

of more closely controlled data, for example translations of a single source
text (parallel text typology), or vernacular re-tellings of standardized narra-
tives (e.g. Pear Story narratives). While these methodologies have undeniable
advantages, they are also subject to certain limitations. The most obvious in
a documentation context is that they are not based on speech samples from
an indigenous genre, and that they are only rarely encountered in archives
of documentation projects. Original text typology, on the other hand, capital-
izes on resources that are generally available in documenters’ archives, and
can more directly preserve the specifics of indigenous speech genres. Given
the widely-accepted goals of language documentation as the preservation of
indigenous language practices, an approach to typology that dovetails with
these aims is surely a goal worth pursuing.
In the remainder of the chapter we present the results of a quantitative
typological comparison of original spoken texts from four language docu-
mentation projects. The chapter is organized as follows: in Section 2, we
introduce the background assumptions to text-based typology, and discuss
the relative efficacy of different methodologies. In Section 3, we outline the
database for this study and the annotation system, GRAID, developed for
these purposes. We show that despite the variation in content, the texts from
the four languages display remarkable similarities in certain domains of dis-
course organization, evident even in corpora of the modest size we are dealing
with. Section 4 turns to an area highly sensitive to cross-linguistic variation,
that of pronoun deployment. Previous studies in text-based typology have
tended to ignore pronouns, either lumping them with full NPs, or with zero
anaphora. Our results indicate that in doing so, a potentially rich domain of
typological variation is overlooked, one that only a finer-grained system of
annotation will capture. The different parameters investigated here represent
but a fraction of the possibilities opened up by the methodology proposed.
However, we think they suffice to demonstrate the fundamental feasibility of
cross-linguistic comparison of original texts, and identify some of the typo-
logical parameters amenable to this kind of research.
2. Language documentation and language typology
Language typology is traditionally based on the comparison of grammars.

But grammars are not languages; they are the products of a more or less
idiosyncratic interpretation of a selection of the documented facts of a lan-
Unauthenticated
Original text typology 57
guage. How much, and what type of data is included in a grammar, how it
is collected, and how it is presented, are matters left to the discretion of the
author. Of course a broad consensus exists on what constitutes a good gram-
mar, but if a given language has only one grammar, as is often the case for
endangered languages, then that grammar will represent that language in ty-
pological databases, regardless of its quality. These problems are well-known,
but for lack of viable alternatives, comparison of reference grammars contin-
ues to be the most frequently used methodology in typology (cf. Haspelmath
et al. 2008).
Recently, however, an alternative methodology has been developed, some-
times referred to as “primary data typology” (Wälchli 2006, 2009). Rather
than compare reference grammars of individual languages, primary data ty-
pology attempts to compare actual instances of language usage – texts – in
different languages. Such an undertaking poses considerable challenges, and
is of course subject to a number of limitations. Perhaps the most obvious
question is: what kinds of text need to be included in a corpus to be repre-
sentative of a given language? The question of representativity has long been
discussed in traditional corpus linguistics (cf. Atkins, Clear, and Ostler 1992),
but also in the context of language documentation (cf. Himmelmann 1998;
Woodbury 2003; Seifart 2008). While there is widespread agreement on the
ideal of a maximum coverage of indigenous speech events (Geertz’ proverbial
‘thick description’ (Geertz 1973)), in practice, the selection of recorded text
types largely depends on preferences of the speech community and is con-
strained by the documenters’ limited knowledge about indigenous text genres
(Mosel 2006). As for corpus size, corpus building is first of all constrained
by limited resources. It is generally recognized that a corpus has to display a
considerable size in order to contain data on all relevant linguistic structures.
However, even the largest corpora may lack data on some structures in the
particular languages. The adequacy of a corpus’ size partly depends on the
nature of the object of investigation: for an investigation of, for example, syl-
lable structure, or the use of determiners in NPs, quite short stretches of text
may be adequate to obtain a reliable picture. For an investigation of relative
clauses, on the other hand, much larger text samples across different genres
would be necessary. The issue of corpus size is ultimately an empirical one,
that can only be satisfactorily addressed when sufficient studies have been
carried out with text corpora of different sizes.
Unauthenticated
While the issues of text types and corpus size are regularly raised in con-
nection with primary-data typology, it is often forgotten that they are equally
relevant to reference grammar typology. Yet with regard to reference gram-
mars, these problems are seldom made explicit. In fact, reference grammars
of different languages are based on wildly varying corpus sizes. While some
are based on just a couple of hours of recorded speech, coupled with elic-
itation sessions with a single speaker, others are based on 20–30 hours of
recorded speech and participant observation in the speech community over
many years, or written by native speakers relying on their own intuitions.
Also, reference grammars are based on corpora of different language vari-
eties and text types: some take a form of literary standard (if available) as the
basis, while others document a spoken vernacular; some are based mainly on
narrative texts, while others are confined to conversational data; and so on.
Yet despite the gaping differences in the empirical basis of reference gram-
mars, in typology, different grammars are treated for all practical purposes as
equal.
The shift towards primary data typology paves the way for a further devel-
opment: features that are fed into a typology are no longer read off a check-list
in an either/or fashion (e.g. language X “has” OV word order etc.). Rather,
features can be assigned a quantitative value, determined by the frequency of
occurrence in the text under investigation. Thus a particular text type may, for
example, be found to exhibit OV-word order in 70% of the relevant construc-
tions, and elsewhere use other word-orders. In principle grammars could –
and probably should – make such variation explicit in a precisely quantified
manner; yet this is not common practice in most of descriptive linguistics to
date (but see Biber et al. 2004 for a notable exception). But precise state-
ments about constructional variation in specific types of text in individual
languages can lead to finer-grained typologies and to closer attention to the
factors determining such variation. They open up the possibility of incorpo-
rating language-internal variation in cross-language comparison. The quan-
titative nature of primary data typology is an aspect that will be of central
concern in the present chapter.
Apart from the problem of representativity, primary data typology also
faces the problem of comparability of text genres and texts of different con-
tent. It would, for example, be misleading to compare a cooking recipe text in
one language with a traditional folk narrative text in another. Currently two
approaches to alleviate these differences are widely used. The first compares
Unauthenticated
the translations of a single source text in different languages, so called parallel-

text typology. Typical source texts are the Universal Declaration of Human
Rights, the Harry Potter-series (Wälchli 2006; Stolz 2007), Le petit prince
(Stolz 2007), or the Bible (cf. Wälchli 2009). The advantages of this method
are clear: roughly the same denotational meaning is held constant across dif-
ferent languages, and the language-specific forms for expressing that meaning
can be compared. From a practical perspective, the texts are already available
in a fairly wide sample of languages, thus reducing the typologist’s workload
considerably. The main disadvantages are equally evident: the texts compared
are generally samples of written language, often a highly specialized and even
artificial register that may have been influenced by the source language of the
translation text (cf. Wälchli 2009: Ch. II for discussion of the advantages and
disadvantages of parallel text typology, and the contributions in Cysouw and
Wälchli 2007 for different applications of the method).
An approach that alleviates the latter problem to some extent is the use
of standardized stimuli for eliciting connected texts. Well-known examples
are the Pear Film developed by Wallace Chafe, or the Frog story, a picture
book (Mayer 1994[1969]). Speakers view the film, or the book, and are then
requested to recount the events they have seen in their own language. In this
manner, texts can be recorded in the spoken vernacular and with approxi-
mately comparable content, avoiding the pitfalls of translations. Nevertheless,
it should be borne in mind that Pear Film narratives or Frog story retellings
come with their own specific disadvantages: first of all, the cultural and mate-
rial content of such stimuli may be quite alien to the speakers; they may lack
vocabulary items for objects that occur in the stories, or even find them em-
barrassing or simply incomprehensible. Second, even within the same speech
community speakers differ in the extent to which they elaborate on the de-
tails of the events recounted, leading to texts of vastly different length and
content. Third, unless the speakers were asked to recount the content of the
film or picture booklet to another speaker who did not know the material,
retelling the content of a film or a picture booklet to the researcher (who is
obviously familiar with the material) is not part of the repertoire of linguistic
practices of speakers of many language communities. Finally, speakers may
produce quite different types of text, namely they either re-narrate the plot of
the film or they describe the film as such, or mix both of these perspectives
(cf. Himmelmann 1998: 187; Mosel, p. c.). So while the resultant texts may
be preferable to translations on some counts, they are also far from ideal as a
source of primary data.
Unauthenticated
From a documentary linguistics’ point of view, both translations and story-

retellings are of secondary importance. As Himmelmann (1998) notes, the
primary aim of language documentation is to document, as comprehensively
as possible, the indigenous linguistic practices of a speech community. As
mentioned, translations, or retelling of films, are often not part of those prac-
tices, but represent novel genres. There is a very real possibility that speakers,
when faced with the challenge of novel speech events, may not be produc-
ing the kind of fluid routine sequences that characterize much of our actual
linguistic behaviour. Of course speakers’ responses to novel speech events
are of considerable interest in their own right, (see Mosel (2004) and Mar-
getts (this volume) for discussion of novel communicative events introduced
by language documenters into speech communities). But in the context of a
typological investigation aiming at comparison of the profiles of languages
by studying actual usage, data of this nature are extremely problematic. One
of the few researchers to explicitly address this point is Foley (2003), who
compares the structure of a traditional indigenous narrative with the structure
of a Frog-story retelling in the Papuan language Watam. Foley notes some
striking structural differences, and concludes that a description based on a
Frog-story retelling could quite significantly distort the language’s discourse
profile. An obvious difference between such texts and many indigenous nar-
ratives in our experience at least is that the former are generally characterized
by a paucity of clauses in the first and second person, and of direct and in-
direct speech. Both of these are extremely common features of traditional
narratives of some regions (e.g. the Middle Eastern cultural region), so texts
lacking them already differ markedly from the indigenous narrative tradition.
The obvious alternative method, which avoids the drawbacks of the two
just discussed, is typology based on the comparison of original texts, i.e. texts
produced by native speakers, recounting original content in an indigenous
text genre. Typically such texts are monologous narratives of past events, real
or mythical, often traditional tales that are part of the community’s cultural
heritage, but in some cases accounts of specific events that occurred in the
recent past. This method has the advantage of working with texts expressing
indigenous content and representing an original linguistic practice within the
speech community, hence not obliging speakers to engage in novel or un-
familiar communicative tasks. The vast majority of language documentation
projects yield some recordings of spoken narrative monologues. Once tran-
scribed and translated, these texts are a potential data source for primary data
Unauthenticated
typology. The amount of such text already available in digital form can only
be estimated, but the DoBeS-programme alone, with its emphasis on text-
based documentation and 51 large-scale documentation projects to date,1 has
produced a massive amount of digitally archived spoken language data in
various stages of annotation and analysis. It is therefore rather surprising that
to date, there has been little serious attempt on the part of typologists to mine
these resources. In the remainder of this chapter, we present the results of our
own efforts in this direction, which focus on studying the form and function
of referring expressions in original texts.
3. Quantitative approaches to typology: The GRAID initiative

3.1. Background
At least as far back as the work of Chafe, Givón and others in the 1970’s, it
has been known that across natural spoken discourse, languages reveal statis-
tically stable patterns in the ways that referents are introduced into discourse,
how they are referred to in subsequent stretches of discourse, and with what
frequency per clause. The focus of Chafe’s research (Chafe 1979, 1980, 1987)
was on the cognitive and evolutionary grounding of narrative structure, but
later researchers have turned their attention to more explicitly grammatical
topics, re-forging the link to typology by systematically investigating a larger
cross-section of languages. One example is John Du Bois’ research on “Pre-
ferred Argument Structure” (PAS), which draws on a quantitative analysis of
discourse; we will take up some aspects of his work in 3.4 below. More re-
cently, Bickel (2003) presented a quantitative, cross-linguistic investigation
of “Referential Density” (RD), the degree to which a language uses overt
expressions for the referents in a given text.
Superficially, it might seem that Bickel’s research is similar to cross-
linguistic investigation of “discourse pro-drop” in the the generative tradi-
tion, (cf. Neeleman and Szendrői 2007, 2008). However, Bickel’s Referential
Density (RD) cannot be equated with “discourse pro-drop”, as the latter refers
solely to subject or object deletion, while RD includes the omission of oblique
arguments. More importantly, there is a fundamental difference in methodol-
ogy between the two approaches, reflecting the discussion in Section 2 above:
In their cross-linguistic investigation of argument-deletion, Neeleman and
1. http://www.mpi.nl/DOBES/projects/
Unauthenticated
Szendrői treat zero-expression of arguments as a feature of grammars, rather

than texts. They explicitly restrict themselves to the question of whether dis-
course pro-drop is “available” or not (Neeleman and Szendrői 2008: 333) in a
given language, i.e. if it can occur at all, and base their decision solely on the
investigation of reference grammars. Under what conditions, and with what
frequency, discourse pro-drop is actually manifested in texts, lies outside the
purview of their investigations. Bickel and his associates, on the other hand,
approach the matter from a text-based perspective, which yields variable val-
ues (between 1 and zero) for the RD of a given language. In fact, Bickel’s
measure of RD is not extractable from a reference grammar.
Cross-linguistic research within the text-based, quantitative approach pre-
supposes that the texts to be compared have been annotated in a manner that
is consistent across different languages. The working assumption is of course
that the annotation categories are sufficiently widespread across the world’s
languages and are definable in a cross-linguistically valid way. Examples of
such categories might be: “subject of transitive verb”, “first person personal
pronoun”, “direct object”. There is an ongoing debate about the status of such
typological categories (cf. Haspelmath 2007, 2010a,b; Newmeyer 2010), but
in practice, typologists have been conducting conventional cross-language re-
search for decades based on just such categories, with considerable success.
Therefore, for the time being, we continue to work on the assumption that
there are at least some grammatical categories that can meaningfully be ap-
plied cross-linguistically.
Despite their widespread acceptance, a standardized system of annotat-
ing such categories has never been developed. Conventional morpheme-by-
morpheme glossing, perhaps the most widespread type of morpho-syntactic
annotation, is of only limited use for cross-linguistic comparison. First of all,
the inventory of symbols used varies from language to language, i.e. from
grammar to grammar. For example, some grammar writers will use “ACC”
to refer to a morphological marker found on direct objects, while others
gloss the approximately equivalent marker as “OBJ”, etc. Second, morphemic
glosses vary according to how, and if, they make reference to non-expressed
categories (zero exponence). Third, morpheme glosses may apply one and
the same label to a morpheme, although it occurs in different distributions,
for example, a person agreement marker on a possessor and a person agree-
ment marker on a predicative verb. Finally, and most seriously, morphemic
glosses render individual morphemic segments, but they provide no direct and
Unauthenticated
consistent means for identifying constituents larger than words, e.g. phrases.
Although there are guidelines for morphemic glossing available (e.g. the
‘Leipzig Glossing Rules’2 ), in practice morphemic glosses are a free-for-
all area, with individual practitioners generally developing their own, often
quite idiosyncratic solutions. Recently, Seifart et al. (2010) evaluate the pos-
sibilities of cross-language comparison of texts glossed with Toolbox. They
find that the sole parameter that can be more or less reliably extracted from
these data is the respective noun/verb ratios in the texts. Thus, the possibil-
ities of direct comparison of existing morphologically glossed texts appears
to be extremely restricted at present. And even if morphemic glossing were
to be standardized to a level where cross-corpus comparison was possible,
the problem of lack of identification of syntactically relevant units remains
unresolved.
The obvious solution, to develop systems for annotating syntactic struc-
tures, was never seriously pursued on a large scale. Schultze-Berndt (2006)
briefly mentions this possibility in connection with the annotation of small
language corpora, only to dismiss it as too time-consuming and hence im-
practicable. However, over the last couple of years we have been developing
such a system for annotating texts in typologically diverse languages, based
on a small and standardized system of labels (approx. 30). The system, known
as GRAID (Grammatical Relations and Animacy in Discourse) is described
in detail in Haig and Schnell (2011), and we will only outline some of the
more important properties here.
GRAID is an annotation system for glossing connected narrative texts,
resulting in an additional rich layer of data on a text corpus that contains
information on the interface of discourse and syntax. It identifies the major
clause constituents (predicates, arguments), and links the syntactic functions
of arguments to their formal properties (e.g. whether they are pronominal, or
full NPs). Additionally, GRAID includes information on the distinction be-
tween human and non-human reference of arguments,3 as this is an important
parameter in shaping many aspects of grammar. GRAID has been developed
through glossing actual texts in five genetically and areally diverse languages,
and has proved sufficiently flexible to accommodate the attested problems to
date.
2. http://www.eva.mpg.de/lingua/resources/glossing-rules.php
3. Note that the category ‘human’ includes anthropomorphized referents.
Unauthenticated
GRAID annotations cannot be derived automatically from morphemic

glosses as they target primarily clause-level constituents (e.g. subjects of in-
transitive clauses etc.), so they need to be undertaken manually. Although this
is a time-consuming process we suggest that, given the availability of text cor-
pora already entered into annotation tools (e.g. Toolbox), the additional work
necessary to add a GRAID-annotation to the corpus remains within reason-
able boundaries. In around 10–15 hours, a person with a sound background
in syntactic analysis, and knowledge of the language concerned, can annotate
a text of around 500 clause units that is then amenable to a variety of quan-
titative typological cross-corpus investigations, as we will demonstrate in the
following sections.
3.2. The GRAID database

Currently, GRAID annotations have been added to monologous narrative
texts from five existing language documentation projects; the data from four
of these were taken as the basis for this investigation.4 Table 1 gives an
overview of the raw data; sources for the languages concerned can be found
in Appendix 5.
Table 1. Overview of GRAID data used in this study

Language Affiliation; location Annotator Words Clause units
Awetí Tupi-Guarani, Brazil Sabine Reiter 2289 466
Gorani Indo-European, Iranian; Geoffrey Haig 1835 551
West Iran
Savosavo Papuan Isolate; Claudia Wegener 3140 659
Solomon Islands
Vera’a Austronesian, Oceanic; Stefan Schnell 3006 546
Vanuatu
Totals: 10270 2222
In addition, spoken German renderings of the Pear Film have also been an-
notated, based on the text versions in Himmelmann (1997: 234–243). As this
material belongs to a different genre, it is only occasionally mentioned for
comparison, but not included in the analysis.
4. Data from the fifth language, Forest Enets (Samoyedic, cf. Siegl, this volume) were not
fully available at the time of writing, hence were not included in the current analysis.
Unauthenticated
The texts from the documentation projects were selected according to the
following criteria: monologous narratives recounting traditional tales or his-
toric events that are part of the cultural heritage of the speakers. The texts had
to have been recorded, transcribed, translated and analysed prior to being an-
notated with GRAID. In all cases, the annotator was the main investigator of
the language concerned, thus ensuring the expertise was available to make the
analytical decisions necessary during annotation.5 For some languages, more
than one text was needed to reach the minimum amount of around 500 clause
units. Annotation took place over the course of approx. 18 months, with close
cooperation between the annotators. A workshop was held in 2010 in Bam-
berg where the annotators for Awetí, Gorani, Forest Enets and Vera’a con-
ducted intensive discussion and refined the annotation practice in response to
challenges raised by the individual languages. These efforts fed into the cur-
rent version of the GRAID Manual (Haig and Schnell 2011), which contains
the main principles for GRAID annotations, the inventory of symbols, and
illustrative examples and explanations.
An important feature of this research is that the data meet the general
requirement of maximal accountability, because it is based on archived and
freely accessible material (see Trilsbeek et al., this volume). This is one of the
key advantages of working with data from language documentation projects,
where data accessibility and durability are afforded maximum priority from
the outset. GRAID thus profits from requirements that are already in place,
by simply adding an extra tier of annotation to existing archived data. It also
distinguishes the present work from much comparable work in text-based
typology, for which the raw data is often not accessible.
Given the bottle-neck of additional manual annotation, the issue of cor-
pus size is obviously highly relevant. Surprisingly, it is seldom explicitly dis-
cussed in the literature on quantitative approaches to typology. An exception
is the following citation by Nichols (2008: 124), discussing the size of cor-
pora necessary to determine the types of argument structure typically associ-
ated with certain verb meanings:
A text frequency survey probably gives the most sensitive indication of a
language’s overall preference, but it is labour-intensive and requires close
5. In subsequent research we will need to cross-code the data to ensure consistency and ob-
jectivity of the coding; this has not been done yet because the GRAID coding system is
still being refined and improved, and the necessary guidelines for each language are still in
preparation.
Unauthenticated
control of genres, stylistic levels, etc. in text corpora. I estimate ... that a
corpus of about 1000 clauses exclusive of those containing the verb ‘be’ that
swamps all frequencies in some Indo-European languages, is enough to yield
reliable information on natural frequencies of verbs ...
A brief look at published research on text-based quantitative typologies is

instructive. Table 2 gives the size of corpora used in the respective studies (in
clause units) for a selection of publications investigating syntactic aspects of
connected discourse. With the exception of Clancy (2003), few of the studies
based on spoken language that we have seen to date have a database of more
than 1000 clause units (and the clause units of Clancy’s child speech may
have been considerably shorter than the clause units of the adult speech in the
other studies).
Table 2. Corpus size in selected text-based studies

Author Language(s) Text type Clause units
Du Bois (1987) Sakupultek Pear Stories 456
Bickel (2003) Belhare, Maithili, Pear Stories 568
Nepali
Genetti and Crain (2003) Nepali Traditional narratives 861
Clancy (2003) Korean Children’s speech 4363
(1–2 yrs.)
Payne (1992: 57) Yagua Traditional narratives 1156
Comparing the data in Table 1 and Table 2, it is evident that our own corpora
fall within the range of those used in previously published research. Taken
together, our corpus actually already represents one of the largest, fully ac-
cessible corpora of its kind currently available.
One potential disadvantage of working with indigenous narrative texts
is that the content of each text differs, thus presumably diminishing the de-
gree of comparability across languages. When we first set out to compare
original texts, this was obviously a matter of some concern, and it was pre-
cisely this concern that has led many researchers to adopt the parallel-text, or
story-retelling, methodologies outlined above. However, our initial analysis
of the indigenous texts has in fact revealed rather remarkable areas of global
stability across all four languages, which are suggestive of the fact that as a
text type, such monologous narratives have enough commonalities to make
cross-language comparison meaningful. But we have also identified an area
Unauthenticated
of considerable cross-language variation, namely in the deployment of pro-

nouns. Here it seems that the discourse-typological profiles of the individual
languages do indeed exert an influence. In the following sections, we will
first outline the areas where the four languages exhibit considerable parallels,
before exploring one area of high cross-language variation.
3.3. Ratios of core arguments and human expressions
In this section we present the overall distribution of core arguments as well as

human expressions, as background figures for the analyses presented in Sec-
tions 3.4 and 4 below. Regarding the notion of ‘core arguments’, throughout
this chapter we restrict our analysis to the properties of S (intransitive sub-
ject), A (transitive subject) and P (transitive object). The GRAID annotation
system does in fact provide for additional arguments, but their inclusion will
be the subject of future analysis. The respective proportions of S, A and P in
the four text corpora is as follows:
Table 3. Proportion of S, A and P arguments

S A P total
Awetí 55.0% (325) 21.1% (125) 23.9% (141) 591
Gorani 46.2% (306) 27.4% (182) 26.4% (175) 663
Savosavo 51.2% (391) 24.9% (190) 24.0% (183) 764
Vera’a 52.6% (374) 23.2% (165) 24.2% (172) 711
Following the definition of A and P functions adopted in Haig and Schnell

(2011), these two arguments should always co-occur in one transitive clause
and hence show exactly the same proportions. The higher number of Ps in
the data can be attributed to the presence of non-finite constructions, which
may contain a P, but systematically block the expression of A. In Gorani, the
slightly higher number of As results from the occasional use of an adposi-
tional object with an otherwise transitive verb, as in English he drank of the
medicine, which could be glossed with an A, but lack a regular P. In Savosavo,
it is mostly due to relative clauses where P was the relativized constituent, and
which therefore could not contain any expression of P.
What is, of course, behind these figures is the proportion of transitive and
intransitive clauses. If we take the number of P arguments to represent the
Unauthenticated
number of transitive clauses, hence including infinitival constructions in this

category, the results obtained are as follows:
Table 4. Proportion of intransitive and transitive clauses
intransitive transitive total
Awetí 69.7% (325) 30.3% (141) 466
Gorani 63.6% (306) 36.4% (175) 481
Savosavo 68.1% (391) 31.9% (183) 574
Vera’a 68.5% (374) 31.5% (172) 546
The similarities are quite remarkable: the mean proportion of intransitive

clauses across the four languages is 67.5%, with a range of just 6.1 percent-
age points. While a corpus of four languages is obviously much too small for
drawing far-reaching conclusions, these results are certainly indicative that
monologous narrative texts reveal a fairly stable ratio of intransitive to tran-
sitive clauses of approximately two thirds to one third. Hence, these texts
are remarkably homogenous despite their very varied contents. Interestingly,
these proportions match those observed by Everett (2009: 9) for his corpus
of English and Portuguese data, in which 70% and 62% of clauses were in-
transitive. The fact that this corpus consisted of data from a very different
genre, i.e. conversational data, suggests a more fundamental relevance of this
proportion.
As mentioned above, GRAID annotations also note the semantic feature
of humanness as a sub-aspect of the more general dimension of animacy,
and the distribution of human referents in texts can therefore be investigated
quantitatively with our system. The traditional stories investigated here pri-
marily recount the actions of human beings, often of one central character.
But the total number of human actors that are introduced in the different sto-
ries varies from text to text, as does the number of non-human referents –
artefacts, places, objects – that occur. One might assume that the actual num-
ber of human participants that play a role in a given narrative would affect
the rate of occurrence of human referents in texts, and potentially influence
the distribution of referential expressions. The actual proportion of human vs.
non-human arguments in core functions is provided in Table 5.
As can be seen from the table, texts in all four languages figure more human
than non-human discourse participants, with a mean of 69.7% of human dis-
course referents in the overall corpus, with a range of 11.8 percentage points.
Again it seems that we are dealing with a relatively stable ratio, which is quite
Unauthenticated
Table 5. Proportion of human vs. non-human arguments in core arguments

Awetí Gorani Savosavo Vera’a
+ HUM 68.4% 62.8% 74.6% 73.0%
- HUM 31.6% 37.2% 25.4% 27.0%
surprising given that these texts have not been controlled for content. A very
preliminary conclusion is that in narrative texts, somewhere in the region of
two-thirds of the core arguments refer to human participants. Whether the
proportion of two thirds is genuinely stable, or merely an artefact of the small
corpus remains to be seen. But it is once again indicative of the fact that tra-
ditional narratives, despite the differences in content, show similar profiles in
some aspects, and are therefore in principle comparable across languages.
Looking closer at the proportion of human referents in individual core
functions, the languages in our corpus show a clear, and again homogenous,
pattern (see Table 6).
Table 6. Proportion of human arguments among S, A and P arguments

S Awetí Gorani Savosavo Vera’a
+ HUM 70.2% 83.7% 81.1% 84.8%
- HUM 29.8% 16.3% 18.9% 15.2%
A Awetí Gorani Savosavo Vera’a
+ HUM 96.8% 92.7% 97.4% 93.3%
- HUM 3.2% 7.3% 2.6% 6.7%
P Awetí Gorani Savosavo Vera’a
+ HUM 39.0% 26.4% 37.2% 27.9%
- HUM 61.0% 73.6% 62.8% 72.1%
Not altogether surprisingly, referential expressions denoting humans are very

often found as S or A, and more rarely in P. The extremely high proportion
of human expressions in A function, and the relatively low one in P function
reflect a well-known tendency that can be attributed to the idea that in typical
transitive events humans act upon inanimate things (Comrie 1989: 128; Næss
2007: 18), a tendency that we will return to in Section 4 below.
Unauthenticated
3.4. Testing a well-known constraint: Avoid Lexical A

As a test case for our corpus, we investigated a well-known constraint on
the occurrence of noun phrases in A function, i.e. as transitive subjects. This
constraint was originally noted in Du Bois 1987, using data from Sakapul-
tek. Du Bois found that in connected discourse, S, A and P differ in their
respective preference or dispreference for being filled by lexical expressions
(NPs) as opposed to non-lexical expressions (zero or pronouns). The tenden-
cies concerned are collectively referred to as ‘Preferred Argument Structure’
and have since been confirmed in a number of other languages (see (Du Bois,
Kumpf, and Ashby 2003)). While Du Bois makes a two-way distinction be-
tween lexical expressions (NPs), and non-lexical expressions (pronouns and
zero anaphora), in our data we maintain a three way distinction in form-types:
Full NP (lexical), pronoun, or zero anaphora (cf. Section 4, where we discuss
the relevance of pronominal, as opposed to both lexical and zero, expressions
in greater detail).
Now in any given text one might expect a random distribution of lexical
NPs, such that, for example, in each of S, A and P functions we find approx-
imately one third of the total lexical NPs in a text. This turns out not to be
the case: For all languages investigated to date, the proportion of NPs in the
A function falls far below what could be expected from a random distribu-
tion. In the data cited in Du Bois, Kumpf, and Ashby (2003: 37), drawn from
eight languages, the proportion of lexical arguments in the A function does
not exceed ten percent in any of the languages, while the remaining 90% are
distributed approximately evenly across S and P. What this tendency indi-
cates is that lexical mentions are not randomly distributed among S, A and P,
but tend to be split across S and P, while avoiding A. Du Bois refers to this
constraint as “Avoid Lexical A”. Despite its cross-linguistic robustness, as yet
we lack a database large and varied enough to empirically back one or the
other explanation proposed in the literature (cf. Du Bois 1987; discussion in
Goldberg 2004 and more recently in Everett 2009).
The raw figures from our corpus are provided in Table 7, showing the
distribution of lexical mentions across core arguments. The data from our
corpus clearly confirm Avoid Lexical A: Nowhere does the A function ac-
count for more than 14.4% of the lexical mentions. In all cases, the value for
non-lexical A arguments is significant according to chi-square tests yielding
p<0.001 in all cases. The mean of 10.2% yielded by our data lies slightly
Unauthenticated
Table 7. Distribution of lexical mentions (i.e. lexical NPs) across core arguments
S A P total
Awetí 53.7% (87) 12.3% (20) 34.0% (55) 162
Gorani 41.2% (83) 7.9% (16) 49.9% (100) 199
Savosavo 47.0% (101) 14.4% (31) 38.6% (83) 215
Vera’a 34.0% (70) 6.3% (13) 59.7% (123) 206
higher than that of the figures provided in Du Bois, Kumpf, and Ashby (2003:
37), which is 7.1%, but it seems evident that Avoid Lexical A emerges in
a very similar fashion in the texts investigated here, suggesting again that
certain characteristics of discourse are remarkably insensitive to differences
in content. Furthermore, they are already surprisingly consistent in texts of
the size we are dealing with (around 500 clause units).
In this section we have looked at some global features of the texts in our
corpus, restricting ourselves to an investigation of the core arguments S, A
and P. This is obviously but a tiny sub-domain of the possibilities for text-
based typology based on GRAID annotations. Our initial findings are that
the texts are surprisingly homogenous along the features of transitive to in-
transitive clauses, lexical mentions in the A-function, and overall proportion
of human referents among core syntactic functions. With the exception of
the second feature, that of proportion of lexical mentions in the A-function
(‘Avoid Lexical A’), where our findings essentially replicate those of previous
studies, we are unaware of any other cross-corpus studies that have looked at
these issues. Based on our findings, we can formulate an initial testable hy-
pothesis: monologous narrative texts tend to exhibit relatively stable rates of
transitive and intransitive clauses, and that the proportion of human referents
in core argument functions is likewise fairly consistent, regardless of content.
4. Distribution of pronouns in texts
As discussed above, Bickel (2003) develops a quantitative approach to assess-

ing the degree to which languages make overt reference to arguments, using
the term ‘referential density’ (RD). Bickel’s measure of RD does not actually
register pronouns as an independent category. Instead, they are grouped to-
gether with NPs, both counting as overt expressions of arguments, and stand-
ing in opposition to zero arguments (those completely lacking overt expres-
Unauthenticated
sion). Du Bois on the other hand, in his investigations of Preferred Argument

Structure, lumps pronouns together with zero, because they apparently share
the discourse feature of referring to given information, as opposed to NPs,
which are primarily used for new information. And in Sakapultek, the lan-
guage originally investigated by Du Bois, free pronouns are extremely scarce,
so that neglecting them had few major consequences. Thus in both major ex-
isting research paradigms on text-based typology, pronouns were not consid-
ered in their own right.
A practical problem apparently not encountered by these researchers, but
presenting itself very prominently during the development of the GRAID
coding scheme, was that it is often not a straightforward matter to distin-
guish “pronouns” from bound “agreement” markers. Indeed, many recent ap-
proaches to referential expressions assume a continuum from free pronoun to
agreement marker. For example, Corbett (2003) refers to “canonical agree-
ment” for morphologically bound, syntactically obligatory cross-reference
markers (e.g. third person singular verb agreement in English), as one pole
of a continuum, from which various kinds of pronominal reference diverge
to different degrees. However, no truly satisfactory solution has yet been pro-
posed in the literature, and it remains a matter of considerable debate. For the
moment, our strategy has been to identify operational criteria for distinguish-
ing canonical agreement from pronouns (based on criteria suggested in the
literature, cf. discussion in Haig and Schnell 2011, Section 4.2). Annotators
use these criteria to decide which (bound or free) morphemes are to be con-
sidered as pronouns, and which are to be classified and coded as agreement,
and include the justification for their decision in the accompanying documen-
tation. Only the formatives coded as pronouns were included in the following
study.
It is still an open question whether or not the presence of agreement mor-
phology in a language influences the distributional patterns observed for pro-
nouns, lexical NPs and zero, and if it does, in which way. Bickel (2003: 721,
729) reports that the different agreement patterns in the languages he inves-
tigated had no influence on the observed referential density values. As men-
tioned above, the GRAID coding scheme allows for the coding of agreement
morphology, and thus provides the means to study this question in more de-
tail. In the following study we will focus on pronominal forms, but refer to the
issue of agreement marking occasionally, to indicate where it can (or cannot)
provide a sufficient explanation for the distributions we observe.
Unauthenticated
4.1. Background
More recently researchers such as Genetti and Crain (2003) emphasize the
importance of specific investigations into the distribution of pronouns, as op-
posed to both zero, and lexical NPs. Stoll and Bickel (2009) have likewise
refined the original notion of RD by distinguishing lexical NPs from pro-
nouns in their investigation of Russian and Belhare texts. Our own data also
indicate that the deployment of pronouns is an area of considerable cross-
linguistic variation, suggesting the need for a more fine-grained distinction.
As a first indication, Table 8 gives a breakdown of the overall proportion of
different form-types (restricted to occurrences in S, A and P function) in our
corpus:
Table 8. Proportion of forms among S, A and P arguments

NP pro zero other total
Awetí 27.4% (162) 47.5% (281) 25.0% (148) 0.0% (0) 591
Gorani 30.0% (199) 12.7% (84) 51.1% (339) 6.2% (41) 663
Savosavo 28.1% (215) 43.1% (329) 27.7% (212) 1.0% (8) 764
Vera’a 29.0% (206) 50.4% (358) 20.5% (146) 0.1% (1) 711
German 36.8% (183) 38.2% (190) 19.3% (96) 5.6% (28) 497
The data from German have been included in the bottom line as a compari-
son from a different text type (Pear story retellings), and more significantly,
from a language that requires overt subjects in most contexts, and generally
strongly disprefers zero-anaphora. There are two striking features of these
data. The first is that the proportion of full NPs in our corpus data is surpris-
ingly uniform, showing a range of just 2.6 percentage points. German, differs
from the other four languages in this respect, for reasons that remain uncer-
tain. However, it should be recalled that the German data consists of seven
smaller texts (Pear-story retellings), hence giving potentially rise to more ex-
tensive lexical reference at the beginning and end of each story. Of greater
interest are the areas of massive diversity, in particular with regard to rates
of pronoun use. Here the languages considered show a massive range of 37.7
percentage points. In the rest of this section we take a closer look at pro-
noun use in discourse, in particular as it relates to syntactic function, and to
animacy (human vs. non-human).
Unauthenticated
4.2. Testing ‘Avoid Pronominal P’
As a point of departure we take the observations of Genetti and Crain (2003:

216) in their study of pronoun use in spoken Nepali discourse. Genetti and
Crain also restrict their investigation to the distribution of pronouns across
the syntactic functions S, A and P. They observe that in Nepali discourse,
pronouns most frequently bear A (35% of all pronouns) and S (43% of all
pronouns) function, while only a small portion of pronouns (17% of all pro-
nouns) has P function (6% of all pronouns function as indirect object). What
this suggests is that pronouns display a roughly converse distribution to lex-
ical NPs, discussed above in connection with Avoid Lexical A: Lexical NPs
favour S and P, while avoiding A; pronouns on the other hand favour S and
A, while avoiding P. We may provisionally formulate this as the following
constraint:
(1) Avoid Pronominal P: In P function, pronouns are significantly less fre-

quent than in the A and S functions
A possible explanation for Avoid Pronominal P is that it results from the com-
bined effects of two further tendencies. The first is that pronominal reference
is largely restricted to animate, more specifically human, referents. In their
Nepali data, Genetti and Crain find that 92% of all pronominal mentions had
human referents (cf. Genetti and Crain 2003: 215, en. 13). Pronouns with non-
human referents are not completely disallowed, but the discourse tendency to
avoid such pronouns is obviously very strong. The second is the more general
observation that P arguments are more likely to be inanimate than animate.
We discussed this proposal above in Section 3.3, and found it to be confirmed
by our data (cf. Table 6). Taken together, the combined effects of these two
tendencies could provide an explanation for Avoid Pronominal P: because
P arguments are typically inanimate, and because pronouns are avoided for
inanimates, P arguments are unlikely to be realised as pronouns.
Whether Avoid Pronominal P holds in other languages, and whether the
two tendencies just outlined are valid explanations, remain open questions. In
fact, Genetti and Crain’s explanations are not without problems, because they
do not actually state how different forms are distributed within the P function,
but only how pronouns are distributed across all of S, A and P. They likewise
do not consider the exact distribution of zero forms, which turns out to be of
some importance for our data. We will first investigate our data to determine
Unauthenticated
whether there is evidence for Avoid Pronominal P, before taking a closer look
at possible explanations.
Table 9 provides an overview of the distribution of pronouns across syn-
tactic function in our data, with German also included for comparison.
Table 9. Distribution of pronominal forms across core arguments
S A P total
Awetí 59.1% (166) 22.8% (64) 18.1% (51) 281
Gorani 44.0% (37) 23.8% (20) 32.2% (27) 84
Savosavo 57.8% (190) 39.2% (129) 3.0% (10) 329
Vera’a 63.7% (228) 29.3% (105) 7.0% (25) 358
German 40.0% (76) 43.7% (83) 16.3% (31) 190
It is evident that the general tendency for avoiding pronominal P seems to

be clearly valid in three languages, Savosavo, Vera’a and German (all clearly
displaying p<0.001 in chi-square tests). For Awetí, however, the difference
between P and A with respect to pronouns is very slight and the significance
of this figure only reaches a value of p<0.01. We take this as not being sig-
nificant enough to uphold Avoid Pronominal P in Awetí. With Gorani, the
difference in numbers is also not statistically significant. It is surprising, how-
ever, that it actually points in the opposite direction: there is no evidence for
Avoid Pronominal P at all – if anything, this language exhibits a trend towards
something like ‘Avoid Pronominal A’!6
Obviously then, unlike Avoid Lexical A, which was demonstrated in 3.4
above, Avoid Pronominal P is by no means a universal constraint. Further-
more, even for the three languages where Avoid Pronominal P does seem to
6. Differences in agreement patterns cannot directly account for these distributions. The hy-
pothesis suggesting itself in this context would be that, in functions for which there is oblig-
atory agreement morphology, either NPs (adding additional information) or zero (avoid-
ing redundancy) should be the preferred means of reference, i.e. that pronouns would be
avoided. Savosavo is the only language that seems to support this hypothesis: Its obligatory
agreement morphology for P (marking person, number and gender) seems to align with the
significantly low proportion of pronouns in this function. But for the other languages in-
vestigated, this correlation could not be confirmed: Vera’a has no obligatory agreement
morphology at all, but still shows a significantly low proportion of pronouns in P. In Ger-
man verbs obligatorily agree with S and A in person and number, but the lowest proportion
of pronouns is still found in P function. And finally, Gorani has obligatory agreement mor-
phology for S and A (marking person and number), but pronouns are distributed fairly
evenly over all three functions. We refrain from commenting on Awetí in this context as
the agreement pattern in this language is too complex to be summarized here.
Unauthenticated
be operative, the figures obtained vary considerably (3.0% in Savosavo com-

pared to 16.3% in German), again suggestive of a greater degree of language-
specific factors in shaping this aspect of discourse.
Above it was suggested that Avoid Pronominal P could be ultimately mo-
tivated by the combination of two more general tendencies, the tendency to
avoid inanimate pronouns, and the tendency for P to be inanimate. Now, if
Avoid Pronominal P is not operative in all languages, the question imme-
diately arises as to whether these two more general tendencies are likewise
absent, or at least significantly weaker, in such languages. While considera-
tions of space preclude a full examination of this question, we can at least test
it for the one language that most clearly departs from the expected pattern,
namely Gorani.
That P is more frequently non-human has already been shown at the end
of Section 3.3. The figures presented in Table 6 showed that Gorani, like
the other languages in our corpus, clearly prefers non-humans in the P role
(73.6% of the total P’s are non-human compared to just 7.3% of the As and
16.3% of the Ss). There seems little doubt that this basic tendency is preserved
in Gorani, so the failure to comply with Avoid Pronominal P cannot be related
to animacy-related constraints on argument realization.
The second tendency that needs to be checked for Gorani is whether
pronouns generally have human referents. As it stands, this generalization
is independent of syntactic function. Genetti and Crain’s figures for pro-
nouns in Nepali are indicative of a genuine dispreference for non-human
pronouns; however, they are not tested against animacy features of the two
other form types. Thus we do not know for certain whether the low figure for
non-humans among the pronouns is simply reflecting a low figure for non-
humans generally. In order to know whether pronouns are truly dispreferred
with non-human referents, we need to check the proportions of non-human
referents among other form-types (NPs and zero), across syntactic functions.
The respective data for Gorani as well as the two languages in our corpus that
exhibited a statistically significant tendency to avoid pronominal P, Savosavo
and Vera’a, are provided in Table 10.
It is clear that for all three languages the tendency to avoid non-human
referents is far stronger with pronouns than with NPs. But where the lan-
guages differ is the relative percentages of pronouns with non-human refer-
ents compared to zero-encoding of non-human referents. In Gorani, the pro-
portion of pronouns used for non-human reference is actually higher than that
Unauthenticated
Table 10. Proportion of non-human referents in the three form types for Gorani, Sa-
vosavo and Vera’a
Gorani Savosavo Vera’a
total non-hum. total non-hum. total non-hum.
NP 199 66.8% (133) 215 45.1% (97) 206 62.1% (128)
pro 84 16.7% (14) 329 5.5% (18) 358 6.7% (24)
zero 339 6.2% (21) 212 33.5% (71) 146 26.7% (39)
of zero, while in Savosavo and Vera’a, this is not the case. In these two lan-
guages, among all three form-types, it is indeed pronouns that show the low-
est percentage of non-human referents, hence they seem to confirm Genetti
and Crain’s hypothesis. It is evident, however, that for Gorani, we need to take
the zero form-type into consideration, which shows some remarkable pecu-
liarities: In comparison with pronouns it is striking that the total figure for
zero, 339, is about four times as high as that for pronouns, which is only 84.
And in comparison with the use of zero in the other two languages, it stands
out as having a conspicuously small proportion of non-human referents, only
6.2% compared to 33.5% in Savosavo and 26.7% in Vera’a. We can conclude
for the time being that pronouns in Gorani also exhibit a general disprefer-
ence for inanimate reference, at least in comparison to NPs, but the overall
effects of this tendency are counteracted by a very different behaviour of the
zero form-type.
So what is the explanation for the odd behaviour of Gorani? We have seen
that Gorani, like all other languages we are aware of, avoids human referents
in the P function. It is thus quite ‘normal’ in this respect. It also complies with
the tendency to avoid non-human referents for pronouns, at least in compar-
ison to NPs. However, crucially, this tendency is reversed when we compare
pronouns to zero: The proportion of pronouns used for non-human referents
(16.7% of all pronouns) is greater than that of zero (6.2% of all instances of
zero encoding).
Note that this does not mean that, in Gorani, non-human referents are
encoded more frequently by means of pronouns than by zero. Table 11 shows
for all three languages the proportion of non-human referents encoded by
each of the different form-types.
In all three languages, only a small proportion of non-human referents are
referred to by pronouns, with a range of only 4.3 percentage points. Not sur-
prisingly, the most commonly used form-type for the encoding of non-human
Unauthenticated
Table 11. Proportion of non-human referents encoded by means of NPs, pronouns

and zero
NP pro zero total
Gorani 79.2% (133) 8.3% (14) 12.5% (21) 168
Savosavo 52.2% (97) 9.7% (18) 38.2% (71) 186
Vera’a 67.0% (128) 12.6% (24) 20.4% (39) 191
referents is NPs. In Gorani this preference is particularly strong, and the third
available form-type, zero, is almost as rarely used as are pronouns. In Savo-
savo, non-human referents are frequently not overtly expressed at all. This is
reflected in a relatively high percentage for zero, which counter-balances the
lower percentage for NPs. So given that all three languages share a prefer-
ence for non-human referents in P function, as well as a dispreference of pro-
nouns to encode non-animates, why does Gorani not fit the pattern of Avoid
Pronominal P?
What is striking in Gorani in comparison to Savosavo and Vera’a is the
extreme predominance of zero form-types in Gorani discourse (more than
half of the core arguments, 51.1%, are expressed through zero, cf. Table 8).
Ss and As in particular (which are frequently animate and definite) are com-
monly not expressed with pronouns, but through zero, see Table 12.
Table 12. Proportion of forms among S and A

S NP pro zero other total
Gorani 27.1% (83) 12.1% (37) 59.2% (181) 1.6% (5) 306
Savosavo 25.8% (101) 48.6% (190) 24.0% (94) 1.5% (6) 391
Vera’a 18.7% (70) 61.0% (228) 20.1% (75) 0.3% (1) 374
A NP pro zero other total
Gorani 8.8% (16) 10.1% (20) 77.5% (141) 2.7% (5) 182
Savosavo 16.3% (31) 67.9% (129) 15.8% (30) 0.0% (0) 190
Vera’a 7.9% (13) 63.6% (105) 28.5% (47) 0.0% (0) 165
This leads to a very large figure for zero, and simultaneously brings down
the overall numbers of pronouns in these functions, leading to a levelling
out of S, A and P functions with regard to pronoun use in Gorani (cf. Table
9). In this case, the agreement pattern in Gorani seems to at least contribute
to the observed proportions: Gorani verbs agree with S and A, so it could
be expected that for given referents, zero would be the preferred form type in
Unauthenticated
these functions, while pronouns would be dispreferred. This is indeed the case
in Gorani. In contrast, Savosavo and Vera’a lack obligatory agreement with
S and A, and here pronouns are by far the most frequent form type in these
functions. The concentration of pronouns in S and A functions pushes up the
overall figure for pronouns in these languages, leading to a corresponding
reduction in the relative numbers of pronouns in P function, an effect that is
missing in Gorani.
Thus, in Gorani, the disproportionately frequent use of the zero form-
type in general, and in A and S function in particular, is the reason why we
do not find the tendency to avoid pronominal Ps in the language, even though
the supposedly contributing factors are still operative. The full picture only
emerges when all form-types are taken into consideration, and tested across
all syntactic functions.
4.3. Summary of pronoun deployment in discourse
We began this section with the observation that pronoun use is evidently one
area of massive cross-linguistic variation. As noted, the two current major
paradigms in discourse-based syntactic typology, namely Preferred Argu-
ment Structure and Referential Density at least in the original investigation of
Bickel (2003), actually do not consider pronouns as a category in their own
right. As Genetti and Crain (2003) point out, however, this is undoubtedly an
oversimplification. They propose a constraint on pronoun distribution, based
on data from Nepali, namely Avoid Pronominal P, and suggest two factors
that contribute towards this constraint. In our data we found evidence for
Avoid Pronominal P in three languages, though to rather differing degrees,
while one language, Gorani, showed no evidence for this constraint. A closer
look at the Gorani data revealed that although the contributory factors are still
operative, language-specific characteristics in the deployment of pronouns
versus zero worked against producing the expected Avoid Pronominal P. This
result confirms the view that the deployment of pronouns is an area of high
cross-language variation, subject to a subtle interplay of language-specific
features, hence a very promising arena for text-based typology. It also high-
lights the necessity for a greater typological range of languages under investi-
gation: the conclusions based on Nepali cannot simply be transferred to other
languages, but these are facts that only emerge when sufficient, and diverse,
languages have been investigated, in particular languages which utilize both
Unauthenticated
zero and pronominal reference strategies. This case study also highlights the
importance of the animacy feature, which is recorded in GRAID-annotations.
Including an animacy parameter for pronouns yields a number of other ob-
servations, for example the almost complete absence of non-human pronouns
in A function.7
Finally, recall that Table 8 above revealed a surprisingly uniform distri-
bution of NPs in discourse: somewhere between 20–30% of core arguments
appear to be expressed by full NPs in spoken narrative discourse. Whether
this figure can be confirmed in further studies remains to be seen. But it raises
the interesting possibility that the space of cross-language differences in this
regard is essentially restricted to how the remaining 70-80% of positions are
partitioned in terms of pronominal versus zero reference. Whether this view
of the matter is realistic can only be determined in the light of more extensive
studies of narrative texts from different languages, but it is certainly a goal
worth pursuing.8
5. Conclusion
In this chapter we have discussed the possibility of typological investigations

based on original data from language documentations. The data we used to
demonstrate the feasibility of such an undertaking were monologous texts
produced by speakers of the speech community expressing indigenous con-
tent, i.e. texts that are not based on external stimuli, either translations of
source texts, or retellings of films or books. Precisely this type of text is a
widespread and highly-regarded output of language documentation projects,
as they reflect most closely indigenous narrative practices. We have developed
a system of annotation, GRAID, that can be practicably applied to available
archived texts. The resulting annotations can be made directly available for
7. This very striking tendency merits more detailed treatment than can be accommodated in
this chapter. Here we simply note its existence, while deferring proper coverage to future
investigations.
8. Stoll and Bickel (2009) report significantly different ratios of lexical NPs in their compar-
ison of Russian and Belhare texts, which would appear to run counter to the tendency we
have identified in our own data. However, Stoll and Bickel’s findings are based on a count
of all argument positions, including goals, and at least some locatives, whereas the figures
for our data refer solely to S, A and P. Thus the figures from the two studies cannot be
directly compared.
Unauthenticated
quantitative comparative analysis, while the texts themselves, as archived ob-

jects, remain fully accessible for purposes of data accountability.
Our investigation focussed on the way core arguments are realized in such
texts. Drawing on two related research traditions, Du Bois’ “Preferred Argu-
ment Structure” and Bickel’s “Referential Density”, we were able to demon-
strate that original texts from four different endangered languages are, across
certain dimensions, remarkably similar. The best-known feature of discourse
structure, that of Avoid Lexical A, is shown to be robustly present in all four
corpora, despite their modest size and idiosyncratic content. We also iden-
tified other possible candidates for cross-corpus commonalities, in particular
the ratio of transitive to intransitive clauses, and the distribution of human and
non-human referents across different syntactic functions. As a test case, we
investigated the deployment of pronouns across syntactic function, showing
that here, the language-specific differences are highly significant. Thus the
findings of Genetti and Crain (2003) for Nepali were shown not to hold for
one of the languages in our corpus, Gorani.
This chapter has touched on only a fraction of the possibilities made avail-
able by the application of standardized annotation practices to original texts.
But in taking up this challenge, we hope to have demonstrated that cross-
corpus comparison of original text material is both feasible, and yields in-
teresting results. Typology is not, and should not be, a discipline bound to
methodological dogmatism. The method of original text comparison that we
have trialed here is not intended to displace other methods, but to complement
them, to serve as an addition to what Wälchli (2006) refers to as the “typolo-
gist’s toolkit”. The method of comparison of original texts is a tool that was
designed to satisfy the requirements of documentary linguistics, and to real-
ize the typological potential inherent in the large-scale archiving of original
text material.
Appendix of data used in the GRAID corpus
Awetí is a Tupi language of the Upper Xingú region in West Brazil. Data on
the Awetí language has been collected during a DoBeS project by Sabine Re-
iter who is currently preparing a PhD thesis containing a grammar sketch of
the language. Two traditional Awetí stories have been used in our corpus, an-
notated by Sabine Reiter using the software ELAN: kal_awytyza1_GRAID.
eaf and kal_makawaja_GRAID.eaf. From each text, approximately the first
Unauthenticated
third was annotated with GRAID (11 and 13 minutes respectively). The total
number of words in the first GRAID-annoted section is 1256, in the second
1033 (total of 2289).
Gorani (Indo-European, West Iranian, from the village of Gawraju in
West Iran). The two texts used in the corpus are available in Mahmudweyssi
et al. (in print), where they are published together with a sketch grammar of
the language, and a CD containing the recordings.
Savosavo (Papuan Isolate, Savo Island, Solomon Islands): The three texts
used in the corpus are available at http://vc.uni-bamberg.de/moodle/course/view.
php?id=9488. The complete data collected during the DoBeS project “Docu-
mentation of Savosavo, a Papuan language of the Solomon Islands” is stored
in the DoBeS online archive at http://corpus1.mpi.nl/ds/imdi_browser?openpath=
MPI1366371\%23. Wegener (2008) is the most recent, and most comprehen-
sive, grammatical description of Savosavo to date.
Vera’a is an Austronesian, Oceanic language spoken on the island of
Vanua Lava in the north of Vanuatu in the South Pacific. The two texts used
in the corpus are available at http://vc.uni-bamberg.de/moodle/course/view.php?
id=9488. The Vera’a language has been extensively documented during the
DoBeS project “Documentation of Vera’a and Vurës, the two surviving en-
dangered languages of Vanua Lava, Vanuatu” and the data collected in the
course of this project is stored at http://corpus1.mpi.nl/ds/imdi_browser?openpath
=MPI649372\%23. Schnell (2010) contains the first description of the lan-
guage. Some structural properties of Vera’a are discussed in François (2005,
2007, 2009) in connection to the historical development of the languages of
North Vanuatu.
References
Atkins, Sue, Jeremy Clear, and Nicholas Ostler. 1992. Corpus design criteria.
Literary and Linguistic Computing 7(1):1–16.
Biber, Douglas, Geoffrey Leech, Susan Conrad, and Edward Finegan. 2004.
Longman Grammar of Spoken and Written English. London: Longman.
Bickel, Balthasar. 2003. Referential density in discourse and syntactic typol-
ogy. Language 79(4):708–736.
Chafe, Wallace L. 1979. The flow of thought and the flow of language. In
Discourse and Syntax (Syntax and Semantics Volume 12), ed. Talmy Givón,
159–181. New York: Academic Press.
Unauthenticated
Chafe, Wallace L. 1980. The deployment of consciousness in the production

of a narrative. In The Pear Stories: Cognitive, Cultural and Linguistic
Aspects of Narrative Production, ed. Wallace L. Chafe, 9–50. Norwood,
NJ: Ablex.
Chafe, Wallace L. 1987. Cognitive constraints on information flow. In Co-
herence and Grounding in Discourse, ed. Russell S. Tomlin, 21–51. Ams-
terdam, Philadelphia: John Benjamins.
Clancy, Patricia M. 2003. The lexicon in interaction: Developmental origins
of preferred argument structure in Korean. In Preferred Argument Struc-
ture: Grammar as Architecture for Function, eds. John W. Du Bois, Lor-
raine E. Kumpf, and William J. Ashby, 81–108. Amsterdam, Philadelphia:
John Benjamins.
Comrie, Bernard. 1989. Language Universals and Linguistic Typology. 2nd
edition. Chicago: The University of Chicago Press.
Corbett, Greville C. 2003. Agreement: The range of the phenomenon and
the principles of the Surrey Database of Agreement. In Agreement: A Ty-
pological Perspective, ed. Dunstan Brown, Greville C. Corbett and Ca-
role Tiberius. Special issue of Transactions of the Philological Society
101(2):155–202.
Cysouw, Michael, and Bernhard Wälchli. 2007. Parallel texts: Using transla-
tional equivalents in linguistic typology. Language Typology and Univer-
sals 60:95–99. doi:10.1524/stuf.2007.60.2.95.
Du Bois, John W. 1987. The discourse basis of ergativity. Language
63(4):805–855.
Du Bois, John W., Lorraine E. Kumpf, and William J. Ashby, eds. 2003. Pre-
ferred Argument Structure: Grammar as Architecture for Function. Ams-
Everett, Caleb. 2009. A reconsideration of the motivations for preferred ar-
gument structure. Studies in Language 33(1):1–24.
African Studies.
François, Alexandre. 2005. Unraveling the history of the vowels of seventeen
nortern Vanuatu languages. Oceanic Linguistics 44(2):443–504.
François, Alexandre. 2007. Noun articles in Torres and Banks languages:
Conservation and innovation. In Language Description, History and
Unauthenticated
Development: Linguistic Indulgence in Memory of Terry Crowley, eds.

Jeff Siegel, John Lynch, and Diana Eades, Creole Language Library 30,
313–326. Amsterdam, Philadelphia: John Benjamins.
François, Alexandre. 2009. Verbal aspect and personal pronouns: The history
of aorist markers in north Vanuatu. In Austronesian Historical Linguistics
and Culture History: A Festschrift for Bob Blust, eds. Andrew Pawley and
Alexander Adelaar, 179–195. Canberra: Pacific Linguistics.
Geertz, Clifford. 1973. The Interpretation of Cultures. New York: Basic
Books.
Genetti, Carol, and Laura D. Crain. 2003. Beyond preferred argument struc-
ture: Sentences, pronouns and given referents in Nepali. In Preferred Ar-
gument Structure: Grammar as Architecture for Function, eds. John W.
Du Bois, Lorraine E. Kumpf, and William J. Ashby, 197–223. Amsterdam,
Philadelphia: John Benjamins.
Goldberg, Adele E. 2004. Pragmatics and argument structure. In Handbook
of Pragmatics, eds. Laurence R. Horn and Gregory L. Ward, 427–441. Ox-
ford: Blackwell.
Haig, Geoffrey, and Stefan Schnell. 2011. Annotations using GRAID (Gram-
matical Relations and Animacy in Discourse). Introduction and guidelines
for annotators. Version 6.0. Available at: http://vc.uni-bamberg.de/moodle/
course/view.php?id=9488.
Haspelmath, Martin. 2007. Pre-established categories don’t exist: Con-
sequences for language description and typology. Linguistic Typology
11:119–132.
Haspelmath, Martin. 2010a. Comparative concepts and descriptive categories
in cross-linguistic studies. Language 86(3):663–687.
Haspelmath, Martin. 2010b. The interplay between comparative concepts and
descriptive categories (Reply to newmeyer). Language 86(3):696–699.
Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie, eds.
2008. The World Atlas of Language Structures Online. Munich: Max
Planck Digital Library. http://wals.info/.
Himmelmann, Nikolaus P. 1997. Deiktikon, Artikel, Nominalphrase: Zur
Emergenz syntaktischer Struktur. Tübingen: Niemeyer.
Mahmudweyssi, Parwin, Denise Bailey, Ludwig Paul, and Geoffrey Haig. in
print. The Gorani Language of Gawraju (Gawrajuyi), a Village of West
Iran. Texts, Grammar and Lexicon. Wiesbaden: Reichert.
Unauthenticated
Mayer, Mercer. 1994[1969]. Frog, Where Are You?. New York: Dial Books
for Young Readers.
ter 1(3):3–4.
Næss, Åshild. 2007. Prototypical Transitivity. Amsterdam, Philadelphia:
John Benjamins.
Neeleman, Ad, and Kriszta Szendrői. 2007. Radical pro drop and the mor-
phology of pronouns. Linguistic Inquiry 38(4):671–714.
Neeleman, Ad, and Kriszta Szendrői. 2008. Case morphology and radical
pro-drop. In The Limits of Syntactic Variation, ed. Theresa Bieberauer,
331–348. Amsterdam, Philadelphia: John Benjamins.
Newmeyer, Frederick J. 2010. On comparative concepts and descriptive cat-
egories: A reply to Haspelmath. Language 86:688–695.
Nichols, Johanna. 2008. Why are stative-active languages rare in Eurasia?
A typological perspective. In The Typology of Semantic Alignment, eds.
Mark Donohue and Sören Wichmann, 121–139. Oxford: Oxford Univer-
sity Press.
Payne, Thomas E. 1992. The Twins Stories: Participant Coding in Yagua
Narrative. Berkeley: University of California Press.
Schnell, Stefan. 2010. Animacy and referentiality in Vera’a. Ph.D. Disserta-
tion, Kiel University.
Schultze-Berndt, Eva. 2006. Linguistic annotation. In Essentials of Language
Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike
Mosel, 213–251. Berlin, New York: Mouton de Gruyter.
Seifart, Frank. 2008. On the representativeness of language documenta-
tion. In Language Documentation and Description, Volume 5, ed. Peter K.
Austin, 60–76. London: School of Oriental and African Studies.
Seifart, Frank, Roland Meyer, Taras Zakharko, Balthasar Bickel, Swintha
Danielson, and Alena Witzlack-Makarevich. 2010. Cross-linguistic varia-
tion in the noun-to-verb ratio: Exploring automatic tagging and quantitative
corpus analysis. Presentation at the workshop Advances in Documentary
Linguistics, MPI Nijmegen, 14–15 October 2010.
Stoll, Sabine, and Balthasar Bickel. 2009. How deep are differences in ref-
erential density? In Crosslinguistic Approaches to the Psychology of Lan-
Unauthenticated
guage: Research in the Tradition of Dan Isaac Slobin, eds. Jiansheng Guo,
Elena Lieven, Nancy Budwig, Susan Ervin-Tripp, Keiko Nakamura, and
Şeyda Özçalişkan, 543–555. London: Psychology Press.
Stolz, Thomas. 2007. Harry Potter meets Le petit prince – on the usefulness
of parallel corpora in crosslinguistic investigation. Sprachtypologie und
Universalienforschung 60(2):100–117.
Wälchli, Bernhard. 2006. Descriptive typology, or, the typologist’s ex-
panded toolkit. Unpublished ms., available at: http://ling.uni-konstanz.de/
pages/home/a20_11/waelchli/waelchli-desctyp.pdf.
Wälchli, Bernhard. 2009. Motion events in parallel texts. A study in primary-
data typology. Unpublished Habilitationsschrift, University of Bern.
Wegener, Claudia. 2008. A grammar of Savosavo, a Papuan language of the
Solomon Islands. Ph.D. Dissertation, MPI for Psycholinguistics, Radboud
Universiteit Nijmegen.
Unauthenticated
Chapter 5
“Words” in Kharia – Phonological, morpho-syntactic
and “orthographical” aspects∗
John Peterson
1. Introduction
The present study deals with the issue of “words” in the South Munda lan-
guage Kharia, spoken in eastern-central India, from a phonological, morpho-
syntactic and “orthographical” perspective. As Dixon and Aikhenvald (2002:
2) note, not all languages have a concept with the same meaning as the En-
glish concept word; rather, this is an empirical issue which poses a challenge
to all documentary linguists, especially (but certainly not only) if the lan-
guage being documented is currently in the process of developing a standard
written form. The main issue here is that a number of typical characteristics of
“words” from different descriptive levels, such as phonology, morpho-syntax
but also orthography (to the extent that an orthography exists), may or may
not converge upon one single “middle-sized transcription unit, delimited by
empty spaces, which represents a basic unit in terms of meaning, grammatical
function, or sound structure” (Himmelmann 2006: 253). Hence, determining
just what is to appear between empty spaces is one of the most fundamental
issues facing any descriptive linguist, as this issue quite literally surfaces with
each and every “word” which is to be transcribed and can also have further
consequences with respect to a description of the language. It can also turn
out to be an issue where the intuition of the linguist differs considerably from
that (or those) of native speakers, who will typically apply different criteria
in determining what to place between empty spaces.
∗
The present study is based on the results of eight months of field work conducted during
five trips to Jharkhand, India. I would like to express my gratitude to the German Research
Council (Deutsche Forschungsgemeinschaft) for two generous grants which made two of
these trips possible (PE 872/1-1, 2). Furthermore, many thanks to Utz Maas for comments
on an earlier version of this study, although I alone am of course responsible for any errors
and oversights.
Unauthenticated
90 John Peterson
As will be shown in the present study, this is also largely true of Kharia:
While the concept sabda with (roughly) the same meaning as English word is
found in Kharia, this term has been borrowed from neighboring Indo-Aryan
languages, cf. Sadri sabad and Hindi śabda. The latter both mean approxi-
mately the same thing as English word, although even here there are many
unclear cases where the different criteria do not converge upon a single unit,
as we shall see in the following pages. On the other hand, the closest native
term in Kharia is kayom, which is perhaps best translated as ‘speech; speak;
spoken; matter’. Thus this element primarily refers to the act of speaking in
general and can refer to entire utterances, but it can also refer to ‘matters’ or
‘affairs’. It does not, however, refer to the same unit as word, regardless of
how this is defined, and to my knowledge it is never used in reference to a
written unit.
This study is structured as follows: Section 2 presents a general overview
of Kharia. This is followed by a discussion of the phonological word in Sec-
tion 3 and the morpho-syntactic word in Section 4. As these topics have been
dealt with elsewhere in greater detail (Peterson 2011; Peterson and Maas
2009), these discussions will be rather brief. Section 5 then presents the re-
sults of a preliminary investigation into the topic of the written word and
speakers’ / writers’ intuitions as to how to divide units into words when writ-
ing. Section 6 briefly summarizes these results and discusses perspectives for
future research.
2. An overview of Kharia
Kharia belongs to the southern branch of the Munda family and is primarily
spoken in eastern-central India. According to Lewis (2009), it was spoken in
1997 by 292,000 people in India and by 293,580 people in all countries.
Kharia is a largely agglutinating language in the sense that one morph
generally corresponds to one morpheme. However, it is not agglutinating in
the sense that these morphemes are affixes – in fact, Kharia only possesses
very few affixes, as virtually all grammatical markers are enclitic. This topic
will be dealt with in more detail in Sections 3 and 4 below.
The predicate is generally clause-final, although not rigidly so, and the
order of clause-level units (the syntagmas, see below) is entirely “free” in the
sense that any ordering of these elements to express a particular pragmatic
status is grammatical. On the other hand, the order within the syntagmas,
Unauthenticated
“Words” in Kharia 91
discussed in some detail in Sections 2.1 and 2.2, is fixed, and the syntagmas
always form contiguous units.
Structurally speaking, there is no compelling evidence for assuming the
presence of parts-of-speech such as nouns, adjectives and verbs in Kharia.
Not only may virtually any contentive morpheme be freely used in attribu-
tive, referential and predicative function, but even “phrasal” units, i.e., units
resembling NPs in English and other languages, can be used in all three of
these functions. As such, I will not refer to “nouns”, “verbs” or “adjectives”
in Kharia. Rather, there are two types of (non-endocentric) “phrases” which
may freely fulfill any one of these three functions, one ending with case mark-
ing, the other with TAM/P ERSON-marking; I will therefore refer to these as
the C ASE- and TAM/P ERSON-syntagmas, respectively. This is a purely struc-
tural term and should not be taken as referring to any particular discourse
functions. Both syntagmas have the same underlying structure:
– A “contentive” or “semantic” component, which may consist of a single

morpheme but which may also be quite complex. It may also contain mark-
ing for the genitive, for inalienable possession, and for number marking;
– A functional or grammatical component, which may either consist of case
marking or of TAM (tense-aspect-mood) marking and marking for per-
son/number/honorific status.
This is shown schematically in Figure 1.
C OMMON STRUCTURE OF C ASE - AND TAM /P ERSON - SYNTAGMAS
S EMANTIC H EAD F UNCTIONAL H EAD
Figure 1. Common structure of TAM/P ERSON- and C ASE-syntagmas in Kharia
2.1. The TAM/P ERSON-syntagma

Figure 2 provides a simplified overview of the most common structure of the
non-negated TAM/P ERSON-syntagma in Kharia. The functional head begins
with the “V2s”, which will be discussed further below. The form of the se-
mantic head given here (ending with the causative infix <CAUS>) is the most
common form, although not the only possible one, as we shall see in Sections
Unauthenticated
92 John Peterson
3 and 4. Note that the causative is realized either as a prefix (with mono-
syllabic roots), an infix (with polysyllabic roots) or both simultaneously (in
double causatives).
REC CAUS -Lexeme-< CAUS > (V2) (=PRF )= TAM / VOICE=PERS/NUM / HON
Figure 2. Simplified structure of the non-negated TAM/P ERSON-syntagma in Kharia
The Kharia predicative system shows an active-middle voice distinction in

most TAM-categories (cf. Table 1). Note that, with the exception of the opta-
tive, which is a phonological and morpho-syntactic word, all of these markers
are enclitic.
Table 1. Markers for TAM and basic voice

Middle Active
Present (PRS) =ta =te
Present Progressive (PROG) =taP j =teP j
Past (PST) =ki =oP
Irrealis (IRR) =na =e
Perfect (PRF) =siP(ã)
Optative (OPT) guãuP / guóuP
Table 2 provides an overview of the enclitic subject markers (A / S). Note that
Kharia has three numbers, singular, dual and plural, in both syntagmas, and
an inclusive/exclusive distinction in the non-singular first persons. The dual is
also used as an honorific marker. With the exception of the 3rd person, plural
form =may, which is found only in the TAM /P ERSON-syntagma, the markers
of the 3rd persons, ø or “zero” ‘SG’, =kiyar ‘DU’ and =ki ‘PL’, are found in
both the TAM /P ERSON- and C ASE-syntagmas. These are, properly speaking,
number markers and will be glossed accordingly.
Table 2. Enclitic markers of person, number and honorific status

Person Singular Dual/HON Plural
Inclusive Exclusive Inclusive Exclusive
1 =iñ / =ñ =naN =jar =niN =le
2 =em / =m =bar =pe
3 – =kiyar =ki / =may
Unauthenticated
The enclitic markers for first and second persons may be considered phono-
logically reduced forms of the corresponding free-standing forms, given in
Table 3. The forms shown in Table 2 are the only enclitics in the language
which may be considered “reduced forms”.
Table 3. Free-standing proforms

Person Singular Dual/HON Plural
Inclusive Exclusive Inclusive Exclusive
1 iñ anaN iñjar aniN ele
2 am ambar ampe
The “V2s” referred to in Figure 2 are a class of grammatical markers deriv-

ing from contentive morphemes (with which many are still homophonous)
which express Aktionsart in its broadest sense, as well as the passive voice.1
Some of these, such as the “culminatory telic” marker goP ã ‘C : TEL’ and the
“anticipatory telic” marker ãoP ã ’A : TEL’, combine Aktionsart in the usual
sense with narrative structure, with the former typically marking the end of a
narrative sequence and the latter denoting that another event is to follow, as
shown in (1).
(1) de babu, amoP ã gujuN=na roP ãaP=ki uP ã

well boy wash.face wash.feet=MID . IRR FOC water=PL drink
ãoó=e=m tay dukham sukham=na=pe.
A : TEL = ACT. IRR =2 SG then chat= MID . IRR =2 PL
‘Well then, boy, wash your face and feet. [Then] you will drink water
(etc., or whatever, = ‘PL’) [and] then you will all have a chat. (Kerkeúúā
1990: 26)
2.2. The C ASE-syntagma

Figure 3 provides a schematic overview of the most common structure of
the C ASE-syntagma in Kharia. Note that with the exception of case marking,
all other categories are optional, as long as at least one of these categories is
present. Also note that markers for the genitive, inalienable possession (POSS)
1. TAM /P ERSON-syntagmas containing the passive V2 ãom are also always marked for the
middle voice.
Unauthenticated
94 John Peterson
and number/honorific status (NUM / HON) belong to the semantic base, not the
functional head.2
( GEN - ATTR ) ( DEM ) ( QUANT (= CLASS )) ( GEN - ATTR ) ( LEXEME ( S ))

(= POSS ) (= NUM / HON ) ( GEN ) = CASE
Figure 3. A schematic overview of the C ASE-syntagma
As with the TAM /P ERSON-syntagma, there are three numbers, singular (un-
marked), dual (=kiyar) and plural (=ki), and here as well the dual also serves
as an honorific marker. There are three cases:
– Direct (unmarked) — for subjects (A / S), non-definite countable objects

and non-countable objects (O);
– Oblique (=te) — definite countable objects (O), “indirect objects” (recipi-
ents, goals), and various adjuncts (especially locatives and temporals);
– Genitive (=(y)aP) — while properly speaking not a case, as it does not
mark the relation to the predicate, the genitive is usually considered a case.
In Kharia it is not a “nominal” or “adnominal” marker, as it is found in the
semantic head in both the C ASE- and TAM /P ERSON-syntagmas. Rather, it
incorporates one potential semantic head into a larger semantic head.
Having given a general overview of the major structures of the language, we

now turn to the issue of “words” in Kharia, beginning in the next section with
a phonological perspective.
3. Phonological words3
From a phonological perspective, the Kharia lexicon can be divided into two
main groups:
2. As this topic is not directly related to the issue of “words” in Kharia, we will not deal with
it in any detail here. The interested reader is referred to the discussion in Peterson (2011,
Chapter 4).
3. As the phonological word has been dealt with in detail elsewhere (especially in Peterson
2011, Section 2.5; Peterson and Maas 2009), we will only deal with this topic briefly here.
The interested reader is referred to these two works for further discussion.
The data in this section were analyzed with the freeware program “Praat”, developed by
Paul Boersma and David Weenink, available under http://www.praat.org/
Unauthenticated
– The first group, which makes up the large majority of all lexical entries,
corresponds quite closely to the class of contentive lexemes (or “lexical mor-
phemes”) such as man, woman, run, sit, etc. It also contains the much smaller,
closed class of proforms / deictics (I, you, today, tomorrow, etc.). A small
number of postpositions also belong to this class, presumably those which
have only recently derived from contentive morphemes.
Although there is no distinctive tone or accent in the language, these ele-
ments all show a “low → high” (LH) pitch pattern: The final syllable of the
morpheme has a high tone, while the first syllable either has a low tone or
immediately drops to a low tone before rising. This can be seen in Figure 4
below, which shows the pitch contour of bunui ‘pig’.
Figure 4. bunui ‘pig’
As these elements make up the vast majority of lexical entries and they can all
stand alone in the syntax, at least if they are bisyllabic (see further below), it
seems clear that our definition of the phonological word will have to include
this class of elements, and the LH pitch pattern will have to be considered
one of this unit’s defining characteristics.
– However, the LH-pattern does not hold for all units in the lexicon: A rather
large but limited number of elements do not show this pattern – they may
have a falling or rising pitch in a particular utterance, depending on a number
of contributing factors which have not yet all been identified, but they do not
have the inherent LH pitch pattern of the elements just mentioned. This is
illustrated in Figure 5 below for baje=ki [resound=MID . PST] ‘it resounded’,
Unauthenticated
96 John Peterson
where the marker for tense and middle voice, =ki, does not continue the LH-
pattern but also does not start a new LH-pattern.
250
Pitch (Hz)
a e
b Í k
i
50
0 0.6986
Time (s)
Figure 5. baje=ki [resound= MID . PST] ‘it resounded’
This second group will be referred to here as the “phonological clitics”, as

their “phonological form is deficient in that it lacks prosodic structure at the
level of the (Prosodic) Word” (Anderson 2005: 23).
The LH pitch pattern discussed above fits in well with the fact that almost
all phonological words in Kharia are bisyllabic. As discussed at length in
Peterson and Maas (2009), this bisyllabic pattern derives from an earlier bi-
moraic constraint in many Austro-Asiatic languages (cf. Anderson and Zide
2002) requiring free-standing forms in referential function (“nouns” in their
terminology) to be bisyllabic, as opposed to their generally monosyllabic
bound form, found, e.g., in noun incorporation. Although the details need
not concern us here (cf. Peterson and Maas 2009, for details), this original
bimoraic constraint of earlier Austro-Asiatic has led to the emergence of a
new grammatical category in Kharia, the masdar. All stems in the semantic
head of the C ASE-syntagma must appear in their masdar form, i.e., with very
few exceptions, all non-borrowed stems (= underlying lexeme + causative
marking, if present) appearing anywhere within the C ASE-syntagma must be
at least bisyllabic; if they are monosyllabic, they must reduplicate.4 Table 4
provides a few examples:
4. With the exception of col ‘go; move’, borrowed from Kharia’s Indo-Aryan neighbor Sadri,
no borrowed lexemes reduplicate in this environment in Kharia; col has been strongly
Unauthenticated
Table 4. Examples for the formation of masdar forms

Underlying stem Masdar
juN ‘ask’ juN-juN
ob-juN [CAUS-ask] ‘have s.o. ask’ ob-juN
kayom ‘tell, narrate’ kayom
yo ‘see’ yo-yo
ob-yo [CAUS-see] ‘show’ ob-yo
This does not hold for TAM /P ERSON-syntagmas; here, no form must redu-
plicate, although the masdar can also be found here in the so-called “generic
middle voice”, which has a number of semantic functions, including habitu-
ality or a remote past or future.
(2) iñ ãaP biúh=oP j.5 Unmarked construction
1 SG water pour.out=ACT. PST.1 SG
‘I poured the water out.’
(3) iñ ãaP biPã-biPã=ki=ñ. Marked construction
1 SG water pour.out-RDP = MID . PST =1 SG (=generic middle)
‘I poured the water out over and over (e.g., that was my job, so I did it
constantly).’
As the preferentially bisyllabic units of the first group above, with the in-
herent LH pitch pattern and which may stand alone in the syntax, may be
followed by one or more members of the second group, the phonological
enclitics which typically express case or TAM-categories, we can define a
phonological word in Kharia as follows, from Peterson and Maas (2009: 221):
“Phonological word” – A preferentially bisyllabic unit beginning with a low
or rising tone and which continues either until a pause or until the next unit
showing this pattern
This also means that a phonological word typically ends in an enclitic ex-
pressing grammatical information or, put slightly differently, that grammati-
cal information is typically found at the end of the phonological word. Thus,
affected by analogy with the native lexeme ãel ‘come’ in this and other phonological envi-
ronments, cf. Peterson and Maas (2009: 232–233).
5. All consonants are devoiced and aspirated before the active, past tense marker =oP, so that
biPã ‘pour out’ is realized as biúh before =oP.
Unauthenticated
98 John Peterson
the reciprocal marker kol (cf. Figure 2), which according to my data is a
phonological word with the characteristic LH pitch pattern, is unique among
the grammatical markers in that it is both a separate phonological word and
appears before a contentive morpheme instead of after this unit. This will
be of importance again in Section 5 when discussing the written word and
speakers’ intuitions.
There are also a number of units whose status as phonological words
is either unclear or variable, depending on their environment in a particular
utterance. As space is limited, we will concentrate here only on those forms
which will be relevant in the discussion in Section 5.6
– Recall the V2s discussed in Section 2. Note that almost all of these V2s (18
out of the 21 that I am aware of) are monosyllabic and that Kharia strongly
disfavors monosyllabic words. As these units directly follow the semantic
base of a predicate, we generally have one of the following two situations:
• In the event of a monosyllabic V2 following a bisyllabic stem as in (4),
the V2 can be realized as a phonological word together with the remain-
ing TAM /P ERSON-marking. However, although further research is nec-
essary, if the V2 follows a monosyllabic stem as in (5), the evidence sug-
gests that these two units (together with any following TAM /P ERSON-
marking) can form a phonological word, as there is no (obligatory) low-
tone marking on the V2.
(4) ho=kaó ãoko goP ã=ki.
that=SG . HUM sit.down C : TEL = MID . PST
‘S/he sat down.’
(5) ho=kaó leP j=bay=oP.
that=SG . HUM curse=EXCESS = ACT. PST
‘S/he gave [someone] a good scolding.’
• The V2 can also separated from the preceding unit, whether a stem or an-
other V2, by one of the floating pragmatic clitics =ga ‘FOC’, =jo ‘ADD’
or =ko ‘CNTR’ (see Section 4), which always attach to the last element
of a syntagma, which also corresponds to the last unit in a phonological
word ((6), see also example (13)).
6. For further discussion, cf. Peterson and Maas (2009: 233–235).
Unauthenticated
(6) ho=kaó leP j=ga bay=oP.

that=SG . HUM curse=FOC EXCESS = ACT. PST
‘S/he gave [someone] a good SCOLDING (i.e., not something else).’
Although much work is still necessary here, a comparison of (5) and
(6) suggests that the V2s, or at least the monosyllabic V2s, may or may
not be part of the same phonological word as the stem in a particular
predicate, depending on their environment.
– There is preliminary evidence to suggest that when two or more encli-
tics are present, these may – under conditions which are still not entirely
understood – combine to form phonological words in their own right so
that we may have phonological words consisting only of clitics, so-called
“clitic words” (cf. Aikhenvald 2002). Consider Figure 6, where the se-
quence (=ki)=te=ga shows the LH pattern typical of phonological words.
250
a
u k
Pitch (Hz)
i t e g
e b
l
50
0 0.9613
Time (s)
Figure 6. lebu=ki=te=ga [man= PL = OBL = FOC]‘the men (FOCUSED OBJECT)’
Preliminary as this evidence may be, we will see in Section 5 that a se-
quence of more than one functional marker such as this can lead to uncer-
tainties when speakers are asked to write down an utterance.
– Finally, there are a number of other units, such as the demonstratives u
‘PROX’, ho ‘MED’ and hin ‘DIST’ or the coordinator ro ‘and’, whose status
as phonological words or proclitics (demonstratives) or enclitics (ro) seems
to depend on their environments in a particular utterance, although further
research is necessary to determine these exact environments; see Peterson
(2011, Section 2.5) for further discussion.
Unauthenticated
100 John Peterson
As we shall see in Section 5, the clitics and those units whose status as phono-
logical words is either indeterminate or dependent on the surrounding envi-
ronment – the V2s and unclear forms such as ro ‘and’ – are responsible for
the majority of uncertainties when speakers are asked to write something in
Kharia.
4. Morpho-syntactic words7
It was shown in Section 3 that a phonological word in Kharia typically con-
sists of a (generally bisyllabic) contentive morpheme and potentially one or
more enclitics. If there are two or more enclitics appearing together, there
is also preliminary evidence that these may combine to form a phonological
word.
The present section shows that those elements which behave as clitics
phonologically also behave morpho-syntactically as enclitics, as their “po-
sition with respect to the other elements of the phrase or clause follows a
distinct set of principles, separate from those of the independently motivated
syntax of free elements in the language” (Anderson 2005: 31). Although these
units all have somewhat differing distributions, they all have in common that
they are “phrasal affixes” in the sense that the elements to which they at-
tach are neither roots nor stems but rather possibly complex units in the syn-
tax. The following example demonstrates the type of “mismatch” between
phonology and morpho-syntax which is involved here. Although the exam-
ple is for the English enclitic copular form ’s, the principles involved are the
same as in Kharia ((7), from Sadock 1991: 50; “W” = word, “S” = sentence,
“Af” = affix).
(7) The man’s at the door.
W S
N Af NP VP
man ’s Det N V PP
the man ’s at the door
7. For a more detailed discussion, cf. Peterson (2011, Chapter 3) and Peterson and Maas
(2009: 212–213).
Unauthenticated
In the left half of (7), we see that ’s in this example attaches to a phonological
word, as it requires a host. Syntactically, however, we see on the right-hand
side of (7) that ’s is a “word” in the syntax. As such, it – like its potential hosts
– is a “syntactic atom” (Di Sciullo and Williams 1987) or a “grammatical
word” (Dixon and Aikhenvald 2002).
As the following discussion will show, virtually all grammatical markers
in Kharia are morpho-syntactic clitics, attaching to other units at the level
of syntax. Thus, in this definition, phonological words in Kharia typically
consist of more than one morpho-syntactic word.
We begin with the three pragmatic markers discussed briefly in Section 3,
=jo ‘ADDitive focus’, =ko ‘CNTR’ or ‘contrastive focus’, and the general re-
strictive focus marker =ga ‘FOC’. These are referred to here as floating clitics
as they may freely appear anywhere in the clause, provided that they attach
as the last elements to what is otherwise already a syntactic unit (cf. (8)–(9)).
The only units they may never attach to are demonstratives when these are
followed by a contentive morpheme, as the demonstrative is proclitic in this
environment, cf. (10). The scope in (8) is ambiguous and can be interpreted
as referring either only to oP ‘house’ or to the entire C ASE-syntagma.
(8) ho rusuN oP=te=ga ‘in that red house’;

that red house= OBL = FOC ‘that red house (OBJECT)’
(9) ho rusuN=ga oP=te ‘in that RED house’;
that red= FOC house= OBL ‘that RED house (OBJECT)’
(10) * ho=ga rusuN oP=te ‘in THAT red house’;
that= FOC red house= OBL ‘THAT red house (OBJECT)’
These three markers may also be combined with one another in varying orders
to yield subtle pragmatic nuances which still require further study.
(11) am=te=jo=ko / am=te=ko=ga /

2 SG = OBL = ADD = CNTR 2 SG = OBL = CNTR = FOC
am=te=ga=ko / am=te=ga=jo yo=yoP j.
2 SG = OBL = FOC = CNTR 2 SG = OBL = FOC = ADD see= ACT. PST.1 SG
‘I saw YOU (as well).’
Similar data obtain for the V2s, whose status was shown in Section 3 to be
somewhat ambiguous with respect to the phonological word. (12) shows that
Unauthenticated
102 John Peterson
the floating clitics may not only have scope over the entire TAM /P ERSON-
syntagma, they may also intervene between the stem (or the last element
of the semantic head in general, regardless of its status) and the V2. (13)
shows that these floating clitics may also appear between two V2s. This fur-
ther demonstrates that the V2s, despite their ambiguous phonological status,
are “syntactic atoms”.
(12) karay goP ã=te=ga / karay=ga goP ã=te
do C : TEL = ACT. PRS = FOC do= FOC C : TEL = ACT. PRS
‘s/he does the job’
(13) kaP búo saNgoP ã ãom=ga may=ki.
door close V 2: PASS = FOC V 2: TOTAL = MID . PST
‘The door was shut entirely.’
With the exception of the causative marker, which is clearly affixal in nature,
appearing as a prefix with monosyllabic roots but as an infix with polysyllabic
roots, virtually all other grammatical marking in Kharia is also enclitic and
not suffixal. For reasons of space, only a few of these can be illustrated here,
however similar comments hold for all other grammatical markers as well (cf.
Peterson (2011, Section 3.1) for further discussion).
As (14) shows, the genitive marker is a morpho-syntactic enclitic, as it
refers to the entire semantic base. In this respect, it behaves very similarly to
the English ’s.
(14) laP [u sembho ro ãakay rani=kiyar]=aP nãw=jan
then this Sembho and Dakay queen= HON = GEN nine=CLASS
beP ú=ãom=kiyar aw=ki=kiyar.
son=3 POSS = HON QUAL = MID . PST = HON
‘And this Sembho and queen Dakay had nine sons (HON) (= ‘[This
Sembho and Queen Dakay]’s nine sons were).’
Similar data hold for the oblique marker =te (15) and number marking (16)–
(17); these examples show that, if the semantic head is omitted (e.g., as its
identity is already known or is considered unimportant), the respective marker
simply attaches to the last element of the semantic base, regardless of its
status.
(15) [iñ=aP boP]=te ‘at my place’, or simply
1 SG = GEN place= OBL
Unauthenticated
[iñ=aP]=te ‘at my [place]’

1 SG = GEN = OBL
(16) [munuPsiN rochob=aP lebu]=ki=ko

east side= GEN person= PL = CNTR
‘the people of the east; the easterners’
(17) [munuPsiN rochob=aP]=ki=ko
east side= GEN = PL = CNTR
‘the easterners; the ones of the east’ [Pinnow 1965: 120, line 29]
The same holds true of other grammatical markers, such as inalienable pos-
session (in addition to number and the genitive, (18)).
(18) [ayo aba ro boker kulam]=ãom=ki=yaP kaúa
mother father and brother.in.law brother=3 POSS = PL = GEN foot
‘his mother, father, brothers-in-law and brother’s feet’ [RD, 2: 113]
Similarly, markers for person in the TAM /P ERSON-syntagma do not behave
as affixes to a stem but rather mark the entire TAM /P ERSON-syntagma. They
can also appear at different slots in this syntagma:
(19) kayom=ta=ñ um=iñ kayom=ta
speak= MID . PRS =1 SG NEG =1 SG speak= MID . PRS
‘I speak’ ‘I do not speak’
(20)–(21) show that TAM / BASIC VOICE markers also attach to the last ele-
ment of the semantic head, regardless of its status and whether it is a simple
lexeme or a complex “phrase”. Finally, (22)–(24) show that in a sequence of
predicates marked for the same categories, the enclitic marking on the first
element may optionally be omitted (cf. the presence / absence of =e and =ki
on the first element in (23)–(24)), resulting in what is often referred to as
“nuclear cosubordination” (cf. Van Valin 2005).
(20) [ho rochoP b]=ki=ñ.
that side= MID . PST =1 SG
‘I moved to that side.’
(21) ho=kaó [iñ=aP natgot]=ki.
that= SG . HUM 1 SG = GEN family.relationship= MID . PST
‘S/he became a member of my family.’
Unauthenticated
104 John Peterson
(22) ñog=e=ki ro uã=e=ki

eat= ACT. IRR = PL and drink= ACT. IRR = PL
‘they will eat and they will drink’
(23) [ñog=e (ro) uã]=e=ki
eat= ACT. IRR and drink= ACT. IRR = PL
‘they will eat and drink’
(24) [ñoP 8 (ro) uã]=e=ki
eat and drink= ACT. IRR = PL
‘they will eat and drink’
As the preceding discussion has shown, there is a rather large number of
morpho-syntactic clitics in Kharia. This group includes all of those elements
shown to be phonological clitics in Section 3 above, but also all of those
elements whose status as phonological clitics / words is either uncertain or
depends on the surrounding environment. We now turn our attention to the
written word and the intuition of native speakers.
5. Speakers’ intuitions – the written word

I conducted a brief experiment with six speakers, all of whom are literate,
to determine their views on the concept of the “word” in Kharia. For most
literate people the notion of “word” is essentially that of the written word
in a language where this aspect of the orthography has been standardized. It
therefore seems reasonable to assume that, for those who are literate in one
language and who wish to write something in another language, one in which
this aspect has not yet been standardized, this will be a unit which is felt
to be a “word” in an intuitive sense and which should therefore be written
separately. This purely heuristic experiment can thus give us our first insight
into speakers’ intuitions with respect to the “word”, as this aspect of Kharia
orthography has not yet been standardized.
These six speakers were quite accustomed to reading and writing Hindi,
and were often familiar with English orthography as well, so that I assumed
8. Underlying /g/ in the coda is obligatorily realized as [P] in the native lexicon, hence /ñog/
‘eat’ is realized as [ñoP] when not followed by a vowel, and is written as such; speakers
/ writers of Kharia will not accept the use of the grapheme <g> in this environment and
insist on using a special symbol for the glottal stop here.
Unauthenticated
they would already have some intuitive notion of the written word, at least
for Hindi and English. The situation in Kharia, however, differs considerably
from that of Hindi and English: Both Hindi and English have standardized
orthographies in which not only the individual orthographical characters of
the various morphemes have been standardized but also (at least to a large
extent) the notion of the orthographical word, but this is not true of Kharia.
Kharia is only rarely written: Although most Kharia in southwestern Jhark-
hand, where the speakers I worked with all came from, are literate, educa-
tion is generally in Hindi, and increasingly in English, and since virtually
all Kharia speak Hindi, Hindi is generally used when the need arises to put
something down in writing. When Kharia is written, the Devanagari script
is virtually always used, which is also used for Hindi and a large number of
other, mostly Indo-Aryan languages.9 While there are a number of issues of
dispute with respect to the “correct” symbols to denote certain sounds which
are not found in Hindi, above all the glottal stop and the pre-glottalized con-
sonants, these issues can be considered minor, as in all systems suggested
so far, there is a unique means of designating these segments (cf. Peterson
(2011, Section 1.5) for details).
In Devanagari as it is used to write Kharia and many other modern lan-
guages, vowels are written as independent characters only at the beginning
of a written word. Otherwise, the vowel is viewed as a diacritic to be added
to the consonant or consonant cluster which precedes it. [2], [@] or short [A],
all of which are allophones of /a/, are considered to be an inherent part of the
consonant and are therefore not indicated (the so-called “inherent a”), while
other vowels must be added to this consonant. The following provides a few
simple examples of the principles involved.
(25) No preceding consonant: <a> a <i> i <u> u
Preceding consonant: <ta> <ti> <tu>
There are also a number of so-called “conjuncts‘’ in this system, i.e., combi-
nations of consonants; as the basic symbols for consonants are in fact not con-
sonants but rather the combination of a consonant plus the “inherent a”, this
vowel must be removed in order to depict consonant clusters, often resulting
in symbols which differ considerably from the basic forms of the symbols in-
volved. To give one simple example, the consonant cluster transliterated here
9. There is a movement to introduce an orthography based on the Roman alphabet, however

this movement is quite small and can be considered marginal.
Unauthenticated
106 John Peterson
as <dr> is written as , which bears a strong resemblance to <d> (), but
virtually none to <r> (). This combination of one or more consonants plus
a vowel is referred to as an aksar, which plays a central role in orthographic
systems based on Devanagari.˙ We will return to the issue of the Devanagari
writing system and its possible influence on the data for the written word in
Section 6.
Kharia orthography has only been “normalized” to the extent that the in-
dividual morphemes have been given a more-or-less fixed form in writing,
whereas the issue of orthographical words has, to my knowledge, not been
dealt with at all. As such, every time a Kharia speaker wishes to write some-
thing in Kharia, s/he must spontaneously decide when to separate these el-
ements and when to write them together. The result is that writers not only
vary greatly with respect to each other as to what to depict as a “word”, there
is also considerable variation for each writer even within shorter texts.
To be sure, this experiment cannot be viewed as anything more than a
heuristic means of approaching the complex issue of the written “word”. To
begin with, the written language is not the spoken language: Here, factors
come into play which are irrelevant to the spoken language, such as whether
the written word appears “too short” to be a separate word, or for that matter
“too long”. Nevertheless, this method gives us a valuable first insight into the
intuition of native speakers with respect to this concept in Kharia on which
we can later build.
The method used in this experiment is the following: In each case I read
aloud in the group a particular unit of language in Kharia (syntagma, clause or
complex sentence) to check for its grammaticality. All examples were based
either on examples occurring in the texts I had collected from these same
speakers or topics which we had discussed on other occasions. The examples
were all chosen to test for “borderline” cases, involving enclitics and postpo-
sitions, repetition of what are already phonological words to denote intensity,
distributivity or plurality, and reduplication, which derives bisyllabic phono-
logical words from monosyllabic stems, i.e., the masdar.
Once the grammaticality of the unit had been assured, each speaker was
in turn asked to repeat it. Then, all were asked to write what had been said.
Afterwards, each of the speakers was asked to count the “words” (sabda)
s/he had written. This was occasionally followed by a brief discussion of
the example with the respective speaker alone at a later date. The following
Unauthenticated
briefly summarizes the results (the examples are given here in my system of
transliteration):
(26) suru goúh=oP. ‘S/he began [doing something].’

begin C : TEL = ACT. PST
Spk 1 Spk 2 Spk 3 Spk 4 Spk 5 Spk 6

First
Number
response: 3 3 3 2 2 1
of words
Then: 2
Discussion: Speakers 1–3 first considered each of the three elements to be

a single unit, although Speaker 1 then changed his mind. Interestingly, for
those speakers who saw three words in this unit, the last “word” was not oP
but rather úhoP, i.e., they divided the two units into separate words at the
syllable boundary, not at the morpheme boundary, a strategy which we will
encounter again below. Speakers 4 and 5 considered suru and goúhoP each to
be a word, while Speaker 6 considered the entire unit to be a single word.
(27) suru=ga goúh=oP. ‘S/he BEGAN [i.e., s/he didn’t stop].’

begin= FOC C : TEL = ACT. PST

Number
2 4 2 2 2 2
of words
Discussion: All speakers except Speaker 2 considered suru=ga and goúh=oP

to be words, while Speaker 2 considered each individual morpheme to be a
word, divided up by syllable boundaries (i.e., suru, ga, go, úhoP), thereby
following the same strategy as in (26).
Note that Speakers 1 and 3, who considered úhoP to be the last word
in (26) (at least initially), here both consider it a part of the word goúhoP.
This is undoubtedly due at least in part to the fact that suru is marked as a
separate “word” by the floating clitic =ga, as there is otherwise no difference
between this and the previous example. Recall that =ga, together with the
other two floating clitics =jo ‘ADD’ and =ko ‘CNTR’, always attaches to the
last element of a phrase and with that is a natural candidate for the end of
a “word”. Speakers 1 and 3 have apparently written goúhoP together here
as a result of this, perhaps because words such as go and úhoP, which these
Unauthenticated
108 John Peterson
speakers at least initially preferred in (26), appeared “too short” to be written

separately after suruga. But whatever the exact reason, the status of a unit
as a “word” clearly depends to some extent on the environment in which it is
found in a particular utterance, as there is otherwise no way to account for the
different (initial) analyses of goúhoP for Speakers 1 and 3 in (26) and (27).
Similarly, Speaker 6 also analyzes this unit as two words, unlike in (26),
where she saw only one word. Again, the presence of =ga in suruga would
seem to account for this, as the presence of =ga makes this a likely word
boundary, so that the remainder (goúhoP) is also a word.
(28) lebu=ki=te=ga ‘the men’ (PLURAL , OBJECT, FOCUSED )

man= PL = OBL = FOC

Number
2 4 2 2 1 1
of words
Discussion: Here, Speaker 2 has once again considered each unit a word,
while the other speakers have either analyzed it as lebu and ki=te=ga (Speak-
ers 1, 3 and 4), i.e., with a clitic word (cf. Figure 6), or the entire unit as a
single word (Speakers 5 and 6). Again, Speaker 2 is consistent in his approach
with the first two examples.
(29) ho=kaó=jo ‘him/her, too’

that= SG . HUM = ADD

Number
1 2 2 2 2 1
of words
Discussion: While some speakers (2, 3, 4 and 5) considered this to consist of

the words ho=kaó and jo, for the other two (1 and 6) the entire unit consists
of a single word.
(30) i=ye=niN? ‘What shall we do?’

what= ACT. IRR =1 PL . INCL

Number
2 3 2 2 2 1
of words
Unauthenticated
Discussion: For those speakers who considered this example to consist of two
words, these were i and ye=niN. Again, Speaker 2 has analyzed each compo-
nent as a separate word while, again, Speaker 6 views the entire TAM /P ERSON-
syntagma as a single word.
(31) kahani lebu=ki khoói=ki=te kayom=ta=ki.

story person= PL village.section= PL = OBL speak= MID . PRS = PL
‘The people tell [this] story in the villages.’
First
Number
5 4 5 response: 5 5 4
of words
Then: 4
Analysis as four words: kahani, lebu=ki, khoói=ki=te, kayom=ta=ki

Analysis as five words: kahani, lebu=ki, khoói, ki=te, kayom=ta=ki
Discussion: Interestingly, Speaker 2, who otherwise tends to view each mor-
pheme as a separate word, is among those who see in this clause only four
words. It would seem that with increasing length of the unit to be written, the
tendency to write clitics as separate words decreases, a tendency which we
will also meet with again in the following examples; for example, none of the
speakers analyzed lebuki or kayomtaki as consisting of more than one word.
(32) khaóiya buN “khajar” gam=te=ki.

Kharia INST deer say= ACT. PRS = PL
‘In Kharia they say khajar [for ‘deer’].’
Number
4 4 4 4 3 3
of words
Discussion: Two of the speakers considered buN, which I consider a phono-

logical and morpho-syntactic word, to be part of the same word as khaóiya.
Otherwise, the four remaining speakers analyzed the clause as I do. Note
however that none view the grammatical marking on the TAM /P ERSON-syn-
tagma in this example as a separate word from the semantic base, as was also
the case in (31).
We now turn to a slightly longer clause. Here juda juda is the repetition of
a phonological and morpho-syntactic word to highlight the fact that all went
Unauthenticated
110 John Peterson
out separately. Thus, from both a phonological and a syntactic perspective,

these are two separate words.
(33) oP=te gomke ro dhaNgar=ki=yaP ghaã juda juda peP

house= OBL master and servant= PL = GEN for separate REP rice
goN=na laP=ki=may.
cook= INF IPFV = MID . PST =3 PL
‘At home, they cooked rice separately for the master and the servants.’

Number
10 10 10 9 7 8
of words
Discussion: Speakers 1 and 3 saw in this sentence ten words each, both as I
have analyzed the sentence. Speaker 2, on the other hand, analyzed dhaNgar
‘servant’ as one word but considered ki=yaP=ghaã a single word.
Speaker 4 similarly analyzed this as dhaNgar and ki=yaP=ghaã, but also
considered juda juda a single unit. In a similar vein, Speaker 6 analyzed the
sentence exactly the same as Speaker 4, except that for this speaker, the sec-
ond word was gomke=ro instead of gomke and ro as separate words. Finally,
Speaker 5 analyzed the sentence the same way as Speaker 6, but for her the
third word was dhaNgar=ki=yaP=ghaã ‘for the servants’.
Note that once again the speakers / writers clearly tend to write more units
together as the length of the unit to be written increases, although to differing
extents for each individual speaker / writer.
Whereas juda juda in (33) is a case of repetition and in my analysis con-
sists of two phonological words, ol-ol in (35) is grammatically conditioned –
this is a case of reduplication to derive a phonological word from a monosyl-
labic stem, i.e., the masdar.
(34) ol-ol benel bel=oP.

bring- RDP sheet spread= ACT. PST
‘S/he spread out the [bed]sheet which s/he had brought.’

Number
4 3 3 3 3 3
of words
Unauthenticated
Discussion: All of the speakers except Speaker 1 analyzed this sentence into
words as I have done here. Interestingly, however, Speaker 1 not only ana-
lyzed ol-ol as two words, he did so by analyzing this unit as the two “words”
o and lol, thus equating the word-boundary with the syllable boundary, al-
though the form is transparent to modern speakers as the repetition of the
lexeme ol. This must also be true of Speaker 1, otherwise he would not have
chosen to write this unit as two words. This is reminiscent of examples (26)
and (27), where those speakers who chose to write the clitic =oP separately
did so at the syllable boundary, not at the morpheme boundary.
The following example tests not only for repetition of the type juda juda
in (33), in this case moPúho moPúho ‘very fat’, it is also intended to see to what
extent speakers of Kharia analyze units such as kuda koloN ‘millet bread’ as
a compound, as is the traditional view of units of this type (cf. e.g. (Malhotra
1982)), or as two separate words, as I view this.10 The example has been
taken directly from one of the stories I collected, a children’s story.
(35) iãib=te moPúho moPúho kuda koloN ter=na
night= OBL fat REP millet bread give= INF
laP=ki=may.
IPFV = MID . PST =3 PL
‘At night they used to give big fat millet breads [to the servants for
them to eat].’
Number
7 7 7 5 6 6
of words
Discussion: Speakers 1, 2 and 3 analyzed the sentence as I have done, whereas

Speakers 5 and 6 viewed moPúho moPúho as a single word. On the other hand,
only Speaker 4 viewed kuda koloN as a single word, in addition to viewing
moPúho moPúho as a single unit. Also, once again the clitics in longer utter-
ances / texts such as this tend to be considered part of the same word as their
hosts more often than in shorter utterances / texts.
The following example was mainly intended to test the intuitive status of
the reciprocal marker kol as a separate word. According to my data, kol is a
phonological word which shows the typical LH pattern, at least when elicited
in isolation. However, the results in this respect clearly point in a different
10. For a discussion of “compounds” in Kharia, cf. Peterson (2011, Section 4.6).
Unauthenticated
112 John Peterson
direction, with 5 out of 6 speakers considering it part of the same word as the
lexical base.
(36) ho=kiyar kol bhePúo=ki=kiyar. ‘They (DU) met each other.’

that= DU REC meet= MID . PST = DU

Number
4 4 3 3 3 3
of words
The individual analyses:

Speaker 1: ho=kiyar, kol, bhePúo, ki=kiyar
Speaker 2: ho=kiyar, kol=bhePúo, ki, kiyar
Speakers 3, 4, 5 and 6: ho=kiyar, kol=bhePúo, ki=kiyar
Discussion: It is interesting that kol, which according to my data is a phono-
logical word, was generally written together with the following unit. In Sec-
tion 3 it was briefly noted that kol is unique in that it is the only gram-
matical marker in the language which appears before the contentive or se-
mantic head, whereas all other grammatical markers (other than the affixal
causative marker, which is either a prefix or an infix) follow the contentive
morphemes. As such, the presence of grammatical marking is usually con-
sidered the end of a “word”, and with that the right boundary of either a
C ASE- or TAM /P ERSON-syntagma. It is likely that most speakers / writers
prefer to integrate kol into the written word for this reason.11
Interestingly, in this example most speakers also consider the enclitics on
the predicate to form their own word, independently of the semantic base.
Perhaps this is motivated by the fact that the semantic base is polysyllabic
and that the clitics, with three syllables altogether, are capable of forming a
typical, polysyllabic phonological word, although this is rather speculative at
the moment. It could also result at least partially from the relative shortness
of the utterance.
Finally, the following example was intended to determine the extent to
which the V2s, in this case the excessive marker bay, are considered words
when the semantic base is also monosyllabic, as opposed to (26) - (27), where
the semantic base is bisyllabic. The results were unanimous.
11. I owe this insight to Utz Maas (p. c.).
Unauthenticated
(37) ho=kaó=te lej bay=oP.

that= SG . HUM = OBL scold EXCESS = ACT. PST
‘[She] gave him a good scolding.’
Number
3 4 3 3 3 3
of words
Discussion: Here, all speakers considered the excessive marker bay – to-
gether with the following functional morpheme – to be a separate word from
lej, which is considered a separate word despite its monosyllabicity. In fact,
all analyzed this short sentence as in my analysis, with the exception of
Speaker 2 who, as so often, considered the enclitic marker =te ‘OBL’ a sep-
arate word although, interestingly, not the enclitic marker =oP on the predi-
cate.
As was noted at the beginning of this section, this brief experiment is
merely a heuristic means of approaching the issue of what units correspond
to “words” – in a purely intuitive sense – for literate native speakers. Never-
theless, it also shows a number of clear tendencies:
– The clitics clearly cause the most uncertainty among speakers / writers. For
some these are tendentially viewed as separate units, most consistently for
Speaker 2, while others tendentially view them as part of the same written
word as their host.
– The shorter the unit that speakers / writers were asked to judge, the more
likely they were to write enclitics as separate words.12 As noted by Utz
Maas (p. c.), this may be due to the fact that the default expectation of
what is perceived to be a sentence is a unit which consists of more than
one word. If the sentence is short, the tendency to write identifiable units
separately, if possible, is greater than when the sentence is longer.
– Clitics which begin with a consonant tend to be written as separate words
more often than clitics which begin with a vowel.
– When clitics which begin with a vowel are viewed as separate units, they
are written together with the final consonant of the preceding morpheme,
e.g., goúh ‘C : TEL’ and =oP ‘ACT. PST’ in (26)–(27) are reanalyzed and
written by some as go and úhoP, although neither of these two units is a
12. This is especially true for Speaker 2, cf. examples (26)– (28), (30), (36) and (37).
Unauthenticated
114 John Peterson
morpheme. Thus, the separation of these units into written words follows
the syllable boundary, not the morpheme boundary.
– This same tendency is also apparent in example (34), where one speaker
analyzed the derived form ol-ol ‘take- RDP’ as two units but did so by ig-
noring the morpheme boundary. Instead, he separated the two units at the
syllable boundary. This is noteworthy since he must have recognized this
reduplication as a process affecting the morpheme ol, otherwise he would
not have seen two words in this constituent. Nevertheless, when it came
to writing this unit and a decision had to be made, he chose the syllable
boundary over the morpheme boundary. This may at least be partially due
to the fact that the Devanagari writing system tends to write vowels as dia-
critics to consonants or consonant clusters in the aksar. However, as there
are also special vowel characters which are used to˙ depict vowels at the
beginning of a written word, this requires further study.
– When multiple clitics occur, there is a tendency among some speakers to
write these units together as a separate clitic word, e.g., in examples (28)
and (31).
– The analysis of V2s as separate words from the semantic base is strongly
preferred, even when both the stem and the V2 are monosyllabic (cf. (26),
(27) and (37)).
– Finally, postpositions such as buN ‘INST’ and ghaã ‘for; PURP’ are some-
times written together with the preceding unit, e.g., (32). Alternatively, if
the preceding unit contains one or more enclitics, these may form a written
word together with the following postposition, as in (33).
Despite the number of uncertainties involved in dividing these various units
into smaller units in writing, there are nevertheless clear tendencies and only
certain types of units cause the speakers / writers problems, above all the
clitics. Thus, the speakers are clearly trying to cope with the indeterminant
status of these units: Phonologically, they are part of the same “word” as the
unit to which they attach. As such, they should be written as part of this larger
unit. On the other hand, they are morpho-syntactic words, which should be
written separately. In the end, there seems to be a general preference to write
phonological words as single units, although this is at best no more than a
tendency, and one which some speakers apply more consistently than others.
Summarizing these results, the strategies used by these speakers / writ-
ers can be classified as “semi-conjunctivism” in the terminology of van Wyk
(1967): The strategies they employ are disjunctive in that each phonologi-
Unauthenticated
cal word tends to be written as a separate unit, and the writing of enclitics
as separate words is also clearly disjunctive. Nevertheless, enclitics are often
written together either with their host, with a following postposition, or with
each other, and postpositions are occasionally written together with their pre-
ceding dependent element - these are clearly conjunctive strategies. Thus the
writers here “steer a middle course” in dividing these units into written words,
to borrow an expression from van Wyk (1967: 230), combining conjunctive
and disjunctive strategies, although often differing from one another greatly
with respect to their individual preferences, as well as from one utterance to
the next for the same speaker / writer.
6. Summary and outlook

As the preceding pages have shown, there is no one unit in Kharia which cor-
responds to the English term word. Although there are still a number of open
questions above all with respect to the phonological analysis, the following
facts nevertheless seem clear:
– At the phonological level, there is a preferentially polysyllabic unit with
an LH pitch pattern which we can consider the phonological word. While
most lexical items in the language fulfill these two criteria, a number do
not but require a host in order to become part of a phonological word.
– At the morpho-syntactic level, virtually all lexical entries are “words” in
the sense that they are syntactically relevant units. These include the entire
class of units which qualify as phonological words, but also the class of
(morpho-syntactic) clitics, which attach to their hosts in the syntax. While
these hosts may be simple lexical items, they may also be highly complex
syntactic units. As this class of morpho-syntactic clitics is quite large and
as most phonological words are marked for at least one element from this
group, this means that a phonological word in Kharia typically consists of
more than one morpho-syntactic word. We also have preliminary evidence
for the presence of clitic words, i.e., phonological words consisting entirely
of clitics.
– The orthographical word has not yet been standardized in Kharia, so that
speakers are left to their own judgement as to when to write units separately
or together. As all of the speakers I worked with are literate in Hindi (and
to some extent also in English), I conducted a brief experiment with these
speakers / writers to gain some insight into their intuitive notions of the
Unauthenticated
116 John Peterson
“word” in Kharia, and their decisions to write two phonological / morpho-

sytnactic units together or separately can give us at least a first indication
of the criteria they make use of.
The process of deciding when to write units together or separately in Kharia
is perhaps more demanding than one might assume if one is used to writ-
ing a language such as English, in which this issue has been standardized
to a large extent. To be sure, there are cases where ambiguity can arise in
English as well, but in case of doubt in English we can always consult a
dictionary or some other authoritative source. Furthermore, the mismatches
between morpho-syntax and phonology in English are not especially numer-
ous, and most of the clitic forms here are analyzable as “reduced free forms”
(e.g., ’s ← is, ’ve ← have). Only very few, such as the “genitive” ’s, cannot
be analyzed as “reduced forms”.
In a language such as Kharia, where this issue has not been standardized
and hence where no normative authorities can be consulted, this can be a
daunting task, especially since in Kharia, unlike English, this issue is rele-
vant in each and every sentence, as the TAM /P ERSON-syntagma always has
enclitic marking and the C ASE-syntagma often does as well. In addition, only
the markers for person in the TAM /P ERSON-syntagma can be considered “re-
duced forms” – no other enclitics have a corresponding free-standing form.
The brief experiment discussed in the preceding pages shows that each
“author” has developed individual strategies to deal with these problems, with
some preferring conjunctive strategies and others favoring disjunctive strate-
gies, although all employ some combination of the two. The only principle
which seems to hold for all is that speakers / writers tend to give priority
to phonology over morpho-syntax when these do not coincide, especially in
longer sequences. In fact, the preference for phonological criteria can even be
so strong that single morphemes can be divided up into two different written
words if an enclitic which begins with a vowel is analyzed as the beginning
of a new written word – here the final consonant of the preceding morpheme
is re-analyzed as the initial consonant of this new word, i.e., the new written
word begins at the syllable boundary, not the morpheme boundary.
This last point immediately raises the question as to what role Devana-
gari, the writing system used for Kharia, plays in this process. As we noted
briefly in the last section, this writing system only views vowels as indepen-
dent units when these appear at the beginning of a written word. Otherwise,
they are written as diacritics to a consonant or consonant cluster in the aksar.
˙
Unauthenticated
As such, we speculated in Section 5 that this may at least partially be respon-

sible for the fact that some speakers / writers occasionally divide morphemes
up into different words, as just noted. However, as there are separate symbols
for vowels when these appear at the beginning of a word, and as I am not
familiar with this phenomenon in other languages written with Devanagari,
this is purely speculative at present.
There are also other aspects which require further investigation, such as
the orthographical rules of Hindi, in which all of the speakers I worked with
were schooled: It is, e.g., noteworthy that a number of morphemes which are
phonologically and morpho-syntactically enclitic in Kharia (and to some ex-
tent also in Hindi) are written as separate units in Hindi, such as the genitive
marker =kā / =ke / =kı̄ or the “dative-accusative” marker =ko. Hindi also has
a small class of V2s which are always written separately from the semantic
base of the predicate, as is also the case for most speakers / writers of Kharia.
While all of this suggests that Hindi orthographical rules may be influenc-
ing the intuitions of native speakers of Kharia with respect to the “word” in
Kharia as well, at least to some extent, a much more detailed study of written
Kharia will be necessary before this can be confirmed.
Future work will also have to address an issue which we have excluded
entirely from the present study but which is perhaps just as central as the de-
termination of the “word”, namely how speakers / writers divide their texts
up into larger units, corresponding (more or less) to the English terms sen-
tence and paragraph and the criteria native speakers use in punctuating their
written language (cf., e.g., the discussion in Himmelmann 2006). Here again
we can assume that Hindi will at least have some influence on this aspect
of written Kharia, although at the moment this is purely speculative. Clearly,
much additional descriptive work is necessary.
In sum, this brief experiment provides us with our first insight into native
speakers’ intuitive notions of the “word” in Kharia, notions which occasion-
ally differ considerably from what might be expected from a purely linguistic
perspective. Despite the considerable variation from one speaker / writer to
another or even from one short text to another for the same speaker / writer,
clear tendencies have also emerged which provide us with a basis on which
future studies can build.
Unauthenticated
118 John Peterson
Abbreviations and symbols

– marks an affix IRR irrealis
= marks a clitic LH low-high pitch
<> marks an infix MED medial
MID middle
1, 2, 3 first, second, third NEG negation
person NUM number
ACT active OBL oblique
ADD additive focus PASS passive
A : TEL anticipatory telic PL plural
CAUS causative POSS inalienable possession
CLASS numeral classifier PRF perfect
CNTR contrastive focus PROG progressive
C : TEL culminatory telic PROX proximal
DU dual PRS present tense
DEM demonstrative PST past tense
DIST distal demonstrative PURP purposive
EXCESS excessive QUAL qualitative predication
FOC (general) focus QUANT quantifier
GEN genitive RDP reduplication (of a stem to derive
GEN - ATTR genitive attribute a phonological word)
HON honorific REC reciprocal
HUM human REP repetition (of a phonological word)
INCL inclusive SG singular
INF infinitive TAM tense, aspect and mood
INST instrumental TOTAL totality
IPFV imperfective V2 Aktionsart marker
References
Aikhenvald, Alexandra Y. 2002. Typological parameters for the study of
clitics, with special reference to Tariana. In Word: A Cross-linguistic Ty-
pology, eds. Robert M. W. Dixon and Alexandra Y. Aikhenvald, 42–78.
Cambridge: Cambridge University Press.
Anderson, Gregory D. S., and Norman H. Zide. 2002. Issues in Proto-Munda
and Proto-Austroasiatic nominal derivation: The bimoraic constraint. In
Papers from the 10th Annual Meeting of the Southeast Asian Linguistics
Society, ed. Marlys A. Macken, 55–74. Tempe, AZ: Arizona State Univer-
sity, South East Asian Studies Program (Monograph Series Press).
Unauthenticated
Anderson, Stephen R. 2005. Aspects of the Theory of Clitics. Oxford: Oxford

University Press.
Di Sciullo, Anna, and Edwin Williams. 1987. On the Definition of Word.
Cambridge (MA), London: MIT Press.
Dixon, Robert M. W., and Alexandra Y. Aikhenvald. 2002. Word: A typolog-
ical framework. In Word: A Cross-linguistic Typology, eds. Robert M. W.
Dixon and Alexandra Y. Aikhenvald, 1–41. Cambridge: Cambridge Uni-
versity Press.
Himmelmann, Nikolaus P. 2006. The challenges of segmenting spoken lan-
guage. In Essentials of Language Documentation, eds. Jost Gippert, Niko-
laus P. Himmelmann, and Ulrike Mosel, 253–274. Berlin, New York: Mou-
ton de Gruyter.
Kerkeúúā, Khrist Pyārı̄. 1990. Jujhair ãā˜ó (Khaóiyā nāúak) [The Battle Field
(A Kharia Drama)]. Ranchi: Tribal Language Academy, Government of
Bihar.
Lewis, M. Paul, ed. 2009. Ethnologue: Languages of the World. Dallas,
Texas: SIL International, sixteenth edition. Online version: http://www.
ethnologue.com/.
Malhotra, Veena. 1982. The structure of Kharia: A study of linguistic ty-
pology and language change. Unpublished Ph.D. dissertation, Jawaharlal
Nehru University, New Delhi.
Peterson, John. 2011. A Grammar of Kharia: A South Munda Language.
Leiden: Brill.
Peterson, John, and Utz Maas. 2009. Reduplication in Kharia. In Reduplica-
tion: Diachrony and Productivity, ed. Bernard Hurch and Veronika Mattes.
Special issue of Morphology 19(2):207–237.
Pinnow, Heinz-Jürgen, ed. 1965. Kharia-Texte (Prosa und Poesie). Wies-
baden: Otto Harrassowitz.
Sadock, Jerrold M. 1991. Autolexical Syntax: A Theory of Parallel Grammat-
ical Representations. Chicago, London: University of Chicago Press.
Van Valin, Robert D., Jr. 2005. Exploring the Syntax-Semantics Interface.
van Wyk, E. B. 1967. Northern Sotho. Lingua 17(2):230–261.
Unauthenticated
Unauthenticated
Chapter 6
Aspect in Forest Enets and other Siberian indigenous
languages – when grammaticography and
lexicography meet different metalanguages
Florian Siegl
1. Introduction
This contribution explores the mutual dependency of grammaticography and
lexicography, recurring topics in Ulrike Mosel’s writings (e.g. Mosel 2004,
2006a,b), by investigating a controversial feature in the description of sev-
eral indigenous languages of Siberia – the representation of aspect. The main
focus is on Forest Enets, a moribund Northern Samoyedic language spoken
and remembered by less than 40 individuals between 50–65 years of age, on
which the author has conducted substantial fieldwork in recent years.1 Ad-
ditional observations deriving from recent descriptions of other indigenous
languages of Siberia will be touched upon en passant, as the description of
aspect is part of a more general problem which is in no way restricted to
Forest Enets.
Although not always clearly stated as such, the description of aspect has
been heavily dependent on dominant linguistic models, developed on, and
applied to dominating majority languages. Such frameworks are all too fre-
quently transferred to the analysis of a given minority language. As aspect
is a category whose description may, or even must, include reference to both
grammar and lexicon, a set of general questions inevitably arises. Before em-
barking on the analysis of aspect in a given language, the grammarian is faced
with a number of questions, including the following: is aspect derivational
or inflectional, does aspect interact with tense or mood, and if so, how; are
there instances of nominal aspect? Other questions that may arise are, for ex-
1. Fieldwork was conducted as part of the Tartu-Göttingen project “Documentation of

Enets and Forest Nenets”, funded by the Volkswagen Foundation as a DoBeS Project
(2005–2009). A preliminary survey including sociolinguistic data was published as Siegl
(2010).
Unauthenticated
122 Florian Siegl
ample, whether and how aspect choice interacts with negation, and whether
aspect distinctions are available to nominalized verbs or reserved for finite
verbs only. From the perspective of the lexicographer,2 on the other hand,
the following questions need to be addressed: should each aspect form be
represented as a single entry, even if the derivational process is productive
and regular? Should aspect-marked verbs be listed under the headword as
subentries or should only irregular or lexicalized forms be listed? Whereas
the representation of aspect in dictionaries may not be decided by either the
grammarian or the lexicographer alone, the motivation underlying a certain
preference of representation should be reasonably well explained. However,
precisely this is not always obvious concerning the languages under investi-
gation. A third, complicating factor is that there have been, and continue to
be, attempts to force the Forest Enets aspect system into the aspectual frame-
work often applied to the Russian language. 3 The outcome are descriptions
of aspectual systems which although satisfying Russian principles of gram-
maticography and lexicography, fail to capture crucial features of the Forest
Enets aspect system.
2. The Russian aspect

In classical descriptions of indigenous languages of Siberia conducted by So-
viet and Russian linguists, Russian continues to serve as the predominant
medium of description. This of course also means that virtually any descrip-
tion of aspect in classical descriptions was conducted from the perspective of
Russian aspect.4 Alternative approaches to aspect, such as those applied to
Romance or Germanic (see e.g. Sasse (2002) for a comprehensive summary)
have played at best a very marginal role, if any.
In order to appreciate the impact it is necessary to briefly sketch some
of the main features of Russian aspect. The following short account largely
follows that of Timberlake (2004). Russian verbs are generally considered
to fall into two aspect groups, perfective and imperfective. This means that
2. When talking about lexicography, bilingual lexicography is meant.

3. A recent Russian textbook on morphology suggests breaking with this tradition, at least
from a terminological perspective. Plungian proposes that Russian vid (Ru: vid), the gen-
eral terminus technicus for aspect, should be used only for the description of Slavonic lan-
guages. For other languages aspekt (Ru: aspekt) should be used instead (Plungian 2003).
4. This does not necessarily imply that such descriptions were meant as contrastive descrip-
tions.
Unauthenticated
Aspect in Forest Enets and other Siberian indigenous languages 123
the only difference between e.g. čitat’ ‘read<IPF>’ vs. pročitat’ ‘read<PFT>’ is
aspect. As much as aspect plays an important role, it is generally seen “more
as a partition of the lexicon than an inflectional operation” as Russian does
not have a “single morphological device that marks the opposition of aspect.”
(Timberlake 2004: 93)
Russian aspect tightly interacts with tense. Only imperfective verbs al-
low a periphrastic future with the auxiliary byt’ ‘be’, inflected for person,
followed by the infinitive of the imperfective verb e.g. ja budu čitat’ <1SG
be.1SG read.INF IPF> ‘I will read/I will be reading.’ Apart from future tense,
imperfective verbs also form both past and present tense. Whereas in the
present tense, the verb is finite, and specialized personal endings for each
person are used, the past tense consists of a special past tense marker in -
l to which markers signaling gender/number agreement (masculine singular
ø, female singular -a, neuter -o, plural -i [no gender-distinction]) are added.
Perfective verbs, relying on the same endings, only show present tense and
past tense. When perfective verbs are inflected in the same way as imperfec-
tive verbs which express present tense, a future time reading evolves instead.
Although other tests for telling perfective from imperfective verbs have been
suggested, the possibility of a periphrastic future provides the single most
reliable criterion for partitioning the verb lexicon into two aspects (see also
Timberlake 2004: 401).
As already mentioned above, Russian has no specialized morphological
aspect marker. Instead, a restricted variety of morphological strategies are
used to create and maintain aspectual pairs. The cornerstones of the Russian
aspect system are the so-called simplex verbs, which do not have prefixes
and are imperfective. Such simplex verbs report continuous situations which
are either static or/and unchanging e.g. grustit’ ‘be sad’, videt’ ‘see’. Others
may involve “some degree of gradual change and responsibility” e.g. sidet’
‘sit’, rabotat’ ‘work’, motat’ ‘wind’, l’stit’ ‘flatter’, krutit’ ‘twist, twirl’ (Tim-
berlake 2004: 402, 411). Such simplex verbs are perfectivized by a number
of prefixes5 , e.g. pisat’ ‘write<IPF>’ ’na-pisat’ ‘write<PFT>’. Furthermore, pre-
fixed perfective verbs e.g. podpisat’ ‘sign<PFT>’ can and do form correspond-
ing imperfective verbs by suffixation: podpisyvat’ ‘sign<IPF>’ or perepisyvat’
‘rewrite<IPF>’ from perepisat’ ‘rewrite<PFT>’. What is important here is the fact
that the relation between prefixed perfective verbs and corresponding imper-
5. A characterization of these prefixes cannot be reproduced due to limitation of space.
Unauthenticated
124 Florian Siegl
fective verbs is perceived differently to the one between simplex verbs and
their perfective counterpart (Timberlake 2004: 407):
Simplex imperfective verbs are prefixed and yield perfectives. Many of those
perfectives – those that report a continuous process leading to a limit – can
be suffixed and yield closely related secondary imperfectives that form un-
ambiguous aspectual pairs. Prefixed verbs that discuss discrete quanta of the
activity are less amenable to forming secondary imperfectives. Because sim-
plexes ordinarily are imperfective, one or another of the prefixed perfectives
will serve as the perfective counterpart to the simplex imperfective.
The fact that Russian allows the formation of secondary aspect pairs of the
type perepisyvat’ ‘rewrite<IPF>’ vs. perepisat’ ‘rewrite<PFT>’ makes the Russian
aspect system different from that of other languages. Whereas e.g. Estonian
and Hungarian have some limited means of perfectivizing verbs by a particle
(ära in Estonian) or a prefix (meg- in Hungarian), secondary aspect pairs of
the Russian perepisat’ / perepisyvat’ (from a simplex pisat’ ‘write<IPF>’) type
are missing. As will be shown later, also Forest Enets does not allow this, and
this peculiarity of Russian aspect will occupy us once more later.
Although small in number, there are verbs which do not fit readily, or
do not fit at all into the perfective/imperfective system of Russian. The first,
rather small, group is generally known as bi-aspectual verbs.6 Such verbs lack
a distinguished aspect value and can be used to express both aspects without
any further derivation. The second group comprises verbs of movement where
the principles of aspect derivation intermingle with indeterminate vs. deter-
minate movement and function on a different basis. Third, a small class of
so called simplexes is perfective. These problems of Russian aspectology are
however not of relevance here.
3. From Russian aspect in grammaticography and lexicography to Si-

berian lexicography
Although aspectual pairs are the cornerstone of verbal conjugation in Rus-
sian, the predominant analysis of aspect classifies it as an organizing princi-
ple of the lexicon, because there is no single dedicated morphological aspect
6. Although it appears that the terminus technicus bi-aspectual verb is fairly well established,
Timberlake proposes a different name, anaspectual verbs (Timberlake 2004: 407–408).
Unauthenticated
operation. In other words, although the role of morphology is generally ac-

knowledged, aspect in Russian resists clear-cut classification, and is consid-
ered both part of the lexicon as well as part of the grammar. Of course, there
have been attempts to justify classifying aspect as an inflectional, derivational
or mixed category, and the discussion of the relationship between aspect and
aktionsart has added more problems, but these matters are of no direct im-
portance within the context of Siberian minority languages.7 What is really
relevant for this study is the concept ‘aspectual pairs’ and related problems,
as these continue to have their impact, especially for lexicography of minor-
ity languages. Whereas there is no doubt that aspectual pairs play a major
role in Russian, the assignment of verbs to such pairs is indeed far from be-
ing settled. This becomes evident in the varied descriptions of aspect in the
different editions of the Russian Academy grammar and major monolingual
dictionaries of Russian. It is worthwhile to mention that to date, no aspectual
verb dictionary of Russian has been produced.8 The situation becomes even
more complicated when addressing the representation of aspect in bilingual
dictionaries. Whereas there is a distinct tradition of bilingual lexicography in
Russia and Russian (e.g. Jachnow 1990; Eismann 1991), practical bilingual
Russian lexicography focuses on European and major non-European majority
languages and the Russian learner. In contrast and especially in the early 20th
century, practical lexicography of Siberian minority languages had a different
audience, namely speakers of indigenous languages learning Russian as a for-
eign language, a target audience excluded from traditional Russian bilingual
lexicography. Of course, this situation has changed profoundly and nowa-
days Russian-based lexicography would be equally necessary as many school
children from indigenous Siberian communities enter schools as bilinguals or
even Russian monolinguals – the original audience consisting of monolingual
school children is a minority today. Still, what has unfortunately not changed
is the fact that practical bilingual lexicography of Siberian minority languages
remains terra incognita and lives a life of its own, largely independent from
current Russian bilingual lexicography. With regard to aspect, the difference
is easily observable. According to V. Berkov, a verb lemma in a Russian to
7. See e.g. Bondarko (2002) for a short survey on different approaches throughout the second
half of the 20th century as well as Zaliznjak and Šmelev (2000). In this concern, Russian
differs sharply from the languages of Siberia which will occupy us in the remainder of this
paper. In contrast to Russian, which is morphologically fusional, the languages which will
be investigated now are all agglutinative and morphemes are fairly well segmentable.
8. Some ideas for such a dictionary were offered by Zaliznjak and Šmelev (2000: 97–103).
Unauthenticated
126 Florian Siegl
X dictionary shows a perfective verb which is followed by a translation. If

followed by a corresponding imperfective verb in the same entry, no separate
translation for the imperfective verb is added. It is, however, equally possible
to have an unprefixed imperfective verb as a lemma followed by a transla-
tion, especially if there is no corresponding perfective verb available (Berkov
1996: 115–116). The consequence of this approach is that the Russian aspec-
tual pairs are preserved wherever possible. In reality bilingual dictionaries
of Siberian indigenous languages do not follow this convention. A random
look at a dozen dictionaries of different minority languages of the Russian
Federation, among them several school dictionaries of Uralic, Turkic, Tungu-
sic and Paleosiberian languages, reveals that different approaches diverging
from Berkov’s suggestion were chosen. Frequently, Russian aspect pairs do
not appear in the Russian section of these dictionaries and both imperfective
and perfective verbs even of the same aspect pair are independent lemmata.
4. Aspect in Forest Enets and other Siberian indigenous languages
Although a comprehensive grammatical description of endangered languages

should, in principle, be based on a theory neutral model, this is rarely the case.
In the coverage of aspect in several Siberian indigenous languages compiled
by Soviet and Russian linguists, aspect winds up looking very similar to Rus-
sian aspect, or at least was treated suspiciously close to Russian. Before we
move to the description and preliminary analysis of aspect in Forest Enets,
several clear examples from the literature will be presented.
4.1. The representation of aspect in Central Siberian Yupik Eskimo and Ket
The following two rather clear instances retrieved from recent literature on
Central Siberian Yupik and Ket neatly exemplify the problems of aspect de-
scription. Whereas Russian sources try to describe the verbal paradigm fol-
lowing Russian principles of grammaticography and lexicography, descrip-
tions by non-Russian linguists have expressed serious doubts, or do not even
mention this possibility. For Central Siberian Yupik, Steven Jacobson wrote:
Concerning the issue of tense, the unmarked form of verbs (other than ad-
jectival or descriptive verbs) has a past-tense implication for CSY [Central
Siberian Yupik, FS] but a present-tense implication for CAY [Central Alaskan
Unauthenticated
Yupik, FS] unless context indicates recent past. Thus, CSY qavaxtuq means
‘he slept’, while CAY qavaxtuq means ‘he is sleeping’. To get the present
tense in CSY one must use the postbase -aq@-: qava7aquq ‘he is sleeping’.
This same postbase in CSY is also used for repeated actions, as in quunp@N
qava7aquq ‘he is always sleeping’. Consequently, the verb forms with this
postbase in CSY begin to resemble the “imperfective aspect” of Russian verbs
– this is probably the reason that Soviet CSY-to-Russian dictionaries include
this postbase in the verb forms that they use as their citation forms for verbs.”
(Jacobson 1990: 275)9
A more interesting instance can be found in recent descriptions of the last re-
maining Yenseian language Ket. The description offered by Heinrich Werner
(Werner 1997: 206–210) takes a Slavonic perspective on aspect and assumes
the existence of aspect pairs. “Es handelt sich um die folgenden Formen [=as-
pects, F.S], die einander gegenüberstehen 1) perfektive vs. imperfektive; 2)
permansive (bzw. progressive) vs. nicht-permansive” (Werner 1997: 206). In
contrast, the descriptions of Edward Vajda (Vajda 2004) and a slightly more
explicit description published as Vajda and Zinn (2004) neither operate with
a clear imperfective vs. perfective distinction as assumed by Werner nor ex-
plicitly state the existence of aspectual pairs.
4.2. Aspect in descriptions of individual Samoyedic languages

As far as Samoyedic languages are concerned, no book-length treatment of
aspect in any of the individual languages is available.10 Nevertheless, what
can be seen is a gradual change of the treatment of aspect in the litera-
ture over time:11 whereas earlier descriptions of Selkup and Tundra Nenets
9. An earlier review of the major Soviet publications on Siberian Yupik Eskimo by Ulving
(1971) was equally criticial with the interpretation of several aspect suffixes although the
review concentrated more on problems within the description of phonology and morphol-
ogy (Ulving 1971: 102–109).
10. Samoyedic languages form the second branch of the Uralic language family together with
the other and generally better known Finno-Ugric branch. Concerning its internal struc-
ture, Samoyedic is generally subdivided into two major branches, Northern Samoyedic
and Southern Samoyedic. The only Southern Samoyedic language still alive is Selkup,
which due to its internal dialectal stratification could easily be subdivided into several in-
dependent languages. Both Nenets and Enets languages (Tundra Nenets and Forest Nenets
and respectively Tundra Enets and Forest Enets) and Nganasan belong to the Northern
Samoyedic subfamily. See also Janhunen (1998) for further background information.
11. Nganasan is excluded from this discussion as the Nganasan aspectual system is known
to differ quite extensively from the Nenets, Enets and Selkup system (Wagner-Nagy
Unauthenticated
128 Florian Siegl
(Prokof’ev 1935, 1937; Tereščenko 1947, 1959) recognized the importance

of aspect, it was explicitly seen as different from Russian. In contrast, con-
secutive research on other Samoyedic languages e.g. Taz Selkup and Forest
Enets claimed the existence of aspect pairs similar to Russian. As already
said, Prokof’ev (1935) did not propose any aspect pairs for Taz Selkup12 but
the grammar compiled by Kuznetsova, Xelimskij, and Gruškina (1980) in-
troduced aspect pairs, though neither motivating their decision nor offering
any clear examples. Recently, this assumption was opposed for the first time
in print by Kazakevič (2008) but the original description is still defended by
one of the authors (Kuznetsova 2008). Also for Forest Enets, a similar obser-
vation can be made and will occupy us for the rest of this chapter.
4.3. Aspect in Forest Enets

The description of aspect in previous research can be summarized rather
quickly. In the existing three grammatical descriptions of Forest Enets (Cas-
trén 1854; Prokof’ev 1937; Tereščenko 1966) 13 only the two Russian sketch
grammars address aspect, offering partly different categories. The only de-
tailed study of the aspect system in Forest Enets is Sorokina (1975),14 which
produced a very accurate inventory of aspect-marking suffixes. During my
own fieldwork, no new suffixes turned up, nor was it necessary to revise
Sorokina’s analysis. As several labels used by Sorokina are slightly confusing
or misleading, the comparative presentation in Table 1 is added to exemplify
terminological choices.
4.3.1. The basics of tense and aspect in Forest Enets

Forest Enets expresses tense morphologically and distinguishes overtly be-
tween past tenses, aorist and future tense. As the morphological realization of
2001: 52–76). As data for the extinct Southern Samoyedic language Kamas is scarce and
not systematized, it has to be excluded too.
12. Interestingly, his grammar of Taz Selkup treated aspect before tense.
13. Although Castrén’s grammar contains a description of Enets, most of the data derives ac-
tually from Tundra Enets. The only English-written grammar of (Forest) Enets by Künnap
(1999) is nothing else than an enhanced translation of Tereščenko (1966) and can therefore
be excluded.
14. Sorokina’s unpublished candidate dissertation deals partly too with this topic, but all nec-
essary information is readily available in this article.
Unauthenticated
Table 1. Aspect suffixes and their function

Suffix Sorokina (1975) Suffix Siegl (forthc.)
-gu-/-ku- dlitel~nost~ destvi -gu/-ku durative
(durative)
-ra-/-la- naqinatel~nost~ destvi -ra/-la inchoative
(inchoative)
-P/-ra mnogokratnost~ destvi -P/-r; frequentative
(plurality, frequentative) rV/-lV
-da16 postepennost~ destvi -do- cumulative
(graduality)
-jta nemnogokratnost~ destvi / -ita delimitative
nepolnota
(non-plurality, incompleteness)
-ubi postonnost~ destvi -ubi/-mubi/ habitual
(continuative) -umbi/-mbi/
-m stanovlenie destvi -ma resultative
(formative, resultative)
-ga/-gi diskontinuativ -ga/-ka discontinuative
(discontinuative)
tense is slightly unusual due to a complex interplay between non-continuous

morphology, inflection and derivation the following condensed survey tries
to sketch its minimal principles for the purposes of this chapter.15
The most problematic tense category in Forest Enets is the morphologi-
cally unmarked aorist. The reason for calling this category aorist rather than
present tense is its incompatibility with a present tense interpretation. The
Forest Enets aorist does not refer exclusively to a currently ongoing action
but refers also to actions which have their implications for the current sit-
uation but precede the moment of speech. This largely coincides with the
inherent lexical aspect of a verb, for example the aorist of the atelic verb
d’iriš ‘live’ has a stable present tense reference. With action verbs like kaąaš
‘kill’, the aorist forms allow two interpretations; they can either refer to an
ongoing action (killing) or the outcome of an action (just killed). Telic verbs
15. The categories tense, aspect, modality and evidentiality are described in Chapter 7 of Siegl
(forthcoming).
16. Although mentioned by Sorokina (1975), no further examples were given. The label
postepennost~ destvi comes from Tereščenko (1966), though slightly different
suffixes -do-/-to- were given Tereščenko (1966: 453). The label cumulative was introduced
by Künnap (1999: 28).
Unauthenticated
130 Florian Siegl
such as koš ‘find’ always refer to an action just completed, hence requiring
a past tense verb form in the English translation. For past tense reference,
three different tenses can be found. The general past tense marked by -š is
morphologically unusual, as it follows verbal suffixes in word-final position,
e.g. mosra-d-uš <work-2SG-PST> ‘you worked’. The perfect is marked by
-bi following verbal suffixes e.g. mosra-bi-d <work-PERF-2SG> ‘you have
worked’. Finally, a distant past is found, for which both the perfect and the
general past markers are combined, e.g. mosra-bi-d-uš <work-PERF-2SG-
PST> ‘you had worked’. The future tense is expressed morphologically reg-
ularly, in that its marker -ąa is followed by verbal endings, e.g. mosra-ąa-d
<work-FUT-2SG> ‘you will work’.
As can be seen from Table 1, Forest Enets has eight derivational suffixes
expressing aspect which can be combined with both aorist and past tenses.17
Future tense and aspect are mutually incompatible.
When contrasting tense and aspect markers, a major morphological dif-
ference is readily observable in negation, for which a negative auxiliary con-
struction is used. The negative auxiliary, either ńi- or i- carries inflectional
morphology such as tense, person and mood (the latter not demonstrated)
whereas the main verb is realized as an infinite form, traditionally called the
connegative. In the following, the negation of the aforementioned tense forms
is shown:
(1) a. uu ńi-d mosra-P

2 SG NEG . AUX -2 SG work- CN
‘You are not working, you do not work.’
b. uu ńi-d-uš mosra-P
2 SG NEG . AUX -2 SG - PST work- CN
‘You did not work.’
c. uu i-bi-d mosra-P
2 SG NEG . AUX - PERF -2 SG work- CN
‘You have not worked.’
17. In several instances, an interpretation based on aktionsart would be equally attractive. In

this paper and in Siegl (forthc.) I have excluded aktionsart from the description as this
category is burdened with diverging interpretations, especially from a Slavonic perspec-
tive. Also Timberlake (2007) excluded aktionsart in his survey article and did not even
mention it.
Unauthenticated
d. uu i-bi-d-uš mosra-P
2 SG NEG . AUX - PERF -2 SG - PST work- CN
‘You had not worked.’
However, both the future tense marker and all aspect markers remain on the
negated lexical verb as the following examples show:
(2) a. uu ńi-d mosra-ąa-P

2 SG NEG . AUX -2 SG work- FUT- CN
‘You will not work.’
b. d’exa kariP mud’ ni-ąP o-mubi-P [...]
perch fish(ACC . PL) 1 SG NEG . AUX -1 SG eat-HAB - CN
‘I usually do not eat perch...’ [NKB Mouse and Fishermen]
c. d’oxora äki po-xon ńi-ą-uč
not.know.SG .1 SG this year-LOC . SG NEG . AUX -3 PL . REFL - PST
tota-gu-P
count-DUR - CN
‘I don’t know, this year they were not counted.’ [LDB Yamal]
The evolving picture shows a certain mismatch between tense and aspect.
First, tense marked verbs always need a verbal ending, but do not need an
aspect marker. This observation is valid for both affirmatives as well as in
negation:
(3) a. mod’naP peri soiąan tonin d’ire-ba-č

1 PL always good-PROL there live-1 PL - PST
‘We always lived well here.’ [ZNB Autobiographic]
b. Nol’u d’a-xan ńi-P Na-P
one land- LOC . SG NEG . AUX -3 PL be(LOC )- CN
‘(Reindeer) do not stay at one place (lit.: are not at one place).’
[ZNB Trip to Potapovo]
In contrast, tense and aspect can co-occur, but in negation, aspect remains on
the negated lexical verb which appears in the infinite connegative form:
(4) a. tiąa šimi-lta-gu-š

reindeer. PX . ACC . PL .3 SG run-CAUS - DUR -3 SG . PST
‘He made his many reindeer run away.’ [LDB Shaman]
Unauthenticated
132 Florian Siegl
b. tonuju tuxudačuP oka-P ä-ubi-P ańP

summer. ADV fly(NOM . PL) many- PL be- HAB -3 PL FOC
ńe-ąP mui-ubi-P
NEG . AUX -1 SG make- HAB - CN
‘In summer, there are usually many flies and so I don’t make it
(=poxi).’18 [NKB Jukola]
As already seen above in (2c), aspect remains on the infinite negated lexical
verb, whereas tense is hosted by the negative auxiliary.
(2) c. d’oxora äki po-xon ńi-ą-uč
not.know.SG .1 SG this year-LOC . SG NEG . AUX -3 PL . REFL - PST
tota-gu-P
count-DUR - CN
‘I don’t know, this year they were not counted.’ [LDB Yamal]
Finally, apart from the infinite connegative form, aspect is also compatible
with other infinite categories, e.g. the manner converb -š in which tense mark-
ers are not allowed:
(5) no šeru-gu-š pä-baP
so bury- DUR - CON begin-1 PL
‘...and so we began to bury (her).’ [EIB Clairvoyant]
We will return to the implications of this distribution after a description of the
aspect system.
4.3.2. Basic functions of Forest Enets aspect suffixes

We will start with the discussion of the discontinuative -ga/-ka, deliminative -
ita and cumulative -da, as they are very infrequent or absent from transcribed
narratives. In contrast, all other aspects appear quite frequently in sponta-
neous speech.
The discontinuative aspect -ga/-ka19 expresses that an action takes place
irregularly:
18. poxi (Ru: jukola) is a cover term for dried staples, usually fish or meat for the winter
which is prepared in summer.
19. The first form represents the basic form of a morpheme; other forms are the result of
morphophonological processes.
Unauthenticated
(6) televizor kutuixin mosra-ga

TV sometimes work- DISC .3 SG
‘Once in a while, the TV works (but usually it does not).’ [ZNB I 66]
(7) bią toąa-ga
water(ACC) bring- DISC .3 SG
‘Once in a while, she brings water.’ [NKB II 88]
The deliminative -ita characterizes an event as taking place for a short period
of time, including a repetition if possible. The meaning difference in contrast
to the discontinuative is currently not fully clear and more research is needed:
(8) bu ibl’eigu-n adi-ita
3 SG little- PROL sit- DEL .3 SG
‘He sits for a little while.’ [ZNB I 67]
The cumulative -do marks an event as continuing and increasing:
(9) d’iričima souxu-do-š tara-ńu
life improve- CUM - CON must- ASS .3 SG
‘Life must improve / get better and better.’ [LDB II 33]
The inchoative marker -ra/-la focuses on the beginning of an action/event.
Concerning its morphological behavior, the inchoative marker changes the
inflectional behavior of a verb by assigning it to conjugation III. This trigger-
ing of conjugation type change makes this marker unique within the aspect
system:20
(10) d’urak baąa-an soiąa-an d’uri-l-iP
Nenets word- PROL good- PROL speak- INCH -1 SG . REFL
‘I started to speak Nenets well.’ [NKB Childhood]
The translative-resultative aspect in -ma focuses on the result of a change of
state. In this concern, the translative-resultative is the functional opposite of
the inchoative which stresses the beginning of an action.
20. Forest Enets verbs show three conjugation types. The so-called first conjugation contains
almost all intransitive and all transitive verbs. The second conjugation is a pragmatically
orientated conjugation with implications for topic prominence and only transitive verbs
can be alternatively conjugated in this conjugation. The third conjugation contains intran-
sitive verbs which do not fall into the first conjugation.
Unauthenticated
134 Florian Siegl
(11) a. kaja oąi-ma

sun appear- RES .3 SG
‘The sun came out.’ [ZNB III 18]
b. biP laxu-ma
water cook- RES .3 SG
‘The water boiled.’ [ZNB III 18]
From a morphological perspective, the translative-resultative aspectual marker

also shows further derivational properties, as it can derive translative-resulta-
tive verbs from nouns:
(12) a. d’eri d’uba

day warm.3 SG
‘The day is warm.’ [ZNB III 18]
b. Na-ąa d’eri-ma
sky- PX .3 SG day- RES .3 SG
‘It had become day.’ [ZNB III 18]
The habitual -ubi/-mubi/-mbi/-umbi marks verbs for actions which take place
on either a regular basis or frequently enough to be considered a habit:
(13) m@dnaP čas-xun oo-mubi-ä täąa ańP orčuą

1 PL hour- LOC . SG eat- HAB -1 PL now FOC before. ABL
oo-Na-ač
eat- FREQ -1 PL
‘We usually eat at 1 o’clock, but today we ate earlier.’ [NKB II 20]
The durative suffix -gu/-ku and the frequentative suffix -P/-r and its allomorph
-Na pose considerable difficulties of analysis, as the frequentative -P/-r has
undergone reanalysis as conjugation marker and is almost exclusively found
on verbs belonging to the inflection class IIa. With several verbs, some limited
productivity of a related frequentative -rV/-lV can be observed, but this is
beyond the scope of this paper and the reader is referred to Chapter 7 in
Siegl (forthcoming). For the current discussion, it is sufficient to characterize
their functions as follows; the durative suffix -gu/-ku marks an event (and
occasionally also a state) as ongoing:
Unauthenticated
(14) bu kńiga tota-go

3 SG book(ACC) read- DUR .3 SG
‘He is reading a/the book.’ [IIS IV 148]
(15) tošnuju kaa-gu-iąP soši ńi-ą
down. ADV descend- DUR -3 SG . REFL hill(GEN) on- ABL
‘Down he came from the hill.’ [LDB Supernatural]
The frequentative suffix -P/-r and its allomorph -Na expresses an action as ei-
ther frequently happening or being an instance of verbal plurality. In example
(16), the frequentative suffix is lexicalized and functions as the conjugation
class marker for verbs of the IIa class, e.g. d’orid’ ‘speak’.
(16) mensi-ąa ańP d’urak baąa-an d’uri-Na-š

old.woman- PX .3 SG FOC Nenets word- PROL speak- FREQ -3 SG . PST
‘His wife spoke Tundra Nenets.’ [LDB Shaman]
The following clause shows one of the few examples with the other (etymo-
logically) related frequentative aspect suffix -rV/-lV:
(17) mod’ kirba koru-xun motu-ra-a

1 SG bread(ACC) knife- LOC . SG cut- FREQ - OSG .1 SG
‘I have cut up one loaf of bread with a knife.’ [ZNB IV 12]
4.3.3. The interaction of tense and aspect in Forest Enets

Both tense and aspect overlap in Forest Enets, though on a different concep-
tual basis as in Russian. First, it appears that the future tense in Forest Enets
does not allow any further aspect modification as there are currently no ex-
amples in my collected data.21 This means that aspect is only found within
present tense and past tense. Based on a corpus search via ELAN22 , the fol-
lowing picture emerges:
21. As future tense is negated like an aspect, there should be a historical connection between
them and in fact, Prokof’ev (1937) classified the future as aspect. For a variety of reasons
which cannot be addressed here in detail, this classification needs to be rejected on the
synchronic level but a historical connection is very likely.
22. The underlying corpus consists of 45 fully annotated narratives (all monologues) equaling
roughly 3.5 hours of spoken Forest Enets.
Unauthenticated
136 Florian Siegl
Table 2. Text frequency

Aspect Label Total in ∑ aorist ∑ general ∑ relative
suffix transcribed past past
speech23
-gu/-ku durative 58 23 16 none
-ra/-la inchoative 35 27 none 5
-P/-r; frequentative 115 85 (both 28 (both 1 (-rV/-lV
-rV/-lV types) types) type)
-do cumulative none none none none
-ita delimitative 1 1 none none
-ubi/-mubi/ habitual 66 55 3 none
-umbi/-mbi
-ma resultative 45 38 2 none
-ga/-ka discontinuative 10 2 none none
This picture is further supported by data from elicitation with one necessary
correction. Although absent from transcribed speech, the habitual is compat-
ible with relative past tenses, both perfect and distant past:
(18) kudaxai äku-xun šiąi po d’iri-ubi-ubi-ą-ud’
long.ago here- LOC . SG two year live- HAB - PERF -1 SG - PST24
‘A long time ago I had been living here for two years.’ [ZNB IV 70]
A short note on the ratio of finite verbs without overt aspect marking vs.
finite verbs marked for aspect is in order. The currently used ELAN annotated
corpus consisting of narratives (all monologues) contains more than 2000
verbs25 out of which 335 are morphosyntactically infinite or semi-finite. This
means that out of around 1700 finite verbs only 330 finite verbs show overt
marking for aspect. In the current state of documentation, the following tense-
aspect combinations are attested:
Table 3. TA combination in Forest Enets
Aorist General Past Perfect Distant Past
durative, frequentative, durative, frequentative, inchoative, inchoative,
inchoative, cumulative, habitual, resultative, habitual habitual
deliminative, habitual, discontinuative
resultative, discontinuative
23. This number contains aspect markers on both finite and infinite verbs. In the columns
further to the right, only aspect markers on finite verbs were counted.
24. The /u/ is apparently added for phonotactic reasons.
25. Due to some inconsistencies in annotation, the exact number is currently unknown.
Unauthenticated
4.4. Intermediate summary
When combing the aforementioned frequency data with several observations

presented in Section 4.3.1 there is good evidence for the derivational status
of aspect. Whereas the current corpus contains roughly 1700 finite verbs,
only 330 finite verbs show overt marking for aspect. Consequently, aspect is
not obligatory and further, not inflectional. Additional distributional evidence
comes from several examples from Section 4.3.1. Aspect is possible with
finite and infinite categories, as shown in (4) and (5):
(4) a. tiąa šimi-lta-gu-š

reindeer. PX . ACC . PL .3 SG run-CAUS - DUR -3 SG . PST
‘He made his many reindeer run away.’ [LDB Shaman]
(5) no šeru-gu-š pä-baP
so bury- DUR - CON begin-1 PL
‘...and so we began to bury (her).’ [EIB Clairvoyant]
However, tense and person marking are generally restricted to finite verbs,
including the negative auxiliary:26
(4) b. tonuju tuxudačuP oka-P ä-ubi-P ańP

summer. ADV fly(NOM . PL) many- PL be- HAB -3 PL FOC
ńe-ąP mui-ubiP
NEG . AUX -1 SG make- HAB - CN
‘In summer, there are usually many flies and so I don’t make it
(=poxi).’ [NKB Jukola]
Finally, as already mentioned in examples (11)–(12), the translative-resulta-

tive aspect in -ma strongly supports the derivational interpretation of aspect
as besides its functioning as a an aspect marker it can derive resultative verbs
from nouns e.g. bu busima <3SG old.man.3SG> ‘he turned into an old man’
(from busi ‘old man’).
26. A specialized infinite converb in -bu allows additional tense marking relying on a mor-
pheme which is absent from finite morphology, but this has no implication for the discus-
sion here (Siegl, forthcoming, Chapter 12.3).
Unauthenticated
138 Florian Siegl
4.5. Forest Enets aspect in contrast to Russian aspect – dissimilar similari-

ties
As the preceding discussion has shown, several major differences between the
Forest Enets and Russian aspect system are evident. Whereas Russian imper-
fective simplex verbs form aspect pairs by prefixation, for which a small set
of specialized prefixes is in usage, almost every Forest Enets aspect marker
has a specialized fixed meaning, thereby being closer to the one-form-one-
function principle generally associated with agglutinative morphology.27 The
obvious exception is the verb class IIa which consists of verbs historically de-
riving from frequentatives in -P/-r and here one has to assume lexicalization.
The second major difference is the absence of secondary aspect pairs of the
type perepisat’ <PFT > / perepisyvat’ <IPF > ‘rewrite’.
When we turn to Russian simplex verbs, another conceptual difference
between the Russian imperfective aspect and Forest Enets aspect is readily
observable. The imperfective aspect in Russian is used to express both on-
going and generic actions. This means that the Russian imperfective aspect
allows both a process and a habitual interpretation. For the latter, a suitable
adverb is needed:
(19) utrom on čita-l knigu
morning. INSTR . MASC . SG he read- PST. MASC . SG book. ACC . FEM . SG
‘In the morning, he was reading a book.’
→ imperfective and habitual reading possible
(20) utrom on obyčno čita-l
morning. INSTR . MASC . SG he generally read- PST. MASC . SG
knigu
book. ACC . FEM . SG
‘In the morning, he was usually reading in a book / he usually read in
a book.’
→ habitual reading of the imperfective
In contrast, Forest Enets uses two different aspect derivations, thereby unam-
biguously expressing one or the other situation:
27. Although not of direct relevance here, Russian prefixes are related to spatiality as a large
number of prefixes appear as freestanding prepositions. Although the grammaticalization
history of Forest Enets aspect suffixes is unknown, spatiality is clearly not underlying their
development.
Unauthenticated
(21) bu kńiga tota-go

3 SG book(ACC) read- DUR .3 SG
‘He is reading a book.’ [IIS IV 148]
(22) kiuąnuju mod’ peri kńiga tota-ubi-ąP
morning. ADV 1 SG always book(ACC) read- HAB -1 SG
‘In the morning I usually read a book. (lit.: I am usually reading...)’
[IIS IV 148]
Whereas the usage of aspect has not changed in the course of the last three
decades for which textual material is available, Sorokina’s description of as-
pect in Forest Enets has changed profoundly. In her only article dedicated to
aspect in Forest Enets Sorokina (1975) clearly stated that although -gu/-ku
comes indeed close to Russian imperfective verbs, this parallel is superfi-
cial.28 In 2009, Sorokina and Bolina published an extended Forest Enets –
Russian dictionary containing a short grammatical sketch of the language,
though without an inventory of aspect markers. In comparison to her 1975
position, Sorokina has changed her interpretation profoundly:
In Forest Enets the category aspect can be distinguished. For every verb, an-
other verb expressing an opposite aspect value does exist. The perfective verb
is formally unmarked; from these, imperfective verbs are generated, apart
from verbs which are semantically imperfective. The appearance of aspect
is morphologically very diverse – by alternating the stem sound, by change
of conjugation type, by suffix derivation and so on. Side by side with aspect
forms, there exists a large number of suffixes which express the process of
actions such as length, repetition, reiteration. (Sorokina and Bolina 2009: 27
– my translation, F.S.)
Change of conjugation type is indeed attested but marginal, it affects several

intransitive verbs from conjugation I which can be conjugated in conjugation
III resulting in a fine-graded difference, most usually expressing semelfactiv-
ity:
(23) a. bu čii-š
3 SG fly-3 SG . PST
‘He flew.’ [ZNB III 17]
28. “With the help of the suffix -gu verbal bases are derived whose meaning is very near to
the Russian imperfective aspect.” (Sorokina 1975: 134 – my translation F.S.).
Unauthenticated
140 Florian Siegl
b. bu či-iąP
3 SG fly-3 SG . REFL
‘He started to fly / he flew away.’ [ZNB III 17]
However, this behavior is apparently lexically triggered and therefore not

productive. The other strategy, stem sound alternation, is unknown to me.
In the sketch grammar accompanying the dictionary, Sorokina and Bolina
(2009) neither present any examples nor do they provide the inventory of as-
pect markers. Hence an overall evaluation of this statement is impossible at
present.
The third major and ultimately crucial difference between aspect in Rus-
sian and Forest Enets concerns its status. Whereas the discussion of Russian
aspect is always cast in terms of a binary opposition perfective/imperfective,
there is indeed no evidence for such an opposition in Forest Enets. Of course,
aspectual derivations in Forest Enets can be assigned to either perfective or
imperfective aspect; durative, frequentative, cumulative, deliminative, habit-
ual and discontinuative do not express the outcome of change around a special
point of time but focus on the inherent process and could therefore be seen as
imperfective. On the other hand, both inchoative aspect and resultative aspect
focus on the result of change, albeit on two different points of change (begin-
ning vs. end) and should be seen as perfective. What is clearly absent from
Forest Enets are both a clear cut binary perfective/imperfective dichotomy
characteristic of Russian, and, as already mentioned, the possibility to create
something comparable to secondary aspect pairs in Russian. When a given
Forest Enets verbal lexeme has undergone aspect derivation, further aspect
derivations are not possible.
5. Representation of aspect in Forest Enets–Russian–Forest Enets dic-

tionaries
So far, we have approached the description of aspect from the perspective of
grammaticography. We will now turn to lexicography where the influence of
Russian is clearly visible as the prevailing form of representation of verbs is
in fact a distortion of the Forest Enets aspect system.
To date, two specialized Forest Enets-Russian dictionaries are available;
the first dictionary, Sorokina and Bolina (2001), henceforth referred to as
ERRE, is a small bilingual school-dictionary, comprising around 6000 words
Unauthenticated
including both a Forest Enets-Russian and a Russian-Forest Enets section.

Published by the most important pedagogic publishing house for indigenous
languages in the Russian Federation and intended as a school dictionary, this
dictionary is the first full-fledged lexicographic resource of the language and
the only bilingual dictionary available.29 The other dictionary, Sorokina and
Bolina (2009), henceforth referred to as ES, is a bilingual Forest-Enets Rus-
sian dictionary, intended for a scientific audience published by the Russian
Academy of Sciences. In contrast to ERRE, ES also includes a small but in-
complete grammatical sketch. Otherwise both dictionaries are identical in the
way the entries were compiled in the Forest Enets-Russian section and there-
fore can be discussed as one. The Russian-Forest Enets part in ERRE will be
addressed later.
In general, the representation of verbs is uniform. Verbs which are not
aspect marked, e.g. pärąiš ‘help’ are represented as an independent entry, and
so are verbs marked with the durative aspect -gu/-ku, e.g. pärąiguš ‘help<DUR>’.
Thus verbs may be separated by other entries, but nevertheless form a kind of
pair:30
(24) pärąiš (Russian: pomoč’) ‘help<PFT>’
bu kasa-da pärąiä
3 SG brother- PX . ACC .3 SG help.3 SG
‘He helped his brother.’ [ERRE 102]
(25) pärąiguš (Russian: pomogat’) ‘help<IPF>’
bu lata kolta-gu-š pärąi-gu
3 SG floor(ACC) clean- DUR - CON help- DUR .3 SG
‘(S)he was helping cleaning the floor.’ [ERRE 102]
Apart from the durative, only verbs belonging to inflection class IIa, which
are almost all lexicalized frequentatives, are represented throughout both dic-
tionaries:
29. In comparison to other dictionaries published by the same publishing company, ERRE
is indeed fairly well compiled as it is more than just a bilingual word list. Although in-
tended as a school dictionary, Forest Enets lacks both literacy and educational materials
and therefore the potential readership does not contain any schoolchildren or L2 learners,
but language activists and researchers.
30. The following examples which are originally given in a Cyrillic orthography are repre-
sented in the same practical orthography as used above and in Siegl (forthcoming).
Unauthenticated
142 Florian Siegl
(26) ped’ (Russian: iskat’, razyskivat’) ‘search’

obu äkun pe-Na-d
what here search- FREQ -2 SG
‘What are you looking for?’
bu ńi pe-rP
3 SG NEG . AUX .3 SG search- FREQ . CN
‘He is not searching’ [ES 331]
Nevertheless, this form of representation (zero vs. -gu/-ku) is indeed arbi-
trary, as other aspect markers which can co-occur on infinite verbs (e.g. the
manner converb which serves as the citation form in both dictionaries) are
almost entirely excluded. Whereas forms such as the one presented above in
example (9) souxu-do-š <improve- CUM - CON > or d’ori-mubi-š <speak- HAB -
CON > are perfectly possible, such forms are neither explicitly represented in
ERRE nor in ES; the only aspect which can be found occasionally in both dic-
tionaries is the translative-resultative.31 In fact, there is no obvious language
internal explanation for the privileged role of the durative -gu/-ku, and the
only explanation for its privileged status appears to be its functional equiva-
lence to the Russian imperfective aspect. In the current format of representa-
tion, the “potential pair” pärąiš vs. pärąiguš is nothing more than a stylized
imitation of Russian aspect pairs such as pisat’ vs. napisat’ with the sole
difference that in Forest Enets, the durative derived forms are imperfective
whereas in Russian, the underived forms are usually imperfective.
When comparing this solution to the major Tundra Nenets Russian dic-
tionary (Tereščenko 1965), still a milestone in Samoyedic lexicography, the
choice of representation in both ERRE and ES is surprising; although Tere-
ščenko’s solution takes up somewhat more space, every Tundra Nenets verb,
regardless of whether it carries an aspect derivation or is underived, is repre-
sented as an independent entry in the dictionary. Derived verbs are cross-
referenced with underived ones, and in this manner, artificial creation of
quasi-aspect pairs is avoided.
Turning to the opposite direction, i.e. Russian to Forest Enets, only found
in ERRE, the situation is not too different from the Forest Enets to Russian
31. In the foreword of both dictionaries (ERRE 8; ES 34) the possibility of using the converb
with other aspect suffixes is at least mentioned. The chosen format of representation with
its heavy bias towards the durative, neglecting almost entirely other aspectual derivations,
is, however, not justified.
Unauthenticated
section of both dictionaries. As a necessary prerequisite it must be mentioned

that Russian verbs are not presented as aspect pairs but presented as two indi-
vidual entries, thereby not following the conventions as sketched by Berkov
(1996: 115–116). Among the Forest Enets aspect suffixes, only unmarked,
durative and class IIa lexicalized frequentatives can be found. This arbitrary
choice is again not adequate, as a Russian imperfective, as already shown
may correspond to more than one Forest Enets aspect suffix. However, this
information cannot be retrieved from the dictionary. The same is in princi-
ple also valid for Russian perfective verbs and their Forest Enets equivalents.
Again, a more fitting solution can be found in an early Russian-Tundra Nenets
school dictionary (Pyrerka and Tereščenko 1948), which after more than half
a century still looks appealing. Also in this dictionary, Russian verbs were not
represented as aspect pairs, but if more than one Tundra Nenets verb served
as potential translation equivalent of a Russian verb, the appropriate aspect
derivations were listed. Comparing both approaches, one is obliged to con-
clude that the Tundra Nenets solutions are more appropriate for speakers of
both languages, and one cannot help but wonder why in the Forest Enets case,
a different solution was chosen.
5.1. The possible place of aspect in a hypothetical Forest Enets to English

dictionary
The previous section has shown that the current format of representation is
unsuitable, as its only merit is to make Forest Enets aspect look closer to
Russian than it actually is. This obviously calls for a different solution and
in the remainder, some preliminary thoughts will be offered as to how such
a solution might look. First, as Forest Enets is on the verge of extinction
and language revitalization is hardly viable, neither monolingual dictionar-
ies nor bilingual Forest Enets Russian dictionaries targeted at native speakers
of Forest Enets are realistic undertakings. The possible user of such a paper
dictionary will most likely be a person not speaking Forest Enets, and most
probably a linguist, and any dictionary must inevitably include grammatical
information on verbs concerning inflection class, conjugation class, and as-
pect derivation. In cases where aspect morphology is lexicalized the answer
is relatively easy. The translation equivalent of ‘love’ komitaš is apparently
a delimitative lexicalized aspect form of komaš ‘want’, also treated as an in-
dependent entry in Sorokina and Bolina (2001: 188) and can no longer serve
Unauthenticated
144 Florian Siegl
as a subentry for ‘want’. But what should be done with e.g. pärąiš ‘help’?
As both durative and habitual aspect can be formed without problems, there
is intuitively no reason for several lemmata, as both meaning and formation
are regular. Initially, also the verb moktaš ‘put up a traditional tent’ does not
seem to be problematic, as again both durative and habitual aspect can be
formed without problems. However, moktaš cannot be used with the inchoat-
ive aspect marker -ra/-la; instead a periphrastic construction with päš ‘begin’
is necessary, e.g. uu mokta-š pä-d <2SG put.up-CON begin-2SG> ‘you be-
gan putting up (a traditional tent).’ In fact, other verbs also block inchoative
-ra/-la and require päš ‘begin’, which shows that the inchoative aspect is
more idiosyncratic and should be listed in a dictionary. Furthermore, as al-
ready mentioned above, class IIa verbs (frequentatives) block durative aspect
but may allow habitual and inchoative, although again with some restrictions.
And finally, although the vast majority of class IIa verbs are frequentative, e.g.
d’orid’ ‘speak’ → bu d’ori-Na <3SG speak-FREQ.3SG> ‘he is speaking’,
some IIa verbs can be either frequentative or translative-resultative e.g. ood’
→ oo-Na <eat-FREQ.3SG> ‘is eating’ or ood’ → oo-ma <eat-RES.3SG>
‘has eaten’. Although this needs more detailed study, there are good reasons
to doubt the productivity of some aspect markers. Their derivational nature
is still rather obvious, and it would therefore be desirable to include forma-
tions with these suffixes as distinct entries, even at the risk of including some
redundancy in the presentation. Such a path was also chosen in Tereščenko’s
large Tundra Nenets Russian dictionary (Tereščenko 1965).
6. Conclusion – aspect, grammaticography, lexicography and Forest

Enets
Although it is quite obvious that aspect plays a significant role in the gram-
mar of Forest Enets, there is no evidence for a systematic binary imperfec-
tive/perfective opposition in the available language material. Furthermore, the
direct comparison with Russian has demonstrated that aspect in Forest Enets
operates on a different conceptual basis. Both Forest Enets dictionaries, per-
haps accidentally, presented Forest Enets verbs in aspect pairs by exagger-
ating the function of the durative marker, which led to the emergence of a
superficial parallel to Russian. Also in the accompanying sketch grammar to
Sorokina and Bolina (2001), the Forest Enets aspect system appeared more
Russian-like than in an earlier description (Sorokina 1975). What I have tried
Unauthenticated
to show is that these recent approaches are problematic and that there is no
immediate need to depart from the earlier position of N. Tereščenko and G.
Prokof’ev, which acknowledged the importance of aspect but did not try to
construct any parallel to the Russian system. Concerning the overall position
of aspect within the grammatical structure of Forest Enets, aspect shows a
relatively clear and stable form-meaning mapping, an argument which is not
valid for aspect in Russian. Nevertheless, Forest Enets aspect is derivational
and idiosyncratic to a certain degree and therefore should be listed in the
dictionary.
Although the description of aspect in Siberian indigenous languages is
most definitely not a problem restricted to Forest Enets, the problem itself
is symptomatic and supportive for a central concern in Sasse (2002); local
traditions and metalanguages are still too influential in the study of aspect
and this study adds further support from a language outside the scope of
Sasse’s paper. Apart from the possible theoretical implication which Forest
Enets could have for aspectology, the study of aspect in Siberian languages
supports again one of the most urgent challenges of language documentation
and language description – grammatical categories should be described with-
out interference from grammatical descriptions and traditions of majority or
related languages.
Abbreviations and symbols

e.g. [ZNB III 18] reference to data from elicitation
e.g. [LDB Supernatural] reference to transcribed and annotated narrative
1, 2, 3 first, second, third INCH inchoative aspect

person INF infinitive
ABL ablative marker (post- INSTR instrumental case
position) IPF imperfective
ACC accusative case LOC locative
ADV adverbial MASC masculine
ASS assertative mood NEG . AUX auxiliary used for
CAUS causative negation
CN connegative form of NOM nominative case
negated lexical verb OSG singular object
CON converb PERF perfect tense
CUM cumulative aspect PFT perfective
DEL deliminative aspect PL plural
Unauthenticated
146 Florian Siegl
DISC discontinuative aspect PROL prolative case

DUR durative aspect PST general past tense
FEM feminine PX possessive declension
FOC focus particle of nouns
FREQ frequentative aspect REFL reflexive
FUT future tense RES resultative aspect
GEN genitive SG singular
HAB habitual aspect
References
Berkov, Valerij P. 1996. Dvujazyčnaja leksikografija [Bilingual lexicogra-
phy]. St. Peterburg: Izdatel’stvo Sankt-Peterburgskogo Universiteta.
Bondarko, Aleksandr V. 2002. Glagol’nyj vid v sisteme grammatičeskix
kategorii (na materiale russkogo jazyka) [Verbal aspect in the system of
grammatical categories (on the basis of Russian)]. In Osnovnye problemy
russkoj aspektologii, 30–43. St. Peterburg: Nauka.
Castrén, Mathias Alexander. 1854. Grammatik der samojedischen Sprachen
– herausgegeben von Anton Schiefner. St. Petersburg: Buchdruckerei der
Kaiserlichen Akademie der Wissenschaften.
Eismann, Wolfgang. 1991. Die zweisprachige Lexikographie mit Russisch.
In Wörterbücher. Ein internationales Handbuch zur Lexikographie. Dritter
Teilband, eds. Franz Josef Hausmann, Oskar Reichmann, and Herbert E.
Wiegand, 3068–3085. Berlin, New York: Walter de Gruyter.
Jachnow, Helmut. 1990. Russische Lexikographie. In Wörterbücher.
Ein internationales Handbuch zur Lexikographie. Zweiter Teilband, eds.
Franz Josef Hausmann, Oskar Reichmann, and Herbert E. Wiegand,
2309–2329. Berlin, New York: Walter de Gruyter.
Jacobson, Steven A. 1990. Comparison of Central Alaskan Yup’ik Eskimo
and Central Siberian Yupik Eskimo. International Journal of American
Janhunen, Juha. 1998. Samoyedic. In The Uralic Languages, ed. Daniel
Abondolo, 357–379. London, New York: Routledge.
Kazakevič, Olga A. 2008. K voprosu o modeljax opisanija sel’kupskoj
glagol’noj derivatsii [A question about the descriptive model of deriva-
tion in Selkup]. In Issledovanija po glagol’noj derivatsii – Sbornik statej,
114–126. Moskva: Jazyki slavjanskix kul’tur.
Unauthenticated
Künnap, Ago. 1999. Enets. München: Lincom Europa.

Kuznetsova, Ariadna I. 2008. Aspektual’naja derivatsija v sel’kupskom
jazyke: semantičeskaja modifikatsija glagola i sposoby ee vyraženija [As-
pectual derivation in the Selkup language: verbal semantic modification
and its means of expression]. In Issledovanija po glagol’noj derivatsii –
Sbornik statej, 102–113. Moskva: Jazyki slavjanskix kul’tur.
Kuznetsova, Ariadna I., Evgenij A. Xelimskij, and Elena V. Gruškina, eds.
1980. Očerk po sel’kupskomu jazyku. Tazovskij dialekt. Tom 1. [A gram-
matical sketch of Selkup – the Taz dialect. Volume 1]. Moskva: Izdatel’stvo
Moskovskogo Gosudarstvennogo Universiteta.
Mosel, Ulrike. 2004. Dictionary making in endangered speech communi-
ties. In Language Documentation and Description, Volume 2, ed. Peter K.
Mosel, Ulrike. 2006b. Grammaticography, the art and craft of writing gram-
mars. In Catching Language: The Standing Challenge of Grammar Writ-
ing, eds. Felix Ameka, Alan Dench, and Nicholas Evans, 41–68. Berlin,
New York: Mouton de Gruyter.
Plungian, Vladimir A., ed. 2003. Obščaja morfologija – vvedenie v prob-
lematiku [General morphology – an introduction to its problems]. Moskva:
URSS.
Prokof’ev, Georgij N. 1935. Sel’kupskaja grammatika [A grammar of
Selkup]. Leningrad: Izdatel’stvo Instituta Narodov Severa CIK SSR.
Prokof’ev, Georgij N., ed. 1937. Jazyki i pis’mennost’ narodov Severa čast
1 – Jazyki i pis’mennost’ samojedskix i finno-ugorskix narodov [The lan-
guages and literacy of the people of the North part 1 – The languages and
literacy of the Samoyedic and Finno-Ugric people]. Moskva, Leningrad:
Gosudarstvennoe učebno-pedagogičeskoe izdatel’stvo.
Pyrerka, Anton P., and Natalija M. Tereščenko. 1948. Russko-nenetskij slo-
var’ [Russian-Tundra Nenets dictionary]. Moskva: Ogiz-Gis.
Sasse, Hans-Jürgen. 2002. Recent activity in the theory of aspect: Accom-
plishments, achievements, or just non-progressive state? Linguistic Typol-
ogy 6:199–271.
Siegl, Florian. 2010. How to prepare for fieldwork – a Forest Enets based ret-
rospective. In Kenttäretkistä tutkimustiedoksi [From Fieldwork to Research
Unauthenticated
148 Florian Siegl
Results], eds. Paula Kokkonen and Anna Kurvinen, 213–240. Helsinki:

University of Helsinki and the Finno-Ugrian Society.
Siegl, Florian. Forthcoming. Materials on Forest Enets, an indigenous lan-
guage of Northern Siberia. Helsinki: Finno-Ugrian Society.
Sorokina, Irina P. 1975. Kategorija glagol’nogo vida v enetskom jazyke [The
category of verbal aspect in the Enets language]. In Jazyki i Toponimija
Sibiri VII, 131–138. Tomsk: Izdatel’stvo Tomskogo Universiteta.
Sorokina, Irina P., and Dar’ja S. Bolina. 2001. Slovar’ enetsko-russkij i
russko-enetskij [Enets-Russian Russian-Enets dictionary]. St. Peterburg:
Prosveščenie.
Sorokina, Irina P., and Dar’ja S. Bolina. 2009. Enetskij slovar’ s kratkim
grammatičeskim očerkom [Enets-Russian dictionary with a short grammat-
ical sketch]. St. Peterburg: Nauka.
Tereščenko, Natalija M. 1965. Nenetsko-russkij slovar’ [Nenets-Russian dic-
tionary]. Moskva: Izdatel’stvo Sovetskaja Entsiklopedija.
Tereščenko, Natalija M. 1947. Očerk grammatiki nenetskogo (jurako-samo-
jedskogo) jazyka [A grammatical sketch of the Nenets (Jurak-Samoyed)
language]. Leningrad: Učpedgiz.
Tereščenko, Natalija M. 1959. V pomošč’ samostojatel’no izučajuščim nenet-
skij jazyk: opyt sopostavitel’noj grammatiki nenetskogo i russkogo jazykov
[Help for the independent learner of the Nenets languages: an experimental
contrastive grammar of the Nenets and the Russian language]. Leningrad:
Učpedgiz.
Tereščenko, Natalija M. 1966. Enetskij jazyk [The Enets language]. In
Jazyki narodov SSSR III: Finno-ugorskie i samodijskie jazyki, ed. Vassilij I.
Lytkin, 438–457. Moskva: Nauka.
Timberlake, Alan. 2004. A Reference Grammar of Russian. Cambridge:
Cambridge University Press.
Timberlake, Alan. 2007. Aspect, tense, mood. In Language Typology and
Syntactic Description. Volume 3: Grammatical Categories and Lexicon,
ed. Timothy Shopen, 280–333. Cambridge: Cambridge University Press.
Ulving, Tor. 1971. Observations on the language of the Asiatic Eskimo as
presented in Soviet linguistic works. Linguistics 69:87–119.
Vajda, Edward J. 2004. Ket. München: Lincom Europa.
Vajda, Edward J., and Marina Zinn. 2004. Morphological Dictionary of the
Ket Verb: Southern Dialect. Tomsk: Tomsk Pedagogical University Press.
Unauthenticated
Wagner-Nagy, Beáta. 2001. Die Wortbildung im Nganananischen. Szeged:

SzTE Finnugor Tanszék.
Werner, Heinrich. 1997. Die ketische Sprache. Wiesbaden: Harrassowitz
Verlag.
Zaliznjak, Andrej D., and Anna A. Šmelev. 2000. Vvedenie v russkuju aspek-
tologiju [Introduction to Russian Aspectology]. Moskva: Jazyki russkoj
kul’tury.
Unauthenticated
Unauthenticated
Chapter 7
Documentary linguistics and prosodic evidence
for the syntax of spoken language∗
Candide Simard and Eva Schultze-Berndt
1. Introduction
The documentary linguistics approach – with its emphasis on primary audio-
or video-recorded data supplemented by annotations – makes it possible to
address seriously the syntax of spontaneous spoken language even in lesser
known languages. We would like to argue here that the syntactic description
of spoken language crucially needs to take into account prosodic phenomena
such as prosodic breaks of different strengths, prosodic prominence, pitch
range, and intonation contours associated with particular constructions. Pro-
sodic analyses in documentary linguistics can contribute both to enriching the
description of a language itself and to extending our empirical coverage of the
diversity of prosodic phenomena in human languages, the inventory of which
has focused so far on an incomplete sample of Germanic, Romance and Asian
languages which are in all likelihood non-representative of the complexity in
this domain (Himmelmann 2006; Himmelmann and Ladd 2008).
In this paper, we present a case study demonstrating that it is both fea-
sible (despite methodological challenges) to study the prosodic system of an
under-documented language, and necessary to incorporate prosodic phenom-
∗
We wish to acknowledge the patience and knowledge of the many Jaminjung speakers,
some of whom now deceased, who have worked with us over the years in the communities
in Timber Creek, Katherine and Kununurra.
We are also grateful for the funding received from the DoBeS programme of the Volkswa-
gen Foundation for the documentation of the linguistic and cultural knowledge of Jamin-
jung and other languages of the Victoria River district, as well as previous research funding
received by the second author from the Max Planck Society and AIATSIS (Australian In-
stitute of Aboriginal and Torres Strait Islander Studies).
We would like to dedicate this paper to Ulrike Mosel who is a role model to us and so
many other linguists both in terms of her commitment as a fieldworker to the communities
she works with, and in terms of her linguistic work which succeeds in bringing to life the
language as it is used by speakers.
Unauthenticated
152 Candide Simard and Eva Schultze-Berndt
ena in the analysis of syntactic constructions if one aims at a comprehen-

sive grammatical description of a language. Our case study is partly based
on a quantitative study of intonation (Simard 2010) in Jaminjung, a severely
endangered language of Australia. Section 2 provides some background in-
formation on this language, and summarizes our methodology as well as the
problems we encountered regarding annotating and searching spoken lan-
guage corpora for grammatical analysis with prosodic phenomena in mind.
We will also introduce the prosodic model PENTA used for this analysis.
In Section 3, we discuss actual examples of the relevance of prosody for
syntactic phenomena in Jaminjung. We will concentrate on the analysis of
prosodically coherent units spanning more than one intonation unit, includ-
ing pragmatically (rather than syntactically) dependent non-finite construc-
tions, afterthoughts, and right-dislocated noun phrases. Section 4 concludes
the paper. In the remainder of this introductory section, we will outline our
theoretical background and assumptions.
We define prosody as encompassing the properties that create the melody
of speech, with the understanding that this melody is not random, but rather
that its properties form an organized, meaningful system that lends itself
to investigation. Its main purpose is to aid the transmission of information
in communication. Among other things, prosody serves to mark boundaries
and define transitions between words, phrases or sentences, and to accentu-
ate certain elements in an utterance. The relationship between prosody and
syntax continues to be the subject of much debate, mainly concerning, in
modular approaches, whether prosody is derived from syntax or whether
it constitutes an independent module. The approach taken here is a mono-
stratal, construction-based one in which prosodic characteristics are integral
components of syntactic constructions, similar to that taken e.g. by Diessel
(1997: 55–57) on polar questions, Michaelis and Lambrecht (1996) on nom-
inal extraposition, and by Lakoff (1987: 527–529) on the Paragon-Intonation
Construction (as in Now THERE ... is a real cup of coffee). In Construc-
tion Grammar, language is viewed as a repertoire of more or less complex
patterns (constructions) in which form and meaning are integrated in conven-
tionalized and, in some respects, non-compositional ways. “Form” can refer
to any syntactic, morphological, or prosodic pattern (or to combinations of
such patterns), and “meaning” is taken here in a broad sense that includes not
only truth-conditional semantics but also grammatical meaning related to the
Unauthenticated
Documentary linguistics, prosody and syntax 153
packaging of information in discourse structure (cf. e.g. Goldberg 1995; Kay

and Fillmore 1999; Lambrecht 1994; Croft 2001).
One domain of grammar where the importance of prosody is particularly
apparent is that of information structure. Accent placement in a clause may,
for example, be solely responsible for the distinction, in a language, between
predicate or “broad” focus (as in English I’m PAYing), and argument or “nar-
row” focus (as in English I’M paying), while in other languages, word order,
morphological focus marking, or specific constructions like clefts have to be
employed to mark the distinction, often in addition to prosodic marking (see
e.g. Lambrecht 1994: 292; Drubig 2003). In Sections 3.3 to 3.5 we will show
how, in Jaminjung, prosodic contours serve to distinguish two subtypes of
right-dislocated elements with different information structure status.
One of the contentious issues in the debate about the prosody-syntax in-
terface concerns the relationship between prosodic units such as intonation
units or prosodic phrases, and grammatical units, and more specifically, the
question whether the “same” grammatical construction may be distributed
over more than one intonation unit. Opposing a view shared by many func-
tionally oriented linguists and typologists (e.g. Halliday 1985; Chafe 1994;
Croft 1995, 2007) as well as conversation analysts (Auer 1991, 1996; Selting
1995, 2000) that prosodic units (“intonation units”, “information units”) con-
stitute a separate level of analysis from grammatical units such as the clause,
we adopt the view that grammatical units in spoken language cannot, in fact,
be defined without reference to prosodic units. This view is supported by
case studies showing that, for instance, the presence of a phrasal or intona-
tion unit boundary can distinguish grammatical constructions such as voca-
tives vs. appositions, appositions vs. lists, restrictive vs. non-restrictive rel-
ative clauses, and other types of intonationally integrated vs. non-integrated
subordinate clauses (cf. e.g. Bing 1984; Couper-Kuhlen 1986: 141–151; Ver-
straete 1998). For Jaminjung, we will argue in Section 3 that phrasal con-
structions such as complex predicates need to be distinguished from construc-
tions where comparable constituents are separated by an intonation boundary
because they have different pragmatic functions. The same applies to noun
phrases which are prosodically integrated into a clause vs. those appearing
as right-dislocated topics and afterthoughts. In particular, we argue that even
discontinuous noun phrases contained within a single intonation unit need to
be distinguished from those prosodically detached noun phrases.
Unauthenticated
2. Fieldwork-based prosodic research

This paper is based on fieldwork conducted by both authors on Jaminjung
over a number of years. The language name Jaminjung is used here for two
named varieties, Jaminjung and Ngaliwurru, which are mutually intelligi-
ble and exhibit mainly lexical differences. They belong to the small Jam-
injungan (or Western Mirndi) subgroup of the geographically discontinuous
Mirndi family (Chadwick 1997; Harvey 2008). The traditional country of
Jaminjung and Ngaliwurru speakers is located north and south of the Victoria
River around the present-day township of Timber Creek in the Northern Ter-
ritory, shown in Figure 1. The remaining few dozen speakers of Jaminjung
and Ngaliwurru today are all elderly; the younger generations speak Kriol,
an English-lexified creole language which is now the lingua franca of a large
area in Northern Australia. The main grammatical characteristics of Jamin-
jung will be introduced in Section 3.
In this situation of language endangerment, the collection and preparation
of datasets for prosodic analysis posed methodological challenges which will
be discussed in Section 2.1. While we recognize that some of these challenges
may be specific to our research situation, we believe that those particularly
concerning the prosodic analysis would apply in many language documenta-
tion situations. We also introduce here a less widely known model of prosodic
analysis, the PENTA model (Section 2.2).
2.1. Fieldwork and datasets

Prosodic research aims at uncovering linguistically relevant prosodic events
in speech. It is widely recognized that working with a corpus of spontaneous
speech is difficult as prosodic patterns may be influenced by factors as di-
verse as syllable structure, word structure, tonal context, sentence type, lo-
cation in sentence, location in paragraph, and so on (Xu 2010: 335). In or-
der to limit these contextual influences, a methodology based on controlled
datasets is most commonly advocated, even in the context of documenting
the prosodic phenomena in endangered languages. Himmelmann (2006: 163)
recommends, for instance, having the same utterance produced by a num-
ber of different speakers (or at least by having multiple versions of the same
utterance).
Unauthenticated
Figure 1. Map of the Northern Territory, showing the Victoria River District.
However, for Jaminjung, such a strategy proved very difficult to apply. Firstly,
the very concept of repeated sentences is an unlikely one in a language in
which word order is regulated by information structure (hence information
that is new the first time a sentence is uttered may not be judged to be so
when repeated, provoking a re-ordering of the constituents). Further, speakers
Unauthenticated
are mostly non-literate so written stimuli cannot be used; moreover, speakers

do not relish participating in artificial communicative activities, and although
they always tried to comply when presented with various elicitation tools,
more culturally relevant and significant situations (e.g. explaining plant uses)
proved better suited to data collection.
Our datasets were thus selected from a corpus of spontaneous or at least
unread speech. This corpus includes narratives consisting of personal anec-
dotes and mythological stories, picture-prompt narratives based on more
widely used materials, such as the Frog Story (Mayer 1994[1969]), and some
of the tasks from the Questionnaire for Information Structure (QUIS) materi-
als developed as part of the SFB 632 Information Structure research project
(Skopeteas et al. 2006). It also includes data recorded in the course of the
documentation of the ethnobiological knowledge of the speakers. Record-
ings usually involved more than one speaker. These recordings of various
discourse genres are not only rich in phonological details and provide many
examples of how language is used in context, but they also have a strong
appeal to the community because of what they preserve.
Understandably, the selection of representative tokens for prosodic analy-
sis out of such a large corpus is most time-consuming. And as truly compara-
ble contexts are infrequent, our datasets are limited in number. Nonetheless,
we contend that the analysis is still both fruitful and feasible (see Section
3.3): patterns must be identifiable, otherwise speakers would not use them in
their interactions.
The challenges are not limited to the actual collection of appropriate data
and the selection of datasets but continue with the processing of the materi-
als. In our research, we have had to contend with both archiving and prosodic
analysis requirements which could not be met with a single software. Signal
processing and speech analyses and some annotations for prosodic analysis
were realized with one software, Praat (Boersma and Weenink 2001), an-
notations were largely undertaken in a second (Toolbox1 ), while archivable
files were created in yet another (ELAN2 ): each has capabilities that are not
available in the others.
Praat is used to segment the utterances in the datasets into syllables, based
on an inspection of the pitch track and other acoustic cues; it also has a very
useful scripting facility which aids prosodic analyses greatly. Toolbox was
1. http://www.sil.org/computing/catalog/show_software.asp?id=79
2. http://www.lat-mpi.eu/tools/ELAN
Unauthenticated
used for another type of annotation, the semi-automatic interlinear glossing,

a feature not provided by the two other tools. Finally, ELAN offers the pos-
sibility of creating archivable, and searchable, annotations at different levels.
In this research, file conversions had to be undertaken both between Toolbox
and ELAN and between the time-aligned transcriptions in ELAN and Praat
files. The latter proved particularly problematic. The conversion protocols are
easy enough to implement, but the alignment of marked segments is not pre-
served in the conversion, causing a considerable loss of time and doubling
of effort. Any additional annotations included in the Praat files but not the
ELAN files are not searchable.
Ideally, once a dataset has been annotated, a mechanism for searching
the annotations and their associated signal files should be available. In other
words, linguistic annotation tools should allow us to create, annotate, query
and analyze speech data in order to conduct a prosodic analysis, the results
of which should be included and accessible in an archived corpus.
2.2. Prosodic analysis with the PENTA model
As the theoretical framework for our prosodic analysis, we employed the Par-
allel Encoding and Target Approximation model (Xu 2005), PENTA here-
after. The main feature of this model is that key components of intonation
are defined in terms of function rather than form. The model has been de-
veloped in the last 10 years, and has been used mostly to analyze speech in
carefully prepared experiments. However, it has been used successfully to
analyze spontaneous speech in Jaminjung, with the understanding that pat-
terns must be detectable, otherwise they would be of no use to speakers and
hearers.
The PENTA model assumes that multiple communicative functions are
concurrently conveyed through speech, and, as they can be perceptually dis-
tinguished, they must be encoded separately. Thus, each individual function
has its own “scheme”, making use of one or more “prosodic primitives” such
as the implementation of a local pitch target, variation in pitch range, or artic-
ulatory strength. The model also assumes that speech melody is produced by
the articulatory system whose physical properties impose various constraints
on the way acoustic forms are generated. In this way, the PENTA model both
describes and explains F0 (fundamental frequency/pitch) patterns in utter-
ances. Some of the advantages of this framework are that, firstly, it recog-
Unauthenticated
nizes that the encoding of one function can overlay another, so surface F0
must be interpreted with caution. Secondly, the model takes into account pa-
rameters other than F0 (duration, pitch range, and rate of change, i.e. “slope”
of contour, etc); and finally, quantitative methods usually reserved for larger
corpora can be applied to relatively limited datasets, which makes patterns
more easily discernable and verifiable.
The particular encoding schemes for a language are not specified by the
PENTA model, they need to be discovered through empirical investigation
(Xu 2005: 246). This has informed our methodology. We choose a commu-
nicative function (e.g. “syntactic phrasal grouping”, “given topic” or “argu-
ment focus”) as a starting point for our investigation, then select clear ex-
amples of its instantiation, and finally seek out the parameters used in its
encoding.
The selected tokens, all segmented into syllables, are then measured and
annotated. They are labeled according to their syntactic subtypes, the number
of words in each token and their positions in the intonation unit (IU), as well
as their number of syllables and the position of each syllable in a word. The
measurements on each syllable include their mean F0 and duration; a measure
termed the ‘excursion size’, defined as the difference between the maximum
and minimum F0 in the syllable (expressed in semitones); and finally the ‘fi-
nal velocity’, a measure of the instantaneous rates of F0 change (sm/s) taken
at a point earlier than the syllable offset (here 30ms) which has proven to be
a good indicator of the slope of the underlying target of the syllable. Based
on these measurements, more refined quantitative analyses can be performed,
consisting of:
– the mean pitch in the last syllable of the IU relative to the other syllables
in the IU;
– the variation in minimum and maximum pitch (excursion size) within each
syllable and between all the syllables in the IU;
– the final velocity of the syllables at the boundaries of the IUs (first, second,
penultimate and final) as an indicator of the underlying pitch target;
– the alignment of pitch targets from the most prominent syllable to the final
syllable of an IU;
– the duration of final syllables relative to the other syllables in the IU.
When the number of tokens in the sample is high enough (at least 30), the
results are validated with statistical analysis.
Unauthenticated
These measurements have made it possible to highlight some subtle dis-

tinctions in the encodings of different constructions (as we will show in Sec-
tion 3), which would not have been captured if other models had been used,
highlighting the descriptive strength of the PENTA model.
3. The interplay of prosody and syntax in Jaminjung
In this section, we discuss several types of expression in Jaminjung which

are distributed over two (or more) intonation units, but show semantic coher-
ence, and, as we argue, also prosodic coherence. We therefore assume that
they form coherent syntactic constructions which can be distinguished from
one another, and from superficially similar constructions contained within a
single intonation unit, on prosodic and functional grounds. As a background
to the argumentation, we will first present an overview of the grammatical
properties of Jaminjung.
Jaminjung shares many of its main characteristics with other Australian
languages (see Gaby 2008 for a brief overview). It is said to have “free word
order” in the sense that word order is not used to distinguish the grammatical
roles of arguments, but is rather conditioned by information structure at a dis-
course pragmatic level (Schultze-Berndt 2000: 108). Jaminjung also lacks a
category verb phrase, and lexical arguments can be freely omitted in a clause
(zero anaphora). Argument roles are indicated by bound pronominals which
attach to the verbs as prefixes (and for which alignment mainly follows a
nominative-accusative pattern), and by case markers suffixed to constituents
of noun phrases (for which alignment follow an ergative-absolutive pattern,
with optional ergativity). Like many other languages of the area (McGregor
2002; Dixon 2001; Schultze-Berndt 2003), Jaminjung also has two distinct
categories of “verbs”: inflecting verbs, which form a closed class of around
thirty members, and a non-inflecting category (clearly distinct from nomi-
nals), referred to as coverbs (Schultze-Berndt 2000, 2002). In finite clauses,
inflecting verbs may occur as simple predicates or form complex predicates
with a coverb; in some types of subordinate clause, coverbs function as the
main predicate.
The constructions to be discussed in the following sections are all char-
acterized by the presence (in the case of discontinuous noun phrases, the ab-
sence) of a prosodic boundary within a larger, prosodically and semantically
coherent unit, defined as “prosodic sentence” in Section 3.1. Secondary units
Unauthenticated
within these larger units can be characterised as pragmatically dependent non-

finite constructions (Section 3.2), afterthoughts, and right-dislocated reacti-
vated topics (Section 3.3), respectively. We will compare the prosodic corre-
lates of these three subtypes in Section 3.4. For discontinuous noun phrases
we will argue that they need to be distinguished from afterthoughts and right-
dislocated topics precisely by the absence of such a prosodic boundary, and
that they themselves fall into two functional subtypes which are likewise dis-
tinguishable on prosodic grounds (Section 3.5).
3.1. Intonation units and larger prosodic units

For the purposes of this paper, and following Chafe (1994: 58) and Cruttenden
(1997: 35–37), we define an intonation unit (IU) as a stretch of speech con-
taining one piece of information uttered under a single coherent intonation
contour containing at least one pitch accent, marked off by pauses, changes
in tempo, and other prosodic cues (see also McGregor 2004: 95).
Note that in Jaminjung, intonation units cannot be identified by means
of morpho-syntactic markers such as the occurrence of a given particle in
final position, as reported, for example, for Dolakha Newar (Genetti 2007).
Syntactically, they may correspond to clauses with inflecting verbs or coverbs
as predicates (e.g. in non-finite subordinate clauses), to noun phrases, or to
interjections.
We also recognize successions of intonation units which form a semanti-
cally coherent larger unit, which will be termed “prosodic sentence” here, fol-
lowing Chafe’s (1994) and Genetti’s (2007) terminology. This corresponds to
the terms “paratone” used in the British tradition (see e.g. Crystal 2003: 336),
and to Ewing’s (2005: 171) term “prosodic cluster”. In Jaminjung, prosodic
sentences are bounded by pauses and set off by substantial pitch resets which
are significantly higher than those between other individual intonation units
(see Simard 2010: 160). At the right boundary, they are lengthened and sub-
ject to final lowering which may provoke creaky phonation (Simard 2010:
140–170).
In grammatical terms, prosodic sentences consisting of two intonation
units – to which we will restrict ourselves for expository reasons – variously
correspond to a succession of two independent main verbal clauses, a main
clause and a subordinate clause or “pragmatically dependent predicate”, or a
Unauthenticated
main clause followed by a prosodically detached noun phrase as afterthought

or reactivated topic; these are the subtypes which we will discuss here.3
A prosodic sentence consisting of two verbal clauses in separate intona-
tion units is illustrated in (1), from an elicitation task in which the speaker
was asked to describe scenes from pictorial stimuli (here a drawing show-
ing a crocodile under a house). The two intonation units form a semantically
cohesive whole in so far as both clauses contribute to the description of the
same event.
(1) nginju=biyang barrajburru ga-ruma-ny \ mun
PROX =now saltwater.crocodile 3 SG-come- PST belly.down
ga-yu jamurrugu \
3 SG-be.PRS below
‘(In) this one a crocodile has come, he is lying face down below.’
[JM:CS07_62_01]
400
300
200
Pitch (Hz)
100
40
ngin ju bi ya ba rraj burru ga ru many mun ga yu ja mu ru gu
nginju=biyang barrajburru garumany mun gayu jamurrugu
dem=clitic n v cov v n
this one now a crocodile has come, he is lying face down below.
0.09865 2.617
Time (s)
Figure 2. A prosodic sentence containing two intonation units, forming a semantic

whole.
This example illustrates the baseline to which the other grammatical types
of prosodic sentence mentioned above can be compared. The first IU has a
falling contour resembling that found in simple declaratives: the left bound-
ary displays a pitch reset and the encodings associated with the marking of
3. Prosodic sentences may also correspond to a combination of direct speech and a reporting
verb, or an interjection (forming its own intonation unit) and a main clause.
Unauthenticated
focus on the first syllable of the focus domain (beginning with barrajburru),
i.e. falling pitch, and wider excursions and longer durations than in other
syllables (Simard 2010: 189–219). The syllables at the right boundary are
lengthened and end in low pitch. We treat this pattern as a “default contour”
in declarative sentences.4
The pattern in the second intonation unit is a repeat of that of the first
one. Its dependence is indicated by its continuing the declination line of the
first clause, i.e. the reset at the left edge of the second intonation unit is less
prominent than that of the first, and the fall at the right edge is more promi-
nent.
In the following subsections we will examine prosodic sentences where
the second intonation unit does not have clausal status, first presenting the
constructions and then comparing their prosodic correlates.
3.2. Coverbs as pragmatically dependent predicates

As mentioned above, in Jaminjung, coverbs are mainly found in complex
predicates together with an inflecting verb, but can also be used as the main
predicate in subordinate clauses, which are marked by a subset of case mark-
ers in subordinate function (Schultze-Berndt 2000: 111). However, in un-
planned discourse (not in elicited sentences) one also finds coverbs (with or
without an argument) as the main predicate in a separate intonation unit with-
out any formal marking of syntactic subordination. These constructions are
referred to as “semi-independent coverbs” by Schultze-Berndt (2000: 135;
2002; 2003: 154–156) and will be referred to as “pragmatically dependent
predicates” here. Prosodically, pragmatically dependent predicates are not
only separated from the preceding intonation unit, but they also receive a
falling pitch, indicative of focus, on their prominent syllable. They are also
uttered with a higher pitch register than verbal clauses as second intonation
units in a prosodic sentence, as illustrated in Figure 3.
Semantically, the coverb is often closely related to the inflecting verb that
appears in the preceding intonation unit. For example, in (2), from a re-telling
of the Frog Story, the coverb waga ‘sit’ in the second intonation unit can be
combined with the inflecting verb -yu ‘be’ in a complex predicate waga gayu
‘he is sitting’. It is therefore tempting to consider these expressions simply as
4. Other contours are also found, i.e. ending with a slightly rising or level pitch; however, the
falling contour is the predominant pattern observed in Jaminjung.
Unauthenticated
prosodically detached versions of the corresponding complex predicates; this

is the position taken by Croft (2007: 9) for Wardaman.
(2) nginju wagurra-ni ga-yu \... waga \

PROX rock- LOC 3 SG -be. PRS sit
‘Here he is on the rock... sitting.’ [JM:CS06_35_01]
400
300
200
Pitch (Hz)
100
40
ngin ju wa gu rra ni ga yu .. wa ga
nginju wagurra-ni gayu waga
dem n-case v cov
here he is on the rock.. sitting
0.02374 2.16
Time (s)
Figure 3. A prosodic sentence consisting of a main verbal clause, followed by a

coverb waga ‘sit’ as a pragmatically dependent predicate.
On the other hand, (3) is a striking example of the semantic (but not prag-
matic) independence of a coverb in a separate intonation unit. This utterance
is taken from a personal account of the speaker and her sister’s efforts, as
children, to escape forced domestic labor and the threat of being removed
to a different state. After a successful escape, they are re-united with their
family members, who see them coming and announce their arrival. From the
context (and also because of prosodic marking of direct speech; see Simard
(2010: 384–392)) we know that it is the speakers of the announcement who
are beating their chests, not the children that are coming. In other words,
even if there was a complex predicate murlngub buntharam ‘they are coming
beating their chests’, this is not the interpretation intended here.
(3) “ngiya jarlig=jirram bunth-aram” \ murlngub \

PROX child=pair 3 DU -come. PRS beat.chest
‘“here the two children are coming” (they cried) beating their chests’
[IP:ES08_A07_03]
Unauthenticated
The view taken here (as in Schultze-Berndt 2000: 139–141; 2002: 280–281),
therefore, is that a coverb which is prosodically detached with the contour
described above constitutes a construction in its own right, i.e. the pragmat-
ically dependent predicate construction, distinct from the complex predicate
construction. While the prosodic contour itself as well as the absence of an
inflecting verb signals the dependency of this unit on a preceding one, it does
not determine the precise semantic relationship of the coverb with any ele-
ment of the preceding unit. Rather, the interpretation of the coverb (e.g. as
encoding an event occurring simultaneously with that encoded by the previ-
ous intonation unit as in (2) and (3), or a resultative or sequential relationship)
is determined by the addressee on the basis of pragmatic principles and world
knowledge. The use of this construction is stylistically marked; its frequency
varies depending on the individual speaker, and it has a clear effect of en-
hancing the vividness of the narration or description.
3.3. Afterthoughts vs. reactivated topics

The second IU in a prosodic sentence can also be a noun phrase which may
serve different functions, the most frequent of which are that of an afterthought
and that of a reactivated topic.
Afterthoughts are defined as constituents which are added after a sentence
is completed in order to add a referent, to disambiguate a potentially unclear
referent, to elaborate on the description of a referent in the preceding intona-
tion unit, or to correct a previous description; see e.g. Chafe (1994: 142), Auer
(1996), Birner and Ward (1998), and Averintseva-Klisch (2008a,b). The inter-
pretation of noun phrases as afterthoughts is to a large extent dependent on the
linguistic and extra-linguistic context – afterthoughts are often treated as the
result of lapses in speech planning. Noun phrases interpreted as afterthoughts
in the prosodic sentences analyzed in this dataset are always preceded by a
pause and never topical; in fact they are often circumstantial elements.
Example (4) (Figure 4) is from a retelling of the Frog Story, showing
a sentence where the second intonation unit is an afterthought, used by the
speaker to add a participant not mentioned in the main clause.
Unauthenticated
(4) jarlig janju-ni=malang ganurru-ngawu warrb-bina \ ...

child DEM - ERG = GIVEN 3 SG :3 PL-see. PST be.together- ALL
wirib-ni-mij \
dog- ERG - COMIT
‘That child saw them sitting together, together with the dog.’
[CP:ES96_18_02]
300
250
200
150
100
Pitch (Hz)
50
ja lig jan ju ni malangga nu rrungawu warrb bi na .. wi rib ni mij
jalig janju-ni=malang ganurrungawu warrb-bina ... wirib-ni-mij
n dem-case=clitic v cov-case n-case-case
that child saw them sitting together together with the dog
0.01296 4.433
Time (s)
Figure 4. A prosodic sentence with a noun phrase functioning as an afterthought in

the second intonation unit.
Right-dislocated reactivated topics, in contrast, also help to solve potentially

unclear reference, but in this case the referent is topical. Restating the topic
in this way is a fairly common strategy used by Jaminjung speakers, to palli-
ate the lack of clarity brought about by the frequent topic elision in the main
clause. The term antitopic is found in the literature for similar constructions.
Chafe (1976) defines it as a pragmatic category which functions to confirm
established information, noting that it frequently occurs at the right periph-
ery. Lambrecht (1994: 202–205) defines antitopics in contradistinction with
afterthoughts. For him, the propositional information is put on hold temporar-
ily until the referent is fully named, in other words, the referent is accessible
although not yet an established topic: “the presuppositional structure of the
antitopic construction involves a signal that the not-yet-active topic referent
is going to be named at the end of the sentence” (Lambrecht 1994: 203). This
definition is based on the notion that the topic is not yet established, but that
this is somehow signaled in the first part of the sentence. This does not corre-
spond to the constructions found in Jaminjung, where the first intonation unit
Unauthenticated
is an independent, complete clause, after which the speaker decides to reit-

erate the topical referent. This motivates the choice of the term “reactivated
topic” over that of “antitopic”.
A right-dislocated reactivated topic is shown in example (5) (Figure 5).
The speaker has been talking about getting mussels, the topic of the preceding
prosodic sentence. He continues talking about them without mentioning them
explicitly in the first IU, and restates the topic at the end of this sentence with
naribu=marlang ‘mussels=GIVEN’.
(5) gurrany yanji-ngarna jarlig \ .. naribu=marlang \
NEG IRR :2 SG :3 SG -give child mussel= GIVEN
‘You should not give them to the children, those mussels.’
[BH:CS07_72_01]
400
300
200
Pitch (Hz)
100
40
gurrany ya nji nga rna ja rlig .. na ri bu ma rlang
gurrany yanjingarna jarlig naribu=marlang
part v n n=clitic
you should not give them to the children (to eat)... those mussels
0.02056 2.322
Time (s)
Figure 5. A prosodic sentence with a noun phrase functioning as a reactivated topic

in the second intonation unit.
The distinction between afterthoughts and reactivated topics is not just a func-
tional one, however. Prosodically, the two constructions are clearly distinct,
as the comparison of Figures 4 and 5 illustrates. First, the average pause pre-
ceding the second intonation unit is much shorter for reactivated topics than
for afterthoughts. Second, the prominent syllable in reactivated topics does
not receive falling pitch (indicative of focus). This is consistent with the find-
ings for the encoding of other types of topics in Jaminjung, which do not
have prominent first syllables, independently of where they occur (Simard
2010: 249–276).
Unauthenticated
We conclude that afterthoughts and right-dislocated reactivated topics are

distinct syntactic constructions, albeit formally distinct only in terms of their
prosodic encoding.
3.4. A comparison of the prosodic correlates of IUs in second position
In this section we will compare the prosodic correlates of the three subsets
of intonation unit in second position (as well as those of “normal verbal
clauses”) discussed in the preceding sections 3.1 to 3.3, thus matching the
quantitative evidence with the more descriptive account presented so far.
The measurements of the mean pitch show significant differences in over-
all pitch. When uttering pragmatically dependent predicates and other types
of non-finite clauses, or afterthoughts, speakers make use of a higher pitch
register, while verbal clauses and reactivated topics are uttered in a lower reg-
ister. The values for intonation units in second position consisting of just two
syllables, the most frequent in our datasets, are illustrated in the first panel of
Figure 6.
Figure 6. The mean F0 (fundamental frequency) and final velocity values of first and
last (= second) syllable in intonation units consisting of two syllables, in
second position in a prosodic sentence consisting of two intonation units.
The measurements of final velocity in the second panel of Figure 6 also high-
light the fact that pragmatically dependent predicates, afterthoughts and ver-
bal clauses5 all exhibit falling intonation on their first syllables, which is as-
5. Verbal clauses consisting of two syllables are not shown in the second part of Figure 6, as
there were too few tokens in our dataset; however, examination of a range of verbal clauses
Unauthenticated
sociated with the expression of focus in Jaminjung, while reactivated topics

do not.
Of particular interest is the apparent gradience of the measured correlates,
notably the mean F0, from the least marked reactivated topics to the most
marked coverbs as pragmatically dependent predicates. We interpret these
gradient prosodic patterns as reflecting both the information structure of the
elements and their structural relation with the main clause.
On the one hand, prosody serves to mark the contribution of the elements
in question to the sentence, either as new or given information. Pragmatically
dependent predicates and afterthoughts have the encoding characteristic of
focused constituents, namely an accented first syllable with a falling contour.
Indeed, all of them contribute new information by either adding to or expand-
ing on an argument, or the predication itself. Further, as these units are obli-
gatorily positioned at the right edge of the prosodic sentence, they can also be
interpreted as having a demarcative function in that they serve as cues to mark
its structural organization. The higher pitch and the steep contour6 which may
be the consequence of the secondary prosodic contour being shorter, have a
stylistically marked effect in Jaminjung: they undoubtedly contribute to the
vividness of the narration or description.
The degree of syntactic integration with the main clause is also reflected
in the prosodic encoding of the second intonation units. It is clearly ev-
idenced, for example, in the much shorter pauses that precede reactivated
topics with an average of 570.22ms, compared to that of the verbal clauses,
afterthoughts and pragmatically dependent predicates which have respective
values of 833. 05ms, 820.21ms and 848.24ms, respectively. Thus the findings
for Jaminjung support the notion of different degrees of boundary strength,
i.e. degrees of prosodic integration (Ladd 1996: 239–248; Auer 1996) and a
distinction between primary and secondary intonation units which has also
been postulated for other Australian languages (McGregor 1990: 364–365;
Merlan 1994: 226). This is captured in Table 1.
in second position in a prosodic sentence corroborates their analysis as exhibiting falling,

focal pitch.
6. And probably the greater intensity as well, although this correlate could not be taken into
account in this analysis.
Unauthenticated
Table 1. The parameters ruling the prosodic encodings of second intonation units in
complex sentences.
Structural integration Information status
dependent ↔ independent new ↔ given
verbal clauses + +
non-finite predicates + +
afterthoughts + +
reactivated topics + +
So far, we have examined some prosodically coherent complex sentences

(“prosodic sentences”) in Jaminjung that contain a second element which en-
codes prosodically the degree of “newness” of the information it contains, as
well as its degree of dependence on the main clause. We will now turn our
attention to a different construction, the discontinuous noun phrase.
3.5. Prosodically detached vs. discontinuous noun phrases

A particularly striking example of the relevance of prosody for syntactic anal-
ysis is the debate surrounding the existence of discontinuous noun phrases,
which has led to far-reaching (though nowadays less widely accepted) conclu-
sions about non-configurationality, e.g. in the case of Australian languages.
However, discussions of this phenomenon often include examples of “dis-
continuous” noun phrases which at closer inspection may well be considered
afterthoughts as defined in Section 3.3 above. Consider the Warlpiri exam-
ple in (6), presented as an example of double noun phrase discontinuity by
Laughren (2002: 84).
(6) yakajirri-rli yankirri maju-manu wita maju-ngku

berry- ERG emu bad-make: PST small bad- ERG
‘The bad berries hurt the little emu.’
It is only from an accompanying footnote that we learn that “[b]oth post-

verbal words, wita and maju-ngku, are typically separated from the verb and
from each other with a pause and a “new phrase” intonation, partially resem-
bling the English “afterthought” intonational pattern” (Laughren 2002: 123–
124). In many other cases of supposed discontinuity presented in the litera-
ture it is in fact impossible to tell whether they were uttered with a prosodic
break between any of the elements of a discontinuous phrase.
Unauthenticated
In Schultze-Berndt and Simard (to appear) we argue in more detail for a

distinction between afterthoughts and discontinuous noun phrases on prosodic
grounds (in line with authors such as Merlan 1994: 241–242 and Legate
2002: 114). We only regard as true discontinuous noun phrases elements that
not only allow a joint construal as a noun phrase identifying a single referent,
but also occur under a single intonation contour, as in example (7).
(7) mulanggirr ngantha-ma-ya wirib \

fierce 2 SG :3 SG -have- PRS dog
‘You have a DANGEROUS dog!’ [IP: ES97_A06_03]
400
300
200
Pitch (Hz)
100
40
... mu lang girr ngan than ma ya wi rib ..
mulanggirr nganthamaya wirib
5.261 6.441
Time (s)
Figure 7. A discontinuous noun phrase of the contrastive subtype (example (7)); the
two nominals are found within a single intonation unit and the modifier
bears focus accent.
We further show that such “true” discontinuous noun phrases (which are in-
frequent in texts) have functions very different from those of afterthoughts.
Two main functions of discontinuous noun phrases can be identified and can
moreover also be formally distinguished on prosodic grounds. The first func-
tion is that of contrastive argument focus where the contrastive element is a
modifier, which is always prosodically salient. Example (7) above is an illus-
tration of this subtype; it is extracted from a fictitious conversation (triggered
by elicitation questions) between the speaker and a dog owner. The preceding
utterance is a request to the dog owner to tie up the dog since it is threaten-
ing to bite people. Thus, the dog, but not its property of fierceness, has been
previously mentioned. Since it is this property which is emphasized in this ut-
Unauthenticated
terance, as the reason for the request, it is the property expression mulanggirr
‘fierce’ which receives prosodic prominence, while the common noun wirib
‘dog’ remains unaccented.
The second function of discontinuous noun phrases in Jaminjung is that of
marking sentence focus or “all-new” statements, typically used to introduce
a new participant into the discourse universe. In this case, the preferred order
of elements is reversed: while for the contrastive argument type the modifier
is usually in preverbal position, it is in postverbal position for the sentence-
focus type. Examples like (8), where the discontinuous noun phrase is made
up of a generic and a specific noun, are also attested. This example comes
from a Frog Story; the speaker quotes the boy alerting his dog to the presence
of a group of frogs which includes their pet frog. As Figure 8 shows, no par-
ticular prosodic salience is associated with either of the subconstituents; in
fact, all constituents in the sentence receive a prominence, including the ver-
bal predicate. This prosodic pattern conforms to the general pattern described
for “all-new” sentences in Jaminjung by Simard (2010: 225–233).
(8) “girrb girrb” gani-yu=nu majani “ngayiny=gun

quiet quiet 3 SG :3 SG -say/do. PST =3 SG . OBL maybe animal= CONTR
ngiya jalwany burru-yu malara!”
PROX talking 3 PL -be. PRS frog
‘“ah, quiet, quiet!” he maybe said to him, “frog animals are talking
here!”’ [DBit: ES96_A07_01]
300
250
200
Pitch (Hz)
150
100
50
.. nga yin gun ngi ya jal wany bu ru yu ma la ra ..
ngayin=gun ngiya jalwany burruyu malara
3.008 4.891
Time (s)
Figure 8. A discontinuous noun phrase of the sentence focus subtype (example (8));
all constituents bear a prosodic prominence.
Unauthenticated
Thus, prosody in this case provides not only a means to distinguish between
afterthoughts and discontinuous noun phrases, but also helps to corroborate
with formal evidence a functional distinction between two subtypes of dis-
continuous noun phrase.
4. Conclusions
Prosody is recognized as one of the fundamental components of language.
While prosody may have received limited attention in the past outside specif-
ically prosodic research on well-described languages, theoretical and techno-
logical advances in recent years have spurred a renewed interest in intonation
studies yielding important insights into its workings as well as its interactions
with other linguistic components. Applying and refining these on the basis of
lesser documented languages still poses many challenges. For Jaminjung we
had to base our investigation solely on a corpus of spontaneous or at least
unread speech but we consider that the drawbacks of using such uncontrolled
data are counterbalanced by the benefits that our analysis brings to our un-
derstanding of the syntax and semantics of the language. We demonstrated
here that it is possible to distinguish, on the basis of prosodic evidence alone,
constructions such as reactivated topics vs. afterthoughts; afterthoughts vs.
discontinuous noun phrases, and two subtypes of discontinuous noun phrase.
We also argued that it is possible, based on quantitative evidence gained from
corpus data, to distinguish between different degrees of prosodic integration
iconically reflecting degrees of semantic integration between intonation units
forming a larger unit, that of prosodic sentence.
Our case study shows how the investigation of prosody based on sponta-
neous speech data collected during fieldwork not only enhances our under-
standing of the language itself, but also contributes more widely to a typol-
ogy of prosodic phenomena and their grammatical functions in human lan-
guages. Firstly, they contribute to our understanding of cross-linguistically
recurrent differences between afterthoughts and reactivated right-dislocated
topics. Secondly, a careful distinction between discontinuous noun phrases
and afterthoughts on prosodic and functional grounds reveals that the former
are infrequent in discourse and tied to very specific information structure con-
figurations, thus providing further evidence against the myth of unconstrained
discontinuity in Australian languages.
Unauthenticated
Abbreviations
1, 2, 3 first, second, third person IRR irrealis
ABL ablative case LOC locative case
ALL allative case NEG negative particle
COMIT comitative case OBL oblique pronominal
CONTR contrastive focus marker PL plural
DAT dative case PROX proximal demonstrative
DEM demonstrative (distance- PRS present
neutral/recognitional) PST past (perfective)
DU dual RESTR restrictive clitic
ERG ergative case (‘right there/then’)
IMPF (past) imperfective SG singular
References
Auer, Peter. 1991. Vom Ende deutscher Sätze. Zeitschrift für Germanistische
Linguistik 19:139–157.
Auer, Peter. 1996. On the prosody and syntax of turn-continuations. In
Prosody in Conversation: Interactional Studies, eds. Elizabeth Couper-
Kuhlen and Margaret Selting, 57–100. Cambridge: Cambridge University
Press.
Averintseva-Klisch, Maria. 2008a. Rechte Satzperipherie im Diskurs: Die
NP-Rechtsversetzung im Deutschen. Ph.D. Dissertation, Universität
Tübingen.
Averintseva-Klisch, Maria. 2008b. To the right of the clause: Right disloca-
tion vs. afterthought. In ‘Subordination’ versus ‘Coordination’ in Sentence
and Text: A Cross-Linguistic Perspective, eds. Cathrine Fabricius-Hansen
and Wiebke Ramm, 217–239. Amsterdam, Philadelphia: John Benjamins.
Bing, Janet Mueller. 1984. A discourse domain identified by intonation.
In Intonation, Accent and Rhythm: Studies in Discourse Phonology, eds.
Dafydd Gibbon and Helmut Richter, 10–19. Berlin, New York: de Gruyter.
Birner, Betty J., and Gregory L. Ward. 1998. Information Status and Non-
Canonical Word Order in English. Amsterdam, Philadelphia: John Ben-
jamins.
Boersma, Paul, and David Weenink. 2001. Praat, a system for doing phonetics
by computer. Report 132, Institute of Phonetic Sciences of the University
of Amsterdam. http://www.praat.org.
Unauthenticated
Chadwick, Neil. 1997. The Barkly and Jaminjungan languages: A non-con-

tiguous genetic grouping in North Australia. In Boundary Rider: Essays
in Honour of Geoffrey O’Grady, eds. Darrell Tryon and Michael Walsh,
95–106. Canberra: Pacific Linguistics.
Chafe, Wallace L. 1976. Givenness, contrastiveness, definiteness, subjects,
topics and point of view. In Subject and Topic, ed. Charles N. Li, 25–55.
New York: Academic Press.
Chafe, Wallace L. 1994. Discourse, Consciousness, and Time. The Flow and
Displacement of Conscious Experience in Speaking and Writing. Chicago:
University of Chicago Press.
Couper-Kuhlen, Elizabeth. 1986. An Introduction to English Prosody. Lon-
don: Arnold.
Croft, William. 1995. Intonation units and grammatical structure. Linguistics
33:839–880.
Croft, William. 2001. Radical Construction Grammar: Syntactic Theory in
Typological Perspective. Oxford: Oxford University Press.
Croft, William. 2007. Intonation units and grammatical structure in War-
daman and in cross-linguistic perspective. Australian Journal of Linguis-
tics 27:1–40.
Cruttenden, Alan. 1997. Intonation. (2nd edition). Cambridge: Cambridge
University Press.
Crystal, David. 2003. A Dictionary of Linguistics and Phonetics. (5th edi-
tion). Oxford: Blackwell.
Diessel, Holger. 1997. Verb-first constructions in German. In Lexical and
Syntactical Constructions and the Construction of Meaning, eds. Marjolijn
Verspoor, Kee Dong Lee, and Eve Sweetser, 51–68. Amsterdam, Philadel-
phia: John Benjamins.
Dixon, Robert M. W. 2001. The Australian linguistic area. In Areal Diffu-
sion and Genetic Inheritance: Problems in Comparative Linguistics, eds.
Alexandra Y. Aikhenvald and Robert M. W. Dixon, 64–104. Oxford: Ox-
ford University Press.
Drubig, Hans Bernhard. 2003. Toward a typology of focus and focus con-
structions. Linguistics 41(1):1–50.
Ewing, Michael C. 2005. Hierarchical constituency in conversational lan-
guage: The case of Cirebon Javanese. Studies in Language 29(1):89–112.
Gaby, Alice. 2008. Rebuilding Australia’s linguistic profile: Recent devel-
opments in research on Australian aboriginal languages. Language and
Linguistics Compass 2(1):211–233.
Unauthenticated
Genetti, Carol. 2007. A Grammar of Dolakha Newar. Berlin, New York:

Mouton de Gruyter.
Goldberg, Adele E. 1995. Constructions: A Construction Grammar Approach
to Argument Structure. Chicago: University of Chicago Press.
Halliday, Michael. 1985. An Introduction to Functional Grammar. London:
Arnold.
Harvey, Mark. 2008. Proto Mirndi: A Discontinuous Language Family in
Northern Australia. Canberra: Pacific Linguistics.
Himmelmann, Nikolaus P. 2006. Prosody in language documentation. In
Essentials of Language Documentation, eds. Jost Gippert, Nikolaus P.
Himmelmann, and Ulrike Mosel, 163–181. Berlin, New York: Mouton de
Gruyter.
Himmelmann, Nikolaus P., and D. Robert Ladd. 2008. Prosodic fieldwork.
Language Documentation & Conservation 2:244–274.
Kay, Paul, and Charles Fillmore. 1999. Grammatical constructions and lin-
guistic generalizations: The What’s X Doing Y construction. Language
75(1):1–33.
Ladd, D. Robert. 1996. Intonational Phonology. Cambridge: Cambridge
University Press.
Lakoff, George. 1987. Women, Fire and Dangerous Things: What Categories
Reveal About the Mind. Chicago: University of Chicago Press.
Lambrecht, Knud. 1994. Information Structure and Sentence Form: Topic,
Focus and the Mental Representation of Discourse Referents. Cambridge:
Laughren, Mary. 2002. Syntactic constraints in a “free word order” language.
In Language Universals and Variation, eds. Mengistu Amberber and Peter
Collins, 83–130. Westport, CT: Praeger Publishers.
Legate, Julie Anne. 2002. Warlpiri: Theoretical implications. Ph.d. disserta-
tion, Massachusetts Institute of Technology, Boston, MA.
for Young Readers.
McGregor, William B. 1990. A Functional Grammar of Gooniyandi. Ams-
McGregor, William B. 2002. Verb Classification in Australian Languages.
Berlin, New York: Mouton de Gruyter.
McGregor, William B. 2004. The Languages of the Kimberley, Western Aus-
tralia. London, New York: Taylor and Francis.
Unauthenticated
Merlan, Francesca C. 1994. A Grammar of Wardaman: A Language of the

Northern Territory of Australia. Berlin, New York: Mouton de Gruyter.
Michaelis, Laura, and Knud Lambrecht. 1996. Toward a construction-based
theory of language function: The case of nominal extraposition. Language
72(2):215–247.
Schultze-Berndt, Eva. 2000. Simple and complex verbs in Jaminjung. Phd
dissertation, University of Nijmegen, Nijmegen, NL.
Schultze-Berndt, Eva. 2002. Constructions in language description. Func-
tions of Language 9:267–308.
Schultze-Berndt, Eva. 2003. Preverbs as an open word class in Northern
Australian languages: Synchronic and diachronic correlates. In Yearbook
of Morphology 2003, eds. Geert Booij and Jaap van Marle, 145–177. Dor-
drecht: Kluwer.
Schultze-Berndt, Eva, and Candide Simard. To appear. Constraints on noun
phrase discontinuity in an Australian language: The role of prosody and
information structure. To appear in Linguistics.
Selting, Margaret. 1995. Prosodie im Gespräch: Aspekte einer interak-
tionalen Phonologie der Konversation. Tübingen: Niemeyer.
Selting, Margaret. 2000. The construction of units in conversational talk.
Language in Society 29(4):477–517.
Simard, Candide. 2010. The prosodic contour of Jaminjung, a language of
Northern Australia. Phd dissertation, University of Manchester, Manch-
ester.
Skopeteas, Stavros, Ines Fiedler, Samantha Hellmuth, Anna Schwarz, Ruben
Stoel, Gisbert Fanselow, Caroline Féry, and Manfred Krifka. 2006. Ques-
tionnaire on Information Structure (QUIS). Potsdam: Universitätsverlag
Potsdam.
Verstraete, Jean-Christophe. 1998. A semiotic model for the description of
levels in conjunction: External, internal-modal and internal-speech func-
tional. Functions of Language 5:179–211.
Xu, Yi. 2005. Speech melody as articulatorily implemented communicative
functions. Speech Communication 46:220–251.
Xu, Yi. 2010. In defense of lab speech. Journal of Phonetics 38:329–336.
Unauthenticated
Chapter 8
Diphthongology meets language documentation: The
Finnish experience∗
Klaus Geyer
1. Introduction
Guides on language documentation and linguistic field work in general agree
on the fact that one of the first steps to be undertaken when the preparatory
work is done and it comes to data collection, analysis and description is to
carry out a first examination of the functional phonetics1 of the language
in question, establishing a preliminary sketch of both the segmental sound
system, the syllable structure, and, when relevant, the tonal patterns; cf. e.g.
Mosel (2006a: 75) and Bowern (2008: Ch. 5). While the preliminary analysis
of a phonological system is a subject in its own right in language documen-
tation and description, it is also (at least on the word level) a prerequisite
for developing an, equally preliminary at that stage, graphemic system (often
termed working orthography), allowing one to record the data in written form.
Functional phonetics as a separate level of language analysis and description
precedes the level of morpho-syntax not only in the research process, but also
in one of the essential outcomes of a language documentation endeavor (cf.
Mosel 2006b), viz. the grammar book, be it a short sketch grammar as e.g.
Mosel (1994) or an extensive reference grammar as e.g. Mosel and Hovd-
haugen (1992).
∗
I would like to thank the members of the Linguistisches Kolloquium at the Linguistics
department, University of Erfurt, the participants of the workshop on diphthongs at the
43rd Annual Meeting of the SLE 2010 in Vilnius, and the editors (especially Nicole Nau
and Claudia Wegener) for discussing my ideas on diphthongs with me and for commenting
on an earlier version of this paper. In particular, I want to express my thanks to Lena
de Mol for her help with the English language. Needless to say all remaining errors and
shortcomings are mine.
1. The term functional phonetics is used here for a data-driven, inductive reconstruction of
the phonological system(s), as opposed to deductive accounts, which aim to fit the sounds
of a language into a pre-determined, allegedly universal structure model.
Unauthenticated
178 Klaus Geyer
The preliminary analysis of a phonological system seems to be one of the

simpler tasks. As Schultze-Berndt (2006: 222) points out, “[p]rocedures for
working out the distinctive sound features of a language (for example mini-
mal pairs) are stated in all good introductory textbooks on phonology”. In my
view, as far as vowel systems are concerned, this claim is true for monoph-
thongs (regardless of how difficult the analyses might be for an individual
language). However, when diphthongs come into play, the textbooks tend to
remain somewhat fuzzy with respect to procedures for working out a poten-
tial diphthong inventory and for further analysis and systematization, not to
mention for the representation of diphthongs in the syllable structure. Gram-
mars, even those providing detailed phonological systems of consonants and
vowels (monophthongs), seem content to limit themselves to a bare list when
it comes to diphthongs.
According to an estimate of Lindau, Norlin, and Svantesson (1990), which
is based on the data in Maddieson (1984), at least around one third of the
world’s languages comprise diphthongs as an integral part of their sound sys-
tems, where diphthong is understood broadly as “any vowel sequence within
a syllable” (Lindau, Norlin, and Svantesson 1990: 10). By far the most fre-
quent diphthongs found by the authors are of type /ai/ in 75% and of type /au/
in 65% of those languages comprising diphthongs. As for quantity, usually
only a couple of diphthongs occur in a language. Of course, also large diph-
thong inventories exist. The reason why diphthongs apparently are a some-
what disregarded and unloved subject might be that there is no sufficiently
fine-grained, widely accepted common means available for their description
and analysis. Whereas the articulatory characteristics given by the IPA, if
complemented by the dimension of active articulator, work well for static
sounds (i.e. monophthongs where no considerable change in quality is per-
ceived) in terms of documentation and description, it is far from clear that the
same criteria suffice for diphthongs as dynamic sounds.
The aim of this chapter is to tackle this gap in the literature by developing
a (preliminary) means for thorough analysis and description of diphthongs
and by providing a “diphthong analysis and description tool”, comprised of
well-differentiated criteria that may be helpful in sorting out laborious diph-
thong issues. This will be achieved by discussing basic questions of diph-
thongology and by examining the general notion of diphthong with respect
to minimum requirements and maximal extension (Section 2). In Section 3,
the diphthongs of Finnish will be discussed for a twofold purpose: on the one
Unauthenticated
Diphthongology meets language documentation 179
hand as an instructive and illustrative case study to show how one can come to
an adequate description and systematization of diphthongs in spite of a rather
complex data set; and on the other hand as an almost perfect example of a
rather complex data set based on which a lot of potentially relevant criteria
for the description and analysis of diphthongs in the world’s languages can be
extracted. The final section (Section 4) brings together the results and find-
ings, presenting an outline of the “diphthong analysis and description tool”.
2. What is a diphthong?
2.1. Definitions
By and large, two main perspectives in the understanding of diphthongs can

be discerned. The first one defines diphthongs as tautosyllabic combinations
of two vowels, cf. e.g. Catford (1977: 215): “A diphthong may be defined as
a sequence of two perceptually different vowel sounds within one and the
same syllable.” In the second view, diphthongs are understood as vowels that
undergo a change in quality; Pompino-Marschall (1995: 218), amongst oth-
ers, represents this perspective: “Als Diphthonge werden die vokoiden Sil-
benkerne bezeichnet, die auditiv nicht durch eine gleichbleibende, sondern
eben gerade durch eine sich ändernde Vokalqualität gekennzeichnet sind.”
[Those vocoid syllable nuclei that are characterized not by an auditorily sta-
ble, but rather by a changing vowel quality are called diphthongs. (Translation
K.G.)]
Obviously, these two notions are rather different. The first one focuses
either on a connection of two vocalic segments within one complex sound
under the condition of tautosyllabicity, or possibly on the initial and the ter-
minal (target) quality of the diphthong as a complex sound. The second one
highlights the change in quality and the transitional nature of diphthongs.
These two perspectives are reminiscent of the question whether diphthongs
should be ascribed mono-phonemic or bi-phonemic status in phonological
analysis – a classic topic in Prague structuralist phonology, discussed in de-
tail by Trubetzkoy (1977[1939]: 50–59), and one which can, of course, only
be answered individually for each given language. When investigating the
sound shapes of a previously undescribed language as part of a documenta-
tion project, however, the initial analysis will be carried out on the phonetic
level. Assuming that the language in question possesses diphthongs, there is
Unauthenticated
180 Klaus Geyer
no need to immediately decide for or against one or the other of the two no-
tions of diphthongs mentioned above. One should rather follow an account
that does not exclude either of the notions from the beginning, but rather
leaves the question open by combining the two. This was done by, e.g., Roca
and Johnson (1999: 688), who define a diphthong as “a complex vowel of
non-steady quality, made up of two phases.”
It is worth mentioning that Schubiger, already as early as 1977, combines
the two main perspectives on the notion of diphthongs and suggests the fol-
lowing, rather differentiated definition:
Diphthonge. Es sind dies lange Vokale mit gleitender Zungenstellung oft auch
mit sich verändernder Lippenstellung. Im Verlauf der Gleitbewegung ergibt
sich eine ganze Reihe von Vokalen, von denen aber wegen der raschen Ab-
folge nur der erste und der letzte wahrgenommen werden. Auf der Wahr-
nehmung beruht die übliche Definition des Diphthongs: eine der gleichen
Silbe angehörende Folge von zwei Vokalen; ebenso ihre Darstellung in der
phonetischen Schrift. (Schubiger 1977: 49)
[Diphthongs. These are long vowels produced with a gliding tongue position,
often also with a changing lip position. In the course of the gliding movement,
a whole series of vowels arise, but due to the quick succession, only the first
one and the last one are perceived. The common definition of diphthong relies
on perception: a sequence of two vowels, belonging to the same syllable; their
presentation in the phonetic script is made up in the same way. (Translation
K.G.)]
Aside from the fact that diphthongs do not necessarily need to be long in
duration as the famous example of diphthongs in Icelandic shows (cf. Guss-
mann 2002: Ch. 7), Schubiger (1977) not only mentions some important artic-
ulatory characteristics of diphthongs, she also gives an adequate explanation
for how the two notions interact. In any case, from a phonological point of
view, diphthongs of both the “two vowels within one syllable”-type and of
the “vowel changing its quality”-type may occur. That is to say, even within
one and the same language, both mono- and bi-phonemic diphthongs may
be observed. The standard analysis of Lithuanian diphthongs illustrates this
(cf. Ambrazas 1997: 28), where it is even the case that the so-called “gliding
diphthongs” /ie/ and /uo/ as mono-phonemic units are integrated in the sys-
tem of long vowel phonemes. All other diphthongs, however, are treated as
phoneme combinations (and are therefore termed “combined diphthongs”):
they can be split up in two syllables due to re-syllabification.; cf. e.g. sai.tas
Unauthenticated
‘tie’ vs. sa˛ .sa.ja ‘linkage’, gau.ti ‘to receive’ vs. ga.vo ‘(s/he) received‘ (cf.
Ambrazas 1997: 27). Importantly, these distinctions in phonological analysis
need not necessarily affect the phonetic shape of the complex vowel sounds
in question, e.g. in the sense of a rather smooth and even transition from ini-
tial to target sound (“gliding diphthong”) vs. a more sharp and rapid change
(“sequential diphthong”), cf. Catford (1977: 215–216) for examples.
2.2. Phonetic and phonological diphthongs

With respect to functionality and distinctiveness, the crucial differentiation
of diphthongs in phonetic vs. phonological ones deserves attention. Consider
the standard variety of today’s Swedish. It is a generally accepted claim in
Scandinavian linguistics that Swedish is the only Germanic language without
genuine phonological diphthongs, due to the so-called monophthongization
of the Proto-Scandinavian (phonological) diphthongs (cf. e.g. Braunmüller
1999); compare Icelandic steinn, auga, heyra vs. Swedish sten, öga, höra
‘stone’, ‘eye’, ‘to hear’. But this fact does not imply that one cannot find
vowels whose quality changes remarkably in Standard Swedish. It is the long
closed vowels that are realized as phonetic diphthongs by most speakers, e.g.
bi ‘bee’, by ‘village’, bu ‘boo’, bo ‘to reside’. The representations of the
change in quality, however, differ remarkably, as a comparison of e.g. An-
dersson (1994: 272) [ij, yj, Ww, uw],2 Braunmüller (1999: 34) [i:j, y:4, W:B,
u:w], and (Elert 1995: 43–44) [ij, y4, WB, uB] shows. Furthermore, in regional
standard varieties of Swedish, “dirty”, i e. auditorily unstable long vowels,
can be observed much more frequently, e.g. the diphthongal realization of /e:/
in the regional standard variety of Scania in Southern Sweden as [EeI] and in
Stockholm as [e:@]. Since the changing quality in these vowels does not con-
stitute a relevant feature (e. g. in minimal pair formation) on the systematic
level, however, there is no need to analyze these sounds as phonological diph-
thongs in Swedish. They can simply be analyzed as long vowel phonemes.
In other cases, tautosyllabic vowel combinations on the phonetic level
might best be analyzed as sequences of a vowel plus a consonant with re-
spect to their functionality, as the following examples from Danish show. To
simplify somewhat, in modern Standard Danish, word-final /b, g, v, r/ are vo-
2. Note that [W] does not stand for IPA cardinal vowel 16, but for a fronted cardinal vowel
18 [0ff], rendering the symbol Ñ which is used in the Swedish landsmålsalfabet for dialect
transcriptions.
Unauthenticated
182 Klaus Geyer
calized after vowels, forming “derived” phonetic diphthongs, cf. løb [løwP]
‘run!’, steg [sdAjP] ‘fry!’, skriv [sg̊KiwP] ‘write!’; klor [khlo2P] ‘chlorine’. In
˚
addition to spelling, there are two further arguments for˚the assumption of
“underlying” word-final consonants in these word-forms: firstly, they require
the allomorph {-e} of the infinitive suffix, which appears on stems ending in
a consonant, cf. løbe ‘to run’, stege ‘to fry’, and skrive ‘to write’, and not
the zero-allomorph used on stems ending in a vowel, e. g. gå ‘to go’, sy ‘to
sew’ etc. Secondly, there is an alternation of the phonetic diphthongs in ques-
tion with vowel-consonant-sequences: løbe løbsk [løbsg̊] ‘to bolt (of horses)’,
˚
stegt [sdEg̊d] ‘fried’, skrift [sg̊Kæfd] ‘script, handwriting’; klorid [khloKiðfl P]
˚ (cf.
‘chloride’ ˚ Grønnum 1998: 251–257). ˚ It is obvious that it would become
very puzzling to account for those alternations without assuming a phonolog-
ical word-final consonant in these cases, realized as a phonetic vowel under
certain circumstances.
An interesting question is to what extent a change in nasality might play
a role in forming diphthongs, e.g. when in the German word Reallohn ‘real
wages’ nasality appears in the course of the long /o:/ in the final syllable,
resulting in a sequence like [oõ]. Since this is an effect of assimilation, it lacks
phonemic status but nevertheless constitutes a phonetic diphthong. The only
hint indicating that such a change in nasality could in fact claim phonemic
status as a diphthong is a notice about a minimal pair I found in Sievers
(1893: 153), the hook on the i indicating nasality: “Nasalirtes i neben reinem
i findet sich nach Böthlingk im Jakutischen, z. B. in Ai˛ï̄ Sünde neben Aiï̄
Schöpfung.” [Nasalized i in addition to pure i can, according “ to Böthlingk,“
be found in Yakut, e.g. in Ai˛ï̄ sin compared with Aiï̄ creation. (Translation
K.G.)]. Unfortunately, it has“ turned out to be quite“difficult to come across
the reference (Böthlingk) cited by Sievers to learn more about this fact.
Finally, it should be added that the perceivable change in vowels that is
described by the notion of contour tones is not subsumed under the topic of
diphthongs, since tone does not affect the quality of the diphthong in terms
of its formant structure but only its pitch.
As a last remark in this section, I want to emphasize the fact that it is
ultimately the auditory change, and not the acoustic one, that is crucial even
for a phonetic diphthong. Ramers and Vater (1992: 128) state correctly that
in natural speech no vowels are produced that are stable with respect to their
frequency spectra in an acoustic sense. Linguistically relevant, however, are
only those changes that are perceivable, be they functional or not.
Unauthenticated
2.3. Sonority and prominence in diphthongs

In the previous section, it was stated that the minimum requirement for a
diphthong is that the change in quality is auditorily perceivable. This section
raises the question of how far the notion of diphthong can and has to be
stretched to cover potentially all relevant linguistic data. One crucial issue is
to find a proper description for the part or phase3 of a diphthong that is the
prominent one and thus constitutes the syllable peak. The examples presented
and discussed hitherto seem to be unproblematic in this respect; sometimes,
as in the Swedish examples above, glide symbols or superscripts are used to
indicate the non-peak phases, whereas full vowel symbols apparently stand
for the peak phases. IPA and other phonetic alphabets use the diacritic [ ]
to indicate a non-peak phase in a diphthong, or even provide special symbols “
for the most common vowels (in the sense of diphthong phases) in non-peak
positions, i. e. the closed vowels (cf. the symbols [j, w], which from a purely
phonetic point of view can be seen as equivalent to [i, u]).
“ “ concept for sylla-
In terms of sonority, which is, of course, a central
ble formation in general and for the issue of diphthongs in particular, the
(phonetic) Swedish diphthongs as well as the – according to Lindau, Nor-
lin, and Svantesson’s (1990) findings – most common diphthong types in the
world’s languages /ai/ and /au/ all show a decrease in sonority. Consequently,
they are called falling diphthongs, as opposed to rising diphthongs with in-
creasing sonority. Note that the feature [falling vs. rising] solely refers to the
sonority contour and should not be confused with the articulatory movements
termed “opening” and “closing”, implying that the tongue or the lower jaw
perform a “falling” or “rising” movement. In fact, the relation is precisely
the other way around. Closing diphthongs are characterized by falling sonor-
ity whereas opening diphthongs are characterized by rising sonority – simply
because open vowels are more sonorous than closed ones (cf. e.g. Vennemann
1988: 9).4
3. I am reluctant to use the expression “half (of a diphthong)” here, since the widely accepted
convention to represent diphthongs as two symbols (initial and target) does definitely not
imply that a diphthong consists of two (equal) “halves”.
4. Here, for convenience, a rough version of the sonority hierarchy is given, the sign > mean-
ing ‘more sonorous than’: open vowels > closed vowels > liquids > nasal consonants >
fricatives > plosives. It should be added that, as far as vowels are concerned, nasalization
increases sonority, and a voiced obstruent is more sonorous than its voiceless counterpart.
Unauthenticated
184 Klaus Geyer
Even if the sonority maximum most often coincides with the syllable
peak, this is not necessarily the case. Regarding the parameter of sonor-
ity, which, according to Kohler (1995: 74), is somewhat “impressionistisch
gewonnen” [impressionistically obtained. (Translation K.G.)], one has to bear
in mind that the gradation of more or less sonorous sounds in the sonority
hierarchy requires the tacit prerequisite that the sounds in question are pro-
duced with the same articulatory effort. Differences in terms of loudness, du-
ration, and/or pitch can override sonority, or, as Schubiger (1977: 108) puts
it: “Die Sonoritätsskala behält ihre volle Gültigkeit nur bei gleichbleibendem
Atemdruck. Druckschwankungen können Verschiebungen bewirken.” [Only
if the expiratory pressure is constant, the sonority hierarchy remains fully
valid. Variations in pressure may cause a shift. (Translation K.G.)] Thus, we
have to distinguish between what could be called the sonority contour and the
prominence contour in a syllable.
Evidence for the validity of this differentiation can be found in Lithua-
nian, where diphthongs possessing roughly the same initial and target vowel
qualities and, consequently, an identical sonority contour, may form mini-
mal pairs with respect to a decreasing or an increasing prominence contour,
cf. áukštas ‘high’ vs. aũkštas ‘floor, storey’5 . In order to have a means for
differentiating sonority and prominence, I suggest to term diphthongs with
a decreasing prominence contour early-peak diphthongs and those with an
increasing prominence contour late-peak diphthongs.6
If we take the fact seriously that sonority, in marked cases, is not the
single controlling parameter for syllable peak formation in diphthongs, the
search for this kind of mismatching sequences can be further extended. Siev-
ers already did this in 1893, when he stated the following on the combinations
of vowels with liquids and nasals:
Auch hier haben wir es hauptsächlich nur mit den einsilbigen Verbindun-
gen zu thun. Diese sind den Verbindungen zweier Vocale vollkommen ana-
log, nur mit der Einschränkung, dass nach den Gesetzen der Abstufung der
Schallfülle ... die Liquide und Nasale in fast allen Fällen die unsilbischen
Glieder der Verbindungen sind. Dass wir Gruppen wie ar, al, an, am, aN
5. The place of the syllable peak is marked by the accent signs tilde ˜ and acute ´ accent.
These are also used to indicate different stress types in Lithuanian.
6. A third “contour-type” of diphthongs occurring in some languages (though not in Lithua-
nian) might be floating diphthongs without a clearly localizable peak, as Grønnum
(1998: 79) puts it (even if this is done in connection with sonority and not with promi-
nence).
Unauthenticated
(genauer geschrieben ar, al, an, am, aN, um die unsilbische Geltung des an
zweiter Stelle stehenden“ Sonorlauts
“ “ “ zu “ bezeichnen) nicht auch als ‘Diph-
thonge’ auffassen, liegt grossentheils bloss an der Gewohnheit, l, r, m, n, N
als ‘Consonanten’ zu bezeichnen, die mit einem ‘Vocale’ nicht eine derartig
homogene Verbindung eingehen können wie zwei ‘Vocale’ unter einander.
(Sievers 1893: 154)
[Also here, we are mostly dealing with monosyllabic combinations. These are
completely analogous to the combinations of two vowels, with the restriction
that according to the laws of gradation of sonority ... the liquids and nasals
are almost always the non-syllabic members in the combinations. The reason
why we do not also consider sequences like ar, al, an, am, aN (more exactly
ar, al, an, am, aN, to mark the non-syllabicity of the sonorous sound in the
“ “ place)
second “ “ diphthongs
“ is to a great extent due to the habit to call l, r, m,
n, N ‘consonants’, which cannot form as homogeneous a combination with a
‘vowel’ as two ‘vowels’ can do with each other. (Translation K.G.)]
The so-called semi-diphthongs in Lithuanian – the term Semi-Diphthong was

presumably coined by Kurschat (1876) – seem to constitute an example of
a rather remarkable mismatch of sonority and prominence contours. These
semi-diphthongs are defined as “tautosyllabic clusters ‘vowel + sonorant”’
(Ambrazas 1997: 25) and come, like the diphthongs /aũ/ and /áu/ mentioned
above, in two prominence contours, namely early-peak and late-peak. An
early-peak sequence of vowel plus sonorant as in pažínti ‘to know’ is not
at all a peculiarity and would normally not be dealt with under the head-
ing of diphthongs. But in the case of vowel plus sonorant combinations in
Lithuanian, also a sonorant /l, m, n, r/ may form the (late) syllable peak –
this in spite of the preceding, much more sonorous vowel in the syllable. To
make the sonorant the syllable peak, extra effort in articulation is required in
the form of higher expiratory pressure, resulting in a higher volume, and, in
addition, some lengthening of the sonorant; cf. the following examples for
illustration, the accent sign over the letter again indicating the place of the
syllable peak: añtra˛ kar̃ta˛ ‘second.ACC time.ACC’, kal̃bate ‘speak.2PL’.
Although the matter of Lithuanian semi-diphthongs deserves much more
thorough inquiry than this somewhat cursory discussion can present here due
to limitations of space, it has become clear that there is much more variation
in diphthongs than one might expect. Therefore, one should be cautious with
claims like the following by Sánchez Miret (1998), who says that “diphthongs
like [eI, ou, @i, @I, ae, a@] are actually unpronounceable, no matter how much
one can “ “ “ the
“ diminish “ expiratory
“ intensity on the first element” (Sánchez Miret
Unauthenticated
186 Klaus Geyer
1998: 44) – at least if it is not intended to be a statement about one’s personal

articulatory abilities. In any case, such mismatches of sonority contour and
prominence contour may be seen as marked structures.
It should be added that whether one wants to call sounds or sound se-
quences like the Lithuanian ones (semi-)diphthongs or not, is not the crucial
question. If a phenomenon like the Lithuanian semi-diphthongs with a sylla-
ble peak on the less sonorous sonorant phase does occur, any language de-
scription or documentation has to account for it. In my view, the best place
to do so would be in the section on diphthongs and syllable structure in the
chapter on phonology.
2.4. Summary: What is a diphthong?

In this section, the basic notions of diphthongs were discussed. It was pointed
out that the articulatory movement in production leading to an auditory per-
ceivable change of quality is the minimum requirement for a vocoid sound to
count as a diphthong; hereby, the representation of a diphthong by means of
two symbols has to be understood as an abstraction inasmuch as these sym-
bols only indicate the initial and the target positions of the diphthong and not
two distinct “halves” – even if the transition from initial to target position
can be a more gliding or a more sequential one. Besides reviewing the pre-
viously established distinctions of phonetic vs. phonological, mono- vs. bi-
phonemic, and, with respect to sonority contour, falling vs. rising diphthongs,
we introduced the criterion of prominence with the distinction of early-peak
vs. late-peak diphthongs. The evidence for the necessity to discern sonority
and prominence contours came from examples like Lithuanian, where less
sonorous phonetic vowels and even sonorants like /m, n, l, r/ can form sylla-
ble peaks.
Diphthongs may turn out to be best analyzed and described as derived
on the synchronic level in the sense that a phonetic diphthong appears as the
result of a phonological rule such as, e.g., vocalization of certain consonants
in certain positions affecting an underlying phonological non-diphthong (a
vowel + consonant sequence). Diphthongs may also be derived – or: have de-
veloped in a historical sense – by sound change processes; sometimes, those
diphthongs are termed primary, secondary, tertiary etc. Be the derivational re-
lation synchronic or historical, any description and documentation will bene-
fit from establishing and explaining systematic links like these.
Unauthenticated
3. Case study: Finnish

3.1. Why Finnish?
Finnish has a remarkably large diphthong inventory that constitutes a true
challenge for a proper description – a challenge that has not been responded
to in a completely satisfactory way to this day, at least as far as descriptive
grammars or grammatical sketches are concerned. Having a closer look at
the ways the diphthongs of this language are systematized and presented in
grammars is, however, not only helpful for the sake of developing a proper
description of Finnish diphthongs. It is also instructive for the purpose of
establishing criteria for potentially relevant diphthong features in general.
To simplify quite a bit (see e.g. Abondolo 1998 for a detailed overview),
Finnish phonology is, amongst other properties, characterized by rather sym-
metrical resp. highly integrated systems of vowels and consonants, length
being distinctive in all vowels (monophthongs) and in almost all consonants.
Word stress is on the first syllable as a rule. One main property is the pres-
ence of vowel harmony. This property is twofold: it concerns the dimensions
of place (front vs. back) and of lip rounding (round vs. unround). Of the eight
vowel phonemes /i, y, u, e, ö, o, ä, a/, 7 /i/ and /e/ are neutral with respect to
both of the dimensions, whereas the open vowels /a/ and /ä/ do not participate
in the lip harmony. In other words, /y/ and /ö/ as well as /u/ and /o/ are the
only fully harmonic vowels (the same holds true for the long counterparts, of
course).
3.2. Item- and inventory-related questions

3.2.1. Finnish diphthong inventories
To start with, already the bare number of diphthongs is a matter of disagree-
ment: some representations count 16, others 17, still others 18 diphthongs.
The differences are, however, not as big as it might seem, since there is a
7. The monophthongs of Finnish are usually systematized as follows:

front back
unrounded rounded
i y u close
e ö o mid ± long
ä a open
Unauthenticated
188 Klaus Geyer
core set of 16 diphthongs that are unequivocally accepted as such by e. g.

Campbell (1995: 461), Karlsson (1992: 14; 1999: 24), Mitchell (2001: 215)
and Sulkala and Karjalainen (1997: 373). These diphthongs are: /ai, ei, oi,
ui, yi, äi, öi, au, ou, eu, iu, äy, öy, ie, uo, yö/. Karlsson (1992: 14), Mitchell
(2001: 215) and Sulkala and Karjalainen (1997: 373) add /ey/ as a 17th diph-
thong, whereas the diphthong /iy/ is listed as no. 18 only by Karlsson (1992: 14)
and Mitchell (2001: 215)8 .
Two main groups are generally distinguished in these representations:
firstly, diphthongs ending in one of the high vowels /i, y, u/, forming a group
of 14, 15, or 16 items respectively, depending on whether or not /ey/ and /iy/
are included; and secondly, a small but stable group of the three diphthongs
/ie, uo, yö/. These two groups are more or less explicitly defined either with
respect to the articulatory movement or with respect to the sonority contour.
In accordance with articulatory movement, the first, bigger group consists of
closing diphthongs, the second, smaller group of opening diphthongs. In ac-
cordance with sonority contour, the first group consists of falling, the second
group of rising diphthongs. Irrespective of the exact number of items in the
inventory, all of the 16, 17 or 18 diphthongs comprise in their first and sec-
ond phases (i e. in their initial and target positions, respectively) only vowel
qualities that also occur in monophthongs.
3.2.2. Diphthongs with restricted occurrence

It is obvious that it is the two diphthongs /iy, ey/ that are causing problems
in identifying the exact size of the diphthong inventory. Even one and the
same author (Fred Karlsson, 1992 and 1999) makes different decisions and,
consequently, differing counts at different times in his analyses of the Finnish
sound structure. /iy, ey/ are indeed heavily restricted in their occurrence –
without, however, being at suspicion for being loans: /ey/ only shows up in
leyhyä ‘to wave’ and in a few more words derived from the same root, even
though the sequence /e/ plus /y/ itself is far from infrequent in Finnish words.
In most cases, however, the two sounds are divided up into two syllables by
a hiatus, e.g. in ter.ve.ys ‘health’. Regarding the second critical diphthong,
the only occurrence of (tautosyllabic) /iy/ is in the place name Kiysaari. A
decision has to be made here whether unical elements like the diphthongs /iy/
8. Actually, the diphthong chart Mitchell provides comprises 20 positions, since, for unknown
reasons, /ie/ und /ei/ are listed twice.
Unauthenticated
(restricted to one single word) and /ey/ (restricted to one single root) should
or should not be included in the overall diphthong inventory of the language.
In my view, they are definitely part of the sound system – but in terms of core
and periphery, they are clearly part of the periphery rather than of the core.
Another peculiarity must be mentioned here, namely that the diphthong
/öi/ never occurs in roots, but can only be the result of a morphological con-
struction involving stem-final /ö/ + (nominal) plural /i/ or (verbal) preterite
or conditional /i/. So, /öi/ is peculiar not with respect to frequency of oc-
currence in word forms, as /ey/ and /iy/ are, but rather with respect to place
of occurrence in a word form. The sequence is frequent since both nominal
plural and verbal preterite and conditional formation are very common mor-
phological constructions, and stem-final /ö/ is not really rare. Because /öi/ is
quite restricted with respect to possible places of occurrence in word forms,
it is, in my view, more peripheral than other diphthongs. An interesting fact,
also from the viewpoint of documentary linguistics, is that the most common
diphthongs in a sample of 6,700 diphthongs turned out to be /oi/ (21.4% of
all diphthongs), /ai/ (16.5%), and /ei/ (15.1%) (cf. Häkkinen 1977, cited in
Karlsson 1983: 90); note that these are all diphthongs that occur freely both
in roots and across morpheme boundaries of the type mentioned above.
Probably the most widely discussed issue in terms of core and periphery
of phonological (and other linguistic) elements is that of whether or not loans
– here: loan diphthongs – should be considered part of the inventory of a
given language or not. This is, actually, not a big issue in Finnish, but for the
closely related Estonian language, it is reported that out of the up to 36 diph-
thongs, as many as 10 solely occur in loan words (Viitso 2003: 22). And for
the Baltic language Lithuanian, recent presentations (Ambrazas 1997; Eckert,
Bukevičiūtė, and Hinze 1994) unanimously count, beyond the genuine items,
the diphthongs /eu, oi, ou/ “in words of foreign origin” (Mathiasen 1996: 30)
as part of the inventory – despite the fact that Lithuanian grammaticography
tends to be very conservative in an overall perspective.
3.2.3. Sonority and prominence in Finish diphthongs
One feature all Finnish diphthongs have in common is that the first phase
of the diphthong is more prominent than the latter one, i.e. all Finnish diph-
thongs are early-peak diphthongs. For the first – and much bigger – group
mentioned above, namely the diphthongs ending in a closed vowel /i, y, u/,
Unauthenticated
190 Klaus Geyer
the decrease of prominence is to be expected, since it is matched by the de-

crease of sonority. But in group two, we find diphthongs that show a rising
sonority contour and decreasing prominence simultaneously. It is worth men-
tioning that the three “mismatching” Finnish diphthongs /ie, yö, uo/, get re-
structured as soon as they are followed by an /i/ – which is, as mentioned
above, very common in morphological constructions – in the sense that they
adapt to the unmarked, prevailing pattern, avoiding triphthong formation, cf.
e.g. tie ‘path *tie-i-tä > te-i-tä ‘path-PL-PART’; yö ‘night’, *yö-i-tä > ö-i-tä
‘night-PL-PART’. In this respect, these are unstable diphthongs.
3.3. System-related issues

3.3.1. Finnish diphthong systems
Up to now, we have said very little about the distinctive features of Finnish
diphthongs and about how to fully systematize them. An examination of the
existing systems of the Finnish diphthong inventories proves to be insightful
and illustrative with respect to the difficulties caused in part by the size of
the inventory, in part by lack of a working analytic tool. In the following, I
will give five examples of diphthong systems in Finnish, as they can be found
in current grammars and grammatical sketches. Note that all systems given
below comprise only 16 items, except that of Hakulinen et al. (2004), which
includes the controversial /ey/ and /iy/.
Branch (1987: 596; also Lyovin 1997: 80) represents the Finnish diph-
thongs in the following way:
ei äy eu
äi
ui
ai au
oi ou
öi öy
yi
ie
yö uo
iu
Unauthenticated
Karlsson (1999: 24–25) lists four groups of diphthongs:

(a) diphthongs ending in /i/ (/ei, äi, ui, ai, oi, öi yi/)
(b) diphthongs ending in /u/ (/au, ou, eu, iu/)
(c) diphthongs ending in /y/ (/äy, öy/)
(d) the three diphthongs /ie, yö, uo/
Hakulinen et al. (2004: §21, translation K.G.) operate with three main types
of diphthongs and some subdivisions:
closing diphthongs ei öi äi oi ai | ey öy äy | eu ou au
closed diphthongs yi ui | iy iu
opening diphthongs ie yö uo
Groenke (1998: 133, translation K.G.) provides these four groups of diph-
thongs:
ei öi oi öy eu ou
äi ai äy au
3 falling: uo, yö, ie
2 de-rounding diphthongs: yi, ui
1 rounding diphthong: iu
And finally, Fromm’s (1982: 31, translation K.G.) analysis comprises the fol-
lowing five groups:
I II III IV V
ei öi oi öy eu ou ie yö uo iu yi ui
äi ai äy au
I. consisting of a open or mid-open vowel phoneme and /i/,

II. and III. consisting of a open or mid-open vowel phoneme and a closed one
as demanded by vowel harmony,
IV. consisting of a closed phoneme and the corresponding mid-open one,
V. consisting of a closed phoneme and a phoneme of the same degree of open-
ness, but contrasting in the feature of rounding.
Let us briefly comment on the diphthong systems presented above. Branch

(1987) and, in an identical way, Lyovin (1997) systematize the diphthong
inventory to a certain extent, as their chart suggests. However, since the cri-
teria for systematization are not explicitly stated by the authors, one has to
Unauthenticated
192 Klaus Geyer
deduce them from the positioning of the elements in the chart. The target
vowel quality seems to play a major role for the arrangement of the columns,
whereas the arrangement of the lines all in all remains cryptic. This applies in
particular to the lines at the bottom, where it remains an open question why
the three opening / rising diphthongs have been integrated in this position.
Clearly, such attempts do not meet the basic requirements of consistency and
explicitness for a phonological systematization.
Karlsson (1999) is, in some sense, not so different from Branch (1987)
and Lyovin (1997): he, too, uses the target vowel quality /i/, /u/ and /y/ as the
main classificatory criterion. In contrast to Branch (1987) and Lyovin (1997),
however, he at least explicitly mentions this criterion. However, the criteria
underlying the arrangement of items within the three groups again have to be
inferred somehow by the reader. Karlsson’s (1999) group 4, unlike groups 1
to 3, is a bare enumeration, not characterized by any feature. As far as any
criteria are retrievable at all, these systematizations only seem to make use
of the static properties of diphthongs, i.e. the positions of either the initial or,
possibly, the target vowel quality.
Hakulinen et al. (2004) is the only systematization comprising 18 diph-
thongs. The vertical tongue movement functions as the main criterion for
classification here; the quality of the target vowel seems to be an additional
criterion for systematizing closing diphthongs, as the respective subgroups
are constituted by separating marks “|”. Unfortunately, this feature is not ex-
plicitly pointed out. Regarding the group of closed diphthongs, however, the
subgrouping is carried out according to the position of the /i/, i.e. whether it
is the initial or the target vowel quality in the diphthong. The order of diph-
thongs within the subgroups seems to follow, by and large, the principle of
the IPA chart to put front vowels to the left and back vowels to the right. The
opening diphthongs in the third group seem to be arranged in the same way.
Hakulinen et al. (2004) are using the dynamic criterion of vertical tongue
movement (or the absence of it, respectively) as a feature for forming the
three main groups, which apparently improves the analysis.
The systematization Groenke (1998) suggests in his sketch of Finnish is
somewhat suspect because the feature of vertical tongue position is described
using the same terms that are otherwise used for sonority contours (rising,
falling).9 The features related to changes in lip rounding that Groenke brings
9. It may be somewhat unclear whether this is a question of terminological confusion re-

garding vertical tongue movement and sonority contour, or whether it has to do with some
Unauthenticated
into play are, however, useful ones since they address the dynamic nature of
diphthongs. But whilst /iy/ only displays lip movement (rounding) in terms
of articulation, both /iu/ and /ui/ imply, in addition to lip movement (rounding
in /iu/ and de-rounding in /ui/), horizontal tongue movement (backing in /iu/
and fronting in /ui/). Diphthongs where only one articulatory feature changes
from initial to target position like in /iy/, but also in /yi, uo/ etc., are termed
homogeneous (cf. Roca and Johnson 1999: 190–191), whereas in a heteroge-
neous diphthong more than one feature changes, e.g. in /ui, äy/.
The system proposed by Fromm (1982), finally, does not only make use
of dynamic as well as static features, it also labels them – at least in the
descriptions of the five (or possibly actually four, since II and III are treated
together) groups. The gap in the chart of group II on the top left is to be filled
with /ey/, which shows that /e/, in contradiction to the explanation given by
Fromm, does not participate in vowel harmony; /iy/ could easily be integrated
in group V.
In my view, none of these systematizations, not even the most recent one,
adequately captures the distinctiveness of relevant features in Finnish diph-
thongs completely. Therefore, I will present a more consistent analysis in the
next section.
3.3.2. Taking the dynamic nature of diphthongs seriously in systematization

The following account for systematizing Finnish diphthongs makes use of
both dynamic and some static features as used in the aforementioned inven-
tories and aims to develop them further. Thus, the system employs the dy-
namic feature of movement in vertical tongue position [± move]. Note that
no movement in vertical tongue position, thus [– move], always implies a dif-
ferent type of movement in any diphthong, be it horizontal tongue position,
sort of different terminological tradition in describing phonology in German. In German

phonology, the terms ‘rising’ and ‘falling’ are traditionally used in the same way Groenke
uses them here, for instance in several introductory textbooks on phonetics and phonology
– even in most recent ones, cf. Pompino-Marschall (1995: 218–219), Graefen and Liedke
(2008: 220), Altmann and Ziegenhain (2010: 47). If the latter, however, should prove to be
true, this would result in an inconsistent terminology, where a diphthong, oddly enough,
can be rising and falling at the same time, depending on whether the feature under discus-
sion is articulatory movement or sonority. Therefore, I strongly prefer and recommend to
strictly distinguish between rising / falling on the one hand, and closing / opening on the
other hand.
Unauthenticated
194 Klaus Geyer
lip rounding, or both – these other types of movement are not distinctive in
this systematic account, but still relevant for the phonetic form of the diph-
thong. If the item is [+ move], the extent of a possible movement in vertical
tongue position can be [+ wide] (between open and close vowel quality) or
narrow (between mid and close), thus [– wide].
The very basic distinction of falling vs. rising in sonority, ascribing the
feature [+ rising] to the three diphthongs /yö, ie, uo/, is also implemented, as
is the vowel harmony dimension of lip rounding, where only the non-open
round vowels /y, ö; u, o/ participate (see Section 3.1). Therefore, there are
only four diphthongs with the feature [+lip harm]. Lip rounding in the target
vowel quality is made use of for the feature [+ round], which occurs only in
diphthongs ending in /y, ö, u, o/, all others being [– round].
The other dimension of Finnish vowel harmony, namely place harmony,
plays an even more important role for the system. The place harmony fea-
ture [+ front] vs. [– front] is to be understood not in an articulatory but in a
phonological sense. Recall that the open vowels /ä/ and /a/ participate in place
harmony (though not in lip rounding harmony) and that /i/ and /e/, though
phonetically front vowels, are neutral with respect to place harmony, i.e. they
can combine with both /y, ö, ä/ and /u, o, a/. In other words, a diphthong is
front if either the initial or the target vowel quality is phonologically front
(i.e. /y, ö, ä/), and it is back, if either the initial or the target vowel quality is
phonologically back (i.e. /u, o, a/); /ei/ and /ie/ are neither front nor back.
The matrix of distinctive features in Finnish diphthongs is as follows:
rising front round lip harm move wide

iy – + + – – –
ey – + + – + –
öy – + + + + –
äy – + + – + +
yi – + – – + –
öi – + – – – –
äi – + – – + +
ei – – – – + –
ui – – – – – –
oi – – – – + –
ai – – – – + +
iu – – + – – –
Unauthenticated
rising front round lip harm move wide

eu – – + – + –
ou – – + + + –
au – – + – + +
yö + + + + + –
uo + – + + + –
ie + – – – + –
Arranged according to their binary feature values, the system displays the
diphthongs in Finnish in a quite compact and integrated way. I refrain from
marking the vast majority of diphthongs for [– lip harm] and [– rising] re-
spectively, since that would affect the clarity of the representation; the pos-
itive feature values are simply indicated in boldface or italics, respectively.
[+ front] [– front]
[+ round] [– round] [+ round]
iy yi ui iu [– move]
ey öi oi eu
[+ lip harm] [+ rising] öy yö ei ie ou uo [– wide] [+ move]
äy äi ai au [+ wide]
Translating the binary features to common category labels renders a system

that looks like this:
front back
round not round round
iy yi ui iu no movement
ey öi oi eu
lip harmonic rising öy yö ei ie ou uo narrow movement
äy äi ai au wide movement
Of course, the category labels – as all labels in this type of systematization

– require some explanation, e.g. that front and back is not an articulatory
feature, but refers to place harmony; that by round and not round only the
target vowel quality is intended; or that movement refers to vertical tongue
movement only.
The systematization I propose in this paper has several advantages: it is
effective, since its outcome is a highly integrated system which, in addition,
does not leave any unanalyzable remainder that merely gets enumerated; it
Unauthenticated
196 Klaus Geyer
is economic, since it makes use of only a small feature set (obviously only
the distinctive ones); it is adequate for the language, since it relies on core
characteristics of Finnish phonology, such as the vowel harmonies; and it is
typologically adequate, since it clearly states that the rising early-peak diph-
thongs /ie, yö, uo/ are marked structures – simply because they need special
features to be singled out.
4. Conclusion: The “diphthong analysis and description tool”

In this last section, I will merge the results from the general discussion of
diphthongs and the findings from the specific examination of the Finnish
diphthongs, thus establishing the tool for diphthong analysis and description
this article aims at.
When analysing the functional phonetics of a language, one of the first
observations might be whether the language contains vocoids with a changing
sound quality or not – i.e. whether the language contains phonetic diphthongs.
In most languages, this will be the case. These diphthongs – even already
before anything is known about their precise phonological status – can be
described in terms of the following articulatory criteria (ordered according to
decreasing probability of occurrence):
– opening vs. closing: vertical tongue movement
– fronting vs. backing: horizontal tongue movement
– (peripherising vs. centralising: combination of vertical and horizontal
tongue movement)
– rounding vs. de-rounding (spreading): change in lip rounding
– nasalizing vs. de-nasalizing: change in nasality
The following two additional criteria may be useful for a further differentia-
tion of the diphthongs in a language:
– homogeneuous vs. heterogeneous: 1 vs. > 1 parameters changing
– narrow vs. wide: articulatory / perceptive distance from initial to target
position / sound
Of course, already in this early stage, the important “contour criteria” may be
accounted for:
– falling vs. rising: sonority contour, direction of change
– early peak vs. late peak: prominence contour, position of peak
Unauthenticated
Note that falling, early-peak diphthongs are by far the most common diph-
thong type, but, as was shown, prominence caused by means of extra effort
in pitch and / or expiratory energy may override the sonority contour – even
to the extent that sonorants may form a syllable peak. In this (probably rare)
case, the next criterion has to be applied:
– purely vocoid vs. blended: type of sounds constituting the diphthong (only
vocoids vs. vocoids in combination with sonorants)
The following criterion can provide a hint for a later analysis of mono-phone-
mic vs. bi-phonemic diphthongs, but need not do so:
– sequential vs. gliding: impression of 2 distinct phases vs. 1 complex phase
All of these criteria, which directly address the dynamic character of diph-
thongs, may not only be relevant for a sound description of the phonetic diph-
thongs in a language, they may also constitute potentially distinctive features
if it turns out in the phonological analysis that at least some of the diph-
thongs have a functional, phonemic status (by forming minimal pairs) in the
language. It goes without saying that all of the static features that either the
initial and/or the target sounds bear – e.g. front vs. back, closed vs. mid vs.
open, round vs. spread, nasal vs. oral, etc. – might turn out to be system-
atically distinctive in phonological diphthongs. This also holds true for the
criterion of quantity, even if it is not very likely that quantity functions as a
distinctive feature in the diphthongs of a language.
If there are phonological diphthongs in a language, these criteria have to
be dealt with:
– mono-phonemic vs. bi-phonemic: segmental phonological analysis
– simple vs. complex: occurrence in morphologically simple vs. in complex
forms
– stable vs. instable: changeability by morphophonological processes
– common vs. restricted: restrictions of occurrence, e.g. only in stressed syl-
lables, in loans, etc.
– frequent vs. infrequent: frequency in lexicon words or in text words
The criteria of sonority and prominence contour come up again in phono-
logical analysis, now with special reference to the general principles of syl-
lable structure in the language. And, of course, mismatches of sonority and
prominence contour, as they are marked structures, need a particularly careful
investigation and explanation.
Unauthenticated
198 Klaus Geyer
A criterion that is equally relevant for the phonetic and the phonologi-
cal perspective is the question of underlying phonological non-diphthongal
sound sequences for diphthongs in the phonetic output, or the question of
stages in diachronic emergence by sound change processes, giving rise to the
following features:
– derived vs. basic: synchronic generatability by phonological rules,

– primary vs. secondary vs. tertiary etc.: diachronic emergence
I hope that the “diphthong analysis and description tool” will prove useful for
a number of linguistic enterprises, be they within language documentation or
other linguistic areas.
References
Abondolo, Daniel. 1998. Finnish. In The Uralic Languages, ed. Daniel Abon-
dolo, 149–183. London, New York: Routledge.
Altmann, Hans, and Ute Ziegenhain. 2010. Prüfungswissen Phonetik, Pho-
nologie und Graphemik. (3rd edition). Göttingen: Vandenhoeck und Rup-
recht.
Ambrazas, Vytautas. 1997. Lithuanian Grammar. Vilnius: Baltos lankos.
Andersson, Erik. 1994. Swedish. In The Germanic Languages, eds. Ekkehard
König and Johan van der Auwera, 271–312. London, New York: Rout-
ledge.
Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke:
Palgrave Macmillan.
Branch, Michael. 1987. Finnish. In The World’s Major Languages, ed.
Bernard Comrie, 593–617. London: Croom Helm.
Braunmüller, Kurt. 1999. Die skandinavischen Sprachen im Überblick. 2nd
edition. Tübingen: Francke.
Campbell, George L. 1995. Concise Compendium of the World’s Languages.
London, New York: Routledge.
Catford, John C. 1977. Fundamental Problems in Phonetics. Edinburgh:
Edinburgh University Press.
Eckert, Rainer, Elvira-Julia Bukevičiūtė, and Friedhelm Hinze. 1994. Die
baltischen Sprachen: Eine Einführung. Leipzig: Langenscheidt Verlag En-
zyklopädie.
Unauthenticated
Elert, Claes-Christian. 1995. Allmän och svensk fonetik [General and

Swedish Phonetics]. 7th edition. Stockholm: Norstedts.
Fromm, Hans. 1982. Finnische Grammatik. Heidelberg: Winter.
Graefen, Gabriele, and Martina Liedke. 2008. Germanistische Sprach-
wissenschaft: Deutsch als Erst-, Zweit- oder Fremdsprache. Tübingen:
Francke.
Groenke, Ulrich. 1998. Die Sprachenlandschaft Skandinaviens. Berlin:
Weidler.
Grønnum, Nina. 1998. Fonetik og fonologie: almen og dansk [Phonetics and
Phonology: General and Danish]. Copenhagen: Akademisk forlag.
Gussmann, Edmund. 2002. Phonology: Analysis and Theory. Cambridge:
Häkkinen, Kaisa. 1977. Tilastotietoja Suomen kielen äännerakenteesta [Sta-
tistical information about the phonemic structure of Finnish]. Turku: Turun
yliopisto.
Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Ri-
itta Heinonen, and Irja Alho. 2004. Iso suomen kielioppi [Comprehensive
Grammar of Finnish]. Helsinki: Suomalaisen Kirjallisuuden Seura.
Karlsson, Fred. 1983. Suomen kielen äänne- ja muotorakenne [The Structure
of Finnish Phonology and Morphology]. Provoo: Söderström.
Karlsson, Fred. 1992. Finnish. In International Encyclopedia of Linguistics,
2nd vol., ed. William Bright, 14–17. Oxford: Oxford University Press.
Karlsson, Fred. 1999. Finnish: An Essential Grammar. London, New York:
Routledge.
Kohler, Klaus. 1995. Einfürung in die Phonetik des Deutschen. 2nd edition.
Berlin: Erich Schmidt.
Kurschat, Friedrich. 1876. Grammatik der litauischen Sprache. Halle: Verlag
der Buchhandlung des Waisenhauses.
Lindau, Mona, Kjell Norlin, and Jan-Olof Svantesson. 1990. Some cross-
linguistic differences in diphthongs. Journal of the International Phonetic
Association 20:10–14.
Lyovin, Anatole. 1997. An Introduction to the Languages of the World. Ox-
ford: Oxford University Press.
Maddieson, Ian. 1984. Patterns of Sound. Cambridge: Cambridge University
Press.
Mathiasen, Terje. 1996. A Short Grammar of Lithuanian. Columbus: Slavica.
Mitchell, Erika J. 2001. Finnish. In Facts about the World’s Languages: An
Unauthenticated
200 Klaus Geyer
Encyclopedia of the World’s Major Languages, Past and Present, eds. Jane
Garry and Carl Rubino, 214–218. New York: Wilson.
Mosel, Ulrike. 1994. Saliba. München: Lincom Europa.
Mosel, Ulrike. 2006b. Sketch grammars. In Essentials of Language Docu-
mentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel,
Mosel, Ulrike, and Even Hovdhaugen. 1992. Samoan Reference Grammar.
Oslo: Scandinavian University Press.
Pompino-Marschall, Bernd. 1995. Einführung in die Phonetik. Berlin: de
Gruyter.
Ramers, Karl-Heinz, and Heinz Vater. 1992. Einführung in die Phonologie.
Hürth: Gabel.
Roca, Iggy, and Wyn Johnson. 1999. A Course in Phonology. Oxford: Black-
well.
Sánchez Miret, Fernando. 1998. Some reflections on the notion of diphthong.
Papers and Studies in Contrastive Linguistics 34:27–51.
Schubiger, Maria. 1977. Einführung in die Phonetik. 2nd edition. Berlin: de
Gruyter.
Schultze-Berndt, Eva. 2006. Linguistic annotation. In Essentials of Language
Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike
Mosel, 213–251. Berlin, New York: Mouton de Gruyter.
Sievers, Eduard. 1893. Grundzüge der Phonetik zur Einführung in das Stu-
dium der Lautlehre der indogermanischen Sprachen. Leipzig: Breitkopf
und Härtel.
Sulkala, Helena, and Merja Karjalainen. 1997. Finnish. London, New York:
Routledge.
Trubetzkoy, Nikolaj S. 1977[1939]. Grundzüge der Phonologie. 6th edition.
Göttingen: Vandenhoeck und Ruprecht.
Vennemann, Theo. 1988. Preference Laws for Syllable Structure and the Ex-
planation of Sound Change: With Special Reference to German, Germanic,
Italian, and Latin. Berlin: Mouton de Gruyter.
Viitso, Tiit-Rein. 2003. Structure of the Estonian language: Phonology, mor-
phology and word formation. In Estonian Language, ed. Mati Erelt, 9–92.
Tallinn: Estonian Academy Publishers.
Unauthenticated
Chapter 9
Retelling data: Working on transcription∗
Dagmar Jung and Nikolaus P. Himmelmann
1. Introduction
Transcribing narrative and conversational speech is a core activity of all lin-
guistic fieldwork, though one of the less attractive ones. Neither linguists nor
speakers are generally very keen to spend long hours on this task. Neverthe-
less, it is without doubt one of the most important tasks to be carried out in
the field requiring close cooperation between speaker(s) and researcher(s).
Given its significance, it is somewhat surprising how little attention this
task receives in the literature. When transcription is mentioned, if at all, in
field method books and articles, the focus is usually on phonetic aspects,
i.e. questions relating to the proper representation of sounds and the dis-
tinction between broad and narrow transcription. Occasionally, there are a
few additional notes on general procedure, as conveniently summarized in
Chelliah and de Reuse (2011: 434–435). A notable exception here is Crow-
ley (2007: 137–141) who discusses practicalities of transcribing a fairly large
amount of narrative and conversational speech that go beyond the problems of
basic procedure and properly capturing sound. Likewise, the current chapter
is exclusively concerned with the conceptual and interpersonal issues arising
when working on transcriptions of continuous text, for which usually some
type of practical orthography (broad transcription) will be used.
We will not repeat Crowley’s very useful observations and suggestions
here. But we want to emphasize the point that text transcription has to be car-
ried out in close cooperation with native speakers, and this usually means: in
the field. It may be possible for a researcher to transcribe some parts of a text
recording independently at a stage when one has achieved a certain mastery
∗
We are very grateful to all the speakers who generously shared their knowledge of the
Beaver language with us and were so patient and accommodating in dealing with our obses-
sions with linguistic form rather than content. Many thanks also to Geoffrey Haig, Carolina
Pasamonik, Stefan Schnell and Gabriele Schwiertz for helpful comments and suggestions,
and to Jessica Di Napoli for thoroughly editing English grammar and style.
Unauthenticated
202 Dagmar Jung and Nikolaus P. Himmelmann
of the language – enough, for example, to engage in simple conversational

exchanges. However, it will almost never be possible to fully transcribe a
recording, as there will always be sections where articulation is poor or very
fast, major noise interferences occur, or new words or constructions are being
used. Hence a serious attempt should be made to transcribe (and translate)
as many recordings in the field as possible, unless it is also possible to work
with native speakers at home.
Successful transcription thus essentially depends on a productive collab-
oration and interaction between researcher and native speaker. The produc-
tivity of this interaction may be hampered by a number of pitfalls which are
the main concern of this chapter. In Section 2, we will first review some basic
conceptual and practical issues which need to be carefully considered to en-
sure a productive transcription collaboration. Section 3 will then survey and
systematize the different strategies native speakers may adopt when respond-
ing to the nontrivial challenges arising for them when engaging in transcrip-
tion, focusing on aspects of (lack of) cooperation and typical changes applied
to the recorded speech in the transfer to a written representation.
This chapter, thus, contributes to two interconnected topics in documen-
tary linguistics. The first topic pertains to the idea that fieldwork should be
conceived of as a cooperative learning enterprise between speaker(s) and re-
searcher(s), as argued in detail by Mosel (2006). Here the important point is
that researchers need to have a clear understanding of what kinds of demands
they are putting on their collaborators and to see where the interests and goals
of the two parties involved may diverge or even stand in opposition to each
other. The second topic pertains to the idea that working on transcription may
lead to the emergence of a new linguistic variety, as it involves the creation
of a new written language. This is particularly true in those instances where
recorded texts are carefully edited for publication in a local (e.g. educational)
context, a process documented, perhaps for the first time, in a rigorous way
in Mosel’s work on Teop (cp. Mosel 2004, 2008). But it actually also occurs
in similar, though less systematic ways in transcription, as we will document
here.
Both topics, incidentally, provide further arguments for the proposal that
the transcription process itself should be documented as fully as practically
feasible. That is, ideally it should also be recorded, as was the standard prac-
tice within the Beaver Athapaskan language documentation project in North-
ern Alberta, Canada, from which all data in this chapter are drawn (Jung et al.
Unauthenticated
Retelling data: Working on transcription 203
2004–present).1 Without such a recording, the kinds of examples we discuss

here would not be available.
2. Basic conceptual and practical issues

To begin with, it will be useful to keep in mind that transcription is a metalin-
guistic activity generally not part of the speaker’s and the linguist’s normal
linguistic repertoire. That is, there is no natural linguistic behavior involving
an activity that can be said to be parallel to what is involved when transcribing
recordings of continuous speech. Stenography, which has some semblances
to transcription but differs considerably in purpose and focus, is a highly spe-
cialized activity as well, requiring a great deal of training and practice.
A more widespread and, in literate societies, more natural activity which
may have some, but rather remote, semblances to transcription is note-taking
in a meeting or a class, which may include the occasional literal transcription
of a short segment of the contribution by a current speaker. But taking notes
is primarily concerned with capturing the content of what is being said, not
with providing a precise transcript of it. The ‘natural’ focus on content seen
here also underlies a number of typical problems occurring in collaborative
work on transcription, as is further discussed in Section 3.
The relative unnaturalness of transcription is due to two facts. First, it
presupposes writing. Though literate practices are now widely found even in
societies without a literary tradition, the very fact that writing is not a uni-
versal modality of linguistic communication (unlike speaking and signing)
contributes to the special status of transcription. Second, and related to the
first fact, transcription involves a transfer from one modality into another one.
Such modality transfers are generally rare in linguistic behavior, the transla-
tion of spoken into signed language (or vice versa) probably being the only
type of transfer occurring in nonliterate societies. Literate societies add read-
ing aloud and dictation as widely used techniques in the acquisition of liter-
acy.
As there are no natural models for transcription, it follows that transcrip-
tion practices need to be established in accordance with the specific purposes
a transcript is intended to serve. Or, to put this slightly differently, transcrip-
tion is theory, as argued by Ochs (1979). In this classic paper, Ochs makes
1. For further information on this project and full acknowledgements see www.mpi.nl/
dobes/projects/beaver.
Unauthenticated
a number of important observations. We repeat here only the most important

and relevant one for our purposes: transcription always involves reduction and
thus selection. It is practically impossible to represent a given speech event in
its entirety in writing. This would have to include, for example, minute details
of articulation, gestures, facial expression, etc. Apart from being unfeasible,
any kind of transcription trying to capture as many features of the speech
event as possible would actually defeat its purpose, as it would be very diffi-
cult to parse. That is, the goal of transcription is reduction, i.e. making those
aspects of a complex speech event accessible that are of major relevance to
the user of the transcript.
The transcripts typically produced in language documentation and de-
scription serve the purpose of making recordings of continuous speech acces-
sible for further analysis, including, for example, morphosyntactic analysis.
For this type of analysis, phonetic detail is usually not relevant, which is the
reason why a broad transcription, using a practical orthography, is generally
sufficient. What is relevant, however, is the segmentation of the continuous
stream of speech into chunks of different sizes (words, intonation units). This
segmentation has to make use of native speaker intuition (in particular with
regard to the word level, see Peterson, this volume) and involves preliminary
analytical hypotheses as to the relevant chunking units (see Himmelmann
(2006) for further discussion of this rarely explicitly discussed aspect of tran-
scription). This in turn means that transcripts have to be repeatedly revised
in line with the better understanding of the relevant units obtained through
ongoing analysis.
A further evident consequence of the fact that transcription is a theory-
dependent project not part of anyone’s basic linguistic repertoire is the fact
that transcription has to be learned and practiced, both by the researcher and
the native speakers collaborating in the task. As a rule, researchers should
have transcribed a recording in their own language before setting out on the
task of transcribing a recording in the field. Among other things, this will help
them to have a clear understanding of the demands they are placing on the
native speakers they engage for collaboration in this task. Additionally, it will
help them comprehend the different kinds of reactions they will encounter
when searching for transcription collaborators, to which we now turn.
As with almost all other activities involved in field-based language doc-
umentation and description (cf. Mosel 2006, Seifart, this volume), native
speakers will differ quite significantly in their aptness for, and interest in, the
Unauthenticated
task of transcription. Given that it is an unnatural task, it should not come as a

surprise that a very common, if not the most common, reaction is great reluc-
tance to collaborate in this task, even from speakers who are happy to engage
in other tasks such as providing example sentences for dictionary entries or
grammaticality judgments, or compiling complex paradigms or taxonomies.
The reasons for such reluctance include: the unwillingness or inability to
listen to chunks of the recording; declaring the recorded item impossible to
understand; unwillingness or inability to repeat segments of the recording in
such a way that they can be written down; lack of time. The reluctance to
engage in transcription may also be based on an evaluation of the recorded
discourse (or a part of it) in terms of correctness (not proper language), ap-
propriateness (that is not how one can say this in these circumstances) or
irrelevance (not worthy of being committed to writing). Such evaluations, in
turn, are part of, or result from, the set of basic attitudes speakers have with
regard to the variety being used and they display consciousness of linguistic
norms, which appear to exist in all societies, even in ones without formally
established standard varieties. The underlying system of attitudes is usually
very difficult to untangle and, in the case of the Beaver data discussed in this
chapter, we lack independent evidence for this system. We will therefore re-
frain from trying to link specific examples to basic attitudes, but for almost all
examples in Section 3 it will be clear that some notion of perceived linguistic
value is at play also in those instances where speakers engage in transcription
but apply changes to the original record.
Finding good collaborators for transcription will thus often be quite chal-
lenging. However, it is usually not the case that in a given field situation
there are very many options to choose from, and hence one has to find ways
to make the best of the options available. Perhaps the most important qual-
ification for becoming a collaborator in transcription – apart from being a
reasonably fluent speaker of the language – is the willingness and ability to
learn something new, i.e. to listen to small chunks of continuous speech and
repeat them so that they can be written down (or, write them down oneself).
Here transcription differs from many other activities required in language
documentation and description where experience and a high level of linguis-
tic skills and insight are essential. For transcription, it may often be useful to
collaborate with younger speakers able and willing to learn the task, even if
they no longer have a full command of the language. Difficult and unclear
Unauthenticated
segments must then be checked with more experienced speakers, who may
lack the patience or interest to collaborate on this task for more extensive
time periods.
It is not necessary, and in fact often not desirable, to work on transcription
with the speaker who appears in the recording. While the speaker probably
has a relatively clear idea of what she wanted to say, this does not mean that
she is particularly good at listening to and restating precisely what was actu-
ally said. In fact, speakers involved in the recorded speech event are some-
times more likely to engage in the correcting and extension activities dis-
cussed in Section 3 than non-participating speakers. Furthermore, listening to
one’s own voice in a recording can be disturbing as it sounds quite different
from what one hears when speaking and because one may feel embarrassed
about dysfluencies, speech errors and other kinds of imperfections in one’s
own speech.
As transcription is not only an unnatural but also a very time consuming
and tedious task which requires practice and dedication, it is the task that is
perhaps most work-like of all the activities involved in field-based language
documentation and description. It is thus also the task that is best approached
in a work-like arrangement, characterized by regular working hours and a
salary/remuneration in line with local practices, if such arrangements are at
all possible in the community. Transcription is ideally done in an office-like
setting, with adequate furniture and a low noise level, so that one can fully
concentrate on the listening and interpretation task.
Work-like arrangements also provide for the option of independent tran-
scription, i.e. a native speaker works by her/himself on the transcription of
recordings. This, of course, presupposes that the speaker is able to handle
the technical aspects of playback (ideally using a laptop, otherwise an audio
player). It also involves some training and, crucially, regular supervision and
checking. The latter are important for two reasons: first, independent tran-
scribers, like researchers, may tend to develop ‘bad habits’ such as regularly
misspelling a class of high frequency items, leaving out segments at times
when they interrupt their work, etc. Regular checks may detect and arrest
such developments early on. Second, and of equal, if not greater importance,
independent transcribers need regular feedback and appreciation in order to
keep up the motivation for good and productive work. If all these conditions
are met, independent transcription, perhaps even involving two or three tran-
Unauthenticated
scribers, is probably the most efficient and productive method for tackling
this task.
The widely practiced alternative to independent transcription is collabora-
tive transcription, where researcher and speaker together transcribe a record-
ing, preferably while both listening to it with a headset. The discussion in
the following section is based on data generated in this way. The set-up
generally involved a single native speaker and a single linguist. The native
speakers were all elderly people as there are no younger speakers left, some-
times working on recordings of themselves, sometimes on recordings of other
speakers. Transcription was not separated from translation, so that upon hear-
ing a short segment played by the linguist, the native speaker would start with
either explaining what was being said, or with a free translation, or with re-
peating the segment for the linguist to transcribe. All speakers involved are
bilingual in English and Beaver and most of them also literate in English.
Two speakers also have some familiarity with the orthography used for rep-
resenting Beaver, hence being able to follow what the linguist was writing.
While this is just one type of set-up for transcription, many of the phe-
nomena discussed below will also be found in transcriptions produced by
independently working native speaker transcribers and also whenever native
speakers are involved in editing precise transcripts for publications to be used
in the community (cp. Mosel 2004, 2008).
3. Retelling data: change and elaboration in transcription

As already mentioned in the previous section, in field-based language de-
scription the main goal of the linguistic researcher in the transcription pro-
cess is typically to obtain a written version of what is being said in a record-
ing which can then be used for further grammatical analysis. Given this goal,
the focus is on precision: the transcript should contain all of, and only, the
linguistic elements that have actually been uttered, including not only ‘small
words’ of ambiguous function such as discourse particles which are easily
overheard, but also linguistic manifestations of the production process such
as false starts, markers of hesitation, etc. The latter often have little direct
relevance for propositional content and grammatical structure, but ignoring
them may lead to incorrect interpretations of content or structure (these may
be edited out carefully later on as part of the overall analysis process). This
Unauthenticated
approach to transcription is very technical and its overall goal may not be
easily understood by non-specialists.
Native speaker collaborators in the transcription process tend to have dif-
ferent goals and priorities which can generally be classified as attempting to
improve the recording in a number of ways: to make the content clearer, to
make it more appropriate for a general audience, to make it adhere to what
they perceive as the proper norm or more authentic speech, and so on. Goals
and priorities here depend in part on who is helping with the transcription:
a transcriber who is also the recorded speaker may decide more freely on
what should be edited in and out for semantic reasons or perceived mistakes
in clause structure. He/she may also focus on rephrasing, expanding or re-
peating the text to guarantee its comprehension by the intended audience. A
transcriber who is not among the speakers who appear in the recording may
comment on specific lexical items and idiomatic expressions that should be
changed.
In the following, examples from actual transcription sessions in a field-
work setting illustrate these processes. They are organized into three major
types: a) avoidance strategies where loose paraphrases and translations are
provided instead of a precise rendering of the recording; b) editing-out strate-
gies which lead to the removal of words and phrases; c) editing-in strategies
changing elements appearing in the recording or adding completely new ma-
terial. The latter two types belong more closely together in that they both
relate primarily to linguistic form, while the first type is most closely related
to the content being transmitted. Specific examples often involve a mixture
of the three types, which cannot always be neatly separated.
3.1. Paraphrasing: avoiding word-by-word renditions of recordings

There are basically two reasons for avoiding to repeat segments of a record-
ing word-by-word during the transcription process (leaving aside boredom or
impatience): either the desire to tell more, or the desire to tell less than what
is actually in the recording.
From a native speaker’s perspective, the important point in working on a
recording is to understand what is being said, i.e. the core concern is with the
message and not with its form, which often results in the desire to tell more
Unauthenticated
or to tell it differently.2 This becomes especially relevant when the speaker

in a recording works on the same recording together with a researcher from a
different cultural background. In this case, the speaker may be perceptive of
the researcher’s need to obtain additional information in order to be able to
fully understand the sense of what is being said. A typical example resulting
from this state of affairs is the following: when played the segment given in
(a) and asked to repeat it so that it may be written down by the researcher, the
speaker volunteers the explanation given in (b). Upon the insistence of the
researcher, the two words of (a) are dictated for transcription (and translated)
but then immediately expanded again by a further elaboration of the story in
(c):3
(1) a. dáwó˛ t’yedye aadi

what.kind.of.place 3.said
‘She said what kind of place.’
b. “(There’s) willow in there, they make arrows with; but there’s a
snake in there, a big one, it will kill you.”
c. “There’s a creek in there, there’s a saskatoon willow in there, but
you got to fight the snake!” (yaamaadzuyaaze_transcr001)
The main concern here is clearly that there (finally) is understanding on the
part of the researcher. And while this may first be perceived by the latter as
not being very helpful with regard to the primary goal of getting a useful (i.e.
reasonably precise) transcription, it is evident that such information will later
be of great value for interpreting the narrative. Depending on speakers and
communities, work on a transcript may involve several retellings of the same
narrative (in the contact language) which usually include important informa-
tion for its interpretation. In some way then, the transcription setup should
allow for making use of such elaborations.
The opposite motivation for avoiding the word-by-word repetition needed
for precise transcription, i.e. the wish to transcribe less than what has been
2. As just mentioned and further illustrated in the following two subsections, form is of course
also a major concern, in particular in those cases where the recorded form is judged to be
inappropriate or incorrect. However, here the main concern is with cases of content-based
avoidance.
3. The practical orthography of Northern Alberta Beaver is used. Dentals are underlined (s,
z), the acute accent marks high tone. ¯
¯
Unauthenticated
recorded, may be due to gender-specific speech, the avoidance of sexual ex-

plicitness, and other kinds of taboos, including respecting the linguistic “own-
ership” of certain topics by other persons. In a traditional story that was told
by a male speaker about the wolf meeting his brother-in-law, the beaver, in the
intense cold of winter, it is mentioned that the wolf’s frozen testicles can be
heard from far away. This situation is described with the use of an ideophone:
tł’aa tł’aa aadyi ‘clack clack it sounds’. The female transcriber rendered this
by saying ts’áa láádi “something about ‘a young beaver”’, without mention-
ing the actual meaning.
Sometimes it is simply disbelief of the content of a recording which may
lead to a refusal or avoidance of transcribing a segment. In still other cases,
the speaker may in fact be unable to help as the segment in question may
involve a different dialect or even another language. In the latter cases, the
reluctance to help with transcription will usually be made explicit, often with
an accompanying explanation. In some instances of taboo, however, it may
not be obvious to the researcher that the native speaker is not willing or able
to engage in precise transcription. The researcher may only find out later on,
once he/she has a better understanding of the language and the general cul-
tural context, that there is a discrepancy between the original wording and the
material given in transcription (and translation). Rephrasings and other kinds
of modification caused by observing taboos may be kept up rather convinc-
ingly and consistently for several units.
3.2. Editing-out
There are different types of elements that tend to be edited out (or ‘over-
looked’) by native speakers in the transcription process regardless of the set-
up of the transcription process (independent or collaborative). Perhaps the
most common type concerns hesitations or false starts, i.e. verbal elements
that are not actually part of the linguistic construction, do not directly con-
tribute to its meaning, and, most importantly perhaps in the current context,
are generally absent in written formats other than transcripts. One very simple
reason for leaving them out in dictation is, of course, the fact that hesitations
and in particular false starts are difficult to reproduce. After some practice,
the researcher her/himself will usually be able to identify such hesitations
and false starts and can then add them to the transcription without having to
bother the native speaker collaborator with this. However, some care has to
Unauthenticated
be taken in this regard as misinterpreted false starts may lead to a chain of

errors, as illustrated with the following example.
(2) a. ts’ídoaa ch’u-... ch’uunézis t’aaguyihtyi˛

¯ ¯ inside.3 PL .3 O.put.ANIMO
child.DIM ch’u-... wolf.hide
‘They put the baby into a wolfhide.’ (moose001:56)
In this example, the false start highlighted here was not initially identified by
the transcribers and thus was not included in the first transcript. At the time
when the transcript was being translated and the recording was listened to
again for verification, the linguist insisted that there was a word missing in
the translation (not having realized that it was a hesitation). Consequently,
the following ‘emendation’ was carried out (which strictly speaking is a case
of editing-in, but note that unlike the examples discussed in the following
section, here the editing-in is triggered by the researcher):
b. ts’ídoaa zo˛ ch’uunézis t’aaguyihtyi˛

¯ ¯ inside.3 PL .3 O.put.ANIMO
child.DIM only wolf.hide
‘They put their only baby into a wolfhide’ (moose001transc003)
The particle zo˛ ‘only’ was inserted, as it is phonetically close, syntactically

well-formed (following an NP), semantically appropriate (the couple in the
story had only one child) and functionally often occurs as a discourse particle.
The fact that this segment does not actually involve this particle despite the
fact that it would fit very well was caught only later, upon further checking
of the transcript paying close attention to the sounds and overall features of
articulation.
The second type of frequently omitted elements encompasses repetitions
of words or phrases characteristic of emphatic oral speech, but which are
deemed redundant or superfluous for the written version, a phenomenon that
is also quite widespread and has occasionally been remarked upon by other
authors (including Mosel 2008).
The third type concerns elements functioning as connectors or floor-keep-
ing particles in discourse. In the second line of the following example, the
initial ˛ih was omitted in the first transcription:4
4. In the remainder of this section, segments that have been left out by native speakers in the
transcription process are put in parentheses (and in bold).
Unauthenticated
(3) ní˛itye, k’emoi hakui ní˛itye,

3.lived bush cows 3.lived
‘They used to live here, buffalo used to live here,
(i˛h...) wuts’e˛ alééske wúye, łéés aghwi˛lé-’éh

(and...) from.here Eleske 3.is.called, dirt 3.made-because.of
then it is called Eleske (=‘on the dirt’), because they make the area
dusty’. (Eleske)
As indicated here, there is usually a pause following this frequent marker of

hesitation. It is frequently not included in the string dictated for transcrip-
tion and typically is not even commented on – that is, for the native speaker
it is as if it is not really there. Similarly, the clause-final clitic =ú that gen-
erally marks combined clauses is very often ignored or not even noticed in
transcription:
(4) tsé’e wutsé madzéé’ xáátsad(=ú), datés’oé’ ní˛dyí˛to˛ ,

¯
dad very heart 3.falls.out(=PRT) 3.gun.POSS 3.took.ELOO
ts’e˛ hdzé’ dyééza
outside 3.went
‘My dad became very mad, he took his gun, and went outside.’
(naabane003)
(5) tłi˛˛idyéédayi-’éh go˛ o˛ łó˛ ó˛ dó˛ dyídyeel(=ú)
horse.team-with over.there all 1 PL.go.PL (= PRT )
‘We all go over there with horse teams.’ (bullrush_lake001)
This clause-final clitic can indicate a loose temporal embedding in the dis-
course and generally does not play a role in signifying a more specific inter-
clausal construction.
Another class of words that are often edited out are evidentials that mark
a narration as known to the speaker only via the word of others. Since there
is no good translation for such particles, they are left out as ‘not important’.
In the case of Beaver, these words are also considered inappropriate for the
written style, partly because the European language used as target language
in the translation lacks the expressed category.
Unauthenticated
(6) Yéhnuujéle naawoghanePo˛ laa (sô˛ ) yéhjii.

3 SG-with.go.back 3 PL-plan FOC (I.think) 3 SG .tell.3 SG
‘That woman was to return back with that guy, they had planned that,
he was told.’ (Sweeney_Creek, Doig River Beaver)
Omissions may also pertain to parts of words, as in the following example:
(7) hade tyéége ii sadaa, uzéhtso˛ (-áa), madzagée ˛ihdadze
¯ ¯
just quiet DEM 3.sit 3.listen(- DIM ) 3.ear.DIM . POSS both
náághada
3.move
‘It just sat there quietly, listening a bit, its little ears were both moving.’
(day_ferry001)
This example involves the (crosslinguistically rather rare) instance of a dimin-
utive element occurring on a verb. Here, it is not clear what exactly triggers
the omission or oversight. As the other examples in this section illustrate, the
most fundamental reason for editing out elements from transcription appears
to be the assessment that they are irrelevant for the propositional content, as is
most clearly the case for hesitations, false starts and – with some exceptions
perhaps – repetitions. The difficulty of rendering the pragmatic or semantic
nuances of particles and clitics in the contact language used for communicat-
ing with the researcher may also be relevant.
3.3. Changing and editing-in

Changing the wording, which often includes the insertion of further mate-
rial, usually has the goal of achieving a clearer and more precise linguistic
expression of the narrated event or situation. Typical examples include the
use of overt nominals (in addition to pronominal markers on the verb) and
additional indications of location, and the change of lexical stems (in Beaver
frequently verb stems). Change of word order may also occur. Another type
of modification pertains to the substitution of lexemes for stylistic or ideolog-
ical reasons, a prime example being the replacement of English words with
terms from the native language. The addition of a locational specification can
be seen in the next two examples:5
5. This section makes use of example pairs, where the first example provides the precise
transcription while the second shows the modified version.
Unauthenticated
(8) a. ashídle=yuu łenaxi-

y.brother=and 1 PL.together
‘My younger brother and me (were tied) together.’
b. ashídle=yuu łehjíítł’o˛ mak’edeisdii-k’e
y.brother=and 1 PL.tied.together saddle-on
‘My younger brother and me were tied together on a saddle.’
(life_without)
In (8b) the location ‘on a saddle’ and a verb are added to the clause. Instead
of the addition of whole phrases, postpositions may be changed to provide a
more accurate depiction of the scene:
(9) a. matyééle-moi ˛iláádzis xi˛st’aas
3.sheet-at.edge gloves 1 SG.cut.out
‘I cut out the gloves at the edge of her (bed) cover.’
b. matyééle-k’e ˛iláádzis xi˛st’aas
3.sheet-on gloves 1 SG.cut.out
‘I cut out the gloves on her (bed) cover.’ (first_gloves001)
Here, the second version is more appropriate in that it prepares the recipient
better for the punch line of the episode in which the cover resulted in being
accidentally cut up.
A further area for emendation pertains to fine semantic distinctions that
may be conveyed with the help of grammatical markings. In Beaver, one im-
portant case here is the expression of distinctions in the number of partici-
pants through suppletive stems within the realm of motion verbs.
(10) a. aláá no˛ o˛ naatł’is-de łííníí-n-í-dyil
boat 3.crosses-LOC stop-PFV-1 PL-go.PL
b. aláá no˛ o˛ naatł’is-de łííníí-n-í-t’aats
boat 3.crosses-LOC stop-PFV-1 PL-go.DU
‘We (pl/dl) stopped at the ferry.’ (day_ferry001)
In the recording, the plural motion stem is used, which, strictly speaking,
is not correct, since only two people were in the car. Accordingly, the dual
motion stem is edited in.
A similar example is seen below in (11) where again a plural stem is
substituted for a dual stem. In addition, several other typical changes may be
Unauthenticated
seen in comparing the two versions. One consists of a change in agreement,

from first plural subject to third plural subject marking:
(11) a. hó˛ ty’e laa tsáá-gha k’éémoi-dze k’éé-ghí-dish-ełé

like.that FOC beaver-for bush-to ASP - CNJ .1 PL -go. PL - HAB
‘Like that, we used to hunt for beaver in the bush.’
b. hó˛ ty’e tsáá-ka k’éémoi-dze k’éé-gha-t’aash-ełé
like.that beaver-for bush-to ASP - CNJ .3 PL -go. DU - HAB
‘Like that, they used to hunt for beaver in the bush.’ (life_without)
Further changes include the editing out of the emphatic/focus marker laa fol-
lowing the first phrase, and the replacement of the purposive postposition
-gha with -ka. Even though both are grammatical in this context, the latter
appears in the common expression ‘to go out in order to hunt an animal’.
Classificatory verbs are also a locus for semantic differentiation and spec-
ification. They are frequently replaced in transcription:
(12) a. náádaagheilé’-éh k’éémoi-dze naxa-gha-dyeh-tyé

wagon-with bush-to 1 PLO-3 PL - ASP. V-bring.ANIMO
‘They brought us with the wagon to the bush.’
b. náádaagheilé’-éh k’éémoi-dze naxa-gha-da-lé
wagon-with bush-to 1 PLO-3 PL - ASP. V-bring.PLO
‘They brought us with the wagon to the bush.’ (life_without)
In this example, the speaker substituted the stem for animate objects, which
refers implicitly to either a single or a dual object, with the stem for handling
plural objects.
An even more elaborate case of semantic specification is seen in the fol-
lowing example. A very general verb found in the original recording (13a) is
first replaced by a variant including a classificatory verb (13b) that pertains to
stick-like objects (in this case the leg of a frog). The third variant in (13c) was
suggested to be the one that best expresses the event depicted in the picture
(from the Frog Story, Mayer 1994[1969]). The verb here specifically refers
to the movement of legs:
(13) a. tyehkazi tyéék’e łige mats’ané’ ó˛ ty’e

¯
frog in.water one 3.leg 3.be
‘The frog has one leg in the water.’
Unauthenticated
b. tyehkazi tyéék’e mats’ané’ łige sahto˛

¯
frog in.water 3.leg one 3.put.ELOO
‘The frog has one leg in the water.’
c. tyehkazi tyéék’e dyéh’ééts
frog in.water 3.move.leg
‘The frog moves a leg in the water.’ (frog_story001transc)
A different sort of editing is involved when phrases are replaced with phrases
of an entirely different grammatical type. In the modified version, the situa-
tion is often described more explicitly than in the actual recording (note that
in (14) the focus marker laa has again been omitted with the original overt
subject noun in the modified version, so strictly speaking this is a case of
replacement and omission):
(14) a. méhz˛i laa łééyetł’o˛

owl¯ FOC 3.tied.3.together
‘The owl tied them together.’
b. ˛ihtyedi łééguyaghétł’o˛
naked 3 PL.tied.3.together
‘They tied them together naked.’ (buffaloman001)
Other possible kinds of changes include the paradigm of the verb (aspectual
variation) or the choice of person markers. In example (15a), the speaker uses
the areal pronominal marker ghu- as a possessive prefix to refer to the story
in a general sense. The transcriber, who in this case is not identical to the
speaker, chooses the third person marker ma- for this construction:
(15) a. gaa ghu-lo˛ o˛

now 3 ARE-end
‘That’s the end (of the story)!’
b. gaa ma-lo˛ o˛
now 3-end
‘That’s the end (of the story)!’ (ghutsahgeeze_woodpecker)
Unauthenticated
The transcription of the next example resulted in a change in word order

within the first noun phrase: the modifier ‘all’ is shifted from its usual place
following the noun to the front.6
(16) a. dane łó˛ ó˛ dó˛ ada’íídyii hesi˛ gutł’e ó˛ li˛

people all 1 PL.know must.be 3 PL.passed.away
‘All the people we know, they are all gone.’
b. łó˛ ó˛ dó˛ dane ada’íídyii hesi˛ gaa matł’e ó˛ li˛
all people 1 PL.know must.be now 3.passed.away
‘All the people we know, they are all gone now.’ (St.Charles_001)
Possibly, word order here is influenced by interference from English. Recall

in this regard that for all the Beaver data discussed in this chapter the work
was set up in such a way that transcription was performed simultaneously
with translation. That is, each unit was transcribed and translated before pro-
ceeding to the next unit, and there was no fixed order in between the two
steps (the speaker would sometimes first volunteer a translation and then re-
peat the segment for transcription, or vice versa). In such a setup, interference
from the target language is more likely than in a setup where all attention is
concentrated on transcription. But, of course, there are many reasons why the
latter procedure will often not be possible, besides the fact that it helps to
have at least a rough idea of what one is transcribing.
Turning briefly now to the second major class of ‘editing-in’ modifica-
tions, code-switches or the use of non-aboriginal words often trigger signifi-
cant modifications to the original recording.7 Here is a simple example:
(17) a. marten gulae, nóódyehk’azhi gulae

marten maybe fisher maybe
b. uust’yá˛ a˛ gulae, nóódyehk’azhi gulae
marten maybe fisher maybe
‘Maybe a marten, maybe a fisher.’ (day_ferry001)
6. Other changes include the editing-in of the adverb gaa ‘now’ to emphasize the difference
between the narrated past and the present situation, as well as a pronominal change (marked
3pl to general 3).
7. The tendency to replace English words or phrases in the annotation with Beaver terms is
more pronounced in the Northern Alberta varieties.
Unauthenticated
It may also happen that longer segments in English are translated by the tran-
scriber. Thus, for example, the clause he hears something was rendered as
wó˛ o˛ li dííts’ak in one transcription session.
More interestingly perhaps, the transcriber might speak a different vari-
ety of the language than the recorded speaker. This may result in changes in
the transcript, both with regard to morphological form and in the naming of
traditional characters:
(18) a. go˛ o˛ dyéézhe gaa éhdyi: aséi dasbát-éh
over.there 3.went now 3.said grandfather 1 SG.hungry-with
ni˛ka dyée-ya
2 SG.for ASP-come.SG
‘He went over there and said: grandfather, I came to you because
I’m hungry!’
b. go˛ o˛ dyéézhe gaa éhdyi: aséi Daskutł’e, ni˛ka
over.there 3.went now 3.said grandfather Daskutł’e 2 SG.for
dyée-zha
ASP -come. SG
‘He went over there and said: grandfather Daskutł’e, I came for
you.’ (aghat’usdane002)
In the recording the singular motion stem is -ya, while the transcriber repeats
it as -zha. Both belong to the paradigm of the stem ‘sg.moves’, but they are
used differently in various dialects of Beaver.8 The second modification, the
renaming of a character within a well-known traditional story, is explained
by family tradition: “My grandfather told me about old man Daskutł’e.” The
original inflected verb form in (18a) that is not a personal name is thereby
replaced by a personal name that seems to fit the particular story.
4. Conclusion
Transcription of recorded data plays a central role in field-based language
documentation and description. The final product, i.e. the transcript, forms the
stepping stone for a variety of further activities, including not only grammat-
ical analysis but also the preparation of educational materials or other written
resources to support language maintenance or development efforts. But, as
8. Historically, a so-called voice marker has been absorbed into the -zha form.
Unauthenticated
argued here, the transcription process itself, while often tedious and disliked
by all parties involved, provides valuable insights into the linguistic knowl-
edge of speakers: insertion, omission, or change of items show the range of
the linguistic repertoire and (un)acceptable variation and thus complement
the evidence otherwise gathered in elicitation tasks. It may also provide im-
portant clues for our understanding of the creation of new linguistic varieties,
as many of the phenomena reviewed here also occur when basic transcripts
are edited for publication as written resources, resulting in the creation of
a written language variety where none existed beforehand (cp. Mosel 2004,
2008).
From a scientific point of view, it is thus important to document, as much
as practically feasible, the kinds of changes and elaborations discussed in the
preceding section, i.e. to provide both as precise a transcript of the actual
recording as possible as well as a record of the changes applied by native
speakers in the transcription process and the motivations given for them (if
any). The best way to do this is to record the transcription process as well,
which, however, will often not be possible for various reasons.
There is, of course, a potential conflict here between scientific and com-
munity/speaker interests, which was already hinted at in the introduction to
Section 3: What if a speaker or the community at large actually rejects (parts
of) an utterance as incorrect or inappropriate, for whatever reason? Under
such circumstances, which version should be made available to whom and in
what form? There is no straightforward and easy answer to this question, as
in all cases where conflicts arise about control and ownership in a documen-
tation project, but layered access levels in a digital archive usually make it
possible to accommodate the interests of all parties concerned.
Abbreviations
1, 2, 3first, second, third person (usually FOC focus
indexing the subject argument if HAB habitual
not otherwise specified) LOC locative
ANIM O animate object O object
ARE areal PFV perfective
ASP aspectual PL plural
CNJ conjugation PL O plural objects
DEM demonstrative POSS possessive
DIM diminutive PRT particle
Unauthenticated
DU dual SG singular
ELO O elongated object V valency
References
Chelliah, Shobhana L., and Willem de Reuse. 2011. Handbook of Descriptive
Linguistics. Dordrecht: Springer.
Crowley, Terry. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford
University Press.
ton de Gruyter.
Jung, Dagmar, Julia Colleen Miller, Patrick Moore, Gabriele Müller
(now Schwiertz), Olga Müller (now Lovick), and Carolina Pasamonik.
2004–present. DoBeS Beaver Documentation. DoBeS Archive MPI Nij-
megen, http://www.mpi.nl/DOBES/.
for Young Readers.
ter 1(3):3–4.
Ochs, Elinor. 1979. Transcription as theory. In Developmental Pragmatics,
eds. Elinor Ochs and Bambi B. Schieffelin, 43–72. New York: Academic
Press.
Unauthenticated
Chapter 10
The making of a multimedia encyclopaedic lexicon
for and in endangered speech communities∗
Gabriele Cablitz
1. Introduction
In the past ten years a growing amount of lexicon software1 has become avail-
able to create multimedia lexica. A key area of application is the compilation
of online dictionaries for speech communities of underdescribed and endan-
gered languages. For a number of such projects (Manning, Jansz, and In-
durkhya 2001; Kroskrity 2002; Albright and Hatton 2008; De Korne et al.
2009; Yang et al. 2008; Rau et al. 2009; Cablitz, Chong, and Tetahiotupa
2009) the development of a multimedia lexicon is an important step towards
language documentation as a means of language maintenance and revival and
the preservation of endangered linguistic, lexical and cultural knowledge.
In this chapter we report on an interdisciplinary project2 in which digi-
tal multimedia encyclopaedic lexica are created for the endangered Marque-
san and Tuamotuan languages of French Polynesia with the help of the lexi-
∗
This research has been generously supported by two DoBeS grants of the Volkswagen
foundation. I would like to thank the Marquesan and Tuamotuan speech communities for
their warm welcome, support and inspiration for the project. I have in particular ben-
efited from discussions with Tehoatahiiani Bruneau, Fasan Chong (Jean Kape), Marc
Kemps-Snijders, Lucien Mataiki, Upu Mataiki, Lucien (Mimio) Puhetini (†), Jacquelijn
Ringersma, Tahia Tuohe (†), Edgar Tetahiotupa, Mathias (Teaiki) Tohetiaatua, Peter Wit-
tenburg and Claus Zinn. I would like to thank Ken Dicks for proof-reading my paper and
Geoff Haig, Nicole Nau and Claudia Wegener for their helpful comments on earlier ver-
sions of this paper; any errors and inconsistencies are of course my own. Last, but not least
I would like to thank Ulrike Mosel for her great inspiration of my own work, and all her
enthusiasm and support over the years.
1. E.g. the Kirrkirr software (Manning, Jansz, and Indurkhya 2001), Lexique Pro and We-
Say (SIL), LEXUS (Max Planck Institute for Psycholinguistics), IDD (Indiana Dictionary
Database, cf. De Korne et al. 2009) among others.
2. The project was part of the DobeS-programme and was generously supported by the Volk-
swagen foundation between 2006 and 2010 (http://www.mpi.nl/dobes).
Unauthenticated
224 Gabriele Cablitz
con tool LEXUS.3 LEXUS is a web-based tool which has a flexible scheme
of linking multimedia documents – including annotated sessions from the
archive – to lexical entries as well as the possibility of creating relational
links using its integrated tool ViCoS4 . The relational linking device in ViCoS
is a new form of knowledge representation with which the user can create a
dense network of lexical and cultural data in ways which are meaningful to
different kinds of user groups in a thematically organised way.5 This has im-
portant implications not only for the speech communities, whose languages
are documented, but also for documentary linguistics. In this chapter we will
discuss why this form of language documentation, i.e. the creation of multi-
media lexica with LEXUS, is beneficial to both the scientific as well as the
speech communities.
In the first part of this chapter (§2) we will briefly discuss the background
and objectives of this multimedia encyclopaedic lexicon project (henceforth:
MEL-project). In §3 we proceed to outline the type of lexicographic work we
have undertaken in the MEL-project, discussing new aspects of lexicographic
work in our lexica in more detail. This section also provides insights into
the creation and representation of thematically organised networks of lexical
and cultural data in ViCoS (§3.4), in particular the creation of ethnobotanical
ontologies and a folk taxonomy on marine life.
One major objective of the MEL-project is to motivate the speech com-
munities to actively participate in the process of creating these multimedia
lexica. The web-based editing possibilities of LEXUS and internet facilities
in French Polynesia – in principle – make it possible to allow an online par-
ticipation by the speech community. In §4 we will address the implications
a web-based lexicon tool has for the process of lexicon creation as well as
the problems which are involved with such an approach of egalitarian lexicon
creation: a model of collaborative workspaces and the basic challenges of a
web-based collaboration in endangered speech communities are discussed in
detail. The last section (§5) discusses why the making of dictionaries should
be a key activity in a DoBeS language documentation project, or indeed for
3. LEXUS is currently being developed by the technical team of the Max Planck Institute for
Psycholinguistics in Nijmegen (Netherlands).
4. ViCoS=Visualization of Conceptual Spaces, cf. http://www.mpi.nl/dobes/tools and cf.
Zinn (2008: 890–894); the first multimedia lexicon software using the relational linking
device was Kirrkirr in the late 1990s (McElvenny 2008: 160).
5. Cf. §3.4 for further details.
Unauthenticated
The making of a multimedia encyclopaedic lexicon 225
any documentation project, and why multimedia dictionaries created with

LEXUS can be particularly appealing and useful for endangered speech com-
munities as well as the wider scientific community. We will discuss the some-
what controversial role of lexicography in documentary linguistics as well as
the contributions that multimedia dictionaries – in particular LEXUS – can
make to documentary linguistics, showing that DoBeS-style language doc-
umentation and lexicography can complement each other. Finally, we also
address the question of how far multimedia lexica – as created in LEXUS –
are useful tools for language revitalization.
2. The MEL-project
2.1. Background
The MEL-project evolved from a previous DoBeS language documentation
project6 of the Marquesan languages in French Polynesia. Between 2003 and
2006 a large corpus of primary and lexical data from five different Marque-
san island vernaculars has been compiled, analysed and annotated in close
cooperation with the Marquesan speech community, and is now stored in a
digital multimedia language archive housed by the Max Planck Institute for
Psycholinguistics (cf. Cablitz 2010: 40–43 for details). In the course of this
DoBeS documentation project we also began to build up a general lexical
database and glossaries of topics which have become important foci of our
documentation (breadfruit varieties, food preparation, plants, plant medicine,
fish and fishing, etc.). All lexical databases are trilingual (Marquesan, French
and English); the lexical entries have been extracted from recordings on the
respective topics, with added information from field notes and wordlists. Due
to the well-known constraints on time and money in short term documentation
projects, we mainly applied Mosel’s thematic approach of dictionary making
(2004b; 2011). We created so-called mini-dictionaries (Mosel 2011: 348) by
focussing on one sub-domain of a culture (e.g. fish/fishing) at one time. In
this way a mini-dictionary can be completed in a relatively short period of
time which often has motivating effects on the speech community (Mosel
2004b: 45–47). Idioms, collocations and lexicalised phrases were also col-
lected in a separate database because they constitute an important part of the
6. Documentation of the Marquesan languages and culture in French Polynesia (2003–2006),

cf. http://www.mpi.nl/DOBES/projects/marquesan.
Unauthenticated
linguistic competence of speakers (Pawley 1993), and they provide deeper

insights into a culture than other linguistic units (Mosel 2004b: 50).
The lexical entries in our databases typically contain information about
parts of speech, phonological variants,7 plural forms, glosses, French and En-
glish reversal glosses for index lists, definitions, dialectal usage and register,
example sentences, etymology of borrowed lexemes, synonyms, antonyms,
cross-references and scientific names of things from the natural environment
(e.g. fish, shellfish, birds and plants). The primary and lexical data forms the
basis for our on-going work on the multimedia lexicon.
2.2. Objectives
The main objective was to create multimedia lexica for the Marquesan and
Tuamotuan speech communities by building up a new form of multimedia
language archive in a structural frame of a lexicon with the help of LEXUS
and ViCoS (thereby also providing the LEXUS/ViCoS developers with a doc-
umentation setting to further improve and refine the software). This involved
the enriching of lexical entries with indigenous knowledge, multimedia ex-
tensions (images, video and audio clips) as well as annotated sessions from
the archive, in order to move from a conventional dictionary with simple
glosses towards an encyclopaedia or ethnographic type of lexicon (cf. Paw-
ley 2001: 236–237). In order to do this effectively, we needed to motivate the
speech communities to actively participate in the creation of these multime-
dia lexica, thus becoming more involved in the process of documenting their
own languages. The entire process had an important beneficial side-effect:
the involvement of the speech community in the creation of the multimedia
lexica is a way of contributing to language maintenance and revival.8
In order to facilitate the more active participation of speech community
members, they had to learn a) about the basics of lexicography and the rele-
vant linguistic software used, and b) how to write monolingual definitions and
encyclopaedic articles of vernacular words, a documentation method which is
7. Phonological variants are allolexemes, also called triplets and doublets (Elbert 1982). Dou-
blets and triplets are different forms of the same lexeme which are used by one and the
same speaker (e.g. ko’aka – ’o’aka ‘find’), i.e. one cannot observe any complementary
distributional rules nor a regional demarcation (Cablitz 2006: 26).
8. We documented e.g. very specialised knowledge about plants which some of our language
consultants have not talked about for many years. In many instances, the documentation
helped them remember and revive traditional knowledge.
Unauthenticated
particularly useful for language maintenance. Moreover, we intended to cre-

ate lexical and cultural networks in ViCoS which are based on the indigenous
categorisation and organisation of relations between words and/or any ele-
ment in the DoBeS-archive (e.g. sessions depicting particular cultural uses).
We were hoping that an indigenous organisation of lexical and cultural data
would ensure a better access of the speech communities to the documenta-
tion of their languages and cultures, with the aim of contributing to future
language revitalization efforts by the speech communities.
Due to the web-based properties of LEXUS, an online cooperation with
the speech communities outside fieldwork periods was in theory possible (cf.
above). However, while the idea was initially appealing to the speech commu-
nities, it turned out to be a complex enterprise. There are many problematic
aspects connected with a web-based approach of lexicon creation in endan-
gered speech communities, some of which are discussed in further detail be-
low (§4).
3. Lexicographic work in the MEL-project

The lexical data collected during the course of the DoBeS documentation
project on Marquesan had been prepared with the MDF 4.0 database type
in Toolbox. This was now imported into LEXUS by representing the same
hierarchical structures as those in the corresponding Toolbox MDF database
type.9 Toolbox can also convert the lexical database into an electronic version
with a conventional dictionary layout of alphabetically ordered headwords
which includes the two gloss languages French and English. Apart from the
electronic and digital versions of the lexical databases, print-out versions are
an important product of our documentation project (Mosel 2011: 340) be-
cause speech community members sometimes have limited access to elec-
tronic devices.
Four aspects distinguish the lexicographic work in the MEL-project from
our previous approach to dictionary making:
1) the inclusion of multimedia and annotated archive files to visualise and
contextualise word meaning;
2) the inclusion of encyclopaedic information in lexical entries;
9. Cf. Coward and Grimes (2000) for details about the MDF (=Multi-Dictionary Formatter)
database type in Toolbox.
Unauthenticated
3) the use of vernacular language (i.e. Marquesan or Tuamotuan) in docu-

menting word meaning and encyclopaedic knowledge, and
4) indigenous classification of vernacular words (i.e. folk taxonomies), and
the creation of ethno-ontologies or informal ontologies (Zinn 2008: 890),
i.e. knowledge spaces which are solely based on an indigenous under-
standing of cultural connections between elements of the lexicon.
In the following sections, the motivation behind the inclusion of these new
aspects will be explained and further detailed.
3.1. Visualisation and contextualisation of word meaning
The contextualisation and visualisation of word meaning with multimedia

and annotated archive files provides the user with information on the prag-
matics of lexical units as well as on cultural knowledge related to the mean-
ing and use of the lexical units in question. Moreover, multimedia extensions
can illustrate non-verbal aspects of cultural activities that are relevant for the
understanding of the concepts encoded by the lexical units in question.
In our lexica in LEXUS, lexical units are enriched with and linked to
example sentences of annotated sessions in the archive (via time codes). The
user can navigate to the exact passage where the example sentence occurs
in the annotated session. From this navigation point the user can then browse
through the whole session back and forth to explore the exact context of usage
(see Figure 1).
The linking of lexical units to annotated multimedia archive files goes be-
yond conventional lexicography, which focuses on the word or lexeme level.
Lexical units are embedded in discourse, revealing patterns and habits as-
sociated with their usage. According to Pawley (2001: 238), “part of native
command of a language involves knowing what things to say in discourse,
when and why to say them and how to say them in conventional ways”.
Furthermore, every language has a large repertoire of so-called speech
formulas (or lexicalised phrases), i.e. “set phrases for saying things in certain
contexts” (Pawley 2001: 238). As Pawley (2001: 239) notes, these tend to
be neglected by grammarians and lexicographers alike. They “do not really
fit the structuralist idea of a clean divide between grammar and dictionary”
(Himmelmann 2006: 19). Multimedia lexicon tools such as LEXUS can go
some way towards closing this gap in linguistic descriptions.
Unauthenticated
Figure 1. Archive-linking: from lexical entry in LEXUS to archive session (viewed

in ANNEX)
3.2. The inclusion of encyclopaedic information in lexical entries
When doing lexicography in and for endangered speech communities, the lin-
guistic fieldworker will often find that lexicographic practice does not easily
combine with linguistic theories of lexical semantics (Haiman 1980; Herbst
and Klotz 2003; Pawley 2001; Mosel 2004b). Apart from the difficulty in
finding adequate translations in the target languages for vernacular words
of the source language – which is a general problem of bi- or multilingual
lexicography (Herbst and Klotz 2003: 109) – there is also the problem that
many headwords of lexical entries denote complex phenomena, procedures
and concepts which are specific to the source language and culture and there-
fore often do not have a translation equivalent in the target language at all
(Mosel 2004b: 48; Franchetto 2006: 203–206; Haviland 2006: 136–139). In
order to avoid an inadequate documentation of word meaning it is there-
fore often necessary to provide more encyclopaedic information in definitions
than a conventional bi- or multilingual dictionary would normally include.
For instance, Marquesan (MQR) heikai vaihopu and mākiko are both
a kind of breadfruit pudding made out of the same ingredients (very ripe
breadfruit pulp and coconut milk), but the final product and the technique of
preparing them differ to a great extent. In our lexicographic work, the En-
Unauthenticated
glish equivalent of ‘breadfruit pudding’ is not a satisfactory translation and

we have added more information, for example, on the preparation technique
to make the distinction between the two more salient. But is this kind of addi-
tional encyclopaedic information part of the word meaning of heikai vaihopu
and mākiko?
Pawley (2001: 231) questions whether the sparse representations of word
meaning in conventional dictionaries only reflect the “arbitrary limitations
in the practices of lexicographers” which arise not “from theoretical princi-
ples but from practical considerations of time and money”. Some researchers
(Haiman 1980; Herbst and Klotz 2003) believe that one cannot really distin-
guish between dictionaries and encyclopaedias because the semantic knowl-
edge of a word often derives from cultural knowledge, and that a meaning
of a word is often a subset of cultural (or encyclopaedic) knowledge. Conse-
quently the distinction between linguistic and cultural knowledge often has
no clear boundary. Whatever the theoretical stance might be, in a context of
language endangerment and the documentation of an endangered language
and culture it is surely self-evident that one should integrate as much cultural
knowledge connected with a certain word as possible. The question is to what
extent the dictionary is the right medium of documenting this knowledge.
Pawley (2001: 236–237) observes that all large general dictionaries of under-
described languages are to some degree ethnographic, definitions of vernac-
ular terms “include information about their significance in the culture of the
speech community”. Also, practical handbooks of conventional lexicography
advise lexicographers to include detailed encyclopaedic and cultural knowl-
edge in definitions of culture-specific terms (Svensén 1993: 164; Atkins and
Rundell 2008: 126–127). However, the definitions in conventional dictionar-
ies are confined to verbal descriptions. The advantage of a lexicon created
with LEXUS is that multimedia extensions as well as archive links can pro-
vide the user with a wider and fuller linguistic as well as non-linguistic con-
text of cultural activities which contributes to more thorough understanding
of the culture-specific concepts in question.
3.3. The use of vernacular language in lexical entries
In §2.2 it has already been mentioned that monolingual (or vernacular) def-
initions of the different senses of a headword and vernacular encyclopaedic
articles are an important part of our MEL-project. This kind of work encour-
Unauthenticated
ages native speakers to express meanings of words in their own language and
therefore to further stimulate the expressive power of their language. Vernac-
ular definitions of words are also an important step towards understanding
indigenous word meaning because they document word meaning from the
native speaker’s understanding of a word (Mosel 2004b: 48). It is a repository
of authentic indigenous word meaning for the scientific community because it
can clarify possible misunderstandings and misinterpretations between the re-
searcher and the local field assistants and language consultants. Misinterpre-
tations are not uncommon in linguistic fieldwork because the researcher, as
well as the local consultants, often communicate via a contact language which
is neither of their native languages (Mosel 2004b: 48; Haviland 2006: 144).
Even if the fieldworker communicates via the field language, there are usu-
ally very different levels of linguistic competence between the fieldworker
and the language consultants which could lead to misinterpretations.
In many endangered speech communities which lack a tradition of lit-
eracy, the formulation of monolingual definitions and encyclopaedic articles
poses a considerable challenge and leads to the emergence of new, written
speech genres. While linguistic ecologists such as Mühlhäusler (1990, 1996)
have cautioned against the imposition of literacy because it might dimin-
ish a rich oral heritage,10 such newly emerging written genres can provide
fascinating insights into the expressive potential of speakers of a language
(2004a: 263, Mosel 2004c: 4). New constructions may be deployed which are
rarely used in everyday conversations, undoubtedly of interest to linguists and
possibly to language educators. The Marquesan field assistants indeed devel-
oped a style containing new and rarely used constructions. Verb serialisation
(1, 2) and clause chaining (2) occurred much more frequently than in other
(speech) genres. For example, a wooden ring (manoni) is defined as follows:
(1) ’Akau humu=tı̄a ha’a=kapoipoi, to’o=tı̄a no te hana
wood attach= PASS CAUS=round take= PASS for ART fabricate
pafi’o ako’e’a ’ofi’o
landing.net or catcher
‘Attached rounded wood (lit. ‘wood which has been attached and round-
ed’) used to fabricate landing nets (fruit harvest) and catchers (fishing)’
10. Crowley (2001: 3) has noted that literacy has been introduced for a long time in many
Pacific minority languages, but it has not at all resulted in the reduction of the rich oral
heritage. This can also be stated for the Marquesas.
Unauthenticated
A preparation of plant medicine is described as follows:
(2) ...e katotı̄a t=o ı̄a tau manamana, pı̄ te ’au i

TAM break.off ART = POSS 3. SG PL RED -branch be.full ART leaf LD
’una, tukituki ha’a=pe’ehu, kohi te tau manamana...
top RED-pound CAUS=soft collect ART PL RED-branch
‘...the branches, which are full of leaves, are broken off, pounded and
made soft, then collected...’
TAM particles and articles are often omitted:
(3) ...ka-kaiu tumu, kofā hunahuna, ’au firifiri ’ina kapoipoi

RED -small trunk rib tiny leaf curly almost round
‘...small trunk, tiny stalk, curly leaves almost round’
There is generally a higher number of juxtapositional constructions:
(4) ...kakano ke’eke’e kaikai na te kūkū

seed black food for ART fruit-dove
‘...black seeds which are food for the fruit doves’
Some NPs can contain up to four postnuclear modifiers:
(5) ...t=o ı̄a tau pupu pua ’iki metı̄e ku’uhua

ART = POSS 3. SG PL bunch flower small green yellow
‘...it has small, yellow-green bunches of flower (lit. ‘its bunches flower
small green yellow’).’
Most of the vernacular definitions in our MEL-project denote concrete enti-

ties of the natural world (plants, fish, shellfish, birds, body-parts, etc.), cul-
tural products (ornaments, artefacts, containers, clothing, etc.) preparation
techniques and technologies to produce these cultural products (fishing tech-
niques, food preparation, plant medicine, tools, instruments, etc.). These def-
initions mainly describe form, appearance, habitat, texture, colour, material
or course of action, etc., but they also contain encyclopaedic information and
functional aspects of usage if consultants felt them to be a necessary. For ex-
ample with respect to plants, a detailed description of the appearance (form,
height, leaves, flowers, fruits, seeds, habitat, etc.) and its main uses were given
because it facilitates the identification of the plant by other native speakers.
Unauthenticated
The vernacular definition is followed by a translation of the definition in the

target languages French and English. If an equivalent French and English
term (e.g. a standard common botanical name) existed, it was added on to the
translated definition as in this entry for the headword tumu ı̄naı̄na:
tumu ı̄naı̄na n., Xylosma suaveolens, Usage: HO, TH11
Marquesan definition:
tumu ’akau e to’u meta te hohonu, mea fatea te manamana i vaho,
e to’u e fā meta, ’e’eva, mea titatita te ’au fi’i o’o, pua ’iki ma’ita,
kakano ke’eke’e me he kakano o te va’ova’o, e tupu i no he tau
ı̄vi, taha mo’o, na te kūkū me te koma’o e kai te kakano (HO).
French translation with common botanical name:
Xylosma suaveolens; arbre de trois mètres d’hauteur, les branches
s’etendent vers les côtés de 3 à 4 mètres, feuilles touffues, petites
fleurs blanches, graines noires comme les graines de l’arbre de Premna
(Premna serratifolia), il se trouve au plateau et sur des collines dans
les endroits secs, le ptilope dupetit-thouars (Ptilinopus dupetithouar-
sii) et la rousserolle des Marquises (Acrocephalus percernis) mangent
ses graines.
English translation with common botanical name:
East Polynesian Xylosma; tree up to 3 meters in height, long branches
to its side (up to 3 to 4 meter), dense, bushy leaves, small white flow-
ers, blacks seeds similar to the Premna tree (Premna serratifolia),
found on the high plateaus and on the hill tops in dry locations, the
white-capped fruit-dove (Ptilinopus dupetithouarsii) and the Marque-
san reed-warbler (Acrocephalus percernis) eat its seeds.
Other, more detailed encyclopaedic information was transformed into ency-
clopaedic articles describing cultural uses by using keywords in capital letters
(e.g. ’APAU ‘medicine, medicinal treatment’, PENI ‘paint’, HA’INA ‘hand-
icraft’, ’AVAĪ’A ‘fishing’, etc.). The lexical entry ’anetāi ‘coral tree’ (Ery-
thrina variegata (or E. corallodendrum)), for example, contains the following
encyclopaedic descriptions of cultural uses:
HA’INA: To’oa te ’akau no te hana te hei ku’uhē no te ’ana’ana o te
’akau. To’otı̄a te kakano no te tui hei. ’AVAĪ’A: Ia pua te ’anetāi ua kai
11. HO and TH are abbreviations for different island vernaculars: HO = Hiva ’Oa dialect, TH
= Tahuata dialect.
Unauthenticated
te te’ape, momona pao te ı̄’a, to ı̄a kerehi me he pua o te ’anetāi, mo’ai

oko te ı̄’a.
ARTISANAT: On prend le bois pour fabriquer le “hei ku’uhe” (sorte de
couronne) à cause de sa qualité legère. Les graines sont prises pour confec-
tionner des couronnes. PÊCHE: Quand l’arbre est en fleurs, cela annonce une
période très favorable pour pêcher les perches (Lutjanus kasmira); les graisses
autour du foie de ce poisson deviennent très grasses et ont une couleur rouge
semblable aux fleurs d’erythrine.
HANDICRAFT: The wood is taken to make the breast ornament "hei ku’uhe"
due to its light quality. The seeds are used to make necklaces. FISHING:
When the coral tree is in flower, an abundance of bluestripe snappers (Lut-
janus kasmira) can be caught; during the flowering period of the coral tree,
the liver of this fish is extremely fatty and has a colouring similar to that of
the red flowers of the coral tree.
The encyclopaedic articles can vary greatly in length. In general, details of

preparation forms (e.g. plant medicine, dance costumes, medical treatment,
etc.) are described briefly in an instructive style. The first encyclopaedic ar-
ticles had structural similarities to spoken instructive texts with a high fre-
quency of discourse particles indicating the sequence of events (e.g. of a
preparation form). Most of these first encyclopaedic articles were re-written,
often in a more brief and dense style avoiding repetitiveness.
In glossaries which focus on one tree or plant with all its botanical vari-
eties (e.g. the breadfruit tree and its varieties), the description of cultural uses
in the encyclopaedic article section of the lexical entry were organised with
respect to the different plant parts (e.g. fruit, stem, bark, leaves, sap, roots,
etc.). More will be said about this type of ethnobotanical approach in the next
section.
3.4. Cultural knowledge spaces and folk taxonomies
As already mentioned above, the new technology of relational linking in

LEXUS/ViCoS not only opens up the possibility of viewing, for example,
typical semantic relations such as synonymy, antonymy, meronymy, etc. of a
lexical entry in one and the same space, but also allows the integration of all
kinds of information in whatever way defined by the user. Unlike the com-
monly used one-to-one hyperlinking mechanism, relational links – created in
Unauthenticated
ViCoS knowledge spaces12 – visualise multiple links in the same space. The
users can view several nodes simultaneously on the computer screen, which
allow them to navigate according to their interest. All nodes can contain fur-
ther links to other lexical, multimedia or archive data (containing sessions of
specific cultural uses) which are opened up once the user clicks on the nodes.
The screenshot in Figure 2 shows how parts of the breadfruit ontology are
realised in ViCoS.
Lexus Workspace Editor ViCoS - Visualising Concept.. ViCoS Editor and Navigator
Legend:
tumu mei - preparation_shown_in
is_a
is_product_of
is_material_for
is_part_of
mei ka'aku pukupuku is_variety_of

is_image_of
is_medicine_for
mei hōi is_topic_in
mei Modes:
mei autēa
browse move
connect delete
lexus world
mei aravei
attach detach
overview refetch
mei hinu save colour
mā
mei maō'i Relation Types:
mei 'ape
mei kopumoko
mei
mā
popoi
haīka mā galerie de photos
mā tehītō
Figure 2. Screenshot of parts of the breadfruit ontology in ViCoS
In the upper left part of the screenshot only those relations are visualised
which are related to the node mei ‘breadfruit’. When navigating e.g. to the
node mā ‘fermented breadfruit’ it becomes the focus in the ViCoS Editor and
Navigator showing all relations connected to mā. The navigation can continue
12. Note that also in the Kirrkirr software multiple relational links are viewed in the same
space.
Unauthenticated
as long as relations have been created between elements of the LEXUS lexical
database.
The advantage such a tool has for the speech as well as scientific com-
munity is obvious. Speech community members can define and organise how
words and other (cultural) data are grouped together and what relations they
hold to each other, thus creating knowledge spaces which are meaningful to
speech community members.
Due to the flexible scheme of linking, all kinds of data13 can be com-
bined in whatever way possible according to the particular interest of the user.
Linguistic researchers could create knowledge spaces to view a particular se-
mantic field or domain of investigation (e.g. CUT and BREAK verbals). Re-
searchers interested in oral literature, for example, can include proper names
of protagonists in the lexical database, establish the relationships they hold to
each other (e.g. is_father_of, is_younger_sister_of, is_sister-in-law_of, etc.)
in ViCoS knowledge spaces and make links to the archive where the respec-
tive narratives are stored (cf. Figure 3).
ViCoS can not only visualise various relations, but important information
(e.g. complex family relationships), which is usually neither included in the
lexical entry nor in the metadata of the archive sessions, can be established in
ViCoS.
The indigenous representation and organisation of data in ViCoS have
been collected and prepared in two different ways. Our first approach was to
create knowledge spaces from an ethnobotanical perspective as the traditional
material culture of the Marquesas and Tuamotu islands is – for most parts –
based on plants. A plant was taken as a point of departure (or anchor) and then
all the cultural uses which are connected with that plant were established.
The cultural uses of the most important trees in the Tuamotuan (coconut
tree) and Marquesan cultures (breadfruit, coconut and banyan tree) were e-
licited in detail by creating a kind of informal ontology for each tree: all the
different parts of the plants were named and put into relation of how they are
used in the traditional culture (food, preparation, plant medicine, handicrafts,
canoe-building, house-building, etc.). It was the intention to establish a kind
13. The linking of multimedia files in ViCoS is still very restricted, but there are work-around-
solutions of creating links to the archive and multimedia files. One can create a database
in LEXUS which contains the relevant multimedia files, archive links or photo galleries
without further lexical data. Within that database one creates a data category (e.g. ‘link to
photo gallery’) which could be used as a link in the ViCoS knowledge space to access the
database which contain the multimedia data or archive links in LEXUS (cf. Figure 2).
Unauthenticated
ViCoS - Visualising Concept.. ViCoS Editor and Navigator
Legend:
is_father_of
Teupokootiū
is_ancestral_spirit_of
Teupokootekahi
is_son_of
is_adoptive_sister_of
is_younger_sister_of
is_sister-in-law_of
Paevao
Pepei’u Tohi’akau Modes:

browse move
connect delete
lexus world
attach detach
Moni overview refetch
save colour
Keikahanui
Relation Types:
Pa‘ehitu
Figure 3. An example for kinship relations represented in ViCoS
of mini-encyclopaedia about one specific plant which would have links to

all the indigenous and encyclopaedic knowledge available in our lexicon and
archive in a structured way (cf. Figure 2 above). We refer to them as cultural
knowledge spaces, ethno-ontologies or informal ontologies (Zinn et al. 2008;
Zinn 2008: 890).
The cultural knowledge spaces of these trees were first created with so-
called meta-cards which are paper cards with different shapes and colours.
Consultants wrote keywords (e.g. usage of a plant part) on the meta-cards and
arranged them according to their own cultural associations. With the meta-
cards consultants could change, add and rearrange their initial organisation
of the meta-cards. On the basis of these meta-card ontologies, consultants
were further interviewed and a wealth of indigenous encyclopaedic infor-
mation was elicited and video-taped about the different preparation forms,
techniques, function and usage, customs and rituals connected with the trees
or artefacts fabricated from these trees, etc. The elicitation of indigenous en-
cyclopaedic knowledge was undertaken together with the indigenous anthro-
pologist of the team, Edgar Tetahiotupa.
Unauthenticated
Cultural knowledge spaces are informal ontologies which are solely based
on native speakers’ associations between elements of the lexicon in LEXUS
(Zinn 2008: 890–894). They do not create formal ontologies like SUMO or
WordNet which are built for machine-reasoning rather than human consump-
tion in educational settings of language revival (Zinn et al. 2008). Work on on-
tology building in endangered speech communities has been so far attempted
by Rau et al. (2009: 199–209) for the Austronesian language Yami, but the
researchers “have come to realize that it is not possible to describe an on-
tology based on sophisticated machine reasoning. Any ontology description
... requires triangulation of various resources of human interpretation” (Rau
et al. 2009: 208). Whereas Rau et al. (2009: 192) pursue a ‘formalized model
of existing indigenous knowledge’, our objective is more guided by practi-
cal aspects of usability for the speech community and better visualisation of
cultural connections in a language documentation archive.
Our second approach was to create so-called folk taxonomies (Conklin
1962; Bulmer 1970; Coward and Grimes 2000: 138–142) for which speech
community members had to classify aspects of the natural environment by
creating classes of one specific cultural domain and establish hierarchical
structures based on cultural uses and associations.
Conklin (1962: 50) observes that folk taxa “belong simultaneously to sev-
eral distinct hierarchical structures” of which some are based on form and ap-
pearance whereas others are based on culture-specific associations and uses.
In other words: “functional categories” of plants and animals, e.g. culture-
specific products such as containers, clothing, ornaments, medicine, food
dishes, etc. form part of folk taxonomies as much as the categories and classes
formed on the basis of form and appearance of things from the natural envi-
ronment.
For our work on folk taxonomies we decided to focus on the domain of
marine life because this domain – next to plants – plays an important role in
the two Polynesian cultures. Thus far, the work has only been accomplished
in the Marquesan speech community.
The first step in establishing a folk taxonomy is to find out under which
generic term a lexeme is classed in the vernacular, revealing the higher cat-
egory of a lexeme (Coward and Grimes 2000: 138). However, between the
highest level of the taxonomy (i.e. generic term), called life form and the
specific name of a thing, called terminal taxa – which relates to the west-
ern notion of species – there can be many more intermediate levels called
Unauthenticated
intermediate taxa (Coward and Grimes 2000: 138). It is particularly difficult

to find out about these intermediate taxa – if they exist at all – which of-
ten deviate greatly from scientific sub-classes (e.g. families, phyla, varieties,
subspecies, etc.). Within taxonomies which are hierarchically structured with
respect to classes, the field researcher working on folk taxonomies often has
the problem of integrating different ontogenetic stages (different names for
juvenile vs. adult animal) and part-of-relationships (tree-stem sap) which can
be important folk categories (Conklin 1962: 50).
For our folk taxonomy we classified 203 different species of marine life
by building up hierarchical structures and classes which are meaningful to the
speech community members (e.g. culture-specific uses such as fishing tech-
niques, food preparation, etc.). The classification of a particular domain is
of course best done with real objects or the entities in question. However, it
is very difficult to have samples of a large number of marine animals at the
same time and for the duration of the classification process of the folk taxon-
omy which can take two weeks or longer. For this reason we prepared photos
of marine species and attached them to blank file cards on which sufficient
space was left to code the data and make additional notes. The Marquesan
consultants then sorted the cards according to their understanding of classes
and sub-classes (intermediate taxa) and found generic names for these classes
if they existed. Differently coloured post-it notes were used for generic terms
and intermediate (vernacular) taxa.
Altogether, we made 22 indigenous classifications (incl. marine animal/
fish families, fishing techniques, fishing times, fishing tools, fishing locations,
fishing baits, food dishes, lunar fishing calendar, etc.) containing sub-classes
with intermediate taxa. Not all classes have generic terms. If there were no
generic terms, consultants were encouraged to give descriptive labels. These
non-lexicalised descriptions express the most important characteristics of a
particular class which are meaningful to them (e.g niho ta’ata’a ‘sharp teeth’,
ı̄pu pe’ehu ‘soft shell’, ı̄pu fe’o ‘hard shell’. Some of the classes correspond
more or less to general scientific groupings such as ı̄pu pe’ehu (crustaceans)
and ı̄pu fe’o (shellfish), whereas other classes such as ki’i fe’o, unahi ko’e
‘thick skin, no scaling’ comprise six different fish families: Ostraciidae (box-
fish, cofferfish, cowfish), Tetraodontinae (puffer, toby), Diodontidae (porcu-
pine fish), Balistidae (triggerfish), Labridae (wrasse) and Pterorinae (lion-
fish). These six fish families are very different in form and appearance, but
they are grouped together due to a quality they share (i.e. thick skin) as well
Unauthenticated
as a functional aspect, namely that they do not need to be scaled after fishing.
Furthermore, it is interesting to note that the classes are not always formed
with regard to the linguistic classification. For example, some fish names
which are formed with the generic terms of particular classes such as ūme
‘unicornfish’ and tamano ‘jackfish, chubs’, are not necessarily classified with
fish which bear the same generic term. For example, ūme tiaporo ‘longhorn
cowfish’ belongs to a different class than ūme tatihi, ūme kuripo, ūme mei,
etc. which all belong to the unicornfish family. Furthermore, the term ’eitano
‘seafood’ comprises crustaceans as well as shellfish, but in the classifica-
tions the consultants clearly separated the group of crustaceans (ı̄pu pe’ehu
‘hard shell’) and shellfish (ı̄pu fe’o ‘soft shell’) into two groups (cf. above).
In ViCoS these indigenous classifications can be visualised coherently and
contrasted with scientific classifications via scientific names of fish families
(e.g. Balistidae (triggerfish) vs. Labridae (wrasse), etc.).
Altogether only three consultants participated in the task, so our folk tax-
onomy does not yet represent a general indigenous classification of marine
life for the Marquesan speech community. More data will have to be col-
lected with a larger number of consultants to see whether there are regular
patterns of classification across the speech community. However, the con-
sultants, of whom two also worked as field assistants on the dictionary, felt
that this task greatly helped them in structuring and eliciting encyclopaedic
knowledge which is now being transformed into encyclopaedic articles for
the fish lexicon. Again, this task shows that the documentation of lexical and
cultural knowledge is best achieved when it is embedded in a very neatly
defined thematic domain. Moreover, cultural connections can be much more
easily established when the whole domain of investigation is highly contex-
tualised.
4. Web-based collaboration with the speech community
Internet facilities are becoming more and more accessible in the remotest ar-
eas of the world which makes it technically possible to continue the coopera-
tion between researchers and speech communities outside fieldwork periods.
The web-based editing possibilities of LEXUS were particularly designed to
motivate the speech community members to actively participate in the doc-
umentation of their languages by adding and editing dictionary entries and
Unauthenticated
enriching them with encyclopaedic information via the web (Ringersma and
Rybka 2009), and potentially, in the absence of the researcher.
However, there are many problems connected with such a collaboration
framework which basically has a wiki-like set-up (cf. also Rau et al. 2009:
194–195). In this section it will be discussed why web-based collaboration
with endangered speech communities is difficult to achieve, drawing on our
own experience and the numerous discussions we had with the Marquesan
and Tuamotuan language activists and field assistants.
At the beginning of the project the idea of online cooperation seemed
very appealing both to the linguistic team and the speech communities. The
language activists in the Tuamotuan speech community, for example, hoped
that the web-based possibilities of LEXUS would promote a community-
wide participation of documenting endangered cultural knowledge on a larger
scale. Community members could be involved who would otherwise not be
able to contribute due to difficult inter-island communications in the Tuamo-
tuan archipelago:14 getting to remote parts of the Tuamotuan islands is ex-
pensive and time-consuming. For the linguistic team a web-based collabo-
ration initially seemed appealing because the collaboration with the speech
communities could continue outside fieldwork periods which would ensure a
continuous growth of linguistic material in a short period of time.
However, it is not sufficient simply to make a web-based tool available in
order to ensure online cooperation, nor can one assume that an encyclopaedic
lexicon will be easily created in a wiki-like manner by the speech commu-
nity. For a successful online cooperation with LEXUS, there needs to be a
substantial amount of capacity building in the speech community and a num-
ber of community-internal obstacles have to be overcome as well, which will
be described in the following sub-sections. Some of the obstacles are culture-
specific, but a number of them can be generalised for endangered speech
communities.
If the lexicon creation is open to the whole speech community, there
needs to be a system established in LEXUS to ensure that the contribu-
tions are coordinated and merged into the lexicon. In LEXUS this concept
of collaborative lexicon creation will be realized via so-called collaborative
workspaces.15 The whole set-up of collaborative workspaces is a crucial pre-
14. Note that the sparsely-scattered Tuamotuan archipelago covers a geographic area approx-
imately corresponding to a triangle between North Germany, Bukarest and London.
15. LEXUS does not yet have the full set-up of collaborative workspaces, but so far can only
distinguish between readers and writers. The concept described below was the proposal
Unauthenticated
requisite in collaborative lexicon creation and the manner of the design of

collaborative workspaces is closely connected to many of the community-
internal obstacles which will be described below. We will therefore begin
with the basic concept of collaborative workspaces.
4.1. Collaborative workspaces

At the beginning of the MEL-project the developers of LEXUS proposed a
basic model of collaborative workspaces. The idea of a collaborative work-
space works as follows. For example, a linguistic team which has already cre-
ated a lexical database in Toolbox imports it into LEXUS and publishes it in
a central storage place. The researcher then defines access rights for the user
groups. Authorised users then copy the lexicon version into their personal
workspace and make changes to the lexicon. After editing the lexicon the
authorised user submits the new lexicon version which is then merged with
the lexicon version in the central storage place. An administrator is solely in
control of the merging, i.e. synchronisation processes mean that the adminis-
trator can accept or refuse changes (Cablitz, Ringersma, and Kemps-Snijders
2007: 414).
4.1.1. Basic challenges of the model of collaborative workspaces

Although there is some control over the merging process, the crucial ques-
tion is who would be a suitable administrator. If it were the field researcher
it would usually mean that she or he is a non-native speaker who cannot ac-
curately judge whether or not vernacular definitions are well-formed. If the
administrator is a native speaker, he or she would be able to make a judge-
ment about the well-formedness of vernacular definitions, but might have too
little knowledge about lexicography (e.g. how to analyse the different senses
of a headword, verbals with differing degrees of transitivity, etc.). Either way,
the administrator will not be able to judge fully on the validity of a changed
or new lexical entry.
Many concerns were raised in discussions held between members of the
speech communities and the linguistic team during 2007. They revealed that
set forth by the software developers to the linguistic team and the speech community. It
gave rise to many discussions which are relevant for the discussion of web-based tools for
language documentation.
Unauthenticated
the idea of collaborative workspaces working on a wiki-type basis leads to

serious problems which could have counterproductive outcomes (cf. §4.2.1
and 4.2.2). It was therefore decided to adapt the model more closely to the
needs of the speech community, as outlined in the next section.
4.1.2. Design of collaborative workspaces by the speech community
The speech community suggested a quality-controlled system of collabora-

tive workspaces. In this model, the contributions of the speech community
are evaluated by a panel of moderators which should ideally consist of na-
tive speakers of all the different island vernaculars of the archipelago. There
should be a flexible system of hierarchical rights such as “reader”, “writer”,
“moderator” and “administrator”. Readers would only have access to the pub-
lished version of lexicon in the central storage place, whereas writers could
download the lexicon from the central storage place into their own personal
workspace and change and add content to the lexical entries. The moderators
have the most important function in this model of collaborative workspaces:
they review the contributions and decide whether they should be published
or not. Each member of the panel of moderators specialises in a particular
cultural domain (e.g. song and dance, story-telling, food preparation, plant
medicine, fishing, etc.). Each moderator reviews only those entries which fall
into their area of cultural expertise. The final decision of accepting, reject-
ing or modifying submitted contributions is a joint decision of the panel of
moderators which is regarded as a more “democratic” approach than having
one administrator who makes all the decisions. The administrator who dele-
gates the new and changed entries to the respective moderators, manages all
users, gives permissions, merges the data into the lexicaldatabase database
and publishes it in the central storage place has to follow the decisions made
by the panel of moderators. The administrator could be the primary editor of
the lexicon – in our case the linguistic fieldworker – who could have a say
in spelling matters, the presentation of linguistic information and semantic
analysis of a headword.
The system of collaborative workspaces as proposed by the speech com-
munities is not without some problematic aspects. First of these is the ques-
tion of how such a panel is selected. For the Tuamotuan speech community
the function of the panel is ideally taken over by a language academy which
is elected by an external committee. Each member of the speech community
Unauthenticated
can apply to become an academy member. Experience with the Marquesan

academy, however, has shown that the choice of academy members is of-
ten politically motivated rather than a selection by competence and knowl-
edge. The replacement of old with new academy members has been largely
the choice and personal preference of the academy’s director rather than an
election by the whole academy body or an external committee. This has led
to academy members being held in low esteem and an unwillingness in the
Marquesan speech community to cooperate with them (Cablitz 2010: 39–40).
Tuamotuan speech community members have made similar observations dur-
ing the election of their academy in 2010.
The whole process of moderating contributions is very time-consuming
and work-intensive. The administrator who finalises the work would con-
stantly have to communicate and negotiate between the contributors (i.e. the
writers) and the panel of moderators. For example, if several entry writers
submit different definitions of the same headword (i.e. same sense of a head-
word), it would be difficult to review efficiently these multiple drafts at one
time. Another problem is that changes cannot be easily traced in a changed
entry which further complicates the final editing process for the administra-
tor. One of the solutions of facilitating the whole editing and merging process
as well as encouraging a community-wide participation would be the intro-
duction of so-called whiteboarding tools. Whiteboarding tools are web-based
tool which allow an informal collaborative editing of documents and files.
Ideally such a tool should be made available with the published ‘read only‘
version of the lexicon. In this way speech community members could add
comments, make suggestions, etc. on specific parts of a lexical entry without
making irretrievable changes to the lexical database, as illustrated in Figure 4.
Apart from difficulties in administrating the workflow, there is an obvious
desire by both speech communities16 to have some control over the quality
and the content of the contributions. Similar negative reactions about unmon-
itored dictionary contributors have been reported by Rau et al. (2009: 194) for
the Yami speech community. Contributors of their wiki dictionary feared that
‘painstakingly entered lexical entries would disappear’ because they would
be re-edited by other contributors as soon as the data was published (Rau
et al. 2009: 195). These concerns were also articulated by the Tuamotuan
16. The set-up of collaborative workspaces and the workflow was also intensely discussed
with Marquesan speech community members, whose concerns were very similar to those
of the Tuamotuan community.
Unauthenticated
To leave a comment, choose a

tool below Lexical entry-LEXUS reviewbasics
.1X 1X 5X
Callout Custom
Like Dislike Change
Agree Disagree Change
Vahaka is not a good word,

Arrow po'o is better
Emoticon
patahi is written
Selection pātāhi!
Drawing
Keep preset selected
Also used on Nuku

Hiva, add NH
View Comments: All Size Normal Delete ALL my comments Print
Figure 4. Example of a whiteboarding tool: a dictionary entry with comments
and Marquesan community members. The speech community members who

have participated in the MEL-project were mainly concerned that a wiki-like
lexicon creation would result in an “editing war” where speech community
members would constantly re-write lexical entries of other speech community
members because they believe that they have a more profound linguistic and
cultural knowledge. To them, collaborative creation of the lexicon seemed
too idealistic an approach, and uncontrolled contributions in a wiki-like set-
up would not yield the desired result.
While some of these difficulties are probably of a universal nature, there
are also culture-specific factors at work. One serious problem is that lan-
guage consultants do not necessarily share the same metalinguistic and cul-
tural knowledge about words. The indigenous Polynesian languages are cur-
rently undergoing rapid linguistic change caused not only by the dominant
contact language French, but also by Tahitian (Cablitz 2010: 49). Depending
Unauthenticated
on the age of the consultants and their up-bringing, the metalinguistic knowl-
edge is in general very heterogeneous.
The problem is also rooted in their traditional society: the transmission of
cultural knowledge was – and still is – by no means a public affair and it was
only transmitted to selected persons in the community. Consequently some
speakers possess detailed cultural knowledge, whereas others – belonging to
the same generation – only have rudimentary knowledge.
Moreover, the continuous loss of their linguistic and cultural heritage also
feeds into many insecurities of the speakers and are often ground for conflicts
between speech community members in what is authentic and unauthentic
knowledge. Community members often accuse each other of (re-)inventing
and transforming the indigenous language and culture. Knowledgeable – of-
ten older – speakers are frequently insulted and stigmatised as liars because
their knowledge is not commonly shared with other community members. As
a consequence, knowledgeable speakers withdraw from participating in the
documentation of their languages.
Another problem of community-wide lexicon creation is that both Poly-
nesian communities are still very much anchored in their oral traditions. De-
spite rapid westernisation and schooling in the last 40 years, the Tuamotuan
and Marquesan communities have not really developed a writing tradition.
The most knowledgeable speech community members, whom one ideally
wants to engage in the creation of an encyclopaedic lexicon, often cannot read
or write, not to mention their lack of IT skills. Even those speech community
members who are literate are often reluctant to express their knowledge in
writing. They mostly prefer to “chat” about their knowledge in informal in-
terviews, and thus many consultants still feel most at ease when their knowl-
edge is simply recorded, thus remaining, to some extent, bound to the oral
tradition of knowledge transmission.
Some of the problems discussed in this section are specific to Polynesian
communities, but the tensions and insecurities which exist due to the loss of
the language and culture can probably be generalised for a number of endan-
gered speech communities. Rau et al. (2009: 194) report that Yami commu-
nity members were “suspicious and critical of the collaborative efforts” of
the research team and showed negative attitudes towards researchers which
were not part of the speech community. The greatest fear of Yami community
members is the potential abuse of the materials which are put online and the
disrespect of intellectual property rights, which meant that the wiki dictio-
Unauthenticated
nary developed for the Yami dictionary project did not get sufficient support
(Rau et al. 2009: 195). Similar fears of abuse have also been expressed by
Marquesan and Tuamotuan speech community members.
4.2. Capacity building in the speech community
Web-based lexicon creation would not be possible without a substantial a-

mount of capacity building in the speech community. For a successful online
cooperation with LEXUS outside fieldwork periods, the participating speech
communities need to be prepared and trained in the basic aspects of lexicog-
raphy and linguistic concepts as well as acquire the usage of linguistic – in
particular lexicographic – software used in the project.
In the previous documentation project on the Marquesas Islands (cf. §2.1
and §3), members of the speech community had already been trained in the
use of linguistic and lexicographic software such as Toolbox and some basic
linguistic concepts. As we used the MDF 4.0 database type for the build-up
of our lexical databases, it was essential that the local field assistants under-
stood the structure because the Toolbox databases were imported into LEXUS
where the same lexicon structure would be reproduced. Understanding lex-
icon structures such as the MDF Toolbox database type requires intensive
training, repetition of usage and continuous familiarisation with the structure
as well as the software. The Marquesan field assistants who had some basic
IT skills were trained to use the Toolbox software on a daily basis over the
protracted period of several field trips by writing lexical entries together with
the fieldworking linguist.
Capacity building also played an important role during fieldwork with the
Tuamotuan speech community, as all the linguistic software used for our lan-
guage documentation was new to them at the beginning of the MEL-project.
Several training sessions with up to six participants were held two to three
times a week over the full length of the field trip; basic linguistic concepts
which are relevant for Polynesian linguistics such as word classification and
polysemy were taught as well as lexicographic aspects of how to write ex-
amples sentences, monolingual definitions, encyclopaedic articles, integrate
subentries/run-ons, make dictionaries by applying Mosel’s (2004b) thematic
approach, etc.
Unauthenticated
4.2.1. The need for user-friendly tools
Toolbox is a complex tool with many editing possibilities. The members

of the Marquesan speech community were able to independently compile
a bilingual (Marquesan-French) breadfruit mini-dictionary of 34 breadfruit
varieties, which involved writing vernacular definitions and encyclopaedic
articles of the ethnobotanical uses of the various plant varieties. However, a
year later, on the next field trip the community members had to be (re-)trained
because they had not used the software on a daily basis and confused many of
the field markers in the MDF database type. Albright and Hatton (2008: 192)
also report that speech community members who were trained to work with
Toolbox appeared to do well during the training, but a month later – when
left on their own to complete lexical entries – they often stopped work. They
thought that they had lost all their work when closing a single window, not
knowing how to open it again in their project. These kinds of complaints also
frequently came from the Marquesan field assistants, who thought that they
had lost their lexicon files through a computer virus.
Albright and Hatton (2008: 192) point out that lexicon tools such as Tool-
box are not even intuitive for trained linguists who can get confused in partic-
ular with file management tasks such as finding, opening, saving and backing-
up files. According to Albright and Hatton (2008: 189) speech community
members who want to get involved in the documentation of their language
need a tool with a simplified user interface that removes this kind of barrier
(Albright and Hatton 2008: 189).
Albright and Hatton (2008) make very valuable points about the need for
a simplified user interface for speech community members. Complex lexi-
con structures can be overwhelming, lexicographic concepts can get confused
and consequently speech community members often feel lost when having to
edit lexical entries on their own. In our experience of working with Tool-
box, speech community members often accidentally inserted field markers,
leading to inconsistent lexicon structures – which is a problem for the data
import into LEXUS – and entered data into the wrong fields. Another major
problem is inconsistent spelling due to the fact that special characters such as
macrons cannot be simply typed in over the keyboard. This creates an enor-
mous amount of editing for the fieldworker who has to integrate and merge
the data produced by the speech community into the lexical databases.
In the light of these problems the development of a simple user interface
for the speech community had been discussed between the LEXUS develop-
Unauthenticated
ers and the speech community. However, it will be very difficult to develop
one simplified user interface for all speech communities because the concept
of simplicity is very much dependent on a personal selection of function-
ality and personal preferences. Each speech community has differing needs
and demands and it is therefore difficult to develop a unique implementa-
tion which takes account of the wishes of particular user groups. As long as
there are no simple standards established by broad training during capacity
building it will be very difficult realize a general simplified user interface.
4.2.2. Collaboration within the speech community

Even if LEXUS had a simplified user interface and training was provided,
it is unlikely that many interested community members would have all the
skills necessary to produce an online encyclopaedia. The differing skills for
a successful lexicon creation such as the ability to use software and com-
puters with ease as well as a profound knowledge about the culture and lan-
guage are unevenly distributed across the two endangered speech commu-
nities. Thus, cooperation between community members is essential for such
a project: younger community members with good IT skills,17 but a lack of
knowledge about the indigenous language and culture, would have to learn to
work cooperatively together with older community members who know the
language and culture, but cannot operate a computer on their own. Currently
there are many efforts in this direction in the Tuamotuan speech community,
but the young Tuamotuans contribute on a voluntary basis during their holi-
days and weekends and progress is therefore slow.
Furthermore, even if speech community members with IT skills and lin-
guistic and cultural knowledge work hand in hand, the participants would
still need to learn important aspects of lexicographic work as well. There is,
however, the danger that the pressure to conform to formal requirements of
writing monolingual definitions and encyclopaedic articles in a certain format
or style would cause participants to abandon their participation altogether.
17. Young Polynesians tend to be quite IT literate because computer skills have been part of
the school curriculum since the existence of IT technology in French Polynesia i.e. since
the late 1990s.
Unauthenticated
4.2.3. Can web-based tools replace fieldwork?

Last but not least we want to discuss whether or not indigenous word mean-
ing and cultural knowledge can be documented in a wiki-like set-up outside
fieldwork periods. In general, the field assistants felt that the mutual dialogue
with the researcher and other consultants was necessary for a detailed inves-
tigation and documentation of the language and culture. In their view, the
writing of monolingual definitions has so far been the most difficult task of
our documentation project because the different senses of a headword had
to be precisely analysed in order to write the definitions. It was not a par-
ticular problem for the field assistants that monolingual definitions were a
newly emerging speech genre, as they quickly developed a distinctive style
(cf. above, §3.3). However, they did find it more difficult to analyse word
meanings and express them adequately in their own language. According to
them, their language “lacks” adequate terms to describe colour and appear-
ance. They often combined several colour terms and use French loan words
(e.g. sokora < Fr. ‘chocolat’) or/and made comparisons to other plants (e.g.
’imu tai sokora mea āveāve me he veivei ’au toa ‘brown seaweed with long
strings like needles of the ironwood tree’). Occasionally it was the process of
defining the meaning of a word that made them aware of its distinct senses.
For example, when we first defined the term kofā they explained that it is
a kind of plant stalk which bears a semantic relationship to four other stalk
terms (kata, fā, veivei and kokau). Trying to explain the differences between
kofā and the other stalk terms, they gave examples of the most prominent uses
of kofā. Leaf stalks of coconut, papaya, banana and some giant fern trees are
called kofā. The definition for kofā was then formulated as being e memau
kokōa mei no he tumu mai i ’una o te ’au no te tumu ’a’e he mea to ı̄a man-
amana ‘central rib or stalk of tree leaves which have otherwise no branches’.
At a later stage they used the term kofā again, but this time referring to the rib
of a pinnate leaf of trees with branches. The previous definition of kofā had
to be revised. For the field assistants kofā did not seem to denote two differ-
ent things at first because coconut and fern leaves are fronds which resemble
pinnate leaves. However, when they were asked to make one definition – en-
compassing the characteristics of pinnate leaves and fronds – they came to
the conclusion that kofā has two distinct senses: due to their size, coconut
and fern fronds form one class with papaya and banana leaf stalks rather than
with pinnate leaves and thus the first definition was kept and a second defini-
tion was added for ribs of pinnate leaves.
Unauthenticated
The definition of plant parts, body-parts or any parts of things from the
natural environment is a difficult task because speakers of non-Indo-European
languages can partition aspects of the natural world (including fauna, flora,
landscape, the human body, etc.) differently than speakers of English or
French do (e.g. pipi ūma ‘middle part of lobster between abdomen and head’).
Some observable phenomena or objects do not have names at all in the target
language (e.g. pūhu’u ‘brown, cloth-like substance around trunk of banana
plant’). The documentation of word meaning and indigenous knowledge can
only be efficiently established in a face-to-face communication because it re-
quires subtle questioning and explanations which go back and forth between
researcher and consultant or field assistant. In a fieldwork situation, one gen-
erally has the possibility to pick up on interesting comments or follow up on
interesting leads. Important details and distinctions can be made by demon-
strating an action, showing the specimen (and its parts) in question in its
natural cultural context, etc. The fieldworker can constantly challenge and
question the definitions with regard to a word’s usage and ask for clarifica-
tions and more precise meanings. In particular the non-native perspective of
the fieldworker and his or her lexical knowledge of the field language as a
second language learner can be fruitful in analysing and refining word mean-
ing.
From a more general perspective, any documentation of indigenous knowl-
edge is a piecemeal process which requires time and patience because many
of the cultural activities have not been practised for a long time. Even the most
knowledgeable consultants will not remember aspects of their traditional cul-
ture instantly. After a session with the fieldworker, the consultant will often
come back to add and revise the documentation of cultural processes and
practices or vocabulary. At a later stage consultants often want to replace the
modern term with the “real” – often obsolete – term for things or activities.
The documentation of indigenous word meaning and knowledge is a close
collaborative effort between the fieldworker and the field assistants and field-
work cannot be replaced by simply making a web-based tool available to
the speech community. For the reasons listed above, the enrichment of the
lexicon with linguistic and cultural knowledge is still best achieved during
fieldwork periods.
Unauthenticated
5. Lexicography in documentary linguistics
This section discusses the somewhat controversial role of lexicography in

documentary linguistics as well as the contribution that multimedia dictio-
naries – but in particular a tool such as LEXUS – can make to documentary
linguistics as well as for language maintenance and revitalization.
Many language documentation projects of endangered languages com-
pile a dictionary or lexical database as a by-product of their documentation
project. It is thought as an additional documentation device which helps to
structure and annotate primary data (Himmelmann 2006: 10). In the follow-
ing subsection we will discuss that a lexical database should not just be a
documentation aid, but an essential part of a language documentation itself.
5.1. The importance of structured lexical data in a language documentation

context
The underlying belief in most language documentation initiatives is that good

language documentation should be a “lasting, multipurpose record of a lan-
guage” which contains a large set of samples of observable linguistic be-
haviour, “i.e. examples of how people actually communicate with each other”
(Himmelmann 2006: 7). Dictionaries and grammars do not provide the full
linguistic context on how words and constructions are actually used in ev-
eryday communicative events (e.g. how people actually converse, how they
verbally interact when making, selling or negotiating something, etc.). If a
documentation consists solely of a conventional dictionary plus a grammar,
then the communicative practices of the speech community are not recon-
structable, and the dictionary and grammar by themselves are of restricted
value for researchers and practitioners outside of linguistics, for example ed-
ucators, speech community members, and researchers from other disciplines
(cf. Himmelmann 2006: 18–19).
Despite the generally acknowledged primacy of primary data in docu-
mentary linguistics, there are several good reasons to make dictionaries an
integral part of a language documentation. One reason for this is that the
annotation of words and phrases in multimedia documents (primary data) –
as required for language documentation projects in the DoBeS-programme –
will normally only document one sense of a word in a specific context. It will
neither document the full range of meanings of a word, nor will the annotation
Unauthenticated
show in which way a word is related to other words of a language in the same
way as dictionaries do. In other words: although the primary data documents
how a language is actually used, the semantic complexity of words and their
semantic networks are not at all evident and easily deductible. In view of this
it is clear that directly accessible structured lexical data adds significantly to
the practical usability of primary data to speech communities, educators and
researchers from scientific disciplines other than linguistics.
Furthermore, a language documentation should not only be data-oriented
and multifunctional, it should ideally provide a potential basis for developing
pedagogical material, thus contributing to language maintenance and revival
as well. A number of other field linguists working with endangered speech
communities regard dictionaries not just as a mere documentation device and
a compilation of structured analysed abstractions of a language, but also as
a key activity for language maintenance and revitalization when prepared
and presented in an accessible format18 for the speech community (Corris
et al. 2000; Kroskrity 2002; Hinton and Weigel 2002: 155; De Korne et al.
2009: 141; Rau et al. 2009; among others); for Hinton and Weigel (2002: 156)
a dictionary is “a repository of tribal identity that can be used for a variety of
purposes even after the language ceases to be spoken”. In many speech com-
munities the compilation of a dictionary is still regarded as one of the most
important products of a language documentation project (Mosel 2006: 68).
The language also acquires a higher status or the status of a “real” language
which approaches that of the dominant local language(s).
5.2. Why multimedia dictionaries and why LEXUS
For endangered speech communities in particular, multimedia dictionaries

are an increasingly popular medium to satisfy both documentation and edu-
cational needs (De Korne et al. 2009: 141). “The rich audio-visual-interactive
input” is “far better than simple text, tape, or audio” and it attempts to approx-
imate immersion education (De Korne et al. 2009: 143; Kroskrity 2002: 190;
Hinton 2001) which is considered to be the best approach to language revi-
talization (Grenoble and Whaley 2006).
18. Cf. below and Corris et al. (2000) for a discussion of the tension between “documentation
dictionaries” and “maintenance dictionaries” or lexical database vs. community dictionary
(Mosel 2011: 340).
Unauthenticated
Apart from contributing to language maintenance and revival (cf. §5.3 for
more details), multimedia lexicon tools such as LEXUS have the potential
to be excellent resource and research tools for the scientific community. The
archive-linking capacity in LEXUS creates enriched multimedia lexica which
go beyond conventional dictionary making. Word meanings in the dictionary
can be fully contextualised and presented in their socio-cultural contexts by
linking corpus-based examples in the dictionary to the respective annotated
sessions in the archive. Words in annotated sessions, on the other hand, are
embedded in and linked to their dictionary entry which contains the full range
of meanings of a word which is not documented in the annotated session as
such (cf. above, §5.1).
A multimedia lexicon created in LEXUS, with its new relational linking
device in ViCoS, can consist of a dense network of lexical and cultural data
with various media files which are part of a language archive. This has impor-
tant implications not only for speech communities, but also for documentary
linguistics. metadata and archive structures used for example in the DoBeS
archive can only establish limited connections between sessions, which can
leave many cultural connections between sessions of an archive unexplored.
As ethnographers such as Franchetto (2006: 183) point out, “ethnographical
documentation is a crucial component in any language documentation”, and
this “involves designing digital architectures with multiple and multidirec-
tional links between different sessions and qualitatively different kinds of
information such as lexica, analytical papers, photos, and so on” (Franchetto
2006: 206). LEXUS and ViCoS are tools which would allow the documenta-
tion team to come closer to this ideal. The new relational linking device in
ViCoS, in particular, opens up the possibility to create a dense network of
cultural connections between sessions as well as structured lexical data in a
thematically organised way and therefore achieves a better and more user-
friendly access to a culture’s network of meanings for both the scientific as
well as speech community.
5.3. Multimedia lexica as a tool for language maintenance and revitalization
The creation of dictionaries of endangered languages is often regarded as

a major activity in language maintenance and revitalization (Corris et al.
2000: 1). Dictionaries – in particular bilingual dictionaries – are important
reference documents for endangered speech communities in revitalization
Unauthenticated
classes: they are important for developing vocabulary, as a spelling reference

and for the language teachers themselves because they are often not fluent
speakers of the endangered language which is being revitalized (Hinton and
Weigel 2002: 155–156).
Multimedia dictionaries are undoubtedly even more useful tools for revi-
talization than conventional dictionaries due to their multimedia extensions,
provided that they are not overloaded with illustrations and graphics (Frawley,
Hill, and Munro 2002: 10–12). Although multimedia tools are no substitute
for a real interaction with a native speaker, they “can provide an approxi-
mation of the model for speaking that is produced by native speakers in a
face-to-face interaction” and thus are important tools which support language
teaching and language learning at the same time (Kroskrity 2002: 190).
Many projects which report on the use of multimedia tools in language
revitalization (Hinton 2001; Manning, Jansz, and Indurkhya 2001; Kroskrity
2002; Canger 2002; De Korne et al. 2009; Rau et al. 2009; among others)
emphasize their beneficial use. However, little is said about the usability and
type of the multimedia data which is integrated into the various tools which
are available at the present. The multimedia documents we have integrated
into our multimedia lexica are based on recordings of spoken language which
commonly contain speech phenomena such as hesitations, repetitions, speech
errors, break-ups, etc. The transcription and translation of these recordings
have at times posed a challenge. Speech community members have often
commented that these recordings are a good repertoire of their language and
culture, but that they need to be edited before they can be published or used as
pedagogical material in schools. Although these kinds of multimedia docu-
ments are a reflection of how the language is actually used, it is therefore
questionable if they meet language revitalization objectives effectively. A
learner with little knowledge of the language will not just learn the language
by force of listening to recordings in an immersive way. At present, it needs to
be further investigated how useful unedited multimedia annotated documents
of naturally spoken language are in revitalizing endangered languages. How-
ever, the learning needs of language learners in endangered speech commu-
nities are different than those of ordinary second language learners because
they mostly have experience in some form with the indigenous language they
want to learn (e.g. passive knowledge by listening to their grandparents, etc.).
Although a tool like LEXUS has the potential of being used for lan-
guage maintenance and revitalization, the creation of our multimedia lexica
Unauthenticated
in LEXUS resembles more a kind of ‘documentation dictionary’ (Corris et al.

2000: 1), ‘ethnographic dictionary’ (Pawley 2001: 237) or ‘lexical database’
(Mosel 2011: 349), but not a ‘maintenance (or learner) dictionary’ (Corris
et al. 2000: 1). According to Corris et al. (2000: 7) documentation dictionar-
ies contain as much information as possible about a lexeme, so lexical entries
can be long and detailed and overcrowded with information (e.g. conven-
tions, symbols, abbreviations) which is often confusing and intimidating for
speech community members who are not familiar with dictionaries. There-
fore documentation dictionaries are less likely to be useful as a tool for revi-
talization. Corris et al. report further that speech community members often
prefer simple word lists, inflected verb forms over citation forms as head-
words, short definitions no longer than two lines and no subentries (2000: 8).
However, as most dictionaries are created with computational tools it is fairly
easy to transform a documentation dictionary into a maintenance dictionary
for the speech community (cf. Mosel’s approach of transforming a documen-
tation dictionary into thematic mini-dictionaries for the speech community
(2011: 348–349)).
6. Conclusion
Multimedia lexicon tools such as LEXUS are excellent resource and research
tools for the scientific as well as endangered speech communities interested
in the lexical, cultural and ethnographic documentation of endangered and
underdescribed languages. The LEXUS tool, with its new feature of knowl-
edge representation in ViCoS, can create a dense network of lexical as well as
cultural data in ways which are meaningful to different kinds of user groups,
allowing a multilayered organisation of lexical and cultural knowledge. For
members of the speech community the creation of knowledge spaces in Vi-
CoS can be a more natural entry point into a lexicon than a conventional
paper dictionary because data of the lexical database in LEXUS can be se-
lected according to the needs of the users, giving it the potential to become
an important tool in language maintenance and revitalization.
Apart from the possibility of individually organising the documented
knowledge, the contextualisation of word meaning via archive linking is a
major new approach in lexicography.
For documentary linguistics, multimedia lexicon creation, as envisaged
in our MEL-project with LEXUS, is in fact a new form of language archiving
Unauthenticated
within the structural frame of a lexicon as it combines data from a language

documentation archive with a structured lexical database. It also goes beyond
so far practised language archiving with metadata as well as conventional
dictionary making as our multimedia lexicon will consist of a dense network
of lexical entries with all sorts of media files, and thus presenting the meaning
of words and cultural connections between archive sessions in a new way.
Although the web-based possibilities of LEXUS can actively involve peo-
ple from the speech community in the creation of their lexicon, a community-
wide online participation in a wiki-like framework is difficult to achieve in
highly endangered speech communities and can create many irresolvable
conflicts. Even if speech community members are trained adequately in the
use of the relevant linguistic software and in the basics of lexicography, and
a quality-controlled system of collaborative workspaces and a simplified user
interface are implemented, it is still unlikely that online collaboration can re-
ally replace intensive fieldwork to document endangered languages and cul-
tures.
A more appropriate and informal way of online participation by endan-
gered speech communities is the use of whiteboarding tools which would
resolve many problems of online participation discussed above.
Abbreviations
1, 2, 3 first, second, third person PL plural
ART article POSS possessive
CAUS causative RED reduplication
LD locational-directional SG singular
preposition TAM tense, aspect, modus
PASS passive
References
Albright, Eric, and John Hatton. 2008. WeSay: A tool for engaging native
speakers in dictionary building. In Documenting and Revitalizing Aus-
tronesian Languages, eds. D. Victoria Rau and Margaret Florey, Language
Documentation & Conservation, Special Publication No. 1, 189–201. Hon-
olulu: University of Hawai’i. http://hdl.handle.net/10125/1368.
Unauthenticated
Atkins, Beryl T. Sue, and Michael Rundell. 2008. The Oxford Guide to Prac-
tical Lexicography. Oxford: Oxford University Press.
Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head?
In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss
à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre
Miranda, 1069–1091. The Hague: Mouton.
Cablitz, Gabriele. 2006. Marquesan - A Grammar of Space. Berlin, New
Cablitz, Gabriele. 2010. A field report on a language documentation project
on the Marquesas Islands in French Polynesia. In Endangered Austrone-
sian, Papuan and Australian Aboriginal Languages: Essays on Language
Documentation, Archiving and Revitalization, ed. Gunter Senft, 31–47.
Canberra: Pacific Linguistics.
Cablitz, Gabriele, Fasan Chong, and Edgar Tetahiotupa. 2009. The docu-
mentation of endangered linguistic, lexical and cultural knowledge of the
Marquesan and Tuamotuan languages of French Polynesia. In Proceed-
ings of the 11th Pacific Science Inter-Congress and 2nd Symposium on
French Research in the Pacific. Tahiti: Pacific Science Association. http:
//intellagence.eu.com/psi2009/output_directory/cd1/Data/articles/000323.pdf.
Cablitz, Gaby, Jacquelijn Ringersma, and Marc Kemps-Snijders. 2007. Vi-
sualizing endangered indigenous languages of French Polynesia with
LEXUS. In 11th International Conference Information Visualization (IV
’07), 409–414. IEEE Computer Society.
Canger, Una. 2002. An interactive dictionary and text corpus for sixteenth-
and seventeenth-century Nahuatl. In Making Dictionaries – Preserving
Indigenous Languages of the Americas, eds. William Frawley, Kenneth C.
Hill, and Pamela Munro, 195–218. Berkeley, CA: University of California
Press.
Conklin, Harold C. 1962. Lexicographic treatment of folk taxonomies. In
Problems in Lexicography, eds. Fred W. Householder and Sol Saporta,
41–59. Bloomington: Indiana University Research Center in Anthropol-
ogy, Folklore, and Linguistics.
Corris, Miriam, Christopher Manning, Susan Poetsch, and Jane Simpson.
2000. Dictionaries and endangered languages. http://nlp.stanford.edu/pubs/
eldic.ps.
Coward, David F., and Charles E. Grimes. 2000. Making dictionaries. A
guide to lexicography and the multi-dictionary formatter. http://www.sil.
org/computing/shoebox/MDF_2000.pdf.
Unauthenticated
Crowley, Terry. 2001. The indigenous linguistic response to missionary au-

thority in the Pacific. Australian Journal of Linguistic 21(2):239–260.
De Korne, Haley, and The Burt Lake Band of Ottawa and Chippewa Indians.
2009. The pedagogical potential of multimedia dictionaries – lessons from
a community dictionary project. In Indigenous Language Revitalization:
Encouragement, Guidance and Lessons Learned, eds. Jon Reyhner and
Louise Lockard, 141–153. Flagstaff, AZ: Northern Arizona University.
Elbert, Samuel H. 1982. Lexical diffusion in Polynesia and the Marquesan-
Hawaiian relationship. Journal of the Polynesian Society 91:499–517.
Franchetto, Bruna. 2006. Ethnography in language documentation. In Essen-
Frawley, William, Kenneth C. Hill, and Pamela Munro. 2002. Making a dic-
tionary. Ten issues. In Making Dictionaries – Preserving Indigenous Lan-
guages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela
Munro, 1–22. Berkeley, CA: University of California Press.
Grenoble, Lenore A., and Lindsay J. Whaley. 2006. Saving Languages: An
Introduction to Language Revitalization. Cambridge: Cambridge Univer-
sity Press.
Haiman, John. 1980. Dictionaries and encyclopedias. Lingua 50:329–357.
Haviland, John. 2006. Documenting lexical knowledge. In Essentials of
Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann,
Herbst, Thomas, and Michael Klotz. 2003. Lexikografie. Paderborn: Schön-
ingh UTB.
Hinton, Leanne. 2001. Language revitalization: An overview. In The Green
Book of Language Revitalization in Practice, eds. Leanne Hinton and
Ken L. Hale, 3–18. San Diego, CA: Academic Press.
Hinton, Leanne, and William F. Weigel. 2002. A dictionary for whom? Ten-
sions between academic and nonacademic functions of bilingual dictio-
naries. In Making Dictionaries – Preserving Indigenous Languages of
the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela Munro,
155–170. Berkeley, CA: University of California Press.
Unauthenticated
Kroskrity, Paul V. 2002. Language renewal and the technologies of literacy

and postliteracy. In Making Dictionaries – Preserving Indigenous Lan-
guages of the Americas, eds. William Frawley, Kenneth C. Hill, and Pamela
Munro, 171–192. Berkeley, CA: University of California Press.
Manning, Christopher, Kevin Jansz, and Nitin Indurkhya. 2001. Kirrkirr:
Software for browsing and visual exploration of a structured Warlpiri dic-
tionary. Literary and Linguistic Computing 16(1):123–139.
McElvenny, James. 2008. Review of Kirrkirr. Language Documentation &
Conservation 2(1):160–165. http://hdl.handle.net/10125/1796.
Mosel, Ulrike. 2004a. Complex predicates and juxtapositional constructions
in Samoan. In Complex Predicates in Oceanic Languages, eds. Isabelle
Bril and Françoise Ozanne-Rivierre, 263–296. Berlin, New York: Mouton
de Gruyter.
Mosel, Ulrike. 2004b. Dictionary making in endangered speech communi-
ties. In Language Documentation and Description, Volume 2, ed. Peter K.
Mosel, Ulrike. 2004c. Inventing communicative events: Conflicts arising
ter 1(3):3–4.
Mosel, Ulrike. 2011. Lexicography in endangered language communities. In
The Cambridge Handbook of Endangered Languages, eds. Peter K. Austin
and Julia Sallaback, 337–353. Cambridge: Cambridge University Press.
Mühlhäusler, Peter. 1990. ‘Reducing’ Pacific languages to writing. In Ide-
ologies of Language, eds. John E. Joseph and Talbot J. Taylor, 189–205.
London: Routledge.
Mühlhäusler, Peter. 1996. Linguistic Ecology. Language Change and Lin-
guistic Imperialism in the Pacific Region. London, New York: Routledge.
Pawley, Andrew. 1993. A language which defies description by ordinary
means. In The Role of Theory in Language Description, ed. William A.
Foley, 87–129. Berlin, New York: Mouton de Gruyter.
Pawley, Andrew. 2001. Some problems of describing linguistic and ecolog-
ical knowledge. In On Biocultural Diversity: Linking Language, Knowl-
edge, and the Environment, ed. Luisa Maffi, 228–247. Washington, DC:
Smithsonian Institution Press.
Unauthenticated
Rau, D. Victoria, Meng-Chien Yang, Hui-Huan Ann Chang, and Maa-Neu

Dong. 2009. Online dictionary and ontology building for Austrone-
sian languages in Taiwan. Language Documentation & Conservation
3(2):192–212. http://hdl.handle.net/10125/4439.
Ringersma, Jacquelijn, and Konrad Rybka. 2009. Lexus manual. Published
first version. http://www.mpi.nl/corpus/manuals/manual-lexus.pdf, down-
loaded June 2009.
Svensén, Bo, ed. 1993. Practical Lexicography: Principles and Methods of
Dictionary-making. Oxford: Oxford University Press.
Yang, Meng-Chien, Hsin-Ta Chou, Hui-Shiuan Guo, and Gia-Ping Chen.
2008. On designing the Formosan multimedia word dictionaries by a par-
ticipatory process. In Documenting and Revitalizing Austronesian Lan-
guages, eds. D. Victoria Rau and Margaret Florey, Language Documenta-
tion & Conservation, Special Publication No. 1, 111–133. Honolulu: Uni-
versity of Hawai’i. http://hdl.handle.net/10125/1359.
Zinn, Claus. 2008. Conceptual spaces in ViCoS. In The Semantic Web:
Research and Applications, eds. Sean Bechhofer, Manfred Hauswirth, Jörg
Hoffmann, and Manolis Koubarakis, 890–894. Berlin: Springer.
Zinn, Claus, Gabriele Cablitz, Jacquelijn Ringersma, Marc Kemps-Snijders,
and Peter Wittenburg. 2008. Constructing knowledge spaces from linguis-
tic resources. Presentation at the 18th International Congress of Linguis-
tics, Workshop 12 on Linguistic Studies of Ontology: From Lexical Se-
mantics to Formal Ontologies and Back, July 21–26, 2008, Seoul, Republic
of Korea.
Unauthenticated
Unauthenticated
Chapter 11
What does it take to make an ethnographic dictionary?
On the treatment of fish and tree names in dictionaries
of Oceanic languages∗
Andrew Pawley
1. Introduction
Some lexicographers hold that it is impossible to make a good first general
dictionary of any language. In his highly regarded textbook Dictionaries: The
Art and Craft of Lexicography, Sidney Landau writes that
A really new dictionary would be a dreadful piece of work, missing innu-
merable basic words and senses, replete with absurdities and unspeakable er-
rors, studded with biases and interlarded with irrelevant provincialisms. Noah
Webster’s American Dictionary of the English Language of 1828, though
far from being entirely new, was new enough to subscribe to many of these
defects... Fortunately, very few dictionaries are really new, and none of the
general, staff-written, commercial dictionaries published by major dictionary
houses are. (Landau 1984: 35–36)
Landau was clearly thinking of dictionaries of metropolitan languages, where

the lexicographers are native speakers of the target language and have the ad-
vantage of access to a large written corpus, including materials by specialists
in scores of different fields. No doubt he would be even more pessimistic
about the chances of making good first dictionaries of languages spoken by
what I will refer to as ‘traditional’ societies (roughly, those with subsistence
economies). Such languages usually have no written tradition or only a lim-
ited one. The dictionaries are generally written by one or a few scholars who
are not native speakers of the target language, assisted by native-speaking
∗
It is a pleasure to contribute to a volume in honour of Ulrike Mosel, whose contributions to
descriptive and documentary linguistics I greatly admire and with whom I have had many
stimulating discussions about dictionary-making. Thanks are due to Geoffrey Haig, Frank
Lichtenberk and Claudia Wegener for their valuable comments on a draft of this paper and
to Claudia for her eagle-eyed copy-editing.
Unauthenticated
264 Andrew Pawley
consultants who have no prior knowledge of lexicography. Sometimes the

compilers themselves have no previous experience of making a dictionary.
For the many languages of traditional societies that are in danger of disap-
pearing the first dictionary is likely also to be the last.
This paper identifies some of the challenges faced in making a first gen-
eral dictionary of a language of a traditional society and considers how to
minimize the kinds of defects mentioned by Landau. I will begin with some
very general questions:
1. What kind of dictionary is to be the goal? In particular, what kinds of se-
mantic information are to be given about headwords? Is there a principled
basis for drawing a line between definitions and cultural or encyclopaedic
information?
2. What are the best ways of discovering the words of a language? Are major
gaps in the lexical record inevitable?
In discussing these matters I will draw mainly on examples from just one
major domain of the lexicon, animal and plant names, and their treatment
in a sample of some 30 bilingual dictionaries of languages belonging to the
Oceanic branch of Austronesian.1 I choose to focus on this domain for a
number of reasons.
(i) Animals and plants are important, economically and socially, in the lives
of Oceanic communities and names for animal and plant taxa make up
a significant proportion of the lexicon. World-wide comparative studies
show that farming communities living in environments with a rich diver-
sity of plant species generally distinguish between 500 and 1500 plant
taxa (Brown 1985).2 I am not aware of a comparable world-wide survey
1. Oceanic contains more than 400 languages of Melanesia together with the languages of
the Polynesian Triangle and most of the languages of Micronesia. Most Oceanic languages
have fewer than 10,000 speakers. Their speech communities were, traditionally, subsis-
tence farmers and, in many cases, also fishers.
2. The largest reliably recorded inventory of vernacular plant names from one traditional
community appears to be about 1,800–2,000, for Hanunóo, of Mindoro, Philippines, who
speak an Austronesian language (Conklin 1954). Puku’i and Elbert (1971: ix) mention that
Mary Neale and Edward Handy list over 2,300 plant names for Hawaiian but a number “are
dubious”. Henderson and Hancock (1988) list more than 800 plant names for Kwara’ae, of
Malaita. Fox’s (1978) dictionary of Arosi (of Makira), gives about 770 names for plants, of
which 194 denote varieties of cultivated plants. For Cèmuhî, a language of northern New
Caledonia (a region with a relatively small flora), Rivierre (1994) lists 557 taxa, of which
178 represent cultivated plants.
Unauthenticated
What does it take to make an ethnographic dictionary? 265
concerning animal names but it is probable that most coastal communi-

ties on high islands in the Pacific tropical zone distinguish between 600
and 1000 animal taxa, with fish names being the largest component.3
(ii) Terminologies for fauna and flora highlight important issues in semantic
description. They are typically associated with complex cultural knowl-
edge and practices and are organised into quite complex taxonomies that
are not isomorphic with those of the biological sciences or with the defin-
ing vernacular language. The task of describing this knowledge under-
lines the ethnographic and interdisciplinary nature of dictionary-making.
(iii) In many Oceanic societies traditional knowledge of plants and animals is
diminishing as these societies are drawn into the modern industrial world.
2. What kind of general dictionary

A pivotal decision concerns what kind of general dictionary is to be the goal. I
refer specifically to the nature of what may be called the ‘explicatory’ parts of
an entry, those that give semantic and pragmatic information about the head-
word. These parts consist chiefly of (i) the definition or gloss, (ii) additional
information about cultural context and reference, (iii) information about se-
mantic relations to other lexical units (synonyms, antonyms, superordinates,
hyponyms, etc.) and (iv) illustrative examples showing the lexical unit in typ-
ical contexts of use.
In a typology of general dictionaries based on the semantic information
they provide one can distinguish between two idealized extremes: dictionaries
designed as translation aids and dictionaries that we might call ethnographic
(or descriptive). Translation aid dictionaries give headwords in L1 and pro-
vide words or phrases in L2 that are translation equivalents for these.
Ethnographic dictionaries provide explications intended to reflect native
speakers’ (sometimes variable) understandings and use of lexical units. They
seek to give definitions that define complex lexical concepts in terms of sim-
pler ones, using vernacular vocabulary, and also provide additional informa-
3. Speakers of Wayan (a dialect of Western Fijian spoken on Waya Island, Yasawa group, Fiji)
distinguish about 800 names for kinds of animals. These include about 450 fish taxa and
230 names for marine invertebrates. Land invertebrates are of little economic importance
on Waya but upwards of 70 taxa are named. As one moves eastwards across the central
Pacific the number of land bird, mammal and reptile species drops off sharply and Wayan
terminologies for indigenous birds (about 35 names) and reptiles (about 20) are small.
Unauthenticated
266 Andrew Pawley
tion of types (ii)–(iv) above. (I am not so naïve as to think that any dictionary
can fully achieve these objectives; I am speaking of an ideal.)
Most bilingual dictionaries are primarily intended to be translation aids.
The best monolingual dictionaries, by contrast, are closer to the ethnographic
type.
Dictionaries of languages of traditional societies are usually bilingual,
with a main part containing headwords in the target language (L1) and glosses
in a major language (L2), in combination with reverse finder list that allows
the user to look up words in L2 and find relevant entries in L1. The glosses
mainly consist of words or phrases intended to be approximate translation
equivalents. Analytic definitions are given only where no translation equiva-
lent is available.
While conceding the importance of providing translation equivalents
where possible, I believe that scholars compiling first bilingual dictionaries
of languages of traditional societies should aim at rich semantic descriptions,
providing analytic definitions wherever these will give a more precise and
usefully informative account of the meaning of the lexical unit than approx-
imate translation equivalents, and also including supplementary information
of types (ii)–(iv) where this serves the same purpose. See Cablitz (this vol-
ume, Section 3.2) for discussion of the question of to what extent it is appro-
priate to record cultural information in a dictionary.
By giving rich semantic descriptions an ethnographic dictionary on the
one hand provides linguistic and cultural information likely to be valued by
members of the speech community and on the other hand stands as a ref-
erence work for scientific purposes. However, the creation of a good ethno-
graphic dictionary presents huge challenges. Some of these challenges will
be considered in the sections that follow.
3. What are the best ways of discovering the words of a language

A lexicographer making a first dictionary has to discover the words of the
language and their meanings (the basic lexical unit being the sense unit, be-
cause each sense requires its own definition). What are the best ways of going
about this task, so as to avoid “missing innumerable basic words” and making
“unspeakable errors” in definitions or glosses?
A corpus of natural connected discourse is a valuable tool in the search
for words and senses. But for an Oceanic language the lexicographer(s) will
Unauthenticated
generally have available a very modest-sized corpus. In many documentation

projects the number of words in the annotated corpus is less than 30,000. It
is unrealistic to imagine that one can obtain anything like a near exhaustive
list of lexical units from such a small sample. Most words and senses are
low frequency items and even a ten million word corpus made up of diverse
speech genres is likely to contain only a small proportion of such items. For
example, many names for kinds of animals and plants will only be obtainable
by elicitation.
An effective elicitation strategy is to pursue words by semantic fields. A
lexicon can be viewed as falling into many different domains of meaning:
kinship, body parts, house parts, canoe parts, bodily conditions and health,
emotions, landscape, spatial relations, weather, seasons, months, numbers,
colours, temperature, texture, sounds, speaking, cutting, breaking, locomo-
tion, transport and so on. Terms belonging to the same semantic field display
more or less systematic semantic relations, i.e. they form a structured termi-
nology. This fact allows a structured approach to the search for words and
the formulation of definitions, one that has been pursued fruitfully by Ulrike
Mosel (Mosel 2011).
Using a dictionary or wordlist from another language to elicit words in
the target language (either for a particular semantic field or more generally)
can yield quick and copious returns. I refer to the practice of asking bilingual
informants to provide translation equivalents in the target language. How-
ever, my experience is that this method will lead to lots of errors unless it
is combined with a good knowledge of the target language, cross-checking
with a range of informants, and observation or eliciting of contextual use. Let
me give an example: In 1967 I undertook a first spell of 11 weeks of field-
work on Waya Island, in the Yasawa group, Fiji, and this yielded an ‘instant’
dictionary of some 5000 roots together with thousands of derived words and
compounds. These were obtained largely by eliciting equivalents of Bauan
(Standard Fijian) items, using the Bauan dictionary, and proceeding alpha-
betically. The process can be compared to using a dictionary of Dutch to
elicit words in English from English-Dutch bilinguals.
Some years later, after I had resumed intensive fieldwork on Waya and
become fluent in the language, I realized that the instant dictionary was rid-
dled with errors – cases where Bauan words had a different meaning or range
of senses than their cognate in Wayan, or where I had included Bauan words
not current or at best rarely used in Wayan, but most often, cases of wrong
Unauthenticated
268 Andrew Pawley
definitions resulting from misunderstandings between linguist and informant.

I then spent part of the next several field seasons checking and correcting the
errors. It is hard to say whether, on balance, using the Bauan elicitation list
wasted more time than it saved.
What I learned the hard way is that there is no substitute for being able to
speak the language. Once you are reasonably fluent you can more easily talk
to a wide range of people in the community, argue the point, eavesdrop, ask
complicated questions and elicit suitable illustrative examples, and you are
much better placed to critically evaluate the information that comes in and so
to reduce errors.
While fluency in the field language is the ideal, it may be difficult to
achieve in the space of a three-year project with limited in situ field work. In
some contexts it is possible for the linguist to partially compensate for lack of
fluency in the vernacular by close collaboration with bilingual native speak-
ers. (See Cablitz (this volume) and Jung and Himmelmann (this volume) for
discussion of the crucial role of native speaking consultants in linguistic data-
gathering and analysis.)
When eliciting terms for certain fields, such as plants and animals, lex-
icographers need help, in the form of specialist researchers and/or works of
reference. In recent decades some excellent handbooks for the flora and fauna
of certain regions of Oceania have appeared. However, there are traps in using
illustrations in handbooks to elicit terms. In the case of animals and plants, for
example, informants are unable to judge size from illustrations and pictures
alone usually do not provide information on behaviour and habitat. In some
cases even very expert informants looking at illustrations make wrong equa-
tions. Checking and rechecking with a variety of knowledgeable people and,
if possible, with actual specimens and a variety of illustrations, will reduce
the error rate. But in the case of terms for fauna and flora, my experience is
that unless the lexicographer enlists the aid of specialists, many terms will be
missed and many others will be misidentified or will remain unidentified.4
4. Finding a method to assess coverage of fish and tree names in Oce-

anic dictionaries
Most of the Oceanic dictionaries cited in this paper are the first and only
substantial dictionaries of these languages. Contrary to Landau, it is not my
4. For a fuller account of methodological issues in ethnobiology see Bulmer (1992).
Unauthenticated
impression that they “are missing innumerable basic words and senses”, if
“basic” refers to high frequency items. However, it is certainly the case that
many of the dictionaries are missing many words of middle to low frequency.
This becomes obvious if one looks closely at the treatment of particular se-
mantic fields. I turn now to a brief survey of the coverage of fish and tree
names in about 30 Oceanic dictionaries. I will discuss two methods for deter-
mining whether coverage in a particular dictionary is close to exhaustive, and
will touch briefly on a third.
4.1. Fish names

Comparative evidence, drawing on the most thorough collections of terms
for those Oceanic fishing communities where the widest range of fish species
and genera are present, indicates that such communities generally distinguish
between 300 and 500 fish taxa.5 The number tends to fall as one moves further
east in the Pacific, where the number of fish species and genera diminish.
5. Sources (dictionaries and other works) for contemporary languages that figure in the dis-
cussion are listed below. Works that are not general dictionaries are marked with a star
before the name of the author(s); these are mainly survey reports focusing on names of
marine fauna.
New Guinea and Bismarck Archipelago: Kiriwina (Trobriand Islands, PNG): Lawton
1998; Kuanua (New Britain): Lanyon-Orgill 1960; Titan (Admiralty group): *Akimichi
and Sakiyama 1991.
Western Solomons: Cheke Holo: White 1988; Marovo: Hviding 1990, *Hviding 2005;
Roviana: Waterhouse 1949; Takū (Polynesian Outlier, north of Bougainville): Moyle in
press; Teop: Shoffner 1976.
Eastern Solomons: Arosi: Fox 1978; Gela: Fox 1955; *Foale 1998; Lau: Fox 1974; Owa:
Mellow 2009; Toqabaqita: *Henderson and Hancock 1988; Lichtenberk 2008.
Vanuatu and Tikopia: Paamese: Crowley 1992; Kwamera: Lindstrom 1986; Tikopia (Te
Motu Province, Solomon Islands): Firth 1985; Lenakel: Lynch 1977.
New Caledonia: Cèmuhî: Rivierre 1994; Nyelâyu: Ozanne-Rivierre 1998; Paicî: Rivierre
1983; Xârâcùù: Moyse-Faurie and Néchéro-Jorédie 1989.
Fiji: Bauan (Standard Fijian): Capell 1941; Wayan (dialect of Western Fijian): Pawley
and Sayaba 2003; Rotuman: Churchward 1940, Inia et al. 1998.
Micronesia: Carolinean (Saipan, Marianas): Jackson and Marck 1991; Kapingamarangi
(Polynesian Outlier, central Carolines): Lieber and Dikepa 1974; Marshallese: Abo et al.
1976; Palauan: Helfman and Randall 1973, Johannes 1981; Ponapean: Rehg and Sohl
1979; Puluwat: Elbert 1972; Satawalese: *Akimichi 1980; Kiribati: Thaman and Tebano,
n.d.
Polynesian Triangle: Marquesan: *Lavondès 1977; Niuatoputapu: *Dye 1983; Niuean:
Sperlich 1997; Rarotongan: Buse and Taringa 1996; Tongan: Churchward 1959; Uvean:
*Rensch 1983.
Unauthenticated
270 Andrew Pawley
The crudest method of assessing how thorough is coverage of fish names

in a dictionary of a fishing community’s language is to compare the dictio-
nary’s tally with tallies obtained for languages where a thorough survey of
fish names has been carried out. (Of course allowance must be made for re-
gional variations in the diversity of species). Table 1 compares data for nine
languages for which fairly thorough surveys of fish names have been done:
five are from western Melanesia and western Micronesia and four from Fiji
and Polynesia. Here and in later tables a star before a language name indicates
that the source is a study that focuses on fish names, not a general dictionary.
Table 1. Total recorded fish names in some Pacific Islands languages

Western Melanesia and western Micronesia
*Satawal. *Gela *Marovo *Palauan *Titan
400+ 368 350+ 336 289
Fiji and Polynesia
Wayan *Uvean *Marquesan *Niuatoputapu
458 284 260 210
Table 2 gives the tallies obtained by going through a sample of substantial

Oceanic dictionaries. In some cases their finder lists treat ‘sharks’, ‘rays’ and
‘eels’ separately from ‘fish’, i.e. by ‘fish’ they mean only ‘typical fish’. In
such cases I have combined the lists so as to include ‘sharks’, etc. under
‘fish’.
Table 2. Total fish taxa in some substantial Oceanic dictionaries
Western Melanesia
Kuanua Arosi Owa Lau Toqabaqita Gela Cheke Holo Roviana
301 214 203 149 143 136 133 80
Fiji, New Caledonia and Vanuatu
Bauan Cèmuhî Paicî Nyelâyu Xârâcùù Lenakel Kwamera
200 163 147 123 121 41 15
Micronesia
Marshallese Ponapean Carolinean Kiribati
231 157 154 151
Polynesia, including the Outliers Takū and Kapingamarangi
Takū Kapinga. Tikopia Niuean Tongan Rarotongan
280 262 220 163 158 124
Unauthenticated
It can be seen that the totals vary greatly. Comparison with Table 1 suggests
that most of these inventories are missing between 100 and 250 of the taxa
distinguished by the speech community. This is demonstrably the case for
Gela: Fox’s (1955) very substantial dictionary of Gela contains only 136 fish
names, more than 200 fewer than the 368 reported in the survey of Gela
fishing done by a marine biologist (Foale 1998). Similarly, the large Bauan
(Standard Fijian) dictionary contains about 200 fish names, less than half the
number recorded for Wayan, another Fijian language.
For fishing communities in Melanesia and western Micronesia, where the
fish fauna are relatively rich, lists numbering below 300 are probably far from
complete. In the case of communities of the Polynesian Triangle and eastern
Micronesia, we can expect somewhat lower totals.
A second method for detecting gaps is to compare the breakdown of
names for taxa into uninomials and binomials. Uninomials typically apply
to taxa belonging to the level of folk generic, binomials to folk specifics (see
Section 6). In the best documented Oceanic fish taxonomies uninomials usu-
ally amount to between 70 and 80 percent of total taxa, and binomials to be-
tween 20 and 30 percent. Percentages that deviate markedly from this range
stand out as suspiciously anomalous. Table 3 gives the percentages for 14
Oceanic languages recorded in dictionaries (unmarked) and surveys of fish
names (marked *).
Table 3. Uninomial and binomial fish names in some Oceanic languages

Wayan *Satawal. *Titan *Gela *Marovo
uninomials 352 278 279 252 194
binomials 106 122 8 100 122
% binomials 23 30 3 28 38
Uvean *Teop *Kapinga. Tikopia Niuatoputapu
uninomials 180 168 148 215 138
binomials 104 29 114 5 71
% binomials 36 15 43 2 34
Cèmuhî Kiribati Cheke Holo Xârâcùù
uninomials 132 132 123 116
binomials 31 19 10 5
% binomials 24 14 7 4
At one extreme, only 5 out of 220 recorded Tikopia fish names (Firth 1985)
are binomials. Similarly, the survey of Titan (Akimichi and Sakiyama 1991)
Unauthenticated
272 Andrew Pawley
yielded only eight binomials among 287 listed fish names, whereas the survey
of Satawalese (Akimichi 1980) yielded 122 binomials out of 400 names. We
conclude that, in these cases, the Tikopia and Titan lists are probably missing
between 60 and 120 binomials. At the other extreme, in the case of Kapinga-
marangi, there are 43% binomials (114 out of 262) and we can conclude that
the list of binomials probably includes some ad hoc descriptive forms.
Most of the larger dictionaries generally do poorly in their coverage of
binomials. Their coverage of uninomials is much better, but even there it is
clear that, in many cases, coverage is incomplete.
The third method is more fine-grained and requires painstaking family-
by-family comparison of fish names. The first steps are to note how many
names representing each of the major families of fish are present in the best
documented languages, and then to obtain an average and a range of variation
for this sample. The next step is to see how dictionary tallies compare with
these averages and ranges, looking for cases where there is a striking shortfall
in the number of taxa recorded for particular families. This method makes it
possible to locate quite precisely some of the gaps in coverage but for reasons
of space I will not tabulate results here (see Pawley 2011a for details).
It has been suggested to me that two Oceanic fishing communities, ex-
ploiting a similar marine environment, may differ greatly in the extent to
which they have elaborated their lexicon of uninomial and/or binomial fish
names. That is to say, large differences in the numbers of fish taxa between
two communities with similar means of subsistence, may not always be due
to oversights on the part of the lexicographer, but actually reflect genuine
differences in the vernacular lexicon.
This possibility cannot be ruled out – our data include only a few well-
controlled case studies (such as the Gela one) where the dictionary coverage
can be compared with that of a thorough independent survey. However, we
are looking for general trends and the general trends are clear. It is strik-
ing that in every case where a careful survey has been done of an Oceanic
fishing community’s lexicon of fish names, the number of fish names distin-
guished has been above 200 and generally in the 300-450 range. Similar find-
ings have been reported for fishing communities in Indonesia (Quick 2010;
Taylor 1990). The fact that most dictionaries of Oceanic languages spoken by
fishing communities report much lower figures cannot plausibly be attributed
to random variation in the vernacular lexicons.
Unauthenticated
A recurrent problem in eliciting is that of distinguishing between names

that are genuine binomials and ad hoc descriptions. When asked to discrim-
inate between specimens informants may offer descriptive names, such as
‘spotted X’, ‘spiky X’, ‘red X’, which may be descriptions rather than con-
ventional names. For this reason it is important to make independent checks
with a range of informants. I had this problem in eliciting Wayan fish names
using illustrations. An overly helpful expert informant provided names for
about 30 kinds of butterflyfish. Cross-checking indicated that only a very few
are conventional names. Another potential source of confusion concerns bi-
nomials which are conventional but denote functional categories, like ‘fish of
the reef’ vs ‘fish of the deep sea’, which cross-cut the core categories in the
taxonomic hierarchy.
4.2. Tree names

The first two of these methods can readily be applied to plant names, again
making allowances for regional variations in diversity of flora. Table 4 com-
pares total names of plants that English speakers classify as ‘trees’ (including
palms and pandans) in dictionaries of 15 Oceanic languages. Eight of these
are spoken on large high islands of western Melanesia (six of them in the
Solomons), where the flora are relatively diverse, and seven are spoken in
southern Melanesia (New Caledonia and Vanuatu) and the central Pacific,
where the diversity is less.
Table 4. Total names of trees in some Oceanic dictionaries

Western Melanesia
Arosi Kuanua Lau Cheke Holo Roviana Gela Kiriwina Toqabaqita
339 295 204 188 180 179 170 157
Southern Melanesia and the central Pacific
Wayan Niuean Cèmuhî Paamese Kwamera Lenakel Rotuman
222 160 156 84 61 58 71
A comparison of totals for the six Solomon Islands languages is revealing.

The tally of 339 tree names in Fox’s dictionary of Arosi (spoken on Makira)
is the highest for this region. By this measure, the totals of 204, 179 and
157 in the dictionaries of Lau (Malaita), Gela (Florida group) and Toqabaqita
(Malaita), are below expectations, as are the tallies of 188 for Cheke Holo
Unauthenticated
274 Andrew Pawley
(Santa Isabel) and Roviana (180). Kuanua, which scores almost as high as
Arosi, is spoken much further west, in New Britain.
The tree flora of southern Melanesia and the central Pacific is a good
deal less diverse than that of western Melanesia. The total of 222 tree taxa
for Wayan Fijian, based on systematic collecting by a botanist (Gardner and
Pawley 2006), is probably close to exhaustive for this language and provides
a rough yardstick. It suggests that the dictionaries of Paamese, Kwamera,
Lenakel (all Vanuatu) and Rotuman, which record between 58 and 84 tree
names, are probably missing upwards of 100 taxa.
4.3. Conclusion
Most Oceanic dictionaries show major shortfalls in their coverage of fish and
tree names. The broad conclusion we can draw is that getting all the names
for indigenous animal and plant taxa distinguished by a typical Oceanic com-
munity is a formidable task and that without the help of specialists, lexicog-
raphers are likely to miss a high proportion of them.
5. Problems in defining the meanings of lexical units

Getting names is only the first step in the lexicographical treatment of animal
and plant taxa. It remains to provide accurate identifications and definitions
of each taxon. Each vernacular name should be matched with the class of ref-
erents it is applied to – whether this be a particular species, a genus, a growth
stage, male vs female animal, a grouping based on habitat, edibility or size,
and so on. In the case of folk specifics and folk generics (see below), ideally
this should include accurate scientific identifications with Latin names.
The pioneering lexicographers in the Pacific Islands seldom had help
from specialists or natural history handbooks on the flora and fauna of their
regions to help them to make scientific identifications. Dictionary searches for
cognate sets for animal and plant names often yield results like the following.
Proto Oceanic *rakumu, which referred to a kind of crab, is quite widely re-
flected in contemporary languages. However, the sources for these languages
give such imprecise identifications that it is impossible to determine its range
of reference in Proto Oceanic.
Unauthenticated
Proto Oceanic *rakumu ‘kind of large crab, probably a land crab’

Admiralties: Lou roum ‘kind of crab’
Papuan Tip: Kiriwina lakum ‘small mud crab’
Papuan Tip: Muyuw lakum ‘crab’
Papuan Tip: Gapapaiwa rakum ‘kind of crab’
North N. Guinea: Manam rakum ‘k. big, red crab’
Meso-Melanesian: Tabar raku ‘crab’
Meso-Melanesian: Roviana garumu ‘Cardisoma carnifex,
white land crab’
(metathesis)
SE Solomonic: Bugotu ragomu ‘a crab’
Micronesian: Ponapean rokumw ‘species of small
land crab’
Micronesian: Puluwat róókum ‘large land crab’
róókum(ihát) ‘common black
sea crabs’
Micronesian: Woleiaian ragumw u ‘crab, generic’
N. C. Vanuatu: S.E. Ambrym oum ‘land crab’
S. Vanuatu: Sye (n)roGum ‘kind of crab’
It should be noted that this cognate set stands in sharp contrast to a number
of other cases where a Proto Oceanic term for an animal or plant category
can be given a highly specific identification (many such examples can be
found in Ross, Pawley, and Osmond 2008, 2011). The latter cases are those
where dictionary sources for a subset of contemporary languages belonging
to diverse subgroups yield cognates that agree on the identification.
Today’s lexicographers can often call on a range of quite good handbooks
for domains of flora and fauna to take to the field. And they are usually bet-
ter placed than the pioneering generations were to have specimens identi-
fied by specialists. For instance, exemplary work has been done on the lexi-
cons of New Caledonian languages by scholars from CNRS/LACITO (Cen-
tre national de la recherche scientifique/Langues et civilisations á tradition
orale), who have been able to draw on a reservoir of specialist help and
a good range of handbooks. A number of such dictionaries, e.g. Ozanne-
Rivierre (1984, 1998), Rivierre (1983, 1994), Moyse-Faurie and Néchéro-
Jorédie (1989), provide scientific determinations to family, genus or species
level for around 90 percent of recorded plant and animal names.
Of course, it should not be assumed that there will be a one-to-one corre-
spondence between folk taxa and a particular species, genus or family recog-
nised by biologists. Sometimes several species will be subsumed in a folk
Unauthenticated
276 Andrew Pawley
taxon. At other times two or more folk taxa will correspond to a single
species, with different taxa corresponding, e.g. to different stages of matu-
ration or to adult males vs females and juvenile males. This is commonly the
case, for instance, with certain fish species, where up to five different taxa
corresponding to different maturational stages may be distinguished.
Making scientific IDs is the primary but not the only reason dictionary-
makers need the help of specialists in natural history. A second reason is that
specialists are better qualified to describe the key characteristics of species
and their relationships to other species and to investigate their practical uses.
A third, already noted above, is that without specialists it is likely that names
for many of the less important taxa will be missed. Conversely, a survey con-
ducted solely by biologists with almost no knowledge of the target language
is likely to yield an inaccurate account of vernacular terms and taxonomies.
The ideal is a close collaboration of linguists and biologists.
6. Drawing a line between definitions and cultural or encyclopaedic in-

formation
In explicating the meaning of a lexical unit, are there principled grounds for
drawing a line between linguistic and encyclopaedic knowledge, or between
definitions and cultural information? If so, what are the distinguishing crite-
ria? If not, what practical criteria should be invoked? These questions haunt
every writer of descriptive or ethnographic dictionaries (see Cablitz, this vol-
ume).
What is the meaning of sheep to the community of English speakers? One
can say that sheep are four-legged animals reared for their wool and meat.
One can add that they eat grass and other leafy plants, that they come in many
breeds, noted for different characteristics, that their skin is used in making
leather, that they belong to the genus Ovis and are closely related to goats,
that they are found wild in the mountains of Asia, Africa, Europe and North
America, that they live in herds (called ‘flocks’), that they are proverbially
easily led, that they make bleating sounds, that they normally bear a single
young, that traditionally they were tended by shepherds, assisted by specially
bred sheepdogs, and so on. The list of characteristics is indefinitely large and
there is no obvious boundary between characteristics central and peripheral
to the meaning of sheep.
Unauthenticated
I agree with those who argue that we lack a principled basis for drawing
a line between lexical and encyclopaedic knowledge.6 The question ‘What is
the meaning of sheep?’ is probably wrong-headed. When formulating a dic-
tionary explication of sheep, it makes more sense to ask ‘Of the many charac-
teristics of sheep known to English speakers, which are the most salient?’ and
‘For the various users of the dictionary, what is likely to be the most useful
information to include?’
A few plant name entries from the Wayan dictionary follow which are in-
structive in two respects: (1) they provide fairly rich descriptions, mainly due
to the work of a professional botanist; (2) they show the difficulty of draw-
ing the line between information that belongs in a dictionary and information
better left to an encyclopaedia or to technical botanical works.
DALI1 , n. Tree (kai) taxon: Cajanus cajan (Leguminosae), pigeon pea.

(Name from Hindi dal.) Introduced and sporadically cultivated in Waya
for its pea-like seeds. Bushy shrub, occasionally cultivated for its edible
seeds. Stems ribbed, leaves 3-foliolate, narrowly oval, pointed, pea-flowers
red-brown outside, yellow and crimson-striped within, pods straight, about
6cm long. Ei kani vinā na dali magā ei vākari. Cajun seeds are good eating
if they are curried.
DAMANU, n. Tree (kai) taxon: Calophyllum vitiense (Guttiferae). Straight-
trunked forest tree of higher altitude, leaves narrow-oval, shiny grey and
brown, flowers white. Timber valued for house posts, rafters, furniture,
boats.
DIGI2 , n. Fern (kai) taxon: generic, includes at least the following two
medium-sized terrestrial ferns: (1) Nephrolepis biserrata (Davalliaceae).
Ladder fern. Locally common in open mid-altitude forest. (2) Sphaeroste-
phanos invisus (Thelypteridaceae). Common on dry grassy hillsides. The
two vascular strands in the leaf-stalk were once used in necklace-making.
Subtaxa include borete, mata, vativati.
DILO, n. Tree (kai) taxon: Calophyllum inophyllum (Guttiferae). Common
shore tree, sticky cream-coloured sap, spikes of white day-fragrant flowers,
round green-purple spongy fruit with a hard stone. Timber used especially
for boat ribs. An infusion of the leaves is applied to sore eyes. Oil of the
seed used in massage (and formerly to treat leprosy).
6. Haiman (1980) offers detailed arguments in support of this position.
Unauthenticated
278 Andrew Pawley
The descriptions provided for these taxa by the botanist working on Wayan
were in most cases a good deal more detailed than are shown in the dictionary
entries. As the lexicographer in this case, the rule of thumb I adopted was to
omit details of plant morphology likely to be of interest only to a botanist
while retaining information relating to general appearance, habitat and uses.
But it can be seen that the entries are not entirely consistent in this respect.
In the entry for DALI, for instance, more details of leaf forms and flowers
(“Stems ribbed, leaves 3-foliolate, narrowly oval, pointed, pea-flowers red-
brown outside, yellow and crimson-striped within, pods straight, about 6cm
long”) are retained than in the entry for DAMANU.
7. Problems in defining the semantic range and semantic relations of

major generics: the case of life forms for fish and trees
To analyse and describe the semantic structure of the lexicon, lexicographers
need to have a theory of how lexical systems are structured and a metalan-
guage for describing the relevant categories and relations.
Cross-linguistic comparisons show that plant and animal terms typically
participate in several systems of semantic classification (Berlin 1992; Con-
klin 1962). Two such systems (not always completely distinct) are kinds of
taxonomies, in which categories are placed in contrastive and hierarchically
organised relations. There are (1) taxonomies based on natural kinds, dis-
tinguished by morphological and behavioural characteristics; (2) functional
taxonomies, where categories are based on use and context. Other systems of
classification include (3) partonomies, i.e. part-whole relations, such as parts
of a plant or animal, and (4) maturational or growth stages.
Let us consider just folk taxonomies of type (1). Over the years various
frameworks have been proposed for describing such systems. For example,
for a four-level taxonomy, as, say, the series animal > dog > terrier > fox
terrier, one can simply refer to the most inclusive level as primary, the next
as secondary, then tertiary and quaternary, respectively. A more ambitious
framework is that advocated by Brent Berlin and his associates. The follow-
ing is a summary of the system of rank distinctions in folk taxonomies and
generalisations about nomenclature given in Berlin (1992).
Life-form. A life-form is a high-order generic taxon, one that (i) distin-
guishes a highly distinctive morphotype which (ii) is not included in any other
taxon other than kingdom, (iii) includes many (sometimes hundreds) of lower
Unauthenticated
order taxa which share the characteristic morphology of the type, and (iv) is
named by a uninomial. Examples of English life-form taxa are tree, vine, fish,
bird and insect.
Folk generic. A folk generic (or folk genus) is a ‘natural’ category in two
senses, one perceptual, the other linguistic. First, the members of this cat-
egory are usually marked off from non-members by multiple characters of
morphology and behaviour or ecological adaptation that will be evident to
any close observer. English examples are mullet, trout, oak, pine, spider, ant,
centipede. Second, the folk generic (rather than the life form) is the usual way
of referring to particular plants or animals if their identity is known. Depend-
ing on various factors, a folk genus may correspond to a single species in the
biologist’s taxonomy, to a number of species or a genus, or to a number of
genera or families. Third, the category is named by a uninomial rather than a
binomial.
Folk specific. Some folk generics divide into a number of folk specifics (or
folk species), usually just a few taxa that contrast in a limited number of fea-
tures with other members of the generic. Except for domesticated animals and
plants folk specifics are usually the lowest-level taxa distinguished. Berlin
(1992) says that folk specific names are usually compounds, consisting of the
generic name plus a modifier, e.g. mako shark vs hammerhead shark, trap-
door spider vs huntsman spider. However, Bulmer (1970) finds that a fair
number of folk specific names for animals, among the Kalam, are primary
lexemes.
Berlin names three other ranks that are sometimes distinguished in folk tax-
onomies, but these need not concern us here.
Determining the scope of life form taxa is often tricky. While informants
agree on the focal membership, they may disagree about peripheral cases.
For example, among Oceanic languages there is considerable variation in the
boundaries of the general term for fish. Generally the term can be applied to
various aquatic creatures other than fish: typically it is extended to whales
and dolphins, in some languages, to turtles and crocodiles, in some also to
octopus and squid, and in some cases most or all water-dwelling animals
are included (Pawley 2011b). As Table 5 indicates, few Oceanic dictionaries
make much of an effort to specify the range of reference of the generic for
fish and fish-like animals. Definitions can be roughly scaled according to their
level of informativeness.
Unauthenticated
280 Andrew Pawley
Table 5. Definitions of the generic for fish and fish-like animals in some Oceanic
dictionaries
Group A. Definitions that give ‘fish, or ‘fish, sea creature’ without further
definition
Arosi: i’a, a fish.
Cheke Holo: sasa, fish (generic).
Marshallese: ek, fish.
Mota: iga, a fish.
Owa: aiga, fish, sea creature.
Rarotongan: ika, fish.
Roviana: igana, the generic name for fish.
Group B. Definitions that give a partial but rather imprecise listing of members
Puluwat: yiik, fish (including porpoises and whales but not squid).
Samoan: i’a, the general name for fishes, except the bonito (Thymnus) and
shellfish (Mollusca and Crustacea). On Tuituila the bonito is called
i’a. (Pratt 1911)
Takū: ika, generic term for fish, including marine animals, turtles and two
species of clam.
Tikopia: ika, generic category with primary reference to fish, but including
allied creatures, e.g. turtle, cetaceans. [Examples also refer to crabs.]
Tongan: ika, fish. Also turtles (fonu) and whales (tofua’a) but not eels,
cuttle-fish, or jelly-fish.
Group C. Definitions that try to be comprehensive
Gela: iga, a creature of the sea, fish, mollusc, crayfish, whale, squid,
sea anemone, etc.
Paamese: mesau, 1. fish. 2 any sea dweller (including also turtles,
dolphins, shellfish, etc.).
Toqabaqita: iqa, 1. fish (generic term). 2. Also denotes a superordinate
category that includes fish, whales, dolphins, turtles, dugongs.
Wayan: ika, 1. Typical fish, true fish, syn. ika dū. This category includes
all gill-breathing fish with fins, including sharks, rays and eels.
2. Fish and certain fish-like creatures. A generic which includes
all true fish (see sense 1) and dolphins. Most informants also
regard turtles (ikabula) as ika. Some also include octopus
(sulua) and squid (suluanū). Universally excluded are
crustaceans (crabs, lobsters, etc.), molluscs with shells, sea
cucumbers, sea urchins and jellyfish. (There follows a full list of
names of ika.)
Group A definitions show no awareness of the fact that ‘fish’ is a highly prob-
lematic definition. Group B definitions make an effort to give an exhaustive
Unauthenticated
account of the membership but the definitions tend to be vague or incomplete.

Group C definitions come closest to being exhaustive, but largely avoid the
problem of disagreement among informants and fuzziness at the boundaries
of generic categories.
The treatment of ‘tree’ terms shows a similar variation in quality. All
Oceanic languages have a ‘tree’ life form but they vary as to which plants are
included (Evans 2008). For instance, some include palms, pandans, others do
not; some include tree-ferns and cycads, others do not. In some languages
the term for ‘tree’ is used, in certain specialized contexts, as a generic for all
plants.
Table 6. Definitions of generic term for ‘tree’ in a sample of Oceanic languages

Group A. Definitions that give ‘tree, or ‘tree, plant’, without further
definition
Puluwat: yirå, tree.
Owa: apenagai, tree.
Cheke Holo: gaiiju, tree (generic).
Roviana: huda, tree.
Mota: tangae, a tree.
Rarotongan: rākau, tree, bush, plant.
Samoan: lā’au, a tree, a plant. (Pratt 1911)
Tongan: ’akau, tree, plant.
Paamese: āi, tree.
Marshallese: kaan, tree. [Examples refer to coconut and pandanus as well as
branching trees.]
Group B. Definitions that give a partial but generally incomplete listing of tree-
like members
Takū lākau, generic which includes all trees, shrubs, plants, creepers and
sticks.
Tikopia: rakau, generic term for member of vegetable kingdom, usually
woody plant, including shrub, herb, but not vegetable or grass.
Niuean: akau, tree (refers to large or substantial trees; shrubs or small trees
are generally known as lākau).
Arosi: ’ai, a tree or plant having stem and branches; not used of a fern
cycad, sago palm, coconut, etc. but used of small plants, e.g. balsam.
It is not therefore the true equivalent of English “tree”.
Gela: gai, a branching plant, shrub or tree, i.e. balsam, croton, and
banyan are all gai, but not a palm or coconut (gaimane).
Unauthenticated
282 Andrew Pawley
Toqabaqita: ai, tree (does not include palm trees, cycads, tree-ferns, bamboos,
banana trees).
Wayan: kai, generic for trees and shrubs, and occasionally low bushy plants.
Includes palms and pandans. Used in certain compounds as a
generic for all plants.
Group A definitions show little or no awareness that the English gloss ‘tree’ is
not a sufficiently precise definition of the vernacular generic. Group B defini-
tions recognize that the range of reference of the generic is not satisfactorily
captured by such general terms as ‘tree’ and ‘plant’. However, their attempts
to define by specifying diagnostic characters and/or by listing members are
of varying adequacy.
8. Concluding remarks
Finally, let us return to the question: Is it possible to make a first general
dictionary of a language that is not ‘dreadful’? Our examination of some 30
Oceanic dictionaries has been largely confined to aspects of their treatment
of terms for fish and trees – too narrow a basis for a general assessment.
It must be conceded that only a few of the dictionaries score well in their
coverage of the fish and tree lexicons. Most do poorly, both in terms of ex-
haustiveness of the wordlists and quality of the expository information. This
does not necessarily make them dreadful works overall. It is my impression
that many of the major Oceanic dictionaries do passably well in their treat-
ment of a number of major lexical domains. However, my purpose in this
paper has been to use the example of fish and tree lexicons to highlight some
of the challenges of method and scale that face anyone intending to do a first
general dictionary of a language.
Trial and error in my own attempts at dictionary-making has taught me
that such a project requires considerable expertise, or the help of experts, in
various specialised fields of knowledge in addition to botany and zoology.
Experience has also taught me that the more fluent the lexicographer is in
the target language, and the more fluent native speaker assistants are in the
defining language, the easier it is to achieve accuracy, especially in respect of
definitions and sense discriminations. Except that it is never easy. The hard
truth is that making an accurate and close-to-comprehensive general dictio-
nary of the language of a traditional society needs many thousands of hours
Unauthenticated
of labour, i.e. the equivalent of several full-time years on the job.7 In most
documentation contexts, which operate within a 3–4 year framework, with
many competing objectives, it would be unrealistic to try to compile a large
general dictionary. These practical considerations, among others, have led Ul-
rike Mosel (2011) to advocate doing ‘thematic dictionaries’, mini-dictionaries
each of which treats a single culturally important domain, as an alternative to
compiling general dictionaries. This has the virtue of allowing the researcher
to produce in quite a short time one or more small reference works of interest
both to members of the speech community and to academics.
References
Abo, Takachi, Bryon Bender, Alfred Capelle, and Tony Debrum. 1976. Mar-
shallese–English Dictionary. Honolulu: University of Hawai’i Press.
Akimichi, Tomoya. 1980. Bad fish or good fish: The ethnoichthyology of the
Satawalese (Central Carolines, Micronesia). Museum of Ethnology, Osaka.
Bulletin of the National Museum of Ethnology 6(1):66–133.
Akimichi, Tomoya, and Osamu Sakiyama. 1991. Manus fish names. Bulletin
of the National Museum of Ethnology, Tokyo 16(1):1–29.
Berlin, Brent. 1992. Ethnobiological Classification: Principles of Catego-
rization of Plants and Animals in Traditional Societies. Princeton, NJ:
Princeton University Press.
Brown, Cecil H. 1985. Mode of subsistence and folk biological taxonomy.
Current Anthropology 26:43–62.
Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head?
In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss
à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre
Miranda, 1069–1091. The Hague: Mouton.
Bulmer, Ralph N. H. 1992. Field methods in ethno-zoology with special
reference to the New Guinea Highlands. In Studying and Describing Un-
written Languages, Questionnaire 12, Ethnozoology, eds. Luc Bouquiaux
7. Anyone thinking of embarking on an ethnographic dictionary should note these words of

warning offered by a French colleague and accomplished lexicographer: “La confection
d’un dictionnaire encyclopédique est un exercice dangereux, qui peut induire une régres-
sion à un stade libidinal archaique.” (J-C. Rivierre, p.c.) [The making of an encyclopaedic
dictionary is a dangerous exercise, which can induce a regression to a primal libidinal
stage.]
Unauthenticated
284 Andrew Pawley
and Jacqueline M. C. Thomas, 526–556. Dallas: Summer Institute of Lin-

guistics.
Buse, Jasper, and Raututi Taringa. 1996. Cook Islands Maori Dictionary.
(Eds. Bruce Biggs and Rangi Moeka’a). Canberra: Cook Islands Ministry
of Education.
Capell, Arthur C. 1941. A New Fijian Dictionary. Suva: Government Printer.
Churchward, Clerk Maxwell. 1940. Rotuman Grammar and Dictionary. Syd-
ney: Australasian Medical Publishing Co.
Churchward, Clerk Maxwell. 1959. Tongan Dictionary. Oxford: Oxford
University Press.
Conklin, Harold C. 1954. The Relation of Hanunóo Culture to the Plant
World. New Haven, CT: Yale University.
Conklin, Harold C. 1962. Lexicographic treatment of folk taxonomies. In
Problems in Lexicography, eds. Fred W. Householder and Sol Saporta,
41–59. Bloomington: Indiana University Research Center in Anthropol-
ogy, Folklore, and Linguistics.
Crowley, Terry. 1992. A Dictionary of Paamese. Canberra: Pacific Linguis-
tics.
Dye, Tom. 1983. Fish and fishing on Niuatoputapu. Oceania 53(3):247–271.
Elbert, Samuel H. 1972. Puluwat Dictionary. Canberra: Pacific Linguistics.
Evans, Bethwyn. 2008. Ethnobotanical classification. In The Lexicon of
Proto Oceanic: The Culture and Environment of Ancient Oceanic Society.
Vol. 3. Plants, eds. Malcolm Ross, Andrew Pawley, and Meredith Osmond,
53–84. Canberra: Pacific Linguistics.
Firth, Raymond. 1985. Tikopia–English Dictionary. Oxford, Auckland: Ox-
ford University Press, Auckland University Press.
Foale, Simon J. 1998. The role of customary marine tenure and local knowl-
edge in fishery management at West Nggela, Solomon Islands. Phd thesis,
University of Melbourne.
Fox, Charles E. 1955. Nggela Dictionary. Auckland: The Unity Press.
Fox, Charles E. 1974. Lau Dictionary. Canberra: Pacific Linguistics.
Fox, Charles E. 1978. Arosi Dictionary. Canberra: Pacific Linguistics.
Gardner, Rhys O., and Andrew Pawley. 2006. Annotated list of local
plant names from Waya Island, Fiji. Records of the Auckland Museum
43:97–108.
Haiman, John. 1980. Dictionaries and encyclopedias. Lingua 50:329–357.
Helfman, Gene S., and John E. Randall. 1973. Palauan fish names. Pacific
Science 27:136–153.
Unauthenticated
Henderson, C. P., and I. R. Hancock. 1988. A Guide to the Useful Plants of

Solomon Islands. Honiara: Research Department, Ministry of Agriculture
and Lands.
Hviding, Edvard. 1990. Draft Environmental Dictionary, Marovo Language.
Centre for Development Studies, University of Bergen, Norway. In coop-
eration with the Western Province Division of Culture, Gizo.
Hviding, Edvard. 2005. Reef and Rainforest. An Environmental Encyclopae-
dia of Marovo Lagoon, Solomon Islands. Paris: UNESCO.
Inia, Elizabeth, Sofie Arntsen, Hans Schmidt, Jan Rensel, and Alan Howard.
1998. A New Rotuman Dictionary. An English–Rotuman Wordlist and C.
Maxwell Churchward’s Rotuman-English Dictionary. Suva: Institute of
Pacific Studies, University of the South Pacific.
Jackson, Frederick H., and Jeffrey C. Marck. 1991. Carolinean–English Dic-
tionary. Honolulu: University of Hawai’i Press.
Johannes, Robert E. 1981. Words of the Lagoon. Fishing and Marine Lore in
the Palau District of Micronesia. Berkeley: University of California Press.
Landau, Sidney. 1984. Dictionaries: The Art and Craft of Lexicography.
Lanyon-Orgill, Peter A. 1960. A Dictionary of the Raluana Language. Van-
couver: The author.
Lavondès, Henri. 1977. Les noms du poissons marquisien. Journal de la
Société des Océanistes 33:79–112.
Lawton, Ralph. 1998. Kiriwina word list: English to Kiriwina. Printout.
Canberra.
Lichtenberk, Frantisek. 2008. Toqabaqita Dictionary. Canberra: Pacific Lin-
guistics.
Lieber, Michael D., and Halio H. Dikepa. 1974. Kapingamarangi Lexicon.
Honolulu: University of Hawai’i Press.
Lindstrom, Lamont. 1986. Kwamera Dictionary. Nikukua sai Nagkiariien
Ninife. Canberra: Pacific Linguistics.
Lynch, John. 1977. Lenakel Dictionary. Canberra: Pacific Linguistics.
Mellow, Greg. 2009. Owa dictionary. Electronic file, submitted to Pacific
Linguistics.
Mosel, Ulrike. 2011. Lexicography in endangered language communities. In
The Cambridge Handbook of Endangered Languages, eds. Peter K. Austin
and Julia Sallaback, 337–353. Cambridge: Cambridge University Press.
Moyle, Richard. In press. Takū Dictionary. Canberra: Pacific Linguistics.
Unauthenticated
286 Andrew Pawley
Moyse-Faurie, Claire, and Marie-Adèle Néchéro-Jorédie. 1989. Diction-

naire xârâcùù–français suivi d’un lexique français–xârâcùù (Nouvelle-
Calédonie). Noumea: Ed. Populaires.
Ozanne-Rivierre, Françoise. 1984. Dictionnaire iaai–français (Ouvéa,
Nouvelle-Calédonie), suivi d’un lexique français–iaai. Paris: Société
d’Études Linguistiques et Anthropologiques de France.
Ozanne-Rivierre, Françoise. 1998. Le Nyelâyu de Balade (Nouvelle-
Calédonie). Paris: Peters.
Pawley, Andrew. 2011a. Stability and change in Oceanic fish names. In
The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral
Oceanic Society. Vol. 4. Animals, eds. Malcolm D. Ross, Andrew Pawley,
and Meredith Osmond, 137–160. Canberra: Pacific Linguistics.
Pawley, Andrew. 2011b. Were turtles fish in Proto Oceanic? Semantic re-
construction and change in some terms for animal categories in Oceanic
languages. In The Lexicon of Proto Oceanic: The Culture and Environ-
ment of Ancestral Oceanic Society. Vol. 4. Animals, eds. Malcolm D. Ross,
Andrew Pawley, and Meredith Osmond, 421–452. Canberra: Pacific Lin-
guistics.
Pawley, Andrew, and Timoci Sayaba. 2003. Words of Waya: A dictionary
of the Wayan dialect of the Western Fijian language. Printout, pp.1200.
Department of Linguistics, Research School of Pacific and Asian Studies.
Pratt, George. 1911. A Grammar and Dictionary of the Samoan Language.
4th edition. Apia: Malua Printing Press.
Puku’i, Mary Kawena, and Samuel Elbert. 1971. Hawaiian Dictionary. Hon-
olulu: University of Hawai’i Press.
Quick, Phil. 2010. Assaying the scope and number of fish names in Pendau
based on Pawley’s Oceanic benchmarks. In A Journey Through Austrone-
sian and Papuan Linguistic and Cultural Space. Papers in Honour of An-
drew Pawley, eds. John Bowden, Nikolaus P. Himmelmann, and Malcolm
Ross, 621–631. Canberra: Pacific Linguistics.
Rehg, Kenneth L., and Damian G. Sohl. 1979. Ponapean–English Dictionary.
Honolulu: University of Hawai’i Press.
Rensch, Karl H. 1983. Fish names of Wallis Island (Uvea). Pacific Studies
7(1):59–90.
Rivierre, Jean-Claude. 1983. Dictionnaire paicî–français suivi d’un lexique
français–paicî. Paris: Société d’Études Linguistiques et Anthropologiques
de France.
Unauthenticated
Rivierre, Jean-Claude. 1994. Dictionnaire cèmuhî–français suivi d’un lex-

ique français–cèmuhî. Paris: Société d’Études Linguistiques et Anthro-
pologiques de France, vol. no. 345.
Ross, Malcolm, Andrew Pawley, and Meredith Osmond, eds. 2008. The Lex-
icon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic
Society. Vol. 3. Plants. Canberra: Pacific Linguistics.
Ross, Malcolm, Andrew Pawley, and Meredith Osmond, eds. 2011. The Lex-
icon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic
Society. Vol. 4. Animals. Canberra: Pacific Linguistics.
Shoffner, Robert Kirk. 1976. The economy and cultural ecology of Teop: An
analysis of the fishing, gardening, and cash cropping systems in a Melane-
sian society. Phd thesis, University of Hawai’i, Honolulu.
Sperlich, Wolfgang, ed. 1997. Tohi vagahau Niue. Niue Language Dictio-
nary. Honolulu: Government of Niue and Department of Linguistics, Uni-
versity of Hawai’i.
Taylor, Paul M. 1990. The Folk Biology of the Tobelo People. A Study in Folk
Classification. Washington, DC: Smithsonian Institution Press.
Thaman, Randy, and Temakai Tebano. N. d. Kiribati plant and fish names.
A preliminary listing. Atoll Research Programme, University of the South
Pacific.
Waterhouse, J. H. Lawry. 1949. A Roviana and English Dictionary with En-
glish–Roviana Index and List of Natural History Objects. Revised and en-
larged by L. M. Jones. Sydney: Epworth Printing and Publishing House.
White, Geoffrey M. 1988. Cheke Holo (Maringe/Hograno) Dictionary. Can-
berra: Pacific Linguistics.
Unauthenticated
Unauthenticated
Chapter 12
Language is power: The impact of fieldwork on
community politics∗
Even Hovdhaugen and Åshild Næss
1. Introduction
The ethical and political aspects of language documentation work have been
brought increasingly to the forefront in recent literature. This is a vast and
complex field, where no universally valid solutions can be offered, since
both the nature of the issues and the appropriate way of resolving them will
vary widely between communities and situations. In the words of Grinevald
(2006: 351), “the different agents involved create a maze of commitments of
often conflicting nature, and ... one of the major challenges of fieldwork is to
juggle all these constraints, requirements, and commitments”.
This paper presents a case study from our own fieldwork experience as a
basis for addressing some of the complex issues raised by the presence of a
fieldworker in a small local community: what happens when there is disagree-
ment within a community as to whether, how, and where the documentation
work should be carried out, and how the burdens and benefits associated with
such work is to be distributed? We will discuss how such issues may affect
not only the relationship between researcher and language community, but
also politics and power relations within the community itself.
While it is generally taken for granted that a visiting researcher should
as far as possible avoid getting involved in issues of local politics, this is in
practice impossible, because the presence of a visiting researcher in a small
∗
The authors would like to thank Benedicte H. Frostad, Anders Vaa, and the editors of this
volume for helpful comments on earlier versions, while stressing that the views expressed
in this paper, as well as any factual errors, are entirely our own responsibility. We would
also like to emphasize that during our many years of work in Temotu Province, we have
been met in the overwhelming majority of cases with helpfulness and generosity; our dis-
cussion of the conflict described below is meant as an example of issues which it may
be important for linguistic fieldworkers to take into consideration, and not in any way as
criticism of the people of the area, for whom we hold the deepest respect and gratitude.
Unauthenticated
292 Even Hovdhaugen and Åshild Næss
community is local politics. Indeed, Rice (2006), while stating that a field-
worker should avoid getting caught up in internal political issues, goes on to
list a number of ways in which the presence of a researcher affects a commu-
nity, including the distribution of money as payment for consultants’ work,
choosing who gets to do such work, and deciding where the linguist stays and
works; these are all, ultimately, political issues in the sense that they concern
the distribution of potentially scarce material resources, and of personal and
communal prestige associated with the presence of and interaction with a re-
searcher from outside. While not being able to offer any general solutions,
we will point out some potential loci of conflict which are probably to some
extent present in most fieldwork settings.
Language documentation work is done in a variety of settings. Our ex-
periences come from “classical” fieldwork, where researchers from outside
spend time in a location where a language is spoken and work within the
language community for purposes of language documentation and descrip-
tion. However, in many cases work is done in towns or cities with isolated
members or small groups of a scattered or decimated language community.
In such cases, the issues to be raised here – of who speaks for the commu-
nity, of whose permission is required and of who decides how the benefits
and privileges attached to such work is to be distributed – may be even more
acutely relevant and even more difficult to resolve. While we will not specif-
ically discuss such situations here, we note that they are likely to become
increasingly common as a result of language shift and the loss of traditional
ways of life, and that as a result, the kinds of issues discussed in this paper
are likely to require increased attention in language documentation work.
2. The issues
There are three main issues which arise from the case study to be presented
below. All are to some extent to do with issues of local politics and of the dis-
tribution of power and privilege within a community, and all are interrelated.
Specifically, they concern the question of the relevance of existing legal and
administrative structures for the particular issues associated with language
documentation projects, and of how to handle cases where these structures
turn out to be inadequate or disputable.
Unauthenticated
Language is power 293
2.1. Permits vs. permissions
There are normally procedures to be followed at the level of national and/or

local administration in order to obtain legal permission to carry out research
in a country other than one’s own. It is taken for granted that any responsible
research project should follow such procedures and obtain all necessary legal
documents required to carry out a project in compliance with the require-
ments set by central authorities.
However, such legal permission from a central body of authority does not
necessarily translate into ready permission from the local community where
a language is actually spoken. In many of the countries where endangered
languages are spoken and documentation work is most urgently needed, the
central administration is relatively weak, and the communities where one may
wish to carry out the work are far removed both geographically and in terms
of everyday goals and values from those granting the research permits. Com-
ing to a village and presenting a permit from a central authority may trigger
goodwill towards a project; it may also, however, cause resentment and a
sense of having the community’s own authority over its linguistic and cul-
tural heritage weakened through the imposition of decrees from above.
For this reason, funding bodies for language documentation work often
require a researcher to document explicit permission from the language com-
munity in which the work will actually be undertaken. This, however, po-
tentially leads to a second problem, that of exactly whose permission can or
should be sought.
2.2. Community and authority
The question of how to obtain consent for a documentation project, and from
whom, holds not only in the tension between central and local authorities,
but also within a language community. There is considerable discussion of
this in the literature, for example in Thieberger and Musgrave (2007), who
address the issue of potential conflict between individuals, who may give
permission for material collected from them to be used in certain ways, and
of the community as a whole, which may wish to place restrictions on the
use of such material. In general, guidelines for ethics in documentation work
stress the importance of obtaining explicit consent both from the individuals
participating in the project, and from the community whose language is to be
Unauthenticated
documented (see e.g. the ethics guidelines of the Linguistics Department of

the Max Planck Institute for Evolutionary Anthropology,1 or of the DoBeS
programme 2 ).
Crucially, however, a “language community” is rarely a homogeneous
entity; rather it consists of a complex web of subcommunities such as towns,
villages, clans, or families, and each of these subcommunities come with their
own power structures. Even those subcommunities not directly involved in a
documentation project may feel entitled to have a say in how the documen-
tation work is carried out, and this may lead to conflict not just between the
community and the researcher, but also between different parts of the lan-
guage community itself. The fieldwork literature typically emphasizes that a
fieldworker should only go where her presence is accepted, if not actively de-
sired – “don’t go where you are not wanted” (e.g. Bowern 2008). It is not nec-
essarily realistic, however, to expect approval from all parts of a community,
or for the same terms to be acceptable to all, and differences in expectations
and desires may trigger tensions within the community which the fieldworker
may not be able to defuse or resolve.
Again, the complexity of negotiating consent from different bodies and
levels of authority within a community is well known from the literature;
for example, Grinevald (2006: 354) acknowledges that “[c]learly one must
deal with a variety of constituencies from whom to get consent”, while Bow-
ern (2008: 153) exhorts the would-be fieldworker to “make sure that you are
seeking permission from the right people”. It is often assumed, however, that
the main challenge facing the fieldworker is that of finding out whose per-
mission is needed; that is, that there is a pre-existing and well-defined set of
“right people” who must be identified and consulted. However, a quite differ-
ent problem arises when there is no clear consensus within the community on
who the relevant authorities are; the question of whose permission is needed
may simply not have a definite answer, and attempts to claim authority over
a documentation effort may have far-reaching political effects within a com-
munity.
1. http://www.eva.mpg.de/lingua/resources/ethics.php
2. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf
Unauthenticated
2.3. Copyrights and profits

The question of who owns the copyrights to materials produced through
language documentation has also been addressed in previous literature (e.g.
Dwyer 2006: 46–48, Thieberger and Musgrave 2007: 33–34). A related prob-
lem, however, is reaching agreement on exactly when and how the question
of copyrights is relevant. In the DoBeS “Code of Conduct”3 , for example, it
is stated that “No one is allowed to use the recorded and analyzed data for
commercial purposes without permission from the speech community” and
that “It is the speech community that deals with the issue of commercial uses.
The speech community not only gives permission, but also decides all aspects
involved in the commercial use (copyrights, sharing profits, etc).”
The problem here is deciding what exactly counts as “commercial use”.
A product which is often agreed on as an outcome of documentation valu-
able to both researcher and speech community is a collection of texts in the
documented language reproduced in digital or printed form. Such a product
has value both to the community, as it preserves traditional oral literature
and may be used as a basis for literacy training, and to the research commu-
nity as primary material for linguistic analysis. Naturally, the publication of
a text collection requires the explicit permission of those who contribute to
it. But does it constitute “commercial use” of the material? The researcher is
likely not to see it that way, since she knows that the commercial market for
such products is practically non-existent, and since finding a commercial pub-
lisher for such materials is generally difficult; typically, such materials will
be produced using the project’s own money, i.e. the researcher spends money
producing the materials rather than making money from them. But it cannot
be taken for granted that the community will share this view. It is not uncom-
mon to hear stories of researchers “stealing” traditional cultural materials and
making large profits from them (cf. e.g. McLaughlin and Sall 2001: 196), and
unlikely as such stories may sound to the researcher’s ears, they may never-
theless inform the community’s understanding of the situation. Indeed, given
many third-world communities’ experience with Westerners as people who
visit their areas solely for the purpose of making money – through planta-
tions, logging, mining or other commercial ventures – the idea of a non-profit
research project, and of Western institutions paying for such projects, may be
extremely difficult to explain. This may be a particular problem for linguis-
3. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf
Unauthenticated
tics, which, as Bowern (2008: 161–162) points out, has often been carried
out as a means to other ends, such as anthropology or missionary efforts; the
idea that someone might be paid to study a remote minority language with no
ulterior motives may be difficult to accept.
3. The setting
Our experiences come from fieldwork in several different language commu-

nities around the world, but primarily in the Pacific. The experiences to be
discussed here stem from the documentation and description work carried
out in Temotu Province in the Solomon Islands in 2002–2007, as part of the
multidisciplinary research project Identity Matters, funded by the Norwegian
Research Council. The linguistic part of this project aimed at studying the
two languages of the Reef Islands, Vaeakau-Taumako (Pileni) and Äiwoo,
and the contact situation between them, and it based itself on fieldwork in
both language communities.
Some of the specific issues associated with doing fieldwork as an outsider
in Melanesia have been insightfully discussed by Dobrin (2005, 2008), who
points out that the ability to engage the interest of outsiders and to enter into
relationships of mutual exchange and commitment is a fundamental cultural
value of the area. From a more general perspective, the questions surrounding
the allocation of resources connected with a documentation project, discussed
in the introduction, are highly pertinent in the area where we have worked:
firstly, because it consists of small islands poor in material resources, where
possibilities for earning a cash income are extremely limited; secondly, be-
cause the “bigman” system of social organization, where political influence
is based on the resources and community support that an individual is able to
command, means that the personal prestige that may derive from close asso-
ciation with visiting researchers and their work is a valuable commodity in
itself.
On the face of it, we were reasonably well prepared to tackle such issues
when the Identity Matters project started. It built on previous work in the
area, with Hovdhaugen having made two previous visits to Pileni island in
order to make a start on studying the language there – the first time accom-
panied by anthropologist Ingjerd Hoëm, the second time by Næss. During
these visits, experience and personal contacts had been built up which proved
extremely valuable when Hovdhaugen returned in 2003 to present the Iden-
Unauthenticated
tity Matters project to the local chiefs and communities in different islands
in the area, and to make arrangements concerning working conditions for re-
searchers in the new project: where they would stay, who would act as main
consultants, how much they would be paid per hour of work, how much the
community would receive in return for hosting the researchers, and not least,
who would be responsible for receiving and allocating this money, as there
were several candidates for this role – paramount chiefs, village councils,
pastors, churches, etc. The fact that Hovdhaugen was able to communicate in
the Vaeakau-Taumako language made these meetings a lot easier than would
probably otherwise have been the case.
We were well aware of the potential impact of our presence in these
small village communities, and made every effort to negotiate clear agree-
ments concerning the practical and financial arrangements for our fieldwork.
These arrangements usually involved both a contribution to village funds and
a modest hourly compensation for those speakers taking time out from their
daily activities to work with the linguists. While these terms were, in prin-
ciple, generally accepted, the more detailed arrangements concerning which
villages would receive the benefits of having a visiting researcher stay, and
which individuals would get the opportunity through consultant work of earn-
ing some much-needed cash income, proved more difficult to resolve to ev-
eryone’s satisfaction.
Mosel (2006: 71–72) addresses the near-impossibility of a fieldworker se-
lecting consultants directly; as a guest in the community, she has to work with
those individuals whom the community considers suitable and who can spare
the time from their daily duties. Our principle has been, once a village has
been selected for a research stay, to ask local authorities – chiefs and elders
– for help in selecting suitable consultants, according to some basic criteria
specified by the researcher. This ensures that decisions are seen to be made
by those whose decision-making authority is unquestionable, and makes it
possible to take into account local notions of fairness and entitlement which
the researcher, at least initially, will know little about. Of course individuals
may still feel slighted and conflict may ensue, but such conflict will to a lesser
extent be perceived as a direct result of the researcher’s personal conduct.
More complicated, at least in our case, is the question of where – in which
village and under whose direct responsibility – a researcher is to be located
during her stay. The ideal solution is of course for the researcher to divide
her time about equally between the different communities, not only for po-
Unauthenticated
litical but also for scientific reasons, as a language documentation project

should strive to include as much of the existing dialectal variation as possi-
ble. In practice, however, this may not always be possible, for several reasons.
Firstly, time limitations and logistical challenges may mean that only very lit-
tle time may be spent in each location, drastically reducing the efficiency of
the work to be done. Secondly, other practical issues may mean that some
locations are more suitable than others.
In our case, a general, though unwritten agreement existed concerning
where the researchers would stay and how they would circulate between vil-
lages. In the spring of 2005 Hovdhaugen came to the area with two MA stu-
dents, who were to spend time in two different villages. This was one point
that necessarily influenced Hovdhaugen’s choice of location. Should the stu-
dents experience practical problems or need help in their work, they had to
be able to contact Hovdhaugen, meaning that both he and the students had
to have access to the local radio network. But several islands and villages
did not have a working radio. Usually the battery needed charging, and the
villages lacked the equipment for this.
Furthermore, Hovdhaugen had a few years previously been diagnosed
with Parkinson’s disease, and this affliction influenced his accommodation
needs. Whereas previously he had stayed and worked in the holau or men’s
house, which also served as a meeting house and general social gathering
place for the men of the community, he now needed quieter surroundings in
order to be able to work efficiently. This meant that accommodation plans
had to be rethought, and this was felt by some as a breach of an unwritten
contract. As a result, some communities and individuals felt slighted, and
this may have contributed to the open conflict that followed.
4. The conflict
The project plan, including the budget, had been approved by the Ministry of
Education in Honiara, the national capital; by the Premier of Temotu Province;
and by the council of chiefs in the villages where members of the research
team stayed. We felt fairly confident that in obtaining permission from these
bodies, we had followed correct procedures and consulted all relevant author-
ities. However, in 2005, a group of chiefs on one of the islands questioned
our right to collect traditional stories, on the basis that the project had not
been approved at the level of the island. This was initially communicated by
Unauthenticated
a representative of the group approaching the student working on the island

in question and demanding that he stop his work immediately. This demand
caused considerable consternation in the village where he was staying, where
everyone, at least on the face of it, was supportive of the project, and where
it had been explicitly approved by the village authorities.
An individual island, however, is not a recognized administrative entity
either locally or by national legislation. The island in question consists of four
larger villages and a number of smaller villages, and our project had been
approved by the villages where the work was being carried out. The whole
island is part of a larger administrative district, which is the unit recognized
at the national level. The attempt at setting up the island as a distinct level of
authority therefore amounted to a re-casting of the political structure of the
area, and caused some confusion locally.
The protestors called a public meeting to discuss the project, its legiti-
macy and the terms for its continued operation in the islands. In the week
before this meeting Hovdhaugen had long discussions via radio almost every
morning with one of the central actors behind the protests and the proposed
meeting. Since the radio network is public, a number of people, both on the
island concerned and in the neighbouring area, could listen in on these dis-
cussions, and they were often continued with locals in the area where Hovd-
haugen was staying, many of whom expressed their support.
The main objections to our project were that we had no right to record and
print the traditional stories of the language community, and that no central
or provincial authority, nor the chiefs in other villages, had any authority
to give us permits to do so. More specifically, our motivation for wishing
to collect such stories was questioned. Explaining the link between “studying
the language”, which was what our paperwork stated that we were there to do,
and collecting traditional texts, proved difficult. A complicating factor here
may be that most people in the region encounter outside interest in studying
their language only in the context of Bible translation, which of course has a
rather different goal and methodology.
As discussions proceeded, it became clear that the main motivation for
the objections was of a financial nature. As noted above, explaining the sci-
entific motivation behind a linguistic documentation project is often difficult,
and the protestors reasoned that there must be a profit motive behind the work
that we wished to do. Why, after all, would a faraway country like Norway
spend so much money to send us to a remote island group to gather language
Unauthenticated
data? If traditional stories were to be published, there should be financial

profit involved, and the question of copyrights was perceived as being poorly
handled. Hovdhaugen conceded the latter point, but also stressed that there
was very little chance of any profits being made from the planned text col-
lections; on the contrary, our publications were printed in Norway using the
project’s money, and a significant number of copies were given to the local
communities free of charge. Indeed, one major motivation for our wishing
to produce printed collections of texts was for the communities to receive
written materials to use in literacy training, a goal supported by many; these
publications, which comprised both text collections and a short dictionary as
well as a small collection of hymns (Hovdhaugen, Hoëm, and Næss 2002;
Hovdhaugen 2006; Hovdhaugen and Næss 2006; Hovdhaugen and Tekila-
mata 2006) were, for the most part, received with enthusiasm.
The conflict came to a head when it transpired that the protesting chiefs
had asked the local police officer in the islands to come to the meeting with
the police boat and arrest Hovdhaugen if he was found to have done some-
thing illegal. But after hours of discussion, including thorough scrutiny of the
research permit from the department of education in Honiara and the letter
from the province Premier where he asked every village to do their best to
support us in our work, the meeting was resolved without any important de-
cisions taken. However, a general wish was expressed that the question of
copyright and ownership of linguistic materials should be addressed more
carefully, a wish which Hovdhaugen accepted as legitimate.
5. The aftermath
Triggered by the presence of outside researchers, this political conflict ulti-

mately involved a struggle for authority within the local community: given
that there was no formal level of authority corresponding to the language
community as a whole, who had the right to make decisions on the commu-
nity’s behalf? While the conflict petered out without any dramatic short-term
results, a pertinent question is whether it may have had any lasting effects on
local politics, or on the relationship between locals and research teams from
outside.
We have not been able to confidently identify any such effects. The claim
of authority based on a non-existent administrative level was recognized as
illegitimate, and the process triggered, if anything, explicit statements of sup-
Unauthenticated
port during later research visits; on one occasion a group of village chiefs
provided a written statement of support for our project, asserting explicitly
that authority in this matter rested with them and so reaffirming the illegiti-
macy of the attempt at redefining local political structure.
Nevertheless, the very fact that such explicit statements of authority were
deemed to be necessary may be indicative of subtle shifts in power relations
or status within the local communities; such changes are of course difficult
for us to discover. Furthermore, the conflict may to some extent have changed
patterns of interaction between different groups in the area.
Traditionally, the two language communities in the Reef Islands – the
Polynesian-speaking group in the Outer Reefs and the Äiwoo speakers in the
Main Reefs – have had little day-to-day contact. There is some intermarriage
between the two groups, and people from the Polynesian villages travel to the
trade store in the Main Reefs for supplies, but people identify strongly with
their village and language community, and there is little contact or collabora-
tion across the linguistic border.
As our project spanned both communities, however, both had a vested
interest in the situation, and the conflict brought together supporters of the
project from both sides. On the evening before the meeting, several of Hovd-
haugen’s friends both from the Main and Outer Reefs came to his house to
discuss strategies for the meeting. We are not aware that meetings of political
leaders from the two language communities have been a practice in the past,
and in a longer-term context it may well be that an increase in such contact
will turn out to be the main political impact of our project in the islands.
For small island communities having to deal increasingly with the effects of
modernization and globalization, increased collaboration and the ability to
put traditional borders aside for the sake of promoting common interests is
clearly an asset, and our presence may, in a small way, have contributed to
this.
As a final point, it should be noted that conflicts of authority of the kind
we have described may have their source in preexisting conflicts between in-
dividuals or groups that the fieldworker cannot be expected to be aware of –
they may go back years or even decades, and may be dragged to the surface
through the new situation created by the arrival of the researcher. In our case,
it was clear that the protests were motivated as much by individuals’ sense
of entitlement and injustice as by legitimate formal claims. One key actor in
particular was known to have been very hostile to previous linguistic efforts
Unauthenticated
in the area, including ones largely carried out by locals; acceding to this per-
son’s wishes of playing a central role in our work might have prevented the
particular situation which arose, but would have been practically and scien-
tifically infeasible, as well as creating a potential source of resentment from
other prominent personalities in the area.
6. Conclusions
A familiarity with the structures of power and authority in the region where
one is working is essential for any documentation enterprise. Such familiar-
ity is only built up by experience, which means that the chances of making
mistakes in the early phases of a project are great, as is well documented in
the literature on language documentation.
However, the existing power structures may not always be adequate for
the handling of the particular issues raised by a language documentation
project, and this may lead to conflict not only between project participants
and community members, but also within different parts of a language com-
munity. If there is no body of authority associated with the language commu-
nity as a whole (as distinct from, e.g., an administrative district which may
subsume several language communities), the question of whose permission
is required may simply not have an unequivocal answer. This raises a number
of questions:
1. Under such circumstances, how far can a researcher be expected to go in
obtaining permission to carry out documentation work? Is it necessary –
and realistic – to acquire permission from all parties who may consider
themselves entitled to an opinion, even if they are not directly involved in
the project itself?
2. If this is deemed to be the case, how does one identify and approach all
interested parties? In our case, the objections came from a self-styled body
of authority which was not recognized locally as legitimate. In other words,
there simply was no way to approach this body beforehand, as technically it
did not exist until it was set up as a means of protesting against the project.
3. When there is no consensus within a community as to who has the authority
to grant permission, how does the researcher establish whose claims are
legitimate? It is tempting to assume that those who desire the project to
go ahead are those whose opinions should be listened to, but can their
dismissal of the objections be accepted without question?
Unauthenticated
4. In cases where a project creates divisions and conflict within a community,

is it ethical to carry on with the project at all? Do the benefits for those who
wish for documentation to be carried out – benefits which are typically
assumed to extend to all of the language community, even those who may
have objections – outweigh the damage that may be done to interpersonal
relations and political equilibrium within the community? This question
is of central importance not only to language documentation as such, but
to the entire discipline of linguistics, since it plays a significant part in
determining what language data is made available to linguistic science.
As is often the case, the answers to these questions will differ from case to
case and from area to area. They should, however, be reflected upon as part of
the planning of a documentation project, since our experience shows that it is
in practice impossible for a visiting researcher to predict the impact which a
research project may have on local politics. One may strive to treat everyone
fairly and to make the proper arrangements with the appropriate local author-
ities; but local power struggles and individuals’ quest for personal prestige
may nevertheless give rise to conflicts which may not only be uncomfortable
for the researchers, but may shake up a whole community and create new
divisions and alliances which cross-cut traditional structures. When the dust
settles and the researcher leaves, local power relations may well have changed
as a result.
References
Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke:
Palgrave Macmillan.
Dobrin, Lise. 2005. When our values conflict with theirs: Linguistics and
community empowerment in Melanesia. In Language Documentation and
Description, Volume 3, ed. Peter K. Austin, 45–52. London: School of Ori-
ental and African Studies.
Dobrin, Lise. 2008. From linguistic elicitation to eliciting the lin-
guist: Lessons in community empowerment from Melanesia. Language
84(2):300–324.
Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork
and analysis. In Essentials of Language Documentation, eds. Jost Gippert,
Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York:
Mouton de Gruyter.
Unauthenticated
Grinevald, Colette. 2006. Worrying about ethics and wondering about “in-
formed consent”: Fieldwork from an Americanist perspective. In Lesser-
Known Languages of South Asia: Status and Policies, Case Studies and
Applications of Information Technology, eds. Anju Saxena and Lars Borin,
Hovdhaugen, Even. 2006. A Short Dictionary of the Vaeakau-Taumako Lan-
guage. Oslo: The Kon-Tiki Museum.
Hovdhaugen, Even, Ingjerd Hoëm, and Åshild Næss. 2002. Pileni Texts with
a Pileni-English Vocabulary and an English-Pileni Finderlist. Oslo: The
Kon-Tiki Museum.
Hovdhaugen, Even, and Åshild Næss. 2006. Stories from Vaeakau and Tau-
mako / A lalakhai ma talanga o Vaeakau ma Taumako. Oslo: The Kon-Tiki
Museum.
Hovdhaugen, Even, and Christian Tekilamata. 2006. Christmas Carols from
Vaeakau. University of Oslo: Department of Linguistics and Scandinavian
Studies.
McLaughlin, Fiona, and Thierno Seydou Sall. 2001. The give and take of
fieldwork: Noun classes and other concerns in Fatick, Senegal. In Linguis-
tic Fieldwork, eds. Paul Newman and Martha Ratliff, 189–210. Cambridge:
Rice, Keren. 2006. Ethical issues in linguistic fieldwork: An overview. Jour-
nal of Academic Ethics 4:123–155.
Thieberger, Nick, and Simon Musgrave. 2007. Documentary linguistics and
ethical issues. In Language Documentation and Description, Volume 4, ed.
Peter K. Austin, 26–37. London: School of Oriental and African Studies.
Unauthenticated
Chapter 13
Sustaining Vurës: Making products of language
documentation accessible to multiple audiences
Catriona Hyslop Malau
1. Introduction
With the emergence of language documentation as a distinct sub-discipline of

linguistics, and recent advances in technology, there is now a strong empha-
sis within field-based linguistics on the importance of producing high qual-
ity video and audio language data which is annotated and stored in a digital
archive (e.g. Gippert, Himmelmann, and Mosel 2006; Austin and Greno-
ble 2007). Once archived, the unedited data can be accessed (depending on
imposed restrictions) not only by linguists and members of the speech com-
munity, but also by any other interested parties. However, if the documen-
tation is only stored as unedited recordings in a digital archive, apart from
linguists and members of the language community, most likely it will only
be accessed by researchers in disciplines closely linked to linguistics, such as
anthropology, ethnomusicology and archaeology, in particular those with a
focus on the immediate language area or family. A number of linguists work-
ing on language documentation have therefore emphasised the importance of
considering from the outset the broad variety of users who could potentially
utilise and benefit from a language documentation corpus, and planning and
carrying out the data collection accordingly (e.g. Dwyer 2006; Austin and
Grenoble 2007).
With this issue in mind, as part of our DoBeS language documentation
project focussing on the Vurës and Vera’a languages of Vanuatu, we made
a decision to produce edited versions of some of our video recordings as a
distinct output of our project. We considered that if we took advantage of the
opportunity to use all available resources to create high quality documentary
films, we could reach a significantly wider range of users. This could then
have a substantial impact on the value of our project, with its primary aim of
documenting endangered languages and through documentation promoting
Unauthenticated
306 Catriona Hyslop Malau
language maintenance and linguistic diversity. Thus while Seifart (this vol-
ume) considers the different motivations for documenting endangered lan-
guages, I consider a related issue, that of attempting to target a wide range
of users of language documentation. Cablitz (this volume) has also consid-
ered the issue of making language documentation work accessible to a wide
audience, with her work on a multimedia encyclopaedic lexicon that is user-
friendly for the language community and others. We took a similar approach
and this paper presents, as a case study, the films produced by our project. Le
Kal Vurës ‘Sustaining Vurës’ is a two DVD set of documentaries presenting
aspects of Vurës life and language. Through discussion of the diverse range
of audiences who have accessed the films, the paper shows how these and
other films of this nature can be important tools for supporting language and
cultural maintenance, both within and beyond the language community.
1.1. Project background

Our project focuses on the documentation of two closely related languages
spoken on the island of Vanua Lava, one of the Banks Islands situated in
North Vanuatu: Vera’a, the more endangered language of the two with ap-
proximately 300 speakers, and Vurës, the dominant indigenous language of
the island, which is nevertheless endangered with only around 1,200 speak-
ers. There are approximately 100 languages spoken in Vanuatu (Lynch and
Crowley 2001) for a population which in 2009 was 234,023 (Vanuatu Cen-
sus, Office 2009). Thus with an average of little more than 2,000 speakers
per language, this linguistic situation has earned Vanuatu the status as the
country with the greatest density of languages per capita in the world. The
low number of speakers of each language, combined with the fact that the
lingua franca and national language, Bislama, and official languages, French
and English, impact on the status and transmission of the indigenous lan-
guages, means that all of the languages of Vanuatu are endangered to some
extent, although the threat is not high in comparison with languages in other
hotspots across the globe (Harrison 2007). The majority of the Vanuatu lan-
guages are under-documented (Lynch and Crowley 2001), particularly with
regards to archived language corpora, and language documentation projects
such as ours are essential for increasing our understanding of the languages
and producing documentations which can assist with language maintenance.
Unauthenticated
Sustaining Vurës 307
The primary aim of our research is to produce a comprehensive record of the

language within its cultural context. In an effort to document the most signifi-
cant aspects of the speakers’ culture and environment, the project has brought
together a team which includes, apart from linguists, specialists from varied
disciplines: a marine biologist (Katherine Holmes) an anthropologist (Sabine
Hess), an ethnomusicologist (Raymond Ammann), and a botanist (Philemon
Ala).
1.2. The documentaries
We have produced two documentaries in DVD format which centre on the

Vurës language and culture, presented under the single title Le Kal Vurës
‘Sustaining Vurës’. The Vurës title translates more literally as ‘lift up Vurës’,
indicating our objective for the documentaries to be used both to promote
the maintenance of the language within the language community, and also to
celebrate linguistic diversity and raise the status of minority languages such
as Vurës outside the community. To the best of my knowledge, these are the
first films produced entirely in an indigenous language of Vanuatu.
Each of the two documentaries has a common theme and includes a num-
ber of short films focussing on a specific topic. The first DVD, O tere tarn̄i
vovon̄on ‘Water harvests’ comprises three short films, each demonstrating a
fishing activity: the traditional technique of harvesting the marine worm un
‘palolo’; the process of catching fish employing the leaves of a poisonous
vine; and the weaving and setting of a freshwater prawn trap. The second
DVD, O tere tarn̄i vetvet ‘Ways of weaving’ includes twelve films, each
demonstrating the process and techniques for weaving or plaiting a different
basket or mat.
The fishing and weaving practices presented in the films are quite unique
to the region, thus making the documentaries an important ethnographic con-
tribution in terms of documentation of the cultural activities. The activities
presented in the fishing film are also noteworthy for the opportunities they
provide for discussion of conservation in traditional fishing practices, which
became an important secondary focus of the films due to our collaboration
with conservation marine biologist, Katherine Holmes (see §3.3). The topic
of weaving was chosen for various reasons, most significant of which was our
desire to produce a comparative record demonstrating the variation in bas-
ketry styles relative to other areas in Vanuatu. In the Banks islands a greater
Unauthenticated
diversity of styles has been preserved compared to other islands in Vanuatu.

There are a variety of fibres which are used and basket styles which are pro-
duced today on Vanua Lava which, in areas south of the Banks group, are
either not used at all or about which only a small number of basket producers
retain knowledge.
1.3. Production of the films

All video and audio footage for the films was recorded by me or Kather-
ine Holmes, not by professional film makers. However, we engaged a small
film production company, GoodEyeDear, to edit and post produce the films.
GoodEyeDear director, Gavin Banks, also created all the graphics and DVD
interface. To realise this aspect of the project we depended on the support of
our project host, Prof. Ulrike Mosel, and project funders Volkswagen Founda-
tion, for recognising the films as worthwhile outputs of a language documen-
tation project and agreeing to the use of project funds to pay for production
costs. The documentaries are still low budget productions, a point which is, of
course, evident in the final product, but we feel (and the reaction to the films
would confirm) that they are sufficiently professional to justify targeting them
at an audience beyond the language community. I will return to the point of
funding for film projects in §5, in considering the positive and negative as-
pects for linguists investing time in producing outputs such as documentaries.
2. The films as language maintenance tools

The primary audience for these documentaries was always intended to be the
Vurës people, for the speakers and their descendants to be able to use them
to assist in maintaining and passing on aspects of both their language and
their culture. True to this purpose, the audio is entirely in the Vurës language.
When editing the films we always needed to keep our primary target audi-
ence in mind, while also considering the fact that we wanted the films to be
accessible and of interest to people outside Vanua Lava, both within and out-
side of Vanuatu. Every aspect of the films is thus trilingual in Vurës, English
and Bislama, the national language of Vanuatu. The title and information on
the cover and DVD faces, plus the menus within the DVDs are in all three
languages. The Vurës version plays with no subtitles, while the English and
Bislama versions present subtitles in the chosen language.
Unauthenticated
Aside from the fact that the Vurës language is the medium used to present
the communication in the films, these films are not about the language and do
not present any overt discussion of their purpose as language awareness and
maintenance tools. However, there are three ways in which this purpose is
made evident, two of which are minor yet nevertheless significant points, the
other being a more explicit technique employed for highlighting the language
used as one of the themes of the films.
The first point is simply the fact that these documentaries are entirely in
the Vurës language. It is a highly significant matter for a film to be made –
and to be commercially available – in a minority language spoken by a little
over a thousand people. Within the sociolinguistic context of Vanuatu with
its 100 languages, the existence of the films and knowledge of their existence
within and outside the community serves immediately to elevate the status of
the language. In §4 I discuss further the significance of films being produced
in minority languages.
The second point is that while the language maintenance issue is not
discussed within the films, there is an introductory written text, accessible
through the main DVD menu, which provides background on the language
and language endangerment issue. The text discusses the fact that some of
the languages spoken on Vanua Lava have already been lost, and that while
Vurës is being passed on to children today, the community recognise that it
is under threat from English, French and Bislama. It is stated that this is the
reason why the language documentation project was supported by the com-
munity and why the DVDs were produced.
The third way in which these films can be identified as having the lan-
guage used as a focus is considerably more explicit, with a clear pedagogical
intent. Each of the documentaries contains a number of dictionary entries –
presented on the screen as an excerpt of a page from a dictionary (as exempli-
fied by Figure 1) – which highlight and provide translations and definitions
for certain key terms used in the films. The aim was to choose the most sig-
nificant technical terms that were related to the topic of each film, particularly
those which were more likely to represent restricted knowledge, thus enabling
the dictionary entries to serve as an important record of the meanings of the
technical terms. For example, in both documentaries a number of plants are
referred to which are used for producing woven artefacts or in fishing ac-
tivities. The dictionary entries for these words are presented complete with
the scientific identification for the plant. Thus, the film serves as a record of
Unauthenticated
particular cultural activities, including the vernacular names for species that
have specific cultural uses, and the linked dictionary entry serves as an accu-
rate scientific record linking the identification to the use as it is represented
in the film.
Figure 1. Screenshot of a dictionary excerpt
The idea behind including the dictionary entries within the films themselves
was to highlight the fact that the language being used is one of the themes of
the films, and to compel the viewers, whether they be members of the Vurës
community or not, to consider its high significance. For the Vurës speakers,
now and in the future, the entries for each headword can be used as a teach-
ing tool, particularly in relation to those words which are used in the domain
of cultural practices that are not now widely observed. Linking dictionary
entries to video in which the use and denotatum of the words is clearly il-
lustrated has a much greater value than a dictionary entry presented as text
alone. This is particularly true in the case of scientific identifications and
technical terms, where the translation or definition may not, in reality, be suf-
ficient to unambiguously retrieve the correct meaning and range of use of the
word or expression. I will give two different types of examples. Firstly, in-
cluding the language term and its scientific name alongside contextual video
footage which illustrates clearly the species and the natural environment in
which it is found, is a comprehensive record which can be used to assist the
Unauthenticated
language community in identifying the precise class of referents and pre-

serving the word with its associated meaning and cultural knowledge. Mere
inclusion of a universally accepted scientific identification, on the other hand,
means that the information regarding the exact species will not be easily ac-
cessed by members of the indigenous community who do not possess the
background scientific knowledge and ready access to relevant sources, such
as are available on the internet. A different type of example is that of terms
used to explain the process of producing woven or plaited basketry. In pro-
ducing a bilingual/multilingual dictionary, it can be very difficult to produce
a translation, particularly for words referring to actions and abstract concepts,
which is sufficiently precise to enable someone to access the accurate mean-
ing, so that they could reproduce the correct associated action. For example,
the definition I have provided for qeleg, ‘plait together two strips of coconut
leaflets or pandanus leaves as first stage in making basket or mat’, is not suffi-
ciently precise to enable someone to reproduce the action which qeleg refers
to. Incorporating the use of the word in context, with the translation (in both
English and Bislama), alongside demonstration of the action as a stage of the
plaiting procedure which it is a part of, is a practical method for enabling
someone to reproduce the correct action.
The dictionary entries presented within the films are taken from a trilin-
gual Vurës-English-Bislama dictionary that we are currently compiling as
part of our documentation project, and their format and the information in-
cluded is typical of a bilingual/multilingual dictionary. The headword, given
in the accepted Vurës orthography, is followed by pronunciation presented in
IPA, then the word class. The English translation is given first, then Bislama,
followed by a scientific definition where relevant. Where it was considered
necessary, additional encyclopaedic information was also included. Where
the relevant sense is a secondary sense of the word, the primary sense is in-
cluded as a further record linking the senses.
There are a total of 77 definitions of terms in the two documentaries. All
key weaving terms are presented, including both those terms whose mean-
ing and use is completely restricted to the domain of weaving, and those
which also have other more general senses outside weaving. For example, the
primary sense of the word sal is ‘road, path’. The same term used in weav-
ing means ‘working strip of material being used for weaving, plaiting’. Both
senses are given, and thus the connection between the primary sense and the
secondary sense is made clear through the presentation of the definition.
Unauthenticated
Terms specific to fishing and other featured activities are also included, both
those that are more technical and also more general terms that play an impor-
tant role in discussion of the featured activities. Definitions for 15 different
plant species are included: nine in the weaving documentary and six in the
fishing documentary.
Each dictionary entry appears on screen directly after the utterance unit
the defined word occurs in, thus linking the word and its translation/ definition
directly to the use of the word in context. Most of the content of the films is
of a procedural genre, and the entries are linked in such a way that as the
speaker(s) demonstrate the process, explaining their actions, when they use a
key term the film then cuts to a screen where first the headword alone, then
the full dictionary excerpt appears. The dictionary page remains on screen for
eight seconds and then cuts back to the film. Presenting the dictionary entries
in this way, the aim was to make them a feature of the film without disrupting
considerably the flow of the depicted procedures.
3. The film audiences

At the time of writing of this paper it is less than a year since our docu-
mentaries were completed and first presented to their primary audience, the
Vurës language community. After first being shown to the community, the
films were launched at the Vanuatu Museum in February 2010. In the short
time since, the films have already been viewed by a number of different audi-
ence groups, demonstrating the validity of producing an output of a language
documentation project that targets a variety of users. Below I discuss the dif-
ferent audience groups and describe some of the ways in which a film that is
in an indigenous language and centres on cultural themes, can have different
impacts and functions for different groups of users.
3.1. The Vurës language community

I have already stated clearly that the primary audience for these films is in-
tended to be the language community. But do the speakers of the language
recognise the films as being significant and offering any worthwhile contribu-
tion in terms of supporting linguistic and cultural identity and maintenance?
Apart from a general sense that those who have watched the films feel pride
in seeing themselves and their culture represented on a ‘real’ DVD, the films
Unauthenticated
appear to have already had an impact on viewers in terms of issues relating

to transmission of cultural knowledge.
One significant example is the response to the film Nēn a so o un ‘We
gather the palolo worms’. ‘Palolo’ are segmented sea worms belonging to
various species of the genus Palola or Eunice. They are gathered for con-
sumption when they rise to the surface of the sea close to shore after spawn-
ing. The opportunity to collect the palolo only arises on a few nights each
year, and thus the activities associated with the gathering are part of a rare
yet significant cultural event. (See Mondragón 2004 for further explanation,
discussion and references relating to palolo harvesting in the Torres islands,
directly to the north of Vanua Lava.)
Whilst all of the Vurës speaking villages are located in a very small area
of southwest Vanua Lava, there is considerable variation in exploitation of
marine resources, depending on the proximity of the village to the coast and
reef. This is true to the extent that many people have remarked, after watch-
ing the palolo film, that they have never been involved in or witnessed the
gathering of the palolo. For these viewers, watching the film has been their
first exposure to a natural phenomenon and cultural practice that is an annual
event for members of the language community living less than an hour’s walk
from their village. Production of the film has thus presented members of the
community with access to cultural knowledge regarding an activity that they
were previously aware of only vaguely, without being privy to specific details
regarding the cultural practice, and the language associated with the activity.
This provides opportunities for discussion and transmission of knowledge
possessed by select groups within the language community, rather than the
language community as a whole.
A second more general point is connected to the widespread mispercep-
tion that once a record is made and stored in an archive, then there is no
further effort required to ensure that the knowledge is retained. This issue is
reflected in the reaction of Eli Malau, a Vurës village chief and active mem-
ber of the Vanuatu Cultural Centre’s fieldworker program, who is intensively
involved in our research and has assisted in all aspects of documentation: af-
ter watching the films he commented with pride on how he perceives their
potential usefulness as a cultural maintenance tool within his language com-
munity. He applauds the production of the films as an accessible means of
feeding research outcomes back into the community and sees that there is
even more motivation to work on cultural maintenance when material out-
Unauthenticated
puts are produced and returned to the community. He plans to use the films,
particularly the documentation of weaving, as a starting point for workshops
on transmitting cultural knowledge.
3.2. The wider Vanuatu community

There is great potential for these documentaries to be used by different groups
outside the Vurës community, within Vanuatu. The films have been broadcast
by the national television station, Television Blong Vanuatu, which is proac-
tive in screening locally made productions, as a means of promoting cultural
maintenance. The films have relevance throughout the nation as they depict
familiar activities and can be used to discuss the local diversity of cultural
practices. People in Vanuatu are very interested in slight variations in cultural
practices between islands and this is a common conversation topic. Yet, few
people have the opportunity to travel widely to other islands within Vanuatu.
Producing documentaries like ours offers groups within Vanuatu the oppor-
tunity to view and gain insight into other Vanuatu cultures.
Although the films are in Vurës, a language spoken by less than 1% of the
population of Vanuatu, these documentaries could be used in schools through-
out Vanuatu for their content, for cultural studies. We have not yet explored
this avenue for dissemination of the films, but intend to do so.
3.3. Conservation groups and other researchers outside Linguistics

Our main aim, as linguists and ethnographers, is to document and to support
the maintenance of traditional knowledge, and to enable language communi-
ties to pass on this knowledge through our documentation. However, when
it comes to such cultural practices as fishing, there can often be a certain
amount of conflict between traditional practices and environmentally sustain-
able practices, and linguists or ethnographers do not want to and should not
be engaged in the promotion of unsustainable practices. The three films on
the DVD O tere tarn̄i vovon̄on ‘Water harvests’ each describe a traditional
fishing activity. Although these films do document environmentally unsus-
tainable fishing practices, namely the mass stunning of fish, they can never-
theless be used to promote discussion of issues surrounding the sustainability
of traditional practices.
Unauthenticated
Katherine Holmes, conservation marine biologist on our project team, is cur-

rently director of the Papua New Guinea Marine Program of the Wildlife
Conservation Society. In this role she has already been making use of and will
continue to use in the future, our documentary focussing on fishing practices
on Vanua Lava. In conservation workshops throughout Papua New Guinea
she has been showing both the film demonstrating the use of poison to catch
fish and that centring on gathering palolo worms, with two distinct purposes
in mind.
The technique for stunning fish which is demonstrated in our film Nēn a
vun o mes ‘We poison the fish’, using leaves of the vine Derris elegans, is
highly effective, to the extent that its use is actively discouraged by conserva-
tionists as an unsustainable fishing practice (Katherine Holmes, p.c.). Vurës
speaker Eli Malau presents that message at the end of the film, commenting
that in the past this technique was only employed when a large number of
fish were required for a significant cultural event and people possessed the
cultural knowledge that the practice was not sustainable for regular use. De-
spite its unsustainability, it is a significant cultural practice and it is important
that the knowledge be maintained. In different areas of Papua New Guinea,
both with communities and with conservation groups, Holmes is able to use
the film with a dual purpose: to ascertain where the technique is practised
and the details associated with it, and as a basis for discussion of the unsus-
tainability of the practice. For this audience, only the content of the films is
relevant, not the language used.
3.4. A general audience

There is a dearth of films available depicting Vanuatu culture, and fewer still
including commentary in indigenous languages. There are a number of lo-
cally produced DVDs available, showcasing Vanuatu music, both traditional
music and modern stringband. However, few films highlighting aspects of
Vanuatu culture have been produced1 . Certainly our films are the first which
are solely in a Vanuatu language, with a purely ethnographic focus. Although
our film is a low budget production and would only be of interest to a select
audience, the DVDs are being sold through the Vanuatu Museum. Making
films in minority indigenous languages available to a general audience is an
1. Notable exceptions are the recent award-winning French language films, Sevrapek City
(Broto and Tzerikiantz 2009) and Le Salaire du Poéte (Wittersheim 2008).
Unauthenticated
important way of promoting language and cultural maintenance. This is true

not only in terms of raising public awareness about the issue of language
endangerment, but also considering the fact that the status of the language
within the language community is elevated when the people are aware that
a film depicting their language and culture is being disseminated outside the
community.
4. Films in and on minority languages

The issue of language endangerment has been receiving quite some attention
in the mainstream media, and a number of films have been produced on the
topic which have reached a popular audience and served to increase general
awareness about language endangerment and the need for language documen-
tation. Prominent examples are the 2008 film The Linguists, featuring David
Harrison and Greg Anderson and their work on documentation of endangered
languages, and Janus Billeskov Jansen’s 2005 film In Languages We Live –
voices of the world. Although I do not intend to claim that we are the first
or the only researchers to produce films of this nature, there are however few
entire films that have been produced to date which are not just about, but
which are also in a minority indigenous language. I believe that researchers
who are engaged in language documentation should be taking more opportu-
nities to promote linguistic diversity by using language documentation data
to produce films telling real stories, aimed at a range of audiences. Such films
represent an important middle ground in contrast to other representations of
minority languages that are currently being presented on screen.
The success of Rolf de Heer’s film Ten Canoes is an excellent example of
how production of a film in a minority language can have a significant impact
on increasing awareness of language endangerment and contributing to lan-
guage maintenance. Apart from the impact of having an award-winning film
in a minority language, spin-offs from Ten Canoes have actively contributed
to language and cultural maintenance.2 I do not suggest that linguists work-
ing on language documentation projects are in a position to produce feature
films, yet, as mentioned, films such as those discussed in this paper represent
an important middle ground. We should take advantage of the opportunity
2. http://www.vertigoproductions.com.au/information.php?film_id=11&display=
extras
Unauthenticated
presented through building language documentation projects to incorporate

the production of purposeful non-academic outputs.
But what is the reality of such an expectation, that linguists, employed
as academic researchers, and in most cases also engaged in tertiary teaching,
will be in a position to dedicate the time required for production of such non-
academic outputs? The tasks required in language documentation are already
many and varied: record texts and elicit other language data in the field, in-
cluding metadata; transcribe and translate the recordings; further annotate the
recordings using specialised software tools; archive the recordings along with
metadata; write a grammatical description and analytical papers; produce a
dictionary and literacy materials for the community; and more. These are in-
credibly time-consuming activities (cf. Wittenburg 2007), and added to this
the lack of importance that is generally placed on outputs such as these within
the context of academia, it is an even greater expectation to propose that more
projects should be giving emphasis to these non-academic, popular outputs.
Yet projects such as The Sorosoro Program3 and National Geographic’s En-
during Voices Project4 , which present excerpts of films in indigenous lan-
guages, show how collaboration between linguists and filmmakers can of-
fer opportunities to increase the potential audience of language documenta-
tion. Currently the video excerpts presented on the Enduring Voices YouTube
channel5 do not provide sufficient background information and translations
of content to capture a wide audience, but this initiative is a step in the right
direction in terms of presenting the work of language documentation. The
Sorosoro Program in particular offers a model of how language documenta-
tion and maintenance work can be done, engaging professional filmmakers to
work with linguists to produce video footage that is both aesthetically pleas-
ing (cf. Dimmendaal 2010) and provides linguists with data that enables
further analysis and preservation in language archives.
5. Conclusion
In conclusion, this paper has demonstrated the diverse range of audiences that
can benefit from outputs of a language documentation project, despite the fact
that the different audiences have varied needs and interests. Placing emphasis
3. http://www.sorosoro.org/en/
4. http://travel.nationalgeographic.com/travel/enduring-voices/
5. http://www.youtube.com/enduringvoices
Unauthenticated
on the production of non-academic outputs such as these reminds us of the

primary goal of language documentation: to support language maintenance
and linguistic diversity. Proposing that all language documentation projects
should produce such outputs is not unproblematic. As discussed, the work-
load involved in language documentation is already beyond most academic
linguists and many language documentation projects have neither the time nor
the funding to dedicate to producing these outputs. However, we can address
this issue both in the long term and in the short term. For the long term we
can work towards promoting the worth of these outputs, both to funders and
to the academic community. And in the short term, for those language docu-
mentation projects that do not have the resources available to dedicate to such
outputs, they can nevertheless focus on recording high quality audio, video
and image data that could potentially be used for producing such outputs in
the future; data that includes a wealth of information on focussed topics and
goes beyond being basic language data, by depicting aspects of the lives of
the speakers in an aesthetically pleasing way.
References
Austin, Peter K., and Lenore A. Grenoble. 2007. Current trends in language
documentation. In Language Documentation and Description, Volume 4,
ed. Peter K. Austin, 12–25. London: School of Oriental and African Stud-
ies.
Broto, Emmanuel, and Fabienne Tzerikiantz (Producer & Director). 2009.
Sevrapek City. [Motion picture].
Dimmendaal, Gerrit J. 2010. Language description and the “new paradigm”:
What linguists may learn from ethnocinematographers. Language Docu-
mentation & Conservation 4:152–158.
Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork
and analysis. In Essentials of Language Documentation, eds. Jost Gippert,
Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York:
Mouton de Gruyter.
Gippert, Jost, Nikolaus P. Himmelmann, and Ulrike Mosel, eds. 2006. Essen-
tials of Language Documentation. Berlin, New York: Mouton de Gruyter.
Harrison, K. David, ed. 2007. When Languages Die: The Extinction of the
World’s Languages and the Erosion of Human Knowledge. Oxford: Oxford
University Press.
Unauthenticated
Lynch, John, and Terry Crowley, eds. 2001. Languages of Vanuatu: A New
Survey and Bibiliography. Canberra: Pacific Linguistics.
Mondragón, Carlos. 2004. Of winds, worms and Mana: The traditional cal-
endar of the Torres Islands, Vanuatu. Oceania 74:289–308.
Office, Vanuatu National Statistics. 2009. National Census of Housing and
Population. Port Vila, Vanuatu: Ministry of Finance and Economic Man-
agement.
Wittenburg, Peter. 2007. DoBeS/MPI Archive Issues. Presentation at the
National Science Foundation Workshop on Documenting Endangered Lan-
guages, Durham, New Hampshire, (October 2007), available online at http:
//www.lat-mpi.eu/papers/papers-2007/Presentations/newhampshire-talk.pdf (ac-
cessed 2011/03/19).
Wittersheim (Producer & Director), Eric. 2008. Le Salaire du Poète. [Motion
picture].
Unauthenticated
Unauthenticated
Chapter 14
Filming with native speaker commentary∗
Anna Margetts
1. Introduction
As discussed by Seifart (this volume) there tend to be competing motivations

for documenting endangered languages. Documentation projects within the
DoBeS program aim at building corpora of authentic data of spoken language
that (a) are not only interesting for linguistics, but also for other disciplines;
(b) are annotated with transcriptions, translations and comments so that they
can be understood without prior knowledge of the language; and (c) can be
used by the speech community for language maintenance and revitalisation.
An example of a conflict arising from such competing motivations, and its
resolution, is discussed in Mosel (2004, 2009) for the Teop language doc-
umentation project. Mosel (2004: 3) notes that language documenters need
“to find a balance between what is interesting for them and their special field
of expertise, what is relevant for other non-linguistic disciplines and what
meets the expectations of the speech community.” In response to the Teop
speech community’s reluctance to having linguistically accurate transcrip-
tions of recordings made publicly available, the documentation project cre-
ated written native-speaker-edited versions of oral texts which were consid-
∗
I would like to thank Birgit Hellwig and the editors of this volume for valuable feedback
on an earlier draft. I also thank Andrew Margetts who is behind much of the technical
details reported here. As always, I cordially thank the communities and individuals on
Saliba and Logea Island who have supported our work. More speakers have been involved
with the Saliba-Logea project than can be listed here. Community members who have
worked with us on transcribing, annotating, translating, and editing texts include Nebo
Joseph, Rose Meina, Meggie Alaluku, Matthew Hawele, Penesia Eric, Mila Kelwau, and
Morris Alaluku. For their commentaries on the recordings discussed here I thank Balosi
Leman, Alaluku Leman and Mr January. DoBeS project members in Australia include
Anna Margetts, Carmen Dawuda, Andrew Margetts, and John Hajek; Ulrike Mosel was
the German host, and Kipiro Damas from the PNG National Herbarium in Lae joined as a
botanical consultant.
Unauthenticated
322 Anna Margetts
ered more acceptable and desirable by the community. The raw transcriptions
are still archived, but are less publicly available.
This chapter is concerned with commentaries accompanying video record-
ings as a research methodology and means of data collection, again as a re-
sponse to the tension arising from the different aims within a documentation
project. It discusses the benefits of this technique and addresses the nature of
the data that can be collected by this method.
Commentaries have in the past received relatively little attention as a
genre or as a data collection method. However, more recently there has been
some discussion of this topic in the field of documentary linguistics. Cablitz
(2008) describes several methods and recording setups for the creation of
procedural documents. In one of the techniques, one person performs the ac-
tivity (e.g. traditional food or medicine preparation) while another speaker
provides a commentary on what is happening. As Cablitz observes, such
commentaries do not constitute “examples of how people actually commu-
nicate with each other” as called for by Himmelmann (2006: 7), but create a
new type of communicative event along the lines described by Mosel (2004,
2009). Like Mosel she notes that new types of communicative events, while
not traditional, are interesting in themselves as they allow one to observe the
process of using language in a new situation and provide new insights into a
language’s expressive potential (Mosel 2004: 4; Cablitz 2008).
Another recent use of commentaries is the technique of orally providing
annotations to primary data thereby shortcutting the transcription and written
annotation process. In the context of documentation work with the Cup’ik
community Woodbury (2003) argues for the recording of such oral commen-
taries instead of focusing exclusively on written annotations.
... we will use the time of the few elder Cup’ik translators with wide En-
glish and Cup’ik vocabularies to produce running UN [United Nations] style
translations of many more materials, and then have younger speakers flag the
obscure words or usages for special attention. We are also considering not
transcribing everything – instead starting with hard-to-hear tapes and asking
elders to ‘respeak’ them to a second tape slowly so that anyone with training
in hearing the language can make the transcription if they wish. (2003:45)
In a similar vein, Simons (2008), Bird (2010), and Reiman (2010) discuss the
method of Basic Oral Language Documentation (BOLD) where oral annota-
tion replaces transcription and written annotation. In place of the traditional
corpus which consists of data that has been transcribed and marked up with
Unauthenticated
Filming with native speaker commentary 323
written annotations, the BOLD method builds an oral documentation cor-

pus where the compiled data is annotated orally and then archived (Simons
2008: 23). In this method, commentaries are tagged onto media files con-
taining the primary data; it has been applied to languages in Guinea-Bissau
(Reiman 2010) and in Papua New Guinea as part of the BOLD PNG project
(Bird 2010).1
This chapter introduces a methodology that is distinct from the BOLD ap-
proach, but bears some similarities to that of Cablitz (2008). I will be primar-
ily concerned with commentaries to video recorded events, where the com-
mentary itself constitutes the primary linguistic data (sections 2.1 to 2.3).
However, I also address scenarios where the commentary provides annota-
tions to linguistic events or other types of performances (section 2.4).
The chapter is based on research within the DoBeS Saliba-Logea doc-
umentation project in Papua New Guinea which started in 2004 but which
continues earlier research with the Saliba community since 1995. There are
about 2500 speakers of Saliba-Logea who live predominantly as subsistence
farmers on Saliba and Logea Island and the surrounding area in Milne Bay
Province of Papua New Guinea. The two main dialects, Saliba and Logea, are
mutually intelligible and differ mainly in their lexicon.
During our fieldwork we experimented with different methodologies of
data gathering and of collaborating with the community. Some of the tech-
niques had been conceived of before the beginning of the project, informed
by previous fieldwork, such as interviews of older speakers conducted by
younger family members. Others techniques developed in the course of the
project as new situations arose. Most of our texts are transcribed recordings
of oral language but the database also includes a small number of written
texts.
So far the project has created a corpus of about 37 hours of transcriptions.
Some recordings are of natural conversations, others are based on elicitation
with non-verbal stimuli, including the frog story, the pear story and a number
of the stimuli developed at the MPI for Psycholinguistics.2 However, the ma-
jority of sessions in the database follow the common format of monologuous
narratives where one speaker tells a story or provides a description of certain
activities.
1. http://www.boldpng.info/
2. http://fieldmanuals.mpi.nl/
Unauthenticated
324 Anna Margetts
The dominance of narratives is due to a number of factors including pref-

erence by the community for texts such as traditional narratives as record-
worthy, as described by Mosel (2009), but no doubt also through a bias on
the part of the fieldworkers, as described by Foley (2003). Overall, conver-
sational data is more difficult to record because speakers do not necessarily
recognise conversation as an event worthy of documentation and the topics
of conversation are often more private and less suitable to be shared publicly
(e.g. gossip). It is also easier to prompt speakers to tell a story than to simply
have a conversation. In addition, conversational data is generally more com-
plex due to the presence of multiple speakers in one session and overlapping
speech. It is therefore more difficult and time-consuming to process, annotate,
and analyse.
The commentaries discussed in this chapter extend the project’s database
in two ways. One of the commentaries we recorded provided a new text type
to the corpus – a sports commentary which consists of a stream of sponta-
neous language. Apart from specific linguistic features which are of interest
in this text, the addition of the commentary broadened the range of commu-
nicative events represented in the corpus. This is of value in itself as a broad
spectrum of text types is essential both from the point of view of documenta-
tion and for the purpose of linguistic description (see Foley 2003, Mosel 2009
among others). The second commentary provided essentially conversational
data which, as mentioned, is still underrepresented in the Saliba-Logea cor-
pus and is one of the genres more difficult to record. The project’s experience
in two different recording situations and with different speakers shows that
commentaries do not constitute a single text type or genre but depend on the
speaker’s interpretation of the task.
Some recordings in our database capture community events and cultural
practices. They are classified as event documentation. If they contain linguis-
tic data of sufficient quality, they are transcribed and fed into the project’s nor-
mal data workflow. Otherwise they are deposited to the archive with metadata
but without further annotations. These sessions include recordings of canoe
building and repair, house building, parties, sports matches, drumming, and
other types of performances. The remainder of this chapter is concerned with
commentaries as a method of enhancing the value of such primarily non-
linguistic recordings. Audio commentaries about the filmed events make the
recordings more useful and interesting to the community but they also pro-
vide valuable language data for the project’s corpus.
Unauthenticated
2. Filming with running commentary

On past occasions the team had filmed some special events for the commu-
nity. Ad hoc attempts to add linguistic value to such recordings by asking
from behind the camera what people were doing tend to create only minimal
interactions, like the exchange in (1).
(1) Q: Saha kwa gina-ginauli?
what 2PL RED-do
‘What are you doing?’
A: Ya nekwa-nekwali.
1SG RED-peel.tubers
‘I’m peeling tubers!’
The answer in (1) was given by a woman at a wedding who was peeling an
enormous taro tuber which her family had brought for the feast to support
the relatives who were hosting the party. It was a special type of taro and a
special type of gift, and there could have been much more said about it, but
the question was clearly not suitable to elicit a more detailed response.
When the documentation team was asked to film a soccer match and the
surrounding community activities we were facing a day of filming with little
chance to record any usable linguistic data. In this context we decided to in-
vite a speaker to provide a running commentary of the match. This strategy
was successful and elicited a stream of spontaneous spoken language. We
applied this technique in another context with similar success and, in hind-
sight, realised that we could have enhanced earlier recordings by employing
this method. In the following I discuss the recordings for which we invited
commentaries and one example of an occasion where we did not and how
a commentary would have improved the value of the recording both for the
community and for the database.
2.1. The soccer match

The soccer match was the season final of the Sawasawaga area and a big
community event.3 We invited a speaker who had previously worked with
3. The final was between “Cycas” from Bwasitau and the “Station Warriors” from Sawa-
sawaga. The match had been postponed several times because of a death in a neighbouring
Unauthenticated
326 Anna Margetts
the project on recordings to provide the commentary. We set him up with a

radio microphone which allowed him freedom of movement independent of
the camera. He gave it a very good go but was somewhat shy in his reporting
and when there was not much happening on the soccer field there are many
long pauses in the commentary.
At some point, without warning, the microphone was handed to another
speaker and we did not know for sure where or who the new commentator was
(this is an interesting aspect of radio microphones: one can lose the speaker!).
The new commentator was an older speaker with whom we had not worked
before. He had lived in the capital Port Moresby for many years, was a sports
fan, and had seen many matches on TV with commentaries in English. He
was clearly familiar with sports commentaries as a genre and his commentary
was engaging and professional. It went on virtually without pauses for the
remainder of the two hour match, covering lulls in the game with comments
on the teams, their supporters, the referee, and the weather (perfect: no rain
and not too hot). He also included short interviews with the referees – clearly
in order to make the commentary livelier and more engaging.4
There are a number of interesting aspects to the recorded data. The com-
mentary constitutes a novel type of communicative event in the Saliba-Logea
context and a new text type in the project’s database and so adds to the vari-
ety of texts represented in the corpus. As discussed, a wide spectrum of text
genres is crucial both from the point of view of documentation and for the
purpose of linguistic description. Foley (2003) and Mosel (2009) among oth-
ers discuss the fact that different text types are rich in different constructions
and grammatical descriptions can as a result differ considerably if they are
primarily based on different text types. Mosel (2009: 17) demonstrates this
with topic constructions in Teop.
A broad spectrum of text types is also important for other reasons. Sei-
fart (2008) discusses the representativeness of documentation corpora, and
community and of the uncertainty when the funeral would be held. The teams’ preparation
for the match included all-night prayer meetings which continued each time the match was
postponed. Players must have been very tired by the time the day came and it was a rel-
atively slow match, ending in a 0:0 draw. In the ensuing penalty shoot-out, Sawasawaga
won 1:0.
4. The commentaries were provided by Balosi Leman and Mr January. The recordings of the
match were fed into the data workflow as four sessions: SoccerMatch_01EZ (commentary
Balosi Leman), SoccerMatch_02FA, SoccerMatch_03FA, SoccerMatch_04FA (commen-
tary by Mr January across three video tapes).
Unauthenticated
Foley (2003: 86) warns that “the effect of our native language ideology on
our products of description can be heavily disguised ... The only corrective I
can suggest for our possibly misleading descriptive flights of fancy is ... stay
close to the full range of data, all register and genre types”.
While the commentary is monologuous it is very different from other
monologues in the corpus, such as more planned narratives, but also from
spontaneously produced monologues. Running sports commentaries are a
unique text type in that the commentator is not aware of the outcome of the
match while reporting and therefore cannot structure the text according to a
certain result.5 The speech in the soccer commentary is spontaneous and off
the cuff in this sense as the commentator invents techniques of filling the gaps
and producing a stream of continuous fluent language. It is not an established
genre in Saliba-Logea and the commentator clearly produces an imitation of
a western-style sports commentary. In this sense the commentary data can be
considered artificial while at the same time clearly highly spontaneous.
Beyond constituting a new text type, the commentary contains some inter-
esting linguistic features including several instances of the reciprocal prefix
which is extremely rare in the Saliba-Logea text data:
(2) Se hai-bayobayoa molosi.

3PL RECP-struggle real
‘They are really struggling with each other.’
The text also contains the only examples of code switching between Saliba
and Tok Pisin in the corpus.6 The commentator started out in Saliba but then
switched between Tok Pisin and Saliba several times. Two examples are given
in (3) and (4).7
5. Presumably the genre only emerged in response to advances in media technology and
did not exist before the advent of radio. However, commentary-style reporting could in
principle also take place for the benefit of a physically present but non-seeing audience.
6. English is the areal lingua franca in Milne Bay Province and typically only people who
lived in other parts of PNG know Tok Pisin. This has been changing somewhat in recent
years, at least in the provincial capital Alotau, with the influx of people from other parts of
PNG.
7. Items like winim ‘win’ and tim ‘team’ are considered loan words here rather than instances
of code switching.
Unauthenticated
328 Anna Margetts
(3) Saliba ta kaikewa kabo kaiteya tim ye winim,

1INCL watch TAM which team 3SG win
‘We watch which team will win,’
nige kabina ta kata.

NEG its.nature 1 INCL know
‘we don’t know’
Tok Pisin Aah! tufella tim, tufella tim.

INTRJ two team two team
‘Ah, two teams, two teams.’
(4) Tok Pisin i kisim lau i pasim finish,

3SG get go 3SG pass finished
‘he goes and gets it he passes it,’
Saliba kabo kaiheya ye lau-lau sola

TAM game 3 SG RED -go still
‘the game is still going’
sola nige hada hesauna ye tole.

still NEG goal other 3SG put
‘nobody put in a goal yet.’
Another characteristic of the commentary concerns the structuring of the text

through parallel clauses and repetitions. Parallel constructions describing the
teams and their actions seem to be designed to highlight the teams’ equality
and possibly also the commentator’s impartiality. Examples are given in (5)
and (6).
(5) Cycas ye bayao kalili na

team.name 3SG strong very CONJ
‘The Cycas team is really strong and’
Station warriors hinage ye bayao kalili na.

team.name also 3SG strong very CONJ
‘the Station warriors team is also really strong.’
Unauthenticated
(6) Cycas yo-di tau support meta ye bado kalili

team.name CLF -3 PL man support TOPIC 3 PL many very
‘There are very many Cycas supporters,’
na Station warriors hinage yo-di tau support ye bado kalili
CONJ team.name also CLF -3 PL man support 3 PL many very
‘there are also many Station warriors supporters.’
In terms of repetitions, the commentary shows a different pattern from other
text types. It features continuing repetition of phases while the action they
describe is ongoing, as common in western-style sports commentaries. (See
Müller 2007 and Lavric et al. 2008 for discussion on the linguistics of foot-
ball.)
(7) Cycas ye hai ye dobi-dobima,
team.name 3SG get 3 SG RED-come.down
‘The Cycas team got it, it’s coming down,’
ye dobi-dobima ye dobi-dobima ye dobi-dobima.
3 SG RED-come.down 3 SG RED-come.down 3 SG RED-come.down
‘it’s coming down, it’s coming down, it’s coming down.’
Besides providing linguistic material, the commentary also enhanced the value
of the recording for the community by adding some of the excitement of the
match to the recording. It also provides information which cannot necessarily
be gleaned from video by naming players, stating which team is on the ball,
and who has a chance for scoring, etc.
2.2. The canoe race

On another occasion the documentation team was preparing to film a chil-
dren’s canoe race. We again invited one of the organisers to provide a running
commentary.8
The style of this commentary is very different from that accompanying
the soccer match. The commentator is basically not speaking for the micro-
phone but to the other race organiser, to by-standing spectators, and to the
8. The race was initiated by Balosi Leman, the sailing canoe expert and soccer commentator
with whom we had worked before. He organised it with the help of his brother Alaluku
Leman who also provided the commentary.
Unauthenticated
330 Anna Margetts
racing participants out on the bay. Overall the data is conversational rather
than monologuous and it constitutes some of the clearest audio recordings of
casual conversations in the database. From the point of view of conversational
data, the drawback of the recordings is of course that none of the speakers is
in the picture as the camera is trained on the racing canoes and that when in-
terlocutors are not close to the commentator only his side of the conversation
is recorded through the lapel microphone. However, the audio quality is still
better than some of our other audio-only conversational data.
Apart from the conversational nature of the commentary, interesting as-
pects of the data include examples of boating and sailing terminology in their
natural context of use, as in (8) to (10). Such terminology is otherwise mainly
represented through elicitation and is rare in the corpus.
(8) Se kuke.
3 PL set.sail
‘They are setting sail.’
(9) Se giyuli.
3 PL go.around.point
‘They are sailing around the point.’
(10) Se yatowa.
3 PL tack
‘They are tacking.’
The recordings also include examples of English vocabulary and of code-

switching between Saliba and English. Speakers tend to consciously avoid
this during recordings and so, compared to every-day language, code-switch-
ing is underrepresented in the database. Some examples are given in (11) to
(13).
(11) Nige ko wose! Ko wose ko disqualified.

NEG 2 SG paddle 2 SG paddle 2 SG disqualified
‘Don’t paddle. If you paddle you will be disqualified.’
(12) Next time, huya hesau namwa-namwa-na kabo ku kita-gai.
next time time other RED-good-3 SG . POSS TAM 2 SG see-1 EXCL
‘You’ll see us next time.’
Unauthenticated
(13) You are safe to walk, kwa lau namwa-namwa, you act properly.
you are safe to walk 2 PL go RED-good you act properly
‘You are safe to walk, walk safely, act properly.’
Again, the commentary enhanced the video footage both in terms of adding
linguistic data for documentation and analysis and in terms of benefit for the
community. Without the explanations of what is happening the video images
basically just show a few sailing canoes and are otherwise uninterpretable.
2.3. The gwalisaekeno feast

There are other events that were filmed as part of the project, such as wed-
dings, feasts, or the gathering and cooking of malamala (sea worms), which
would have benefited from a running commentary. In this section I describe,
based on the example of a traditional feast, how a commentary would have
enhanced the documentation.
A gwalisaekeno feast is a significant community event with many aspects
and sub-events that are worth documenting. The feast is part of the traditional
arrangements of celebrating and cementing a marriage and the ensuing bond
between families. The feast is held by the woman’s or the man’s side after
the couple has lived together for a few years. One family calls the other for
the gwalisaekeno at a set date. Both sides collect pigs from people on which
they can call and some will be slaughtered for the feast. The extended family
will come to support either side by bringing contributions of food stuff. The
hosts prepare a feast for their in-laws but will not partake themselves. The
visiting party brings pigs and food gifts and there is competition to match
or out-do the other side’s efforts. The visitors also bring gifts of special fire
wood as a present for the mother-in-law to compensate her for the loss of help
by her daughter or son who is now married. If the visitors come from further
away they will typically arrive by boat, blowing a cone shell trumpet or a
modern substitute which indicates that the boat is carrying a gift of pigs. As
they arrive, the visitors run into the village carrying the pigs, yams and other
goods on decorated stretchers. The mother of the visiting party or another
female relative will launch into a tirade about the bad and lazy behaviour
of the son or daughter-in-law. She will complain about the good-for-nothing
youngster who doesn’t help, doesn’t know how to work hard, and doesn’t
fetch water – often cracking into laughter while performing this traditional
Unauthenticated
332 Anna Margetts
task. At the end of the party the visitors return to their village with their
gifts and the host family distributes the received goods to the members of
their own extended family who supported them with their contributions. The
visiting side typically reciprocates after a few years with a corresponding
feast. Following this counter exchange the marriage is settled and divorce
would be quite involved for both sides, because the exchanged gifts would
have to be returned or compensated for.
We filmed the preparations of a gwalisaekeno feast on the side of the
visiting party, their arrival by boat, and the party itself.9 The recordings are
more or less free of usable linguistic data (apart from some speeches) but they
provide an interesting documentation of one of the last traditional feasts still
practiced in the area. However the video documentation is basically only as
good as the background knowledge of the viewer because many aspects of
the recordings are not self-explanatory. As described, there are many aspects
of this traditional feast which follow regular patterns and which are part of
the cultural knowledge of the community, but their meaning and significance
or even the fact that they are taking place cannot necessarily be gleaned from
video footage of the events. A running commentary by a community member
could have bridged this gap and made the documentation more meaningful
for outside viewers by explaining what was happening.
In the course of the project we also recorded procedural texts and inter-
views with several speakers about gwalisaekeno feasts.10 These sessions are
located in a different branch of our data tree but they are linked to the video
recording of the feast though metadata and the project’s file naming practice.
A running commentary during filming would have strengthened the link be-
tween the two types of recordings (the event itself and the meta-texts about
the event) by naming sub-events as they occurred, such as the traditional com-
plaints by the mother, the carrying of the firewood, and the pig slaughter,
which are described in the meta-texts but filmed without commentary in the
documentation.
Of course it is possible to add a running commentary after the event by
reviewing the recording with speakers and this would provide a useful anno-
tation to the gwalisaekeno videos. However, such commentaries constitute a
separate step in the data gathering and processing (and therefore may or may
9. The relevant sessions are Gwalisaekeno_03AN, Gwalisaekeno_04AN, and Gwa-

lisaekeno_ 05AN.
10. See e.g. Gwalisaekeno_01BF, Gwalisaekeno_02BF, and Giyahi_01AA.
Unauthenticated
not in fact happen). One also needs to be aware that adding a commentary
later is a different technique which will result in different kinds of linguis-
tic data. For example deictic terms employed by a commentator on site may
be quite different from those used by a commentator who is reviewing the
recordings on a screen and commenting on them.11
2.4. Commentary as annotation

So far we have been concerned with commentaries as a means of adding lin-
guistic data to recordings of primarily non-linguistic events, similar to the
methodology discussed by Cablitz (2008). However the technique can also
be applied to record annotations to linguistic events or other types of perfor-
mances as in the case of the BOLD methodology (Simons 2008, Bird 2010,
and Reiman 2010) and to provide metadata. The oral recording of basic meta-
data like the date, location and the names of the participants at the beginning
of a session is already a standard in language documentation, but commen-
taries can be used much more extensively along these lines.
The previous section discussed the value that a commentary to the gwal-
isaekeno recordings could have added by providing explanations on the activ-
ities being filmed. But besides commenting on the activities it could also have
provided extended metadata type information such as the key performer’s
names and their position in the respective families (e.g. the mother’s eldest
brother), or annotations such as which aspects of the party are traditional and
which are modern takes on traditional themes.
Depending on the type of performance being filmed, the audio recording
of the event itself can be very important. This would be the case with the
speeches that were held as part of the gwalisaekeno feast, or when record-
ing musical performances or festivals. Ideally, these aspects of the signal
would not be inseparably overlayed with the commentary and so a record-
ing of the event without the commentary should be created in parallel. This
can be achieved by using separate microphones, one for each channel. With
11. In principle, such different types of commentaries could be used to create parallel corpora
in order to explicitly investigate the differences between them. A similar methodology is
described by Mosel (2009) to compare a narrative about killing a chicken and a procedural
text on the same topic, both elicited with the same visual stimuli. See also Mosel (2008,
2009) on comparing parallel corpora of raw oral versions and edited versions of the same
texts.
Unauthenticated
334 Anna Margetts
this setup one microphone is dedicated to the commentary, the other micro-
phone is recording the event. This allows for annotations and comments to be
recorded simultaneously to the actual event without interfering with it. See
Margetts and Margetts (in print) for technical details on such recordings.
3. Conclusion
Language documentation projects aim to meet several sometimes seemingly

conflicting goals and are subject to different expectations by different par-
ties. Making materials produced for the community valuable for linguistic
research and vice versa can reduce the tension between conflicting demands
and make for more productive fieldwork.
Commentaries have so far been little discussed as a method of data gath-
ering. In particular running commentaries on recordings of primarily non-
linguistic events prove to be a good source of high quality data and a good
way of recording annotations. Apart from making filming more worthwhile
from the point of view of the documentation project, running commentaries
also enhance the quality of the documentation for the community. The prob-
lem with many non-linguistic recordings is that they are not necessarily self-
explanatory and often require additional information to be interpreted, even
for people who may have been present during recording. Commentaries can
therefore make the recordings both more interesting to watch, and create a
more detailed and comprehensive documentation.
The running commentaries we recorded contain interesting linguistic data
including some features otherwise rare in the database and they contributed a
new text type to the Saliba-Logea corpus, thus adding to the variety of speech
types within the database.
Questions to address when preparing to film with a commentary include
which events are appropriate to be filmed and who is a good commentator.
Filming on invitation by the community is a good place to start. But in our
experience, some events for which we were initially reluctant to ask permis-
sion to film, such as mourning rituals or funerals, might in hindsight have de-
lighted the community as a memento and a documentation of the important
event. It can be hard to find the right people to ask for permission to film be-
cause they may be occupied with scripted roles in the event, e.g. as mourners.
Inquiring ahead of time about the appropriateness of filming certain events is
therefore indispensable. Another point to be taken into account is the person-
Unauthenticated
ality of the commentator. It helps if they like to talk and are engaged with the
event. Another aspect is whether the commentator is perceived as appropriate
to comment on the event, and what their relation is to the people and events
being filmed. This includes aspects like their seniority, their knowledge of the
event, (perceived) partiality, family affiliations, etc.
In sum inviting a running commentary can help in creating a richer doc-
umentation for the project and for the community and should perhaps be
the norm rather than the exception for video recordings of primarily non-
linguistic events. They are also a productive technique of annotating linguistic
events and performances and for recording metadata.
Abbreviations
1, 2, 3 first, second, third person PL plural
CLF classifier POSS possessive
CONJ conjunction RECP reciprocal
EXCL exclusive RED reduplication
INCL inclusive SG singular
INTRJ interjection TAM tense, aspect, mood
NEG negation TOPIC topic marker
References
Bird, Steven. 2010. A scalable method for preserving oral literature from
small languages. Proceedings of the 12th International Conference on
Asia-Pacific Digital Libraries, Gold Coast, Australia, June 2010. http:
//www.boldpng.info/.
Cablitz, Gabriele. 2008. The making of procedural documents on the Mar-
quesas and Tuamotu islands (French Polynesia). Presentation at Language
Documentation Methods in Focus (DoBeS meeting, June 2008).
African Studies.
Unauthenticated
336 Anna Margetts
ton de Gruyter.
Lavric, Eva, Gerhard Pisek, Andrew Skinner, and Wolfgang Stadler, eds.
2008. The Linguistics of Football. Tübingen: Gunter Narr.
Margetts, Anna, and Andrew Margetts. To print. Audio and video recording
techniques for linguistic research. In The Oxford Handbook of Linguistic
Fieldwork, ed. Nicholas Thieberger. Oxford: Oxford University Press.
ter 1(3):3–4.
Mosel, Ulrike. 2009. Collecting data for grammars of previously un-
researched languages. Unpublished manuscript, available online at
http://www.linguistik.uni-kiel.de/mosel_publikationen.htm#download (accessed
2011/02/26).
Müller, Torsten. 2007. Football, Language and Linguistics. Tübingen: Gunter
Narr.
Reiman, D. Will. 2010. Basic oral language documentation. Language Doc-
umentation & Conservation 4:254–268.
Seifart, Frank. 2008. On the representativeness of language documenta-
tion. In Language Documentation and Description, Volume 5, ed. Peter K.
Simons, Gary F. 2008. The rise of documentary linguistics and a new kind
of corpus. Presentation at the 5th National Natural Language Research
Symposium, De La Salle University, Manila (November 2008), available
online at http://www.sil.org/~simonsg/presentation/doc%20ling.pdf (accessed
2011/02/26).
Unauthenticated
Unauthenticated
Unauthenticated
Index
access regulations, 46 collaborative
Admiralty Islands, 269, 275 fieldwork, 23, 251
Äiwoo, 296, 301 lexicon creation, 48, 241, 242, 245,
afterthought, 10, 152, 153, 160, 161, 268
164–170, 172 transcription, 10, 202–208, 210
Ambrym (South East), see South East workspace, 241–247
Ambrym commentary, 13, 315, 321–335
animacy, 68, 73, 76, 80 community (DoBeS), see DoBeS
ANNEX, see tools community (language), see language
ARBIL, see tools community involvement, 5–6, 10, 11,
archive (DoBeS), see DoBeS 13, 23–24, 57, 224–226, 240,
archiving, 8, 21, 33–53, 81, 256 241, 243–249, 251, 252, 256,
Arosi, 264, 269, 270, 273, 274, 280, 257, 291, 297, 323
281 complex sentences (prosody of), 160–
aspect system, 121–145 169
Athapaskan, 202 conjunctivism, 114, 116
Austronesian, 64, 82, 238, 264 consent, 44, 45, 293, 294
authority, 293–294, 297, 299–302 contact (language), see language
Awetí, 55, 64, 65, 67–69, 71, 73, 75, contrastive (focus), 101, 170, 171
81, 82 copyrights, 45, 295, 300
corpus linguistics, 57
Basic Oral Language Documentation cultural heritage, see heritage
(BOLD), 322, 323, 333 cultural knowledge, 12, 21, 151, 223,
Bauan, 267–271 228, 230, 237, 240, 241, 245,
Beaver, 201–219 246, 249–251, 256, 265, 311,
Bislama, 306, 308, 309, 311 313–315, 332
Bismarck Archipelago, 269
boundary (intonation), 153, 159–162, Danish, 181
168 data
Brazil, 64, 81 collection, 6, 33, 156, 177, 305,
Bugotu, 275 322
infrastructures, 8, 33–53
capacity building, 241, 247–249 utilization, 50–52, 305
Carolinean, 269, 270 data (lexical), see lexical
Cèmuhî, 264, 269–271, 273 description (language), see language
Cheke Holo, 269–271, 273, 280, 281 dictionaries
clitic word, see word bilingual, 125, 229, 248, 254, 264,
Code of Conduct (DoBeS), see DoBeS 266, 311
339
Unauthenticated
340 Index
descriptive, 265 archive, 7, 36, 51, 82, 227, 254

documentation, 24, 253, 256 Code of Conduct, 295
encyclopaedic, 11, 223–257, 306 community, 53
ethnographic, 11, 226, 263–283 data, 35
general, 230, 263–266, 282, 283 initiative, 21
maintenance, 253, 256 programme, xi, xiii, 1, 2, 6, 7,
mini-dictionaries, 225, 248, 283 18, 33, 35–37, 39, 40, 44,
monolingual, 125, 266 45, 55, 61, 151, 252, 294,
multilingual, 229, 311 321
multimedia, 11, 223–257, 306 projects, 8, 55, 81, 82, 121, 223–
online, 223 225, 227, 305, 323
thematic, 225, 283 documentary film, 305–318
translation aid, 265, 266 documentary linguistics, xi–xiii, 2–6,
dictionary definitions, 226, 229, 230, 9, 10, 18, 44, 60, 81, 89, 151,
232, 233, 244, 256, 264–268, 189, 202, 224, 225, 252–256,
274, 276–282, 309–312 322
monolingual, 226, 230, 231, 247,
249–251 Eastern Solomons, see Solomon Is-
vernacular, 230–233, 242, 248 lands
dictionary-making, see lexicography editing
diphthong feature, 183, 187, 189–198 annotations, 48
diphthongs, 10, 177–198 dictionary entries, 11, 224, 240–
analysis and description tool, 10, 251
196–198 during transcription, 207–219
dynamic vs. static, 192, 193, 197 film, 308
falling vs. rising, 183, 186, 188, text, 10, 24
191–194, 196, 197 web-based, 11, 224, 240–251
inventory, 178 ELAN, see tools
inventory (Finnish), 187–196 elicitation
opening vs. closing, 183, 188, 191– strategies, 267
193, 196 tasks, 161, 219, 323
phonetic vs. phonological, 179, encyclopaedia, 226, 230, 237, 249, 277
181–182, 186, 194, 196–198 encyclopaedic information, 226, 227,
prominence in, 183–186, 189–190, 229–234, 237, 240, 241, 247–
196, 197 249, 264, 276–278, 311
semi-diphthongs (Lithuanian), 185– endangered languages, xi, 1–3, 5, 7,
186 10, 17–28, 53, 55–82, 126,
system (Finnish), 187, 190–196 154, 223–257, 293, 305, 306,
discontinuity, see noun phrase 316, 321
disjunctivism, 115, 116 Estonian, 124, 189
dislocation, see noun phrase ethics, 2, 5, 23, 44–46, 50, 291, 293,
DoBeS 294, 303
Unauthenticated
Index 341
ethnobotany, 224, 234, 236, 248 indigenous

classification, 227, 228, 236, 239,
fieldwork, xi, xii, xvi, 3, 11, 12, 51, 240
121, 154, 172, 201, 202, 208, genre, 56, 57, 60
227, 231, 240, 241, 247, 250, knowledge, 21, 226, 228, 237, 238,
251, 257, 267, 291–303, 323, 251
334 linguistic practices, 13, 56, 60
fieldwork (collaborative), see collab- word meaning, 231, 250, 251
orative Indo-European, 64, 66, 82
Fiji, 267, 269, 270 infrastructures (data), see data
Fijian intonation, see prosody
Standard, see Bauan Iran, 64, 82
Western, see Wayan Iranian, 64, 82
Finnish, 10, 177–198
fish names, 11, 240, 263–273, 283 Jaminjung, 10, 151–172
focus, 99, 101, 108, 153, 158, 162,
166, 168, 170, 171, 215, 216 Kapingamarangi, 269–272
folk Kharia, 9, 89–117
generic, 271, 274, 279 Kiriwina, 269, 273, 275
specific, 271, 274, 279 knowledge (indigenous), see indige-
taxa, 275, 276 nous
taxonomies, 224, 228, 234–240, knowledge (lexical), see lexical
278, 279 Kuanua, 269, 270, 273, 274
Forest Enets, 9, 55, 64, 65, 121–145 Kwamera, 269, 270, 273, 274
Gapapaiwa, 275 Kwara’ae, 264
Gela, 269–273, 280, 281
genre (indigenous), see indigenous language
German, 64, 73, 75, 76, 182, 193 community, 291–294, 299–303,
glossing, 19, 22, 28, 62, 63, 157, 226, 305–308, 311–313, 316
265, 266 contact, xix, 4, 7, 17, 24–27
Gorani, 64–82 description, xv, 10, 186, 207
grammatical description, 9, 126, 128, maintenance, 20, 218, 223, 226,
145, 152, 326 227, 252–256, 306, 308–312,
grammaticography, 9, 121–145, 189 316, 318, 321
revitalization, 5, 20, 143, 225, 227,
Hanunóo, 264 252–256
Hawaiian, 264 language typology, see typology
heritage Lau, 269, 270, 273
cultural, 7, 17, 20, 21, 27, 60, 65, Lenakel, 269, 270, 273, 274
246, 293 lexica, see dictionaries
linguistic, 12, 246 lexical, 159, 243
oral, 231 concept, 265
Unauthenticated
342 Index
data, 224–227, 235, 236, 252–254, morpheme glossing, see glossing

256 morpho-syntactic word, see word
database, 225, 227, 236, 242, 244, Mota, 280, 281
247, 248, 252, 253, 256, 257 multimedia lexicon, see dictionaries
entries, 48, 95, 115, 224–227, 229– multimedia lexicon tool, see tools
234, 236, 242–245, 247, 248, Munda (South), see South Munda
256, 257 Muyuw, 275
expressions, 8, 70
knowledge, 223, 240, 251, 256, native speaker, 2, 9, 10, 13, 23, 24, 58,
277 60, 89, 104, 106, 113, 117,
networks, 227 143, 201–219, 231–255, 265,
unit, 228, 265–267, 274–276 268, 282, 321–335
verb, 131, 132 New Caledonia, 264, 269, 270, 273,
lexicography, xvi, 9, 121–145, 223– 275
257, 263–283 Ngaliwurru, 154
LEXUS, see tools Niuatoputapu, 269–271
life form, 238, 278–282 Niuean, 269, 270, 273, 281
linguistic heritage, see heritage Northern Central Vanuatu, 275
linguistic practices (indigenous), see Northern New Guinea, 275
indigenous Northern Samoyedic, 121
literacy, 203, 231, 295, 300, 317 noun phrase
Lithuanian, 180, 184–186, 189 discontinuous, 10, 153, 159, 160,
long-term preservation, 5, 21, 38–43 169–172
Lou, 275 right-dislocated, 152, 153, 160,
165–167, 172
Nyelâyu, 269, 270
maintenance (language), see language
major generic, 278–282 Oceanic, 11, 64, 82, 263–283
Manam, 275 ontologies, 52
Marovo, 269–271 ethno-/informal, 224, 228, 236–
Marquesan, 11, 223–248, 269, 270 238
Marshallese, 269, 270, 280, 281 formal, 238
Melanesia, 264, 270, 271, 273, 274, oral heritage, see heritage
296 orthographical word, see word
Meso-Melanesian, 275 orthography, 9, 89, 104
metadata, 2, 7, 8, 19, 21, 24, 26, 33, practical, 201, 204
35–37, 39–43, 48–52, 236, working, 177
254, 257, 317, 324, 332, 333, Owa, 269, 270, 280, 281
335
metalanguage, 9, 24, 26, 121–145, 278 Paamese, 269, 273, 274, 280, 281
Micronesia, 264, 269–271 Paicî, 269, 270
Micronesian, 275 Palauan, 269, 270
Mirndi, 154 Papua New Guinea, 269, 315, 323
Unauthenticated
Index 343
Papuan, 60 Rotuman, 269, 273, 274

Papuan Isolate, 64, 82 Roviana, 269, 270, 273–275, 280, 281
Papuan Tip, 275 Russian, 9, 122–126, 129, 135, 138–
parallel text typology, 56, 59 144
payment, 292
PENTA model of intonation, 152, 154, Saliba-Logea, xiv, 321–335
157–159 Samoan, xii, xiv, xvii–xix, 280, 281
permission, 36, 45, 243, 292–295, 298, Samoyedic, 9, 64, 121, 127, 128, 142
302, 334 Satawalese, 269–272
phonological word, see word Savosavo, 64–82
Pileni, see Vaeakau-Taumako sense unit, 266
pitch, 95–98, 115, 151, 156–158, 160– Solomon Islands, 12, 64, 82, 269, 273,
162, 166–168, 182, 184, 197 296
politics (community), 12, 291–303 Eastern, 269
Polynesia, 223–225, 249, 264, 269– Western, 269
271 sonority, 177–198
Polynesian, 245, 269, 301 South East Ambrym, 275
Polynesian communities, 246, 301 South East Solomonic, 275
Polynesian Outlier, 269, 270 South Munda, 89, 90
Polynesian Triangle, 264, 269, 271 South Vanuatu, 275
Ponapean, 269, 270, 275 speakers’ intuitions, 9, 98, 104–115
Preferred Argument Structure, 61, 70, speech genre, 4–5, 8, 13, 55, 57, 58,
72, 79, 81 60–61, 64, 66, 156, 231, 250,
projects (DoBeS), see DoBeS 267, 312, 322, 326, 327
prosody, 151–172 sports commentary, 324, 327
Proto Oceanic, 274, 275 Standard Fijian, see Bauan
Puluwat, 269, 275, 280, 281 standards, 2, 21, 33–53, 249
Swedish, 181–183
quantitative analysis, 8, 56, 64, 68, 71,
syllable structure, 154, 177, 178, 197
81, 152, 158, 172
Rarotongan, 269, 270, 280, 281 Tabar, 275

recording techniques, 8, 13, 33, 38 Takū, 269, 270, 280, 281
reduplication, 106, 110, 114 Teop, xii, xiv, xv, xviii, xix, 27, 202,
reference (pronominal), 72, 74, 80 269, 271, 321, 326
Referential Density, 61, 71, 72, 79, 81 terminologies for fauna and flora, 265,
relational linking, 11, 224, 234, 236, 268
254 text types, 4, 57, 58, 324, 326, 329
repetition, 18, 24, 27, 106, 109–111, Tikopia, 269–272, 280, 281
209, 211, 213, 255, 328, 329 Titan, 269–272
revitalization (language), see language Tok Pisin, xii, xix, 327, 328
right-dislocated noun phrase, see noun Tolai, xi, xii, xv, xvii–xix
phrase Tongan, 269, 270, 280, 281
Unauthenticated
344 Index
Toolbox, see tools linguistic, 6–8, 22, 55–81

tools, 52, 248 original text, 55–81
access, 37 quantitative approaches to, 7, 8,
ANNEX, 47, 48, 229 61–71
annotation, 39, 46–47
ARBIL, 51 utilization (data), see data
diphthong analysis and descrip- Uvean, 269–271
tion tool, see diphthong
Vaeakau-Taumako, 296, 297
ELAN, 39, 47, 48, 81, 135, 136,
Vanuatu, 12, 23, 64, 82, 269, 270, 273,
156, 157
274, 305–309, 314, 315
LEXUS, 11, 39, 47, 48, 223–230,
Vera’a, 64–82, 305
234, 236, 238, 240–242, 247–
vernacular language, 228, 230, 265
249, 252–257
ViCoS, see tools
metadata, 40, 51
Vurës, 12, 23, 82, 305–318
multimedia lexicon, 11, 39, 47–
48, 224, 225, 228, 252, 254– Wayan, 265, 267, 269–271, 273, 274,
256 277, 278, 280, 282
relational linking, 224, 236 web-based collaboration, 224, 240–251
search, 49, 52 West Iran, see Iran
tagging, 46–47 Western Fijian, see Wayan
Toolbox, 63, 64, 156, 157, 227, Western Mirndi, 154
242, 247, 248 Western Solomons, see Solomon Is-
TROVA, 47–49 lands
ViCoS, 11, 224, 226, 227, 234– whiteboarding tools, see tools
237, 240, 254, 256 Woleiaian, 275
web-based lexicon, 224, 241, 250– word, 9, 89
251 clitic, 99, 108, 114, 115
whiteboarding, 244, 245, 257 morpho-syntactic, 89, 90, 92, 100–
topic, 153, 158, 160, 161, 164–169, 104, 109, 114, 115
172 orthographical, 89, 105, 106, 115
Toqabaqita, 269, 270, 273, 280, 282 phonological, 89, 90, 92, 94–101,
transcription, 10, 19, 89, 201–219, 255, 104, 106, 109–112, 114, 115,
322 117
transcription (collaborative), see col- written, 90, 98, 104–116
laborative word meaning, 228–230, 251, 254
translation equivalent, 143, 229, 265– contextualization of, 227–228, 256
267 indigenous, see indigenous
tree names, 11, 234, 263–283 visualization of, 227–228
TROVA, see tools workspace (collaborative), see collab-
Tuamotuan, 11, 223–249 orative
Tupi-Guarani, 64, 81 written word, see word
typology
Xârâcùù, 269–271
Unauthenticated

Haig Et Al - 2011 - Documenting Endangered Languages

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Haig Et Al - 2011 - Documenting Endangered Languages

Uploaded by

Copyright:

Available Formats

Documenting Endangered Languages

Editor responsible for this volume

Library of Congress Cataloging-in-Publication Data

Documenting endangered languages : achievements and perspectives /

Bibliographic information published by the Deutsche Nationalbibliothek

” 2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston

Part I. Theoretical issues in language documentation

Part II. Documenting language structure

Part III. Documenting the lexicon

Part IV. Interaction with speech communities

Geoffrey Haig and Nicole Nau

In 2011, we look back on a decade of language documentation within the

that time on she has contributed substantially to the methodology of ﬁeld-

Publications by Ulrike Mosel

On ﬁeldwork and language documentation

On typology and grammar of Oceanic languages

1982 Number, collection and mass in Tolai. In Apprehension. Das sprach-

1999 Negation in Teop (with Ruth Spriggs). In Negation in Oceanic Lan-

1994 Samoan. In Semantic and Lexical Universals: Theory and Empirical

On language contact and language change

Geoffrey Haig, Nicole Nau, Stefan Schnell

Nevertheless, it would be a gross oversimpliﬁcation to see today’s dis-

umentation of Kurdish dialects in Iraq in the 1950’s and 1960’s. In the

become involved in such initiatives. Since then, however, documentary lin-

The creation of structured, accessible, and rich data-bases raises conceptual

2. The organization of this volume

Language documentation is by nature a multi-faceted enterprise, inevitably

deployment of pronouns in different syntactic functions, are more language

trates how linguistic analysis may be driven by the linguist’s pre-established

and our understanding of the development of written varieties of hitherto un-

The issue of which indigenous communicative practices are to be doc-

Harrison, K. David, David Rood, and Arienne Dwyer. 2008. A world of

documentations. For instance, the empirical basis of linguistics may require

2. The format of language documentation

Within documentary linguistics, the format of language documentations is

Annotations General access resources

3. Motivations for documenting endangered languages

3.1. Documentation to preserve human cultural heritage

The general motivation to document endangered languages discussed in this

serving a long-term record of linguistic and cultural traditions and practices.

3.2. Documentation to enhance the empirical basis of linguistics

linguistic structures. For instance, the verbal system in Potawatomi conversa-

3.3. Documentation by and for the speech community

There is also often an expectation to produce within a documentation

3.4. Documentation to study language contact

This last motivation to be discussed here is seldom explicitly on the agenda of

change. One aspect of language contact studies that is particularly relevant

4. Where motivations compete

In principle, the format of language documentations is ﬂexible enough to in-

5. Summary and conclusion

umentation. Each motivation was analyzed in terms of its requirements for

Campbell, Lyle, and Martha C. Muntzel. 1989. The structural consequences

Güldemann, Tom, Alena Witzlack-Makarevich, Martina Ernszt, and Sven

linguists care? In Proceedings of the XVth International Congress of

Sasse, Hans-Jürgen, Nicholas D. Evans, Linda Barwick, Bruce Birch, Murray

Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van

2. Issues and strategies in data handling

2.1. The inﬂuence of the DoBeS programme

metadata Initiative (IMDI1 ). It was quickly understood that metadata is the

3. Archive stakeholders and their needs

4. Long-term preservation requirements

– for audiovisual material, use uncompressed or lossless compressed formats

For textual material and audio material it is quite straightforward today to