Professional Documents
Culture Documents
Unauthenticated
Download Date | 8/13/19 5:35 PM
Trends in Linguistics
Studies and Monographs 240
Editor
Volker Gast
Founding Editor
Werner Winter
Editorial Board
Walter Bisang
Hans Henrich Hock
Heiko Narrog
Matthias Schlesewsky
Niina Ning Zhang
De Gruyter Mouton
Unauthenticated
Download Date | 8/13/19 5:35 PM
Documenting
Endangered Languages
Achievements and Perspectives
Edited by
Geoffrey L. J. Haig
Nicole Nau
Stefan Schnell
Claudia Wegener
De Gruyter Mouton
Unauthenticated
Download Date | 8/13/19 5:35 PM
ISBN 978-3-11-026001-4
e-ISBN 978-3-11-026002-1
ISSN 1861-4302
Unauthenticated
Download Date | 8/13/19 5:35 PM
This volume is dedicated to Ulrike Mosel,
in recognition of her contribution towards documenting
the world’s endangered languages.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Unauthenticated
Download Date | 8/13/19 5:35 PM
Contents
Preface
Ulrike Mosel’s contribution to documentary linguistics . . . . . . . . . . . . . . . xi
Geoffrey Haig and Nicole Nau
Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Geoffrey Haig, Nicole Nau, Stefan Schnell and Claudia Wegener
Chapter 2
Competing motivations for documenting endangered languages . . . . . . . 17
Frank Seifart
Chapter 3
Evolving challenges in archiving and data infrastructures . . . . . . . . . . . . . 33
Daan Broeder, Han Sloetjes, Paul Trilsbeek, Dieter van Uytvanck,
Menzo Windhouwer and Peter Wittenburg
Chapter 4
Comparing corpora from endangered language projects:
Explorations in language typology based on original texts . . . . . . . . . . . . 55
Geoffrey Haig, Stefan Schnell and Claudia Wegener
Chapter 5
“Words” in Kharia – Phonological, morpho-syntactic, and
“orthographical” aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
John Peterson
Unauthenticated
Download Date | 8/13/19 5:35 PM
viii Contents
Chapter 6
Aspect in Forest Enets and other Siberian indigenous languages –
when grammaticography and lexicography meet different
metalanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Florian Siegl
Chapter 7
Documentary linguistics and prosodic evidence for the syntax of
spoken language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Candide Simard and Eva Schultze-Berndt
Chapter 8
Diphthongology meets language documentation:
The Finnish experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Klaus Geyer
Chapter 9
Retelling data: Working on transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Dagmar Jung and Nikolaus P. Himmelmann
Chapter 10
The making of a multimedia encyclopaedic lexicon for and in
endangered speech communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Gaby Cablitz
Chapter 11
What does it take to make an ethnographic dictionary? On the
treatment of fish and tree names in dictionaries of Oceanic languages . . 263
Andrew Pawley
Chapter 12
Language is power: The impact of fieldwork on community politics . . . 291
Even Hovdhaugen and Åshild Næss
Unauthenticated
Download Date | 8/13/19 5:35 PM
Contents ix
Chapter 13
Sustaining Vurës: Making products of language documentation
accessible to multiple audiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Catriona Hyslop Malau
Chapter 14
Filming with native speaker commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Anna Margetts
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Unauthenticated
Download Date | 8/13/19 5:35 PM
Unauthenticated
Download Date | 8/13/19 5:35 PM
Preface
Ulrike Mosel’s contribution to documentary linguistics
Unauthenticated
Download Date | 8/13/19 5:35 PM
xii Geoffrey Haig, Nicole Nau
Unauthenticated
Download Date | 8/13/19 5:35 PM
Ulrike Mosel’s contribution to documentary linguistics xiii
do!" attitude that she herself would, perhaps, attribute to the influence of her
Australian years, but in fact is very much part of her own personality. She in-
herited a chair that had been vacant for some time, and a department that was
lacking in direction, and in students. She set about rebuilding the department,
introducing an emphasis on language documentation, but without neglect-
ing foundational theoretical aspects of linguistics. The BA programme she
established in collaboration with the phonetics department has been hugely
successful, with new enrolments of 60–100 students annually. The shake-up
involved some decisions that raised a few eyebrows in the then still very tra-
ditional Philosophische Fakultät at the University of Kiel (for example, her
decision to abolish Latin as a prerequisite for an undergraduate degree in lin-
guistics, or her regular conflicts with the faculty when it came to allowing
graduates to write their theses in English). In both decisions, as in many oth-
ers, she stuck to her own convictions against considerable resistance, and in
doing so proved herself once again ahead of her time.
Within a remarkably short time she had forged a thriving department,
hosted several research projects, supervised more than a dozen PhD’s and
countless MA theses, and set up highly successful and innovative new degree
programmes. She never served as dean, vice-chancellor, or anything else in
university politics: she just concentrated on what she does best, namely re-
search and teaching in linguistics. International recognition of her work has
come in many forms, among them an Honorary Membership in the Linguistic
Society of America, awarded to her in 2007.
Through it all she has retained her signature mix of enthusiasm and rigour,
coupled with a very Prussian work ethic, but also a deep humanity that en-
ables her to genuinely engage with people of all walks of life – one of the
most important qualities for successful field work. Her research commit-
ments in these latter years have been dominated by her involvement in the
DoBeS-programme, as one of its founding scholars, multiple grant-recipient,
and long-standing chairperson of the Steering Committee. There is hardly an
aspect of the programme that Ulrike has not impacted on over the years. We
feel confident that a little thing like retirement will not significantly change
that.
Unauthenticated
Download Date | 8/13/19 5:35 PM
xiv Geoffrey Haig, Nicole Nau
Dictionaries
1997 O le fale (with Mose Fulu). (A small monolingual Samoan dictionary
on housebuilding and furniture, published by the Ministry of Youth,
Sports and Culture). Apia, Western Samoa: Matagaluega Autalavou
Taaloga ma Aganuu.
2001 Utugaga (with Fosa Siliko, Ainslie So’o and Agafili Tuitolova’a).
(A monolingual dictionary of Samoan for primary school students
of year 5–8). Apia, Western Samoa: Curriculum Development Unit,
Department of Education.
2007 Teop Lexical Database (with Marcia L. Schwartz, Ruth Saovana
Spriggs, Ruth Siimaa Rigamu, Jeremiah Vaabero and Naphtaly
Maion). Kiel: Seminar für Allgemeine und Vergleichende Sprach-
wissenschaft, Christian-Albrechts-Universität. http://www.linguistik.
uni-kiel.de/Teop_Lexical_Database_May07.pdf
2010 A inu. The Teop–English Dictionary of House Building (with Mark
Mahaka, Enoch Horai Magum, Joyce Maion, Naphtaly Maion, Ruth
Siimaa Rigamu, Ruth Saovana Spriggs, Jeremiah Vaabero, Marica
Schwartz and Yvonne Thiesen). With drawings by Neville Vitahi
and photographs by Ulrike Mosel. SAVS Arbeitsberichte 6. Kiel:
Seminar für Allgemeine und Vergleichende Sprachwissenschaft,
Christian-Albrechts-Universität.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Ulrike Mosel’s contribution to documentary linguistics xv
Text collections
1977 Tolai Texts. Kivung, Journal of the Linguistic Society of Papua New
Guinea. Volume 10, Port Moresby.
2007 Amaa vahutate vaa Teapu (with Enoch Horai Magum, Joyce Maion,
Jubilie Kamai, Ondria Tavagaga and Yvonne Thiesen). Illustrated
by Rodney Rasin. Kiel: Seminar für Allgemeine und Vergleichende
Sprachwissenschaft, Christian-Albrechts-Universität.
2009 Teop Language Corpus (with Enoch Horai Magum, Helen Magum,
Shalom Magum, Jubilie Kamai, Owen Kasinory, Mark Ma-
haka, Joyce Maion, Naphtaly Maion, Janeth Nasin, Rodney
Rasin, Ruth Siimaa Rigamu, Ruth Saovana Spriggs, Ondria Tava-
gaga, Jerimiah Vaabero, Neville Vitahi, Jessika Reinig, Marcia
Schwartz, Yvonne Schuth (Thiesen)). http://corpus1.mpi.nl/ds/imdi_
browser?openpath=MPI622803%23
On grammaticography
1975 Die syntaktische Terminologie bei Sibawaih. 2 volumes. Disserta-
tion. München: Uni Fotodruck Frank.
1987 Inhalt und Aufbau deskriptiver Grammatiken (How to write a gram-
mar). Arbeitspapier Nr. 4. Köln: Institut für Sprachwissenschaft,
Universität zu Köln.
1980 Syntactic categories in Sibawaih’s “Kitab”. Histoire Épistémologie
Langage 2(1): 27–37.
2002 Analytic and synthetic language description. In Linguistik jenseits
des Strukturalismus. Akten des II. Ost-West-Kolloquiums Berlin
1998, eds. Kennosuke Ezawa, Wilfried Kürschner, Karl H. Rensch
and Manfred Ringmacher, 199–208. Tübingen: Narr.
2006 Sketch grammars. In Essentials of Language Documentation, eds.
Jost Gippert, Nikolaus Himmelmann and Ulrike Mosel, 301–309.
Berlin: Mouton de Gruyter.
2006 Grammaticography, the art and craft of writing grammars. In Catch-
ing Language: the Standing Challenge of Grammar Writing, eds. Fe-
lix Ameka, Alan Dench and Nicholas Evans, 41–68. Berlin: Mouton
de Gruyter.
2007 Early grammars of Oceanic languages (with Even Hovdhaugen).
In Sprachtheorien der Neuzeit III/2: Sprachbeschreibung und
Sprachunterricht, ed. Peter Schmitter, 462–478. Tübingen: Narr.
Unauthenticated
Download Date | 8/13/19 5:35 PM
xvi Geoffrey Haig, Nicole Nau
On lexicography
2002 Dictionary making in endangered language communities. In Pro-
ceedings of the International Workshop on Resources and Tools in
Field Linguistics, Las Palmas 26–27 May 2002.
2004 Dictionary making in endangered speech communities. In Language
Documentation and Description, Volume 2, ed. Peter K. Austin,
39–54. London: School of Oriental and African Studies.
2011 Lexicography in endangered language communities. In The Cam-
bridge Handbook of Endangered Languages, eds. Peter K. Austin
and Julia Sallabank, 337–353. Cambridge: Cambridge University
Press.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Ulrike Mosel’s contribution to documentary linguistics xvii
Unauthenticated
Download Date | 8/13/19 5:35 PM
xviii Geoffrey Haig, Nicole Nau
On Semantics
1982 Local deixis in Tolai. In Here and There: Cross-linguistic Studies in
Deixis and Demonstration, eds. Jürgen Weissenborn and Wolfgang
Klein, 111–132. Amsterdam: Benjamins.
1991 Time metaphors in Samoan. In Festschrift für Meinrad Scheller,
ed. Walter Bisang, 149–165. Arbeiten des Seminars für Allgemeine
Sprachwissenschaft der Universität Zürich 11. Zürich: Seminar für
Allgemeine Sprachwissenschaft, Universität Zürich.
1991 The Samoan construction of reality. In The Currents in Pacific Lin-
guistics. Papers on Austronesian Languages and Ethnolinguistics in
Honor of George Grace, ed. Robert Blust, 293–303. Canberra: Aus-
tralian National University Press.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Ulrike Mosel’s contribution to documentary linguistics xix
Unauthenticated
Download Date | 8/13/19 5:35 PM
Unauthenticated
Download Date | 8/13/19 5:35 PM
Chapter 1
Introduction: Documenting endangered languages
before, during, and after the DoBeS programme∗
1. Background
Scholars have been engaged in language documentation for centuries. But it is
only in the past couple of decades that language documentation has emerged
as an academic discipline in its own right, associated with scholarly publica-
tions, university departments, funding initiatives, and a growing repertoire of
practices, theoretical diversification and sophistication, and specialist termi-
nology. The roots of the current upsurge in language documentation run deep,
and stem from several sources, too many to treat in any detail here. Today, lan-
guage documentation has come of age, a fertile domain for cross-disciplinary
exchange, involving linguists, anthropologists, ethno-musicologists, biolo-
gists, along with software developers and corpus linguists.
The development has been remarkably rapid; a glance at the seminal
article by Himmelmann (1998) suffices to confirm that much of what was
envisaged in that programmatic statement has since become part and par-
cel of language documentation practice. In the late 1990’s, a small group
of German-based linguists began developing an agenda for a programme
aimed at documenting endangered languages worldwide. That initiative came
to fruition in the Volkswagen Foundation’s DoBeS-programme (Dokumenta-
tion bedrohter Sprachen), which began with a pilot phase in 2000. DoBeS
went on to become the Foundation’s longest-running programme within the
humanities. From the outset, DoBeS established the Max Planck Institute for
∗
We are extremely grateful to the DoBeS-programme of the Volkswagenstiftung, who
funded most of the research reported in this book, and so much more. We would also
like to thank the authors for their contributions and constructive input during the entire
publication process. Finally, our thanks go to the production team at Mouton and the series
editor, Volker Gast, for their support and encouragement throughout.
Unauthenticated
Download Date | 8/13/19 5:35 PM
2 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
Psycholinguistics in Nijmegen as the host for the central archive, also respon-
sible for developing and implementing technical standards for the project.
This generated an intensive exchange between linguists, software developers,
and archivists, yielding significant improvements in annotation and metadata
procedures as well as refinements of ethical and legal guidelines. It has been
an ongoing process, constantly informed by the practical experience gained
through some 50 documentation projects (cf. Broeder et al., this volume).
Moreover, regular workshops and summer schools have provided scores of
linguists, as well as many native speakers of endangered languages, with
training in documentary linguistics, ensuring the initiative’s continued in-
fluence through future generations of linguists. The DoBeS-programme has
undoubtedly been one of the most successful research initiatives in the lan-
guage sciences. Its impact on the emergent field of documentary linguistics
can hardly be exaggerated (see Harrison, Rood, and Dwyer 2008 for a similar
assessment), with long-term implications that go far beyond the field of lin-
guistics itself. With the programme entering its closing phases in late 2011,
it is fitting to take a look at some of its achievements, and some of the future
challenges, through the work of some of its practitioners.
As mentioned, the modern discipline of language documentation has mul-
tiple ancestors. Previous retrospectives, e.g. Woodbury (2003), identify Franz
Boas as the spiritus rector of documentary linguistics in the North American
context. However, with the rise of Chomskyan linguistics in North America,
the Boasian tradition of anthropologically informed documentary linguistics
had lost ground there. Both Grinevald (2003) and Woodbury (2003) identify
the LSA symposium in 1991, and the associated publication in Language
(Hale et al. 1992), as the turning points in triggering the rehabilitation of doc-
umentary linguistics. From a North American perspective, the developments
certainly appear to be quite radical; Grinevald (2003: 52) describes in vivid
terms the “incredible tension” she experienced prior to the LSA panel meet-
ing in 1991 when the blunt facts regarding the imminent loss of much of the
world’s linguistic diversity were to be presented before the assembled digni-
taries of the North American linguistic scene. The perception of a “paradigm
shift” can only really be appreciated when one considers the extent to which
American linguistics at the time was dominated by scholars working in theo-
ries focused on an idealized conception of “grammar”. The topic of language
endangerment, on the other hand, meant introducing social and political di-
mensions into a field that had effectively abstracted away from such consider-
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 3
ations. But since the early 1990’s, documentary linguistics in North America
has regained much of the lost ground, with numerous highly successful and
innovative programmes now well-established (cf. Woodbury 2003 for further
discussion).
But the North American perspective is only part of the story.1 From a Eu-
ropean perspective, on the other hand, the paradigm shift appears somewhat
less fundamental. In Germany, now an important centre for language docu-
mentation, the impact of the much-vaunted Chomskyan Revolution was con-
siderably weaker than across the Atlantic. The reasons are partly to be sought
in the way linguistics is institutionalized at German universities. To this day,
dedicated departments of general linguistics are only sparsely scattered across
the university landscape. Linguistics has remained to a large extent the con-
cern of individual philologies (English, Romance, German etc.). Within such
departments, the text-based tradition of philology continued to be an impor-
tant pillar in the academic training. Linguists within these disciplines thus
never felt themselves in the defensive to the same extent that their Ameri-
can colleagues felt during Generative Grammar’s ascension to domination in
North America. In Europe, a deep-rooted tradition in the description of little-
studied languages and dialects could survive as a respectable, if marginal,
niche within the philologies. Particularly in departments such as African lan-
guages, Finno-Ugric, Semitic, Turkic, or Iranian, descriptive grammars, dic-
tionaries and text collections from under-studied languages and dialects have
constituted a regular part of the scientific output for close to two centuries.
Highly-respected German professors, such as the now retired Semitist Otto
Jastrow, devoted much of their professional careers to field work and docu-
menting endangered languages. Among other things, Jastrow’s work led to
the establishment of the Semarch,2 a digital archive of (often highly endan-
gered) Semitic languages at the University of Heidelberg.
1. A comprehensive treatment of the roots of language documentation lies beyond the scope
of this introduction; see Himmelmann (2008) for a more balanced account. Two further an-
tecedents of modern documentary linguistics, outside of Western Europe and North Amer-
ica, are also definitely worthy of mention: the linguistic fieldwork paradigm initiated by the
Moscow-based linguist Alexander Kibrik and his associates in the 1970’s, which generated
an enormous amount of descriptive materials on indigenous languages of the Ex-Soviet-
Union, and the tradition of encouraging descriptive grammars of undescribed languages as
PhD theses, established at the Australian National University in the 1970’s mainly through
the efforts of Bob Dixon.
2. http://www.semarch.uni-hd.de/index.php43
Unauthenticated
Download Date | 8/13/19 5:35 PM
4 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
3. The importance of documenting the full range of communicative practices in the commu-
nity has been stressed by many authors (e.g. Himmelmann 1998; Foley 2003). In practice,
however, it is notable that in most documentation projects, it still tends to be more tradi-
tional monologues than everyday conversational interactions that find themselves as fully-
annotated records in the archive. Part of the reason lies in the speech communities’ own
assessment of what is worth preserving for posterity (cf. Mosel 2004, 2006, 2008). Partly,
it may be due to the practical difficulties of recording, annotating and analysing sponta-
neous multi-participant discourse. In this sense, then, documentation practice lags behind
documentation theory, but this is a perfectly normal state of affairs in any discipline, and
with increased experience with, for example, video recording and annotations, we expect
that greater amounts of conversational data will be archived.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 5
Unauthenticated
Download Date | 8/13/19 5:35 PM
6 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
4. http://wals.info/
5. Atkinson’s findings have been heavily criticized, and may thus seem an unfortunate exam-
ple for illustrating this point. However, the basic fact remains that data collectors simply
cannot predict future advances; for them, it is a sufficient goal to ensure maximal accessi-
bility, perseverance, and good data organization so that later applications can be applied as
economically as possible.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 7
Unauthenticated
Download Date | 8/13/19 5:35 PM
8 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 9
Unauthenticated
Download Date | 8/13/19 5:35 PM
10 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 11
Unauthenticated
Download Date | 8/13/19 5:35 PM
12 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
that of defining terms from such fields, when they differ in content and struc-
ture from corresponding fields in the investigator’s language. A central ques-
tion is how much, and what kinds of cultural knowledge should be included
in a dictionary. Pawley concludes that general ethnographic dictionaries need
the cooperation of several well-informed specialists, and that the linguist’s
fluency in the target language, as well as the native-speaker collaborator’s
fluency in the defining language, are of considerable importance. Above all,
he stresses that compiling such a dictionary is inevitably a long term project
in its own right, which cannot readily be accommodated within the tightly-
constrained time frame of a typical documentation project.
Part IV. Interaction with speech communities
The fourth part of this book is devoted to the interaction of linguists with the
speech community where field work is carried out.
Even Hovdhaugen and Åshild Næss open the discussion by drawing at-
tention to problems that may arise when the field work becomes an issue in
local struggles of power (“Language is power: The impact of fieldwork on
community politics”). They describe a situation of conflict they witnessed
during their work in the Solomon Islands, which shows that even the best
preparation and long term experience in an area cannot always prevent the
field worker, however unwillingly, from becoming the cause of conflicts in
the community. Linguists have to be aware of the impact their work may have
on local politics, and questions of the linguist’s behaviour in possible situa-
tions of dissent within the community should be given consideration when
planning a language documentation project.
In Chapter 13, “Sustaining Vurës: Making products of language docu-
mentation accessible to multiple audiences”, Catriona Hyslop Malau presents
her experiences with the production and presentation of a DVD documenting
traditional plaiting and fishing techniques in the Vurës community, accom-
panied by screenings of Vurës dictionary entries of relevant specialized vo-
cabulary. She shows how such professionally processed video material can
be of significant value as an output of a language documentation project. The
very fact that video material is processed professionally signals the worthi-
ness of the community’s linguistic heritage and traditional knowledge, and
encourages the speech community to uphold them. But professionally pro-
duced DVDs also enhance the visibility of the linguistic and cultural diver-
sity of Vanuatu, and of the general enterprise of language documentation, by
making it available to a wider national and global audience.
Unauthenticated
Download Date | 8/13/19 5:35 PM
Introduction 13
References
Atkinson, Quentin. 2011. Phonemic diversity supports a serial founder effect
model of language expansion from Africa. Science 4(15):346–349.
Foley, William A. 2003. Genre, register and language documentation in liter-
ate and preliterate communities. In Language Documentation and Descrip-
tion, Volume 1, ed. Peter K. Austin, 85–98. London: School of Oriental and
African Studies.
Grinevald, Colette. 2003. Speakers and documentation of endangered lan-
guages. In Language Documentation and Description, Volume 1, ed. Pe-
ter K. Austin, 52–72. London: School of Oriental and African Studies.
Haig, Geoffrey, and Stefan Schnell. 2011. Annotations using GRAID (Gram-
matical Relations and Animacy in Discourse). Introduction and guidelines
for annotators. Version 6.0. Available at: http://vc.uni-bamberg.de/moodle/
course/view.php?id=9488.
Hale, Ken, Michael Krauss, Lucille J. Watahomigie, Akira Y. Yamamoto,
Colette Craig, LaVerne Masayesva Jeanne, and Nora C. England. 1992.
Endangered languages. Language 68:1–42.
Unauthenticated
Download Date | 8/13/19 5:35 PM
14 Geoffrey Haig, Nicole Nau, Stefan Schnell, Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:35 PM
Chapter 2
Competing motivations
for documenting endangered languages∗
Frank Seifart
1. Introduction
Many aspects of language documentations are crucially shaped by the di-
verse motivations of the documenters. This chapter addresses a number of
questions with the aim of clarifying the role of such motivations in language
documentation, including:
(i) What are the motivations for documenting endangered languages ?
(ii) Which requirements for language documentations stem from them?
(iii) How can these requirements be accommodated in the format of language
documentations?
(iv) Where do these motivations give rise to competing documentation pri-
orities?
With respect to the first question, four possible motivations for documenting
endangered languages may be identified:1
– documentation to preserve human cultural heritage
– documentation to enhance the empirical basis of linguistics
– documentation by and for the speech community
– documentation to study language contact
These four motivations are theoretical abstractions, which are mixed in prac-
tice. This abstraction turns out to be useful, however, since making these mo-
tivations explicit allows one to study their specific requirements for language
∗
I am grateful to the editors of this volume for very useful comments that helped improve
this chapter.
1. The fourth motivation, the study of language contact, may be a less common one than the
first three. However, as argued in Section 3.4 below, there are a number of good reasons to
document endangered languages to study language contact, with interesting implications
for documentation priorities.
Unauthenticated
Download Date | 8/13/19 5:36 PM
18 Frank Seifart
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 19
This includes annotations of primary data (per session), which in turn cru-
cially include translations. It also includes (for the documentation as a whole)
general access resources, descriptive analyses, and metadata. As we shall see
further below, a number of requirements stemming from different motivations
pertain to these aspects of language documentation.
Table 1. Format of language documentations (Himmelmann 1998, 2006)
Apparatus
Contents For documentation
(Primary data) Per session as a whole
metadata metadata
Descriptive analysis
ethnography
descriptive grammar
dictionary
2. The idea of examining language documentation from the perspective of underlying motiva-
tions stems from Seifart (2000: 25–48), where a similar set of motivations were discussed
with little reference to actual documentation projects. The current paper modifies the con-
cept of motivations and discusses them with respect to experience from over 10 years of
language documentation practice.
Unauthenticated
Download Date | 8/13/19 5:36 PM
20 Frank Seifart
be emphasized that I wish to refrain from any judgment with respect to the
(moral or other) validity of these motivations (see the useful discussion in
Fast 2007), but rather focus on clarifying their respective requirements for
the contents and apparatus of language documentations.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 21
Unauthenticated
Download Date | 8/13/19 5:36 PM
22 Frank Seifart
3. There are recent approaches to using textual data for typological studies (Cysouw and
Wälchli 2007). These approaches have so far used mostly parallel texts from translations
of the Bible or popular fiction. The approach can be extended to a kind of parallel texts
contained in the apparatus of languages documentation, namely transcriptions and their
aligned translations.
4. See URL http://www.eva.mpg.de/lingua/resources/glossing-rules.php.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 23
Unauthenticated
Download Date | 8/13/19 5:36 PM
24 Frank Seifart
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 25
Unauthenticated
Download Date | 8/13/19 5:36 PM
26 Frank Seifart
mation (Giles, Bourhis, and Taylor 1977; Landry and Allard 1994; Edwards
2010).5
5. It should be noted that the ethnolinguistic vitality framework was developed in the context
of migrant communities in industrialized countries, often with languages that are far from
being endangered as a whole. It would need to be extended to be applied to language
endangerment settings involving indigenous languages.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 27
(ii) Naturally occurring spoken language vs. edited data. Linguistic data
that are produced as naturally as possible are heralded as the most valuable for
many linguistic, as well as other scientific studies, which make use of the full
range of information present in spoken language, including prosody, speech
rate, repetitions, etc. (see, e.g. Finnegan 2008). This applies similarly to lan-
guage contact studies that require data on code switching and other influences
from a contact language. On the other hand, data that are edited according to
normative intuitions of speakers, i.e. that are cleared of speech errors, repe-
titions, code switching, etc., may be appropriate or even required for reading
materials used in community schools, and to a lesser degree for some aspects
of documenting cultural heritage and structural-linguistic studies. Producing
edited versions of texts is a painstaking and time-consuming task, which few
documentation projects have undertaken for more than a small number of
texts. An exception is the Teop language documentation (Mosel et al. 2007),
which has produced a large corpus (probably the largest) of texts edited in
this way, which are archived alongside the original, unedited versions.
(iii) Specific descriptive information in the apparatus. Within a language
documentation, the apparatus serves to facilitate access to the primary data,
allowing for their further usability and analysis. However, a number of moti-
vations to document endangered languages require specific descriptive infor-
mation that may not be contained in such an apparatus, such as sociolinguistic
descriptions of domain-specific language use or specific linguistic-structural
information for typological surveys about, e.g., word order. The question is
to what extent a language documentation project can be expected to provide
such information rather than spending its limited resources on, for instance,
enhancing the collection of primary data. A guideline to mediate between
these competing motivations might be to afford higher priority to providing
information in the apparatus that cannot be deduced, however painstakingly,
from the corpus of primary data at a later stage. Thus, for instance, details of
word order can be determined from a large text collection after the conclu-
sion of a documentation project, while the distribution of languages across
domains cannot.
This chapter discussed different intended aims and user groups of language
documentation and subsumed these under four motivations for language doc-
Unauthenticated
Download Date | 8/13/19 5:36 PM
28 Frank Seifart
References
Austin, Peter K. 2003. Introduction. In Language Documentation and De-
scription, Volume 1, ed. Peter K. Austin, 6–12. London: School of Oriental
and African Studies.
Bickel, Balthasar. 2008. A refined sampling procedure for genealogical con-
trol. Sprachtypologie und Universalienforschung 61:221–233.
Bickel, Balthasar, Goma Banjade, Toya N. Bhatta, Martin Gaenzle, Netra P.
Paudyal, Manoj Rai, Novel Kishore Rai, Ichchha Purna Rai, and Sabine
Stoll, eds. 2009. Audiovisual Chintang corpus (ca. 150,000 words tran-
scribed and translated, of which ca. 65,000 glossed and translated, plus
paradigm sets and grammar sketches, ethnographic descriptions, pho-
tographs). Nijmegen, Leipzig: DoBeS, Universität Leipzig. http://www.
uni-leipzig.de/~ff/cpdp/.
Burenhult, Niclas, and Stephen C. Levinson, eds. 2010. Tongues of the
Semang: documenting endangered languages and indigenous knowledge
among foragers of the Malay Peninsula. Nijmegen: DoBeS. http://mpi.nl/
DOBES/projects/semang.
Buszard-Welcher, Laura. 2010. Lessons from Potawatomi legacy documen-
tation. In Language Documentation: Practice and Values, eds. Lenore A.
Grenoble and Louanna Furbee, 67–74. Amsterdam, Philadelphia: John
Benjamins.
Cablitz, Gabriele, ed. 2007. Towards a multimedia dictionary of the Marque-
san and Tuamotuan languages of French Polynesia. Nijmegen: DoBeS.
http://mpi.nl/DOBES/projects/marquesan.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 29
Unauthenticated
Download Date | 8/13/19 5:36 PM
30 Frank Seifart
Unauthenticated
Download Date | 8/13/19 5:36 PM
Competing motivations for documenting endangered languages 31
Unauthenticated
Download Date | 8/13/19 5:36 PM
32 Frank Seifart
Unauthenticated
Download Date | 8/13/19 5:36 PM
Chapter 3
Evolving challenges
in archiving and data infrastructures
1. Introduction
Increasingly often research in the humanities is based on data. This change in
attitude and research practice is driven to a large extent by the availability of
small and cheap yet high-quality recording equipment (video cameras, audio
recorders) as well as advances in information technology (faster networks,
larger data storage, larger computation power, suitable software). In some
institutes such as the Max Planck Institute for Psycholinguistics, already in
the 90s a clear trend towards an all-digital domain could be identified, making
use of state-of-the-art technology for research purposes. This change of habits
was one of the reasons for the Volkswagen Foundation to establish the DoBeS
program in 2000 with a clear focus on language documentation based on
recordings as primary material.
The fact that more and more data is being collected poses some challenges
for those who are dealing with this data in one way or another. The researcher
who collects the material will need to maintain a coherent administration of
all the relevant bits of contextual information surrounding the data. These
“metadata” descriptions (see Section 4.2) are not just for the researchers own
use but should also allow others to find the data once it has been stored in an
archive and should allow others to assess whether the data suits their needs.
Research data archives that are storing more and more large data collections
will have to provide proper facilities and guidance for potential users of the
data to find what they are looking for.
While technological advances have made it much easier to collect large
amounts of audiovisual recordings, the automatic extraction of the relevant
bits of information from these recordings is still very difficult and therefore
needs to be done manually to a large extent. This causes a discrepancy be-
Unauthenticated
Download Date | 8/13/19 5:36 PM
34 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
tween the amount of data that is being collected and the amount of data that
ends up being analyzed and used to support research hypotheses.
Data archiving and sharing is currently on the agenda in all areas of sci-
ence and the technical frameworks that are being developed are often based
on the OAIS reference model (CCSDS 2002) that was originally designed for
space data but can be applied more broadly. Different workflows and usage
scenarios and differences in the nature of the archived data often require de-
viations from this abstract model though, in particular in the case of an online
archive that gives users direct access to the archived material.
Since digital technology quickly offered ways to not only create large amounts
of primary recordings but also several associated resources such as transcrip-
tions, linguistic analyses, field notes, etc., it became obvious that new chal-
lenges appeared at the horizon: we needed ways to take care of proper life-
cycle management of the archived data. In 2000 the MPI stored about one
terabyte of digitized recordings, currently the data in the online archive and
the data ready to be integrated take up about 74 terabyte. Due to techno-
logical innovation we are now able to process and store lossless compressed
JPEG2000 video streams, which result in files that are a factor 20 larger than
the MPEG2 files that were our highest quality archival copies until recently.
This increase in file sizes results in an annual growth of the archive of about
18 terabytes currently, however with more and more researchers switching to
high-definition video cameras we can expect another steep increase in annual
growth in the near future.
In the humanities, sheer data volume specifications are not a good indica-
tor for the data management challenges to solve. There are generally complex
relations between the archived objects that need to be maintained in order
to preserve all the knowledge about the objects. Each digitized recording is
for example part of a hierarchy of semantically related objects. Often such
objects are split into new objects for specific reasons such as presentations.
Different layers of annotation of the linguistic content are created, perhaps
even from different annotators at different times. Derived resources such as
lexica are created that relate to a collection of archived objects (see Cablitz,
this volume). Several versions and transformations of many objects might be
created in the course of time. It is important to store the relationships between
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 35
all these objects, since in most cases context and provenance is essential for
the interpretation of the objects’ content. Handling the complexity in such
collections is thus a true challenge.
From a UNESCO study we know that already for tape media the preser-
vation of the stored information has turned out to become a huge and partly
insoluble problem. About 80% of the existent recordings of languages and
cultures created by ethnologists, linguists etc. are highly endangered because
the physical carriers are deteriorating rapidly and the material is not in the
hands of specialized archives (Schüller 2004). Digital technology moves on
even faster, i.e. uncurated data is much more endangered than the traditional
analog recordings. There is a great risk of losing parts of our cultural and sci-
entific memory if we do not ensure that data formats and encodings are kept
distinct from the software being used, if we do not use open standards such
as XML (eXtensible Markup Language) for specifying structure and if we do
not use widely agreed and thoroughly documented encoding schemes such as
UNICODE, MPEG etc.
Digital data needs to be continuously migrated, both at the carrier level as
well as at the structure/encoding level. How can we maintain integrity and au-
thenticity - both essential pillars for the preservation of our contents - in such
a dynamic world? Migration alone will not ensure data survival, since our
media are very vulnerable and our software erroneous. Automatic copying to
distinct locations according to safe protocols making use of different software
systems is required as well to preserve our digital treasure. For DoBeS data,
six copies are created automatically at three locations and in addition selected
data is being returned to the locations where they were recorded.
For both aspects – migration and copying – there are no simple solutions
that are safe enough and all procedures involving too many manual operations
will not work in the end, since the costs would be much too high for the large
volumes of data that we are creating and maintaining.
Unauthenticated
Download Date | 8/13/19 5:36 PM
36 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
1. http://www.mpi.nl/IMDI/
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 37
and (3) that the users can trust getting exactly those objects they are looking
for in authentic quality. The last point has led to an important shift in the MPI
archivists view on researcher involvement in data managing. Given the utterly
dynamic era in which the DoBeS program and in particular the archiving part
has been set up, we can state that this trust has eventually been established,
even though it required some attitude changes both from the archivist’s as
well as from the researcher’s side. The archivists for example had to become
aware of the utmost importance that researchers attach to proper protection
and presentation of their data, and the researchers had to get used to the idea
of handing over their data to an online archive that has data sharing as one
of its goals. It’s in the nature of innovation that trust has to be continuously
re-established.
Unauthenticated
Download Date | 8/13/19 5:36 PM
38 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
CLARIN2 . Agreements about standards and interchange formats for data and
services are needed to ensure interoperability between various archives and
tool providers.
2. http://www.clarin.eu
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 39
Unauthenticated
Download Date | 8/13/19 5:36 PM
40 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
There are very many ways to organize digital language resources; one or-
ganization might be more suitable for a specific archiving or research purpose
than another and fortunately the digital storage paradigm does not impose a
single organization. Therefore we need the ability to impose different flex-
ible organization models or views that match the interest of researchers or
archivists. The richer the metadata available, the more possibilities there are
for the end user to create these special views and explore the digital collec-
tion.
In the current landscape of digital repositories and archives, a number
of specific metadata standards are prominent for the description of linguistic
data. Such a standard usually specifies a set of metadata elements (sometimes
called attributes) together with prescriptions for the values of these elements
and also prescriptions on how the metadata elements and values should be
put into a text format (schema).
The first one of these sets and probably the most widely used one, is
Dublin Core,3 which stems from the electronic library world. Dublin Core
was later extended with some linguistic specializations into the OLAC stan-
dard4 which has become popular for exchanging Language Resource meta-
data between archives. Around the same time the IMDI5 standard was intro-
duced and adopted by the DoBeS program. IMDI strives to allow detailed
descriptions and several so-called specialized profiles were created for spe-
cific linguistic subdomains. A suite of tools to edit and use IMDI metadata
was partly developed within the context of the DoBeS program.
At the time of writing (2011) a follow-up standard for IMDI, called CMDI6
(Component metadata Infrastructure, cf. Broeder et al. 2010) is being worked
out within the CLARIN framework. Rather than offering one single meta-
data schema it tries to offer the user a set of loose components that can be
combined into a tailored metadata schema. This approach should allow for
a detailed description while keeping the focus only on those metadata ele-
ments that are relevant. Apart from that, it also allows for partial re-use of
existing metadata schemas and provides better mechanisms of semantic in-
teroperability by requiring that the semantics of all used metadata elements
are explicitly defined in an accepted concept registry. Using CMDI will hope-
3. http://dublincore.org
4. http://www.language-archives.org/OLAC/metadata.html
5. http://www.mpi.nl/IMDI/
6. http://www.clarin.eu/cmdi
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 41
A metadata example that is already in use illustrates the support for mapping
from the IMDI to the Dublin Core metadata schemas by using these strate-
gies. The metadata profile in ISOcat has been bootstrapped with the IMDI
elements, which includes the /mimeType/9 data category. The specification of
a data category can be very elaborate including translations in multiple lan-
guages, but at least an English name and definition should be available. The
/mimeType/ data category is defined as the “specification of the mime-type
of the resource which is a formalized specifier for the format included or a
7. http://www.isocat.org/
8. http://www.clarin.eu/cmdi/
9. http://www.isocat.org/datcat/DC-2571
Unauthenticated
Download Date | 8/13/19 5:36 PM
42 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
4.4. Versioning
When storing and archiving digital resources, an important policy decision
concerns how to respond when a depositor offers a new “version” of a re-
source that is already present in the archive’s holdings. There can be differ-
10. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:c_
1271859438106
11. http://purl.org/dc/elements/1.1/format
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 43
ent reasons for offering a new version: (a) The depositor has realized that
the first version is simply broken or unusable, for instance in the case that
the files were switched. (b) New insights make it necessary to change some
annotations. (c) The format of a resource may need to be upgraded. For in-
stance a codec used to encode a media-stream may become obsolete requiring
resource replacement.
Depending on the archive organization and policies, it is possible to let the
new version take the place of the old one in the existing network of relations
with other resources and metadata. The old version may then be moved to
background storage or, depending on archive policy, even deleted. Of course
the relation between old and new versions needs to be stored and users should
to be able to see that other versions exist.
We will not go into the question on what actually makes a resource a new
version of another resource. This should best be left to the judgment of the
depositor or caretaker of the original version.
It is however very important to realize that users may have created ref-
erences to a resource in the archive, for instance as a link in a publication.
Most users will expect that that reference will always link to the same ver-
sion, while others may want to refer to the latest version. It is important that
the archive is explicit about its versioning policy in this respect. The most
flexible system is to always keep any reference to a specific resource version
but to provide referencing to the latest version as a special service.
However it is known that some archives are unable to keep stable ref-
erences to resources or resource collections due to legal or organizational
obstacles. For instance, its legal owner might withdraw a resource from an
archive’s holding. In such cases the archive can only be as explicit as possi-
ble about such circumstances.
12. http://oa.mpg.de/lang/en-uk/berlin-prozess/berliner-erklarung/
Unauthenticated
Download Date | 8/13/19 5:36 PM
44 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
As indicated above, when working with data collected in small language com-
munities one has to carefully consider the rights and privacy of the inter-
viewed contributors. In the DoBeS program, legal and ethical considerations
were an important point of discussion from the very beginning. In its second
year a workshop was organized with leading European law experts to deter-
mine a proper juridical basis for the DoBeS program and in particular the
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 45
online archiving ideas. The result however was disappointing from the prac-
titioners’ point of view, since it was concluded that the legal situation is much
too complex to give clear juridical advice. The only advice the experts could
come up with was to lock the material in a safe in a cellar, which was exactly
the opposite of what was expected from the emerging archive – namely to be
a place where authorized persons from all around the world could access and
even enrich the stored data.
Intensive and serious discussions afterwards led to a number of conclu-
sions:
– It was understood that the DoBeS program should have a proper basis to
guide the behavior of all persons involved: collectors, archivists and users.
The result was an elaborate Code of Conduct, which was amended over the
years.
– The roles of all actors in the complex system were defined and the expec-
tations with respect to each actor were formulated. For the archivists it is
the principal researcher who is responsible for specifying for example the
access permissions etc. It is expected that the researcher responsible takes
care of proper relationships with the communities and the interviewees and
that all statements are based on informed consent. The archivist will adhere
to the statements of the researcher responsible and provide access mecha-
nisms that implement the requirements.
– The archivist declared that he does not claim copyright on the stored mate-
rial. However, he needs the right to archive in order to perform his task in
a responsible way. With respect to users the archivist will claim copyright
on behalf of the data producers.
– It was decided to not use visible logos in the video since they might ob-
struct the content.
– The researcher responsible always has access permissions to all material
and he can set access permissions for other persons. In particular mem-
bers of the speech community should be granted the rights and abilities to
access the content.
Unauthenticated
Download Date | 8/13/19 5:36 PM
46 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
degree of sensitivity of all actors involved. To cope with all kinds of un-
expected events a Linguistic Advisory Board consisting of highly respected
field researchers was established that can be called upon by the archive to
help solving potential difficult questions.
Over the years, when it became more obvious that more users may want
to access material in the online archive, four levels of access granting were
agreed upon:
Level 1: Material under this level is directly accessible via the internet;
Level 2: Material at this level requires that users register and accept the Code
of Conduct;
Level 3: At this level, access is only granted to users who apply to the re-
searcher responsible (or persons specified by him or her) and spec-
ify their usage intentions;
Level 4: Finally, there will be material that will be completely closed, except
for the researcher and (some or all) members of the speech commu-
nities.
Access level specifications for archived resources may change over time for
various reasons, e.g. resources could be opened up a certain number of years
after a speaker has passed away, or access restrictions might be loosened after
a PhD candidate in a documentation project is done writing the thesis.
The number of external people who requested access to “level 3” re-
sources over the last years was not that high. We need to see in the future
whether the regulations that are currently in place can and should be main-
tained as explained. Access regulations remain a highly sensitive area where
the technical possibilities opened up by using web-based technologies need
to be carefully balanced against the ethical and legal responsibilities which
archivists and depositors have towards the speech communities. Despite al-
most 10 years of ongoing discussions and debate, no simple solution to this
problem has yet been found.
Providing tools for tagging or annotating audio-visual media has been one of
the focal points of software development at the MPI right from the start, from
the Mac-only application MediaTagger, via a set of client-server based cor-
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 47
13. http://www.lat-mpi.eu/tools/lexus
Unauthenticated
Download Date | 8/13/19 5:36 PM
48 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
various formats, e.g., MDF, into LEXUS. In principle lexica exist in a user-
specific workspace. However, LEXUS allows sharing these lexica with other
users thus enabling collaboration on the development and population of a lex-
icon. Cablitz (this volume) gives a detailed account of the implementation of
LEXUS in an actual documentation project.
The LEXUS frontend has evolved over time using different browser-
based technologies into the current FLEX version, which due to its use of
the Adobe Flash plug-in provides a similar look&feel across a wide variety
of browsers and platforms. The rendering of lexical entries has always been
very flexible, allowing users to construct templates for both list and entry
views. A new version of the LEXUS backend is currently close to comple-
tion and next to providing increased stability and performance will also allow
to more easily add new output formats, e.g., a printable version of the lexicon.
Making different tools like LEXUS, ANNEX/TROVA and ELAN coop-
erate as seamlessly as possible is another important line of development. Sep-
aration of metadata and annotation content has its merits, but at some point
they will have to come together e.g. in a combined data-metadata search in
TROVA. Some annotation editing options, especially those that are executed
on multiple files (like find-and-replace in many files), make perfect sense in
the context of ANNEX. The combination of ELAN and LEXUS will on the
one hand allow lookup and retrieval of information from a lexicon while an-
notating, and on the other hand will enable the user to start building a lexicon
while annotating.
8. Accessing data
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 49
the metadata field does not require the use of a controlled vocabulary. Some
kind of mapping would need to be performed in order to find all variants of
the same value. The situation becomes more complex if one needs to search
across different catalogues with different metadata schemes. It is hoped that
the use of links to the ISOcat data category registry in metadata schemas and
value sets will make cross-archive searches more manageable. As an exam-
ple, archive A may use a metadata schema that contains the element “gender”
for speakers for which the values can be “F” and “M”, archive B may use the
element “sex” for basically the same concept and uses the values “female”
and “male”. If both metadata schemas would refer to the proposed ISOcat
term “HumanGender” and the values “feminine” and “masculine” (with the
definitions that this relates to the gender of a person rather than grammati-
cal gender), it would be possible to search across both archives using either
terminology using an ISOcat-aware search tool.
Unauthenticated
Download Date | 8/13/19 5:36 PM
50 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
8.3. Portals
While metadata and content search tools are generally suitable for specialists
to find material that they are interested in, members of a speech community
or members of the general public have other requirements when accessing
archived content. If the search services and archive access framework are
set up in a rather generic way and can be called via standard web-service
interfaces, it is possible to create an additional “layer” on top of the archive
that serves a specific user group. This layer or “Portal” can have an appealing
graphical design and it can direct people to certain pre-defined searches that
have been set up or interesting resources that have been selected. Within the
European research infrastructure projects that are currently running such as
CLARIN, more and more tools are being made available as web services. To
what extent these web services will be of use for certain user-specific portals
remains to be seen, but at least they open up a wide range of possibilities to
combine resources and services together in a web interface.
9. New challenges
Life cycle management of data can be split into three major and related
phases: creation, curation/preservation and access/utilization. With respect to
all three phases we will see accelerated technological innovation which on the
one hand has positive effects in so far that research can make use of newest
inventions and products and on the other hand has negative implications with
respect to the stability of the solutions found. The trick will be to define the
islands of stability in a very dynamic environment and to participate stepwise
in the innovation process. This holds for the archive as well as for all software
being written. In all phases of the data life cycle, the challenging ethical and
legal situation needs to be taken into account.
Creation Phase: The creation process will benefit from further sophisti-
cation in recording equipment, where in particular three developments will
have their implications: (1) miniaturization of data storage leading to in-
creased capacity; (2) resolution; (3) connectivity. Miniaturization will lead
to continuously increasing storage capacities allowing researchers to make
high-resolution recordings with portable devices. Miniaturization also will
simplify field work in so far that direct annotation will be easier with help of
smart and small devices demanding less power. The resolution of recording
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 51
Unauthenticated
Download Date | 8/13/19 5:36 PM
52 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
as long as the quality of the metadata and the data is high – quality will ac-
tually be the crucial point for many advanced operations. One big concern is
that the amount of recorded media streams that is not being touched (anno-
tated in some form to make it ready for analysis) is increasing continuously
which means that much of the stored data will effectively not be of much
use to anyone other than the person who collected it. A new attempt to use
state-of-the-art speech and image processing technology is required that does
not build on holistic stochastic models but on detectors that react to compar-
atively simple patterns in media streams and create annotations. There will
be several of these detectors all with different characteristics that may also
be specialized on specific quality types of recordings. The resulting lattice of
annotations could be the base for linguistic evidence and theorization if there
are smart tools allowing the researcher to look for specific patterns and to
easily navigate in it.
We can indicate a few other areas where we expect new opportunities in
the coming months and years:
Unauthenticated
Download Date | 8/13/19 5:36 PM
Evolving challenges in archiving and data infrastructures 53
the initiative that aims to achieve these goals in the linguistic domain. Such
infrastructure work can only be achieved when we apply standardization and
harmonization where possible without hampering the research progress. The
DoBeS community was one of the driving forces to apply open standards
and foster new standards. If this positive attitude is continued, the work on
endangered languages will profit in many ways from new technological de-
velopments in the coming years.
References
Blue Ribbon Task Force on Sustainable Digital Preservation and Access.
2010. Sustainable Economics for a Digital Planet: Ensuring Long-Term
Access to Digital Information. San Diego: BRTF-SDPA. Online version:
http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf.
Broeder, Daan, Marc Kemps-Snijders, Dieter Van Uytvanck, Menzo Wind-
houwer, Peter Withers, Peter Wittenburg, and Claus Zinn. 2010. A data
category registry- and component-based metadata framework. In Proceed-
ings of the Seventh International Conference on Language Resources and
Evaluation (LREC 2010), Valetta, Malta, May 19–21, 2010, 43–47.
Consultative Committee for Space Data Systems. 2002. Reference Model for
an Open Archival Information System (OAIS). Washington DC: CCSDS
Secretariat, NASA.
High Level Expert Group on Scientific Data. 2010. Riding the Wave: How Eu-
rope Can Gain from the Rising Tide of Scientific Data. Brussels: European
Commission. Online version: http://cordis.europa.eu/fp7/ict/e-infrastructure/
docs/hlg-sdi-report.pdf.
ISO 12620:2009. 2009. Terminology and other language and content re-
sources – Specification of data categories and management of a Data Cat-
egory Registry for language resources.
ISO 24613:2008. 2008. Language resource management – Lexical markup
framework (LMF).
ISO 24619:2010. 2010. Language resource management – Persistent identi-
fication and sustainable access (PISA).
Kemps-Snijders, Marc, Menzo A. Windhouwer, Peter Wittenburg, and
Sue Ellen Wright. 2008. ISOcat: Corralling data categories in the wild. In
Proceedings of the Sixth International Conference on Language Resources
Unauthenticated
Download Date | 8/13/19 5:36 PM
54 Broeder, Sloetjes, Trilsbeek, van Uytvanck, Windhouwer and Wittenburg
and Evaluation (LREC 2008), Marrakech, Morocco, May 28–30, 2008, ed.
European Language Association (ELRA).
Ringersma, Jacquelijn, and Marc Kemps-Snijders. 2007. Creating multime-
dia dictionaries of endangered languages using LEXUS. In Proceedings of
Interspeech 2007, eds. Hugo van Hamme and Rob van Son, 65–68. Baixas,
France: International Speech Communication Association.
Schüller, Dietrich. 2004. Safeguarding the documentary heritage of cultural
and linguistic diversity. Language Archives Newsletter 1(3):9.
Trilsbeek, Paul, and Peter Wittenburg. 2006. Archiving challenges. In Essen-
tials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmel-
mann, and Ulrike Mosel, 311–335. Berlin, New York: Mouton de Gruyter.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Chapter 4
Comparing corpora from endangered language
projects: Explorations in language typology based on
original texts∗
1. Introduction
The current large-scale initiatives towards documenting endangered languages
have produced, among many other outcomes, an unprecedented amount of
digitally archived, natural spoken discourse in a range of typologically di-
verse languages. This chapter investigates some of the ways that language
typologists can tap into these growing resources. Our results suggest that the
cross-language comparison of original narrative texts, of the type now avail-
able in abundance in many documentation projects, is a viable and promising
avenue of typological investigation. Up until now, there has been some re-
luctance within typology with regard to direct comparisons of original texts,
for reasons that are certainly understandable: the lack of comparable con-
tent, the lack of cross-cultural comparability of genres, and perhaps most
tellingly, the lack of a standardized system of annotation. As a result, typolo-
gists interested in text-based (as opposed to grammar-based) typologies have
∗
The present study draws on the GRAID-system of morpho-syntactic annotation of con-
nected discourse (Haig and Schnell 2011). We are grateful to Sabine Reiter and Florian
Siegl for contributing data from Awetí and Forest Enets respectively. We would like to
thank the audiences at conferences in Nijmegen (October 2009; October 2010) and SOAS,
London (November 2009) for their feedback and comments on earlier work on this topic.
We are grateful for financial support from the University of Bamberg, the University
of Kiel, the Max Planck Research Group on Comparative Population Linguistics at the
MPI for Evolutionary Anthropology Leipzig, and the Volkswagen Foundation’s DoBeS-
programme, which financed the four documentation projects from which the data for this
chapter is taken. Many thanks also to Nicole Nau, Bethwyn Evans, Anna Margetts and
Brigitte Pakendorf for their valuable feedback on earlier drafts of this paper. The responsi-
bility for all remaining errors rests with the authors.
Unauthenticated
Download Date | 8/13/19 5:36 PM
56 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 57
guage. How much, and what type of data is included in a grammar, how it
is collected, and how it is presented, are matters left to the discretion of the
author. Of course a broad consensus exists on what constitutes a good gram-
mar, but if a given language has only one grammar, as is often the case for
endangered languages, then that grammar will represent that language in ty-
pological databases, regardless of its quality. These problems are well-known,
but for lack of viable alternatives, comparison of reference grammars contin-
ues to be the most frequently used methodology in typology (cf. Haspelmath
et al. 2008).
Recently, however, an alternative methodology has been developed, some-
times referred to as “primary data typology” (Wälchli 2006, 2009). Rather
than compare reference grammars of individual languages, primary data ty-
pology attempts to compare actual instances of language usage – texts – in
different languages. Such an undertaking poses considerable challenges, and
is of course subject to a number of limitations. Perhaps the most obvious
question is: what kinds of text need to be included in a corpus to be repre-
sentative of a given language? The question of representativity has long been
discussed in traditional corpus linguistics (cf. Atkins, Clear, and Ostler 1992),
but also in the context of language documentation (cf. Himmelmann 1998;
Woodbury 2003; Seifart 2008). While there is widespread agreement on the
ideal of a maximum coverage of indigenous speech events (Geertz’ proverbial
‘thick description’ (Geertz 1973)), in practice, the selection of recorded text
types largely depends on preferences of the speech community and is con-
strained by the documenters’ limited knowledge about indigenous text genres
(Mosel 2006). As for corpus size, corpus building is first of all constrained
by limited resources. It is generally recognized that a corpus has to display a
considerable size in order to contain data on all relevant linguistic structures.
However, even the largest corpora may lack data on some structures in the
particular languages. The adequacy of a corpus’ size partly depends on the
nature of the object of investigation: for an investigation of, for example, syl-
lable structure, or the use of determiners in NPs, quite short stretches of text
may be adequate to obtain a reliable picture. For an investigation of relative
clauses, on the other hand, much larger text samples across different genres
would be necessary. The issue of corpus size is ultimately an empirical one,
that can only be satisfactorily addressed when sufficient studies have been
carried out with text corpora of different sizes.
Unauthenticated
Download Date | 8/13/19 5:36 PM
58 Geoffrey Haig, Stefan Schnell and Claudia Wegener
While the issues of text types and corpus size are regularly raised in con-
nection with primary-data typology, it is often forgotten that they are equally
relevant to reference grammar typology. Yet with regard to reference gram-
mars, these problems are seldom made explicit. In fact, reference grammars
of different languages are based on wildly varying corpus sizes. While some
are based on just a couple of hours of recorded speech, coupled with elic-
itation sessions with a single speaker, others are based on 20–30 hours of
recorded speech and participant observation in the speech community over
many years, or written by native speakers relying on their own intuitions.
Also, reference grammars are based on corpora of different language vari-
eties and text types: some take a form of literary standard (if available) as the
basis, while others document a spoken vernacular; some are based mainly on
narrative texts, while others are confined to conversational data; and so on.
Yet despite the gaping differences in the empirical basis of reference gram-
mars, in typology, different grammars are treated for all practical purposes as
equal.
The shift towards primary data typology paves the way for a further devel-
opment: features that are fed into a typology are no longer read off a check-list
in an either/or fashion (e.g. language X “has” OV word order etc.). Rather,
features can be assigned a quantitative value, determined by the frequency of
occurrence in the text under investigation. Thus a particular text type may, for
example, be found to exhibit OV-word order in 70% of the relevant construc-
tions, and elsewhere use other word-orders. In principle grammars could –
and probably should – make such variation explicit in a precisely quantified
manner; yet this is not common practice in most of descriptive linguistics to
date (but see Biber et al. 2004 for a notable exception). But precise state-
ments about constructional variation in specific types of text in individual
languages can lead to finer-grained typologies and to closer attention to the
factors determining such variation. They open up the possibility of incorpo-
rating language-internal variation in cross-language comparison. The quan-
titative nature of primary data typology is an aspect that will be of central
concern in the present chapter.
Apart from the problem of representativity, primary data typology also
faces the problem of comparability of text genres and texts of different con-
tent. It would, for example, be misleading to compare a cooking recipe text in
one language with a traditional folk narrative text in another. Currently two
approaches to alleviate these differences are widely used. The first compares
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 59
Unauthenticated
Download Date | 8/13/19 5:36 PM
60 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 61
typology. The amount of such text already available in digital form can only
be estimated, but the DoBeS-programme alone, with its emphasis on text-
based documentation and 51 large-scale documentation projects to date,1 has
produced a massive amount of digitally archived spoken language data in
various stages of annotation and analysis. It is therefore rather surprising that
to date, there has been little serious attempt on the part of typologists to mine
these resources. In the remainder of this chapter, we present the results of our
own efforts in this direction, which focus on studying the form and function
of referring expressions in original texts.
1. http://www.mpi.nl/DOBES/projects/
Unauthenticated
Download Date | 8/13/19 5:36 PM
62 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 63
consistent means for identifying constituents larger than words, e.g. phrases.
Although there are guidelines for morphemic glossing available (e.g. the
‘Leipzig Glossing Rules’2 ), in practice morphemic glosses are a free-for-
all area, with individual practitioners generally developing their own, often
quite idiosyncratic solutions. Recently, Seifart et al. (2010) evaluate the pos-
sibilities of cross-language comparison of texts glossed with Toolbox. They
find that the sole parameter that can be more or less reliably extracted from
these data is the respective noun/verb ratios in the texts. Thus, the possibil-
ities of direct comparison of existing morphologically glossed texts appears
to be extremely restricted at present. And even if morphemic glossing were
to be standardized to a level where cross-corpus comparison was possible,
the problem of lack of identification of syntactically relevant units remains
unresolved.
The obvious solution, to develop systems for annotating syntactic struc-
tures, was never seriously pursued on a large scale. Schultze-Berndt (2006)
briefly mentions this possibility in connection with the annotation of small
language corpora, only to dismiss it as too time-consuming and hence im-
practicable. However, over the last couple of years we have been developing
such a system for annotating texts in typologically diverse languages, based
on a small and standardized system of labels (approx. 30). The system, known
as GRAID (Grammatical Relations and Animacy in Discourse) is described
in detail in Haig and Schnell (2011), and we will only outline some of the
more important properties here.
GRAID is an annotation system for glossing connected narrative texts,
resulting in an additional rich layer of data on a text corpus that contains
information on the interface of discourse and syntax. It identifies the major
clause constituents (predicates, arguments), and links the syntactic functions
of arguments to their formal properties (e.g. whether they are pronominal, or
full NPs). Additionally, GRAID includes information on the distinction be-
tween human and non-human reference of arguments,3 as this is an important
parameter in shaping many aspects of grammar. GRAID has been developed
through glossing actual texts in five genetically and areally diverse languages,
and has proved sufficiently flexible to accommodate the attested problems to
date.
2. http://www.eva.mpg.de/lingua/resources/glossing-rules.php
3. Note that the category ‘human’ includes anthropomorphized referents.
Unauthenticated
Download Date | 8/13/19 5:36 PM
64 Geoffrey Haig, Stefan Schnell and Claudia Wegener
In addition, spoken German renderings of the Pear Film have also been an-
notated, based on the text versions in Himmelmann (1997: 234–243). As this
material belongs to a different genre, it is only occasionally mentioned for
comparison, but not included in the analysis.
4. Data from the fifth language, Forest Enets (Samoyedic, cf. Siegl, this volume) were not
fully available at the time of writing, hence were not included in the current analysis.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 65
The texts from the documentation projects were selected according to the
following criteria: monologous narratives recounting traditional tales or his-
toric events that are part of the cultural heritage of the speakers. The texts had
to have been recorded, transcribed, translated and analysed prior to being an-
notated with GRAID. In all cases, the annotator was the main investigator of
the language concerned, thus ensuring the expertise was available to make the
analytical decisions necessary during annotation.5 For some languages, more
than one text was needed to reach the minimum amount of around 500 clause
units. Annotation took place over the course of approx. 18 months, with close
cooperation between the annotators. A workshop was held in 2010 in Bam-
berg where the annotators for Awetí, Gorani, Forest Enets and Vera’a con-
ducted intensive discussion and refined the annotation practice in response to
challenges raised by the individual languages. These efforts fed into the cur-
rent version of the GRAID Manual (Haig and Schnell 2011), which contains
the main principles for GRAID annotations, the inventory of symbols, and
illustrative examples and explanations.
An important feature of this research is that the data meet the general
requirement of maximal accountability, because it is based on archived and
freely accessible material (see Trilsbeek et al., this volume). This is one of the
key advantages of working with data from language documentation projects,
where data accessibility and durability are afforded maximum priority from
the outset. GRAID thus profits from requirements that are already in place,
by simply adding an extra tier of annotation to existing archived data. It also
distinguishes the present work from much comparable work in text-based
typology, for which the raw data is often not accessible.
Given the bottle-neck of additional manual annotation, the issue of cor-
pus size is obviously highly relevant. Surprisingly, it is seldom explicitly dis-
cussed in the literature on quantitative approaches to typology. An exception
is the following citation by Nichols (2008: 124), discussing the size of cor-
pora necessary to determine the types of argument structure typically associ-
ated with certain verb meanings:
A text frequency survey probably gives the most sensitive indication of a
language’s overall preference, but it is labour-intensive and requires close
5. In subsequent research we will need to cross-code the data to ensure consistency and ob-
jectivity of the coding; this has not been done yet because the GRAID coding system is
still being refined and improved, and the necessary guidelines for each language are still in
preparation.
Unauthenticated
Download Date | 8/13/19 5:36 PM
66 Geoffrey Haig, Stefan Schnell and Claudia Wegener
control of genres, stylistic levels, etc. in text corpora. I estimate ... that a
corpus of about 1000 clauses exclusive of those containing the verb ‘be’ that
swamps all frequencies in some Indo-European languages, is enough to yield
reliable information on natural frequencies of verbs ...
Comparing the data in Table 1 and Table 2, it is evident that our own corpora
fall within the range of those used in previously published research. Taken
together, our corpus actually already represents one of the largest, fully ac-
cessible corpora of its kind currently available.
One potential disadvantage of working with indigenous narrative texts
is that the content of each text differs, thus presumably diminishing the de-
gree of comparability across languages. When we first set out to compare
original texts, this was obviously a matter of some concern, and it was pre-
cisely this concern that has led many researchers to adopt the parallel-text, or
story-retelling, methodologies outlined above. However, our initial analysis
of the indigenous texts has in fact revealed rather remarkable areas of global
stability across all four languages, which are suggestive of the fact that as a
text type, such monologous narratives have enough commonalities to make
cross-language comparison meaningful. But we have also identified an area
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 67
Unauthenticated
Download Date | 8/13/19 5:36 PM
68 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 69
surprising given that these texts have not been controlled for content. A very
preliminary conclusion is that in narrative texts, somewhere in the region of
two-thirds of the core arguments refer to human participants. Whether the
proportion of two thirds is genuinely stable, or merely an artefact of the small
corpus remains to be seen. But it is once again indicative of the fact that tra-
ditional narratives, despite the differences in content, show similar profiles in
some aspects, and are therefore in principle comparable across languages.
Looking closer at the proportion of human referents in individual core
functions, the languages in our corpus show a clear, and again homogenous,
pattern (see Table 6).
Unauthenticated
Download Date | 8/13/19 5:36 PM
70 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 71
Table 7. Distribution of lexical mentions (i.e. lexical NPs) across core arguments
S A P total
Awetí 53.7% (87) 12.3% (20) 34.0% (55) 162
Gorani 41.2% (83) 7.9% (16) 49.9% (100) 199
Savosavo 47.0% (101) 14.4% (31) 38.6% (83) 215
Vera’a 34.0% (70) 6.3% (13) 59.7% (123) 206
higher than that of the figures provided in Du Bois, Kumpf, and Ashby (2003:
37), which is 7.1%, but it seems evident that Avoid Lexical A emerges in
a very similar fashion in the texts investigated here, suggesting again that
certain characteristics of discourse are remarkably insensitive to differences
in content. Furthermore, they are already surprisingly consistent in texts of
the size we are dealing with (around 500 clause units).
In this section we have looked at some global features of the texts in our
corpus, restricting ourselves to an investigation of the core arguments S, A
and P. This is obviously but a tiny sub-domain of the possibilities for text-
based typology based on GRAID annotations. Our initial findings are that
the texts are surprisingly homogenous along the features of transitive to in-
transitive clauses, lexical mentions in the A-function, and overall proportion
of human referents among core syntactic functions. With the exception of
the second feature, that of proportion of lexical mentions in the A-function
(‘Avoid Lexical A’), where our findings essentially replicate those of previous
studies, we are unaware of any other cross-corpus studies that have looked at
these issues. Based on our findings, we can formulate an initial testable hy-
pothesis: monologous narrative texts tend to exhibit relatively stable rates of
transitive and intransitive clauses, and that the proportion of human referents
in core argument functions is likewise fairly consistent, regardless of content.
Unauthenticated
Download Date | 8/13/19 5:36 PM
72 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 73
4.1. Background
More recently researchers such as Genetti and Crain (2003) emphasize the
importance of specific investigations into the distribution of pronouns, as op-
posed to both zero, and lexical NPs. Stoll and Bickel (2009) have likewise
refined the original notion of RD by distinguishing lexical NPs from pro-
nouns in their investigation of Russian and Belhare texts. Our own data also
indicate that the deployment of pronouns is an area of considerable cross-
linguistic variation, suggesting the need for a more fine-grained distinction.
As a first indication, Table 8 gives a breakdown of the overall proportion of
different form-types (restricted to occurrences in S, A and P function) in our
corpus:
The data from German have been included in the bottom line as a compari-
son from a different text type (Pear story retellings), and more significantly,
from a language that requires overt subjects in most contexts, and generally
strongly disprefers zero-anaphora. There are two striking features of these
data. The first is that the proportion of full NPs in our corpus data is surpris-
ingly uniform, showing a range of just 2.6 percentage points. German, differs
from the other four languages in this respect, for reasons that remain uncer-
tain. However, it should be recalled that the German data consists of seven
smaller texts (Pear-story retellings), hence giving potentially rise to more ex-
tensive lexical reference at the beginning and end of each story. Of greater
interest are the areas of massive diversity, in particular with regard to rates
of pronoun use. Here the languages considered show a massive range of 37.7
percentage points. In the rest of this section we take a closer look at pro-
noun use in discourse, in particular as it relates to syntactic function, and to
animacy (human vs. non-human).
Unauthenticated
Download Date | 8/13/19 5:36 PM
74 Geoffrey Haig, Stefan Schnell and Claudia Wegener
A possible explanation for Avoid Pronominal P is that it results from the com-
bined effects of two further tendencies. The first is that pronominal reference
is largely restricted to animate, more specifically human, referents. In their
Nepali data, Genetti and Crain find that 92% of all pronominal mentions had
human referents (cf. Genetti and Crain 2003: 215, en. 13). Pronouns with non-
human referents are not completely disallowed, but the discourse tendency to
avoid such pronouns is obviously very strong. The second is the more general
observation that P arguments are more likely to be inanimate than animate.
We discussed this proposal above in Section 3.3, and found it to be confirmed
by our data (cf. Table 6). Taken together, the combined effects of these two
tendencies could provide an explanation for Avoid Pronominal P: because
P arguments are typically inanimate, and because pronouns are avoided for
inanimates, P arguments are unlikely to be realised as pronouns.
Whether Avoid Pronominal P holds in other languages, and whether the
two tendencies just outlined are valid explanations, remain open questions. In
fact, Genetti and Crain’s explanations are not without problems, because they
do not actually state how different forms are distributed within the P function,
but only how pronouns are distributed across all of S, A and P. They likewise
do not consider the exact distribution of zero forms, which turns out to be of
some importance for our data. We will first investigate our data to determine
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 75
whether there is evidence for Avoid Pronominal P, before taking a closer look
at possible explanations.
Table 9 provides an overview of the distribution of pronouns across syn-
tactic function in our data, with German also included for comparison.
Table 9. Distribution of pronominal forms across core arguments
S A P total
Awetí 59.1% (166) 22.8% (64) 18.1% (51) 281
Gorani 44.0% (37) 23.8% (20) 32.2% (27) 84
Savosavo 57.8% (190) 39.2% (129) 3.0% (10) 329
Vera’a 63.7% (228) 29.3% (105) 7.0% (25) 358
German 40.0% (76) 43.7% (83) 16.3% (31) 190
6. Differences in agreement patterns cannot directly account for these distributions. The hy-
pothesis suggesting itself in this context would be that, in functions for which there is oblig-
atory agreement morphology, either NPs (adding additional information) or zero (avoid-
ing redundancy) should be the preferred means of reference, i.e. that pronouns would be
avoided. Savosavo is the only language that seems to support this hypothesis: Its obligatory
agreement morphology for P (marking person, number and gender) seems to align with the
significantly low proportion of pronouns in this function. But for the other languages in-
vestigated, this correlation could not be confirmed: Vera’a has no obligatory agreement
morphology at all, but still shows a significantly low proportion of pronouns in P. In Ger-
man verbs obligatorily agree with S and A in person and number, but the lowest proportion
of pronouns is still found in P function. And finally, Gorani has obligatory agreement mor-
phology for S and A (marking person and number), but pronouns are distributed fairly
evenly over all three functions. We refrain from commenting on Awetí in this context as
the agreement pattern in this language is too complex to be summarized here.
Unauthenticated
Download Date | 8/13/19 5:36 PM
76 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 77
Table 10. Proportion of non-human referents in the three form types for Gorani, Sa-
vosavo and Vera’a
Gorani Savosavo Vera’a
total non-hum. total non-hum. total non-hum.
NP 199 66.8% (133) 215 45.1% (97) 206 62.1% (128)
pro 84 16.7% (14) 329 5.5% (18) 358 6.7% (24)
zero 339 6.2% (21) 212 33.5% (71) 146 26.7% (39)
of zero, while in Savosavo and Vera’a, this is not the case. In these two lan-
guages, among all three form-types, it is indeed pronouns that show the low-
est percentage of non-human referents, hence they seem to confirm Genetti
and Crain’s hypothesis. It is evident, however, that for Gorani, we need to take
the zero form-type into consideration, which shows some remarkable pecu-
liarities: In comparison with pronouns it is striking that the total figure for
zero, 339, is about four times as high as that for pronouns, which is only 84.
And in comparison with the use of zero in the other two languages, it stands
out as having a conspicuously small proportion of non-human referents, only
6.2% compared to 33.5% in Savosavo and 26.7% in Vera’a. We can conclude
for the time being that pronouns in Gorani also exhibit a general disprefer-
ence for inanimate reference, at least in comparison to NPs, but the overall
effects of this tendency are counteracted by a very different behaviour of the
zero form-type.
So what is the explanation for the odd behaviour of Gorani? We have seen
that Gorani, like all other languages we are aware of, avoids human referents
in the P function. It is thus quite ‘normal’ in this respect. It also complies with
the tendency to avoid non-human referents for pronouns, at least in compar-
ison to NPs. However, crucially, this tendency is reversed when we compare
pronouns to zero: The proportion of pronouns used for non-human referents
(16.7% of all pronouns) is greater than that of zero (6.2% of all instances of
zero encoding).
Note that this does not mean that, in Gorani, non-human referents are
encoded more frequently by means of pronouns than by zero. Table 11 shows
for all three languages the proportion of non-human referents encoded by
each of the different form-types.
In all three languages, only a small proportion of non-human referents are
referred to by pronouns, with a range of only 4.3 percentage points. Not sur-
prisingly, the most commonly used form-type for the encoding of non-human
Unauthenticated
Download Date | 8/13/19 5:36 PM
78 Geoffrey Haig, Stefan Schnell and Claudia Wegener
referents is NPs. In Gorani this preference is particularly strong, and the third
available form-type, zero, is almost as rarely used as are pronouns. In Savo-
savo, non-human referents are frequently not overtly expressed at all. This is
reflected in a relatively high percentage for zero, which counter-balances the
lower percentage for NPs. So given that all three languages share a prefer-
ence for non-human referents in P function, as well as a dispreference of pro-
nouns to encode non-animates, why does Gorani not fit the pattern of Avoid
Pronominal P?
What is striking in Gorani in comparison to Savosavo and Vera’a is the
extreme predominance of zero form-types in Gorani discourse (more than
half of the core arguments, 51.1%, are expressed through zero, cf. Table 8).
Ss and As in particular (which are frequently animate and definite) are com-
monly not expressed with pronouns, but through zero, see Table 12.
This leads to a very large figure for zero, and simultaneously brings down
the overall numbers of pronouns in these functions, leading to a levelling
out of S, A and P functions with regard to pronoun use in Gorani (cf. Table
9). In this case, the agreement pattern in Gorani seems to at least contribute
to the observed proportions: Gorani verbs agree with S and A, so it could
be expected that for given referents, zero would be the preferred form type in
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 79
these functions, while pronouns would be dispreferred. This is indeed the case
in Gorani. In contrast, Savosavo and Vera’a lack obligatory agreement with
S and A, and here pronouns are by far the most frequent form type in these
functions. The concentration of pronouns in S and A functions pushes up the
overall figure for pronouns in these languages, leading to a corresponding
reduction in the relative numbers of pronouns in P function, an effect that is
missing in Gorani.
Thus, in Gorani, the disproportionately frequent use of the zero form-
type in general, and in A and S function in particular, is the reason why we
do not find the tendency to avoid pronominal Ps in the language, even though
the supposedly contributing factors are still operative. The full picture only
emerges when all form-types are taken into consideration, and tested across
all syntactic functions.
We began this section with the observation that pronoun use is evidently one
area of massive cross-linguistic variation. As noted, the two current major
paradigms in discourse-based syntactic typology, namely Preferred Argu-
ment Structure and Referential Density at least in the original investigation of
Bickel (2003), actually do not consider pronouns as a category in their own
right. As Genetti and Crain (2003) point out, however, this is undoubtedly an
oversimplification. They propose a constraint on pronoun distribution, based
on data from Nepali, namely Avoid Pronominal P, and suggest two factors
that contribute towards this constraint. In our data we found evidence for
Avoid Pronominal P in three languages, though to rather differing degrees,
while one language, Gorani, showed no evidence for this constraint. A closer
look at the Gorani data revealed that although the contributory factors are still
operative, language-specific characteristics in the deployment of pronouns
versus zero worked against producing the expected Avoid Pronominal P. This
result confirms the view that the deployment of pronouns is an area of high
cross-language variation, subject to a subtle interplay of language-specific
features, hence a very promising arena for text-based typology. It also high-
lights the necessity for a greater typological range of languages under investi-
gation: the conclusions based on Nepali cannot simply be transferred to other
languages, but these are facts that only emerge when sufficient, and diverse,
languages have been investigated, in particular languages which utilize both
Unauthenticated
Download Date | 8/13/19 5:36 PM
80 Geoffrey Haig, Stefan Schnell and Claudia Wegener
zero and pronominal reference strategies. This case study also highlights the
importance of the animacy feature, which is recorded in GRAID-annotations.
Including an animacy parameter for pronouns yields a number of other ob-
servations, for example the almost complete absence of non-human pronouns
in A function.7
Finally, recall that Table 8 above revealed a surprisingly uniform distri-
bution of NPs in discourse: somewhere between 20–30% of core arguments
appear to be expressed by full NPs in spoken narrative discourse. Whether
this figure can be confirmed in further studies remains to be seen. But it raises
the interesting possibility that the space of cross-language differences in this
regard is essentially restricted to how the remaining 70-80% of positions are
partitioned in terms of pronominal versus zero reference. Whether this view
of the matter is realistic can only be determined in the light of more extensive
studies of narrative texts from different languages, but it is certainly a goal
worth pursuing.8
5. Conclusion
7. This very striking tendency merits more detailed treatment than can be accommodated in
this chapter. Here we simply note its existence, while deferring proper coverage to future
investigations.
8. Stoll and Bickel (2009) report significantly different ratios of lexical NPs in their compar-
ison of Russian and Belhare texts, which would appear to run counter to the tendency we
have identified in our own data. However, Stoll and Bickel’s findings are based on a count
of all argument positions, including goals, and at least some locatives, whereas the figures
for our data refer solely to S, A and P. Thus the figures from the two studies cannot be
directly compared.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 81
Awetí is a Tupi language of the Upper Xingú region in West Brazil. Data on
the Awetí language has been collected during a DoBeS project by Sabine Re-
iter who is currently preparing a PhD thesis containing a grammar sketch of
the language. Two traditional Awetí stories have been used in our corpus, an-
notated by Sabine Reiter using the software ELAN: kal_awytyza1_GRAID.
eaf and kal_makawaja_GRAID.eaf. From each text, approximately the first
Unauthenticated
Download Date | 8/13/19 5:36 PM
82 Geoffrey Haig, Stefan Schnell and Claudia Wegener
third was annotated with GRAID (11 and 13 minutes respectively). The total
number of words in the first GRAID-annoted section is 1256, in the second
1033 (total of 2289).
Gorani (Indo-European, West Iranian, from the village of Gawraju in
West Iran). The two texts used in the corpus are available in Mahmudweyssi
et al. (in print), where they are published together with a sketch grammar of
the language, and a CD containing the recordings.
Savosavo (Papuan Isolate, Savo Island, Solomon Islands): The three texts
used in the corpus are available at http://vc.uni-bamberg.de/moodle/course/view.
php?id=9488. The complete data collected during the DoBeS project “Docu-
mentation of Savosavo, a Papuan language of the Solomon Islands” is stored
in the DoBeS online archive at http://corpus1.mpi.nl/ds/imdi_browser?openpath=
MPI1366371\%23. Wegener (2008) is the most recent, and most comprehen-
sive, grammatical description of Savosavo to date.
Vera’a is an Austronesian, Oceanic language spoken on the island of
Vanua Lava in the north of Vanuatu in the South Pacific. The two texts used
in the corpus are available at http://vc.uni-bamberg.de/moodle/course/view.php?
id=9488. The Vera’a language has been extensively documented during the
DoBeS project “Documentation of Vera’a and Vurës, the two surviving en-
dangered languages of Vanua Lava, Vanuatu” and the data collected in the
course of this project is stored at http://corpus1.mpi.nl/ds/imdi_browser?openpath
=MPI649372\%23. Schnell (2010) contains the first description of the lan-
guage. Some structural properties of Vera’a are discussed in François (2005,
2007, 2009) in connection to the historical development of the languages of
North Vanuatu.
References
Atkins, Sue, Jeremy Clear, and Nicholas Ostler. 1992. Corpus design criteria.
Literary and Linguistic Computing 7(1):1–16.
Biber, Douglas, Geoffrey Leech, Susan Conrad, and Edward Finegan. 2004.
Longman Grammar of Spoken and Written English. London: Longman.
Bickel, Balthasar. 2003. Referential density in discourse and syntactic typol-
ogy. Language 79(4):708–736.
Chafe, Wallace L. 1979. The flow of thought and the flow of language. In
Discourse and Syntax (Syntax and Semantics Volume 12), ed. Talmy Givón,
159–181. New York: Academic Press.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 83
Unauthenticated
Download Date | 8/13/19 5:36 PM
84 Geoffrey Haig, Stefan Schnell and Claudia Wegener
Unauthenticated
Download Date | 8/13/19 5:36 PM
Original text typology 85
Mayer, Mercer. 1994[1969]. Frog, Where Are You?. New York: Dial Books
for Young Readers.
Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising
from the aims of language documentation. Language Archives Newslet-
ter 1(3):3–4.
Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials
of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann,
and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter.
Næss, Åshild. 2007. Prototypical Transitivity. Amsterdam, Philadelphia:
John Benjamins.
Neeleman, Ad, and Kriszta Szendrői. 2007. Radical pro drop and the mor-
phology of pronouns. Linguistic Inquiry 38(4):671–714.
Neeleman, Ad, and Kriszta Szendrői. 2008. Case morphology and radical
pro-drop. In The Limits of Syntactic Variation, ed. Theresa Bieberauer,
331–348. Amsterdam, Philadelphia: John Benjamins.
Newmeyer, Frederick J. 2010. On comparative concepts and descriptive cat-
egories: A reply to Haspelmath. Language 86:688–695.
Nichols, Johanna. 2008. Why are stative-active languages rare in Eurasia?
A typological perspective. In The Typology of Semantic Alignment, eds.
Mark Donohue and Sören Wichmann, 121–139. Oxford: Oxford Univer-
sity Press.
Payne, Thomas E. 1992. The Twins Stories: Participant Coding in Yagua
Narrative. Berkeley: University of California Press.
Schnell, Stefan. 2010. Animacy and referentiality in Vera’a. Ph.D. Disserta-
tion, Kiel University.
Schultze-Berndt, Eva. 2006. Linguistic annotation. In Essentials of Language
Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike
Mosel, 213–251. Berlin, New York: Mouton de Gruyter.
Seifart, Frank. 2008. On the representativeness of language documenta-
tion. In Language Documentation and Description, Volume 5, ed. Peter K.
Austin, 60–76. London: School of Oriental and African Studies.
Seifart, Frank, Roland Meyer, Taras Zakharko, Balthasar Bickel, Swintha
Danielson, and Alena Witzlack-Makarevich. 2010. Cross-linguistic varia-
tion in the noun-to-verb ratio: Exploring automatic tagging and quantitative
corpus analysis. Presentation at the workshop Advances in Documentary
Linguistics, MPI Nijmegen, 14–15 October 2010.
Stoll, Sabine, and Balthasar Bickel. 2009. How deep are differences in ref-
erential density? In Crosslinguistic Approaches to the Psychology of Lan-
Unauthenticated
Download Date | 8/13/19 5:36 PM
86 Geoffrey Haig, Stefan Schnell and Claudia Wegener
guage: Research in the Tradition of Dan Isaac Slobin, eds. Jiansheng Guo,
Elena Lieven, Nancy Budwig, Susan Ervin-Tripp, Keiko Nakamura, and
Şeyda Özçalişkan, 543–555. London: Psychology Press.
Stolz, Thomas. 2007. Harry Potter meets Le petit prince – on the usefulness
of parallel corpora in crosslinguistic investigation. Sprachtypologie und
Universalienforschung 60(2):100–117.
Wälchli, Bernhard. 2006. Descriptive typology, or, the typologist’s ex-
panded toolkit. Unpublished ms., available at: http://ling.uni-konstanz.de/
pages/home/a20_11/waelchli/waelchli-desctyp.pdf.
Wälchli, Bernhard. 2009. Motion events in parallel texts. A study in primary-
data typology. Unpublished Habilitationsschrift, University of Bern.
Wegener, Claudia. 2008. A grammar of Savosavo, a Papuan language of the
Solomon Islands. Ph.D. Dissertation, MPI for Psycholinguistics, Radboud
Universiteit Nijmegen.
Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language
Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51.
London: School of Oriental and African Studies.
Unauthenticated
Download Date | 8/13/19 5:36 PM
Chapter 5
“Words” in Kharia – Phonological, morpho-syntactic
and “orthographical” aspects∗
John Peterson
1. Introduction
The present study deals with the issue of “words” in the South Munda lan-
guage Kharia, spoken in eastern-central India, from a phonological, morpho-
syntactic and “orthographical” perspective. As Dixon and Aikhenvald (2002:
2) note, not all languages have a concept with the same meaning as the En-
glish concept word; rather, this is an empirical issue which poses a challenge
to all documentary linguists, especially (but certainly not only) if the lan-
guage being documented is currently in the process of developing a standard
written form. The main issue here is that a number of typical characteristics of
“words” from different descriptive levels, such as phonology, morpho-syntax
but also orthography (to the extent that an orthography exists), may or may
not converge upon one single “middle-sized transcription unit, delimited by
empty spaces, which represents a basic unit in terms of meaning, grammatical
function, or sound structure” (Himmelmann 2006: 253). Hence, determining
just what is to appear between empty spaces is one of the most fundamental
issues facing any descriptive linguist, as this issue quite literally surfaces with
each and every “word” which is to be transcribed and can also have further
consequences with respect to a description of the language. It can also turn
out to be an issue where the intuition of the linguist differs considerably from
that (or those) of native speakers, who will typically apply different criteria
in determining what to place between empty spaces.
∗
The present study is based on the results of eight months of field work conducted during
five trips to Jharkhand, India. I would like to express my gratitude to the German Research
Council (Deutsche Forschungsgemeinschaft) for two generous grants which made two of
these trips possible (PE 872/1-1, 2). Furthermore, many thanks to Utz Maas for comments
on an earlier version of this study, although I alone am of course responsible for any errors
and oversights.
Unauthenticated
Download Date | 8/13/19 5:36 PM
90 John Peterson
As will be shown in the present study, this is also largely true of Kharia:
While the concept sabda with (roughly) the same meaning as English word is
found in Kharia, this term has been borrowed from neighboring Indo-Aryan
languages, cf. Sadri sabad and Hindi śabda. The latter both mean approxi-
mately the same thing as English word, although even here there are many
unclear cases where the different criteria do not converge upon a single unit,
as we shall see in the following pages. On the other hand, the closest native
term in Kharia is kayom, which is perhaps best translated as ‘speech; speak;
spoken; matter’. Thus this element primarily refers to the act of speaking in
general and can refer to entire utterances, but it can also refer to ‘matters’ or
‘affairs’. It does not, however, refer to the same unit as word, regardless of
how this is defined, and to my knowledge it is never used in reference to a
written unit.
This study is structured as follows: Section 2 presents a general overview
of Kharia. This is followed by a discussion of the phonological word in Sec-
tion 3 and the morpho-syntactic word in Section 4. As these topics have been
dealt with elsewhere in greater detail (Peterson 2011; Peterson and Maas
2009), these discussions will be rather brief. Section 5 then presents the re-
sults of a preliminary investigation into the topic of the written word and
speakers’ / writers’ intuitions as to how to divide units into words when writ-
ing. Section 6 briefly summarizes these results and discusses perspectives for
future research.
2. An overview of Kharia
Kharia belongs to the southern branch of the Munda family and is primarily
spoken in eastern-central India. According to Lewis (2009), it was spoken in
1997 by 292,000 people in India and by 293,580 people in all countries.
Kharia is a largely agglutinating language in the sense that one morph
generally corresponds to one morpheme. However, it is not agglutinating in
the sense that these morphemes are affixes – in fact, Kharia only possesses
very few affixes, as virtually all grammatical markers are enclitic. This topic
will be dealt with in more detail in Sections 3 and 4 below.
The predicate is generally clause-final, although not rigidly so, and the
order of clause-level units (the syntagmas, see below) is entirely “free” in the
sense that any ordering of these elements to express a particular pragmatic
status is grammatical. On the other hand, the order within the syntagmas,
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 91
discussed in some detail in Sections 2.1 and 2.2, is fixed, and the syntagmas
always form contiguous units.
Structurally speaking, there is no compelling evidence for assuming the
presence of parts-of-speech such as nouns, adjectives and verbs in Kharia.
Not only may virtually any contentive morpheme be freely used in attribu-
tive, referential and predicative function, but even “phrasal” units, i.e., units
resembling NPs in English and other languages, can be used in all three of
these functions. As such, I will not refer to “nouns”, “verbs” or “adjectives”
in Kharia. Rather, there are two types of (non-endocentric) “phrases” which
may freely fulfill any one of these three functions, one ending with case mark-
ing, the other with TAM/P ERSON-marking; I will therefore refer to these as
the C ASE- and TAM/P ERSON-syntagmas, respectively. This is a purely struc-
tural term and should not be taken as referring to any particular discourse
functions. Both syntagmas have the same underlying structure:
Unauthenticated
Download Date | 8/13/19 5:36 PM
92 John Peterson
3 and 4. Note that the causative is realized either as a prefix (with mono-
syllabic roots), an infix (with polysyllabic roots) or both simultaneously (in
double causatives).
REC CAUS -Lexeme-< CAUS > (V2) (=PRF )= TAM / VOICE=PERS/NUM / HON
Table 2 provides an overview of the enclitic subject markers (A / S). Note that
Kharia has three numbers, singular, dual and plural, in both syntagmas, and
an inclusive/exclusive distinction in the non-singular first persons. The dual is
also used as an honorific marker. With the exception of the 3rd person, plural
form =may, which is found only in the TAM /P ERSON-syntagma, the markers
of the 3rd persons, ø or “zero” ‘SG’, =kiyar ‘DU’ and =ki ‘PL’, are found in
both the TAM /P ERSON- and C ASE-syntagmas. These are, properly speaking,
number markers and will be glossed accordingly.
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 93
The enclitic markers for first and second persons may be considered phono-
logically reduced forms of the corresponding free-standing forms, given in
Table 3. The forms shown in Table 2 are the only enclitics in the language
which may be considered “reduced forms”.
1. TAM /P ERSON-syntagmas containing the passive V2 ãom are also always marked for the
middle voice.
Unauthenticated
Download Date | 8/13/19 5:36 PM
94 John Peterson
and number/honorific status (NUM / HON) belong to the semantic base, not the
functional head.2
As with the TAM /P ERSON-syntagma, there are three numbers, singular (un-
marked), dual (=kiyar) and plural (=ki), and here as well the dual also serves
as an honorific marker. There are three cases:
3. Phonological words3
From a phonological perspective, the Kharia lexicon can be divided into two
main groups:
2. As this topic is not directly related to the issue of “words” in Kharia, we will not deal with
it in any detail here. The interested reader is referred to the discussion in Peterson (2011,
Chapter 4).
3. As the phonological word has been dealt with in detail elsewhere (especially in Peterson
2011, Section 2.5; Peterson and Maas 2009), we will only deal with this topic briefly here.
The interested reader is referred to these two works for further discussion.
The data in this section were analyzed with the freeware program “Praat”, developed by
Paul Boersma and David Weenink, available under http://www.praat.org/
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 95
– The first group, which makes up the large majority of all lexical entries,
corresponds quite closely to the class of contentive lexemes (or “lexical mor-
phemes”) such as man, woman, run, sit, etc. It also contains the much smaller,
closed class of proforms / deictics (I, you, today, tomorrow, etc.). A small
number of postpositions also belong to this class, presumably those which
have only recently derived from contentive morphemes.
Although there is no distinctive tone or accent in the language, these ele-
ments all show a “low → high” (LH) pitch pattern: The final syllable of the
morpheme has a high tone, while the first syllable either has a low tone or
immediately drops to a low tone before rising. This can be seen in Figure 4
below, which shows the pitch contour of bunui ‘pig’.
As these elements make up the vast majority of lexical entries and they can all
stand alone in the syntax, at least if they are bisyllabic (see further below), it
seems clear that our definition of the phonological word will have to include
this class of elements, and the LH pitch pattern will have to be considered
one of this unit’s defining characteristics.
– However, the LH-pattern does not hold for all units in the lexicon: A rather
large but limited number of elements do not show this pattern – they may
have a falling or rising pitch in a particular utterance, depending on a number
of contributing factors which have not yet all been identified, but they do not
have the inherent LH pitch pattern of the elements just mentioned. This is
illustrated in Figure 5 below for baje=ki [resound=MID . PST] ‘it resounded’,
Unauthenticated
Download Date | 8/13/19 5:36 PM
96 John Peterson
where the marker for tense and middle voice, =ki, does not continue the LH-
pattern but also does not start a new LH-pattern.
250
Pitch (Hz)
a e
b Í k
i
50
0 0.6986
Time (s)
4. With the exception of col ‘go; move’, borrowed from Kharia’s Indo-Aryan neighbor Sadri,
no borrowed lexemes reduplicate in this environment in Kharia; col has been strongly
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 97
This does not hold for TAM /P ERSON-syntagmas; here, no form must redu-
plicate, although the masdar can also be found here in the so-called “generic
middle voice”, which has a number of semantic functions, including habitu-
ality or a remote past or future.
(2) iñ ãaP biúh=oP j.5 Unmarked construction
1 SG water pour.out=ACT. PST.1 SG
‘I poured the water out.’
(3) iñ ãaP biPã-biPã=ki=ñ. Marked construction
1 SG water pour.out-RDP = MID . PST =1 SG (=generic middle)
‘I poured the water out over and over (e.g., that was my job, so I did it
constantly).’
As the preferentially bisyllabic units of the first group above, with the in-
herent LH pitch pattern and which may stand alone in the syntax, may be
followed by one or more members of the second group, the phonological
enclitics which typically express case or TAM-categories, we can define a
phonological word in Kharia as follows, from Peterson and Maas (2009: 221):
“Phonological word” – A preferentially bisyllabic unit beginning with a low
or rising tone and which continues either until a pause or until the next unit
showing this pattern
This also means that a phonological word typically ends in an enclitic ex-
pressing grammatical information or, put slightly differently, that grammati-
cal information is typically found at the end of the phonological word. Thus,
affected by analogy with the native lexeme ãel ‘come’ in this and other phonological envi-
ronments, cf. Peterson and Maas (2009: 232–233).
5. All consonants are devoiced and aspirated before the active, past tense marker =oP, so that
biPã ‘pour out’ is realized as biúh before =oP.
Unauthenticated
Download Date | 8/13/19 5:36 PM
98 John Peterson
the reciprocal marker kol (cf. Figure 2), which according to my data is a
phonological word with the characteristic LH pitch pattern, is unique among
the grammatical markers in that it is both a separate phonological word and
appears before a contentive morpheme instead of after this unit. This will
be of importance again in Section 5 when discussing the written word and
speakers’ intuitions.
There are also a number of units whose status as phonological words
is either unclear or variable, depending on their environment in a particular
utterance. As space is limited, we will concentrate here only on those forms
which will be relevant in the discussion in Section 5.6
– Recall the V2s discussed in Section 2. Note that almost all of these V2s (18
out of the 21 that I am aware of) are monosyllabic and that Kharia strongly
disfavors monosyllabic words. As these units directly follow the semantic
base of a predicate, we generally have one of the following two situations:
• In the event of a monosyllabic V2 following a bisyllabic stem as in (4),
the V2 can be realized as a phonological word together with the remain-
ing TAM /P ERSON-marking. However, although further research is nec-
essary, if the V2 follows a monosyllabic stem as in (5), the evidence sug-
gests that these two units (together with any following TAM /P ERSON-
marking) can form a phonological word, as there is no (obligatory) low-
tone marking on the V2.
(4) ho=kaó ãoko goP ã=ki.
that=SG . HUM sit.down C : TEL = MID . PST
‘S/he sat down.’
(5) ho=kaó leP j=bay=oP.
that=SG . HUM curse=EXCESS = ACT. PST
‘S/he gave [someone] a good scolding.’
• The V2 can also separated from the preceding unit, whether a stem or an-
other V2, by one of the floating pragmatic clitics =ga ‘FOC’, =jo ‘ADD’
or =ko ‘CNTR’ (see Section 4), which always attach to the last element
of a syntagma, which also corresponds to the last unit in a phonological
word ((6), see also example (13)).
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 99
250
a
u k
Pitch (Hz)
i t e g
e b
l
50
0 0.9613
Time (s)
Preliminary as this evidence may be, we will see in Section 5 that a se-
quence of more than one functional marker such as this can lead to uncer-
tainties when speakers are asked to write down an utterance.
– Finally, there are a number of other units, such as the demonstratives u
‘PROX’, ho ‘MED’ and hin ‘DIST’ or the coordinator ro ‘and’, whose status
as phonological words or proclitics (demonstratives) or enclitics (ro) seems
to depend on their environments in a particular utterance, although further
research is necessary to determine these exact environments; see Peterson
(2011, Section 2.5) for further discussion.
Unauthenticated
Download Date | 8/13/19 5:36 PM
100 John Peterson
As we shall see in Section 5, the clitics and those units whose status as phono-
logical words is either indeterminate or dependent on the surrounding envi-
ronment – the V2s and unclear forms such as ro ‘and’ – are responsible for
the majority of uncertainties when speakers are asked to write something in
Kharia.
4. Morpho-syntactic words7
It was shown in Section 3 that a phonological word in Kharia typically con-
sists of a (generally bisyllabic) contentive morpheme and potentially one or
more enclitics. If there are two or more enclitics appearing together, there
is also preliminary evidence that these may combine to form a phonological
word.
The present section shows that those elements which behave as clitics
phonologically also behave morpho-syntactically as enclitics, as their “po-
sition with respect to the other elements of the phrase or clause follows a
distinct set of principles, separate from those of the independently motivated
syntax of free elements in the language” (Anderson 2005: 31). Although these
units all have somewhat differing distributions, they all have in common that
they are “phrasal affixes” in the sense that the elements to which they at-
tach are neither roots nor stems but rather possibly complex units in the syn-
tax. The following example demonstrates the type of “mismatch” between
phonology and morpho-syntax which is involved here. Although the exam-
ple is for the English enclitic copular form ’s, the principles involved are the
same as in Kharia ((7), from Sadock 1991: 50; “W” = word, “S” = sentence,
“Af” = affix).
(7) The man’s at the door.
W S
N Af NP VP
man ’s Det N V PP
7. For a more detailed discussion, cf. Peterson (2011, Chapter 3) and Peterson and Maas
(2009: 212–213).
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 101
In the left half of (7), we see that ’s in this example attaches to a phonological
word, as it requires a host. Syntactically, however, we see on the right-hand
side of (7) that ’s is a “word” in the syntax. As such, it – like its potential hosts
– is a “syntactic atom” (Di Sciullo and Williams 1987) or a “grammatical
word” (Dixon and Aikhenvald 2002).
As the following discussion will show, virtually all grammatical markers
in Kharia are morpho-syntactic clitics, attaching to other units at the level
of syntax. Thus, in this definition, phonological words in Kharia typically
consist of more than one morpho-syntactic word.
We begin with the three pragmatic markers discussed briefly in Section 3,
=jo ‘ADDitive focus’, =ko ‘CNTR’ or ‘contrastive focus’, and the general re-
strictive focus marker =ga ‘FOC’. These are referred to here as floating clitics
as they may freely appear anywhere in the clause, provided that they attach
as the last elements to what is otherwise already a syntactic unit (cf. (8)–(9)).
The only units they may never attach to are demonstratives when these are
followed by a contentive morpheme, as the demonstrative is proclitic in this
environment, cf. (10). The scope in (8) is ambiguous and can be interpreted
as referring either only to oP ‘house’ or to the entire C ASE-syntagma.
These three markers may also be combined with one another in varying orders
to yield subtle pragmatic nuances which still require further study.
Similar data obtain for the V2s, whose status was shown in Section 3 to be
somewhat ambiguous with respect to the phonological word. (12) shows that
Unauthenticated
Download Date | 8/13/19 5:36 PM
102 John Peterson
the floating clitics may not only have scope over the entire TAM /P ERSON-
syntagma, they may also intervene between the stem (or the last element
of the semantic head in general, regardless of its status) and the V2. (13)
shows that these floating clitics may also appear between two V2s. This fur-
ther demonstrates that the V2s, despite their ambiguous phonological status,
are “syntactic atoms”.
(12) karay goP ã=te=ga / karay=ga goP ã=te
do C : TEL = ACT. PRS = FOC do= FOC C : TEL = ACT. PRS
‘s/he does the job’
(13) kaP búo saNgoP ã ãom=ga may=ki.
door close V 2: PASS = FOC V 2: TOTAL = MID . PST
‘The door was shut entirely.’
With the exception of the causative marker, which is clearly affixal in nature,
appearing as a prefix with monosyllabic roots but as an infix with polysyllabic
roots, virtually all other grammatical marking in Kharia is also enclitic and
not suffixal. For reasons of space, only a few of these can be illustrated here,
however similar comments hold for all other grammatical markers as well (cf.
Peterson (2011, Section 3.1) for further discussion).
As (14) shows, the genitive marker is a morpho-syntactic enclitic, as it
refers to the entire semantic base. In this respect, it behaves very similarly to
the English ’s.
(14) laP [u sembho ro ãakay rani=kiyar]=aP nãw=jan
then this Sembho and Dakay queen= HON = GEN nine=CLASS
beP ú=ãom=kiyar aw=ki=kiyar.
son=3 POSS = HON QUAL = MID . PST = HON
‘And this Sembho and queen Dakay had nine sons (HON) (= ‘[This
Sembho and Queen Dakay]’s nine sons were).’
Similar data hold for the oblique marker =te (15) and number marking (16)–
(17); these examples show that, if the semantic head is omitted (e.g., as its
identity is already known or is considered unimportant), the respective marker
simply attaches to the last element of the semantic base, regardless of its
status.
(15) [iñ=aP boP]=te ‘at my place’, or simply
1 SG = GEN place= OBL
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 103
Unauthenticated
Download Date | 8/13/19 5:36 PM
104 John Peterson
8. Underlying /g/ in the coda is obligatorily realized as [P] in the native lexicon, hence /ñog/
‘eat’ is realized as [ñoP] when not followed by a vowel, and is written as such; speakers
/ writers of Kharia will not accept the use of the grapheme <g> in this environment and
insist on using a special symbol for the glottal stop here.
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 105
they would already have some intuitive notion of the written word, at least
for Hindi and English. The situation in Kharia, however, differs considerably
from that of Hindi and English: Both Hindi and English have standardized
orthographies in which not only the individual orthographical characters of
the various morphemes have been standardized but also (at least to a large
extent) the notion of the orthographical word, but this is not true of Kharia.
Kharia is only rarely written: Although most Kharia in southwestern Jhark-
hand, where the speakers I worked with all came from, are literate, educa-
tion is generally in Hindi, and increasingly in English, and since virtually
all Kharia speak Hindi, Hindi is generally used when the need arises to put
something down in writing. When Kharia is written, the Devanagari script
is virtually always used, which is also used for Hindi and a large number of
other, mostly Indo-Aryan languages.9 While there are a number of issues of
dispute with respect to the “correct” symbols to denote certain sounds which
are not found in Hindi, above all the glottal stop and the pre-glottalized con-
sonants, these issues can be considered minor, as in all systems suggested
so far, there is a unique means of designating these segments (cf. Peterson
(2011, Section 1.5) for details).
In Devanagari as it is used to write Kharia and many other modern lan-
guages, vowels are written as independent characters only at the beginning
of a written word. Otherwise, the vowel is viewed as a diacritic to be added
to the consonant or consonant cluster which precedes it. [2], [@] or short [A],
all of which are allophones of /a/, are considered to be an inherent part of the
consonant and are therefore not indicated (the so-called “inherent a”), while
other vowels must be added to this consonant. The following provides a few
simple examples of the principles involved.
(25) No preceding consonant: <a> a <i> i <u> u
Preceding consonant: <ta> <ti> <tu>
There are also a number of so-called “conjuncts‘’ in this system, i.e., combi-
nations of consonants; as the basic symbols for consonants are in fact not con-
sonants but rather the combination of a consonant plus the “inherent a”, this
vowel must be removed in order to depict consonant clusters, often resulting
in symbols which differ considerably from the basic forms of the symbols in-
volved. To give one simple example, the consonant cluster transliterated here
Unauthenticated
Download Date | 8/13/19 5:36 PM
106 John Peterson
as <dr> is written as , which bears a strong resemblance to <d> (), but
virtually none to <r> (). This combination of one or more consonants plus
a vowel is referred to as an aksar, which plays a central role in orthographic
systems based on Devanagari.˙ We will return to the issue of the Devanagari
writing system and its possible influence on the data for the written word in
Section 6.
Kharia orthography has only been “normalized” to the extent that the in-
dividual morphemes have been given a more-or-less fixed form in writing,
whereas the issue of orthographical words has, to my knowledge, not been
dealt with at all. As such, every time a Kharia speaker wishes to write some-
thing in Kharia, s/he must spontaneously decide when to separate these el-
ements and when to write them together. The result is that writers not only
vary greatly with respect to each other as to what to depict as a “word”, there
is also considerable variation for each writer even within shorter texts.
To be sure, this experiment cannot be viewed as anything more than a
heuristic means of approaching the complex issue of the written “word”. To
begin with, the written language is not the spoken language: Here, factors
come into play which are irrelevant to the spoken language, such as whether
the written word appears “too short” to be a separate word, or for that matter
“too long”. Nevertheless, this method gives us a valuable first insight into the
intuition of native speakers with respect to this concept in Kharia on which
we can later build.
The method used in this experiment is the following: In each case I read
aloud in the group a particular unit of language in Kharia (syntagma, clause or
complex sentence) to check for its grammaticality. All examples were based
either on examples occurring in the texts I had collected from these same
speakers or topics which we had discussed on other occasions. The examples
were all chosen to test for “borderline” cases, involving enclitics and postpo-
sitions, repetition of what are already phonological words to denote intensity,
distributivity or plurality, and reduplication, which derives bisyllabic phono-
logical words from monosyllabic stems, i.e., the masdar.
Once the grammaticality of the unit had been assured, each speaker was
in turn asked to repeat it. Then, all were asked to write what had been said.
Afterwards, each of the speakers was asked to count the “words” (sabda)
s/he had written. This was occasionally followed by a brief discussion of
the example with the respective speaker alone at a later date. The following
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 107
briefly summarizes the results (the examples are given here in my system of
transliteration):
Unauthenticated
Download Date | 8/13/19 5:36 PM
108 John Peterson
Discussion: Here, Speaker 2 has once again considered each unit a word,
while the other speakers have either analyzed it as lebu and ki=te=ga (Speak-
ers 1, 3 and 4), i.e., with a clitic word (cf. Figure 6), or the entire unit as a
single word (Speakers 5 and 6). Again, Speaker 2 is consistent in his approach
with the first two examples.
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 109
Discussion: For those speakers who considered this example to consist of two
words, these were i and ye=niN. Again, Speaker 2 has analyzed each compo-
nent as a separate word while, again, Speaker 6 views the entire TAM /P ERSON-
syntagma as a single word.
Unauthenticated
Download Date | 8/13/19 5:36 PM
110 John Peterson
Discussion: Speakers 1 and 3 saw in this sentence ten words each, both as I
have analyzed the sentence. Speaker 2, on the other hand, analyzed dhaNgar
‘servant’ as one word but considered ki=yaP=ghaã a single word.
Speaker 4 similarly analyzed this as dhaNgar and ki=yaP=ghaã, but also
considered juda juda a single unit. In a similar vein, Speaker 6 analyzed the
sentence exactly the same as Speaker 4, except that for this speaker, the sec-
ond word was gomke=ro instead of gomke and ro as separate words. Finally,
Speaker 5 analyzed the sentence the same way as Speaker 6, but for her the
third word was dhaNgar=ki=yaP=ghaã ‘for the servants’.
Note that once again the speakers / writers clearly tend to write more units
together as the length of the unit to be written increases, although to differing
extents for each individual speaker / writer.
Whereas juda juda in (33) is a case of repetition and in my analysis con-
sists of two phonological words, ol-ol in (35) is grammatically conditioned –
this is a case of reduplication to derive a phonological word from a monosyl-
labic stem, i.e., the masdar.
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 111
Discussion: All of the speakers except Speaker 1 analyzed this sentence into
words as I have done here. Interestingly, however, Speaker 1 not only ana-
lyzed ol-ol as two words, he did so by analyzing this unit as the two “words”
o and lol, thus equating the word-boundary with the syllable boundary, al-
though the form is transparent to modern speakers as the repetition of the
lexeme ol. This must also be true of Speaker 1, otherwise he would not have
chosen to write this unit as two words. This is reminiscent of examples (26)
and (27), where those speakers who chose to write the clitic =oP separately
did so at the syllable boundary, not at the morpheme boundary.
The following example tests not only for repetition of the type juda juda
in (33), in this case moPúho moPúho ‘very fat’, it is also intended to see to what
extent speakers of Kharia analyze units such as kuda koloN ‘millet bread’ as
a compound, as is the traditional view of units of this type (cf. e.g. (Malhotra
1982)), or as two separate words, as I view this.10 The example has been
taken directly from one of the stories I collected, a children’s story.
(35) iãib=te moPúho moPúho kuda koloN ter=na
night= OBL fat REP millet bread give= INF
laP=ki=may.
IPFV = MID . PST =3 PL
‘At night they used to give big fat millet breads [to the servants for
them to eat].’
Spk 1 Spk 2 Spk 3 Spk 4 Spk 5 Spk 6
Number
7 7 7 5 6 6
of words
10. For a discussion of “compounds” in Kharia, cf. Peterson (2011, Section 4.6).
Unauthenticated
Download Date | 8/13/19 5:36 PM
112 John Peterson
direction, with 5 out of 6 speakers considering it part of the same word as the
lexical base.
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 113
Discussion: Here, all speakers considered the excessive marker bay – to-
gether with the following functional morpheme – to be a separate word from
lej, which is considered a separate word despite its monosyllabicity. In fact,
all analyzed this short sentence as in my analysis, with the exception of
Speaker 2 who, as so often, considered the enclitic marker =te ‘OBL’ a sep-
arate word although, interestingly, not the enclitic marker =oP on the predi-
cate.
As was noted at the beginning of this section, this brief experiment is
merely a heuristic means of approaching the issue of what units correspond
to “words” – in a purely intuitive sense – for literate native speakers. Never-
theless, it also shows a number of clear tendencies:
– The clitics clearly cause the most uncertainty among speakers / writers. For
some these are tendentially viewed as separate units, most consistently for
Speaker 2, while others tendentially view them as part of the same written
word as their host.
– The shorter the unit that speakers / writers were asked to judge, the more
likely they were to write enclitics as separate words.12 As noted by Utz
Maas (p. c.), this may be due to the fact that the default expectation of
what is perceived to be a sentence is a unit which consists of more than
one word. If the sentence is short, the tendency to write identifiable units
separately, if possible, is greater than when the sentence is longer.
– Clitics which begin with a consonant tend to be written as separate words
more often than clitics which begin with a vowel.
– When clitics which begin with a vowel are viewed as separate units, they
are written together with the final consonant of the preceding morpheme,
e.g., goúh ‘C : TEL’ and =oP ‘ACT. PST’ in (26)–(27) are reanalyzed and
written by some as go and úhoP, although neither of these two units is a
12. This is especially true for Speaker 2, cf. examples (26)– (28), (30), (36) and (37).
Unauthenticated
Download Date | 8/13/19 5:36 PM
114 John Peterson
morpheme. Thus, the separation of these units into written words follows
the syllable boundary, not the morpheme boundary.
– This same tendency is also apparent in example (34), where one speaker
analyzed the derived form ol-ol ‘take- RDP’ as two units but did so by ig-
noring the morpheme boundary. Instead, he separated the two units at the
syllable boundary. This is noteworthy since he must have recognized this
reduplication as a process affecting the morpheme ol, otherwise he would
not have seen two words in this constituent. Nevertheless, when it came
to writing this unit and a decision had to be made, he chose the syllable
boundary over the morpheme boundary. This may at least be partially due
to the fact that the Devanagari writing system tends to write vowels as dia-
critics to consonants or consonant clusters in the aksar. However, as there
are also special vowel characters which are used to˙ depict vowels at the
beginning of a written word, this requires further study.
– When multiple clitics occur, there is a tendency among some speakers to
write these units together as a separate clitic word, e.g., in examples (28)
and (31).
– The analysis of V2s as separate words from the semantic base is strongly
preferred, even when both the stem and the V2 are monosyllabic (cf. (26),
(27) and (37)).
– Finally, postpositions such as buN ‘INST’ and ghaã ‘for; PURP’ are some-
times written together with the preceding unit, e.g., (32). Alternatively, if
the preceding unit contains one or more enclitics, these may form a written
word together with the following postposition, as in (33).
Despite the number of uncertainties involved in dividing these various units
into smaller units in writing, there are nevertheless clear tendencies and only
certain types of units cause the speakers / writers problems, above all the
clitics. Thus, the speakers are clearly trying to cope with the indeterminant
status of these units: Phonologically, they are part of the same “word” as the
unit to which they attach. As such, they should be written as part of this larger
unit. On the other hand, they are morpho-syntactic words, which should be
written separately. In the end, there seems to be a general preference to write
phonological words as single units, although this is at best no more than a
tendency, and one which some speakers apply more consistently than others.
Summarizing these results, the strategies used by these speakers / writ-
ers can be classified as “semi-conjunctivism” in the terminology of van Wyk
(1967): The strategies they employ are disjunctive in that each phonologi-
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 115
cal word tends to be written as a separate unit, and the writing of enclitics
as separate words is also clearly disjunctive. Nevertheless, enclitics are often
written together either with their host, with a following postposition, or with
each other, and postpositions are occasionally written together with their pre-
ceding dependent element - these are clearly conjunctive strategies. Thus the
writers here “steer a middle course” in dividing these units into written words,
to borrow an expression from van Wyk (1967: 230), combining conjunctive
and disjunctive strategies, although often differing from one another greatly
with respect to their individual preferences, as well as from one utterance to
the next for the same speaker / writer.
Unauthenticated
Download Date | 8/13/19 5:36 PM
116 John Peterson
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 117
Unauthenticated
Download Date | 8/13/19 5:36 PM
118 John Peterson
References
Aikhenvald, Alexandra Y. 2002. Typological parameters for the study of
clitics, with special reference to Tariana. In Word: A Cross-linguistic Ty-
pology, eds. Robert M. W. Dixon and Alexandra Y. Aikhenvald, 42–78.
Cambridge: Cambridge University Press.
Anderson, Gregory D. S., and Norman H. Zide. 2002. Issues in Proto-Munda
and Proto-Austroasiatic nominal derivation: The bimoraic constraint. In
Papers from the 10th Annual Meeting of the Southeast Asian Linguistics
Society, ed. Marlys A. Macken, 55–74. Tempe, AZ: Arizona State Univer-
sity, South East Asian Studies Program (Monograph Series Press).
Unauthenticated
Download Date | 8/13/19 5:36 PM
“Words” in Kharia 119
Unauthenticated
Download Date | 8/13/19 5:36 PM
Unauthenticated
Download Date | 8/13/19 5:36 PM
Chapter 6
Aspect in Forest Enets and other Siberian indigenous
languages – when grammaticography and
lexicography meet different metalanguages
Florian Siegl
1. Introduction
This contribution explores the mutual dependency of grammaticography and
lexicography, recurring topics in Ulrike Mosel’s writings (e.g. Mosel 2004,
2006a,b), by investigating a controversial feature in the description of sev-
eral indigenous languages of Siberia – the representation of aspect. The main
focus is on Forest Enets, a moribund Northern Samoyedic language spoken
and remembered by less than 40 individuals between 50–65 years of age, on
which the author has conducted substantial fieldwork in recent years.1 Ad-
ditional observations deriving from recent descriptions of other indigenous
languages of Siberia will be touched upon en passant, as the description of
aspect is part of a more general problem which is in no way restricted to
Forest Enets.
Although not always clearly stated as such, the description of aspect has
been heavily dependent on dominant linguistic models, developed on, and
applied to dominating majority languages. Such frameworks are all too fre-
quently transferred to the analysis of a given minority language. As aspect
is a category whose description may, or even must, include reference to both
grammar and lexicon, a set of general questions inevitably arises. Before em-
barking on the analysis of aspect in a given language, the grammarian is faced
with a number of questions, including the following: is aspect derivational
or inflectional, does aspect interact with tense or mood, and if so, how; are
there instances of nominal aspect? Other questions that may arise are, for ex-
Unauthenticated
Download Date | 8/13/19 5:37 PM
122 Florian Siegl
ample, whether and how aspect choice interacts with negation, and whether
aspect distinctions are available to nominalized verbs or reserved for finite
verbs only. From the perspective of the lexicographer,2 on the other hand,
the following questions need to be addressed: should each aspect form be
represented as a single entry, even if the derivational process is productive
and regular? Should aspect-marked verbs be listed under the headword as
subentries or should only irregular or lexicalized forms be listed? Whereas
the representation of aspect in dictionaries may not be decided by either the
grammarian or the lexicographer alone, the motivation underlying a certain
preference of representation should be reasonably well explained. However,
precisely this is not always obvious concerning the languages under investi-
gation. A third, complicating factor is that there have been, and continue to
be, attempts to force the Forest Enets aspect system into the aspectual frame-
work often applied to the Russian language. 3 The outcome are descriptions
of aspectual systems which although satisfying Russian principles of gram-
maticography and lexicography, fail to capture crucial features of the Forest
Enets aspect system.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 123
the only difference between e.g. čitat’ ‘read<IPF>’ vs. pročitat’ ‘read<PFT>’ is
aspect. As much as aspect plays an important role, it is generally seen “more
as a partition of the lexicon than an inflectional operation” as Russian does
not have a “single morphological device that marks the opposition of aspect.”
(Timberlake 2004: 93)
Russian aspect tightly interacts with tense. Only imperfective verbs al-
low a periphrastic future with the auxiliary byt’ ‘be’, inflected for person,
followed by the infinitive of the imperfective verb e.g. ja budu čitat’ <1SG
be.1SG read.INF IPF> ‘I will read/I will be reading.’ Apart from future tense,
imperfective verbs also form both past and present tense. Whereas in the
present tense, the verb is finite, and specialized personal endings for each
person are used, the past tense consists of a special past tense marker in -
l to which markers signaling gender/number agreement (masculine singular
ø, female singular -a, neuter -o, plural -i [no gender-distinction]) are added.
Perfective verbs, relying on the same endings, only show present tense and
past tense. When perfective verbs are inflected in the same way as imperfec-
tive verbs which express present tense, a future time reading evolves instead.
Although other tests for telling perfective from imperfective verbs have been
suggested, the possibility of a periphrastic future provides the single most
reliable criterion for partitioning the verb lexicon into two aspects (see also
Timberlake 2004: 401).
As already mentioned above, Russian has no specialized morphological
aspect marker. Instead, a restricted variety of morphological strategies are
used to create and maintain aspectual pairs. The cornerstones of the Russian
aspect system are the so-called simplex verbs, which do not have prefixes
and are imperfective. Such simplex verbs report continuous situations which
are either static or/and unchanging e.g. grustit’ ‘be sad’, videt’ ‘see’. Others
may involve “some degree of gradual change and responsibility” e.g. sidet’
‘sit’, rabotat’ ‘work’, motat’ ‘wind’, l’stit’ ‘flatter’, krutit’ ‘twist, twirl’ (Tim-
berlake 2004: 402, 411). Such simplex verbs are perfectivized by a number
of prefixes5 , e.g. pisat’ ‘write<IPF>’ ’na-pisat’ ‘write<PFT>’. Furthermore, pre-
fixed perfective verbs e.g. podpisat’ ‘sign<PFT>’ can and do form correspond-
ing imperfective verbs by suffixation: podpisyvat’ ‘sign<IPF>’ or perepisyvat’
‘rewrite<IPF>’ from perepisat’ ‘rewrite<PFT>’. What is important here is the fact
that the relation between prefixed perfective verbs and corresponding imper-
Unauthenticated
Download Date | 8/13/19 5:37 PM
124 Florian Siegl
fective verbs is perceived differently to the one between simplex verbs and
their perfective counterpart (Timberlake 2004: 407):
Simplex imperfective verbs are prefixed and yield perfectives. Many of those
perfectives – those that report a continuous process leading to a limit – can
be suffixed and yield closely related secondary imperfectives that form un-
ambiguous aspectual pairs. Prefixed verbs that discuss discrete quanta of the
activity are less amenable to forming secondary imperfectives. Because sim-
plexes ordinarily are imperfective, one or another of the prefixed perfectives
will serve as the perfective counterpart to the simplex imperfective.
The fact that Russian allows the formation of secondary aspect pairs of the
type perepisyvat’ ‘rewrite<IPF>’ vs. perepisat’ ‘rewrite<PFT>’ makes the Russian
aspect system different from that of other languages. Whereas e.g. Estonian
and Hungarian have some limited means of perfectivizing verbs by a particle
(ära in Estonian) or a prefix (meg- in Hungarian), secondary aspect pairs of
the Russian perepisat’ / perepisyvat’ (from a simplex pisat’ ‘write<IPF>’) type
are missing. As will be shown later, also Forest Enets does not allow this, and
this peculiarity of Russian aspect will occupy us once more later.
Although small in number, there are verbs which do not fit readily, or
do not fit at all into the perfective/imperfective system of Russian. The first,
rather small, group is generally known as bi-aspectual verbs.6 Such verbs lack
a distinguished aspect value and can be used to express both aspects without
any further derivation. The second group comprises verbs of movement where
the principles of aspect derivation intermingle with indeterminate vs. deter-
minate movement and function on a different basis. Third, a small class of
so called simplexes is perfective. These problems of Russian aspectology are
however not of relevance here.
6. Although it appears that the terminus technicus bi-aspectual verb is fairly well established,
Timberlake proposes a different name, anaspectual verbs (Timberlake 2004: 407–408).
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 125
7. See e.g. Bondarko (2002) for a short survey on different approaches throughout the second
half of the 20th century as well as Zaliznjak and Šmelev (2000). In this concern, Russian
differs sharply from the languages of Siberia which will occupy us in the remainder of this
paper. In contrast to Russian, which is morphologically fusional, the languages which will
be investigated now are all agglutinative and morphemes are fairly well segmentable.
8. Some ideas for such a dictionary were offered by Zaliznjak and Šmelev (2000: 97–103).
Unauthenticated
Download Date | 8/13/19 5:37 PM
126 Florian Siegl
4.1. The representation of aspect in Central Siberian Yupik Eskimo and Ket
The following two rather clear instances retrieved from recent literature on
Central Siberian Yupik and Ket neatly exemplify the problems of aspect de-
scription. Whereas Russian sources try to describe the verbal paradigm fol-
lowing Russian principles of grammaticography and lexicography, descrip-
tions by non-Russian linguists have expressed serious doubts, or do not even
mention this possibility. For Central Siberian Yupik, Steven Jacobson wrote:
Concerning the issue of tense, the unmarked form of verbs (other than ad-
jectival or descriptive verbs) has a past-tense implication for CSY [Central
Siberian Yupik, FS] but a present-tense implication for CAY [Central Alaskan
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 127
Yupik, FS] unless context indicates recent past. Thus, CSY qavaxtuq means
‘he slept’, while CAY qavaxtuq means ‘he is sleeping’. To get the present
tense in CSY one must use the postbase -aq@-: qava7aquq ‘he is sleeping’.
This same postbase in CSY is also used for repeated actions, as in quunp@N
qava7aquq ‘he is always sleeping’. Consequently, the verb forms with this
postbase in CSY begin to resemble the “imperfective aspect” of Russian verbs
– this is probably the reason that Soviet CSY-to-Russian dictionaries include
this postbase in the verb forms that they use as their citation forms for verbs.”
(Jacobson 1990: 275)9
A more interesting instance can be found in recent descriptions of the last re-
maining Yenseian language Ket. The description offered by Heinrich Werner
(Werner 1997: 206–210) takes a Slavonic perspective on aspect and assumes
the existence of aspect pairs. “Es handelt sich um die folgenden Formen [=as-
pects, F.S], die einander gegenüberstehen 1) perfektive vs. imperfektive; 2)
permansive (bzw. progressive) vs. nicht-permansive” (Werner 1997: 206). In
contrast, the descriptions of Edward Vajda (Vajda 2004) and a slightly more
explicit description published as Vajda and Zinn (2004) neither operate with
a clear imperfective vs. perfective distinction as assumed by Werner nor ex-
plicitly state the existence of aspectual pairs.
9. An earlier review of the major Soviet publications on Siberian Yupik Eskimo by Ulving
(1971) was equally criticial with the interpretation of several aspect suffixes although the
review concentrated more on problems within the description of phonology and morphol-
ogy (Ulving 1971: 102–109).
10. Samoyedic languages form the second branch of the Uralic language family together with
the other and generally better known Finno-Ugric branch. Concerning its internal struc-
ture, Samoyedic is generally subdivided into two major branches, Northern Samoyedic
and Southern Samoyedic. The only Southern Samoyedic language still alive is Selkup,
which due to its internal dialectal stratification could easily be subdivided into several in-
dependent languages. Both Nenets and Enets languages (Tundra Nenets and Forest Nenets
and respectively Tundra Enets and Forest Enets) and Nganasan belong to the Northern
Samoyedic subfamily. See also Janhunen (1998) for further background information.
11. Nganasan is excluded from this discussion as the Nganasan aspectual system is known
to differ quite extensively from the Nenets, Enets and Selkup system (Wagner-Nagy
Unauthenticated
Download Date | 8/13/19 5:37 PM
128 Florian Siegl
2001: 52–76). As data for the extinct Southern Samoyedic language Kamas is scarce and
not systematized, it has to be excluded too.
12. Interestingly, his grammar of Taz Selkup treated aspect before tense.
13. Although Castrén’s grammar contains a description of Enets, most of the data derives ac-
tually from Tundra Enets. The only English-written grammar of (Forest) Enets by Künnap
(1999) is nothing else than an enhanced translation of Tereščenko (1966) and can therefore
be excluded.
14. Sorokina’s unpublished candidate dissertation deals partly too with this topic, but all nec-
essary information is readily available in this article.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 129
15. The categories tense, aspect, modality and evidentiality are described in Chapter 7 of Siegl
(forthcoming).
16. Although mentioned by Sorokina (1975), no further examples were given. The label
postepennost~ destvi comes from Tereščenko (1966), though slightly different
suffixes -do-/-to- were given Tereščenko (1966: 453). The label cumulative was introduced
by Künnap (1999: 28).
Unauthenticated
Download Date | 8/13/19 5:37 PM
130 Florian Siegl
such as koš ‘find’ always refer to an action just completed, hence requiring
a past tense verb form in the English translation. For past tense reference,
three different tenses can be found. The general past tense marked by -š is
morphologically unusual, as it follows verbal suffixes in word-final position,
e.g. mosra-d-uš <work-2SG-PST> ‘you worked’. The perfect is marked by
-bi following verbal suffixes e.g. mosra-bi-d <work-PERF-2SG> ‘you have
worked’. Finally, a distant past is found, for which both the perfect and the
general past markers are combined, e.g. mosra-bi-d-uš <work-PERF-2SG-
PST> ‘you had worked’. The future tense is expressed morphologically reg-
ularly, in that its marker -ąa is followed by verbal endings, e.g. mosra-ąa-d
<work-FUT-2SG> ‘you will work’.
As can be seen from Table 1, Forest Enets has eight derivational suffixes
expressing aspect which can be combined with both aorist and past tenses.17
Future tense and aspect are mutually incompatible.
When contrasting tense and aspect markers, a major morphological dif-
ference is readily observable in negation, for which a negative auxiliary con-
struction is used. The negative auxiliary, either ńi- or i- carries inflectional
morphology such as tense, person and mood (the latter not demonstrated)
whereas the main verb is realized as an infinite form, traditionally called the
connegative. In the following, the negation of the aforementioned tense forms
is shown:
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 131
d. uu i-bi-d-uš mosra-P
2 SG NEG . AUX - PERF -2 SG - PST work- CN
‘You had not worked.’
However, both the future tense marker and all aspect markers remain on the
negated lexical verb as the following examples show:
The evolving picture shows a certain mismatch between tense and aspect.
First, tense marked verbs always need a verbal ending, but do not need an
aspect marker. This observation is valid for both affirmatives as well as in
negation:
In contrast, tense and aspect can co-occur, but in negation, aspect remains on
the negated lexical verb which appears in the infinite connegative form:
Unauthenticated
Download Date | 8/13/19 5:37 PM
132 Florian Siegl
18. poxi (Ru: jukola) is a cover term for dried staples, usually fish or meat for the winter
which is prepared in summer.
19. The first form represents the basic form of a morpheme; other forms are the result of
morphophonological processes.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 133
20. Forest Enets verbs show three conjugation types. The so-called first conjugation contains
almost all intransitive and all transitive verbs. The second conjugation is a pragmatically
orientated conjugation with implications for topic prominence and only transitive verbs
can be alternatively conjugated in this conjugation. The third conjugation contains intran-
sitive verbs which do not fall into the first conjugation.
Unauthenticated
Download Date | 8/13/19 5:37 PM
134 Florian Siegl
The habitual -ubi/-mubi/-mbi/-umbi marks verbs for actions which take place
on either a regular basis or frequently enough to be considered a habit:
The durative suffix -gu/-ku and the frequentative suffix -P/-r and its allomorph
-Na pose considerable difficulties of analysis, as the frequentative -P/-r has
undergone reanalysis as conjugation marker and is almost exclusively found
on verbs belonging to the inflection class IIa. With several verbs, some limited
productivity of a related frequentative -rV/-lV can be observed, but this is
beyond the scope of this paper and the reader is referred to Chapter 7 in
Siegl (forthcoming). For the current discussion, it is sufficient to characterize
their functions as follows; the durative suffix -gu/-ku marks an event (and
occasionally also a state) as ongoing:
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 135
The frequentative suffix -P/-r and its allomorph -Na expresses an action as ei-
ther frequently happening or being an instance of verbal plurality. In example
(16), the frequentative suffix is lexicalized and functions as the conjugation
class marker for verbs of the IIa class, e.g. d’orid’ ‘speak’.
The following clause shows one of the few examples with the other (etymo-
logically) related frequentative aspect suffix -rV/-lV:
21. As future tense is negated like an aspect, there should be a historical connection between
them and in fact, Prokof’ev (1937) classified the future as aspect. For a variety of reasons
which cannot be addressed here in detail, this classification needs to be rejected on the
synchronic level but a historical connection is very likely.
22. The underlying corpus consists of 45 fully annotated narratives (all monologues) equaling
roughly 3.5 hours of spoken Forest Enets.
Unauthenticated
Download Date | 8/13/19 5:37 PM
136 Florian Siegl
This picture is further supported by data from elicitation with one necessary
correction. Although absent from transcribed speech, the habitual is compat-
ible with relative past tenses, both perfect and distant past:
(18) kudaxai äku-xun šiąi po d’iri-ubi-ubi-ą-ud’
long.ago here- LOC . SG two year live- HAB - PERF -1 SG - PST24
‘A long time ago I had been living here for two years.’ [ZNB IV 70]
A short note on the ratio of finite verbs without overt aspect marking vs.
finite verbs marked for aspect is in order. The currently used ELAN annotated
corpus consisting of narratives (all monologues) contains more than 2000
verbs25 out of which 335 are morphosyntactically infinite or semi-finite. This
means that out of around 1700 finite verbs only 330 finite verbs show overt
marking for aspect. In the current state of documentation, the following tense-
aspect combinations are attested:
Table 3. TA combination in Forest Enets
Aorist General Past Perfect Distant Past
durative, frequentative, durative, frequentative, inchoative, inchoative,
inchoative, cumulative, habitual, resultative, habitual habitual
deliminative, habitual, discontinuative
resultative, discontinuative
23. This number contains aspect markers on both finite and infinite verbs. In the columns
further to the right, only aspect markers on finite verbs were counted.
24. The /u/ is apparently added for phonotactic reasons.
25. Due to some inconsistencies in annotation, the exact number is currently unknown.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 137
However, tense and person marking are generally restricted to finite verbs,
including the negative auxiliary:26
26. A specialized infinite converb in -bu allows additional tense marking relying on a mor-
pheme which is absent from finite morphology, but this has no implication for the discus-
sion here (Siegl, forthcoming, Chapter 12.3).
Unauthenticated
Download Date | 8/13/19 5:37 PM
138 Florian Siegl
27. Although not of direct relevance here, Russian prefixes are related to spatiality as a large
number of prefixes appear as freestanding prepositions. Although the grammaticalization
history of Forest Enets aspect suffixes is unknown, spatiality is clearly not underlying their
development.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 139
Whereas the usage of aspect has not changed in the course of the last three
decades for which textual material is available, Sorokina’s description of as-
pect in Forest Enets has changed profoundly. In her only article dedicated to
aspect in Forest Enets Sorokina (1975) clearly stated that although -gu/-ku
comes indeed close to Russian imperfective verbs, this parallel is superfi-
cial.28 In 2009, Sorokina and Bolina published an extended Forest Enets –
Russian dictionary containing a short grammatical sketch of the language,
though without an inventory of aspect markers. In comparison to her 1975
position, Sorokina has changed her interpretation profoundly:
In Forest Enets the category aspect can be distinguished. For every verb, an-
other verb expressing an opposite aspect value does exist. The perfective verb
is formally unmarked; from these, imperfective verbs are generated, apart
from verbs which are semantically imperfective. The appearance of aspect
is morphologically very diverse – by alternating the stem sound, by change
of conjugation type, by suffix derivation and so on. Side by side with aspect
forms, there exists a large number of suffixes which express the process of
actions such as length, repetition, reiteration. (Sorokina and Bolina 2009: 27
– my translation, F.S.)
(23) a. bu čii-š
3 SG fly-3 SG . PST
‘He flew.’ [ZNB III 17]
28. “With the help of the suffix -gu verbal bases are derived whose meaning is very near to
the Russian imperfective aspect.” (Sorokina 1975: 134 – my translation F.S.).
Unauthenticated
Download Date | 8/13/19 5:37 PM
140 Florian Siegl
b. bu či-iąP
3 SG fly-3 SG . REFL
‘He started to fly / he flew away.’ [ZNB III 17]
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 141
29. In comparison to other dictionaries published by the same publishing company, ERRE
is indeed fairly well compiled as it is more than just a bilingual word list. Although in-
tended as a school dictionary, Forest Enets lacks both literacy and educational materials
and therefore the potential readership does not contain any schoolchildren or L2 learners,
but language activists and researchers.
30. The following examples which are originally given in a Cyrillic orthography are repre-
sented in the same practical orthography as used above and in Siegl (forthcoming).
Unauthenticated
Download Date | 8/13/19 5:37 PM
142 Florian Siegl
31. In the foreword of both dictionaries (ERRE 8; ES 34) the possibility of using the converb
with other aspect suffixes is at least mentioned. The chosen format of representation with
its heavy bias towards the durative, neglecting almost entirely other aspectual derivations,
is, however, not justified.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 143
The previous section has shown that the current format of representation is
unsuitable, as its only merit is to make Forest Enets aspect look closer to
Russian than it actually is. This obviously calls for a different solution and
in the remainder, some preliminary thoughts will be offered as to how such
a solution might look. First, as Forest Enets is on the verge of extinction
and language revitalization is hardly viable, neither monolingual dictionar-
ies nor bilingual Forest Enets Russian dictionaries targeted at native speakers
of Forest Enets are realistic undertakings. The possible user of such a paper
dictionary will most likely be a person not speaking Forest Enets, and most
probably a linguist, and any dictionary must inevitably include grammatical
information on verbs concerning inflection class, conjugation class, and as-
pect derivation. In cases where aspect morphology is lexicalized the answer
is relatively easy. The translation equivalent of ‘love’ komitaš is apparently
a delimitative lexicalized aspect form of komaš ‘want’, also treated as an in-
dependent entry in Sorokina and Bolina (2001: 188) and can no longer serve
Unauthenticated
Download Date | 8/13/19 5:37 PM
144 Florian Siegl
as a subentry for ‘want’. But what should be done with e.g. pärąiš ‘help’?
As both durative and habitual aspect can be formed without problems, there
is intuitively no reason for several lemmata, as both meaning and formation
are regular. Initially, also the verb moktaš ‘put up a traditional tent’ does not
seem to be problematic, as again both durative and habitual aspect can be
formed without problems. However, moktaš cannot be used with the inchoat-
ive aspect marker -ra/-la; instead a periphrastic construction with päš ‘begin’
is necessary, e.g. uu mokta-š pä-d <2SG put.up-CON begin-2SG> ‘you be-
gan putting up (a traditional tent).’ In fact, other verbs also block inchoative
-ra/-la and require päš ‘begin’, which shows that the inchoative aspect is
more idiosyncratic and should be listed in a dictionary. Furthermore, as al-
ready mentioned above, class IIa verbs (frequentatives) block durative aspect
but may allow habitual and inchoative, although again with some restrictions.
And finally, although the vast majority of class IIa verbs are frequentative, e.g.
d’orid’ ‘speak’ → bu d’ori-Na <3SG speak-FREQ.3SG> ‘he is speaking’,
some IIa verbs can be either frequentative or translative-resultative e.g. ood’
→ oo-Na <eat-FREQ.3SG> ‘is eating’ or ood’ → oo-ma <eat-RES.3SG>
‘has eaten’. Although this needs more detailed study, there are good reasons
to doubt the productivity of some aspect markers. Their derivational nature
is still rather obvious, and it would therefore be desirable to include forma-
tions with these suffixes as distinct entries, even at the risk of including some
redundancy in the presentation. Such a path was also chosen in Tereščenko’s
large Tundra Nenets Russian dictionary (Tereščenko 1965).
Although it is quite obvious that aspect plays a significant role in the gram-
mar of Forest Enets, there is no evidence for a systematic binary imperfec-
tive/perfective opposition in the available language material. Furthermore, the
direct comparison with Russian has demonstrated that aspect in Forest Enets
operates on a different conceptual basis. Both Forest Enets dictionaries, per-
haps accidentally, presented Forest Enets verbs in aspect pairs by exagger-
ating the function of the durative marker, which led to the emergence of a
superficial parallel to Russian. Also in the accompanying sketch grammar to
Sorokina and Bolina (2001), the Forest Enets aspect system appeared more
Russian-like than in an earlier description (Sorokina 1975). What I have tried
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 145
to show is that these recent approaches are problematic and that there is no
immediate need to depart from the earlier position of N. Tereščenko and G.
Prokof’ev, which acknowledged the importance of aspect but did not try to
construct any parallel to the Russian system. Concerning the overall position
of aspect within the grammatical structure of Forest Enets, aspect shows a
relatively clear and stable form-meaning mapping, an argument which is not
valid for aspect in Russian. Nevertheless, Forest Enets aspect is derivational
and idiosyncratic to a certain degree and therefore should be listed in the
dictionary.
Although the description of aspect in Siberian indigenous languages is
most definitely not a problem restricted to Forest Enets, the problem itself
is symptomatic and supportive for a central concern in Sasse (2002); local
traditions and metalanguages are still too influential in the study of aspect
and this study adds further support from a language outside the scope of
Sasse’s paper. Apart from the possible theoretical implication which Forest
Enets could have for aspectology, the study of aspect in Siberian languages
supports again one of the most urgent challenges of language documentation
and language description – grammatical categories should be described with-
out interference from grammatical descriptions and traditions of majority or
related languages.
Unauthenticated
Download Date | 8/13/19 5:37 PM
146 Florian Siegl
References
Berkov, Valerij P. 1996. Dvujazyčnaja leksikografija [Bilingual lexicogra-
phy]. St. Peterburg: Izdatel’stvo Sankt-Peterburgskogo Universiteta.
Bondarko, Aleksandr V. 2002. Glagol’nyj vid v sisteme grammatičeskix
kategorii (na materiale russkogo jazyka) [Verbal aspect in the system of
grammatical categories (on the basis of Russian)]. In Osnovnye problemy
russkoj aspektologii, 30–43. St. Peterburg: Nauka.
Castrén, Mathias Alexander. 1854. Grammatik der samojedischen Sprachen
– herausgegeben von Anton Schiefner. St. Petersburg: Buchdruckerei der
Kaiserlichen Akademie der Wissenschaften.
Eismann, Wolfgang. 1991. Die zweisprachige Lexikographie mit Russisch.
In Wörterbücher. Ein internationales Handbuch zur Lexikographie. Dritter
Teilband, eds. Franz Josef Hausmann, Oskar Reichmann, and Herbert E.
Wiegand, 3068–3085. Berlin, New York: Walter de Gruyter.
Jachnow, Helmut. 1990. Russische Lexikographie. In Wörterbücher.
Ein internationales Handbuch zur Lexikographie. Zweiter Teilband, eds.
Franz Josef Hausmann, Oskar Reichmann, and Herbert E. Wiegand,
2309–2329. Berlin, New York: Walter de Gruyter.
Jacobson, Steven A. 1990. Comparison of Central Alaskan Yup’ik Eskimo
and Central Siberian Yupik Eskimo. International Journal of American
Linguistics 56:264–286.
Janhunen, Juha. 1998. Samoyedic. In The Uralic Languages, ed. Daniel
Abondolo, 357–379. London, New York: Routledge.
Kazakevič, Olga A. 2008. K voprosu o modeljax opisanija sel’kupskoj
glagol’noj derivatsii [A question about the descriptive model of deriva-
tion in Selkup]. In Issledovanija po glagol’noj derivatsii – Sbornik statej,
114–126. Moskva: Jazyki slavjanskix kul’tur.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 147
Unauthenticated
Download Date | 8/13/19 5:37 PM
148 Florian Siegl
Unauthenticated
Download Date | 8/13/19 5:37 PM
Aspect in Forest Enets and other Siberian indigenous languages 149
Unauthenticated
Download Date | 8/13/19 5:37 PM
Unauthenticated
Download Date | 8/13/19 5:37 PM
Chapter 7
Documentary linguistics and prosodic evidence
for the syntax of spoken language∗
1. Introduction
The documentary linguistics approach – with its emphasis on primary audio-
or video-recorded data supplemented by annotations – makes it possible to
address seriously the syntax of spontaneous spoken language even in lesser
known languages. We would like to argue here that the syntactic description
of spoken language crucially needs to take into account prosodic phenomena
such as prosodic breaks of different strengths, prosodic prominence, pitch
range, and intonation contours associated with particular constructions. Pro-
sodic analyses in documentary linguistics can contribute both to enriching the
description of a language itself and to extending our empirical coverage of the
diversity of prosodic phenomena in human languages, the inventory of which
has focused so far on an incomplete sample of Germanic, Romance and Asian
languages which are in all likelihood non-representative of the complexity in
this domain (Himmelmann 2006; Himmelmann and Ladd 2008).
In this paper, we present a case study demonstrating that it is both fea-
sible (despite methodological challenges) to study the prosodic system of an
under-documented language, and necessary to incorporate prosodic phenom-
∗
We wish to acknowledge the patience and knowledge of the many Jaminjung speakers,
some of whom now deceased, who have worked with us over the years in the communities
in Timber Creek, Katherine and Kununurra.
We are also grateful for the funding received from the DoBeS programme of the Volkswa-
gen Foundation for the documentation of the linguistic and cultural knowledge of Jamin-
jung and other languages of the Victoria River district, as well as previous research funding
received by the second author from the Max Planck Society and AIATSIS (Australian In-
stitute of Aboriginal and Torres Strait Islander Studies).
We would like to dedicate this paper to Ulrike Mosel who is a role model to us and so
many other linguists both in terms of her commitment as a fieldworker to the communities
she works with, and in terms of her linguistic work which succeeds in bringing to life the
language as it is used by speakers.
Unauthenticated
Download Date | 8/13/19 5:37 PM
152 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 153
Unauthenticated
Download Date | 8/13/19 5:37 PM
154 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 155
Figure 1. Map of the Northern Territory, showing the Victoria River District.
However, for Jaminjung, such a strategy proved very difficult to apply. Firstly,
the very concept of repeated sentences is an unlikely one in a language in
which word order is regulated by information structure (hence information
that is new the first time a sentence is uttered may not be judged to be so
when repeated, provoking a re-ordering of the constituents). Further, speakers
Unauthenticated
Download Date | 8/13/19 5:37 PM
156 Candide Simard and Eva Schultze-Berndt
1. http://www.sil.org/computing/catalog/show_software.asp?id=79
2. http://www.lat-mpi.eu/tools/ELAN
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 157
As the theoretical framework for our prosodic analysis, we employed the Par-
allel Encoding and Target Approximation model (Xu 2005), PENTA here-
after. The main feature of this model is that key components of intonation
are defined in terms of function rather than form. The model has been de-
veloped in the last 10 years, and has been used mostly to analyze speech in
carefully prepared experiments. However, it has been used successfully to
analyze spontaneous speech in Jaminjung, with the understanding that pat-
terns must be detectable, otherwise they would be of no use to speakers and
hearers.
The PENTA model assumes that multiple communicative functions are
concurrently conveyed through speech, and, as they can be perceptually dis-
tinguished, they must be encoded separately. Thus, each individual function
has its own “scheme”, making use of one or more “prosodic primitives” such
as the implementation of a local pitch target, variation in pitch range, or artic-
ulatory strength. The model also assumes that speech melody is produced by
the articulatory system whose physical properties impose various constraints
on the way acoustic forms are generated. In this way, the PENTA model both
describes and explains F0 (fundamental frequency/pitch) patterns in utter-
ances. Some of the advantages of this framework are that, firstly, it recog-
Unauthenticated
Download Date | 8/13/19 5:37 PM
158 Candide Simard and Eva Schultze-Berndt
nizes that the encoding of one function can overlay another, so surface F0
must be interpreted with caution. Secondly, the model takes into account pa-
rameters other than F0 (duration, pitch range, and rate of change, i.e. “slope”
of contour, etc); and finally, quantitative methods usually reserved for larger
corpora can be applied to relatively limited datasets, which makes patterns
more easily discernable and verifiable.
The particular encoding schemes for a language are not specified by the
PENTA model, they need to be discovered through empirical investigation
(Xu 2005: 246). This has informed our methodology. We choose a commu-
nicative function (e.g. “syntactic phrasal grouping”, “given topic” or “argu-
ment focus”) as a starting point for our investigation, then select clear ex-
amples of its instantiation, and finally seek out the parameters used in its
encoding.
The selected tokens, all segmented into syllables, are then measured and
annotated. They are labeled according to their syntactic subtypes, the number
of words in each token and their positions in the intonation unit (IU), as well
as their number of syllables and the position of each syllable in a word. The
measurements on each syllable include their mean F0 and duration; a measure
termed the ‘excursion size’, defined as the difference between the maximum
and minimum F0 in the syllable (expressed in semitones); and finally the ‘fi-
nal velocity’, a measure of the instantaneous rates of F0 change (sm/s) taken
at a point earlier than the syllable offset (here 30ms) which has proven to be
a good indicator of the slope of the underlying target of the syllable. Based
on these measurements, more refined quantitative analyses can be performed,
consisting of:
– the mean pitch in the last syllable of the IU relative to the other syllables
in the IU;
– the variation in minimum and maximum pitch (excursion size) within each
syllable and between all the syllables in the IU;
– the final velocity of the syllables at the boundaries of the IUs (first, second,
penultimate and final) as an indicator of the underlying pitch target;
– the alignment of pitch targets from the most prominent syllable to the final
syllable of an IU;
– the duration of final syllables relative to the other syllables in the IU.
When the number of tokens in the sample is high enough (at least 30), the
results are validated with statistical analysis.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 159
Unauthenticated
Download Date | 8/13/19 5:37 PM
160 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 161
400
300
200
Pitch (Hz)
100
40
ngin ju bi ya ba rraj burru ga ru many mun ga yu ja mu ru gu
nginju=biyang barrajburru garumany mun gayu jamurrugu
dem=clitic n v cov v n
this one now a crocodile has come, he is lying face down below.
0.09865 2.617
Time (s)
This example illustrates the baseline to which the other grammatical types
of prosodic sentence mentioned above can be compared. The first IU has a
falling contour resembling that found in simple declaratives: the left bound-
ary displays a pitch reset and the encodings associated with the marking of
3. Prosodic sentences may also correspond to a combination of direct speech and a reporting
verb, or an interjection (forming its own intonation unit) and a main clause.
Unauthenticated
Download Date | 8/13/19 5:37 PM
162 Candide Simard and Eva Schultze-Berndt
focus on the first syllable of the focus domain (beginning with barrajburru),
i.e. falling pitch, and wider excursions and longer durations than in other
syllables (Simard 2010: 189–219). The syllables at the right boundary are
lengthened and end in low pitch. We treat this pattern as a “default contour”
in declarative sentences.4
The pattern in the second intonation unit is a repeat of that of the first
one. Its dependence is indicated by its continuing the declination line of the
first clause, i.e. the reset at the left edge of the second intonation unit is less
prominent than that of the first, and the fall at the right edge is more promi-
nent.
In the following subsections we will examine prosodic sentences where
the second intonation unit does not have clausal status, first presenting the
constructions and then comparing their prosodic correlates.
4. Other contours are also found, i.e. ending with a slightly rising or level pitch; however, the
falling contour is the predominant pattern observed in Jaminjung.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 163
400
300
200
Pitch (Hz)
100
40
ngin ju wa gu rra ni ga yu .. wa ga
nginju wagurra-ni gayu waga
dem n-case v cov
here he is on the rock.. sitting
0.02374 2.16
Time (s)
On the other hand, (3) is a striking example of the semantic (but not prag-
matic) independence of a coverb in a separate intonation unit. This utterance
is taken from a personal account of the speaker and her sister’s efforts, as
children, to escape forced domestic labor and the threat of being removed
to a different state. After a successful escape, they are re-united with their
family members, who see them coming and announce their arrival. From the
context (and also because of prosodic marking of direct speech; see Simard
(2010: 384–392)) we know that it is the speakers of the announcement who
are beating their chests, not the children that are coming. In other words,
even if there was a complex predicate murlngub buntharam ‘they are coming
beating their chests’, this is not the interpretation intended here.
Unauthenticated
Download Date | 8/13/19 5:37 PM
164 Candide Simard and Eva Schultze-Berndt
The view taken here (as in Schultze-Berndt 2000: 139–141; 2002: 280–281),
therefore, is that a coverb which is prosodically detached with the contour
described above constitutes a construction in its own right, i.e. the pragmat-
ically dependent predicate construction, distinct from the complex predicate
construction. While the prosodic contour itself as well as the absence of an
inflecting verb signals the dependency of this unit on a preceding one, it does
not determine the precise semantic relationship of the coverb with any ele-
ment of the preceding unit. Rather, the interpretation of the coverb (e.g. as
encoding an event occurring simultaneously with that encoded by the previ-
ous intonation unit as in (2) and (3), or a resultative or sequential relationship)
is determined by the addressee on the basis of pragmatic principles and world
knowledge. The use of this construction is stylistically marked; its frequency
varies depending on the individual speaker, and it has a clear effect of en-
hancing the vividness of the narration or description.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 165
300
250
200
150
100
Pitch (Hz)
50
ja lig jan ju ni malangga nu rrungawu warrb bi na .. wi rib ni mij
that child saw them sitting together together with the dog
0.01296 4.433
Time (s)
Unauthenticated
Download Date | 8/13/19 5:37 PM
166 Candide Simard and Eva Schultze-Berndt
400
300
200
Pitch (Hz)
100
40
gurrany ya nji nga rna ja rlig .. na ri bu ma rlang
gurrany yanjingarna jarlig naribu=marlang
part v n n=clitic
you should not give them to the children (to eat)... those mussels
0.02056 2.322
Time (s)
The distinction between afterthoughts and reactivated topics is not just a func-
tional one, however. Prosodically, the two constructions are clearly distinct,
as the comparison of Figures 4 and 5 illustrates. First, the average pause pre-
ceding the second intonation unit is much shorter for reactivated topics than
for afterthoughts. Second, the prominent syllable in reactivated topics does
not receive falling pitch (indicative of focus). This is consistent with the find-
ings for the encoding of other types of topics in Jaminjung, which do not
have prominent first syllables, independently of where they occur (Simard
2010: 249–276).
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 167
In this section we will compare the prosodic correlates of the three subsets
of intonation unit in second position (as well as those of “normal verbal
clauses”) discussed in the preceding sections 3.1 to 3.3, thus matching the
quantitative evidence with the more descriptive account presented so far.
The measurements of the mean pitch show significant differences in over-
all pitch. When uttering pragmatically dependent predicates and other types
of non-finite clauses, or afterthoughts, speakers make use of a higher pitch
register, while verbal clauses and reactivated topics are uttered in a lower reg-
ister. The values for intonation units in second position consisting of just two
syllables, the most frequent in our datasets, are illustrated in the first panel of
Figure 6.
Figure 6. The mean F0 (fundamental frequency) and final velocity values of first and
last (= second) syllable in intonation units consisting of two syllables, in
second position in a prosodic sentence consisting of two intonation units.
The measurements of final velocity in the second panel of Figure 6 also high-
light the fact that pragmatically dependent predicates, afterthoughts and ver-
bal clauses5 all exhibit falling intonation on their first syllables, which is as-
5. Verbal clauses consisting of two syllables are not shown in the second part of Figure 6, as
there were too few tokens in our dataset; however, examination of a range of verbal clauses
Unauthenticated
Download Date | 8/13/19 5:37 PM
168 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 169
Table 1. The parameters ruling the prosodic encodings of second intonation units in
complex sentences.
Structural integration Information status
dependent ↔ independent new ↔ given
verbal clauses + +
non-finite predicates + +
afterthoughts + +
reactivated topics + +
Unauthenticated
Download Date | 8/13/19 5:37 PM
170 Candide Simard and Eva Schultze-Berndt
400
300
200
Pitch (Hz)
100
40
... mu lang girr ngan than ma ya wi rib ..
5.261 6.441
Time (s)
Figure 7. A discontinuous noun phrase of the contrastive subtype (example (7)); the
two nominals are found within a single intonation unit and the modifier
bears focus accent.
We further show that such “true” discontinuous noun phrases (which are in-
frequent in texts) have functions very different from those of afterthoughts.
Two main functions of discontinuous noun phrases can be identified and can
moreover also be formally distinguished on prosodic grounds. The first func-
tion is that of contrastive argument focus where the contrastive element is a
modifier, which is always prosodically salient. Example (7) above is an illus-
tration of this subtype; it is extracted from a fictitious conversation (triggered
by elicitation questions) between the speaker and a dog owner. The preceding
utterance is a request to the dog owner to tie up the dog since it is threaten-
ing to bite people. Thus, the dog, but not its property of fierceness, has been
previously mentioned. Since it is this property which is emphasized in this ut-
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 171
terance, as the reason for the request, it is the property expression mulanggirr
‘fierce’ which receives prosodic prominence, while the common noun wirib
‘dog’ remains unaccented.
The second function of discontinuous noun phrases in Jaminjung is that of
marking sentence focus or “all-new” statements, typically used to introduce
a new participant into the discourse universe. In this case, the preferred order
of elements is reversed: while for the contrastive argument type the modifier
is usually in preverbal position, it is in postverbal position for the sentence-
focus type. Examples like (8), where the discontinuous noun phrase is made
up of a generic and a specific noun, are also attested. This example comes
from a Frog Story; the speaker quotes the boy alerting his dog to the presence
of a group of frogs which includes their pet frog. As Figure 8 shows, no par-
ticular prosodic salience is associated with either of the subconstituents; in
fact, all constituents in the sentence receive a prominence, including the ver-
bal predicate. This prosodic pattern conforms to the general pattern described
for “all-new” sentences in Jaminjung by Simard (2010: 225–233).
300
250
200
Pitch (Hz)
150
100
50
.. nga yin gun ngi ya jal wany bu ru yu ma la ra ..
3.008 4.891
Time (s)
Figure 8. A discontinuous noun phrase of the sentence focus subtype (example (8));
all constituents bear a prosodic prominence.
Unauthenticated
Download Date | 8/13/19 5:37 PM
172 Candide Simard and Eva Schultze-Berndt
Thus, prosody in this case provides not only a means to distinguish between
afterthoughts and discontinuous noun phrases, but also helps to corroborate
with formal evidence a functional distinction between two subtypes of dis-
continuous noun phrase.
4. Conclusions
Prosody is recognized as one of the fundamental components of language.
While prosody may have received limited attention in the past outside specif-
ically prosodic research on well-described languages, theoretical and techno-
logical advances in recent years have spurred a renewed interest in intonation
studies yielding important insights into its workings as well as its interactions
with other linguistic components. Applying and refining these on the basis of
lesser documented languages still poses many challenges. For Jaminjung we
had to base our investigation solely on a corpus of spontaneous or at least
unread speech but we consider that the drawbacks of using such uncontrolled
data are counterbalanced by the benefits that our analysis brings to our un-
derstanding of the syntax and semantics of the language. We demonstrated
here that it is possible to distinguish, on the basis of prosodic evidence alone,
constructions such as reactivated topics vs. afterthoughts; afterthoughts vs.
discontinuous noun phrases, and two subtypes of discontinuous noun phrase.
We also argued that it is possible, based on quantitative evidence gained from
corpus data, to distinguish between different degrees of prosodic integration
iconically reflecting degrees of semantic integration between intonation units
forming a larger unit, that of prosodic sentence.
Our case study shows how the investigation of prosody based on sponta-
neous speech data collected during fieldwork not only enhances our under-
standing of the language itself, but also contributes more widely to a typol-
ogy of prosodic phenomena and their grammatical functions in human lan-
guages. Firstly, they contribute to our understanding of cross-linguistically
recurrent differences between afterthoughts and reactivated right-dislocated
topics. Secondly, a careful distinction between discontinuous noun phrases
and afterthoughts on prosodic and functional grounds reveals that the former
are infrequent in discourse and tied to very specific information structure con-
figurations, thus providing further evidence against the myth of unconstrained
discontinuity in Australian languages.
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 173
Abbreviations
1, 2, 3 first, second, third person IRR irrealis
ABL ablative case LOC locative case
ALL allative case NEG negative particle
COMIT comitative case OBL oblique pronominal
CONTR contrastive focus marker PL plural
DAT dative case PROX proximal demonstrative
DEM demonstrative (distance- PRS present
neutral/recognitional) PST past (perfective)
DU dual RESTR restrictive clitic
ERG ergative case (‘right there/then’)
IMPF (past) imperfective SG singular
References
Auer, Peter. 1991. Vom Ende deutscher Sätze. Zeitschrift für Germanistische
Linguistik 19:139–157.
Auer, Peter. 1996. On the prosody and syntax of turn-continuations. In
Prosody in Conversation: Interactional Studies, eds. Elizabeth Couper-
Kuhlen and Margaret Selting, 57–100. Cambridge: Cambridge University
Press.
Averintseva-Klisch, Maria. 2008a. Rechte Satzperipherie im Diskurs: Die
NP-Rechtsversetzung im Deutschen. Ph.D. Dissertation, Universität
Tübingen.
Averintseva-Klisch, Maria. 2008b. To the right of the clause: Right disloca-
tion vs. afterthought. In ‘Subordination’ versus ‘Coordination’ in Sentence
and Text: A Cross-Linguistic Perspective, eds. Cathrine Fabricius-Hansen
and Wiebke Ramm, 217–239. Amsterdam, Philadelphia: John Benjamins.
Bing, Janet Mueller. 1984. A discourse domain identified by intonation.
In Intonation, Accent and Rhythm: Studies in Discourse Phonology, eds.
Dafydd Gibbon and Helmut Richter, 10–19. Berlin, New York: de Gruyter.
Birner, Betty J., and Gregory L. Ward. 1998. Information Status and Non-
Canonical Word Order in English. Amsterdam, Philadelphia: John Ben-
jamins.
Boersma, Paul, and David Weenink. 2001. Praat, a system for doing phonetics
by computer. Report 132, Institute of Phonetic Sciences of the University
of Amsterdam. http://www.praat.org.
Unauthenticated
Download Date | 8/13/19 5:37 PM
174 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Documentary linguistics, prosody and syntax 175
Unauthenticated
Download Date | 8/13/19 5:37 PM
176 Candide Simard and Eva Schultze-Berndt
Unauthenticated
Download Date | 8/13/19 5:37 PM
Chapter 8
Diphthongology meets language documentation: The
Finnish experience∗
Klaus Geyer
1. Introduction
Guides on language documentation and linguistic field work in general agree
on the fact that one of the first steps to be undertaken when the preparatory
work is done and it comes to data collection, analysis and description is to
carry out a first examination of the functional phonetics1 of the language
in question, establishing a preliminary sketch of both the segmental sound
system, the syllable structure, and, when relevant, the tonal patterns; cf. e.g.
Mosel (2006a: 75) and Bowern (2008: Ch. 5). While the preliminary analysis
of a phonological system is a subject in its own right in language documen-
tation and description, it is also (at least on the word level) a prerequisite
for developing an, equally preliminary at that stage, graphemic system (often
termed working orthography), allowing one to record the data in written form.
Functional phonetics as a separate level of language analysis and description
precedes the level of morpho-syntax not only in the research process, but also
in one of the essential outcomes of a language documentation endeavor (cf.
Mosel 2006b), viz. the grammar book, be it a short sketch grammar as e.g.
Mosel (1994) or an extensive reference grammar as e.g. Mosel and Hovd-
haugen (1992).
∗
I would like to thank the members of the Linguistisches Kolloquium at the Linguistics
department, University of Erfurt, the participants of the workshop on diphthongs at the
43rd Annual Meeting of the SLE 2010 in Vilnius, and the editors (especially Nicole Nau
and Claudia Wegener) for discussing my ideas on diphthongs with me and for commenting
on an earlier version of this paper. In particular, I want to express my thanks to Lena
de Mol for her help with the English language. Needless to say all remaining errors and
shortcomings are mine.
1. The term functional phonetics is used here for a data-driven, inductive reconstruction of
the phonological system(s), as opposed to deductive accounts, which aim to fit the sounds
of a language into a pre-determined, allegedly universal structure model.
Unauthenticated
Download Date | 8/13/19 5:38 PM
178 Klaus Geyer
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 179
hand as an instructive and illustrative case study to show how one can come to
an adequate description and systematization of diphthongs in spite of a rather
complex data set; and on the other hand as an almost perfect example of a
rather complex data set based on which a lot of potentially relevant criteria
for the description and analysis of diphthongs in the world’s languages can be
extracted. The final section (Section 4) brings together the results and find-
ings, presenting an outline of the “diphthong analysis and description tool”.
2. What is a diphthong?
2.1. Definitions
Unauthenticated
Download Date | 8/13/19 5:38 PM
180 Klaus Geyer
no need to immediately decide for or against one or the other of the two no-
tions of diphthongs mentioned above. One should rather follow an account
that does not exclude either of the notions from the beginning, but rather
leaves the question open by combining the two. This was done by, e.g., Roca
and Johnson (1999: 688), who define a diphthong as “a complex vowel of
non-steady quality, made up of two phases.”
It is worth mentioning that Schubiger, already as early as 1977, combines
the two main perspectives on the notion of diphthongs and suggests the fol-
lowing, rather differentiated definition:
Diphthonge. Es sind dies lange Vokale mit gleitender Zungenstellung oft auch
mit sich verändernder Lippenstellung. Im Verlauf der Gleitbewegung ergibt
sich eine ganze Reihe von Vokalen, von denen aber wegen der raschen Ab-
folge nur der erste und der letzte wahrgenommen werden. Auf der Wahr-
nehmung beruht die übliche Definition des Diphthongs: eine der gleichen
Silbe angehörende Folge von zwei Vokalen; ebenso ihre Darstellung in der
phonetischen Schrift. (Schubiger 1977: 49)
[Diphthongs. These are long vowels produced with a gliding tongue position,
often also with a changing lip position. In the course of the gliding movement,
a whole series of vowels arise, but due to the quick succession, only the first
one and the last one are perceived. The common definition of diphthong relies
on perception: a sequence of two vowels, belonging to the same syllable; their
presentation in the phonetic script is made up in the same way. (Translation
K.G.)]
Aside from the fact that diphthongs do not necessarily need to be long in
duration as the famous example of diphthongs in Icelandic shows (cf. Guss-
mann 2002: Ch. 7), Schubiger (1977) not only mentions some important artic-
ulatory characteristics of diphthongs, she also gives an adequate explanation
for how the two notions interact. In any case, from a phonological point of
view, diphthongs of both the “two vowels within one syllable”-type and of
the “vowel changing its quality”-type may occur. That is to say, even within
one and the same language, both mono- and bi-phonemic diphthongs may
be observed. The standard analysis of Lithuanian diphthongs illustrates this
(cf. Ambrazas 1997: 28), where it is even the case that the so-called “gliding
diphthongs” /ie/ and /uo/ as mono-phonemic units are integrated in the sys-
tem of long vowel phonemes. All other diphthongs, however, are treated as
phoneme combinations (and are therefore termed “combined diphthongs”):
they can be split up in two syllables due to re-syllabification.; cf. e.g. sai.tas
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 181
‘tie’ vs. sa˛ .sa.ja ‘linkage’, gau.ti ‘to receive’ vs. ga.vo ‘(s/he) received‘ (cf.
Ambrazas 1997: 27). Importantly, these distinctions in phonological analysis
need not necessarily affect the phonetic shape of the complex vowel sounds
in question, e.g. in the sense of a rather smooth and even transition from ini-
tial to target sound (“gliding diphthong”) vs. a more sharp and rapid change
(“sequential diphthong”), cf. Catford (1977: 215–216) for examples.
2. Note that [W] does not stand for IPA cardinal vowel 16, but for a fronted cardinal vowel
18 [0ff], rendering the symbol Ñ which is used in the Swedish landsmålsalfabet for dialect
transcriptions.
Unauthenticated
Download Date | 8/13/19 5:38 PM
182 Klaus Geyer
calized after vowels, forming “derived” phonetic diphthongs, cf. løb [løwP]
‘run!’, steg [sdAjP] ‘fry!’, skriv [sg̊KiwP] ‘write!’; klor [khlo2P] ‘chlorine’. In
˚
addition to spelling, there are two further arguments for˚the assumption of
“underlying” word-final consonants in these word-forms: firstly, they require
the allomorph {-e} of the infinitive suffix, which appears on stems ending in
a consonant, cf. løbe ‘to run’, stege ‘to fry’, and skrive ‘to write’, and not
the zero-allomorph used on stems ending in a vowel, e. g. gå ‘to go’, sy ‘to
sew’ etc. Secondly, there is an alternation of the phonetic diphthongs in ques-
tion with vowel-consonant-sequences: løbe løbsk [løbsg̊] ‘to bolt (of horses)’,
˚
stegt [sdEg̊d] ‘fried’, skrift [sg̊Kæfd] ‘script, handwriting’; klorid [khloKiðfl P]
˚ (cf.
‘chloride’ ˚ Grønnum 1998: 251–257). ˚ It is obvious that it would become
very puzzling to account for those alternations without assuming a phonolog-
ical word-final consonant in these cases, realized as a phonetic vowel under
certain circumstances.
An interesting question is to what extent a change in nasality might play
a role in forming diphthongs, e.g. when in the German word Reallohn ‘real
wages’ nasality appears in the course of the long /o:/ in the final syllable,
resulting in a sequence like [oõ]. Since this is an effect of assimilation, it lacks
phonemic status but nevertheless constitutes a phonetic diphthong. The only
hint indicating that such a change in nasality could in fact claim phonemic
status as a diphthong is a notice about a minimal pair I found in Sievers
(1893: 153), the hook on the i indicating nasality: “Nasalirtes i neben reinem
i findet sich nach Böthlingk im Jakutischen, z. B. in Ai˛ï̄ Sünde neben Aiï̄
Schöpfung.” [Nasalized i in addition to pure i can, according “ to Böthlingk,“
be found in Yakut, e.g. in Ai˛ï̄ sin compared with Aiï̄ creation. (Translation
K.G.)]. Unfortunately, it has“ turned out to be quite“difficult to come across
the reference (Böthlingk) cited by Sievers to learn more about this fact.
Finally, it should be added that the perceivable change in vowels that is
described by the notion of contour tones is not subsumed under the topic of
diphthongs, since tone does not affect the quality of the diphthong in terms
of its formant structure but only its pitch.
As a last remark in this section, I want to emphasize the fact that it is
ultimately the auditory change, and not the acoustic one, that is crucial even
for a phonetic diphthong. Ramers and Vater (1992: 128) state correctly that
in natural speech no vowels are produced that are stable with respect to their
frequency spectra in an acoustic sense. Linguistically relevant, however, are
only those changes that are perceivable, be they functional or not.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 183
3. I am reluctant to use the expression “half (of a diphthong)” here, since the widely accepted
convention to represent diphthongs as two symbols (initial and target) does definitely not
imply that a diphthong consists of two (equal) “halves”.
4. Here, for convenience, a rough version of the sonority hierarchy is given, the sign > mean-
ing ‘more sonorous than’: open vowels > closed vowels > liquids > nasal consonants >
fricatives > plosives. It should be added that, as far as vowels are concerned, nasalization
increases sonority, and a voiced obstruent is more sonorous than its voiceless counterpart.
Unauthenticated
Download Date | 8/13/19 5:38 PM
184 Klaus Geyer
Even if the sonority maximum most often coincides with the syllable
peak, this is not necessarily the case. Regarding the parameter of sonor-
ity, which, according to Kohler (1995: 74), is somewhat “impressionistisch
gewonnen” [impressionistically obtained. (Translation K.G.)], one has to bear
in mind that the gradation of more or less sonorous sounds in the sonority
hierarchy requires the tacit prerequisite that the sounds in question are pro-
duced with the same articulatory effort. Differences in terms of loudness, du-
ration, and/or pitch can override sonority, or, as Schubiger (1977: 108) puts
it: “Die Sonoritätsskala behält ihre volle Gültigkeit nur bei gleichbleibendem
Atemdruck. Druckschwankungen können Verschiebungen bewirken.” [Only
if the expiratory pressure is constant, the sonority hierarchy remains fully
valid. Variations in pressure may cause a shift. (Translation K.G.)] Thus, we
have to distinguish between what could be called the sonority contour and the
prominence contour in a syllable.
Evidence for the validity of this differentiation can be found in Lithua-
nian, where diphthongs possessing roughly the same initial and target vowel
qualities and, consequently, an identical sonority contour, may form mini-
mal pairs with respect to a decreasing or an increasing prominence contour,
cf. áukštas ‘high’ vs. aũkštas ‘floor, storey’5 . In order to have a means for
differentiating sonority and prominence, I suggest to term diphthongs with
a decreasing prominence contour early-peak diphthongs and those with an
increasing prominence contour late-peak diphthongs.6
If we take the fact seriously that sonority, in marked cases, is not the
single controlling parameter for syllable peak formation in diphthongs, the
search for this kind of mismatching sequences can be further extended. Siev-
ers already did this in 1893, when he stated the following on the combinations
of vowels with liquids and nasals:
Auch hier haben wir es hauptsächlich nur mit den einsilbigen Verbindun-
gen zu thun. Diese sind den Verbindungen zweier Vocale vollkommen ana-
log, nur mit der Einschränkung, dass nach den Gesetzen der Abstufung der
Schallfülle ... die Liquide und Nasale in fast allen Fällen die unsilbischen
Glieder der Verbindungen sind. Dass wir Gruppen wie ar, al, an, am, aN
5. The place of the syllable peak is marked by the accent signs tilde ˜ and acute ´ accent.
These are also used to indicate different stress types in Lithuanian.
6. A third “contour-type” of diphthongs occurring in some languages (though not in Lithua-
nian) might be floating diphthongs without a clearly localizable peak, as Grønnum
(1998: 79) puts it (even if this is done in connection with sonority and not with promi-
nence).
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 185
(genauer geschrieben ar, al, an, am, aN, um die unsilbische Geltung des an
zweiter Stelle stehenden“ Sonorlauts
“ “ “ zu “ bezeichnen) nicht auch als ‘Diph-
thonge’ auffassen, liegt grossentheils bloss an der Gewohnheit, l, r, m, n, N
als ‘Consonanten’ zu bezeichnen, die mit einem ‘Vocale’ nicht eine derartig
homogene Verbindung eingehen können wie zwei ‘Vocale’ unter einander.
(Sievers 1893: 154)
[Also here, we are mostly dealing with monosyllabic combinations. These are
completely analogous to the combinations of two vowels, with the restriction
that according to the laws of gradation of sonority ... the liquids and nasals
are almost always the non-syllabic members in the combinations. The reason
why we do not also consider sequences like ar, al, an, am, aN (more exactly
ar, al, an, am, aN, to mark the non-syllabicity of the sonorous sound in the
“ “ place)
second “ “ diphthongs
“ is to a great extent due to the habit to call l, r, m,
n, N ‘consonants’, which cannot form as homogeneous a combination with a
‘vowel’ as two ‘vowels’ can do with each other. (Translation K.G.)]
Unauthenticated
Download Date | 8/13/19 5:38 PM
186 Klaus Geyer
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 187
Unauthenticated
Download Date | 8/13/19 5:38 PM
188 Klaus Geyer
8. Actually, the diphthong chart Mitchell provides comprises 20 positions, since, for unknown
reasons, /ie/ und /ei/ are listed twice.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 189
(restricted to one single word) and /ey/ (restricted to one single root) should
or should not be included in the overall diphthong inventory of the language.
In my view, they are definitely part of the sound system – but in terms of core
and periphery, they are clearly part of the periphery rather than of the core.
Another peculiarity must be mentioned here, namely that the diphthong
/öi/ never occurs in roots, but can only be the result of a morphological con-
struction involving stem-final /ö/ + (nominal) plural /i/ or (verbal) preterite
or conditional /i/. So, /öi/ is peculiar not with respect to frequency of oc-
currence in word forms, as /ey/ and /iy/ are, but rather with respect to place
of occurrence in a word form. The sequence is frequent since both nominal
plural and verbal preterite and conditional formation are very common mor-
phological constructions, and stem-final /ö/ is not really rare. Because /öi/ is
quite restricted with respect to possible places of occurrence in word forms,
it is, in my view, more peripheral than other diphthongs. An interesting fact,
also from the viewpoint of documentary linguistics, is that the most common
diphthongs in a sample of 6,700 diphthongs turned out to be /oi/ (21.4% of
all diphthongs), /ai/ (16.5%), and /ei/ (15.1%) (cf. Häkkinen 1977, cited in
Karlsson 1983: 90); note that these are all diphthongs that occur freely both
in roots and across morpheme boundaries of the type mentioned above.
Probably the most widely discussed issue in terms of core and periphery
of phonological (and other linguistic) elements is that of whether or not loans
– here: loan diphthongs – should be considered part of the inventory of a
given language or not. This is, actually, not a big issue in Finnish, but for the
closely related Estonian language, it is reported that out of the up to 36 diph-
thongs, as many as 10 solely occur in loan words (Viitso 2003: 22). And for
the Baltic language Lithuanian, recent presentations (Ambrazas 1997; Eckert,
Bukevičiūtė, and Hinze 1994) unanimously count, beyond the genuine items,
the diphthongs /eu, oi, ou/ “in words of foreign origin” (Mathiasen 1996: 30)
as part of the inventory – despite the fact that Lithuanian grammaticography
tends to be very conservative in an overall perspective.
One feature all Finnish diphthongs have in common is that the first phase
of the diphthong is more prominent than the latter one, i.e. all Finnish diph-
thongs are early-peak diphthongs. For the first – and much bigger – group
mentioned above, namely the diphthongs ending in a closed vowel /i, y, u/,
Unauthenticated
Download Date | 8/13/19 5:38 PM
190 Klaus Geyer
ei äy eu
äi
ui
ai au
oi ou
öi öy
yi
ie
yö uo
iu
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 191
Hakulinen et al. (2004: §21, translation K.G.) operate with three main types
of diphthongs and some subdivisions:
closing diphthongs ei öi äi oi ai | ey öy äy | eu ou au
closed diphthongs yi ui | iy iu
opening diphthongs ie yö uo
Groenke (1998: 133, translation K.G.) provides these four groups of diph-
thongs:
ei öi oi öy eu ou
äi ai äy au
3 falling: uo, yö, ie
2 de-rounding diphthongs: yi, ui
1 rounding diphthong: iu
And finally, Fromm’s (1982: 31, translation K.G.) analysis comprises the fol-
lowing five groups:
I II III IV V
ei öi oi öy eu ou ie yö uo iu yi ui
äi ai äy au
Unauthenticated
Download Date | 8/13/19 5:38 PM
192 Klaus Geyer
deduce them from the positioning of the elements in the chart. The target
vowel quality seems to play a major role for the arrangement of the columns,
whereas the arrangement of the lines all in all remains cryptic. This applies in
particular to the lines at the bottom, where it remains an open question why
the three opening / rising diphthongs have been integrated in this position.
Clearly, such attempts do not meet the basic requirements of consistency and
explicitness for a phonological systematization.
Karlsson (1999) is, in some sense, not so different from Branch (1987)
and Lyovin (1997): he, too, uses the target vowel quality /i/, /u/ and /y/ as the
main classificatory criterion. In contrast to Branch (1987) and Lyovin (1997),
however, he at least explicitly mentions this criterion. However, the criteria
underlying the arrangement of items within the three groups again have to be
inferred somehow by the reader. Karlsson’s (1999) group 4, unlike groups 1
to 3, is a bare enumeration, not characterized by any feature. As far as any
criteria are retrievable at all, these systematizations only seem to make use
of the static properties of diphthongs, i.e. the positions of either the initial or,
possibly, the target vowel quality.
Hakulinen et al. (2004) is the only systematization comprising 18 diph-
thongs. The vertical tongue movement functions as the main criterion for
classification here; the quality of the target vowel seems to be an additional
criterion for systematizing closing diphthongs, as the respective subgroups
are constituted by separating marks “|”. Unfortunately, this feature is not ex-
plicitly pointed out. Regarding the group of closed diphthongs, however, the
subgrouping is carried out according to the position of the /i/, i.e. whether it
is the initial or the target vowel quality in the diphthong. The order of diph-
thongs within the subgroups seems to follow, by and large, the principle of
the IPA chart to put front vowels to the left and back vowels to the right. The
opening diphthongs in the third group seem to be arranged in the same way.
Hakulinen et al. (2004) are using the dynamic criterion of vertical tongue
movement (or the absence of it, respectively) as a feature for forming the
three main groups, which apparently improves the analysis.
The systematization Groenke (1998) suggests in his sketch of Finnish is
somewhat suspect because the feature of vertical tongue position is described
using the same terms that are otherwise used for sonority contours (rising,
falling).9 The features related to changes in lip rounding that Groenke brings
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 193
into play are, however, useful ones since they address the dynamic nature of
diphthongs. But whilst /iy/ only displays lip movement (rounding) in terms
of articulation, both /iu/ and /ui/ imply, in addition to lip movement (rounding
in /iu/ and de-rounding in /ui/), horizontal tongue movement (backing in /iu/
and fronting in /ui/). Diphthongs where only one articulatory feature changes
from initial to target position like in /iy/, but also in /yi, uo/ etc., are termed
homogeneous (cf. Roca and Johnson 1999: 190–191), whereas in a heteroge-
neous diphthong more than one feature changes, e.g. in /ui, äy/.
The system proposed by Fromm (1982), finally, does not only make use
of dynamic as well as static features, it also labels them – at least in the
descriptions of the five (or possibly actually four, since II and III are treated
together) groups. The gap in the chart of group II on the top left is to be filled
with /ey/, which shows that /e/, in contradiction to the explanation given by
Fromm, does not participate in vowel harmony; /iy/ could easily be integrated
in group V.
In my view, none of these systematizations, not even the most recent one,
adequately captures the distinctiveness of relevant features in Finnish diph-
thongs completely. Therefore, I will present a more consistent analysis in the
next section.
Unauthenticated
Download Date | 8/13/19 5:38 PM
194 Klaus Geyer
lip rounding, or both – these other types of movement are not distinctive in
this systematic account, but still relevant for the phonetic form of the diph-
thong. If the item is [+ move], the extent of a possible movement in vertical
tongue position can be [+ wide] (between open and close vowel quality) or
narrow (between mid and close), thus [– wide].
The very basic distinction of falling vs. rising in sonority, ascribing the
feature [+ rising] to the three diphthongs /yö, ie, uo/, is also implemented, as
is the vowel harmony dimension of lip rounding, where only the non-open
round vowels /y, ö; u, o/ participate (see Section 3.1). Therefore, there are
only four diphthongs with the feature [+lip harm]. Lip rounding in the target
vowel quality is made use of for the feature [+ round], which occurs only in
diphthongs ending in /y, ö, u, o/, all others being [– round].
The other dimension of Finnish vowel harmony, namely place harmony,
plays an even more important role for the system. The place harmony fea-
ture [+ front] vs. [– front] is to be understood not in an articulatory but in a
phonological sense. Recall that the open vowels /ä/ and /a/ participate in place
harmony (though not in lip rounding harmony) and that /i/ and /e/, though
phonetically front vowels, are neutral with respect to place harmony, i.e. they
can combine with both /y, ö, ä/ and /u, o, a/. In other words, a diphthong is
front if either the initial or the target vowel quality is phonologically front
(i.e. /y, ö, ä/), and it is back, if either the initial or the target vowel quality is
phonologically back (i.e. /u, o, a/); /ei/ and /ie/ are neither front nor back.
The matrix of distinctive features in Finnish diphthongs is as follows:
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 195
Arranged according to their binary feature values, the system displays the
diphthongs in Finnish in a quite compact and integrated way. I refrain from
marking the vast majority of diphthongs for [– lip harm] and [– rising] re-
spectively, since that would affect the clarity of the representation; the pos-
itive feature values are simply indicated in boldface or italics, respectively.
[+ front] [– front]
[+ round] [– round] [+ round]
iy yi ui iu [– move]
ey öi oi eu
[+ lip harm] [+ rising] öy yö ei ie ou uo [– wide] [+ move]
äy äi ai au [+ wide]
Unauthenticated
Download Date | 8/13/19 5:38 PM
196 Klaus Geyer
is economic, since it makes use of only a small feature set (obviously only
the distinctive ones); it is adequate for the language, since it relies on core
characteristics of Finnish phonology, such as the vowel harmonies; and it is
typologically adequate, since it clearly states that the rising early-peak diph-
thongs /ie, yö, uo/ are marked structures – simply because they need special
features to be singled out.
Of course, already in this early stage, the important “contour criteria” may be
accounted for:
– falling vs. rising: sonority contour, direction of change
– early peak vs. late peak: prominence contour, position of peak
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 197
Note that falling, early-peak diphthongs are by far the most common diph-
thong type, but, as was shown, prominence caused by means of extra effort
in pitch and / or expiratory energy may override the sonority contour – even
to the extent that sonorants may form a syllable peak. In this (probably rare)
case, the next criterion has to be applied:
– purely vocoid vs. blended: type of sounds constituting the diphthong (only
vocoids vs. vocoids in combination with sonorants)
The following criterion can provide a hint for a later analysis of mono-phone-
mic vs. bi-phonemic diphthongs, but need not do so:
– sequential vs. gliding: impression of 2 distinct phases vs. 1 complex phase
All of these criteria, which directly address the dynamic character of diph-
thongs, may not only be relevant for a sound description of the phonetic diph-
thongs in a language, they may also constitute potentially distinctive features
if it turns out in the phonological analysis that at least some of the diph-
thongs have a functional, phonemic status (by forming minimal pairs) in the
language. It goes without saying that all of the static features that either the
initial and/or the target sounds bear – e.g. front vs. back, closed vs. mid vs.
open, round vs. spread, nasal vs. oral, etc. – might turn out to be system-
atically distinctive in phonological diphthongs. This also holds true for the
criterion of quantity, even if it is not very likely that quantity functions as a
distinctive feature in the diphthongs of a language.
If there are phonological diphthongs in a language, these criteria have to
be dealt with:
– mono-phonemic vs. bi-phonemic: segmental phonological analysis
– simple vs. complex: occurrence in morphologically simple vs. in complex
forms
– stable vs. instable: changeability by morphophonological processes
– common vs. restricted: restrictions of occurrence, e.g. only in stressed syl-
lables, in loans, etc.
– frequent vs. infrequent: frequency in lexicon words or in text words
The criteria of sonority and prominence contour come up again in phono-
logical analysis, now with special reference to the general principles of syl-
lable structure in the language. And, of course, mismatches of sonority and
prominence contour, as they are marked structures, need a particularly careful
investigation and explanation.
Unauthenticated
Download Date | 8/13/19 5:38 PM
198 Klaus Geyer
A criterion that is equally relevant for the phonetic and the phonologi-
cal perspective is the question of underlying phonological non-diphthongal
sound sequences for diphthongs in the phonetic output, or the question of
stages in diachronic emergence by sound change processes, giving rise to the
following features:
I hope that the “diphthong analysis and description tool” will prove useful for
a number of linguistic enterprises, be they within language documentation or
other linguistic areas.
References
Abondolo, Daniel. 1998. Finnish. In The Uralic Languages, ed. Daniel Abon-
dolo, 149–183. London, New York: Routledge.
Altmann, Hans, and Ute Ziegenhain. 2010. Prüfungswissen Phonetik, Pho-
nologie und Graphemik. (3rd edition). Göttingen: Vandenhoeck und Rup-
recht.
Ambrazas, Vytautas. 1997. Lithuanian Grammar. Vilnius: Baltos lankos.
Andersson, Erik. 1994. Swedish. In The Germanic Languages, eds. Ekkehard
König and Johan van der Auwera, 271–312. London, New York: Rout-
ledge.
Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke:
Palgrave Macmillan.
Branch, Michael. 1987. Finnish. In The World’s Major Languages, ed.
Bernard Comrie, 593–617. London: Croom Helm.
Braunmüller, Kurt. 1999. Die skandinavischen Sprachen im Überblick. 2nd
edition. Tübingen: Francke.
Campbell, George L. 1995. Concise Compendium of the World’s Languages.
London, New York: Routledge.
Catford, John C. 1977. Fundamental Problems in Phonetics. Edinburgh:
Edinburgh University Press.
Eckert, Rainer, Elvira-Julia Bukevičiūtė, and Friedhelm Hinze. 1994. Die
baltischen Sprachen: Eine Einführung. Leipzig: Langenscheidt Verlag En-
zyklopädie.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Diphthongology meets language documentation 199
Unauthenticated
Download Date | 8/13/19 5:38 PM
200 Klaus Geyer
Encyclopedia of the World’s Major Languages, Past and Present, eds. Jane
Garry and Carl Rubino, 214–218. New York: Wilson.
Mosel, Ulrike. 1994. Saliba. München: Lincom Europa.
Mosel, Ulrike. 2006a. Fieldwork and community language work. In Essen-
tials of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmel-
mann, and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter.
Mosel, Ulrike. 2006b. Sketch grammars. In Essentials of Language Docu-
mentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike Mosel,
301–309. Berlin, New York: Mouton de Gruyter.
Mosel, Ulrike, and Even Hovdhaugen. 1992. Samoan Reference Grammar.
Oslo: Scandinavian University Press.
Pompino-Marschall, Bernd. 1995. Einführung in die Phonetik. Berlin: de
Gruyter.
Ramers, Karl-Heinz, and Heinz Vater. 1992. Einführung in die Phonologie.
Hürth: Gabel.
Roca, Iggy, and Wyn Johnson. 1999. A Course in Phonology. Oxford: Black-
well.
Sánchez Miret, Fernando. 1998. Some reflections on the notion of diphthong.
Papers and Studies in Contrastive Linguistics 34:27–51.
Schubiger, Maria. 1977. Einführung in die Phonetik. 2nd edition. Berlin: de
Gruyter.
Schultze-Berndt, Eva. 2006. Linguistic annotation. In Essentials of Language
Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann, and Ulrike
Mosel, 213–251. Berlin, New York: Mouton de Gruyter.
Sievers, Eduard. 1893. Grundzüge der Phonetik zur Einführung in das Stu-
dium der Lautlehre der indogermanischen Sprachen. Leipzig: Breitkopf
und Härtel.
Sulkala, Helena, and Merja Karjalainen. 1997. Finnish. London, New York:
Routledge.
Trubetzkoy, Nikolaj S. 1977[1939]. Grundzüge der Phonologie. 6th edition.
Göttingen: Vandenhoeck und Ruprecht.
Vennemann, Theo. 1988. Preference Laws for Syllable Structure and the Ex-
planation of Sound Change: With Special Reference to German, Germanic,
Italian, and Latin. Berlin: Mouton de Gruyter.
Viitso, Tiit-Rein. 2003. Structure of the Estonian language: Phonology, mor-
phology and word formation. In Estonian Language, ed. Mati Erelt, 9–92.
Tallinn: Estonian Academy Publishers.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Chapter 9
Retelling data: Working on transcription∗
1. Introduction
Transcribing narrative and conversational speech is a core activity of all lin-
guistic fieldwork, though one of the less attractive ones. Neither linguists nor
speakers are generally very keen to spend long hours on this task. Neverthe-
less, it is without doubt one of the most important tasks to be carried out in
the field requiring close cooperation between speaker(s) and researcher(s).
Given its significance, it is somewhat surprising how little attention this
task receives in the literature. When transcription is mentioned, if at all, in
field method books and articles, the focus is usually on phonetic aspects,
i.e. questions relating to the proper representation of sounds and the dis-
tinction between broad and narrow transcription. Occasionally, there are a
few additional notes on general procedure, as conveniently summarized in
Chelliah and de Reuse (2011: 434–435). A notable exception here is Crow-
ley (2007: 137–141) who discusses practicalities of transcribing a fairly large
amount of narrative and conversational speech that go beyond the problems of
basic procedure and properly capturing sound. Likewise, the current chapter
is exclusively concerned with the conceptual and interpersonal issues arising
when working on transcriptions of continuous text, for which usually some
type of practical orthography (broad transcription) will be used.
We will not repeat Crowley’s very useful observations and suggestions
here. But we want to emphasize the point that text transcription has to be car-
ried out in close cooperation with native speakers, and this usually means: in
the field. It may be possible for a researcher to transcribe some parts of a text
recording independently at a stage when one has achieved a certain mastery
∗
We are very grateful to all the speakers who generously shared their knowledge of the
Beaver language with us and were so patient and accommodating in dealing with our obses-
sions with linguistic form rather than content. Many thanks also to Geoffrey Haig, Carolina
Pasamonik, Stefan Schnell and Gabriele Schwiertz for helpful comments and suggestions,
and to Jessica Di Napoli for thoroughly editing English grammar and style.
Unauthenticated
Download Date | 8/13/19 5:38 PM
202 Dagmar Jung and Nikolaus P. Himmelmann
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 203
1. For further information on this project and full acknowledgements see www.mpi.nl/
dobes/projects/beaver.
Unauthenticated
Download Date | 8/13/19 5:38 PM
204 Dagmar Jung and Nikolaus P. Himmelmann
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 205
Unauthenticated
Download Date | 8/13/19 5:38 PM
206 Dagmar Jung and Nikolaus P. Himmelmann
segments must then be checked with more experienced speakers, who may
lack the patience or interest to collaborate on this task for more extensive
time periods.
It is not necessary, and in fact often not desirable, to work on transcription
with the speaker who appears in the recording. While the speaker probably
has a relatively clear idea of what she wanted to say, this does not mean that
she is particularly good at listening to and restating precisely what was actu-
ally said. In fact, speakers involved in the recorded speech event are some-
times more likely to engage in the correcting and extension activities dis-
cussed in Section 3 than non-participating speakers. Furthermore, listening to
one’s own voice in a recording can be disturbing as it sounds quite different
from what one hears when speaking and because one may feel embarrassed
about dysfluencies, speech errors and other kinds of imperfections in one’s
own speech.
As transcription is not only an unnatural but also a very time consuming
and tedious task which requires practice and dedication, it is the task that is
perhaps most work-like of all the activities involved in field-based language
documentation and description. It is thus also the task that is best approached
in a work-like arrangement, characterized by regular working hours and a
salary/remuneration in line with local practices, if such arrangements are at
all possible in the community. Transcription is ideally done in an office-like
setting, with adequate furniture and a low noise level, so that one can fully
concentrate on the listening and interpretation task.
Work-like arrangements also provide for the option of independent tran-
scription, i.e. a native speaker works by her/himself on the transcription of
recordings. This, of course, presupposes that the speaker is able to handle
the technical aspects of playback (ideally using a laptop, otherwise an audio
player). It also involves some training and, crucially, regular supervision and
checking. The latter are important for two reasons: first, independent tran-
scribers, like researchers, may tend to develop ‘bad habits’ such as regularly
misspelling a class of high frequency items, leaving out segments at times
when they interrupt their work, etc. Regular checks may detect and arrest
such developments early on. Second, and of equal, if not greater importance,
independent transcribers need regular feedback and appreciation in order to
keep up the motivation for good and productive work. If all these conditions
are met, independent transcription, perhaps even involving two or three tran-
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 207
scribers, is probably the most efficient and productive method for tackling
this task.
The widely practiced alternative to independent transcription is collabora-
tive transcription, where researcher and speaker together transcribe a record-
ing, preferably while both listening to it with a headset. The discussion in
the following section is based on data generated in this way. The set-up
generally involved a single native speaker and a single linguist. The native
speakers were all elderly people as there are no younger speakers left, some-
times working on recordings of themselves, sometimes on recordings of other
speakers. Transcription was not separated from translation, so that upon hear-
ing a short segment played by the linguist, the native speaker would start with
either explaining what was being said, or with a free translation, or with re-
peating the segment for the linguist to transcribe. All speakers involved are
bilingual in English and Beaver and most of them also literate in English.
Two speakers also have some familiarity with the orthography used for rep-
resenting Beaver, hence being able to follow what the linguist was writing.
While this is just one type of set-up for transcription, many of the phe-
nomena discussed below will also be found in transcriptions produced by
independently working native speaker transcribers and also whenever native
speakers are involved in editing precise transcripts for publications to be used
in the community (cp. Mosel 2004, 2008).
Unauthenticated
Download Date | 8/13/19 5:38 PM
208 Dagmar Jung and Nikolaus P. Himmelmann
approach to transcription is very technical and its overall goal may not be
easily understood by non-specialists.
Native speaker collaborators in the transcription process tend to have dif-
ferent goals and priorities which can generally be classified as attempting to
improve the recording in a number of ways: to make the content clearer, to
make it more appropriate for a general audience, to make it adhere to what
they perceive as the proper norm or more authentic speech, and so on. Goals
and priorities here depend in part on who is helping with the transcription:
a transcriber who is also the recorded speaker may decide more freely on
what should be edited in and out for semantic reasons or perceived mistakes
in clause structure. He/she may also focus on rephrasing, expanding or re-
peating the text to guarantee its comprehension by the intended audience. A
transcriber who is not among the speakers who appear in the recording may
comment on specific lexical items and idiomatic expressions that should be
changed.
In the following, examples from actual transcription sessions in a field-
work setting illustrate these processes. They are organized into three major
types: a) avoidance strategies where loose paraphrases and translations are
provided instead of a precise rendering of the recording; b) editing-out strate-
gies which lead to the removal of words and phrases; c) editing-in strategies
changing elements appearing in the recording or adding completely new ma-
terial. The latter two types belong more closely together in that they both
relate primarily to linguistic form, while the first type is most closely related
to the content being transmitted. Specific examples often involve a mixture
of the three types, which cannot always be neatly separated.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 209
The main concern here is clearly that there (finally) is understanding on the
part of the researcher. And while this may first be perceived by the latter as
not being very helpful with regard to the primary goal of getting a useful (i.e.
reasonably precise) transcription, it is evident that such information will later
be of great value for interpreting the narrative. Depending on speakers and
communities, work on a transcript may involve several retellings of the same
narrative (in the contact language) which usually include important informa-
tion for its interpretation. In some way then, the transcription setup should
allow for making use of such elaborations.
The opposite motivation for avoiding the word-by-word repetition needed
for precise transcription, i.e. the wish to transcribe less than what has been
2. As just mentioned and further illustrated in the following two subsections, form is of course
also a major concern, in particular in those cases where the recorded form is judged to be
inappropriate or incorrect. However, here the main concern is with cases of content-based
avoidance.
3. The practical orthography of Northern Alberta Beaver is used. Dentals are underlined (s,
z), the acute accent marks high tone. ¯
¯
Unauthenticated
Download Date | 8/13/19 5:38 PM
210 Dagmar Jung and Nikolaus P. Himmelmann
3.2. Editing-out
There are different types of elements that tend to be edited out (or ‘over-
looked’) by native speakers in the transcription process regardless of the set-
up of the transcription process (independent or collaborative). Perhaps the
most common type concerns hesitations or false starts, i.e. verbal elements
that are not actually part of the linguistic construction, do not directly con-
tribute to its meaning, and, most importantly perhaps in the current context,
are generally absent in written formats other than transcripts. One very simple
reason for leaving them out in dictation is, of course, the fact that hesitations
and in particular false starts are difficult to reproduce. After some practice,
the researcher her/himself will usually be able to identify such hesitations
and false starts and can then add them to the transcription without having to
bother the native speaker collaborator with this. However, some care has to
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 211
In this example, the false start highlighted here was not initially identified by
the transcribers and thus was not included in the first transcript. At the time
when the transcript was being translated and the recording was listened to
again for verification, the linguist insisted that there was a word missing in
the translation (not having realized that it was a hesitation). Consequently,
the following ‘emendation’ was carried out (which strictly speaking is a case
of editing-in, but note that unlike the examples discussed in the following
section, here the editing-in is triggered by the researcher):
4. In the remainder of this section, segments that have been left out by native speakers in the
transcription process are put in parentheses (and in bold).
Unauthenticated
Download Date | 8/13/19 5:38 PM
212 Dagmar Jung and Nikolaus P. Himmelmann
This clause-final clitic can indicate a loose temporal embedding in the dis-
course and generally does not play a role in signifying a more specific inter-
clausal construction.
Another class of words that are often edited out are evidentials that mark
a narration as known to the speaker only via the word of others. Since there
is no good translation for such particles, they are left out as ‘not important’.
In the case of Beaver, these words are also considered inappropriate for the
written style, partly because the European language used as target language
in the translation lacks the expressed category.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 213
5. This section makes use of example pairs, where the first example provides the precise
transcription while the second shows the modified version.
Unauthenticated
Download Date | 8/13/19 5:38 PM
214 Dagmar Jung and Nikolaus P. Himmelmann
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 215
Further changes include the editing out of the emphatic/focus marker laa fol-
lowing the first phrase, and the replacement of the purposive postposition
-gha with -ka. Even though both are grammatical in this context, the latter
appears in the common expression ‘to go out in order to hunt an animal’.
Classificatory verbs are also a locus for semantic differentiation and spec-
ification. They are frequently replaced in transcription:
In this example, the speaker substituted the stem for animate objects, which
refers implicitly to either a single or a dual object, with the stem for handling
plural objects.
An even more elaborate case of semantic specification is seen in the fol-
lowing example. A very general verb found in the original recording (13a) is
first replaced by a variant including a classificatory verb (13b) that pertains to
stick-like objects (in this case the leg of a frog). The third variant in (13c) was
suggested to be the one that best expresses the event depicted in the picture
(from the Frog Story, Mayer 1994[1969]). The verb here specifically refers
to the movement of legs:
Unauthenticated
Download Date | 8/13/19 5:38 PM
216 Dagmar Jung and Nikolaus P. Himmelmann
A different sort of editing is involved when phrases are replaced with phrases
of an entirely different grammatical type. In the modified version, the situa-
tion is often described more explicitly than in the actual recording (note that
in (14) the focus marker laa has again been omitted with the original overt
subject noun in the modified version, so strictly speaking this is a case of
replacement and omission):
Other possible kinds of changes include the paradigm of the verb (aspectual
variation) or the choice of person markers. In example (15a), the speaker uses
the areal pronominal marker ghu- as a possessive prefix to refer to the story
in a general sense. The transcriber, who in this case is not identical to the
speaker, chooses the third person marker ma- for this construction:
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 217
6. Other changes include the editing-in of the adverb gaa ‘now’ to emphasize the difference
between the narrated past and the present situation, as well as a pronominal change (marked
3pl to general 3).
7. The tendency to replace English words or phrases in the annotation with Beaver terms is
more pronounced in the Northern Alberta varieties.
Unauthenticated
Download Date | 8/13/19 5:38 PM
218 Dagmar Jung and Nikolaus P. Himmelmann
It may also happen that longer segments in English are translated by the tran-
scriber. Thus, for example, the clause he hears something was rendered as
wó˛ o˛ li dííts’ak in one transcription session.
More interestingly perhaps, the transcriber might speak a different vari-
ety of the language than the recorded speaker. This may result in changes in
the transcript, both with regard to morphological form and in the naming of
traditional characters:
(18) a. go˛ o˛ dyéézhe gaa éhdyi: aséi dasbát-éh
over.there 3.went now 3.said grandfather 1 SG.hungry-with
ni˛ka dyée-ya
2 SG.for ASP-come.SG
‘He went over there and said: grandfather, I came to you because
I’m hungry!’
b. go˛ o˛ dyéézhe gaa éhdyi: aséi Daskutł’e, ni˛ka
over.there 3.went now 3.said grandfather Daskutł’e 2 SG.for
dyée-zha
ASP -come. SG
‘He went over there and said: grandfather Daskutł’e, I came for
you.’ (aghat’usdane002)
In the recording the singular motion stem is -ya, while the transcriber repeats
it as -zha. Both belong to the paradigm of the stem ‘sg.moves’, but they are
used differently in various dialects of Beaver.8 The second modification, the
renaming of a character within a well-known traditional story, is explained
by family tradition: “My grandfather told me about old man Daskutł’e.” The
original inflected verb form in (18a) that is not a personal name is thereby
replaced by a personal name that seems to fit the particular story.
4. Conclusion
Transcription of recorded data plays a central role in field-based language
documentation and description. The final product, i.e. the transcript, forms the
stepping stone for a variety of further activities, including not only grammat-
ical analysis but also the preparation of educational materials or other written
resources to support language maintenance or development efforts. But, as
8. Historically, a so-called voice marker has been absorbed into the -zha form.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Retelling data: Working on transcription 219
argued here, the transcription process itself, while often tedious and disliked
by all parties involved, provides valuable insights into the linguistic knowl-
edge of speakers: insertion, omission, or change of items show the range of
the linguistic repertoire and (un)acceptable variation and thus complement
the evidence otherwise gathered in elicitation tasks. It may also provide im-
portant clues for our understanding of the creation of new linguistic varieties,
as many of the phenomena reviewed here also occur when basic transcripts
are edited for publication as written resources, resulting in the creation of
a written language variety where none existed beforehand (cp. Mosel 2004,
2008).
From a scientific point of view, it is thus important to document, as much
as practically feasible, the kinds of changes and elaborations discussed in the
preceding section, i.e. to provide both as precise a transcript of the actual
recording as possible as well as a record of the changes applied by native
speakers in the transcription process and the motivations given for them (if
any). The best way to do this is to record the transcription process as well,
which, however, will often not be possible for various reasons.
There is, of course, a potential conflict here between scientific and com-
munity/speaker interests, which was already hinted at in the introduction to
Section 3: What if a speaker or the community at large actually rejects (parts
of) an utterance as incorrect or inappropriate, for whatever reason? Under
such circumstances, which version should be made available to whom and in
what form? There is no straightforward and easy answer to this question, as
in all cases where conflicts arise about control and ownership in a documen-
tation project, but layered access levels in a digital archive usually make it
possible to accommodate the interests of all parties concerned.
Abbreviations
1, 2, 3first, second, third person (usually FOC focus
indexing the subject argument if HAB habitual
not otherwise specified) LOC locative
ANIM O animate object O object
ARE areal PFV perfective
ASP aspectual PL plural
CNJ conjugation PL O plural objects
DEM demonstrative POSS possessive
DIM diminutive PRT particle
Unauthenticated
Download Date | 8/13/19 5:38 PM
220 Dagmar Jung and Nikolaus P. Himmelmann
DU dual SG singular
ELO O elongated object V valency
References
Chelliah, Shobhana L., and Willem de Reuse. 2011. Handbook of Descriptive
Linguistics. Dordrecht: Springer.
Crowley, Terry. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford
University Press.
Himmelmann, Nikolaus P. 2006. The challenges of segmenting spoken lan-
guage. In Essentials of Language Documentation, eds. Jost Gippert, Niko-
laus P. Himmelmann, and Ulrike Mosel, 253–274. Berlin, New York: Mou-
ton de Gruyter.
Jung, Dagmar, Julia Colleen Miller, Patrick Moore, Gabriele Müller
(now Schwiertz), Olga Müller (now Lovick), and Carolina Pasamonik.
2004–present. DoBeS Beaver Documentation. DoBeS Archive MPI Nij-
megen, http://www.mpi.nl/DOBES/.
Mayer, Mercer. 1994[1969]. Frog, Where Are You?. New York: Dial Books
for Young Readers.
Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising
from the aims of language documentation. Language Archives Newslet-
ter 1(3):3–4.
Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials
of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann,
and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter.
Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from
a language documentation project in Bougainville, Papua New Guinea.
Presentation at the Simposio Internacional Contacto de lenguas y docu-
mentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral
and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/
mosel_publikationen.htm#download (accessed 2011/02/26).
Ochs, Elinor. 1979. Transcription as theory. In Developmental Pragmatics,
eds. Elinor Ochs and Bambi B. Schieffelin, 43–72. New York: Academic
Press.
Unauthenticated
Download Date | 8/13/19 5:38 PM
Chapter 10
The making of a multimedia encyclopaedic lexicon
for and in endangered speech communities∗
Gabriele Cablitz
1. Introduction
In the past ten years a growing amount of lexicon software1 has become avail-
able to create multimedia lexica. A key area of application is the compilation
of online dictionaries for speech communities of underdescribed and endan-
gered languages. For a number of such projects (Manning, Jansz, and In-
durkhya 2001; Kroskrity 2002; Albright and Hatton 2008; De Korne et al.
2009; Yang et al. 2008; Rau et al. 2009; Cablitz, Chong, and Tetahiotupa
2009) the development of a multimedia lexicon is an important step towards
language documentation as a means of language maintenance and revival and
the preservation of endangered linguistic, lexical and cultural knowledge.
In this chapter we report on an interdisciplinary project2 in which digi-
tal multimedia encyclopaedic lexica are created for the endangered Marque-
san and Tuamotuan languages of French Polynesia with the help of the lexi-
∗
This research has been generously supported by two DoBeS grants of the Volkswagen
foundation. I would like to thank the Marquesan and Tuamotuan speech communities for
their warm welcome, support and inspiration for the project. I have in particular ben-
efited from discussions with Tehoatahiiani Bruneau, Fasan Chong (Jean Kape), Marc
Kemps-Snijders, Lucien Mataiki, Upu Mataiki, Lucien (Mimio) Puhetini (†), Jacquelijn
Ringersma, Tahia Tuohe (†), Edgar Tetahiotupa, Mathias (Teaiki) Tohetiaatua, Peter Wit-
tenburg and Claus Zinn. I would like to thank Ken Dicks for proof-reading my paper and
Geoff Haig, Nicole Nau and Claudia Wegener for their helpful comments on earlier ver-
sions of this paper; any errors and inconsistencies are of course my own. Last, but not least
I would like to thank Ulrike Mosel for her great inspiration of my own work, and all her
enthusiasm and support over the years.
1. E.g. the Kirrkirr software (Manning, Jansz, and Indurkhya 2001), Lexique Pro and We-
Say (SIL), LEXUS (Max Planck Institute for Psycholinguistics), IDD (Indiana Dictionary
Database, cf. De Korne et al. 2009) among others.
2. The project was part of the DobeS-programme and was generously supported by the Volk-
swagen foundation between 2006 and 2010 (http://www.mpi.nl/dobes).
Unauthenticated
Download Date | 8/13/19 5:38 PM
224 Gabriele Cablitz
con tool LEXUS.3 LEXUS is a web-based tool which has a flexible scheme
of linking multimedia documents – including annotated sessions from the
archive – to lexical entries as well as the possibility of creating relational
links using its integrated tool ViCoS4 . The relational linking device in ViCoS
is a new form of knowledge representation with which the user can create a
dense network of lexical and cultural data in ways which are meaningful to
different kinds of user groups in a thematically organised way.5 This has im-
portant implications not only for the speech communities, whose languages
are documented, but also for documentary linguistics. In this chapter we will
discuss why this form of language documentation, i.e. the creation of multi-
media lexica with LEXUS, is beneficial to both the scientific as well as the
speech communities.
In the first part of this chapter (§2) we will briefly discuss the background
and objectives of this multimedia encyclopaedic lexicon project (henceforth:
MEL-project). In §3 we proceed to outline the type of lexicographic work we
have undertaken in the MEL-project, discussing new aspects of lexicographic
work in our lexica in more detail. This section also provides insights into
the creation and representation of thematically organised networks of lexical
and cultural data in ViCoS (§3.4), in particular the creation of ethnobotanical
ontologies and a folk taxonomy on marine life.
One major objective of the MEL-project is to motivate the speech com-
munities to actively participate in the process of creating these multimedia
lexica. The web-based editing possibilities of LEXUS and internet facilities
in French Polynesia – in principle – make it possible to allow an online par-
ticipation by the speech community. In §4 we will address the implications
a web-based lexicon tool has for the process of lexicon creation as well as
the problems which are involved with such an approach of egalitarian lexicon
creation: a model of collaborative workspaces and the basic challenges of a
web-based collaboration in endangered speech communities are discussed in
detail. The last section (§5) discusses why the making of dictionaries should
be a key activity in a DoBeS language documentation project, or indeed for
3. LEXUS is currently being developed by the technical team of the Max Planck Institute for
Psycholinguistics in Nijmegen (Netherlands).
4. ViCoS=Visualization of Conceptual Spaces, cf. http://www.mpi.nl/dobes/tools and cf.
Zinn (2008: 890–894); the first multimedia lexicon software using the relational linking
device was Kirrkirr in the late 1990s (McElvenny 2008: 160).
5. Cf. §3.4 for further details.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 225
2. The MEL-project
2.1. Background
The MEL-project evolved from a previous DoBeS language documentation
project6 of the Marquesan languages in French Polynesia. Between 2003 and
2006 a large corpus of primary and lexical data from five different Marque-
san island vernaculars has been compiled, analysed and annotated in close
cooperation with the Marquesan speech community, and is now stored in a
digital multimedia language archive housed by the Max Planck Institute for
Psycholinguistics (cf. Cablitz 2010: 40–43 for details). In the course of this
DoBeS documentation project we also began to build up a general lexical
database and glossaries of topics which have become important foci of our
documentation (breadfruit varieties, food preparation, plants, plant medicine,
fish and fishing, etc.). All lexical databases are trilingual (Marquesan, French
and English); the lexical entries have been extracted from recordings on the
respective topics, with added information from field notes and wordlists. Due
to the well-known constraints on time and money in short term documentation
projects, we mainly applied Mosel’s thematic approach of dictionary making
(2004b; 2011). We created so-called mini-dictionaries (Mosel 2011: 348) by
focussing on one sub-domain of a culture (e.g. fish/fishing) at one time. In
this way a mini-dictionary can be completed in a relatively short period of
time which often has motivating effects on the speech community (Mosel
2004b: 45–47). Idioms, collocations and lexicalised phrases were also col-
lected in a separate database because they constitute an important part of the
Unauthenticated
Download Date | 8/13/19 5:38 PM
226 Gabriele Cablitz
2.2. Objectives
The main objective was to create multimedia lexica for the Marquesan and
Tuamotuan speech communities by building up a new form of multimedia
language archive in a structural frame of a lexicon with the help of LEXUS
and ViCoS (thereby also providing the LEXUS/ViCoS developers with a doc-
umentation setting to further improve and refine the software). This involved
the enriching of lexical entries with indigenous knowledge, multimedia ex-
tensions (images, video and audio clips) as well as annotated sessions from
the archive, in order to move from a conventional dictionary with simple
glosses towards an encyclopaedia or ethnographic type of lexicon (cf. Paw-
ley 2001: 236–237). In order to do this effectively, we needed to motivate the
speech communities to actively participate in the creation of these multime-
dia lexica, thus becoming more involved in the process of documenting their
own languages. The entire process had an important beneficial side-effect:
the involvement of the speech community in the creation of the multimedia
lexica is a way of contributing to language maintenance and revival.8
In order to facilitate the more active participation of speech community
members, they had to learn a) about the basics of lexicography and the rele-
vant linguistic software used, and b) how to write monolingual definitions and
encyclopaedic articles of vernacular words, a documentation method which is
7. Phonological variants are allolexemes, also called triplets and doublets (Elbert 1982). Dou-
blets and triplets are different forms of the same lexeme which are used by one and the
same speaker (e.g. ko’aka – ’o’aka ‘find’), i.e. one cannot observe any complementary
distributional rules nor a regional demarcation (Cablitz 2006: 26).
8. We documented e.g. very specialised knowledge about plants which some of our language
consultants have not talked about for many years. In many instances, the documentation
helped them remember and revive traditional knowledge.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 227
9. Cf. Coward and Grimes (2000) for details about the MDF (=Multi-Dictionary Formatter)
database type in Toolbox.
Unauthenticated
Download Date | 8/13/19 5:38 PM
228 Gabriele Cablitz
In the following sections, the motivation behind the inclusion of these new
aspects will be explained and further detailed.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 229
When doing lexicography in and for endangered speech communities, the lin-
guistic fieldworker will often find that lexicographic practice does not easily
combine with linguistic theories of lexical semantics (Haiman 1980; Herbst
and Klotz 2003; Pawley 2001; Mosel 2004b). Apart from the difficulty in
finding adequate translations in the target languages for vernacular words
of the source language – which is a general problem of bi- or multilingual
lexicography (Herbst and Klotz 2003: 109) – there is also the problem that
many headwords of lexical entries denote complex phenomena, procedures
and concepts which are specific to the source language and culture and there-
fore often do not have a translation equivalent in the target language at all
(Mosel 2004b: 48; Franchetto 2006: 203–206; Haviland 2006: 136–139). In
order to avoid an inadequate documentation of word meaning it is there-
fore often necessary to provide more encyclopaedic information in definitions
than a conventional bi- or multilingual dictionary would normally include.
For instance, Marquesan (MQR) heikai vaihopu and mākiko are both
a kind of breadfruit pudding made out of the same ingredients (very ripe
breadfruit pulp and coconut milk), but the final product and the technique of
preparing them differ to a great extent. In our lexicographic work, the En-
Unauthenticated
Download Date | 8/13/19 5:38 PM
230 Gabriele Cablitz
In §2.2 it has already been mentioned that monolingual (or vernacular) def-
initions of the different senses of a headword and vernacular encyclopaedic
articles are an important part of our MEL-project. This kind of work encour-
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 231
ages native speakers to express meanings of words in their own language and
therefore to further stimulate the expressive power of their language. Vernac-
ular definitions of words are also an important step towards understanding
indigenous word meaning because they document word meaning from the
native speaker’s understanding of a word (Mosel 2004b: 48). It is a repository
of authentic indigenous word meaning for the scientific community because it
can clarify possible misunderstandings and misinterpretations between the re-
searcher and the local field assistants and language consultants. Misinterpre-
tations are not uncommon in linguistic fieldwork because the researcher, as
well as the local consultants, often communicate via a contact language which
is neither of their native languages (Mosel 2004b: 48; Haviland 2006: 144).
Even if the fieldworker communicates via the field language, there are usu-
ally very different levels of linguistic competence between the fieldworker
and the language consultants which could lead to misinterpretations.
In many endangered speech communities which lack a tradition of lit-
eracy, the formulation of monolingual definitions and encyclopaedic articles
poses a considerable challenge and leads to the emergence of new, written
speech genres. While linguistic ecologists such as Mühlhäusler (1990, 1996)
have cautioned against the imposition of literacy because it might dimin-
ish a rich oral heritage,10 such newly emerging written genres can provide
fascinating insights into the expressive potential of speakers of a language
(2004a: 263, Mosel 2004c: 4). New constructions may be deployed which are
rarely used in everyday conversations, undoubtedly of interest to linguists and
possibly to language educators. The Marquesan field assistants indeed devel-
oped a style containing new and rarely used constructions. Verb serialisation
(1, 2) and clause chaining (2) occurred much more frequently than in other
(speech) genres. For example, a wooden ring (manoni) is defined as follows:
(1) ’Akau humu=tı̄a ha’a=kapoipoi, to’o=tı̄a no te hana
wood attach= PASS CAUS=round take= PASS for ART fabricate
pafi’o ako’e’a ’ofi’o
landing.net or catcher
‘Attached rounded wood (lit. ‘wood which has been attached and round-
ed’) used to fabricate landing nets (fruit harvest) and catchers (fishing)’
10. Crowley (2001: 3) has noted that literacy has been introduced for a long time in many
Pacific minority languages, but it has not at all resulted in the reduction of the rich oral
heritage. This can also be stated for the Marquesas.
Unauthenticated
Download Date | 8/13/19 5:38 PM
232 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 233
11. HO and TH are abbreviations for different island vernaculars: HO = Hiva ’Oa dialect, TH
= Tahuata dialect.
Unauthenticated
Download Date | 8/13/19 5:38 PM
234 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 235
ViCoS knowledge spaces12 – visualise multiple links in the same space. The
users can view several nodes simultaneously on the computer screen, which
allow them to navigate according to their interest. All nodes can contain fur-
ther links to other lexical, multimedia or archive data (containing sessions of
specific cultural uses) which are opened up once the user clicks on the nodes.
The screenshot in Figure 2 shows how parts of the breadfruit ontology are
realised in ViCoS.
Lexus Workspace Editor ViCoS - Visualising Concept.. ViCoS Editor and Navigator
Legend:
tumu mei - preparation_shown_in
is_a
is_product_of
is_material_for
is_part_of
mei Modes:
mei autēa
browse move
connect delete
lexus world
mei aravei
attach detach
overview refetch
mei hinu save colour
mā
mei 'ape
mei kopumoko
mei
mā
popoi
mā tehītō
In the upper left part of the screenshot only those relations are visualised
which are related to the node mei ‘breadfruit’. When navigating e.g. to the
node mā ‘fermented breadfruit’ it becomes the focus in the ViCoS Editor and
Navigator showing all relations connected to mā. The navigation can continue
12. Note that also in the Kirrkirr software multiple relational links are viewed in the same
space.
Unauthenticated
Download Date | 8/13/19 5:38 PM
236 Gabriele Cablitz
as long as relations have been created between elements of the LEXUS lexical
database.
The advantage such a tool has for the speech as well as scientific com-
munity is obvious. Speech community members can define and organise how
words and other (cultural) data are grouped together and what relations they
hold to each other, thus creating knowledge spaces which are meaningful to
speech community members.
Due to the flexible scheme of linking, all kinds of data13 can be com-
bined in whatever way possible according to the particular interest of the user.
Linguistic researchers could create knowledge spaces to view a particular se-
mantic field or domain of investigation (e.g. CUT and BREAK verbals). Re-
searchers interested in oral literature, for example, can include proper names
of protagonists in the lexical database, establish the relationships they hold to
each other (e.g. is_father_of, is_younger_sister_of, is_sister-in-law_of, etc.)
in ViCoS knowledge spaces and make links to the archive where the respec-
tive narratives are stored (cf. Figure 3).
ViCoS can not only visualise various relations, but important information
(e.g. complex family relationships), which is usually neither included in the
lexical entry nor in the metadata of the archive sessions, can be established in
ViCoS.
The indigenous representation and organisation of data in ViCoS have
been collected and prepared in two different ways. Our first approach was to
create knowledge spaces from an ethnobotanical perspective as the traditional
material culture of the Marquesas and Tuamotu islands is – for most parts –
based on plants. A plant was taken as a point of departure (or anchor) and then
all the cultural uses which are connected with that plant were established.
The cultural uses of the most important trees in the Tuamotuan (coconut
tree) and Marquesan cultures (breadfruit, coconut and banyan tree) were e-
licited in detail by creating a kind of informal ontology for each tree: all the
different parts of the plants were named and put into relation of how they are
used in the traditional culture (food, preparation, plant medicine, handicrafts,
canoe-building, house-building, etc.). It was the intention to establish a kind
13. The linking of multimedia files in ViCoS is still very restricted, but there are work-around-
solutions of creating links to the archive and multimedia files. One can create a database
in LEXUS which contains the relevant multimedia files, archive links or photo galleries
without further lexical data. Within that database one creates a data category (e.g. ‘link to
photo gallery’) which could be used as a link in the ViCoS knowledge space to access the
database which contain the multimedia data or archive links in LEXUS (cf. Figure 2).
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 237
Legend:
is_father_of
Teupokootiū
is_ancestral_spirit_of
Teupokootekahi
is_son_of
is_adoptive_sister_of
is_younger_sister_of
is_sister-in-law_of
Paevao
connect delete
lexus world
attach detach
Moni overview refetch
save colour
Keikahanui
Relation Types:
Pa‘ehitu
Unauthenticated
Download Date | 8/13/19 5:38 PM
238 Gabriele Cablitz
Cultural knowledge spaces are informal ontologies which are solely based
on native speakers’ associations between elements of the lexicon in LEXUS
(Zinn 2008: 890–894). They do not create formal ontologies like SUMO or
WordNet which are built for machine-reasoning rather than human consump-
tion in educational settings of language revival (Zinn et al. 2008). Work on on-
tology building in endangered speech communities has been so far attempted
by Rau et al. (2009: 199–209) for the Austronesian language Yami, but the
researchers “have come to realize that it is not possible to describe an on-
tology based on sophisticated machine reasoning. Any ontology description
... requires triangulation of various resources of human interpretation” (Rau
et al. 2009: 208). Whereas Rau et al. (2009: 192) pursue a ‘formalized model
of existing indigenous knowledge’, our objective is more guided by practi-
cal aspects of usability for the speech community and better visualisation of
cultural connections in a language documentation archive.
Our second approach was to create so-called folk taxonomies (Conklin
1962; Bulmer 1970; Coward and Grimes 2000: 138–142) for which speech
community members had to classify aspects of the natural environment by
creating classes of one specific cultural domain and establish hierarchical
structures based on cultural uses and associations.
Conklin (1962: 50) observes that folk taxa “belong simultaneously to sev-
eral distinct hierarchical structures” of which some are based on form and ap-
pearance whereas others are based on culture-specific associations and uses.
In other words: “functional categories” of plants and animals, e.g. culture-
specific products such as containers, clothing, ornaments, medicine, food
dishes, etc. form part of folk taxonomies as much as the categories and classes
formed on the basis of form and appearance of things from the natural envi-
ronment.
For our work on folk taxonomies we decided to focus on the domain of
marine life because this domain – next to plants – plays an important role in
the two Polynesian cultures. Thus far, the work has only been accomplished
in the Marquesan speech community.
The first step in establishing a folk taxonomy is to find out under which
generic term a lexeme is classed in the vernacular, revealing the higher cat-
egory of a lexeme (Coward and Grimes 2000: 138). However, between the
highest level of the taxonomy (i.e. generic term), called life form and the
specific name of a thing, called terminal taxa – which relates to the west-
ern notion of species – there can be many more intermediate levels called
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 239
Unauthenticated
Download Date | 8/13/19 5:38 PM
240 Gabriele Cablitz
as a functional aspect, namely that they do not need to be scaled after fishing.
Furthermore, it is interesting to note that the classes are not always formed
with regard to the linguistic classification. For example, some fish names
which are formed with the generic terms of particular classes such as ūme
‘unicornfish’ and tamano ‘jackfish, chubs’, are not necessarily classified with
fish which bear the same generic term. For example, ūme tiaporo ‘longhorn
cowfish’ belongs to a different class than ūme tatihi, ūme kuripo, ūme mei,
etc. which all belong to the unicornfish family. Furthermore, the term ’eitano
‘seafood’ comprises crustaceans as well as shellfish, but in the classifica-
tions the consultants clearly separated the group of crustaceans (ı̄pu pe’ehu
‘hard shell’) and shellfish (ı̄pu fe’o ‘soft shell’) into two groups (cf. above).
In ViCoS these indigenous classifications can be visualised coherently and
contrasted with scientific classifications via scientific names of fish families
(e.g. Balistidae (triggerfish) vs. Labridae (wrasse), etc.).
Altogether only three consultants participated in the task, so our folk tax-
onomy does not yet represent a general indigenous classification of marine
life for the Marquesan speech community. More data will have to be col-
lected with a larger number of consultants to see whether there are regular
patterns of classification across the speech community. However, the con-
sultants, of whom two also worked as field assistants on the dictionary, felt
that this task greatly helped them in structuring and eliciting encyclopaedic
knowledge which is now being transformed into encyclopaedic articles for
the fish lexicon. Again, this task shows that the documentation of lexical and
cultural knowledge is best achieved when it is embedded in a very neatly
defined thematic domain. Moreover, cultural connections can be much more
easily established when the whole domain of investigation is highly contex-
tualised.
Internet facilities are becoming more and more accessible in the remotest ar-
eas of the world which makes it technically possible to continue the coopera-
tion between researchers and speech communities outside fieldwork periods.
The web-based editing possibilities of LEXUS were particularly designed to
motivate the speech community members to actively participate in the doc-
umentation of their languages by adding and editing dictionary entries and
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 241
enriching them with encyclopaedic information via the web (Ringersma and
Rybka 2009), and potentially, in the absence of the researcher.
However, there are many problems connected with such a collaboration
framework which basically has a wiki-like set-up (cf. also Rau et al. 2009:
194–195). In this section it will be discussed why web-based collaboration
with endangered speech communities is difficult to achieve, drawing on our
own experience and the numerous discussions we had with the Marquesan
and Tuamotuan language activists and field assistants.
At the beginning of the project the idea of online cooperation seemed
very appealing both to the linguistic team and the speech communities. The
language activists in the Tuamotuan speech community, for example, hoped
that the web-based possibilities of LEXUS would promote a community-
wide participation of documenting endangered cultural knowledge on a larger
scale. Community members could be involved who would otherwise not be
able to contribute due to difficult inter-island communications in the Tuamo-
tuan archipelago:14 getting to remote parts of the Tuamotuan islands is ex-
pensive and time-consuming. For the linguistic team a web-based collabo-
ration initially seemed appealing because the collaboration with the speech
communities could continue outside fieldwork periods which would ensure a
continuous growth of linguistic material in a short period of time.
However, it is not sufficient simply to make a web-based tool available in
order to ensure online cooperation, nor can one assume that an encyclopaedic
lexicon will be easily created in a wiki-like manner by the speech commu-
nity. For a successful online cooperation with LEXUS, there needs to be a
substantial amount of capacity building in the speech community and a num-
ber of community-internal obstacles have to be overcome as well, which will
be described in the following sub-sections. Some of the obstacles are culture-
specific, but a number of them can be generalised for endangered speech
communities.
If the lexicon creation is open to the whole speech community, there
needs to be a system established in LEXUS to ensure that the contribu-
tions are coordinated and merged into the lexicon. In LEXUS this concept
of collaborative lexicon creation will be realized via so-called collaborative
workspaces.15 The whole set-up of collaborative workspaces is a crucial pre-
14. Note that the sparsely-scattered Tuamotuan archipelago covers a geographic area approx-
imately corresponding to a triangle between North Germany, Bukarest and London.
15. LEXUS does not yet have the full set-up of collaborative workspaces, but so far can only
distinguish between readers and writers. The concept described below was the proposal
Unauthenticated
Download Date | 8/13/19 5:38 PM
242 Gabriele Cablitz
set forth by the software developers to the linguistic team and the speech community. It
gave rise to many discussions which are relevant for the discussion of web-based tools for
language documentation.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 243
Unauthenticated
Download Date | 8/13/19 5:38 PM
244 Gabriele Cablitz
16. The set-up of collaborative workspaces and the workflow was also intensely discussed
with Marquesan speech community members, whose concerns were very similar to those
of the Tuamotuan community.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 245
Emoticon
patahi is written
Selection pātāhi!
Drawing
Unauthenticated
Download Date | 8/13/19 5:38 PM
246 Gabriele Cablitz
on the age of the consultants and their up-bringing, the metalinguistic knowl-
edge is in general very heterogeneous.
The problem is also rooted in their traditional society: the transmission of
cultural knowledge was – and still is – by no means a public affair and it was
only transmitted to selected persons in the community. Consequently some
speakers possess detailed cultural knowledge, whereas others – belonging to
the same generation – only have rudimentary knowledge.
Moreover, the continuous loss of their linguistic and cultural heritage also
feeds into many insecurities of the speakers and are often ground for conflicts
between speech community members in what is authentic and unauthentic
knowledge. Community members often accuse each other of (re-)inventing
and transforming the indigenous language and culture. Knowledgeable – of-
ten older – speakers are frequently insulted and stigmatised as liars because
their knowledge is not commonly shared with other community members. As
a consequence, knowledgeable speakers withdraw from participating in the
documentation of their languages.
Another problem of community-wide lexicon creation is that both Poly-
nesian communities are still very much anchored in their oral traditions. De-
spite rapid westernisation and schooling in the last 40 years, the Tuamotuan
and Marquesan communities have not really developed a writing tradition.
The most knowledgeable speech community members, whom one ideally
wants to engage in the creation of an encyclopaedic lexicon, often cannot read
or write, not to mention their lack of IT skills. Even those speech community
members who are literate are often reluctant to express their knowledge in
writing. They mostly prefer to “chat” about their knowledge in informal in-
terviews, and thus many consultants still feel most at ease when their knowl-
edge is simply recorded, thus remaining, to some extent, bound to the oral
tradition of knowledge transmission.
Some of the problems discussed in this section are specific to Polynesian
communities, but the tensions and insecurities which exist due to the loss of
the language and culture can probably be generalised for a number of endan-
gered speech communities. Rau et al. (2009: 194) report that Yami commu-
nity members were “suspicious and critical of the collaborative efforts” of
the research team and showed negative attitudes towards researchers which
were not part of the speech community. The greatest fear of Yami community
members is the potential abuse of the materials which are put online and the
disrespect of intellectual property rights, which meant that the wiki dictio-
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 247
nary developed for the Yami dictionary project did not get sufficient support
(Rau et al. 2009: 195). Similar fears of abuse have also been expressed by
Marquesan and Tuamotuan speech community members.
Unauthenticated
Download Date | 8/13/19 5:38 PM
248 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 249
ers and the speech community. However, it will be very difficult to develop
one simplified user interface for all speech communities because the concept
of simplicity is very much dependent on a personal selection of function-
ality and personal preferences. Each speech community has differing needs
and demands and it is therefore difficult to develop a unique implementa-
tion which takes account of the wishes of particular user groups. As long as
there are no simple standards established by broad training during capacity
building it will be very difficult realize a general simplified user interface.
17. Young Polynesians tend to be quite IT literate because computer skills have been part of
the school curriculum since the existence of IT technology in French Polynesia i.e. since
the late 1990s.
Unauthenticated
Download Date | 8/13/19 5:38 PM
250 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 251
The definition of plant parts, body-parts or any parts of things from the
natural environment is a difficult task because speakers of non-Indo-European
languages can partition aspects of the natural world (including fauna, flora,
landscape, the human body, etc.) differently than speakers of English or
French do (e.g. pipi ūma ‘middle part of lobster between abdomen and head’).
Some observable phenomena or objects do not have names at all in the target
language (e.g. pūhu’u ‘brown, cloth-like substance around trunk of banana
plant’). The documentation of word meaning and indigenous knowledge can
only be efficiently established in a face-to-face communication because it re-
quires subtle questioning and explanations which go back and forth between
researcher and consultant or field assistant. In a fieldwork situation, one gen-
erally has the possibility to pick up on interesting comments or follow up on
interesting leads. Important details and distinctions can be made by demon-
strating an action, showing the specimen (and its parts) in question in its
natural cultural context, etc. The fieldworker can constantly challenge and
question the definitions with regard to a word’s usage and ask for clarifica-
tions and more precise meanings. In particular the non-native perspective of
the fieldworker and his or her lexical knowledge of the field language as a
second language learner can be fruitful in analysing and refining word mean-
ing.
From a more general perspective, any documentation of indigenous knowl-
edge is a piecemeal process which requires time and patience because many
of the cultural activities have not been practised for a long time. Even the most
knowledgeable consultants will not remember aspects of their traditional cul-
ture instantly. After a session with the fieldworker, the consultant will often
come back to add and revise the documentation of cultural processes and
practices or vocabulary. At a later stage consultants often want to replace the
modern term with the “real” – often obsolete – term for things or activities.
The documentation of indigenous word meaning and knowledge is a close
collaborative effort between the fieldworker and the field assistants and field-
work cannot be replaced by simply making a web-based tool available to
the speech community. For the reasons listed above, the enrichment of the
lexicon with linguistic and cultural knowledge is still best achieved during
fieldwork periods.
Unauthenticated
Download Date | 8/13/19 5:38 PM
252 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 253
show in which way a word is related to other words of a language in the same
way as dictionaries do. In other words: although the primary data documents
how a language is actually used, the semantic complexity of words and their
semantic networks are not at all evident and easily deductible. In view of this
it is clear that directly accessible structured lexical data adds significantly to
the practical usability of primary data to speech communities, educators and
researchers from scientific disciplines other than linguistics.
Furthermore, a language documentation should not only be data-oriented
and multifunctional, it should ideally provide a potential basis for developing
pedagogical material, thus contributing to language maintenance and revival
as well. A number of other field linguists working with endangered speech
communities regard dictionaries not just as a mere documentation device and
a compilation of structured analysed abstractions of a language, but also as
a key activity for language maintenance and revitalization when prepared
and presented in an accessible format18 for the speech community (Corris
et al. 2000; Kroskrity 2002; Hinton and Weigel 2002: 155; De Korne et al.
2009: 141; Rau et al. 2009; among others); for Hinton and Weigel (2002: 156)
a dictionary is “a repository of tribal identity that can be used for a variety of
purposes even after the language ceases to be spoken”. In many speech com-
munities the compilation of a dictionary is still regarded as one of the most
important products of a language documentation project (Mosel 2006: 68).
The language also acquires a higher status or the status of a “real” language
which approaches that of the dominant local language(s).
18. Cf. below and Corris et al. (2000) for a discussion of the tension between “documentation
dictionaries” and “maintenance dictionaries” or lexical database vs. community dictionary
(Mosel 2011: 340).
Unauthenticated
Download Date | 8/13/19 5:38 PM
254 Gabriele Cablitz
Apart from contributing to language maintenance and revival (cf. §5.3 for
more details), multimedia lexicon tools such as LEXUS have the potential
to be excellent resource and research tools for the scientific community. The
archive-linking capacity in LEXUS creates enriched multimedia lexica which
go beyond conventional dictionary making. Word meanings in the dictionary
can be fully contextualised and presented in their socio-cultural contexts by
linking corpus-based examples in the dictionary to the respective annotated
sessions in the archive. Words in annotated sessions, on the other hand, are
embedded in and linked to their dictionary entry which contains the full range
of meanings of a word which is not documented in the annotated session as
such (cf. above, §5.1).
A multimedia lexicon created in LEXUS, with its new relational linking
device in ViCoS, can consist of a dense network of lexical and cultural data
with various media files which are part of a language archive. This has impor-
tant implications not only for speech communities, but also for documentary
linguistics. metadata and archive structures used for example in the DoBeS
archive can only establish limited connections between sessions, which can
leave many cultural connections between sessions of an archive unexplored.
As ethnographers such as Franchetto (2006: 183) point out, “ethnographical
documentation is a crucial component in any language documentation”, and
this “involves designing digital architectures with multiple and multidirec-
tional links between different sessions and qualitatively different kinds of
information such as lexica, analytical papers, photos, and so on” (Franchetto
2006: 206). LEXUS and ViCoS are tools which would allow the documenta-
tion team to come closer to this ideal. The new relational linking device in
ViCoS, in particular, opens up the possibility to create a dense network of
cultural connections between sessions as well as structured lexical data in a
thematically organised way and therefore achieves a better and more user-
friendly access to a culture’s network of meanings for both the scientific as
well as speech community.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 255
Unauthenticated
Download Date | 8/13/19 5:38 PM
256 Gabriele Cablitz
6. Conclusion
Multimedia lexicon tools such as LEXUS are excellent resource and research
tools for the scientific as well as endangered speech communities interested
in the lexical, cultural and ethnographic documentation of endangered and
underdescribed languages. The LEXUS tool, with its new feature of knowl-
edge representation in ViCoS, can create a dense network of lexical as well as
cultural data in ways which are meaningful to different kinds of user groups,
allowing a multilayered organisation of lexical and cultural knowledge. For
members of the speech community the creation of knowledge spaces in Vi-
CoS can be a more natural entry point into a lexicon than a conventional
paper dictionary because data of the lexical database in LEXUS can be se-
lected according to the needs of the users, giving it the potential to become
an important tool in language maintenance and revitalization.
Apart from the possibility of individually organising the documented
knowledge, the contextualisation of word meaning via archive linking is a
major new approach in lexicography.
For documentary linguistics, multimedia lexicon creation, as envisaged
in our MEL-project with LEXUS, is in fact a new form of language archiving
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 257
Abbreviations
1, 2, 3 first, second, third person PL plural
ART article POSS possessive
CAUS causative RED reduplication
LD locational-directional SG singular
preposition TAM tense, aspect, modus
PASS passive
References
Albright, Eric, and John Hatton. 2008. WeSay: A tool for engaging native
speakers in dictionary building. In Documenting and Revitalizing Aus-
tronesian Languages, eds. D. Victoria Rau and Margaret Florey, Language
Documentation & Conservation, Special Publication No. 1, 189–201. Hon-
olulu: University of Hawai’i. http://hdl.handle.net/10125/1368.
Unauthenticated
Download Date | 8/13/19 5:38 PM
258 Gabriele Cablitz
Atkins, Beryl T. Sue, and Michael Rundell. 2008. The Oxford Guide to Prac-
tical Lexicography. Oxford: Oxford University Press.
Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head?
In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss
à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre
Miranda, 1069–1091. The Hague: Mouton.
Cablitz, Gabriele. 2006. Marquesan - A Grammar of Space. Berlin, New
York: Mouton de Gruyter.
Cablitz, Gabriele. 2010. A field report on a language documentation project
on the Marquesas Islands in French Polynesia. In Endangered Austrone-
sian, Papuan and Australian Aboriginal Languages: Essays on Language
Documentation, Archiving and Revitalization, ed. Gunter Senft, 31–47.
Canberra: Pacific Linguistics.
Cablitz, Gabriele, Fasan Chong, and Edgar Tetahiotupa. 2009. The docu-
mentation of endangered linguistic, lexical and cultural knowledge of the
Marquesan and Tuamotuan languages of French Polynesia. In Proceed-
ings of the 11th Pacific Science Inter-Congress and 2nd Symposium on
French Research in the Pacific. Tahiti: Pacific Science Association. http:
//intellagence.eu.com/psi2009/output_directory/cd1/Data/articles/000323.pdf.
Cablitz, Gaby, Jacquelijn Ringersma, and Marc Kemps-Snijders. 2007. Vi-
sualizing endangered indigenous languages of French Polynesia with
LEXUS. In 11th International Conference Information Visualization (IV
’07), 409–414. IEEE Computer Society.
Canger, Una. 2002. An interactive dictionary and text corpus for sixteenth-
and seventeenth-century Nahuatl. In Making Dictionaries – Preserving
Indigenous Languages of the Americas, eds. William Frawley, Kenneth C.
Hill, and Pamela Munro, 195–218. Berkeley, CA: University of California
Press.
Conklin, Harold C. 1962. Lexicographic treatment of folk taxonomies. In
Problems in Lexicography, eds. Fred W. Householder and Sol Saporta,
41–59. Bloomington: Indiana University Research Center in Anthropol-
ogy, Folklore, and Linguistics.
Corris, Miriam, Christopher Manning, Susan Poetsch, and Jane Simpson.
2000. Dictionaries and endangered languages. http://nlp.stanford.edu/pubs/
eldic.ps.
Coward, David F., and Charles E. Grimes. 2000. Making dictionaries. A
guide to lexicography and the multi-dictionary formatter. http://www.sil.
org/computing/shoebox/MDF_2000.pdf.
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 259
Unauthenticated
Download Date | 8/13/19 5:38 PM
260 Gabriele Cablitz
Unauthenticated
Download Date | 8/13/19 5:38 PM
The making of a multimedia encyclopaedic lexicon 261
Unauthenticated
Download Date | 8/13/19 5:38 PM
Unauthenticated
Download Date | 8/13/19 5:38 PM
Chapter 11
What does it take to make an ethnographic dictionary?
On the treatment of fish and tree names in dictionaries
of Oceanic languages∗
Andrew Pawley
1. Introduction
Some lexicographers hold that it is impossible to make a good first general
dictionary of any language. In his highly regarded textbook Dictionaries: The
Art and Craft of Lexicography, Sidney Landau writes that
A really new dictionary would be a dreadful piece of work, missing innu-
merable basic words and senses, replete with absurdities and unspeakable er-
rors, studded with biases and interlarded with irrelevant provincialisms. Noah
Webster’s American Dictionary of the English Language of 1828, though
far from being entirely new, was new enough to subscribe to many of these
defects... Fortunately, very few dictionaries are really new, and none of the
general, staff-written, commercial dictionaries published by major dictionary
houses are. (Landau 1984: 35–36)
∗
It is a pleasure to contribute to a volume in honour of Ulrike Mosel, whose contributions to
descriptive and documentary linguistics I greatly admire and with whom I have had many
stimulating discussions about dictionary-making. Thanks are due to Geoffrey Haig, Frank
Lichtenberk and Claudia Wegener for their valuable comments on a draft of this paper and
to Claudia for her eagle-eyed copy-editing.
Unauthenticated
Download Date | 8/13/19 5:39 PM
264 Andrew Pawley
1. Oceanic contains more than 400 languages of Melanesia together with the languages of
the Polynesian Triangle and most of the languages of Micronesia. Most Oceanic languages
have fewer than 10,000 speakers. Their speech communities were, traditionally, subsis-
tence farmers and, in many cases, also fishers.
2. The largest reliably recorded inventory of vernacular plant names from one traditional
community appears to be about 1,800–2,000, for Hanunóo, of Mindoro, Philippines, who
speak an Austronesian language (Conklin 1954). Puku’i and Elbert (1971: ix) mention that
Mary Neale and Edward Handy list over 2,300 plant names for Hawaiian but a number “are
dubious”. Henderson and Hancock (1988) list more than 800 plant names for Kwara’ae, of
Malaita. Fox’s (1978) dictionary of Arosi (of Makira), gives about 770 names for plants, of
which 194 denote varieties of cultivated plants. For Cèmuhî, a language of northern New
Caledonia (a region with a relatively small flora), Rivierre (1994) lists 557 taxa, of which
178 represent cultivated plants.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 265
3. Speakers of Wayan (a dialect of Western Fijian spoken on Waya Island, Yasawa group, Fiji)
distinguish about 800 names for kinds of animals. These include about 450 fish taxa and
230 names for marine invertebrates. Land invertebrates are of little economic importance
on Waya but upwards of 70 taxa are named. As one moves eastwards across the central
Pacific the number of land bird, mammal and reptile species drops off sharply and Wayan
terminologies for indigenous birds (about 35 names) and reptiles (about 20) are small.
Unauthenticated
Download Date | 8/13/19 5:39 PM
266 Andrew Pawley
tion of types (ii)–(iv) above. (I am not so naïve as to think that any dictionary
can fully achieve these objectives; I am speaking of an ideal.)
Most bilingual dictionaries are primarily intended to be translation aids.
The best monolingual dictionaries, by contrast, are closer to the ethnographic
type.
Dictionaries of languages of traditional societies are usually bilingual,
with a main part containing headwords in the target language (L1) and glosses
in a major language (L2), in combination with reverse finder list that allows
the user to look up words in L2 and find relevant entries in L1. The glosses
mainly consist of words or phrases intended to be approximate translation
equivalents. Analytic definitions are given only where no translation equiva-
lent is available.
While conceding the importance of providing translation equivalents
where possible, I believe that scholars compiling first bilingual dictionaries
of languages of traditional societies should aim at rich semantic descriptions,
providing analytic definitions wherever these will give a more precise and
usefully informative account of the meaning of the lexical unit than approx-
imate translation equivalents, and also including supplementary information
of types (ii)–(iv) where this serves the same purpose. See Cablitz (this vol-
ume, Section 3.2) for discussion of the question of to what extent it is appro-
priate to record cultural information in a dictionary.
By giving rich semantic descriptions an ethnographic dictionary on the
one hand provides linguistic and cultural information likely to be valued by
members of the speech community and on the other hand stands as a ref-
erence work for scientific purposes. However, the creation of a good ethno-
graphic dictionary presents huge challenges. Some of these challenges will
be considered in the sections that follow.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 267
Unauthenticated
Download Date | 8/13/19 5:39 PM
268 Andrew Pawley
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 269
impression that they “are missing innumerable basic words and senses”, if
“basic” refers to high frequency items. However, it is certainly the case that
many of the dictionaries are missing many words of middle to low frequency.
This becomes obvious if one looks closely at the treatment of particular se-
mantic fields. I turn now to a brief survey of the coverage of fish and tree
names in about 30 Oceanic dictionaries. I will discuss two methods for deter-
mining whether coverage in a particular dictionary is close to exhaustive, and
will touch briefly on a third.
5. Sources (dictionaries and other works) for contemporary languages that figure in the dis-
cussion are listed below. Works that are not general dictionaries are marked with a star
before the name of the author(s); these are mainly survey reports focusing on names of
marine fauna.
New Guinea and Bismarck Archipelago: Kiriwina (Trobriand Islands, PNG): Lawton
1998; Kuanua (New Britain): Lanyon-Orgill 1960; Titan (Admiralty group): *Akimichi
and Sakiyama 1991.
Western Solomons: Cheke Holo: White 1988; Marovo: Hviding 1990, *Hviding 2005;
Roviana: Waterhouse 1949; Takū (Polynesian Outlier, north of Bougainville): Moyle in
press; Teop: Shoffner 1976.
Eastern Solomons: Arosi: Fox 1978; Gela: Fox 1955; *Foale 1998; Lau: Fox 1974; Owa:
Mellow 2009; Toqabaqita: *Henderson and Hancock 1988; Lichtenberk 2008.
Vanuatu and Tikopia: Paamese: Crowley 1992; Kwamera: Lindstrom 1986; Tikopia (Te
Motu Province, Solomon Islands): Firth 1985; Lenakel: Lynch 1977.
New Caledonia: Cèmuhî: Rivierre 1994; Nyelâyu: Ozanne-Rivierre 1998; Paicî: Rivierre
1983; Xârâcùù: Moyse-Faurie and Néchéro-Jorédie 1989.
Fiji: Bauan (Standard Fijian): Capell 1941; Wayan (dialect of Western Fijian): Pawley
and Sayaba 2003; Rotuman: Churchward 1940, Inia et al. 1998.
Micronesia: Carolinean (Saipan, Marianas): Jackson and Marck 1991; Kapingamarangi
(Polynesian Outlier, central Carolines): Lieber and Dikepa 1974; Marshallese: Abo et al.
1976; Palauan: Helfman and Randall 1973, Johannes 1981; Ponapean: Rehg and Sohl
1979; Puluwat: Elbert 1972; Satawalese: *Akimichi 1980; Kiribati: Thaman and Tebano,
n.d.
Polynesian Triangle: Marquesan: *Lavondès 1977; Niuatoputapu: *Dye 1983; Niuean:
Sperlich 1997; Rarotongan: Buse and Taringa 1996; Tongan: Churchward 1959; Uvean:
*Rensch 1983.
Unauthenticated
Download Date | 8/13/19 5:39 PM
270 Andrew Pawley
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 271
It can be seen that the totals vary greatly. Comparison with Table 1 suggests
that most of these inventories are missing between 100 and 250 of the taxa
distinguished by the speech community. This is demonstrably the case for
Gela: Fox’s (1955) very substantial dictionary of Gela contains only 136 fish
names, more than 200 fewer than the 368 reported in the survey of Gela
fishing done by a marine biologist (Foale 1998). Similarly, the large Bauan
(Standard Fijian) dictionary contains about 200 fish names, less than half the
number recorded for Wayan, another Fijian language.
For fishing communities in Melanesia and western Micronesia, where the
fish fauna are relatively rich, lists numbering below 300 are probably far from
complete. In the case of communities of the Polynesian Triangle and eastern
Micronesia, we can expect somewhat lower totals.
A second method for detecting gaps is to compare the breakdown of
names for taxa into uninomials and binomials. Uninomials typically apply
to taxa belonging to the level of folk generic, binomials to folk specifics (see
Section 6). In the best documented Oceanic fish taxonomies uninomials usu-
ally amount to between 70 and 80 percent of total taxa, and binomials to be-
tween 20 and 30 percent. Percentages that deviate markedly from this range
stand out as suspiciously anomalous. Table 3 gives the percentages for 14
Oceanic languages recorded in dictionaries (unmarked) and surveys of fish
names (marked *).
At one extreme, only 5 out of 220 recorded Tikopia fish names (Firth 1985)
are binomials. Similarly, the survey of Titan (Akimichi and Sakiyama 1991)
Unauthenticated
Download Date | 8/13/19 5:39 PM
272 Andrew Pawley
yielded only eight binomials among 287 listed fish names, whereas the survey
of Satawalese (Akimichi 1980) yielded 122 binomials out of 400 names. We
conclude that, in these cases, the Tikopia and Titan lists are probably missing
between 60 and 120 binomials. At the other extreme, in the case of Kapinga-
marangi, there are 43% binomials (114 out of 262) and we can conclude that
the list of binomials probably includes some ad hoc descriptive forms.
Most of the larger dictionaries generally do poorly in their coverage of
binomials. Their coverage of uninomials is much better, but even there it is
clear that, in many cases, coverage is incomplete.
The third method is more fine-grained and requires painstaking family-
by-family comparison of fish names. The first steps are to note how many
names representing each of the major families of fish are present in the best
documented languages, and then to obtain an average and a range of variation
for this sample. The next step is to see how dictionary tallies compare with
these averages and ranges, looking for cases where there is a striking shortfall
in the number of taxa recorded for particular families. This method makes it
possible to locate quite precisely some of the gaps in coverage but for reasons
of space I will not tabulate results here (see Pawley 2011a for details).
It has been suggested to me that two Oceanic fishing communities, ex-
ploiting a similar marine environment, may differ greatly in the extent to
which they have elaborated their lexicon of uninomial and/or binomial fish
names. That is to say, large differences in the numbers of fish taxa between
two communities with similar means of subsistence, may not always be due
to oversights on the part of the lexicographer, but actually reflect genuine
differences in the vernacular lexicon.
This possibility cannot be ruled out – our data include only a few well-
controlled case studies (such as the Gela one) where the dictionary coverage
can be compared with that of a thorough independent survey. However, we
are looking for general trends and the general trends are clear. It is strik-
ing that in every case where a careful survey has been done of an Oceanic
fishing community’s lexicon of fish names, the number of fish names distin-
guished has been above 200 and generally in the 300-450 range. Similar find-
ings have been reported for fishing communities in Indonesia (Quick 2010;
Taylor 1990). The fact that most dictionaries of Oceanic languages spoken by
fishing communities report much lower figures cannot plausibly be attributed
to random variation in the vernacular lexicons.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 273
Unauthenticated
Download Date | 8/13/19 5:39 PM
274 Andrew Pawley
(Santa Isabel) and Roviana (180). Kuanua, which scores almost as high as
Arosi, is spoken much further west, in New Britain.
The tree flora of southern Melanesia and the central Pacific is a good
deal less diverse than that of western Melanesia. The total of 222 tree taxa
for Wayan Fijian, based on systematic collecting by a botanist (Gardner and
Pawley 2006), is probably close to exhaustive for this language and provides
a rough yardstick. It suggests that the dictionaries of Paamese, Kwamera,
Lenakel (all Vanuatu) and Rotuman, which record between 58 and 84 tree
names, are probably missing upwards of 100 taxa.
4.3. Conclusion
Most Oceanic dictionaries show major shortfalls in their coverage of fish and
tree names. The broad conclusion we can draw is that getting all the names
for indigenous animal and plant taxa distinguished by a typical Oceanic com-
munity is a formidable task and that without the help of specialists, lexicog-
raphers are likely to miss a high proportion of them.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 275
Unauthenticated
Download Date | 8/13/19 5:39 PM
276 Andrew Pawley
taxon. At other times two or more folk taxa will correspond to a single
species, with different taxa corresponding, e.g. to different stages of matu-
ration or to adult males vs females and juvenile males. This is commonly the
case, for instance, with certain fish species, where up to five different taxa
corresponding to different maturational stages may be distinguished.
Making scientific IDs is the primary but not the only reason dictionary-
makers need the help of specialists in natural history. A second reason is that
specialists are better qualified to describe the key characteristics of species
and their relationships to other species and to investigate their practical uses.
A third, already noted above, is that without specialists it is likely that names
for many of the less important taxa will be missed. Conversely, a survey con-
ducted solely by biologists with almost no knowledge of the target language
is likely to yield an inaccurate account of vernacular terms and taxonomies.
The ideal is a close collaboration of linguists and biologists.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 277
I agree with those who argue that we lack a principled basis for drawing
a line between lexical and encyclopaedic knowledge.6 The question ‘What is
the meaning of sheep?’ is probably wrong-headed. When formulating a dic-
tionary explication of sheep, it makes more sense to ask ‘Of the many charac-
teristics of sheep known to English speakers, which are the most salient?’ and
‘For the various users of the dictionary, what is likely to be the most useful
information to include?’
A few plant name entries from the Wayan dictionary follow which are in-
structive in two respects: (1) they provide fairly rich descriptions, mainly due
to the work of a professional botanist; (2) they show the difficulty of draw-
ing the line between information that belongs in a dictionary and information
better left to an encyclopaedia or to technical botanical works.
Unauthenticated
Download Date | 8/13/19 5:39 PM
278 Andrew Pawley
The descriptions provided for these taxa by the botanist working on Wayan
were in most cases a good deal more detailed than are shown in the dictionary
entries. As the lexicographer in this case, the rule of thumb I adopted was to
omit details of plant morphology likely to be of interest only to a botanist
while retaining information relating to general appearance, habitat and uses.
But it can be seen that the entries are not entirely consistent in this respect.
In the entry for DALI, for instance, more details of leaf forms and flowers
(“Stems ribbed, leaves 3-foliolate, narrowly oval, pointed, pea-flowers red-
brown outside, yellow and crimson-striped within, pods straight, about 6cm
long”) are retained than in the entry for DAMANU.
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 279
order taxa which share the characteristic morphology of the type, and (iv) is
named by a uninomial. Examples of English life-form taxa are tree, vine, fish,
bird and insect.
Folk generic. A folk generic (or folk genus) is a ‘natural’ category in two
senses, one perceptual, the other linguistic. First, the members of this cat-
egory are usually marked off from non-members by multiple characters of
morphology and behaviour or ecological adaptation that will be evident to
any close observer. English examples are mullet, trout, oak, pine, spider, ant,
centipede. Second, the folk generic (rather than the life form) is the usual way
of referring to particular plants or animals if their identity is known. Depend-
ing on various factors, a folk genus may correspond to a single species in the
biologist’s taxonomy, to a number of species or a genus, or to a number of
genera or families. Third, the category is named by a uninomial rather than a
binomial.
Folk specific. Some folk generics divide into a number of folk specifics (or
folk species), usually just a few taxa that contrast in a limited number of fea-
tures with other members of the generic. Except for domesticated animals and
plants folk specifics are usually the lowest-level taxa distinguished. Berlin
(1992) says that folk specific names are usually compounds, consisting of the
generic name plus a modifier, e.g. mako shark vs hammerhead shark, trap-
door spider vs huntsman spider. However, Bulmer (1970) finds that a fair
number of folk specific names for animals, among the Kalam, are primary
lexemes.
Berlin names three other ranks that are sometimes distinguished in folk tax-
onomies, but these need not concern us here.
Determining the scope of life form taxa is often tricky. While informants
agree on the focal membership, they may disagree about peripheral cases.
For example, among Oceanic languages there is considerable variation in the
boundaries of the general term for fish. Generally the term can be applied to
various aquatic creatures other than fish: typically it is extended to whales
and dolphins, in some languages, to turtles and crocodiles, in some also to
octopus and squid, and in some cases most or all water-dwelling animals
are included (Pawley 2011b). As Table 5 indicates, few Oceanic dictionaries
make much of an effort to specify the range of reference of the generic for
fish and fish-like animals. Definitions can be roughly scaled according to their
level of informativeness.
Unauthenticated
Download Date | 8/13/19 5:39 PM
280 Andrew Pawley
Table 5. Definitions of the generic for fish and fish-like animals in some Oceanic
dictionaries
Group A. Definitions that give ‘fish, or ‘fish, sea creature’ without further
definition
Arosi: i’a, a fish.
Cheke Holo: sasa, fish (generic).
Marshallese: ek, fish.
Mota: iga, a fish.
Owa: aiga, fish, sea creature.
Rarotongan: ika, fish.
Roviana: igana, the generic name for fish.
Group B. Definitions that give a partial but rather imprecise listing of members
Puluwat: yiik, fish (including porpoises and whales but not squid).
Samoan: i’a, the general name for fishes, except the bonito (Thymnus) and
shellfish (Mollusca and Crustacea). On Tuituila the bonito is called
i’a. (Pratt 1911)
Takū: ika, generic term for fish, including marine animals, turtles and two
species of clam.
Tikopia: ika, generic category with primary reference to fish, but including
allied creatures, e.g. turtle, cetaceans. [Examples also refer to crabs.]
Tongan: ika, fish. Also turtles (fonu) and whales (tofua’a) but not eels,
cuttle-fish, or jelly-fish.
Group C. Definitions that try to be comprehensive
Gela: iga, a creature of the sea, fish, mollusc, crayfish, whale, squid,
sea anemone, etc.
Paamese: mesau, 1. fish. 2 any sea dweller (including also turtles,
dolphins, shellfish, etc.).
Toqabaqita: iqa, 1. fish (generic term). 2. Also denotes a superordinate
category that includes fish, whales, dolphins, turtles, dugongs.
Wayan: ika, 1. Typical fish, true fish, syn. ika dū. This category includes
all gill-breathing fish with fins, including sharks, rays and eels.
2. Fish and certain fish-like creatures. A generic which includes
all true fish (see sense 1) and dolphins. Most informants also
regard turtles (ikabula) as ika. Some also include octopus
(sulua) and squid (suluanū). Universally excluded are
crustaceans (crabs, lobsters, etc.), molluscs with shells, sea
cucumbers, sea urchins and jellyfish. (There follows a full list of
names of ika.)
Group A definitions show no awareness of the fact that ‘fish’ is a highly prob-
lematic definition. Group B definitions make an effort to give an exhaustive
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 281
Unauthenticated
Download Date | 8/13/19 5:39 PM
282 Andrew Pawley
Toqabaqita: ai, tree (does not include palm trees, cycads, tree-ferns, bamboos,
banana trees).
Wayan: kai, generic for trees and shrubs, and occasionally low bushy plants.
Includes palms and pandans. Used in certain compounds as a
generic for all plants.
Group A definitions show little or no awareness that the English gloss ‘tree’ is
not a sufficiently precise definition of the vernacular generic. Group B defini-
tions recognize that the range of reference of the generic is not satisfactorily
captured by such general terms as ‘tree’ and ‘plant’. However, their attempts
to define by specifying diagnostic characters and/or by listing members are
of varying adequacy.
8. Concluding remarks
Finally, let us return to the question: Is it possible to make a first general
dictionary of a language that is not ‘dreadful’? Our examination of some 30
Oceanic dictionaries has been largely confined to aspects of their treatment
of terms for fish and trees – too narrow a basis for a general assessment.
It must be conceded that only a few of the dictionaries score well in their
coverage of the fish and tree lexicons. Most do poorly, both in terms of ex-
haustiveness of the wordlists and quality of the expository information. This
does not necessarily make them dreadful works overall. It is my impression
that many of the major Oceanic dictionaries do passably well in their treat-
ment of a number of major lexical domains. However, my purpose in this
paper has been to use the example of fish and tree lexicons to highlight some
of the challenges of method and scale that face anyone intending to do a first
general dictionary of a language.
Trial and error in my own attempts at dictionary-making has taught me
that such a project requires considerable expertise, or the help of experts, in
various specialised fields of knowledge in addition to botany and zoology.
Experience has also taught me that the more fluent the lexicographer is in
the target language, and the more fluent native speaker assistants are in the
defining language, the easier it is to achieve accuracy, especially in respect of
definitions and sense discriminations. Except that it is never easy. The hard
truth is that making an accurate and close-to-comprehensive general dictio-
nary of the language of a traditional society needs many thousands of hours
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 283
of labour, i.e. the equivalent of several full-time years on the job.7 In most
documentation contexts, which operate within a 3–4 year framework, with
many competing objectives, it would be unrealistic to try to compile a large
general dictionary. These practical considerations, among others, have led Ul-
rike Mosel (2011) to advocate doing ‘thematic dictionaries’, mini-dictionaries
each of which treats a single culturally important domain, as an alternative to
compiling general dictionaries. This has the virtue of allowing the researcher
to produce in quite a short time one or more small reference works of interest
both to members of the speech community and to academics.
References
Abo, Takachi, Bryon Bender, Alfred Capelle, and Tony Debrum. 1976. Mar-
shallese–English Dictionary. Honolulu: University of Hawai’i Press.
Akimichi, Tomoya. 1980. Bad fish or good fish: The ethnoichthyology of the
Satawalese (Central Carolines, Micronesia). Museum of Ethnology, Osaka.
Bulletin of the National Museum of Ethnology 6(1):66–133.
Akimichi, Tomoya, and Osamu Sakiyama. 1991. Manus fish names. Bulletin
of the National Museum of Ethnology, Tokyo 16(1):1–29.
Berlin, Brent. 1992. Ethnobiological Classification: Principles of Catego-
rization of Plants and Animals in Traditional Societies. Princeton, NJ:
Princeton University Press.
Brown, Cecil H. 1985. Mode of subsistence and folk biological taxonomy.
Current Anthropology 26:43–62.
Bulmer, Ralph N. H. 1970. Which came first, the chicken or the egg-head?
In Échanges et Communications: Mélanges Offert à Claudes Lévi-Strauss
à l’Occasion de Son 60ième Anniversaire, eds. Jean Pouillon and Pierre
Miranda, 1069–1091. The Hague: Mouton.
Bulmer, Ralph N. H. 1992. Field methods in ethno-zoology with special
reference to the New Guinea Highlands. In Studying and Describing Un-
written Languages, Questionnaire 12, Ethnozoology, eds. Luc Bouquiaux
Unauthenticated
Download Date | 8/13/19 5:39 PM
284 Andrew Pawley
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 285
Unauthenticated
Download Date | 8/13/19 5:39 PM
286 Andrew Pawley
Unauthenticated
Download Date | 8/13/19 5:39 PM
What does it take to make an ethnographic dictionary? 287
Unauthenticated
Download Date | 8/13/19 5:39 PM
Unauthenticated
Download Date | 8/13/19 5:39 PM
Chapter 12
Language is power: The impact of fieldwork on
community politics∗
1. Introduction
The ethical and political aspects of language documentation work have been
brought increasingly to the forefront in recent literature. This is a vast and
complex field, where no universally valid solutions can be offered, since
both the nature of the issues and the appropriate way of resolving them will
vary widely between communities and situations. In the words of Grinevald
(2006: 351), “the different agents involved create a maze of commitments of
often conflicting nature, and ... one of the major challenges of fieldwork is to
juggle all these constraints, requirements, and commitments”.
This paper presents a case study from our own fieldwork experience as a
basis for addressing some of the complex issues raised by the presence of a
fieldworker in a small local community: what happens when there is disagree-
ment within a community as to whether, how, and where the documentation
work should be carried out, and how the burdens and benefits associated with
such work is to be distributed? We will discuss how such issues may affect
not only the relationship between researcher and language community, but
also politics and power relations within the community itself.
While it is generally taken for granted that a visiting researcher should
as far as possible avoid getting involved in issues of local politics, this is in
practice impossible, because the presence of a visiting researcher in a small
∗
The authors would like to thank Benedicte H. Frostad, Anders Vaa, and the editors of this
volume for helpful comments on earlier versions, while stressing that the views expressed
in this paper, as well as any factual errors, are entirely our own responsibility. We would
also like to emphasize that during our many years of work in Temotu Province, we have
been met in the overwhelming majority of cases with helpfulness and generosity; our dis-
cussion of the conflict described below is meant as an example of issues which it may
be important for linguistic fieldworkers to take into consideration, and not in any way as
criticism of the people of the area, for whom we hold the deepest respect and gratitude.
Unauthenticated
Download Date | 8/13/19 5:39 PM
292 Even Hovdhaugen and Åshild Næss
community is local politics. Indeed, Rice (2006), while stating that a field-
worker should avoid getting caught up in internal political issues, goes on to
list a number of ways in which the presence of a researcher affects a commu-
nity, including the distribution of money as payment for consultants’ work,
choosing who gets to do such work, and deciding where the linguist stays and
works; these are all, ultimately, political issues in the sense that they concern
the distribution of potentially scarce material resources, and of personal and
communal prestige associated with the presence of and interaction with a re-
searcher from outside. While not being able to offer any general solutions,
we will point out some potential loci of conflict which are probably to some
extent present in most fieldwork settings.
Language documentation work is done in a variety of settings. Our ex-
periences come from “classical” fieldwork, where researchers from outside
spend time in a location where a language is spoken and work within the
language community for purposes of language documentation and descrip-
tion. However, in many cases work is done in towns or cities with isolated
members or small groups of a scattered or decimated language community.
In such cases, the issues to be raised here – of who speaks for the commu-
nity, of whose permission is required and of who decides how the benefits
and privileges attached to such work is to be distributed – may be even more
acutely relevant and even more difficult to resolve. While we will not specif-
ically discuss such situations here, we note that they are likely to become
increasingly common as a result of language shift and the loss of traditional
ways of life, and that as a result, the kinds of issues discussed in this paper
are likely to require increased attention in language documentation work.
2. The issues
There are three main issues which arise from the case study to be presented
below. All are to some extent to do with issues of local politics and of the dis-
tribution of power and privilege within a community, and all are interrelated.
Specifically, they concern the question of the relevance of existing legal and
administrative structures for the particular issues associated with language
documentation projects, and of how to handle cases where these structures
turn out to be inadequate or disputable.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 293
The question of how to obtain consent for a documentation project, and from
whom, holds not only in the tension between central and local authorities,
but also within a language community. There is considerable discussion of
this in the literature, for example in Thieberger and Musgrave (2007), who
address the issue of potential conflict between individuals, who may give
permission for material collected from them to be used in certain ways, and
of the community as a whole, which may wish to place restrictions on the
use of such material. In general, guidelines for ethics in documentation work
stress the importance of obtaining explicit consent both from the individuals
participating in the project, and from the community whose language is to be
Unauthenticated
Download Date | 8/13/19 5:39 PM
294 Even Hovdhaugen and Åshild Næss
1. http://www.eva.mpg.de/lingua/resources/ethics.php
2. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 295
3. http://www.mpi.nl/DOBES/ethical_legal_aspects/DOBES-coc-v2.pdf
Unauthenticated
Download Date | 8/13/19 5:39 PM
296 Even Hovdhaugen and Åshild Næss
tics, which, as Bowern (2008: 161–162) points out, has often been carried
out as a means to other ends, such as anthropology or missionary efforts; the
idea that someone might be paid to study a remote minority language with no
ulterior motives may be difficult to accept.
3. The setting
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 297
tity Matters project to the local chiefs and communities in different islands
in the area, and to make arrangements concerning working conditions for re-
searchers in the new project: where they would stay, who would act as main
consultants, how much they would be paid per hour of work, how much the
community would receive in return for hosting the researchers, and not least,
who would be responsible for receiving and allocating this money, as there
were several candidates for this role – paramount chiefs, village councils,
pastors, churches, etc. The fact that Hovdhaugen was able to communicate in
the Vaeakau-Taumako language made these meetings a lot easier than would
probably otherwise have been the case.
We were well aware of the potential impact of our presence in these
small village communities, and made every effort to negotiate clear agree-
ments concerning the practical and financial arrangements for our fieldwork.
These arrangements usually involved both a contribution to village funds and
a modest hourly compensation for those speakers taking time out from their
daily activities to work with the linguists. While these terms were, in prin-
ciple, generally accepted, the more detailed arrangements concerning which
villages would receive the benefits of having a visiting researcher stay, and
which individuals would get the opportunity through consultant work of earn-
ing some much-needed cash income, proved more difficult to resolve to ev-
eryone’s satisfaction.
Mosel (2006: 71–72) addresses the near-impossibility of a fieldworker se-
lecting consultants directly; as a guest in the community, she has to work with
those individuals whom the community considers suitable and who can spare
the time from their daily duties. Our principle has been, once a village has
been selected for a research stay, to ask local authorities – chiefs and elders
– for help in selecting suitable consultants, according to some basic criteria
specified by the researcher. This ensures that decisions are seen to be made
by those whose decision-making authority is unquestionable, and makes it
possible to take into account local notions of fairness and entitlement which
the researcher, at least initially, will know little about. Of course individuals
may still feel slighted and conflict may ensue, but such conflict will to a lesser
extent be perceived as a direct result of the researcher’s personal conduct.
More complicated, at least in our case, is the question of where – in which
village and under whose direct responsibility – a researcher is to be located
during her stay. The ideal solution is of course for the researcher to divide
her time about equally between the different communities, not only for po-
Unauthenticated
Download Date | 8/13/19 5:39 PM
298 Even Hovdhaugen and Åshild Næss
4. The conflict
The project plan, including the budget, had been approved by the Ministry of
Education in Honiara, the national capital; by the Premier of Temotu Province;
and by the council of chiefs in the villages where members of the research
team stayed. We felt fairly confident that in obtaining permission from these
bodies, we had followed correct procedures and consulted all relevant author-
ities. However, in 2005, a group of chiefs on one of the islands questioned
our right to collect traditional stories, on the basis that the project had not
been approved at the level of the island. This was initially communicated by
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 299
Unauthenticated
Download Date | 8/13/19 5:39 PM
300 Even Hovdhaugen and Åshild Næss
5. The aftermath
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 301
port during later research visits; on one occasion a group of village chiefs
provided a written statement of support for our project, asserting explicitly
that authority in this matter rested with them and so reaffirming the illegiti-
macy of the attempt at redefining local political structure.
Nevertheless, the very fact that such explicit statements of authority were
deemed to be necessary may be indicative of subtle shifts in power relations
or status within the local communities; such changes are of course difficult
for us to discover. Furthermore, the conflict may to some extent have changed
patterns of interaction between different groups in the area.
Traditionally, the two language communities in the Reef Islands – the
Polynesian-speaking group in the Outer Reefs and the Äiwoo speakers in the
Main Reefs – have had little day-to-day contact. There is some intermarriage
between the two groups, and people from the Polynesian villages travel to the
trade store in the Main Reefs for supplies, but people identify strongly with
their village and language community, and there is little contact or collabora-
tion across the linguistic border.
As our project spanned both communities, however, both had a vested
interest in the situation, and the conflict brought together supporters of the
project from both sides. On the evening before the meeting, several of Hovd-
haugen’s friends both from the Main and Outer Reefs came to his house to
discuss strategies for the meeting. We are not aware that meetings of political
leaders from the two language communities have been a practice in the past,
and in a longer-term context it may well be that an increase in such contact
will turn out to be the main political impact of our project in the islands.
For small island communities having to deal increasingly with the effects of
modernization and globalization, increased collaboration and the ability to
put traditional borders aside for the sake of promoting common interests is
clearly an asset, and our presence may, in a small way, have contributed to
this.
As a final point, it should be noted that conflicts of authority of the kind
we have described may have their source in preexisting conflicts between in-
dividuals or groups that the fieldworker cannot be expected to be aware of –
they may go back years or even decades, and may be dragged to the surface
through the new situation created by the arrival of the researcher. In our case,
it was clear that the protests were motivated as much by individuals’ sense
of entitlement and injustice as by legitimate formal claims. One key actor in
particular was known to have been very hostile to previous linguistic efforts
Unauthenticated
Download Date | 8/13/19 5:39 PM
302 Even Hovdhaugen and Åshild Næss
in the area, including ones largely carried out by locals; acceding to this per-
son’s wishes of playing a central role in our work might have prevented the
particular situation which arose, but would have been practically and scien-
tifically infeasible, as well as creating a potential source of resentment from
other prominent personalities in the area.
6. Conclusions
A familiarity with the structures of power and authority in the region where
one is working is essential for any documentation enterprise. Such familiar-
ity is only built up by experience, which means that the chances of making
mistakes in the early phases of a project are great, as is well documented in
the literature on language documentation.
However, the existing power structures may not always be adequate for
the handling of the particular issues raised by a language documentation
project, and this may lead to conflict not only between project participants
and community members, but also within different parts of a language com-
munity. If there is no body of authority associated with the language commu-
nity as a whole (as distinct from, e.g., an administrative district which may
subsume several language communities), the question of whose permission
is required may simply not have an unequivocal answer. This raises a number
of questions:
1. Under such circumstances, how far can a researcher be expected to go in
obtaining permission to carry out documentation work? Is it necessary –
and realistic – to acquire permission from all parties who may consider
themselves entitled to an opinion, even if they are not directly involved in
the project itself?
2. If this is deemed to be the case, how does one identify and approach all
interested parties? In our case, the objections came from a self-styled body
of authority which was not recognized locally as legitimate. In other words,
there simply was no way to approach this body beforehand, as technically it
did not exist until it was set up as a means of protesting against the project.
3. When there is no consensus within a community as to who has the authority
to grant permission, how does the researcher establish whose claims are
legitimate? It is tempting to assume that those who desire the project to
go ahead are those whose opinions should be listened to, but can their
dismissal of the objections be accepted without question?
Unauthenticated
Download Date | 8/13/19 5:39 PM
Language is power 303
References
Bowern, Claire. 2008. Linguistic Fieldwork: A Practical Guide. Basingstoke:
Palgrave Macmillan.
Dobrin, Lise. 2005. When our values conflict with theirs: Linguistics and
community empowerment in Melanesia. In Language Documentation and
Description, Volume 3, ed. Peter K. Austin, 45–52. London: School of Ori-
ental and African Studies.
Dobrin, Lise. 2008. From linguistic elicitation to eliciting the lin-
guist: Lessons in community empowerment from Melanesia. Language
84(2):300–324.
Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork
and analysis. In Essentials of Language Documentation, eds. Jost Gippert,
Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York:
Mouton de Gruyter.
Unauthenticated
Download Date | 8/13/19 5:39 PM
304 Even Hovdhaugen and Åshild Næss
Grinevald, Colette. 2006. Worrying about ethics and wondering about “in-
formed consent”: Fieldwork from an Americanist perspective. In Lesser-
Known Languages of South Asia: Status and Policies, Case Studies and
Applications of Information Technology, eds. Anju Saxena and Lars Borin,
339–370. Berlin, New York: Mouton de Gruyter.
Hovdhaugen, Even. 2006. A Short Dictionary of the Vaeakau-Taumako Lan-
guage. Oslo: The Kon-Tiki Museum.
Hovdhaugen, Even, Ingjerd Hoëm, and Åshild Næss. 2002. Pileni Texts with
a Pileni-English Vocabulary and an English-Pileni Finderlist. Oslo: The
Kon-Tiki Museum.
Hovdhaugen, Even, and Åshild Næss. 2006. Stories from Vaeakau and Tau-
mako / A lalakhai ma talanga o Vaeakau ma Taumako. Oslo: The Kon-Tiki
Museum.
Hovdhaugen, Even, and Christian Tekilamata. 2006. Christmas Carols from
Vaeakau. University of Oslo: Department of Linguistics and Scandinavian
Studies.
McLaughlin, Fiona, and Thierno Seydou Sall. 2001. The give and take of
fieldwork: Noun classes and other concerns in Fatick, Senegal. In Linguis-
tic Fieldwork, eds. Paul Newman and Martha Ratliff, 189–210. Cambridge:
Cambridge University Press.
Mosel, Ulrike. 2006. Fieldwork and community language work. In Essentials
of Language Documentation, eds. Jost Gippert, Nikolaus P. Himmelmann,
and Ulrike Mosel, 67–85. Berlin, New York: Mouton de Gruyter.
Rice, Keren. 2006. Ethical issues in linguistic fieldwork: An overview. Jour-
nal of Academic Ethics 4:123–155.
Thieberger, Nick, and Simon Musgrave. 2007. Documentary linguistics and
ethical issues. In Language Documentation and Description, Volume 4, ed.
Peter K. Austin, 26–37. London: School of Oriental and African Studies.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Chapter 13
Sustaining Vurës: Making products of language
documentation accessible to multiple audiences
1. Introduction
Unauthenticated
Download Date | 8/13/19 5:39 PM
306 Catriona Hyslop Malau
language maintenance and linguistic diversity. Thus while Seifart (this vol-
ume) considers the different motivations for documenting endangered lan-
guages, I consider a related issue, that of attempting to target a wide range
of users of language documentation. Cablitz (this volume) has also consid-
ered the issue of making language documentation work accessible to a wide
audience, with her work on a multimedia encyclopaedic lexicon that is user-
friendly for the language community and others. We took a similar approach
and this paper presents, as a case study, the films produced by our project. Le
Kal Vurës ‘Sustaining Vurës’ is a two DVD set of documentaries presenting
aspects of Vurës life and language. Through discussion of the diverse range
of audiences who have accessed the films, the paper shows how these and
other films of this nature can be important tools for supporting language and
cultural maintenance, both within and beyond the language community.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 307
Unauthenticated
Download Date | 8/13/19 5:39 PM
308 Catriona Hyslop Malau
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 309
Aside from the fact that the Vurës language is the medium used to present
the communication in the films, these films are not about the language and do
not present any overt discussion of their purpose as language awareness and
maintenance tools. However, there are three ways in which this purpose is
made evident, two of which are minor yet nevertheless significant points, the
other being a more explicit technique employed for highlighting the language
used as one of the themes of the films.
The first point is simply the fact that these documentaries are entirely in
the Vurës language. It is a highly significant matter for a film to be made –
and to be commercially available – in a minority language spoken by a little
over a thousand people. Within the sociolinguistic context of Vanuatu with
its 100 languages, the existence of the films and knowledge of their existence
within and outside the community serves immediately to elevate the status of
the language. In §4 I discuss further the significance of films being produced
in minority languages.
The second point is that while the language maintenance issue is not
discussed within the films, there is an introductory written text, accessible
through the main DVD menu, which provides background on the language
and language endangerment issue. The text discusses the fact that some of
the languages spoken on Vanua Lava have already been lost, and that while
Vurës is being passed on to children today, the community recognise that it
is under threat from English, French and Bislama. It is stated that this is the
reason why the language documentation project was supported by the com-
munity and why the DVDs were produced.
The third way in which these films can be identified as having the lan-
guage used as a focus is considerably more explicit, with a clear pedagogical
intent. Each of the documentaries contains a number of dictionary entries –
presented on the screen as an excerpt of a page from a dictionary (as exempli-
fied by Figure 1) – which highlight and provide translations and definitions
for certain key terms used in the films. The aim was to choose the most sig-
nificant technical terms that were related to the topic of each film, particularly
those which were more likely to represent restricted knowledge, thus enabling
the dictionary entries to serve as an important record of the meanings of the
technical terms. For example, in both documentaries a number of plants are
referred to which are used for producing woven artefacts or in fishing ac-
tivities. The dictionary entries for these words are presented complete with
the scientific identification for the plant. Thus, the film serves as a record of
Unauthenticated
Download Date | 8/13/19 5:39 PM
310 Catriona Hyslop Malau
particular cultural activities, including the vernacular names for species that
have specific cultural uses, and the linked dictionary entry serves as an accu-
rate scientific record linking the identification to the use as it is represented
in the film.
The idea behind including the dictionary entries within the films themselves
was to highlight the fact that the language being used is one of the themes of
the films, and to compel the viewers, whether they be members of the Vurës
community or not, to consider its high significance. For the Vurës speakers,
now and in the future, the entries for each headword can be used as a teach-
ing tool, particularly in relation to those words which are used in the domain
of cultural practices that are not now widely observed. Linking dictionary
entries to video in which the use and denotatum of the words is clearly il-
lustrated has a much greater value than a dictionary entry presented as text
alone. This is particularly true in the case of scientific identifications and
technical terms, where the translation or definition may not, in reality, be suf-
ficient to unambiguously retrieve the correct meaning and range of use of the
word or expression. I will give two different types of examples. Firstly, in-
cluding the language term and its scientific name alongside contextual video
footage which illustrates clearly the species and the natural environment in
which it is found, is a comprehensive record which can be used to assist the
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 311
Unauthenticated
Download Date | 8/13/19 5:39 PM
312 Catriona Hyslop Malau
Terms specific to fishing and other featured activities are also included, both
those that are more technical and also more general terms that play an impor-
tant role in discussion of the featured activities. Definitions for 15 different
plant species are included: nine in the weaving documentary and six in the
fishing documentary.
Each dictionary entry appears on screen directly after the utterance unit
the defined word occurs in, thus linking the word and its translation/ definition
directly to the use of the word in context. Most of the content of the films is
of a procedural genre, and the entries are linked in such a way that as the
speaker(s) demonstrate the process, explaining their actions, when they use a
key term the film then cuts to a screen where first the headword alone, then
the full dictionary excerpt appears. The dictionary page remains on screen for
eight seconds and then cuts back to the film. Presenting the dictionary entries
in this way, the aim was to make them a feature of the film without disrupting
considerably the flow of the depicted procedures.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 313
Unauthenticated
Download Date | 8/13/19 5:39 PM
314 Catriona Hyslop Malau
puts are produced and returned to the community. He plans to use the films,
particularly the documentation of weaving, as a starting point for workshops
on transmitting cultural knowledge.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 315
1. Notable exceptions are the recent award-winning French language films, Sevrapek City
(Broto and Tzerikiantz 2009) and Le Salaire du Poéte (Wittersheim 2008).
Unauthenticated
Download Date | 8/13/19 5:39 PM
316 Catriona Hyslop Malau
2. http://www.vertigoproductions.com.au/information.php?film_id=11&display=
extras
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 317
5. Conclusion
In conclusion, this paper has demonstrated the diverse range of audiences that
can benefit from outputs of a language documentation project, despite the fact
that the different audiences have varied needs and interests. Placing emphasis
3. http://www.sorosoro.org/en/
4. http://travel.nationalgeographic.com/travel/enduring-voices/
5. http://www.youtube.com/enduringvoices
Unauthenticated
Download Date | 8/13/19 5:39 PM
318 Catriona Hyslop Malau
References
Austin, Peter K., and Lenore A. Grenoble. 2007. Current trends in language
documentation. In Language Documentation and Description, Volume 4,
ed. Peter K. Austin, 12–25. London: School of Oriental and African Stud-
ies.
Broto, Emmanuel, and Fabienne Tzerikiantz (Producer & Director). 2009.
Sevrapek City. [Motion picture].
Dimmendaal, Gerrit J. 2010. Language description and the “new paradigm”:
What linguists may learn from ethnocinematographers. Language Docu-
mentation & Conservation 4:152–158.
Dwyer, Arienne M. 2006. Ethics and practicalities of cooperative fieldwork
and analysis. In Essentials of Language Documentation, eds. Jost Gippert,
Nikolaus P. Himmelmann, and Ulrike Mosel, 31–66. Berlin, New York:
Mouton de Gruyter.
Gippert, Jost, Nikolaus P. Himmelmann, and Ulrike Mosel, eds. 2006. Essen-
tials of Language Documentation. Berlin, New York: Mouton de Gruyter.
Harrison, K. David, ed. 2007. When Languages Die: The Extinction of the
World’s Languages and the Erosion of Human Knowledge. Oxford: Oxford
University Press.
Unauthenticated
Download Date | 8/13/19 5:39 PM
Sustaining Vurës 319
Lynch, John, and Terry Crowley, eds. 2001. Languages of Vanuatu: A New
Survey and Bibiliography. Canberra: Pacific Linguistics.
Mondragón, Carlos. 2004. Of winds, worms and Mana: The traditional cal-
endar of the Torres Islands, Vanuatu. Oceania 74:289–308.
Office, Vanuatu National Statistics. 2009. National Census of Housing and
Population. Port Vila, Vanuatu: Ministry of Finance and Economic Man-
agement.
Wittenburg, Peter. 2007. DoBeS/MPI Archive Issues. Presentation at the
National Science Foundation Workshop on Documenting Endangered Lan-
guages, Durham, New Hampshire, (October 2007), available online at http:
//www.lat-mpi.eu/papers/papers-2007/Presentations/newhampshire-talk.pdf (ac-
cessed 2011/03/19).
Wittersheim (Producer & Director), Eric. 2008. Le Salaire du Poète. [Motion
picture].
Unauthenticated
Download Date | 8/13/19 5:39 PM
Unauthenticated
Download Date | 8/13/19 5:39 PM
Chapter 14
Filming with native speaker commentary∗
Anna Margetts
1. Introduction
∗
I would like to thank Birgit Hellwig and the editors of this volume for valuable feedback
on an earlier draft. I also thank Andrew Margetts who is behind much of the technical
details reported here. As always, I cordially thank the communities and individuals on
Saliba and Logea Island who have supported our work. More speakers have been involved
with the Saliba-Logea project than can be listed here. Community members who have
worked with us on transcribing, annotating, translating, and editing texts include Nebo
Joseph, Rose Meina, Meggie Alaluku, Matthew Hawele, Penesia Eric, Mila Kelwau, and
Morris Alaluku. For their commentaries on the recordings discussed here I thank Balosi
Leman, Alaluku Leman and Mr January. DoBeS project members in Australia include
Anna Margetts, Carmen Dawuda, Andrew Margetts, and John Hajek; Ulrike Mosel was
the German host, and Kipiro Damas from the PNG National Herbarium in Lae joined as a
botanical consultant.
Unauthenticated
Download Date | 8/13/19 5:40 PM
322 Anna Margetts
ered more acceptable and desirable by the community. The raw transcriptions
are still archived, but are less publicly available.
This chapter is concerned with commentaries accompanying video record-
ings as a research methodology and means of data collection, again as a re-
sponse to the tension arising from the different aims within a documentation
project. It discusses the benefits of this technique and addresses the nature of
the data that can be collected by this method.
Commentaries have in the past received relatively little attention as a
genre or as a data collection method. However, more recently there has been
some discussion of this topic in the field of documentary linguistics. Cablitz
(2008) describes several methods and recording setups for the creation of
procedural documents. In one of the techniques, one person performs the ac-
tivity (e.g. traditional food or medicine preparation) while another speaker
provides a commentary on what is happening. As Cablitz observes, such
commentaries do not constitute “examples of how people actually commu-
nicate with each other” as called for by Himmelmann (2006: 7), but create a
new type of communicative event along the lines described by Mosel (2004,
2009). Like Mosel she notes that new types of communicative events, while
not traditional, are interesting in themselves as they allow one to observe the
process of using language in a new situation and provide new insights into a
language’s expressive potential (Mosel 2004: 4; Cablitz 2008).
Another recent use of commentaries is the technique of orally providing
annotations to primary data thereby shortcutting the transcription and written
annotation process. In the context of documentation work with the Cup’ik
community Woodbury (2003) argues for the recording of such oral commen-
taries instead of focusing exclusively on written annotations.
... we will use the time of the few elder Cup’ik translators with wide En-
glish and Cup’ik vocabularies to produce running UN [United Nations] style
translations of many more materials, and then have younger speakers flag the
obscure words or usages for special attention. We are also considering not
transcribing everything – instead starting with hard-to-hear tapes and asking
elders to ‘respeak’ them to a second tape slowly so that anyone with training
in hearing the language can make the transcription if they wish. (2003:45)
In a similar vein, Simons (2008), Bird (2010), and Reiman (2010) discuss the
method of Basic Oral Language Documentation (BOLD) where oral annota-
tion replaces transcription and written annotation. In place of the traditional
corpus which consists of data that has been transcribed and marked up with
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 323
1. http://www.boldpng.info/
2. http://fieldmanuals.mpi.nl/
Unauthenticated
Download Date | 8/13/19 5:40 PM
324 Anna Margetts
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 325
3. The final was between “Cycas” from Bwasitau and the “Station Warriors” from Sawa-
sawaga. The match had been postponed several times because of a death in a neighbouring
Unauthenticated
Download Date | 8/13/19 5:40 PM
326 Anna Margetts
community and of the uncertainty when the funeral would be held. The teams’ preparation
for the match included all-night prayer meetings which continued each time the match was
postponed. Players must have been very tired by the time the day came and it was a rel-
atively slow match, ending in a 0:0 draw. In the ensuing penalty shoot-out, Sawasawaga
won 1:0.
4. The commentaries were provided by Balosi Leman and Mr January. The recordings of the
match were fed into the data workflow as four sessions: SoccerMatch_01EZ (commentary
Balosi Leman), SoccerMatch_02FA, SoccerMatch_03FA, SoccerMatch_04FA (commen-
tary by Mr January across three video tapes).
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 327
Foley (2003: 86) warns that “the effect of our native language ideology on
our products of description can be heavily disguised ... The only corrective I
can suggest for our possibly misleading descriptive flights of fancy is ... stay
close to the full range of data, all register and genre types”.
While the commentary is monologuous it is very different from other
monologues in the corpus, such as more planned narratives, but also from
spontaneously produced monologues. Running sports commentaries are a
unique text type in that the commentator is not aware of the outcome of the
match while reporting and therefore cannot structure the text according to a
certain result.5 The speech in the soccer commentary is spontaneous and off
the cuff in this sense as the commentator invents techniques of filling the gaps
and producing a stream of continuous fluent language. It is not an established
genre in Saliba-Logea and the commentator clearly produces an imitation of
a western-style sports commentary. In this sense the commentary data can be
considered artificial while at the same time clearly highly spontaneous.
Beyond constituting a new text type, the commentary contains some inter-
esting linguistic features including several instances of the reciprocal prefix
which is extremely rare in the Saliba-Logea text data:
The text also contains the only examples of code switching between Saliba
and Tok Pisin in the corpus.6 The commentator started out in Saliba but then
switched between Tok Pisin and Saliba several times. Two examples are given
in (3) and (4).7
5. Presumably the genre only emerged in response to advances in media technology and
did not exist before the advent of radio. However, commentary-style reporting could in
principle also take place for the benefit of a physically present but non-seeing audience.
6. English is the areal lingua franca in Milne Bay Province and typically only people who
lived in other parts of PNG know Tok Pisin. This has been changing somewhat in recent
years, at least in the provincial capital Alotau, with the influx of people from other parts of
PNG.
7. Items like winim ‘win’ and tim ‘team’ are considered loan words here rather than instances
of code switching.
Unauthenticated
Download Date | 8/13/19 5:40 PM
328 Anna Margetts
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 329
8. The race was initiated by Balosi Leman, the sailing canoe expert and soccer commentator
with whom we had worked before. He organised it with the help of his brother Alaluku
Leman who also provided the commentary.
Unauthenticated
Download Date | 8/13/19 5:40 PM
330 Anna Margetts
racing participants out on the bay. Overall the data is conversational rather
than monologuous and it constitutes some of the clearest audio recordings of
casual conversations in the database. From the point of view of conversational
data, the drawback of the recordings is of course that none of the speakers is
in the picture as the camera is trained on the racing canoes and that when in-
terlocutors are not close to the commentator only his side of the conversation
is recorded through the lapel microphone. However, the audio quality is still
better than some of our other audio-only conversational data.
Apart from the conversational nature of the commentary, interesting as-
pects of the data include examples of boating and sailing terminology in their
natural context of use, as in (8) to (10). Such terminology is otherwise mainly
represented through elicitation and is rare in the corpus.
(8) Se kuke.
3 PL set.sail
‘They are setting sail.’
(9) Se giyuli.
3 PL go.around.point
‘They are sailing around the point.’
(10) Se yatowa.
3 PL tack
‘They are tacking.’
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 331
(13) You are safe to walk, kwa lau namwa-namwa, you act properly.
you are safe to walk 2 PL go RED-good you act properly
‘You are safe to walk, walk safely, act properly.’
Again, the commentary enhanced the video footage both in terms of adding
linguistic data for documentation and analysis and in terms of benefit for the
community. Without the explanations of what is happening the video images
basically just show a few sailing canoes and are otherwise uninterpretable.
Unauthenticated
Download Date | 8/13/19 5:40 PM
332 Anna Margetts
task. At the end of the party the visitors return to their village with their
gifts and the host family distributes the received goods to the members of
their own extended family who supported them with their contributions. The
visiting side typically reciprocates after a few years with a corresponding
feast. Following this counter exchange the marriage is settled and divorce
would be quite involved for both sides, because the exchanged gifts would
have to be returned or compensated for.
We filmed the preparations of a gwalisaekeno feast on the side of the
visiting party, their arrival by boat, and the party itself.9 The recordings are
more or less free of usable linguistic data (apart from some speeches) but they
provide an interesting documentation of one of the last traditional feasts still
practiced in the area. However the video documentation is basically only as
good as the background knowledge of the viewer because many aspects of
the recordings are not self-explanatory. As described, there are many aspects
of this traditional feast which follow regular patterns and which are part of
the cultural knowledge of the community, but their meaning and significance
or even the fact that they are taking place cannot necessarily be gleaned from
video footage of the events. A running commentary by a community member
could have bridged this gap and made the documentation more meaningful
for outside viewers by explaining what was happening.
In the course of the project we also recorded procedural texts and inter-
views with several speakers about gwalisaekeno feasts.10 These sessions are
located in a different branch of our data tree but they are linked to the video
recording of the feast though metadata and the project’s file naming practice.
A running commentary during filming would have strengthened the link be-
tween the two types of recordings (the event itself and the meta-texts about
the event) by naming sub-events as they occurred, such as the traditional com-
plaints by the mother, the carrying of the firewood, and the pig slaughter,
which are described in the meta-texts but filmed without commentary in the
documentation.
Of course it is possible to add a running commentary after the event by
reviewing the recording with speakers and this would provide a useful anno-
tation to the gwalisaekeno videos. However, such commentaries constitute a
separate step in the data gathering and processing (and therefore may or may
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 333
not in fact happen). One also needs to be aware that adding a commentary
later is a different technique which will result in different kinds of linguis-
tic data. For example deictic terms employed by a commentator on site may
be quite different from those used by a commentator who is reviewing the
recordings on a screen and commenting on them.11
11. In principle, such different types of commentaries could be used to create parallel corpora
in order to explicitly investigate the differences between them. A similar methodology is
described by Mosel (2009) to compare a narrative about killing a chicken and a procedural
text on the same topic, both elicited with the same visual stimuli. See also Mosel (2008,
2009) on comparing parallel corpora of raw oral versions and edited versions of the same
texts.
Unauthenticated
Download Date | 8/13/19 5:40 PM
334 Anna Margetts
this setup one microphone is dedicated to the commentary, the other micro-
phone is recording the event. This allows for annotations and comments to be
recorded simultaneously to the actual event without interfering with it. See
Margetts and Margetts (in print) for technical details on such recordings.
3. Conclusion
Unauthenticated
Download Date | 8/13/19 5:40 PM
Filming with native speaker commentary 335
ality of the commentator. It helps if they like to talk and are engaged with the
event. Another aspect is whether the commentator is perceived as appropriate
to comment on the event, and what their relation is to the people and events
being filmed. This includes aspects like their seniority, their knowledge of the
event, (perceived) partiality, family affiliations, etc.
In sum inviting a running commentary can help in creating a richer doc-
umentation for the project and for the community and should perhaps be
the norm rather than the exception for video recordings of primarily non-
linguistic events. They are also a productive technique of annotating linguistic
events and performances and for recording metadata.
Abbreviations
1, 2, 3 first, second, third person PL plural
CLF classifier POSS possessive
CONJ conjunction RECP reciprocal
EXCL exclusive RED reduplication
INCL inclusive SG singular
INTRJ interjection TAM tense, aspect, mood
NEG negation TOPIC topic marker
References
Bird, Steven. 2010. A scalable method for preserving oral literature from
small languages. Proceedings of the 12th International Conference on
Asia-Pacific Digital Libraries, Gold Coast, Australia, June 2010. http:
//www.boldpng.info/.
Cablitz, Gabriele. 2008. The making of procedural documents on the Mar-
quesas and Tuamotu islands (French Polynesia). Presentation at Language
Documentation Methods in Focus (DoBeS meeting, June 2008).
Foley, William A. 2003. Genre, register and language documentation in liter-
ate and preliterate communities. In Language Documentation and Descrip-
tion, Volume 1, ed. Peter K. Austin, 85–98. London: School of Oriental and
African Studies.
Himmelmann, Nikolaus P. 2006. The challenges of segmenting spoken lan-
guage. In Essentials of Language Documentation, eds. Jost Gippert, Niko-
Unauthenticated
Download Date | 8/13/19 5:40 PM
336 Anna Margetts
laus P. Himmelmann, and Ulrike Mosel, 253–274. Berlin, New York: Mou-
ton de Gruyter.
Lavric, Eva, Gerhard Pisek, Andrew Skinner, and Wolfgang Stadler, eds.
2008. The Linguistics of Football. Tübingen: Gunter Narr.
Margetts, Anna, and Andrew Margetts. To print. Audio and video recording
techniques for linguistic research. In The Oxford Handbook of Linguistic
Fieldwork, ed. Nicholas Thieberger. Oxford: Oxford University Press.
Mosel, Ulrike. 2004. Inventing communicative events: Conflicts arising
from the aims of language documentation. Language Archives Newslet-
ter 1(3):3–4.
Mosel, Ulrike. 2008. Putting oral narratives into writing – experiences from
a language documentation project in Bougainville, Papua New Guinea.
Presentation at the Simposio Internacional Contacto de lenguas y docu-
mentación, Buenos Aires, CAIYT (August 2008), available online as ‘Oral
and written versions of Teop legends’ at http://www.linguistik.uni-kiel.de/
mosel_publikationen.htm#download (accessed 2011/02/26).
Mosel, Ulrike. 2009. Collecting data for grammars of previously un-
researched languages. Unpublished manuscript, available online at
http://www.linguistik.uni-kiel.de/mosel_publikationen.htm#download (accessed
2011/02/26).
Müller, Torsten. 2007. Football, Language and Linguistics. Tübingen: Gunter
Narr.
Reiman, D. Will. 2010. Basic oral language documentation. Language Doc-
umentation & Conservation 4:254–268.
Seifart, Frank. 2008. On the representativeness of language documenta-
tion. In Language Documentation and Description, Volume 5, ed. Peter K.
Austin, 60–76. London: School of Oriental and African Studies.
Simons, Gary F. 2008. The rise of documentary linguistics and a new kind
of corpus. Presentation at the 5th National Natural Language Research
Symposium, De La Salle University, Manila (November 2008), available
online at http://www.sil.org/~simonsg/presentation/doc%20ling.pdf (accessed
2011/02/26).
Woodbury, Anthony C. 2003. Defining documentary linguistics. In Language
Documentation and Description, Volume 1, ed. Peter K. Austin, 35–51.
London: School of Oriental and African Studies.
Unauthenticated
Download Date | 8/13/19 5:40 PM
Unauthenticated
Download Date | 8/13/19 5:40 PM
Unauthenticated
Download Date | 8/13/19 5:40 PM
Index
access regulations, 46 collaborative
Admiralty Islands, 269, 275 fieldwork, 23, 251
Äiwoo, 296, 301 lexicon creation, 48, 241, 242, 245,
afterthought, 10, 152, 153, 160, 161, 268
164–170, 172 transcription, 10, 202–208, 210
Ambrym (South East), see South East workspace, 241–247
Ambrym commentary, 13, 315, 321–335
animacy, 68, 73, 76, 80 community (DoBeS), see DoBeS
ANNEX, see tools community (language), see language
ARBIL, see tools community involvement, 5–6, 10, 11,
archive (DoBeS), see DoBeS 13, 23–24, 57, 224–226, 240,
archiving, 8, 21, 33–53, 81, 256 241, 243–249, 251, 252, 256,
Arosi, 264, 269, 270, 273, 274, 280, 257, 291, 297, 323
281 complex sentences (prosody of), 160–
aspect system, 121–145 169
Athapaskan, 202 conjunctivism, 114, 116
Austronesian, 64, 82, 238, 264 consent, 44, 45, 293, 294
authority, 293–294, 297, 299–302 contact (language), see language
Awetí, 55, 64, 65, 67–69, 71, 73, 75, contrastive (focus), 101, 170, 171
81, 82 copyrights, 45, 295, 300
corpus linguistics, 57
Basic Oral Language Documentation cultural heritage, see heritage
(BOLD), 322, 323, 333 cultural knowledge, 12, 21, 151, 223,
Bauan, 267–271 228, 230, 237, 240, 241, 245,
Beaver, 201–219 246, 249–251, 256, 265, 311,
Bislama, 306, 308, 309, 311 313–315, 332
Bismarck Archipelago, 269
boundary (intonation), 153, 159–162, Danish, 181
168 data
Brazil, 64, 81 collection, 6, 33, 156, 177, 305,
Bugotu, 275 322
infrastructures, 8, 33–53
capacity building, 241, 247–249 utilization, 50–52, 305
Carolinean, 269, 270 data (lexical), see lexical
Cèmuhî, 264, 269–271, 273 description (language), see language
Cheke Holo, 269–271, 273, 280, 281 dictionaries
clitic word, see word bilingual, 125, 229, 248, 254, 264,
Code of Conduct (DoBeS), see DoBeS 266, 311
339
Unauthenticated
Download Date | 8/13/19 5:40 PM
340 Index
Unauthenticated
Download Date | 8/13/19 5:40 PM
Index 341
Unauthenticated
Download Date | 8/13/19 5:40 PM
342 Index
Unauthenticated
Download Date | 8/13/19 5:40 PM
Index 343
Unauthenticated
Download Date | 8/13/19 5:40 PM
344 Index
Unauthenticated
Download Date | 8/13/19 5:40 PM