Professional Documents
Culture Documents
the Representation
of Etymological Data
Fahad Khan Jack Bowers
Istituto di Linguistica Computazionale Austrian Center for Digital Humanities,
<<A. Zampolli>> Vienna
Consiglio Nazionale delle Ricerche, Inria - Team ALMAnAC, Paris
Pisa
Introduction
- In this talk we introduce a new ISO standard LMF-diachrony-etymology (ISO
24613-3), or LMF-Ety, currently in an advanced stage of development.
- LMF-Ety is meant for the modelling and publication of etymological data in
computational lexical resources.
- Meets the need for a lexical standard that will facilitate the interoperability and the
queryability of etymological information.
- LMF-Ety is intended both for modelling etymologies extracted from legacy
dictionaries (or other kinds of texts) as well as for creating digital born etymological
resources or any hybrid of these two.
- It is one part of a multipart lexical ISO standard, the Lexical Markup Framework
(LMF)
Introduction
- LMF-Ety follows on from similar previous work in LMF and other data
standards/frameworks such as the Text Encoding Initiative (TEI) and Linked Data.
- Also feeds into and is informed by ongoing work by the authors in TEI, and more
recently TEI-Lex0, and LInked Data.
- General Idea: represent etymologies as abstract labelled graphs using a meta-model
that can also be implemented in other formalisms (especially linked data/RDF).
- One of the two official serialisations of LMF-Ety (and LMF as a whole) is TEI (this
is part 4 of the new LMF standard).
Introduction
The modelling of etymologies as computational resources lies very much at the
intersection of the digital humanities and computational linguistics:
- We're often dealing with retrodigitized dictionaries which have significant historical significance
in their own right.
- On the other hand etymologies can be embedded in all kinds of texts wrt genre or kind (e.g.,
philosophical treatises, religious works, etc)
- Very often dealing with historical languages such as Latin, Greek, Sanskrit.
- Brings up importance of representing explicit scholarly hypotheses, evidence from the literature,
interpretations of texts, etc and modelling uncertainty.
- Shows limits of what we can represent efficiently using current standards and technologies.
Introduction
In what follows we will:
- Give some brief background on the modelling of etymologies and the challenges (and
opportunitites) of representing them as richly structured data.
- Discuss the Lexical Markup Framework (LMF) both in its previous incarnation as a
single-part standard and its current status as a multi-part standard.
- Describe the LMF-Ety model itself using an example taken from a Latin
etymological dictionary
Etymologies
- The word etymology comes from the greek ἐτυμολογία (etumología) -- from ἔτυμον
(étumon, “true sense”) and -λογία(-logía, “study of”) -- and has at least two different
(related) senses. It can refer to:
- A sub-discipline of historical linguistics that is concerned with the development of individual words
(and other lexemes) over time and which attempts to trace their origins as far back as the record
supports
- A single such history of a word (or other lexical phenomena).
- We will mostly ocus on the second sense in what follows.
- Etymologies as word histories potentially lend themselves quite well to being
represented as graphs.
- However the necessity of explicitly representing different etymological hypotheses
and time can render this quite tricky.
Modelling Etymologies
- In the next few slides we will look at some of main concepts which we have singled
out as primary to the study of etymology.
- These concepts constitute the core of a meta-model which is implemented in the
LMF-Ety model and in previous and ongoing work in TEI and RDF.
- In the interests of interoperability our model therefore follows some of the formal
constraints of these data models/frameworks (esp. w/r/t time).
- In the next few slides we will look at some examples etymological entries in order to
motivate our modelling choices.
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swiss gurre, gurrli,a depriciatory term for a girl; Sanders, G. Dict. i. 609,
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swissof gurre,
Description the history gurrli,a
and depriciatory term for a girl; Sanders, G. Dict. i. 609,
Information about previous
Norw.
641; alsoevolution gorre,
of the word
a small child (Aasen); Swed. dial. garra, guerre
forms (the same).
and meanings
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Information about previous forms and meanings
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
Information on Cognates
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swiss gurre, gurrli,a depriciatory term for a girl; Sanders, G. Dict. i. 609,
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Another Example Entry
girl
, whence girlish, derives from ME girle, varr gerle, gurle: o.o.o.: perh of C origin: cf
Ga and Ir caile, EIr cale, a girl; with Anglo-Ir girleen (dim -een), a (young) girl, cf
Ga-Ir cailin (dim -in), a girl. But far more prob, girl is of Gmc origin: Whitehall
postulates the OE etymon *gyrela or *gyrele and adduces Southern E dial girls,
primrose blossoms, and grlopp, a lout, and tentatively LG goere, a young person
(either sex). Ult, perh, related to L puer, puella, with basic idea '(young) growing
thing'.
Another Example Entry
girl
Three different hypotheses for the origin
of the same word
, whence girlish, derives from ME girle, varr gerle, gurle: o.o.o.: perh of C origin: cf
Ga and Ir caile, EIr cale, a girl; with Anglo-Ir girleen (dim -een), a (young) girl, cf
Ga-Ir cailin (dim -in), a girl. But far more prob, girl is of Gmc origin: Whitehall
postulates the OE etymon *gyrela or *gyrele and adduces Southern E dial girls,
primrose blossoms, and grlopp, a lout, and tentatively LG goere, a young person
(either sex). Ult, perh, related to L puer, puella, with basic idea '(young) growing
thing'.
Another Example Etymology
The English word friar (‘member of certain religious orders’) has an interesting history…
Latin frāter brother < Old French frere ‘brother’, ‘also member of a religious order of
'brothers’'< Middle English frere, friar < modern English friar
We have two kinds of links between the items in the etymology. The salmon pink coloured
links are words that are inherited from an earlier stage of a language or from a parent
language, and the blue link is a borrowing from one language into another. These links can
be further subtyped.
NB. The ‘<’ symbol is commonly used in etymological sources to mean ‘is derived from’
Etymons and Cognates
At the core of our model are two concepts from the field of Etymology to distinguish
between *regular* lexical entries and those used in etymologies: etymons and cognates.
An etymon is one of the sources for another word. That is, it is a direct ancestor of a given
word, whereas a cognate has an ancestor in common with a word. For instance the English
word obligation has amongst its etymons the latin words obligare, ligare, and the
reconstructed Proto-Indo-European root *leig-. It has cognates such as ligament, league in
English and obbligare in Italian and obrigado in Portuguese.
Etymologies and Etymological Links
- It would be simpler to model etymologies as simple chains of etymons (adding links
to cognates to build simple trees).
- However, if we want to predicate information about different etymologies/hypotheses
(different chains of etymons), and even potentially reason about them it is better to
make etymologies objects in their own right.
- This makes it easy to attach bibliographic information and references to the
secondary literature to individual etymologies.
- Similarly with etymological links, by reifying them (making them a type of
individual) it is much easier to describe their provenance.
Adding Time...
- Since we would like elements of this meta-model to be implemented also in RDF we
needed to respect the constraints of that framework w/r/t adding temporal parameters
to elements and relations between them
- E.g., in order to express that a lexical entry l had the sense s1 during the temporal
interval t1 and sense s2 during the temporal interval t2 we cannot use a simple three
place relation: hasSense(l, s1, t2) and hasSense(l, s2, t2).
- For this reason we decided to use a well-known approach in ontology modelling: the
perdurantist approach.
- According to this approach we can associate a 'lifespan' with individual lexical
entries, senses, etymologies, etc. Individuals can also have temporal parts during
which different properties hold.
The Lexical Markup Framework
- The Lexical Markup Framework (ISO 24613: 2008) was first published as a
standard in 2008 by the International Standards Organisation (ISO) as a “standard
framework for the construction of computer lexicons”
- Used by several organizations and in a number of national and international
projects.
- Very influential on other more recent standards such as (Ontolex-)Lemon, TEI-
Lex0, and the forthcoming ELEXIS standard.
- As a result of a review it was decided in 2016 to revise LMF as a multi-part
standard (improved modularity).
- The first part of the new version has already been published, and four of the other
parts are at an advanced stage of completion
The Lexical Markup Framework
Comes in several parts: