You are on page 1of 25

Towards a Lexical Standard for

the Representation
of Etymological Data
Fahad Khan Jack Bowers
Istituto di Linguistica Computazionale Austrian Center for Digital Humanities,
<<A. Zampolli>> Vienna
Consiglio Nazionale delle Ricerche, Inria - Team ALMAnAC, Paris
Pisa
Introduction
- In this talk we introduce a new ISO standard LMF-diachrony-etymology (ISO
24613-3), or LMF-Ety, currently in an advanced stage of development.
- LMF-Ety is meant for the modelling and publication of etymological data in
computational lexical resources.
- Meets the need for a lexical standard that will facilitate the interoperability and the
queryability of etymological information.
- LMF-Ety is intended both for modelling etymologies extracted from legacy
dictionaries (or other kinds of texts) as well as for creating digital born etymological
resources or any hybrid of these two.
- It is one part of a multipart lexical ISO standard, the Lexical Markup Framework
(LMF)
Introduction
- LMF-Ety follows on from similar previous work in LMF and other data
standards/frameworks such as the Text Encoding Initiative (TEI) and Linked Data.
- Also feeds into and is informed by ongoing work by the authors in TEI, and more
recently TEI-Lex0, and LInked Data.
- General Idea: represent etymologies as abstract labelled graphs using a meta-model
that can also be implemented in other formalisms (especially linked data/RDF).
- One of the two official serialisations of LMF-Ety (and LMF as a whole) is TEI (this
is part 4 of the new LMF standard).
Introduction
The modelling of etymologies as computational resources lies very much at the
intersection of the digital humanities and computational linguistics:

- We're often dealing with retrodigitized dictionaries which have significant historical significance
in their own right.
- On the other hand etymologies can be embedded in all kinds of texts wrt genre or kind (e.g.,
philosophical treatises, religious works, etc)
- Very often dealing with historical languages such as Latin, Greek, Sanskrit.
- Brings up importance of representing explicit scholarly hypotheses, evidence from the literature,
interpretations of texts, etc and modelling uncertainty.
- Shows limits of what we can represent efficiently using current standards and technologies.
Introduction
In what follows we will:

- Give some brief background on the modelling of etymologies and the challenges (and
opportunitites) of representing them as richly structured data.
- Discuss the Lexical Markup Framework (LMF) both in its previous incarnation as a
single-part standard and its current status as a multi-part standard.
- Describe the LMF-Ety model itself using an example taken from a Latin
etymological dictionary
Etymologies
- The word etymology comes from the greek ἐτυμολογία (etumología) -- from ἔτυμον
(étumon, “true sense”) and -λογία(-logía, “study of”) -- and has at least two different
(related) senses. It can refer to:
- A sub-discipline of historical linguistics that is concerned with the development of individual words
(and other lexemes) over time and which attempts to trace their origins as far back as the record
supports
- A single such history of a word (or other lexical phenomena).
- We will mostly ocus on the second sense in what follows.
- Etymologies as word histories potentially lend themselves quite well to being
represented as graphs.
- However the necessity of explicitly representing different etymological hypotheses
and time can render this quite tricky.
Modelling Etymologies
- In the next few slides we will look at some of main concepts which we have singled
out as primary to the study of etymology.
- These concepts constitute the core of a meta-model which is implemented in the
LMF-Ety model and in previous and ongoing work in TEI and RDF.
- In the interests of interoperability our model therefore follows some of the formal
constraints of these data models/frameworks (esp. w/r/t time).
- In the next few slides we will look at some examples etymological entries in order to
motivate our modelling choices.
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swiss gurre, gurrli,a depriciatory term for a girl; Sanders, G. Dict. i. 609,
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swissof gurre,
Description the history gurrli,a
and depriciatory term for a girl; Sanders, G. Dict. i. 609,
Information about previous
Norw.
641; alsoevolution gorre,
of the word
a small child (Aasen); Swed. dial. garra, guerre
forms (the same).
and meanings
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Information about previous forms and meanings
Example Entry
GIRL, a female child, young woman. (E.) ME. gerle, girle, gyrle, formerly used of
either sex, and signifying either a boy or girl. In Chaucer, C.T. 3767 (A 3769) gerl
is a young woman; but in C.T. 666 (A 664), the pl. girles means young people of
both sexes. In Will. of Palerne, 816, and King Alisander, 2802, it means ‘young
Information on Cognates
women;’ in P. Plowman, B.i.33, it means ‘boys;’ cf. B. x. 175. Answering to an AS.
form *gyr-el-, Teut. *gur-wil-, a dimin. form from Teut. base *gur-. Cf. NFries. gor,
a girl; Pomeran. goer, a child; O. Low G. gor, a child; see Bremen Wortebuch, ii.
528. Cf. Swiss gurre, gurrli,a depriciatory term for a girl; Sanders, G. Dict. i. 609,
641; also Norw. gorre, a small child (Aasen); Swed. dial. garra, guerre (the same).
Root uncertain. Der. girl-ish, girlish-ly, girl-ish-ness, girl-hood.
Another Example Entry
girl

, whence girlish, derives from ME girle, varr gerle, gurle: o.o.o.: perh of C origin: cf
Ga and Ir caile, EIr cale, a girl; with Anglo-Ir girleen (dim -een), a (young) girl, cf
Ga-Ir cailin (dim -in), a girl. But far more prob, girl is of Gmc origin: Whitehall
postulates the OE etymon *gyrela or *gyrele and adduces Southern E dial girls,
primrose blossoms, and grlopp, a lout, and tentatively LG goere, a young person
(either sex). Ult, perh, related to L puer, puella, with basic idea '(young) growing
thing'.
Another Example Entry
girl
Three different hypotheses for the origin
of the same word
, whence girlish, derives from ME girle, varr gerle, gurle: o.o.o.: perh of C origin: cf
Ga and Ir caile, EIr cale, a girl; with Anglo-Ir girleen (dim -een), a (young) girl, cf
Ga-Ir cailin (dim -in), a girl. But far more prob, girl is of Gmc origin: Whitehall
postulates the OE etymon *gyrela or *gyrele and adduces Southern E dial girls,
primrose blossoms, and grlopp, a lout, and tentatively LG goere, a young person
(either sex). Ult, perh, related to L puer, puella, with basic idea '(young) growing
thing'.
Another Example Etymology
The English word friar (‘member of certain religious orders’) has an interesting history…

Latin frāter brother < Old French frere ‘brother’, ‘also member of a religious order of
'brothers’'< Middle English frere, friar < modern English friar

We have two kinds of links between the items in the etymology. The salmon pink coloured
links are words that are inherited from an earlier stage of a language or from a parent
language, and the blue link is a borrowing from one language into another. These links can
be further subtyped.

NB. The ‘<’ symbol is commonly used in etymological sources to mean ‘is derived from’
Etymons and Cognates
At the core of our model are two concepts from the field of Etymology to distinguish
between *regular* lexical entries and those used in etymologies: etymons and cognates.

An etymon is one of the sources for another word. That is, it is a direct ancestor of a given
word, whereas a cognate has an ancestor in common with a word. For instance the English
word obligation has amongst its etymons the latin words obligare, ligare, and the
reconstructed Proto-Indo-European root *leig-. It has cognates such as ligament, league in
English and obbligare in Italian and obrigado in Portuguese.
Etymologies and Etymological Links
- It would be simpler to model etymologies as simple chains of etymons (adding links
to cognates to build simple trees).
- However, if we want to predicate information about different etymologies/hypotheses
(different chains of etymons), and even potentially reason about them it is better to
make etymologies objects in their own right.
- This makes it easy to attach bibliographic information and references to the
secondary literature to individual etymologies.
- Similarly with etymological links, by reifying them (making them a type of
individual) it is much easier to describe their provenance.
Adding Time...
- Since we would like elements of this meta-model to be implemented also in RDF we
needed to respect the constraints of that framework w/r/t adding temporal parameters
to elements and relations between them
- E.g., in order to express that a lexical entry l had the sense s1 during the temporal
interval t1 and sense s2 during the temporal interval t2 we cannot use a simple three
place relation: hasSense(l, s1, t2) and hasSense(l, s2, t2).
- For this reason we decided to use a well-known approach in ontology modelling: the
perdurantist approach.
- According to this approach we can associate a 'lifespan' with individual lexical
entries, senses, etymologies, etc. Individuals can also have temporal parts during
which different properties hold.
The Lexical Markup Framework
- The Lexical Markup Framework (ISO 24613: 2008) was first published as a
standard in 2008 by the International Standards Organisation (ISO) as a “standard
framework for the construction of computer lexicons”
- Used by several organizations and in a number of national and international
projects.
- Very influential on other more recent standards such as (Ontolex-)Lemon, TEI-
Lex0, and the forthcoming ELEXIS standard.
- As a result of a review it was decided in 2016 to revise LMF as a multi-part
standard (improved modularity).
- The first part of the new version has already been published, and four of the other
parts are at an advanced stage of completion
The Lexical Markup Framework
Comes in several parts:

- Core Model (ISO 24613-1) - already published


- Machine Readable Dictionaries (ISO 24613-2) - due to be published this year
- Diachrony-Etymology (ISO 24613-3) - CD stage
- TEI serialisation (ISO 24613-4) - serialisation of the other parts in TEI, also closely
aligned with the new TEI-Lex0 standard
- LBX serialisation (ISO 24613-5)
- Syntax and Semantics (ISO 24613-6)
- Inflectional Morphology (ISO 24613-7)
LMF - Core Model
- First part of the standard, and the only part that is mandatory.
- Consists of classes and attributes which are fundamental for the creation of
computational lexica.
- These include the class Lexical Resource which serves to group together different
individuals of the class Lexicon. The latter class is a container for a collection of
individuals of the class Lexical Entry.
- Other fundamental Core classes include Form, Sense, and Definition.
- One important new class (not in the previous LMF) is CrossRef which provides a
way of modelling generic links and makes LMF more interoperable with linked data.
LMF-Ety
- The LMF-Ety builds on top of the concepts in the first two parts of LMF and
especially the core.
- Etymon and Cognate: Two elements modelled as subtypes of Lexical Entry.
Etymons are lexical entries from which a given lexical entry is derived via
some historical process; Cognates are lexical entries which share a common
ancestor with a given a lexical entry;
- Additionally Cognate Set represents the reification of a set of cognates.
LMF-Ety
- Etymology: An (container) element that represents a single history of a lexical
entry or other element.
- We associate Etymology individuals with an ordered series of EtyLink
instances; this allows us to define different etymologies featuring shared
elements. In addition Etymology instances can be recursive, they can also be
typed to define the changes undergone according to any number of linguistic
processes.
- These elements (and others) are modelled as perdurants and can have have
temporal intervals associated with them representing their lifespans
Example
Example in LMF-Ety
Conclusions
- Up till recently there hasn't been a lot of choice with respect to standards for
representing etymologies as structured digital data.
- Thanks to projects like ELEXIS and working groups like Ontolex-lemon and TEI-
Lex0, however there is now a wider movement towards the creation of expressive
(and computationally efficient) standards and best practises for modelling and
publishing lexical data and also to make existing standards more interopeable.
- Standards are important for facilitating interoperability and reusability of datasets and
especially for the creation of networks of etymologies.
- The work which we have presented here is part of this trend but it also seeks to link
up with wider trends in knowledge representation in DH.
Acknowledgements
Thanks to Francesca Frontini for her help in preparing these slides. Thanks also to
Laurent Romary for his guidance throughout the whole process of preparing the
LMF-Ety drafts. The first author was supported by the EU H2020 programme
under grant agreements 731015 (ELEXIS- European Lexical Infrastructure).

You might also like