You are on page 1of 10

LinkSyr: Linking Syriac Data

Report of the CLARIAH Research Pilot (2017–2018).

1. General outline
1.1 Abstract
How do the Biblical heritage and Hellenistic culture interact in the oldest documents of Syriac Chris-
tianity? The Eep Talstra Centre for Bible and Computer (ETCBC) at the Vrije Universiteit (VU), Am-
sterdam, investigates this question in the CLARIAH pilot project LinkSyr (2017–2018), using linguistic
data processing, especially topic modelling. The Syriac Book of the Laws of the Countries (BLC), writ-
ten by the 2nd/3rd-cent. Syriac philosopher Bardaisan is compared with the ancient Syriac transla-
tion of the Bible (“Peshitta”, 2nd cent.), other sources from the ancient Mespotamian and Hellenistic
world, and later authors (especially the 4th-cent. author Ephrem the Syrian) who react to Bardai-
san’s teachings. The analysed texts are exposed as Linked Open Data and related to the lexicograph-
ical and encyclopaedic resources of SEDRA (Beth Mardutho) and Syriaca. The former presents the
URIs for a large number of place names and person names for the Syriac heritage, whereas the latter
contains dictionary information for a list of more than 50,000 lexemes. LinkSyr has been extended
thanks to grants from Pelagios (§2.2.1), DANS (§2.2.2) and the Van Moorsel and Rijnierse Foundation
(§2.2.3) and cooperation projects with Brill publishers (§2.2.4) and the GREgORI project (§2.2.5).

1.2 Earlier work on Syriac at the ETCBC and the Peshitta Institute
In various places of the world, pioneering work on Syriac computing has taken place over the last
decades. 1 Here we will restrict ourselves to the predecessors of the current projects on Syriac at the
ETCBC.

1.2.1 1980s: Concordance


In the 1980s Pier Giorgio Borbone developed a pioneering basic parsing program which provided in-
put for the Concordance to the Pentateuch. These were rule-based parsers with human correction.
The output of the program consisted of lists of lexeme identifications, that had to be corrected man-
ually. 2 Till now only one volume of the Concordance (the Pentateuch volume) has appeared. 3 The
delay of the other volumes was due to the challenges involved in organizing and directing the schol-
ars who were assigned the manual correction of the output of these programs and the addition of
Hebrew equivalents, which was also done manually and which is complicated by the fact that more
than once the Peshitta does not provide a word-by-word translation. 4

1
See George Anton Kiraz, ''Forty Years of Syriac Computing'', Hugoye: Journal of Syriac Studies 10/1 (2007).
2
For a description of Borbone’s programs, called Obélix (analysis program) and Idéfix (printing program) see
Pier Giorgio Borbone and Francesco Mandracci, ''Another Way to Analyze Syriac Texts: A Simple Powerful Tool
to Draw up Syriac Computer Aided Concordances'', Bible et informatique: methodes, outils, resultats. Actes du
Second Colloque International, Jerusalem, (Israel), 9-13 juin 1988 (Travaux de linguistique quantitative 43, Col-
lection DEBORA 5. Paris: Champion, 1989), 135–145.
3
P.G. Borbone, K.D. Jenner et al., The Old Testament in Syriac according to the Peshiṭta Version V. Concord-
ance 1. The Pentateuch (Leiden: Brill, 1997).
4
See, e.g., Moshe A. Zipor, review of Concordance Vol. I, in Leshonenu 65 (2002–03) 77–89.
1.2.2 1999–2010: CALAP and Turgama
In 1999 the ETCBC (at that time called “WIVU”) started cooperation with the Peshitta Institute Lei-
den (which has now moved to VU Amsterdam as well) in the NWO projects CALAP (Computer-As-
sisted Analysis of the Peshitta, 1999–2005) and Turgama (2005–2010). 5 The linguistic annotations at
word level were created in a process of computer-human interactive encoding of distributional mor-
phological phenomena based on pattern recognition, from which functional deductions were made. 6
The analysis also involved linguistic analysis at the level of phrases, clauses, sentences and text-hier-
archical structures.

1.2.3 From 2014: Polemics Visualized


In 2012 Van Peursen, one of the two General Editors of the Peshitta Project moved from Leiden Uni-
versity to the VU in Amsterdam and became director of the ETCBC. Some years later, Bas Ter Haar
Romeny, the other General Editor, also moved to the VU. With them, the Peshitta Institute moved
from the Leiden to the Amsterdam. The combination of Peshitta studies at the Peshitta Institute and
the application of DH to the Bible at the ETCBC provided a fertile ground for the computational anal-
ysis of the Peshitta.

In 2014 the application of computational techniques to Syriac studies received an incentive from the
projects Polemics Visualized, an Academy Assistant project of the Network Institute (2014–2015;
see description on CLTL website) and Do you see what I am talking about? Towards a Topic Visual-
izer for Syriac Texts, Research Voucher of Network Institute (2015; see the Project report on
Academia and the topic visualizer on the Frontwise website). These projects were our first experi-
ments with Syriac topic modeling and visualization. Related to this pilot project was an MA intern-
ship of Mignon van Bokhoven (2015), who encoded a section of Ephrem’s Prose Refutations. In the
meantime, one of the programmers in the LinkSyr project, Mathias Coeckelbergs, worked on Topic
Modeling on ancient Hebrew data. 7

In this stage we were investigating how we could apply new developments in the rapidly evolving
field of DH to Syriac studies. The ideas about the current CLARIAH project were first expressed in the
conference paper ‘Linking Syriac Data’ by Wido van Peursen at the Perspectives on Linguistics and
Ancient Languages International Conference, St Petersburg, 30 June–4 July 2014. The experiments
we had done since then and the CLARIAH funding provided the opportunity to take further steps in
the realization of these ideas.

5
See the project page in DANS EASY.
6
Cf. Peursen, W.Th. van, ‘Progress Report: Three Leiden Projects on the Syriac Text of Ben Sira’, in R. Egger-
Wenzel (ed.), Ben Sira’s God. Proceedings of the Second International Ben Sira Conference, Durham, Ushaw
College, 2001 (Beihefte zur Zeitscrift für die Alttestamentliche Wissenschaft 321; Berlin: De Gruyter, 2002)
361–370, esp. 366–369; idem, ‘How to Establish a Verbal Paradigm on the Basis of Ancient Syriac Manuscripts’,
in M. Rosner and S. Wintner (eds.), Proceedings of the EACL 2009 Workshop on Computational Approaches to
Semitic Languages, 31 March 2009, Megaron Athens International Conference Centre, Athens, Greece
(http://staff.um.edu.mt/mros1/casl09) 1–9. [Academia].
7
See M. Coeckelbergs and S. Van Hooland, ‘Modeling the Hebrew Bible. Potential of Topic Modeling Tech-
niques for Semantic Annotation and Historical Analysis’, Proceedings of the FCAIR 2013 Formal Concept Analy-
sis meets Information Retrieval workshop, co-located with the 35th European Conference on Information Re-
trieval (ECIR 2013) vol. 2 (2016), 47–52.
2. Project Organization
2.1 Project members and partners
The following people were involved in the LinkSyr project:

• Wido van Peursen (ETCBC, VU, Principal Investigator)


• Dirk Roorda (researcher at DANS)
• Femmy Admiraal (linguist, DANS)
• Mathias Coeckelbergs (scientific programmer, Université libre de Bruxelles & KU Leuven)
• Srecko Koralija (Peshitta scholar, University of Cambridge)
• Reinier de Valk (Linked Data, DANS; till 1 February 2018)
• Geert Jan Veldman (Syriac scholar, ETCBC and Peshitta Institute)
• Hannes Vlaardingerbroek (scientific programmer, ETCBC)

The LinkSyr project started with the CLARIAH research pilot LinkSyr: Linking Syriac Data and this was
the primary context of the project. The project members attended CLARIAH meetings and met CLA-
RIAH people at other occasions (e.g. DHBenelux). CLARIAH was also helpful for building their net-
work (e.g. the contact with Dr Cornelis van Lit, with whom they will organize a DH2019 workshop
(below, §2.3.1) was established through CLARIAH (Arjan van Hessen). They had promising discus-
sions with the project leader of CLARIAH-PLUS WP6 (Karina van Dalen-Oskam), about the further de-
velopment of their work.

The project collaborated with Beth Mardutho and Syriaca (above, §1.1). The collaboration was inten-
sified through the workshop in March 2018 and the bootcamp in January 2019 (below, §2.3.1) in
which people from both partners, as well as from other projects, participated.

2.2 Related projects


During the project, the ETCBC received some other grants for activities related to the LinkSyr pro-
ject. 8

2.2.1 Pelagios Resource Development Grant: Linking Syriac Geographic Data


Pelagios is a community & infrastructure for Linked Open Geodata in the Humanities. In 2018 they
granted us a research development grant a project on Syriac data. This project brought two new re-
sources to the Pelagios community: a dataset of over 2500 place entities from The Syriac Gazetteer
and a dataset derived from a linguistically annotated Syriac text, the Book of the Laws of the Coun-
tries (ca. 200 AD), which contains many geographic references. Using a fuzzy term matching tech-
nique, we developed an algorithm which looks for matches in any given text with the Syriaca named
entities. In general this methodology provides salient results, but further work was necessary on the
attested variations, which are often not present in the Syriaca database. Based on our findings we
produced a workflow for using an increasing amount of novel textual data to provide matches possi-
bly indicating variants of already known named entities. After manually investigating these findings,
we could improve the current named entities database of Syriaca, leading to an ever more correct
tool for future use.

8
To integrate the deliverables of these projects in the LinkSyr project, the end of the LinkSyr project was ex-
tended till 1 January 2019 with permisison from the CLARIAH board.
In this context the cooperation with the LinkSyr partner Beth Mardutho, the developers of the
SEDRA database (§1.1) proved to be fruitful as well. SEDRA provides a morphological analysis for the
most current surface forms. Accordingly, words that don’t have a match with SEDRA are good candi-
dates for being named entities. This was used in the matching algorithm to recognize place names in
new texts. 9

2.2.2 ‘Klein Data Project’ DANS: Linking Syriac Liturgies


This project (2017–2018) concerned the digitization of Syriac liturgical traditions based on:

a) notes and cards catalogues concerning the liturgical readings of the lectionaries as they were
written out by Wim Baars in the 1970s. This material comprises 43 lectionary manuscripts
(39 lectionaries in total) dating from the 9th–16th centuries;
b) the inventory of the liturgical readings in 39 Syriac biblical manuscripts from the 5th-9th cen-
turies. These lists were included as appendices in the unpublished Dutch PhD dissertation
(1993) of Konrad Jenner.

The now digitized materials are preserved at the Peshitta Institute Amsterdam, and were compiled
as part of the preparation of the critical edition of Vetus Testamentum Syriace. Therefore the first
focus was the correct description of the biblical pericopes themselves. In the present dataset, other
data, as for instance extra-biblical or liturgical words and phrases, codicological remarks, links to the
bibliographical description of the manuscripts, links to relevant publications, and links to already dig-
itized photographs of the manuscripts were inserted. With regard to these extra data, this dataset
leaves some instances where the data needs to be enhanced and enriched, also some questions re-
main unanswered, this will be dealt with in the near future.

2.2.3 Van Moorsel and Rijnierse: Network and Workshop: Linked Data and Syriac Sources
This project provided the funds to organize the meetings related to the LinkSyr project: a workshop
(March 2018), a bootcamp (January 2019) and a public event (February 2019). For details see §2.3.1.

2.2.4 Brill Publishers for Peshitta Online


This project concerns the production of an electronic version of the complete Peshitta of the Old
Testament as published in the Brill series Vetus Testamentum Syriace. For the current project it is
not only interesting that in this way a large Syriac corpus is created, 10 but also that as part of this
project Hannes Vlaardingerbroek invested in the development and optimization of an OCR tool. This
tool works very well with printed texts in the Estrangelo script.

2.2.5 GREgORI
We have initiated cooperation with the GREgORI project. This concerned the use of materials devel-
oped at the ETCBC for the GREgORI concordances project (see the pilot with the Prayer of Manas-
seh).

9
If those words are above the threshold of 0.8 similarity with Syriac place names that we have in other sources
(Syriaca and the sources mentioned in §5.1), they have a high probability that we are dealing with a yet unat-
tested (orthographical) variant of that place name.
10
We agreed with Brill that the ETCBC can use the running text (the main text of the edition) for its own re-
search purposes. Brill will have the copyright of the critical apparatus.
2.3 Deliverables
2.3.1 Events
Within the framework of the LinkSyr project and its satellite projects, the following events have been
organized:

• Workshop Linked Data and Syriac Sources, Amsterdam, 12–13 March 2018.
• Bootcamp: NLP tools for Syriac, 17–18 January 2019.
• Public event Digitaal Onderzoek naar Syrische Bronnen, Glane, Syriac Orthodox Monas-
tery, 20 Feb 2019; presentations by Van Peursen [slideshare], Veldman, Coeckelbergs.
• Workshop From Manuscript to Database, DH2019, Utrecht, 9–12 July 2019. (Accepted)

2.3.2 Papers and presentations


The project has been presented at the following occasions:

• CLARIAH toogdag, The Hague, 10 March 2017: See Powerpoint on Academia.


• International Workshop on Lemmatization and Morphological Analysis of Syriac Texts,
Manchester, 31 May 2017.
• ETCBC≥40, Celebrating 40 Years of Bible and Computer, Amsterdam, 31 Oct–1 Nov 2017.
• Participation in CLARIAH Tech-dag, 29 March 2018.
• DH Benelux 2018, Amsterdam, 6–8 June 2018. Abstract at Congres website.
• CLARIAH toogdag, Utrecht, 19 October 2018.
• Linked Pasts IV: Views From Inside The LOD-cloud, Mainz, 11–13 December 2018. Power-
point on Slideshare.
• Synergy2019, Bussum, 7 February 2019. See Powerpoint on Slideshare.
• Workshop of the International Syriac Language Project at the 23rd Congress of the Inter-
national Organization for the Study of the Old Testament (IOSOT), 4–9 August 2019 (Ac-
cepted).
• HUNAYNNET workshop: Coding and Encoding: New Approaches to the Study of Syriac
and Arabic Translations of Greek Scientific and Philosophical Texts, 10–12 October 2019
(invited presentation).

2.3.3 Website, blogs and announcements


• Description of the LinkSyr project at CLARIAH website.
• Announcements
o ‘Network and Workshop’ grant, NWO website.
o KDP ‘Linking Syriac Liturgies’, DANS website.
o Workshop ‘Linked Data and Syriac Sources’, ETCBC website.
o Bootcamp ‘NLP tools for Syriac’, ETCBC website.
o Bootcamp ‘NLP tools for Syriac’, DANS website.
o Pelagios Development grant, Pelagios website (2 May 2018).
• Reports
o Workshop ‘Linked Data and Syriac Sources’ by Srecko Koralija, ETCBC website.
o Workshop ‘Linked Data and Syriac Sources’ by Rachel Dryden in Hugoye.
o Bootcamp ‘NLP tools for Syriac’ by Constantijn Sikkel, ETCBC website.
• Blogposts on the Pelagios website:
o ‘Fate, free will, and ancient geography’ (5 June 2018).
o ‘Linking Syriac Geographic Data Working Group: Fuzzy Matching and Data Rec-
onciliation’ (1 October 2018).
o ‘Linking Syriac Geographic Data Working Group: Place Name Detection and
Comparison with Hebrew (Final Report)’ (February 21, 2019)
• ‘Linked Data en Syrische Bronnen, E-Data&Research 12/1 (2017), p. 5.
• ‘LinkSyr: Linking Syriac Data’, CLARIAH A Digital Research Infrastructure for the Humani-
ties Researchers (CLARIAH publication 2019), pp. 36–37.

3. Data and tools


3.1 Data
3.1.1 Texts
The LinkSyr project repository on github includes the annotated texts from the CALAP and Turgama
projects (above, §1.2.2): Book of the Laws of the Countries]; Peshitta Judges (only morphology),
Kings, Ben Sira, Psalms 1–30, the Epistle of Baruch and the Prayer of Manasseh and further an anno-
tated sermon by Ephrem on the Ninevites. The level of annotations of these texts differs, from only
running text to rich linguistic annotations on the levels of word, phase, clause and text. This is de-
scribed in the repository’s Read Me. Another corpus, not yet on github, concerns the remaining
parts of the Ephrem corpus (ca. 462,400 words).

The Peshitta Old Testament and the Peshitta New Testament have been stored in their own
repositories. For the New Testament the annotations from SEDRA are included (because of
differences in format and encoding, it was better to give both the OT and the NT their own
repository). Both are available in Text-Fabric. The corresponding TF-apps refer to these repositories
for the feature documentation. The apps themselves are in yet other github repos for the OT and the
NT. 11

3.1.2 Liturgical data


The KDP on liturgical data (§2.2.2) has yielded data in the form of tables, rather than running text,
including information on textual readings, the liturgical calendars, the various liturgical traditions,
lectionary and other manuscripts etc. These documents have been submitted to the deliverable of
the KDP project and are available at DANS. Discussions with Daniel Stökl are ongoing to link these
data to the ThALES lectionaries database.

3.1.3 Concordance indices


From the Concordance project, we have lists of lemmata and their occurrences in many biblical
books. These lists are based on the output of Borbone’s programs (see above, §1.2.1). For the Penta-
teuch, these have been corrected manually in the preparation of Volume 1 of the Concordance. The
other books of the Bible are not, or only partially corrected. The appendices of the Concordance con-
tain a list of proper names.

11
There are still some loose ends in the Peshitta data, related to the headings. When that has been settled, we
will consider hosting the Peshitta OT and NT on ancient-data.org.
3.2 Tools
The following programs have been developed in the LinkSyr project (see also the LinkSyr programs
folder on github):

• syrocr, which “provides an interface to several Python modules aimed at optical recogni-
tion of Syriac text, in a manually supervised automated workflow.”
• Parsing programs: MorphAn; SedraIII (a Python parser for the SEDRA III database); SyrNT
(a Python parser for SyroMorph, which included an NT text somewhat different from the
SEDRA III text). 12
• Linked Data programs, including Recogito_reconciler.py, Syriaca_expansion.py, Term-
matching
• Topic modeling programs

4. Workflow
The various tools comprise a pipeline from OCR to Linked Data.

4.1 OCR
The first step is to create a Syriac document as electronic
text. This may involve the digitization of a non-digital
source (printed book, manuscript) or of a digital source that
doesn’t have a text-format but can only be accessed as an
image. For various corpora electronic texts are already
available in the ETCBC database (see §3.1.1) or in other re-
sources such as the Digital Syriac Corpus Project or the
Comprehensive Aramaic Lexicon.

4.2 Corpora preparation


The electronic corpora need to be prepared for auto-
matic analysis. This consists mostly of converting digital
texts to a uniform data format, proofreading the texts
for obvious errors and structuring the texts into sen-
tences. Converting the texts to a uniform data format
can mostly be done automatically, but structuring and
especially proofreading also require more labour-inten-
sive manual inspection.

12
The output of the parsers SedraIII and SyrNT were used as input for MorphAn. That are both based on the
same data, but SedraIII contains more details (about lexicon and etymology) but SyrNT was easier to parse.
4.3 POS parsing

4.3.1 Language model training sets


For the analysis of the electronic corpora, we use the tagged corpora that are already available.
From these corpora language model training sets are prepared. For this we have conducted experi-
ments with OpenNLP, and with the Python NLTK library combined with a self-developed morphologi-
cal analyzer for Syriac, MorphAn (see §3.2). Tests with MorphAn models trained and tested on
tagged corpora indicate an accuracy level of 82%, for combined segmentation and PoS-tagging. Our
current aim is to improve that result with a Brill-tagger, which uses transformation rules to optimize
the PoS-analysis. We hope that a transformation rule-based approach will give better results than
the more common approach using Hidden Markov Models, given the small size of our tagged cor-
pora (ca. 150,000 tagged words).

Most of the tagged corpora already available have been


prepared by the interactive encoding programs of the
ETCBC that have been developed for Hebrew and
adapted for Syriac. See Cody Kingham’s description of
the ETCBC data creation pipeline. (The tagged corpora in
the ETCBC database and other tagged corpora such as
the NT in SEDRA use different parsing conventions. But
we will postpone efforts to harmonize them.)

4.3.2 Parsing and disambiguation


With the tools thus established large portions of texts can be parsed.

4.3.3 Named Entity Recognition


The final step is the word-sense disambiguation and entity recognition, in order to map the analyzed
forms to the SEDRA and Syriaca databases. This flows from the POS tagging, but the identification of
Named Entitities requires further tooling. In the Pelagios project (§2.2.1), special attention is paid to
the identification and disambiguation of Named Entities. Specific challenges involved orthographic
and phonological variation. [See our second Pelagios blogpost.]

4.5 Linked Data


The first completed task on the linked data aspect of the
project involves collecting the data stored in the Syriaca
and SEDRA databases. Once these data were brought to-
gether and converted into standard formats developed
by the W3C (RDF and JSON-LD), we performed term
matching experiments on the BLC text, hence taking the
first steps at developing a platform where we can input a text, whereafter its words can be recon-
ciled with the information in Syriaca and SEDRA. Currently, we are exploring to what extent events
from a new text can be encoded using the URIs we already have at our disposal, so that a database
of facts can be created semi-automatically. The main difficulty lies in identifying salient relationships
among the URIs, which afterwards can be queried. Our ongoing research focuses on this identifica-
tion of relationships, as well as on the question of incorporating additional knowledge, such as lin-
guistic information, which then in turn can form the basis of an ever-widening set of questions ex-
ploring the interconnections of texts.

4.6 Topic modelling


The search-ability of named entities through linked data has created new avenues of research. Most
importantly for our purposes is the application of these named entities in text mining tasks. Within
the context of increasing digitalisation of ancient Syriac texts, topic modelling techniques are useful
in searching corpora for topical relevance to named entities. For example, we can discover which to-
pics are most relevant to the city of Jerusalem. Due to its data-driven approach, topic modelling al-
lows for a fine-grained contextual modelling. Traditional search methods rely on keyword selection
by researchers, and is heavily dependent on textual annotations. In addition to data exploration, to-
pic modelling points to textual relationships between named entities, and allows us to investigate
specific lexical choices in the context of these entities. Our first experiments have highlighted lexical
differences between the Hebrew TaNaKh and Syriac Peshitta, modelled specifically for their most
important locations.

5. Integration: Peshitta Pentateuch Geography


The data from the Concordance project mentioned in §3.1.3 provided some interesting material for
a pilot. Initial questions were: can we connect the geographical names from the Peshitta Pentateuch
with the Pelagios project and the geographical data in Syriaca? Is it possible to link places names in
the Pentateuch to the Pelagios Gazetteer? After the initial examination, it became clear that there
are only a few matches between the Peshitta and Syriaca. Consequently, Mathias Coeckelbergs an
Srecko Koralija decided to make independent datasets to be used within Pelagios/LinkSyr, working
towards indices of place-names that are not in Syriaca.

The chosen method was to generally follow the same two paths that we did for the Book of the Laws
of the Countries: (a) work with documentation of geographical names (for BLC: Dirk Bakker’s PhD
dissertation; for Pentateuch: Concordance) (b) import our texts in the Recogito tool of Pelagios (But
since Recogito only recognizes English names, Mathias Coeckelbergs wrote a program that does in
fact the same as Recogito does. The connections that are established by this program can be added
manually to Pelagios). Additionally, Srecko Karolija provided a manually established list of place na-
mes of the Pentateuch, detailing useful information concerning their location in the text and disam-
biguating its meaning, location and references in the TaNaKh. He also included a bilingual reference
guide in Hebrew and Syriac. This in turn allow Mathias Coeckelbergs to train a named entity recogni-
tion tool. Within the context of other digitalisation efforts of Syriac works, this software is useful to
increase the detection of place names and their variants in new literature.

As the final step, we proceed to the question: is it possible to connect the Syriac data with the He-
brew geographical data in the Pelagios Research pilot KIMA? Sinai Rusinek informed Van Peursen
that they have not yet included biblical names in the KIMA project but that they will try to obtain the
data from hatanakh. The way in which proper nouns and geographical locations have been updated
in the Ancient Versions is an interesting window to the work of the work of the translators. 13 Even if
the integration of the hatanakh materials into KIMA is not possible, it is interesting to comparing the
geographies from the Peshitta and the Syriac gazetteer with those of hatanakh, based on the He-
brew text.

The findings of the pilot project for geographical names of the Peshitta that Srecko Koralija worked
on will be included in his doctoral dissertation (University of Cambridge) that will be submitted at
the end of 2019. His dissertation will also include a study of personal names in the Peshitta.

13
See, e.g., Ze'ev Safrai, Seeking out the Land: Land of Israel Traditions in Ancient Jewish, Christian and Samari-
tan Literature (200 BCE - 400 CE) (Leiden: Brill, 2018), 321ff.

You might also like