Professional Documents
Culture Documents
Clariah Research Pilot Linksyr Linking S
Clariah Research Pilot Linksyr Linking S
1. General outline
1.1 Abstract
How do the Biblical heritage and Hellenistic culture interact in the oldest documents of Syriac Chris-
tianity? The Eep Talstra Centre for Bible and Computer (ETCBC) at the Vrije Universiteit (VU), Am-
sterdam, investigates this question in the CLARIAH pilot project LinkSyr (2017–2018), using linguistic
data processing, especially topic modelling. The Syriac Book of the Laws of the Countries (BLC), writ-
ten by the 2nd/3rd-cent. Syriac philosopher Bardaisan is compared with the ancient Syriac transla-
tion of the Bible (“Peshitta”, 2nd cent.), other sources from the ancient Mespotamian and Hellenistic
world, and later authors (especially the 4th-cent. author Ephrem the Syrian) who react to Bardai-
san’s teachings. The analysed texts are exposed as Linked Open Data and related to the lexicograph-
ical and encyclopaedic resources of SEDRA (Beth Mardutho) and Syriaca. The former presents the
URIs for a large number of place names and person names for the Syriac heritage, whereas the latter
contains dictionary information for a list of more than 50,000 lexemes. LinkSyr has been extended
thanks to grants from Pelagios (§2.2.1), DANS (§2.2.2) and the Van Moorsel and Rijnierse Foundation
(§2.2.3) and cooperation projects with Brill publishers (§2.2.4) and the GREgORI project (§2.2.5).
1.2 Earlier work on Syriac at the ETCBC and the Peshitta Institute
In various places of the world, pioneering work on Syriac computing has taken place over the last
decades. 1 Here we will restrict ourselves to the predecessors of the current projects on Syriac at the
ETCBC.
1
See George Anton Kiraz, ''Forty Years of Syriac Computing'', Hugoye: Journal of Syriac Studies 10/1 (2007).
2
For a description of Borbone’s programs, called Obélix (analysis program) and Idéfix (printing program) see
Pier Giorgio Borbone and Francesco Mandracci, ''Another Way to Analyze Syriac Texts: A Simple Powerful Tool
to Draw up Syriac Computer Aided Concordances'', Bible et informatique: methodes, outils, resultats. Actes du
Second Colloque International, Jerusalem, (Israel), 9-13 juin 1988 (Travaux de linguistique quantitative 43, Col-
lection DEBORA 5. Paris: Champion, 1989), 135–145.
3
P.G. Borbone, K.D. Jenner et al., The Old Testament in Syriac according to the Peshiṭta Version V. Concord-
ance 1. The Pentateuch (Leiden: Brill, 1997).
4
See, e.g., Moshe A. Zipor, review of Concordance Vol. I, in Leshonenu 65 (2002–03) 77–89.
1.2.2 1999–2010: CALAP and Turgama
In 1999 the ETCBC (at that time called “WIVU”) started cooperation with the Peshitta Institute Lei-
den (which has now moved to VU Amsterdam as well) in the NWO projects CALAP (Computer-As-
sisted Analysis of the Peshitta, 1999–2005) and Turgama (2005–2010). 5 The linguistic annotations at
word level were created in a process of computer-human interactive encoding of distributional mor-
phological phenomena based on pattern recognition, from which functional deductions were made. 6
The analysis also involved linguistic analysis at the level of phrases, clauses, sentences and text-hier-
archical structures.
In 2014 the application of computational techniques to Syriac studies received an incentive from the
projects Polemics Visualized, an Academy Assistant project of the Network Institute (2014–2015;
see description on CLTL website) and Do you see what I am talking about? Towards a Topic Visual-
izer for Syriac Texts, Research Voucher of Network Institute (2015; see the Project report on
Academia and the topic visualizer on the Frontwise website). These projects were our first experi-
ments with Syriac topic modeling and visualization. Related to this pilot project was an MA intern-
ship of Mignon van Bokhoven (2015), who encoded a section of Ephrem’s Prose Refutations. In the
meantime, one of the programmers in the LinkSyr project, Mathias Coeckelbergs, worked on Topic
Modeling on ancient Hebrew data. 7
In this stage we were investigating how we could apply new developments in the rapidly evolving
field of DH to Syriac studies. The ideas about the current CLARIAH project were first expressed in the
conference paper ‘Linking Syriac Data’ by Wido van Peursen at the Perspectives on Linguistics and
Ancient Languages International Conference, St Petersburg, 30 June–4 July 2014. The experiments
we had done since then and the CLARIAH funding provided the opportunity to take further steps in
the realization of these ideas.
5
See the project page in DANS EASY.
6
Cf. Peursen, W.Th. van, ‘Progress Report: Three Leiden Projects on the Syriac Text of Ben Sira’, in R. Egger-
Wenzel (ed.), Ben Sira’s God. Proceedings of the Second International Ben Sira Conference, Durham, Ushaw
College, 2001 (Beihefte zur Zeitscrift für die Alttestamentliche Wissenschaft 321; Berlin: De Gruyter, 2002)
361–370, esp. 366–369; idem, ‘How to Establish a Verbal Paradigm on the Basis of Ancient Syriac Manuscripts’,
in M. Rosner and S. Wintner (eds.), Proceedings of the EACL 2009 Workshop on Computational Approaches to
Semitic Languages, 31 March 2009, Megaron Athens International Conference Centre, Athens, Greece
(http://staff.um.edu.mt/mros1/casl09) 1–9. [Academia].
7
See M. Coeckelbergs and S. Van Hooland, ‘Modeling the Hebrew Bible. Potential of Topic Modeling Tech-
niques for Semantic Annotation and Historical Analysis’, Proceedings of the FCAIR 2013 Formal Concept Analy-
sis meets Information Retrieval workshop, co-located with the 35th European Conference on Information Re-
trieval (ECIR 2013) vol. 2 (2016), 47–52.
2. Project Organization
2.1 Project members and partners
The following people were involved in the LinkSyr project:
The LinkSyr project started with the CLARIAH research pilot LinkSyr: Linking Syriac Data and this was
the primary context of the project. The project members attended CLARIAH meetings and met CLA-
RIAH people at other occasions (e.g. DHBenelux). CLARIAH was also helpful for building their net-
work (e.g. the contact with Dr Cornelis van Lit, with whom they will organize a DH2019 workshop
(below, §2.3.1) was established through CLARIAH (Arjan van Hessen). They had promising discus-
sions with the project leader of CLARIAH-PLUS WP6 (Karina van Dalen-Oskam), about the further de-
velopment of their work.
The project collaborated with Beth Mardutho and Syriaca (above, §1.1). The collaboration was inten-
sified through the workshop in March 2018 and the bootcamp in January 2019 (below, §2.3.1) in
which people from both partners, as well as from other projects, participated.
8
To integrate the deliverables of these projects in the LinkSyr project, the end of the LinkSyr project was ex-
tended till 1 January 2019 with permisison from the CLARIAH board.
In this context the cooperation with the LinkSyr partner Beth Mardutho, the developers of the
SEDRA database (§1.1) proved to be fruitful as well. SEDRA provides a morphological analysis for the
most current surface forms. Accordingly, words that don’t have a match with SEDRA are good candi-
dates for being named entities. This was used in the matching algorithm to recognize place names in
new texts. 9
a) notes and cards catalogues concerning the liturgical readings of the lectionaries as they were
written out by Wim Baars in the 1970s. This material comprises 43 lectionary manuscripts
(39 lectionaries in total) dating from the 9th–16th centuries;
b) the inventory of the liturgical readings in 39 Syriac biblical manuscripts from the 5th-9th cen-
turies. These lists were included as appendices in the unpublished Dutch PhD dissertation
(1993) of Konrad Jenner.
The now digitized materials are preserved at the Peshitta Institute Amsterdam, and were compiled
as part of the preparation of the critical edition of Vetus Testamentum Syriace. Therefore the first
focus was the correct description of the biblical pericopes themselves. In the present dataset, other
data, as for instance extra-biblical or liturgical words and phrases, codicological remarks, links to the
bibliographical description of the manuscripts, links to relevant publications, and links to already dig-
itized photographs of the manuscripts were inserted. With regard to these extra data, this dataset
leaves some instances where the data needs to be enhanced and enriched, also some questions re-
main unanswered, this will be dealt with in the near future.
2.2.3 Van Moorsel and Rijnierse: Network and Workshop: Linked Data and Syriac Sources
This project provided the funds to organize the meetings related to the LinkSyr project: a workshop
(March 2018), a bootcamp (January 2019) and a public event (February 2019). For details see §2.3.1.
2.2.5 GREgORI
We have initiated cooperation with the GREgORI project. This concerned the use of materials devel-
oped at the ETCBC for the GREgORI concordances project (see the pilot with the Prayer of Manas-
seh).
9
If those words are above the threshold of 0.8 similarity with Syriac place names that we have in other sources
(Syriaca and the sources mentioned in §5.1), they have a high probability that we are dealing with a yet unat-
tested (orthographical) variant of that place name.
10
We agreed with Brill that the ETCBC can use the running text (the main text of the edition) for its own re-
search purposes. Brill will have the copyright of the critical apparatus.
2.3 Deliverables
2.3.1 Events
Within the framework of the LinkSyr project and its satellite projects, the following events have been
organized:
• Workshop Linked Data and Syriac Sources, Amsterdam, 12–13 March 2018.
• Bootcamp: NLP tools for Syriac, 17–18 January 2019.
• Public event Digitaal Onderzoek naar Syrische Bronnen, Glane, Syriac Orthodox Monas-
tery, 20 Feb 2019; presentations by Van Peursen [slideshare], Veldman, Coeckelbergs.
• Workshop From Manuscript to Database, DH2019, Utrecht, 9–12 July 2019. (Accepted)
The Peshitta Old Testament and the Peshitta New Testament have been stored in their own
repositories. For the New Testament the annotations from SEDRA are included (because of
differences in format and encoding, it was better to give both the OT and the NT their own
repository). Both are available in Text-Fabric. The corresponding TF-apps refer to these repositories
for the feature documentation. The apps themselves are in yet other github repos for the OT and the
NT. 11
11
There are still some loose ends in the Peshitta data, related to the headings. When that has been settled, we
will consider hosting the Peshitta OT and NT on ancient-data.org.
3.2 Tools
The following programs have been developed in the LinkSyr project (see also the LinkSyr programs
folder on github):
• syrocr, which “provides an interface to several Python modules aimed at optical recogni-
tion of Syriac text, in a manually supervised automated workflow.”
• Parsing programs: MorphAn; SedraIII (a Python parser for the SEDRA III database); SyrNT
(a Python parser for SyroMorph, which included an NT text somewhat different from the
SEDRA III text). 12
• Linked Data programs, including Recogito_reconciler.py, Syriaca_expansion.py, Term-
matching
• Topic modeling programs
4. Workflow
The various tools comprise a pipeline from OCR to Linked Data.
4.1 OCR
The first step is to create a Syriac document as electronic
text. This may involve the digitization of a non-digital
source (printed book, manuscript) or of a digital source that
doesn’t have a text-format but can only be accessed as an
image. For various corpora electronic texts are already
available in the ETCBC database (see §3.1.1) or in other re-
sources such as the Digital Syriac Corpus Project or the
Comprehensive Aramaic Lexicon.
12
The output of the parsers SedraIII and SyrNT were used as input for MorphAn. That are both based on the
same data, but SedraIII contains more details (about lexicon and etymology) but SyrNT was easier to parse.
4.3 POS parsing
The chosen method was to generally follow the same two paths that we did for the Book of the Laws
of the Countries: (a) work with documentation of geographical names (for BLC: Dirk Bakker’s PhD
dissertation; for Pentateuch: Concordance) (b) import our texts in the Recogito tool of Pelagios (But
since Recogito only recognizes English names, Mathias Coeckelbergs wrote a program that does in
fact the same as Recogito does. The connections that are established by this program can be added
manually to Pelagios). Additionally, Srecko Karolija provided a manually established list of place na-
mes of the Pentateuch, detailing useful information concerning their location in the text and disam-
biguating its meaning, location and references in the TaNaKh. He also included a bilingual reference
guide in Hebrew and Syriac. This in turn allow Mathias Coeckelbergs to train a named entity recogni-
tion tool. Within the context of other digitalisation efforts of Syriac works, this software is useful to
increase the detection of place names and their variants in new literature.
As the final step, we proceed to the question: is it possible to connect the Syriac data with the He-
brew geographical data in the Pelagios Research pilot KIMA? Sinai Rusinek informed Van Peursen
that they have not yet included biblical names in the KIMA project but that they will try to obtain the
data from hatanakh. The way in which proper nouns and geographical locations have been updated
in the Ancient Versions is an interesting window to the work of the work of the translators. 13 Even if
the integration of the hatanakh materials into KIMA is not possible, it is interesting to comparing the
geographies from the Peshitta and the Syriac gazetteer with those of hatanakh, based on the He-
brew text.
The findings of the pilot project for geographical names of the Peshitta that Srecko Koralija worked
on will be included in his doctoral dissertation (University of Cambridge) that will be submitted at
the end of 2019. His dissertation will also include a study of personal names in the Peshitta.
13
See, e.g., Ze'ev Safrai, Seeking out the Land: Land of Israel Traditions in Ancient Jewish, Christian and Samari-
tan Literature (200 BCE - 400 CE) (Leiden: Brill, 2018), 321ff.