You are on page 1of 1

Aligning extant transcriptions of documentary and literary papyri with their glyphs

B. Kiessling2, D. Stökl Ben Ezra1,2, Rodney Ast3, Holger Essler4


École Pratique des Hautes Études1, Université Paris Sciences et Lettres (PSL)2,
Universität Heidelberg3, Universität Würzburg4

The task
As a collaboration between Scripta PSL in Paris, the Sonderforschungsbereich Manuskriptkulturen in Heidelberg, and the Anagnosis Projekt in
Würzburg, our team works on an automatic alignment of all available papyri transcriptions with the glyphs on the images.

Preliminary Work
We originally started with a pixel segmenter who segments images into background (red), object (green), script (blue) and paraphernalia (ruler,
cardboard, plate name, etc) (yellow). However, the creation of training data was very time consuming and we found a direct solution to segment
lines without passing by pixel segmentation.

P. Cair Zen II 59140r

Step 1
Train a line segmenter based on convolutional neural networks [1] to automatically detect the baselines of the papyri. One of the bigger challenges is to arrive at good results on
fragmentary texts with difficult color or grayscale images. All semantically connected glyphs are recognized as a single line. Lacunae should be bridged, but different columns,
marginal or interlinear annotations should remain separate. See below for some good and some improvable results of automatic line segmentation. Automatically recognized
baselines are marked in green.
We are currently accumulating the necessary ground truth to optimize the results.

PSI VIII 977 v PSI II 116


Firenze, Biblioteca Medicea Laurenziana
Firenze, Biblioteca Medicea Laurenziana

Step 2 (present)
Extracting the polygonic line contours of the lines according to the baselines works as preliminary
stage. P. Oxy. LXVII 4577
Papyrology Rooms, Sackler Library, Oxford

Step 3 (future)
Transform the textual data and train recognizer. We have widely applied our pipeline to Medieval manuscripts in Hebrew and
Latin with excellent results.

Step 4 (future)
Align the resulting image segmentation with existing transcriptions. We have widely applied our pipeline [2,3] to Medieval manuscripts in Hebrew and
Latin with excellent results.

Bibliography
[1] Benjamin Kiessling, Daniel Stökl Ben Ezra, Matthew T. Miller, “BADAM: A Public Dataset for Baseline Detection in Arabic-script
Manuscripts” HIP@ICDAR 2019 (forthcoming)
[2] Benjamin Kiessling, Robin Tissot, Peter Stokes, Daniel Stökl Ben Ezra, “eScriptorium: An Open Source Platform for Historical Document
Analysis” OST@ICDAR 2019 (forthcoming)
[3] Benjamin Kiessling, Robin Tissot, Daniel Stökl Ben Ezra, Peter Stokes, “eScripta: A New Digital Platform for the Study of Historical Texts
and Writing” DH2019

You might also like