Professional Documents
Culture Documents
The task
As a collaboration between Scripta PSL in Paris, the Sonderforschungsbereich Manuskriptkulturen in Heidelberg, and the Anagnosis Projekt in
Würzburg, our team works on an automatic alignment of all available papyri transcriptions with the glyphs on the images.
Preliminary Work
We originally started with a pixel segmenter who segments images into background (red), object (green), script (blue) and paraphernalia (ruler,
cardboard, plate name, etc) (yellow). However, the creation of training data was very time consuming and we found a direct solution to segment
lines without passing by pixel segmentation.
Step 1
Train a line segmenter based on convolutional neural networks [1] to automatically detect the baselines of the papyri. One of the bigger challenges is to arrive at good results on
fragmentary texts with difficult color or grayscale images. All semantically connected glyphs are recognized as a single line. Lacunae should be bridged, but different columns,
marginal or interlinear annotations should remain separate. See below for some good and some improvable results of automatic line segmentation. Automatically recognized
baselines are marked in green.
We are currently accumulating the necessary ground truth to optimize the results.
Step 2 (present)
Extracting the polygonic line contours of the lines according to the baselines works as preliminary
stage. P. Oxy. LXVII 4577
Papyrology Rooms, Sackler Library, Oxford
Step 3 (future)
Transform the textual data and train recognizer. We have widely applied our pipeline to Medieval manuscripts in Hebrew and
Latin with excellent results.
Step 4 (future)
Align the resulting image segmentation with existing transcriptions. We have widely applied our pipeline [2,3] to Medieval manuscripts in Hebrew and
Latin with excellent results.
Bibliography
[1] Benjamin Kiessling, Daniel Stökl Ben Ezra, Matthew T. Miller, “BADAM: A Public Dataset for Baseline Detection in Arabic-script
Manuscripts” HIP@ICDAR 2019 (forthcoming)
[2] Benjamin Kiessling, Robin Tissot, Peter Stokes, Daniel Stökl Ben Ezra, “eScriptorium: An Open Source Platform for Historical Document
Analysis” OST@ICDAR 2019 (forthcoming)
[3] Benjamin Kiessling, Robin Tissot, Daniel Stökl Ben Ezra, Peter Stokes, “eScripta: A New Digital Platform for the Study of Historical Texts
and Writing” DH2019