You are on page 1of 36

Using Corpora for Language

COGS 523-Lecture 5
METU Turkish Corpus
and METU-Turkish
Sabancı Treebank- A
Developer’s Perspective

10.11.23 COGS 523 - Bilge Say 1

Related Readings
 Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a
Corpus and a Treebank for Present-day Written Turkish, in
Proceedings of the Eleventh International Conference of Turkish
Linguistics, August 2002.
 Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür,
Building a Turkish Treebank, Invited chapter in Building and
Exploiting Syntactically-annotated Corpora, Anne Abeille Editor,
Kluwer Academic Publishers, 2003.
 Nart B. Atalay, Kemal Oflazer, Bilge Say,
The Annotation Process in the Turkish Treebank, in Proceedings of the
EACL Workshop on Linguistically Interpreted Corpora - LINC, April
13-14, 2003, Budapest, Hungary.

10.11.23 COGS 523 - Bilge Say 2

 METU-Sabancı Treebank: Joint work with
Prof. Kemal Oflazer
 Main Contributors: Umut Özge and Nart
Bedin Atalay, METU; around 5 research
assistants and 13 student annotators and
trainees at various phases of the project.
Various members of faculty gave ideas
esp at initial stages.
 Agreements with 14 publishers (incl. 3
newspapers and 4 magazines)

10.11.23 COGS 523 - Bilge Say 3

Requirements for Corpora
for Turkish ?
 Incorporating many registers representatively 
 Diachronic and synchronic 
 Electronic 
 Annotated with standard practices
(typographically, morphosyntactically,
semantically, prosodically ...)
 Respecting copyright laws
 Accessible (free availabilty, support, etc)
 Searchable 

10.11.23 COGS 523 - Bilge Say 4

What is METU Turkish
 A synchronic (1990+) corpus of written
 2.000.000 words from 201 books, 87
journal issues and issues of 3 daily
newspapers totaling 999 samples
 Various kinds of annotation (creation of a
treebank as separate subproject)
 Project: 1999-2003

10.11.23 COGS 523 - Bilge Say 5

Other Features of METU
Turkish Corpus
 Permissions for each sample obtained
from the publishers
 Opportunistic representativeness !!
 Platform-independent; XML and TEI-
compliant annotation
 Accompanying query software
 Free for academic research purposes on
signature of a user agreement 

10.11.23 COGS 523 - Bilge Say 6

Building the Corpus
 Text Compilation (permissions,
scanning if necessary, control)
 Computer-aided annotation
(TEI-XCES for general-typographic;
XML-compliant in-house scheme for
the treebank)
 Control
 Query Workbench Development

10.11.23 COGS 523 - Bilge Say 7

Distribution of Text Types
Travel Interview
Res. Mon. 2% 1% Other
5% 3%
Essay 42%

Column Story Novel

8% 11% 13%

10.11.23 COGS 523 - Bilge Say 8

Annotation of the Corpus
 Text Encoding Initiative (TEI)
 XCES – XML based Corpus Encoding
Standards compliant- a TEI
 Compliant with major current
corpora such as British National
10.11.23 COGS 523 - Bilge Say 9
The TEI Structure - 1

teiHeader TEI.2

teiHeader text

front body back

10.11.23 COGS 523 - Bilge Say 10

The TEI Structure - 2

front body back

divisions e.g. <div1>

components e.g. <p>, <list>…

e.g. <w>, <corr>…

10.11.23 COGS 523 - Bilge Say 11
(Burnard, 2001)
A Typical Header


10.11.23 COGS 523 - Bilge Say 12

A Typical Header (cont.)
<h.title>Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan
<>Nalân MAHSERECİ</>
<publisher>Bilim ve Ütopya</publisher>
<pubDate>Mart 2000</pubDate>
<idno>1301 - 6717</idno>
10.11.23 COGS 523 - Bilge Say 13
A Typical Header (cont.)
<h.item>The header part was changed.</h.item>

10.11.23 COGS 523 - Bilge Say 14

A Typical Body
<p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim,</q> demişti. Sitem'in,
kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü
çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin
yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp
gidivermemden korkan Oktay beni <hi>oyalamak</hi> için geçen yaz Giray
Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı.</p>

<p>O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o

kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez...</p>


10.11.23 COGS 523 - Bilge Say 15

Entering XCES Annotations

10.11.23 COGS 523 - Bilge Say 16

Entering XCES Annotations

10.11.23 COGS 523 - Bilge Say 17

METU-Sabancı treebank
 Annotation of morphological and (surface)
syntactic features in a dependency-
inspired manner
 A subcorpus containing 7.300 annotated
sentences and 65.000 words: initially
whole samples selected from the main
corpus. (Another version containing 5600
 Genre distribution is proportional with the
METU Corpus

10.11.23 COGS 523 - Bilge Say 18

Building the Treebank
 Morphological Analysis of Selected
Samples from the Corpus
 Preprocessing of the Collocations
 (Manual) Disambiguation of the
Morphological Parses
 Annotating with the Dependency
 Control

10.11.23 COGS 523 - Bilge Say 19

Annotation – Lexical Level
 A word can be seen as a sequence
of inflectional groups (IGs) of the

 evinizdekilerden (from the ones at your house)


Inflectional Group
10.11.23 COGS 523 - Bilge Say 20
Annotation- Syntactic Level

Bu çocuk okuldan erken geldi.

This child school+Abl early come+Past+3sg
This child came from the school early.

Determiner Subject Modifier

Bu çocuk okuldan erken geldi .

Abl. adj

10.11.23 COGS 523 - Bilge Say 21

Annotation- Syntactic Level
 Sentence  Relativizer
 Object  Coordination
 Subject  Possessor
 Intensifier  Classifier
 Modifier  Ablative Adjunct
 Determiner  Dative Adjunct
 Question-Particle  Locative Adjunct
 Total of 20  Instrumental
syntactic tags Adjunct...
10.11.23 COGS 523 - Bilge Say 22
Morphosyntactic processing
 Tokenized text is annotated
(ambiguously) by all possible
morphological analyses for each token.
 Involves also unknown word processing
 A constraint-based disambiguation
module performs limited morphological
 Recognizing and morphological
annotation of collocations

10.11.23 COGS 523 - Bilge Say 23

Automatic Dependency
 Try to get most of the “easy”
relations right automatically to help
and speed up the human annotator
 Human annotator can override if the
selected dependency relation is not
 Pilot work is done but not practised
in the METU-Sabancı treebank
10.11.23 COGS 523 - Bilge Say 24
Automatic Dependency
 A set of heuristic rules tentatively
attach some of the relations
 Appropriately case-marked nouns to the
immediately following unambiguous
postposition as objects
 Indefinite nominative nouns to the first verb
to the right as objects
 Adverbs and Adjuncts attach to the first verb
to the right as modifiers and adjunct
10.11.23 COGS 523 - Bilge Say 25
The Annotation Tool
 The text thus processed can now be
further annotated with an annotation tool
 Visualization
 Review selections (morph/dependency)
and override (for morphology) or
annotate (for dependency)
 The output of the program is
morphologically disambiguated and
annotated text which is encoded
according to XML document and Turkish
Treebank formats.

10.11.23 COGS 523 - Bilge Say 26

Annotating the Treebank -

10.11.23 COGS 523 - Bilge Say 27

Annotating the Treebank –2

10.11.23 COGS 523 - Bilge Say 28

Corpus Query Workbench
 A user-friendly query engine for linguists
 Organization through sessions
 Boolean or regular expression queries
 Filtering queries through bibliographic
constraints such as author, genre, year
 Treebank entries viewed through a graphical
 Printing and saving options of outputs and
session queries available
 Implemented in Java SE 1.4.1, compatible with
Window XP/Linux

10.11.23 COGS 523 - Bilge Say 29

10.11.23 COGS 523 - Bilge Say 30
10.11.23 COGS 523 - Bilge Say 31
Post-project developments
 About 100 user forms received
 Some uses (from a recent survey)
 Word sense disambiguation
 Coherence in Turkish texts
 Subcategorization Frame Acquisition
 Teaching Turkish or NLP
 CoNLL Dependency task for METU-
Sabancı Treebank (~5000 sentences)
 Frequency lists available (due to Umut
Özge and Serge Sharoff)

10.11.23 COGS 523 - Bilge Say 32

What would we have done
 More funding, more interdisciplinary
organization, less turnover...
 Approaching a corpus development
project like a software engineering
 Doing a pilot project
 Better quality control processes, version
control and documentation control processes.
 More and better automatic text capture
and annotation

10.11.23 COGS 523 - Bilge Say 33

Requests from Users
 Extend the size and variety of the corpus
 POS tag the whole corpus
 Enable the users to enter their own corpora to
query tool
 Implement statistical features to the query tools
 Add semantic annotation
 Treebank specific ones:
 10,000; 7,000 or 5,000 sentences?
 Detailed stylebook
 LEM and MORPH fields
 Better versioning, some nonconformant entries with

10.11.23 COGS 523 - Bilge Say 34

Requirements for future generations of
Turkish corpora
 Turkish National Corpus (like ANC, BNC,
or CNC)
 Spoken Part
 Automatic Tools
 Diachronic Part
 Linguistically motivated morphological and
syntactic annotation
 Some motivation for text providers
 Well-funded, well-organized project
 Comparable corpora of Turkic languages

10.11.23 COGS 523 - Bilge Say 35

Lecture 6
 Bernardini et al. A Wacky Introduction.

 April 14, your tool evaluation

presentations and reports – only two
weeks left!

10.11.23 COGS 523 - Bilge Say 36

You might also like