You are on page 1of 36

Using Corpora for Language

Research
COGS 523-Lecture 5
METU Turkish Corpus
and METU-Turkish
Sabancı Treebank- A
Developer’s Perspective

10.11.23 COGS 523 - Bilge Say 1


Related Readings
 Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a
Corpus and a Treebank for Present-day Written Turkish, in
Proceedings of the Eleventh International Conference of Turkish
Linguistics, August 2002.
 Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür,
Building a Turkish Treebank, Invited chapter in Building and
Exploiting Syntactically-annotated Corpora, Anne Abeille Editor,
Kluwer Academic Publishers, 2003.
 Nart B. Atalay, Kemal Oflazer, Bilge Say,
The Annotation Process in the Turkish Treebank, in Proceedings of the
EACL Workshop on Linguistically Interpreted Corpora - LINC, April
13-14, 2003, Budapest, Hungary.

10.11.23 COGS 523 - Bilge Say 2


Acknowledgements
 Funding: METU-BAP, TÜBİTAK
 METU-Sabancı Treebank: Joint work with
Prof. Kemal Oflazer
 Main Contributors: Umut Özge and Nart
Bedin Atalay, METU; around 5 research
assistants and 13 student annotators and
trainees at various phases of the project.
Various members of faculty gave ideas
esp at initial stages.
 Agreements with 14 publishers (incl. 3
newspapers and 4 magazines)

10.11.23 COGS 523 - Bilge Say 3


Requirements for Corpora
for Turkish ?
 Incorporating many registers representatively 
 Diachronic and synchronic 
 Electronic 
 Annotated with standard practices
(typographically, morphosyntactically,
semantically, prosodically ...)
 Respecting copyright laws
 Accessible (free availabilty, support, etc)
 Searchable 

10.11.23 COGS 523 - Bilge Say 4


What is METU Turkish
Corpus?
 A synchronic (1990+) corpus of written
Turkish
 2.000.000 words from 201 books, 87
journal issues and issues of 3 daily
newspapers totaling 999 samples
 Various kinds of annotation (creation of a
treebank as separate subproject)
 Project: 1999-2003

10.11.23 COGS 523 - Bilge Say 5


Other Features of METU
Turkish Corpus
 Permissions for each sample obtained
from the publishers
 Opportunistic representativeness !!
 Platform-independent; XML and TEI-
compliant annotation
 Accompanying query software
 Free for academic research purposes on
signature of a user agreement 
 http://www.ii.metu.edu.tr/~corpus/

10.11.23 COGS 523 - Bilge Say 6


Building the Corpus
 Text Compilation (permissions,
scanning if necessary, control)
 Computer-aided annotation
(TEI-XCES for general-typographic;
XML-compliant in-house scheme for
the treebank)
 Control
 Query Workbench Development

10.11.23 COGS 523 - Bilge Say 7


Distribution of Text Types
Travel Interview
Res. Mon. 2% 1% Other
5% 3%
News
Essay 42%
7%
Article
8%

Column Story Novel


8% 11% 13%

10.11.23 COGS 523 - Bilge Say 8


Annotation of the Corpus
 Text Encoding Initiative (TEI)
compliant
 XCES – XML based Corpus Encoding
Standards compliant- a TEI
application
 Compliant with major current
corpora such as British National
Corpus
10.11.23 COGS 523 - Bilge Say 9
The TEI Structure - 1
teiCorpus

teiHeader TEI.2

teiHeader text

front body back

10.11.23 COGS 523 - Bilge Say 10


The TEI Structure - 2

front body back

divisions e.g. <div1>

components e.g. <p>, <list>…

e.g. <w>, <corr>…


phrase-level
10.11.23 COGS 523 - Bilge Say 11
(Burnard, 2001)
A Typical Header
<cesHeader>
<fileDesc>
<titleStmt>
<h.title>00017113</h.title>
</titleStmt>
<extent>
<wordCount>2008</wordCount>
<byteCount>17929</byteCount>
</extent>

...

10.11.23 COGS 523 - Bilge Say 12


A Typical Header (cont.)
<sourceDesc>
<biblStruct>
<analytic>
<h.title>Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan
BAYTOP</h.title>
<h.author>Nalân MAHSERECİ</h.author>
</analytic>
<imprint>
<publisher>Bilim ve Ütopya</publisher>
<pubDate>Mart 2000</pubDate>
<pubPlace>İstanbul</pubPlace>
</imprint>
<idno>1301 - 6717</idno>
</biblStruct>
</sourceDesc>
10.11.23 COGS 523 - Bilge Say 13
A Typical Header (cont.)
<profileDesc>
<textClass>
<catRef>Makale</catRef>
</textClass>
</profileDesc>
<revisionDesc>
<change>
<changeDate>12.10.2000</changeDate>
<respname>Sedef</respname>
<h.item>The header part was changed.</h.item>
</change>
</revisionDesc>

10.11.23 COGS 523 - Bilge Say 14


A Typical Body
<text>
<body>
<p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim,</q> demişti. Sitem'in,
kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü
çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin
yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp
gidivermemden korkan Oktay beni <hi>oyalamak</hi> için geçen yaz Giray
Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı.</p>

<p>O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o


kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez...</p>

</body>
</text>

10.11.23 COGS 523 - Bilge Say 15


Entering XCES Annotations
-1

10.11.23 COGS 523 - Bilge Say 16


Entering XCES Annotations
-2

10.11.23 COGS 523 - Bilge Say 17


METU-Sabancı treebank
project
 Annotation of morphological and (surface)
syntactic features in a dependency-
inspired manner
 A subcorpus containing 7.300 annotated
sentences and 65.000 words: initially
whole samples selected from the main
corpus. (Another version containing 5600
sentences)
 Genre distribution is proportional with the
METU Corpus

10.11.23 COGS 523 - Bilge Say 18


Building the Treebank
 Morphological Analysis of Selected
Samples from the Corpus
 Preprocessing of the Collocations
 (Manual) Disambiguation of the
Morphological Parses
 Annotating with the Dependency
Structure
 Control

10.11.23 COGS 523 - Bilge Say 19


Annotation – Lexical Level
 A word can be seen as a sequence
of inflectional groups (IGs) of the
form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln

 evinizdekilerden (from the ones at your house)

ev+Noun+A3sg+P2pl+Loc^DB+Adj^DB+Noun+A3pl+Pnon+Abl

Inflectional Group
10.11.23 COGS 523 - Bilge Say 20
Annotation- Syntactic Level

Bu çocuk okuldan erken geldi.


This child school+Abl early come+Past+3sg
This child came from the school early.

Determiner Subject Modifier

Bu çocuk okuldan erken geldi .

Abl. adj

10.11.23 COGS 523 - Bilge Say 21


Annotation- Syntactic Level
 Sentence  Relativizer
 Object  Coordination
 Subject  Possessor
 Intensifier  Classifier
 Modifier  Ablative Adjunct
 Determiner  Dative Adjunct
 Question-Particle  Locative Adjunct
 Total of 20  Instrumental
syntactic tags Adjunct...
10.11.23 COGS 523 - Bilge Say 22
Morphosyntactic processing
 Tokenized text is annotated
(ambiguously) by all possible
morphological analyses for each token.
 Involves also unknown word processing
 A constraint-based disambiguation
module performs limited morphological
disambiguation.
 Recognizing and morphological
annotation of collocations

10.11.23 COGS 523 - Bilge Say 23


Automatic Dependency
Annotation
 Try to get most of the “easy”
relations right automatically to help
and speed up the human annotator
 Human annotator can override if the
selected dependency relation is not
right.
 Pilot work is done but not practised
in the METU-Sabancı treebank
10.11.23 COGS 523 - Bilge Say 24
Automatic Dependency
Annotation
 A set of heuristic rules tentatively
attach some of the relations
automatically
 Appropriately case-marked nouns to the
immediately following unambiguous
postposition as objects
 Indefinite nominative nouns to the first verb
to the right as objects
 Adverbs and Adjuncts attach to the first verb
to the right as modifiers and adjunct
10.11.23 COGS 523 - Bilge Say 25
The Annotation Tool
 The text thus processed can now be
further annotated with an annotation tool
 Visualization
 Review selections (morph/dependency)
and override (for morphology) or
annotate (for dependency)
 The output of the program is
morphologically disambiguated and
annotated text which is encoded
according to XML document and Turkish
Treebank formats.

10.11.23 COGS 523 - Bilge Say 26


Annotating the Treebank -
1

10.11.23 COGS 523 - Bilge Say 27


Annotating the Treebank –2

10.11.23 COGS 523 - Bilge Say 28


Corpus Query Workbench
 A user-friendly query engine for linguists
 Organization through sessions
 Boolean or regular expression queries
 Filtering queries through bibliographic
constraints such as author, genre, year
 Treebank entries viewed through a graphical
interface
 Printing and saving options of outputs and
session queries available
 Implemented in Java SE 1.4.1, compatible with
Window XP/Linux

10.11.23 COGS 523 - Bilge Say 29


10.11.23 COGS 523 - Bilge Say 30
10.11.23 COGS 523 - Bilge Say 31
Post-project developments
 About 100 user forms received
 Some uses (from a recent survey)
 Word sense disambiguation
 Coherence in Turkish texts
 Subcategorization Frame Acquisition
 Teaching Turkish or NLP
 CoNLL Dependency task for METU-
Sabancı Treebank (~5000 sentences)
 Frequency lists available (due to Umut
Özge and Serge Sharoff)

10.11.23 COGS 523 - Bilge Say 32


What would we have done
differently?
 More funding, more interdisciplinary
organization, less turnover...
 Approaching a corpus development
project like a software engineering
project...
 Doing a pilot project
 Better quality control processes, version
control and documentation control processes.
 More and better automatic text capture
and annotation

10.11.23 COGS 523 - Bilge Say 33


Requests from Users
 Extend the size and variety of the corpus
 POS tag the whole corpus
 Enable the users to enter their own corpora to
query tool
 Implement statistical features to the query tools
 Add semantic annotation
 Treebank specific ones:
 10,000; 7,000 or 5,000 sentences?
 Detailed stylebook
 LEM and MORPH fields
 Better versioning, some nonconformant entries with
XML

10.11.23 COGS 523 - Bilge Say 34


Requirements for future generations of
Turkish corpora
 Turkish National Corpus (like ANC, BNC,
or CNC)
 Spoken Part
 Automatic Tools
 Diachronic Part
 Linguistically motivated morphological and
syntactic annotation
 Some motivation for text providers
 Well-funded, well-organized project
 Comparable corpora of Turkic languages

10.11.23 COGS 523 - Bilge Say 35


Lecture 6
 Bernardini et al. A Wacky Introduction.

 April 14, your tool evaluation


presentations and reports – only two
weeks left!

10.11.23 COGS 523 - Bilge Say 36

You might also like