You are on page 1of 43

Using Corpora for Language

Research

COGS 523-Lecture 3
Corpus
Annotation

10.11.23 COGS 523 - Bilge Say 1


Related Readings
 Course Pack: Meyer (2002) Ch4;
Sampson and McCarthy (2005) Ch 39;
Garside (1997) Chs 4,5,16
 Optional: McEnery et al (2006): A3, A4,
A8, A9
• For your reference, rest of Garside et al.
(1997) is relatively old but useful.

Slides with tagged text are adopted from McEnery and Wilson (2001) or McEnery et
al(2006) except TEI encodings (see http://www.tei-c.org/Support/Learn/)

10.11.23 COGS 523 - Bilge Say 2


Mark-up and Annotation
 Corpus Mark-up: System of codes inserted into a
document stored in electronic form to provide
information about the text itself and govern
formating: exs: Text Encoding Initiative TEI
 Corpus Annotation: Addition of interpretive,
linguistic information to an electronic corpus of
spoken and/or written data
 Sometimes used interchangeably

Conflict: Utility of Annotations vs Ease of


Annotations

10.11.23 COGS 523 - Bilge Say 3


Other Issues
 Standards vs Guidelines
 Manual vs Automatic Annotation
 Documentation
 Evaluation of Annotation Schemes
 See LREC conferences....

10.11.23 COGS 523 - Bilge Say 4


Maxims in Annotation of
Text Corpora (Leech, 93)
 Removable-revertable
 Extractable
 End user guidelines available
 Annotation mode and annotator info clear
 Reliability available
 Annotation schemes – theory neutral or
widely agreed upon ?
 No a priori standard

10.11.23 COGS 523 - Bilge Say 5


Cross-Linguistic Annotation
Standards
 Reusability and Shareability
 Ease and Efficiency in Building a
Corpus
 Crosslinguistic Comparability

Examples: TEI, CES; EAGLES (Expert


Advisory Group for Language
Engineering Standards)

10.11.23 COGS 523 - Bilge Say 6


Problems with
Standardization:
 Applicability of standards to existing
or ongoing corpus research
 Acceptibility of standards by general
linguistic community
 Task dependency of corpora
 Applicability to a wide range of
languages

10.11.23 COGS 523 - Bilge Say 7


Documentation of Markup/Annotation
Guidelines
 What should be specified in a annotation
guidelines document?
 Level and layers of annotation
 Set of annotation devices used and their meanings
 Conventions for applying such devices defined -
supplemented with examples or a reference corpus
 Granularity of annotation
 Disambiguation process applied (if any)
 Measurable quality of annotation (accuracy rate,
consistency rate, extent of manual checking)
 Any incompleteness, known errors etc.

10.11.23 COGS 523 - Bilge Say 8


Markup
 A.k.a structural annotation
 Different conventions for line breaks, sections,
lists etc exist.
What does that imply?
 Character Sets (Unicode, ISO639-3)
 Textual Information
 COCOA References <A Charles Dickens>
 Standard Generalized Markup Language (SGML)
 Hypertext Markup Language (HTML)
 Extensible Markup Language (XML)

10.11.23 COGS 523 - Bilge Say 9


XML
 Three characteristics of XML
distinguish it from other markup
languages:
 its emphasis on descriptive rather than
procedural markup;
 its notion of documents as instances of
a document type;
 its independence of any one hardware
or software system.

10.11.23 COGS 523 - Bilge Say 10


Text Encoding Initiative
(TEI)
 Objective: The Development of an Interchange Language for
Textual Data
 Started in 1987
 Original P3 documentation 1400 pages
 Currently in P5 with extensive web support (see Links)
 TEILite: Simplified by a factor of 3.
 Moved from SGML to XML
 Flexible tagset
 Document Type Definitions (DTDs, rules for a particular markup
language, i.e. elements, attributes, entities), more flexible and
optional
 XSL – Extensible Style Language
 Simpler and better syntax
 Corpus Encoding Standard (CES) and XCES
 an attempt to specialize XML for corpora (not currently fully
compliant to TEI P5 but many commonalities) (see Links)

10.11.23 COGS 523 - Bilge Say 11


TEI
 Alternative customizations:
 tei_bare: TEI Absolutely Bare
 teilite: TEI Lite

 tei_corpus: TEI for Linguistic Corpora

 tei_ms: TEI for Manuscript Description

 tei_drama: TEI with Drama

 tei_speech: TEI for Speech


Representation

10.11.23 COGS 523 - Bilge Say 12


An example of a feature system declaration (FSD)

<fs id=vvd type=word-form>


<f name=verb-class><sym value=verb>
<f name=base><sym value=verb>
<f name=verb-form><sym value=lexical>
<f name=verb-class><sym value=past>
</fs>

10.11.23 COGS 523 - Bilge Say 13


Examples of SGML tags

<Q></Q> encloses a question


<EX></EX> encloses an expansion of an abbreviation in the original
manuscript

<LB> indicates a line break


<FRN></FRN> encloses words in another language; Lang=“LA”
indicates Latin

10.11.23 COGS 523 - Bilge Say 14


Example of XML breakfast food menu

10.11.23 COGS 523 - Bilge Say 15


TEI P5 structure
 The TEI encoding scheme consists of
a number of modules, each of which
declares particular XML elements
and their attributes. (from TEI
Guidelines)
 Modules: core, header,
textstructure, corpus ...

10.11.23 COGS 523 - Bilge Say 16


TEI for Language Corpora –
text descriptions
 channel (primary channel) describes the medium or channel by which a
text is delivered or experienced. For a written text, this might be print,
manuscript, e-mail, etc.; for a spoken one, radio, telephone, face-to-face,
etc. modespecifies the mode of this channel with respect to
speech and writing.
 constitution describes the internal composition of a text or text sample,
for example as fragmentary, complete, etc. typespecifies how the text
was constituted.
 derivation describes the nature and extent of originality of this text.
typecategorizes the derivation of the text.
 domain (domain of use) describes the most important social context in
which the text was realized or for which it is intended, for example
private vs. public, education, religion, etc. typecategorizes the domain
of use.
 factuality describes the extent to which the text may be regarded as
imaginative or non-imaginative, that is, as describing a fictional or a non-
fictional world. typecategorizes the factuality of the text. ...
(from Guidelines)

10.11.23 COGS 523 - Bilge Say 17


TEI Elements
 Elements:
 Major structuring elements: text, body, front, back..
 Paragraph level elements: citation, speaker..
 Lists, tables, figures
 Phrase level elements: date, emph, foreign
 Bibliographical elements: author, publisher
 Others: file description, revision description
 Attributes:
 <div type=“chapter” n=“1”> ...</div>
 Entities
 &auml; for ä

10.11.23 COGS 523 - Bilge Say 18


A Text Description Example
 textDesc ="Informal domestic conversation">
<channel mode="s">informal face-to-face
conversation</channel>
<constitution type="single">each text represents a continuously
recorded interaction among the specified participants
</constitution>
<derivation type="original"/>
<domain type="domestic">plans for coming week, local
affairs</domain>
<factuality type="mixed">mostly factual, some
jokes</factuality>
<interaction
type="complete" active="plural" passive="many"/>
<preparedness type="spontaneous"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>

10.11.23 COGS 523 - Bilge Say 19


A Sample Participant
Description
 <person sex="2" age="mid">
<birth when="1950-01-12">
<date>12 Jan 1950</date>
<name type="place">Shropshire, UK</name>
</birth>
<langKnowledge tags="en fr">
<langKnown level="first" tag="en">English</langKnown
>
<langKnown tag="fr">French</langKnown>
</langKnowledge>
<residence>Long term resident of Hull</residence>
<education>University postgraduate</education>
<occupation>Unknown</occupation>
<socecStatus scheme="#pep" code="#b2"/>
</person>

10.11.23 COGS 523 - Bilge Say 20


Example of TEI Header from University of Michigan Library

10.11.23 COGS 523 - Bilge Say 21


Adopting XML-based
linguistic annotation
 Technical difficulties – human
perceptual difficulties
 Not conformant to how linguistic
knowledge is expressed in many
layers of linguistic annotation...

10.11.23 COGS 523 - Bilge Say 22


Types of Annotation
 Morphosyntactic
 Part-of-speech tagging; partial or full parse
 Semantic
 Word senses, thematic roles
 Discourse
 Information structure, anaphoric relations, discourse
relations
 Prosodic (e.g Intonation)
 Pragmatic (e.g. Speech acts)
 Problem Understanding (see Message
Understanding (MUC) or Document
Understanding Conferences (DUC))

10.11.23 COGS 523 - Bilge Say 23


POS Tagging
 Obligatory attributes or values:
major word categories
 Recommended attributes or
values:type,gender, case
 Optional: semantic classes,
language specific information,
derivational morphology

10.11.23 COGS 523 - Bilge Say 24


Tagsets
 Issues: Conciseness, ease of
interpretation, analysability,
disambiguatibility, linguistic quality vs
computational tractability– trade-offs...
 Size of tagsets: English 30-200; Spanish
475, Turkish 6000 distinct morphological
feature combinations for 250,000 words
 What to do with multiwords: in spite of
(ditto tags), mergers (clitics, eg hasn’t),
compounds (eye strain vs eyestrain)

10.11.23 COGS 523 - Bilge Say 25


Tagging Accuracy
 Amount of training data available
 The size of the tagset
 Training data,dictionary vs the real
corpus – the differences
 Unknown words
 Recall and Precision
 2-6% error rate for English

10.11.23 COGS 523 - Bilge Say 26


Examples of codes of tag sets
LOB Corpus (C1 tag set) SEC (tagset C7) BNC (C5 tagset)
IN preposition IF for DPS possessive determiner
JJ adjective IO of NN1 singular common noun
NN singular common noun NN1 singular common noun NN2 plural common noun
NNS plural common noun NN2 plural common noun NP0 proper noun
NP singular proper noun NNJ singular group noun PNP personal pronoun
NP$ genitive proper noun RL locative adverb POS genitive marker ('s)
PP$ possessive pronoun RR general adverb PRF of
RP adverbial adjective RT temporal adverb PRP preposition

10.11.23 COGS 523 - Bilge Say 27


Example of part-of-speech tagging from LOB corpus (C1 tagset)

P05 32 ^ Joanna_NP stubbed_VBD out_RP her_PP$ cigarette_NN with_IN


P05 32 unnecessary_JJ fierceness_NN ._.
P05 33 ^ her_PP$ lovely_JJ eyes_NNS were_BED defiant_JJ above_IN
P05 33 cheeks_NNS whose_WP$ colour_NN had_HVD deepened_VBN
P05 34 at_IN Noreen’s_NP$ remark_NN ._.

10.11.23 COGS 523 - Bilge Say 28


Example of part-of-speech tagging from Spoken English Corpus (C7 tagset)

^ For_IF the_AT members_NN2 of_IO this_DD1 university_NN1 this_DD1 character_NN1


enshrines_VVZ a_AT1 victorious_JJ principle_NN1 ;_; and_CC the_AT fruits_NN2 of_IO
that_DD1 victory_NN1 can_VM immediately_RR be_VBI seen_VVN in_II the_AT
international_JJ community_NNJ of_IO scholars_NN2 that_CST has_VHZ graduates_VVN
here_RL today_RT ._.

10.11.23 COGS 523 - Bilge Say 29


Example of part-of-speech tagging from the British National Corpus (C5 tagset in TEI-conformant
layout)

Predita&NN1-NP0; , &PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF;


the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; pritect&VVI;

&bquo;&PUQ;I&PNP;’11&VM0; polish&VVI; your&DPS;


boots&NN2; , &PUN; &equo;&PUQ; he&PNP; offered&VVD; .&PUN;

10.11.23 COGS 523 - Bilge Say 30


Example from the CLAWS system

0000117 040 I 03 PPIS1


0000117 050 do 03 VD0
0000117 051 n’t 03 XX
0000117 060 think 99 VVI

10.11.23 COGS 523 - Bilge Say 31


Syntactic Annotation
 More problematic than POS tagging. Can
you guess why?
 Proposed levels
 Bracketing of segments
 Labeling of segments
 Marking of dependency relations, eg
complements
 Indicating functional labels, e.g. Subject,
object
 Extra: ellipsis, traces...

10.11.23 COGS 523 - Bilge Say 32


Treebanks
 Penn Treebank – the initiator
 Treebanks for Swedish, Danish, German,
Dutch, French, Turkish, Czech, Spanish,
Basque, Russian, Chinese Portuguese,
Italian...
 Sizes: 700 to 90,000 sentences
 Automated and manual annotation
 Grammar formalisms: Context-free
grammar trees, dependency, LFG, HPSG,
CCG

10.11.23 COGS 523 - Bilge Say 33


Example of full parsing from the Lancaster-Leeds treebank

[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb


is_BEZ Vzb] [Ns the_ATI [NN/JJ& wine-glass_NN [JJ+ or_CC
flared_JJ JJ+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq]

Fr]Ns] ._. S]

10.11.23 COGS 523 - Bilge Say 34


Example of skeleton parsing from Spoken English Corpus

[S&[P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1


university_NNl1 N]P]N]P] [N this_DD1 character_NN1 N] [V
enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1
N]V]S&]._.

10.11.23 COGS 523 - Bilge Say 35


From Penn Treebank
((S (NP-SBJ-1
(NP Yields)
(PP on
(NP money-market mutual funds)))
(VP continued
(S (NP-SBJ *-1)
(VP to
, (VP slide)))
(PP-LOC amid
(NP signs
(SBAR that
(S (NP-SBJ portfolio managers)
(VP expect
(NP (NP further declines)
(PP-LOC in
(NP interest
rates)))))))))
Tiger Treebank – A German treebank

<n id="n1_500" cat="S">


<edge href="#id(w1)"/>
<edge href="#id(w2)"/>
</n>

<w id="w1" word="the"/>


<w id="w2" word="boy"/>

10.11.23 COGS 523 - Bilge Say 37


Semantic Annotation
 Makes sense in linguistic or
psycholinguistic terms
 Applicable to whole corpus
 Flexible and right level of granularity
 Hierarchical structure (?)
 Conforming to standards

(Schmidt, 88)

10.11.23 COGS 523 - Bilge Say 38


Other Issues
 Harder to annotate
 Can be computer assisted if
appropriate interfaces to lexical
resources are developed
 General frequency information can
help in disambiguation

10.11.23 COGS 523 - Bilge Say 39


Example of semantic text analysis, based upon Wilson (1996)

And
00000000
the
00000000 Key:

soldiers 00000000 Low content word


23241000 13010000 Plant life in general
21030000 Body and body parts
platted 21072000 Object-oriented
21072000 physical activity
a 21110321 Men’s clothing: outer
00000000 clothing
21110400 Headgear
crown 23241000 War and conflict:
21110400 general
31241100 Color
of
00000000
thorns
13010000
and
00000000
Example of anaphoric annotation from Lancaster Anaphoric Treebank.

A039 1 v
(1 [N Local_JJ atheists_NN2 N] 1) [V want_VVO (2 [N the_AT (9
Charlotte_NP1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO
get_VV0 rid_VVN of_IO [N (3 <REF=2 its_APP$ chaplain_NN1 3)
,_, [N {{3 the_AT Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_,
38_MC N]N]Ti]V] ._.

10.11.23 COGS 523 - Bilge Say 41


Example of prosodic annotation from London-Lund corpus
1 8 14 1470 1 1A 11 ^ what a_bout a cigar\ette# . /
1 8 14 1480 1 1A 20 *((4 sylls))* /
1 8 14 1490 1 1B 11 *I ^w\on't have one th/anks#* - - - /
1 8 14 1500 1 1A 11 ^ aren't you 'going to sit d/own# - /
1 8 14 1510 1 1B 11 ^ [/\m] # - /
1 8 14 1520 1 1A 11 ^have my _coffee in p=eace# - - - /
Example of codes of prosodic annotation.
# end of tone group Also represent
unintelligable speech
^ onset background noise
overlapping speech
/ raising nuclear tone (conventions exist)
\ falling nuclear tone
Changing names for
/\ raise-fall nuclear tone privacy

_ level nuclear tone


[] enclose partial words and phonetic symbols
Lecture 4
Using corpora w. other resources and corpus
query tools (general); corpus/treebank
quality control.

Readings: Buchholz and Green (2006); Miller


and Fellbaum (2007); Sampson and
McCarthy Ch 29.

Due Date: Project Proposals

10.11.23 COGS 523 - Bilge Say 43

You might also like