Corpus Linguistics (L615) : Linguistic Annotation

Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
Corpus Linguistics How to annotate

Levels of Annotation
(L615) Text Processing

Tokenization
Lemmatization
Linguistic Annotation
Standalone
annotation
More on XML
Markus Dickinson
Department of Linguistics, Indiana University

Spring 2009
1 / 73
Corpus Linguistics
Corpus mark-up
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Corpus mark-up is information about the corpus included in Motivation
How to annotate
the corpus itself Levels of Annotation
Text Processing
I some of it governs formatting, printing, other processing Tokenization
Lemmatization
Mark-up is important to: Standalone

annotation
I help recover the original context More on XML
I allow for a wider range of questions to be explored (e.g.,

sociolinguistic variables)
I encode extralingusitic details (e.g., laughs, formulas)
2 / 73
Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
We will use the term annotation to refer specifically to How to annotate
linguistic information encoded in the text
Text Processing
Tokenization
I Generally speaking, annotation is a form of markup Lemmatization
Standalone
I Non-linguistic information also encoded in markup annotation
(meta-data) More on XML
I i.e. information about the text: author, version, date,

information about author, encoding
3 / 73
Corpus Linguistics
Corpus mark-up schemes
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Meta-data systematically needs to be kept separate from the Motivation
How to annotate
corpus data Levels of Annotation
Text Processing
A general idea is to use attributes and corresponding Tokenization
Lemmatization
values
Standalone
annotation
I e.g., attribute author has a value William
More on XML
Shakespeare
I We’ll see this encoded in XML in ways such as:
I <... author="William Shakespeare" ...>
I <author name="William Shakespeare">
5 / 73
Text Encoding Standards Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
I TEI: Text Encoding Initiative Text Processing

Tokenization
I CES: Corpus Encoding Standard Lemmatization
Standalone
I XCES: XML version of CES annotation
More on XML
If you annotate a corpus, make it conform to XCES as much
as possible
6 / 73
Corpus Linguistics
Text Encoding Initiative (TEI)
Linguistic
Annotation
Corpus markup
Mark-up schemes
TEI-compliant documents are broken into 2 parts: Character encodings
Corpus annotation
1. Header = information about the document Motivation
How to annotate
I File description (<fileDesc>): bibliographic information Levels of Annotation
I Encoding description (<encodingDesc>): relationship Text Processing

to the source Tokenization
Lemmatization
I Text profile (<profileDesc>): non-bibliographic Standalone
information (e.g., setting of the text, languages used, annotation
etc.) More on XML
I Revision history (<revisionDesc>): changes made

2. Body = the original text itself
TEI documents can be encoded in different mark-up

languages, e.g., SGML and XML
8 / 73
Mark-up languages Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
Mark-up languages generally have the following aspects How to annotate

I Tags, which come in both start (e.g., <s>) and end (e.g., Text Processing
Tokenization
</s>) forms Lemmatization
Standalone
I Attributes and values, encoded within the start tag (e.g., annotation
<s num="1">) More on XML
<w POS=AT0>the</w>
10 / 73
XML Corpus Linguistics
Linguistic
Annotation
Corpus markup
XML stands for: eXtensible Markup Language Mark-up schemes
Character encodings
I in contrast to HTML, XML does not have built-in Corpus annotation

Motivation
“meaning” for labels: you must define your own tags! How to annotate
An XML document consists of two parts: DTD (or XML Text Processing
Tokenization
Schema) and the real document Lemmatization
Standalone
I DTD (Document Type Definition) defines structure of
annotation
document More on XML
I XML Schema does the same thing, but in a different
format
I XML documents can be validated, i.e., checked against
rules in DTD
We’ll save a more thorough discussion of XML for the end of
this unit
11 / 73
Corpus Linguistics
CES: Corpus Encoding Standard
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
TEI is for any text document and, as such, defines Corpus annotation
Motivation
approximately 450 elements How to annotate
I CES simplifies this system in order to encode language Text Processing
corpora Tokenization
Lemmatization
CES covers: Standalone

annotation
I document-wide mark-up More on XML
I gross structural mark-up, for structural units of text

I mark-up for sub-paragraph structures
CES documents can be validated at metalanguage,
syntactic, or semantic levels
12 / 73
Corpus Linguistics
Character encodings
Linguistic
Annotation
Corpus markup
Mark-up schemes
Corpus has to specify character encoding of the text, as Character encodings
different encodings exist (ASCII, ISO-8859-1, etc.) Corpus annotation

Motivation
How to annotate
Unicode has a single representation for every possible Levels of Annotation
Text Processing
character Tokenization
Lemmatization
I Version 3.2 has codes for 95,221 characters from Standalone

annotation
alphabets, syllabaries and logographic systems
More on XML
I Unicode has three versions
I UTF-32 (32 bits): direct representation
I UTF-16 (16 bits): 216 = 65536
I UTF-8 (8 bits): 28 = 256
Unicode (especially UTF-8) is the standard for XML

documents
13 / 73
Corpus Linguistics
Corpus annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
Corpus annotation = “interpretative, linguistic information” Levels of Annotation
Text Processing
added to a corpus Tokenization
Lemmatization
I Corpus mark-up is relatively objective factual Standalone
information annotation
More on XML
I Corpus annotation is more subjective, interpretative
information
14 / 73
Motivating Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Why would we want annotation? Character encodings
Corpus annotation
I for training NLP tools Motivation
How to annotate
I for finding examples Text Processing
Tokenization
I what is the plural form of fish? Lemmatization
I which nouns can occur as bare nouns, without a Standalone

determiner? annotation
I are there subjectless sentences in German? More on XML
– Yes, e.g. Mir ist kalt. (To me is cold.)

I is it possible in English to have something between a
noun and its modifying relative clause?
Annotation makes it possible to find phenomena that would

otherwise disappear in masses of data
15 / 73
Corpus Linguistics
Potential benefits of annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation adds much value to a corpus Corpus annotation

Motivation
I Extracting information is easier: How to annotate

I Can easily isolate left in its adjective uses Text Processing

I Can find information on a language you don’t speak Tokenization
Lemmatization
I Corpus is more reusable Standalone

annotation
I Insights are more accessible to others
More on XML
I Corpus is more multifunctional
I Annotation provides a clear record of analysis, open to
future scrutiny
I Decisions are more objective, reproducible
16 / 73
Corpus Linguistics
Potential criticisms of annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I Annotation is cluttered Corpus annotation
I But: it can be easily ignored Motivation
How to annotate
I Annotation imposes one particular linguistic analysis Levels of Annotation
Text Processing
I But: users can reject the analysis, and annotation at Tokenization
least makes the analysis process clearer Lemmatization
Standalone
I Annotation makes a corpus less accessible and annotation
expandable More on XML
I But: corpora can be extended without annotation, if
need be
I Annotation cannot be done completely consistently
I But: humans are falliable, and there are ways to ensure
better consistency
17 / 73
Corpus Linguistics
Leech’s Seven Maxims of Annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
1. It should be possible to remove the annotation from an Corpus annotation

annotated corpus in order to revert to the raw corpus. Motivation
How to annotate
2. It should be possible to extract the annotations by Levels of Annotation
themselves from the text. This is the flip side of maxim Text Processing
Tokenization
1. Lemmatization
I Taking points 1. and 2. together, the annotated corpus Standalone

annotation
should allow the maximum flexibility for manipulation by
More on XML
the user.
3. The annotation scheme should be based on guidelines
which are available to the end user.
4. It should be made clear how and by whom the
annotation was carried out.
18 / 73
Corpus Linguistics
Leech’s Seven Maxims (2)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
5. The end user should be made aware that the corpus How to annotate
annotation is not infallible, but simply a potentially useful Text Processing

tool. Tokenization
Lemmatization
6. Annotation schemes should be based as far as possible Standalone

annotation
on widely agreed and theory-neutral principles.
More on XML
7. No annotation scheme has the a priori right to be
considered as a standard. Standards emerge through
practical consensus.
19 / 73
How corpus annotation is achieved Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Some options: Corpus annotation

Motivation
I Fully manually How to annotate
I Pro: has the potential to be of highest quality Text Processing
I Con: time-consuming + humans are prone to errors Tokenization
Lemmatization
I Fully automatically (assuming appropriate technology) Standalone

annotation
I Pro: quick and consistent
More on XML
I Con: will often be consistently wrong
I Semi-automatically, e.g., automatic analysis + manual
post-editing
I Pro: Can combine the best of both worlds
I Con: Have to avoid the pitfalls of each
20 / 73
Corpus Linguistics
Levels of linguistic annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I morphological annotation (e.g. inflection, derivation, Corpus annotation

Motivation
compounding) How to annotate

I morpho-syntactic annotation: part-of-speech (POS) Text Processing

Tokenization
tagging Lemmatization
Standalone
I syntactic annotation (e.g. named entities, phrasal annotation
chunking, full syntactic analysis) More on XML
I semantic annotation (e.g. word-sense disambiguation,

anaphora and coreference resolution, information
structure)
I discourse annotation (e.g. dialog turns, speech acts)
21 / 73
Step by Step Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I tokenization Motivation
How to annotate
I lemmatization / morphological analysis Levels of Annotation
Text Processing
I part-of-speech tagging Tokenization
Lemmatization
I named-entity recognition Standalone
annotation
I partial parsing
More on XML
I full syntactic parsing
I semantic and discourse processing
See section A4.4 of the book for a more thorough analysis of
each of these
22 / 73
Annotation uses Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I POS annotation: compare occurrences of word classes
Corpus annotation
I Lemmatization: study distribution of lexemes for Motivation
How to annotate
lexicography Levels of Annotation
Text Processing
I Parsing: study clauses types in a language or teach Tokenization
Lemmatization
grammatical analysis
Standalone
I Semantic annotation: analyze content annotation
More on XML
Other types of annotation can cover specific purposes:
stylistic annotation, error tagging, problem-oriented
annotation
We’ll examine POS and syntactic annotation in greater detail

later in the semester
23 / 73
Preprocessing the Text: Tokenization Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Tokenization refers to the annotation step of dividing the Motivation
How to annotate
input text into units called tokens. Levels of Annotation
Text Processing
Tokenization
Lemmatization
Standalone
Each tokens consists of one of the following: annotation
More on XML
I a morpho-syntactic word
I a punctuation mark or a special character (e.g. &, @, %)
I a number
24 / 73
Tokenization – Example Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
before tokenization: How to annotate

Milton wrote "Paradise Lost." Then his wife dies Text Processing
and he wrote "Paradise Regained." Tokenization
Lemmatization
Standalone
annotation
More on XML
after tokenization:
Milton wrote " Paradise Lost . " Then his wife
dies and he wrote " Paradise Regained . "
25 / 73
Corpus Linguistics
Why is Tokenization Non-Trivial?
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I disambiguation of punctuation Motivation
How to annotate
e.g. period can occur inside cardinal numbers, after Levels of Annotation
ordinals, after abbreviations, at end of sentences Text Processing

Tokenization
I recognition of complex words Lemmatization
Standalone
I compounds, e.g. bank transfer fee, US-company
annotation
I mergers, e.g. don’t, England’s, French: t’aime
More on XML
I multiwords, e.g. complex prepositions provided that,
in spite of
Languages such as Chinese present more challenging
segmentation issues
26 / 73
Lemmatization Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I refers to the process of relating individual word forms to Corpus annotation

Motivation
their citation form (lemma) by means of morphological How to annotate
analysis
Text Processing
e.g. stopped ⇒ stop Tokenization
Lemmatization
I provides a means to distinguish between the total Standalone
number of word tokens and distinct lemmata that occur annotation
More on XML
in a corpus
e.g. helps to find all occurrences of buy
I is indispensable for highly inflectional languages which
have a large number of distinct word forms for a given
lemma
27 / 73
Corpus Linguistics
Lemmatization – German Example
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
wie wie +Adv+Wh+#lex+COWIE
wie wie +Conj+Coord+#lex+COWIE Corpus annotation
Motivation
wie wie +Conj+Subord+#lex+COWIE How to annotate
sie sie +Pron+Pers+3P+Pl+Fem+Nom+#lex+PERSPRO Text Processing

Tokenization
sie sie +Pron+Pers+3P+Sg+Fem+Nom+#lex+PERSPRO Lemmatization
Standalone
offenbar offenbaren +Verb+Imp+2P+Sg+#lex+VVFIN annotation
offenbar offenbar +Adj+Pos+Pred+#lex+ADJD More on XML
gedacht gedenken +Verb+PPast+#lex+VVPP

gedacht dachen +Verb+PPast+#lex+VVPP
gedacht denken +Verb+PPast+#lex+VVPP
hat haben +Verb+Indc+3P+Sg+Pres+#lex+VAFIN
29 / 73
Corpus Linguistics
Tools for Lemmatization
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I XEROX Morphological Analyzer: comprehensive Motivation
morphological analyzers for many languages including How to annotate

English, French, Dutch, German, Hungarian, Italian, Text Processing

Tokenization
Portuguese, Czech, Danish, Finnish, Norwegian, Polish, Lemmatization
Russian, Turkish. Standalone

annotation
link: http://www.xrce.xerox.com/competencies/
More on XML
content-analysis/toolhome.en.html
I Lingsoft: morphological anylyzers for English, Danish,
German, Swedish, and Finnish
link: http://www.lingsoft.fi/demos.html
30 / 73
Morphological Analysis Corpus Linguistics
Linguistic
Annotation
Corpus markup
Xerox: Mark-up schemes
Character encodings
half half+Adj Corpus annotation

Motivation
half half+Adv How to annotate

half half+Noun+Sg Text Processing

Tokenization
Lemmatization
Standalone
annotation
More on XML
Lingsoft:
"<half>"
"half" <Quant> DET PRE SG/PL @QN>
"half" <NonMod> <Quant> PRON SG/PL
"half" N NOM SG
"half" ADV
32 / 73
Standalone annotation Corpus Linguistics
Linguistic
Annotation
Instead of embedding annotation in the base document, one Corpus markup

Mark-up schemes
can use standalone, or standoff, annotation: Character encodings
I The annotation is linked to particular points in the Corpus annotation

Motivation
original document How to annotate

Advantages: Text Processing

Tokenization
I Allows base documents to be treated as read-only & Lemmatization
can be distributed separately Standalone

annotation
I Allows for overlapping annotation and alternative More on XML
annotation
I Allows for new annotation levels to be easily added
Disadvantages:
I Potentially difficult to allow for annotation to point to
other annotation
I Less pre-built tools for standoff annotation
33 / 73
Example XML document Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
We’ll look more closely at XML Corpus annotation

Motivation
I Some of this material is adapted from: How to annotate
http://www.w3schools.com/xml/
Text Processing
Tokenization
<?xml version="1.0" encoding="ISO-8859-1"?> Lemmatization
Standalone
<note> annotation
<to>Tove</to> More on XML
<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>
35 / 73
XML tags Corpus Linguistics
Linguistic
Annotation
In the previous example, there are several things to note Corpus markup
about the tags, i.e., the things in angled brackets (<...>): Mark-up schemes
Character encodings
I There has to be one root tag—in this case, <note> is Corpus annotation
the root tag Motivation

How to annotate
I Every (opening) tag needs a closing tag: Levels of Annotation
Text Processing
<heading>Reminder</heading> is legitimate; Tokenization
<heading>Reminder is not Lemmatization
Standalone
I Along with that, tags must be properly nested: annotation
word is legitimate; More on XML
word is not
So, you’ll wind up with a structure like:
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
37 / 73
Attributes Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Each tag (or element) can have a variety of attributes Corpus annotation
Motivation
(provided such attributes have been declared; see below) How to annotate
I Attributes are noted within an element tag; there can be Text Processing
Tokenization
multiple attributes Lemmatization
I Attribute values are put in quotes after the attribute Standalone

annotation
I Some examples:
More on XML
<note date="2/23/2006">
<note date="2/23/2006" author="joe shmoe">
I This last one is equivalent to:
<note author="joe shmoe" date="2/23/2006">
39 / 73
Corpus Linguistics
Comments and whitespace
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Comments are added in between  Motivation
How to annotate
Does whitespace matter? ... Yes and no. Text Processing

Tokenization
I Within tags, each whitespace character is treated as Lemmatization
Standalone
significant (unlike, e.g., HTML) annotation
I But it doesn’t matter how much whitespace you put More on XML
between different tags, or if you indent a child tag

I The data is structured in and of itself, regardless of
whitespace.
41 / 73
Corpus Linguistics
Attributes vs. Elements (1)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
There is often a question of where to store information: in Corpus annotation

Motivation
element text, in attributes, as children elements, ... Here are How to annotate
three different ways to say the “same” thing Text Processing

Tokenization
Lemmatization
<note date="12/11/2002"> Standalone

<to>Tove</to> annotation
More on XML
<from>Jani</from>
</note>
43 / 73
Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
<note> Levels of Annotation
Text Processing
<date>12/11/2002</date> Tokenization
Lemmatization
<to>Tove</to>
Standalone
<from>Jani</from> annotation
<heading>Reminder</heading> More on XML

</note>
45 / 73
Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
<note> Character encodings
<date> Corpus annotation

Motivation
<day>12</day> How to annotate
<month>11</month> Text Processing
<year>2002</year> Tokenization
Lemmatization
</date> Standalone
<to>Tove</to> annotation
<from>Jani</from> More on XML
</note>
If you actually want to do something with the day, month, or

year, this last example is the most effective (i.e., most
structured)
47 / 73
Linguistic
Annotation
Corpus markup
In general, putting data in attributes needs to be done with Mark-up schemes
Character encodings
care:
Corpus annotation
Motivation
I attributes cannot contain multiple values (child elements How to annotate
can) Levels of Annotation
Text Processing
I attributes are not easily expandable (for future changes) Tokenization
Lemmatization
I attributes cannot describe structures (child elements
Standalone
can) annotation
I attributes are more difficult to manipulate by program More on XML
code
I attribute values are not as easy to test against a
Document Type Definition (DTD) - which is used to
define the legal elements of an XML document
However, converting all attributes to elements can result in
an inflated structure
48 / 73
Linguistic example Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
So, what do we do with corpora? Well, this won’t typically be Character encodings
your problem, as someone else will have decided, but here Corpus annotation
Motivation
are some possibilities: How to annotate
Text Processing
<terminal tag="NN">dog</terminal> Tokenization
Lemmatization
Standalone
<terminal word="dog" tag="NN"></terminal> annotation
 More on XML
<terminal>
<word>dog</word>
<tag>NN</tag>
</terminal>
50 / 73
Corpus Linguistics
Validation/DTDs
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
We want to know what makes up a valid XML file, i.e., we How to annotate
want to make sure our corpus is annotated in a consistent Text Processing

fashion; if not, the DTD allows us to automatically find out Tokenization
Lemmatization
I We can specify the Document Type Definition (DTD) in Standalone

annotation
a separate file or at the top of the XML file More on XML
I It tells us what the valid elements are; what children

those elements can have; what attributes; etc.
51 / 73
Corpus Linguistics
Document Type Definition (DTD) example
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
<!DOCTYPE note [ Corpus annotation

Motivation
<!ELEMENT note (to,from,heading,body)> How to annotate
<!ELEMENT to (#PCDATA)>
Text Processing
<!ELEMENT from (#PCDATA)> Tokenization
Lemmatization
<!ELEMENT heading (#PCDATA)>
Standalone
<!ELEMENT body (#PCDATA)> annotation
]> More on XML
I Document is of type note (i.e. that’s the root element)

I The element note has four different children
I The other four elements simply have parsed character
data (#PCDATA) as their content
53 / 73
Corpus Linguistics
More complex definitions
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
I Declaring one or more of the same element: Levels of Annotation
<!ELEMENT nonterminal (terminal+)> Text Processing

Tokenization
I Declaring zero or more of the same element: Lemmatization
Standalone
<!ELEMENT nonterminal (terminal*)> annotation
I Declaring mixed content, in any order, any number of More on XML
times:
<!ELEMENT terminal (word|tag|#PCDATA)*>
55 / 73
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Declaring attributes are slightly more complicated Corpus annotation

Motivation
How to annotate
DTD example:
Text Processing
<!ATTLIST payment type CDATA "check"> Tokenization
Lemmatization

Standalone
annotation
XML example: More on XML
<payment type="check" />
You can set attribute values to be #IMPLIED (not necessary),

#REQUIRED, or #FIXED
57 / 73
Corpus Linguistics
XML Schema (XSD)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
DTDs are in the process of being replaced by XML Schemas
Corpus annotation
(XSDs) Motivation
How to annotate
I See http://www.w3schools.com/schema/ Levels of Annotation
Text Processing
Why XSDs? Tokenization
Lemmatization
I More support for data types, including, e.g., being Standalone

annotation
easier to convert between different data types
More on XML
I XSDs use XML syntax, so you don’t have to relearn a
new syntax
I Easier to validate correctness
Our earlier DTD would have the equivalent XSD on the next
page ...
58 / 73
XSD example Corpus Linguistics
Linguistic
Annotation
<?xml version="1.0"?> Corpus markup

Mark-up schemes
<xs:schema xmlns:xs="http://..." targetNamespace="http://..."

Character encodings
xmlns="http://..." elementFormDefault="qualified"> Corpus annotation

Motivation
How to annotate
<xs:element name="note"> Text Processing

<xs:complexType> Tokenization
Lemmatization
<xs:sequence> Standalone
annotation
<xs:element name="to" type="xs:string"/>
More on XML
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
60 / 73
Corpus Linguistics
XML for linguistic purposes
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
I XML allows us to add structured information to a text How to annotate

I By adding XML annotation, we are able to preserve the Text Processing

Tokenization
original document—i.e., we only add to it Lemmatization
Standalone
I The complexity of the XML used depends on exactly annotation
what the task is. More on XML
I Using XML and a corresponding DTD also allows
corpus designers or software designers to specify how
the annotation information should be formatted
61 / 73
Corpus Linguistics
XML for “raw” text
Linguistic
Annotation
Corpus markup
New York Times data from the English Gigaword corpus: Mark-up schemes
Character encodings
<DOC id="NYT19940701.0001" type="story" > Corpus annotation

<HEADLINE> Motivation
WITNESS SAYS O.J. SIMPSON BOUGHT KNIFE WEEKS BEFORE SLAYINGS How to annotate
</HEADLINE> Levels of Annotation
<DATELINE>
LOS ANGELES (BC-SIMPSON-KILLINGS-1stLd-3Takes-Writethru-LADN) Text Processing
</DATELINE> Tokenization
<TEXT> Lemmatization

With the nation’s attention riveted again on a Los Standalone
Angeles courtroom, a knife dealer testified that O.J. Simpson annotation
bought a 15-inch knife five weeks before the slashing deaths of his More on XML
ex-wife and her friend.

...

‘‘She frequented the restaurant quite often,’’ DeBello said.


(STORY CAN END HERE. OPTIONAL 2ND TAKE FOLLOWS.)

</TEXT>
</DOC>
63 / 73
Corpus Linguistics
Part of Gigaword DTD
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
<!ELEMENT DOC - - (HEADLINE*, Corpus annotation

DATELINE*, Motivation
TEXT*) > How to annotate

<!ELEMENT HEADLINE - - (#PCDATA) > Text Processing
<!ELEMENT DATELINE - - (#PCDATA) > Tokenization
<!ELEMENT TEXT - - (P* | #PCDATA)+ > Lemmatization
 Standalone

<!ELEMENT P - - (#PCDATA) > annotation
More on XML

<!ENTITY amp "&" >
<!ENTITY AMP "&" >
<!ENTITY #DEFAULT SYSTEM >
<!ATTLIST DOC id CDATA #REQUIRED

type CDATA #REQUIRED >
]>
65 / 73
Corpus Linguistics
Adding POS information: BNC
Linguistic
Annotation
Corpus markup
Mark-up schemes
</c></s><s n="45" p="n" teiform="s">
Character encodings
<w type="av0" teiform="w">Commonly </w>
<w type="vvn-vvd" teiform="w">held </w> Corpus annotation
<w type="nn2" teiform="w">ideas </w> Motivation
<w type="vvb" teiform="w">restrict </w> How to annotate
<w type="at0" teiform="w">the </w> Levels of Annotation
<w type="aj0" teiform="w">social </w>
<w type="nn1" teiform="w">role </w> Text Processing
<w type="cjc" teiform="w">and </w> Tokenization
<w type="nn1" teiform="w">status </w> Lemmatization
<w type="prf" teiform="w">of </w>
<w type="ajc" teiform="w">older </w> Standalone
<w type="nn0" teiform="w">people</w> annotation
<c type="pun" teiform="c">, </c>
More on XML
<w type="vvb-nn1" teiform="w">structure </w>
<w type="dps" teiform="w">their </w>
<w type="nn2" teiform="w">expectations </w>
<w type="prf" teiform="w">of </w>
<w type="pnx" teiform="w">themselves</w>
<c type="pun" teiform="c">, </c>
<w type="vvb" teiform="w">prevent </w>
<w type="pnp" teiform="w">them </w>
<w type="vvg" teiform="w">achieving </w>
<w type="dps" teiform="w">their </w>
...
<c type="pun" teiform="c">.</c></s>
67 / 73
Corpus Linguistics
XML for syntactic trees: TigerXML (WSJ)
Linguistic
Annotation
Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> Mark-up schemes
<corpus xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance Character encodings
xsi:noNamespaceSchemaLocation="/home/compling/TIGERSearch/1.1rc2/schema/TigerXML.xsd" id="temp" >

Corpus annotation
<head external="file:temp_generated_header.xml"/>
Motivation
<body>
How to annotate
<s id="s1" >
<graph root="s1_500" >
<terminals> Text Processing
<t id="s1_1" word="Pierre" pos="NNP" /> Tokenization
<t id="s1_2" word="Vinken" pos="NNP" />
Lemmatization
<t id="s1_3" word="," pos="," />
<t id="s1_4" word="61" pos="CD" /> Standalone
<t id="s1_5" word="years" pos="NNS" /> annotation
<t id="s1_6" word="old" pos="JJ" />
<t id="s1_7" word="," pos="," /> More on XML
<t id="s1_8" word="will" pos="MD" />
<t id="s1_9" word="join" pos="VB" />
<t id="s1_10" word="the" pos="DT" />
<t id="s1_11" word="board" pos="NN" />
<t id="s1_12" word="as" pos="IN" />
<t id="s1_13" word="a" pos="DT" />
<t id="s1_14" word="nonexecutive" pos="JJ" />
<t id="s1_15" word="director" pos="NN" />
<t id="s1_16" word="Nov." pos="NNP" />
<t id="s1_17" word="29" pos="CD" />
<t id="s1_18" word="." pos="." />
</terminals>
69 / 73
Corpus Linguistics
WSJ in TigerXML (cont.)
Linguistic
Annotation
Corpus markup
Mark-up schemes
<nonterminals> Character encodings
<nt id="s1_502" cat="NP" >
<edge idref="s1_1" label="--" /> Corpus annotation
<edge idref="s1_2" label="--" /> Motivation
</nt> How to annotate
<nt id="s1_504" cat="NP" > Levels of Annotation
<edge idref="s1_4" label="--" />

Text Processing
Tokenization
</nt>
Lemmatization
...
<nt id="s1_501" cat="NP" > Standalone
<edge idref="s1_502" label="--" /> annotation
<edge idref="s1_503" label="--" /> More on XML
</nt>
...
<nt id="s1_500" cat="S" >
<edge idref="s1_501" label="SBJ" />
</nt>
</nonterminals>
</graph>
</s>
71 / 73
Corpus Linguistics
Header information: define your classes
Linguistic
Annotation
Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> Mark-up schemes
<head> Character encodings
<meta>
Corpus annotation
<name>/home/dickinso/temp.xml.gz</name>
Motivation
<format>bracketing format</format>
How to annotate
</meta>
<annotation>
<feature name="word" domain="T" /> Text Processing
<feature name="pos" domain="T">
Tokenization
<value name="CC" />
Lemmatization
...
</feature> Standalone
<feature name="cat" domain="NT"> annotation
<value name="ADJP" />
... More on XML
</feature>
<edgelabel>
<value name="--" />
<value name="ADV" />
...
</edgelabel>
<secedgelabel>
<value name="*" />
...
</secedgelabel>
</annotation>
</head>
73 / 73

Corpus Linguistics (L615) : Linguistic Annotation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics (L615) : Linguistic Annotation

Uploaded by

Copyright:

Available Formats

Corpus Linguistics

Corpus Linguistics How to annotate

(L615) Text Processing

Department of Linguistics, Indiana University

Mark-up is important to: Standalone

I allow for a wider range of questions to be explored (e.g.,

I i.e. information about the text: author, version, date,

I <author name="William Shakespeare">

I TEI: Text Encoding Initiative Text Processing

TEI-compliant documents are broken into 2 parts: Character encodings

I Encoding description (<encodingDesc>): relationship Text Processing

etc.) More on XML

I Revision history (<revisionDesc>): changes made

TEI documents can be encoded in different mark-up

Mark-up languages generally have the following aspects How to annotate

I in contrast to HTML, XML does not have built-in Corpus annotation

Schema) and the real document Lemmatization

CES covers: Standalone

I gross structural mark-up, for structural units of text

different encodings exist (ASCII, ISO-8859-1, etc.) Corpus annotation

Unicode has a single representation for every possible Levels of Annotation

I Version 3.2 has codes for 95,221 characters from Standalone

Unicode (especially UTF-8) is the standard for XML

Corpus annotation = “interpretative, linguistic information” Levels of Annotation

Why would we want annotation? Character encodings

I which nouns can occur as bare nouns, without a Standalone

I are there subjectless sentences in German? More on XML

– Yes, e.g. Mir ist kalt. (To me is cold.)

Annotation makes it possible to find phenomena that would

Corpus annotation adds much value to a corpus Corpus annotation

I Extracting information is easier: How to annotate

I Can easily isolate left in its adjective uses Text Processing

I Corpus is more reusable Standalone

I Annotation imposes one particular linguistic analysis Levels of Annotation

least makes the analysis process clearer Lemmatization

1. It should be possible to remove the annotation from an Corpus annotation

2. It should be possible to extract the annotations by Levels of Annotation

I Taking points 1. and 2. together, the annotated corpus Standalone

annotation is not infallible, but simply a potentially useful Text Processing

6. Annotation schemes should be based as far as possible Standalone

Some options: Corpus annotation

I Fully automatically (assuming appropriate technology) Standalone

I morphological annotation (e.g. inflection, derivation, Corpus annotation

compounding) How to annotate

I morpho-syntactic annotation: part-of-speech (POS) Text Processing

I semantic annotation (e.g. word-sense disambiguation,

I lemmatization / morphological analysis Levels of Annotation

lexicography Levels of Annotation

We’ll examine POS and syntactic annotation in greater detail

input text into units called tokens. Levels of Annotation

before tokenization: How to annotate

ordinals, after abbreviations, at end of sentences Text Processing

I recognition of complex words Lemmatization

I refers to the process of relating individual word forms to Corpus annotation

sie sie +Pron+Pers+3P+Pl+Fem+Nom+#lex+PERSPRO Text Processing

offenbar offenbar +Adj+Pos+Pred+#lex+ADJD More on XML

gedacht gedenken +Verb+PPast+#lex+VVPP

hat haben +Verb+Indc+3P+Sg+Pres+#lex+VAFIN

morphological analyzers for many languages including How to annotate

English, French, Dutch, German, Hungarian, Italian, Text Processing

Russian, Turkish. Standalone

half half+Adj Corpus annotation