You are on page 1of 50

Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

Corpus Linguistics How to annotate


Levels of Annotation

(L615) Text Processing


Tokenization
Lemmatization
Linguistic Annotation
Standalone
annotation

More on XML
Markus Dickinson

Department of Linguistics, Indiana University


Spring 2009

1 / 73
Corpus Linguistics
Corpus mark-up
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Corpus mark-up is information about the corpus included in Motivation
How to annotate
the corpus itself Levels of Annotation

Text Processing
I some of it governs formatting, printing, other processing Tokenization
Lemmatization

Mark-up is important to: Standalone


annotation
I help recover the original context More on XML

I allow for a wider range of questions to be explored (e.g.,


sociolinguistic variables)
I encode extralingusitic details (e.g., laughs, formulas)

2 / 73
Annotation Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation
We will use the term annotation to refer specifically to How to annotate
Levels of Annotation
linguistic information encoded in the text
Text Processing
Tokenization
I Generally speaking, annotation is a form of markup Lemmatization

Standalone
I Non-linguistic information also encoded in markup annotation
(meta-data) More on XML

I i.e. information about the text: author, version, date,


information about author, encoding

3 / 73
Corpus Linguistics
Corpus mark-up schemes
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Meta-data systematically needs to be kept separate from the Motivation
How to annotate
corpus data Levels of Annotation

Text Processing
A general idea is to use attributes and corresponding Tokenization
Lemmatization
values
Standalone
annotation
I e.g., attribute author has a value William
More on XML
Shakespeare
I We’ll see this encoded in XML in ways such as:
I <... author="William Shakespeare" ...>

I <author name="William Shakespeare">

5 / 73
Text Encoding Standards Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation
How to annotate
Levels of Annotation

I TEI: Text Encoding Initiative Text Processing


Tokenization
I CES: Corpus Encoding Standard Lemmatization

Standalone
I XCES: XML version of CES annotation

More on XML
If you annotate a corpus, make it conform to XCES as much
as possible

6 / 73
Corpus Linguistics
Text Encoding Initiative (TEI)
Linguistic
Annotation

Corpus markup
Mark-up schemes

TEI-compliant documents are broken into 2 parts: Character encodings

Corpus annotation
1. Header = information about the document Motivation
How to annotate
I File description (<fileDesc>): bibliographic information Levels of Annotation

I Encoding description (<encodingDesc>): relationship Text Processing


to the source Tokenization
Lemmatization
I Text profile (<profileDesc>): non-bibliographic Standalone
information (e.g., setting of the text, languages used, annotation

etc.) More on XML

I Revision history (<revisionDesc>): changes made


2. Body = the original text itself

TEI documents can be encoded in different mark-up


languages, e.g., SGML and XML

8 / 73
Mark-up languages Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

Mark-up languages generally have the following aspects How to annotate


Levels of Annotation

I Tags, which come in both start (e.g., <s>) and end (e.g., Text Processing
Tokenization
</s>) forms Lemmatization

Standalone
I Attributes and values, encoded within the start tag (e.g., annotation
<s num="1">) More on XML

<w POS=AT0>the</w>

10 / 73
XML Corpus Linguistics

Linguistic
Annotation

Corpus markup
XML stands for: eXtensible Markup Language Mark-up schemes
Character encodings

I in contrast to HTML, XML does not have built-in Corpus annotation


Motivation

“meaning” for labels: you must define your own tags! How to annotate
Levels of Annotation

An XML document consists of two parts: DTD (or XML Text Processing
Tokenization

Schema) and the real document Lemmatization

Standalone
I DTD (Document Type Definition) defines structure of
annotation
document More on XML
I XML Schema does the same thing, but in a different
format
I XML documents can be validated, i.e., checked against
rules in DTD
We’ll save a more thorough discussion of XML for the end of
this unit

11 / 73
Corpus Linguistics
CES: Corpus Encoding Standard
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

TEI is for any text document and, as such, defines Corpus annotation
Motivation
approximately 450 elements How to annotate
Levels of Annotation
I CES simplifies this system in order to encode language Text Processing
corpora Tokenization
Lemmatization

CES covers: Standalone


annotation
I document-wide mark-up More on XML

I gross structural mark-up, for structural units of text


I mark-up for sub-paragraph structures
CES documents can be validated at metalanguage,
syntactic, or semantic levels

12 / 73
Corpus Linguistics
Character encodings
Linguistic
Annotation

Corpus markup
Mark-up schemes
Corpus has to specify character encoding of the text, as Character encodings

different encodings exist (ASCII, ISO-8859-1, etc.) Corpus annotation


Motivation
How to annotate

Unicode has a single representation for every possible Levels of Annotation

Text Processing
character Tokenization
Lemmatization

I Version 3.2 has codes for 95,221 characters from Standalone


annotation
alphabets, syllabaries and logographic systems
More on XML
I Unicode has three versions
I UTF-32 (32 bits): direct representation
I UTF-16 (16 bits): 216 = 65536
I UTF-8 (8 bits): 28 = 256

Unicode (especially UTF-8) is the standard for XML


documents

13 / 73
Corpus Linguistics
Corpus annotation
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation
How to annotate

Corpus annotation = “interpretative, linguistic information” Levels of Annotation

Text Processing
added to a corpus Tokenization
Lemmatization
I Corpus mark-up is relatively objective factual Standalone
information annotation

More on XML
I Corpus annotation is more subjective, interpretative
information

14 / 73
Motivating Annotation Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes

Why would we want annotation? Character encodings

Corpus annotation
I for training NLP tools Motivation
How to annotate
Levels of Annotation
I for finding examples Text Processing
Tokenization
I what is the plural form of fish? Lemmatization

I which nouns can occur as bare nouns, without a Standalone


determiner? annotation

I are there subjectless sentences in German? More on XML

– Yes, e.g. Mir ist kalt. (To me is cold.)


I is it possible in English to have something between a
noun and its modifying relative clause?

Annotation makes it possible to find phenomena that would


otherwise disappear in masses of data

15 / 73
Corpus Linguistics
Potential benefits of annotation
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation adds much value to a corpus Corpus annotation


Motivation

I Extracting information is easier: How to annotate


Levels of Annotation

I Can easily isolate left in its adjective uses Text Processing


I Can find information on a language you don’t speak Tokenization
Lemmatization

I Corpus is more reusable Standalone


annotation
I Insights are more accessible to others
More on XML
I Corpus is more multifunctional
I Annotation provides a clear record of analysis, open to
future scrutiny
I Decisions are more objective, reproducible

16 / 73
Corpus Linguistics
Potential criticisms of annotation
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings
I Annotation is cluttered Corpus annotation
I But: it can be easily ignored Motivation
How to annotate

I Annotation imposes one particular linguistic analysis Levels of Annotation

Text Processing
I But: users can reject the analysis, and annotation at Tokenization

least makes the analysis process clearer Lemmatization

Standalone
I Annotation makes a corpus less accessible and annotation
expandable More on XML
I But: corpora can be extended without annotation, if
need be
I Annotation cannot be done completely consistently
I But: humans are falliable, and there are ways to ensure
better consistency

17 / 73
Corpus Linguistics
Leech’s Seven Maxims of Annotation
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

1. It should be possible to remove the annotation from an Corpus annotation


annotated corpus in order to revert to the raw corpus. Motivation
How to annotate

2. It should be possible to extract the annotations by Levels of Annotation

themselves from the text. This is the flip side of maxim Text Processing
Tokenization
1. Lemmatization

I Taking points 1. and 2. together, the annotated corpus Standalone


annotation
should allow the maximum flexibility for manipulation by
More on XML
the user.
3. The annotation scheme should be based on guidelines
which are available to the end user.
4. It should be made clear how and by whom the
annotation was carried out.

18 / 73
Corpus Linguistics
Leech’s Seven Maxims (2)
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

5. The end user should be made aware that the corpus How to annotate
Levels of Annotation

annotation is not infallible, but simply a potentially useful Text Processing


tool. Tokenization
Lemmatization

6. Annotation schemes should be based as far as possible Standalone


annotation
on widely agreed and theory-neutral principles.
More on XML
7. No annotation scheme has the a priori right to be
considered as a standard. Standards emerge through
practical consensus.

19 / 73
How corpus annotation is achieved Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Some options: Corpus annotation


Motivation
I Fully manually How to annotate
Levels of Annotation
I Pro: has the potential to be of highest quality Text Processing
I Con: time-consuming + humans are prone to errors Tokenization
Lemmatization

I Fully automatically (assuming appropriate technology) Standalone


annotation
I Pro: quick and consistent
More on XML
I Con: will often be consistently wrong
I Semi-automatically, e.g., automatic analysis + manual
post-editing
I Pro: Can combine the best of both worlds
I Con: Have to avoid the pitfalls of each

20 / 73
Corpus Linguistics
Levels of linguistic annotation
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

I morphological annotation (e.g. inflection, derivation, Corpus annotation


Motivation

compounding) How to annotate


Levels of Annotation

I morpho-syntactic annotation: part-of-speech (POS) Text Processing


Tokenization
tagging Lemmatization

Standalone
I syntactic annotation (e.g. named entities, phrasal annotation
chunking, full syntactic analysis) More on XML

I semantic annotation (e.g. word-sense disambiguation,


anaphora and coreference resolution, information
structure)
I discourse annotation (e.g. dialog turns, speech acts)

21 / 73
Step by Step Annotation Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
I tokenization Motivation
How to annotate

I lemmatization / morphological analysis Levels of Annotation

Text Processing
I part-of-speech tagging Tokenization
Lemmatization
I named-entity recognition Standalone
annotation
I partial parsing
More on XML
I full syntactic parsing
I semantic and discourse processing
See section A4.4 of the book for a more thorough analysis of
each of these

22 / 73
Annotation uses Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings
I POS annotation: compare occurrences of word classes
Corpus annotation
I Lemmatization: study distribution of lexemes for Motivation
How to annotate

lexicography Levels of Annotation

Text Processing
I Parsing: study clauses types in a language or teach Tokenization
Lemmatization
grammatical analysis
Standalone
I Semantic annotation: analyze content annotation

More on XML
Other types of annotation can cover specific purposes:
stylistic annotation, error tagging, problem-oriented
annotation

We’ll examine POS and syntactic annotation in greater detail


later in the semester

23 / 73
Preprocessing the Text: Tokenization Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Tokenization refers to the annotation step of dividing the Motivation
How to annotate

input text into units called tokens. Levels of Annotation

Text Processing
Tokenization
Lemmatization

Standalone
Each tokens consists of one of the following: annotation

More on XML
I a morpho-syntactic word
I a punctuation mark or a special character (e.g. &, @, %)
I a number

24 / 73
Tokenization – Example Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

before tokenization: How to annotate


Levels of Annotation

Milton wrote "Paradise Lost." Then his wife dies Text Processing
and he wrote "Paradise Regained." Tokenization
Lemmatization

Standalone
annotation

More on XML
after tokenization:
Milton wrote " Paradise Lost . " Then his wife
dies and he wrote " Paradise Regained . "

25 / 73
Corpus Linguistics
Why is Tokenization Non-Trivial?
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
I disambiguation of punctuation Motivation
How to annotate

e.g. period can occur inside cardinal numbers, after Levels of Annotation

ordinals, after abbreviations, at end of sentences Text Processing


Tokenization

I recognition of complex words Lemmatization

Standalone
I compounds, e.g. bank transfer fee, US-company
annotation
I mergers, e.g. don’t, England’s, French: t’aime
More on XML
I multiwords, e.g. complex prepositions provided that,

in spite of
Languages such as Chinese present more challenging
segmentation issues

26 / 73
Lemmatization Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

I refers to the process of relating individual word forms to Corpus annotation


Motivation
their citation form (lemma) by means of morphological How to annotate
Levels of Annotation
analysis
Text Processing
e.g. stopped ⇒ stop Tokenization
Lemmatization
I provides a means to distinguish between the total Standalone
number of word tokens and distinct lemmata that occur annotation

More on XML
in a corpus
e.g. helps to find all occurrences of buy
I is indispensable for highly inflectional languages which
have a large number of distinct word forms for a given
lemma

27 / 73
Corpus Linguistics
Lemmatization – German Example
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings
wie wie +Adv+Wh+#lex+COWIE
wie wie +Conj+Coord+#lex+COWIE Corpus annotation
Motivation
wie wie +Conj+Subord+#lex+COWIE How to annotate
Levels of Annotation

sie sie +Pron+Pers+3P+Pl+Fem+Nom+#lex+PERSPRO Text Processing


Tokenization
sie sie +Pron+Pers+3P+Sg+Fem+Nom+#lex+PERSPRO Lemmatization

Standalone
offenbar offenbaren +Verb+Imp+2P+Sg+#lex+VVFIN annotation

offenbar offenbar +Adj+Pos+Pred+#lex+ADJD More on XML

gedacht gedenken +Verb+PPast+#lex+VVPP


gedacht dachen +Verb+PPast+#lex+VVPP
gedacht denken +Verb+PPast+#lex+VVPP

hat haben +Verb+Indc+3P+Sg+Pres+#lex+VAFIN

29 / 73
Corpus Linguistics
Tools for Lemmatization
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
I XEROX Morphological Analyzer: comprehensive Motivation

morphological analyzers for many languages including How to annotate


Levels of Annotation

English, French, Dutch, German, Hungarian, Italian, Text Processing


Tokenization
Portuguese, Czech, Danish, Finnish, Norwegian, Polish, Lemmatization

Russian, Turkish. Standalone


annotation
link: http://www.xrce.xerox.com/competencies/
More on XML
content-analysis/toolhome.en.html
I Lingsoft: morphological anylyzers for English, Danish,
German, Swedish, and Finnish
link: http://www.lingsoft.fi/demos.html

30 / 73
Morphological Analysis Corpus Linguistics

Linguistic
Annotation

Corpus markup
Xerox: Mark-up schemes
Character encodings

half half+Adj Corpus annotation


Motivation

half half+Adv How to annotate


Levels of Annotation

half half+Noun+Sg Text Processing


Tokenization
Lemmatization

Standalone
annotation

More on XML
Lingsoft:

"<half>"
"half" <Quant> DET PRE SG/PL @QN>
"half" <NonMod> <Quant> PRON SG/PL
"half" N NOM SG
"half" ADV

32 / 73
Standalone annotation Corpus Linguistics

Linguistic
Annotation

Instead of embedding annotation in the base document, one Corpus markup


Mark-up schemes
can use standalone, or standoff, annotation: Character encodings

I The annotation is linked to particular points in the Corpus annotation


Motivation

original document How to annotate


Levels of Annotation

Advantages: Text Processing


Tokenization
I Allows base documents to be treated as read-only & Lemmatization

can be distributed separately Standalone


annotation

I Allows for overlapping annotation and alternative More on XML

annotation
I Allows for new annotation levels to be easily added
Disadvantages:
I Potentially difficult to allow for annotation to point to
other annotation
I Less pre-built tools for standoff annotation

33 / 73
Example XML document Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

We’ll look more closely at XML Corpus annotation


Motivation
I Some of this material is adapted from: How to annotate
Levels of Annotation
http://www.w3schools.com/xml/
Text Processing
Tokenization

<?xml version="1.0" encoding="ISO-8859-1"?> Lemmatization

Standalone
<note> annotation

<to>Tove</to> More on XML

<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>

35 / 73
XML tags Corpus Linguistics

Linguistic
Annotation

In the previous example, there are several things to note Corpus markup
about the tags, i.e., the things in angled brackets (<...>): Mark-up schemes
Character encodings

I There has to be one root tag—in this case, <note> is Corpus annotation

the root tag Motivation


How to annotate

I Every (opening) tag needs a closing tag: Levels of Annotation

Text Processing
<heading>Reminder</heading> is legitimate; Tokenization

<heading>Reminder is not Lemmatization

Standalone
I Along with that, tags must be properly nested: annotation

<b><i>word</i></b> is legitimate; More on XML

<b><i>word</b></i> is not
So, you’ll wind up with a structure like:
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
37 / 73
Attributes Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Each tag (or element) can have a variety of attributes Corpus annotation
Motivation
(provided such attributes have been declared; see below) How to annotate
Levels of Annotation

I Attributes are noted within an element tag; there can be Text Processing
Tokenization
multiple attributes Lemmatization

I Attribute values are put in quotes after the attribute Standalone


annotation
I Some examples:
More on XML
<note date="2/23/2006">
<note date="2/23/2006" author="joe shmoe">
I This last one is equivalent to:
<note author="joe shmoe" date="2/23/2006">

39 / 73
Corpus Linguistics
Comments and whitespace
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Comments are added in between <!-- and --> Motivation
How to annotate
Levels of Annotation

Does whitespace matter? ... Yes and no. Text Processing


Tokenization

I Within tags, each whitespace character is treated as Lemmatization

Standalone
significant (unlike, e.g., HTML) annotation

I But it doesn’t matter how much whitespace you put More on XML

between different tags, or if you indent a child tag


I The data is structured in and of itself, regardless of
whitespace.

41 / 73
Corpus Linguistics
Attributes vs. Elements (1)
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

There is often a question of where to store information: in Corpus annotation


Motivation

element text, in attributes, as children elements, ... Here are How to annotate
Levels of Annotation

three different ways to say the “same” thing Text Processing


Tokenization
Lemmatization

<note date="12/11/2002"> Standalone


<to>Tove</to> annotation

More on XML
<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>

43 / 73
Corpus Linguistics
Attributes vs. Elements (2)
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation
How to annotate

<note> Levels of Annotation

Text Processing
<date>12/11/2002</date> Tokenization
Lemmatization
<to>Tove</to>
Standalone
<from>Jani</from> annotation

<heading>Reminder</heading> More on XML

<body>Don’t forget me this weekend!</body>


</note>

45 / 73
Corpus Linguistics
Attributes vs. Elements (3)
Linguistic
Annotation

Corpus markup
Mark-up schemes
<note> Character encodings

<date> Corpus annotation


Motivation
<day>12</day> How to annotate
Levels of Annotation
<month>11</month> Text Processing
<year>2002</year> Tokenization
Lemmatization
</date> Standalone
<to>Tove</to> annotation

<from>Jani</from> More on XML

<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>

If you actually want to do something with the day, month, or


year, this last example is the most effective (i.e., most
structured)
47 / 73
Attributes Corpus Linguistics

Linguistic
Annotation

Corpus markup
In general, putting data in attributes needs to be done with Mark-up schemes
Character encodings
care:
Corpus annotation
Motivation
I attributes cannot contain multiple values (child elements How to annotate

can) Levels of Annotation

Text Processing
I attributes are not easily expandable (for future changes) Tokenization
Lemmatization
I attributes cannot describe structures (child elements
Standalone
can) annotation

I attributes are more difficult to manipulate by program More on XML

code
I attribute values are not as easy to test against a
Document Type Definition (DTD) - which is used to
define the legal elements of an XML document
However, converting all attributes to elements can result in
an inflated structure

48 / 73
Linguistic example Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
So, what do we do with corpora? Well, this won’t typically be Character encodings

your problem, as someone else will have decided, but here Corpus annotation
Motivation
are some possibilities: How to annotate
Levels of Annotation

Text Processing
<terminal tag="NN">dog</terminal> Tokenization
Lemmatization

Standalone
<terminal word="dog" tag="NN"></terminal> annotation

<!-- or: <terminal word="dog" tag="NN"/> --> More on XML

<terminal>
<word>dog</word>
<tag>NN</tag>
</terminal>

50 / 73
Corpus Linguistics
Validation/DTDs
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

We want to know what makes up a valid XML file, i.e., we How to annotate
Levels of Annotation

want to make sure our corpus is annotated in a consistent Text Processing


fashion; if not, the DTD allows us to automatically find out Tokenization
Lemmatization

I We can specify the Document Type Definition (DTD) in Standalone


annotation
a separate file or at the top of the XML file More on XML

I It tells us what the valid elements are; what children


those elements can have; what attributes; etc.

51 / 73
Corpus Linguistics
Document Type Definition (DTD) example
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

<!DOCTYPE note [ Corpus annotation


Motivation
<!ELEMENT note (to,from,heading,body)> How to annotate
Levels of Annotation
<!ELEMENT to (#PCDATA)>
Text Processing
<!ELEMENT from (#PCDATA)> Tokenization
Lemmatization
<!ELEMENT heading (#PCDATA)>
Standalone
<!ELEMENT body (#PCDATA)> annotation

]> More on XML

I Document is of type note (i.e. that’s the root element)


I The element note has four different children
I The other four elements simply have parsed character
data (#PCDATA) as their content

53 / 73
Corpus Linguistics
More complex definitions
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation
How to annotate
I Declaring one or more of the same element: Levels of Annotation

<!ELEMENT nonterminal (terminal+)> Text Processing


Tokenization

I Declaring zero or more of the same element: Lemmatization

Standalone
<!ELEMENT nonterminal (terminal*)> annotation

I Declaring mixed content, in any order, any number of More on XML

times:
<!ELEMENT terminal (word|tag|#PCDATA)*>

55 / 73
Attributes Corpus Linguistics

Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Declaring attributes are slightly more complicated Corpus annotation


Motivation
How to annotate
Levels of Annotation
DTD example:
Text Processing
<!ATTLIST payment type CDATA "check"> Tokenization
Lemmatization
<!-- element-name att-name valid-content default-value -->
Standalone
annotation

XML example: More on XML

<payment type="check" />

You can set attribute values to be #IMPLIED (not necessary),


#REQUIRED, or #FIXED

57 / 73
Corpus Linguistics
XML Schema (XSD)
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings
DTDs are in the process of being replaced by XML Schemas
Corpus annotation
(XSDs) Motivation
How to annotate

I See http://www.w3schools.com/schema/ Levels of Annotation

Text Processing
Why XSDs? Tokenization
Lemmatization

I More support for data types, including, e.g., being Standalone


annotation
easier to convert between different data types
More on XML
I XSDs use XML syntax, so you don’t have to relearn a
new syntax
I Easier to validate correctness
Our earlier DTD would have the equivalent XSD on the next
page ...

58 / 73
XSD example Corpus Linguistics

Linguistic
Annotation

<?xml version="1.0"?> Corpus markup


Mark-up schemes

<xs:schema xmlns:xs="http://..." targetNamespace="http://..."


Character encodings

xmlns="http://..." elementFormDefault="qualified"> Corpus annotation


Motivation
How to annotate
Levels of Annotation

<xs:element name="note"> Text Processing


<xs:complexType> Tokenization
Lemmatization

<xs:sequence> Standalone
annotation
<xs:element name="to" type="xs:string"/>
More on XML
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>

</xs:schema>
60 / 73
Corpus Linguistics
XML for linguistic purposes
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

Corpus annotation
Motivation

I XML allows us to add structured information to a text How to annotate


Levels of Annotation

I By adding XML annotation, we are able to preserve the Text Processing


Tokenization
original document—i.e., we only add to it Lemmatization

Standalone
I The complexity of the XML used depends on exactly annotation
what the task is. More on XML
I Using XML and a corresponding DTD also allows
corpus designers or software designers to specify how
the annotation information should be formatted

61 / 73
Corpus Linguistics
XML for “raw” text
Linguistic
Annotation

Corpus markup
New York Times data from the English Gigaword corpus: Mark-up schemes
Character encodings

<DOC id="NYT19940701.0001" type="story" > Corpus annotation


<HEADLINE> Motivation
WITNESS SAYS O.J. SIMPSON BOUGHT KNIFE WEEKS BEFORE SLAYINGS How to annotate
</HEADLINE> Levels of Annotation
<DATELINE>
LOS ANGELES (BC-SIMPSON-KILLINGS-1stLd-3Takes-Writethru-LADN) Text Processing
</DATELINE> Tokenization

<TEXT> Lemmatization

<P>
With the nation’s attention riveted again on a Los Standalone
Angeles courtroom, a knife dealer testified that O.J. Simpson annotation
bought a 15-inch knife five weeks before the slashing deaths of his More on XML
ex-wife and her friend.
</P>
...
<P>
‘‘She frequented the restaurant quite often,’’ DeBello said.
</P>
<P>
(STORY CAN END HERE. OPTIONAL 2ND TAKE FOLLOWS.)
</P>
</TEXT>
</DOC>

63 / 73
Corpus Linguistics
Part of Gigaword DTD
Linguistic
Annotation

Corpus markup
Mark-up schemes
Character encodings

<!ELEMENT DOC - - (HEADLINE*, Corpus annotation


DATELINE*, Motivation
TEXT*) > How to annotate
Levels of Annotation
<!-- fields of "DOC" -->
<!ELEMENT HEADLINE - - (#PCDATA) > Text Processing
<!ELEMENT DATELINE - - (#PCDATA) > Tokenization
<!ELEMENT TEXT - - (P* | #PCDATA)+ > Lemmatization

<!-- fields of "TEXT" --> Standalone


<!ELEMENT P - - (#PCDATA) > annotation

More on XML
<!--Entities -->
<!ENTITY amp "&" >
<!ENTITY AMP "&" >
<!ENTITY #DEFAULT SYSTEM >

<!ATTLIST DOC id CDATA #REQUIRED


type CDATA #REQUIRED >

]>

65 / 73
Corpus Linguistics
Adding POS information: BNC
Linguistic
Annotation

Corpus markup
Mark-up schemes
</c></s><s n="45" p="n" teiform="s">
Character encodings
<w type="av0" teiform="w">Commonly </w>
<w type="vvn-vvd" teiform="w">held </w> Corpus annotation
<w type="nn2" teiform="w">ideas </w> Motivation
<w type="vvb" teiform="w">restrict </w> How to annotate
<w type="at0" teiform="w">the </w> Levels of Annotation
<w type="aj0" teiform="w">social </w>
<w type="nn1" teiform="w">role </w> Text Processing
<w type="cjc" teiform="w">and </w> Tokenization
<w type="nn1" teiform="w">status </w> Lemmatization
<w type="prf" teiform="w">of </w>
<w type="ajc" teiform="w">older </w> Standalone
<w type="nn0" teiform="w">people</w> annotation
<c type="pun" teiform="c">, </c>
More on XML
<w type="vvb-nn1" teiform="w">structure </w>
<w type="dps" teiform="w">their </w>
<w type="nn2" teiform="w">expectations </w>
<w type="prf" teiform="w">of </w>
<w type="pnx" teiform="w">themselves</w>
<c type="pun" teiform="c">, </c>
<w type="vvb" teiform="w">prevent </w>
<w type="pnp" teiform="w">them </w>
<w type="vvg" teiform="w">achieving </w>
<w type="dps" teiform="w">their </w>
...
<c type="pun" teiform="c">.</c></s>

67 / 73
Corpus Linguistics
XML for syntactic trees: TigerXML (WSJ)
Linguistic
Annotation

Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> Mark-up schemes

<corpus xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance Character encodings

xsi:noNamespaceSchemaLocation="/home/compling/TIGERSearch/1.1rc2/schema/TigerXML.xsd" id="temp" >


Corpus annotation
<head external="file:temp_generated_header.xml"/>
Motivation
<body>
How to annotate
<s id="s1" >
Levels of Annotation
<graph root="s1_500" >
<terminals> Text Processing
<t id="s1_1" word="Pierre" pos="NNP" /> Tokenization
<t id="s1_2" word="Vinken" pos="NNP" />
Lemmatization
<t id="s1_3" word="," pos="," />
<t id="s1_4" word="61" pos="CD" /> Standalone
<t id="s1_5" word="years" pos="NNS" /> annotation
<t id="s1_6" word="old" pos="JJ" />
<t id="s1_7" word="," pos="," /> More on XML
<t id="s1_8" word="will" pos="MD" />
<t id="s1_9" word="join" pos="VB" />
<t id="s1_10" word="the" pos="DT" />
<t id="s1_11" word="board" pos="NN" />
<t id="s1_12" word="as" pos="IN" />
<t id="s1_13" word="a" pos="DT" />
<t id="s1_14" word="nonexecutive" pos="JJ" />
<t id="s1_15" word="director" pos="NN" />
<t id="s1_16" word="Nov." pos="NNP" />
<t id="s1_17" word="29" pos="CD" />
<t id="s1_18" word="." pos="." />
</terminals>

69 / 73
Corpus Linguistics
WSJ in TigerXML (cont.)
Linguistic
Annotation

Corpus markup
Mark-up schemes
<nonterminals> Character encodings
<nt id="s1_502" cat="NP" >
<edge idref="s1_1" label="--" /> Corpus annotation
<edge idref="s1_2" label="--" /> Motivation

</nt> How to annotate

<nt id="s1_504" cat="NP" > Levels of Annotation

<edge idref="s1_4" label="--" />


Text Processing
<edge idref="s1_5" label="--" />
Tokenization
</nt>
Lemmatization
...
<nt id="s1_501" cat="NP" > Standalone
<edge idref="s1_502" label="--" /> annotation
<edge idref="s1_3" label="--" />
<edge idref="s1_503" label="--" /> More on XML
<edge idref="s1_7" label="--" />
</nt>
...
<nt id="s1_500" cat="S" >
<edge idref="s1_501" label="SBJ" />
<edge idref="s1_505" label="--" />
<edge idref="s1_18" label="--" />
</nt>
</nonterminals>
</graph>
</s>

71 / 73
Corpus Linguistics
Header information: define your classes
Linguistic
Annotation

Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> Mark-up schemes

<head> Character encodings

<meta>
Corpus annotation
<name>/home/dickinso/temp.xml.gz</name>
Motivation
<format>bracketing format</format>
How to annotate
</meta>
Levels of Annotation
<annotation>
<feature name="word" domain="T" /> Text Processing
<feature name="pos" domain="T">
Tokenization
<value name="CC" />
Lemmatization
...
</feature> Standalone
<feature name="cat" domain="NT"> annotation
<value name="ADJP" />
... More on XML
</feature>
<edgelabel>
<value name="--" />
<value name="ADV" />
...
</edgelabel>
<secedgelabel>
<value name="*" />
...
</secedgelabel>
</annotation>
</head>

73 / 73

You might also like