Professional Documents
Culture Documents
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
More on XML
Markus Dickinson
1 / 73
Corpus Linguistics
Corpus mark-up
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Corpus mark-up is information about the corpus included in Motivation
How to annotate
the corpus itself Levels of Annotation
Text Processing
I some of it governs formatting, printing, other processing Tokenization
Lemmatization
2 / 73
Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
We will use the term annotation to refer specifically to How to annotate
Levels of Annotation
linguistic information encoded in the text
Text Processing
Tokenization
I Generally speaking, annotation is a form of markup Lemmatization
Standalone
I Non-linguistic information also encoded in markup annotation
(meta-data) More on XML
3 / 73
Corpus Linguistics
Corpus mark-up schemes
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Meta-data systematically needs to be kept separate from the Motivation
How to annotate
corpus data Levels of Annotation
Text Processing
A general idea is to use attributes and corresponding Tokenization
Lemmatization
values
Standalone
annotation
I e.g., attribute author has a value William
More on XML
Shakespeare
I We’ll see this encoded in XML in ways such as:
I <... author="William Shakespeare" ...>
5 / 73
Text Encoding Standards Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
Levels of Annotation
Standalone
I XCES: XML version of CES annotation
More on XML
If you annotate a corpus, make it conform to XCES as much
as possible
6 / 73
Corpus Linguistics
Text Encoding Initiative (TEI)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Corpus annotation
1. Header = information about the document Motivation
How to annotate
I File description (<fileDesc>): bibliographic information Levels of Annotation
8 / 73
Mark-up languages Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
I Tags, which come in both start (e.g., <s>) and end (e.g., Text Processing
Tokenization
</s>) forms Lemmatization
Standalone
I Attributes and values, encoded within the start tag (e.g., annotation
<s num="1">) More on XML
<w POS=AT0>the</w>
10 / 73
XML Corpus Linguistics
Linguistic
Annotation
Corpus markup
XML stands for: eXtensible Markup Language Mark-up schemes
Character encodings
“meaning” for labels: you must define your own tags! How to annotate
Levels of Annotation
An XML document consists of two parts: DTD (or XML Text Processing
Tokenization
Standalone
I DTD (Document Type Definition) defines structure of
annotation
document More on XML
I XML Schema does the same thing, but in a different
format
I XML documents can be validated, i.e., checked against
rules in DTD
We’ll save a more thorough discussion of XML for the end of
this unit
11 / 73
Corpus Linguistics
CES: Corpus Encoding Standard
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
TEI is for any text document and, as such, defines Corpus annotation
Motivation
approximately 450 elements How to annotate
Levels of Annotation
I CES simplifies this system in order to encode language Text Processing
corpora Tokenization
Lemmatization
12 / 73
Corpus Linguistics
Character encodings
Linguistic
Annotation
Corpus markup
Mark-up schemes
Corpus has to specify character encoding of the text, as Character encodings
Text Processing
character Tokenization
Lemmatization
13 / 73
Corpus Linguistics
Corpus annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
Text Processing
added to a corpus Tokenization
Lemmatization
I Corpus mark-up is relatively objective factual Standalone
information annotation
More on XML
I Corpus annotation is more subjective, interpretative
information
14 / 73
Motivating Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Corpus annotation
I for training NLP tools Motivation
How to annotate
Levels of Annotation
I for finding examples Text Processing
Tokenization
I what is the plural form of fish? Lemmatization
15 / 73
Corpus Linguistics
Potential benefits of annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
16 / 73
Corpus Linguistics
Potential criticisms of annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I Annotation is cluttered Corpus annotation
I But: it can be easily ignored Motivation
How to annotate
Text Processing
I But: users can reject the analysis, and annotation at Tokenization
Standalone
I Annotation makes a corpus less accessible and annotation
expandable More on XML
I But: corpora can be extended without annotation, if
need be
I Annotation cannot be done completely consistently
I But: humans are falliable, and there are ways to ensure
better consistency
17 / 73
Corpus Linguistics
Leech’s Seven Maxims of Annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
themselves from the text. This is the flip side of maxim Text Processing
Tokenization
1. Lemmatization
18 / 73
Corpus Linguistics
Leech’s Seven Maxims (2)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
5. The end user should be made aware that the corpus How to annotate
Levels of Annotation
19 / 73
How corpus annotation is achieved Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
20 / 73
Corpus Linguistics
Levels of linguistic annotation
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Standalone
I syntactic annotation (e.g. named entities, phrasal annotation
chunking, full syntactic analysis) More on XML
21 / 73
Step by Step Annotation Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I tokenization Motivation
How to annotate
Text Processing
I part-of-speech tagging Tokenization
Lemmatization
I named-entity recognition Standalone
annotation
I partial parsing
More on XML
I full syntactic parsing
I semantic and discourse processing
See section A4.4 of the book for a more thorough analysis of
each of these
22 / 73
Annotation uses Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
I POS annotation: compare occurrences of word classes
Corpus annotation
I Lemmatization: study distribution of lexemes for Motivation
How to annotate
Text Processing
I Parsing: study clauses types in a language or teach Tokenization
Lemmatization
grammatical analysis
Standalone
I Semantic annotation: analyze content annotation
More on XML
Other types of annotation can cover specific purposes:
stylistic annotation, error tagging, problem-oriented
annotation
23 / 73
Preprocessing the Text: Tokenization Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Tokenization refers to the annotation step of dividing the Motivation
How to annotate
Text Processing
Tokenization
Lemmatization
Standalone
Each tokens consists of one of the following: annotation
More on XML
I a morpho-syntactic word
I a punctuation mark or a special character (e.g. &, @, %)
I a number
24 / 73
Tokenization – Example Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
Milton wrote "Paradise Lost." Then his wife dies Text Processing
and he wrote "Paradise Regained." Tokenization
Lemmatization
Standalone
annotation
More on XML
after tokenization:
Milton wrote " Paradise Lost . " Then his wife
dies and he wrote " Paradise Regained . "
25 / 73
Corpus Linguistics
Why is Tokenization Non-Trivial?
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I disambiguation of punctuation Motivation
How to annotate
e.g. period can occur inside cardinal numbers, after Levels of Annotation
Standalone
I compounds, e.g. bank transfer fee, US-company
annotation
I mergers, e.g. don’t, England’s, French: t’aime
More on XML
I multiwords, e.g. complex prepositions provided that,
in spite of
Languages such as Chinese present more challenging
segmentation issues
26 / 73
Lemmatization Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
More on XML
in a corpus
e.g. helps to find all occurrences of buy
I is indispensable for highly inflectional languages which
have a large number of distinct word forms for a given
lemma
27 / 73
Corpus Linguistics
Lemmatization – German Example
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
wie wie +Adv+Wh+#lex+COWIE
wie wie +Conj+Coord+#lex+COWIE Corpus annotation
Motivation
wie wie +Conj+Subord+#lex+COWIE How to annotate
Levels of Annotation
Standalone
offenbar offenbaren +Verb+Imp+2P+Sg+#lex+VVFIN annotation
29 / 73
Corpus Linguistics
Tools for Lemmatization
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
I XEROX Morphological Analyzer: comprehensive Motivation
30 / 73
Morphological Analysis Corpus Linguistics
Linguistic
Annotation
Corpus markup
Xerox: Mark-up schemes
Character encodings
Standalone
annotation
More on XML
Lingsoft:
"<half>"
"half" <Quant> DET PRE SG/PL @QN>
"half" <NonMod> <Quant> PRON SG/PL
"half" N NOM SG
"half" ADV
32 / 73
Standalone annotation Corpus Linguistics
Linguistic
Annotation
annotation
I Allows for new annotation levels to be easily added
Disadvantages:
I Potentially difficult to allow for annotation to point to
other annotation
I Less pre-built tools for standoff annotation
33 / 73
Example XML document Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Standalone
<note> annotation
<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>
35 / 73
XML tags Corpus Linguistics
Linguistic
Annotation
In the previous example, there are several things to note Corpus markup
about the tags, i.e., the things in angled brackets (<...>): Mark-up schemes
Character encodings
I There has to be one root tag—in this case, <note> is Corpus annotation
Text Processing
<heading>Reminder</heading> is legitimate; Tokenization
Standalone
I Along with that, tags must be properly nested: annotation
<b><i>word</b></i> is not
So, you’ll wind up with a structure like:
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
37 / 73
Attributes Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Each tag (or element) can have a variety of attributes Corpus annotation
Motivation
(provided such attributes have been declared; see below) How to annotate
Levels of Annotation
I Attributes are noted within an element tag; there can be Text Processing
Tokenization
multiple attributes Lemmatization
39 / 73
Corpus Linguistics
Comments and whitespace
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Comments are added in between <!-- and --> Motivation
How to annotate
Levels of Annotation
Standalone
significant (unlike, e.g., HTML) annotation
I But it doesn’t matter how much whitespace you put More on XML
41 / 73
Corpus Linguistics
Attributes vs. Elements (1)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
element text, in attributes, as children elements, ... Here are How to annotate
Levels of Annotation
More on XML
<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>
43 / 73
Corpus Linguistics
Attributes vs. Elements (2)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
Text Processing
<date>12/11/2002</date> Tokenization
Lemmatization
<to>Tove</to>
Standalone
<from>Jani</from> annotation
45 / 73
Corpus Linguistics
Attributes vs. Elements (3)
Linguistic
Annotation
Corpus markup
Mark-up schemes
<note> Character encodings
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>
Linguistic
Annotation
Corpus markup
In general, putting data in attributes needs to be done with Mark-up schemes
Character encodings
care:
Corpus annotation
Motivation
I attributes cannot contain multiple values (child elements How to annotate
Text Processing
I attributes are not easily expandable (for future changes) Tokenization
Lemmatization
I attributes cannot describe structures (child elements
Standalone
can) annotation
code
I attribute values are not as easy to test against a
Document Type Definition (DTD) - which is used to
define the legal elements of an XML document
However, converting all attributes to elements can result in
an inflated structure
48 / 73
Linguistic example Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
So, what do we do with corpora? Well, this won’t typically be Character encodings
your problem, as someone else will have decided, but here Corpus annotation
Motivation
are some possibilities: How to annotate
Levels of Annotation
Text Processing
<terminal tag="NN">dog</terminal> Tokenization
Lemmatization
Standalone
<terminal word="dog" tag="NN"></terminal> annotation
<terminal>
<word>dog</word>
<tag>NN</tag>
</terminal>
50 / 73
Corpus Linguistics
Validation/DTDs
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
We want to know what makes up a valid XML file, i.e., we How to annotate
Levels of Annotation
51 / 73
Corpus Linguistics
Document Type Definition (DTD) example
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
53 / 73
Corpus Linguistics
More complex definitions
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
How to annotate
I Declaring one or more of the same element: Levels of Annotation
Standalone
<!ELEMENT nonterminal (terminal*)> annotation
times:
<!ELEMENT terminal (word|tag|#PCDATA)*>
55 / 73
Attributes Corpus Linguistics
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
57 / 73
Corpus Linguistics
XML Schema (XSD)
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
DTDs are in the process of being replaced by XML Schemas
Corpus annotation
(XSDs) Motivation
How to annotate
Text Processing
Why XSDs? Tokenization
Lemmatization
58 / 73
XSD example Corpus Linguistics
Linguistic
Annotation
<xs:sequence> Standalone
annotation
<xs:element name="to" type="xs:string"/>
More on XML
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
60 / 73
Corpus Linguistics
XML for linguistic purposes
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
Corpus annotation
Motivation
Standalone
I The complexity of the XML used depends on exactly annotation
what the task is. More on XML
I Using XML and a corresponding DTD also allows
corpus designers or software designers to specify how
the annotation information should be formatted
61 / 73
Corpus Linguistics
XML for “raw” text
Linguistic
Annotation
Corpus markup
New York Times data from the English Gigaword corpus: Mark-up schemes
Character encodings
<TEXT> Lemmatization
<P>
With the nation’s attention riveted again on a Los Standalone
Angeles courtroom, a knife dealer testified that O.J. Simpson annotation
bought a 15-inch knife five weeks before the slashing deaths of his More on XML
ex-wife and her friend.
</P>
...
<P>
‘‘She frequented the restaurant quite often,’’ DeBello said.
</P>
<P>
(STORY CAN END HERE. OPTIONAL 2ND TAKE FOLLOWS.)
</P>
</TEXT>
</DOC>
63 / 73
Corpus Linguistics
Part of Gigaword DTD
Linguistic
Annotation
Corpus markup
Mark-up schemes
Character encodings
More on XML
<!--Entities -->
<!ENTITY amp "&" >
<!ENTITY AMP "&" >
<!ENTITY #DEFAULT SYSTEM >
]>
65 / 73
Corpus Linguistics
Adding POS information: BNC
Linguistic
Annotation
Corpus markup
Mark-up schemes
</c></s><s n="45" p="n" teiform="s">
Character encodings
<w type="av0" teiform="w">Commonly </w>
<w type="vvn-vvd" teiform="w">held </w> Corpus annotation
<w type="nn2" teiform="w">ideas </w> Motivation
<w type="vvb" teiform="w">restrict </w> How to annotate
<w type="at0" teiform="w">the </w> Levels of Annotation
<w type="aj0" teiform="w">social </w>
<w type="nn1" teiform="w">role </w> Text Processing
<w type="cjc" teiform="w">and </w> Tokenization
<w type="nn1" teiform="w">status </w> Lemmatization
<w type="prf" teiform="w">of </w>
<w type="ajc" teiform="w">older </w> Standalone
<w type="nn0" teiform="w">people</w> annotation
<c type="pun" teiform="c">, </c>
More on XML
<w type="vvb-nn1" teiform="w">structure </w>
<w type="dps" teiform="w">their </w>
<w type="nn2" teiform="w">expectations </w>
<w type="prf" teiform="w">of </w>
<w type="pnx" teiform="w">themselves</w>
<c type="pun" teiform="c">, </c>
<w type="vvb" teiform="w">prevent </w>
<w type="pnp" teiform="w">them </w>
<w type="vvg" teiform="w">achieving </w>
<w type="dps" teiform="w">their </w>
...
<c type="pun" teiform="c">.</c></s>
67 / 73
Corpus Linguistics
XML for syntactic trees: TigerXML (WSJ)
Linguistic
Annotation
Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> Mark-up schemes
69 / 73
Corpus Linguistics
WSJ in TigerXML (cont.)
Linguistic
Annotation
Corpus markup
Mark-up schemes
<nonterminals> Character encodings
<nt id="s1_502" cat="NP" >
<edge idref="s1_1" label="--" /> Corpus annotation
<edge idref="s1_2" label="--" /> Motivation
71 / 73
Corpus Linguistics
Header information: define your classes
Linguistic
Annotation
Corpus markup
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> Mark-up schemes
<meta>
Corpus annotation
<name>/home/dickinso/temp.xml.gz</name>
Motivation
<format>bracketing format</format>
How to annotate
</meta>
Levels of Annotation
<annotation>
<feature name="word" domain="T" /> Text Processing
<feature name="pos" domain="T">
Tokenization
<value name="CC" />
Lemmatization
...
</feature> Standalone
<feature name="cat" domain="NT"> annotation
<value name="ADJP" />
... More on XML
</feature>
<edgelabel>
<value name="--" />
<value name="ADV" />
...
</edgelabel>
<secedgelabel>
<value name="*" />
...
</secedgelabel>
</annotation>
</head>
73 / 73