You are on page 1of 10

Corpus annotation

Corpus annotation is the practice of adding interpretative linguistic information to a


corpus. For example, one common type of annotation is the addition of tags (Certain
kinds of linguistic annotation, which involve the attachment of special codes to words
in order to indicate particular features, are frequently known as ‘tagging’ rather than
‘annotation’, and the codes which are assigned are known as ‘tags’.), or labels,
indicating the word class to which words in a text belong. This is so-called part-of-
speech tagging (or POS tagging), and can be useful, for example, in distinguishing
words which have the same spelling, but different meanings or pronunciation. If a word
in a text is spelt present, it may be a noun (= 'gift'), a verb (= 'give someone a present')
or an adjective (= 'not absent'). The meanings of these same-looking words are very
different, and also there is a difference of pronunciation, since the verb present has stress
on the final syllable. Using one simple method of representing the POS tags — attaching
tags to words by an underscore symbol — these three words may be annotated as
follows:

present_NN1 (singular common noun)


present_VVB (base form of a lexical verb)
present_JJ (general adjective)
Some people prefer not to engage in corpus annotation: for them, the unannotated corpus
is the 'pure' corpus they want to investigate — the corpus without adulteration with
information which is suspect, possibly reflecting the predilections, or even the errors, of
the annotator. For others, annotation is a means to make a corpus much more useful —
an enrichment of the original raw corpus. From this perspective, probably a majority
view, adding annotation to a corpus is giving 'added value', which can be used for
research by the individual or team that carried out the annotation, but which can also be
passed on to others who may find it useful for their own purposes. For example, POS-
tagged versions of major English language corpora such as the Brown Corpus, the LOB
Corpus and the British National Corpus have been distributed widely throughout the
world for those who would like to make use of the tagging, as well as of the original
'raw' corpus. In this chapter, I will assume that such annotation is a benefit, so long as it
is done well, with an eye to the standards that ought to apply to such work.

1
Different kinds of annotation
Apart from part-of-speech (POS) tagging, there are other types of annotation,
corresponding to different levels of linguistic analysis of a corpus or text — for example:
i. Part-of-speech annotation
The most basic type of linguistic corpus annotation is part-of-speech tagging
(sometimes also known as grammatical tagging or morpho-syntactic annotation). The
aim of part-of-speech tagging is to assign to each lexical unit in the text a code indicating
its part of speech (e.g. singular common noun, comparative adjective, past participle).
Part-of-speech information is a fundamental basis for increasing the specificity of data
retrieval from corpora and also forms an essential foundation for further forms of
analysis such as syntactic parsing and semantic field annotation. For instance, part-of-
speech annotation forms a basic first step in the disambiguation of homographs. It
cannot distinguish between the verbal senses of boot meaning 'to kick' and 'to start up a
computer', but it can distinguish boot as a verb from boot as a noun, and hence separate
away the verb senses from the noun senses. This means that, in examining the way in
which boot as 'footwear' is used in a corpus, we might need to dredge through only 50
examples of boot as a noun instead of 100 examples of the ambiguous string boot
Example

2
ii. Lexical annotation
Adding the identity of the lemma of each word form in a text — i.e. the base form of
the word, such as would occur as its headword in a dictionary.
Lemmatization
One of the most basic types of annotation is lemmatization, the process of identifying
and marking each word in a corpus with its base (citation or dictionary) form. In an
English corpus this would involve, for example, stripping away inflectional morphology
on verbs so that all forms of the lemma FORGET – forget, forgets, forgetting, forgot,
and forgotten – would be marked as representing a form of FORGET, and could be
retrieved without the user having to enter all forms of FORGET individually.
Lemmatization can be performed on the basis of an existing form-lemma database, a
(semi-)automatic approach called stemming in which word forms are truncated by
cutting off characters to arrive at the more general representation of a lemma, or some
hybrid approaches of these two strategies that may also involve morphological and/or
syntactic analysis to disambiguate ambiguous forms.
iii. Syntactic annotation
It refers to adding information about how a given sentence is parsed, in terms of
syntactic analysis into such units such phrases and clauses.
Once basic morpho-syntactic categories have been identified in a text, it is then possible
to consider bringing these categories into higher-level syntactic relationships with one
another. The procedure of doing this is generally known as parsing. Parsing is probably
the most commonly encountered form of corpus annotation after part-of-speech tag ing.
Corpora which have been parsed are sometimes known as ‘treebanks’ ‘Treebank’ term
alludes to the tree diagrams or 'phrase markers' which will be familiar from most

3
introductory syntax textbooks. For example, the structure _of the sentence “Claudia sat
on a stool” (BNC) might be represented by the following tree diagram.

The Constituents can also be shown as the following:

OR

Parsing:

4
The constituent labels used are:

iv. Semantic annotation


Adding information about the semantic category of words — the noun cricket as a term
for a sport and as a term for an insect belong to different semantic categories, although
there is no difference in spelling or pronunciation.
v. Pragmatic annotation
Adding information about the kinds of speech act (or dialogue act) that occur in a spoken
dialogue — thus the utterance okay on different occasions may be an acknowledgement,
a request for feedback, an acceptance, or a pragmatic marker initiating a new phase of
discussion.
vi. Discourse annotation
Adding information about anaphoric links in a text, for example connecting the pronoun
them and its antecedent the horses in: I'll saddle the horses and bring them round. [an
example from the Brown corpus]
vii. Stylistic annotation
Adding information about speech and thought presentation (direct speech, indirect
speech, free indirect thought, etc.)

5
viii. Phonetic annotation
Adding information about how a word in a spoken corpus was pronounced. Prosodic
annotation — again in a spoken corpus — adding information about prosodic features
such as stress, intonation and pauses.

In fact, it is possible to think up untold kinds of annotation that might be useful for
specific kinds of research. One example is dysfluency annotation: those working on
spoken data may wish to annotate a corpus of spontaneous speech for dysfluencies such
as false starts, repeats, hesitations, etc. Another illustration comes from an area of corpus
research which has flourished in the last ten years: the creation and study of learner
corpora. Such corpora, consisting of writing (or speech) produced by learners of a
second language, may be annotated with 'error tags' indicating where the learner has
produced errors, and what kinds of errors these are.
Non-linguistic and other phenomena
There is currently no widely agreed standard way of representing information in texts.
In the past, many different approaches have been adopted. But some approaches have
been more lasting than others, and work is now progressing towards the establishment
of truly international standards. One long-standing annotation practice has been that
known as COCOA references. COCOA was a very early computer program used for
extracting indexes of words in context from machine-readable texts. Its conventions
were carried forward into several other programs, notably to its immediate successor,
the widely-used Oxford Concordance Program (ocP).The conventions have also been
applied to the encoding of corpora themselves, such as the Longman-Lancaster corpus
and the Helsinki corpus, to indicate textual information (see below). Very simply, a
COCOA reference consists of a balanced set of angled brackets (< >) containing two
entities: a code standing for a particular variable name and a string or set of strings,
which are the instantiations of that variable. For exam- ple, the code letter 'A' could be
used to stand for the variable 'author' and the string or set of strings would stand for the
author's name. Thus COCOA refer- ences indicating the author of a passage or text
would look like the following
<A CHARLES DICKENS>
<A WOLFGANG VON GOETHE>
<A HOMER>

6
But COCOA references only represent an informal trend for encoding specific types of
textual information, for example, authors, dates and titles. Current moves are aiming
towards more formalised international standards for the encoding of any type of
information that one would conceivably want to encode in machine-readable texts. The
flagship of this current trend towards standards is the Text Encoding Initiative (TEI).
The TEI is sponsored by the three main scholarly associations concerned with
humanities computing - the Association for Computational Linguistics (ACL), the
Association for Literary and Linguistic Computing (ALLC) and the Association for
Computers and the Humanities (ACH).The aim of the TEI is to provide standardised
implementations for machine-readable text interchange. For this, the TEI employs an
already existing form of document markup known as SGML (Standard Generalised
Markup Language). SGML was adopted because it is simple, clear, formally rigorous
and already recognised as an intern- national standard. The TEI's own original
contribution is a detailed set of guidelines as to how this standard is to be use ext
encoding.
In the TEI, each individual text (or ‘document’) is conceived of as consisting of two
parts - a header and the text itself. The header contains information about the text such
as: the author, title, date and so on; information about the source from which it was

7
encoded, for example the particular edition/ publisher used in creating the machine-
readable text; and information about the encoding practices adopted, including any
feature system declarations. The actual TEI annotation of the header and the text is based
around two basic devices: tags and entity references. Texts are assumed to be made up
of elements. An element can be any unit of text - word, sentence, paragraph, chapter
and so on. Elements are marked in the TEI using SGML tags. SGML tags which should
be distinguished from the code strings that are often known as 'tags' in linguistic
annotation (e.g. NN1 = singular common noun) - are indicated by balanced pairs of
angled brackets (i.e.< and >).A start tag at the beginning of an element is represented
by a pair of angled brackets containing annotation strings, thus:< >; an end tag at
the end of an element contains a slash character preceding the annotation strings, thus:
</ >. To give an example, a simple and frequently used TEI tag is that which
indicates the extent of a paragraph. This would be represented as follows:
<p>
The actual textual material goes here.
</p>
In contrast to tags, entity references are delimited by the characters & and; an entity
reference is essentially a shorthand way of encoding detailed information within a text.
The shorthand form which is contained in the text refers outwards to a feature system
declaration (FSD) in the document header which contains all the relevant information
in full TEI tag-based markup. For example, one shorthand code which is used in
annotating words for parts of speech is 'vvd', in which the first v signifies that the word
is a verb, the second v signifies that it is a lexical verb and the d signifies that it is a past
tense form. In the following example, this code is used in the form of an entity reference:
polished&vvd;

Another, additional level of representation may be used for non-linguisitc phenomena


which occur when people are speaking. This includes speaker noises such as coughing,
laughter, and lip smacking, as well as extraneous noises such as the barking of dogs and
the slamming of doors. In addition, this level can also be used to label information such
as dysfluencies and filled pauses. The type of representation used for such annotations
will depend on the purpose of the database. An annotation system such as that proposed
by the Text Encoding Initiative is very elaborate and makes heavy demands on a

8
transcriber, but also makes it possible to derive all relevant information from a
transcription. While the TEI system makes use of SGML, which guarantees that existing
software can be used, there is a large initial learning curve for the transcriber, which
multiplies the possibility of human error in the transcription. Other annotation systems
(such as those used in ATIS and Switchboard) are less elaborate, but also easier for
transcribers to learn. The conventions used in ATIS, Switchboard, POLYPHONE and
the GRONINGEN corpus consist of different types of brackets with possible additional
glosses. Retrieval software referring to these particular annotations are less convenient
than the TEI system. However, it is possible to provide standard UNIX scripts for a
speech corpus. It is important to find the correct balance between the sophistication of
the annotation system and the practicality of the system from the transcriber's point of
view.

The types of phenomena which could conceivably be annotated on this level of


representation are listed below.
Omissions in read text
Words from the recording script which were omitted by the speaker may be indicated.
In spontaneous speech, it is very difficult to know whether a speaker has omitted words
which he actually intended to say, and so omission is only relevant in the case of read
speech.
Verbal deletions or corrections, implicit or explicit
Words that are verbally deleted by the speaker may be indicated. Verbal deletions are
words that are actually uttered, then (according to the transcriber) superseded by
subsequent speech. This can be done explicitly, as in “Can you give me some
information about the price, I mean, the place where I can find.” Alternatively, it can be
done implicitly, as in “Can you give me some information about the price, place where
I can find…” Verbal deletions or self-repairs may be indicated in read as well as
spontaneous speech.
Unintelligible words
Sometimes only part of a word is unintelligible, in which case only the intelligible part
is transcribed orthographically. If a word is completely unintelligible, that fact will be
annotated on this level. For example, by putting ``[unintelligible]'' in the text (ATIS) ,
or by putting two stars ``**'' as in SPEECHDAT corpora.

9
Hesitations and filled pauses
Filled pauses (such as uh and mm) may be indicated. Some annotation conventions (e.g.
POLYPHONE and Switchboard) annotate only one or two types of filled pause (uh and
mm, or only uh). Other systems (e.g. ATIS and Speech Styles) annotate more than two
types (e.g. uh, mm, um, er, ah). The types of filled pause vary across languages (for
example, the British English er is not used in Dutch). The recommendation is to use at
least two types: one vowel-like type uh, and one nasal type mm.
Non-speech acoustic events
These can be made either by the speaker or by outside sources. The first category
includes lip smacks, grunts, laughter, heavy breathing and coughing. The second
category includes the noise of doors slamming, phones ringing, dogs barking, and all
kinds of noises from other speakers. The Switchboard corpus uses a very extensive list
of non-speech acoustic events, ranging from bird squawk to groaning and yawning. The
recommendation is that these events are annotated at the correct location in the
utterance, by first transcribing the words and then indicating which words are
simultaneous with the acoustic events.
Simultaneous speech
For dialogues and interviews, words spoken simultaneously by two or more speakers
may be indicated.
Speaking turns
Discourse analysis makes use of indications of different speaking turns and initiatives.
While these are not generally used in speech technology, it would always be possible to
transcribe them.

10

You might also like