You are on page 1of 45

CORPUS

ANNOTATION
BBI 5411
AP. DR AFIDA MOHAMAD ALI
• 1960S, FIRST ATTEMPTS AT • 1986 – SGML WAS CREATED
STANDARDISING MARKUP FOR • WHEN THE WORLD WIDE WEB WAS
INFORMATION EXCHANGE CREATED BY TIM BERNERS LEE, HTML
BEGAN. (HYPERTEXT MARKUP LANGUAGE)
ARRIVED ON THE SCENE AND BECAME
POPULAR.
• NECESSARY IN ORDER TO BE ABLE
TO EXCHANGE VARIOUS TYPES OF • THEN CAME XML

DOCUMENTS EFFICIENTLY AND


WITHOUT THE NEED FOR
EXTENSIVE REFORMATTING OF
THE PARTLY
(WEISSER, 2016)
IDIOSYNCRATIC/PROPRIETARY
CODING USED BY DIFFERENT
AUTHORS, MANUFACTURERS,
OR PUBLISHING HOUSES.
CORPUS MARK-UP

SYSTEM OF STANDARD CODES INSERTED


INTO A DOCUMENT STORED IN ELECTRONIC
FORM TO PROVIDE INFORMATION ABOUT
THE TEXT ITSELF AND GOVERN
FORMATTING, PRINTING
AND OTHER PROCESSES.

MOST WIDELY USED MARK-UP SCHEMES:


• TEI (TEXT ENCODING INITIATIVE)
• CES (CORPUS ENCODING STANDARD)
CORPUS MARK-UP
IT IS ESSENTIAL IN CORPUS-BUILDING
BECAUSE…
…SAMPLED TEXTS ARE OUT OF CONTEXT AND IT ALLOWS TO
RECOVER CONTEXTUAL INFORMATION
…IT PROVIDES MORE INFORMATION THAN THE FILE NAMES
ALONE (RE.
TEXT TYPES, SOCIOLINGUISTIC VARIABLES, TEXTUAL
INFORMATION – STRUCTURE)

…IT ADS VALUE TO THE CORPUS BECAUSE IT ALLOWS FOR A


BROADER RANGE OF QUESTIONS TO BE ADDRESSED

…IT ALLOWS TO INSERT EDITORIAL COMMENTS DURING THE


CORPUS BUILDING PROCESS.
CORPUS MARK-UP

EXTRA-TEXTUAL AND TEXTUAL


INFORMATION MUST BE KEPT SEPARATE
FROM THE CORPUS DATA.
EXAMPLES:
COCOA MARK-UP SCHEME
<A WILLIAM SHAKESPEARE>
A= AUTHOR, ATTRIBUTE NAME
WILLIAM SHAKESPEARE= ATTRIBUTE VALUE
TEI MARK-UP SCHEME
EACH INDIVIDUAL TEXT IS A DOCUMENT
CONSISTING IN A HEADER AND A BODY, IN TURN
COMPOSED OF DIFFERENT ELEMENTS.
EX. IN THE HEADER THERE ARE 4 MAIN
ELEMENTS:
• A FILE DESCRIPTION <FILEDESC>
• AN ENCODING DESCRIPTION <ENCODINGDESC>
• A TEXT PROFILE <PROFILEDESC>
• A REVISION HISTORY <REVISIONDESC>

TAGS CAN BE NESTED, I.E. THEY CAN APPEAR


INSIDE OTHER ELEMENTS.
TEI MARK-UP SCHEME

IT CAN BE EXPRESSED USING A NUMBER


OF DIFFERENT FORMAL LANGUAGES.

SGML (STANDARD GENERALIZED


MARK-UP LANGUAGE – USED BY
THE BNC)
XML (EXTENSIBLE MARK-UP LANGUAGE)
CES MARK-UP SCHEME

DESIGNED SPECIFICALLY FOR THE


ENCODING OF LANGUAGE CORPORA.

• DOCUMENT-WIDE MARK-UP (BIBLIOGRAPHICAL


DESCRIPTION, ENCODING DESCRIPTION, ETC.)

• GROSS STRUCTURAL MARK-UP (VOLUME,


CHAPTER, PARAGRAPH, FOOTNOTES, ETC.; SPECIFIES
RECOMMENDED CHARACTER SETS)

• MARK-UP FOR SUBPARAGRAPH


STRUCTURES (SENTENCE, QUOTATIONS, WORDS,
ABBREVIATIONS, ETC.)
CES MARK-UP SCHEME
IT SPECIFIES A MINIMAL ENCODING LEVEL THAT CORPORA MUST
ACHIEVE TO BE CONSIDERED STANDARDIZED IN TERMS OF
DESCRIPTIVE REPRESENTATION AS WELL AS GENERAL
ARCHITECTURE.

3 LEVELS OF STANDARDIZATION DESIGNED


TO ACHIEVE THE GOAL OF UNIVERSAL
DOCUMENT INTERCHANGE:
• METALANGUAGE LEVEL
• SYNTACTIC LEVEL
• SEMANTIC LEVEL
CORPUS ANNOTATION

NECESSARY IN ORDER TO EXTRACT


RELEVANT INFORMATION FROM CORPORA.

“THE PROCESS OF ADDING […]


INTERPRETIVE, LINGUISTIC INFORMATION
TO AN ELECTRONIC CORPUS OF SPOKEN
AND/OR WRITTEN LANGUAGE DATA”
(LEECH 1997)
• CORPUS ANNOTATION IS LARGELY THE PROCESS OF PROVIDING – IN
A SYSTEMATIC AND ACCESSIBLE FORM – THOSE ANALYSES WHICH A
LINGUIST WOULD, IN ALL LIKELIHOOD, CARRY OUT ANYWAY ON
WHATEVER DATA THEY WORKED WITH. (MCENERY & HARDIE 2012)
• A PARTICULAR FORMAT FOR ENRICHING A TEXT, SO AS TO
 CATEGORISE PARTS THEREOF, OR
 MAKE THESE PARTS MORE SALIENT,
 OR TO THE ACTUAL PROCESS OF APPLYING THIS FORM OF
ANNOTATION.
ANNOTATION VS. MARK-UP

CORPUS MARK-UP PROVIDES OBJECTIVE, VERIFIABLE


INFORMATION.

ANNOTATION IS CONCERNED WITH INTERPRETIVE


LINGUISTIC
INFORMATION.
THE ADVANTAGES OF
ANNOTATION
1. IT MAKES EXTRACTING INFORMATION
EASIER, FASTER AND ENABLES HUMAN
ANALYSTS TO EXPLOIT AND RETRIEVE
ANALYSES OF WHICH THEY ARE NOT
THEMSELVES CAPABLE.
THE ADVANTAGES OF
ANNOTATION
2. ANNOTATED CORPORA ARE REUSABLE RESOURCES.

3. ANNOTATED CORPORA ARE MULTIFUNCTIONAL: THEY CAN


BE ANNOTATED WITH A PURPOSE AND BE REUSED WITH
ANOTHER.
THE ADVANTAGES OF
ANNOTATION
4. CORPUS ANNOTATION RECORDS A LINGUISTIC ANALYSIS
EXPLICITLY.
5. CORPUS ANNOTATION PROVIDES A STANDARD REFERENCE
RESOURCE, A STABLE BASE OF LINGUISTIC ANALYSES, SO
THAT SUCCESSIVE STUDIES CAN BE COMPARED AND
CONTRASTED ON A COMMON BASIS.
CRITICISMS TO CORPUS
ANNOTATION
1. ANNOTATION PRODUCES CLUTTERED CORPORA
2. ANNOTATION IMPOSES AN ANALYSIS  IMPURITY.
IMPOSE AN ANALYSIS ON THE USERS OF THE DATA, BUT ALSO
BECAUSE THE ANNOTATIONS THEMSELVES MAY BE
INACCURATE OR
INCONSISTENT (SINCLAIR 1992).
3. ANNOTATION OVERVALUES CORPORA MAKING THEM LESS
ACCESSIBLE.
4. IS ANNOTATION ACCURATE AND CONSISTENT?
HOW ARE CORPORA ANNOTATED?

• AUTOMATIC ANNOTATION
• COMPUTER-ASSISTED ANNOTATION
• MANUAL ANNOTATION

SINCLAIR (1992): THE INTRODUCTION OF THE HUMAN


ELEMENT IN CORPUS ANNOTATION REDUCES CONSISTENCY.
TYPES OF ANNOTATION

DIFFERENT TYPES OF ANNOTATION CAN BE CARRIED OUT


WITH DIFFERENT MEANS.
FOR SOME TYPES AUTOMATIC ANNOTATION IS VERY
ACCURATE.
OTHER TYPES REQUIRE POST-EDITING, I.E. HUMAN
CORRECTION.

HTTP://UCREL.LANCS.AC.UK/ANNOTATION.HTML#POS
TYPES OF ANNOTATION

CORPORA CAN BE ANNOTATED AT DIFFERENT LEVELS OF


LINGUISTIC ANALYSIS.
• PHONOLOGICAL LEVEL
• SYLLABLE BOUNDARIES (PHONETIC/PHONEMIC ANNOTATION)
• PROSODIC FEATURES (PROSODIC ANNOTATION)
TYPES OF ANNOTATION

• MORPHOLOGICAL LEVEL
• PREFIXES
• SUFFIXES
• STEMS
(MORPHOLOGICAL ANNOTATION)
TYPES OF ANNOTATION

• LEXICAL LEVEL
• PART OF SPEECH (POS TAGGING)
• LEMMAS (LEMMATIZATION)
• SEMANTIC FIELDS (SEMANTIC
ANNOTATION)

• SYNTACTIC LEVEL
• PARSING
• TREEBANKING
• BRACKETING
TYPES OF ANNOTATION

• DISCOURSE LEVEL
• ANAPHORIC RELATIONS (COREFERENCE ANNOTATION)
• SPEECH ACTS (PRAGMATIC ANNOTATION)
• STYLISTIC FEATURES SUCH AS SPEECH AND THOUGHT
IN PRESENTATION (STYLISTIC ANNOTATION).
POS TAGGING
MOST COMMON TYPE OF ANNOTATION.

ALSO KNOWN AS GRAMMATICAL TAGGING OR


MORPHO-SYNTACTIC ANNOTATION.

IT PROVIDES THE BASIS OF FURTHER FORMS


OF ANALYSIS SUCH AS PARSING AND
SEMANTIC ANNOTATION.

MANY LINGUISTIC ANALYSES, E.G. THE COLLOCATES


OF A WORD DEPEND HEAVILY ON POS TAGGING.
POS TAGGING

IT CAN BE PERFORMED AUTOMATICALLY WITH TAGGERS LIKE


CLAWS
HTTP://WWW.COMP.LANCS.AC.UK/UCREL/CLAWS/

YOU CAN TRY IT FOR FREE ONLINE.

EXAMPLES OF TAGS: NN1 (NOUN), VVZ (VERB IN THE THIRD


PERSON OF THE SIMPLE PRESENT TENSE), VVD (VERB IN THE
SIMPLE PAST FORM), ADJ0 (ADJECTIVE IN THE BASIC FORM),
ETC.
POS TAGGING

PROBLEMS:
• WORD SEGMENTATION (TOKENIZATION)
• MULTIWORDS (SO THAT, INSPITE OF)
• MERGERS (CAN’T, GONNA)
• VARIABLY SPELLED COMPOUNDS (NOTICEBOARD, NOTICE-
BOARD, NOTICE BOARD)
LEMMATIZATION

TYPE OF ANNOTATION THAT REDUCES THE


INFLECTIONAL VARIANTS OF WORDS TO THEIR
RESPECTIVE LEXEMES OR LEMMAS AS THEY
APPEAR IN DICTIONARY ENTRIES:

DO, DOES, DID, DONE, DOING= DO


CORPUS, CORPORA= CORPUS
SMALL CAPITAL LETTERS ARE THE CONVENTION.
LEMMATIZATION

IT IS IMPORTANT IN VOCABULARY STUDIES AND


LEXICOGRAPHY, E.G. IN STUDYING THE DISTRIBUTION
PATTERN OF
LEXEMES AND IMPROVING DICTIONARIES AND COMPUTER
LEXICONS.

IT CAN BE AUTOMATICALLY PERFORMED.


PARSING

ONCE A CORPUS IS POS TAGGED, IT IS POSSIBLE TO BRING


THESE MORPHO-SYNTACTIC CATEGORIES INTO HIGHER
LEVEL SYNTACTIC RELATIONSHIPS WITH ONE ANOTHER,
THAT IS, TO ANALYSE THE SENTENCES IN A CORPUS INTO
THEIR CONSTITUENTS.

PARSING CONSISTS IN BRACKETING.

IT CAN BE AUTOMATED BUT WITH A LOW PRECISION RATE.


PARSING

EXAMPLE:

(S (NP MARY)
(VP VISITED)
(NP A
(ADJP VERY NICE)
BOY)))
SEMANTIC ANNOTATION

IT ASSIGNS CODES INDICATING THE SEMANTIC FEATURES OF


THE SEMANTIC FIELDS OF THE WORDS IN A TEXT. IT IS
KNOWLEDGE-BASED SO IT NEEDS TO BE MANUAL MOST OF
THE TIME.
(WORD SENSE TAGGING)
TWO TYPES:
• ONE MARKS THE SEMANTIC RELATIONSHIPS BETWEEN
THE CONSTITUENTS IN A SENTENCE
• ONE MARKS THE SEMANTIC FEATURES OF
WORDS IN A TEXT
COREFERENCE ANNOTATION
• THIS TYPE OF ANNOTATION MAKES IT POSSIBLE TO TRACK HOW
ELEMENTS OF A TEXT ARE PROGRESSIVELY INTERWOVEN SO
THAT COHESION IS ACHIEVED, TYPICALLY THROUGH THE USE OF

• PRONOUNS
• REPETITION
• SUBSTITUTION
• ELLIPSIS

COMPUTER-ASSISTED AT BEST.
A SIMPLE EXAMPLE OF ANAPHORIC ANNOTATION IS:
(6 THE MARRIED COUPLE 6) SAID THAT <REF=6 THEY WERE
HAPPY WITH <REF=6 THEIR LOT.

HERE THE NUMBER 6 IS AN INDEX NUMBER WHILE THE LESS THAN CHARACTER <
INDICATES THAT A BACKWARD REFERENTIAL (ANAPHORIC) LINK IS PRESENT, I.E.
THEY AND THEIR POINT BACKWARD TO THE MARRIED COUPLE
(CITED FROM GARSIDE, FLIGELSTONE AND BOTLEY 1997: 68).
PRAGMATIC ANNOTATION

SPEECH/DIALOGUE ACTS IN DOMAIN-SPECIFIC DIALOGUE.


• FOCUS OF PRAGMATIC ANNOTATION APPEARS TO BE ON
SPEECH/DIALOGUE ACTS IN DOMAIN SPECIFIC DIALOGUE
SUCH AS DOCTOR-PATIENT DISCOURSE AND TELEPHONE
CONVERSATIONS

THE MOST COHERENT SYSTEM IS DRI (DISCOURSE


REPRESENTATION INITIATIVE).
3 LAYERS OF CODING:
 SEGMENTATION (DIVIDING DIALOGUE IN
TEXTUAL UNITS, UTTERANCES)
 FUNCTIONAL ANNOTATION (DIALOGUE ACT ANNOTATION)
 UTTERANCE TAGS (APPLYING UTTERANCE TAGS THAT CHARACTERIZE THE ROLE OF THE
UTTERANCE AS A DIALOGUE ACT)
PRAGMATIC ANNOTATION
UTTERANCE TAGS:
• COMMUNICATIVE STATUS (INTELLIGIBLE, COMPLETE, ETC.)
• INFORMATION LEVEL AND STATUS (INDICATING THE
SEMANTIC CONTENT OF THE UTTERANCE AND HOW IT
RELATES TO THE TASK IN QUESTION)
• FORWARD-LOOKING COMMUNICATIVE FUNCTION
(UTTERANCES THAT MAY CONSTRAIN OR AFFECT THE
DISCOURSE, E.G. ASSERT, REQUEST, QUESTION AND OFFER)
• BACKWARDING-LOOKING COMMUNICATIVE FUNCTION
(UTTERANCES THAT RELATE TO PREVIOUS PARTS OF THE
DISCOURSE, E.G. ACCEPT, BACKCHANNELLING, ANSWER)
STYLISTIC ANNOTATION

IT IS PARTICULARLY ASSOCIATED WITH STYLISTIC FEATURES


IN LITERARY TEXTS.

AN EXAMPLE: THE REPRESENTATION


OF PEOPLE’S SPEECH AND THOUGHTS, KNOWN AS SPEECH
AD THOUGHT PRESENTATION (S&TP)
OTHER TYPES OF TAGGING

• ERROR TAGGING
• PROBLEM-ORIENTED ANNOTATION

ANNOTATION OF THIS KIND INVOLVES ASSIGNING CODES INDICATING THE TYPES


OF ERRORS OCCURRING IN A LEARNER CORPUS.

CORPORA ANNOTATED FOR LEARNER ERRORS CAN HELP TO REVEAL THE


RELATIVE FREQUENCY OF ERROR TYPES PRODUCED BY LEARNERS OF DIFFERENT
L1 (FIRST LANGUAGE) BACKGROUNDS AND PROFICIENCY LEVELS.

THEY ARE ALSO USEFUL WHEN EXPLORING FEATURES OF NON-NATIVE LANGUAGE


BEHAVIOUR (E.G. OVERUSE OR UNDER-USE OF CERTAIN LINGUISTIC FEATURES)
• HTTP://UCREL.LANCS.AC.UK/CLAWS/
• HTTP://UCREL.LANCS.AC.UK/CLAWS1TAGS.HTML
• UCREL SEMANTIC ANALYSIS SYSTEM
HTTP://WWW.COMP.LANCS.AC.UK/UCREL/USAS/
HTTP://WWW.AHDS.AC.UK/GUIDES/LINGUISTIC-COR
PORA/CHAPTER2.HTM
TAGS (ELEMENTS)

• OPENING TAGS (<ELEMENT_NAME>), E.G. <P> FOR THE START OF


A PARAGRAPH;
• CLOSING TAGS, WHERE THE NAME OF THE ELEMENT IS
PRECEDED BY A SLASH (</ELEMENT_NAME>), E.G. </P> FOR THE
END OF A PARAGRAPH;
• ‘EMPTY’ TAGS, WHERE THE CLOSING BRACKET IS PRECEDED BY
(A SPACE AND) A SLASH (<ELEMENT_NAME />), E.G. <PAUSE /> TO
REPRESENT A PAUSE OF UNDEFINED DURATION.
(WEISSER, 2016)
• LINGUISTIC ANNOTATION SCHEME - COMPREHENSIVE OPERATIONAL
DEFINITION FOR A PARTICULAR VARIABLE, WITH DETAILED
INSTRUCTIONS AS TO HOW THE VALUES OF THIS VARIABLE SHOULD
BE ASSIGNED TO LINGUISTIC DATA (IN OUR CASE, CORPUS DATA,
BUT ANNOTATION SCHEMES ARE ALSO NEEDED TO CATEGORIZE
EXPERIMENTALLY ELICITED LINGUISTIC DATA).
• THE ANNOTATION SCHEME WOULD TYPICALLY ALSO INCLUDE A
CODING SCHEME, SPECIFYING THE LABELS BY WHICH THESE
CATEGORIES ARE TO BE REPRESENTED.
(STEFANOWITSCH, 2020)
ACCOUNTABILITY

• A SIGNIFICANT ADVANTAGE OF USING CORPORA IS THAT CORPORA


ALLOW ANALYSTS TO APPROACH THE STUDY OF LANGUAGE
WITHIN THE CONTEXT OF THE SCIENTIFIC METHOD (LEECH 1992).
• IF YOU APPROACH A CORPUS WITH A SPECIFIC THEORY IN MIND, IT
CAN BE EASY TO UNINTENTIONALLY FOCUS ON AND PULL OUT
ONLY THE EXAMPLES FROM THE CORPUS THAT SUPPORT THE
THEORY (THIS IS TECHNICALLY CALLED A CONFIRMATION BIAS).
• WE MUST NOT SELECT A FAVOURABLE SUBSET OF THE DATA THAT
WILL ONLY CONTAIN THE VARIABLES WE WANT TO RESEARCH ON.
FALSIFIABILITY (POPPER, 2006)

• COUNTER TO ACCOUNTABILITY.
• USE THE ENTIRE CORPUS – AND ALL RELEVANT EVIDENCE
EMERGING FROM ANALYSIS OF THE CORPUS – TO TEST THE
HYPOTHESIS
• THERE SHOULD BE NO MOTIVATED SELECTION OF
EXAMPLES TO FAVOUR THOSE EXAMPLES THAT FIT THE
HYPOTHESIS,
• AND NO SCREENING OUT OF INCONVENIENT EXAMPLES.
REPLICABILITY

• ANY CLAIM OF TOTAL ACCOUNTABILITY IN CORPUS LINGUISTICS MUST BE MODERATED.


• WE CAN ONLY SEEK TOTAL ACCOUNTABILITY RELATIVE TO THE DATASET THAT WE ARE
USING, NOT TO THE ENTIRETY OF LANGUAGE ITSELF. (MCENERY & HARDIE 2012)
• MODERATING THE CLAIM OF TOTAL ACCOUNTABILITY IN THE LIGHT OF THE FINITE SIZE

OF THE CORPUS DOES RAISE ONE TROUBLING POSSIBILITY.


• AN ANALYST MAY, BY CHANCE OR DESIGN, CONSTRUCT A DATASET THAT MISREPRESENTS
THE LANGUAGE SUCH THAT THE ANALYSIS OF THIS DATASET SUPPORTS A FAULTY
THEORY.

(MCENERY & HARDIE 2012)


Hence, results are typically considered provisional until they are known to
be replicable – and in many cases, it is precisely through that process of
continuous checking of results as theories develop and expand that
falsifiability is achieved.
A SINGLE EXAMPLE MAY FALSIFY A HYPOTHESIS, LEADING
TO THE REVISION, OR ABANDONMENT, OF THAT SPECIFIC
HYPOTHESIS. IN THAT SENSE, APPROACHING A CORPUS TO
FIND A SINGLE EXAMPLE
IS ENTIRELY CONSISTENT WITH BOTH THE SCIENTIFIC
METHOD AND
WITH THE PRINCIPLE OF TOTAL ACCOUNTABILITY.
(MCENERY & HARDIE 2012)

You might also like