You are on page 1of 32

Natural Language Processing

CSE4022

Lecture-10: Text Processing - Module 2


Corpus and Corpus Analysis

Dr. Durgesh Kumar


Assistant Professor, SCOPE, VIT Vellore
Table of contents

1 Summary of Module-2 : Text Processing

2 Corpus and Corpus Analysis

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 1 / 31
Module 2 recap - Text processing

Module 2 Summary

Document Triage - Character Encoding, Language identification text


sectioning
Character Encoding- ASCII, Extended ASCII, 2 Byte for Japanese
and Chines, Unicode: UTF-8
Challenges of Tokenization - Character Encoding dependence,
Language dependence, Corpus Dependence
Regular Expression, Min Edit (Levenshtein) distance between two
strings
Word Normalization: Lemmatization and Stemming

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 2 / 31
Module 2 recap - Text processing

Module 2 Summary

Word Tokenization and Sentence Tokenization


Space-delimited languages and un-segmented Language
Greedy Max matching Algorithm for word Segmentation.
Decision tree based algorithm for Sentence segmentation
ML-based classifiers like SVM, Logistic regression and Neural
network-based classifier using different features such as number of
words/ccharacter after periods “.”, Capitalization of the words
surrounding periods, POS of the words surrounding periods.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 3 / 31
Corpus Introduction

Corpus

Cambridge dictionary: a collection of written or spoken material


stored on a computer and used to find out how language is used.
MXT 2006: A corpus is a collection of i) machine-readable,
authentic texts (including transcripts of spoken data) which is
sampled to be representative of a particular language or language
variety.
Plural of Corpus is corpora.

Example: Brown Corpus, British National Corpus (BNC)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 4 / 31
What is not a corpus

A list of words is not a corpus.


A text archive is not a corpus. - A random collection of texts
A collection of citations is not a corpus.
A collection of quotations is not a corpus - A short selection from a
text chosen on internal criteria by human beings
A text is not a corpus - Intending to be read in different ways
The Web is not a corpus - Its dimensions unknown, constantly
changing, not designed from a linguistic perspective

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 5 / 31
Need of Corpus

A corpus is made for the study of language in a broad sense


1 to test existing linguistic theory and hypotheses
2 to generate and verify new linguistic hypotheses
3 beyond linguistics, to provide textual evidence in text-based
humanities and social sciences subjects

The purpose is reflected in a well-designed corpus

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 6 / 31
Benefits of corpus data?
1 Corpus data is more reliable
A corpus pools together linguistic intuitions of a range of language
speakers, which offsets the potential biases in intuitions of individual
speakers.
2 Corpus data is more natural.
It is used in real communications instead of being invented
specifically for linguistic analysis.
3 Corpus data is contextualized
Attested language use which has already occurred in real linguistic
context
4 Corpus data is quantitative
Corpora can provide frequencies and statistics readily
5 Corpus data can find differences that intuitions alone cannot
perceive.
e.g. synonyms: totally, absolutely, utterly, completely, entirely
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 7 / 31
Why use Corpora?

1 A corpus can be more comprehensive and balanced


Even expert speakers have only a partial knowledge of a language
2 A corpus can show us what is common and typical.
Even expert speakers tend to notice the unusual and think of what is
possible
3 A corpus can readily give us accurate statistics
Even expert speakers cannot quantify their knowledge of language

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 8 / 31
Why use Corpora? (Contd.)

4 A corpus can store and recall all the information that has been
stored in it.
Even expert speakers cannot remember everything they know.
5 A corpus can provide us with a vast number of examples in real
communication context.
Even experts speakers cannot make up vast number of natural
examples.
6 A corpus can give you more objective evidence
Even expert speakers have prejudices and preferences and every
language has cultural connotations and underlying ideology

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 9 / 31
Why use Corpora? (Contd.)

7 A corpus can be made permanently accessible to all.


Even expert speakers are not always available to be consulted.
8 A constantly updated corpus can reflect even recent changes in the
language.
Even expert speakers cannot keep up with language change
9 A corpus can encompass the actual language use of many expert
speakers
Even expert speakers lack authority: they can be challenged by other
expert speakers

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 10 / 31
Example of Corpus

1 Brown corpus
It contains 500 samples of English-language text, totaling roughly
one million words, compiled from works published in the United
States in 1961.
the texts for the corpus were sampled from 15 different text
categories to make the corpus a good standard reference.
2 Lancaster-Oslo-Bergen (LOB) Corpus
consists of one million words of British English texts from 1961.
The texts for the corpus were sampled from 15 different text
categories.
Each text is just over 2,000 words long and the number of texts in
each category varies.

3 British National Corpus (BNC)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 11 / 31
Example of Corpus (Contd.)

1 Brown corpus
2 Lancaster-Oslo-Bergen (LOB) Corpus
3 British National Corpus (BNC)
100-million-word text corpus of samples of written and spoken
English from a wide range of sources.
The corpus covers British English of the late 20th century from a
wide variety of genres

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 12 / 31
Corpus Annotation and Markup

Markup - processing/formatting information, metadata/text


classifications, structural representation
Annotation - the practice of adding interpretative linguistic
information to a corpus (Leech 2005)
Benefits of Markup and Annotation : interpretive, linguistic, value added.
Examples
present_NN1 (singular common noun)
present_VVB (base form of a lexical verb)
present_JJ (general adjective)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 13 / 31
Markup Schmes

1 COCOA – attributes and values in <angle brackets>

A – Author
Value – William Shakespeare
2 Markup Languages - XML , XGML

3 JSON format
{ ’AUTHOR’: ’WILLIAM SHAKESPEARE’ }

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 14 / 31
Why Annotate ?

Manual examination of corpus


Automatic analysis of corpus
Re-usability of annotations
Multi-functionality
Objective record of analysis
Annotation process is corpus analysis

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 15 / 31
Type of Corpus Annotation

Part-of-speech (POS)
Named Entity Recognition
Syntactical (parsing)
Lexical Parsing
Semantic (Domain Classification)
Co-reference resolution
Event Detection
Sentiment Analysis
Language Identification
Fake News Detection

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 16 / 31
CLAWS part-of-speech tagger for English

CLAWS (the Constituent Likelihood Automatic Word-tagging


System) was developed by UCREL at Lancaster.
The latest version of the tagger, CLAWS4, was used to POS tag
c.100 million words of the British National Corpus (BNC).
Documentation can be accessed at https://ucrel.lancs.ac.uk/claws/
POS tagging demo online ta
http://ucrel-api.lancaster.ac.uk/claws/free.html

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 17 / 31
CLAWS - POS Tagging - TAG sets examples

1 Noun
NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)
2 Pronoun
PNI indefinite pronoun (e.g. NONE, EVERYTHING)
PNP personal pronoun (e.g. YOU, THEM, OURS)
PNQ wh-pronoun (e.g. WHO, WHOEVER)
PNX reflexive pronoun (e.g. ITSELF, OURSELVES)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 18 / 31
CLAWS - POS Tagging - TAG sets examples

3 Verb
VBB the "base forms" of the verb "BE" (except the infinitive), i.e.
AM, ARE
VBD past form of the verb "BE", i.e. WAS, WERE
VBG -ing form of the verb "BE", i.e. BEING
VBI infinitive of the verb "BE"
VBN past participle of the verb "BE", i.e. BEEN
VBZ -s form of the verb "BE", i.e. IS, ’S
VDB base form of the verb "DO" (except the infinitive), i.e.
VDD past form of the verb "DO", i.e. DID
VDG -ing form of the verb "DO", i.e. DOING
VDI infinitive of the verb "DO"
VDN past participle of the verb "DO", i.e. DONE
VDZ -s form of the verb "DO", i.e. DOES

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 19 / 31
CLAWS - POS Tagging - TAG sets examples

3 Verb
VHB base form of the verb "HAVE" (except the infinitive), i.e. HAVE
VHD past tense form of the verb "HAVE", i.e. HAD, ’D
VHG -ing form of the verb "HAVE", i.e. HAVING
VHI infinitive of the verb "HAVE"
VHN past participle of the verb "HAVE", i.e. HAD
VHZ -s form of the verb "HAVE", i.e. HAS, ’S
VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, ’LL)
VVB base form of lexical verb (except the infinitive)(e.g. TAKE,
LIVE)
VVD past tense form of lexical verb (e.g. TOOK, LIVED)
VVG -ing form of lexical verb (e.g. TAKING, LIVING)
VVI infinitive of lexical verb
VVN past participle form of lex. verb (e.g. TAKEN, LIVED)
VVZ -s form of lexical verb (e.g. TAKES, LIVES)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 20 / 31
CLAWS - POS Tagging - TAG sets examples
4 Adjective
AJ0 adjective (unmarked) (e.g. GOOD, OLD)
AJC comparative adjective (e.g. BETTER, OLDER)
AJS superlative adjective (e.g. BEST, OLDEST)
5 Adverb
AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER,
FURTHEST)
AVP adverb particle (e.g. UP, OFF, OUT)
AVQ wh-adverb (e.g. WHEN, HOW, WHY)
6 Article
AT0 article (e.g. THE, A, AN)
More details TAG set for POS tagging can be found at
CLAWS5
CLAWS7
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 21 / 31
CLAWS POS tagging software

Link : http://ucrel-api.lancaster.ac.uk/claws/free.html
Input: I am reading a book.
Output: I_PNP am_VBB reading_VVG a_AT0 book_SENT
._PUN
Input: Labeled dataset is required for supervised Machine Learning
task.
Output: Labeled_AJ0 dataset_NN1 is_VBZ required_VVN
for_PRP supervised_AJ0 Machine_NN1 Learning_NN1 task_SENT
._PUN

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 22 / 31
Dependency Parsing

I prefer the morning flight through Denver.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 23 / 31
Dependency Parsing - Tags

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 24 / 31
Dependency Parsing - More examples

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 25 / 31
Semantic Annotation

Each word given code from thesaurus-style dictionary


Also called Word Sense Tagging
Example WordNet http://wordnet.princeton.edu/

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 26 / 31
Semantic Annotation
The verb move has 16 senses ( (first 4 from tagged texts))
1 (130) travel, go, move, locomote – (change location; move, travel,
or proceed; "How fast does your new car go?"; "We travelled from
Rome to Naples by bus"; "The policemen went from door to door
looking for the suspect"; )
2 (60) move, displace – (cause to move, both in a concrete and in an
abstract sense; "Move those boxes into the corner, please"; "I’m
moving my money to another bank"; "The director moved more
responsibilities onto his new assistant")
3 (52) move – (move so as to change position, perform a
nontranslational motion; "He moved his hand slightly to the right")
4 (20) move – (change residence, affiliation, or place of employment;
"We moved from Idaho to Nebraska"; "The basketball player moved
from one team to another")
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 27 / 31
Semantic Annotation
The noun move has 5 senses (first 5 from tagged texts)
1 (377) move – (the act of deciding to do something; "he didn’t make
a move to help"; "his first move was to hire a lawyer")
2 (70) move, relocation – (the act of changing your residence or place
of business; "they say that three moves equal one fire")
3 (57) motion, movement, move, motility – (a change of position that
does not entail a change of location; "the reflex motion of his
eyebrows revealed his surprise"; "movement is a sign of life"; "an
impatient move of his hand"; "gastrointestinal motility")
4 (30) motion, movement, move – (the act of changing location from
one place to another; "police controlled the motion of the crowd";
"the movement of people from the farms to the cities";)
5 (5) move – ((game) a player’s turn to take some action permitted by
the rules of the game)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 28 / 31
Resource needed for Annotations

A collection of documents.
Annotation guidelines
Tool to perform annotation.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 29 / 31
Next Class

Examples of few annotation


Current available tools/software for annotation.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 30 / 31
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 31 / 31

You might also like