Professional Documents
Culture Documents
Lecture-10:: - Module 2
Lecture-10:: - Module 2
CSE4022
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 1 / 31
Module 2 recap - Text processing
Module 2 Summary
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 2 / 31
Module 2 recap - Text processing
Module 2 Summary
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 3 / 31
Corpus Introduction
Corpus
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 4 / 31
What is not a corpus
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 5 / 31
Need of Corpus
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 6 / 31
Benefits of corpus data?
1 Corpus data is more reliable
A corpus pools together linguistic intuitions of a range of language
speakers, which offsets the potential biases in intuitions of individual
speakers.
2 Corpus data is more natural.
It is used in real communications instead of being invented
specifically for linguistic analysis.
3 Corpus data is contextualized
Attested language use which has already occurred in real linguistic
context
4 Corpus data is quantitative
Corpora can provide frequencies and statistics readily
5 Corpus data can find differences that intuitions alone cannot
perceive.
e.g. synonyms: totally, absolutely, utterly, completely, entirely
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 7 / 31
Why use Corpora?
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 8 / 31
Why use Corpora? (Contd.)
4 A corpus can store and recall all the information that has been
stored in it.
Even expert speakers cannot remember everything they know.
5 A corpus can provide us with a vast number of examples in real
communication context.
Even experts speakers cannot make up vast number of natural
examples.
6 A corpus can give you more objective evidence
Even expert speakers have prejudices and preferences and every
language has cultural connotations and underlying ideology
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 9 / 31
Why use Corpora? (Contd.)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 10 / 31
Example of Corpus
1 Brown corpus
It contains 500 samples of English-language text, totaling roughly
one million words, compiled from works published in the United
States in 1961.
the texts for the corpus were sampled from 15 different text
categories to make the corpus a good standard reference.
2 Lancaster-Oslo-Bergen (LOB) Corpus
consists of one million words of British English texts from 1961.
The texts for the corpus were sampled from 15 different text
categories.
Each text is just over 2,000 words long and the number of texts in
each category varies.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 11 / 31
Example of Corpus (Contd.)
1 Brown corpus
2 Lancaster-Oslo-Bergen (LOB) Corpus
3 British National Corpus (BNC)
100-million-word text corpus of samples of written and spoken
English from a wide range of sources.
The corpus covers British English of the late 20th century from a
wide variety of genres
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 12 / 31
Corpus Annotation and Markup
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 13 / 31
Markup Schmes
A – Author
Value – William Shakespeare
2 Markup Languages - XML , XGML
3 JSON format
{ ’AUTHOR’: ’WILLIAM SHAKESPEARE’ }
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 14 / 31
Why Annotate ?
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 15 / 31
Type of Corpus Annotation
Part-of-speech (POS)
Named Entity Recognition
Syntactical (parsing)
Lexical Parsing
Semantic (Domain Classification)
Co-reference resolution
Event Detection
Sentiment Analysis
Language Identification
Fake News Detection
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 16 / 31
CLAWS part-of-speech tagger for English
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 17 / 31
CLAWS - POS Tagging - TAG sets examples
1 Noun
NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)
2 Pronoun
PNI indefinite pronoun (e.g. NONE, EVERYTHING)
PNP personal pronoun (e.g. YOU, THEM, OURS)
PNQ wh-pronoun (e.g. WHO, WHOEVER)
PNX reflexive pronoun (e.g. ITSELF, OURSELVES)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 18 / 31
CLAWS - POS Tagging - TAG sets examples
3 Verb
VBB the "base forms" of the verb "BE" (except the infinitive), i.e.
AM, ARE
VBD past form of the verb "BE", i.e. WAS, WERE
VBG -ing form of the verb "BE", i.e. BEING
VBI infinitive of the verb "BE"
VBN past participle of the verb "BE", i.e. BEEN
VBZ -s form of the verb "BE", i.e. IS, ’S
VDB base form of the verb "DO" (except the infinitive), i.e.
VDD past form of the verb "DO", i.e. DID
VDG -ing form of the verb "DO", i.e. DOING
VDI infinitive of the verb "DO"
VDN past participle of the verb "DO", i.e. DONE
VDZ -s form of the verb "DO", i.e. DOES
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 19 / 31
CLAWS - POS Tagging - TAG sets examples
3 Verb
VHB base form of the verb "HAVE" (except the infinitive), i.e. HAVE
VHD past tense form of the verb "HAVE", i.e. HAD, ’D
VHG -ing form of the verb "HAVE", i.e. HAVING
VHI infinitive of the verb "HAVE"
VHN past participle of the verb "HAVE", i.e. HAD
VHZ -s form of the verb "HAVE", i.e. HAS, ’S
VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, ’LL)
VVB base form of lexical verb (except the infinitive)(e.g. TAKE,
LIVE)
VVD past tense form of lexical verb (e.g. TOOK, LIVED)
VVG -ing form of lexical verb (e.g. TAKING, LIVING)
VVI infinitive of lexical verb
VVN past participle form of lex. verb (e.g. TAKEN, LIVED)
VVZ -s form of lexical verb (e.g. TAKES, LIVES)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 20 / 31
CLAWS - POS Tagging - TAG sets examples
4 Adjective
AJ0 adjective (unmarked) (e.g. GOOD, OLD)
AJC comparative adjective (e.g. BETTER, OLDER)
AJS superlative adjective (e.g. BEST, OLDEST)
5 Adverb
AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER,
FURTHEST)
AVP adverb particle (e.g. UP, OFF, OUT)
AVQ wh-adverb (e.g. WHEN, HOW, WHY)
6 Article
AT0 article (e.g. THE, A, AN)
More details TAG set for POS tagging can be found at
CLAWS5
CLAWS7
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 21 / 31
CLAWS POS tagging software
Link : http://ucrel-api.lancaster.ac.uk/claws/free.html
Input: I am reading a book.
Output: I_PNP am_VBB reading_VVG a_AT0 book_SENT
._PUN
Input: Labeled dataset is required for supervised Machine Learning
task.
Output: Labeled_AJ0 dataset_NN1 is_VBZ required_VVN
for_PRP supervised_AJ0 Machine_NN1 Learning_NN1 task_SENT
._PUN
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 22 / 31
Dependency Parsing
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 23 / 31
Dependency Parsing - Tags
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 24 / 31
Dependency Parsing - More examples
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 25 / 31
Semantic Annotation
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 26 / 31
Semantic Annotation
The verb move has 16 senses ( (first 4 from tagged texts))
1 (130) travel, go, move, locomote – (change location; move, travel,
or proceed; "How fast does your new car go?"; "We travelled from
Rome to Naples by bus"; "The policemen went from door to door
looking for the suspect"; )
2 (60) move, displace – (cause to move, both in a concrete and in an
abstract sense; "Move those boxes into the corner, please"; "I’m
moving my money to another bank"; "The director moved more
responsibilities onto his new assistant")
3 (52) move – (move so as to change position, perform a
nontranslational motion; "He moved his hand slightly to the right")
4 (20) move – (change residence, affiliation, or place of employment;
"We moved from Idaho to Nebraska"; "The basketball player moved
from one team to another")
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 27 / 31
Semantic Annotation
The noun move has 5 senses (first 5 from tagged texts)
1 (377) move – (the act of deciding to do something; "he didn’t make
a move to help"; "his first move was to hire a lawyer")
2 (70) move, relocation – (the act of changing your residence or place
of business; "they say that three moves equal one fire")
3 (57) motion, movement, move, motility – (a change of position that
does not entail a change of location; "the reflex motion of his
eyebrows revealed his surprise"; "movement is a sign of life"; "an
impatient move of his hand"; "gastrointestinal motility")
4 (30) motion, movement, move – (the act of changing location from
one place to another; "police controlled the motion of the crowd";
"the movement of people from the farms to the cities";)
5 (5) move – ((game) a player’s turn to take some action permitted by
the rules of the game)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 28 / 31
Resource needed for Annotations
A collection of documents.
Annotation guidelines
Tool to perform annotation.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 29 / 31
Next Class
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 30 / 31
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 23, 2022August 23, 2022 31 / 31