You are on page 1of 11

UNIT VI

DISCOURSE ANALYSIS AND LEXICAL RESOURCES Discourse


segmentation,Coherence– Reference Phenomena, Anaphora Resolution using Hobbs and
CenteringAlgorithm – Coreference Resolution – Resources: Porter Stemmer,
Lemmatizer, PennTreebank, Brill's Tagger, WordNet, PropBank, FrameNet, Brown
Corpus, BritishNational Corpus (BNC)

The most difficult problem of AI is to process the natural language by computers or


in other words natural language processing is the most difficult problem of
artificial intelligence. If we talk about the major problems in NLP, then one of
the major problems in NLP is discourse processing − building theories and models of
how utterances stick together to form coherent discourse. Actually, the language
always consists of collocated, structured and coherent groups of sentences rather
than isolated and unrelated sentences like movies. These coherent groups of
sentences are referred to as discourse

Concept of Coherence;
Coherence and discourse structure are interconnected in many ways. Coherence, along
with property of good text, is used to evaluate the output quality of natural
language generation system. The question that arises here is what does it mean for
a text to be coherent? Suppose we collected one sentence from every page of the
newspaper, then will it be a discourse? Of-course, not. It is because these
sentences do not exhibit coherence. The coherent discourse must possess the
following properties

Coherence relation between utterances:


The discourse would be coherent if it has meaningful connections between its
utterances. This property is called coherence relation. For example, some sort of
explanation must be there to justify the connection between utterances.

Relationship between entities;


Another property that makes a discourse coherent is that there must be a certain
kind of relationshipwith the entities. Such kind of coherence is called entity-
based coherence

Discourse structure;
An important question regarding discourse is what kind of structure the discourse
must have. The answer to this question depends upon the segmentation we applied on
discourse. Discourse segmentations may be defined as determining the types of
structures for large discourse. It is quite difficult to implement discourse
segmentation, but it is very important for information retrieval, textsummarization
and information extraction kind of applications.

Algorithms for Discourse Segmentation;


In this section, we will learn about the algorithms for discourse segmentation. The
algorithms are described below

Unsupervised Discourse Segmentation;


The class of unsupervised discourse segmentation is often represented as linear
segmentation. We can understand the task of linear segmentation with the help of an
example. In the example, there is a task of segmenting the text into multi-
paragraph units; the units represent the passage of the original text. These
algorithms are dependent on cohesion that may be defined as the use of certain
linguistic devices to tie the textual units together. On the other hand, lexicon
cohesion is the cohesion that is indicated by the relationship between two or more
words in two units like the use of synonyms.

Supervised Discourse Segmentation;


The earlier method does not have any hand-labeled segment boundaries. On the other
hand, supervised discourse segmentation needs to have boundary-labeled training
data. It is very easy to acquire the same. In supervised discourse segmentation,
discourse marker or cue words play an important role. Discourse marker or cue word
is a word or phrase that functions to signal discourse structure. These discourse
markers are domain-specific.

Text Coherence;
Lexical repetition is a way to find the structure in a discourse, but it does not
satisfy the requirement of being coherent discourse. To achieve the coherent
discourse, we must focus on coherence relations in specific. As we know that
coherence relation defines the possible connection between utterances in a
discourse. Hebb has proposed such kind of relations as follows −

We are taking two terms S0 and S1 to represent the meaning of the two related
sentences −

Result;
It infers that the state asserted by term S0 could cause the state asserted by S1.
For example, two statements show the relationship result: Ram was caught in the
fire. His skin burned.

Explanation;
It infers that the state asserted by S1 could cause the state asserted by S0. For
example, two statements show the relationship − Ram fought with Shyam’s friend. He
was drunk.

Parallel;
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai
and bi are similar for all i. For example, two statements are parallel − Ram wanted
car. Shyam wanted money.

Elaboration;
It infers the same proposition P from both the assertions − S0 and S1 For example,
two statements show the relation elaboration: Ram was from Chandigarh. Shyam was
from Kerala.

Occasion;
It happens when a change of state can be inferred from the assertion of S0, final
state of which can beinferred from S1 and vice-versa. For example, the two
statements show the relation occasion: Ram picked up the book. He gave it to Shyam

Building Hierarchical Discourse Structure;


The coherence of entire discourse can also be considered by hierarchical structure
between coherence relations. For example, the following passage can be represented
as hierarchical structure −

S1 − Ram went to the bank to deposit money.


S2 − He then took a train to Shyam’s cloth shop.
S3 − He wanted to buy some clothes.
S4 − He do not have new clothes for party.
S5 − He also wanted to talk to Shyam regarding his health

diagram
Reference Resolution;
Interpretation of the sentences from any discourse is another important task and to
achieve this we need to know who or what entity is being talked about. Here,
interpretation reference is the key element. Reference may be defined as the
linguistic expression to denote an entity or individual. For example, in the
passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to
meet him, the linguistic expressions like Ram, His, He are reference.

On the same note, reference resolution may be defined as the task of determining
what entities are referred to by which linguistic expression.

Terminology Used in Reference Resolution;


We use the following terminologies in reference resolution −

Referring expression − The natural language expression that is used to perform


reference is called a referring expression. For example, the passage used above is
a referring expression.

Referent − It is the entity that is referred. For example, in the last given
example Ram is a referent.

Corefer − When two expressions are used to refer to the same entity, they are
called corefers. For example, Ram and he are corefers.

Antecedent − The term has the license to use another term. For example, Ram is the
antecedent of the reference he.

Anaphora & Anaphoric − It may be defined as the reference to an entity that has
been previously introduced into the sentence. And, the referring expression is
called anaphoric.

Discourse model − The model that contains the representations of the entities that
have been referred to in the discourse and the relationship they are engaged in.

Types of Referring Expressions;

Let us now see the different types of referring expressions. The five types of
referring expressions aredescribed below −

Indefinite Noun Phrases;


Such kind of reference represents the entities that are new to the hearer into the
discourse context. For example − in the sentence Ram had gone around one day to
bring him some food − some is an indefinite reference.

Definite Noun Phrases;


Opposite to above, such kind of reference represents the entities that are not new
or identifiable to the hearer into the discourse context. For example, in the
sentence - I used to read The Times of India – The Times of India is a definite
reference.

Pronouns;
It is a form of definite reference. For example, Ram laughed as loud as he could.
The word he represents pronoun referring expression.

Demonstratives;
These demonstrate and behave differently than simple definite pronouns. For
example, this and that are demonstrative pronouns.

Names;
It is the simplest type of referring expression. It can be the name of a person,
organization and location also. For example, in the above examples, Ram is the
name-refereeing expression

Reference Resolution Tasks;


The two reference resolution tasks are described below.Coreference ResolutionIt is
the task of finding referring expressions in a text that refer to the same entity.
In simple words, it is the task of finding corefer expressions. A set of
coreferring expressions are called coreference chain. For example - He, Chief
Manager and His - these are referring expressions in the first passage given as
example.

Constraint on Coreference Resolution;


In English, the main problem for coreference resolution is the pronoun it. The
reason behind this is that the pronoun it has many uses. For example, it can refer
much like he and she. The pronoun it also refers to the things that do not refer to
specific things. For example, It’s raining. It is really good.

Pronominal Anaphora Resolution;

Unlike the coreference resolution, pronominal anaphora resolution may be defined as


the task of finding the antecedent for a single pronoun. For example, the pronoun
is his and the task of pronominal anaphora resolution is to find the word Ram
because Ram is the antecedent.

Stemming is the process of producing morphological variants of a root/base word.


Stemming programs are commonly referred to as stemming algorithms or stemmers. A
stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the
root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem
“retrieve”. Stemming is an important part of the pipelining process in Natural
language processing. The input to the stemmer is tokenized words. How do we get
these tokenized words? Well, tokenization involves breaking down the document into
different words,

. Stemming is a natural language processing technique that is used to reduce words


to their base form, also known as the root form. The process of stemming is used
tonormalize text and make it easier to process. It is an important step in text
pre-processing, and it is commonly used in information retrieval and text mining
applications.

There are several different algorithms for stemming, including the Porter stemmer,
Snowball stemmer, and the Lancaster stemmer. The Porter stemmer is the most widely
used algorithm, and it is based on a set of heuristics that are used to
removecommon suffixes from words. The Snowball stemmer is a more advanced algorithm
that is based on the Porter stemmer, but it also supports several other languages
in addition to English. The Lancaster stemmer is a more aggressive stemmer and it
is less accurate than the Porter stemmer and Snowball stemmer.
Stemming can be useful for several natural language processing tasks such as text
classification, information retrieval, and text summarization. However, stemming
can also have some negative effects such as reducing the readability of the text,
and
it may not always produce the correct root form of a word.

It is important to note that stemming is different from Lemmatization.


Lemmatization is the process of reducing a word to its base form, but unlike
stemming, it takes into account the context of the word, and it produces a valid
word, unlike stemming which may produce a non-word as the root form.Under-stemming
occurs when two words are stemmed from the same root that are not of different
stems.

Under-stemming -can be interpreted as false-negatives. Under-stemming is a problem


that can occur when using stemming algorithms in natural language processing. It
refers to the situation where a stemmer does not produce the correct root form of a
word or does not reduce a word to its base form. This can happen when the stemmer
is not aggressive enough in removing suffixes or when it is not designed for the
specific task or language.

Automatically interpreting and analysing the meaning of words and pre-processing


textual input can be complex in Natural Language Processing (NLP). We frequently
employ lexicons to help with this. A lexicon, word-hoard, wordbook, or word-stock
is a person's, language's, or branch of knowledge's vocabulary. We frequently link
the text in our data to the lexicon, which aids us incomprehending the
relationships between those terms.

Wordnet;

WordNet is a massive lexicon of English words. Nouns, verbs, adjectives, and


adverbs are arranged into synsets,' which are collections of cognitive synonyms
that communicate a separate concept. Conceptual-semantic and linguistic links like
hyponymy and antonymy are used to connect synsets. WordNet is similar to a
thesaurus in that it groups words accordingto their meanings. There are,
nevertheless, some key distinctions.How to use WordNetThe Natural Language Toolkit
(NLTK) is a Python module for natural language processing. It has many corpora, toy
grammars, trained models, and, most importantly, WordNet, which is of importance to
this site. The English WordNet module in the NLTK module has 155,287 words and
117,659 synonym sets.

Synset is a type of basic interface used in NLTK that allows you to look up words
in WordNet. Synset instances are a collection of synonyms that communicate thesame
idea. Some words have only one Synset, while others have multiple.

from nltk.corpus import wordnet

syn = wordnet.synsets('hello')[0]

print ("Name of Synset : ", syn.name())

# Word Definition

print ("Meaning of Synset : ", syn.definition())

# a collection of phrases in which the word is used in context

print ("Synset's example : ", syn.examples())


Name of Synset : hello.n.01

Meaning of Synset : an expression of greeting

Synset's example : ['every morning they exchanged polite hellos']

and Hyponyms;

Hypernyms are more esoteric terms.


More specific terms are referred to as hyponyms

Synsets are grouped in a structure similar to that of an inheritance tree,


therefore both spring to mind. A root hypernym can be found at the top of this
tree. Hypernyms are a means of classifying and grouping words based on their
resemblance.

Code:
from nltk.corpus import wordnet

synset = wordnet.synsets('hello')[0]

print ("Synset's name : ", synset.name())

print ("Synset abstract term : ", synset.hypernyms())

print ("Specific term of Synset : ", synset.hypernyms()[0].hyponyms())


synset.root_hypernyms()

print ("\nSynset root hypernerm : ", synset.root_hypernyms())

Output:
Synset's name : hello.n.01

Synset abstract term : [Synset('greeting.n.01')]

Specific Term of Synset : [Synset('calling_card.n.02'),


Synset('good_afternoon.n.01'),
Synset('good_morning.n.01'),Synset('hail.n.03'),Synset('hello.n.01'),

Synset('pax.n.01'), Synset('reception.n.01'),Synset('regard.n.03'),
Synset('salute.n.02'),Synset('salute.n.03'),Synset('welcome.n.02'),
Synset('well-wishing.n.01')]
Synset root hypernerm : [Synset('entity.n.01')

]Part of Speech (POS) in Synsetsynset = wordnet.

synsets('hello')[0]print ("Syntag: ", syn.pos())


syn = wordnet.synsets('doing')[0]
print ("Syntag : ", syn.pos())
syn = wordnet.synsets('beautiful')[0]
print ("Syntag : ", syn.pos())
syn = wordnet.synsets('quickly')[0]
print ("Syntag : ", syn.pos())

Output:
Syntax : n
Syntax : v
Syntax : a
Syntax : r

The FrameNet corpus is a lexical database of English that is both human- and
machine-readable, based on annotating examples of how words are used in actual
texts. FrameNet is based on a theory of meaning called Frame Semantics, deriving
from the work of Charles J. Fillmore and colleagues. Thebasic idea is
straightforward: that the meanings of most words can best be understood on the
basis of a semantic frame: a description of a type of event, relation, or entity
and the participants in it. For example, the concept of cooking typically involves
a person doing the cooking (Cook), the food that isto be cooked (Food), something
to hold the food while cooking (Container) and a source of heat
(Heating_instrument). In the FrameNet project, this is represented as a frame
called Apply_heat, andthe Cook, Food, Heating_instrument and Container are called
frame elements (FEs). Words that evoke this frame, such as fry, bake, boil, and
broil, are called lexical units (LUs) of the Apply_heat frame. The job of FrameNet
is to define the frames and to annotate sentences to show how the FEs fit
syntactically around the word that evokes the frame.

A Frame is a script-like conceptual structure that describes a particular type of


situation, object, or event along with the participants and props that are needed
for that Frame. For

example, the “Apply_heat” frame describes a common situation involving a Cook, some
Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil,
broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are
called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined,
of whichthe most important are:
 Inheritance: An IS-A relation. The child frame is a subtype of the parent frame,
and each FE in the parent is bound to a corresponding FE in the child. An example
is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
 Using: The child frame presupposes the parent frame as background, e.g the
“Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent
FEs need to be bound to child FEs.
 Subframe: The child frame is a subevent of a complex event represented by the
parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”,
“Trial”,and “Sentencing”.

 Perspective_on: The child frame provides a particular perspective on an un-


perspectivized parent frame. A pair of examples consists of the “Hiring” and
“Get_a_job” frames, which perspectivize the “Employment_start” frame from the
Employer’s and the Employee’s point of view, respectively.

To get a list of all of the Frames in FrameNet, you can use the frames() function.
If you supply a regular expression pattern to the frames() function, you will get a
list of all Frames whose names match that pattern:
>>> from pprint import pprint
>>> from operator import itemgetter
>>> from nltk.corpus import framenet as fn
>>> from nltk.corpus.reader.framenet import PrettyList
>>> x = fn.frames(r'(?i)crim')
>>> x.sort(key=itemgetter('ID'))
>>> x[<frame ID=200 name=Criminal_process>, <frame ID=500
name=Criminal_investigation>, ...]
>>> PrettyList(sorted(x, key=itemgetter('ID')))[<frame ID=200
name=Criminal_process>, <frame ID=500 name=Criminal_investigation>, ...]To get the
details of a particular Frame, you can use the frame() function passing in the
framenumber:
>>> from pprint import pprint
>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(202)>>> f.ID202
>>> f.name'Arrest'
>>> f.definition"Authorities charge a Suspect, who is under suspicion of having
committed acrime..."
>>> len(f.lexUnit)11
>>> pprint(sorted([x for x in f.FE]))
['Authorities',
'Charges',

'Co-participant',
'Manner',
'Means',
'Offense',
'Place',
'Purpose', 'Source_of_legal_authority',
'Suspect',
'Time',
'Type']
>>> pprint(f.frameRelations)[<Parent=Intentionally_affect -- Inheritance ->
Child=Arrest>, <Complex=Criminal_process -- Subframe -> Component=Arrest>, ...]

The frame() function shown above returns a dict object containing detailed
information aboutthe Frame. See the documentation on the frame() function for the
specifics.You can also search for Frames by their Lexical Units (LUs). The
frames_by_lemma() function returns a list of all frames that contain LUs in which
the ‘name’ attribute of the LU matches the given regular expression. Note that LU
names are composed of “lemma.POS”, where the “lemma” part can be made up of either
a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’) (see below).

>>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'), key=itemgetter('ID')))


[<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]

Lexical Units;

A lexical unit (LU) is a pairing of a word with a meaning. For example, the
“Apply_heat” Frame describes a common situation involving a Cook, some Food, and a
Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil,
brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat
frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather
complex. When we say that the word “bake” is polysemous, we mean that the lemma
“bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is
linked to three different frames:

Apply_heat: “Michelle baked the potatoes for 45 minutes.”


Cooking_creation: “Michelle baked her mother a cake for her birthday.”
Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can
also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)”
are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”,
respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each
LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the
combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame.
This makes the FrameNet database similar to a thesaurus, grouping together
semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes
“Cause_change_of_scalar_position” in:

“…the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep”
frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as
dependents rather than clearly evoking their own frames.

Details for a specific lexical unit can be obtained using this class’s lus()
function, which takesan optional regular expression pattern that will be matched
against the name of the lexical unit:
>>> from pprint import pprint
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')))[<lu ID=14733
name=a little.n>, <lu ID=14743 name=a little.adv>, ...]
You can obtain detailed information on a particular LU by calling the lu() function
and passing in an LU’s ‘ID’ number:
>>> from pprint import pprint
>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name'foresee.v'
>>> fn.lu(256).definition'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name'Expectation'
>>> fn.lu(256).lexemes[0].name
'foresee'

Note that LU names take the form of a dotted string (e.g. “run.v” or “a
little.adv”) in which a lemma precedes the “.” and a part of speech (POS) follows
the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple
lexemes (e.g. “a little”). The list of POSs used in the LUs is:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj
- interjection art - article c - conjunction scon - subordinating conjunction

For more detailed information about the info that is contained in the dict that is
returned by the lu() function, see the documentation on the lu() function.

Annotated Documents;
The FrameNet corpus contains a small set of annotated documents. A list of these
documents can be obtained by calling the docs() function:>>> from pprint import
pprint
>>> from nltk.corpus import framenet as fn
>>> d = fn.docs('BellRinging')[0]
>>> d.corpname
'PropBank'
>>> d.sentence[49]
full-text sentence (...) in BellRinging:

[POS] 17 tags

[POS_tagset] PENN

[text] + [annotationSet]

`` I live in hopes that the ringers themselves will be drawn into


***** ******* *****
Desir Cause_t Cause
[1] [3] [2]
that fuller life .
******
Comple
[4]
(Desir=Desiring, Cause_t=Cause_to_make_noise, Cause=Cause_motion,
Comple=Completeness)
>>> d.sentence[49].annotationSet[1]annotation set (...):
[status] MANUAL
[LU] (6605) hope.n in Desiring
[frame] (366) Desiring
[GF] 2 relations
[PT] 2 phrases
[text] + [Target] + [FE] + [Noun]
`` I live in hopes that the ringers themselves will be drawn into
- ^^^^ ^^ ***** ---------------------------------------------- E supp su
event
Event that fuller life .
-----------------
(E=Experiencer, su=supp)

The corpus consists of one million words of American English texts printed in 1961.
The texts for the corpus were sampled from 15 different text categories to make the
corpus a good standard reference. Today, this corpus is considered small, and
slightly dated. The corpus is, however, still used. Much of its usefulness lies in
the fact that the Brown corpus lay-out has been copied by other corpus compilers.
The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two
examples of corpora made to match the Brown corpus. The availability of corpora
which are so similar in structure is a valuable resourse for researchers interested
in comparing different language varieties, for example.

For a long time, the Brown and LOB corpora were almost the only easily available
computer readable corpora. Much research within the field of corpus linguistics
hastherefore been made using these data. By studying the same data from different
angles, in different kinds of studies, researchers can compare their findings
without

having to take into consideration possible variation caused by the use of different
data.

At the University of Freiburg, Germany, researchers are compiling new versions


ofthe LOB and Brown corpora with texts from 1991. This will undoubtedly be a
valuable resource for studies of language change in a near diachronic perspective.
The Brown corpus consists of 500 texts, each consisting of just over 2,000 words.
The texts were sampled from 15 different text categories. The number of texts in
each category varies (see below),

More comprehensive information about the Brown corpus can be found in the Brown
Corpus Manual (external link).

BNC is distributed in a format which makes possible almost any kind of computer-
based research on the nature of the language. Obvious application areas include
lexicography, natural language understanding (NLP) systems, and all branches of
applied and The theoretical linguistics.

Uses of the BNC;

Large language corpora can help provide answers for these kinds of questions -- if
only because they encourage linguists, lexicographers, and all who work with
language to ask them. The purpose of a language corpus is to provide language
workers with evidence of how language is really used, evidence that can then be
used to inform and substantiate individual theories about what words might or
should mean. Traditional grammars and dictionaries tell us what a word ought to
mean, but only experience can tell us what a word is used to mean. This is why
dictionary publishers, grammar writers, language teachers, and developers of
natural language processing software alike have been turning to corpus evidence as
a means of extending and organizing that experience.

You might also like