Professional Documents
Culture Documents
NLP Vi6
NLP Vi6
Concept of Coherence;
Coherence and discourse structure are interconnected in many ways. Coherence, along
with property of good text, is used to evaluate the output quality of natural
language generation system. The question that arises here is what does it mean for
a text to be coherent? Suppose we collected one sentence from every page of the
newspaper, then will it be a discourse? Of-course, not. It is because these
sentences do not exhibit coherence. The coherent discourse must possess the
following properties
Discourse structure;
An important question regarding discourse is what kind of structure the discourse
must have. The answer to this question depends upon the segmentation we applied on
discourse. Discourse segmentations may be defined as determining the types of
structures for large discourse. It is quite difficult to implement discourse
segmentation, but it is very important for information retrieval, textsummarization
and information extraction kind of applications.
Text Coherence;
Lexical repetition is a way to find the structure in a discourse, but it does not
satisfy the requirement of being coherent discourse. To achieve the coherent
discourse, we must focus on coherence relations in specific. As we know that
coherence relation defines the possible connection between utterances in a
discourse. Hebb has proposed such kind of relations as follows −
We are taking two terms S0 and S1 to represent the meaning of the two related
sentences −
Result;
It infers that the state asserted by term S0 could cause the state asserted by S1.
For example, two statements show the relationship result: Ram was caught in the
fire. His skin burned.
Explanation;
It infers that the state asserted by S1 could cause the state asserted by S0. For
example, two statements show the relationship − Ram fought with Shyam’s friend. He
was drunk.
Parallel;
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai
and bi are similar for all i. For example, two statements are parallel − Ram wanted
car. Shyam wanted money.
Elaboration;
It infers the same proposition P from both the assertions − S0 and S1 For example,
two statements show the relation elaboration: Ram was from Chandigarh. Shyam was
from Kerala.
Occasion;
It happens when a change of state can be inferred from the assertion of S0, final
state of which can beinferred from S1 and vice-versa. For example, the two
statements show the relation occasion: Ram picked up the book. He gave it to Shyam
diagram
Reference Resolution;
Interpretation of the sentences from any discourse is another important task and to
achieve this we need to know who or what entity is being talked about. Here,
interpretation reference is the key element. Reference may be defined as the
linguistic expression to denote an entity or individual. For example, in the
passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to
meet him, the linguistic expressions like Ram, His, He are reference.
On the same note, reference resolution may be defined as the task of determining
what entities are referred to by which linguistic expression.
Referent − It is the entity that is referred. For example, in the last given
example Ram is a referent.
Corefer − When two expressions are used to refer to the same entity, they are
called corefers. For example, Ram and he are corefers.
Antecedent − The term has the license to use another term. For example, Ram is the
antecedent of the reference he.
Anaphora & Anaphoric − It may be defined as the reference to an entity that has
been previously introduced into the sentence. And, the referring expression is
called anaphoric.
Discourse model − The model that contains the representations of the entities that
have been referred to in the discourse and the relationship they are engaged in.
Let us now see the different types of referring expressions. The five types of
referring expressions aredescribed below −
Pronouns;
It is a form of definite reference. For example, Ram laughed as loud as he could.
The word he represents pronoun referring expression.
Demonstratives;
These demonstrate and behave differently than simple definite pronouns. For
example, this and that are demonstrative pronouns.
Names;
It is the simplest type of referring expression. It can be the name of a person,
organization and location also. For example, in the above examples, Ram is the
name-refereeing expression
There are several different algorithms for stemming, including the Porter stemmer,
Snowball stemmer, and the Lancaster stemmer. The Porter stemmer is the most widely
used algorithm, and it is based on a set of heuristics that are used to
removecommon suffixes from words. The Snowball stemmer is a more advanced algorithm
that is based on the Porter stemmer, but it also supports several other languages
in addition to English. The Lancaster stemmer is a more aggressive stemmer and it
is less accurate than the Porter stemmer and Snowball stemmer.
Stemming can be useful for several natural language processing tasks such as text
classification, information retrieval, and text summarization. However, stemming
can also have some negative effects such as reducing the readability of the text,
and
it may not always produce the correct root form of a word.
Wordnet;
Synset is a type of basic interface used in NLTK that allows you to look up words
in WordNet. Synset instances are a collection of synonyms that communicate thesame
idea. Some words have only one Synset, while others have multiple.
syn = wordnet.synsets('hello')[0]
# Word Definition
and Hyponyms;
Code:
from nltk.corpus import wordnet
synset = wordnet.synsets('hello')[0]
Output:
Synset's name : hello.n.01
Synset('pax.n.01'), Synset('reception.n.01'),Synset('regard.n.03'),
Synset('salute.n.02'),Synset('salute.n.03'),Synset('welcome.n.02'),
Synset('well-wishing.n.01')]
Synset root hypernerm : [Synset('entity.n.01')
Output:
Syntax : n
Syntax : v
Syntax : a
Syntax : r
The FrameNet corpus is a lexical database of English that is both human- and
machine-readable, based on annotating examples of how words are used in actual
texts. FrameNet is based on a theory of meaning called Frame Semantics, deriving
from the work of Charles J. Fillmore and colleagues. Thebasic idea is
straightforward: that the meanings of most words can best be understood on the
basis of a semantic frame: a description of a type of event, relation, or entity
and the participants in it. For example, the concept of cooking typically involves
a person doing the cooking (Cook), the food that isto be cooked (Food), something
to hold the food while cooking (Container) and a source of heat
(Heating_instrument). In the FrameNet project, this is represented as a frame
called Apply_heat, andthe Cook, Food, Heating_instrument and Container are called
frame elements (FEs). Words that evoke this frame, such as fry, bake, boil, and
broil, are called lexical units (LUs) of the Apply_heat frame. The job of FrameNet
is to define the frames and to annotate sentences to show how the FEs fit
syntactically around the word that evokes the frame.
example, the “Apply_heat” frame describes a common situation involving a Cook, some
Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil,
broil, brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are
called “lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined,
of whichthe most important are:
Inheritance: An IS-A relation. The child frame is a subtype of the parent frame,
and each FE in the parent is bound to a corresponding FE in the child. An example
is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
Using: The child frame presupposes the parent frame as background, e.g the
“Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent
FEs need to be bound to child FEs.
Subframe: The child frame is a subevent of a complex event represented by the
parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”,
“Trial”,and “Sentencing”.
To get a list of all of the Frames in FrameNet, you can use the frames() function.
If you supply a regular expression pattern to the frames() function, you will get a
list of all Frames whose names match that pattern:
>>> from pprint import pprint
>>> from operator import itemgetter
>>> from nltk.corpus import framenet as fn
>>> from nltk.corpus.reader.framenet import PrettyList
>>> x = fn.frames(r'(?i)crim')
>>> x.sort(key=itemgetter('ID'))
>>> x[<frame ID=200 name=Criminal_process>, <frame ID=500
name=Criminal_investigation>, ...]
>>> PrettyList(sorted(x, key=itemgetter('ID')))[<frame ID=200
name=Criminal_process>, <frame ID=500 name=Criminal_investigation>, ...]To get the
details of a particular Frame, you can use the frame() function passing in the
framenumber:
>>> from pprint import pprint
>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(202)>>> f.ID202
>>> f.name'Arrest'
>>> f.definition"Authorities charge a Suspect, who is under suspicion of having
committed acrime..."
>>> len(f.lexUnit)11
>>> pprint(sorted([x for x in f.FE]))
['Authorities',
'Charges',
'Co-participant',
'Manner',
'Means',
'Offense',
'Place',
'Purpose', 'Source_of_legal_authority',
'Suspect',
'Time',
'Type']
>>> pprint(f.frameRelations)[<Parent=Intentionally_affect -- Inheritance ->
Child=Arrest>, <Complex=Criminal_process -- Subframe -> Component=Arrest>, ...]
The frame() function shown above returns a dict object containing detailed
information aboutthe Frame. See the documentation on the frame() function for the
specifics.You can also search for Frames by their Lexical Units (LUs). The
frames_by_lemma() function returns a list of all frames that contain LUs in which
the ‘name’ attribute of the LU matches the given regular expression. Note that LU
names are composed of “lemma.POS”, where the “lemma” part can be made up of either
a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’) (see below).
Lexical Units;
A lexical unit (LU) is a pairing of a word with a meaning. For example, the
“Apply_heat” Frame describes a common situation involving a Cook, some Food, and a
Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil,
brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat
frame. Each sense of a polysemous word is a different LU.
We have used the word “word” in talking about LUs. The reality is actually rather
complex. When we say that the word “bake” is polysemous, we mean that the lemma
“bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is
linked to three different frames:
Multiword expressions such as “given name” and hyphenated words like “shut-eye” can
also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)”
are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”,
respectively), and their internal structure is not analyzed.
Framenet provides multiple annotated examples of each sense of a word (i.e. each
LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the
combinatorial possibilities of the lexical unit.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame.
This makes the FrameNet database similar to a thesaurus, grouping together
semantically similar words.
In the simplest case, frame-evoking words are verbs such as “fried” in:
Sometimes event nouns may evoke a Frame. For example, “reduction” evokes
“Cause_change_of_scalar_position” in:
Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep”
frame as in:
Many common nouns, such as artifacts like “hat” or “tower”, typically serve as
dependents rather than clearly evoking their own frames.
Details for a specific lexical unit can be obtained using this class’s lus()
function, which takesan optional regular expression pattern that will be matched
against the name of the lexical unit:
>>> from pprint import pprint
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')))[<lu ID=14733
name=a little.n>, <lu ID=14743 name=a little.adv>, ...]
You can obtain detailed information on a particular LU by calling the lu() function
and passing in an LU’s ‘ID’ number:
>>> from pprint import pprint
>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name'foresee.v'
>>> fn.lu(256).definition'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name'Expectation'
>>> fn.lu(256).lexemes[0].name
'foresee'
Note that LU names take the form of a dotted string (e.g. “run.v” or “a
little.adv”) in which a lemma precedes the “.” and a part of speech (POS) follows
the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple
lexemes (e.g. “a little”). The list of POSs used in the LUs is:
v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj
- interjection art - article c - conjunction scon - subordinating conjunction
For more detailed information about the info that is contained in the dict that is
returned by the lu() function, see the documentation on the lu() function.
Annotated Documents;
The FrameNet corpus contains a small set of annotated documents. A list of these
documents can be obtained by calling the docs() function:>>> from pprint import
pprint
>>> from nltk.corpus import framenet as fn
>>> d = fn.docs('BellRinging')[0]
>>> d.corpname
'PropBank'
>>> d.sentence[49]
full-text sentence (...) in BellRinging:
[POS] 17 tags
[POS_tagset] PENN
[text] + [annotationSet]
The corpus consists of one million words of American English texts printed in 1961.
The texts for the corpus were sampled from 15 different text categories to make the
corpus a good standard reference. Today, this corpus is considered small, and
slightly dated. The corpus is, however, still used. Much of its usefulness lies in
the fact that the Brown corpus lay-out has been copied by other corpus compilers.
The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two
examples of corpora made to match the Brown corpus. The availability of corpora
which are so similar in structure is a valuable resourse for researchers interested
in comparing different language varieties, for example.
For a long time, the Brown and LOB corpora were almost the only easily available
computer readable corpora. Much research within the field of corpus linguistics
hastherefore been made using these data. By studying the same data from different
angles, in different kinds of studies, researchers can compare their findings
without
having to take into consideration possible variation caused by the use of different
data.
More comprehensive information about the Brown corpus can be found in the Brown
Corpus Manual (external link).
BNC is distributed in a format which makes possible almost any kind of computer-
based research on the nature of the language. Obvious application areas include
lexicography, natural language understanding (NLP) systems, and all branches of
applied and The theoretical linguistics.
Large language corpora can help provide answers for these kinds of questions -- if
only because they encourage linguists, lexicographers, and all who work with
language to ask them. The purpose of a language corpus is to provide language
workers with evidence of how language is really used, evidence that can then be
used to inform and substantiate individual theories about what words might or
should mean. Traditional grammars and dictionaries tell us what a word ought to
mean, but only experience can tell us what a word is used to mean. This is why
dictionary publishers, grammar writers, language teachers, and developers of
natural language processing software alike have been turning to corpus evidence as
a means of extending and organizing that experience.