You are on page 1of 55

Natural Language Processing

Some screenshots are taken from NLP book by Jufrasky


— Used only for educational purpose

NLP © Sakthi Balan M


Lexical Semantics

NLP © Sakthi Balan M


“How many legs does a dog have if you call its tail a leg?

Four.

Calling a tail a leg doesn’t make it one.”


– Abraham Lincoln

© Sakthi Balan M
NLP
Lexical Semantics
• We introduce a richer model of the semantics of words, drawing on the linguistic study of
word meaning, a field called lexical semantics

• Lexeme: Lexeme share a more meaning although they are spelt (Orthography) or
pronounced (Phonetics) differently

• Lexicon: Is a finite list of lexemes

• Lemma: A lemma or citation form is the grammatical form that is used to represent a
lexeme.

• For example: The lemma or citation form for sing, sang, sung is sing

• The specific forms sung or carpets or sing are also called wordforms

NLP © Sakthi Balan M


Lexical Semantics
• The process of mapping from a wordform to a lemma is called lemmatization

• Lemmatization is not always deterministic. It depends on the context

• For example, the wordform found can map to the lemma find (meaning ‘to
locate’) or the lemma found (‘to create an institution’), as illustrated in the
following WSJ examples:

• He has looked at 14 baseball and football stadiums and found that only one
– private Dodger Stadium – brought more money into a city than it took out

• Culturally speaking, this city has increasingly displayed its determination to


found the sort of institutions that attract the esteem of Eastern urbanites

NLP © Sakthi Balan M


Lexical Semantics

• Note: Lemmas are part-of-speech specific; thus the wordform tables has
two possible lemmas, the noun table and the verb table

• In general lemmas may be larger than morphological stems (e.g., New


York or throw up)

• A lemma is not necessarily the same as the stem from the morphological
parse (Celebration is a lemma but celebrate is the stem for celebration)

• Here when we refer to the meaning (or meanings) of a ‘word’, we will


generally be referring to a lemma

NLP © Sakthi Balan M


Word Sense
• Word Senses: Meaning of a lemma varies with respect to
the context. For example:

• Instead, a bank can hold the investments in a


custodial account in the client’s name — sense 1
Sense1 and Sense 2: Hononyms,
• But as agriculture burgeons on the east bank, the Homonymy
river will shrink even more — sense 2

• While some banks give blood only to the needy as a Sense1 and Sense 3: Polysemy
service, others may do it as a business — sense 3

• The bank is on the corner of Nassau and Witherspoon Sense 4: Metonymy


— sense 4 Example: I really love Jufrasky

NLP © Sakthi Balan M


Word Sense

• More example on the verb Serve:

• They rarely serve red meat, preferring to prepare seafood,


poultry or game birds.

• He served as U.S. ambassador to Norway in 1976 and 1977.

• He might have served his time, come out and led an upstanding
life.

• Three senses of Serve


NLP © Sakthi Balan M
Word Sense

• For determining if two senses are distinct is to conjoin two uses of a


word in a single sentence; this kind of conjunction of antagonistic
readings is called zeugma. Consider the following examples:

• Which of those flights serve breakfast?

• Does Midwest Express serve Philadelphia?

• Does Midwest Express serve breakfast and Philadelphia?

• Does Midwest Express serve breakfast and lunch?


NLP © Sakthi Balan M
Word Sense

American Heritage Dictionary

Definitions are circular in nature!

NLP © Sakthi Balan M


Word Sense

• Word Sense relations are embodied in on-line databases like


WordNet

NLP © Sakthi Balan M


Relations Between Senses

• Synonymy:

• meaning of two senses of two different words (lemmas) are


identical or nearly identical we call them as Synonyms

• Example: couch/sofa, vomit/throw up, car/automobile

• More formal definition: Two words are synonymous if they are


substitutable one for the other in any sentence without
changing the truth conditions of the sentence.

NLP © Sakthi Balan M


Relations Between Senses

• Two words may be synonymous but still they may not have an
identical meaning:

• I came home by automobile / car

• I am thirsty give me H2O / water

• In practice the word synonym is therefore commonly used to


describe a relationship of approximate or rough synonymy

NLP © Sakthi Balan M


Relations Between Senses
• Consider the following ATIS sentences, since we could swap big and
large in either sentence and retain the same meaning:

• How big is that plane?

• Would I be flying on a large or small plane?

• But consider the following WSJ sentence where we cannot substitute


large for big:

• Miss Nelson, for instance, became a kind of big sister to Benjamin

• Miss Nelson, for instance, became a kind of large sister to Benjamin


NLP © Sakthi Balan M
Relations Between Senses
• Antonyms are words with opposite meanings

• long / short, big / little, fast / slow, cold / hot, dark / light, rise / fall, up / down, in / out

• Two senses can be antonyms if they define a binary opposition, or are at opposite ends
of some scale. This is the case for long/short, fast/slow, or big/little, which are at
opposite ends of the length or size scale.

• Another groups of antonyms is reversives which describe some sort of change or


movement in opposite directions such as rise/fall or up/down.

• From one perspective, antonyms have very different meanings, since they are opposite.

• From another perspective, they have very similar meanings, since they share almost all
aspects of their meaning except their position on a scale, or their direction

NLP © Sakthi Balan M


Relations Between Senses

• Hyponym: If the first sense is more specific than the second sense

• For example: car is a hyponym of vehicle; dog is a hyponym of animal, and mango is
a hyponym of fruit

• Hypernym: We say that vehicle is a hypernym of car, and animal is a hypernym of dog.
The word superordinate is often used instead of hypernym

• Class denoted by the superordinate extensionally includes the class denoted by the
hyponym

• Hypernymy can also be defined in terms of entailment. Under this definition, a sense A is
a hyponym of a sense B if everything that is A is also B
NLP © Sakthi Balan M
Relations Between Senses

• The term ontology usually refers to a set of distinct objects resulting from
an analysis of a domain, or microworld.

• A taxonomy is a particular arrangement of the elements of an ontology


into a tree-like class inclusion structure.

NLP © Sakthi Balan M


Relations Between Senses
• Meronymy: the part-whole relation

• For Example: A leg is part of a chair; a wheel is part of a car

• We say that wheel is a meronym of car, and car is a holonym of wheel

• Semantic field is a model of a more integrated, or holistic, relationship


among entire sets of words from a single domain.

• Consider the following set of words: reservation, flight, travel, buy,


price, cost, fare, rates, meal, plane

• FrameNet project (Baker et al., 1998),


NLP © Sakthi Balan M
WordNet
• The most commonly used resource for English sense relations is the WordNet lexical
database (Fellbaum, 1998)

• WordNet consists of three separate databases, one each for nouns and verbs, and a
third for adjectives and adverbs

• In WordNet closed class words are not included

• Each database consists of a set of lemmas, each one annotated with a set of senses

• The WordNet 3.0 release has 117,097 nouns, 11,488 verbs, 22,141 adjectives, and
4,601 adverbs

• The average noun has 1.23 senses, and the average verb has 2.16 senses

• WordNet can be accessed via the web or downloaded and accessed locally.
NLP © Sakthi Balan M
WordNet

• There are eight senses for the


noun and one for the adjective,
each of which has a gloss (a
dictionary-style definition), a list
of synonyms for the sense
(called a synset), and
sometimes also usage examples
(shown for the adjective sense)

• WordNet does not have


pronunciation of the word like a
dictionary
NLP © Sakthi Balan M
WordNet

Noun Relations

NLP © Sakthi Balan M


WordNet

Verb Relations

NLP © Sakthi Balan M


Word Sense Disambiguation
• Two variants of the generic WSD task:

• In the lexical sample task: A small pre-selected set of target words is chosen, along with an
inventory of senses for each word from some lexicon. For each word, a number of corpus
instances (context sentences) can be selected and hand-labeled with the correct sense of
the target word in each. Classifier systems can then be trained using these labeled examples.
Unlabeled target words in context can then be labeled using such a trained classifier.

• Early work in word sense disambiguation focused solely on lexical sample tasks of this sort,
building word-specific algorithms for disambiguating single words like line, interest, or plant.

• In the all-words task: Systems are given entire texts and a lexicon with an inventory of senses
for each entry, and are required to disambiguate every content word in the text. The all-words
task is very similar to part-of-speech tagging, except with a much larger set of tags, since
each lemma has its own set.
NLP © Sakthi Balan M
Supervised Word Sense Disambiguation

• Lexical sample tasks: line-hard-serve corpus containing 4,000 sense-tagged


examples of line as a noun, hard as an adjective and serve as a verb (Leacock
et al., 1993)

• The interest corpus with 2,369 sense-tagged examples of interest as a noun


(Bruce and Wiebe, 1994).

• The SENSEVAL project has also produced a number of such sense-labeled


lexical sample corpora (SENSEVAL-1 with 34 words from the HECTOR lexicon
and corpus (Kilgarriff and Rosenzweig, 2000; Atkins, 1993), SENSEVAL-2 and -3
with 73 and 57 target words, respectively (Palmer et al., 2001b; Kilgarriff,
2001)).
NLP © Sakthi Balan M
• For all-word disambiguation one commonly used corpus is SemCor,
a subset of the Brown Corpus consisting of over 234,000 words
which were manually tagged with WordNet senses (Miller et al.,
1993; Landes et al., 1998).

• In addition, sense-tagged corpora have been built for the SENSEVAL


all-word tasks. The SENSEVAL-3 English all-words test data
consisted of 2081 tagged content word tokens, from 5,000 total
running words of English from the WSJ and Brown corpora (Palmer
et al., 2001b)

NLP © Sakthi Balan M


Extracting Feature Vectors for Supervised Learning

• To extract useful features pre-processing of data is done:

• Processing varies from approach to approach but typically includes


part-of-speech tagging, lemmatization or stemming, and in some
cases syntactic parsing to reveal information such as head words
and dependency relations.

• A feature vector consisting of numeric or nominal values is used to


encode this linguistic information as an input to most machine
learning algorithms.

NLP © Sakthi Balan M


Extracting Feature Vectors for Supervised Learning

• Two classes of features are generally extracted:

• Collocational Features

• Bag-of-words

NLP © Sakthi Balan M


Naive Bayes Classifier — WSD

• The naive Bayes classifier approach to WSD is based on the premise


that choosing the best sense out of the set of possible senses S for a
feature vector amounts to choosing the most probable sense given
that vector f :⃗

• As is almost always the case, it would be difficult to collect


reasonable statistics for this equation: For a vocabulary of 20 words
20
with binary bag of words vector we will have 2 possible vectors!!

NLP © Sakthi Balan M


Naive Bayes Classifier — WSD

We assume that the features are conditionally


independent given the word sense

NLP © Sakthi Balan M


WSD Evaluation

• extrinsic evaluation / task-based evaluation / end-to-end evaluation / in vivo


evaluation — difficult, time consuming and depended on the task

• Intrinsic or in vitro evaluation — practically possible and easy to do and gives a


reasonable estimate of how good is the WSD algorithm

• Whichever WSD task we are performing, we ideally need two additional


measures to assess how well we’re doing:

• a baseline measure to tell use how well we’re doing as compared to


relatively simple approaches

• a ceiling to tell us how close we are to optimal performance


NLP © Sakthi Balan M
WSD Evaluation — Baseline and
Optimality
• Baseline:

• Choose the most frequent sense for each word from the senses in a
labeled corpus

• For WordNet, this corresponds to the take the first sense heuristic,
since senses in WordNet are generally ordered from most-frequent
to least-frequent

Alternate: Lesk Algorithm

NLP © Sakthi Balan M


WSD Evaluation — Baseline and
Optimality
• Human inter-annotator agreement

• Human agreement is measured by comparing the annotations of two


human annotators on the same data given the same tagging guidelines

• The ceiling (inter-annotator agreement) for many all-words corpora using


WordNet-style sense inventories seems to range from about 75% to 80%
(Palmer et al., 2006)

• Agreement on more coarse grained, often binary, sense inventories is


closer to 90%
NLP © Sakthi Balan M
WSD — Dictionary and Thesaurus Method

• Supervised algorithms based on sense-labeled corpora are the best


performing algorithms for sense disambiguation

• But such labeled training data is expensive and limited and


supervised approaches fail on words not in the training data

• How to get around this by indirect supervision from other sources:


Dictionary and Thesaurus

NLP © Sakthi Balan M


The Lesk Algorithm

• Well-studied dictionary-based algorithm for sense disambiguation is the Lesk


algorithm

• Algorithm that chooses the sense whose dictionary gloss or definition shares
the most words with the target word’s neighborhood

• Context: set of words together with the target word

• Signature: set of words in the gloss and examples of the target word sense

• Find the overlap between context and signature and find the maximum
overlap wrt all senses
NLP © Sakthi Balan M
The Lesk Algorithm

NLP © Sakthi Balan M


The Lesk Algorithm - Example

• The bank can guarantee deposits will eventually cover future tuition costs
because it invests in adjustable-rate mortgage securities

bank1 has two (non-stop) words overlapping with the context: deposits
and mortgage

sense bank2 has zero

so sense bank1 is chosen


NLP © Sakthi Balan M
Lesk Algorithm
• The original Lesk algorithm (Lesk, 1986) is slightly more indirect

• Instead of comparing a target word’s signature with the context words, the target signature is compared
with the signatures of each of the context words

• Other Version (Corpus Lesk): Instead of just counting up the overlapping words, the Corpus Lesk
algorithm also applies a weight to each overlapping word.

• The weight is the inverse document frequency or IDF

• IDF measures how many different ’documents’ (in this case glosses and examples) a word occurs
in

• Since function words like the, of, etc, occur in many documents, their IDF is very low, while the IDF
of content words is high. Corpus Lesk thus uses IDF.

Ndoc is the total number of ‘documents’ (glosses and


examples) and ndi is the number of these documents
containing word i.
NLP © Sakthi Balan M
Minimally Supervised WSD — Bootstrapping

• Supervised approach require large supervised training set

• Dictionary-based approach require large dictionaries

• Yarowsky algorithm (Yarowsky, 1995):

• bootstrapping algorithms, also called semi-supervised learning or


minimally supervised learning

• needs only a very small hand-labeled training set

NLP © Sakthi Balan M


Yarowsky Algorithm

• The goal of the Yarowsky algorithm is to learn a classifier for a target word

• The algorithm is given a small seed-set Λ0 of labeled instances of each sense, and a much larger unlabeled corpus V0

• The algorithm first trains a classifier on the seed-set Λ0 and it then uses this classifier to label the unlabeled corpus V0

• The algorithm then selects the examples in V0 that it is most confident about, removes them, and adds them to the
training set (call it now Λ1)

• The algorithm then trains a new classifier (a new set of rules) on Λ1 and iterates by applying the classifier to the now-
smaller unlabeled set V1, extracting a new training set Λ2 and so on

• With each iteration of this process, the training corpus grows and the untagged corpus shrinks. The process is
repeated until some sufficiently low error-rate on the training set is reached, or until no further examples from the
untagged corpus are above threshold.

• The key to any bootstrapping approach lies in its ability to create a larger training set from a small set of seeds. This
requires an accurate initial set of seeds and a good confidence metric for picking good new examples to add to the
training set

NLP © Sakthi Balan M


Yarowsky Algorithm
• Yarowsky (1995) used the One Sense per Collocation

• Yarowsky defines his seed set by choosing a single collocation


for each sense. As an illustration of this technique, consider
generating seed sentences for the fish and musical senses of
bass.

• fish is a reasonable indicator of bass ,


1 and

• play is a reasonable indicator of bass2

NLP © Sakthi Balan M


Yarowsky Algorithm

Picture shows a partial result of a search for the strings “fish” and “play” in a corpus of bass
examples drawn from the WSJ

NLP © Sakthi Balan M


Yarowsky Algorithm

• The original Yarowsky algorithm’s second heuristic: One Sense Per


Discourse (Gale et al. 1992)

• A particular word appearing multiple times in a text or discourse often


appeared with the same sense

• Yarowsky (1995) showed in a corpus of 37,232 examples that every


time the word bass occurred more than once in a discourse, that it
occurred in only the fish or only the music coarse-grain sense
throughout the discourse

NLP © Sakthi Balan M


Word Similarity — Thesaurus Methods
• Semantic relations that hold between words

• We saw many relations include synonymy, antonymy, hyponymy, hypernymy, and


meronymy

• Synonymy has the greatest number of applications

• We can as word similarity or semantic distance

• Two words are more similar if they share more features of meaning, or are near-synonyms

• Two words are less similar, or have greater semantic distance, if they have fewer
common meaning elements

• Example: Bank — fund and riparian — Bank are similar pairs

• on-line thesaurus like WordNet can be used for this

NLP © Sakthi Balan M


Word Similarity — Thesaurus Methods

• Word Similarity — mainly means Synonymy relations but it may include other
relations also

• Word similarity — (1) Similar words and (2) Relates words

• Examples: Car and bicycle are similar but car and gasoline are related

• But in this unit we will not distinguish between similarity and relatedness

• Thesaurus-based word similarity algorithms use only the hypernym/hyponym


(is-a or subsumption) hierarchy

NLP © Sakthi Balan M


Word Similarity — Thesaurus Methods

• Applications:

• Information retrieval

• Question answering system

• Summarisation

• Generation

• Machine translation

• Automatic essay grading


NLP © Sakthi Balan M
Word Similarity — Thesaurus Methods

• In WordNet, verbs and nouns are in separate hypernym hierarchies

• So verb-verb similarity and noun-noun similarity are possible

NLP © Sakthi Balan M


Word Similarity — Thesaurus Methods
• Simplest idea is based on the intuition that the shorter the path between two words or senses
in the graph defined by the thesaurus hierarchy, the more similar they are

• This tells that a word/sense is very similar to its parents or its siblings, and less similar to
words that are far away in the network

• This idea can be easily implemented if we define the shortest distance between the words/
senses as the similarity measure

• For example: dime is most similar to nickel and coin but it is less similar to money, and even
less similar to Richter scale.

• pathlen(c1,c2) = the number of edges in the shortest path in the thesaurus graph
between the sense nodes c1 and c2

• If pathlen is a small value then their similarity is more and if it is large then the similarity
is less
NLP © Sakthi Balan M
Word Similarity — Thesaurus Methods

NLP © Sakthi Balan M


Word Similarity — Thesaurus Methods

Above similarity measure is generalised to two words w1 and w2

This measure assumes a uniform distance between nodes (which is NOT correct):

• For example nickel & money, and nickel & standard are of equal distance but
intuitively they are not the same

• Also, the link between medium of exchange and standard seems wider than that
between, say, coin and coinage

• It is possible to refine path-based algorithms with normalizations based on depth


in the hierarchy (Wu and Palmer, 1994) but in general it is good to associate a
NLP weight with respect to each edge © Sakthi Balan M
Word Similarity using information
content

• Proposed by Resnik 1995

• P(c) — probability that a randomly selected word in a corpus is an


instance of concept c

• P(root) = 1 words(c) is the set of words subsumed by concept c

N is the total number of words in the corpus that are


also present in the thesaurus

NLP © Sakthi Balan M


Word Similarity using information
content

A fragment of the WordNet concept hierarchy augmented


with the probabilities P(c) (from Lin 1998)

NLP © Sakthi Balan M


Word Similarity using information
content
• Two additional measures:

• Using basic information theory, we define the


information content (IC) of a concept c as:

• IC(c) = − log P(c)

• Lowest common subsumer or LCS of two concepts:

• LCS(c1,c2) = the lowest common subsumer,


i.e., the lowest node in the hierarchy that
subsumes (is a hypernym of) both c1 and c2
NLP © Sakthi Balan M
Word Similarity using information
content

Lin (1998b) extended the Resnik intuition — a similarity metric between objects A
and B needs to do more than measure the amount of information in common
between A and B.

• commonality: the more information A and B have in common, the more similar they
are
• difference: the more differences between the information in A and B, the less
similar they are

NLP © Sakthi Balan M


Word Similarity using information
content

• Lin measures the commonality and difference between A and B as


follows:

• the information content of the proposition that states the


commonality between A and B: IC(Common(A,B))

• the difference between A and B as


IC(description(A,B)) − IC(common(A,B))

NLP © Sakthi Balan M


Word Similarity using information
content

• Similarity Theorem: The similarity between A and B is measured by


the ratio between the amount of information needed to state the
commonality of A and B and the information needed to fully describe
what A and B are:

NLP © Sakthi Balan M

You might also like