NLP Lexical Semnatics Slides

Natural Language Processing
Some screenshots are taken from NLP book by Jufrasky

— Used only for educational purpose
NLP © Sakthi Balan M

Lexical Semantics

“How many legs does a dog have if you call its tail a leg?
Four.
Calling a tail a leg doesn’t make it one.”

– Abraham Lincoln
© Sakthi Balan M
NLP
Lexical Semantics
• We introduce a richer model of the semantics of words, drawing on the linguistic study of
word meaning, a field called lexical semantics
• Lexeme: Lexeme share a more meaning although they are spelt (Orthography) or
pronounced (Phonetics) differently
• Lexicon: Is a finite list of lexemes
• Lemma: A lemma or citation form is the grammatical form that is used to represent a
lexeme.
• For example: The lemma or citation form for sing, sang, sung is sing
• The specific forms sung or carpets or sing are also called wordforms

Lexical Semantics
• The process of mapping from a wordform to a lemma is called lemmatization
• Lemmatization is not always deterministic. It depends on the context
• For example, the wordform found can map to the lemma find (meaning ‘to
locate’) or the lemma found (‘to create an institution’), as illustrated in the
following WSJ examples:
• He has looked at 14 baseball and football stadiums and found that only one
– private Dodger Stadium – brought more money into a city than it took out
• Culturally speaking, this city has increasingly displayed its determination to

found the sort of institutions that attract the esteem of Eastern urbanites

Lexical Semantics
• Note: Lemmas are part-of-speech specific; thus the wordform tables has
two possible lemmas, the noun table and the verb table
• In general lemmas may be larger than morphological stems (e.g., New

York or throw up)
• A lemma is not necessarily the same as the stem from the morphological
parse (Celebration is a lemma but celebrate is the stem for celebration)
• Here when we refer to the meaning (or meanings) of a ‘word’, we will

generally be referring to a lemma

Word Sense
• Word Senses: Meaning of a lemma varies with respect to
the context. For example:
• Instead, a bank can hold the investments in a

custodial account in the client’s name — sense 1
Sense1 and Sense 2: Hononyms,
• But as agriculture burgeons on the east bank, the Homonymy
river will shrink even more — sense 2
• While some banks give blood only to the needy as a Sense1 and Sense 3: Polysemy
service, others may do it as a business — sense 3
• The bank is on the corner of Nassau and Witherspoon Sense 4: Metonymy

— sense 4 Example: I really love Jufrasky

Word Sense
• More example on the verb Serve:
• They rarely serve red meat, preferring to prepare seafood,

poultry or game birds.
• He served as U.S. ambassador to Norway in 1976 and 1977.
• He might have served his time, come out and led an upstanding
life.
• Three senses of Serve

Word Sense
• For determining if two senses are distinct is to conjoin two uses of a

word in a single sentence; this kind of conjunction of antagonistic
readings is called zeugma. Consider the following examples:
• Which of those flights serve breakfast?
• Does Midwest Express serve Philadelphia?
• Does Midwest Express serve breakfast and Philadelphia?
• Does Midwest Express serve breakfast and lunch?

Word Sense
American Heritage Dictionary
Definitions are circular in nature!

Word Sense
• Word Sense relations are embodied in on-line databases like

WordNet

Relations Between Senses
• Synonymy:
• meaning of two senses of two different words (lemmas) are

identical or nearly identical we call them as Synonyms
• Example: couch/sofa, vomit/throw up, car/automobile
• More formal definition: Two words are synonymous if they are

substitutable one for the other in any sentence without
changing the truth conditions of the sentence.

• Two words may be synonymous but still they may not have an
identical meaning:
• I came home by automobile / car
• I am thirsty give me H2O / water
• In practice the word synonym is therefore commonly used to

describe a relationship of approximate or rough synonymy

• Consider the following ATIS sentences, since we could swap big and
large in either sentence and retain the same meaning:
• How big is that plane?
• Would I be flying on a large or small plane?
• But consider the following WSJ sentence where we cannot substitute

large for big:
• Miss Nelson, for instance, became a kind of big sister to Benjamin
• Miss Nelson, for instance, became a kind of large sister to Benjamin

• Antonyms are words with opposite meanings
• long / short, big / little, fast / slow, cold / hot, dark / light, rise / fall, up / down, in / out
• Two senses can be antonyms if they define a binary opposition, or are at opposite ends
of some scale. This is the case for long/short, fast/slow, or big/little, which are at
opposite ends of the length or size scale.
• Another groups of antonyms is reversives which describe some sort of change or

movement in opposite directions such as rise/fall or up/down.
• From one perspective, antonyms have very different meanings, since they are opposite.
• From another perspective, they have very similar meanings, since they share almost all
aspects of their meaning except their position on a scale, or their direction

• Hyponym: If the first sense is more specific than the second sense
• For example: car is a hyponym of vehicle; dog is a hyponym of animal, and mango is
a hyponym of fruit
• Hypernym: We say that vehicle is a hypernym of car, and animal is a hypernym of dog.
The word superordinate is often used instead of hypernym
• Class denoted by the superordinate extensionally includes the class denoted by the
hyponym
• Hypernymy can also be defined in terms of entailment. Under this definition, a sense A is
a hyponym of a sense B if everything that is A is also B
• The term ontology usually refers to a set of distinct objects resulting from
an analysis of a domain, or microworld.
• A taxonomy is a particular arrangement of the elements of an ontology

into a tree-like class inclusion structure.

• Meronymy: the part-whole relation
• For Example: A leg is part of a chair; a wheel is part of a car
• We say that wheel is a meronym of car, and car is a holonym of wheel
• Semantic field is a model of a more integrated, or holistic, relationship

among entire sets of words from a single domain.
• Consider the following set of words: reservation, flight, travel, buy,

price, cost, fare, rates, meal, plane
• FrameNet project (Baker et al., 1998),

WordNet
• The most commonly used resource for English sense relations is the WordNet lexical
database (Fellbaum, 1998)
• WordNet consists of three separate databases, one each for nouns and verbs, and a
third for adjectives and adverbs
• In WordNet closed class words are not included
• Each database consists of a set of lemmas, each one annotated with a set of senses
• The WordNet 3.0 release has 117,097 nouns, 11,488 verbs, 22,141 adjectives, and
4,601 adverbs
• The average noun has 1.23 senses, and the average verb has 2.16 senses
• WordNet can be accessed via the web or downloaded and accessed locally.
WordNet
• There are eight senses for the

noun and one for the adjective,
each of which has a gloss (a
dictionary-style definition), a list
of synonyms for the sense
(called a synset), and
sometimes also usage examples
(shown for the adjective sense)
• WordNet does not have

pronunciation of the word like a
dictionary
WordNet
Noun Relations

WordNet
Verb Relations

Word Sense Disambiguation
• Two variants of the generic WSD task:
• In the lexical sample task: A small pre-selected set of target words is chosen, along with an
inventory of senses for each word from some lexicon. For each word, a number of corpus
instances (context sentences) can be selected and hand-labeled with the correct sense of
the target word in each. Classifier systems can then be trained using these labeled examples.
Unlabeled target words in context can then be labeled using such a trained classifier.
• Early work in word sense disambiguation focused solely on lexical sample tasks of this sort,
building word-specific algorithms for disambiguating single words like line, interest, or plant.
• In the all-words task: Systems are given entire texts and a lexicon with an inventory of senses
for each entry, and are required to disambiguate every content word in the text. The all-words
task is very similar to part-of-speech tagging, except with a much larger set of tags, since
each lemma has its own set.
Supervised Word Sense Disambiguation
• Lexical sample tasks: line-hard-serve corpus containing 4,000 sense-tagged

examples of line as a noun, hard as an adjective and serve as a verb (Leacock
et al., 1993)
• The interest corpus with 2,369 sense-tagged examples of interest as a noun

(Bruce and Wiebe, 1994).
• The SENSEVAL project has also produced a number of such sense-labeled

lexical sample corpora (SENSEVAL-1 with 34 words from the HECTOR lexicon
and corpus (Kilgarriff and Rosenzweig, 2000; Atkins, 1993), SENSEVAL-2 and -3
with 73 and 57 target words, respectively (Palmer et al., 2001b; Kilgarriff,
2001)).
• For all-word disambiguation one commonly used corpus is SemCor,
a subset of the Brown Corpus consisting of over 234,000 words
which were manually tagged with WordNet senses (Miller et al.,
1993; Landes et al., 1998).
• In addition, sense-tagged corpora have been built for the SENSEVAL

all-word tasks. The SENSEVAL-3 English all-words test data
consisted of 2081 tagged content word tokens, from 5,000 total
running words of English from the WSJ and Brown corpora (Palmer
et al., 2001b)

Extracting Feature Vectors for Supervised Learning
• To extract useful features pre-processing of data is done:
• Processing varies from approach to approach but typically includes

part-of-speech tagging, lemmatization or stemming, and in some
cases syntactic parsing to reveal information such as head words
and dependency relations.
• A feature vector consisting of numeric or nominal values is used to

encode this linguistic information as an input to most machine
learning algorithms.

Extracting Feature Vectors for Supervised Learning
• Two classes of features are generally extracted:
• Collocational Features
• Bag-of-words

Naive Bayes Classifier — WSD
• The naive Bayes classifier approach to WSD is based on the premise

that choosing the best sense out of the set of possible senses S for a
feature vector amounts to choosing the most probable sense given
that vector f :⃗
• As is almost always the case, it would be difficult to collect

reasonable statistics for this equation: For a vocabulary of 20 words
20
with binary bag of words vector we will have 2 possible vectors!!

Naive Bayes Classifier — WSD
We assume that the features are conditionally

independent given the word sense

WSD Evaluation
• extrinsic evaluation / task-based evaluation / end-to-end evaluation / in vivo

evaluation — difficult, time consuming and depended on the task
• Intrinsic or in vitro evaluation — practically possible and easy to do and gives a

reasonable estimate of how good is the WSD algorithm
• Whichever WSD task we are performing, we ideally need two additional

measures to assess how well we’re doing:
• a baseline measure to tell use how well we’re doing as compared to

relatively simple approaches
• a ceiling to tell us how close we are to optimal performance

WSD Evaluation — Baseline and
Optimality
• Baseline:
• Choose the most frequent sense for each word from the senses in a
labeled corpus
• For WordNet, this corresponds to the take the first sense heuristic,
since senses in WordNet are generally ordered from most-frequent
to least-frequent
Alternate: Lesk Algorithm

WSD Evaluation — Baseline and
Optimality
• Human inter-annotator agreement
• Human agreement is measured by comparing the annotations of two

human annotators on the same data given the same tagging guidelines
• The ceiling (inter-annotator agreement) for many all-words corpora using

WordNet-style sense inventories seems to range from about 75% to 80%
(Palmer et al., 2006)
• Agreement on more coarse grained, often binary, sense inventories is

closer to 90%
WSD — Dictionary and Thesaurus Method
• Supervised algorithms based on sense-labeled corpora are the best

performing algorithms for sense disambiguation
• But such labeled training data is expensive and limited and

supervised approaches fail on words not in the training data
• How to get around this by indirect supervision from other sources:

Dictionary and Thesaurus

The Lesk Algorithm
• Well-studied dictionary-based algorithm for sense disambiguation is the Lesk

algorithm
• Algorithm that chooses the sense whose dictionary gloss or definition shares
the most words with the target word’s neighborhood
• Context: set of words together with the target word
• Signature: set of words in the gloss and examples of the target word sense
• Find the overlap between context and signature and find the maximum
overlap wrt all senses
The Lesk Algorithm

The Lesk Algorithm - Example
• The bank can guarantee deposits will eventually cover future tuition costs
because it invests in adjustable-rate mortgage securities
bank1 has two (non-stop) words overlapping with the context: deposits
and mortgage
sense bank2 has zero
so sense bank1 is chosen

Lesk Algorithm
• The original Lesk algorithm (Lesk, 1986) is slightly more indirect
• Instead of comparing a target word’s signature with the context words, the target signature is compared
with the signatures of each of the context words
• Other Version (Corpus Lesk): Instead of just counting up the overlapping words, the Corpus Lesk
algorithm also applies a weight to each overlapping word.
• The weight is the inverse document frequency or IDF
• IDF measures how many different ’documents’ (in this case glosses and examples) a word occurs
in
• Since function words like the, of, etc, occur in many documents, their IDF is very low, while the IDF
of content words is high. Corpus Lesk thus uses IDF.
Ndoc is the total number of ‘documents’ (glosses and

examples) and ndi is the number of these documents
containing word i.
Minimally Supervised WSD — Bootstrapping
• Supervised approach require large supervised training set
• Dictionary-based approach require large dictionaries
• Yarowsky algorithm (Yarowsky, 1995):
• bootstrapping algorithms, also called semi-supervised learning or

minimally supervised learning
• needs only a very small hand-labeled training set

Yarowsky Algorithm
• The goal of the Yarowsky algorithm is to learn a classifier for a target word
• The algorithm is given a small seed-set Λ0 of labeled instances of each sense, and a much larger unlabeled corpus V0
• The algorithm first trains a classifier on the seed-set Λ0 and it then uses this classifier to label the unlabeled corpus V0
• The algorithm then selects the examples in V0 that it is most confident about, removes them, and adds them to the
training set (call it now Λ1)
• The algorithm then trains a new classifier (a new set of rules) on Λ1 and iterates by applying the classifier to the now-
smaller unlabeled set V1, extracting a new training set Λ2 and so on
• With each iteration of this process, the training corpus grows and the untagged corpus shrinks. The process is
repeated until some sufficiently low error-rate on the training set is reached, or until no further examples from the
untagged corpus are above threshold.
• The key to any bootstrapping approach lies in its ability to create a larger training set from a small set of seeds. This
requires an accurate initial set of seeds and a good confidence metric for picking good new examples to add to the
training set

Yarowsky Algorithm
• Yarowsky (1995) used the One Sense per Collocation
• Yarowsky defines his seed set by choosing a single collocation

for each sense. As an illustration of this technique, consider
generating seed sentences for the fish and musical senses of
bass.
• fish is a reasonable indicator of bass ,

1 and
• play is a reasonable indicator of bass2

Yarowsky Algorithm
Picture shows a partial result of a search for the strings “fish” and “play” in a corpus of bass
examples drawn from the WSJ

Yarowsky Algorithm
• The original Yarowsky algorithm’s second heuristic: One Sense Per

Discourse (Gale et al. 1992)
• A particular word appearing multiple times in a text or discourse often

appeared with the same sense
• Yarowsky (1995) showed in a corpus of 37,232 examples that every

time the word bass occurred more than once in a discourse, that it
occurred in only the fish or only the music coarse-grain sense
throughout the discourse

Word Similarity — Thesaurus Methods
• Semantic relations that hold between words
• We saw many relations include synonymy, antonymy, hyponymy, hypernymy, and

meronymy
• Synonymy has the greatest number of applications
• We can as word similarity or semantic distance
• Two words are more similar if they share more features of meaning, or are near-synonyms
• Two words are less similar, or have greater semantic distance, if they have fewer
common meaning elements
• Example: Bank — fund and riparian — Bank are similar pairs
• on-line thesaurus like WordNet can be used for this

• Word Similarity — mainly means Synonymy relations but it may include other
relations also
• Word similarity — (1) Similar words and (2) Relates words
• Examples: Car and bicycle are similar but car and gasoline are related
• But in this unit we will not distinguish between similarity and relatedness
• Thesaurus-based word similarity algorithms use only the hypernym/hyponym

(is-a or subsumption) hierarchy

• Applications:
• Information retrieval
• Question answering system
• Summarisation
• Generation
• Machine translation
• Automatic essay grading

• In WordNet, verbs and nouns are in separate hypernym hierarchies
• So verb-verb similarity and noun-noun similarity are possible

• Simplest idea is based on the intuition that the shorter the path between two words or senses
in the graph defined by the thesaurus hierarchy, the more similar they are
• This tells that a word/sense is very similar to its parents or its siblings, and less similar to
words that are far away in the network
• This idea can be easily implemented if we define the shortest distance between the words/
senses as the similarity measure
• For example: dime is most similar to nickel and coin but it is less similar to money, and even
less similar to Richter scale.
• pathlen(c1,c2) = the number of edges in the shortest path in the thesaurus graph
between the sense nodes c1 and c2
• If pathlen is a small value then their similarity is more and if it is large then the similarity
is less

Above similarity measure is generalised to two words w1 and w2
This measure assumes a uniform distance between nodes (which is NOT correct):
• For example nickel & money, and nickel & standard are of equal distance but
intuitively they are not the same
• Also, the link between medium of exchange and standard seems wider than that
between, say, coin and coinage
• It is possible to refine path-based algorithms with normalizations based on depth

in the hierarchy (Wu and Palmer, 1994) but in general it is good to associate a
NLP weight with respect to each edge © Sakthi Balan M
Word Similarity using information
content
• Proposed by Resnik 1995
• P(c) — probability that a randomly selected word in a corpus is an

instance of concept c
• P(root) = 1 words(c) is the set of words subsumed by concept c
N is the total number of words in the corpus that are

also present in the thesaurus

content
A fragment of the WordNet concept hierarchy augmented

with the probabilities P(c) (from Lin 1998)

content
• Two additional measures:
• Using basic information theory, we define the

information content (IC) of a concept c as:
• IC(c) = − log P(c)
• Lowest common subsumer or LCS of two concepts:
• LCS(c1,c2) = the lowest common subsumer,

i.e., the lowest node in the hierarchy that
subsumes (is a hypernym of) both c1 and c2
content
Lin (1998b) extended the Resnik intuition — a similarity metric between objects A
and B needs to do more than measure the amount of information in common
between A and B.
• commonality: the more information A and B have in common, the more similar they
are
• difference: the more differences between the information in A and B, the less
similar they are

content
• Lin measures the commonality and difference between A and B as

follows:
• the information content of the proposition that states the

commonality between A and B: IC(Common(A,B))
• the difference between A and B as

IC(description(A,B)) − IC(common(A,B))

content
• Similarity Theorem: The similarity between A and B is measured by

the ratio between the amount of information needed to state the
commonality of A and B and the information needed to fully describe
what A and B are:

NLP Lexical Semnatics Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Lexical Semnatics Slides

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Some screenshots are taken from NLP book by Jufrasky

NLP © Sakthi Balan M

NLP © Sakthi Balan M

Calling a tail a leg doesn’t make it one.”

• Lexicon: Is a finite list of lexemes

NLP © Sakthi Balan M

• Lemmatization is not always deterministic. It depends on the context

• Culturally speaking, this city has increasingly displayed its determination to

NLP © Sakthi Balan M

• In general lemmas may be larger than morphological stems (e.g., New

• Here when we refer to the meaning (or meanings) of a ‘word’, we will

NLP © Sakthi Balan M

• Instead, a bank can hold the investments in a

• The bank is on the corner of Nassau and Witherspoon Sense 4: Metonymy

NLP © Sakthi Balan M

• More example on the verb Serve:

• They rarely serve red meat, preferring to prepare seafood,

• He served as U.S. ambassador to Norway in 1976 and 1977.

• Three senses of Serve

• For determining if two senses are distinct is to conjoin two uses of a

• Which of those flights serve breakfast?

• Does Midwest Express serve Philadelphia?

• Does Midwest Express serve breakfast and Philadelphia?

• Does Midwest Express serve breakfast and lunch?

American Heritage Dictionary

Definitions are circular in nature!

NLP © Sakthi Balan M

• Word Sense relations are embodied in on-line databases like

NLP © Sakthi Balan M

• meaning of two senses of two different words (lemmas) are

• Example: couch/sofa, vomit/throw up, car/automobile

• More formal definition: Two words are synonymous if they are

NLP © Sakthi Balan M

• I came home by automobile / car

• I am thirsty give me H2O / water

• In practice the word synonym is therefore commonly used to

NLP © Sakthi Balan M

• How big is that plane?

• Would I be flying on a large or small plane?

• But consider the following WSJ sentence where we cannot substitute

• Miss Nelson, for instance, became a kind of big sister to Benjamin

• Miss Nelson, for instance, became a kind of large sister to Benjamin

• Another groups of antonyms is reversives which describe some sort of change or

NLP © Sakthi Balan M

• A taxonomy is a particular arrangement of the elements of an ontology

NLP © Sakthi Balan M

• For Example: A leg is part of a chair; a wheel is part of a car

• We say that wheel is a meronym of car, and car is a holonym of wheel

• Semantic field is a model of a more integrated, or holistic, relationship

• Consider the following set of words: reservation, flight, travel, buy,

• FrameNet project (Baker et al., 1998),

• In WordNet closed class words are not included

• There are eight senses for the

• WordNet does not have

NLP © Sakthi Balan M

NLP © Sakthi Balan M

• Lexical sample tasks: line-hard-serve corpus containing 4,000 sense-tagged

• The interest corpus with 2,369 sense-tagged examples of interest as a noun

• The SENSEVAL project has also produced a number of such sense-labeled

• In addition, sense-tagged corpora have been built for the SENSEVAL

NLP © Sakthi Balan M

• To extract useful features pre-processing of data is done: