Applied Natural Language Processing: Barbara Rosario

I256
Applied Natural Language Processing

Fall 2009
Lecture 5 Word Sense Disambiguation (WSD) Intro on Probability Theory Graphical Models Nave Bayes Nave Bayes for WSD
Barbara Rosario
Word Senses
Words have multiple distinct meanings, or senses:
Plant: living plant, manufacturing plant, Title: name of a work, ownership document, form of address, material at the start of a film,
Many levels of sense distinctions

Homonymy: totally unrelated meanings (river bank, money bank) Polysemy: related meanings (star in sky, star on tv, title) Systematic polysemy: productive meaning extensions (metonymy such as organizations to their buildings) or metaphor Sense distinctions can be extremely subtle (or not)
Granularity of senses needed depends a lot on the task
2 Taken from Dan Kleins cs 288 slides
Word Sense Disambiguation

Determine which of the senses of an ambiguous word is invoked in a particular use of the word Example: living plant vs. manufacturing plant How do we tell these senses apart?
Context
The manufacturing plant which had previously sustained the towns economy shut down after an extended labor strike.
Maybe its just text categorization Each word sense represents a topic
Why is it important to model and disambiguate word senses?

Translation
Bank banca or riva
Parsing
For PP attachment, for example
information retrieval
To return documents with the right sense of bank
3 Adapted from Dan Kleins cs 288 slides
Resources
WordNet
Hand-build (but large) hierarchy of word senses Basically a hierarchical thesaurus
SensEval
AWSD competition Training / test sets for a wide range of words, difficulties, and parts-of-speech Bake-off where lots of labs tried lots of competing approaches
SemCor
A big chunk of the Brown corpus annotated with WordNet senses
OtherResources
The Open Mind Word Expert Parallel texts
Features
Bag-of-words (use words around with no order)
The manufacturing plant which had previously sustained the towns economy shut down after an extended labor strike. Bags of words = {after, manufacturing, which, labor, ..}
Bag-of-words classification works ok for noun senses

90% on classic, shockingly easy examples (line, interest, star) 80% on senseval-1 nouns 70% on senseval-1 verbs
5
Verb WSD
Why are verbs harder?
Verbal senses less topical More sensitive to structure, argument choice Better disambiguated by their argument (subject-object): importance of local information For nouns, a wider context likely to be useful
Verb Example: Serve

[function] The tree stump serves as a table [enable] The scandal served to increase his popularity [dish] We serve meals for the homeless [enlist] She served her country [jail] He served six years for embezzlement [tennis] It was Agassi's turn to serve [legal] He was served by the sheriff
Different types of information may be appropriate for different part of speech

Better features
There are smarter features:
Argument selectional preference:
serve NP[meals] vs. serve NP[papers] vs. serve NP[country]
Subcategorization:
[function] serve PP[as] [enable] serve VP[to] [tennis] serve <intransitive> [food] serve NP {PP[to]} Can capture poorly (but robustly) with local windows but we can also use a parser and get these features explicitly
Other constraints (Yarowsky 95)

One-sense-per-discourse One-sense-per-collocation (pretty reliable when it kicks in: manufacturing plant, flowering plant)
Various Approaches to WSD

Unsupervised learning
We dont know/have the labels More than disambiguation is discrimination
Cluster into groups and discriminate between these groups without giving labels Clustering
Example: EM (expectation-minimization), Bootstrapping (seeded with some labeled data)
Indirect supervision (See Session 7.3 of Stat NLP book)

From thesauri From WordNet From parallel corpora
Supervised learning
Supervised learning
Supervised learning
When we know the truth (true senses) (not always true or easy) Classification task Most systems do some kind of supervised learning Many competing classification technologies perform about the same (its all about the knowledge sources you tap) Problem: training data available for only a few words Examples: Bayesian classification
Nave Bayes (simplest example of Graphical models)
(Well talk more about supervised learning/classification during the course)

Today
Introduction to probability theory Introduction to graphical models
Probability theory plus graph theory
Nave bayes (simple graphical model)

Nave bayes for WSD (classification task)
10
Why Probability?
Statistical NLP aims to do statistical inference for the field of NLP Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution.
11
Why Probability?
Examples of statistical inference are WSD, the task of language modeling (ex how to predict the next word given the previous words), topic classification, etc. In order to do this, we need a model of the language. Probability theory helps us finding such model
12
Probability Theory
How likely it is that something will happen Sample space is listing of all possible outcome of an experiment
Sample space can be continuous or discrete For language applications its discrete (i.e. words)
Event A is a subset of Probability function (or distribution) P : 0,1
13
14 http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Prior Probability
Prior probability: the probability before we consider any additional knowledge
P ( A)
18
Conditional probability
Sometimes we have partial knowledge about the outcome of an experiment Conditional (or Posterior) Probability Suppose we know that event B is true The probability that A is true given the knowledge about B is expressed by
P( A | B)
P(A,B) P(A|B) P(B)
19
Conditional probability (cont)
P ( A, B ) P ( A | B ) P ( B ) P ( B | A) P ( A)
Note: P(A,B) = P(A B) Chain Rule P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the probability that B happens times the probability that A happens, given B has occurred. P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the probability that A happens times the probability that B happens, given A has occurred. Multi-dimensional table with a value in every cell giving the probability of that specific state occurring 21
Chain Rule
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
22
Chain Rule Bayes' rule

P(A,B) = P(A|B)P(B) = P(B|A)P(A)
P(B|A)P(A) P(A|B) P(B)
Bayes' rule
Useful when one quantity is more easy to calculate; trivial consequence of the definitions we saw but it s extremely useful
23
Bayes' rule
P(A|B)P(A) P(A|B) P(B)
Bayes' rule translates causal knowledge into diagnostic knowledge.

For example, if A is the event that a patient has a disease, and B is the event that she displays a symptom, then P(B | A) describes a causal relationship, and P(A | B) describes a diagnostic one (that is usually hard to assess). If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B) for free.
24
Example
S:stiff neck, M: meningitis P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20 I have stiff neck, should I worry?
P(M | S )
P(S | M ) P(M ) P(S ) 0.5 1 / 50 ,000 0.0002 1 / 20

25
(Conditional) independence
Two events A e B are independent of each other if P(A) = P(A|B) Two events A and B are conditionally independent of each other given C if P(A|C) = P(A|B,C)
26
Back to language
Statistical NLP aims to do statistical inference for the field of NLP
Topic classification
P( topic | document )
Language models
P (word | previous word(s) )
WSD
P( sense | word)
Two main problems

Estimation: P in unknown: estimate P Inference: We estimated P; now we want to find (infer) the topic of a document, o the sense of a word
27
Language Models (Estimation)

In general, for language events, P is unknown We need to estimate P, (or model M of the language) Well do this by looking at evidence about what P must be based on a sample of data
28
Estimation of P
Frequentist statistics
Parametric Non-parametric (distribution free)
Bayesian statistics
Bayesian statistics measures degrees of belief Degrees are calculated by starting with prior beliefs and updating them in face of the evidence, using Bayes theorem
2 different approaches, 2 different philosophies

29
Inference
The central problem of computational Probability Theory is the inference problem: Given a set of random variables X1, , Xk and their joint density P(X1, , Xk), compute one or more conditional densities given observations.
Compute
P(X1 | X2 , Xk) P(X3 | X1 ) P(X1 , X2 | X3, X4,) Etc
Many problems can be formulated in these terms.

30
Bayes decision rule

w: ambiguous word S = {s1, s2, , sn } senses for w C = {c1, c2, , cn } context of w in a corpus V = {v1, v2, , vj } words used as contextual features for disambiguation
Bayes decision rule

Decide sj if P(sj | c) > P(sk | c) for sj sk
We want to assign w to the sense s where
s = argmaxsk P(sk | c)
31
Bayes classification for WSD

We want to assign w to the sense s where We usually do not know P(sk | c) but we can compute it using Bayes rule
s = argmaxsk P(sk | c)
P(c,sk) P(c|sk) P(sk|c) P(sk) P(c) P(c)

s' arg max skP(sk|c) arg max sk P(c|sk) P(sk) arg max skP(c|sk)P(sk) P(c)
32
Nave Bayes classifier s' arg max P(c|sk)P(sk)

sk
Nave Bayes classifier widely used in machine learning Estimate P(c | sk) and P(sk)
33
Nave Bayes classifier

s' arg max P(c|sk)P(sk) sk Estimate P(c | sk) and P(sk)
w: ambiguous word S = {s1, s2, , sn } senses for w C = {c1, c2, , cn } context of w in a corpus V = {v1, v2, , vj } words used as contextual features for disambiguation
Nave Bayes assumption:

P(c|sk) P({vj|vj in c} | sk) P(vj | sk)
vj in c
34
Nave Bayes classifier P(c|s ) P({v |v in c} | s ) P(v | s )

k
vj in c
Nave Bayes assumption: Two consequences All the structure and linear ordering of words within the context is ignored bags of words model The presence of one word in the model is independent of the others
Not true but model easier and very efficient easier efficient mean something specific in the probabilistic framework
Well see later (but easier to estimate parameters and more efficient inference)
Nave Bayes assumption is inappropriate if there are strong dependencies, but often it does very well (partly because the 35 decision may be optimal even if the assumption is not correct)
Nave Bayes for WSD

s' arg max P(c|sk)P(sk)
sk
Bayes decision rule Nave Bayes assumption
P(c|sk) P({vj|vj in c} | sk) P(vj | sk)

vj in c
s' arg max P(vj | sk)P(sk)

sk vj in c
P(vj | sk)
Estimation
C (vj , sk) C ( sk)
Count of vj when sk
C ( sk) P(sk) C ( w)
Prior probability of sk
36
Nave Bayes Algorithm for WSD

TRAINING (aka Estimation)
For all of senses sk of w do For all words vj in the vocabulary calculate
end end
C (vj , sk) P(vj | sk) C ( sk)
For all of senses sk of w do
P(sk)
end
C ( sk) C ( w)
37
Nave Bayes Algorithm for WSD

TESTING (aka Inference or Disambiguation)
For all of senses sk of w do For all words vj in the context window c calculate
score(sk) P(sk|c) P(c|sk)P(sk)

end end Choose s= sk of w do
s' arg max score( sk)

sk
38
Next week
Introduction to Graphical Models Part of speech tagging Readings:
Chapter 5 NLTL book Chapter 10 of Foundation of Stat NLP
39

Applied Natural Language Processing: Barbara Rosario

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Natural Language Processing: Barbara Rosario

Uploaded by

Copyright:

Available Formats

I256

Applied Natural Language Processing

Many levels of sense distinctions

Granularity of senses needed depends a lot on the task

2 Taken from Dan Kleins cs 288 slides

Word Sense Disambiguation

Why is it important to model and disambiguate word senses?

Bag-of-words classification works ok for noun senses

Verb Example: Serve

Different types of information may be appropriate for different part of speech

Other constraints (Yarowsky 95)

Various Approaches to WSD

Example: EM (expectation-minimization), Bootstrapping (seeded with some labeled data)

Indirect supervision (See Session 7.3 of Stat NLP book)

8 Adapted from Dan Kleins cs 288 slides

(Well talk more about supervised learning/classification during the course)

Nave bayes (simple graphical model)

Event A is a subset of Probability function (or distribution) P : 0,1

Conditional probability (cont)

Chain Rule Bayes' rule

Bayes' rule translates causal knowledge into diagnostic knowledge.

P(S | M ) P(M ) P(S ) 0.5 1 / 50 ,000 0.0002 1 / 20

Two main problems

Language Models (Estimation)

2 different approaches, 2 different philosophies

Many problems can be formulated in these terms.

Bayes decision rule

Bayes decision rule

We want to assign w to the sense s where

Bayes classification for WSD

P(c,sk) P(c|sk) P(sk|c) P(sk) P(c) P(c)

Nave Bayes classifier s' arg max P(c|sk)P(sk)

Nave Bayes classifier

Nave Bayes assumption:

Nave Bayes classifier P(c|s ) P({v |v in c} | s ) P(v | s )

Nave Bayes for WSD

Bayes decision rule Nave Bayes assumption

P(c|sk) P({vj|vj in c} | sk) P(vj | sk)

s' arg max P(vj | sk)P(sk)

C (vj , sk) C ( sk)

Nave Bayes Algorithm for WSD

C (vj , sk) P(vj | sk) C ( sk)

For all of senses sk of w do

Nave Bayes Algorithm for WSD

score(sk) P(sk|c) P(c|sk)P(sk)

s' arg max score( sk)

You might also like