Professional Documents
Culture Documents
Machine Learning in Natural Language Processing: ML in NLP June 02
Machine Learning in Natural Language Processing: ML in NLP June 02
June 02
Introduction
Fernando Pereira
University of Pennsylvania
NASSLLI, June 2002
Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,
Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby
ML in NLP
Why ML in NLP
n
n
n
n
Document topic
AI
Cognitive science
politics, business
Moores law
storage
lots of data
ML in NLP
NASSLI
Classification
national, environment
Word sense
treasury bonds chemical bonds
ML in NLP
ML in NLP
June 02
Analysis
n
VBG
NNS
causing symptoms
Language Modeling
Tagging
WDT
VBP
RP
NNS
JJ
that
show
up
decades
later
Parsing
S(dumped)
NP-C(workers)
VP(dumped)
N(workers)V(dumped) NP-C(sacks)
workers
dumped
PP(into)
into D(a)
a
ML in NLP
N(bin)
bin
Inference
Translation
treasury bonds fi obrigaes do tesouro
covalent bonds fi ligaes covalentes
n
ML in NLP
Information extraction
Specify
Form of output programs
Accuracy criterion
acquirer
acquired
Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in
Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million
dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.
n
n
The investment includes the purchase of 5 million newly issued DIM shares
valued at about 5 million dollars, and a loan of about 15 million dollars, it said.
The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, it
said.
ML in NLP
NASSLI
ML in NLP
ML in NLP
June 02
Learning Tradeoffs
Overfitting
Error of best program
Fundamental Questions
training
n
n
Tagging
Parsing
Extraction
Unsupervised learning
n
n
Generalization
Structure induction
ML in NLP
NASSLI
Jargon
n
Document classification
Disambiguation disambiguation
Structured models
n
ML in NLP
Classifiers
n
Rote learning
ML in NLP
n
n
n
n
ML in NLP
June 02
Classification Ideas
n
n
n
Interdependent decisions
n
Content
n Context
n
Successive parts-of-speech
Parsing/generation steps
Lexical choice
Parsing
Translation
ML in NLP
n
n
Sequential decisions
Generative models
Constraint satisfaction
ML in NLP
Methodological Detour
Combining decisions
Empiricist/information-theoretic view:
words combine following their associations
in previous material
Latent variables
Im thinking of sports fi more sporty words
Distributional regularities
Rationalist/generative view:
words combine according to a formal
grammar in the class of possible naturallanguage grammars
Data compression
n Infer dependencies among variables:
structure learning
n
ML in NLP
NASSLI
ML in NLP
ML in NLP
June 02
(2)
n
n
n
ML in NLP
Unseen Events
n
ML in NLP
NASSLI
\
n
ML in NLP
ML in NLP
June 02
Global coherence:
n
n
n
Questions
Generative or discriminative?
Structured models: local classification or
global constraint satisfaction?
n Does unsupervised learning help?
n
n
ML in NLP
Classification
ML in NLP
Generative or Discriminative?
n
Generative models
n
Discriminative models
n
ML in NLP
NASSLI
ML in NLP
ML in NLP
June 02
Generative Claims
n
n
F2
F3
F4
ML in NLP
Discriminative Models
Define functional form for
Logistic form:
p(y | x;q )
ML in NLP
NASSLI
ML in NLP
ML in NLP
ML in NLP
June 02
Discriminative Claims
Focus modeling resources on instance-tolabel mapping
n Avoid restrictive probabilistic assumptions
on instance distribution
n Optimize what you care about
n Higher accuracy
Classification Tasks
n
ML in NLP
Binary vector
Frequency vector
ft (d) t d
NASSLI
Tagging
n
n
n
News categorization
Message filtering
Web page selection
Named entity
Part-of-speech
Sense disambiguation
Syntactic decisions
n
Attachment
tf(d,t) = t d idf(d,t) = D d D : t d
raw frequency : rt (d) = tf(d,t)
TF * IDF :
xt (d) = log(1+ tf(d,t))log(1+ idf(d,t))
ML in NLP
ML in NLP
Document Models
Document categorization
ML in NLP
ML in NLP
June 02
ML in NLP
Unigram model:
p(c | d) =
p(c)p(d | c)
n
d
p(di | c)
i=1
d
p(di | c)
i=1
Vector model:
ML in NLP
p(t | c)rt
where L = rt
t
rt !
Linear Classifiers
c p(c)p( d | c)
p(c | r) =
NASSLI
ML in NLP
Documents
vs. Vectors (3)
n
ML in NLP
June 02
Rocchio
Widrow-Hoff
w w - 2h(w.x i - yi )x i
n
Online methods
n
n
(Balanced) winnow
y = sign(w+ x - w- x - q )
Perceptron
Winnow
ML in NLP
ML in NLP
Linear
Classification
n
Margin
n
n
g i = yi (w x i + b)
Instance margin
Normalized (geometric) margin
h(x) = w x + b = w k xk + b
k
x
x
x
x
w
x
xj
x
o
x
x
gi
gj o
o
o
w
b
g i = yi x i +
w
w
xi
o
o
NASSLI
x
x
bo
ML in NLP
xk
xk
wk = max 0, xc - xc
c
D - c
o
o
x
n
10
ML in NLP
June 02
Duality
Perceptron Algorithm
n
Given
ML in NLP
the Margin?
c R2
perr(h) 2 log2 N + log(1 d ) 1 - d
N g
NASSLI
ML in NLP
Why Maximize
ai 0
Canonical Hyperplanes
n
1 w
w
x-
x+ 2 w
w
1
=
(w x + - w x - )
2w
1
=
w
g =
x+
g
x-
ML in NLP
11
ML in NLP
June 02
min wW n f (w)
subject to
gi (w) 0
h j (w) = 0
f convex
gi, hj affine (h(w)=Aw-b)
n Solution w*, a*, b* must satisfy:
L(w* ,a * , b * )
= 0
w
* * *
L(w ,a , b )
= 0
b
Complementary condition:
ai* gi (w* ) = 0 parameter is non-zero iff
gi (w* )
0 constraint is active
*
ai
0
ML in NLP
n
n
Lagrangian function:
Dual problem:
1
L(w,b,a ) = w w - i ai [ yi ( w x i + b) -1]
2
L(w,b,a )
= w - i yi a i x i = 0
w
L(w,b,a )
= yi a i = 0
i
b
1
yi y ja i a j x i x j
2 i, j
maxa
W (a )
subject to a i 0
i yi a i = 0
Lagrangian:
ML in NLP
NASSLI
Kuhn-Tucker conditions:
w = i yi a *i x i with margin g = 1 w* =
(isv ai* )
-1 2
ML in NLP
12
ML in NLP
June 02
Consequences
n
ai yi w* x i + b -1 = 0
ai > 0 fi w* x i + b = yi
[ (
Decision function:
h(x) = sgn
(i yi a*i x i x + b* )
kernel K
maxa
Decision rule
h(x) = sgn
NASSLI
Soft Margin
n
n
i a i - 12 i, j yi y ja i a j K (x i , x j )
subject to a i 0
i yi a i = 0
ML in NLP
ML in NLP
) ]
g = 1 w*
ML in NLP
(i yi a*i K (x i ,x) + b* )
Dual problem:
maxa
i a i - 12 i, j yi y ja i a j x i x j + C1 dij
subject to a i 0
i yi a i = 0
ML in NLP
13
ML in NLP
June 02
exp l k f k (x, y)
k
n
n
subject to constraints
i f k (x i , yi ) = y p(y | x i ) f k (x i , y)
Multi-class
May use different features for different classes
Training is convex optimization
ML in NLP
yields
)
p (y | x) = p(y | x; L
ML in NLP
Relationship
to (Binary) Logistic
Discrimination
exp k l k f k (x,+1)
p(+1 | x) =
1
=
1 + exp- l k gk (x)
k
ML in NLP
Z(x; l )
Useful properties
n
NASSLI
Model form
p(y | x;L) =
Duality
n
Relationship to Linear
Discrimination
n
Decision rule
p(+1| x)
sign log
= sign k l k g k (x)
p(-1 | x)
ML in NLP
14
ML in NLP
June 02
Parameter updates
lk l k +
n
lk l k + d k
i f k (x i , yi )
1
log
C
i y p(y | x i ;L) f k (x i , y)
f k (x i , y) = C "i, y
ML in NLP
Conditional log-likelihood
l(L) = log p(yi | x i ;L)
i
(x i ,y)
ML in NLP
f # (x, y) = k f k (x, y)
Parameter updates
Log-likelihood update
Z (x i ;L + D)
l(L + D) - l(L) = i D f (x i , yi ) - log
Z (x i ;L)
(L+D ) f (x i ,y)
e
= D f (x i , yi ) - log
i
i
y
Z (x i ;L)
= i D f (x i , yi ) - i log y p(y | x i ;L)e D f ( x i ,y)
By Jensens inequality:
A(D) D f (x i , yi ) + N i
i y p(y | x i ;L)k
f k (x i , y) dk f # (x i ,y)
e
= B(D)
f # (x i , y)
#
B(D)
= f k (x i , yi ) - p(y | x i ;L) f k (x i , y)edk f (x i ,y)
i
i
y
d k
144444444424444444443
A(D)
ML in NLP
NASSLI
ML in NLP
15
ML in NLP
June 02
n
n
Gaussian Prior
n
Log-likelihood gradient
l(L)
l
= i f k (x i , yi ) - i y p(y | x i ;L)i f k (x i , y) - k2
l k
sk
n
i f k (x i , yi ) =
( x i ,y)
l k + dk
s 2k
ML in NLP
Instance Representation
Word identity
Word class
Word identity
Word relative frequency in document
ML in NLP
NASSLI
f (x, y) = k f k (x, y)
ML in NLP
Enriching Features
Word n-grams
Sparse word n-grams
n Character n-grams (noisy transcriptions:
speech, OCR)
n Unknown word features: suffixes,
capitalization
n Feature combinations (cf. n-grams)
n
n
ML in NLP
16
ML in NLP
June 02
Structured Models:
Finite State
I understood each and every word you said but not the order in which they appeared.
ML in NLP
Structured Models
Language modeling
n Story segmentation
n POS tagging
n Information extraction (IE)
n (Shallow) parsing
ML in NLP
NASSLI
ML in NLP
ML in NLP
17
ML in NLP
June 02
Constraint Satisfaction in
Structured Models
n
q = argminq i Loss(x i , y i | q )
n
ML in NLP
Constraint satisfaction
n
n
n
Principled
Probabilistic interpretation allows model
composition
Efficient optimal decoding
n Local classification
n
n
n
ML in NLP
NASSLI
ML in NLP
w1
wk n + 1
P(wk |w1
n
wk 1wk
wk 1) P(wk |wk n + 1
wk 1)
18
ML in NLP
June 02
Markovs Unreasonable
Effectiveness
n
bits/char
4.43
word trigrams (Brown et al 92)
1.75
human prediction (Cover & King 78)
1.34
Local word relations dominate statistics (Jelinek):
1 The
2 This
2 One
7 Please
9 We
98
1641
are to
will
the
need
know
have
understand
use
insert
resolve
the
issues
this
problems
these the
problem
all
necessary
data
information
people
tools
old
important
role
thing
that
point
issues
ML in NLP
n
n
No dependency
Likelihoods based on
sequencing, not dependency
1
The
2
This
2
One
7
Please
9
We
98
1641
are to
will
the
need
know
have
understand
use
insert
resolve
the
issues
this
problems
these the
problem
all
necessary
data
information
people
tools
old
important
role
thing
that
point
issues
ML in NLP
ML in NLP
NASSLI
ML in NLP
19
ML in NLP
June 02
Important Detour:
Latent Variable Models
Expectation-Maximization (EM)
n
Examples:
Mixture models
n Class-based models (hidden classes)
n Hidden Markov models
n
ML in NLP
Maximizing Likelihood
Data log-likelihood
D = {(x1 , y1 ),...,(x N , yN )}
L(D | L) = i log p(x i , yi ) = x,y p (x, y) log p(x, y | L)
i : x i = x, yi = y
p (x, y) =
N
n Find parameters that maximize (log-)likelihood
= arg max L p (x, y) log p(x, y | L)
L
x,y
ML in NLP
NASSLI
ML in NLP
f ( x1 )
Convex function
af ( x0 ) + (1 - a ) f ( x1 )
f
f ( x0 )
n
f (ax0 + (1 - a ) x1 )
x0 ax0 + (1 - a ) x1 x1
Jensens inequality
f x p(x)x x p(x) f (x)
20
ML in NLP
June 02
Auxiliary Function
n
E1
M0
E0
lower bound
ML in NLP
Comments
ML in NLP
n
n
ML in NLP
NASSLI
Base distributions
pi (y) :1 i m
Mixture coefficients
li 0
i=1 li = 1
Mixture distribution
p(y | L) = li pi (y)
i
ML in NLP
21
ML in NLP
June 02
Auxiliary Quantities
Solution
p(c, y | L) = lc pc (y)
n
E step:
Ci =
M-step:
Auxiliary function
Q(L,L) = y p (y)c p(c | y,L) log
li
p(y,c | L)
p(y,c | L)
ML in NLP
1
p (y) lilpi p(y)(y)
li y
j j j
Ci
jC j
ML in NLP
ML in NLP
NASSLI
ML in NLP
22
ML in NLP
June 02
IE Example
IE Methods
employee
relation
Co-reference
Partial matching:
Hand-built patterns
Automatically-trained hidden Markov models
n Cascaded finite-state transducers
n
person-descriptor
person
organization
Parsing-based:
n
Rely on:
n
n
Syntactic structure
Phrase classification
ML in NLP
ML in NLP
NASSLI
ML in NLP
ML in NLP
23
ML in NLP
June 02
Global constraint
Principled
Probabilistic interpretation allows model
composition
n Efficient optimal decoding
n
n
Local classifier
Wider range of models
More efficient training
n Heuristic decoding comparable to pruning in
global models
n
n
ML in NLP
ML in NLP
Generative Models
n
n
n
Process structure
Process parameters
(Hypothesize structure)
Estimate parameters from training data
ML in NLP
NASSLI
Model Structure
Decompose the generation of instances
into elementary steps
n Define dependencies between steps
n Parameterize the dependencies
n Useful descriptive language: graphical
models
n
ML in NLP
24
ML in NLP
June 02
F3
Word identity
Word properties (eg. spelling, capitalization)
C3
C4
X0
X1
X2
X3
X4
Independence or Intractability
C0
C1
C2
F0,0
F0,1
F0,2
F1,0
F1,1
F1,2
F2,0
F2,1
C3
F2,2
F3,0
F3,1
F3,2
F4,0
F4,1
F4,2
n
n
NASSLI
C2
C1
ML in NLP
C0
F4
ML in NLP
F1
ML in NLP
25
ML in NLP
June 02
Score Card
No independence assumptions
Richer features: combinations of existing
features
Optimization problem for parameters
Limited probabilistic interpretation
Insensitive to input distribution
n
n
n
n
ML in NLP
ML in NLP
2.
identity of word
word is in all caps
word ends in -tion
word is part of a noun phrase
word is in bold font
word is on left hand side of page
word is under node X in WordNet
length of line
line is centered
percent of non-alphabetics
total amount of white space
line contains two verbs
line begins with a number
line is grammatically a question
NASSLI
Maximum entropy
Markov model
P(s|o,s)
(represented by exponential
model)
s
ML in NLP
26
ML in NLP
June 02
P(o|s)
ot
st
st-1
P(s|s)
MEMM
st
st-1
Transition Features
P(s|o,s)
ot
ML in NLP
Next-State Classifier
n
1
Ps ' (s | o) =
exp l<b,q> f<b,q> (o, s)
Z(o, s')
<b,q>
ML in NLP
NASSLI
ML in NLP
27
ML in NLP
June 02
Experimental Data
n
<head>
<head>
<head>
<head>
<question>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>
Features in Experiments
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
ML in NLP
Models Tested
n
n
NASSLI
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
Results
Learner
Segmentation
precision
Segmentation
recall
ME-Stateless
0.038
0.362
TokenHMM
0.276
0.140
FeatureHMM
0.413
0.529
MEMM
0.867
0.681
ML in NLP
28
ML in NLP
June 02
r
4
n
n
i
o
Proposed Solutions
n
rib
Per-state normalization
does not allow the
required score(1,2|ro) << score(1,2|ri).
n
n
ML in NLP
s = s1 ...sn
o = o1 ...on
MEMM P(s | o)
rib
rob
P(s
| st-1,ot )
t=1
l f f (st ,st-1 )
exp
Z(s ,o ) +f m f g(st ,ot )
t-1 t
t=1
g
l f f (st ,st-1 )
n
1
P(s | o) =
CRF
exp+f m f g(st ,ot )
Z(o) t=1
g
ML in NLP
ML in NLP
i 2
Determinization:
r
1
n not always possible 0
o
b
5
n state-space explosion
Fully-connected models:
n lacks prior structural knowledge.
Conditional random fields (CRFs):
n Allow some transitions to vote more
strongly than others
n Whole sequence rather than per-state
normalization
n
1
exp l f f (st-1,st ,o,t)
Z(o)
t=1 f
ML in NLP
NASSLI
29
ML in NLP
June 02
Efficient Estimation
n
Forward-Backward Calculations
Matrix notation
M t (s',s | o) = exp L t (s',s | o)
E LG
1
PL (s | o) =
M i (st-1,st | o)
Z L (o) t
Z L (o) = (M1 (o)M 2 (o)L M n +1 (o)) start,stop
n
P (s | o)G(s)
a (s'| o)g (s',s)M (s',s | o)b
Z (o)
s
t +1
t +1
t +1
(s | o)
t,s',s
ML in NLP
Training
L(L)
=
l f
# f (s | o) =
#
k
r
0
(sk | o k ) - E L # f (S | o k )
k
f (st-1,st ,o,t)
ML in NLP
NASSLI
r
4
n
n
i
o
b
b
rib
rob
30
ML in NLP
June 02
Mixed-Order Sources
n
n
Part-of-Speech Tagging
n
ML in NLP
ML in NLP
error
oov error
HMM
5.69%
45.99%
MEMM
6.37%
54.61%
CRF
5.55%
48.05%
MEMM+
4.81%
26.99%
CRF+
4.27%
23.76%
ML in NLP
NASSLI
Structured Models:
Stochastic Grammars
ML in NLP
31
ML in NLP
June 02
S
VP
NP
N'
Adj
N'
revolutionary Adj
new
N'
VP
Adv
slowly
N'
advance
Adj
Related sentences
Sleep furiously is all that colorless green ideas do.
new
ML in NLP
NASSLI
slowly
advance
1.0
0.2
0.8
0.3
0.7
0.4
0.6
0.1
0.2
V sleep
0.3
SCFG Derivations
S
P(
Adv
S NP VP
NP Det N
NP N
N Adj N
N N
VP VP Adv
VP V
N ideas
N people
ML in NLP
N'
VP
ideas
ML in NLP
N'
revolutionary Adj
ideas
VP
NP
N'
Adj
N'
revolutionary Adj
new
N'
VP
Adv
slowly
N'
ML in NLP
Adv
slowly
advance
Context-freeness fi
independence of
derivation steps
VP
P(N Adj N)
ideas
)
VP
NP
advance
Improved convergence
Improved computational complexity
P(
VP
NP
P(
Adj
revolutionary
P(
Adj
new
N'
N'
N
ideas
32
ML in NLP
June 02
Inside-Outside Reestimation
Ratios of expected rule frequencies to
expected phrase frequencies
E (#N Adj N)
phrase context
Pn + 1(N Adj N) = n
En(#N)
n
n
VP
NP
N'
phrase
Adj
N'
revolutionary Adj
new
N'
VP
Adv
slowly
advance
Computed from
the probabilities of
each phrase and
phrase context in
the training
corpus
Iterate
ideas
ML in NLP
ML in NLP
cross entropy
4.5
bracketed
train
4
3.5
3
2.5
0
20
40
iteration
Test:
60
80
raw test
cross entropy
Predictive rawPower
train
4.5
bracketed
test
4
3.5
3
2.5
0
ML in NLP
NASSLI
ML in NLP
20
40
iteration
60
80
33
ML in NLP
June 02
Bracketing Accuracy
n
Limitations of SCFGs
n
n
accuracy
100
80
raw test
60
bracketed
test
Markov models:
We need to resolve the issue
40
20
0
20
40
60
80
iteration
Hierarchical models:
ML in NLP
Lexicalization
n Markov
model:
n Dependency
model:
n
ML in NLP
NASSLI
ML in NLP
34
ML in NLP
June 02
S(dumped)
VP(dumped)
NP-C(workers)
dumped
PP(into)
N(sacks) P(into)
sacks
Dependency
Direction Relation
L
workers dumped
NP, S, VP
sacks dumped
R
NP, VP, V
R
into dumped
PP, VP, V
L
a bin
D, NP, N
bin into
R
NP, PP, P
NP-C(bin)
into D(a)
a
N(bin)
bin
ML in NLP
Unsupervised Learning
n
ML in NLP
Information bottleneck
Find efficient compression of some
observables
n preserving the information about other
observables
n
ML in NLP
NASSLI
Generalization
n
Dimensionality reduction
n
n
Smaller models
Improved classification accuracy
ML in NLP
35
ML in NLP
June 02
Complex Events
Chomskys Challenge
to Empiricism
(1)
(2)
n
n
n
Chomsky 57
ML in NLP
ML in NLP
Distributional Clustering
In Any Model?
n
P(wi+1 | wi )
P(wi +1 | c) P(c | wi )
c =1
ML in NLP
NASSLI
ML in NLP
36
ML in NLP
June 02
Distributional Representation
Training Data
0.1
stock
bond
0.08
0.06
0.04
0.02
ML in NLP
Markov condition:
p( n | v) = p(n | n)p(n | v)
n
I(,N)
Find p( | N) to maximize
mutual information for
fixed I( ,N)
I( N ,V ) =
I(N,V)
trade
sell
tender
select
own
p( n ,v )
p( n | n) =
NASSLI
I(,V)
p( n , v ) log p( n )p(v )
Solution:
ML in NLP
n ,v
purchase
ML in NLP
issue
buy
0
hold
relative frequency
p( n )
exp(- bDKL ( p(V | n) || p(V | n ))
Zn
ML in NLP
37
ML in NLP
June 02
Small Example
n
officer 0.484
aide
0.612
chief
0.649
manager 0.651
0.6
In
cr
ea
si
ng
missile 0.835
rocket 0.850
bullet 0.917
gun
0.940
I(,V)/I(,N)
0.8
0.4
0.2
shot 0.858
bullet 0.925
rocket 0.930
missile 1.037
gun
0.758
missile 0.786
weapon 0.862
rocket 0.875
0
0
p(v | n ) p( n |n)
p (v | n) =
used in
experiments
more appropriate
Zn
Depends on b
NASSLI
0.6
0.8
ML in NLP
ML in NLP
0.4
I(,N)/H(N)
ML in NLP
0.2
Evaluation
n
ML in NLP
38
ML in NLP
June 02
Decision Task
Relative Entropy
n
6
5
4
3
2
1
0
0
200
400
number of clusters
ML in NLP
NASSLI
600
1.2
decision error
all
exceptional
0.8
0.6
0.4
0.2
0
0
100
ML in NLP
200
300
400
500
number of clusters
39