You are on page 1of 39

ML in NLP

June 02

Machine Learning in Natural


Language Processing

Introduction

Fernando Pereira
University of Pennsylvania
NASSLLI, June 2002
Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,
Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby
ML in NLP

Why ML in NLP
n
n
n
n

Examples are easier to create than rules


Rule writers miss low frequency cases
Many factors involved in language interpretation
People do it
n
n

Document topic

AI
Cognitive science

politics, business

Let the computer do it


n
n
n

Moores law
storage
lots of data

ML in NLP

NASSLI

Classification

national, environment

Word sense
treasury bonds chemical bonds

ML in NLP

ML in NLP

June 02

Analysis
n

VBG

NNS

causing symptoms

Language Modeling

Tagging
WDT

VBP

RP

NNS

JJ

that

show

up

decades

later

P(colorless green ideas sleep furiously)


2 10 5
P(furiously sleep ideas green colorless)

Parsing
S(dumped)
NP-C(workers)

VP(dumped)

N(workers)V(dumped) NP-C(sacks)
workers

dumped

PP(into)

into D(a)
a

ML in NLP

N(bin)
bin

Inference
Translation
treasury bonds fi obrigaes do tesouro
covalent bonds fi ligaes covalentes
n

Disambiguate noisy transcription


Its easy to wreck a nice beach
Its easy to recognize speech

N(sacks) P(into) NP-C(bin)


sacks

Is this a likely English sentence?

ML in NLP

Machine Learning Approach


n

Algorithms that write programs


n

Information extraction

Specify
Form of output programs
Accuracy criterion

Sara Lee to Buy 30% of DIM

acquirer
acquired

Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in
Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million
dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.

n
n

The investment includes the purchase of 5 million newly issued DIM shares
valued at about 5 million dollars, and a loan of about 15 million dollars, it said.
The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, it
said.
ML in NLP

NASSLI

Input: set of training examples


Output: program that performs as accurately
as possible on the training examples

But will it work on new examples?

ML in NLP

ML in NLP

June 02

Generalization: is the learned program useful on new


examples?
n

Statistical learning theory: quantifiable tradeoffs between


number of examples, complexity of program class, and
generalization error

Computational tractability: can we find a good program


quickly?

Adaptation: can the program learn quickly from new


evidence?

If not, can we find a good approximation?

Learning Tradeoffs
Overfitting
Error of best program

Fundamental Questions

training

Information-theoretic analysis: relationship between adaptation


and compression

Machine Learning Methods

n
n

Tagging
Parsing
Extraction

Unsupervised learning
n
n

Generalization
Structure induction

ML in NLP

NASSLI

Jargon
n

Instance: event type of interest


n
n

Document classification
Disambiguation disambiguation

Structured models
n

ML in NLP

Classifiers
n

Rote learning

Program class complexity

ML in NLP

testing on new examples

n
n
n
n

Document and its class


Sentence and its analysis

Supervised learning: learn classification


function from hand-labeled instances
Unsupervised learning: exploit correlations to
organize training instances
Generalization: how well does it work on
unseen data
Features: map instance to set of elementary
events
ML in NLP

ML in NLP

June 02

Classification Ideas
n

Represent instances by feature vectors

Structured Model Ideas


n

n
n

Learn function from feature vectors


Class
n Class-probability distribution

Interdependent decisions
n

Content
n Context
n

Successive parts-of-speech
Parsing/generation steps
Lexical choice
Parsing
Translation

Redundancy is our friend: many weak


clues

ML in NLP

n
n

Clustering: class induction


n

Sequential decisions
Generative models
Constraint satisfaction

ML in NLP

Methodological Detour

Unsupervised Learning Ideas


n

Combining decisions

Empiricist/information-theoretic view:
words combine following their associations
in previous material

Latent variables
Im thinking of sports fi more sporty words

Distributional regularities

Rationalist/generative view:
words combine according to a formal
grammar in the class of possible naturallanguage grammars

Know words by the company they keep

Data compression
n Infer dependencies among variables:
structure learning
n

ML in NLP

NASSLI

ML in NLP

ML in NLP

June 02

Chomskys Challenge to Empiricism


(1)

Colorless green ideas sleep furiously.

(2)

Furiously sleep ideas green colorless.

The Return of Empiricism


n

Empiricist methods work:


n

It is fair to assume that neither sentence (1) nor


(2) (nor indeed any part of these sentences) has
ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.
Chomsky 57
ML in NLP

n
n
n

Just engineering tricks?

ML in NLP

Unseen Events
n

nave estimation of Markov model probabilities from


frequencies
no latent (hidden) events

Any such model overfits data: many events are


likely to be missing in any finite sample
\ The learned model cannot generalize to unseen
data
\ Support for poverty of the stimulus arguments
n

ML in NLP

NASSLI

The Science of Modeling

Chomskys implicit assumption: any model must


assign zero probability to unseen events
n

Markov models can capture a surprising fraction of


the unpredictability in language
Statistical information retrieval methods beat
alternatives
Statistical parsers are more accurate than
competitors based on rationalist methods
Machine-learning, statistical techniques close to
human performance in part-of-speech tagging, sense
disambiguation

\
n

Probability estimates can be smoothed to


accommodate unseen events
Redundancy in language supports
effective statistical inference procedures
the stimulus is richer than it might seem
Statistical learning theory: generalization
ability of a model class can be measured
independently of model representation
Beyond Markov models: effects of latent
conditioning variables can be estimated
from data

ML in NLP

ML in NLP

June 02

Richness of the Stimulus


n

Information about: mutual information


n
n

between linguistic and non-linguistic events


between parts of a linguistic event

Global coherence:

Word statistics carry more information


than it might seem

banks can now sell stocks and bonds

n
n
n

Questions
Generative or discriminative?
Structured models: local classification or
global constraint satisfaction?
n Does unsupervised learning help?
n
n

Markov models in speech recognition


Success of bag-of-words model in information
retrieval
Statistical machine translation

How far can these methods go?

ML in NLP

Classification

ML in NLP

Generative or Discriminative?
n

Generative models
n

Estimate the instance-label distribution


p(x, y)

Discriminative models
n

Estimate the label-given-instance distribution


p(y | x)

Minimize an upper-bound on training error


[[ f (x i ) yi ]]
i

ML in NLP

NASSLI

ML in NLP

ML in NLP

June 02

Simple Generative Model


n

Generative Claims

Binary nave Bayes:


n

Represent instances by sets of binary


features

Easy to train: just count


Language modeling: probability of
observed forms
n More robust
n
n

Does word occur in document

Finite predefined set of classes


P(F1,..., Fn ,C) = P(C) P(Fi | C)
C

n
n

P(c | F1,..., Fn ) P(c) P(Fi | C)


F1

F2

F3

F4

ML in NLP

Discriminative Models
Define functional form for

Simple Discriminative Forms


n

Linear discriminant function

Logistic form:

h(x;q 0 ,q1,...,q n ) = q 0 + i q i fi (x)

p(y | x;q )

Binary classification: define a discriminant


function
y = sign h(x;q )

n Adjust parameter(s) q to maximize


probability of training labels/minimize
error

ML in NLP

NASSLI

Full advantage of probabilistic methods

ML in NLP

Small training sets


Label noise

P(+1 | x) = 11+ exp- h(x;q )


n

Multi-class exponential form (maxent):


h(x, y;q 0 ,q1 ,...,q n ) = q 0 + i q i fi (x, y)
P(y | x;q ) = exp h(x, y;q )

y exp h(x, y;q )

ML in NLP

ML in NLP

June 02

Discriminative Claims
Focus modeling resources on instance-tolabel mapping
n Avoid restrictive probabilistic assumptions
on instance distribution
n Optimize what you care about
n Higher accuracy

Classification Tasks
n

ML in NLP

Binary vector

Frequency vector

ft (d) t d

N-gram language model


d

p(d | c) = p(d | c)i=1 p(di | d1...di-1; c)


p(di | d1 ...di-1;c) p(di | di-n ...di-1;c)

NASSLI

Tagging
n
n
n

News categorization
Message filtering
Web page selection
Named entity
Part-of-speech
Sense disambiguation

Syntactic decisions
n

Attachment

Term Weighting and Feature


Selection
Select or weigh most informative features
TF*IDF: adjust term weight by how
document-specific the term is
n Feature selection:
n
n

tf(d,t) = t d idf(d,t) = D d D : t d
raw frequency : rt (d) = tf(d,t)
TF * IDF :
xt (d) = log(1+ tf(d,t))log(1+ idf(d,t))

ML in NLP

ML in NLP

Document Models

Document categorization

Remove low, unreliable counts


Mutual information
n Information gain
n Other statistics
n
n

ML in NLP

ML in NLP

June 02

Documents vs. Vectors (I)


Many documents have the same binary or
frequency vector
n Document multiplicity must be handled
correctly in probability models
n Binary nave Bayes

Documents vs. Vectors (2)


n

p(f | c) = t [ ft p(t | c) + (1 - ft )(1- p(t | c))]


n

Multiplicity is not recoverable

ML in NLP

Unigram model:
p(c | d) =

p(c)p(d | c)

Raw frequency vector probability


rt = tf(d,t)
p(r | c) = p(L | c)L!

n
d
p(di | c)
i=1
d
p(di | c)
i=1

Embedding into high-dimensional vector


space
n
n

Geometric intuitions and techniques


Easier separability
Increase dimension with interaction terms
Nonlinear embeddings (kernels)

Vector model:

ML in NLP

p(t | c)rt
where L = rt
t
rt !

Linear Classifiers

c p(c)p( d | c)

p(c | r) =

NASSLI

p(d | c) = p(d | c)i=1 p(di | c)

ML in NLP

Documents
vs. Vectors (3)
n

Document probability (unigram language


model)

p(c)p(L | c)t p(t | c)rt

Swiss Army knife

c p(c) p(L | c)t p(t | c)rt


ML in NLP

ML in NLP

June 02

Kinds of Linear Classifiers


Nave Bayes
n Exponential models
n Large margin classifiers

Learning Linear Classifiers


n

Rocchio

Widrow-Hoff

Support vector machines (SVM)


n Boosting
n

w w - 2h(w.x i - yi )x i
n

Online methods
n
n

(Balanced) winnow
y = sign(w+ x - w- x - q )

Perceptron
Winnow

positive error : w+ aw+ ,w- bw- , a > 1 > b > 0


negative error : w+ bw+ ,w- aw -

ML in NLP

ML in NLP

Linear
Classification
n

Linear discriminant function

Margin
n
n

g i = yi (w x i + b)
Instance margin
Normalized (geometric) margin

h(x) = w x + b = w k xk + b
k

x
x

x
x

w
x

xj

x
o

x
x

gi

gj o
o
o

w
b
g i = yi x i +
w
w

xi

o
o

NASSLI

x
x

bo

ML in NLP


xk
xk
wk = max 0, xc - xc

c
D - c

o
o

x
n

Training set margin g


ML in NLP

10

ML in NLP

June 02

Duality

Perceptron Algorithm
n

Given

Final hypothesis is a linear combination of


training points
w = i ai yi x i

Linearly separable training set S


n Learning rate h>0
w 0;b 0;R = maxi x i
repeat
for i = 1...N
if yi ( w x i + b) 0
w w + hyi x i
b b + hyi R 2
until there are no mistakes
n

ML in NLP

the Margin?

There is a constant c such that for any


data distribution D with support in a ball of
radius R and any training sample S of size
N drawn from D

c R2
perr(h) 2 log2 N + log(1 d ) 1 - d
N g

where g is the margin of h in S


ML in NLP

NASSLI

Dual perceptron algorithm


a 0;b 0;R = maxi x i
repeat
for i = 1...N
if yi a j y j x j x i + b 0
j
a i ai + 1
b b + yi R 2
until there are no mistakes

ML in NLP

Why Maximize

ai 0

Canonical Hyperplanes
n

Multiple representations for the same


hyperplane
(lw, lb) l > 0

Canonical hyperplane: functional margin = 1


n Geometric margin for canonical hyperplane
n

1 w
w
x-
x+ 2 w
w

1
=
(w x + - w x - )
2w
1
=
w

g =

x+
g
x-

ML in NLP

11

ML in NLP

June 02

Convex Optimization (1)


n

Constrained optimization problem:

Convex Optimization (2)


n

min wW n f (w)
subject to
gi (w) 0
h j (w) = 0

f convex
gi, hj affine (h(w)=Aw-b)
n Solution w*, a*, b* must satisfy:
L(w* ,a * , b * )
= 0
w
* * *
L(w ,a , b )
= 0
b
Complementary condition:
ai* gi (w* ) = 0 parameter is non-zero iff
gi (w* )
0 constraint is active
*
ai
0
ML in NLP
n
n

Lagrangian function:

L(w,a , b ) = f (w) + i a i gi (w) + j b j h j (w)


n

Dual problem:

maxa ,b infwW L(w,a , b )


subject to a i 0
ML in NLP

Maximizing the Margin (1)


n

Given a separable training sample:


min w,b w = w w subject to y i ( w x i + b) 1

Maximizing the margin (2)


n

Dual Lagrangian at stationary point:

W(a ) = L(w* ,b* ,a ) = i a i n

Dual maximization problem:

1
L(w,b,a ) = w w - i ai [ yi ( w x i + b) -1]
2
L(w,b,a )
= w - i yi a i x i = 0
w
L(w,b,a )
= yi a i = 0
i
b

1
yi y ja i a j x i x j
2 i, j

maxa
W (a )
subject to a i 0
i yi a i = 0

Lagrangian:

ML in NLP

NASSLI

Kuhn-Tucker conditions:

Maximum margin weight vector:

w = i yi a *i x i with margin g = 1 w* =

(isv ai* )

-1 2

ML in NLP

12

ML in NLP

June 02

Building the Classifier


n

Computing the offset (from primal


constraints):
b* =

Consequences
n

max yi =-1 w* x i + min yi =1 w* x i

ai yi w* x i + b -1 = 0

ai > 0 fi w* x i + b = yi

[ (

Decision function:
h(x) = sgn

(i yi a*i x i x + b* )

Margin maximization for an arbitrary

kernel K
maxa

Decision rule
h(x) = sgn

NASSLI

Soft Margin
n
n

Handles non-separable case


Primal problem (2-norm):
min w,b,x w w + C x i2
i
subject to yi (w x i + b ) 1 - x i
xi 0

i a i - 12 i, j yi y ja i a j K (x i , x j )

subject to a i 0
i yi a i = 0

ML in NLP

Functional margin of 1 implies minimum


geometric margin

ML in NLP

General SVM Form

) ]

g = 1 w*

ML in NLP

Complementarity condition yields support


vectors:

(i yi a*i K (x i ,x) + b* )

Dual problem:
maxa

i a i - 12 i, j yi y ja i a j x i x j + C1 dij

subject to a i 0
i yi a i = 0
ML in NLP

13

ML in NLP

June 02

Conditional Maxent Model


n

Maximizing conditional entropy

exp l k f k (x, y)
k

n
n

subject to constraints

i f k (x i , yi ) = y p(y | x i ) f k (x i , y)

Multi-class
May use different features for different classes
Training is convex optimization

ML in NLP

yields

)
p (y | x) = p(y | x; L
ML in NLP

Relationship

to (Binary) Logistic
Discrimination
exp k l k f k (x,+1)

p(+1 | x) =

exp k l k f k (x,+1) + exp k l k f k (x,-1)


1
=
1
+
expl
f
(
k k k (x,+1) - f k (x,-1))

1
=
1 + exp- l k gk (x)
k

ML in NLP

p = arg max p -i y p(y | x i ) log p(y | x i )

Z(x; l )

Useful properties
n

NASSLI

= arg max L log p(yi | x i ;L)


L
i

Z(x;L) = y exp k l k f k (x, y)

Maximize conditional log likelihood

Model form
p(y | x;L) =

Duality
n

Relationship to Linear
Discrimination
n

Decision rule

p(+1| x)
sign log
= sign k l k g k (x)
p(-1 | x)

Bias term: parameter for always on


feature
n Question: relationship to other trainers for
linear discriminant functions
n

ML in NLP

14

ML in NLP

June 02

Solution Techniques (I)


n

Solution Techniques (2)


Improved iterative scaling (IIS)

Generalized iterative scaling (GIS)


n

Parameter updates

lk l k +
n

lk l k + d k

i f k (x i , yi )
1
log
C
i y p(y | x i ;L) f k (x i , y)

i f k (x i , yi ) = i y p(y | x i ;L) f k (x i , y)edk f


n

f k (x i , y) = C "i, y

ML in NLP

Conditional log-likelihood
l(L) = log p(yi | x i ;L)
i

(x i ,y)

For binary features reduces to solving a polynomial with


positive coefficients
Reduces to GIS if feature sum constant

ML in NLP

Deriving IIS (1)


n

f # (x, y) = k f k (x, y)

Requires that features add up to constant independent of


instance or label (add slack feature)

Parameter updates

Log-likelihood update

Z (x i ;L + D)
l(L + D) - l(L) = i D f (x i , yi ) - log

Z (x i ;L)

(L+D ) f (x i ,y)
e
= D f (x i , yi ) - log
i
i
y
Z (x i ;L)
= i D f (x i , yi ) - i log y p(y | x i ;L)e D f ( x i ,y)

(log x x -1) i D f (x i , yi ) + N - i y p(y | x i ;L)e D f (x i ,y)

Deriving IIS (2)


n

By Jensens inequality:

A(D) D f (x i , yi ) + N i

i y p(y | x i ;L)k

f k (x i , y) dk f # (x i ,y)
e
= B(D)

f # (x i , y)

#
B(D)
= f k (x i , yi ) - p(y | x i ;L) f k (x i , y)edk f (x i ,y)
i
i
y
d k

Maximize lower bound on update

144444444424444444443
A(D)

ML in NLP

NASSLI

ML in NLP

15

ML in NLP

June 02

Solution Techniques (3)


GIS very slow if slack variable takes large
values
n IIS faster, but still problematic
n Recent suggestion: use standard convex
optimization techniques
n

n
n

Gaussian Prior
n

Log-likelihood gradient

l(L)
l
= i f k (x i , yi ) - i y p(y | x i ;L)i f k (x i , y) - k2
l k
sk
n

Modified IIS update


lk l k + d k

i f k (x i , yi ) =

Eg. Conjugate gradient


Some evidence of faster convergence

i y p(y | x i ;L) f k (x i , y)edk f

( x i ,y)

l k + dk
s 2k

ML in NLP

Instance Representation

Fixed-size instance (PP attachment):


binary features
n
n

Word identity
Word class

Variable-size instance (document


classification)
n
n

Word identity
Word relative frequency in document

ML in NLP

NASSLI

f (x, y) = k f k (x, y)

ML in NLP

Enriching Features
Word n-grams
Sparse word n-grams
n Character n-grams (noisy transcriptions:
speech, OCR)
n Unknown word features: suffixes,
capitalization
n Feature combinations (cf. n-grams)
n
n

ML in NLP

16

ML in NLP

June 02

Structured Models:
Finite State

I understood each and every word you said but not the order in which they appeared.
ML in NLP

Structured Model Applications

Structured Models

Language modeling
n Story segmentation
n POS tagging
n Information extraction (IE)
n (Shallow) parsing

ML in NLP

NASSLI

ML in NLP

Assign a labeling to a sequence


Story segmentation
POS tagging
n Named entity extraction
n (Shallow) parsing
n
n

ML in NLP

17

ML in NLP

June 02

Constraint Satisfaction in
Structured Models
n

Local Classification Models


n

Train to minimize labeling loss

q = argminq i Loss(x i , y i | q )
n

Computing the best labeling:


argmin y Loss(x, y | q )

Efficient minimization requires:


n
n

q = argminq i 0 j<|x |loss(yi, j | x i , y (i j);q )


i

A common currency for local labeling decisions


Efficient algorithm to combine the decisions

ML in NLP

Example: Language Modeling


n

n-gram (n-1 order Markov) model:


n

Constraint satisfaction
n
n
n

Principled
Probabilistic interpretation allows model
composition
Efficient optimal decoding

n Local classification
n
n
n

Wider range of models


More efficient training
Heuristic decoding comparable to pruning in
global models

ML in NLP

NASSLI

Apply by guessing context and finding


each lowest-loss label:
argmin y j loss(y j | x, y ( j);q )

ML in NLP

Structured Model Claims

Train to minimize the per-decision loss in


context

w1

wk n + 1

P(wk |w1
n

wk 1wk

wk 1) P(wk |wk n + 1

wk 1)

example, character n-grams (Shannon, Lucky):


n
0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
1 OCRO HLI RGWR NMIELWIS EU LL
2 ON IE ANTSOUTINYS ARE T INCTORE ST
3 IN NO IST LAT WHEY CRATICT FROURE
4 ABOVERY UPONDULTS WELL THE CODERST
ML in NLP

18

ML in NLP

June 02

Markovs Unreasonable
Effectiveness
n

Limits of Markov Models

Entropy estimates for English


model
compress

bits/char
4.43
word trigrams (Brown et al 92)
1.75
human prediction (Cover & King 78)
1.34
Local word relations dominate statistics (Jelinek):

1 The
2 This
2 One
7 Please
9 We
98
1641

are to
will
the
need

know
have
understand
use
insert
resolve

the
issues
this
problems
these the
problem
all

necessary
data
information
people
tools
old
important

role
thing
that
point
issues

ML in NLP

n
n

Whats the probability of unseen events?


Bias forces nonzero probabilities for some
unseen events
Typical bias: tie probabilities of related events
n

No dependency

Likelihoods based on
sequencing, not dependency

1
The
2
This
2
One
7
Please
9
We
98
1641

are to
will
the
need

know
have
understand
use
insert
resolve

the
issues
this
problems
these the
problem
all

necessary
data
information
people
tools
old
important

role
thing
that
point
issues

ML in NLP

Unseen Events (1)


n

specific unseen event general seen event


eat pineapple eat _

event decomposition: event event1 event2

Factoring via latent variables:

eat pineapple eat _ _ pineapple

P(eat | pineapple) P(eat | C) P(C | pineapple)

Unseen Events (2)


Discount estimates for seen events
Use leftover for unseen events
n How to allocate leftover?
n
n

Back-off from unseen event to less specific


seen events: n-gram to n-1-gram
n Hypothesize hidden cause for unseen events:
latent variable model
n Relate unseen event to distributionally similar
seen events
n

ML in NLP

NASSLI

ML in NLP

19

ML in NLP

June 02

Important Detour:
Latent Variable Models

Expectation-Maximization (EM)
n

Latent (hidden) variable models


p(y,x,z | L)
p(y,x | L) = p(y,x,z | L)
z

Examples:
Mixture models
n Class-based models (hidden classes)
n Hidden Markov models
n

ML in NLP

Maximizing Likelihood

Data log-likelihood

D = {(x1 , y1 ),...,(x N , yN )}
L(D | L) = i log p(x i , yi ) = x,y p (x, y) log p(x, y | L)
i : x i = x, yi = y
p (x, y) =
N
n Find parameters that maximize (log-)likelihood
= arg max L p (x, y) log p(x, y | L)
L
x,y
ML in NLP

NASSLI

ML in NLP

Convenient Lower Bounds (1)


n

f ( x1 )
Convex function
af ( x0 ) + (1 - a ) f ( x1 )
f

f ( x0 )
n

f (ax0 + (1 - a ) x1 )

x0 ax0 + (1 - a ) x1 x1
Jensens inequality
f x p(x)x x p(x) f (x)

if f is convex and p is a probability density


ML in NLP

20

ML in NLP

June 02

Convenient Lower Bounds (2)


L(D | l )

Auxiliary Function
n

E1

Find a convenient non-negative function


that lower-bounds likelihood increase
L(D | L) - L(D | L) Q(L,L) 0

M0

E0

Maximize lower bound:


L i+1 = arg max L Q(L,L i )

lower bound

ML in NLP

Comments

ML in NLP

Example: Mixture Model

n Likelihood keeps increasing, but


Can get stuck in local maximum (or saddle point!)
Can oscillate between different local maxima with
same log-likelihood

If maximizing auxiliary function is too hard, find


some L that increases likelihood: generalized
EM (GEM)
Sum over hidden variable values can be
exponential if not done carefully (sometimes not
possible)

n
n

ML in NLP

NASSLI

Base distributions
pi (y) :1 i m

Mixture coefficients
li 0

i=1 li = 1

Mixture distribution

p(y | L) = li pi (y)
i

ML in NLP

21

ML in NLP

June 02

Auxiliary Quantities

Solution

Mixture coefficient i = prior probability of


being in class I
n Joint probability

p(c, y | L) = lc pc (y)
n

E step:
Ci =

M-step:

Auxiliary function
Q(L,L) = y p (y)c p(c | y,L) log

li

p(y,c | L)
p(y,c | L)

ML in NLP

1
p (y) lilpi p(y)(y)
li y
j j j

Ci

jC j

ML in NLP

More Finite-State Models

Example: Information Extraction


n

Given: types of entities and relationships


we are interested in
People, places, organizations, dates,
amounts, materials, processes,
n Employed by, located in, used for, arrived
when,
n

Find all entities and relationships of the


given types in source material
n Collect in suitable database
n

ML in NLP

NASSLI

ML in NLP

22

ML in NLP

June 02

IE Example

IE Methods
employee
relation

Co-reference

Partial matching:
Hand-built patterns
Automatically-trained hidden Markov models
n Cascaded finite-state transducers
n

person-descriptor
person

organization

Nance, who is also a paid consultant to ABC News , said

Parsing-based:
n

Shallow parser (chunking)


Automatically-induced grammar

Rely on:
n
n

Syntactic structure
Phrase classification

ML in NLP

Train to minimize labeling loss


q = argminq Loss(x i , y i | q )

Local Classification Models


n

Computing the best labeling:


argmin y Loss(x, y | q )
Efficient minimization requires:
n
n

Train to minimize the per-symbol loss in


context
q = argminq i 0 j<|x |loss(yi, j | x i , y (i j);q )
i

Apply by guessing context and finding


each lowest-loss label:
argmin y j loss(y j | x, y ( j);q )

A common currency for local labeling decisions


A dynamic programming algorithm to combine the
decisions

ML in NLP

NASSLI

Classify phrases and phrase relations as


desired entities and relationships

ML in NLP

Global Constraint Models


n

Parse the whole text:

ML in NLP

23

ML in NLP

June 02

Structured Model Claims


n

Generative vs. Discriminative

Global constraint

Hidden Markov models (HMMs):


generative, global
n Conditional exponential models (MEMMs,
CRFs): discriminative, global
n Boosting, winnow: discriminative, local
n

Principled
Probabilistic interpretation allows model
composition
n Efficient optimal decoding
n
n

Local classifier
Wider range of models
More efficient training
n Heuristic decoding comparable to pruning in
global models
n
n

ML in NLP

ML in NLP

Generative Models
n

Stochastic process that generates


instance-label pairs
n
n

n
n

Process structure
Process parameters

(Hypothesize structure)
Estimate parameters from training data

ML in NLP

NASSLI

Model Structure
Decompose the generation of instances
into elementary steps
n Define dependencies between steps
n Parameterize the dependencies
n Useful descriptive language: graphical
models
n

ML in NLP

24

ML in NLP

June 02

Binary Nave Bayes


n

Represent instances by sets of binary


features
n
n

Discrete Hidden Markov Model


n

Does word occur in document

P(F1,..., Fn ,C) = P(C) P(Fi | C)


i

P(c | F1,..., Fn ) P(c) P(Fi | C)


F2

F3

Word identity
Word properties (eg. spelling, capitalization)

C3

C4

X0

X1

X2

X3

X4

P(X,C) = P(C0 )P(X 0 | C0 )i P(Ci | Ci-1 )P(Xi | Ci )

Independence or Intractability

C0

C1

C2

F0,0
F0,1
F0,2

F1,0
F1,1

F1,2

F2,0
F2,1

C3

F2,2

F3,0
F3,1

F3,2

Trees are good: each node has a single


immediate ancestor, joint probability
computed in linear time
n But that forces features to be conditionally
independent given the class
n Unrealistic
n

Labels: class sequences


C4

F4,0
F4,1
F4,2

n
n

Suffixes and capitalization


San and Francisco in document

Limitation: conditionally independent features


ML in NLP

NASSLI

C2

Instances: sequences of feature sets


n

C1

ML in NLP

Generating Multiple Features

C0

F4

ML in NLP

Instances: symbol sequences


Labels: class sequences

Finite predefined set of classes


C

F1

ML in NLP

25

ML in NLP

June 02

Information Extraction with HMMs


[Seymore & McCallum 99]
[Freitag & McCallum 99]

Score Card

No independence assumptions
Richer features: combinations of existing
features
Optimization problem for parameters
Limited probabilistic interpretation
Insensitive to input distribution

n
n
n
n

Parameters = P(s|s) , P(o|s) for all states in S={s1,s2,}


Observations: words
Training: maximize probability of observations (+ prior).
For IE, states indicate database field.

ML in NLP

ML in NLP

Problems with HMMs


1.

Would prefer richer representation of text:


multiple overlapping features, whole chunks of text
n

Example word features:


n
n
n
n
n
n
n

2.

identity of word
word is in all caps
word ends in -tion
word is part of a noun phrase
word is in bold font
word is on left hand side of page
word is under node X in WordNet

Example line features:


n
n
n
n
n
n
n

Hidden Markov model

length of line
line is centered
percent of non-alphabetics
total amount of white space
line contains two verbs
line begins with a number
line is grammatically a question

HMMs are generative models.


Generative models do not handle easily overlapping, nonindependent features.
Would prefer a conditional model: P({s}|{o}).
ML in NLP

NASSLI

Solution: Conditional Model


P(s|s)
P(o|s)

Maximum entropy
Markov model

P(s|o,s)
(represented by exponential
model)

For the time being, capture dependency


on s with |S| independent functions, Ps(s|o)
s

s
ML in NLP

Each state contains a next-state classifier


black box, that, given the next observation, will
produce a probability distribution over possible
next states, Ps(s|o).

26

ML in NLP

June 02

Two Sequence Models


HMM

P(o|s)
ot

overlapping (binary) features.


n Example observation predicates:

st

st-1

P(s|s)

n Model Ps (s|o) in terms of multiple arbitrary

MEMM
st

st-1

Transition Features

o is the word apple


o is capitalized
l o is on a left-justified line
n Feature f depends on both a predicate b
n and a destination state s.
l

P(s|o,s)

ot

1 if b(o) is true and s'= s


f<b,s> (o, s') =
0 otherwise

Standard belief propagation: forward-backward


procedure.
Viterbi and Baum-Welch follow naturally.
ML in NLP

ML in NLP

Next-State Classifier
n

Per-state conditional maxent model

1
Ps ' (s | o) =
exp l<b,q> f<b,q> (o, s)
Z(o, s')
<b,q>

Training: each state model independently


from labeled sequences

X-NNTP-Poster: NewsHound v1.33


Archive-name: acorn/faq/part2
Frequency: monthly
2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connections for common terminal
programs to work properly. They are as far as I know the informal standard
agreed upon by commercial comms software developers for the Arc.
Pins 1, 4, and 8 must be connected together inside the 9 pin plug. This
is to avoid the well known serial port chip bugs. The modems DCD (Data
Carrier Detect) signal has been re-routed to the Arcs RI (Ring Indicator)
most modems broadcast a software RING signal anyway, and even then its
really necessary to detect it for the model to answer the call.
2.7) The sound from the speaker port seems quite muffled.
How can I get unfiltered sound from an Acorn machine?
All Acorn machine are equipped with a sound filter designed to remove
high frequency harmonics from the sound output. To bypass the filter, hook
into the Unfiltered port. You need to have a capacitor. Look for LM324 (chip
39) and and hook the capacitor like this:

ML in NLP

NASSLI

Example: Q-A pairs from FAQ

ML in NLP

27

ML in NLP

June 02

Experimental Data
n

<head>
<head>
<head>
<head>
<question>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>

Features in Experiments

38 files belonging to 7 UseNet FAQs


X-NNTP-Poster: NewsHound v1.33
Archive-name: acorn/faq/part2
Frequency: monthly
2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connection
programs to work properly. They are as far as I know
agreed upon by commercial comms software developers fo
Pins 1, 4, and 8 must be connected together inside
is to avoid the well known serial port chip bugs. The

Procedure: For each FAQ, train on one file, test


on other; average.
ML in NLP

begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe

ML in NLP

Models Tested
n
n

ME-Stateless: A single maximum entropy


classifier applied to each line independently.
TokenHMM: A fully-connected HMM with four
states, one for each of the line categories,
each of which generates individual tokens
(groups of alphanumeric characters and
individual punctuation characters).
FeatureHMM: Identical to TokenHMM, only
the lines in a document are first converted to
sequences of features.
MEMM: maximum entropy Markov model
ML in NLP

NASSLI

contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30

Results
Learner

Segmentation
precision

Segmentation
recall

ME-Stateless

0.038

0.362

TokenHMM

0.276

0.140

FeatureHMM

0.413

0.529

MEMM

0.867

0.681

ML in NLP

28

ML in NLP

June 02

Label Bias Problem


n

Example (after Bottou 91):


r 1
0

r
4

n
n

i
o

Proposed Solutions
n

rib

P(1, 2 | ro) = P(1 | r)P(2 | o,1)


P(1 | r)1
P(1 | r)P(2 | i,1)
6 rob
P(1, 2 | ri)

Bias toward states with fewer outgoing


transitions.

Per-state normalization
does not allow the
required score(1,2|ro) << score(1,2|ri).

n
n

ML in NLP

s = s1 ...sn

o = o1 ...on

P(s | o) P(st | st-1 )P(ot | st )


t=1

MEMM P(s | o)

rib

rob

CRF General Form


n

State sequence is a Markov random field


conditioned on the observation sequence.

Model form: P(s | o) =

Features represent the dependency between


successive states conditioned on the
observations

Dependence on whole observation sequence o


(not possible in HMMs).

P(s

| st-1,ot )

t=1

l f f (st ,st-1 )
exp
Z(s ,o ) +f m f g(st ,ot )
t-1 t
t=1
g

l f f (st ,st-1 )
n
1
P(s | o) =
CRF
exp+f m f g(st ,ot )
Z(o) t=1
g

ML in NLP

ML in NLP

From HMMs to CRFs


HMM

i 2
Determinization:
r
1
n not always possible 0
o
b
5
n state-space explosion
Fully-connected models:
n lacks prior structural knowledge.
Conditional random fields (CRFs):
n Allow some transitions to vote more
strongly than others
n Whole sequence rather than per-state
normalization

n
1
exp l f f (st-1,st ,o,t)
Z(o)
t=1 f

ML in NLP

NASSLI

29

ML in NLP

June 02

Efficient Estimation
n

Forward-Backward Calculations

Matrix notation
M t (s',s | o) = exp L t (s',s | o)

L t (s',s | o) = l f f (st-1,st ,o,i)

E LG

1
PL (s | o) =
M i (st-1,st | o)
Z L (o) t
Z L (o) = (M1 (o)M 2 (o)L M n +1 (o)) start,stop
n

For any path function G(s) = t gt (st-1,st )


=
=

P (s | o)G(s)
a (s'| o)g (s',s)M (s',s | o)b

Z (o)
s

t +1

t +1

t +1

(s | o)

t,s',s

a t (o) = a t-1 (o)M t (o)


b t (o) = M t +1 (o)b t +1 (o)
Z L (o) = a n +1 (end | o) = b 0 (start | o)

Efficient normalization: forward-backward


algorithm
ML in NLP

ML in NLP

Training

Label Bias Experiment

Maximize L(L) = log PL (sk | o k )


k
n Log-likelihood gradient
n

L(L)
=

l f
# f (s | o) =

#
k

r
0

(sk | o k ) - E L # f (S | o k )
k

f (st-1,st ,o,t)

Methods: iterative scaling, conjugate


gradient
n Comparable to standard Baum-Welch

ML in NLP

NASSLI

r
4

n
n

Data source: noisy version of

i
o

b
b

rib

rob

P(intended symbol) = 29/30, P(other) = 1/30.


Train both an MEMM and a CRF with identical
topologies on data from the source.
Compute decoding error: CRF 4.6%, MEMM 42%
(2,000 training samples, 500 test)
ML in NLP

30

ML in NLP

June 02

Mixed-Order Sources
n
n

Data generated by mixing sparse first and second


order HMMs with varying mixing coefficient.
Modeled by first-order HMM, MEMM and CRF
(without contextual or overlapping features).

Part-of-Speech Tagging
n

Trained on 50% of the 1.1 million words in the


Penn treebank. In this set, 5.45% of the words
occur only once, and were mapped to oov.
Experiments with two different sets of features:
n
n

ML in NLP

ML in NLP

POS Tagging Results


model

error

oov error

HMM

5.69%

45.99%

MEMM

6.37%

54.61%

CRF

5.55%

48.05%

MEMM+

4.81%

26.99%

CRF+

4.27%

23.76%

ML in NLP

NASSLI

traditional: just the words


take advantage of power of conditional models: use
words, plus overlapping features: capitalized, begins
with #, contains hyphen, ends in -ing, -ogy, -ed, -s, ly, -ion, -tion, -ity, -ies.

Structured Models:
Stochastic Grammars

ML in NLP

31

ML in NLP

June 02

Beyond Finite State

Stochastic Context-Free Grammars

S
VP

NP
N'
Adj

N'

revolutionary Adj
new

N'

VP

Adv

slowly

N'

advance
Adj

Constituency and meaningful relations


What sleeps? How does it sleep? What kind of ideas?

Related sentences
Sleep furiously is all that colorless green ideas do.

new

Extended inside-outside algorithm: use


information about training corpus phrase
structure to guide rule probability reestimation

ML in NLP

NASSLI

slowly

advance

1.0
0.2
0.8
0.3
0.7
0.4
0.6
0.1
0.2

V sleep

0.3

SCFG Derivations
S

P(

Inside-outside algorithm (Baker 79): find rule


probabilities that locally maximize the
likelihood of a training corpus (instance of EM)

Adv

S NP VP
NP Det N
NP N
N Adj N
N N
VP VP Adv
VP V
N ideas
N people

ML in NLP

Stochastic CFG Inference

N'

VP

ideas

ML in NLP

N'

revolutionary Adj

ideas

VP

NP

N'
Adj

N'

revolutionary Adj
new

Better modeling of phrase structure

N'

VP

Adv

slowly

N'

ML in NLP

Adv

slowly

advance

Context-freeness fi
independence of
derivation steps

VP

P(N Adj N)

ideas

)
VP

NP

advance

Improved convergence
Improved computational complexity

P(

VP

NP

P(

Adj

revolutionary

P(
Adj
new

N'
N'
N
ideas

32

ML in NLP

June 02

Inside-Outside Reestimation
Ratios of expected rule frequencies to
expected phrase frequencies
E (#N Adj N)
phrase context
Pn + 1(N Adj N) = n
En(#N)
n

n
VP

NP
N'

phrase

Adj

N'

revolutionary Adj
new

N'

VP

Adv

slowly

advance

Computed from
the probabilities of
each phrase and
phrase context in
the training
corpus
Iterate

ideas
ML in NLP

Problems with I-O Reestimation


n

Hill-climbing procedure: sensitivity to initial


rule probabilities

Does not learn grammar structure directly:


only implicit in rule probabilities

Linguistically inadequate grammars: high


mutual information sequences are grouped
into phrases
((What (((is the) cheapest)
fare))((I can) (get ?)))))))
Contrast: Is $300 the cheapest fare?

ML in NLP

(Partially) Bracketed Text


Training:

(((List (the fares (for


((flight) (number
891)))))).)
n

Use only derivations


compatible with training
bracketing

cross entropy

Hand-annotated text with


(some) phrase boundaries

4.5

bracketed
train

4
3.5
3
2.5
0

20

40
iteration

Test:

60

80

raw test

cross entropy

Predictive rawPower
train

4.5

bracketed
test

4
3.5
3
2.5
0

ML in NLP

NASSLI

ML in NLP

20

40
iteration

60

80

33

ML in NLP

June 02

Bracketing Accuracy
n

Accuracy criterion: proportion of phrases in


most likely analysis compatible with tree bank
bracketing

Limitations of SCFGs
n
n

Likelihoods independent of particular words


Markovian assumption on syntactic categories

accuracy

100
80

raw test

60

bracketed
test

Markov models:
We need to resolve the issue

40
20
0

20

40

60

80

iteration

Conclusion: structure is not evident from distribution


alone
ML in NLP

Hierarchical models:

ML in NLP

Lexicalization
n Markov

model:

We need to resolve the issue

Best Current Models


n
n
n

n Dependency

head word, dependency type, argument vs. adjunct, heaviness,


slash

model:
n

We need to resolve the issue


n
n

ML in NLP

NASSLI

Representation: surface trees with head-word


propagation
Generative power still context-free
Model variables:

Main challenge: smoothing method for unseen


dependencies
Learned from hand-parsed text (treebank)
Around 90% constituent accuracy

ML in NLP

34

ML in NLP

June 02

Lexicalized Tree (Collins 98)


Inducing Representations

S(dumped)
VP(dumped)

NP-C(workers)

N(workers) V(dumped) NP-C(sacks)


workers

dumped

PP(into)

N(sacks) P(into)
sacks

Dependency
Direction Relation
L
workers dumped
NP, S, VP
sacks dumped
R
NP, VP, V
R
into dumped
PP, VP, V
L
a bin
D, NP, N
bin into
R
NP, PP, P

NP-C(bin)

into D(a)
a

N(bin)
bin

ML in NLP

Unsupervised Learning
n

Latent variable models


Model observables from latent variables
n Search for good set of latent variables

ML in NLP

Do Induced Classes Help?


n

Information bottleneck
Find efficient compression of some
observables
n preserving the information about other
observables
n

ML in NLP

NASSLI

Generalization
n

Better statistics for coarser events

Dimensionality reduction
n
n

Smaller models
Improved classification accuracy

ML in NLP

35

ML in NLP

June 02

Complex Events

Chomskys Challenge
to Empiricism
(1)

Colorless green ideas sleep furiously.

(2)

Furiously sleep ideas green colorless.

It is fair to assume that neither sentence (1) nor


(2) (nor indeed any part of these sentences) has
ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.

n
n
n

Chomsky 57
ML in NLP

representation of past experience


uncertainty about correct grammar
uncertainty about correct interpretation of
experience: ambiguity

Probabilistic relationships involving


hidden variables can be induced from
observable data alone: EM algorithm

ML in NLP

Distributional Clustering

In Any Model?
n

What Chomsky was talking about: Markov


models state is just a record of
observations
But statistical models can have hidden
state:

Automatic grouping of words according to the


contexts in which they appear

Approach to data sparseness: approximate the


distribution of a relatively rare event (word) by the
collective distribution of similar events (cluster)

P(colorless green ideas sleep furiously)


2 10 5
P(furiously sleep ideas green colorless)

Sense ambiguity membership in several soft


clusters

Trained for large-vocabulary speech recognition


from newswire text by EM

Case study: cluster nouns according to the verbs


that take them as direct objects

Factored bigram model:


16

P(wi+1 | wi )

P(wi +1 | c) P(c | wi )

c =1

P(w1 Lwn ) P(w1 ) P(wi +1 | wi )


i =2

ML in NLP

NASSLI

ML in NLP

36

ML in NLP

June 02

Distributional Representation

Training Data

0.1

Data: frequencies fvn of (v,n) pairs


extracted from text by parsing or pattern
matching

stock
bond

0.08
0.06
0.04
0.02

ML in NLP

Markov condition:

p( n | v) = p(n | n)p(n | v)
n

I(,N)

Find p( | N) to maximize
mutual information for
fixed I( ,N)
I( N ,V ) =

I(N,V)

trade

sell

tender

select

own

The scale parameter b (inverse


temperature) determines how much a
noun contributes to nearby centroids

b increases fi clusters split fi hierarchical


clustering

p( n ,v )

p( n | n) =

NASSLI

I(,V)

p( n , v ) log p( n )p(v )

Solution:
ML in NLP

Search for Cluster Solutions

n ,v

purchase

ML in NLP

Reminder: Bottleneck Model


n

issue

buy

0
hold

Describing n N: use conditional


distribution p(V | n)

Universe: two word classes V and N, a


single relation between them (eg. main
verb head noun of verbs direct object)

relative frequency

p( n )
exp(- bDKL ( p(V | n) || p(V | n ))
Zn

ML in NLP

37

ML in NLP

June 02

Small Example
n

Mutual Information Ratios

Cluster the 64 most common direct objects of


fire in 1988 Associated Press newswire

officer 0.484
aide
0.612
chief
0.649
manager 0.651

0.6

In
cr
ea
si
ng

missile 0.835
rocket 0.850
bullet 0.917
gun
0.940

I(,V)/I(,N)

0.8

0.4

0.2

shot 0.858
bullet 0.925
rocket 0.930
missile 1.037

gun
0.758
missile 0.786
weapon 0.862
rocket 0.875

0
0

Model verbobject associations through object


clusters:
p (v | n) = p(v | n ) p( n | n)
n

p(v | n ) p( n |n)
p (v | n) =

used in
experiments
more appropriate

Zn

Depends on b

Intuition: the associations of a word are a mixture


of the associations of the sense classes to which
the word belongs

NASSLI

0.6

0.8

ML in NLP

Using Clusters for Prediction

ML in NLP

0.4

I(,N)/H(N)

ML in NLP

0.2

Evaluation
n

Relative entropy of held-out data


to asymmetric model

Decision task: which of two verbs


is more likely to take a noun as
direct object, estimated from the
model for training data in which
the pairs relating the noun to one
of the verbs have been deleted

ML in NLP

38

ML in NLP

June 02

Decision Task

Relative Entropy
n

6
5
4

train n Test: 81,240 pair


test
held-out test set
new

3
2

1
0
0

200

400

number of clusters
ML in NLP

NASSLI

Train: 756,721 verbobject pair training


set

600

New: held-out data


for 1000 nouns not
in the training data

Which of v1 and v2 is more likely to take object o

Held out: 104 (v2,o) pairs need to guess from


pc(v2)

Testing: compare all v1 and v2 such that (v2,o) was


held out and (v1,o) occurs

1.2

decision error

avg relative entropy (bits)

Verb-object pairs from 1988 AP Newswire

all

All: all test data

Exceptional: verb pairs


whose frequency ratio
with o is reversed from
their overall frequency
ratio

exceptional

0.8
0.6
0.4
0.2
0
0

100
ML in NLP

200

300

400

500

number of clusters

39

You might also like