Machine Learning in Natural Language Processing: ML in NLP June 02

ML in NLP
June 02
Machine Learning in Natural

Language Processing
Introduction
Fernando Pereira
University of Pennsylvania
NASSLLI, June 2002
Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,
Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby
ML in NLP
Why ML in NLP
n
n
n
n
Examples are easier to create than rules

Rule writers miss low frequency cases
Many factors involved in language interpretation
People do it
n
n
Document topic
AI
Cognitive science
politics, business
Let the computer do it

n
n
n
Moores law
storage
lots of data
ML in NLP
NASSLI
Classification
national, environment
Word sense
treasury bonds chemical bonds
ML in NLP
ML in NLP
June 02
Analysis
n
VBG
NNS
causing symptoms
Language Modeling
Tagging
WDT
VBP
RP
NNS
JJ
that
show
up
decades
later
P(colorless green ideas sleep furiously)

2 10 5
P(furiously sleep ideas green colorless)
Parsing
S(dumped)
NP-C(workers)
VP(dumped)
N(workers)V(dumped) NP-C(sacks)
workers
dumped
PP(into)
into D(a)
a
ML in NLP
N(bin)
bin
Inference
Translation
treasury bonds fi obrigaes do tesouro
covalent bonds fi ligaes covalentes
n
Disambiguate noisy transcription

Its easy to wreck a nice beach
Its easy to recognize speech
N(sacks) P(into) NP-C(bin)

sacks
Is this a likely English sentence?
ML in NLP
Machine Learning Approach

n
Algorithms that write programs

n
Information extraction
Specify
Form of output programs
Accuracy criterion
Sara Lee to Buy 30% of DIM
acquirer
acquired
Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in
Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million
dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.
n
n
The investment includes the purchase of 5 million newly issued DIM shares
valued at about 5 million dollars, and a loan of about 15 million dollars, it said.
The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, it
said.
ML in NLP
NASSLI
Input: set of training examples

Output: program that performs as accurately
as possible on the training examples
But will it work on new examples?
ML in NLP
ML in NLP
June 02
Generalization: is the learned program useful on new

examples?
n
Statistical learning theory: quantifiable tradeoffs between

number of examples, complexity of program class, and
generalization error
Computational tractability: can we find a good program

quickly?
Adaptation: can the program learn quickly from new

evidence?
If not, can we find a good approximation?
Learning Tradeoffs
Overfitting
Error of best program
Fundamental Questions
training
Information-theoretic analysis: relationship between adaptation

and compression
Machine Learning Methods
n
n
Tagging
Parsing
Extraction
Unsupervised learning
n
n
Generalization
Structure induction
ML in NLP
NASSLI
Jargon
n
Instance: event type of interest

n
n
Document classification
Disambiguation disambiguation
Structured models
n
ML in NLP
Classifiers
n
Rote learning
Program class complexity
ML in NLP
testing on new examples
n
n
n
n
Document and its class

Sentence and its analysis
Supervised learning: learn classification

function from hand-labeled instances
Unsupervised learning: exploit correlations to
organize training instances
Generalization: how well does it work on
unseen data
Features: map instance to set of elementary
events
ML in NLP
ML in NLP
June 02
Classification Ideas
n
Represent instances by feature vectors
Structured Model Ideas

n
n
n
Learn function from feature vectors

Class
n Class-probability distribution
Interdependent decisions
n
Content
n Context
n
Successive parts-of-speech
Parsing/generation steps
Lexical choice
Parsing
Translation
Redundancy is our friend: many weak

clues
ML in NLP
n
n
Clustering: class induction

n
Sequential decisions
Generative models
Constraint satisfaction
ML in NLP
Methodological Detour
Unsupervised Learning Ideas

n
Combining decisions
Empiricist/information-theoretic view:
words combine following their associations
in previous material
Latent variables
Im thinking of sports fi more sporty words
Distributional regularities
Rationalist/generative view:
words combine according to a formal
grammar in the class of possible naturallanguage grammars
Know words by the company they keep
Data compression
n Infer dependencies among variables:
structure learning
n
ML in NLP
NASSLI
ML in NLP
ML in NLP
June 02
Chomskys Challenge to Empiricism

(1)
Colorless green ideas sleep furiously.
(2)
Furiously sleep ideas green colorless.
The Return of Empiricism

n
Empiricist methods work:

n
It is fair to assume that neither sentence (1) nor

(2) (nor indeed any part of these sentences) has
ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.
Chomsky 57
ML in NLP
n
n
n
Just engineering tricks?
ML in NLP
Unseen Events
n
nave estimation of Markov model probabilities from

frequencies
no latent (hidden) events
Any such model overfits data: many events are

likely to be missing in any finite sample
\ The learned model cannot generalize to unseen
data
\ Support for poverty of the stimulus arguments
n
ML in NLP
NASSLI
The Science of Modeling
Chomskys implicit assumption: any model must

assign zero probability to unseen events
n
Markov models can capture a surprising fraction of

the unpredictability in language
Statistical information retrieval methods beat
alternatives
Statistical parsers are more accurate than
competitors based on rationalist methods
Machine-learning, statistical techniques close to
human performance in part-of-speech tagging, sense
disambiguation
\
n
Probability estimates can be smoothed to

accommodate unseen events
Redundancy in language supports
effective statistical inference procedures
the stimulus is richer than it might seem
Statistical learning theory: generalization
ability of a model class can be measured
independently of model representation
Beyond Markov models: effects of latent
conditioning variables can be estimated
from data
ML in NLP
ML in NLP
June 02
Richness of the Stimulus

n
Information about: mutual information

n
n
between linguistic and non-linguistic events

between parts of a linguistic event
Global coherence:
Word statistics carry more information

than it might seem
banks can now sell stocks and bonds
n
n
n
Questions
Generative or discriminative?
Structured models: local classification or
global constraint satisfaction?
n Does unsupervised learning help?
n
n
Markov models in speech recognition

Success of bag-of-words model in information
retrieval
Statistical machine translation
How far can these methods go?
ML in NLP
Classification
ML in NLP
Generative or Discriminative?
n
Generative models
n
Estimate the instance-label distribution

p(x, y)
Discriminative models
n
Estimate the label-given-instance distribution

p(y | x)
Minimize an upper-bound on training error

[[ f (x i ) yi ]]
i
ML in NLP
NASSLI
ML in NLP
ML in NLP
June 02
Simple Generative Model

n
Generative Claims
Binary nave Bayes:

n
Represent instances by sets of binary

features
Easy to train: just count

Language modeling: probability of
observed forms
n More robust
n
n
Does word occur in document
Finite predefined set of classes

P(F1,..., Fn ,C) = P(C) P(Fi | C)
C
n
n
P(c | F1,..., Fn ) P(c) P(Fi | C)

F1
F2
F3
F4
ML in NLP
Discriminative Models
Define functional form for
Simple Discriminative Forms

n
Linear discriminant function
Logistic form:
h(x;q 0 ,q1,...,q n ) = q 0 + i q i fi (x)
p(y | x;q )
Binary classification: define a discriminant

function
y = sign h(x;q )
n Adjust parameter(s) q to maximize

probability of training labels/minimize
error
ML in NLP
NASSLI
Full advantage of probabilistic methods
ML in NLP
Small training sets

Label noise
P(+1 | x) = 11+ exp- h(x;q )

n
Multi-class exponential form (maxent):

h(x, y;q 0 ,q1 ,...,q n ) = q 0 + i q i fi (x, y)
P(y | x;q ) = exp h(x, y;q )
y exp h(x, y;q )
ML in NLP
ML in NLP
June 02
Discriminative Claims
Focus modeling resources on instance-tolabel mapping
n Avoid restrictive probabilistic assumptions
on instance distribution
n Optimize what you care about
n Higher accuracy
Classification Tasks
n
ML in NLP
Binary vector
Frequency vector
ft (d) t d
N-gram language model

d
p(d | c) = p(d | c)i=1 p(di | d1...di-1; c)

p(di | d1 ...di-1;c) p(di | di-n ...di-1;c)
NASSLI
Tagging
n
n
n
News categorization
Message filtering
Web page selection
Named entity
Part-of-speech
Sense disambiguation
Syntactic decisions
n
Attachment
Term Weighting and Feature

Selection
Select or weigh most informative features
TF*IDF: adjust term weight by how
document-specific the term is
n Feature selection:
n
n
tf(d,t) = t d idf(d,t) = D d D : t d
raw frequency : rt (d) = tf(d,t)
TF * IDF :
xt (d) = log(1+ tf(d,t))log(1+ idf(d,t))
ML in NLP
ML in NLP
Document Models
Document categorization
Remove low, unreliable counts

Mutual information
n Information gain
n Other statistics
n
n
ML in NLP
ML in NLP
June 02
Documents vs. Vectors (I)

Many documents have the same binary or
frequency vector
n Document multiplicity must be handled
correctly in probability models
n Binary nave Bayes
Documents vs. Vectors (2)

n
p(f | c) = t [ ft p(t | c) + (1 - ft )(1- p(t | c))]

n
Multiplicity is not recoverable
ML in NLP
Unigram model:
p(c | d) =
p(c)p(d | c)
Raw frequency vector probability

rt = tf(d,t)
p(r | c) = p(L | c)L!
n
d
p(di | c)
i=1
d
p(di | c)
i=1
Embedding into high-dimensional vector

space
n
n
Geometric intuitions and techniques

Easier separability
Increase dimension with interaction terms
Nonlinear embeddings (kernels)
Vector model:
ML in NLP
p(t | c)rt
where L = rt
t
rt !
Linear Classifiers
c p(c)p( d | c)
p(c | r) =
NASSLI
p(d | c) = p(d | c)i=1 p(di | c)
ML in NLP
Documents
vs. Vectors (3)
n
Document probability (unigram language

model)
p(c)p(L | c)t p(t | c)rt
Swiss Army knife
c p(c) p(L | c)t p(t | c)rt

ML in NLP
ML in NLP
June 02
Kinds of Linear Classifiers

Nave Bayes
n Exponential models
n Large margin classifiers
Learning Linear Classifiers

n
Rocchio
Widrow-Hoff
Support vector machines (SVM)

n Boosting
n
w w - 2h(w.x i - yi )x i
n
Online methods
n
n
(Balanced) winnow
y = sign(w+ x - w- x - q )
Perceptron
Winnow
positive error : w+ aw+ ,w- bw- , a > 1 > b > 0

negative error : w+ bw+ ,w- aw -
ML in NLP
ML in NLP
Linear
Classification
n
Linear discriminant function
Margin
n
n
g i = yi (w x i + b)
Instance margin
Normalized (geometric) margin
h(x) = w x + b = w k xk + b
k
x
x
x
x
w
x
xj
x
o
x
x
gi
gj o
o
o
w
b
g i = yi x i +
w
w
xi
o
o
NASSLI
x
x
bo
ML in NLP

xk
xk
wk = max 0, xc - xc
c
D - c
o
o
x
n
Training set margin g

ML in NLP
10
ML in NLP
June 02
Duality
Perceptron Algorithm
n
Given
Final hypothesis is a linear combination of

training points
w = i ai yi x i
Linearly separable training set S

n Learning rate h>0
w 0;b 0;R = maxi x i
repeat
for i = 1...N
if yi ( w x i + b) 0
w w + hyi x i
b b + hyi R 2
until there are no mistakes
n
ML in NLP
the Margin?
There is a constant c such that for any

data distribution D with support in a ball of
radius R and any training sample S of size
N drawn from D
c R2
perr(h) 2 log2 N + log(1 d ) 1 - d
N g
where g is the margin of h in S

ML in NLP
NASSLI
Dual perceptron algorithm

a 0;b 0;R = maxi x i
repeat
for i = 1...N
if yi a j y j x j x i + b 0
j
a i ai + 1
b b + yi R 2
until there are no mistakes
ML in NLP
Why Maximize
ai 0
Canonical Hyperplanes
n
Multiple representations for the same

hyperplane
(lw, lb) l > 0
Canonical hyperplane: functional margin = 1

n Geometric margin for canonical hyperplane
n
1 w
w
x-
x+ 2 w
w
1
=
(w x + - w x - )
2w
1
=
w
g =
x+
g
x-
ML in NLP
11
ML in NLP
June 02
Convex Optimization (1)

n
Constrained optimization problem:
Convex Optimization (2)

n
min wW n f (w)
subject to
gi (w) 0
h j (w) = 0
f convex
gi, hj affine (h(w)=Aw-b)
n Solution w*, a*, b* must satisfy:
L(w* ,a * , b * )
= 0
w
* * *
L(w ,a , b )
= 0
b
Complementary condition:
ai* gi (w* ) = 0 parameter is non-zero iff
gi (w* )
0 constraint is active
*
ai
0
ML in NLP
n
n
Lagrangian function:
L(w,a , b ) = f (w) + i a i gi (w) + j b j h j (w)

n
Dual problem:
maxa ,b infwW L(w,a , b )

subject to a i 0
ML in NLP
Maximizing the Margin (1)

n
Given a separable training sample:

min w,b w = w w subject to y i ( w x i + b) 1
Maximizing the margin (2)

n
Dual Lagrangian at stationary point:
W(a ) = L(w* ,b* ,a ) = i a i n
Dual maximization problem:
1
L(w,b,a ) = w w - i ai [ yi ( w x i + b) -1]
2
L(w,b,a )
= w - i yi a i x i = 0
w
L(w,b,a )
= yi a i = 0
i
b
1
yi y ja i a j x i x j
2 i, j
maxa
W (a )
subject to a i 0
i yi a i = 0
Lagrangian:
ML in NLP
NASSLI
Kuhn-Tucker conditions:
Maximum margin weight vector:
w = i yi a *i x i with margin g = 1 w* =
(isv ai* )
-1 2
ML in NLP
12
ML in NLP
June 02
Building the Classifier

n
Computing the offset (from primal

constraints):
b* =
Consequences
n
max yi =-1 w* x i + min yi =1 w* x i
ai yi w* x i + b -1 = 0
ai > 0 fi w* x i + b = yi
[ (
Decision function:
h(x) = sgn
(i yi a*i x i x + b* )
Margin maximization for an arbitrary
kernel K
maxa
Decision rule
h(x) = sgn
NASSLI
Soft Margin
n
n
Handles non-separable case

Primal problem (2-norm):
min w,b,x w w + C x i2
i
subject to yi (w x i + b ) 1 - x i
xi 0
i a i - 12 i, j yi y ja i a j K (x i , x j )
subject to a i 0
i yi a i = 0
ML in NLP
Functional margin of 1 implies minimum

geometric margin
ML in NLP
General SVM Form
) ]
g = 1 w*
ML in NLP
Complementarity condition yields support

vectors:
(i yi a*i K (x i ,x) + b* )
Dual problem:
maxa
i a i - 12 i, j yi y ja i a j x i x j + C1 dij
subject to a i 0
i yi a i = 0
ML in NLP
13
ML in NLP
June 02
Conditional Maxent Model

n
Maximizing conditional entropy
exp l k f k (x, y)
k
n
n
subject to constraints
i f k (x i , yi ) = y p(y | x i ) f k (x i , y)
Multi-class
May use different features for different classes
Training is convex optimization
ML in NLP
yields
)
p (y | x) = p(y | x; L
ML in NLP
Relationship
to (Binary) Logistic
Discrimination
exp k l k f k (x,+1)
p(+1 | x) =
exp k l k f k (x,+1) + exp k l k f k (x,-1)

1
=
1
+
expl
f
(
k k k (x,+1) - f k (x,-1))
1
=
1 + exp- l k gk (x)
k
ML in NLP
p = arg max p -i y p(y | x i ) log p(y | x i )
Z(x; l )
Useful properties
n
NASSLI
= arg max L log p(yi | x i ;L)

L
i
Z(x;L) = y exp k l k f k (x, y)
Maximize conditional log likelihood
Model form
p(y | x;L) =
Duality
n
Relationship to Linear
Discrimination
n
Decision rule
p(+1| x)
sign log
= sign k l k g k (x)
p(-1 | x)
Bias term: parameter for always on

feature
n Question: relationship to other trainers for
linear discriminant functions
n
ML in NLP
14
ML in NLP
June 02
Solution Techniques (I)

n
Solution Techniques (2)

Improved iterative scaling (IIS)
Generalized iterative scaling (GIS)

n
Parameter updates
lk l k +
n
lk l k + d k
i f k (x i , yi )
1
log
C
i y p(y | x i ;L) f k (x i , y)
i f k (x i , yi ) = i y p(y | x i ;L) f k (x i , y)edk f

n
f k (x i , y) = C "i, y
ML in NLP
Conditional log-likelihood
l(L) = log p(yi | x i ;L)
i
(x i ,y)
For binary features reduces to solving a polynomial with

positive coefficients
Reduces to GIS if feature sum constant
ML in NLP
Deriving IIS (1)

n
f # (x, y) = k f k (x, y)
Requires that features add up to constant independent of

instance or label (add slack feature)
Parameter updates
Log-likelihood update
Z (x i ;L + D)
l(L + D) - l(L) = i D f (x i , yi ) - log
Z (x i ;L)
(L+D ) f (x i ,y)
e
= D f (x i , yi ) - log
i
i
y
Z (x i ;L)
= i D f (x i , yi ) - i log y p(y | x i ;L)e D f ( x i ,y)
(log x x -1) i D f (x i , yi ) + N - i y p(y | x i ;L)e D f (x i ,y)
Deriving IIS (2)

n
By Jensens inequality:
A(D) D f (x i , yi ) + N i
i y p(y | x i ;L)k
f k (x i , y) dk f # (x i ,y)
e
= B(D)
f # (x i , y)
#
B(D)
= f k (x i , yi ) - p(y | x i ;L) f k (x i , y)edk f (x i ,y)
i
i
y
d k
Maximize lower bound on update
144444444424444444443
A(D)
ML in NLP
NASSLI
ML in NLP
15
ML in NLP
June 02
Solution Techniques (3)

GIS very slow if slack variable takes large
values
n IIS faster, but still problematic
n Recent suggestion: use standard convex
optimization techniques
n
n
n
Gaussian Prior
n
Log-likelihood gradient
l(L)
l
= i f k (x i , yi ) - i y p(y | x i ;L)i f k (x i , y) - k2
l k
sk
n
Modified IIS update

lk l k + d k
i f k (x i , yi ) =
Eg. Conjugate gradient

Some evidence of faster convergence
i y p(y | x i ;L) f k (x i , y)edk f
( x i ,y)
l k + dk
s 2k
ML in NLP
Instance Representation
Fixed-size instance (PP attachment):

binary features
n
n
Word identity
Word class
Variable-size instance (document

classification)
n
n
Word identity
Word relative frequency in document
ML in NLP
NASSLI
f (x, y) = k f k (x, y)
ML in NLP
Enriching Features
Word n-grams
Sparse word n-grams
n Character n-grams (noisy transcriptions:
speech, OCR)
n Unknown word features: suffixes,
capitalization
n Feature combinations (cf. n-grams)
n
n
ML in NLP
16
ML in NLP
June 02
Structured Models:
Finite State
I understood each and every word you said but not the order in which they appeared.
ML in NLP
Structured Model Applications
Structured Models
Language modeling
n Story segmentation
n POS tagging
n Information extraction (IE)
n (Shallow) parsing
ML in NLP
NASSLI
ML in NLP
Assign a labeling to a sequence

Story segmentation
POS tagging
n Named entity extraction
n (Shallow) parsing
n
n
ML in NLP
17
ML in NLP
June 02
Constraint Satisfaction in
Structured Models
n
Local Classification Models

n
Train to minimize labeling loss
q = argminq i Loss(x i , y i | q )
n
Computing the best labeling:

argmin y Loss(x, y | q )
Efficient minimization requires:

n
n
q = argminq i 0 j<|x |loss(yi, j | x i , y (i j);q )

i
A common currency for local labeling decisions

Efficient algorithm to combine the decisions
ML in NLP
Example: Language Modeling

n
n-gram (n-1 order Markov) model:

n
Constraint satisfaction
n
n
n
Principled
Probabilistic interpretation allows model
composition
Efficient optimal decoding
n Local classification
n
n
n
Wider range of models

More efficient training
Heuristic decoding comparable to pruning in
global models
ML in NLP
NASSLI
Apply by guessing context and finding

each lowest-loss label:
argmin y j loss(y j | x, y ( j);q )
ML in NLP
Structured Model Claims
Train to minimize the per-decision loss in

context
w1
wk n + 1
P(wk |w1
n
wk 1wk
wk 1) P(wk |wk n + 1
wk 1)
example, character n-grams (Shannon, Lucky):

n
0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
1 OCRO HLI RGWR NMIELWIS EU LL
2 ON IE ANTSOUTINYS ARE T INCTORE ST
3 IN NO IST LAT WHEY CRATICT FROURE
4 ABOVERY UPONDULTS WELL THE CODERST
ML in NLP
18
ML in NLP
June 02
Markovs Unreasonable
Effectiveness
n
Limits of Markov Models
Entropy estimates for English

model
compress
bits/char
4.43
word trigrams (Brown et al 92)
1.75
human prediction (Cover & King 78)
1.34
Local word relations dominate statistics (Jelinek):
1 The
2 This
2 One
7 Please
9 We
98
1641
are to
will
the
need
know
have
understand
use
insert
resolve
the
issues
this
problems
these the
problem
all
necessary
data
information
people
tools
old
important
role
thing
that
point
issues
ML in NLP
n
n
Whats the probability of unseen events?

Bias forces nonzero probabilities for some
unseen events
Typical bias: tie probabilities of related events
n
No dependency
Likelihoods based on
sequencing, not dependency
1
The
2
This
2
One
7
Please
9
We
98
1641
are to
will
the
need
know
have
understand
use
insert
resolve
the
issues
this
problems
these the
problem
all
necessary
data
information
people
tools
old
important
role
thing
that
point
issues
ML in NLP
Unseen Events (1)

n
specific unseen event general seen event

eat pineapple eat _
event decomposition: event event1 event2
Factoring via latent variables:
eat pineapple eat _ _ pineapple
P(eat | pineapple) P(eat | C) P(C | pineapple)
Unseen Events (2)

Discount estimates for seen events
Use leftover for unseen events
n How to allocate leftover?
n
n
Back-off from unseen event to less specific

seen events: n-gram to n-1-gram
n Hypothesize hidden cause for unseen events:
latent variable model
n Relate unseen event to distributionally similar
seen events
n
ML in NLP
NASSLI
ML in NLP
19
ML in NLP
June 02
Important Detour:
Latent Variable Models
Expectation-Maximization (EM)
n
Latent (hidden) variable models

p(y,x,z | L)
p(y,x | L) = p(y,x,z | L)
z
Examples:
Mixture models
n Class-based models (hidden classes)
n Hidden Markov models
n
ML in NLP
Maximizing Likelihood
Data log-likelihood
D = {(x1 , y1 ),...,(x N , yN )}
L(D | L) = i log p(x i , yi ) = x,y p (x, y) log p(x, y | L)
i : x i = x, yi = y
p (x, y) =
N
n Find parameters that maximize (log-)likelihood
= arg max L p (x, y) log p(x, y | L)
L
x,y
ML in NLP
NASSLI
ML in NLP
Convenient Lower Bounds (1)

n
f ( x1 )
Convex function
af ( x0 ) + (1 - a ) f ( x1 )
f
f ( x0 )
n
f (ax0 + (1 - a ) x1 )
x0 ax0 + (1 - a ) x1 x1
Jensens inequality
f x p(x)x x p(x) f (x)
if f is convex and p is a probability density

ML in NLP
20
ML in NLP
June 02
Convenient Lower Bounds (2)

L(D | l )
Auxiliary Function
n
E1
Find a convenient non-negative function

that lower-bounds likelihood increase
L(D | L) - L(D | L) Q(L,L) 0
M0
E0
Maximize lower bound:

L i+1 = arg max L Q(L,L i )
lower bound
ML in NLP
Comments
ML in NLP
Example: Mixture Model
n Likelihood keeps increasing, but

Can get stuck in local maximum (or saddle point!)
Can oscillate between different local maxima with
same log-likelihood
If maximizing auxiliary function is too hard, find

some L that increases likelihood: generalized
EM (GEM)
Sum over hidden variable values can be
exponential if not done carefully (sometimes not
possible)
n
n
ML in NLP
NASSLI
Base distributions
pi (y) :1 i m
Mixture coefficients
li 0
i=1 li = 1
Mixture distribution
p(y | L) = li pi (y)
i
ML in NLP
21
ML in NLP
June 02
Auxiliary Quantities
Solution
Mixture coefficient i = prior probability of

being in class I
n Joint probability
p(c, y | L) = lc pc (y)
n
E step:
Ci =
M-step:
Auxiliary function
Q(L,L) = y p (y)c p(c | y,L) log
li
p(y,c | L)
p(y,c | L)
ML in NLP
1
p (y) lilpi p(y)(y)
li y
j j j
Ci
jC j
ML in NLP
More Finite-State Models
Example: Information Extraction

n
Given: types of entities and relationships

we are interested in
People, places, organizations, dates,
amounts, materials, processes,
n Employed by, located in, used for, arrived
when,
n
Find all entities and relationships of the

given types in source material
n Collect in suitable database
n
ML in NLP
NASSLI
ML in NLP
22
ML in NLP
June 02
IE Example
IE Methods
employee
relation
Co-reference
Partial matching:
Hand-built patterns
Automatically-trained hidden Markov models
n Cascaded finite-state transducers
n
person-descriptor
person
organization
Nance, who is also a paid consultant to ABC News , said
Parsing-based:
n
Shallow parser (chunking)

Automatically-induced grammar
Rely on:
n
n
Syntactic structure
Phrase classification
ML in NLP
Train to minimize labeling loss

q = argminq Loss(x i , y i | q )
Local Classification Models

n
Computing the best labeling:

argmin y Loss(x, y | q )
Efficient minimization requires:
n
n
Train to minimize the per-symbol loss in

context
q = argminq i 0 j<|x |loss(yi, j | x i , y (i j);q )
i
Apply by guessing context and finding

each lowest-loss label:
argmin y j loss(y j | x, y ( j);q )
A common currency for local labeling decisions

A dynamic programming algorithm to combine the
decisions
ML in NLP
NASSLI
Classify phrases and phrase relations as

desired entities and relationships
ML in NLP
Global Constraint Models

n
Parse the whole text:
ML in NLP
23
ML in NLP
June 02
Structured Model Claims

n
Generative vs. Discriminative
Global constraint
Hidden Markov models (HMMs):

generative, global
n Conditional exponential models (MEMMs,
CRFs): discriminative, global
n Boosting, winnow: discriminative, local
n
Principled
Probabilistic interpretation allows model
composition
n Efficient optimal decoding
n
n
Local classifier
Wider range of models
More efficient training
n Heuristic decoding comparable to pruning in
global models
n
n
ML in NLP
ML in NLP
Generative Models
n
Stochastic process that generates

instance-label pairs
n
n
n
n
Process structure
Process parameters
(Hypothesize structure)
Estimate parameters from training data
ML in NLP
NASSLI
Model Structure
Decompose the generation of instances
into elementary steps
n Define dependencies between steps
n Parameterize the dependencies
n Useful descriptive language: graphical
models
n
ML in NLP
24
ML in NLP
June 02
Binary Nave Bayes

n
Represent instances by sets of binary

features
n
n
Discrete Hidden Markov Model

n
Does word occur in document
P(F1,..., Fn ,C) = P(C) P(Fi | C)

i
P(c | F1,..., Fn ) P(c) P(Fi | C)

F2
F3
Word identity
Word properties (eg. spelling, capitalization)
C3
C4
X0
X1
X2
X3
X4
P(X,C) = P(C0 )P(X 0 | C0 )i P(Ci | Ci-1 )P(Xi | Ci )
Independence or Intractability
C0
C1
C2
F0,0
F0,1
F0,2
F1,0
F1,1
F1,2
F2,0
F2,1
C3
F2,2
F3,0
F3,1
F3,2
Trees are good: each node has a single

immediate ancestor, joint probability
computed in linear time
n But that forces features to be conditionally
independent given the class
n Unrealistic
n
Labels: class sequences

C4
F4,0
F4,1
F4,2
n
n
Suffixes and capitalization

San and Francisco in document
Limitation: conditionally independent features

ML in NLP
NASSLI
C2
Instances: sequences of feature sets

n
C1
ML in NLP
Generating Multiple Features
C0
F4
ML in NLP
Instances: symbol sequences

Labels: class sequences
Finite predefined set of classes

C
F1
ML in NLP
25
ML in NLP
June 02
Information Extraction with HMMs

[Seymore & McCallum 99]
[Freitag & McCallum 99]
Score Card
No independence assumptions
Richer features: combinations of existing
features
Optimization problem for parameters
Limited probabilistic interpretation
Insensitive to input distribution
n
n
n
n
Parameters = P(s|s) , P(o|s) for all states in S={s1,s2,}

Observations: words
Training: maximize probability of observations (+ prior).
For IE, states indicate database field.
ML in NLP
ML in NLP
Problems with HMMs

1.
Would prefer richer representation of text:

multiple overlapping features, whole chunks of text
n
Example word features:

n
n
n
n
n
n
n
2.
identity of word
word is in all caps
word ends in -tion
word is part of a noun phrase
word is in bold font
word is on left hand side of page
word is under node X in WordNet
Example line features:

n
n
n
n
n
n
n
Hidden Markov model
length of line
line is centered
percent of non-alphabetics
total amount of white space
line contains two verbs
line begins with a number
line is grammatically a question
HMMs are generative models.

Generative models do not handle easily overlapping, nonindependent features.
Would prefer a conditional model: P({s}|{o}).
ML in NLP
NASSLI
Solution: Conditional Model

P(s|s)
P(o|s)
Maximum entropy
Markov model
P(s|o,s)
(represented by exponential
model)
For the time being, capture dependency

on s with |S| independent functions, Ps(s|o)
s
s
ML in NLP
Each state contains a next-state classifier

black box, that, given the next observation, will
produce a probability distribution over possible
next states, Ps(s|o).
26
ML in NLP
June 02
Two Sequence Models

HMM
P(o|s)
ot
overlapping (binary) features.

n Example observation predicates:
st
st-1
P(s|s)
n Model Ps (s|o) in terms of multiple arbitrary
MEMM
st
st-1
Transition Features
o is the word apple

o is capitalized
l o is on a left-justified line
n Feature f depends on both a predicate b
n and a destination state s.
l
P(s|o,s)
ot
1 if b(o) is true and s'= s

f<b,s> (o, s') =
0 otherwise
Standard belief propagation: forward-backward

procedure.
Viterbi and Baum-Welch follow naturally.
ML in NLP
ML in NLP
Next-State Classifier
n
Per-state conditional maxent model
1
Ps ' (s | o) =
exp l<b,q> f<b,q> (o, s)
Z(o, s')
<b,q>
Training: each state model independently

from labeled sequences
X-NNTP-Poster: NewsHound v1.33

Archive-name: acorn/faq/part2
Frequency: monthly
2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connections for common terminal
programs to work properly. They are as far as I know the informal standard
agreed upon by commercial comms software developers for the Arc.
Pins 1, 4, and 8 must be connected together inside the 9 pin plug. This
is to avoid the well known serial port chip bugs. The modems DCD (Data
Carrier Detect) signal has been re-routed to the Arcs RI (Ring Indicator)
most modems broadcast a software RING signal anyway, and even then its
really necessary to detect it for the model to answer the call.
2.7) The sound from the speaker port seems quite muffled.
How can I get unfiltered sound from an Acorn machine?
All Acorn machine are equipped with a sound filter designed to remove
high frequency harmonics from the sound output. To bypass the filter, hook
into the Unfiltered port. You need to have a capacitor. Look for LM324 (chip
39) and and hook the capacitor like this:
ML in NLP
NASSLI
Example: Q-A pairs from FAQ
ML in NLP
27
ML in NLP
June 02
Experimental Data
n
<head>
<head>
<head>
<head>
<question>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>
<answer>
Features in Experiments
38 files belonging to 7 UseNet FAQs

X-NNTP-Poster: NewsHound v1.33
Archive-name: acorn/faq/part2
Frequency: monthly
2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connection
programs to work properly. They are as far as I know
agreed upon by commercial comms software developers fo
Pins 1, 4, and 8 must be connected together inside
is to avoid the well known serial port chip bugs. The
Procedure: For each FAQ, train on one file, test

on other; average.
ML in NLP
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
ML in NLP
Models Tested
n
n
ME-Stateless: A single maximum entropy

classifier applied to each line independently.
TokenHMM: A fully-connected HMM with four
states, one for each of the line categories,
each of which generates individual tokens
(groups of alphanumeric characters and
individual punctuation characters).
FeatureHMM: Identical to TokenHMM, only
the lines in a document are first converted to
sequences of features.
MEMM: maximum entropy Markov model
ML in NLP
NASSLI
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
Results
Learner
Segmentation
precision
Segmentation
recall
ME-Stateless
0.038
0.362
TokenHMM
0.276
0.140
FeatureHMM
0.413
0.529
MEMM
0.867
0.681
ML in NLP
28
ML in NLP
June 02
Label Bias Problem

n
Example (after Bottou 91):

r 1
0
r
4
n
n
i
o
Proposed Solutions
n
rib
P(1, 2 | ro) = P(1 | r)P(2 | o,1)

P(1 | r)1
P(1 | r)P(2 | i,1)
6 rob
P(1, 2 | ri)
Bias toward states with fewer outgoing

transitions.
Per-state normalization
does not allow the
required score(1,2|ro) << score(1,2|ri).
n
n
ML in NLP
s = s1 ...sn
o = o1 ...on
P(s | o) P(st | st-1 )P(ot | st )

t=1
MEMM P(s | o)
rib
rob
CRF General Form

n
State sequence is a Markov random field

conditioned on the observation sequence.
Model form: P(s | o) =
Features represent the dependency between

successive states conditioned on the
observations
Dependence on whole observation sequence o

(not possible in HMMs).
P(s
| st-1,ot )
t=1
l f f (st ,st-1 )
exp
Z(s ,o ) +f m f g(st ,ot )
t-1 t
t=1
g
l f f (st ,st-1 )
n
1
P(s | o) =
CRF
exp+f m f g(st ,ot )
Z(o) t=1
g
ML in NLP
ML in NLP
From HMMs to CRFs

HMM
i 2
Determinization:
r
1
n not always possible 0
o
b
5
n state-space explosion
Fully-connected models:
n lacks prior structural knowledge.
Conditional random fields (CRFs):
n Allow some transitions to vote more
strongly than others
n Whole sequence rather than per-state
normalization
n
1
exp l f f (st-1,st ,o,t)
Z(o)
t=1 f
ML in NLP
NASSLI
29
ML in NLP
June 02
Efficient Estimation
n
Forward-Backward Calculations
Matrix notation
M t (s',s | o) = exp L t (s',s | o)
L t (s',s | o) = l f f (st-1,st ,o,i)
E LG
1
PL (s | o) =
M i (st-1,st | o)
Z L (o) t
Z L (o) = (M1 (o)M 2 (o)L M n +1 (o)) start,stop
n
For any path function G(s) = t gt (st-1,st )

=
=
P (s | o)G(s)
a (s'| o)g (s',s)M (s',s | o)b

Z (o)
s
t +1
t +1
t +1
(s | o)
t,s',s
a t (o) = a t-1 (o)M t (o)

b t (o) = M t +1 (o)b t +1 (o)
Z L (o) = a n +1 (end | o) = b 0 (start | o)
Efficient normalization: forward-backward

algorithm
ML in NLP
ML in NLP
Training
Label Bias Experiment
Maximize L(L) = log PL (sk | o k )

k
n Log-likelihood gradient
n
L(L)
=
l f
# f (s | o) =
#
k
r
0
(sk | o k ) - E L # f (S | o k )
k
f (st-1,st ,o,t)
Methods: iterative scaling, conjugate

gradient
n Comparable to standard Baum-Welch
ML in NLP
NASSLI
r
4
n
n
Data source: noisy version of
i
o
b
b
rib
rob
P(intended symbol) = 29/30, P(other) = 1/30.

Train both an MEMM and a CRF with identical
topologies on data from the source.
Compute decoding error: CRF 4.6%, MEMM 42%
(2,000 training samples, 500 test)
ML in NLP
30
ML in NLP
June 02
Mixed-Order Sources
n
n
Data generated by mixing sparse first and second

order HMMs with varying mixing coefficient.
Modeled by first-order HMM, MEMM and CRF
(without contextual or overlapping features).
Part-of-Speech Tagging
n
Trained on 50% of the 1.1 million words in the

Penn treebank. In this set, 5.45% of the words
occur only once, and were mapped to oov.
Experiments with two different sets of features:
n
n
ML in NLP
ML in NLP
POS Tagging Results

model
error
oov error
HMM
5.69%
45.99%
MEMM
6.37%
54.61%
CRF
5.55%
48.05%
MEMM+
4.81%
26.99%
CRF+
4.27%
23.76%
ML in NLP
NASSLI
traditional: just the words

take advantage of power of conditional models: use
words, plus overlapping features: capitalized, begins
with #, contains hyphen, ends in -ing, -ogy, -ed, -s, ly, -ion, -tion, -ity, -ies.
Structured Models:
Stochastic Grammars
ML in NLP
31
ML in NLP
June 02
Beyond Finite State
Stochastic Context-Free Grammars
S
VP
NP
N'
Adj
N'
revolutionary Adj
new
N'
VP
Adv
slowly
N'
advance
Adj
Constituency and meaningful relations

What sleeps? How does it sleep? What kind of ideas?
Related sentences
Sleep furiously is all that colorless green ideas do.
new
Extended inside-outside algorithm: use

information about training corpus phrase
structure to guide rule probability reestimation
ML in NLP
NASSLI
slowly
advance
1.0
0.2
0.8
0.3
0.7
0.4
0.6
0.1
0.2
V sleep
0.3
SCFG Derivations
S
P(
Inside-outside algorithm (Baker 79): find rule

probabilities that locally maximize the
likelihood of a training corpus (instance of EM)
Adv
S NP VP
NP Det N
NP N
N Adj N
N N
VP VP Adv
VP V
N ideas
N people
ML in NLP
Stochastic CFG Inference
N'
VP
ideas
ML in NLP
N'
revolutionary Adj
ideas
VP
NP
N'
Adj
N'
revolutionary Adj
new
Better modeling of phrase structure
N'
VP
Adv
slowly
N'
ML in NLP
Adv
slowly
advance
Context-freeness fi
independence of
derivation steps
VP
P(N Adj N)
ideas
)
VP
NP
advance
Improved convergence
Improved computational complexity
P(
VP
NP
P(
Adj
revolutionary
P(
Adj
new
N'
N'
N
ideas
32
ML in NLP
June 02
Inside-Outside Reestimation
Ratios of expected rule frequencies to
expected phrase frequencies
E (#N Adj N)
phrase context
Pn + 1(N Adj N) = n
En(#N)
n
n
VP
NP
N'
phrase
Adj
N'
revolutionary Adj
new
N'
VP
Adv
slowly
advance
Computed from
the probabilities of
each phrase and
phrase context in
the training
corpus
Iterate
ideas
ML in NLP
Problems with I-O Reestimation

n
Hill-climbing procedure: sensitivity to initial

rule probabilities
Does not learn grammar structure directly:

only implicit in rule probabilities
Linguistically inadequate grammars: high

mutual information sequences are grouped
into phrases
((What (((is the) cheapest)
fare))((I can) (get ?)))))))
Contrast: Is $300 the cheapest fare?
ML in NLP
(Partially) Bracketed Text

Training:
(((List (the fares (for

((flight) (number
891)))))).)
n
Use only derivations

compatible with training
bracketing
cross entropy
Hand-annotated text with

(some) phrase boundaries
4.5
bracketed
train
4
3.5
3
2.5
0
20
40
iteration
Test:
60
80
raw test
cross entropy
Predictive rawPower
train
4.5
bracketed
test
4
3.5
3
2.5
0
ML in NLP
NASSLI
ML in NLP
20
40
iteration
60
80
33
ML in NLP
June 02
Bracketing Accuracy
n
Accuracy criterion: proportion of phrases in

most likely analysis compatible with tree bank
bracketing
Limitations of SCFGs
n
n
Likelihoods independent of particular words

Markovian assumption on syntactic categories
accuracy
100
80
raw test
60
bracketed
test
Markov models:
We need to resolve the issue
40
20
0
20
40
60
80
iteration
Conclusion: structure is not evident from distribution

alone
ML in NLP
Hierarchical models:
ML in NLP
Lexicalization
n Markov
model:
Best Current Models

n
n
n
n Dependency
head word, dependency type, argument vs. adjunct, heaviness,

slash
model:
n

n
n
ML in NLP
NASSLI
Representation: surface trees with head-word

propagation
Generative power still context-free
Model variables:
Main challenge: smoothing method for unseen

dependencies
Learned from hand-parsed text (treebank)
Around 90% constituent accuracy
ML in NLP
34
ML in NLP
June 02
Lexicalized Tree (Collins 98)

Inducing Representations
S(dumped)
VP(dumped)
NP-C(workers)
N(workers) V(dumped) NP-C(sacks)

workers
dumped
PP(into)
N(sacks) P(into)
sacks
Dependency
Direction Relation
L
workers dumped
NP, S, VP
sacks dumped
R
NP, VP, V
R
into dumped
PP, VP, V
L
a bin
D, NP, N
bin into
R
NP, PP, P
NP-C(bin)
into D(a)
a
N(bin)
bin
ML in NLP
Unsupervised Learning
n
Latent variable models

Model observables from latent variables
n Search for good set of latent variables
ML in NLP
Do Induced Classes Help?

n
Information bottleneck
Find efficient compression of some
observables
n preserving the information about other
observables
n
ML in NLP
NASSLI
Generalization
n
Better statistics for coarser events
Dimensionality reduction
n
n
Smaller models
Improved classification accuracy
ML in NLP
35
ML in NLP
June 02
Complex Events
Chomskys Challenge
to Empiricism
(1)
Colorless green ideas sleep furiously.
(2)
Furiously sleep ideas green colorless.
It is fair to assume that neither sentence (1) nor

(2) (nor indeed any part of these sentences) has
ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.
n
n
n
Chomsky 57
ML in NLP
representation of past experience

uncertainty about correct grammar
uncertainty about correct interpretation of
experience: ambiguity
Probabilistic relationships involving

hidden variables can be induced from
observable data alone: EM algorithm
ML in NLP
Distributional Clustering
In Any Model?
n
What Chomsky was talking about: Markov

models state is just a record of
observations
But statistical models can have hidden
state:
Automatic grouping of words according to the

contexts in which they appear
Approach to data sparseness: approximate the

distribution of a relatively rare event (word) by the
collective distribution of similar events (cluster)
P(colorless green ideas sleep furiously)

2 10 5
P(furiously sleep ideas green colorless)
Sense ambiguity membership in several soft

clusters
Trained for large-vocabulary speech recognition

from newswire text by EM
Case study: cluster nouns according to the verbs

that take them as direct objects
Factored bigram model:

16
P(wi+1 | wi )
P(wi +1 | c) P(c | wi )
c =1
P(w1 Lwn ) P(w1 ) P(wi +1 | wi )

i =2
ML in NLP
NASSLI
ML in NLP
36
ML in NLP
June 02
Distributional Representation
Training Data
0.1
Data: frequencies fvn of (v,n) pairs

extracted from text by parsing or pattern
matching
stock
bond
0.08
0.06
0.04
0.02
ML in NLP
Markov condition:
p( n | v) = p(n | n)p(n | v)
n
I(,N)
Find p( | N) to maximize
mutual information for
fixed I( ,N)
I( N ,V ) =
I(N,V)
trade
sell
tender
select
own
The scale parameter b (inverse

temperature) determines how much a
noun contributes to nearby centroids
b increases fi clusters split fi hierarchical

clustering
p( n ,v )
p( n | n) =
NASSLI
I(,V)
p( n , v ) log p( n )p(v )
Solution:
ML in NLP
Search for Cluster Solutions
n ,v
purchase
ML in NLP
Reminder: Bottleneck Model

n
issue
buy
0
hold
Describing n N: use conditional

distribution p(V | n)
Universe: two word classes V and N, a

single relation between them (eg. main
verb head noun of verbs direct object)
relative frequency
p( n )
exp(- bDKL ( p(V | n) || p(V | n ))
Zn
ML in NLP
37
ML in NLP
June 02
Small Example
n
Mutual Information Ratios
Cluster the 64 most common direct objects of

fire in 1988 Associated Press newswire
officer 0.484
aide
0.612
chief
0.649
manager 0.651
0.6
In
cr
ea
si
ng
missile 0.835
rocket 0.850
bullet 0.917
gun
0.940
I(,V)/I(,N)
0.8
0.4
0.2
shot 0.858
bullet 0.925
rocket 0.930
missile 1.037
gun
0.758
missile 0.786
weapon 0.862
rocket 0.875
0
0
Model verbobject associations through object

clusters:
p (v | n) = p(v | n ) p( n | n)
n
p(v | n ) p( n |n)
p (v | n) =
used in
experiments
more appropriate
Zn
Depends on b
Intuition: the associations of a word are a mixture

of the associations of the sense classes to which
the word belongs
NASSLI
0.6
0.8
ML in NLP
Using Clusters for Prediction
ML in NLP
0.4
I(,N)/H(N)
ML in NLP
0.2
Evaluation
n
Relative entropy of held-out data

to asymmetric model
Decision task: which of two verbs

is more likely to take a noun as
direct object, estimated from the
model for training data in which
the pairs relating the noun to one
of the verbs have been deleted
ML in NLP
38
ML in NLP
June 02
Decision Task
Relative Entropy
n
6
5
4
train n Test: 81,240 pair

test
held-out test set
new
3
2
1
0
0
200
400
number of clusters
ML in NLP
NASSLI
Train: 756,721 verbobject pair training

set
600
New: held-out data

for 1000 nouns not
in the training data
Which of v1 and v2 is more likely to take object o
Held out: 104 (v2,o) pairs need to guess from

pc(v2)
Testing: compare all v1 and v2 such that (v2,o) was

held out and (v1,o) occurs
1.2
decision error
avg relative entropy (bits)
Verb-object pairs from 1988 AP Newswire
all
All: all test data
Exceptional: verb pairs

whose frequency ratio
with o is reversed from
their overall frequency
ratio
exceptional
0.8
0.6
0.4
0.2
0
0
100
ML in NLP
200
300
400
500
number of clusters
39

Machine Learning in Natural Language Processing: ML in NLP June 02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning in Natural Language Processing: ML in NLP June 02

Uploaded by

Copyright:

Available Formats

ML in NLP

Machine Learning in Natural

Examples are easier to create than rules

Let the computer do it

P(colorless green ideas sleep furiously)

Disambiguate noisy transcription

N(sacks) P(into) NP-C(bin)

Is this a likely English sentence?

Machine Learning Approach

Algorithms that write programs

Sara Lee to Buy 30% of DIM

Input: set of training examples

But will it work on new examples?

Generalization: is the learned program useful on new

Statistical learning theory: quantifiable tradeoffs between

Computational tractability: can we find a good program

Adaptation: can the program learn quickly from new

If not, can we find a good approximation?

Information-theoretic analysis: relationship between adaptation

Machine Learning Methods

Instance: event type of interest

Program class complexity

testing on new examples

Document and its class

Supervised learning: learn classification

Represent instances by feature vectors

Structured Model Ideas

Learn function from feature vectors

Redundancy is our friend: many weak

Clustering: class induction

Unsupervised Learning Ideas

Know words by the company they keep

Chomskys Challenge to Empiricism

Colorless green ideas sleep furiously.

Furiously sleep ideas green colorless.

The Return of Empiricism

Empiricist methods work:

It is fair to assume that neither sentence (1) nor

Just engineering tricks?

nave estimation of Markov model probabilities from

Any such model overfits data: many events are

The Science of Modeling

Chomskys implicit assumption: any model must

Markov models can capture a surprising fraction of

Probability estimates can be smoothed to

Richness of the Stimulus

Information about: mutual information

between linguistic and non-linguistic events

Word statistics carry more information

banks can now sell stocks and bonds

Markov models in speech recognition

How far can these methods go?

Estimate the instance-label distribution

Estimate the label-given-instance distribution

Minimize an upper-bound on training error

Simple Generative Model

Binary nave Bayes:

Represent instances by sets of binary

Easy to train: just count

Does word occur in document

Finite predefined set of classes

P(c | F1,..., Fn ) P(c) P(Fi | C)

Simple Discriminative Forms