Professional Documents
Culture Documents
Model Smoothing
Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net
Zipf’s law
Dealing with unseen words/n-grams
Add-one smoothing
Linear smoothing
Good-Turing smoothing
Absolute discounting
Kneser-Ney smoothing
P( I | <S>) = 2 / 3 P(am | I) = 1
P( Sam | am) = 1/3 P( </S> | Sam) = 1/2
P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2
CS6501 Natural Language Processing 3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants
close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are
available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Normalize by unigrams:
Result:
2000 2000000
20 20
20 20
allegations
1 claims
outcome
reports
attack
1 request
…
request
claims
man
7 total
outcome
1.5 reports
attack
0.5 claims reports
…
man
request
claims
0.5 request
2 other Credit: Dan Klein
7 total
CS6501 Natural Language Processing 13
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-one estimation (Laplace smoothing)
MLE estimate:
c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
Add-1 estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Laplace-smoothed bigrams
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Reconstituted counts
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Compare with raw bigram counts
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Problem with Add-One Smoothing
We’ve been considering just 26 letter types …
Training Test
How to measure whether a particular gets good
results?
Is it fair to measure that on test data (for setting )?
Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical.
Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate
the final system(s). Report all results on it.
Training Test
Dev. Training
… and
Pick that … when we collect counts Now use that report
gets best from this 80% and smooth to get results of
results on them using add- smoothing. smoothed that final
this 20% … counts from model on
all 100% … test data.
Test
…
Test
…
http://wugology.com/zipfs-law/
1 * 69836 the
1 * 52108 EOS
N6 * 6 abdomen, bachelor, Caesar …
N5 * 5 aberrant, backlog, cabinets …
N4 * 4 abdominal, Bach, cabana …
N3 * 3 Abbas, babel, Cabot …
N2 * 2 aback, Babbitt, cabanas …
N1 * 1 abaringe, Babatinde, cabaret …
N0 * 0
N5 * 5
N4 * 4
N3 * 3
N5 * 5
N4 * 4
N3 * 3 unsmoothed smoothed
r/NNatural
CS6501 = (NLanguage
r*r/N)/N
Processing
r (Nr+1*(r+1)/N)/N
45 r
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Witten-Bell vs. Good-Turing
Estimate p(z | xy) using just the tokens we’ve
seen in context xy. Might be a small set …
Witten-Bell intuition: If those tokens were
distributed over many different types, then novel
types are likely in future.
Good-Turing intuition: If many of those tokens
came from singleton types , then novel types are
likely in future.
Very nice idea (but a bit tricky in practice)
See the paper “Good-Turing smoothing without tears”
N2 N1
Problem 2: we don’t observe
events for every k N3 N2
....
....
N3511 N3510
N4417 N4416
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Good-Turing Reweighting
Simple Good-Turing [Gale and Sampson]:
replace empirical Nk with a best-fit
regression (e.g., power law) once count
counts get unreliable
f(k) = a + b log (k)
Find a,b such that f(k) ∼ Nk
N1 N1
N2 N N2
3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Backoff and interpolation
words
Backoff smoothing
Holds out same amount of probability mass for
novel events
But divide up unevenly in proportion to backoff
prob.
c( wi 1 , wi ) d
PAbsoluteDiscounting ( wi | wi 1 ) ( wi 1 ) P ( w)
c( wi 1 )
unigram
But should we really just use the regular
unigram P(w)?