Xu Ly Ngon Ngu Tu Nhien Kai Wei Chang 03 Smooth (Cuuduongthancong - Com)

Lecture 3: Language
Model Smoothing
Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net
Couse webpage: http://kwchang.net/teaching/NLP16
CS6501 Natural Language Processing 1

CuuDuongThanCong.com https://fb.com/tailieudientucntt
This lecture
 Zipf’s law
 Dealing with unseen words/n-grams
 Add-one smoothing
 Linear smoothing
 Good-Turing smoothing
 Absolute discounting
 Kneser-Ney smoothing

Recap: Bigram language model
<S> I am Sam </S>

<S> I am legend </S>
<S> Sam I am </S>
Let P(<S>) = 1
P( I | <S>) = 2 / 3 P(am | I) = 1
P( Sam | am) = 1/3 P( </S> | Sam) = 1/2
P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2
More examples:
Berkeley Restaurant Project sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

Raw bigram counts
 Out of 9222 sentences

Raw bigram probabilities
 Normalize by unigrams:
 Result:

Zeros
 Training set: Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
P(“offer” | denied the) = 0

Smoothing This dark art is why
NLP is taught in the
engineering school.
There are more principled

smoothing methods, too. We’ll
look next at log-linear models,
which are a good and popular
general technique.
But the traditional methods are

easy to implement, run fast, and
will give you intuitions about
what you want from a smoothing
method.
Credit: the following slides are adapted from Jason Eisner’s NLP course
What is smoothing?
20 200
2000 2000000

ML 101: bias variance tradeoff
 Different samples of size 20 vary considerably
 though on average, they give the correct bell
curve!
20 20
20 20

Overfitting

The perils of overfitting
 N-grams only work well for word prediction if the
test corpus looks like the training corpus
In real life, it often doesn’t
 We need to train robust models that
generalize!
 One kind of generalization: Zeros!
Things that don’t ever occur in the training set
But occur in the test set

The intuition of smoothing
 When we have sparse statistics:
P(w | denied the)
3 allegations
2 reports
allegations
1 claims
outcome
reports
attack
1 request
…
request
claims
man
7 total
 Steal probability mass to generalize better

P(w | denied the)
2.5 allegations allegations
allegations
outcome
1.5 reports
attack
0.5 claims reports
…
man
request
claims
0.5 request
2 other Credit: Dan Klein
7 total
Add-one estimation (Laplace smoothing)
 Pretend we saw each word one more time than

we did
 Just add one to all the counts!
 MLE estimate:
c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
 Add-1 estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

Add-One Smoothing
xya 100 100/300 101 101/326

xyb 0 0/300 1 1/326
xyc 0 0/300 1 1/326
xyd 200 200/300 201 201/326
xye 0 0/300 1 1/326
…
xyz 0 0/300 1 1/326
Total xy 300 300/300 326 326/326

Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
V=1446 in the Berkeley Restaurant Project corpus
Reconstituted counts
Compare with raw bigram counts
Problem with Add-One Smoothing
We’ve been considering just 26 letter types …
xya 1 1/3 2 2/29

xyb 0 0/3 1 1/29
xyc 0 0/3 1 1/29
xyd 2 2/3 3 3/29
xye 0 0/3 1 1/29
…
xyz 0 0/3 1 1/29
Total xy 3 3/3 29 29/29
Suppose we’re considering 20000 word types
see the abacus 1 1/3 2 2/20003

see the abbot 0 0/3 1 1/20003
see the abduct 0 0/3 1 1/20003
see the above 2 2/3 3 3/20003
see the Abram 0 0/3 1 1/20003
…
see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003

Suppose we’re considering 20000 word types
see the abacus 1 1/3 2 2/20003

event” = event0 never happened
see the abbot
“Novel 0/3 1 1/20003
in training data.
see the19998
Here: 0
abduct novel events, 0/3
with 1 1/20003
total estimated
probability
see the above 19998/20003.
2 2/3 3 3/20003
Add-one smoothing thinks we are extremely likely to see
see the Abram 0 0/3 1 1/20003
novel events, rather than words we’ve seen.
…
see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003

600.465 - Intro to NLP - J. Eisner
22
Infinite Dictionary?
In fact, aren’t there infinitely many possible word types?
see the aaaaa 1 1/3 2 2/(∞+3)

see the aaaab 0 0/3 1 1/(∞+3)
see the aaaac 0 0/3 1 1/(∞+3)
see the aaaad 2 2/3 3 3/(∞+3)
see the aaaae 0 0/3 1 1/(∞+3)
…
see the zzzzz 0 0/3 1 1/(∞+3)

Total 3 3/3 (∞+3) (∞+3)/(∞+3)

Add-Lambda Smoothing
 A large dictionary makes novel events too probable.
 To fix: Instead of adding 1 to all counts, add  = 0.01?

 This gives much less probability to novel events.
 But how to pick best value for ?

 That is, how much should we smooth?

Add-0.001 Smoothing
Doesn’t smooth much (estimated distribution has high variance)
xya 1 1/3 1.001 0.331

xyb 0 0/3 0.001 0.0003
xyc 0 0/3 0.001 0.0003
xyd 2 2/3 2.001 0.661
xye 0 0/3 0.001 0.0003
…
xyz 0 0/3 0.001 0.0003
Total xy 3 3/3 3.026 1
Add-1000 Smoothing
Smooths too much (estimated distribution has high bias)
xya 1 1/3 1001 1/26

xyb 0 0/3 1000 1/26
xyc 0 0/3 1000 1/26
xyd 2 2/3 1002 1/26
xye 0 0/3 1000 1/26
…
xyz 0 0/3 1000 1/26
Total xy 3 3/3 26003 1
Add-Lambda Smoothing
 A large dictionary makes novel events too probable.
 To fix: Instead of adding 1 to all counts, add  = 0.01?

 This gives much less probability to novel events.
 But how to pick best value for ?

 That is, how much should we smooth?
 E.g., how much probability to “set aside” for novel events?
 Depends on how likely novel events really are!
 Which may depend on the type of text, size of training corpus, …
 Can we figure it out from the data?
 We’ll look at a few methods for deciding how much to smooth.

Setting Smoothing Parameters
 How to pick best value for ? (in add- smoothing)
 Try many  values & report the one that gets best
results?
Training Test
 How to measure whether a particular  gets good
results?
 Is it fair to measure that on test data (for setting )?
 Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical.
 Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate
the final system(s). Report all results on it.

 How to pick best value for ? (in add- smoothing)
results?
Feynman’s Advice: “The
first principle is that you Test
Training
must not
 How to measure fool ayourself,
whether particular and
gets good
results?
you are the easiest person
 Is it fair to measure that on test data (for setting )?
to fool.”
 Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical.
 Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate
the final system(s). Report all results on it.

 How to pick best value for ?
results?
Training Test
Dev. Training
… and
Pick  that … when we collect counts Now use that report
gets best from this 80% and smooth  to get results of
results on them using add- smoothing. smoothed that final
this 20% … counts from model on
all 100% … test data.

Large or small Dev set?
 Here we held out 20% of our training set

(yellow) for development.
 Would like to use > 20% yellow:
 20% not enough to reliably assess 
 Would like to use > 80% blue:
Best  for smoothing 80%  best  for
smoothing 100%

Cross-Validation
 Try 5 training/dev splits as below

 Pick  that gets best average performance
Dev.
Dev.
Dev. Test
Dev.
Dev.
  Tests on all 100% as yellow, so we can more
reliably assess 
  Still picks a  that’s good at smoothing the
80% size, not 100%.
 But now we can grow that 80% without trouble

32
N-fold Cross-Validation (“Leave One Out”)
Test
…
 Test each sentence with smoothed model from

other N-1 sentences
  Still tests on all 100% as yellow, so we can
reliably assess 
  Trains on nearly 100% blue data ((N-1)/N) to
measure whether  is good for smoothing that
33
N-fold Cross-Validation (“Leave One Out”)
Test
…
  Surprisingly fast: why?

 Usually easy to retrain on blue by
adding/subtracting 1 sentence’s counts

34
More Ideas for Smoothing
 Remember, we’re trying to decide how

much to smooth.
 E.g., how much probability to “set aside” for
novel events?
 Depends on how likely novel events really
are
 Which may depend on the type of text, size
of training corpus, …
 Can we figure this out from the data?
How likely are novel events?
20000 types 300 tokens 300 tokens

a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
??? 0 0
grapes 0 0/300 1 0/300
his 38 0
ice cream 0 7
…
which zero would you expect is really rare?

a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
determiners:…a closed class

a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
… (food) nouns: an open class
Zipfs’ law the r-th most
common word 𝑤𝑟 has
P(𝑤𝑟 ) ∝ 1/r
A few words
are very
frequent
http://wugology.com/zipfs-law/

How common are novel events?
1 * 69836 the
1 * 52108 EOS
N6 * 6 abdomen, bachelor, Caesar …
N5 * 5 aberrant, backlog, cabinets …
N4 * 4 abdominal, Bach, cabana …
N3 * 3 Abbas, babel, Cabot …
N2 * 2 aback, Babbitt, cabanas …
N1 * 1 abaringe, Babatinde, cabaret …
N0 * 0
0 10000 20000 30000 40000 50000 60000 70000 8000

How common are novel events?
Counts from Brown Corpus (N  1 million tokens)
N6 * 6
N5 * 5
N4 * 4
N3 * 3
N2 * 2 doubletons (occur twice)

N1 * 1 singletons (occur once)
N0 * 0 novel words (in dictionary but never occur)
0 5000 10000 15000 20000 25000

N2 = # doubleton types r Nr = total # types = T (purple bars)
N2 * 2 = # doubleton tokens r (Nr * r) = total # tokens = N (all bars)
Witten-Bell Smoothing Idea
If T/N is large, we’ve seen lots of novel types

N6 * 6 in the past, so we expect lots more.
N5 * 5
N4 * 4
N3 * 3 unsmoothed  smoothed
N2 * 2 doubletons 2/N  2/(N+T)
N1 * 1 singletons 1/N  1/(N+T)
N0 * 0 novel words 0/N  (T/(N+T)) / N0
0 5000 10000 15000 20000 25000

Good-Turing Smoothing Idea
Partition the type vocabulary into classes

N6* 6/N (novel, singletons, doubletons, …)
by how often they occurred in training data
N5* 5/N Use observed total probability of class r+1
to estimate total probability of class r
N4* 4/N
N3* 3/N obs. (tripleton) unsmoothed  smoothed

1.2%
N2* 2/N obs. p(doubleton) est. p(doubleton) 2/N  (N3*3/N)/N2
1.5%
N1* 1/N obs. p(singleton) est. p(singleton) 1/N  (N2*2/N)/N1
2%
N0* 0/N est. p(novel) 0/N  (N1*1/N)/N0
0 0.005 0.01 0.015 0.02 0.025
r/NNatural
CS6501 = (NLanguage
r*r/N)/N 
Processing
r (Nr+1*(r+1)/N)/N
45 r
Witten-Bell vs. Good-Turing
 Estimate p(z | xy) using just the tokens we’ve
seen in context xy. Might be a small set …
 Witten-Bell intuition: If those tokens were
distributed over many different types, then novel
types are likely in future.
 Good-Turing intuition: If many of those tokens
came from singleton types , then novel types are
likely in future.
 Very nice idea (but a bit tricky in practice)
 See the paper “Good-Turing smoothing without tears”

Good-Turing Reweighting
 Problem 1: what about “the”? (k=4417)
For small k, Nk > Nk+1
For large k, too jumpy. N 1 N0
N2 N1
 Problem 2: we don’t observe
events for every k N3 N2
....
....
N3511 N3510
N4417 N4416
Good-Turing Reweighting
Simple Good-Turing [Gale and Sampson]:
replace empirical Nk with a best-fit
regression (e.g., power law) once count
counts get unreliable
f(k) = a + b log (k)
Find a,b such that f(k) ∼ Nk
N1 N1
N2 N N2
3
Backoff and interpolation
 Why are we treating all novel events as the

same?
words
words

49
 p(zombie | see the) vs. p(baby | see the)

What if count(see the zygote) =
count(see the baby) = 0?
baby beats zygote as a unigram
the baby beats the zygote as a bigram
 see the baby beats see the zygote ?
(even if both have the same count, such as 0)

50
condition on less context for

contexts you haven’t learned much
about
backoff: use trigram if you have
good evidence, otherwise bigram,
otherwise unigram
Interpolation: – mixture of unigram,
bigram, trigram (etc.) models
51
Smoothing + backoff
Basic smoothing (e.g., add-, Good-Turing,
Witten-Bell):
 Holds out some probability mass for novel
events
 Divided up evenly among the novel events
Backoff smoothing
 Holds out same amount of probability mass for
novel events
 But divide up unevenly in proportion to backoff
prob.

52
Smoothing + backoff
 When defining p(z | xy), the backoff prob

for novel z is p(z | y)
Even if z was never observed after xy, it
may have been observed after y (why?).
Then p(z | y) can be estimated without
further backoff. If not, we back off further to
p(z).

53
Linear Interpolation
 Jelinek-Mercer smoothing
 Use a weighted average of backed-off naïve
models:
paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z)
where 3 + 2 + 1 = 1 and all are  0
 The weights  can depend on the context xy
 E.g., we can make 3 large if count(xy) is high
 Learn the weights on held-out data w/ jackknifing
 Different 3 when xy is observed 1 time, 2 times,
5 times, …

54
Absolute Discounting Interpolation
 Save ourselves some time and just

subtract 0.75 (or some d)!
discounted bigram Interpolation weight
c( wi 1 , wi )  d
PAbsoluteDiscounting ( wi | wi 1 )    ( wi 1 ) P ( w)
c( wi 1 )
unigram
 But should we really just use the regular
unigram P(w)?

Absolute discounting: just subtract a little
from each count
 How much to subtract ?
 Church and Gale (1991)’s clever idea
 Divide data into training and held-out sets
 for each bigram in the training set
 see the actual count in the held-out set!
Bigram Bigram count
count in in heldout set
training
0 .0000270
1 0.448
2 1.25
 It sure looks like c* = (c - .75) 3 2.24
4 3.23
5
CS6501 Natural Language Processing 4.21 56

Xu Ly Ngon Ngu Tu Nhien Kai Wei Chang 03 Smooth (Cuuduongthancong - Com)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Xu Ly Ngon Ngu Tu Nhien Kai Wei Chang 03 Smooth (Cuuduongthancong - Com)

Uploaded by

Copyright:

Available Formats

Lecture 3: Language

Couse webpage: http://kwchang.net/teaching/NLP16

CS6501 Natural Language Processing 1

CS6501 Natural Language Processing 2

<S> I am Sam </S>

CS6501 Natural Language Processing 4

 Out of 9222 sentences

CS6501 Natural Language Processing 5

CS6501 Natural Language Processing 6

CS6501 Natural Language Processing 7

There are more principled

But the traditional methods are

CS6501 Natural Language Processing 9

CS6501 Natural Language Processing 10

CS6501 Natural Language Processing 11

CS6501 Natural Language Processing 12

 Steal probability mass to generalize better

 Pretend we saw each word one more time than

CS6501 Natural Language Processing 14

xya 100 100/300 101 101/326

CS6501 Natural Language Processing 15

V=1446 in the Berkeley Restaurant Project corpus

xya 1 1/3 2 2/29

see the abacus 1 1/3 2 2/20003

see the zygote 0 0/3 1 1/20003

CS6501 Natural Language Processing 21

see the abacus 1 1/3 2 2/20003

see the zygote 0 0/3 1 1/20003

CS6501 Natural Language Processing 22

see the aaaaa 1 1/3 2 2/(∞+3)

see the zzzzz 0 0/3 1 1/(∞+3)

CS6501 Natural Language Processing 23

 To fix: Instead of adding 1 to all counts, add  = 0.01?

 But how to pick best value for ?

CS6501 Natural Language Processing 24

xya 1 1/3 1.001 0.331

xya 1 1/3 1001 1/26

 To fix: Instead of adding 1 to all counts, add  = 0.01?

 But how to pick best value for ?

CS6501 Natural Language Processing 27

CS6501 Natural Language Processing 28

CS6501 Natural Language Processing 29

CS6501 Natural Language Processing 30

 Here we held out 20% of our training set

CS6501 Natural Language Processing 31

 Try 5 training/dev splits as below

CS6501 Natural Language Processing 32

 Test each sentence with smoothed model from

  Surprisingly fast: why?

CS6501 Natural Language Processing 34

 Remember, we’re trying to decide how

20000 types 300 tokens 300 tokens

20000 types 300 tokens 300 tokens

20000 types 300 tokens 300 tokens

CS6501 Natural Language Processing 39

0 10000 20000 30000 40000 50000 60000 70000 8000

CS6501 Natural Language Processing 42

N2 * 2 doubletons (occur twice)

0 5000 10000 15000 20000 25000

If T/N is large, we’ve seen lots of novel types

N2 * 2 doubletons 2/N  2/(N+T)

N1 * 1 singletons 1/N  1/(N+T)