You are on page 1of 56

Lecture 3: Language

Model Smoothing

Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

CS6501 Natural Language Processing 1


CuuDuongThanCong.com https://fb.com/tailieudientucntt
This lecture

 Zipf’s law
 Dealing with unseen words/n-grams
 Add-one smoothing
 Linear smoothing
 Good-Turing smoothing
 Absolute discounting
 Kneser-Ney smoothing

CS6501 Natural Language Processing 2


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Recap: Bigram language model

<S> I am Sam </S>


<S> I am legend </S>
<S> Sam I am </S>
Let P(<S>) = 1

P( I | <S>) = 2 / 3 P(am | I) = 1
P( Sam | am) = 1/3 P( </S> | Sam) = 1/2
P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2
CS6501 Natural Language Processing 3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
More examples:
Berkeley Restaurant Project sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

CS6501 Natural Language Processing 4


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Raw bigram counts

 Out of 9222 sentences

CS6501 Natural Language Processing 5


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Raw bigram probabilities

 Normalize by unigrams:

 Result:

CS6501 Natural Language Processing 6


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Zeros
 Training set: Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
P(“offer” | denied the) = 0

CS6501 Natural Language Processing 7


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Smoothing This dark art is why
NLP is taught in the
engineering school.

There are more principled


smoothing methods, too. We’ll
look next at log-linear models,
which are a good and popular
general technique.

But the traditional methods are


easy to implement, run fast, and
will give you intuitions about
what you want from a smoothing
method.
Credit: the following slides are adapted from Jason Eisner’s NLP course
CS6501 Natural Language Processing 8
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is smoothing?
20 200

2000 2000000

CS6501 Natural Language Processing 9


CuuDuongThanCong.com https://fb.com/tailieudientucntt
ML 101: bias variance tradeoff
 Different samples of size 20 vary considerably
 though on average, they give the correct bell
curve!

20 20

20 20

CS6501 Natural Language Processing 10


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Overfitting

CS6501 Natural Language Processing 11


CuuDuongThanCong.com https://fb.com/tailieudientucntt
The perils of overfitting
 N-grams only work well for word prediction if the
test corpus looks like the training corpus
In real life, it often doesn’t
 We need to train robust models that
generalize!
 One kind of generalization: Zeros!
Things that don’t ever occur in the training set
But occur in the test set

CS6501 Natural Language Processing 12


CuuDuongThanCong.com https://fb.com/tailieudientucntt
The intuition of smoothing
 When we have sparse statistics:
P(w | denied the)
3 allegations
2 reports

allegations
1 claims

outcome
reports

attack
1 request

request
claims

man
7 total

 Steal probability mass to generalize better


P(w | denied the)
2.5 allegations allegations
allegations

outcome
1.5 reports

attack
0.5 claims reports

man
request
claims
0.5 request
2 other Credit: Dan Klein
7 total
CS6501 Natural Language Processing 13
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-one estimation (Laplace smoothing)

 Pretend we saw each word one more time than


we did
 Just add one to all the counts!

 MLE estimate:
c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
 Add-1 estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

CS6501 Natural Language Processing 14


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-One Smoothing

xya 100 100/300 101 101/326


xyb 0 0/300 1 1/326
xyc 0 0/300 1 1/326
xyd 200 200/300 201 201/326
xye 0 0/300 1 1/326

xyz 0 0/300 1 1/326
Total xy 300 300/300 326 326/326

CS6501 Natural Language Processing 15


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Laplace-smoothed bigrams

V=1446 in the Berkeley Restaurant Project corpus

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Reconstituted counts

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Compare with raw bigram counts

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Problem with Add-One Smoothing
We’ve been considering just 26 letter types …

xya 1 1/3 2 2/29


xyb 0 0/3 1 1/29
xyc 0 0/3 1 1/29
xyd 2 2/3 3 3/29
xye 0 0/3 1 1/29

xyz 0 0/3 1 1/29
Total xy 3 3/3 29 29/29
CS6501 Natural Language Processing 20
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Problem with Add-One Smoothing
Suppose we’re considering 20000 word types

see the abacus 1 1/3 2 2/20003


see the abbot 0 0/3 1 1/20003
see the abduct 0 0/3 1 1/20003
see the above 2 2/3 3 3/20003
see the Abram 0 0/3 1 1/20003

see the zygote 0 0/3 1 1/20003


Total 3 3/3 20003 20003/20003

CS6501 Natural Language Processing 21


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Problem with Add-One Smoothing
Suppose we’re considering 20000 word types

see the abacus 1 1/3 2 2/20003


event” = event0 never happened
see the abbot
“Novel 0/3 1 1/20003
in training data.
see the19998
Here: 0
abduct novel events, 0/3
with 1 1/20003
total estimated
probability
see the above 19998/20003.
2 2/3 3 3/20003
Add-one smoothing thinks we are extremely likely to see
see the Abram 0 0/3 1 1/20003
novel events, rather than words we’ve seen.

see the zygote 0 0/3 1 1/20003


Total 3 3/3 20003 20003/20003

CS6501 Natural Language Processing 22


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
22
Infinite Dictionary?
In fact, aren’t there infinitely many possible word types?

see the aaaaa 1 1/3 2 2/(∞+3)


see the aaaab 0 0/3 1 1/(∞+3)
see the aaaac 0 0/3 1 1/(∞+3)
see the aaaad 2 2/3 3 3/(∞+3)
see the aaaae 0 0/3 1 1/(∞+3)

see the zzzzz 0 0/3 1 1/(∞+3)


Total 3 3/3 (∞+3) (∞+3)/(∞+3)

CS6501 Natural Language Processing 23


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-Lambda Smoothing
 A large dictionary makes novel events too probable.

 To fix: Instead of adding 1 to all counts, add  = 0.01?


 This gives much less probability to novel events.

 But how to pick best value for ?


 That is, how much should we smooth?

CS6501 Natural Language Processing 24


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-0.001 Smoothing
Doesn’t smooth much (estimated distribution has high variance)

xya 1 1/3 1.001 0.331


xyb 0 0/3 0.001 0.0003
xyc 0 0/3 0.001 0.0003
xyd 2 2/3 2.001 0.661
xye 0 0/3 0.001 0.0003

xyz 0 0/3 0.001 0.0003
Total xy 3 3/3 3.026 1
CS6501 Natural Language Processing 25
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-1000 Smoothing
Smooths too much (estimated distribution has high bias)

xya 1 1/3 1001 1/26


xyb 0 0/3 1000 1/26
xyc 0 0/3 1000 1/26
xyd 2 2/3 1002 1/26
xye 0 0/3 1000 1/26

xyz 0 0/3 1000 1/26
Total xy 3 3/3 26003 1
CS6501 Natural Language Processing 26
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Add-Lambda Smoothing
 A large dictionary makes novel events too probable.

 To fix: Instead of adding 1 to all counts, add  = 0.01?


 This gives much less probability to novel events.

 But how to pick best value for ?


 That is, how much should we smooth?
 E.g., how much probability to “set aside” for novel events?
 Depends on how likely novel events really are!
 Which may depend on the type of text, size of training corpus, …
 Can we figure it out from the data?
 We’ll look at a few methods for deciding how much to smooth.

CS6501 Natural Language Processing 27


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Setting Smoothing Parameters
 How to pick best value for ? (in add- smoothing)
 Try many  values & report the one that gets best
results?

Training Test
 How to measure whether a particular  gets good
results?
 Is it fair to measure that on test data (for setting )?
 Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical.
 Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate
the final system(s). Report all results on it.

CS6501 Natural Language Processing 28


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Setting Smoothing Parameters
 How to pick best value for ? (in add- smoothing)
 Try many  values & report the one that gets best
results?
Feynman’s Advice: “The
first principle is that you Test
Training
must not
 How to measure fool ayourself,
whether particular and
gets good
results?
you are the easiest person
 Is it fair to measure that on test data (for setting )?
to fool.”
 Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical.
 Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate
the final system(s). Report all results on it.

CS6501 Natural Language Processing 29


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Setting Smoothing Parameters
 How to pick best value for ?
 Try many  values & report the one that gets best
results?

Training Test

Dev. Training
… and
Pick  that … when we collect counts Now use that report
gets best from this 80% and smooth  to get results of
results on them using add- smoothing. smoothed that final
this 20% … counts from model on
all 100% … test data.

CS6501 Natural Language Processing 30


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Large or small Dev set?

 Here we held out 20% of our training set


(yellow) for development.
 Would like to use > 20% yellow:
 20% not enough to reliably assess 
 Would like to use > 80% blue:
Best  for smoothing 80%  best  for
smoothing 100%

CS6501 Natural Language Processing 31


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Cross-Validation

 Try 5 training/dev splits as below


 Pick  that gets best average performance
Dev.
Dev.
Dev. Test
Dev.
Dev.
  Tests on all 100% as yellow, so we can more
reliably assess 
  Still picks a  that’s good at smoothing the
80% size, not 100%.
 But now we can grow that 80% without trouble

CS6501 Natural Language Processing 32


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
32
N-fold Cross-Validation (“Leave One Out”)

Test

 Test each sentence with smoothed model from


other N-1 sentences
  Still tests on all 100% as yellow, so we can
reliably assess 
  Trains on nearly 100% blue data ((N-1)/N) to
measure whether  is good for smoothing that
CS6501 Natural Language Processing 33
600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
33
N-fold Cross-Validation (“Leave One Out”)

Test

  Surprisingly fast: why?


 Usually easy to retrain on blue by
adding/subtracting 1 sentence’s counts

CS6501 Natural Language Processing 34


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
34
More Ideas for Smoothing

 Remember, we’re trying to decide how


much to smooth.
 E.g., how much probability to “set aside” for
novel events?
 Depends on how likely novel events really
are
 Which may depend on the type of text, size
of training corpus, …
 Can we figure this out from the data?
CS6501 Natural Language Processing 35
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How likely are novel events?

20000 types 300 tokens 300 tokens


a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
??? 0 0
grapes 0 0/300 1 0/300
his 38 0
ice cream 0 7

which zero would you expect is really rare?
CS6501 Natural Language Processing 36
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How likely are novel events?

20000 types 300 tokens 300 tokens


a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
determiners:…a closed class
CS6501 Natural Language Processing 37
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How likely are novel events?

20000 types 300 tokens 300 tokens


a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
… (food) nouns: an open class
CS6501 Natural Language Processing 38
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Zipfs’ law the r-th most
common word 𝑤𝑟 has
P(𝑤𝑟 ) ∝ 1/r
A few words
are very
frequent

http://wugology.com/zipfs-law/

CS6501 Natural Language Processing 39


CuuDuongThanCong.com https://fb.com/tailieudientucntt
CS6501 Natural Language Processing 40
CuuDuongThanCong.com https://fb.com/tailieudientucntt
CS6501 Natural Language Processing 41
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How common are novel events?

1 * 69836 the
1 * 52108 EOS
N6 * 6 abdomen, bachelor, Caesar …
N5 * 5 aberrant, backlog, cabinets …
N4 * 4 abdominal, Bach, cabana …
N3 * 3 Abbas, babel, Cabot …
N2 * 2 aback, Babbitt, cabanas …
N1 * 1 abaringe, Babatinde, cabaret …
N0 * 0

0 10000 20000 30000 40000 50000 60000 70000 8000

CS6501 Natural Language Processing 42


CuuDuongThanCong.com https://fb.com/tailieudientucntt
How common are novel events?
Counts from Brown Corpus (N  1 million tokens)
N6 * 6

N5 * 5

N4 * 4

N3 * 3

N2 * 2 doubletons (occur twice)


N1 * 1 singletons (occur once)
N0 * 0 novel words (in dictionary but never occur)

0 5000 10000 15000 20000 25000


N2 = # doubleton types r Nr = total # types = T (purple bars)
N2 * 2 = # doubleton tokens r (Nr * r) = total # tokens = N (all bars)
CS6501 Natural Language Processing 43
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Witten-Bell Smoothing Idea

If T/N is large, we’ve seen lots of novel types


N6 * 6 in the past, so we expect lots more.

N5 * 5

N4 * 4

N3 * 3 unsmoothed  smoothed

N2 * 2 doubletons 2/N  2/(N+T)

N1 * 1 singletons 1/N  1/(N+T)

N0 * 0 novel words 0/N  (T/(N+T)) / N0

0 5000 10000 15000 20000 25000

CS6501 Natural Language Processing 44


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Good-Turing Smoothing Idea

Partition the type vocabulary into classes


N6* 6/N (novel, singletons, doubletons, …)
by how often they occurred in training data
N5* 5/N Use observed total probability of class r+1
to estimate total probability of class r
N4* 4/N

N3* 3/N obs. (tripleton) unsmoothed  smoothed


1.2%
N2* 2/N obs. p(doubleton) est. p(doubleton) 2/N  (N3*3/N)/N2
1.5%
N1* 1/N obs. p(singleton) est. p(singleton) 1/N  (N2*2/N)/N1
2%
N0* 0/N est. p(novel) 0/N  (N1*1/N)/N0

0 0.005 0.01 0.015 0.02 0.025

r/NNatural
CS6501 = (NLanguage
r*r/N)/N 
Processing
r (Nr+1*(r+1)/N)/N
45 r
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Witten-Bell vs. Good-Turing
 Estimate p(z | xy) using just the tokens we’ve
seen in context xy. Might be a small set …
 Witten-Bell intuition: If those tokens were
distributed over many different types, then novel
types are likely in future.
 Good-Turing intuition: If many of those tokens
came from singleton types , then novel types are
likely in future.
 Very nice idea (but a bit tricky in practice)
 See the paper “Good-Turing smoothing without tears”

CS6501 Natural Language Processing 46


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Good-Turing Reweighting
 Problem 1: what about “the”? (k=4417)
For small k, Nk > Nk+1
For large k, too jumpy. N 1 N0

N2 N1
 Problem 2: we don’t observe
events for every k N3 N2

....
....
N3511 N3510
N4417 N4416
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Good-Turing Reweighting
Simple Good-Turing [Gale and Sampson]:
replace empirical Nk with a best-fit
regression (e.g., power law) once count
counts get unreliable
f(k) = a + b log (k)
Find a,b such that f(k) ∼ Nk

N1 N1
N2 N N2
3

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Backoff and interpolation

 Why are we treating all novel events as the


same?
words

words

CS6501 Natural Language Processing 49


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
49
Backoff and interpolation

 p(zombie | see the) vs. p(baby | see the)


What if count(see the zygote) =
count(see the baby) = 0?
baby beats zygote as a unigram
the baby beats the zygote as a bigram
 see the baby beats see the zygote ?
(even if both have the same count, such as 0)

CS6501 Natural Language Processing 50


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
50
Backoff and interpolation

condition on less context for


contexts you haven’t learned much
about
backoff: use trigram if you have
good evidence, otherwise bigram,
otherwise unigram
Interpolation: – mixture of unigram,
bigram, trigram (etc.) models
CS6501 Natural Language Processing 51
600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
51
Smoothing + backoff
Basic smoothing (e.g., add-, Good-Turing,
Witten-Bell):
 Holds out some probability mass for novel
events
 Divided up evenly among the novel events

Backoff smoothing
 Holds out same amount of probability mass for
novel events
 But divide up unevenly in proportion to backoff
prob.

CS6501 Natural Language Processing 52


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
52
Smoothing + backoff

 When defining p(z | xy), the backoff prob


for novel z is p(z | y)
Even if z was never observed after xy, it
may have been observed after y (why?).
Then p(z | y) can be estimated without
further backoff. If not, we back off further to
p(z).

CS6501 Natural Language Processing 53


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
53
Linear Interpolation
 Jelinek-Mercer smoothing
 Use a weighted average of backed-off naïve
models:
paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z)
where 3 + 2 + 1 = 1 and all are  0
 The weights  can depend on the context xy
 E.g., we can make 3 large if count(xy) is high
 Learn the weights on held-out data w/ jackknifing
 Different 3 when xy is observed 1 time, 2 times,
5 times, …

CS6501 Natural Language Processing 54


600.465 - Intro to NLP - J. Eisner
CuuDuongThanCong.com https://fb.com/tailieudientucntt
54
Absolute Discounting Interpolation

 Save ourselves some time and just


subtract 0.75 (or some d)!

discounted bigram Interpolation weight

c( wi 1 , wi )  d
PAbsoluteDiscounting ( wi | wi 1 )    ( wi 1 ) P ( w)
c( wi 1 )
unigram
 But should we really just use the regular
unigram P(w)?

CS6501 Natural Language Processing 55


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Absolute discounting: just subtract a little
from each count
 How much to subtract ?
 Church and Gale (1991)’s clever idea
 Divide data into training and held-out sets
 for each bigram in the training set
 see the actual count in the held-out set!
Bigram Bigram count
count in in heldout set
training
0 .0000270
1 0.448
2 1.25
 It sure looks like c* = (c - .75) 3 2.24
4 3.23
5
CS6501 Natural Language Processing 4.21 56
CuuDuongThanCong.com https://fb.com/tailieudientucntt

You might also like