Professional Documents
Culture Documents
Lecture 2 - Language Modeling
Lecture 2 - Language Modeling
Language
Modeling
Introduc*on
to
N-‐grams
Dan
Jurafsky
P(w1w 2 …w n ) = # P(w i | w1w 2 …w i"1 )
i
P(“its
water
is
so
transparent”)
=
P(its)
×
P(water|its)
×
P(is|its
water)
!
×
P(so|its
water
is)
×
P(transparent|its
water
is
so)
Dan
Jurafsky
!
Dan
Jurafsky
Markov Assump1on
• Simplifying
assump*on:
Andrei
Markov
P(the | its water is so transparent that) " P(the | that)
• Or
maybe
P(the | its water is so transparent that) " P(the | transparent that)
Dan
Jurafsky
Markov Assump1on
Dan
Jurafsky
Bigram
model
" Condi*on
on
the
previous
word:
N-‐gram
models
• We
can
extend
to
trigrams,
4-‐grams,
5-‐grams
• In
general
this
is
an
insufficient
model
of
language
• because
language
has
long-‐distance
dependencies:
“The
computer
which
I
had
just
put
into
the
machine
room
on
the
fiIh
floor
crashed.”
count(w i"1,w i )
P(w i | w i"1 ) =
count(w i"1 )
c(w i"1,w i )
P(w i | w i"1 ) =
c(w i"1 )
!
Dan
Jurafsky
An example
More
examples:
Berkeley
Restaurant
Project
sentences
• can
you
tell
me
about
any
good
cantonese
restaurants
close
by
• mid
priced
thai
food
is
what
i’m
looking
for
• tell
me
about
chez
panisse
• can
you
give
me
a
lis*ng
of
the
kinds
of
food
that
are
available
• i’m
looking
for
a
good
place
to
eat
breakfast
• when
is
caffe
venezia
open
during
the
day
Dan
Jurafsky
• Result:
Dan
Jurafsky
Prac1cal
Issues
• We
do
everything
in
log
space
• Avoid
underflow
• (also
adding
is
faster
than
mul*plying)
…
Dan
Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan
Jurafsky
Perplexity
The
best
language
model
is
one
that
best
predicts
an
unseen
test
set
• Gives
the
highest
P(sentence)
!
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity
is
the
inverse
probability
of
the
test
set,
normalized
by
the
number
1
of
words:
= N
P(w1w2 ...wN )
Chain
rule:
For
bigrams:
Approxima1ng Shakespeare
Dan
Jurafsky
Zeros
• Training
set:
• Test
set
…
denied
the
allega*ons
…
denied
the
offer
…
denied
the
reports
…
denied
the
loan
…
denied
the
claims
…
denied
the
request
P(“offer”
|
denied
the)
=
0
Dan
Jurafsky
allegations
3
allega*ons
outcome
reports
2
reports
attack
1
claims
…
request
claims
man
1
request
7
total
• Steal
probability
mass
to
generalize
be_er
P(w
|
denied
the)
2.5
allega*ons
allegations
1.5
reports
allegations
outcome
0.5
claims
reports
attack
0.5
request
…
man
claims
request
2
other
7
total
Dan
Jurafsky
Add-‐one es1ma1on
c(wi!1, wi ) +1
PAdd!1 (wi | wi!1 ) =
• Add-‐1
es*mate:
c(wi!1 ) +V
Dan
Jurafsky
Laplace-smoothed bigrams
Dan
Jurafsky
Reconstituted counts
Dan
Jurafsky
Linear
Interpola1on
• Simple
interpola*on
• Lambdas
condi*onal
on
context:
Dan
Jurafsky
count(wi )
S(wi ) =
63
N
Dan
Jurafsky
64
Dan
Jurafsky
Advanced: Good
Turing Smoothing
Dan
Jurafsky
c(wi!1, wi ) +1
PAdd!1 (wi | wi!1 ) =
c(wi!1 ) + V
Dan
Jurafsky
c(wi!1, wi ) + k
PAdd!k (wi | wi!1 ) =
c(wi!1 ) + kV
1
c(wi!1, wi ) + m( )
PAdd!k (wi | wi!1 ) = V
c(wi!1 ) + m
Dan
Jurafsky
c(wi!1, wi ) + mP(wi )
PUnigramPrior (wi | wi!1 ) =
c(wi!1 ) + m
Dan
Jurafsky
Held-out words:
75
Dan
Jurafsky
Training
Held
out
Ney et al. Good Turing Intuition
(slide from Dan Klein)
N1 N0
• Intui*on
from
leave-‐one-‐out
valida*on
• Take
each
of
the
c
training
words
out
in
turn
• c
training
sets
of
size
c–1,
held-‐out
of
size
1
• What
frac*on
of
held-‐out
words
are
unseen
in
training?
N2 N1
• N1/c
• What
frac*on
of
held-‐out
words
are
seen
k
*mes
in
training?
N3 N2
• (k+1)Nk+1/c
....
....
• So
in
the
future
we
expect
(k+1)Nk+1/c
of
the
words
to
be
those
with
training
count
k
• There
are
Nk
words
with
training
count
k
• Each
should
occur
with
probability:
• (k+1)Nk+1/c/Nk
N3511 N3510
(k +1)N k+1
• …or
expected
count:
k* =
Nk N4417 N4416
Dan
Jurafsky
Good-Turing complications
(slide from Dan Klein)
Advanced: Good
Turing Smoothing
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Dan
Jurafsky
c(wi!1, wi ) ! d
PAbsoluteDiscounting (wi | wi!1 ) = + ! (wi!1 )P(w)
c(wi!1 )
unigram
• (Maybe
keeping
a
couple
extra
values
of
d
for
counts
1
and
2)
• But
should
we
really
just
use
the
regular
unigram
P(w)?
82
Dan
Jurafsky
Kneser-Ney Smoothing I
• Be_er
es*mate
for
probabili*es
of
lower-‐order
unigrams!
Francisco
glasses
• Shannon
game:
I
can’t
see
without
my
reading___________?
• “Francisco”
is
more
common
than
“glasses”
• …
but
“Francisco”
always
follows
“San”
• The
unigram
is
useful
exactly
when
we
haven’t
seen
this
bigram!
• Instead
of
P(w):
“How
likely
is
w”
• Pcon*nua*on(w):
“How
likely
is
w
to
appear
as
a
novel
con*nua*on?
• For
each
word,
count
the
number
of
bigram
types
it
completes
• Every
bigram
type
was
a
novel
con*nua*on
the
first
*me
it
was
seen
PCONTINUATION (w) ! {wi"1 : c(wi"1, w) > 0}
Dan
Jurafsky
Kneser-Ney Smoothing II
• How
many
*mes
does
w
appear
as
a
novel
con*nua*on:
PCONTINUATION (w) ! {wi"1 : c(wi"1, w) > 0}
• A
frequent
word
(Francisco)
occurring
in
only
one
context
(San)
will
have
a
low
con*nua*on
probability
Dan
Jurafsky
Kneser-Ney Smoothing IV
max(c(wi!1, wi ) ! d, 0)
PKN (wi | wi!1 ) = + ! (wi!1 )PCONTINUATION (wi )
c(wi!1 )
λ
is
a
normalizing
constant;
the
probability
mass
we’ve
discounted
d
! (wi!1 ) = {w : c(wi!1, w) > 0}
c(wi!1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
86
= # of times we applied normalized discount
Dan
Jurafsky
i
i!1 max(c KN (wi!n+1 ) ! d, 0) i!1 i!1
PKN (wi | wi!n+1 )= i!1
+ ! (w )P
i!n+1 KN (wi | wi!n+2 )
cKN (wi!n+1 )
!# count(•) for the highest order
cKN (•) = "
#$ continuationcount(•) for lower order
Advanced:
Kneser-Ney Smoothing