You are on page 1of 48

Language Modeling: Part II

Pawan Goyal

CSE, IITKGP

August 8, 2016

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 1 / 18


More general formulations: Add-k

c(wi−1 , wi ) + k
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + kV

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18


More general formulations: Add-k

c(wi−1 , wi ) + k
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + kV

c(wi−1 , wi ) + m( V1 )
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + m

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18


More general formulations: Add-k

c(wi−1 , wi ) + k
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + kV

c(wi−1 , wi ) + m( V1 )
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + m
Unigram prior smoothing:

c(wi−1 , wi ) + mP(wi )
PUnigramPrior (wi |wi−1 ) =
c(wi−1 ) + m

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18


More general formulations: Add-k

c(wi−1 , wi ) + k
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + kV

c(wi−1 , wi ) + m( V1 )
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + m
Unigram prior smoothing:

c(wi−1 , wi ) + mP(wi )
PUnigramPrior (wi |wi−1 ) =
c(wi−1 ) + m

A good value of k or m?
Can be optimized on held-out set

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18


Advanced smoothing algorithms

Smoothing algorithms
Good-Turing
Kneser-Ney

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 3 / 18


Advanced smoothing algorithms

Smoothing algorithms
Good-Turing
Kneser-Ney

Good-Turing: Basic Intuition


Use the count of things we have see once
to help estimate the count of things we have never seen

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 3 / 18


Nc : Frequency of frequency c

Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 4 / 18


Nc : Frequency of frequency c

Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Computing Nc
I 3
am 2
here 1 N1 = 4
who 1 N2 = 1
would 1 N3 = 1
like 1

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 4 / 18


Good Turing Estimation

Idea
Reallocate the probability mass of n−grams that occur r + 1 times in the
training data to the n−grams that occur r times

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18


Good Turing Estimation

Idea
Reallocate the probability mass of n−grams that occur r + 1 times in the
training data to the n−grams that occur r times
In particular, reallocate the probability mass of n−grams that were seen
once to the n−grams that were never seen

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18


Good Turing Estimation

Idea
Reallocate the probability mass of n−grams that occur r + 1 times in the
training data to the n−grams that occur r times
In particular, reallocate the probability mass of n−grams that were seen
once to the n−grams that were never seen

Adjusted count
For each count c, an adjusted count c∗ is computed as:

(c + 1)Nc+1
c∗ =
Nc
where Nc is the number of n−grams seen exactly c times

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18


Good Turing Estimation

Good Turing Smoothing


c∗
P∗GT (things with frequency c) = N

(c + 1)Nc+1
c∗ =
Nc

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 6 / 18


Good Turing Estimation

Good Turing Smoothing


c∗
P∗GT (things with frequency c) = N

(c + 1)Nc+1
c∗ =
Nc

What if c = 0
N1
P∗GT (things with frequency c = 0) = N where N denotes the total number of
n-grams that actually occur in training

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 6 / 18


Complications

What about words with high frequency?


For small c, Nc > Nc+1
For large c, too jumpy

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 7 / 18


Complications

What about words with high frequency?


For small c, Nc > Nc+1
For large c, too jumpy

Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 7 / 18


Good-Turing numbers: Example

22 million words of AP Neswire

(c + 1)Nc+1
c∗ =
Nc

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 8 / 18


Good-Turing numbers: Example

22 million words of AP Neswire

(c + 1)Nc+1
c∗ =
Nc
It looks like c∗ = c − 0.75

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 9 / 18


Absolute Discounting Interpolation

Why don’t we just substract 0.75 (or some d)?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18


Absolute Discounting Interpolation

Why don’t we just substract 0.75 (or some d)?

c(wi−1 , wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = + λ(wi−1 )P(wi )
c(wi−1 )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18


Absolute Discounting Interpolation

Why don’t we just substract 0.75 (or some d)?

c(wi−1 , wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = + λ(wi−1 )P(wi )
c(wi−1 )

We may keep some more values of d for counts 1 and 2

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18


Absolute Discounting Interpolation

Why don’t we just substract 0.75 (or some d)?

c(wi−1 , wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = + λ(wi−1 )P(wi )
c(wi−1 )

We may keep some more values of d for counts 1 and 2


But can we do better than using the regular unigram correct?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”
But “Francisco” mostly follows “San”

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”
But “Francisco” mostly follows “San”

P(w): “How likely is w?”


Instead, Pcontinuation (w): “How likely is w to appear as a novel continuation?”

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”
But “Francisco” mostly follows “San”

P(w): “How likely is w?”


Instead, Pcontinuation (w): “How likely is w to appear as a novel continuation?”
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?
“Francisco” more common that “glasses”
But “Francisco” mostly follows “San”

P(w): “How likely is w?”


Instead, Pcontinuation (w): “How likely is w to appear as a novel continuation?”
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18


Kneser-Ney Smoothing

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18


Kneser-Ney Smoothing

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18


Kneser-Ney Smoothing

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

|{wi−1 : c(wi−1 , w) > 0}|


Pcontinuation (w) =
|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18


Kneser-Ney Smoothing

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

|{wi−1 : c(wi−1 , w) > 0}|


Pcontinuation (w) =
|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18


Kneser-Ney Smoothing

max(c(wi−1 , wi ) − d, 0)
PKN (wi |wi−1 ) = + λ(wi−1 )Pcontinuation (wi )
c(wi−1 )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 13 / 18


Kneser-Ney Smoothing

max(c(wi−1 , wi ) − d, 0)
PKN (wi |wi−1 ) = + λ(wi−1 )Pcontinuation (wi )
c(wi−1 )
λ is a normalizing constant
d
λ(wi−1 ) = |{w : c(wi−1 , w) > 0}|
c(wi−1 )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 13 / 18


Model Combination

As N increases
The power (expressiveness) of an N-gram model increases

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 14 / 18


Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
But the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 14 / 18


Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
But the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).

A general approach is to combine the results of multiple N-gram models.

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 14 / 18


Backoff and Interpolation

It might help to use less context

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 15 / 18


Backoff and Interpolation

It might help to use less context


when you haven’t learned much about larger contexts

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 15 / 18


Backoff and Interpolation

It might help to use less context


when you haven’t learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 15 / 18


Backoff and Interpolation

It might help to use less context


when you haven’t learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Interpolation
mix unigram, bigram, trigram

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 15 / 18


Backoff

Estimating P(wi |wi−2 wi−1 )


If we do not have counts to compute P(wi |wi−2 wi−1 ) estimate this using
the bigram probbaility P(wi |wi−1 )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 16 / 18


Backoff

Estimating P(wi |wi−2 wi−1 )


If we do not have counts to compute P(wi |wi−2 wi−1 ) estimate this using
the bigram probbaility P(wi |wi−1 )
If we do not have counts to compute P(wi |wi−1 ), estimate this using the
unigram probability P(wi )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 16 / 18


Backoff

Estimating P(wi |wi−2 wi−1 )


If we do not have counts to compute P(wi |wi−2 wi−1 ) estimate this using
the bigram probbaility P(wi |wi−1 )
If we do not have counts to compute P(wi |wi−1 ), estimate this using the
unigram probability P(wi )

P(wi |wi−2 wi−1 )=


P(wi |wi−2 wi−1 ), if c(wi−2 wi−1 wi ) > 0
λ1 P(wi |wi−1 ), if c(wi−2 wi−1 wi ) = 0 and c(wi−1 wi ) > 0
λ2 P(wi ) otherwise

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 16 / 18


Linear Interpolation

Simple Interpolation

P̃(wn |wn−1 wn−2 ) = λ1 P(wn |wn−1 wn−2 ) + λ2 P(wn |wn−1 ) + λ3 P(wn )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 17 / 18


Linear Interpolation

Simple Interpolation

P̃(wn |wn−1 wn−2 ) = λ1 P(wn |wn−1 wn−2 ) + λ2 P(wn |wn−1 ) + λ3 P(wn )

∑ λi = 1
i

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 17 / 18


Linear Interpolation

Simple Interpolation

P̃(wn |wn−1 wn−2 ) = λ1 P(wn |wn−1 wn−2 ) + λ2 P(wn |wn−1 ) + λ3 P(wn )

∑ λi = 1
i

Lambdas conditional on context


P̃(wn |wn−1 wn−2 ) = λ1 (wn−2 , wn−1 )P(wn |wn−1 wn−2 )
+λ2 (wn−2 , wn−1 )P(wn |wn−1 ) + λ3 (wn−2 , wn−1 )P(wn )

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 17 / 18


Setting the lambda values

Use a held-out corpus


Choose λs to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for λs that give the largest probability to held-out data

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 18 / 18

You might also like