Language Modelling

Language Modeling: Part II
Pawan Goyal
CSE, IITKGP
August 8, 2016
Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 1 / 18

More general formulations: Add-k
c(wi−1 , wi ) + k
PAdd−k (wi |wi−1 ) =
c(wi−1 ) + kV

c(wi−1 , wi ) + k
c(wi−1 ) + kV
c(wi−1 , wi ) + m( V1 )
c(wi−1 ) + m

c(wi−1 , wi ) + k
c(wi−1 ) + kV
c(wi−1 , wi ) + m( V1 )
c(wi−1 ) + m
Unigram prior smoothing:
c(wi−1 , wi ) + mP(wi )
PUnigramPrior (wi |wi−1 ) =
c(wi−1 ) + m

c(wi−1 , wi ) + k
c(wi−1 ) + kV
c(wi−1 , wi ) + m( V1 )
c(wi−1 ) + m
Unigram prior smoothing:
c(wi−1 , wi ) + mP(wi )
PUnigramPrior (wi |wi−1 ) =
c(wi−1 ) + m
A good value of k or m?
Can be optimized on held-out set

Advanced smoothing algorithms
Smoothing algorithms
Good-Turing
Kneser-Ney

Advanced smoothing algorithms
Smoothing algorithms
Good-Turing
Kneser-Ney
Good-Turing: Basic Intuition

Use the count of things we have see once
to help estimate the count of things we have never seen

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Computing Nc
I 3
am 2
here 1 N1 = 4
who 1 N2 = 1
would 1 N3 = 1
like 1

Good Turing Estimation
Idea
Reallocate the probability mass of n−grams that occur r + 1 times in the
training data to the n−grams that occur r times

Idea
In particular, reallocate the probability mass of n−grams that were seen
once to the n−grams that were never seen

Idea
In particular, reallocate the probability mass of n−grams that were seen
once to the n−grams that were never seen
Adjusted count
For each count c, an adjusted count c∗ is computed as:
(c + 1)Nc+1
c∗ =
Nc
where Nc is the number of n−grams seen exactly c times

Good Turing Smoothing

c∗
P∗GT (things with frequency c) = N
(c + 1)Nc+1
c∗ =
Nc

Good Turing Smoothing

c∗
P∗GT (things with frequency c) = N
(c + 1)Nc+1
c∗ =
Nc
What if c = 0
N1
P∗GT (things with frequency c = 0) = N where N denotes the total number of
n-grams that actually occur in training

Complications
What about words with high frequency?

For small c, Nc > Nc+1
For large c, too jumpy

Complications
What about words with high frequency?

For small c, Nc > Nc+1
For large c, too jumpy
Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable

Good-Turing numbers: Example
22 million words of AP Neswire
(c + 1)Nc+1
c∗ =
Nc

Good-Turing numbers: Example
22 million words of AP Neswire
(c + 1)Nc+1
c∗ =
Nc
It looks like c∗ = c − 0.75

Absolute Discounting Interpolation
Why don’t we just substract 0.75 (or some d)?

c(wi−1 , wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = + λ(wi−1 )P(wi )
c(wi−1 )

c(wi−1 , wi ) − d
c(wi−1 )
We may keep some more values of d for counts 1 and 2

c(wi−1 , wi ) − d
c(wi−1 )
We may keep some more values of d for counts 1 and 2

But can we do better than using the regular unigram correct?

Kneser-Ney Smoothing
Intuition
Shannon game: I can’t see without my reading ...: glasses/Francisco?

Intuition
“Francisco” more common that “glasses”

Intuition
But “Francisco” mostly follows “San”

Intuition
P(w): “How likely is w?”

Instead, Pcontinuation (w): “How likely is w to appear as a novel continuation?”

Intuition

For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Intuition

For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

How many times does w appear as a novel continuation?

Normalized by the total number of word bigram types
|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|
|{wi−1 : c(wi−1 , w) > 0}|

Pcontinuation (w) =
|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|
|{wi−1 : c(wi−1 , w) > 0}|

Pcontinuation (w) =
|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability

max(c(wi−1 , wi ) − d, 0)
PKN (wi |wi−1 ) = + λ(wi−1 )Pcontinuation (wi )
c(wi−1 )

max(c(wi−1 , wi ) − d, 0)
PKN (wi |wi−1 ) = + λ(wi−1 )Pcontinuation (wi )
c(wi−1 )
λ is a normalizing constant
d
λ(wi−1 ) = |{w : c(wi−1 , w) > 0}|
c(wi−1 )

Model Combination
As N increases
The power (expressiveness) of an N-gram model increases

Model Combination
As N increases
But the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).

Model Combination
As N increases
But the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.

Backoff and Interpolation
It might help to use less context


when you haven’t learned much about larger contexts


Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram


Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Interpolation
mix unigram, bigram, trigram

Backoff
Estimating P(wi |wi−2 wi−1 )

If we do not have counts to compute P(wi |wi−2 wi−1 ) estimate this using
the bigram probbaility P(wi |wi−1 )

Backoff

If we do not have counts to compute P(wi |wi−1 ), estimate this using the
unigram probability P(wi )

Backoff

If we do not have counts to compute P(wi |wi−1 ), estimate this using the
unigram probability P(wi )
P(wi |wi−2 wi−1 )=

P(wi |wi−2 wi−1 ), if c(wi−2 wi−1 wi ) > 0
λ1 P(wi |wi−1 ), if c(wi−2 wi−1 wi ) = 0 and c(wi−1 wi ) > 0
λ2 P(wi ) otherwise

Linear Interpolation
Simple Interpolation
P̃(wn |wn−1 wn−2 ) = λ1 P(wn |wn−1 wn−2 ) + λ2 P(wn |wn−1 ) + λ3 P(wn )

∑ λi = 1
i

∑ λi = 1
i
Lambdas conditional on context

P̃(wn |wn−1 wn−2 ) = λ1 (wn−2 , wn−1 )P(wn |wn−1 wn−2 )
+λ2 (wn−2 , wn−1 )P(wn |wn−1 ) + λ3 (wn−2 , wn−1 )P(wn )

Setting the lambda values
Use a held-out corpus

Choose λs to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for λs that give the largest probability to held-out data

Language Modelling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Modelling

Uploaded by

Copyright:

Available Formats

Language Modeling: Part II

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 1 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 2 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 3 / 18

Good-Turing: Basic Intuition

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 3 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 4 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 4 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 5 / 18

Good Turing Smoothing

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 6 / 18

Good Turing Smoothing

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 6 / 18

What about words with high frequency?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 7 / 18

What about words with high frequency?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 7 / 18

22 million words of AP Neswire

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 8 / 18

22 million words of AP Neswire

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 9 / 18

Why don’t we just substract 0.75 (or some d)?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18

Why don’t we just substract 0.75 (or some d)?

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18

Why don’t we just substract 0.75 (or some d)?

We may keep some more values of d for counts 1 and 2

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18

Why don’t we just substract 0.75 (or some d)?

We may keep some more values of d for counts 1 and 2

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 10 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

P(w): “How likely is w?”

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

P(w): “How likely is w?”

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

P(w): “How likely is w?”

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 11 / 18

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types

|{(wj−1 , wj ) : c(wj−1 , wj ) > 0}|

|{wi−1 : c(wi−1 , w) > 0}|

Pawan Goyal (IIT Kharagpur) Language Modeling August 8, 2016 12 / 18

How many times does w appear as a novel continuation?

Pcontinuation (w) ∝ |{wi−1 : c(wi−1 , w) > 0}|

Normalized by the total number of word bigram types