Professional Documents
Culture Documents
Language Modeling: Introduction To N-Grams
Language Modeling: Introduction To N-Grams
Modeling
Introduction to N-grams
Dan Jurafsky
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Dan Jurafsky
The Chain Rule applied to compute
joint probability of words in sentence
Markov Assumption
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Dan Jurafsky
Markov Assumption
• Probability of a word depends only on the previous word is
called a Markov assumption.
• Markov model is the class of probabilistic models that assume
we can predict the probability of some future unit without
looking too far into the past.
• N-Gram is the augmentation of this idea and hence
generalization of unigram, bigram, … n gram becomes.
Dan Jurafsky
Markov Assumption
• For complete word sequence:
Dan Jurafsky
P(w1w 2 … w n ) » Õ P(w i )
i
Some automatically generated sentences from a unigram model
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars,
quarter, in, is, mass
Bigram model
Condition on the previous word:
N-gram models
• We can extend to trigrams, 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room
on the fifth floor is (are) crashing.”
count(w i- 1,w i )
P(w i | w i- 1 ) =
count(w i- 1 )
c(w i- 1,w i )
P(w i | w i- 1 ) =
c(w i- 1)
Dan Jurafsky
An example
• Result:
Dan Jurafsky
Practical Issues
• We do everything in log space
• Avoid underflow (e.g. 0.002 x 0.002 =0.00004)
• (also adding is faster than multiplying)
…
Dan Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan Jurafsky
29
Dan Jurafsky
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____ ….
The 33rd President of the US was ____ fried rice 0.0001
I saw a ____ ….
and 1e-100
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number 1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
Approximating Shakespeare
No Coherence
Shakespeare as corpus
• N=884,647 tokens, V=29,066
• Shakespeare produced 300,000 bigram types out of V2= 844
million possible bigrams.
• Test result was quite bad and 99.96% of the possible bigrams
were never seen (have zero entries in the table)
• Quadrigrams: What's coming out looks like Shakespeare
because it is Shakespeare
Dan Jurafsky
N-Gram Precautions
• Training and Test set
• Should be from same genre e.g. translating legal documents, question-
answering system, etc.
• Should be of same dialect e.g. social media posts, spoken transcripts, etc.
• Genre and dialect is still not sufficient and our models are subject to
problem of sparsity (Zero probability).
Dan Jurafsky
Zeros
• Training set: • Test set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
allegations
3 allegations
outcome
2 reports
reports
attack
…
claims
1 claims
request
man
1 request
7 total
• Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
1.5 reports
allegations
outcome
0.5 claims
attack
reports
0.5 request
…
man
claims
request
2 other
7 total
Dan Jurafsky
Add-one estimation
Laplace-smoothed bigrams
=(5+1)/(2533+1446)
=0.0015
Dan Jurafsky
Reconstituted counts
={(5+1)x2533}/
(2533+1446)
=3.8
Dan Jurafsky
C(want to) went from 608 to 238, P(to|want) from .66 to .26!
Discount d= c*/c and d for “chinese food” =.10!!! A 10x reduction
Dan Jurafsky
Linear Interpolation
• Simple interpolation (Counting all different N gram probabilities first and then
respective lambda values are calculated which would be normalized to sigma as follows )
Linear Interpolation
=0.571
62
Dan Jurafsky
count(wi )
S(wi ) =
66 N
Dan Jurafsky
67
Dan Jurafsky
Advanced:
Kneser-Ney Smoothing
Absolute discounting: just subtract a
Dan Jurafsky
c(wi- 1, wi ) - d
PAbsoluteDiscounting (wi | wi- 1 ) = + l (wi- 1 )P(wi )
c(wi- 1 )
unigram
Kneser-Ney Smoothing
• Better estimate for probabilities of lower-order unigrams!
• I can’t see without my reading___________? Francisco
• “Francisco” is more common than “glasses” glasses
• … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
• Explanation next.
PCONTINUATION (w) µ {wi- 1 : c(wi- 1, w) > 0}
Dan Jurafsky
Kneser-Ney Smoothing
• Number of unique words that come before w
PCONTINUATION (w) µ {wi- 1 : c(wi- 1, w) > 0}
• Total unique bigrams
Kneser-Ney Smoothing
max(c(wi- 1, wi ) - d, 0)
PKN (wi | wi- 1 ) = + l (wi- 1 )PCONTINUATION (wi )
c(wi- 1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi- 1 ) = {w : c(wi- 1, w) > 0}
c(wi- 1 )
The number of unique word that can follow wi-1
the normalized discount = # of times we applied normalized discount
74
Dan Jurafsky
75
Dan Jurafsky
Kneser-Ney Smoothing
76
Dan Jurafsky
Kneser-Ney Smoothing
77
Dan Jurafsky