Professional Documents
Culture Documents
Lecture - 3 - Statistical Language Models
Lecture - 3 - Statistical Language Models
Lin Li
CS@PVAMU
Probabilistic Language Modeling
Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model
2
Why is a language model useful?
Provide a principled way to quantify the uncertainties
associated with natural language
Allow us to answer questions like:
Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
Given that we observe “baseball” three times and “game”
once in a news article, how likely is it about “sports” v.s.
“politics”?
(text categorization)
Given that a user is interested in sports news, how likely
would the user use “baseball” in a query?
(information retrieval)
3
Statistical Language Models
N-gram models
Case Study: Query likelihood Model in IR
4
Finite Automata and Language Models
5
Probability of Word Sequence
7
Chain Rule and Conditional Probability
Product rule: p(AB) = p(A|B) × p(B) = p(B|A) × p(A)
Chain rule:
p(ABCD) = p(A) × p(B|A) × p(C|AB) × p(D|ABC)
p(ABCD|E) = p(A|E) × p(B|AE) × p(C|ABE) × p(D|ABCE)
Chain rule in general:
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 = 𝑃 𝑥1 𝑃 𝑥2 𝑥1 𝑃 𝑥3 𝑥1𝑥2 ⋯
𝑃(𝑥𝑛 |𝑥1, … , 𝑥𝑛−1 )
A document “frog said that toad likes frog” is essentially a
term sequence (t1, t2, t3, t4, t5, t1)
If there is dependence between the terms
𝑃 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃 𝑡2 𝑡1 𝑃 𝑡3 𝑡1𝑡2 𝑃(𝑡4 |𝑡1𝑡2𝑡3 )
8
How to estimate these probabilities?
Could we just count and divide?
P( the | its water is so transparent that )
Count (its water is so transparent that the)
Count (its water is so transparent that)
No! Too many possible sentences! We’ll never see enough
data for estimating these
If there is dependence between the terms
9
Markov Assumption
Simplify the condition
P( the | its water is so transparent that ) P( the | that )
or
P( the | its water is so transparent that )
P( the | transparent that ) Andrei Markov
10
Type of Language Models
If assume there is NO dependence between terms —
unigram language model
𝑃𝑢𝑛𝑖 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃(𝑡2)𝑃(𝑡3)𝑃(𝑡4)
If assume that there is only dependence between
consecutive terms — bigram language model
𝑃𝑏𝑖 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃 𝑡2 𝑡1 𝑃 𝑡3 𝑡2 𝑃(𝑡4 |𝑡3 )
N-gram language models
Conditioned on the preceding n-1 words
e.g., trigram language model:
𝑃 𝑤1𝑤2 … 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1𝑤2 … 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )
11
Type of Language Models (cont’d)
In general, N-gram is an insufficient model of language
because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
But we can often get away with N-gram models
Other sophisticated language models
Remote-dependence language models (e.g., maximum
entropy model)
Structured language models (e.g., probabilistic context-free
grammar)
12
Pseudo Code of Text Generation (Unigram)
Sample from a discrete distribution 𝑝(𝑋)
Assume 𝑛 outcomes in the event space 𝑋
13
Language Generated by N-gram
Unigram Generated from language models of New York Times
Months the my and issue of year foreign new exchange’s
september were recession exchange new endorsed a q
acquire to six executives.
Bigram
Last December through the way to preserve the Hudson
corporation N.B.E.C. Taylor would seem to complete the
major central planners one point five percent of U.S.E. has
already told M.X. corporation of living on information such
as more frequently fishing to keep her.
Trigram
They also point to ninety nine point six billon dollars from
two hundred four oh six three percent of the rates of
interest stores as Mexico and Brazil on market conditions.
14
Parameter Estimation
General setting:
Given a (hypothesized & probabilistic) model that governs
the random experiment
The model gives a probability of any data 𝑝(𝑋|𝜃) that
depends on the parameter 𝜃
Now, given actual sample data X={x1,…,xn}, what can we
say about the value of 𝜃?
Intuitively, take our best guess of 𝜃 -- “best” means
“best explaining/fitting the data”
Generally an optimization problem
15
Maximum Likelihood Estimates
The maximum likelihood estimate (MLE)
of some parameter of a model M from a training set T
maximizes the likelihood of the training set T given the
model M
Suppose the word “bagel” occurs 400 times in a
corpus of a million words, what is the probability that
a random word from some other text will be “bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
But it is the estimate that makes it most likely that “bagel”
will occur 400 times in this corpus.
16
Estimating N-gram Probabilities
The Maximum Likelihood Estimate of Bigram
count ( wi 1, wi )
P( wi | wi 1 )
count ( wi 1 )
Example:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
17
Example: Berkeley Restaurant Project Sentences
Sample sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts out of 9222 sentences
18
Berkeley Restaurant Project Sentences (cont’d)
Normalize by unigrams:
Result:
19
Berkeley Restaurant Project Sentences (cont’d)
Bigram estimates of sentence probabilities
P(<s> I want english food </s>)
= P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food)
= 0.000031
Prior knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (I | <s>) = .25
20
N-gram Probability
Practical issues: do everything in log space
Avoid underflow
Also, adding is faster than multiplying
log( p1 p2 p3 p4 ) log p1 log p2 log p3 log p4
21
Language Model Evaluation
Train the models on the same training set
Parameter tuning can be done by holding off some
training set for validation
Test the models on an unseen test set (i.e., check in
what probability that the test set can be generated
by the language model)
This data set must be disjoint from training data
Language model A is better than model B
If A assigns higher probability to the test data than B
22
Perplexity
Standard evaluation metric for language models
A function of the probability that a language model assigns
to a data set
Rooted in the notion of cross-entropy in information theory
The inverse of the likelihood of the test set as assigned
by the language model, normalized by the number of
words
𝑁 1
𝑃𝑃 𝑤1 , … , 𝑤𝑁 = 𝑁
𝑖=1 𝑝(𝑤𝑖 |𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1 )
n-gram language model
Meaning of Perplexity: when trying to guess the next
word, the average # of different words to select from
23
Perplexity (cont’d)
The smaller the better!
A model is more certain on how to generate the text if its
perplexity is less (i.e., knows better which word to choose)
Word sequence is the entire sequence of words in
the test set. Since it will cross many sentence
boundaries, include the begin- and end-sentence
markers <s> and </s> in the probability computation.
Also, need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>)
in the total count of word tokens N
Why not include <s>?
24
An Experiment
Models
Unigram, Bigram, Trigram models (with proper smoothing)
Training data
38M words of WSJ text (vocabulary: 20K types)
Test data
1.5M words of WSJ text
Results
Unigram Bigram Trigram
Perplexity 962 170 109
25
Model Generalization
N-gram model depends on the training corpus
Probabilities often encode specific facts about a given
training corpus
Will better model the training corpus as we increase the
value of N
What text can be generated by a N-gram model (e.g.,
bigram)? —Shannon visualization method
<s> I
1. Choose a random bigram (<s>,
I want
w) according to its probability want to
2. Now choose a random bigram to eat
(w, x) according to its probability eat Chinese
Chinese food
3. And so on until we choose </s>
food </s>
4. Then string the words together I want to eat Chinese food
26
Why Generalization? – An Example
Approximating Shakespeare using different N-grams
27
Why Generalization? – An Example (cont’d)
Shakespeare as corpus (his oeuvre is not large as
corpora go)
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of V2=
844 million possible bigrams.
So 99.96% of the possible bigrams were never seen (i.e., have
zero entries in the table)
Quadrigrams: too much like Shakespeare!
E.g., once the generator has chosen the first 4-gram (It
cannot be but), there are only five possible continuations
(that, I, he, thou, and so); indeed, for many 4-grams, there
is only one continuation
What's coming out looks like Shakespeare because
it is Shakespeare
28
Why Generalization? – An Example (cont’d)
29
Problem with MLE
30
Problem with MLE: Zeroes
Zero probability trigram
Training set: Testing set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
31
Smoothing (Intuition)
When we have sparse
statistics:
allegations
P(w | denied the)
3 allegations
outcome
reports
attack
2 reports
…
request
claims
man
1 claims
1 request
7 total
outcome
attack
reports
P(w | denied the)
…
man
request
claims
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total
32
Illustration of N-gram Model Smoothing
𝑝 𝑤𝑖 𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1
Smoothed LM
w
Discount from the seen Assigning nonzero
words probabilities to the unseen
words
33
Laplace Smoothing
Pretend we saw each word one more time than we
did, just add one to all the counts! (also called add-one
estimate)
c( wi 1 , wi )
MLE estimate: PMLE ( wi | wi 1 )
c( wi 1 )
c( wi 1 , wi ) 1
Add-1 estimate: PAdd 1 ( wi | wi 1 )
c( wi 1 ) V
34
Berkeley Restaurant Sentences (smoothed)
35
Berkeley Restaurant Sentences (smoothed)
Reconstituted counts
[𝐶 𝑤𝑛−1 𝑤𝑛 + 1] × 𝐶(𝑤𝑛−1 )
𝑐∗ 𝑤𝑛−1 𝑤𝑛 =
𝐶 𝑤𝑛−1 + 𝑉
36
Smoothing
Add-1 estimation is a blunt instrument
Steal too much probability from n-grams to zeroes
Not practically used for N-grams
But can be use for other models or applications
Text Classification
In domains where the number of zeros isn’t so huge
Backup: Add-K estimate, k is a much smaller number
(e.g., 0.1, 0.05, etc.)
But how to choose k to optimize the result? Using a devset
(development dataset) to validate the result, thus also called
validate dataset
37
Backoff and Interpolation
Sometimes it helps to use less context
Condition on less context for contexts you haven’t learned
much about
Backoff:
use trigram if you have good evidence
otherwise bigram, otherwise unigram
If we are trying to compute 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 ), but we have no examples of a
particular trigram 𝑤𝑛−2 𝑤𝑛−1 𝑤𝑛 , we can instead estimate its probability by
using the bigram probability 𝑃 𝑤𝑛 𝑤𝑛−1 . Similarly, if we do not have counts
to compute 𝑃 𝑤𝑛 𝑤𝑛−1 , we can look to the unigram 𝑃(𝑤𝑛 ).
Interpolation:
mix unigram, bigram, trigram
Interpolation works better
38
Linear Interpolation
Simple interpolation
𝑃 𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 = λ1 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )
+ λ2 𝑃(𝑤𝑛 |𝑤𝑛−1 )
+ λ3 𝑃(𝑤𝑛 )
Such that the s sum to 1:
40
Kneser-Ney Smoothing
Suppose we wanted to subtract a
little from a count of 4 to save Bigram count Bigram count in
probability mass for the zeros, but in training set held-out set
0 .0000270
how much to subtract?
1 0.448
Church and Gale (1991)’s clever idea: 2 1.25
Divide words of AP Newswire, 22 3 2.24
million in training set and another 22 4 3.23
million in held-out set 5 4.21
For each bigram in training set, check 6 5.23
its actual count in the held-out set, 7 6.21
average actual counts for all bigram 8 7.21
having same counts in training 9 8.26
Finding: except 0 and 1, all bigram
counts in held-out set can be
estimated by subtract 0.75 from the
count in the training set!
41
Absolute Discount Interpolation
Save time and just subtract 0.75 (or some d)!
discounted bigram Interpolation weight
𝑐 𝑤𝑖−1 , 𝑤𝑖 − 𝑑
𝑃AbsoluteDiscounting 𝑤𝑖 |𝑤𝑖−1 = + 𝜆 𝑤𝑖−1 𝑃(𝑤𝑖 )
𝑐(𝑤𝑖−1 )
unigram
42
Kneser-Ney Smoothing I
Better estimate for probabilities of lower-order
unigrams! glasses
Francisco
Shannon game: I can’t see without my reading___________?
“Francisco” is more common than “glasses”
… but “Francisco” always follows “San”
Unigram is useful exactly when we haven’t seen this
bigram!
Instead of P(w): “How likely is w”
Pcontinuation(w): “How likely is w to appear as a novel
continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it
was seen
PCONTINUATION ( w) {wi 1 : c( wi 1 , w) 0}
43
Kneser-Ney Smoothing II
How many times does w appear as a novel continuation
(the number of word types seen to precede w):
PCONTINUATION ( w) {wi 1 : c( wi 1 , w) 0}
44
Kneser-Ney Smoothing III
Alternative metaphor: the number of word types
seen to precede w
| {wi 1 : c( wi 1 , w) 0} |
45
Engineering N-gram Models
N-gram models
Case Study: Query likelihood Model in IR
47
Multinomial Distribution Over Words
Probability of generating a specific document (Unigram)
𝐿𝑑 !
𝑃 𝑑 = 𝑃(𝑡1 )𝑡𝑓𝑡1,𝑑 𝑃(𝑡2 )𝑡𝑓𝑡2,𝑑 … 𝑃(𝑡𝑀 )𝑡𝑓𝑡𝑀,𝑑
𝑡𝑓𝑡1 ,𝑑 ! 𝑡𝑓𝑡2 ,𝑑 ! … 𝑡𝑓𝑡𝑀 ,𝑑 !
𝐿𝑑 = 𝑡𝑓𝑡𝑖 ,𝑑 is the length of document 𝑑
1≤𝑖≤𝑀
M is the size of the vocabulary, tfi,d is the frequency of
term i in document d.
𝐿 !
Coefficient 𝑡𝑓 !𝑡𝑓 𝑑 !…𝑡𝑓 ! is the # of combinations of all
𝑡1 ,𝑑 𝑡2 ,𝑑 𝑡𝑀 ,𝑑
documents having the same terms and term frequencies
Positional difference is ignored here, i.e. p(“Hong Hong
Kong Kong”) = p(“Kong Kong Hong Hong”) = p(“Hong Kong
Hong Kong”)
48
Query Likelihood Model
Probability of generating a specific query from a
document: 𝑃 𝑑 𝑞 = 𝑃 𝑞 𝑑 𝑃(𝑑)/𝑃(𝑞)
49
Smoothing by Non-occurring Terms
Non-occurring terms should be given probabilities to
smooth calculation (generalize model and discount zero
probabilities)
A mixture between document-specific multinomial
distribution and entire-collection multinomial distribution
Linear Interpolation language model
𝑃 𝑡 𝑑 = λ𝑃𝑚𝑙𝑒 𝑡 𝑀𝑑 + (1 − λ)𝑃𝑚𝑙𝑒 𝑡 𝑀𝑐
Bayesian Smoothing:
𝑡𝑓𝑡,𝑑 + 𝛼𝑃(𝑡|𝑀𝑐 )
𝑃 𝑡𝑑 =
𝐿𝑑 + 𝛼
50
Query Likelihood Model (Example)
51
Language Model Performance
52
Extended Language Models
Document likelihood model: less appealing since there is
much less text available in query to estimate a LM; on the
other hand, easy to integrate relevance feedback.
Mixture model of both query likelihood and document
likelihood model
Kullback-Leibler divergence: an asymmetric divergence measure
to check how bad the probability distribution Mq is at Md.
𝑃(𝑡|𝑀𝑞 )
𝑅 𝑑; 𝑞 = 𝐾𝐿 𝑀𝑑 ‖𝑀𝑞 = 𝑃 𝑡 𝑀𝑞 log
𝑃(𝑡|𝑀𝑑 )
𝑡∈𝑉
53
What you should know
N-gram language models
How to generate text documents from a language model
How to estimate a language model
General idea and different ways of smoothing
Language model evaluation
54
Reading
IIR Chapter 12
SLP Chapter 3
55
Acknowledgement
The slides of this lecture were prepared based on the
course materials of the following scholars:
Textbook and slides of Dan Jurafsky, Stanford University, “Speech and
Language Processing”
Textbook and course slides of Christopher Manning, Stanford
University, “Information Retrieval and Web Search”
Course materials of many other higher educators (e.g., Hongning
Wang from University of Virginia, Greg Durrett from UT Austin, etc.)
56