You are on page 1of 56

Statistical Language Models

Lin Li
CS@PVAMU
Probabilistic Language Modeling
 Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
 Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
 A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model

2
Why is a language model useful?
 Provide a principled way to quantify the uncertainties
associated with natural language
 Allow us to answer questions like:
 Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
 Given that we observe “baseball” three times and “game”
once in a news article, how likely is it about “sports” v.s.
“politics”?
(text categorization)
 Given that a user is interested in sports news, how likely
would the user use “baseball” in a query?
(information retrieval)
3
Statistical Language Models

 N-gram models
 Case Study: Query likelihood Model in IR

4
Finite Automata and Language Models

5
Probability of Word Sequence

 Probability of word sequence generated by automata


in Figure 12.2 by model M1
𝑃 frog said that toad likes frog = (0.01 × 0.03 × 0.04 × 0.01 × 0.02 × 0.01)
× (0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)
≈ 0.000000000001573
6
Likelihood Ratio
 Which model is more likely to generate the sentence “frog
said that toad likes frog”?
s frog said that toad likes that dog
M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005
M2 0.0002 0.03 0.04 0.0001 0.02 0.04 0.01
𝑃 𝑠 𝑀1 = 0.00000000000048
𝑃 𝑠 𝑀2 = 0.000000000000000384
Note: the probabilities of STOP and non-STOP are omitted

 But the models do not consider the impact between the


terms, are these terms independent? (practically NOT)

7
Chain Rule and Conditional Probability
 Product rule: p(AB) = p(A|B) × p(B) = p(B|A) × p(A)
 Chain rule:
p(ABCD) = p(A) × p(B|A) × p(C|AB) × p(D|ABC)
p(ABCD|E) = p(A|E) × p(B|AE) × p(C|ABE) × p(D|ABCE)
Chain rule in general:
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 = 𝑃 𝑥1 𝑃 𝑥2 𝑥1 𝑃 𝑥3 𝑥1𝑥2 ⋯
𝑃(𝑥𝑛 |𝑥1, … , 𝑥𝑛−1 )
 A document “frog said that toad likes frog” is essentially a
term sequence (t1, t2, t3, t4, t5, t1)
 If there is dependence between the terms
𝑃 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃 𝑡2 𝑡1 𝑃 𝑡3 𝑡1𝑡2 𝑃(𝑡4 |𝑡1𝑡2𝑡3 )

8
How to estimate these probabilities?
 Could we just count and divide?
P( the | its water is so transparent that ) 
Count (its water is so transparent that the)
Count (its water is so transparent that)
 No! Too many possible sentences! We’ll never see enough
data for estimating these
 If there is dependence between the terms

9
Markov Assumption
 Simplify the condition
P( the | its water is so transparent that )  P( the | that )
or
P( the | its water is so transparent that ) 
P( the | transparent that ) Andrei Markov

 Assume each term is dependent only on its preceding k terms,


we have:
𝑃 𝑤1𝑤2 … 𝑤𝑛 ≈ 𝑃(𝑤𝑖 |𝑤𝑖−𝑘 … 𝑤𝑖−1 )
𝑖
In other words, approximate each component in the product by
𝑃 𝑤𝑖|𝑤1𝑤2 … 𝑤𝑖−1 ≈ 𝑃(𝑤𝑖 |𝑤𝑖−𝑘 … 𝑤𝑖−1 )

10
Type of Language Models
 If assume there is NO dependence between terms —
unigram language model
𝑃𝑢𝑛𝑖 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃(𝑡2)𝑃(𝑡3)𝑃(𝑡4)
 If assume that there is only dependence between
consecutive terms — bigram language model
𝑃𝑏𝑖 𝑡1𝑡2𝑡3𝑡4 = 𝑃 𝑡1 𝑃 𝑡2 𝑡1 𝑃 𝑡3 𝑡2 𝑃(𝑡4 |𝑡3 )
 N-gram language models
 Conditioned on the preceding n-1 words
 e.g., trigram language model:
𝑃 𝑤1𝑤2 … 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1𝑤2 … 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )

11
Type of Language Models (cont’d)
 In general, N-gram is an insufficient model of language
 because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
 But we can often get away with N-gram models
 Other sophisticated language models
 Remote-dependence language models (e.g., maximum
entropy model)
 Structured language models (e.g., probabilistic context-free
grammar)

12
Pseudo Code of Text Generation (Unigram)
 Sample from a discrete distribution 𝑝(𝑋)
 Assume 𝑛 outcomes in the event space 𝑋

1. Divide the interval [0,1] into 𝑛 intervals


according to the probabilities of the outcomes
2. Generate a random number 𝑟 between 0 and 1
3. Return 𝑥𝑖 where 𝑟 falls into

13
Language Generated by N-gram
 Unigram Generated from language models of New York Times
 Months the my and issue of year foreign new exchange’s
september were recession exchange new endorsed a q
acquire to six executives.
 Bigram
 Last December through the way to preserve the Hudson
corporation N.B.E.C. Taylor would seem to complete the
major central planners one point five percent of U.S.E. has
already told M.X. corporation of living on information such
as more frequently fishing to keep her.
 Trigram
 They also point to ninety nine point six billon dollars from
two hundred four oh six three percent of the rates of
interest stores as Mexico and Brazil on market conditions.
14
Parameter Estimation
 General setting:
 Given a (hypothesized & probabilistic) model that governs
the random experiment
 The model gives a probability of any data 𝑝(𝑋|𝜃) that
depends on the parameter 𝜃
 Now, given actual sample data X={x1,…,xn}, what can we
say about the value of 𝜃?
 Intuitively, take our best guess of 𝜃 -- “best” means
“best explaining/fitting the data”
 Generally an optimization problem

15
Maximum Likelihood Estimates
 The maximum likelihood estimate (MLE)
 of some parameter of a model M from a training set T
 maximizes the likelihood of the training set T given the
model M
 Suppose the word “bagel” occurs 400 times in a
corpus of a million words, what is the probability that
a random word from some other text will be “bagel”?
 MLE estimate is 400/1,000,000 = .0004
 This may be a bad estimate for some other corpus
 But it is the estimate that makes it most likely that “bagel”
will occur 400 times in this corpus.

16
Estimating N-gram Probabilities
 The Maximum Likelihood Estimate of Bigram
count ( wi 1, wi )
P( wi | wi 1 ) 
count ( wi 1 )
 Example:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

17
Example: Berkeley Restaurant Project Sentences
 Sample sentences
 can you tell me about any good cantonese restaurants close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day
 Raw bigram counts out of 9222 sentences

18
Berkeley Restaurant Project Sentences (cont’d)
 Normalize by unigrams:

 Result:

19
Berkeley Restaurant Project Sentences (cont’d)
 Bigram estimates of sentence probabilities
P(<s> I want english food </s>)
= P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food)
= 0.000031
 Prior knowledge
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (I | <s>) = .25
20
N-gram Probability
 Practical issues: do everything in log space
 Avoid underflow
 Also, adding is faster than multiplying
log( p1  p2  p3  p4 )  log p1  log p2  log p3  log p4

 Language mode toolkits


 SRILM
 http://www.speech.sri.com/projects/srilm/
 KenLM
 https://kheafield.com/code/kenlm/

21
Language Model Evaluation
 Train the models on the same training set
 Parameter tuning can be done by holding off some
training set for validation
 Test the models on an unseen test set (i.e., check in
what probability that the test set can be generated
by the language model)
 This data set must be disjoint from training data
 Language model A is better than model B
 If A assigns higher probability to the test data than B

22
Perplexity
 Standard evaluation metric for language models
 A function of the probability that a language model assigns
to a data set
 Rooted in the notion of cross-entropy in information theory
 The inverse of the likelihood of the test set as assigned
by the language model, normalized by the number of
words
𝑁 1
𝑃𝑃 𝑤1 , … , 𝑤𝑁 = 𝑁
𝑖=1 𝑝(𝑤𝑖 |𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1 )
n-gram language model
 Meaning of Perplexity: when trying to guess the next
word, the average # of different words to select from
23
Perplexity (cont’d)
 The smaller the better!
 A model is more certain on how to generate the text if its
perplexity is less (i.e., knows better which word to choose)
 Word sequence is the entire sequence of words in
the test set. Since it will cross many sentence
boundaries, include the begin- and end-sentence
markers <s> and </s> in the probability computation.
 Also, need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>)
in the total count of word tokens N
 Why not include <s>?

24
An Experiment
 Models
 Unigram, Bigram, Trigram models (with proper smoothing)
 Training data
 38M words of WSJ text (vocabulary: 20K types)
 Test data
 1.5M words of WSJ text
 Results
Unigram Bigram Trigram
Perplexity 962 170 109

25
Model Generalization
 N-gram model depends on the training corpus
 Probabilities often encode specific facts about a given
training corpus
 Will better model the training corpus as we increase the
value of N
 What text can be generated by a N-gram model (e.g.,
bigram)? —Shannon visualization method
<s> I
1. Choose a random bigram (<s>,
I want
w) according to its probability want to
2. Now choose a random bigram to eat
(w, x) according to its probability eat Chinese
Chinese food
3. And so on until we choose </s>
food </s>
4. Then string the words together I want to eat Chinese food

26
Why Generalization? – An Example
 Approximating Shakespeare using different N-grams

27
Why Generalization? – An Example (cont’d)
 Shakespeare as corpus (his oeuvre is not large as
corpora go)
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types out of V2=
844 million possible bigrams.
 So 99.96% of the possible bigrams were never seen (i.e., have
zero entries in the table)
 Quadrigrams: too much like Shakespeare!
 E.g., once the generator has chosen the first 4-gram (It
cannot be but), there are only five possible continuations
(that, I, he, thou, and so); indeed, for many 4-grams, there
is only one continuation
What's coming out looks like Shakespeare because
it is Shakespeare

28
Why Generalization? – An Example (cont’d)

 The wall street journal is different from Shakespeare


(N-gram models trained on WSJ corpus)

29
Problem with MLE

 N-grams work fine for word prediction if the test


corpus looks like the training corpus
 In real life, it often doesn’t
 We need to train robust models that generalize!
 One kind of generalization: unseen events
 We estimated a model on 440K word tokens, but:
 Only 30,000 unique words occurred
 Only 0.04% of all possible bigrams occurred
 Things that don’t ever occur in the training set, but occur
in the test set will have zero probability (thus, No future
documents can contain those unseen words/N-grams)

30
Problem with MLE: Zeroes
 Zero probability trigram
 Training set:  Testing set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

 Problems with zero probability


 mean that we will assign 0 probability to the test set!
 hence we cannot compute perplexity (can’t divide by 0)!

31
Smoothing (Intuition)
 When we have sparse
statistics:

allegations
P(w | denied the)
3 allegations

outcome
reports

attack
2 reports

request
claims

man
1 claims
1 request
7 total

 Steal probability mass


allegations
to generalize better
allegations

outcome
attack
reports
P(w | denied the)

man
request
claims
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total
32
Illustration of N-gram Model Smoothing

𝑝 𝑤𝑖 𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1

Max. Likelihood Estimate


𝑐(𝑤𝑖 , 𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1 )
𝑝 𝑤𝑖 𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1 =
𝑐(𝑤𝑖−1 , … , 𝑤𝑖−𝑛+1 )

Smoothed LM

w
Discount from the seen Assigning nonzero
words probabilities to the unseen
words

33
Laplace Smoothing
 Pretend we saw each word one more time than we
did, just add one to all the counts! (also called add-one
estimate)

c( wi 1 , wi )
 MLE estimate: PMLE ( wi | wi 1 ) 
c( wi 1 )

c( wi 1 , wi )  1
 Add-1 estimate: PAdd 1 ( wi | wi 1 ) 
c( wi 1 )  V

34
Berkeley Restaurant Sentences (smoothed)

35
Berkeley Restaurant Sentences (smoothed)

 Reconstituted counts
[𝐶 𝑤𝑛−1 𝑤𝑛 + 1] × 𝐶(𝑤𝑛−1 )
𝑐∗ 𝑤𝑛−1 𝑤𝑛 =
𝐶 𝑤𝑛−1 + 𝑉

36
Smoothing
 Add-1 estimation is a blunt instrument
 Steal too much probability from n-grams to zeroes
 Not practically used for N-grams
 But can be use for other models or applications
 Text Classification
 In domains where the number of zeros isn’t so huge
 Backup: Add-K estimate, k is a much smaller number
(e.g., 0.1, 0.05, etc.)
 But how to choose k to optimize the result? Using a devset
(development dataset) to validate the result, thus also called
validate dataset

37
Backoff and Interpolation
 Sometimes it helps to use less context
 Condition on less context for contexts you haven’t learned
much about
 Backoff:
 use trigram if you have good evidence
 otherwise bigram, otherwise unigram
If we are trying to compute 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 ), but we have no examples of a
particular trigram 𝑤𝑛−2 𝑤𝑛−1 𝑤𝑛 , we can instead estimate its probability by
using the bigram probability 𝑃 𝑤𝑛 𝑤𝑛−1 . Similarly, if we do not have counts
to compute 𝑃 𝑤𝑛 𝑤𝑛−1 , we can look to the unigram 𝑃(𝑤𝑛 ).

 Interpolation:
 mix unigram, bigram, trigram
 Interpolation works better
38
Linear Interpolation
 Simple interpolation
𝑃 𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 = λ1 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )
+ λ2 𝑃(𝑤𝑛 |𝑤𝑛−1 )
+ λ3 𝑃(𝑤𝑛 )
 Such that the s sum to 1:

 Lambdas conditional on context (e.g., assign higher


value to 1 if a bigram can be accurately counted)
𝑃 𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 = λ1 (𝑤𝑛−2 : 𝑤𝑛−1 )𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )
+ λ2 (𝑤𝑛−2 : 𝑤𝑛−1 )𝑃(𝑤𝑛 |𝑤𝑛−1 )
+ λ3 (𝑤𝑛−2 : 𝑤𝑛−1 )𝑃(𝑤𝑛 )
39
Linear Interpolation: how to choose s?
 Use a held-out corpus
Training Data Held-Out Data Test Data

 Choose λs to maximize the probability of held-out data:


 Fix the N-gram probabilities (on the training data)
 Then search for λs that give largest probability to held-out set
log P( w1...wn | M (1...k ))   log PM ( 1 ...k ) ( wi | wi 1 )
i

 An evaluation metric is needed to define “optimality”


will come back to this later

40
Kneser-Ney Smoothing
 Suppose we wanted to subtract a
little from a count of 4 to save Bigram count Bigram count in
probability mass for the zeros, but in training set held-out set
0 .0000270
how much to subtract?
1 0.448
 Church and Gale (1991)’s clever idea: 2 1.25
 Divide words of AP Newswire, 22 3 2.24
million in training set and another 22 4 3.23
million in held-out set 5 4.21
 For each bigram in training set, check 6 5.23
its actual count in the held-out set, 7 6.21
average actual counts for all bigram 8 7.21
having same counts in training 9 8.26
 Finding: except 0 and 1, all bigram
counts in held-out set can be
estimated by subtract 0.75 from the
count in the training set!
41
Absolute Discount Interpolation
 Save time and just subtract 0.75 (or some d)!
discounted bigram Interpolation weight
𝑐 𝑤𝑖−1 , 𝑤𝑖 − 𝑑
𝑃AbsoluteDiscounting 𝑤𝑖 |𝑤𝑖−1 = + 𝜆 𝑤𝑖−1 𝑃(𝑤𝑖 )
𝑐(𝑤𝑖−1 )
unigram

 (Maybe keeping a couple extra values of d for counts


1 and 2)
 But should we just use the regular unigram P(wi)?

42
Kneser-Ney Smoothing I
 Better estimate for probabilities of lower-order
unigrams! glasses
 Francisco
Shannon game: I can’t see without my reading___________?
 “Francisco” is more common than “glasses”
 … but “Francisco” always follows “San”
 Unigram is useful exactly when we haven’t seen this
bigram!
 Instead of P(w): “How likely is w”
 Pcontinuation(w): “How likely is w to appear as a novel
continuation?
 For each word, count the number of bigram types it completes
 Every bigram type was a novel continuation the first time it
was seen
PCONTINUATION ( w)  {wi 1 : c( wi 1 , w)  0}
43
Kneser-Ney Smoothing II
 How many times does w appear as a novel continuation
(the number of word types seen to precede w):
PCONTINUATION ( w)  {wi 1 : c( wi 1 , w)  0}

 Normalized by the total number of word bigram types:


{( w j 1 , w j ) : c( w j 1 , w j )  0}
{wi 1 : c( wi 1, w)  0}
PCONTINUATION ( w) 
{( w j 1, w j ) : c( w j 1, w j )  0}

44
Kneser-Ney Smoothing III
 Alternative metaphor: the number of word types
seen to precede w
| {wi 1 : c( wi 1 , w)  0} |

 Normalized by the # of words preceding all words:


{wi 1 : c( wi 1 , w)  0}
PCONTINUATION ( w) 
 {w'
w'
i 1 : c( w'i 1 , w' )  0}

 A frequent word (Francisco) occurring in only one


context (San) will have a low continuation probability

45
Engineering N-gram Models

 For 5+-gram models, need to store between 100M and 10B


context word-count triples
 Make it fit in memory by delta encoding scheme: store deltas
instead of values and use variable-length encoding
Pauls and Klein (2011), Heafield (2011) 46
Statistical Language Models

 N-gram models
 Case Study: Query likelihood Model in IR

47
Multinomial Distribution Over Words
 Probability of generating a specific document (Unigram)
𝐿𝑑 !
𝑃 𝑑 = 𝑃(𝑡1 )𝑡𝑓𝑡1,𝑑 𝑃(𝑡2 )𝑡𝑓𝑡2,𝑑 … 𝑃(𝑡𝑀 )𝑡𝑓𝑡𝑀,𝑑
𝑡𝑓𝑡1 ,𝑑 ! 𝑡𝑓𝑡2 ,𝑑 ! … 𝑡𝑓𝑡𝑀 ,𝑑 !
𝐿𝑑 = 𝑡𝑓𝑡𝑖 ,𝑑 is the length of document 𝑑
1≤𝑖≤𝑀
 M is the size of the vocabulary, tfi,d is the frequency of
term i in document d.
𝐿 !
 Coefficient 𝑡𝑓 !𝑡𝑓 𝑑 !…𝑡𝑓 ! is the # of combinations of all
𝑡1 ,𝑑 𝑡2 ,𝑑 𝑡𝑀 ,𝑑
documents having the same terms and term frequencies
 Positional difference is ignored here, i.e. p(“Hong Hong
Kong Kong”) = p(“Kong Kong Hong Hong”) = p(“Hong Kong
Hong Kong”)
48
Query Likelihood Model
 Probability of generating a specific query from a
document: 𝑃 𝑑 𝑞 = 𝑃 𝑞 𝑑 𝑃(𝑑)/𝑃(𝑞)

 Assume probability of P(d) follows uniform distribution,


i.e., P(di) = P(dj) for any i and j, thus, it can be ignored like
P(q)
 Estimation of unigram language model:
𝑡𝑓𝑡,𝑑
𝑃 𝑞 𝑀𝑑 = 𝑃𝑚𝑙𝑒 𝑡 𝑀𝑑 =
𝐿𝑑
𝑡∈𝑞 𝑡∈𝑞
where Md is the language model of document d, tft,d is the
(raw) term frequency of term t in document d, and Ld is
the number of tokens in document d.

49
Smoothing by Non-occurring Terms
 Non-occurring terms should be given probabilities to
smooth calculation (generalize model and discount zero
probabilities)
 A mixture between document-specific multinomial
distribution and entire-collection multinomial distribution
 Linear Interpolation language model
𝑃 𝑡 𝑑 = λ𝑃𝑚𝑙𝑒 𝑡 𝑀𝑑 + (1 − λ)𝑃𝑚𝑙𝑒 𝑡 𝑀𝑐
 Bayesian Smoothing:
𝑡𝑓𝑡,𝑑 + 𝛼𝑃(𝑡|𝑀𝑐 )
𝑃 𝑡𝑑 =
𝐿𝑑 + 𝛼

50
Query Likelihood Model (Example)

51
Language Model Performance

 Language models may yield better retrieval results than


traditional td-idf approach (Pont & Croft, 1998)

52
Extended Language Models
 Document likelihood model: less appealing since there is
much less text available in query to estimate a LM; on the
other hand, easy to integrate relevance feedback.
 Mixture model of both query likelihood and document
likelihood model
 Kullback-Leibler divergence: an asymmetric divergence measure
to check how bad the probability distribution Mq is at Md.
𝑃(𝑡|𝑀𝑞 )
𝑅 𝑑; 𝑞 = 𝐾𝐿 𝑀𝑑 ‖𝑀𝑞 = 𝑃 𝑡 𝑀𝑞 log
𝑃(𝑡|𝑀𝑑 )
𝑡∈𝑉

53
What you should know
 N-gram language models
 How to generate text documents from a language model
 How to estimate a language model
 General idea and different ways of smoothing
 Language model evaluation

54
Reading
 IIR Chapter 12
 SLP Chapter 3

Note: IIR: “introduction to information retrieval”, SLP:


“Speech and Language Processing”, check slide 30 of
lecture 1 for the acronyms of the textbooks

55
Acknowledgement
 The slides of this lecture were prepared based on the
course materials of the following scholars:
 Textbook and slides of Dan Jurafsky, Stanford University, “Speech and
Language Processing”
 Textbook and course slides of Christopher Manning, Stanford
University, “Information Retrieval and Web Search”
 Course materials of many other higher educators (e.g., Hongning
Wang from University of Virginia, Greg Durrett from UT Austin, etc.)

56

You might also like