Professional Documents
Culture Documents
1
N-grams & Language Modeling
2
A bad language model
3
4
A bad language model
5
What’s a language model for?
Speech recognition
Handwriting recognition
Spelling correction
Optical character recognition
Machine translation
6
Try to complete the following
12/21/21 7
7
Next Word Prediction
8
Stocks plunged this morning, despite a cut in interest rates by the
Federal Reserve, as Wall Street began trading for the first time
since last …
Stocks plunged this morning, despite a cut in interest rates by the
Federal Reserve, as Wall Street began trading for the first time
since last Tuesday's terrorist attacks.
9
Human Word Prediction
10
Claim
Lots of applications
11
Lots of applications
12/21/21 12
12
Real Word Spelling Errors
12/21/21 13
13
Real Word Spelling Errors
14
Real Word Spelling Errors
12/21/21 15
15
Applications
16
Simple N-Grams
17
N-grams
18
Computing the Probability of a Word Sequence
P(w1, …, wn) =
P(w1).P(w2|w1).P(w3|w1,w2). … P(wn|w1, …,wn-1)
19
Bigram Model
20
Using N-Grams
For N-gram models
P(wn | w1n1) P(wn |wnn1N 1)
P(wn-1,wn) = P(wn | wn-1) P(wn-1)
By the Chain Rule we can decompose a joint probability,
e.g. P(w1,w2,w3)
P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-1|wn)
P(wn)
For bigrams then, the probability of a sequence is just the product of the
conditional probabilities of its bigrams
P(the,mythical,unicorn) = P(unicorn|mythical) P(mythical|the)
P(the|<start>)
n
P(w1n ) P(wk | wk 1)
k 1
21
The n-gram Approximation
Assume each word depends only on the previous (n-1) words (n words
total)
22
n-grams, continued
23
Unigram probabilities (1-gram)
http://www.wordcount.org/main.php
Most likely to transition to “the”, least likely to transition to
“conquistador”.
24
N-grams for Language Generation
C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal, vol. 27, pp.
379-423 and 623-656, July and October, 1948.
Unigram:
5. …Here words are chosen independently but with their appropriate frequencies.
Bigram:
6. Second-order word approximation. The word transition probabilities are correct but no
further structure is included.
25
N-Gram Models of Language
26
Counting Words in Corpora
What is a word?
e.g., are cat and cats the same word?
September and Sept?
zero and oh?
Is _ a word? * ? ‘(‘ ?
How many words are there in don’t ? Gonna ?
In Japanese and Chinese text -- how do we identify a word?
27
Terminology
28
Corpora
29
Simple N-Grams
30
Training and Testing
31
A Simple Example
32
A Bigram Grammar Fragment from BERP
33
<start> I .25 Want some .04
34
P(I want to eat British food) = P(I|<start>) P(want|I) P(to|
want) P(eat|to) P(British|eat) P(food|British) = .
25*.32*.65*.26*.001*.60 = .000080
vs. I want to eat Chinese food = .00015
Probabilities seem to capture ``syntactic'' facts, ``world
knowledge''
eat is often followed by an NP
British food is not too popular
N-gram models can be trained by counting and normalization
35
BERP Bigram Counts
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
36
BERP Bigram Probabilities
Normalization: divide each row's counts by appropriate unigram
counts for wn-1
I Want To Eat Chinese Food Lunch
freq(w1, w2)
freq(w1)
37
What do we learn about the language?
38
P(I | I) = .0023 I I I I want
P(I | want) = .0025 I want I want
P(I | food) = .013 the kind of food I want is ...
39
Approximating Shakespeare
40
Trigrams
Sweet prince, Falstaff shall die.
This shall forbid it should be branded, if renown made it
empty.
Quadrigrams
What! I will go seek the traitor Gloucester.
Will you not tell me who I am?
41
There are 884,647 tokens, with 29,066 word form types, in
about a one million word Shakespeare corpus
Shakespeare produced 300,000 bigram types out of 844
million possible bigrams: so, 99.96% of the possible bigrams
were never seen (have zero entries in the table)
Quadrigrams worse: What's coming out looks like
Shakespeare because it is Shakespeare
42
N-Gram Training Sensitivity
43
Some Useful Empirical Observations
44
Unknown words
45
Smoothing Techniques
46
Smoothing: None
C ( xyz ) C ( xyz )
P( z | xy )
C ( xyw) C ( xy)
w
47
Add-one Smoothing
For unigrams:
Add 1 to every word (type) count
Normalize by N (tokens) /(N (tokens) +V (types))
Smoothed count (adjusted for additions to N) is
c 1 N
N V
i
i N V
For bigrams:
Add 1 to every bigram c(wn-1 wn) + 1
Incr unigram count by vocabulary size c(wn-1) + V
48
Effect on BERP bigram counts
49
Add-one bigram probabilities
50
The problem
51
The problem
52
Discount: ratio of new counts to old (e.g. add-one
smoothing changes the BERP bigram (to|want)
from 786 to 331 (dc=.42) and p(to|want) from .65
to .28)
But this changes counts drastically:
– too much weight given to unseen ngrams
– in practice, unsmoothed bigrams often work better!
53
Smoothing
C ( xyz ) 1
Add one smoothing: P( z | xy )
C ( xy ) V
Works very badly.