You are on page 1of 57

# Language Modeling

## Michael Collins, Columbia University

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

## The Language Modeling Problem

We have some (nite) vocabulary, say V = {the, a, man, telescope, Beckham, two, . . .} We have an (innite) set of strings, V the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

## The Language Modeling Problem (Continued)

We have a training sample of example sentences in English

## The Language Modeling Problem (Continued)

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV

## The Language Modeling Problem (Continued)

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV

## p(the p(the p(the p(the ... p(the ...

STOP) = 1012 fan STOP) = 108 fan saw Beckham STOP) = 2 108 fan saw saw STOP) = 1015 fan saw Beckham play for Real Madrid STOP) = 2 109

## Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

## Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) The estimation techniques developed for this problem will be VERY useful for other problems in NLP

A Naive Method

We have N training sentences For any sentence x1 . . . xn , c(x1 . . . xn ) is the number of times the sentence is seen in our training data A naive estimate: p(x1 . . . xn ) = c(x1 . . . xn ) N

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Markov Processes

Consider a sequence of random variables X1 , X2 , . . . Xn . Each random variable can take any value in a nite set V . For now we assume the length n is xed (e.g., n = 100). Our goal: model P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

## First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

## First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2

## First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2 n

i=2

= P ( X 1 = x1 )

## First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2 n

## P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

i=2

= P ( X 1 = x1 )

The rst-order Markov assumption: For any i {2 . . . n}, for any x1 . . . xi , P (Xi = xi |X1 = x1 . . . Xi1 = xi1 ) = P (Xi = xi |Xi1 = xi1 )

## Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

i=3

## P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

n

i=3 n

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 ) P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

=
i=1

## Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol

## Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol Then use a Markov process as before:

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

=
i=1

## Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

## Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

For any sentence x1 . . . xn where xi V for i = 1 . . . (n 1), and xn = STOP, the probability of the sentence under the trigram language model is
n

p(x1 . . . xn ) =
i=1

## q (xi |xi2 , xi1 )

where we dene x0 = x1 = *.

An Example

For the sentence the dog barks STOP we would have p(the dog barks STOP) = q (the|*, *) q (dog|*, the) q (barks|the, dog) q (STOP|dog, barks)

## The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog)

## The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog) A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

## Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

## Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm

## Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m

log
i=1

p(si ) =
i=1

log p(si )

## Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m

log
i=1

p(si ) =
i=1

log p(si )

l

where

1 l= M

log p(si )
i=1

## Some Intuition about Perplexity

Say we have a vocabulary V , and N = |V| + 1 and model that predicts q (w|u, v ) = 1 N

for all w V {STOP}, for all u, v V {*}. Easy to calculate the perplexity in this case: Perplexity = 2l Perplexity = N Perplexity is a measure of eective branching factor where l = log 1 N

## Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74
n i=1

## Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137
n i=1

n i=1

q (xi |xi1 ).

## Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137 A unigram model: p(x1 . . . xn ) = Perplexity = 955
n i=1

## q (xi |xi2 , xi1 ).

n i=1

q (xi |xi1 ).

n i=1

q (xi ).

Some History

Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:5064, 1951.

Some History
Chomsky (in Syntactic Structures (1957)):
Second, the notion grammatical cannot be identied with meaningful or signicant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. ... . . . Third, the notion grammatical in English cannot be identied in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. . . .

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

## Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

## The Bias-Variance Trade-O

Trigram maximum-likelihood estimate qML (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 )

## Unigram maximum-likelihood estimate qML (wi ) = Count(wi ) Count()

Linear Interpolation
Take our estimate q (wi | wi2 , wi1 ) to be q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v )

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

wV

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

## Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

## How to estimate the values?

Hold out part of training set as validation data

## How to estimate the values?

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set

## How to estimate the values?

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set Choose 1 , 2 , 3 to maximize: L(1 , 2 , 3 ) =
w1 ,w2 ,w3

## c (w1 , w2 , w3 ) log q (w3 | w1 , w2 )

such that 1 + 2 + 3 = 1, and i 0 for all i, and where q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

## Allowing the s to vary

Take a function that e.g., 1 2 (wi2 , wi1 ) = 3 4 partitions histories If Count(wi1 , wi2 ) = 0 If 1 Count(wi1 , wi2 ) 2 If 3 Count(wi1 , wi2 ) 5 Otherwise

Introduce a dependence of the s on the partition: q (wi | wi2 , wi1 ) = 1 i2 i1 qML (wi | wi2 , wi1 ) (w ,w ) +2 i2 i1 qML (wi | wi1 ) (w ,w ) +3 i2 i1 qML (wi )
(w ,wi1 ) (w ,w )

## where 1 i2 i1 + 2 i2 (w ,w ) and i i2 i1 0 for all i.

(w

,w

+ 3

(wi2 ,wi1 )

= 1,

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Discounting Methods
Say weve seen the following counts: x Count(x) qML (wi | wi1 ) the 48 the, dog 15 the, woman 11 the, man 10 the, park 5 the, job 2 the, telescope 1 the, manual 1 the, afternoon 1 the, country 1 the, street 1 The maximum-likelihood estimates are (particularly for low count items) 15/48 11/48 10/48 5/48 2/48 1/48 1/48 1/48 1/48 1/48 high

Discounting Methods
Now dene discounted counts, Count (x) = Count(x) 0.5 New estimates:
x the the, the, the, the, the, the, the, the, the, the, dog woman man park job telescope manual afternoon country street Count(x) 48 15 11 10 5 2 1 1 1 1 1 14.5 10.5 9.5 4.5 1.5 0.5 0.5 0.5 0.5 0.5 14.5/48 10.5/48 9.5/48 4.5/48 1.5/48 0.5/48 0.5/48 0.5/48 0.5/48 0.5/48 Count (x) Count (x) Count(the)

## Discounting Methods (Continued)

We now have some missing probability mass: (wi1 ) = 1
w

## Katz Back-O Models (Bigrams)

For a bigram model, dene two sets A(wi1 ) = {w : Count(wi1 , w) > 0} B (wi1 ) = {w : Count(wi1 , w) = 0} A bigram model Count (wi1 ,wi ) Count(wi1 ) qBO (wi | wi1 ) = q ML (wi ) (wi1 ) q
wB(wi1 )

If wi A(wi1 ) If wi B (wi1 )

ML (w)

where (wi1 ) = 1
wA(wi1 )

## Katz Back-O Models (Trigrams)

For a trigram model, rst dene two sets A(wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) > 0} B (wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) = 0} A trigram model is dened in terms of the bigram model: Count (wi2 ,wi1 ,wi ) Count(wi2 ,wi1 ) If wi A(wi2 , wi1 ) qBO (wi | wi2 , wi1 ) = (wi2 ,wi1 )qBO (wi |wi1 ) wB(w ,w ) qBO (w|wi1 ) i2 i1 If wi B (wi2 , wi1 )

wA(wi2 ,wi1 )

## Count (wi2 , wi1 , w) Count(wi2 , wi1 )

Summary
Three steps in deriving the language model probabilities:
1. Expand p(w1 , w2 . . . wn ) using Chain rule. 2. Make Markov Independence Assumptions p(wi | w1 , w2 . . . wi2 , wi1 ) = p(wi | wi2 , wi1 ) 3. Smooth the estimates using low order counts.

## Other methods used to improve language models:

Topic or long-range features. Syntactic models.