6 views

Uploaded by sersoage

- Ch01 Intro to Quantitative Analysis_march2014
- Theodore M. Porter - The Rise of Statistical Thinking, 1820-1900 (1988, Princeton University Press).pdf
- Empirical Studies of Cost Function
- resume-2
- math eportfolio 1040
- Pakej Kerjarumah Addmaths F5_2010
- jj
- Irs
- Linguistic's Report
- Inequalities Maze
- 12.ĀNURŨPYENA
- Symposium Schedule Detailed[1]
- Quantitative Techniques
- IMF Arcand 2012
- Statistical Techniques to Monitor the Strength of Concrete
- Int
- Hidden Markov Models
- Proposal 2
- Maneesha Seminar Report Front Pages
- Research Glossary

You are on page 1of 57

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:

Linear interpolation Discounting methods

We have some (nite) vocabulary, say V = {the, a, man, telescope, Beckham, two, . . .} We have an (innite) set of strings, V the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

We have a training sample of example sentences in English

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,

xV

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,

xV

STOP) = 1012 fan STOP) = 108 fan saw Beckham STOP) = 2 108 fan saw saw STOP) = 1015 fan saw Beckham play for Real Madrid STOP) = 2 109

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) The estimation techniques developed for this problem will be VERY useful for other problems in NLP

A Naive Method

We have N training sentences For any sentence x1 . . . xn , c(x1 . . . xn ) is the number of times the sentence is seen in our training data A naive estimate: p(x1 . . . xn ) = c(x1 . . . xn ) N

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:

Linear interpolation Discounting methods

Markov Processes

Consider a sequence of random variables X1 , X2 , . . . Xn . Each random variable can take any value in a nite set V . For now we assume the length n is xed (e.g., n = 100). Our goal: model P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

= P ( X 1 = x1 )

i=2

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

= P ( X 1 = x1 )

i=2 n

i=2

= P ( X 1 = x1 )

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

= P ( X 1 = x1 )

i=2 n

i=2

= P ( X 1 = x1 )

The rst-order Markov assumption: For any i {2 . . . n}, for any x1 . . . xi , P (Xi = xi |X1 = x1 . . . Xi1 = xi1 ) = P (Xi = xi |Xi1 = xi1 )

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

i=3

n

i=3 n

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 ) P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

=

i=1

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol Then use a Markov process as before:

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

n

=

i=1

A trigram language model consists of:

1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

A trigram language model consists of:

1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

For any sentence x1 . . . xn where xi V for i = 1 . . . (n 1), and xn = STOP, the probability of the sentence under the trigram language model is

n

p(x1 . . . xn ) =

i=1

where we dene x0 = x1 = *.

An Example

For the sentence the dog barks STOP we would have p(the dog barks STOP) = q (the|*, *) q (dog|*, the) q (barks|the, dog) q (STOP|dog, barks)

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog)

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog) A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:

Linear interpolation Discounting methods

We have some test data, m sentences s1 , s2 , s3 , . . . , sm

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability

m m

log

i=1

p(si ) =

i=1

log p(si )

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability

m m

log

i=1

p(si ) =

i=1

log p(si )

l

where

1 l= M

log p(si )

i=1

Say we have a vocabulary V , and N = |V| + 1 and model that predicts q (w|u, v ) = 1 N

for all w V {STOP}, for all u, v V {*}. Easy to calculate the perplexity in this case: Perplexity = 2l Perplexity = N Perplexity is a measure of eective branching factor where l = log 1 N

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74

n i=1

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137

n i=1

n i=1

q (xi |xi1 ).

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137 A unigram model: p(x1 . . . xn ) = Perplexity = 955

n i=1

n i=1

q (xi |xi1 ).

n i=1

q (xi ).

Some History

Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:5064, 1951.

Some History

Chomsky (in Syntactic Structures (1957)):

Second, the notion grammatical cannot be identied with meaningful or signicant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. ... . . . Third, the notion grammatical in English cannot be identied in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. . . .

Overview

Linear interpolation Discounting methods

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

Trigram maximum-likelihood estimate qML (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 )

Linear Interpolation

Take our estimate q (wi | wi2 , wi1 ) to be q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

q (w | u, v )

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

wV

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

Our estimate correctly denes a distribution (dene V = V {STOP}):

wV

w

= = 1

wV

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

Hold out part of training set as validation data

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set Choose 1 , 2 , 3 to maximize: L(1 , 2 , 3 ) =

w1 ,w2 ,w3

such that 1 + 2 + 3 = 1, and i 0 for all i, and where q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

Take a function that e.g., 1 2 (wi2 , wi1 ) = 3 4 partitions histories If Count(wi1 , wi2 ) = 0 If 1 Count(wi1 , wi2 ) 2 If 3 Count(wi1 , wi2 ) 5 Otherwise

Introduce a dependence of the s on the partition: q (wi | wi2 , wi1 ) = 1 i2 i1 qML (wi | wi2 , wi1 ) (w ,w ) +2 i2 i1 qML (wi | wi1 ) (w ,w ) +3 i2 i1 qML (wi )

(w ,wi1 ) (w ,w )

(w

,w

+ 3

(wi2 ,wi1 )

= 1,

Overview

Linear interpolation Discounting methods

Discounting Methods

Say weve seen the following counts: x Count(x) qML (wi | wi1 ) the 48 the, dog 15 the, woman 11 the, man 10 the, park 5 the, job 2 the, telescope 1 the, manual 1 the, afternoon 1 the, country 1 the, street 1 The maximum-likelihood estimates are (particularly for low count items) 15/48 11/48 10/48 5/48 2/48 1/48 1/48 1/48 1/48 1/48 high

Discounting Methods

Now dene discounted counts, Count (x) = Count(x) 0.5 New estimates:

x the the, the, the, the, the, the, the, the, the, the, dog woman man park job telescope manual afternoon country street Count(x) 48 15 11 10 5 2 1 1 1 1 1 14.5 10.5 9.5 4.5 1.5 0.5 0.5 0.5 0.5 0.5 14.5/48 10.5/48 9.5/48 4.5/48 1.5/48 0.5/48 0.5/48 0.5/48 0.5/48 0.5/48 Count (x) Count (x) Count(the)

We now have some missing probability mass: (wi1 ) = 1

w

For a bigram model, dene two sets A(wi1 ) = {w : Count(wi1 , w) > 0} B (wi1 ) = {w : Count(wi1 , w) = 0} A bigram model Count (wi1 ,wi ) Count(wi1 ) qBO (wi | wi1 ) = q ML (wi ) (wi1 ) q

wB(wi1 )

If wi A(wi1 ) If wi B (wi1 )

ML (w)

where (wi1 ) = 1

wA(wi1 )

For a trigram model, rst dene two sets A(wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) > 0} B (wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) = 0} A trigram model is dened in terms of the bigram model: Count (wi2 ,wi1 ,wi ) Count(wi2 ,wi1 ) If wi A(wi2 , wi1 ) qBO (wi | wi2 , wi1 ) = (wi2 ,wi1 )qBO (wi |wi1 ) wB(w ,w ) qBO (w|wi1 ) i2 i1 If wi B (wi2 , wi1 )

wA(wi2 ,wi1 )

Summary

Three steps in deriving the language model probabilities:

1. Expand p(w1 , w2 . . . wn ) using Chain rule. 2. Make Markov Independence Assumptions p(wi | w1 , w2 . . . wi2 , wi1 ) = p(wi | wi2 , wi1 ) 3. Smooth the estimates using low order counts.

Topic or long-range features. Syntactic models.

- Ch01 Intro to Quantitative Analysis_march2014Uploaded byT S Akmal Ridzuan
- Theodore M. Porter - The Rise of Statistical Thinking, 1820-1900 (1988, Princeton University Press).pdfUploaded bybig chungus
- Empirical Studies of Cost FunctionUploaded bylupav
- resume-2Uploaded byapi-291443331
- math eportfolio 1040Uploaded byapi-216948537
- Pakej Kerjarumah Addmaths F5_2010Uploaded byHayati Aini Ahmad
- jjUploaded byDeepti Agarwal
- IrsUploaded byhazihappy
- Linguistic's ReportUploaded byizti_azrah4981
- Inequalities MazeUploaded byrofi
- 12.ĀNURŨPYENAUploaded byshuklahouse
- Symposium Schedule Detailed[1]Uploaded byXiao Gao
- Quantitative TechniquesUploaded byPrateek Dave
- IMF Arcand 2012Uploaded byStelios Karagiannis
- Statistical Techniques to Monitor the Strength of ConcreteUploaded byroi_mauricio
- IntUploaded bymuralidharan
- Hidden Markov ModelsUploaded bymaddog
- Proposal 2Uploaded byChristina
- Maneesha Seminar Report Front PagesUploaded bymaneeshamohans666
- Research GlossaryUploaded byMuhammad Zubair
- Using Statistics to Analyze Words Data SetUploaded byMinitab Inc.
- syllabusUploaded bynbehzadpour
- Doc15Uploaded byAngelo Santos Gonzales
- SS ZG526-L3Uploaded byVishu Kukkar
- Tip Sheet for Select Query Tools in ArcGIS 9Uploaded byboy aja
- Project ManagementUploaded byNurun Nazirah
- asdsaUploaded byArchit Kumar
- Random Effects ModelsUploaded byhubik38
- Convolution h10 x30 mar2011Uploaded bynarsimramesh
- Marketing Research Method by Lim Kok SiaUploaded byAzrul Azli

- lecun-98Uploaded byNguyen Van Huan
- Reading_list_A Novel Approach to on-line Handwriting Recognition Based on Bidirectional Long Short-term Memory NetworksUploaded bysersoage
- cs229-notes10.pdfUploaded bysharukhsshaikh
- At Rain AbleUploaded byHỏa Top
- El Cuarto ParadigmaUploaded byIzobell Silvertone
- Lectures TaggingUploaded bysersoage
- BengioDVJ03Uploaded byjeromeku

- A Statistical Comparison of Written Language and Nonlinguistic Symbol Systems Richard Sproat Google, IncUploaded byÜntaç Güner
- Speech Recognition-Statistical MethodsUploaded bySeroabsoluto
- 20050421 Smoothing TutorialUploaded bySahil Patwardhan
- Speech ProcessingUploaded byapi-3806249
- Sphinx 4Uploaded byC_O_D2006
- speech recognisation systemUploaded byNitin Sanap
- Lm Spring2013Uploaded byjuan perez arrikitaun
- GOOGLE Large Vocabulary Continuous Speech Recognition LVCSRUploaded byAbcdfuckfuck
- rnn-1406.1078.pdfUploaded byalan
- Lecture Transcription Is2013Uploaded bypistol_laura
- Keyphrase Extraction Using Word EmbeddingUploaded byroberto86
- Sentence Generation from Relation TuplesUploaded byNishith Khandwala
- BlackOut Speeding Up Recurrent Neural Network Language Models With Very Large VocabulariesUploaded byJan Hula
- _5434d3ec492babe5dcb015df70703096_Neural-probabilisic-language-models.pdfUploaded byAndres Suarez
- Revolutionary New Ideas Appear Infrequently - BerwickUploaded byJairo Araujo
- Generating Text With Recurrent Neural NetworksUploaded byJan Hula
- A Maximum Likelihood Approach to Continuos Speech RecognitionUploaded byJ Jesús Villanueva García
- Conversion of NNLM to Back Off Language Model in ASRUploaded byEditor IJRITCC
- Speech RecognitionUploaded byayushrungta
- Eisenstein Nlp NotesUploaded byaklfkasdfj
- SRILM Kurulum TutorialUploaded bySeçkin Özbek
- (the Information Retrieval Series 26) Victor Lavrenko (Auth.)-A Generative Theory of Relevance-Springer-Verlag Berlin Heidelberg (2009)Uploaded byhomam
- af0ef90e-83cf-11e5-964d-04015fb6ba01Uploaded byWilly Echama
- MapReduce AlgorithmsUploaded byLoki Jayanagar Blr
- nnlpUploaded bylucianmol
- BIDMUploaded byAwadhesh Yadav
- Hidden Markov Model in Automatic Speech RecognitionUploaded byvipkute0901
- N-GramsUploaded byMohit Kewalramani
- Word 2 Vec RepresentationUploaded bytansoei
- snips-nluUploaded byStu Ron