You are on page 1of 25

SE @ EINE

OBJECTIVES
After reading this chapter, the student will be able to understand:
on
Part-Of-Speech tagging (POS)- Tag set for English (Penn Treebank ) ,
Rule based POS tagging, Stochastic POS tagging,
OOO ©

Issues —Multiple tags & words, Unknown words.


Introduction to CFG,
Sequencelabelling: Hidden Markov Model (HMM), Maximum Entropy, and
Conditional Random Field (CRF).

Scanned with CamScanner


‘Syntax Analysis
FSP J

such that they


lang aligntical
uagegramma
make s wit
ofwords ina Saat ee ial
Syntaxrefers to the arrangement
to asse 1cen. graramatical rues to
© sense.In NLP, syntactic analysis is used use
Computer algorithms are
~~ the grammatical rules. .
ng from them.
2 group of words and derive meani
roles playe d by differ ent words in a body of
he
Syntactic analysis helps us understandt child ren sleep
Consider Innocent peace fully
text. The words themselves are not enough.
ully. There is some evide nce that human
little vs Innocentlittle children sleep peacef
part, based onstru ctura l analy sis. Consi der“Twas brillig
understanding of languageis,in
” Here we can understand the
and the slithy toves did gyre and gimble in the wabe.
s make no senseto us. Colorless
sentence on somelevel, even though mostofthe word
green furiously colorless each word, E.g., Book/VB that/DT flight/NN /The main‘ challenge in POS taggingis )
greenideas sleepfuriously makes more sense to us than Ideas
teps of the young. In reading resolving the ambiguity in possible POS tags for a word. The word bearin the followi te
sleep Finally, consider the sentence the old dog the foots
amine the first part. In all sentences have the samespelling but different meanings. She sawa bear. Your effo
this sentence, you might have found yourself having to re-ex
a noun, when, in fact, itis will bear fruit. Bee
likelihood, you thoughtthat the word “dog” was being used as
a verb. Wordsaretraditionally grouped into equivalence classes called parts of speech, word 2
classes, morphological classes, or lexical tags. Intraditional grammarthere weregenerally-:
, Here are some syntax techniques that can be used:
only a few parts of speech (noun, verb, adjective, preposition, adverb, conjunction ;
! Lemmatization: It entails reducing the various inflected forms of a word into a single
etc.). More recent models have much larger numbers of word classes (45 for the Penn>
form for easy analysis.
Treebank ,87 for the Brown corpus, and 146 for the C7 tag set ). The partof speech fori
Morphological segmentation: It involves dividing words into individual units called a word gives a significant amountof information about the word andits neighbours. This:as
morphemes. is clearly true for major categories, (verb versus noun), butis also true for the many.finer : ae
Word segmentation: It involves dividing a large piece of continuous text into distinct distinctions. For example, these tag sets distinguish between possessive pronouns (my,*ae
units. your,his, her, its) and personal pronouns(I, you, he, me). Knowing whethera word isa ae
Part-of-speech tagging: It involves identifying the part of speech for every word. possessive pronoun ora personal pronoun cantell us whatwordsare likely to occurin.we2
Parsing; It involves undertaking grammatical analysis for the provided sentence. its vicinity (possessive pronouns arelikely to be followed by a noun, personal pronouns, 4 ju
by a verb). i
Sentence breaking: It involves placing sentence boundaries on large pieceof text.
A word's part-of-speech can tell us something about how the word is pronounced.4A.
Stemming: It involves cutting the inflected wordsto their root form.
|
noun or ann adjective are proneuncar differently (the noun is pronounced CONten

i
| system.
}

|
|
1
al

Scanned with CamScanner


informatio val (IR), since”
ye

rma nal retrie


be used In in st emming for info take,
Parts of speech
can also gical affixes it can Nouns aretraditionally grouped into proper nouns and common nouns. Proper nouns,
of speec h can help
tell us which morpholo rtant
g a word 's part nou ns OF othe r impo like Regina, Colorado, and IBM, are the namesofspecific personsor entities. In English,
helping select out
kno win
an IR app lication by
They can also help building automatic they generally aren't preceded by articles (e.g. the book is upstairs, but Regina is.
oma tic part - of-s peech taggers can help in
words from a document.
Aut
taggers are also used in advanced _ upstairs). In written English, proper nouns are usually capitalized? In many languages,
ation algorithms, an d POS very often —
word-sense disambigu Part s of spee ch are including English, common nouns are divided into count nouns and mass nouns. Count
as class base d N-grams.
ASR language models such mes or other phrasesfor nounsare thosethat allow gram- count nouns mass nouns matical enumeration; thatis,
kly finding na
ng’ texts, for examp le for quic
usedfor‘partial parsiing’ they can occur in both the singular and plural (goat/goats, relationship/relationships) and
edfor part
ction appl icat ions Final ly, corpora that have been mark
the information extra they can be counted (one goat, two goats). Mass nouns are used when something is
resear ch, for examp le to help find instances or
tic
of-speech are very useful for linguis conceptualized as a homogeneous group. So words like snow,salt, and communism are
s in large corpora.
frequencies of particular construction not counted (i.e. two snowsor two communisms). Mass nouns can also appear without
Tag set for English articles where singular count nouns cannot (Snow is white but not Goat is white). The
»

closed class types verb class includes most of the words referring to actions and processes, including main
Parts of speech can be divided into two broad super categories:
verbs like draw, provide, differ, and go. English verbs have a number of morphological
and open class types. Closed classes are those that have relatively fixed membership.
forms (non- 3rd-person-sg (eat), 3d-person-sg (eats), progressive (eating), past participle
For example, prepositions are a closed class because there is a fixed set of them in
eaten). A subclass of English verbs called auxiliaries will be discussed when weturn to
English. By contrast nouns and verbs are open classes because new nouns and verbs
are continually coined or borrowed from other languages(e.g. the new verb to fax or closed class forms. The third open class English form is adjectives; semantically this

the borrowed noun futon). It is likely that any given speaker or corpus will have different class includes many terms that describe properties or qualities. Most languages have

open class words, but all speakers of a language, and corpora that are large enough, adjectives for the concepts of color (white, black), age (old, young), and value (good,

will likely share the set of closed class words. Closed class words are generally also bad), but there are languages without adjéctives. As we discussed above, manylinguists

function words; function words are grammatical words like of, it, and, or you, which tend argue that the Chinese family of languages usesverbsto describe such English-adjectival

to be very short, occur frequently, and play an important role in grammar. There are four
notions as color and age.
major open classes that occur in the languages of the world: nouns, verbs, adjectives, The final open class form is adverbs. What coherence the class has semantically may
and adverbs. It turns out that English has nouns verbs adjectives adverbs all four of be solely that each of these words can be viewed as modifying something (often verbs,
these, although not every language does. Many languages have no adjectives. Every hence the name ‘adverb’, but also other adverbs and entire verb phrases). Directional
known human language has at least two categories noun and verb. Noun is the name adverbs or locative adverbs (home, here, downhill) specify the direction or location of
given to the lexical class in which the words for most people, places, or things some action; degree adverbs (extremely, very, somewhat) specify the extent of some
occur. But
since lexical classeslike noun are defined functionally (morphological action, process, or property; manner adverbs (slowly, slinkily, delicately) describe the
and syntactically)
sidGiveshenornene for people, places, manner of some action or process and temporal adverbs describe the time that some
and things may not be nouns,
y not be words for people, places, or things. Thus nouns action or event took place (yesterday, Monday). Because of the heterogeneous nature
include concrete termslike ship and chair, abstractions of this class, some adverbs (for example temporal adverbs like Monday) are tagged in
like band- width and relationship,

Stree nouninErle reneheeweae alwithang)


ability to occur Wh(a
determiners
some tagging schemes as nouns. The closedclasses differ from language to language
than do the open classes. .
goat,its bandwidth, Plato's Republic), to
take possessives (IBM's annual revenue),
for mostbut notall nouns, to occurin the plural and
form (goats, abaci). |
Natural Language Processin |
|
;
syntax Analysis 83}

Scanned with CamScanner


tense). whetherit.is complete
Prepositions occur before noun phrases; semantically they are relational, often / g (aspect
: ), Whetherit j
an action is necessary, possible, sugges Gated (polarity), and whether
indicating spatial or temporalrelations, whether literal (on it, before then, by the house) ted, des ar
ired, ete. (moo
_
English also has many d).
or metaphorical (on time, with gusto, beside her- self). But they often indicate other words of more o r less
unique function, incl
relations as well (Hamlet was written by Shakespeare, and (from Shakespeare) “And | (oh, ah, hey, man, alas), negatives (no, uding interjections
not),
Politeness markers (ple
did laugh sans intermission an hour byhis dial”). “greetings (hello, goodbye), and the exist€ntia ase, thank you), .
l
there (there are two on
others. Whether these Classes
are assi the table) among
A particle is a word that resembles a preposition or an adverb, and that often combines rs. i A ar Nam
gned particul es or lum ped together (as.
with a verb to form larger unit called a phrasal verb, as in the following examples from interjections or even adverbs) depends on the purpo
Se of the labelling =
Thoreau: So | went on for some dayscutting and hewing timber. . . Moral reform is the
Penn Treebank
effort to throw off sleep. . . We can see that these are particles rather than prepositions,
for in the first example, on is followed, not by a noun phrase, but by a true preposition There are a small number of
popular tag sets for English, many
the 87-tag tagset used for the of which evolved from
phrase. With transitive phrasal verbs, as in the second example, we can tell that off is Brown Corpus. Three of
the most commonly used are the
a particle and not a preposition becauseparticles may appear after their objects small 45-tag Penn Treebank
(throw tagset , the medium-sized 61
tag C5 tagset used by the
sleep off as well as throw off sleep). This is not possible for prepositions (The horse Lancaster UGREL project’s CLAWS (the
went Constituent Likelihood Automatic Word-
off its track, but *The horse wentits track off). System) tagger to tag the British Nation tagging
al Corpus (BNC) and the larger -146-tag
tagset . The Penn Treebanktagset, C7
Articles a, an, and the often begin a noun phrase. A and an has been applied to the Brown corpus
mark a noun phrase as and a number
indefinite, while the can mark it as definite. Conjunctions are of other corpora. Hereis an example of a
used to join two phrases, tagged sentence from the Penn Treebank ~
clauses, or sentences. Coordinating conjunctions like and, version of the Brown corpus (in a flat ASCII
or, or but, join two elements file, tags are often represented after each -
of equal status. Subordinating conjunctions are used word, following a slash, but the tags can also
when one of the elements is of be represented in various other ways).
some sort of embedded status. For examplein ‘| thought The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
that you might like some milk’ - number/NN offIN other/JJ topics/
is a subordinating conjunction that links the main Clause
| thought with the subordinate
NNS|/.
clause you might like some milk. This clause is called |
subordinate becausethis entire The Penn Treebank tagset was culled from the original 87-tag
tagset for the Brown
clause is the ‘content’ of the main verb thought.
Subordinating conjunctions like that corpus. This reduced set leavesout information that can be recover
whichlink a verb to its argument in this way are
ed from the identity
also called complementizers. of the lexicalitem. For example the original Brown tagsetand
other large tagsets like CS -
Pronouns are formsthat often act as a kind of shorthan include a separate tag for each ofthe different forms of the verbs do (e.g. C5
d for referring to some noun tag ‘VDD
phraseor entity or event. Personal pronouns for did and ‘VDG’for doing), be, and have. These were omitted from the Penn set.
refer to per- sons or entities (you, she, |
, it,
me, etc). Possessive pronouns areforms of personal
pronouns thatindicat e either actual Table : Penn tree bank set
Possession or moreoften just an abstract
relation between the person and someobject
(my, your, his, her, WH its, one’s, our,
their). Wh-pronouns ( what, who, whom,
whoever)
are usedin certain question forms,
or may also act as co mplementizers
metfive years ago. . .).
(Frieda, who | 1 CC Coordinating conjunction
A closed class Subtype of Engli
sh verbs is the au xiliary verbs 2. CD Cardinal number
. Cross- AUXILIARY
linguistically, auxiliaries are-w
ords ( usually verbs) that mark
certain semantic features 3. DT Determiner -
of a main verb, including . , a
whether an action takes place
in the present, past or future
4. tex lowsaniatthes
Natural Language Processing
ee Syntax Analysis (ss)

Scanned with CamScanner


Number UE Eresten
Foreign word

Preposition or subordinating conjunction

Adjective

Adjective, comparative

Adjective, superlative

List item marker

Modal

Noun, singular or mass 33. | WDT | Wh-determiner

Noun, plural — 34. | | WP | Wh-pronoun

Proper noun, singular :35. | WP$ Possessive wh-pronoun

Proper noun, plural 36. | WRB | Wh-adverb

Predeterminer
2.Rule based POS tagging, Stochastic POS tagging an
Possessive ending
Part-of-speech tagging is the process of assigning a part-of-speech or other =
ee tenet

Personal pronoun class marker to each word in a corpus. Tags are also usually applied to punctustion
markers; thus, tagging for natural language is the same process as toxenizetion for
Possessive pronoun
computer languages, although tags for natural languages are much more ambiquous_
As we suggestedat the beginning of the chapter, taggers play an increasing’y imoorert
role in speech recognition, natural language parsing and information retrieval. The incut
, comparative
to a tagging algorithm is a string of words and a specified TAGSET tagset of he Kine

, superlative described in the previous section. The outputis a single best tag for each word.

VB DT NN.
Book thatflight.
VBZ DT NN VB NN?
25. Doesthatflight serve dinner?

Natural Language Processing

Scanned with CamScanner


Wass = |

trivial
to each word is not The rules can be categorizedin to two parts:
ally assigning a tag
examples, a utomatic ge and pay
Even in these simple than one poss ible usa a. Lexical: Wordlevel
ambiguous. That is,
it has more
For example, book is ect) or a noun (as in - p. Context Sensitive Rules: Sentencelevel
ver b (as in 2
boo k thatf light or to book the susp
of speech. It can be a
be a determi ner(as in Dogg .
ra boo k of matches). o! milarly that can The transformation-based approachesuse a predefined set of handcrafted rules as well
. Scorene or a complementizer (as in | thought that ou was earlier),
as automatically inducedrules that are generated during training
es, cheosing : e Proper tag
ng is to resolve these ambiguiti
The problem of POS-taggi Basic Idea: .
iguation tasks
‘for the context. Part-of-speech tagging is thus one of the many disamb
4. Assign all possible tags to words
t
tagging problem? Mos wor ds in English are
we will see in this book. How hard is the
y of the most common words of Removetags accordingto set of rules of type:
‘unamb igu ous ; i.e. the y hav e onl y a sing le tag. But man
“English are ambiguous (for example can can be an auxiliary (‘to be able’), a noun (‘a
Noa fs WN
if word+1 is an adj, adv, or quantifier and the following is a sentence boundary

ing in such a metal container’)).


metal container’), or a verb(‘to put someth
And
ased taggers and stochastic
Most tagging algorithms fall into one of two classes: rule-b word-1 is not a verb like “consider”
se of handwritten
taggers. Rule-based taggers generally involve a large databa then
ous word is a noun
disambiguation rules which specify, for example, that an ambigu eliminate non-advelse eliminate adv.
e
rather than a verbifit follows a-determiner. Stochastic taggers generally resolv tagging
8. Typically more than 1000 hand-written rules, but may be machine-learmed.
ambiguities by using a training corpus to compute the probability of a given word having
a given tag in a given context. First Stage:
FOR each word
Rule-Based Part-of-Speech Tagging
Getall possible parts of speech using a morphological analysis algorithm
One ofthe oldest techniques of tagging is rule-based POS tagging. Rule-based taggers
Example
use dictionary or lexicon for getting possible tags for tagging each word. If the word has
morethan one possible tag, then rule-based taggers use hand-written rules to identify the ‘NMoS
correct tag. Disambiguation can also be performed in rule-based tagging by analyzing RB
the linguistic features of a word along with its preceding as well as following words.
For VBN JJ VB
| example, supposeif the preceding word of a word is article
then word must be a noun. PRP VBD TOWB, DTNM
AS the name suggests, all such kind of information in rule-base
in the tor of rules. These rules may be either Context-pa
d POS tagging is coded She promised to back thebill.
ttern rules Or, as Regular
expression compiled into finite-state Apply rules to removepossibilities
automata, intersected with lexicall
sentence representation.
biguous Example Rule:
_ 4
IF VBD is an option and VBN|VBD follows “<start>PRP”

gn each word list of potential


i parts-of-speech. In _ THEN Eliminate VBN
the second Stage, it uses large lists of
. ; hand-written disambi i NM
list to a single Part-of-speech for each word
Suation rules to sort down gly
RB

(2 nota Lang
uage Processing Syntax Analysis

jee

Scanned with CamScanner


SS <a ee

VBN J VB
FRE VBD TO VB DT NM J NATIVE SG NOINDEFDETERMIN
ER
She promised to back the bill
one-th
oo
SQ
ENGTWOLRule-Based Tagger she PRON P
pt | PERSONALFEMININE NOMINATIVE SG3
A Two-stage architecture ara V IMP
Use lextoon FST (dictionary) to tag each word with all possible POS
show Vv
Apply hand-written rules to eliminate tags. fj PR PRESENT-sc3
5 vein
The mutes eliminate tags that are inconsistent with the context, and should reducethe list show N
— NO MINATIVE SG
‘of POStags to. a single POS per word.
shown PCP2 SVOO SVO sy
Det-Noun Rule:If an ambiguous word follows a determiner, tag itas a noun Ee |

occurred PCP2 SV
ENGTWOL Adverdial-that Rule
op
Giveninput that" | occurred V PAST V FIN SV
if the next word is adj, adverb, or quantifier, and following that is a sentence
boundary, and Sample ENGTWOL Lexicon
the previous wordis nota verblike “consider” which allows adjs as object complement
s, Stage 1 of ENGTWOL Tagging
Theneliminate non-ADVtags,
First Stage: Run words through Kimmo-style morphological
Else eliminate ADVtag analyzerto get all parts of
speech.
| consider that odd. (that is NOT ADV)
Example: Paviov had shownthatsalivation ...
Itisn't that strange. (that is an ADV) Pavlov PAVLOV N NOM SG PROPER
This approach does work and produces accurate results. had HAVE V PAST VFIN SVO
What are the drawbacks? HAVE PCP2 SVO
Extremely labor-intensive shown SHOW PCP2 SVOO SVO SV
that ADV
Word POS Additional POS features
PRON DEM SG
smaller ADJ COMPARATIVE DET CENTRAL DEM SG
entire CS
ADJ ABSOLURE ATTRIBUTIVE
Salivation NNOMSG
fast ADV SUPERLATIVE Stage 2 of ENGTWOL Tagging

that DET CENTRAL DEMONSTRATIVE SG Second Stage: Apply constraints.


Constraints used in negative way.
all DET PREDETERMINER SG/PL QUANTIFIER
dog's N GENITIVE SG

Natural Language Processing


Syntax Anafysis (s:)

Scanned with CamScanner


!a 2, The besttag for a given wordis determined by the probability that it occurs with
the n previoustags(tag sequence probabilities)
“that” rule
Example: Adverbial Advantages of Stochastic Part of Speech Taggers
Giveninput: “that”
If
4. Generally more accurate as comparedto rule based taggers.
(+1 AIADVIQUANT) Drawbacks of Stochastic Part of Speech Taggers
(+2 SENT-LIM) 4. Relatively complex.
(NOT -1 SVOCIA)
2. Require vast amounts ofstored information.
Then eliminate non-ADV tags
Else eliminate ADV Stochastic taggers are more popular as compared to rule based taggers because of
POS Tagging their higher degree of accuracy. However,this high degree of accuracy is achieved using
Properties of Rule-Based
the following properties — some sophisticated and relatively.complex procedures and data structures.
Rule-based POS taggers possess
are knowledge-driven taggers. For a given sentence or word sequence, Hidden Markov Models (HMM) taggers choose
These taggers
manually. - the tag sequencethat maximizesthe following formula:
The rules in Rule-based POS tagging are built
Theinformation is coded in the form of rules. P(word|tag)* P(tag|previousntags) |(1)
1000. _
We have somelimited numberof rules approximately around
taggers. HMM taggers generally choose a tag sequence for a whole sentence rather than for a
Smoothing and language modelling is defined explicitly in rule-based
single word, but for pedagogical purposes, let's first see how an HMM tagger assigns
Advantages of Rule Based Taggers:-
a tag to an individual word. Wefirst give the basic equation, then work through an
a. Small set of simple rules.
example, and, finally, give the motivation for the equation. A bigram-HMM tagger ofthis
b. Less stored information.
kind chooses the tag t, for word w,that is most probable given the previous tag t-1 and
Drawbacks of Rule Based Taggers
the current word w,
a. Generally less accurate as compared to stochastic taggers.
t= argmax P(tfae
w;)
Stochastic Part of Speech Taggers ...(2)
\ The use of probabilities in tags is quite old. Stochastic taggers use probabilistic and Through some simplifying Markov assumptions that we will give below, we restate
Statistical information to assign tags to words. These taggers might use ‘tag sequence Equation 2 to give the basic HMM equation for a single tag as follows:
probabilities’, ‘word frequency measurements’ or a combination of both.
Based on probability of certain tag occurring given various possibilities. f=argmax P(t, It;,
Requires a
..)P(w; It) ...(3)
training corpus. No probabilities for words not in corpus. Training corpus may bedifferent
from test corpus. , An Example
E.g. Let's work through an example, using an HMM tagger to assign the proper tag to the
1. The tag encountered most frequently Single word race in the following examples (both shortened slightly from the Brown
in the training set is the one assign
ed to an
ambiguous instanceof that word (word frequen Corpus):
cy measurements)

Natural Language Processing


|
"
Syntax Analysis (3)

Scanned with CamScanner


to/TO race/VB tomorrow/NN ..(4)
Secretariat/NNP is/VBZ expected/VBN combined Brown and Swi
tchboard Corpor
the/DT reason/NN for/IN the/DT race/NN a: P(race|NN) = .00041
People/NNS continue/VBP to/TO inquire/VB P(race|VB) = .00003
for/IN outer/JJ space/NN ..(5)
noun (NN). For the purposes of this If we multiply the lexicallikelihoods with the
In the first example race is a verb (VB), in the second tag sequence probabili- ties, we see that .
pretend that some other mech- anism has already done the Best tagging even the simple bigram version of the HMM ta gger correctly
example , let's tags race as a VB despite
leaving only the word race untagged. A bigram the fact thatit is the less likely sense ofrace:
job possible on the surrounding words,
tagging problem
version of the HMM tagger makes the simplifying assumption that the P(VB|TO)P(race|VB) = :00001
and tags. Consid er the proble m of as- signing
can be solved by looking at nearby words P(NN|TO)P(race|NN) = :000007.
a tag to race given just these subsequences: to/TO race/??? the/DT race/??? Let's see
howthis equation applies to our example with race; Equation 3 says that if we are trying
Transformation-Based Tagging
to choose between NN and VBfor the sequenceto race, we choosethe tag that has the Transformation-Based Tagging, sometimes called Brill tagging, is an in-
stance of

eee
greater of these two probabilities: the Transformation-Based Learning (TBL) approach to machine learning , and
draws
P(VB|TO)P(race|VB) ..(6) inspiration from both the rule-based and stochastic taggers. Like the rule-based
taggers,
TBL is based on rules that specify what tags should be assigned to what words. But.
And
like the stochastic taggers, TBL is a machine learning technique, in which rules
P(NN|TO)P(race|NN) ..(7) are
automatically induced from the data. Like some but notall of the HMM taggers, TBL
is
Equation 3 andits instantiations Equations-6 and 7 each have two probabilities: a tag a supervised learning technique;it assumes a pre-tagged training corpus. Samuel et al.
sequence probability P(t|t-1) and a word-likelihood P(wit). For race, the tag sequence (1998a) offer a useful analogy for understanding the TBL paradigm, whichthey credit to
probabilities P(NN|TO) and P(VB|TO) give us the answer to the question “how likely are Terry Harvey. Imagine an artist painting a picture of a white house with green trim against

SS
we to expect a verb (noun) given the previous tag?". They can just be computed from a blue sky. Suppose mostofthe picture was sky, and hence most ofthe picture was blue.
a corpus by count- ing and normalizing. We would expect that a verb is more likely to The artist might begin by using a very broad brush and painting the entire canvas blue.
follow TO than a noun is, sinceinfinitives (to race, to run, to eat) are commonin English. Next she might switch to a somewhat smaller white brush, and paint the entire house
While it is possible for a noun to follow TO (walk to school, related to hunting), it is less white. She would just color in the whole house, not worrying about the brown roof, or
common. Sure enough, a look at the combined Brown and Switchboard corpora gives the blue windowsor the green gables. Next she takes a smaller brown brush and colors
us the following probabilities, showing that verbsare fifteen times as likely as nouns after Over the roof. Now she takes up the blue paint on a small brush andpaints in the blue |
TO: windowson the barn. Finally she takes a very fine green brush and does the trim on the
P(NN|TO) = .021 gables. The painter starts with a broad brush that coversa lot of the canvas but colors
P(VB|TO) = .34 a lot of areas that will have to be repainted. The next layer colors less of the canvas,
but also makesless ‘mistakes’. Each new layer uses a finer brush that corrects less of
The second part of Equation 3 and its instantia
tions Equations 6 and is the lexical the picture, but makes fewer mistakes. TBL uses somewhat the same method asthis
- likelihood: the likelihood of the noun race given each tag,
P(race|VB) and P(race|NN). Painter. The TBL algorithm has a setof tagging rules. A corpus is first tagged using the
Note that this likelihood term is not asking ‘which is
the mostlikely tag for this word’. broadestrule, i.e. the one that applies to the most cases. Then slightly more specific
Thatis, the likelihood term is not P(VBJrace).
Instead we are computing P(race|VB). The rule is chosen, which changes some of the original tags. Next an even narrowerrule,
probability, slightly counterintuitively, answers
the question “if we were expecting a verb, Which changes a smaller numberof tags (some of which might be previously-changed
tags).

Natural Language Processing


.
= Syntax Analysis

Scanned with CamScanner


———— = = —— — Tie
How TBL rules are applied ° | The preceding (following) word i ar
apply, the tagger The word two before (ate ee
Let's look at one of the rules used byBrill's tagger. Before the rules
.
labels every word with its most-likely tag. We get these most-likely tags from a tagged One of the two preceding
(following) words is tagged
ed z,
corpus. For example, in the Brown corpus, race is most likely to be a noun:
One ofhte three preceding
(following) words is lagged z
P(NNjrace) = .98
receding word is .
P(VBlrace) = .02 The P 5 g | lagged z and the following wordis tagged w.
. receding (followi
This meansthat the two examples of race that we saw abovewill both be coded as NN. The g( INg) word is tagged z and the word. j
two before (after) is tagged w, i
In the first case, this is a mistake, as NN is the incorrect tag: . LS ye fh

is/VBZ expected/VBN to/TO race/NN tomorrow/NN..8 : i


Changetags . 3 :
In the second case this race is correctly tagged as an NN: eo H
the/DT race/NN for/IN outer/JJ space/NN ..9 From To Condition Example k
-i

; é
After selecting the most-likely tag, Brill's tagger applies its transformation rules. Asit NN VB Previoustag is TO to/TO race/NN — VB i
happens, Brill’s tagger learned a rule that applies exactly to this mistagging of race: VBP VB One ofthe previous 3 tags is MD mightMD vanish’VBP — VB
Ro

i
Change NN to VB whenthe previous tag is TO This rule would change race/NN to race/ NN VB One ofthe previous 2 tags is MD
w

might/MD not replayNN — VB


VB in exactly the following situation, since it is preceded by to/TO:
VB NN _ |.One of the previous 2 tags is DT
-

expected/VBN to/TO race/NN --> expected/VBN to/TO race/VB..10


VBD VBN One ofthe previous 3 tags is VBZ
on

How TBL Rules are Learned


Figure 1: Figure Templates for TSL
Brill's TBL algorithm has three major stages. It first labels every word with its most-
likely tag. It then examines every possible transformation, and selects the one that TBL problems and advantages
results in the most improved tagging. Finally, it then re-tags the data according to this Problems
rule. These three stages are repeated until some stopping criterion is reached, such as Infinite loops and rules mayinteract
insufficient improvement over the previous pass. Note that stage two requires that TBL The training algorithm and execution speed is slower than HMM
i
knowsthe correct tag of each word; i.e., TBL is a supervised learning algorithm. The
Advantages- ib
output of the TBL processis an orderedlist of transformations; these then constitute a . ; . iy
‘tagging procedure’ that can be applied to a new corpus. In principle the set of possible 'tis possible to constrain the set of transformations with “templates e
transformationsis infinite, since we could imagine transformations such as “transform IF tag Z or word W is in position *-k . :
NN to VBif the previous word was ‘IBM’ and the word ‘the’ occurs between 17 and 158 THEN replace tag X with tag §
words before that”. But TBL needs to consider every possible transformation, in order to Learns a small number of simple, non-stochastic rules e
pick the best one on each passthrough the algorithm. Thus the algorithm needs a way
Speed optimizations are possible usingfinite state transducers
to limit the set of transformations. This is done by designing a small set of templates,
ithm on unknown words
abstracted transformations. Every allowable transformation is an instantiation of one of TBL is the best performing algor
the templates.
_ The Rules are compact and can be inspected by humans

. ;
SyrianSrialwn
=
as
a
Natural Language Processing ,

Scanned with CamScanner


determined by the averageof the distribution over all singleton words in the training set.
3. Issues that ia idea of using‘things we've seen once’as an estimator for ‘things we've never
seen’ proved useful as key concept.
ds .
Multiple tags and multiple wor The most powerful unknown-wordalgorithms makeuseof information about how the word
-part words. Tag
are tag indeterminacy and multi
Two issues that arise in tagging is spelled. For example, wordsthat endin the letters arelikely to be plural nouns (NNS),
guous between multiple tags and itis impossible
indeterminacy arises when aword is ambi multiple tags.
while words ending with -ed tend to be past participles (VBN). Words starting with capital
case, some tagge rs allow the use of
or very difficult to disambiguate. In this ntag
letters are likely to be nouns. Four specific kinds of orthographic features: 3 inflectional
s. Commo
ank andin the British National Corpu
This is the case in the Penn Treeb endings (-ed, -s, -ing), 32 derivational endings (such as-ion, -al, -ive, and -ly), 4 values
te versus past participle (JJ/MVBD/VBN),
indeterminacies include adjective versus preteri of capitalization (capitalizedinitial, capitalized non- initial, etc.), and hyphenation. They
er (.JJ/NN).
and adjective versus noun as prenominal modifi used the following equation to compute the likelihood of an unknown word:
C5 and C7tagsets, for example, allow
The second issue concerns multi-part words. The P(wilti) = p(unknown-word|ti) p(capital|ti) p(endings/hyph |ti)
be treated as a single word by adding numbers to each
prepositionslike ‘in terms of to Other researchers, rather than relying on these hand-designed features, have used
tag: machine learning to induce useful features. Brill used the TBL algorithm, where the
in/I31 terms/I132 of/1133 allowable templates were defined orthographically (the first N letters of the words, the
the Penn Treebank and the last N letters of the word, etc). His algorithm inducedall the English inflectional features,
Finally, some tagged corporasplit certain words: for example
British National Corpus splits contractions and the ‘s-genitive from their stems: hyphenation, and manyderivational features suchas-ly, al. Franz uses a loglinear model

would/MD n'vRB which includes morefeatures, such as the length of the word and variousprefixes, and
furthermore includes interaction terms among various features.
children/NNS 's/POS

Unknown words 4. Introduction to CFG


All the tagging algorithms we have discussed require a dictionary thatlists the possible
(A context-free grammar (CFG) is a list of rules that define the set of all well-formed
parts of speech of every word. But the largest dictionary will still not contain every
sentences in a language. Each tule has left-hand side, which identifies a syntactic
possible word. Proper names and acronyms are created very often, and even new reading Sek
category, and a tight-handside, which definesits alternative componentparts,
common nouns and verbsenter the languageat a surprising rate. Therefore in order to
from left to right.
build a complete tagger we need some methodfor guessing the tag of an unknown word.
‘setting out together or
The simplest possible unknown-word algorithm is to pretend that each unknown word The word syntax comes from the Greek syntaxi$ , meaning
arrangeme nt’, and refers to the way words are arranged together. various syntactic
is ambiguous amongall possible tags, with equal probability. Then the tagger mustrely ’ . . .
There are
notions part-of-speech categories as a kind of equivalence class for words..
solely on the contextual POS-trigrams to suggest the proper tag. Aslightly more complex subcategorization and
relations, and aa
14 ons constituen
three main new ideas: rammaticalrvs’
cane cy, gramma
algorithm is based on the idea that the probability distribution of tags over unknown
that groups of words may behave
words is very similar to the distribution of tags over words that occurred only once ina © dependencies, The fundamentalidea of constituencyis
of
t. For example wewill see that a group
training set. These words that only occur once are known as hapax legomena(singular as a single unit or phrase, called a constituen
a noun phrase often acts as a unit; noun phrases include single wordslike
hapax legomenon). For. example, unknown words and hapax legomena are similar in Words called
Hill, and a well-weathered three-
that they are both mostlikely to be nouns, followed by verbs, but are very unlikely to she or Michael and phrases like the house, Russian
, a formalism that will allow us to model these
be determiners or interjections. Thus the likelihood P(wijti) for an unknown word is Story structure context-free grammars
s
Syntax Analysis Sa
Natural Language Processing

Scanned with CamScanner


_a

ape
eae
of ideas from traditional
l relations are a formalization jike number agreement can bearbitrarilyfar apart from each other “F
constituency facts. Grammatica chickens that eal corn eats rabbits" is awkward because the wijerd — ne
the sente nce: She ate a mammath
OBJECTS. In

CPUeta ee pa Prc
grammar about SUBJECTS and
mamm oth breakfast is the while the main verb (eats)is singular. In Principle, we could =
SUBJ ECT and a "mG sents . ae
Semences that ean
have an
arbitrary numberof words between the subject aa m —
Sheis the
breakfast. The noun phrase

ae
dencyrelati ons refer to certai n kinds of relations
OBJECT. Subcategorization and depen |
Noam Chomsky, in 1957, used the following notation cated productions. to define the
(

between words and phras es) |

an infinitive, as in | want to fly to Detroit, syntax of English. The terms used here, sentence, noun vee mm tie fotlowing

pean
For example the verb want can be followed by
t. But the verb find cannot be followed by rules
been describe a very
categorized small subset
as adjectives of English sentences. The articles “sed the& have
for simplicity.

>>>,
or a noun phrase, asin | wanta flight to Detroi

ere
aninfinitive (*I found to fly to Dallas ). Thes e are called facts about the subcategory of

ae PW
can be modelled by various kinds of

i,
edge <sentence> — noun phrase> <verb phrase>
the verb. All of these kinds of syntactic knowl

bigot
xt-free grammars are thus +>
grammars that are based on context-free grammars. Conte <noun phrase> <adjective> <noun phrase> i <adjectve> <singuiar noun>

age (and, for that matter,

eg oe ee!
the backbone of many models of the syntax of natural langu <yerb phrase> — ‘<singular verb> <adverb>
s of natural language
of computer languages). As such they are integral to most model
. EE —

<adjective> — a| the litte


understanding, of grammar check ing, and more recen tly of speec h understanding. They
nce, <singular noun> — boy
are powerful enough to express sophisticated relations among the words in a sente
yet computationally tractable enough thatefficient algorithms exist for parsing sentences <singular verb> — fan
with them <adverb> — quickly
Here are some properties of language that we would like to be able to analyze: Structure Here, the arrow, —, might be read as “is defined as” and the verbcal bar, “T
Ce Oat, }.a5 OF.

and Meaning-A‘sentence has a semantic structure thatis critical to determine for many Thus, a noun phraseis defined as an adjectrve followed by ancther noun pirase or2s
tasks. For instance, there is a big difference between Foxes eat rabbits and Rabbits an adjective followed by a singular noun. This cefniten of noun phrase
a 2 6% fects
2

eat foxes. Both describe some eating event, but in the first the fox is the eater, and in because noun phrase occurs on both sides of the orocuctien. Gramma
ranruTiats. > Tecursne
are

the secondit is the one being eaten..We need some wayto represent overall sentence to allow for infinite length strings. This grammaris sarc to be context-ree Decausé onty
structure that captures suchrelationships. Agreement—In English as in most languages, arrow. Hf there were more than
one syntactic category, e.g., , occurs on the le® of the
there are constraints between forms of words that should be maintained, and which one syntactic category, this would describe a context and be called context-sensive.
are useful in analysis for eliminating implausible interpretations. For instance, English another
A grammar is an example of a metalanquags- a language used ta desome
requires number agreement between the subject and the verb (among other things).
language. Here, the metalanguageis the context-free grammar used to Gescrite 2 part
Thus we can say Foxes eat rabbits but “Foxes eats rabbits” is ill formed. In contrast,
of the English language. Figure X shows a cagram called a parse Tee Of Sructure Tee
The Fox eats a rabbit is fine whereas “The fox eat a rabbit" is not. Recursion—Natural for the sentencethe little boy ran quickly. The sentence: “Quickly. the ie Ooy ran” ssaiso
languages have a recursive structure that is important to capture. For instance, Foxes
a syntactically correct sentence, but it cannet be derived from the above grammar,
that eat chickens eat rabbits is best analyzed as having an embedded sentence (i...
that eat chickens) modifying the noun phases Foxes. This one additional rule allows us
to also interpret Foxes that eat chickens that eat corn eat rabbits and a hostof other —
sentences. Long Distance Dependencies—Becauseofthe recursion word dependencies

Natural Language Processing

Scanned with CamScanner


__ _ ee
Bas Ft”

a CFG is a serie word s that can be derii ved by |


s of of word
ase mte allythe
nce
atic
lang
apply the defined by_
ing uage
of a sentence
stem has s onits left-hand side.
ding the structure : rules, beginning with a rule that
EXAMPLE2: Fin id ence is a serie s of rule appli catio ns in whic h a syntactic category is
ten ce: A parse of the sent
ve, the sen
Using the gram
mer abo
gory onits left-hand side, and
ckly replaced by the right -handside of a rule that has that cate
Thelittle boy ran qui ication yields the sente
the final rule appl nceitself . E.g., a parse of the sentenée’ “the
fe vp => the giraffe iv
=> det n vp => the n vp => thee giraf
can be dia gra med :
/ ™. giraf fe = is: s => np vp enie nt way to desc ribe a pars is to showits parse tree,
=> the giraffe dreams. A conv
ree has
he pars e.Note thatthe rootof every subt
<verb phrase?
’ which is simply a graphicaldisplayoft

J”.
children of
ontheleft-hand side of a rule, and the
<noun phrase? a grammatical category that appears

/ ™
ight
ther -han dside of that rule. |
that root are identicalto the elementson —
<adverb>
<noun phrase?
<singular verb> Potential Problemsin CFG

)
- pe Ne
_ <adjective>
» Agreement /~
og: no
<adjective> <singula
r noun? » Subcategorization - ae Tt )
| ny &
e Movement
quickly
boy ran co
The little Agreement
s using a context- “This dogs
ect English sentence
to describe all the corr _This dog
In fact, it is impo ssible r abov e, to derive the
, u sing the gramma “those dog
other hand, it is possible Those dogs
free grammar. On the e the boy ran quick ly.” Context
g, “littl
semantically incorrect strin
syntactically correct, but *This dog eat
ribe semantics. This dog eats
free grammars cannot desc followed
fined as a noun phrase ‘Those dogs eats
s --> Np YP mean sthat “a sentence is de Those dogs eat
E.g., the rule s froma
s a simple CFG that d escr
ibes the sent ence
| | a,
byav erb phrase.” Figure X1 show
5.Subcategorization
_ small subset of English.
s —> np vp
s Sneeze: John sneezed *
np — det n
vp — tv np
= *
vp
Find: Pleasefind [a flight to NY],es;
. np fare],ip
— iv
I Give: Give [me],, [a cheaper
d < ,
det —> the you help [me],, . [with a flight),,
tbo yy Help: Can
>a ; l)ove
—> an
the giraffe dreams Prefer: | prefer [to leave earlie
n —» giraffe
Told: | was told [United hasa flight],
—» apple
iv —» dreams
tv —> eats
—» dreams

dreams”
Figure 2: A grammar and a parse tree for “the giraffe Syntax Analysts (2)

>...

Natura! Language Processing . a ee


—_

aa
2 neee i £, ;
>

Scanned with CamScanner


Ta
*John sneezed the book order to answer the question “What b : ke

' want to know that the sitictofits


400? we'll OOks were wri iti women authorsbefore
sav by British
*| prefer United hasa flight . 7. enten
- py-adjunct wasBritish women authors to help us
; thei ie neee
*Give with a flight oi figure out that the user wants list of
. pooks (and notjust a list
cate (verb for now) places on the number:
Subcat expressesthe econstraints that a predi fications for buildi Authors). Syntactic parsers are also usedin lexicography
and type of the argumentit wantto take ope . | :
ng online versions of dictionaries. Finally, stochastic versions of
oe . oa

te. | parsing algorithms haverecently begunto be incorporated into speech recognizers, both
So the various rules for VPs overgenera for language models and for non-finite-state
ments that don't go acoustic and phonotactic modeling.
They pennit the presence of strings containing verbs and argu
In syntactic Parsing, the parser can be viewed as searching through the space ofall
together
possible parsetrees to find the correct parse tree for the sen
; ntence. The search space
For example
of possible parsetrees is defined by the grammar. For example, considerthe following
VP -> V NPtherefore sentence:
~ Sneezed the:book is a VP since “sneeze”is a verb and “the book”is a valid NP Bookthatflight. ...(1)
“ Subcategorization frames can fix this problem (“slow down” overgeneration) Using the miniature grammar andlexicon in Figure 2, which consists of someof the CFG
rules for English, the correct parse tree that would be assigned to this example is shown
6: Movement
e Core example
fs
Scere om poy Qian SIF

oboe
pS

oneYe
in Figure 1
S

_. [[Mytravel agent],, [booked[the flight],.]1, , OS, ea “ne


VP.
e |.e. "book" is a straightforward transitive verb. It expects a single NP'arg within the
VP as an argument, and a single NP arg as the subject... 4
« What about? |
— Whichflight do you want me to havethe travel agent book?
¢ The direct object argumentto “book’ isn't appearing in the right place. It is in fact a
long way from whereits supposed to appear. Verb

e And note that its separated from its verb by2 other verbs.
Book that flight
Parsing With Context-free.Grammar
Parse trees are directly useful in applications such as grammar check- ing in word-
processing systems; a sentence which cannot be parsed may have grammatical errors
(or atleast be hard to read). In addition, Parsing is an important intermediate stage of
representation for semantic analysis, and thus plays an important role in applications like
machinetranslation, question answering, and information extrac- tion.
For example, in

Syntax Analysis
104 Natural Language Processing
Tee

Scanned with CamScanner


according to the
for the sentence Book that flight an Auxfollowed by an NP and a VP. andthethird a VP byitself.
Figure 3. The correct parse tree To fi t the search space
on the page, we have shownin the third ply of Figure
grammarin Figure 2. 3 only the treeS resulting from the
Det > that this|| a expansion of the left-most leaves of each tree. At each
S—> NP VP ply of the search space we use
y
Noun —» book| flight | meal | mone the right-hand-sides of the rules to provide new sets of expectations for the parser,
which
S— Aux NP VP: are then used to recursively generate the rest of the trees. Trees are grown downward
} Verb —» book | include | prefer
S— VP. until they eventually reach the part-of-speech categoriesat the bottom ofthe tree.At this
Aux > does
NP — Det Nominal point, trees whose leavesfail to match all the wordsin the input can be rejected, leaving
Nominal > Noun. behind those trees that represent successful parses. In Figure 3, only the 5th parse tree
-| Prep —» from | to | on
Nominal > Noun Nominal (the one which has expanded the rule VP ! Verb NP) will eventually match the input
NP ->Proper-Noun Proper-Noun — Houston | TWA sentence Book thatflight.

VP — Verb -

VP > Verb NP Nominal > Nominal PP


Figure 3: A miniature English grammarandlexicon.
S
The goalof a parsing searchisto find all trees whoseroot is the start symbol 5S, which
cover exactly the words in the input. Regardless of the search algorithm we choose, NP VP AUX NP VP VP
there are clearly two kinds of constraints that should help guide the search. One kind of

-
constraint comesfrom the data, i.e. the input sentence itself. Whatever elseis true of the
final parse tree, we know that there must be three leaves, and they must be the words
book, andflight. The second kind of constraint comes from the grammar. We knowthat
NP VP NP VP AUXNP VP AUX NP VP VP y
whatever elseis true of the final parse tree, it must have one root, which must be the
start symbol S. These two constraints,, give rise to the two search strategies underlying Det Nom PropN Det Nom PropN V NP V
mostparsers: top-down or goal-directed search and bottom-up or data-directed search. Figure 4: An expanding top-down search space. Each ply is created by taking each tree from the previous
ply, replacing the leftmost non-terminalwith eachofits possible expansions, and collecting each of these
Top-DownParsing trees into anewply.
A top-downparser searches for a parse tree by trying to build from the root node S$
Bottom-Up Parsing
downto the leaves. Let's consider the search spacethat a top-down parser explores,
assuming for the moment thatit builds all possible trees in parallel. The algorithm starts Bottom-up parsing is the earliest knownparsing algonthm, andis used in the shiftteduce
by assuming the input can be derived by the designated start symbol S. parsers common for computer languages. In bottom-up parsing, the parser stars with
The next step
is to find the topsof all trees which can start with S, by looking the words of the input, and tries to build trees-from the words up, again by applying
for all the grammar
rules with S onthe left-handside. In the grammar in Figure 2, there are three rulesthat rules from the grammar one at a time. The parse is successful if the parser succeeds in

expand S, so the second ply, orlevel, of the search Spacein Figure 3 has three partial building a tree rootedin the start symbol S that covers all of the input. Figure 4 show the
trees. We next expand the constituents in these three new trees just as weoriginall bottom-up search space, beginning
inni with
i as sentence Bookthatflight.get The
Abeye parser begins
git
y . three partial
expanded S. Thefirst tree tells us to expect by looking up each word (book,that, and flight)in the lexicon and building
an NP followed by a VP, the second expect
s can be
trees with the part of speech for each word. But the word book is ambiguous;it

_\(108) Natural Language


guage Processin, 8 tn ah .c Syntax Analysis (207)

Scanned with CamScanner


——————
fF
_.ay
two
sider two pos sible s ets of trees. The first the next by lookingforplacesin the Parse-in-progress where the right-hand-side of soniie
the parser Mm ust con
a noun or a verb. Thus rule might fit. This contrasts With the earlier top-down parser, which expandedtrees by
of the search space.
urcation|
lies in Figure 4showthisinitia
itial bif
applying rules when their left-hand side matched an unexpanded nonterminal. Thus in.
Book that flight
, the fourth Ply, in the first andthird parse, the sequence Det Nominal is recognized-as the
Verb Det Noun right-hand side of the NP ! Det Nominalrule. In the fifth ply, the interpretation of book as
Noun Det Noun
a noun has been prunedfrom the search space. This is becausethis parse cannot be
Book that flight Book that flight
continued: there is no rule in the grammar with the right-hand side NominalNP.Thefinal
ply of the search spaceis the correctparsetree.
NOM. NOM
NOM ,
Comparing Top-down and Bottom-up Parsing
~ Noun Det Noun | y Det van Each of these twoarchitectures has its own advantages and disadvantages. The top-

Book that flight Book that flight down strategy never wastes time exploring trees that cannot result in an S, since it
begins by generating just those trees. This means it also never explores subtrees
NP NP that cannotfind a place in some S-rooted tree. In the bottom-up strategy, by contrast,
trees that have no hopeofleading to an S, orfitting in with any of their neighbours, are
NOM vP NOM
. NOM generated with wild abandon. For example the left branch of the search spacein Figure
Verb Det ‘ban 4 is completely wasted effort; it is based on interpreting book as a Noun at the beginning
vin Det van vi Det Noun
of the sentence despite the fact no such tree can lead to an S given this grammar. The
Book that . _ flight acy that flight | ans wa “ibn top-down approachhas its own inefficiencies. While it does not waste time with trees that
do not lead to an S, it does spend considerable effort on S trees that are not consistent ©
VP with the input. Note that the first four of the six trees in the third ply in Figure 3 all have
left branches that cannot match the word book. None of these trees could possibly be
VP NP NP
used in parsing this sentence. This weaknessin top-down parsers arises from the fact
NOM NOM that they can generate trees before ever examining the input. Bottom-up parsers, on the
other hand, never suggesttrees that are not at least locally groundedin the actual input.
Verb Det Noun Verb © Det vn
Neither of these approaches adequately exploits the constraints present by the grammar
acy | flight Book that ni and the input words.

Figure 5: An expanding bottom-up search spacefor the sentence Bookthat flight. This
figure does not show
7. Sequencelabeling
thefinaltier of the search with the correct parse tree Figure
1.
In natural language processing, it is a common task to extract words or phrases of
Eachof the trees in the second ply are then expanded
.In the Parse on the left (the one
Particular types from a given sentence or paragraph. For example, when performing
in which book is incorrectly considered a noun), the
Nominal! Noun rul e is applied to both
of the: Nouns (book analysis of a corpus of news articles, we may want to know which countries are mentioned
| andflight) - This samerule is also applied to the sole Noun (flight) on
the right, producing the trees on the third ply. In in the articles, and how manyarticles are related to each of these countries.
general, the Parser extends one ply to

(108) natura Language Processing Syntax Analysis i

Scanned with CamScanner


ine ‘he tea Hidden Markov Mode! (HMM)
This is actually a special case of sequence labelling in NLP ~-
the goalis to assign a label to eachme m er " "abelte hor.
and Chunking), in which Markov Models
the case ofidentifying country names, we would like to assign a ‘coun rar evernal s
that form part of a country’s name, and a ‘irrelevant’ label to al other
wet Ss. a mple, Let's talk about the weather Here in Mumbai we
have three types of weather sunny
the following is a sentence broken down into tokens, and its desired
output after the rainy and foggy. Let's assume forthe moment
that the weather lasts all dayie it doesnt
sequencelabelling process: changefrom rainy to sunnyin the middle of the
day. Weatherpredictionis all about trying
to guess what the weather will belike tomorr
input = [“Paris”, “is”, “the”, “capital”, “of, “France’] ow based on a history of observations of
weather. Let's assumea simplified model of weather
output =[1", “1, “P,P, “Py °C") prediction well collect statistics on
what the weather was like today based on whatth
where | meansthat the token ofthat position is an irrelevant word, and C means thatthe e weather was like yesterday the day »
before and so forth. We want to collect
token of that position is a word that form part of a country’s name. the following probabilities:
P(w,, |W, _W,_2 “anes
w,) (1)
Methods of Sequence Labelling
Using expression 1 we can give probabilities of types of
Asimple, though sometimes quite useful, approach is to prepare a dictionary of country weather for tomorrow and the
next day using n daysof history For example if we knew that the
names, and look for these namesin eachof the sentencesin the corpus. However,this weather for the past
three days was{ sunny, sunny, foggy}in chronological order
methodrelies heavily on the comprehensiveness of the dictionary. While thereis a limited the probability that tomorrow
would be rainy is given by
numberof countries, in other cases such as city names the number of possible entries in
P(w, =Rainy|w, = Foggy,w, =Sunny,w, = sunny) ..(2)
the dictionary can be huge. Even for countries, many countries may be referred to using
different sequenceofcharacters in different contexts. For example, the United States of Here's the problem the larger n is the morestatistics we must
collect. Suppose that n
America maybereferredto in an article as the USA, the States, or simply America. =5 then we must collect statistics for 3° = 243 past histories
Therefore we will make a
In fact, a person reading a newsarticle would usually recognise that a word or a phrase simplifying assumption called the Markov Assumption
refers ta a country, even when he or she has not seen the name of that country In a sequence {w,, W,
before.
The reasonis that there are many different cues in the sentence
or the whole article P(w |W WNn-, «2... W,)* P(w,| W,,,)..(3)
that can be used to determine whether a word or a phraseis
a country name. Take the
This is called a first-order Markov assumption since we say
following two sentences as examples: that the probability of an
. observation at time n only depends on the observation at time n A second-order
Patrik travels to India capital, Delhi, on Mondayfor Markov
meetings of foreign ministers from the assumption would have the observation at time n depend on n-1 and n-2.
10-member Association of South East Asia Nations In general
(ASEAN).
when people talk about Markov assumptions they usually mean first-order Markov
The GovernmentsofIndia and Srilan
ka will Strengthen ties with a cus assumptions. We will use the two terms interchangeably
toms cooperation
agreement to be in force on June 15th.
We can express the joint probability using the Markov assumption
Thefirst sentence implies that someth
ing called India has aca
. .
Is a country. Similarly, in the second sentence
pital, suggesting that India P(w,.....w, )=11”, P(w,|w, ;)
we know that b oth India and
countries as the news mentione Srilanka are ~ So this now has a profound aect on the number ofhistories that we haveto ndstatistics
d about their governments
In other words, the words
around Delhi, India and Srila
nka provide clues as to whet
for we now only need 3?=9 numbers to characterize the probabilities of all of the
her they are country name
s Sequences This assumption may or may not be a valid assumption depending on the
Situation( in the case of weather it's probably notvalid) but we use these to simplify the
Situation
Natural Language Processing
| en

Pm
syntax Analysis (2)

Scanned with CamScanner


-
So let's arbitrarily pick some numbers for P(tomorrow | “today) expressedin
Table D1
on Todays Weather
Table D1: Probabilities of Tomorrows weather based
Tomorrow's weather

Sunny Rainy Foggy


Sunny 0.8 0.05 the equation f
Today's weather 0.15 Remember
seenine q or the weather Markovprocess b efore you werelockedin
p i the
Rainy 0.2 0.6 0.2 ;

Foggy 0.2 03 0.5 P\ J=11/-,P(w, Iw,lw .) 5)


we. JSTP(w

Now we have to factor in th oy 5.


For first-order Markov models we can use these probabilities to draw a probabilistic by using Bayes Rule
, u e eens weather lsrhigest fromiyou.Wedosti
nite state automaton For the weather domain you would have three states Sunny Rainy (U,.....U,
, , © _ qeeees n | Wr
Wye. , )P( W,.....
and Foggy and every day you would transition to a possibly new state based on the P( Wy... [Wy}= Plu a) rea)
probabilities in Table D1 Such an automaton would looklike this " 46)
where u_ is true if your caretaker brought an umbrella on day i and falseif the caretaker
0.8 0.5 didn't The probability P(w.,.....w,) is the same as the Markov modelfrom thelast section
and the probability P(u,,....u_) is the prior probability of seding a particular
sequence of
umbrella events (eg {True, False, True}).
The probability P(u,,.... U,|W,,....W,) Can be estimated as T/.,P(u,|w,),, if you assume
that for allt given w., u is independent of all u, and w, forall j##.

HMM for POS tagging 1

A Markov model is a stochastic (probabilistic) model used to represent a system where


future states depend only on the current state. For the purposes of POS tagging, we
make the simplifying assumption that we can represent the Markov modelusing finite
Slate transition network.

Hidden Markov Models


So what makes a Hidden Markov Model. Well SUPpOSe
you were locked in a roomfor
several days and you were asked about the weather
outside. The only piece of evidence
you have is whether the person who comes into
the room Carrying your daily meal is
carrying an umbrella ornot. Let's suppose the
following probabilities

Or Language Processing Semtex ha shoce


MARINE Tak Y

ae ; as .

Scanned with CamScanner


Let's nov take a look at how we can

our states. Going back to the cat and dog example, suppose we observed the ing 8
evo state sequences:
ee
dog, cat, cat
dog, dog. dog
Then the transition probabilities can be calculated using the maxi oie a"
__ C(S,.S, 4) lo
P(s,[S.)= C(s,.)
oa ee

in English, this says that the transition probabi


lity from state -4 to State i is given by the
total numberof times we observe state 1-4 transitioning to
state i divided by the total
number of times we observestate -1.

For example, from the state sequences we can see that the sequence
MONE Size. Note het each edgeis labeled with a numberrepresenting the s ahways siart with
probability dog. Thus we are at the start state twice and both times we get to dog and
tial 2 given transtion wil happen at the current siete. Note also that the probabilit never cat.
y of Hence the transition probability from the start state to dog is 1 and from the start
Tensions out of any given state always sums to 14. state
to cat is 0.
in the finte siete transition network pictured above, each state was observabl
e. We Let's try one more.Let's calculate the transition probability of going from the
are edie to see howoffen 2 cat meows after a dog woofs. Whatif our cat state dog A
and cog were to the state end. From our example state sequences, we see that dog only
transitions
Olingual. Thatis. what if both the cat and the dog can meow and woof? Furthermore,
to the end state once. Wealso see that there are four observed instances of dog. Thus
‘ets assume thai we are given the states of dog and cat and we
want to predict the the transition probability of going from the dog state to the end state
sequence of meows and woofs from the states. In this case, is 0.25. The other
we can only observe the
Gog and the cet but we need to predict the unobserved transition probabilities can be calculated in a similar fashion-
meows and woofs that follow. The
meow’s and woofs are the hidden states. The emission probabilities can also be calculated using maximum likelihood estimates:

P(t |s,) = CUS)


. C(s,) a
In English, this says that the emission probability of tag i given state i isthe total number :o
oftimes we observestatei emitting tag i divided bythe total number of times we observe
State j. .

- . Let's calculate the emission probability of dog emitting woofgiven the following emissions
for our two state sequences above:
woof —/ meow
Woof, woof, meow
Figure 7: A finite state transition network representing
an HMM Meow, woof, woof
The figure aboveis a finite
state transition network that
arrows represent emiss represents our HMM. Theb
ions ofthe unobservedstate lack.
s woof and meow.
| —
ae :
ae 2s) nara Language Processing
ne : ‘ a
symacanai as(a15)
Ir ‘ ‘ wedi tet letniRear.

Scanned with CamScanner


That is, for the first state sequence, dog woofsthen cat woofs andfinally cat meows. We in English, the probability P(WIT)iIs the
probabili that we
seefrom the
see from the state sequencesthat dog is observed four times and we can given the sequenceoftags. To be able to calculate this westi
i
emissions that dog woofs three times. Thus the emission probability of woofgiven that assumption. We needto assu llneed
nooltoinmake a simplifying
aenin wes
methat the probability of a w
ord appearing depends only
we arein the dogstate is 0.75. The other emission probabilities can be calculatedin the on its own tag and not on context, That jis, the word
networkis given here. does not depend on neighboring tags
same way. For completeness, the completedfinite state transition and words. Then we have
0.5
woof (meow p(w | T) =[T),P(™, It.)

<—:3e0
In English, the probability P(T) is the probability of getting the sequenceoftags T. To
calculate this probability we also need to make a simplifying assumption. This assumption
gives our bigram HMM its name andsoit is often called the bigram assumption. We must
assumethat the probability of getting a tag depends only on the previous tag and no
other tags. Then we cancalculate P(T) as
0.5
woof LT/ meow
0.5 P(T)=[]!,P( 164)
0.5
Note that we could use the trigram assumption, that is that a given tag depends on
Figure 8: Finite state transition network of the hidden Markov model of our example. the two tags that came before it. As it turns out, calculating trigram probabilities for the
HMM requires a lot more workthan calculating bigram probabilities due to the smoothing
So how do we use HMMs for POStagging? When we are performing POS tagging, our
required. Trigram models do yield some performance benefits over bigram models but
goalis to find the sequenceof tags T such that given a sequence of words W weget
for simplicity's sake we use the bigram assumption.
T =argmaxP(T |W)
T Finally, we are now ableto find the best tag sequence using

In English, we are saying that we want to find the sequence of POStags with the highest T =argmax[]?., P(w,1t)P (184)
i

probability given a sequence of words. As tag emissions are unobserved in our hidden The probabilities in this equation should look familiar since they are the emission
_ Markov model, we apply Baye’s rule to change this probability to anequation we can
probability and transition probability respectively.
compute using maximum likelihood estimates: 3 .
P(w, | , Emission probability.
T=argmaxP(T | w) = amex
TU), argmax P(w | T)P(T) It, ,) Transition probability
P (t.16
HM\M, the observed
Hence if we were to draw a finite state transition network for this
The second equals is where we apply Baye's rule. The symbolthat looks like an infinity emitted states similar to our woof,
States would be the tags and the words would be the
we can use the maximum likelihood
symbol with a piece chopped off meansproportional to.
This is because P(W)isa constant and meow example. We have already seen that
for our purposes since changing the sequence T doesnot estimates to calculate these probabilities.
changethe probability P(W). | . their. cresponci3 ngPOS
ey
Thus dropping it will not make a difference in the fi sentences that are tagged mn
nal sequence T that maximizes the Given a dataset consisting of
probability. the emission nen ne noelde
tags, training the HMM is as easy as calculating
8S described above. For an example implementation, check out tne big
Syntax Analysts (17)
Natural Language Processing

_f
_——_—---———--
1 hy 3 ee tl ares

Scanned with CamScanner


implemented here. The basic idea ofthis implementation is that it primarily keeps count
of the values required for maximum likelihood estimation during training. The mode! then we will use Viterbi to find the
| Mostlikel ¥Y Sequenceo
calculates the probabilities on the fly during evaluation using the counts collected during First we need to cre fstatesth i
ate our first Viterbi t
able. We need a row for every state
training. i ttsiin noourfi
nh nite
state transition network. Thus ourtable has 4 rowsfor e
the States start, dog, cat and end.
An astute reader would wonder what the model doesin the face of words it did not see Weare trying to decode a sequenceof length two so we need
four columns. In general,
during training. We returnto this topic of handling unknown words later as we will see that the numberof columns weneedis the length of the sequence we are
trying to decode.
it is vital to the performance of the model to be able to handle unknown words properly, The reason we need four columns is because the
full sequence weare trying to decode
Viterbi Decoding ig actually

8) A, GD <start> meow woof <end>

eH\
The first table consists of the probabilities of getting to a given state from previousstates.
\ay / 2d More precisely, the value in eachcell of the table is given by

1:vB 2vB} 3:vB| ( 4:VB 6:v8 a, transition probability from state Sto 5,

[1:LRB| [2:LRB] [3:LRB] b_.emission probability of word w,at state S,.

v,(/) =maxv, ,(/)a,b,


weer eee

Let's fill out the table for our example using the probabilities we calculated for the finite ~
The HMM gives us probabilities but what we want is the actual sequenceof tags. We
state transition network of the HMM model.
need an algorithm that can give us the tag sequence with highest probability of being
correct given a sequenceof words. An intuitive algorithm for doing this, known as greedy. <start> meow woof <end>

decoding, goes and chooses the tag with the highest probability for each word without Start |
considering context such as subsequent tags. As we know, greedy algorithms don't
dog
always return the optimalsolution and indeed it returns a sub-optimal solution in the case
of POS tagging. This is becauseafter a tag is chosen for the current word, the possible Cat
tags for the next word may belimited and sub-optimal leading to an overall sub-optimal End
solution. This is because
Notice that the first column has 0 eve rywhere | exceptfor the start row.
Weinstead use the dynamic programming algorithm called Viterbi. Viterbi starts by <start>. From our finite state transition
the sequences for our example always start with
creating two tables. The first table is used to keep track of the maximum sequence state with probability 1 and
Network. we see that the start state transitions to the dog
probability that it takes to reach a given cell. If this doesn't make sense yet that is okay. ity of 0.25.
that do g emits meow with a probabil
Wewill take a look at an example. The secondtable is used to keep track of the actual Never goesto the cat state. We also see state from the start
to t he start state or end
tis also important to note that we can notget
path that led to the probability in a given cell in the first table.
State.
Let's look at an example to help this settle in. Returning to our previous woof and meow
example, given the sequence

Syntax Analysis (1s)


Natural Language Processing ER

eeea ree
Seat Fc

Scanned with CamScanner


_ Thus wegetthe next column of values
<start> meow . woof <end> fe
nd then take the path with higher probab
<start> ility.
Start 1 0 meow woof
J
<end>

dog 0 1*1*0.25=0.25 Start 1 0


il 0 0
Cat am) 0 dog 0.}
A
1°1°0.25=0.25 0.9
Lo
0.25° 0.5*0 .75=0
| '0. 75= 0 .09375 0
End 0 0 Cat 0 0} 0.25*0.25*0.5=0.03125 0
Notice that the probabilities of all the states we can't get to from our start state are 0.
0.09375°0.25=0.0234375
Also, the probability of getting to the dog state for the meow column is 1 * 1 * 0.25 where End 0 0 O 6.03425°3-5=9.915625
the first 1 is the previous cell's probability, the second 1 is the transition probability from
Going from dogto end hasa higher Probability than going from
the previous state to the dog state and 0.25 is the emission probability of meow from catto end sothatis the
path we take. Thus the answer we getshould be
the current state dog. Thus 0.25 is the maximum sequence probability so far. Continuing
<start> dog dog <end>
onto the next column:
In a Viterbi implementation, the whole time wearefilling out the probability table another
<start> meow woof <end>
table known as the backpointer table should also befilled out. The value of each cell
Start “4 0 |: 0 in the back pointer table is equal to the row index of the previous state that led to the
dog 0}. 1*1°0.25=0.25 0.25*0.5*0.75=0,09375 maximum probability of the current state. Thus in our example, the end state cell in the

Cat 0 0 0.25*0.25*0.5=0,03125 back pointer table will have the value of1 (0 starting index) since the state dog at row 1
is the previous state that gave the end state the highest probability. For completeness,
End 0 0 0 the back pointer table for our exampleis given below.

Observethat we cannot get to the start state from the dog state and the end state
never <start> meow woof <end>
emits woof so both of these rows get 0 probability. Meanwhile, the
cells for the dog
Start -1
and cat state get the probabilities 0.09375 and 0.03125 calculated
in the same way as
we saw before with the previouscell's probability of 0.25 multiplied by the respective dog 0
41

transition and emission probabilities. Finally, we get 1


Cat es

<start> meow woof


1 .
<end> End
the stopping condition we use for when
et

Start 1 0
Note that the start state has a valueof -1. This is
2
0 0
usa aie
ea

dog 0 1*1*0.25=0.25 0.25*0.5*0.75=0.09375 Wetrace the backpointer table backwards to get the path that provides
aos a 0 probability of being correct given our HMM. LU get a 7 oence
Cat With the highest
0 0] 0.25*0.25*a
0.5=0.03125 the table.
0 at the end cell on the bottom rightof
|

<start> dog dog <end>, westart at row. 1 hence the P previous


colum n is
i
0.09375*0.25=0,0234379 sstate 'n the woot
this cell tells us that the previou
End 0
0.03125*0,5=0.015625
of
State must be dog.

&
Py
Syntax Anaives { m= |
(120) natura Language Processing

Scanned with CamScanner


From dog, we see that the cell is labelled 1 again so the previous state in the meow | The most uniform modelwill divide the
the dog cell jg
column before dog is also dog. Finally, in ihe meow column, we see that
<start> state. We see -1 gq © be added.
labelled 0 so the previous state must be row 0 whichis the
dog dog <start>. Reversing this gives us our ~ p(dans) + p(sur) = 3/40
we stop here. Our sequencei s then <end>
e Intuitive Principle: Mode] all :
eet
_- u ene that is known and assume nothing aboutthat whichis
Maximum Entropy
pro i produces an output value y, a memberof finite set Y.
ing i i
Maximum entropy modelling is a framework for integrating information from many Ata ndom process which

heterogeneous information sources for classification. The term maximum entropy y € {sur, dans, par, au bord de}
member of eo
probability mode| The process may beinfluenced by some contextual information x. a
refers to an optimization framework in which the goalis to find the
set X. x could include the wordsin the English sentence suroundingion
that maximizes entropy over the set of models that are consistent with the observed
evidence. maximum entropy framework can be usedto solve various problemsin the domain of
Maximum Entropy modelis used to predict observations from training data. This does natural language processinglike sentence boundary detection, part-of-speech tagging,
not uniquely identify the model but chooses the model which has the most uniform prepositional phrase attachment, natural language parsing, and text categorization .
distribution i.e. the model with the maximum entropy. Entropy is a measureof uncertainty - og .
of a distribution, the higher the entropy the more uncertain a distribution is. Entropy Conditional Random Field (CRF)
measuresuniformity of a distribution but applies to distributions in general. In POS tagging, the goalis to label a sentence (a sequence of wordsortokens) with tags

The Principle of Maximum Entropy argues that the best probability model for the data like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.
is the one which maximizes entropy, over the set of probability distributions that are For example, given the sentence “Bob drankcoffee at Starbucks”, the labeling might be
consistent with the evidence. “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

Maximum Entropy: A simple example So let’s build a conditional random field to label sentences with their parts of speech. Just
The following example illustrates the use of maximum entropy on a very simple problem. like any classifier, we'll first need to decide on a set of feature functions fi.
¢ Model an expert translator’s decisions concerning the proper French rendering of Feature Functions in a CRF
the English word on : In a CRF, eachfeature functionis a function that takesin as input:
* Amodel(p) of the expert’s decisions assigns to each French word or phrase(f) an
asentence s
estimate, p(f), of the probability that the expert would choose f as a translation of
the position i of a word in the sentence
on. Our goal is to extract a set of facts about the decision-making process from the
sample and construct a modelofthis process
the label|, of the current word
Aclue from the sample is the list of allowed translations
the label li-1 of the previous word gh the numbers are oftenjust either 0 or 1).
aae and -yalued number(thou
on ---> {sur, dans, par, au bord de} Outputs a real tion could measure how much we suspectthat
F ‘ble feature func fa
wordis “very”.
With this information in hand, we can imposeour first constraint on p:
Id be labeled as an adjective
thoF example, one posse given that the previous
@ current word should be !a
p(sur) + p(dans) + p( par) + p(au bord de) =1

Syntax Analysis (x3)


.
Processing
Natural Language

ae eh

Scanned with CamScanner


(aE... SS

to build a conditional randomfie


ld, you just define a bunch of feature functions (which
ties
Features to Probabili can depend on the entire sentence, a current
position, and nearby labels), assign them
F eighth... Given a sentence s, we Can NOW Score
Next, assign each feature function fa welg nee overall the words in the sentence; weights, and add themall together, transformin g at the end to a probability if necessary. .
the weighted feat
a labelling | of s by adding up CRF vs HMM
“score(t|$)= ord; (sift) CRFscan define a muchlargersetof features. Whereas HMMs are necessarily local in
nature (because they’re constrained to binary transition and emission feature functions,

i of the sentence. band 4 which force each word to dependonly on the current label and each label to depend
into probabilitie s p(l|s) between and by
Finally, we can transform these scores only on the previous label), CRFs can-use more global features. For example, one of the
exponentiating and normalizing ; . features in our POS taggeraboveincreased the probability of labelings that tagged the
first word of a sentence as a VERBif the end of the sentence contained a question mark.
p(!\s)= S” .exp|score(I'|s) | 0x]Darah (s,i,] | CRFs can havearbitrary weights. Whereas the probabilities of an HMM mustsatisfy
certain constraints
Example Feature Functions
So whatdo these feature functions look like? Examples of POS tagging features could ExpectedQuestions
include:
1. What is POS tagging? Explain types of word classesin English NL. Also comment
f1(s,i,li,li-1)=1 if i= ADVERB andthe ith word endsin “-ly”; 0 otherwise. If the weight A1
on possible tag sets available in ENGLISH NL.° /
associated with this feature is large and positive, then this feature is essentially saying
2. Explain open and closed word classes in English Language. Comment on possible
that we prefer labelling’s where words ending in -ly get labeled as ADVERB.
tag sets available in ENGLISH NL. Show howthe tags are assigned to the words
f2(s,i,li,li-1)=1 if i=1, li= VERB, and the sentence ends in a question mark; 0 otherwise.
of the following sentence:
Again, if the weight A2 associated with this feature is large and positive, then labelings
“Time flies like an arrow.” .
that assign VERBto the first word in a question(e.g., “Is this a sentence beginning with
a verb?") are preferred. 3. Why POS tagging is hard? Discuss possible challengesto be faced while performing
POStagging. |
f3(s,iUi,li-1)=1 if li-1= ADJECTIVE and lit NOUN; 0 otherwise. Again, a
positive weight
for this feature meansthat adjectives tend to befollowed by nouns. : 4. Discuss various approaches/algorithms to perform POS tagging.

f4(silili-1)=1 if li-1= PREPOSITION andli= PREPOSI Explain in detail Rule based POS tagging/ Stochastic (HMM) POS tagging/ Hybrid
TION. A negative weight A4 for
this function would mean that prepositions don’t tend POS tagging. .
to follow prepositions, so we should
avoid labelings where this happens. 6. Explain transformation-based POS tagging with suitable example.

7. Whatdo you mean by constituency? Explain following key constituents’ with suitable
example w.r.t English language: 1: Noun phrases 2) Verb phrases 3) Prepositional -
phrases .

\ Natural Language Processing


{Ii Syntax Analysis (125)
fh

Scanned with CamScanner


8. Explain CFG with suitable example . Discuss the following potential problemsjn
CFG such as 1) Agreement 2) Sub categorization 3) Movement .
9. What is parsing? Explain Top-down & Bottom-up approachof parsing with suitabl
e
example.
10. W.r.t following Grammar showShift reduce parsing of following sente
nces
Book that flight
Does that flight include meal
S —NPVP S-—.AuxNP VP S —. VP NP— Det NOM
NOM — Noun NOM -> Noun NOM VP -Verb
VP —Verb NP
Det —- that | this | a | the Noun — book | flight | meal | man
Verb — book | include| read Aux —does
11. Compare between Top-down & Bottom-up pars
ing approach.

Natural Language Processin


g

Scanned with CamScanner

You might also like