Professional Documents
Culture Documents
OBJECTIVES
After reading this chapter, the student will be able to understand:
on
Part-Of-Speech tagging (POS)- Tag set for English (Penn Treebank ) ,
Rule based POS tagging, Stochastic POS tagging,
OOO ©
i
| system.
}
|
|
1
al
closed class types verb class includes most of the words referring to actions and processes, including main
Parts of speech can be divided into two broad super categories:
verbs like draw, provide, differ, and go. English verbs have a number of morphological
and open class types. Closed classes are those that have relatively fixed membership.
forms (non- 3rd-person-sg (eat), 3d-person-sg (eats), progressive (eating), past participle
For example, prepositions are a closed class because there is a fixed set of them in
eaten). A subclass of English verbs called auxiliaries will be discussed when weturn to
English. By contrast nouns and verbs are open classes because new nouns and verbs
are continually coined or borrowed from other languages(e.g. the new verb to fax or closed class forms. The third open class English form is adjectives; semantically this
the borrowed noun futon). It is likely that any given speaker or corpus will have different class includes many terms that describe properties or qualities. Most languages have
open class words, but all speakers of a language, and corpora that are large enough, adjectives for the concepts of color (white, black), age (old, young), and value (good,
will likely share the set of closed class words. Closed class words are generally also bad), but there are languages without adjéctives. As we discussed above, manylinguists
function words; function words are grammatical words like of, it, and, or you, which tend argue that the Chinese family of languages usesverbsto describe such English-adjectival
to be very short, occur frequently, and play an important role in grammar. There are four
notions as color and age.
major open classes that occur in the languages of the world: nouns, verbs, adjectives, The final open class form is adverbs. What coherence the class has semantically may
and adverbs. It turns out that English has nouns verbs adjectives adverbs all four of be solely that each of these words can be viewed as modifying something (often verbs,
these, although not every language does. Many languages have no adjectives. Every hence the name ‘adverb’, but also other adverbs and entire verb phrases). Directional
known human language has at least two categories noun and verb. Noun is the name adverbs or locative adverbs (home, here, downhill) specify the direction or location of
given to the lexical class in which the words for most people, places, or things some action; degree adverbs (extremely, very, somewhat) specify the extent of some
occur. But
since lexical classeslike noun are defined functionally (morphological action, process, or property; manner adverbs (slowly, slinkily, delicately) describe the
and syntactically)
sidGiveshenornene for people, places, manner of some action or process and temporal adverbs describe the time that some
and things may not be nouns,
y not be words for people, places, or things. Thus nouns action or event took place (yesterday, Monday). Because of the heterogeneous nature
include concrete termslike ship and chair, abstractions of this class, some adverbs (for example temporal adverbs like Monday) are tagged in
like band- width and relationship,
Adjective
Adjective, comparative
Adjective, superlative
Modal
Predeterminer
2.Rule based POS tagging, Stochastic POS tagging an
Possessive ending
Part-of-speech tagging is the process of assigning a part-of-speech or other =
ee tenet
Personal pronoun class marker to each word in a corpus. Tags are also usually applied to punctustion
markers; thus, tagging for natural language is the same process as toxenizetion for
Possessive pronoun
computer languages, although tags for natural languages are much more ambiquous_
As we suggestedat the beginning of the chapter, taggers play an increasing’y imoorert
role in speech recognition, natural language parsing and information retrieval. The incut
, comparative
to a tagging algorithm is a string of words and a specified TAGSET tagset of he Kine
, superlative described in the previous section. The outputis a single best tag for each word.
VB DT NN.
Book thatflight.
VBZ DT NN VB NN?
25. Doesthatflight serve dinner?
trivial
to each word is not The rules can be categorizedin to two parts:
ally assigning a tag
examples, a utomatic ge and pay
Even in these simple than one poss ible usa a. Lexical: Wordlevel
ambiguous. That is,
it has more
For example, book is ect) or a noun (as in - p. Context Sensitive Rules: Sentencelevel
ver b (as in 2
boo k thatf light or to book the susp
of speech. It can be a
be a determi ner(as in Dogg .
ra boo k of matches). o! milarly that can The transformation-based approachesuse a predefined set of handcrafted rules as well
. Scorene or a complementizer (as in | thought that ou was earlier),
as automatically inducedrules that are generated during training
es, cheosing : e Proper tag
ng is to resolve these ambiguiti
The problem of POS-taggi Basic Idea: .
iguation tasks
‘for the context. Part-of-speech tagging is thus one of the many disamb
4. Assign all possible tags to words
t
tagging problem? Mos wor ds in English are
we will see in this book. How hard is the
y of the most common words of Removetags accordingto set of rules of type:
‘unamb igu ous ; i.e. the y hav e onl y a sing le tag. But man
“English are ambiguous (for example can can be an auxiliary (‘to be able’), a noun (‘a
Noa fs WN
if word+1 is an adj, adv, or quantifier and the following is a sentence boundary
(2 nota Lang
uage Processing Syntax Analysis
jee
VBN J VB
FRE VBD TO VB DT NM J NATIVE SG NOINDEFDETERMIN
ER
She promised to back the bill
one-th
oo
SQ
ENGTWOLRule-Based Tagger she PRON P
pt | PERSONALFEMININE NOMINATIVE SG3
A Two-stage architecture ara V IMP
Use lextoon FST (dictionary) to tag each word with all possible POS
show Vv
Apply hand-written rules to eliminate tags. fj PR PRESENT-sc3
5 vein
The mutes eliminate tags that are inconsistent with the context, and should reducethe list show N
— NO MINATIVE SG
‘of POStags to. a single POS per word.
shown PCP2 SVOO SVO sy
Det-Noun Rule:If an ambiguous word follows a determiner, tag itas a noun Ee |
occurred PCP2 SV
ENGTWOL Adverdial-that Rule
op
Giveninput that" | occurred V PAST V FIN SV
if the next word is adj, adverb, or quantifier, and following that is a sentence
boundary, and Sample ENGTWOL Lexicon
the previous wordis nota verblike “consider” which allows adjs as object complement
s, Stage 1 of ENGTWOL Tagging
Theneliminate non-ADVtags,
First Stage: Run words through Kimmo-style morphological
Else eliminate ADVtag analyzerto get all parts of
speech.
| consider that odd. (that is NOT ADV)
Example: Paviov had shownthatsalivation ...
Itisn't that strange. (that is an ADV) Pavlov PAVLOV N NOM SG PROPER
This approach does work and produces accurate results. had HAVE V PAST VFIN SVO
What are the drawbacks? HAVE PCP2 SVO
Extremely labor-intensive shown SHOW PCP2 SVOO SVO SV
that ADV
Word POS Additional POS features
PRON DEM SG
smaller ADJ COMPARATIVE DET CENTRAL DEM SG
entire CS
ADJ ABSOLURE ATTRIBUTIVE
Salivation NNOMSG
fast ADV SUPERLATIVE Stage 2 of ENGTWOL Tagging
eee
greater of these two probabilities: the Transformation-Based Learning (TBL) approach to machine learning , and
draws
P(VB|TO)P(race|VB) ..(6) inspiration from both the rule-based and stochastic taggers. Like the rule-based
taggers,
TBL is based on rules that specify what tags should be assigned to what words. But.
And
like the stochastic taggers, TBL is a machine learning technique, in which rules
P(NN|TO)P(race|NN) ..(7) are
automatically induced from the data. Like some but notall of the HMM taggers, TBL
is
Equation 3 andits instantiations Equations-6 and 7 each have two probabilities: a tag a supervised learning technique;it assumes a pre-tagged training corpus. Samuel et al.
sequence probability P(t|t-1) and a word-likelihood P(wit). For race, the tag sequence (1998a) offer a useful analogy for understanding the TBL paradigm, whichthey credit to
probabilities P(NN|TO) and P(VB|TO) give us the answer to the question “how likely are Terry Harvey. Imagine an artist painting a picture of a white house with green trim against
SS
we to expect a verb (noun) given the previous tag?". They can just be computed from a blue sky. Suppose mostofthe picture was sky, and hence most ofthe picture was blue.
a corpus by count- ing and normalizing. We would expect that a verb is more likely to The artist might begin by using a very broad brush and painting the entire canvas blue.
follow TO than a noun is, sinceinfinitives (to race, to run, to eat) are commonin English. Next she might switch to a somewhat smaller white brush, and paint the entire house
While it is possible for a noun to follow TO (walk to school, related to hunting), it is less white. She would just color in the whole house, not worrying about the brown roof, or
common. Sure enough, a look at the combined Brown and Switchboard corpora gives the blue windowsor the green gables. Next she takes a smaller brown brush and colors
us the following probabilities, showing that verbsare fifteen times as likely as nouns after Over the roof. Now she takes up the blue paint on a small brush andpaints in the blue |
TO: windowson the barn. Finally she takes a very fine green brush and does the trim on the
P(NN|TO) = .021 gables. The painter starts with a broad brush that coversa lot of the canvas but colors
P(VB|TO) = .34 a lot of areas that will have to be repainted. The next layer colors less of the canvas,
but also makesless ‘mistakes’. Each new layer uses a finer brush that corrects less of
The second part of Equation 3 and its instantia
tions Equations 6 and is the lexical the picture, but makes fewer mistakes. TBL uses somewhat the same method asthis
- likelihood: the likelihood of the noun race given each tag,
P(race|VB) and P(race|NN). Painter. The TBL algorithm has a setof tagging rules. A corpus is first tagged using the
Note that this likelihood term is not asking ‘which is
the mostlikely tag for this word’. broadestrule, i.e. the one that applies to the most cases. Then slightly more specific
Thatis, the likelihood term is not P(VBJrace).
Instead we are computing P(race|VB). The rule is chosen, which changes some of the original tags. Next an even narrowerrule,
probability, slightly counterintuitively, answers
the question “if we were expecting a verb, Which changes a smaller numberof tags (some of which might be previously-changed
tags).
; é
After selecting the most-likely tag, Brill's tagger applies its transformation rules. Asit NN VB Previoustag is TO to/TO race/NN — VB i
happens, Brill’s tagger learned a rule that applies exactly to this mistagging of race: VBP VB One ofthe previous 3 tags is MD mightMD vanish’VBP — VB
Ro
i
Change NN to VB whenthe previous tag is TO This rule would change race/NN to race/ NN VB One ofthe previous 2 tags is MD
w
. ;
SyrianSrialwn
=
as
a
Natural Language Processing ,
would/MD n'vRB which includes morefeatures, such as the length of the word and variousprefixes, and
furthermore includes interaction terms among various features.
children/NNS 's/POS
ape
eae
of ideas from traditional
l relations are a formalization jike number agreement can bearbitrarilyfar apart from each other “F
constituency facts. Grammatica chickens that eal corn eats rabbits" is awkward because the wijerd — ne
the sente nce: She ate a mammath
OBJECTS. In
CPUeta ee pa Prc
grammar about SUBJECTS and
mamm oth breakfast is the while the main verb (eats)is singular. In Principle, we could =
SUBJ ECT and a "mG sents . ae
Semences that ean
have an
arbitrary numberof words between the subject aa m —
Sheis the
breakfast. The noun phrase
ae
dencyrelati ons refer to certai n kinds of relations
OBJECT. Subcategorization and depen |
Noam Chomsky, in 1957, used the following notation cated productions. to define the
(
an infinitive, as in | want to fly to Detroit, syntax of English. The terms used here, sentence, noun vee mm tie fotlowing
pean
For example the verb want can be followed by
t. But the verb find cannot be followed by rules
been describe a very
categorized small subset
as adjectives of English sentences. The articles “sed the& have
for simplicity.
>>>,
or a noun phrase, asin | wanta flight to Detroi
ere
aninfinitive (*I found to fly to Dallas ). Thes e are called facts about the subcategory of
ae PW
can be modelled by various kinds of
i,
edge <sentence> — noun phrase> <verb phrase>
the verb. All of these kinds of syntactic knowl
bigot
xt-free grammars are thus +>
grammars that are based on context-free grammars. Conte <noun phrase> <adjective> <noun phrase> i <adjectve> <singuiar noun>
eg oe ee!
the backbone of many models of the syntax of natural langu <yerb phrase> — ‘<singular verb> <adverb>
s of natural language
of computer languages). As such they are integral to most model
. EE —
and Meaning-A‘sentence has a semantic structure thatis critical to determine for many Thus, a noun phraseis defined as an adjectrve followed by ancther noun pirase or2s
tasks. For instance, there is a big difference between Foxes eat rabbits and Rabbits an adjective followed by a singular noun. This cefniten of noun phrase
a 2 6% fects
2
eat foxes. Both describe some eating event, but in the first the fox is the eater, and in because noun phrase occurs on both sides of the orocuctien. Gramma
ranruTiats. > Tecursne
are
the secondit is the one being eaten..We need some wayto represent overall sentence to allow for infinite length strings. This grammaris sarc to be context-ree Decausé onty
structure that captures suchrelationships. Agreement—In English as in most languages, arrow. Hf there were more than
one syntactic category, e.g., , occurs on the le® of the
there are constraints between forms of words that should be maintained, and which one syntactic category, this would describe a context and be called context-sensive.
are useful in analysis for eliminating implausible interpretations. For instance, English another
A grammar is an example of a metalanquags- a language used ta desome
requires number agreement between the subject and the verb (among other things).
language. Here, the metalanguageis the context-free grammar used to Gescrite 2 part
Thus we can say Foxes eat rabbits but “Foxes eats rabbits” is ill formed. In contrast,
of the English language. Figure X shows a cagram called a parse Tee Of Sructure Tee
The Fox eats a rabbit is fine whereas “The fox eat a rabbit" is not. Recursion—Natural for the sentencethe little boy ran quickly. The sentence: “Quickly. the ie Ooy ran” ssaiso
languages have a recursive structure that is important to capture. For instance, Foxes
a syntactically correct sentence, but it cannet be derived from the above grammar,
that eat chickens eat rabbits is best analyzed as having an embedded sentence (i...
that eat chickens) modifying the noun phases Foxes. This one additional rule allows us
to also interpret Foxes that eat chickens that eat corn eat rabbits and a hostof other —
sentences. Long Distance Dependencies—Becauseofthe recursion word dependencies
’
J”.
children of
ontheleft-hand side of a rule, and the
<noun phrase? a grammatical category that appears
/ ™
ight
ther -han dside of that rule. |
that root are identicalto the elementson —
<adverb>
<noun phrase?
<singular verb> Potential Problemsin CFG
)
- pe Ne
_ <adjective>
» Agreement /~
og: no
<adjective> <singula
r noun? » Subcategorization - ae Tt )
| ny &
e Movement
quickly
boy ran co
The little Agreement
s using a context- “This dogs
ect English sentence
to describe all the corr _This dog
In fact, it is impo ssible r abov e, to derive the
, u sing the gramma “those dog
other hand, it is possible Those dogs
free grammar. On the e the boy ran quick ly.” Context
g, “littl
semantically incorrect strin
syntactically correct, but *This dog eat
ribe semantics. This dog eats
free grammars cannot desc followed
fined as a noun phrase ‘Those dogs eats
s --> Np YP mean sthat “a sentence is de Those dogs eat
E.g., the rule s froma
s a simple CFG that d escr
ibes the sent ence
| | a,
byav erb phrase.” Figure X1 show
5.Subcategorization
_ small subset of English.
s —> np vp
s Sneeze: John sneezed *
np — det n
vp — tv np
= *
vp
Find: Pleasefind [a flight to NY],es;
. np fare],ip
— iv
I Give: Give [me],, [a cheaper
d < ,
det —> the you help [me],, . [with a flight),,
tbo yy Help: Can
>a ; l)ove
—> an
the giraffe dreams Prefer: | prefer [to leave earlie
n —» giraffe
Told: | was told [United hasa flight],
—» apple
iv —» dreams
tv —> eats
—» dreams
dreams”
Figure 2: A grammar and a parse tree for “the giraffe Syntax Analysts (2)
>...
aa
2 neee i £, ;
>
te. | parsing algorithms haverecently begunto be incorporated into speech recognizers, both
So the various rules for VPs overgenera for language models and for non-finite-state
ments that don't go acoustic and phonotactic modeling.
They pennit the presence of strings containing verbs and argu
In syntactic Parsing, the parser can be viewed as searching through the space ofall
together
possible parsetrees to find the correct parse tree for the sen
; ntence. The search space
For example
of possible parsetrees is defined by the grammar. For example, considerthe following
VP -> V NPtherefore sentence:
~ Sneezed the:book is a VP since “sneeze”is a verb and “the book”is a valid NP Bookthatflight. ...(1)
“ Subcategorization frames can fix this problem (“slow down” overgeneration) Using the miniature grammar andlexicon in Figure 2, which consists of someof the CFG
rules for English, the correct parse tree that would be assigned to this example is shown
6: Movement
e Core example
fs
Scere om poy Qian SIF
oboe
pS
oneYe
in Figure 1
S
e And note that its separated from its verb by2 other verbs.
Book that flight
Parsing With Context-free.Grammar
Parse trees are directly useful in applications such as grammar check- ing in word-
processing systems; a sentence which cannot be parsed may have grammatical errors
(or atleast be hard to read). In addition, Parsing is an important intermediate stage of
representation for semantic analysis, and thus plays an important role in applications like
machinetranslation, question answering, and information extrac- tion.
For example, in
Syntax Analysis
104 Natural Language Processing
Tee
VP — Verb -
-
constraint comesfrom the data, i.e. the input sentence itself. Whatever elseis true of the
final parse tree, we know that there must be three leaves, and they must be the words
book, andflight. The second kind of constraint comes from the grammar. We knowthat
NP VP NP VP AUXNP VP AUX NP VP VP y
whatever elseis true of the final parse tree, it must have one root, which must be the
start symbol S. These two constraints,, give rise to the two search strategies underlying Det Nom PropN Det Nom PropN V NP V
mostparsers: top-down or goal-directed search and bottom-up or data-directed search. Figure 4: An expanding top-down search space. Each ply is created by taking each tree from the previous
ply, replacing the leftmost non-terminalwith eachofits possible expansions, and collecting each of these
Top-DownParsing trees into anewply.
A top-downparser searches for a parse tree by trying to build from the root node S$
Bottom-Up Parsing
downto the leaves. Let's consider the search spacethat a top-down parser explores,
assuming for the moment thatit builds all possible trees in parallel. The algorithm starts Bottom-up parsing is the earliest knownparsing algonthm, andis used in the shiftteduce
by assuming the input can be derived by the designated start symbol S. parsers common for computer languages. In bottom-up parsing, the parser stars with
The next step
is to find the topsof all trees which can start with S, by looking the words of the input, and tries to build trees-from the words up, again by applying
for all the grammar
rules with S onthe left-handside. In the grammar in Figure 2, there are three rulesthat rules from the grammar one at a time. The parse is successful if the parser succeeds in
expand S, so the second ply, orlevel, of the search Spacein Figure 3 has three partial building a tree rootedin the start symbol S that covers all of the input. Figure 4 show the
trees. We next expand the constituents in these three new trees just as weoriginall bottom-up search space, beginning
inni with
i as sentence Bookthatflight.get The
Abeye parser begins
git
y . three partial
expanded S. Thefirst tree tells us to expect by looking up each word (book,that, and flight)in the lexicon and building
an NP followed by a VP, the second expect
s can be
trees with the part of speech for each word. But the word book is ambiguous;it
Book that flight Book that flight down strategy never wastes time exploring trees that cannot result in an S, since it
begins by generating just those trees. This means it also never explores subtrees
NP NP that cannotfind a place in some S-rooted tree. In the bottom-up strategy, by contrast,
trees that have no hopeofleading to an S, orfitting in with any of their neighbours, are
NOM vP NOM
. NOM generated with wild abandon. For example the left branch of the search spacein Figure
Verb Det ‘ban 4 is completely wasted effort; it is based on interpreting book as a Noun at the beginning
vin Det van vi Det Noun
of the sentence despite the fact no such tree can lead to an S given this grammar. The
Book that . _ flight acy that flight | ans wa “ibn top-down approachhas its own inefficiencies. While it does not waste time with trees that
do not lead to an S, it does spend considerable effort on S trees that are not consistent ©
VP with the input. Note that the first four of the six trees in the third ply in Figure 3 all have
left branches that cannot match the word book. None of these trees could possibly be
VP NP NP
used in parsing this sentence. This weaknessin top-down parsers arises from the fact
NOM NOM that they can generate trees before ever examining the input. Bottom-up parsers, on the
other hand, never suggesttrees that are not at least locally groundedin the actual input.
Verb Det Noun Verb © Det vn
Neither of these approaches adequately exploits the constraints present by the grammar
acy | flight Book that ni and the input words.
Figure 5: An expanding bottom-up search spacefor the sentence Bookthat flight. This
figure does not show
7. Sequencelabeling
thefinaltier of the search with the correct parse tree Figure
1.
In natural language processing, it is a common task to extract words or phrases of
Eachof the trees in the second ply are then expanded
.In the Parse on the left (the one
Particular types from a given sentence or paragraph. For example, when performing
in which book is incorrectly considered a noun), the
Nominal! Noun rul e is applied to both
of the: Nouns (book analysis of a corpus of news articles, we may want to know which countries are mentioned
| andflight) - This samerule is also applied to the sole Noun (flight) on
the right, producing the trees on the third ply. In in the articles, and how manyarticles are related to each of these countries.
general, the Parser extends one ply to
Pm
syntax Analysis (2)
ae ; as .
our states. Going back to the cat and dog example, suppose we observed the ing 8
evo state sequences:
ee
dog, cat, cat
dog, dog. dog
Then the transition probabilities can be calculated using the maxi oie a"
__ C(S,.S, 4) lo
P(s,[S.)= C(s,.)
oa ee
For example, from the state sequences we can see that the sequence
MONE Size. Note het each edgeis labeled with a numberrepresenting the s ahways siart with
probability dog. Thus we are at the start state twice and both times we get to dog and
tial 2 given transtion wil happen at the current siete. Note also that the probabilit never cat.
y of Hence the transition probability from the start state to dog is 1 and from the start
Tensions out of any given state always sums to 14. state
to cat is 0.
in the finte siete transition network pictured above, each state was observabl
e. We Let's try one more.Let's calculate the transition probability of going from the
are edie to see howoffen 2 cat meows after a dog woofs. Whatif our cat state dog A
and cog were to the state end. From our example state sequences, we see that dog only
transitions
Olingual. Thatis. what if both the cat and the dog can meow and woof? Furthermore,
to the end state once. Wealso see that there are four observed instances of dog. Thus
‘ets assume thai we are given the states of dog and cat and we
want to predict the the transition probability of going from the dog state to the end state
sequence of meows and woofs from the states. In this case, is 0.25. The other
we can only observe the
Gog and the cet but we need to predict the unobserved transition probabilities can be calculated in a similar fashion-
meows and woofs that follow. The
meow’s and woofs are the hidden states. The emission probabilities can also be calculated using maximum likelihood estimates:
- . Let's calculate the emission probability of dog emitting woofgiven the following emissions
for our two state sequences above:
woof —/ meow
Woof, woof, meow
Figure 7: A finite state transition network representing
an HMM Meow, woof, woof
The figure aboveis a finite
state transition network that
arrows represent emiss represents our HMM. Theb
ions ofthe unobservedstate lack.
s woof and meow.
| —
ae :
ae 2s) nara Language Processing
ne : ‘ a
symacanai as(a15)
Ir ‘ ‘ wedi tet letniRear.
<—:3e0
In English, the probability P(T) is the probability of getting the sequenceoftags T. To
calculate this probability we also need to make a simplifying assumption. This assumption
gives our bigram HMM its name andsoit is often called the bigram assumption. We must
assumethat the probability of getting a tag depends only on the previous tag and no
other tags. Then we cancalculate P(T) as
0.5
woof LT/ meow
0.5 P(T)=[]!,P( 164)
0.5
Note that we could use the trigram assumption, that is that a given tag depends on
Figure 8: Finite state transition network of the hidden Markov model of our example. the two tags that came before it. As it turns out, calculating trigram probabilities for the
HMM requires a lot more workthan calculating bigram probabilities due to the smoothing
So how do we use HMMs for POStagging? When we are performing POS tagging, our
required. Trigram models do yield some performance benefits over bigram models but
goalis to find the sequenceof tags T such that given a sequence of words W weget
for simplicity's sake we use the bigram assumption.
T =argmaxP(T |W)
T Finally, we are now ableto find the best tag sequence using
In English, we are saying that we want to find the sequence of POStags with the highest T =argmax[]?., P(w,1t)P (184)
i
probability given a sequence of words. As tag emissions are unobserved in our hidden The probabilities in this equation should look familiar since they are the emission
_ Markov model, we apply Baye’s rule to change this probability to anequation we can
probability and transition probability respectively.
compute using maximum likelihood estimates: 3 .
P(w, | , Emission probability.
T=argmaxP(T | w) = amex
TU), argmax P(w | T)P(T) It, ,) Transition probability
P (t.16
HM\M, the observed
Hence if we were to draw a finite state transition network for this
The second equals is where we apply Baye's rule. The symbolthat looks like an infinity emitted states similar to our woof,
States would be the tags and the words would be the
we can use the maximum likelihood
symbol with a piece chopped off meansproportional to.
This is because P(W)isa constant and meow example. We have already seen that
for our purposes since changing the sequence T doesnot estimates to calculate these probabilities.
changethe probability P(W). | . their. cresponci3 ngPOS
ey
Thus dropping it will not make a difference in the fi sentences that are tagged mn
nal sequence T that maximizes the Given a dataset consisting of
probability. the emission nen ne noelde
tags, training the HMM is as easy as calculating
8S described above. For an example implementation, check out tne big
Syntax Analysts (17)
Natural Language Processing
_f
_——_—---———--
1 hy 3 ee tl ares
eH\
The first table consists of the probabilities of getting to a given state from previousstates.
\ay / 2d More precisely, the value in eachcell of the table is given by
1:vB 2vB} 3:vB| ( 4:VB 6:v8 a, transition probability from state Sto 5,
Let's fill out the table for our example using the probabilities we calculated for the finite ~
The HMM gives us probabilities but what we want is the actual sequenceof tags. We
state transition network of the HMM model.
need an algorithm that can give us the tag sequence with highest probability of being
correct given a sequenceof words. An intuitive algorithm for doing this, known as greedy. <start> meow woof <end>
decoding, goes and chooses the tag with the highest probability for each word without Start |
considering context such as subsequent tags. As we know, greedy algorithms don't
dog
always return the optimalsolution and indeed it returns a sub-optimal solution in the case
of POS tagging. This is becauseafter a tag is chosen for the current word, the possible Cat
tags for the next word may belimited and sub-optimal leading to an overall sub-optimal End
solution. This is because
Notice that the first column has 0 eve rywhere | exceptfor the start row.
Weinstead use the dynamic programming algorithm called Viterbi. Viterbi starts by <start>. From our finite state transition
the sequences for our example always start with
creating two tables. The first table is used to keep track of the maximum sequence state with probability 1 and
Network. we see that the start state transitions to the dog
probability that it takes to reach a given cell. If this doesn't make sense yet that is okay. ity of 0.25.
that do g emits meow with a probabil
Wewill take a look at an example. The secondtable is used to keep track of the actual Never goesto the cat state. We also see state from the start
to t he start state or end
tis also important to note that we can notget
path that led to the probability in a given cell in the first table.
State.
Let's look at an example to help this settle in. Returning to our previous woof and meow
example, given the sequence
eeea ree
Seat Fc
Cat 0 0 0.25*0.25*0.5=0,03125 back pointer table will have the value of1 (0 starting index) since the state dog at row 1
is the previous state that gave the end state the highest probability. For completeness,
End 0 0 0 the back pointer table for our exampleis given below.
—
Observethat we cannot get to the start state from the dog state and the end state
never <start> meow woof <end>
emits woof so both of these rows get 0 probability. Meanwhile, the
cells for the dog
Start -1
and cat state get the probabilities 0.09375 and 0.03125 calculated
in the same way as
we saw before with the previouscell's probability of 0.25 multiplied by the respective dog 0
41
Start 1 0
Note that the start state has a valueof -1. This is
2
0 0
usa aie
ea
dog 0 1*1*0.25=0.25 0.25*0.5*0.75=0.09375 Wetrace the backpointer table backwards to get the path that provides
aos a 0 probability of being correct given our HMM. LU get a 7 oence
Cat With the highest
0 0] 0.25*0.25*a
0.5=0.03125 the table.
0 at the end cell on the bottom rightof
|
&
Py
Syntax Anaives { m= |
(120) natura Language Processing
heterogeneous information sources for classification. The term maximum entropy y € {sur, dans, par, au bord de}
member of eo
probability mode| The process may beinfluenced by some contextual information x. a
refers to an optimization framework in which the goalis to find the
set X. x could include the wordsin the English sentence suroundingion
that maximizes entropy over the set of models that are consistent with the observed
evidence. maximum entropy framework can be usedto solve various problemsin the domain of
Maximum Entropy modelis used to predict observations from training data. This does natural language processinglike sentence boundary detection, part-of-speech tagging,
not uniquely identify the model but chooses the model which has the most uniform prepositional phrase attachment, natural language parsing, and text categorization .
distribution i.e. the model with the maximum entropy. Entropy is a measureof uncertainty - og .
of a distribution, the higher the entropy the more uncertain a distribution is. Entropy Conditional Random Field (CRF)
measuresuniformity of a distribution but applies to distributions in general. In POS tagging, the goalis to label a sentence (a sequence of wordsortokens) with tags
The Principle of Maximum Entropy argues that the best probability model for the data like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.
is the one which maximizes entropy, over the set of probability distributions that are For example, given the sentence “Bob drankcoffee at Starbucks”, the labeling might be
consistent with the evidence. “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.
Maximum Entropy: A simple example So let’s build a conditional random field to label sentences with their parts of speech. Just
The following example illustrates the use of maximum entropy on a very simple problem. like any classifier, we'll first need to decide on a set of feature functions fi.
¢ Model an expert translator’s decisions concerning the proper French rendering of Feature Functions in a CRF
the English word on : In a CRF, eachfeature functionis a function that takesin as input:
* Amodel(p) of the expert’s decisions assigns to each French word or phrase(f) an
asentence s
estimate, p(f), of the probability that the expert would choose f as a translation of
the position i of a word in the sentence
on. Our goal is to extract a set of facts about the decision-making process from the
sample and construct a modelofthis process
the label|, of the current word
Aclue from the sample is the list of allowed translations
the label li-1 of the previous word gh the numbers are oftenjust either 0 or 1).
aae and -yalued number(thou
on ---> {sur, dans, par, au bord de} Outputs a real tion could measure how much we suspectthat
F ‘ble feature func fa
wordis “very”.
With this information in hand, we can imposeour first constraint on p:
Id be labeled as an adjective
thoF example, one posse given that the previous
@ current word should be !a
p(sur) + p(dans) + p( par) + p(au bord de) =1
ae eh
i of the sentence. band 4 which force each word to dependonly on the current label and each label to depend
into probabilitie s p(l|s) between and by
Finally, we can transform these scores only on the previous label), CRFs can-use more global features. For example, one of the
exponentiating and normalizing ; . features in our POS taggeraboveincreased the probability of labelings that tagged the
first word of a sentence as a VERBif the end of the sentence contained a question mark.
p(!\s)= S” .exp|score(I'|s) | 0x]Darah (s,i,] | CRFs can havearbitrary weights. Whereas the probabilities of an HMM mustsatisfy
certain constraints
Example Feature Functions
So whatdo these feature functions look like? Examples of POS tagging features could ExpectedQuestions
include:
1. What is POS tagging? Explain types of word classesin English NL. Also comment
f1(s,i,li,li-1)=1 if i= ADVERB andthe ith word endsin “-ly”; 0 otherwise. If the weight A1
on possible tag sets available in ENGLISH NL.° /
associated with this feature is large and positive, then this feature is essentially saying
2. Explain open and closed word classes in English Language. Comment on possible
that we prefer labelling’s where words ending in -ly get labeled as ADVERB.
tag sets available in ENGLISH NL. Show howthe tags are assigned to the words
f2(s,i,li,li-1)=1 if i=1, li= VERB, and the sentence ends in a question mark; 0 otherwise.
of the following sentence:
Again, if the weight A2 associated with this feature is large and positive, then labelings
“Time flies like an arrow.” .
that assign VERBto the first word in a question(e.g., “Is this a sentence beginning with
a verb?") are preferred. 3. Why POS tagging is hard? Discuss possible challengesto be faced while performing
POStagging. |
f3(s,iUi,li-1)=1 if li-1= ADJECTIVE and lit NOUN; 0 otherwise. Again, a
positive weight
for this feature meansthat adjectives tend to befollowed by nouns. : 4. Discuss various approaches/algorithms to perform POS tagging.
f4(silili-1)=1 if li-1= PREPOSITION andli= PREPOSI Explain in detail Rule based POS tagging/ Stochastic (HMM) POS tagging/ Hybrid
TION. A negative weight A4 for
this function would mean that prepositions don’t tend POS tagging. .
to follow prepositions, so we should
avoid labelings where this happens. 6. Explain transformation-based POS tagging with suitable example.
7. Whatdo you mean by constituency? Explain following key constituents’ with suitable
example w.r.t English language: 1: Noun phrases 2) Verb phrases 3) Prepositional -
phrases .