Unit-5: POS Tagging

Unit-5: Part-Of-Speech Tagging
IS 7118: Natural Language Processing

1st Year, 2nd Semester, M.Sc(IS)
(Slides are adapted from Text Book by Jurafsky & Martin )
Outline
• Why Part of speech ?

• English Word classes
• Tag sets and problem definition
• Automatic approaches 1: rule-based tagging
• Automatic approaches 2: stochastic tagging
• Automatic approaches 3: transformation-based
tagging
• Other issues: tagging unknown words, evaluation
IS 7118: NLP Unit-5: POS, 2
Prof. R.K.Rao Bandaru
Part of Speech
• Part of speech (POS) is a way to categorize words based on a
particular syntactic (and often semantic) function they take in the
sentence
– Sometimes called syntactic or grammatical categories.
• It comes from Dionysius Thrax of Alexandria (c. 100 B.C.) the idea
that is still with us that there are 8 parts of speech
• Thrax: noun, verb, article, adverb, preposition, conjunction,
participle, pronoun
• School grammar: noun, verb, adjective, adverb, preposition,
conjunction, pronoun, interjection
• parts-of-speech(POS) also known as lexical categories, word

classes, morphological classes, lexical tagsets....

Why Part of Speech ?
• Provide significant amount of information about the word and its
neighbors
• POS can tell something about how the word is pronounced as
required in speech synthesis
– How to pronounce “lead”?
– INsult inSULT
– OBject obJECT
– OVERflow overFLOW
– DIScount disCOUNT
– CONtent conTENT
• Parsing
– Need to know if a word is an N or V before you can parse
• Information Retrieval
– POS can be used in stemming for IR
• Information extraction
– Finding names, relations, etc.
• Machine Translation Prof. R.K.Rao Bandaru
Part-Of-Speech
• Important POS:
– Nouns: typically refer to people, animals and “things”.
– Verbs: express the action in the sentence.
– Adjectives: describe properties of nouns.
• Children’s eat sweet candy
• Children: Noun - group of people.
• eat: Verb - describes what people do with candy.
• sweet: Adj.- a property of candy
• candy: Noun - a particular type of food
• Other basic Parts of Speech:
– adverb,
– article,
– pronoun,
– conjunction
– Preposition … IS 7118: NLP Unit-5: POS,
5
Part of Speech : Open vs. Closed
• Useful sub-categorization of POS into two types:
• Open class words:
– A constantly changing set; new words are often introduced into the language.
– nouns, verbs, adjectives and adverbs
• Closed class words:
– A relatively stable set; new words are rarely introduced into the language.
– articles, pronouns, prepositions, conjunctions.
– Closed class words are also generally function words like : Of, it ,and, or, you -which
tend to be very short occur more frequently.
• It is therefore easier to deal with closed class words.
• Articles: a, an, the
• Pronouns: I, you, me, we, he, she, him, her, it, them, they
• Prepositions: to, for, with, between, at, of
• Demonstratives: this, that, these, those
• Quantifiers: some, every, most, any, both
• Conjunctions: and, or, but
Open class (lexical) words
Nouns Adjectives old older oldest

Verbs
Proper Common Main Adverbs slowly
IBM cat / cats see

Italy snow registered
Numbers … more
122,312
one
Closed class (functional)
Modals
Determiners the some can Prepositions to with

had
Conjunctions and or Particles off up … more
Pronouns he its Interjections Ow Eh

Nouns
• Nouns refer to entities in the world, which represent objects, places,
concepts, people, events
– dog, city, idea, marathon
• Count nouns: are those that allow grammatical enumeration-goat/goats,
• Mass nouns: describe composites or substances--dirt, water, garbage, salt,
snow ( - homogenous group)
• Proper nouns are names of specific persons or entities. like Colorado,
IBM
• Pronouns are special class of nouns that refer to a person or a “thing'' that
is salient in the context of use.
– After Mary had arrived in the village, she looked for a hotel.
• Relative Pronouns are pronouns like:
– who, which, that
– The man who saw Elvis.. The UFO that landed in Toledo ...
– The Rolling Stones concert, which I attended, ...
Verbs
• Verbs: Words that represent actions, commands or assertions.
• Main verbs: walk, eat, believe, claim, ask
• Auxiliary verbs: be, do, have
• Modal verbs: will, can, could
• Verbs can be
– transitive: they take a complement, as in:
eat an apple; read a book; sing a song
– intransitive: verbs that do not take complements, as in:
she laughed; he slept; I lied

Adjectives
• Adjectives includes many terms that describe the
properties or qualities.
• Many languages have adjectives for the
– Concept of color: white, black
– Age: old, young
– Value: good, bad
– There are languages without adjectives e.g., Chinese,
Koreans
– Chinese language uses verbs in such a place,
– English adjectives act as a subclass of verbs in Korean:
“beautiful” acts in Korean like a verb meaning “ to be
beautiful” IS 7118: NLP Unit-5: POS, 10
Adverbs
• An adverb is a part of speech. It is any word that modifies any
other part of language: verbs, adjectives (including numbers),
clauses, sentences and other adverbs, except for nouns;
modifiers of nouns are primarily determiners and adjectives.
• Unfortunately, John walked home extremely slowly yesterday
• Types of adverbs:
– Directional or locative (specify the location or direction): home, here,
downhill
– Degree (describes about extent of some action) : extremely, very,
somewhat
– Manner (describes about manner of some action): slowly, slinkily,
delicate
– Temporal (describes about event or some action took place): yesterday,
Monday
Closed classes
• Here are important closed classes are:
– Prepositions: on, under, over, near, by, at, from, to, with
– Determiners: a, an, the
– Pronouns: she, who, I, others
– Conjunctions: and, but, or, as, if, when
– Auxiliary verbs: can, may, should, are
– Particles: up, down, on, off, in, out, at, by
– Numerals: one, two, three, first, second, third

Prepositions from CELEX
Prepositions occur before noun phrases; semantically they
indicate spatial or temporal relations whether literal (on it, before
then, by the house) or metaphorical (on time, with gusto, beside
herself).

Particle
• Particle is a word that resembles a preposition or an adverb and
is used in combination with a verb to form a larger unit called a
phrasal verb. The Phrasal verb can behave as a semantic unit.
Often the meaning is not predictable; e.g., turn down, rule out
– So I went on for some days cutting and hewing timber…
– Moral reform is the effort to throw off sleep…
• Table: English single-word particles (from Quirk et.al(1985)

Articles
• A particularly small closed class is the articles: a, an,
and the.
• Articles often making the beginning of a noun phrase
is the determiners.
– ‘a’ and ‘an’ marks a noun phrase as indefinite
– ‘the’ can mark it as definite.
• Out of 16 million words, the COBUILD statics for
articles are:
– the : 1,071,676
– a: 413,387
– an : 59,359
Conjunctions
Conjunctions are used to join two phrases, clauses, or sentences.
Coordinating conjunctions: and, or, but
Subordinating conjunctions: that, thought, like in “I thought that you might like
some milk.”

Pronouns
Pronouns are forms that often act as a kind of shorthand for referencing to some
noun phrase or entity or event.
Personal pronouns: you, she, I, it, me etc.
Possessive pronouns: my, our, your, her, his, its, their.
Wh-pronouns : what, who, whom, whoever

Auxiliaries
Auxiliaries are words (usually verbs) that mark certain semantic features of a main
verb, including
•whether, action takes place in the present, past or future(tense).
•whether it is completed(aspect)
•whether it is negated (polarity)
•whether an action is necessary, possible, suggested, desired, etc.(mood)
English auxiliaries include the copula verb be. It is called copula because it connects
subjects with certain kinds of predicate nominal and adjectives ( He is a duck; I
have gone,; We were robbed)

Outline

tagging
Definition
❖ ”Tagging is the process of assigning a part-of-speech or
other lexical class marker to each word in a corpus”
WORDS (Jurafsky and Martin)
TAGS
the
girl
kissed N
the V
boy P
on DET
the
cheek
“More recent tagsets have more word classes:

Penn Treebank: 45; Brown corpus:87; C7: 146; C5: 61 .
The C5 tagset is used by the Lancaster UCREL project’s CLAWS(the Constituent
Likelihood Automatic Word-tagging System)
NLP Task – Determining Part of Speech Tags
• Words often have more than one POS: back
• The back door = JJ

• On my back =NN
• Win the voters back =RB
• Promised to back the bill =VB
• The POS tagging problem is to determine the
POS tag for a particular instance of word.

NLP Task – Determining Part of Speech Tags
• From Penn-Treebank POS tagset (45 tags)

• Input: Plays well with others
• Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS
• Output: Plays/VBZ well/RB with/IN others/NNS
• Applications:
– Text to Speech (how do we pronounce “lead”?)
– Can write RE like (Det)Adj* N+ over the output for phrase,
etc.
– As an input to or speed up a full parser

History of POS Tagging

British National Carpus
What is it used for?

• Ultimately, its use is limited only by our imagination; if you have any need for up
to 100 million words of modern British English, you can make use of the British
National Corpus.
• The main uses of the corpus, are as follows:
– Reference Book Publishing
• Dictionaries, grammar books, teaching materials, usage guides, thesauri.
Increasingly, publishers are referring to the use they make of corpus facilities: it's
important to know how well their corpora are planned and constructed.
– Linguistic Research
• Raw data for studying lexis, syntax, morphology, semantics, discourse analysis,
stylistics, sociolinguistics...
– Artificial Intelligence
• Extensive data test bed for program development.
– Natural language processing
• Taggers, parsers, natural language understanding programs, spell checking word
lists...
– English Language Teaching
• Syllabus and materials design, classroom
IS 7118: NLP Unit-5:reference,
POS, independent learner research.
24
Penn TreeBank POS Tagset

Brown Corpus

Brown Corpus

A Simplified Tagset for English
• Tagsets for English have grown progressively larger since the

Brown Corpus until the Penn Treebank project.
Brown Corpus: 87 tags

LOB Corpus: 135 tags
Lancaster UCREL group: 166 tags
London-Lund Corpus: 197 tags
UPenn Treebank: 34 tags + punctuation

Reasons for a Smaller Tagset
• Many tags are unique to particular lexical items,
and can be recovered automatically if desired.
Brown Tags For Verbs
be/BE have/HV sing/VB
is/BEZ has/HVZ sing/VBZ
was/BED had/HVD sang/VBD
being/BEG having/HVG singing/VBG
been/BEN had/HVN sung/VBN
Penn Treebank Tags For Verbs

be/VB have/VB sing/VB
is/VBZ has/VBZ sing/VBZ
was/VBD had/VBD sang/VBD
being/VBG having/VBG singing/VBG
been/VBN had/VBN sung/VBN
Using the Penn Tagset
1) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
2) There/EX are/VBP 70/CD children/NNS there/RB
3) Although/IN preliminary/JJ findings/NN were/VBD reported/VBN
more/RBR than/IN a/DT year/NN ago/IN ,/, the/DT …
• Some tagging distinct are quite hard for both humans and machines.
• Preposition(IN),particle(RP) and adverb(RB) can have a large overlap as
the word ‘around’ in the following cases:
– Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO
joining/VBG
– All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT
corner/NN
– Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD
Issues in tagging
• Particles often can either precede or follow a noun phrase as in
the following examples:
– She told off/RP her friends
– She told her friends off/RP
• Prepositions on the other hand cannot follow their noun phrases
– She stepped off/IN the train
– *She stepped the train off/IN
• Another difficulty is labeling words that can modify nouns.
– Cotton/NN sweater/NN**
– Income-tax/JJ return/NN
Note1: *marks are ungrammatical sentences

Note2: **Penn Treebank specifies cotton as adjectives

Issues in tagging
• Some words that can be adjectives; common nouns or proper
nouns are tagged in the Treebank as common nouns when
acting as modifiers:
– Chinese/NN cooking/NN
– Pacific/NN waters/NNS
• Yet another difficult is in distinguishing past participles(VBN)
from adjectives
– They were married/VBN by the Justice of the Peace yesterday at 5:00
– At the time, she was already married/JJ
• Tagging manuals like Santorini (1990) give various helpful
criteria for deciding how ‘verb-like’ or ‘eventive’ a particular
word in a specific context.

POS Tagging
• The POS tagging problem is to determine the POS tag for a
particular instance of a word.
• The input to tagging algorithm: a string of words and a
specified tagset
• Output: a single best tag for each word.
• Example:
– Book that flight
(VB DT NN)
– Does that flight serve dinner ?
(VBZ DT NN VB NN ) ?
Here book is ambiguous need to be resolved.
Similarly, that can be a determiner or complementizer

How Hard is POS Tagging?
Measuring Ambiguity

Task I – Determining Part of Speech Tags
• The Problem:
Word POS listing in Brown
heat noun verb
oil noun
in prep noun adv
a det noun noun-proper
large adj noun adv
pot noun
• The Old Solution: Combinatorial search.

– If each of n words has k tags on average, try the nk
combinations until one works.
NLP Task I – Determining Part of
Speech Tags
• Machine Learning Solutions: Automatically

learn Part of Speech (POS) assignment.
– The best techniques achieve 96-97% accuracy per

word on new materials, given large training
corpora.

Three Methods for POS Tagging
1) Rule-based tagging
▪ (ENGTWOL tagger)
2) Stochastic
▪ Probabilistic sequence models
• HMM (Hidden Markov Model) tagging
• MEMMs (Maximum Entropy Markov Models)
3) Transformation-based tagging

Outline

tagging
Rule-Based Tagging
❖ Rule-based taggers like EngCG, based on the
Constraint Grammar architecture of Karlsson et al.
(1995)
1) First stage: to assign each word a list of potential
POS.
2) Second stage: use large lists of hand-written
disambiguation rules(as many as 3,744 constraints)
to winnow down this list to a single POS for each
word.
❖ The EngCG ENGTWOL lexicon is based on the
two-level morphology and has about 56,000 entries
for the English wordISProf.
7118: NLP Unit-5: POS,
stems.
R.K.Rao Bandaru
39
ENGTWOL Lexicon
SV,SVO and SVOO specify the sub-categorization or complementation pattern for

the verb.
SV means the verb appears solely with a subject(nothing occurred).
SVO with a subject and object ( I showed the film)
SVOO with a subject and two complements (she showed her the ball)
Nominative means the non-genitive
PCP2 means past participle Prof. R.K.Rao Bandaru
Assign Every Possible Tag
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
Etc… for the ~100,000 words of English with more than 1 tag

Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when
VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

Stage 1 of ENGTWOL Tagging
• First Stage: Run words through FST morphological
analyzer to get all parts of speech.
• Example: Pavlov had shown that salivation …
Pavlov PAVLOV N NOM SG PROPER

had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG

Stage 2 of ENGTWOL Tagging
• Second Stage: Apply large set of NEGATIVE constraints.
• Example: Adverbial “that” rule
The first two clauses in
– Eliminates all readings of “that” except the ADV sense, asto see that
this rule check
the ‘that’ directly
in “It isn’t that odd” or “ I consider that odd precedes a sentence-final
adjective, adverb, or
Given input: “that” qualified. In all other
cases, the adverb reading
If is eliminated.
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier The last clause eliminates
case preceded by verbs like
(+2 SENT-LIM) ;following which is E-O-S ‘consider’ or ‘believe’ that
can take a noun and an
(NOT -1 SVOC/A) ; and the previous word is not a adjective ( I consider that
odd)
; verb like “consider” which
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags
Else eliminate ADV IS 7118: NLP Unit-5: POS, 44
Explanation of Rules
• The first two clauses in this rule check to see that the ‘that’
directly precedes a sentence-final adjective, adverb, or
qualified.
• In all other cases, the adverb reading is eliminated.
• The last clause eliminates case preceded by verbs like
‘consider’ or ‘believe’ that can take a noun and an adjective ( I
consider that odd)
• Another rule is used to express the constraint that the
complementizer sense of that is most likely to be used if the
previous word is a verb that expects a complement like(believe,
think, or show), and if that is followed by the beginning of a
noun phrase and a finite verb.
Outline

tagging
Hidden Markov Model Tagging
• Using an HMM to do POS tagging is a special case of
Bayesian inference.
• Bayesian inference successfully applied to language
problems:
– Bledsoe 1959: OCR
– Mosteller and Wallace 1964: authorship identification
– Sequence classification tasks: POS
• HMM models are related to the “noisy channel”
model that’s the basis for ASR, OCR and MT

POS tagging as a sequence classification task
• We are given a sentence (an “observation” or “sequence of

observations”)
– Secretariat is expected to race tomorrow.
– sequence of n words w1…wn.
• What is the best sequence of tags which corresponds to this
sequence of observations?
• Probabilistic/Bayesian view:
– Consider all possible sequences of tags
– Out of this universe of sequences, choose the tag sequence which
is most probable given the observation sequence of n words
w1…wn.

Getting to HMM
• Let T = t1,t2,…,tn
• Let W = w1,w2,…,wn
• Goal: Out of all sequences of tags t1…tn, get the most probable
sequence of POS tags T underlying the observed sequence of
words w1,w2,…,wn
• Hat ^ means “our estimate of the best = the most probable tag
sequence”
• Argmaxx f(x) means “the x such that f(x) is maximized”

it maximizes our estimate of the best tag sequence
Bayes Rule
We can drop the denominator: it does not change for each tag
sequence; we are looking for the best tag sequence for the same
observation, for the same fixed set of words

Bayes Rule

Likelihood and prior
Computing directly the above probabilities is still too hard.

We need to make certain assumptions:
1) The probability of a word appearing depends only on its own POS; i.e., It is
independent of other words around it and of the other tags around it.
2) The probability of a tag appearing is depends only on the previous tag, rather
than the entire tag sequence.

Further Simplifications
1. The probability of a word appearing depends only on its
own POS tag, i.e, independent of other words around it
n
2. BIGRAM assumption: the probability of a tag appearing

depends only on the previous tag
The most probable tag sequence estimated by the bigram tagger

Further Simplifications
The most probable tag sequence estimated by the bigram tagger
---------------------------------------------------------------------------------------------------------------
biagram assumption

Probability estimates
• Tag transition probabilities p(ti|ti-1)

– Determiners likely to precede adjectives and nouns
• That/DT flight/NN
• The/DT yellow/JJ hat/NN
• So we expect P(NN|DT) and P(JJ|DT) to be high
– But, adjectives don’t tend to precede determiners,
so P(DT|JJ) ought to be low.

Estimating probability
• Tag transition probabilities p(ti|ti-1)

– Compute P(NN|DT) by counting in a labeled corpus:
# of times DT is followed
by NN
These counts are taken from Treebank Brown corpus with 45 tag.
The probability of a common noun
IS 7118: afterPOS,
NLP Unit-5: a determiner is 0.49 56
Two kinds of probabilities
If we were expecting a third person singular verb, how likely is it that
• this verb would be is?

•Word likelihood probabilities p(wi|ti)
– P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
– Compute P(is|VBZ) by counting in a labeled corpus:

An Example: the verb “race”
Two possible tags:
• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB the/DT

reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?

(These two examples are taken from the Brown and Switchboard corpus.
Tagging is done using 87-tag Brown corpus tagset)

Disambiguating “race”

Disambiguating “race”

Example summary
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• Multiply the lexical likelihoods with the tag sequence probabilities: the
verb wins
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032

Formalizing Hidden Markov Model Taggers
• In order to define HMM, we will first introduce the Markov Chain,

or observable Markov Model.
• A Markov chain is a special case of a weighted finite-state-
automation(WFSA)
• WFSA is an augmentation of FSA with each arc associated with a
probability, indicating how likely that paths is to be taken.
• Markov chain is useful for assigning probabilities to unambiguous
sequences.
• However Markov chain is useful to see the actual conditioning
events, it is not suitable for POS tagging, as they are not visible.
• HMM allows both observed events(words) and hidden events(POS
tags).
Hidden Markov Model
• An HMM is specified by the following components

HMM for Ice Cream Example
• You are a climatologist in the year 2799 Studying
global warming.
• You can’t find any records of the weather in
Baltimore, MA for summer of 2007
• But you find Jason Eisner’s diary
• Which lists how many ice-creams Jason ate every
date that summer.
• Our job: figure out how hot it was

• Given
– Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
• Produce:
– Weather Sequence: H,C,H,H,H,C…

• Let’s just do 131 as the sequence
– How many underlying state (hot/cold) sequences are
there? HHH
HHC
HCH
HCC
CCC
CCH
CHC
CHH
– How do you pick the right one?
Argmax P(state sequence | 1 3 1)

Let’s just do 1 sequence: CHC
Cold as the initial state .2
P(Cold|Start)
Observing a 1 on a cold day .5
P(1 | Cold)
Hot as the next state .4
P(Hot | Cold)
Observing a 3 on a hot day .4
P(3 | Hot)
Cold as the next state .3
P(Cold | Hot) .0024
Observing a 1 on a cold day
P(1 | Cold) .5

Question
• To compute
• We could just enumerate all paths given the input and use the
model to assign probabilities to each.
– Not a good idea.
• If there are 30 or so tags in the Penn set, and the average
sentence is around 20 words...
• How many tag sequences do we have to enumerate to argmax
over?
3020
– Luckily dynamic programming could helps us here

Using the Viterbi Algorithm for HMM
Tagging
• For model like HMM with hidden variables, the task of
determining which sequence of variables is underlying source
of some sequence of observations is called the decoding task.
• The Viterbi algorithm is the most common decoding algorithm
used for HMMs both for POS and speech recognition.
– Viterbi algorithm is a standard application of classic dynamic
programming.
• Let sequence of observed words O=(o1o2o3…oT) and returns
most probable state/tag sequence Q= (q1q2q3…qT)
• The transition probabilities between hidden states (i.e., POS) is
aij and observation likelihood is bi(ot)

Observation Likelihood
The figure below illustrates the prior probabilities in an HMM POS, showing three sample states and
some of the A transition probabilities between them.
Figure 5.13: The Markov chain corresponding to the hidden states of the HMM.
The A transition probabilities are used to compute the prior probabilities

Observation Likelihoods
The figure below shows another view of an HMM POS, focusing on the word likelihoods B. Each
hidden state is associated with a vector of likelihoods for each observation word.

Transition and observation likelihood
probabilities

Steps in Decoding
• Move on column by column;
– for every state in column 1 , compute the probability of moving into
each state in column 2 and so on.
– For each state qj at time t, computer viterbi[s,t] by taking the maximum
over the extensions of all the paths that lead to the content cell, using:
vt(j) = max vt-1(i) aij bj(oi) for all i=1…N
• Where
– vt-1(i) : the previous Viterbi path probability from previous time step
– aij : the transition probability from previous state qi to current state qj
– bj(ot) : the state observation likelihood of the observation symbol ot
given the current state j.

The Viterbi Algorithm

Viterbi
Example

Viterbi Summary
• Create an array
– With columns corresponding to inputs
– Rows corresponding to possible states
• Sweep through the array in one pass filling the
columns left to right using our transition probs
and observations probs
• Dynamic programming key is that we need
only store the MAX prob path to each cell, (not
all paths).
Outline

• Automatic approaches 3: transformation-
based tagging
• Other issues: tagging unknown words,
evaluation
Transformation-based tagging
• Eric Brill (1993).
• Label the training set with most frequent tags
DT MD VBD VBD
The can was rusted
• Add transformation rules which reduce training mistakes
MD→ NN:DT
VBD → VBN:VBD_
• Stop when no transformations do sufficient good
• Probably the most widely used tagger(esp. utside NLP)… but
no the most accurate: 96.6% /82.0%
• Captures the tagging data in much fewer parameters than
stochastic models.
• The transformations learned (often) have linguistic “reality”
IS 7118: NLP Unit-5: POS, 78/24
Transformation-based tagging:
examples
• if a word is currently tagged NN, and has a suffix of
length 1 which consists of the letter 's', change its tag
to NNS
• if a word has a suffix of length 2 consisting of the
letter sequence 'ly', change its tag to RB (regardless of
the initial tag)
• change VBN to VBD if previous word is tagged as
NN
• Change VBD to VBN if previous word is ‘by’
IS 7118: NLP Unit-5: POS, 79/24

Transformation-based tagging
• Three stages:
– Lexical look-up
– Lexical rule application for unknown words
– Contextual rule application to correct mis-tags
• Painting analogy
IS 7118: NLP Unit-5: POS, 80/24

TBL Rule Application
• Tagger labels every word with its most-likely tag
– For example: race has the following probabilities in the Brown corpus:
• P(NN|race) = .98
• P(VB|race)= .02
• Transformation rules make changes to tags
– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Templates for TBL

Brill TBL algorithm
• TBL algorithm:
1) GET_BEST_TRANSFORMATION
2) GET_BEST_INSTANCE
▪ The first function is called with a

list of potential templates;
▪ For each template the first function
calls the second function.
▪ The second function iteratively tests
every possible instantiation of each
template by filling in specific values
for the tag variables a,b,z and w.

Outline

tagging
• Other issues: tagging unknown words,
evaluation
Tagging Unknown Words
• New words added to (newspaper) language 20+ per month
• Plus many proper names …
• Increases error rates by 1-2%
• Method 1: assume they are nouns

• Method 2: assume the unknown words have a probability
distribution similar to words only occurring once in the
training set.
• Method 3: Use morphological information, e.g., words
ending with –ed tend to be tagged VBN.

Evaluation
• The result is compared with a manually coded “Gold Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a baseline tagger (one that uses no
context).
• Important: 100% is impossible even for human annotators.
• Human Ceiling: When using a human-Gold Standard to evaluate a
classification algorithm, check the agreement rate of humans on the
standard.
• Most Frequent Class Baseline: Always compare a classifier against a base
line at least as good as the most frequent class baseline(assigning each token
to the class it occurred in most often in the training set)
• Factors that affects the performance
– The amount of training data available
– The tag set
– The difference between training corpus and test corpus
– Dictionary IS 7118: NLP Unit-5: POS, 86
– Unknown words Prof. R.K.Rao Bandaru
Evaluation
• So once you have you POS tagger running how
do you evaluate it?
– Overall error rate with respect to a gold-standard
test set.
– Error rates on particular tags
– Error rates on particular words
– Tag confusions...

Error Analysis
• Look at a confusion matrix
• See what errors are causing problems

– Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
– Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
End of the Unit-5
???

Unit-5: POS Tagging

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5: POS Tagging

Uploaded by

Copyright:

Available Formats

Unit-5: Part-Of-Speech Tagging

IS 7118: Natural Language Processing

• Why Part of speech ?

• parts-of-speech(POS) also known as lexical categories, word

IS 7118: NLP Unit-5: POS, 3

Nouns Adjectives old older oldest

IBM cat / cats see

Determiners the some can Prepositions to with

Pronouns he its Interjections Ow Eh

IS 7118: NLP Unit-5: POS, 9

IS 7118: NLP Unit-5: POS, 12

IS 7118: NLP Unit-5: POS, 13

IS 7118: NLP Unit-5: POS, 14

IS 7118: NLP Unit-5: POS, 16

IS 7118: NLP Unit-5: POS, 17

IS 7118: NLP Unit-5: POS, 18

• Why Part of speech ?

“More recent tagsets have more word classes:

• The back door = JJ

IS 7118: NLP Unit-5: POS, 21

• From Penn-Treebank POS tagset (45 tags)

IS 7118: NLP Unit-5: POS, 22

IS 7118: NLP Unit-5: POS, 23

What is it used for?

IS 7118: NLP Unit-5: POS, 25

IS 7118: NLP Unit-5: POS, 26

IS 7118: NLP Unit-5: POS, 27

• Tagsets for English have grown progressively larger since the

Brown Corpus: 87 tags

IS 7118: NLP Unit-5: POS, 28

Penn Treebank Tags For Verbs

Note1: *marks are ungrammatical sentences

IS 7118: NLP Unit-5: POS, 31

IS 7118: NLP Unit-5: POS, 32

IS 7118: NLP Unit-5: POS, 33

IS 7118: NLP Unit-5: POS, 34

• The Old Solution: Combinatorial search.

• Machine Learning Solutions: Automatically

– The best techniques achieve 96-97% accuracy per

IS 7118: NLP Unit-5: POS, 36

IS 7118: NLP Unit-5: POS, 37

• Why Part of speech ?

SV,SVO and SVOO specify the sub-categorization or complementation pattern for

IS 7118: NLP Unit-5: POS, 41

IS 7118: NLP Unit-5: POS, 42

Pavlov PAVLOV N NOM SG PROPER

IS 7118: NLP Unit-5: POS, 43

• Why Part of speech ?

IS 7118: NLP Unit-5: POS, 47

• We are given a sentence (an “observation” or “sequence of

IS 7118: NLP Unit-5: POS, 48

• Argmaxx f(x) means “the x such that f(x) is maximized”

IS 7118: NLP Unit-5: POS, 50

IS 7118: NLP Unit-5: POS, 51

Computing directly the above probabilities is still too hard.

IS 7118: NLP Unit-5: POS, 52

2. BIGRAM assumption: the probability of a tag appearing

The most probable tag sequence estimated by the bigram tagger

IS 7118: NLP Unit-5: POS, 53

The most probable tag sequence estimated by the bigram tagger

IS 7118: NLP Unit-5: POS, 54

• Tag transition probabilities p(ti|ti-1)