4-Lecture Four - (Part of Speech Tagging and Sequence Labeling)

Chapter 3 : Part of Speech Tagging
and Sequence Labeling

Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2022)
Outline
 Introduction
 Part of Speech
 Part of Speech Problem
 Lexical Syntax
 Part-of-Speech Tagging Approaches

 Hidden Markov Model
 Maximum Entropy Models
 Sequence Labeling
01/02/23 2
Introduction
 Computer processing of natural language normally follows a

sequence of steps, beginning with a phoneme and morpheme-
based analysis and stepping toward semantics and discourse
analyses.
 Although some of the steps can be interwoven depending on the
requirements of an application (e.g., doing word segmentation
and part-of-speech tagging together in languages like Chinese),
dividing the analysis into distinct stages adds to the modularity
of the process and helps in identifying the problems peculiar to
each stage more clearly.
 Each step aims at solving the problems at that level of processing
and feeding the next level with an accurate stream of data.
01/02/23 3
Introduction …
 One of the earliest steps within this sequence is part-of-speech

(POS) tagging.
 It is normally a sentence based approach and given a sentence
formed of a sequence of words, POS tagging tries to label (tag)
each word with its correct part of speech (also named word
category, word class, or lexical category).
 This process can be regarded as a simplified form (or a sub-
process) of morphological analysis.
 Although morphological analysis involves finding the internal
structure of a word (root form, affixes, etc.), POS tagging only
deals with assigning a POS tag to the given surface form word.
01/02/23 4
Introduction POS
 Let W = w1, w2, . . . wn be a sentence having n words.
 The task of POS tagging is finding the set of tags T = t1, t2, . . .
tn, where ti corresponds to the POS tag of wi, 1 ≤ i ≤ n, as
accurately as possible.
 In determining the correct tag sequence, it is possible to make

use of the morphological and syntactic (and maybe semantic)
relationships within the sentence (the context).
01/02/23 5
Introduction POS…
 The question is how a tagger encodes and uses the constraints

enforced by these relationships”
 The traditional answer to this question is:
 Simply limiting the context to a few words around the target
word (the word we are trying to disambiguate),
 Making use of the information supplied by these words and

their tags, and ignoring the rest.
01/02/23 6
Introduction POS…
 Although there is some debate on the topic (e.g., the claim that
the adjective–verb distinction is almost nonexistent in some
languages such as the East-Asian language Mandarin or the
claim that all the words in a particular category do not show the
same functional/semantic behavior), this minimal set of three
categories is considered universal.
 A natural question that may arise is:

 What are these parts of speech, or
 How do we specify a set of suitable parts of speech?
01/02/23 7
Introduction POS…
 From a linguistic point of view, the linguists mostly agree that

there are three major (primary) parts of speech: noun, verb, and
adjective (Although there is some debates)
 The usual solution to the arguable nature of this set is admitting
the inconsistencies within each group and saying that in each
group there are “typical members” as well as not-so-typical
members.
 For example, eat is a prototypical instance of the verb category because
it describes a “process” (a widely accepted definition for verbs),
whereas hunger is a less typical instance of a verb.
 This judgment is supported by the fact that hunger is also related to the
adjective category because of the more common adjective hungry, but
there is no such correspondence for eat.
01/02/23 8
Introduction POS…
 Leaving aside the linguistic considerations and the theoretical

implications, people in the realm of natural language processing
(NLP) approach the issue from a more practical point of view.
 Although the decision about the size and the contents of the
tagset (the set of POS tags) is still linguistically oriented, the
idea is providing distinct parts of speech for all classes of words
having distinct grammatical behavior, rather than arriving at a
classification that is in support of a particular linguistic theory.
01/02/23 9
Introduction POS…
 Usually the size of the tag set is large and there is a rich
repertoire (Catalog) of tags with high discriminative power.
 The most frequently used corpora (for English) in the POS

tagging research and the corresponding tagsets are as follows:
 Brown corpus (87 basic tags and special indicator tags),
 Lancaster-Oslo/Bergen (LOB) corpus (135 tags of which 23 are base
tags),
 Penn Treebank and Wall Street Journal (WSJ) corpus (48 tags of which
12 are for punctuation and other symbols), and
 Susanne corpus (353 tags).
01/02/23 10
Introduction POS …
01/02/23 11
Introduction POS Problem
 Except a few studies, nearly all of the POS tagging systems

presuppose a fixed tagset.
 Then, the problem is, given a sentence, assigning a POS tag from
the tagset to each word of the sentence.
 There are basically two difficulties in POS tagging:
 1. Ambiguous words.
 In a sentence, obviously there exist some words for which more than
one POS tag is possible.
 In fact, this language property makes POS tagging a real problem,
otherwise the solution would be trivial.
 Consider the following sentence:
 We can can the can
01/02/23 12

 The three occurrences of the word can correspond to auxiliary,
verb, and noun categories, respectively.
 When we take the whole sentence into account instead of the
individual words, it is easy to determine the correct role of each
word.
01/02/23 13

 It is easy at least for humans, but may not be so for automatic
taggers.
 While disambiguating a particular word, humans exploit several
mechanisms and information sources such as:
 The roles of other words in the sentence,
 The syntactic structure of the sentence,
 The domain of the text, and
 The commonsense knowledge.
 The problem for computers is finding out how to handle all this
information.
01/02/23 14

 2. Unknown words
 In the case of rule-based approaches to the POS tagging
problem that use a set of handcrafted rules, there will clearly
be some words in the input text that cannot be handled by the
rules.
 Likewise, in statistical systems, there will be words that do

not appear in the training corpus.
 Such words are called unknown words.
01/02/23 15
 Thus, having some special mechanisms for dealing with

unknown words is an important issue in the designing of a
tagger.
 Another issue in POS tagging, which is not directly related to
language properties but poses a problem for taggers, is the
consistency of the tagset.
 Using a large tagset enables us to encode more knowledge

about the morphological and morphosyntactical structures of
the words, but at the same time makes it more difficult to
distinguish between similar tags.
01/02/23 16
 Tag distinctions in some cases are so subtle that even humans

may not agree on the tags of some words.
 For instance, an annotation experiment performed on the Penn
Treebank has shown that the annotators disagree on 7.2% of the
cases on the average.
 Building a consistent tagset is a more delicate subject for

morphologically rich languages since the distinctions between
different affix combinations need to be handled carefully.
 Thus, it is possible to consider the inconsistencies in the tagsets
as a problem that degrades the performance of taggers.
01/02/23 17
 A number of studies allow some ambiguity in the output of the

tagger by labeling some of the words with a set of tags (usually 2–
3 tags) instead of a single tag.
 The reason is that, since POS tagging is seen as a preprocessing step
for other higher-level processes such as named-entity recognition or
syntactic parsing, it may be wiser to output a few most probable tags
for some words for which we are not sure about the correct tag (e.g.,
both of the tags prepositions (IN)∗ and adverbs (RB) may have
similar chances of being selected for a particular word).
 This decision may be left to later processing, which is more likely to
decide on the correct tag by exploiting more relevant information
(which is not available to the POS tagger).
01/02/23 18
 The state-of-the-art in POS tagging accuracy (number of

correctly tagged word tokens over all word tokens) is about
96%–97% for most European languages (English, French, etc.)
 Similar accuracies are obtained for other types of languages

provided that the characteristics different from European
languages are carefully handled by the taggers.
01/02/23 19
 It is important to note here that it is possible to obtain high

accuracies using very simple methods.
 For example, on the WSJ corpus, tagging each word in the test
data with the most likely tag for that word in the training data
gives rise to accuracies around 90% .
 So, the sophisticated methods used in the POS tagging domain are
for getting the last 10% of tagging accuracy.
 On the one hand, 96%–97% accuracy may be regarded as quite a
high success rate, when compared with other NLP tasks.
 Based on this figure (96%-97%), some researchers argue that we
can consider POS tagging as an already-solved problem (at least for
European languages)
01/02/23 20
 Any performance improvement above these success rates will

be very small.
 However, on the other hand, the performances obtained with
current taggers may seem insufficient and even a small
improvement has the potential of significantly increasing the
quality of later processing.
 If we suppose that a sentence in a typical English text has 20–30
words on the average, an accuracy rate of 96%–97% implies that
there will be about one word erroneously tagged per sentence.
 Even one such word will make the job of a syntax analyzer
much more difficult.
01/02/23 21
Lexical Syntax
 Words are the building blocks of natural language texts.
 As a proportion of a text, words are morphologically complex,

it makes sense for text-oriented applications to register a
word’s structure.
 Text analysis at the level of the word is called lexical analysis.
01/02/23 22
POS Tagging Approaches
 Knowing the part-of-speech can produce more natural

pronunciations in a speech synthesis system and more accuracy
in a speech recognition system.
 Parts-of-speech can also be used in stemming for information

retrieval (IR), since knowing a word’s part-of-speech can help
to tell which morphological affixes it can take.
01/02/23 23
POS Tagging Approaches…
 There are computational methods for assigning parts-of-speech

to words (part-of-speech tagging).
 Many algorithms have also been applied to this problem,
including:
 Hand-written rules (rule-based tagging),
 Probabilistic methods (HMM tagging and maximum entropy
tagging),
 As well as other methods such as:

 Transformation based tagging and
 Memory-based tagging
01/02/23 24
POS Tagging Approaches (Rule
based Approach)
 The earliest POS tagging systems are rule-based systems, in
which a set of rules is manually constructed and then applied to
a given text.
 The first rule-based tagging system is based on a large set of
handcrafted rules and a small lexicon to handle the exceptions.
 The main drawbacks of these early systems are the laborious

work of manually coding the rules and the requirement of
linguistic background.
 Instead of trying to acquire the linguistic rules manually, a
system is described that learns a set of correction rules by a
methodology called transformation-based learning (TBL).
01/02/23 25
POS Tagging Approaches - HMM
 The rule-based methods used for the POS tagging problem

began to be replaced by statistical models in the early 1990s.
 The major drawback of the oldest rule-based systems was the

need to manually compile the rules, a process that requires
linguistic background.
 Moreover, these systems are not robust in the sense that they
must be partially or completely redesigned when a change in
the domain or in the language occurs.
 This leads to the development of new statistical models.
01/02/23 26
POS Tagging Approaches – HMM…
 Later on a new paradigm, statistical natural language

processing, has emerged and offered solutions to these
problems.
 As the field became more mature, researchers began to abandon
the classical strategies and developed new statistical models.
 Use of a Hidden Markov Model to do part-of-speech-tagging is

a special case of Bayesian inference, a paradigm that has been
known since the work of Bayes (1763).
 Bayesian inference or Bayesian classification was applied
successfully.
01/02/23 27
POS Tagging Approaches – HMM…
 A first-order Hidden Markov Model instantiates two simplifying

assumptions.
 First, as with a first-orderMarkov chain, the probability of a
particular state is dependent only on the previous state:
 Markov Assumption: P(qi|q1.... qi−1) = P(qi|qi−1)
 Second, the probability of an output observation oi is dependent

only on the state that produced the observation qi, and not on any
other states or any other observations:
 Output Independence: P(oi|q1 . . .qi, . . . ,qT ,o1, . . . ,oi, . . . ,oT ) =
P(oi|qi)
01/02/23 28
POS Tagging Approaches –
(Maximum Entropy Model)
 The HMM framework has two important limitations for
classification tasks such as POS tagging:
 Strong independence assumptions and poor use of contextual
information.
 For HMM POS tagging, it is usually assumed that the tag of a word
does not depend on previous and next words, or a word in the
context does not supply any information about the tag of the target
word.
 Furthermore, the context is usually limited to the previous one or
two words.
 Although there exist some attempts to overcome these limitations,
they do not allow to use the context in any way required.
01/02/23 29
POS Tagging Approaches –
(Maximum Entropy Model)
 Maximum Entropy (MaxEnt) models provide more flexibility in
dealing with the context and are used as an alternative to
HMMs in the domain of POS tagging.
01/02/23 30
POS Tagging Approaches
 Hidden Markov Model (HMM) and Maximum Entropy model

(MaxEnt) are the two important classes of statistical models for
processing text and speech.
 Both are descendants of Markov’s models.
 Markov-related variant of MaxEnt is called the Maximum

Entropy Markov Model (MEMM).
 All are machine learning models.
01/02/23 31
Sequence Labeling…
 Typical goal of Sequence labeling:

 Inputs: x = (x1, …, xn)
 Labels: y = (y1, …, yn)
 Typical goal: Given x, predict y
 Example sequence labeling tasks:

 Part-of-speech tagging
 Named-entity-recognition (NER – Assignment )
 Label people, places, organizations
01/02/23 32
 Although this task may be regarded as a supervised learning

problem, it has an extra complexity that data (xi, yi) in the
sequence are dependent.
 For example, label yi may depend on the previous label yi−1.
 In the probabilistic modeling approach, one may construct a

probability model of the whole sequence {(xi, yi)}, and then
estimate the model parameters.
 Similar to the standard supervised learning setting with
independent observations, there are two types of models for
sequence prediction: generative and discriminative.
01/02/23 33
 For simplicity, it is possible to consider first-order dependency

where yi only depends on yi−1.
 Higher order dependency (E.g., yi may depend on yi−2, yi−3, and so
on) can be easily incorporated but requires more complicated
notations.
 Also for simplicity, it is better to ignore sentence boundaries, and

just assume that the training data contain n sequential
observations.
01/02/23 34
Question & Answer
01/02/23 35
Thank You !!!
01/02/23 36

4-Lecture Four - (Part of Speech Tagging and Sequence Labeling)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4-Lecture Four - (Part of Speech Tagging and Sequence Labeling)

Uploaded by

Copyright:

Available Formats

Chapter 3 : Part of Speech Tagging

and Sequence Labeling

 Part-of-Speech Tagging Approaches

 Computer processing of natural language normally follows a

 One of the earliest steps within this sequence is part-of-speech

 Let W = w1, w2, . . . wn be a sentence having n words.

 In determining the correct tag sequence, it is possible to make

 The question is how a tagger encodes and uses the constraints

 Making use of the information supplied by these words and

 A natural question that may arise is:

 From a linguistic point of view, the linguists mostly agree that

 Leaving aside the linguistic considerations and the theoretical

 The most frequently used corpora (for English) in the POS

 Except a few studies, nearly all of the POS tagging systems

 There are basically two difficulties in POS tagging:

 There are basically two difficulties in POS tagging:

 There are basically two difficulties in POS tagging:

 Likewise, in statistical systems, there will be words that do

 Such words are called unknown words.

 Thus, having some special mechanisms for dealing with

 Using a large tagset enables us to encode more knowledge

 Tag distinctions in some cases are so subtle that even humans

 Building a consistent tagset is a more delicate subject for

 A number of studies allow some ambiguity in the output of the

 The state-of-the-art in POS tagging accuracy (number of

 Similar accuracies are obtained for other types of languages

 It is important to note here that it is possible to obtain high

 Any performance improvement above these success rates will

 Words are the building blocks of natural language texts.

 As a proportion of a text, words are morphologically complex,

 Text analysis at the level of the word is called lexical analysis.

 Knowing the part-of-speech can produce more natural

 Parts-of-speech can also be used in stemming for information

 There are computational methods for assigning parts-of-speech

 As well as other methods such as:

 The main drawbacks of these early systems are the laborious

 The rule-based methods used for the POS tagging problem

 The major drawback of the oldest rule-based systems was the

 Later on a new paradigm, statistical natural language

 Use of a Hidden Markov Model to do part-of-speech-tagging is

 A first-order Hidden Markov Model instantiates two simplifying

 Second, the probability of an output observation oi is dependent

 Hidden Markov Model (HMM) and Maximum Entropy model

 Both are descendants of Markov’s models.

 Markov-related variant of MaxEnt is called the Maximum

 All are machine learning models.

 Typical goal of Sequence labeling:

 Example sequence labeling tasks:

 Although this task may be regarded as a supervised learning

 In the probabilistic modeling approach, one may construct a

 For simplicity, it is possible to consider first-order dependency

 Also for simplicity, it is better to ignore sentence boundaries, and

You might also like