Professional Documents
Culture Documents
Approaches To NLP: Introduction To Natural Language Processing (CSE 5321)
Approaches To NLP: Introduction To Natural Language Processing (CSE 5321)
Teshome M. Bekele
2020/21—Sem I
Introduction Representing Linguistic Knowledge
Models Rule-Based vs Statistical Approaches
Algorithms Mathematical Foundations
• In the preceding lectures, we have been engaged in enhancing knowledge about the
linguistic structures of English and Amharic at different levels.
• NLP requires a set of rules to represent knowledge about the linguistic structures.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 2/35
Introduction Representing Linguistic Knowledge
Models Rule-Based vs Statistical Approaches
Algorithms Mathematical Foundations
Hybrid Approaches
• Both rule-based and statistical approaches have their own pros and cons.
• Thus, rule-based and statistical approaches are usually combined to benefit from their
synergy effect.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 3/35
Introduction Representing Linguistic Knowledge
Models Rule-Based vs Statistical Approaches
Algorithms Mathematical Foundations
Mathematical Foundations
• There are well-established mathematical foundations for both rule-based and statistical
approaches to NLP.
• Models and theories applied in NLP are all drawn from the standard toolkit of computer
science, mathematics and linguistics.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 4/35
Introduction Representing Linguistic Knowledge
Models Rule-Based vs Statistical Approaches
Algorithms Mathematical Foundations
• Logic-based models
• Probabilistic models
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 5/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
State Machines
• State machines are widely used in NLP for modeling phonology, morphology and syntax.
• State machines are formal models that consist of states, transitions among states, and
an input representation.
♦ States – represent the set of properties of an abstract machine
♦ Transitions – represent jumps from one state to another
♦ Inputs – sequences of symbols or letters that can be read by the machine
• A machine with finite number of states is called finite state machine (FSM).
• FSM has two special states: start state and final state.
1 1 Input symbol
0 Transition
1 Final state
S0 S1 S2
0
Start state
• There are two types of FSMs: finite state automata and finite state transducers.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 6/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
• Finite state automaton (FSA) is finite state machine that only accepts a set of given
strings (a language).
• In deterministic FSA, every state has one transition for each possible input.
1 1
0
ε S2
S0 S1
0
♦ Strings accepted by this deterministic FSA are: ε, 1, 11, 111, 00, 010,
1010, 10110, etc.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 7/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
• In non-deterministic FSA, an input can lead to one, more than one or no transition for
a given state.
ε S2
S1
0
S0 0 0
1
ε
S3 S4
1
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 8/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Word Recognition
• FSAs can be used to recognize words in a language.
• Examples:
ሰ በ ረ
S0 S1 S2 S3
w a l k
S0 S1 S2 S3 S4
ሰበረ
S0 S1
walk
S0 S1
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 9/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Word Recognition
♦ Recognition of multiple words
ሰበ ቀ
S0 S1 S2
ብ
in
S2
tern
e al
S0 S1 c
i opia
S4 S5
eth
S3 anol
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 10/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Word Recognition
♦ Recognition of multiple words (for instance, Amharic pronouns: Eኔ, Eኛ, Aንተ,
Aንቺ, Eናንተ, Eስዎ, Eርስዎ, Eሱ, Eርሱ, Eሷ, Eርሷ, Eሳቸው, Eርሳቸው, Eነሱ, Eነርሱ)
ቺ
S1
Aን ሷ
E ሱ
ሳቸው
E ር ስዎ
S0 S2 S3 S6
ኔ
ኛ
ሱ
ነ ናነተ
S4 ር S5
ነ
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 11/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• One word and multiple inflections
s
walk ed
S0 S1 S2
ing
...
ኧን
ኧህ
ኣት
ኧው
S0 ሰበር S1 ኣቸው S2
ኧኝ
ኧሽ
ኣችሁ
ኣችሁት
..
.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 12/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• Multiple words and multiple inflections
..
.
jump s
walk ed
S0 S1 S2
help ing
..
.
...
ኧን
ኧህ
..
. ኣት
ማረክ ኧው
S0 ሰበር S1 ኣቸው S2
ገደል ኧኝ
..
. ኧሽ
ኣችሁ
ኣችሁት
..
.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 13/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• One word and multiple inflections with affixes
.
.
.
. ኧን
.
. ህ
Eንዲ ኣት
Eንዳይ ኧው
S0 ከሚ S1 ሰብር S2 ኣቸው S3
ሊ ብን
የሚ በት
.
. ለት
.
ባቸው
.
.
.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 14/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• Multiple words and multiple inflections with affixes
.
.
.
. ኧን
.
. ህ
.
Eንዲ . ኣት
.
Eንዳይ ማርክ ኧው
S0 ከሚ S1 ሰብር S2 ኣቸው S3
ሊ ገድል ብን
.
የሚ . በት
.
.
. ለት
.
ባቸው
.
.
.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 15/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• Marking part-of-speech
ion
[word] y cate
S0 S1 S3 S5
ism er y
ist
S2 S4
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 16/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Modeling Morphology
• Marking part-of-speech
ion
[word] y cate
S0 N Adj V
ism er y
ist
N N
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 17/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
... walk walked walking walks wall walls want wanted wanting
wants warn warned warning warns ...
d
e
k s
i
l n g
l
s
d
e
w a n t s
i g
n
r e d
n s
i g
n
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 18/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 19/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
d Discovered Morphology
e
k s • Stems - with common
i suffix tree:
l n g
l ♦ walk
s
♦ want
d ♦ warn
e
w a n t s
i • Morphemes - frequent
n g suffix tree:
r e d ♦ ε
n s ♦ – ed
i ♦ –s
n g ♦ – ing
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 20/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
ኧው
Discovered Morphology
Uበት • Stems - with common
ሰብር Uባቸው suffix tree:
Uት
♦ ሰብር
ገድል ♦ ገድል
ኧው
ሚ Uበት • Morphemes - frequent
Uባቸው suffix tree:
Uት
Eንደ ♦ ε
♦ – ኧው
ኧው
ማይ
♦ – Uበት
Uበት
ሰብር Uባቸው
♦ – Uባቸው
Uት ♦ – Uት
• Other affixes:
ገድል ኧው
Uበት
♦ – Eንደ
Uባቸው ♦ –ሚ–
Uት ♦ – ማይ –
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 21/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
• Finite state transducers (FSTs) are extensions of finite state automata (FSA) that can
generate outputs.
• FSTs can be considered as:
♦ Recognizer: a machine that takes a pair of strings as input and outputs
“accept” if the string-pair is in the string-pair language, and
“reject” if it is not.
♦ Generator: a machine that outputs pairs of strings of the language, i.e. the
output is a “yes” or “no”, and a pair of output strings.
♦ Translator: a machine that reads a string and outputs another string.
♦ Set relater: a machine that computes relations between sets.
b:b b:ε b b b b
b
ε ε
a:b a a
S0 S1 S0 S1 S0 S1
b b
a:ba a a
ba ba
Different ways of representing input/output relations in FSTs
N.B: Identical input/output pairs can be written using one symbol, e.g. “b:b” Î “b”.
The ε symbol represents empty symbol.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 22/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
• Depending on the type of accepted input and produced output, FSTs can be:
♦ String-to-string transducers: produce strings as outputs.
♦ String-to-weight transducers: produce weights as outputs.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 23/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
S1
b:ε
b a/2 b/3
a:b S1
S0
S0/4 S2/1
a:ba
b/5
aab
Initial weight Final weight
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 24/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Two-Level Morphology
• In the finite-state morphology paradigm, a word is represented as a correspondence
between a lexical level and the surface level.
♦ Lexical level represents a concatenation of morphemes making up a word.
♦ Surface level represents the concatenation of letters which make up the actual
spelling of the word.
• Morphological parsing is the process of building a structured representation of words by
breaking down into component morphemes. For example:
♦ “bigger” is morphologically parsed as “big+ADJ+COMPARATIVE”.
♦ “lower” is morphologically parsed as “low+ADJ+COMPARATIVE”.
♦ “ተማሪዎች” is morphologically parsed as “ተማሪ+N+PLURAL”.
• Thus, morphological parser is used to identify the correspondence between a lexical
level and the surface level.
♦ For example, the lexical level representation for the surface level word “lower” is
“low+ADJ+COMPARATIVE”.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 25/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Two-Level Morphology
• Two-level morphology is an important application of FSTs to morphological
representation and parsing.
b i g ε +ADJ ε +COMP
S1 S2 S3 S4 S5 S6 S7
b i g g ε e r
S0
l o w
S8 S9 w
l o
• FSTs can also be used to implement spelling rules applied during inflection of words.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 26/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
• Formal Rule Systems are formalisms used to define languages using formal grammars.
♦ Formal grammar is a set of formation rules for strings in a formal language.
♦ Formal grammars and languages are studied under Formal Language Theory.
• There are two types of grammars:
♦ Generative grammar: gives a set of rules that will correctly predict which
combinations of words will form grammatical sentences.
♦ Analytic grammar: tries to deal how a given string is determined.
• Formal Rule Systems are widely used in NLP to model:
♦ Phonology
♦ Morphology
♦ Syntax
• Review [Lecture 03] for further details.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 27/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Logic-Based Models
• Logic-Based Models are formalisms used to define languages using mathematical logic.
• Commonly used logic models used in NLP are:
♦ First-order logic/predicate calculus
♦ λ-calculus
♦ semantic primitives
• Logic-Based Models are widely used in NLP to model:
♦ Semantics
♦ Pragmatics
• Review [Lecture 04] for further details.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 28/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 29/35
State Machines
Introduction Formal Rule Systems
Models Logic-Based Models
Algorithms Vector Space Models
Probabilistic Models
Probabilistic Models
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 30/35
Introduction
State Space Search Algorithms
Models
Machine Learning Algorithms
Algorithms
• State space is the set of states of the problem we can get to by applying operators to a
state of the problem to get a new state.
♦ Natural language problems are often modeled as a state space.
• A state space has some common properties:
♦ Complexity, where branching factor is important
♦ Structure of the space:
Directionality of arcs
Tree
Rooted graph
• The state space for a given problem is usually huge, and as a result, state space
searching requires efficient strategies.
• Dynamic programming is one of the most commonly used strategies.
♦ Dynamic programming is a method for solving complex state space search
problems by breaking them down into many simpler and much easier ones.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 31/35
Introduction
State Space Search Algorithms
Models
Machine Learning Algorithms
Algorithms
• In NLP problems, among the most important algorithms that employ dynamic
programming strategy are:
♦ Viterbi Algorithm
Used for finding the most likely sequence of hidden states in Hidden
Markov Models (HMMs).
♦ Chart Parsing Algorithms
Partial hypothesized results are stored in a structure called chart.
Used for parsing strings that belong to Context Free Grammars (CFGs).
• There are two types of chart parsers.
♦ Early Parser: employs top-down parsing approach.
♦ Cocke-Younger-Kasami (CYK): employs bottom-up parsing approach.
• In general, state space search algorithms are used in the following NLP applications:
♦ Speech Recognition
♦ Parsing
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 32/35
Introduction
State Space Search Algorithms
Models
Machine Learning Algorithms
Algorithms
New Example
Predicted Classification
• Among the most important machine learning algorithms used in NLP are:
♦ Classifiers
♦ Expectation-Maximization (EM) algorithms
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 33/35
Introduction
State Space Search Algorithms
Models
Machine Learning Algorithms
Algorithms
• The goal of classifiers is to categorize a given input into a fixed set of categories based
on the training model.
♦ Classifiers commonly used in NLP are Decision Trees, Support Vector Machines,
Gaussian Mixture Models, etc.
• The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood
(ML) estimate in the presence of missing or hidden data.
• In ML estimation, we wish to estimate the model parameter(s) for which the observed
data are the most likely.
• Each iteration of the EM algorithm consists of two processes:
♦ The expectation (E)-step: the missing data are estimated given the observed
data and current estimate of the model parameters.
♦ The maximization (M)-step: the likelihood function is maximized under the
assumption that the missing data are known.
• Baum-Welch Algorithm is a special case of EM algorithm which is used as a standard
algorithm for HMM training.
• In general, Machine Learning Algorithms are widely used in the development of NLP
applications such as Document Classification, Disambiguation, Speech Recognition,
Machine Translation, Optical Character Recognition, etc.
Department of Computer Science and Engineering, ASTU Lecture 07: Approaches to NLP 34/35
TOC: Course Syllabus
Previous: Disambiguation