1

@tomzeal

CPE 510: Natural Language Processing
Friday HSLTB 8 – 10am
Human Language Processing: This sphere of study is of the
field of artificial intelligence.
Artificial intelligence: is defined by what it has (have) and does
(do) but not what it is (are).
What is Artificial Intelligence?
Alan Turing: The Turing test of intelligence says to define if a
machine is intelligent
Saerte: Chinese room experiment
Features of Human Language
1. Human Language has a sophisticated linguistic
system which allows the generation of infinite
expression from finite set of cues or words.
2. Human language is ambiguous: Ambiguity is of
different levels:
- Lexical Ambiguity: Written the same way and
pronounce the same way meaning different
things(Ade sa Aso & Ade sa ere).
- Orthographic Ambiguity: Written the same
way, pronounced differently (I read the book, I
read the book: It is a record. Please record it)
- Sentential ambiguity: An expression made is
ambiguous (I saw the boy with my glasses)
3. Language is domain boundedi.e. can be a domain of
belief (some words can be a taboo or abomination) or
domain of world view (how you see the world)

Wife Iyawo
Husband Oko
Family Ebi
Lord Oluwa

4. Language evolves with time and technology.
5. Language expresses three entities:
- What is done (Verb)
- Who did it and (Subject)
- To whom it is done. (Object)
6. It uses default to express multi-level conception of
ideas that are complex (E.g. Birds fly but ostrich in
the real sense does not and it’s a bird)
What must be considered for Natural Language
Processing?
 Language has a structure (Grammatical structure)
 Language Context of semantic
 Language has a constraint on the mapping of
structure to context
What is Language?
Why it is difficult of express Knowledge Definitely
1. Whole Part Dilemma: If a system comprises many
components, can we describe the behaviour of the
system as an aggregation of its component parts?




2. Signifier – Signified Dilemma:Things mean
different things to different people. Most often the
signifier is spoken of and not the signified.

How does perception come to us? (Sight, sound, taste,
smell, touch) but some other things go into what we
perceive (Emotion & Intuition) and language as well goes
into this.
S/
N
Formal
Language
Human
Language
Natural
Language
1. Means of
communicatin
g with
computing
devices
Means of
communicatio
n among
humans
Means of
communicatio
n among
natural entities
2. Dry and sterile Creative and
Fertile
Engrained in
nature
3. Use discrete
symbols
Use signifier
to represent
natural or
abstract
entities or
phenomenon
Uses natural
signal
phenomenon
primarily. E.g.
waves (sound
& Water),
sunrays,
4. It is definitive
and seeks for
consistency
It is
ambiguous
(because of
creativity)
It is definitive
and consistent

Content of NLP
Carpente
r
Botani
st
Saw Miller
Tree
Computer
Science
2

@tomzeal


(A - Formal Language, B- Psychology, C - A.I, D - NLP)
Features of human Language in the context of
NLP
1. The construction of signifiers for ideas and concepts
is arbitrary
2. Due to the arbitrariness the signifier or words that
constitute a language is difficult to count. There are
171, 476 words in the advanced learners dictionary
3. There is a systematic process for constructing
expressions from the basic signifiers to language
(This is the grammar)
4. No two individuals have the same language
behaviour,
5. It is possible to generate an infinite expression from a
finite set of signifiers or words using the same
system.
Useful Terms in NLP
 Alphabets (Finite set of symbols defining a
language) The Symbols which are:
 Each symbol in an alphabet must represent a
unique primitive and each primitive represents
an indivisible or atomic entity in the domain of
interest
 Each primitive in itself alone cannot register or
reckon a sense or concept
 A string is the sequence of symbols over an
alphabet. It is the concatenation of symbols of
an alphabet.
 A word is a string. A word has a meaning and
that is what differentiates it from a string. i.e.
String + meaning assignment = Word
 Vocabulary is the set of words in a language.
 Syllable is a sound that can be produced in one
effort
 A language is formally defined as a string over
an alphabet. But in this class it is a set of words
and the rules that govern it
 Meta Language is the language used to
describe another language. Example is mark-up
over xml and comments in programs
 Syntax rule for constructing
 Semantics it what it means
 Pragmatics is what people get from it.
Computational Level
0 Levels – Register
1 Level – Memory
2 Levels – Arithmetic / Logic /Relation
3 Levels – Counting and Ordering
4 Levels – Selection / Decision
5 Levels – Control and Parallel operation
Operation at level i requires level (i - 1) details
Noam Chomsky Hierarchy
Model
3 Regular Expression
(ab)
n

Finite State
Automata
2 Context Free a
n
b
n
Push Down
Automata (Stack)
1 Context Sensitive
(Includes options of
selection) a
n
b
n
C
n

Linearly bounded
Automata
0 Unrestricted (Recursively
Enumerable) e.g.
Human Langa
n!

Turing Machine
NLP System Development Steps
Why do we have to develop?


Vocabulary
Semantics
Synthesis
Grammar

Speech Recognition
Speech Synthesis
Machine Translation
Text Summarization
Automatic Dialogue System
Automata Diacition
Language Recognition

Computer
Science and
Engineering
Cognitive
Science
Linguistics
and
Language
Language
Tech
Human
Language
Impacts
C
Develops
A

C
B

C
D

C
3

@tomzeal



1. Understand the Problem:
2. State reasonable assumptions that are appropriate:
3. Identify the language structure:
4. Identify the System States
Speech and Signal Analysis
Praat.exe

Modelling Human Language
Subject - Ade
Verb - Slapped
Object– Olu

Phrase Structure Grammar (Context Free Grammar
- Level 2)
G: <VT,VN,S,P>
VT = Finite set of terminal symbols i.e. alphabets
VN = Finite non-empty set of Non terminal symbols
S = Non terminal symbol called the start symbol
P = Production or re-write rule |<x>- Non-terminal notation
1. <sentence>::=<noun phrase><verb phrase>
2. <noun phrase>::
=<noun>/<determinanat><noun>
3. <verb phrase>:: =<verb>/ <verb><noun
phrase>
4. <noun>:: = {list of all nouns}
5. <determinant::: = {list of all determinants}
Example: <S>:: = <Kola eats the food>

<Number>:: = <digit>/<number><digit> This
allows recursive definition
<digit>:: = {0….9}
31 – 01 - 2014
Modelling a 4
th
degree polynomial using a quadratic
function (e.g. Type 2 with type 0 grammar)
Phrase Structure  Type 2
Human Language  Type 0
Dialogue System
Comprises two or more agents who interact
using a language in a well-defined environment towards
achieving a well-defined goal.

Robot Arm Environment
<S>
NP
Kola
VP
<Verb>
Eat
<Noun Phrase>
<determinant>
The
<Noun>
Food
4

@tomzeal


W: The domain of the robot arm
B = {A, B, C, D,E,F}
The environment is monotonic i.e. the environment
won’t change as the problem of the domain evolves. The
state only changes by effect of the environment. The
domain is closed.
In order to abstract, a state space must be defined:
Whatever methods has been selected the following must
be carried out:
1. Identify and label each object in the domain
2. Identify and label relationship between the
objects
3. Represent data of the spatial location of each
object
4. Express the world by formally representing facts
about the world
5. Device a mechanism to determine whether or
not a formula, fact, expression is logically
plausible or not.
Methods
1. A Semantic Networkcan be used to represent
this
2. First Order Predicate calculus: The logic used
here is binary logic. The times need to be
defined and allow calculus to manipulate these.
Box = {A, B, C, D, E, F}
Table = {T}
Robot = {R}


31 – 01 - 2014
Machine Translation
Machine Translationis the application of computers to
the task of translating text or speeches from one human
language to another human language. The expression
being translated can be in different form such as text,
speech, image, sign etc. The goal of machine translation
is to communicate the content of the expression in one
human language referred to as source language (SL) to its
equivalent in another human language referred to as
target language (TL). Machine translation is a multi-
disciplinarian study which cuts across arts and sciences.
The translation can either be unidirectional, bidirectional
or multidirectional.
Representation and Processing
Human translators usually employ at least five distinct
kinds of knowledge:
a. Knowledge of the source language
b. Knowledge of the target language
This allows them to produce text that are acceptable in
the target language.
c. Knowledge of various correspondents to source
language and target language – how individual
words can be translated
d. Knowledge of the subject matter including
ordinary general knowledge and common sense
This along with knowledge of the source language allows
them to understand what the text to be translated means.
e. Knowledge of culture, social conventions,
customs and expectations etc. of the speaker,
source and target language.
* Phonological Knowledge, Morphological knowledge
Phonological Knowledge:knowledge about the sound
system of a language. Knowledge which for example
allows one to work out likely pronunciation of words.
When dealing with written text, such knowledge is not
useful. However, there is related knowledge about
orthography which can be useful. This deals with writing
style.
Example Aiye | Aye
5

@tomzeal

Enia | Eniyan
Adie | Adiye
Morphological knowledge:This has to do with apply
knowledge from the study of form and internal structure
of morphemes (words and their semantic building
blocks).
Morphemes are the smallest linguistic unit in a word that
can carry a meaning.
Example: Print/er
(Verb/noun)
Un-, break, able in unbreakable
It deals with how words can be constructed.
Syntactic Knowledge: Knowledge about how sentences
and other sort of phrases can be made up out of words.
Semantic Knowledge: Knowledge about words and
phrases or sentence that provide meaning about a
sentence, about how the meaning of a phrase is related
to the meaning of component words, phrase or
sentences.
Tolu builds a House (SVO)
Pragmatic Knowledge: Talks about the practical use of
knowledge. The use of spoken language in a social
context.
Representing Linguistic Knowledge
In general, syntax is concerned with two slightly different
analysis of sentence:
 The first is constituent or phrase structure
analysis. The division of sentences into their
constituent parts ant the categorization of
these parts as nouns, verb etc.
 The second has to do with the grammatical
relations. The assignment of grammatical
relations such as object, subject, head and so
on to various parts of the sentence.
Ade ate the food
N V Det N
Subject Predicate Object
Grammar and Constituent Structure
Sentences are made up of words traditionally categorized
into parts of speech or categories including nouns, verbs,
adjective and prepositions. A grammar of a language is a
set of rules which states how these parts of speech can
be put together to make grammatical or well-formed
sentences.
E.g.
a) Put some papers in the printer – Follows grammar
rules
b) Print some put in papers – do not follow grammar
rules
Here are some simple rules for English grammar with
example. A sentence consists of a noun phrase such as:
The user should clean the printer
(Noun Phrase)(modal) (verb)
(VP……………….…)
But within a verb we can have a noun phrase: VP 
VNP
A noun phrase can consist of a determinant or article
such as “the” or “a”
English to Yoruba: Unidirectional Bilingual
English to Yoruba – Yoruba to English: Bilingual
Bidirectional
English to (Hausa, Igbo, Yoruba).
11 – 03 - 2014
Difference between Clauses and Sentences
A verb phrase can consist of verbs and auxiliary verbs.
Verb phrase can consists of a noun phrase. It will only
consist of a verb alone when it is a command word e.g.
go!. It consist of a noun phrase such as “The printer”.
We can have a structure like:
S NP + VP
Parts of Speech
Subject Verb Object
N|NP Verb N|NP
6

@tomzeal

Pronoun Pronoun
Machine Translation Approach
1. Statistical Based Approach (Data Driven):
Translates lines of sentences using a set of
stored words in a database or repository.
2. Rule Based Approach: Understand the
grammar structure of the source and target
language and use a context free grammar for
translation and then apply to the translation.
3. Hybrid Approach:
Part of Speech tagging: is the process of assigning a
part of speech label to each lexicon.
Lexicon  Collection of Words
Lexical Item ===Lexeme === Word
Part of speech give information about a word and its
neighbours. This is clearly true for major categories e.g.
Verb and Noun.
Book the Flight
V | N
Ade bought the book
S  Ade <N>bought <V>the <det>book<N>
N  Ade, Book
V  Bought
Det  the
Ade ra iwe naa
High /
Low \
Mid-tone
Types of Part of Speech Tagging
1. Rule Based: Rule based tagging often have a
large database of handwritten disambiguation
rule which specify for example that an
ambiguous word is a noun rather than a verb if
it follows a determinant.
2. Stochastic Tagging: Generally resolved
tagging ambiguities by using a training corpus to
compute the probability of a give word having a
given tag in a given context.
 Hidden Markov Model
Parsing
Essence is to determine if rules used are correct before
final translation.
1. Top – Down Parser
2. Bottom Up parser
Formal Grammar
This is generally described as a structure containing a
vocabulary and a set of rules for defining strings on the
basis of the vocabulary. The Chomsky hierarchy of
formal grammar
Unrestricted Grammar (Type 0): The languages
defined by type one are accepted by Turing machinesЄ:
Chomskian transformations are defined as type 0
grammars. Type zero grammars have rules of the form
α beta
Where α and beta are arbitrary strings over the
vocabulary V: α /= β
The α and β could be terminal and non-terminal of a
sentence. E.g.
α could be: S, NP, VP and PP. All these are non-
terminal.
β could be: Verb, noun, preposition etc.
The words in these category are terminals
Context sensitive Grammar (Type 1):The language
defined by type one grammar are accepted by linearly
bounded automata. The syntax of some of the natural
languages are generally held in computational linguistic.
They have rules of the form:
αAβ αB
A,B E N, B /= Eαβ E V*
S E where S E N
Is the initial symbol and E is the empty string
14 – 03 – 2014
Example: Type 1
7

@tomzeal

Rule 1: S  aSBC
Rule 2: S  aBc
Rule 3: CB  Bc
Rule 4: aB  ab
Rule 5: bB  bb
Rule 6: bc  bc
Rule 7: cC  cc

This grammar generate all strings consisting of a non-
empty sequence of As followed by the same number of
Bs followed by the same number of Cs.
a
n
b
n
c
n

Context Free Grammar (Type 2): The languages
defined by type2 grammar are accepted by PDA; the
syntax of the natural languages is definable almost
entirely in terms of context free grammar and the tree
structures generated by them. Type 2 grammars have
rules of the form:
A β Where A ЄN, β Є V
There are special normal forms e.g. Chomsky Normal
form and Theibach normal form into which any context
free grammar can be equivalently converted
Regular Grammars (Type 3): Languages defined by
this are accepted by FSA; Morphological structures and
perhaps all the syntax of the formal spoken dialogue is
desirable by regular grammars.
Read Up:There are two types: Right linear and left linear.
Scenario 1: Subject Verb Object
a. John ate the rice
S  NP VP
NP  N
NP  V NP
NP  DET N
NP  DET ADJ N
b.
c. The tall boy ate the white rice
[Subject & object Modifiers]
The rule is (5) above suffices
Note: If we have
Preposition; then it can be reduced to
 Pre NP
Adjectival Phrase  (Det) Adj N
Adverbial Phrase  Verb (Adv)
Re-ordering | Swapping
(1) He ate the rice
Ó je ìresì náà
PRN V N Det
So, you need to change the order in the re-write rules i.e
Det Noun  Noun Det
(2) The red car
Okò pupa náà
The adjectival phrase becomes  Noun Adjective Det
Phrase Structure Grammar
This is the traditional method of representing the
structural constitutents of a sentence in a phrase
structure tree.
The tall man | beat the boy
NP VP
Subject Object
S
NP VP
S NP VP
NP  Det AdjP
AdjP  Adj N
VP  V NP
NP  Det N
8

@tomzeal


Inflectionary verbs. In translating from Enlish to
Yoruba, there’s no need to take care of inflectionary
verbs.
E.g : beat | beats  na
Go | goes | went | gone  went
21 – 03 - 2014
TEXT PROCESSING
Spelling Correction: Scanning sentence for words that
do not appear in its dictionary.
Grammar Checker: This is the flagging of words put
together in a sentence when there is no correspondence
between them hence making the sentence out of context
Language Recognition:
Information Retrieval:Allowsone to locate relevant
documents that are related to a query but do not specify
where the answers are. In information retrieval, the
document of interest are fetched by matching query key
words to indexes of document collection. The main
purpose of information retrieval is to prevent information
overload.
Information Extraction: This extracts the information
of interest for a well-defined extraction domain and this
relies on filling out predefined templates. Such
information consists of entities and the relationship
between them thus information extraction generates
structured information
This is about getting specific information. The difference
with information retrieval will return documents by
extraction will retrieve information particular to a
domain and often requires templates for extraction.
Handwriting Recognitionwhich comes under
character recognition.
Essay Grading: Searching through for key terms in an
essay as well as being written in good grammar
Recommender System: Is a system that goes through a
given data and then based on defined variables supplied
to it, it suggests an idea to the user.
Text Categorization: Case study in which this has been
used is the “Federalist Papers”. This was accomplished
using writing style and words used. Basically text
categorization looks through documents from unknown
authors and makes efforts to categorize. Plagiarism is
also and application of text categorization.
Question Answering System: accepts questions in
natural language form, searches for answers over a
collection of documents and formulates a concise
answer. It involves question processing, document
retrieval, answer extraction and answer formulation.
Computational Semantics:
Tools applied for Text Processing
Pattern Matching: Text written in any language is
about pattern. Matching the defined pattern is what
makes it valid. Rules can be used to implement pattern
matching.
- Regular Expressions: Example is in
detecting if a word is a proper noun.
- Statistics or probability:
- Dictionaries:
- Machine Learning: Often relies on
probability theory and statistics alongside a
machine learning algorithm to be able to
recognize patterns
- Language Model:
- Spell Check: Process include scan, identify
- Edit Distance, insertion, transposition,
substitution, deletion
- Contextual Spell Checking:


Sentence
NP
Det
The
AdjP
Adj
tall
N
man
VP
V
beat
NP
Det
the
N
boy