Professional Documents
Culture Documents
Section II
n Tokenization and word Segmentation,
n Stemming,
n Lemmatization,
n Morphological Processing
¨ Types of Morpheme,
¨ Morphological Types,
¨ Morphological Rules,
¨ Morphemes and Words,
¨ Inflectional and Derivational Morphology
Language
n Language is a system of communication which consists of a set
of sounds, sign and written symbols used by the people of a
particular country or region to communicate.
n There are approximately 7,111 languages in the world
organized in the same basic way.
¨ 2,143 belongs to Africa (30.14%)
¨ More than 83 languages in Ethiopia.
Communication
n Verbal
¨ Message is transmitted through the spoken words.
n Textual
¨ One of the most popular categories used.
n Emails, SMS text messages, social media interaction, and instant messaging to
accelerate communication between parties.
n Gesture
¨ Movement of part of the body, especially a hand or the head, to express
an idea.
Sign Language (Gesture)
Spoken language
Written Language
Natural Languages
n Natural language refers to a human language which differ from
artificial command or programming language.
n One of the most challenging problem in the field of
computing is to develop a system that can understand
natural/human languages.
¨ So far, the complete solution to this problem has proved elusive,
although a great deal of progress has been made.
¨ Fourth-generation languages are the programming languages closest
to natural languages.
Text processing
n Is a theory and practice of manipulating/automating a electronic text
in a way that can be used for specific NLP application.
¨ Text processing might have different scope based on the NLP application and
domain.
n Almost all task of NLP needs text processing:
¨ Sentence segmentation, words tokenizing, format normalization, search and
replace, filter file or report etc…
n Trivial string manipulation programs is not economical and
performing these tasks requires robust text processing.
n Most widely used Approach: RegEx
Regular Expression
n RE is a formal language for specifying text strings.
¨ Python uses re module to offers a set of functions that allows us to
search a string for a match:
n Regular expressions are case sensitive.
n The string of characters inside square braces specify a disjunction
of characters to match.
¨ Eg. /[aA]ddis/ matches Addis or addis
RegEx (2)
Character classes
Quantifiers & Alternation
. any character except newline
a*a+a? 0 or more, 1 or more, 0 or 1
\w\d\s word, digit, whitespace
a{5}a{2,} exactly five, two or more
\W\D\S not word, digit, whitespace
a{1,3} between one & three
[abc] any of a, b, or c
a+?a{2,}? match as few as possible
[^abc] not a, b, or c
ab|cd match ab or cd
[a-g] character between a & g
RegEx (3)
Groups & Lookaround Anchors
(abc) capture group ^abc$ start / end of the string
\1 backreference to group #1 \b\B word, not-word boundary
(?:abc) non-capturing group
(?=abc) positive lookahead
(?!abc) negative lookahead
Escaped characters
\.\*\\ escaped special characters
\t\n\r tab, linefeed, carriage return
Regex Example (1)
a No match a 1 match
string 1 match
a$ formula 1 match
cab No match
Regex Example (3)
mn 1 match
man 1 match
ma*n maaan 1 match Exp String Matched ?
cde No match
a|b ade 1 match (match at ade)
acdbea 3 matches (at acdbea)
Regex Example (5)
Exp String Matched ?
ab xz No match
(a|b|c)xz abxz 1 match (match at abxz)
axz cabxz 2 matches (at axzbc cabxz)
n \A - Matches if the specified characters are at the start of a string.
n \b - Matches if the specified characters are at the beginning or end
of a word.
n \B - Opposite of \b. Matches if the specified characters are not at
the beginning or end of a word.
n \d - Matches any decimal digit. Equivalent to [0-9]
n \D - Matches any non-decimal digit. Equivalent to [^0-9]
n \s - Matches where a string contains any whitespace character.
Equivalent to [ \t\n\r\f\v].
n \S - Matches where a string contains any non-whitespace character.
Equivalent to [^\t\n\r\f\v].
n \w - Matches any alphanumeric character (digits and alphabets).
Equivalent to [a-zA-Z0-9_].
¨ Underscore _ is also considered an alphanumeric character.
n \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-
Z0-9_]
n \Z - Matches if the specified characters are at the end of a string.
Exp String Matched ?
Exp String Matched ?
the sun Match
\Athe football No match
In the sun No match
\Bfoo a football No match
football Match
afootball Match
\bfoo a football Match
the foo No match
afootball No match
n (*V*) ING → Ø
¨ Condition verified: motoring → motor
¨ Condition notverified: sing → sing
STEP 2B (Cleaning)
n (These rules are ran if second or third rule in 2a apply)
¨ AT → ATE m The measure of the stem.
derivat(ed) → derivate Number of VC patterns.
*S The stem ends with S
¨ BL -> BLE *v* The stem contains a vowel
*d The stem ends with a double consonant
tumbl(ing) → tumble *o The stem ends in CVC (second C not W,
X, or Y)
n (*d & ! (*L or *S or *Z)) → single letter
¨ Condition verified: hopp(ing) → hop, tann(ed) → tan
¨ Condition notverified: fall(ing) → fall