You are on page 1of 44

Morphology

Section II
n Tokenization and word Segmentation,
n Stemming,
n Lemmatization,
n Morphological Processing
¨ Types of Morpheme,
¨ Morphological Types,
¨ Morphological Rules,
¨ Morphemes and Words,
¨ Inflectional and Derivational Morphology
Language
n Language is a system of communication which consists of a set
of sounds, sign and written symbols used by the people of a
particular country or region to communicate.
n There are approximately 7,111 languages in the world
organized in the same basic way.
¨ 2,143 belongs to Africa (30.14%)
¨ More than 83 languages in Ethiopia.
Communication
n Verbal
¨ Message is transmitted through the spoken words.
n Textual
¨ One of the most popular categories used.
n Emails, SMS text messages, social media interaction, and instant messaging to
accelerate communication between parties.
n Gesture
¨ Movement of part of the body, especially a hand or the head, to express
an idea.
Sign Language (Gesture)
Spoken language
Written Language
Natural Languages
n Natural language refers to a human language which differ from
artificial command or programming language.
n One of the most challenging problem in the field of
computing is to develop a system that can understand
natural/human languages.
¨ So far, the complete solution to this problem has proved elusive,
although a great deal of progress has been made.
¨ Fourth-generation languages are the programming languages closest
to natural languages.
Text processing
n Is a theory and practice of manipulating/automating a electronic text
in a way that can be used for specific NLP application.
¨ Text processing might have different scope based on the NLP application and
domain.
n Almost all task of NLP needs text processing:
¨ Sentence segmentation, words tokenizing, format normalization, search and
replace, filter file or report etc…
n Trivial string manipulation programs is not economical and
performing these tasks requires robust text processing.
n Most widely used Approach: RegEx
Regular Expression
n RE is a formal language for specifying text strings.
¨ Python uses re module to offers a set of functions that allows us to
search a string for a match:
n Regular expressions are case sensitive.
n The string of characters inside square braces specify a disjunction
of characters to match.
¨ Eg. /[aA]ddis/ matches Addis or addis
RegEx (2)

Character classes
Quantifiers & Alternation
. any character except newline
a*a+a? 0 or more, 1 or more, 0 or 1
\w\d\s word, digit, whitespace
a{5}a{2,} exactly five, two or more
\W\D\S not word, digit, whitespace
a{1,3} between one & three
[abc] any of a, b, or c
a+?a{2,}? match as few as possible
[^abc] not a, b, or c
ab|cd match ab or cd
[a-g] character between a & g
RegEx (3)
Groups & Lookaround Anchors
(abc) capture group ^abc$ start / end of the string
\1 backreference to group #1 \b\B word, not-word boundary
(?:abc) non-capturing group
(?=abc) positive lookahead
(?!abc) negative lookahead

Escaped characters
\.\*\\ escaped special characters
\t\n\r tab, linefeed, carriage return
Regex Example (1)

Exp String Matched ?


a 1 match
ababa 5 matches
[abc] Sample representation
Hey student No match
[a-e] is the same as [abcde].
abc de ca 5 matches
[1-4] is the same as [1234].
[0-39] is the same as [01239].
[^abc] a or b or c.
[^0-9] non-digit character.
Regex Example (2)
Exp String Matched ? Exp String Matched ?

a No match a 1 match

ac 1 match ^a abc 1 match


..
acd 1 match bac No match

acde 2 matches (contains 4 char) abc 1 match


^ab
acb No match
Exp String Matched ?

string 1 match
a$ formula 1 match
cab No match
Regex Example (3)

Exp String Matched ?

mn 1 match
man 1 match
ma*n maaan 1 match Exp String Matched ?

main No match (a is not followed by n) mn No match (no a character)

woman 1 match man 1 match


ma+n maaan 1 match
main No match (a is not followed by n)
woman 1 match
Regex Example (3)

Exp String Matched ? Exp String Matched ?

mn 1 match abc dat No match


man 1 match abc daat 1 match (at daat)
No match (more than aabc daaat 2 matches (at aabc and daaat)
maaan a{2,3}
ma?n one a character)
aabc
2 matches (at aabc and daaaat)
No match (a is not followed daaaat
main
by n)
abc dat No match
woman 1 match
Regex Example (4)
Exp String Matched ?

ab123csde 1 match (match at ab123csde)

[0-9]{2,4} 12 and 345673 2 matches (at 12 and 345673)


1 and 2 No match

Exp String Matched ?

cde No match
a|b ade 1 match (match at ade)
acdbea 3 matches (at acdbea)
Regex Example (5)
Exp String Matched ?

ab xz No match
(a|b|c)xz abxz 1 match (match at abxz)
axz cabxz 2 matches (at axzbc cabxz)
n \A - Matches if the specified characters are at the start of a string.
n \b - Matches if the specified characters are at the beginning or end
of a word.
n \B - Opposite of \b. Matches if the specified characters are not at
the beginning or end of a word.
n \d - Matches any decimal digit. Equivalent to [0-9]
n \D - Matches any non-decimal digit. Equivalent to [^0-9]
n \s - Matches where a string contains any whitespace character.
Equivalent to [ \t\n\r\f\v].
n \S - Matches where a string contains any non-whitespace character.
Equivalent to [^\t\n\r\f\v].
n \w - Matches any alphanumeric character (digits and alphabets).
Equivalent to [a-zA-Z0-9_].
¨ Underscore _ is also considered an alphanumeric character.
n \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-
Z0-9_]
n \Z - Matches if the specified characters are at the end of a string.
Exp String Matched ?
Exp String Matched ?
the sun Match
\Athe football No match
In the sun No match
\Bfoo a football No match
football Match
afootball Match
\bfoo a football Match
the foo No match
afootball No match

the foo Match


foo\B the afoo test No match

foo\b the afoo test Match the afootest Match

the afootest No match


Exp String Matched ? Exp String Matched ?
12abc3 3 matches (at 12abc3) 12&": ;c 3 matches (at 12&": ;c)
\d \w
Python No match %"> ! No match
1ab34"50 3 matches (at 1ab34"50) 1a2%c 1 match (at 1a2%c)
\D \W
1345 No match Python No match
Python RegEx 1 match I like Python 1 match
\s
PythonRegEx No match \ZPython I like Python No match
ab 2 matches (at a b) Python is fun. No match
\S
__ No match
Searching pattern (1)
import re
pattern = “^a...s$” • Here, we used re.match() function to
test_data = “graduate program” search pattern within the test_data.
result = re.match(pattern, test_data) • The method returns a match object if
if result:
the search is successful. If not, it
print("Search successful.")
returns None.
else:
print("Search unsuccessful.")
Output: Search unsuccessful
Searching pattern (2)
import re
data = ”regex2019 is a fun"
reg_result = re.findall(r"^\w+", data)
print(reg_result)
print((re.split(r'\s’,’how to split words')))
print((re.split(r's','split the words')))
Pattern searching (3)
pattern = "\w+@(\w+\.)+(com|org|net|edu|et)"
result = re.match(pattern, "myemail@gmail.com")
result.group()
output: 'myemail@gmail.com'

• \w+ → shows word


• @ → Email indicator
• . → subdomain in root domain
• com|org|net|gov|edu|et → root domain
Ex
n From the file extract the following using regular expression
¨ Email address
¨ Website Address
¨ Any word containing a digits
¨ Extracting date
Tokenization
n Tokenization is the task of chopping it up into pieces, called
tokens, perhaps at the same time throwing away certain
characters, such as punctuation.
n Tokens are often loosely referred to as terms or words, but it is
sometimes important to make a type/token distinction.
¨A token is an instance of a sequence of characters in some particular
document that are grouped together as a useful semantic unit.
¨ A type is the class of all tokens containing the same character sequence.
Sentence tokenization
n !, ? are relatively unambiguous
n Period “.” is quite ambiguous
¨ Sentenceboundary
¨ Abbreviations like Inc. or Dr.
¨ Numbers like .02% or 4.3

n Build a binary classifier


¨ Looks at a “.”
¨ Decides EndOfSentence/NotEndOfSentence
¨ Classifiers: hand-written rules, regular expressions, or machine-learning
Tokenization: language issues
n Chinese and Japanese no spaces between words:
¨
¨
¨ Sharapova now lives in US southeastern Florida.
n Further complicated in Japanese, with multiple alphabets
intermingled
¨ Dates/amounts in multiple formats
n October 13, 2019 vs 10/13/2019 vs 13/10/2019 vs 10-13-2019 ……
Stemming (1)
n is a process of reducing inflection in words to their root forms
such as mapping a group of words to the same stem even if the
stem itself is not a valid word in the Language.
¨ inflection is the modification of a word to express different grammatical
categories such as tense, case, voice, aspect, person, number, gender, and mood.
n Used to improve the performance of IR systems.
Stemming (2)
n It is much easier to write a stemming algorithm for a language
when you are familiar with it.
¨ Stemming is language dependent
n The three major stemming algorithms in use nowadays:
¨ PorterStemmer
¨ Snowball Stemmer
¨ Lancaster Stemmer
Stemming (3)
porter=PorterStemmer()
Connection -ion
from nltk.tokenize import sent_tokenize, word_tokenize
Connections -ions
Connective -ive def stemSentence(sentence):
Connect
Connected -ed token_words=word_tokenize(sentence)
Connecting -ing stem_sentence=[]
Connector -or
for word in token_words:
stem_sentence.append(porter.stem(word))
stem_sentence.append(" ")
return “ ”.join(stem_sentence)
x=stemSentence(sentence)
print(x)
Stemming a document
n You can write your own function that can stem a given documents.
1. Take a document as the input.
2. Read the document line by line
3. Tokenize the line
4. Stem the words
5. Output the stemmed words (print on screen or write to a file)
6. Repeat step 2 to step 5 until it is to the end of the document.
Stemming algorithm
n Porter Stemmers use simple algorithms to determine which
affixes to strip/remove in which order and when to apply repair
strategies (Porter 1980).
¨ Easy to see understand, easy to implement.
¨ A simple rule-based algorithm for stemming.

n The algorithm consists of seven sets of rules, applied in order.


Algorithm (1)
n CONSONANT
¨ a letter other than A, E, I, O, U, and Y preceded by consonant
n VOWEL
¨ any other letter
n With this definition, all words are of the form:
¨ (C)(VC)m(V)

n C=string of one or more consonants (con+)


n V=string of one or more vowels
STEP 1
1. SSES→SS
¨ excesses → excess
¨ stresses → stress
2. IES → I
¨ Companies → compani
¨ difficulties → difficulti
3. SS → SS
¨ caress → caress
4. S→Ø
¨ breads → bread
STEP 2A (Past tense, progressive)
n (m>1) EED → EE
¨ Condition verified: agreed → agree
¨ Condition notverified: feed → feed m The measure of the stem.
Number of VC patterns.
n (*V*) ED → Ø *S The stem ends with S
*v* The stem contains a vowel
¨ Condition verified: plastered → plaster *d The stem ends with a double consonant
*o The stem ends in CVC (second C not W,
¨ Condition notverified: bled → bled X, or Y)

n (*V*) ING → Ø
¨ Condition verified: motoring → motor
¨ Condition notverified: sing → sing
STEP 2B (Cleaning)
n (These rules are ran if second or third rule in 2a apply)
¨ AT → ATE m The measure of the stem.
derivat(ed) → derivate Number of VC patterns.
*S The stem ends with S
¨ BL -> BLE *v* The stem contains a vowel
*d The stem ends with a double consonant
tumbl(ing) → tumble *o The stem ends in CVC (second C not W,
X, or Y)
n (*d & ! (*L or *S or *Z)) → single letter
¨ Condition verified: hopp(ing) → hop, tann(ed) → tan
¨ Condition notverified: fall(ing) → fall

n (m=1 & *o) → E


¨ Condition verified: fil(ing) → file
¨ Condition notverified: fail → fail
STEPS 3 AND 4
n Y Elimination (*V*) Y -> I
¨ Condition verified: happy -> happi
m The measure of the stem.
¨ Condition notverified: sky -> sky Number of VC patterns.
*S The stem ends with S
*v* The stem contains a vowel
*d The stem ends with a double consonant
*o The stem ends in CVC (second C not W,
X, or Y)
STEP 4 (Derivational Morphology I)
n (m>0) ATIONAL → ATE
¨ Relational → relate
m The measure of the stem.
n (m>0) IZATION → IZE Number of VC patterns.
*S The stem ends with S
¨ generalization → generalize *v* The stem contains a vowel
*d The stem ends with a double consonant
n (m>0) BILITI → BLE *o The stem ends in CVC (second C not W,
X, or Y)
¨ sensibiliti → sensible
STEP 5 (Derivational Morphology II)
n (m>0) ICATE → IC
¨ triplicate → triplic
n (m>0) FUL → Ø
¨ hopeful → hope
n (m>0) NESS → Ø
¨ goodness → good
STEP 6 (Derivational Morphology III)
n (m>0) ANCE → Ø
¨ allowance → allow
n (m>0) ENT → Ø
¨ dependent → depend
n (m>0) IVE → Ø
¨ effective → effect
STEP 7a (Cleaning)
n (m>1) E → Ø
¨ probate → probat
n (m=1 & !*o) NESS → Ø
¨ goodness → good
STEP 7b (Cleaning)
n (m>1 & *d & *L) → single letter
¨ Condition verified: control → control
¨ Condition not verified: roll → roll

You might also like