You are on page 1of 18

Module No.

2
Word Level Analysis

D. M. Bavkar
14.01.2022
• Morphology analysis survey of English Morphology
• Inflectional morphology & Derivational morphology
• Lemmatization
• Regular expression
• finite automata
• finite state transducers (FST)
• Morphological parsing with FST
• Lexicon free FST Porter stemmer
• N Grams- N-gram language model
• N-gram for spelling correction
Morphology analysis survey of English Morphology:
Parts of Speech and Morphology
• Three important parts of speech are noun, verb, and adjective.
• Nouns typically refer to people, animals, concepts and things.
• The prototypical verb is used to express the action in a sentence.
• Adjectives describe properties of nouns.
• The most basic test for words belonging to the same class is the substitution test.
• Adjectives can be picked out as words that occur in the frame in

one is in the corner.


• Example: Children eat sweet candy.

The verb eat describes what children do with candy. The adjective sweet tells us about a
property of candy, namely that it is sweet. Many words have multiple parts of speech:
candy can also be a verb (as in Too much boiling will candy the molasses), and, at least in
British English, sweet can be a noun, meaning roughly the same as candy. Word classes are
normally divided into two. The open or lexical categories are ones like nouns, verbs and
adjectives which have a large number of members, and to which new words are commonly
added. The closed or functional cutegories are categories such as prepositions and
determiners (containing words like ot on, the, a) which have only a few members, and the
members of which normally have a clear grammatical use. Normally, the various parts of
speech for a word are listed in an online dictionary, otherwise known as a lexicon.
Tokenization
Tokenization is the process of replacing sensitive data with unique
identification symbols that retain all the essential information about
the data without compromising its security.
Tokenization
There are many library / framework for NLP problem solution

1. Natural Language Toolkit (NLTK)


2. TextBlob
3. CoreNLP
4. Gensim
5. spaCy
6. polyglot
7. scikit–learn
8. Pattern
Stemming and Lemmatization
Before understanding Stemming and Lemmatization, first let’s understand the following…

Prefix: Character(s) at the beginning ▸ $ ( “ ¿


Suffix: Character(s) at the end ▸ km ) , . ! ”
Infix: Character(s) in between ▸ - -- / ...
Exception: Special-case rule to split a string into several tokens or
prevent a token from being split when punctuation rules are
applied St. U.S.
Stemming
Stemming is the process of reducing a word to its word stem that
affixes to suffixes and prefixes or to the roots of words known as a
lemma. Stemming is important in natural language understanding
(NLU) and natural language processing (NLP).
Porter Stemming
One of the most common - and effective - stemming tools is 
Porter's Algorithm developed by Martin Porter in 1980. The
algorithm employs five phases of word reduction, each with its
own set of mapping rules

e.g: caresses reduces to caress but not cares


Snowball Stemming
Snowball is a small string processing language designed for
creating stemming algorithms for use in Information
Retrieval. This site describes Snowball, and presents
several useful stemmers which have been implemented
using it
Lemmatization
Lemmatization usually refers to doing things properly with the use
of a vocabulary and morphological analysis of words, normally
aiming to remove inflectional endings only and to return the base
or dictionary form of a word, which is known as the lemma .

The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse’.
Further, the lemma of  'meeting' might be 'meet' or 'meeting’ 
depending on its use in a sentence.
Lemmatization
User: I am unhappy.
ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
User: I need some help, that much seems certain.
ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
User: Perhaps I could learn to get along with my mother.
ELIZA: TELL ME MORE ABOUT YOUR FAMILY
User: My mother takes care of me.
ELIZA: WHO ELSE IN YOU FAMILY TAKES CARE OF YOU
User: My father.
ELIZA: YOUR FATHER
User: You are like my father in some ways.
Important tools to describe text pattern

• Regular expression:
Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “I need X” in Eliza above, to defining strings like $199 or $24.99 for
extracting tables of prices from a document.
• Text normalization: Normalizing text means converting it to a more convenient, standard form. For example,
most of what we are going to do with language relies on first separating out or tokenizing words from running text,
the task of tokenization.
• English words are often separated from each other by whitespace, but whitespace is not always sufficient. New
York and rock ’n’ roll are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am. For processing tweets or texts we’ll need to
tokenize emoticons like :) or hashtags like #nlproc. Some languages, like Japanese, don’t have spaces between
words, so word tokenization becomes more difficult.
Important tools to describe text pattern
• Lemmatization: the task of determining that two words have the same root, despite their surface differences.

For example, the words sang, sung, and sings are forms of the verb sing. The word sing is the common lemma

of these words, and a lemmatizer maps from all of these to sing. Lemmatization is essential for processing

morphologically complex languages like Arabic.

• Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of

the word. Text normalization also includes sentence segmentation: breaking up a text into individual sentences,

using cues like segmentation periods or exclamation points.

• edit distance: edit distance that measures how similar two strings are based on the number of edits (insertions,

deletions, substitutions) it takes to change one string into the other. Edit distance is an algorithm with

applications throughout language processing, from spelling correction to speech recognition to coreference
Regular Expressions
• One of the unsung successes in standardization in computer science has been the regular regular
expression (RE), a language for specifying text search strings.

• This practical language is used in every computer language, word processor, and text processing tools
like the Unix tools grep or Emacs.

• Formally, a regular expression is an algebraic notation for characterizing a set of strings.

• They are particularly usecorpus ful for searching in texts, when we have a pattern to search for and a
corpus of texts to search through.

• A regular expression search function will search through the corpus, returning all texts that match the
pattern.

• The corpus can be a single document or a collection. For example, the Unix command-line tool grep
takes a regular expression and returns every line of the input document that matches the expression.

You might also like