NLP Unit1Content

FOUNDATIONS OF NLP
B.Tech –IIIrd Year-Sem2,A19 Batch

COURSE OBJECTIVES:
• To explain text normalization techniques and n-gram

language model
• To discuss part of speech methods and naïve bayes
classification techniques 3
• To understand word sense disambiguation techniques
and process of building question answering system
• To introduce the concepts of chatbots, dialogue
systems, speech recognition systems and text to speech
recognition methods
COURSE OUTCOMES:
After completion of the course, the student should be able to
CO-1: Apply normalization techniques on a document and evaluate a language model.

CO-2: Implement parts of speech tagging and classification techniques on the words.
CO-3: Establish relationships among words of a sentence using word net and also build the question answering system CO-4: Analyze chatbots, dialogue systems, and automatic speech recognition systems.
Syllabus
UNIT – I: Introduction, Regular Expressions, Text Normalization, Edit Distance: Words, Corpora, Text Normalization, Word
Normalization, Lemmatization and Stemming, Sentence Segmentation, The Minimum Edit Distance Algorithm.
UNIT – II: N-gram Language Models: N-Grams, Evaluating Language Model, Sampling sentences from a language model,
Sequence Labeling for Parts of Speech and Named Entities: Part-of-Speech Tagging, Named Entities and Named Entity
Tagging.
UNIT – III: Naive Bayes and Sentiment Classification: Naive Bayes Classifiers, Training the Naive Bayes Classifier,
Optimizing for Sentiment Analysis, Naive Bayes as a Language Model, Evaluation: Precision, Recall, F-measure, Test sets
and Cross-validation
UNIT – IV: Word Senses and WordNet: Word Senses, Relations Between Senses, WordNet: A Database of Lexical
Relations, Word Sense Disambiguation, WSD Algorithm: Contextual Embeddings
UNIT – V: Question Answering: Information Retrieval, IR-based Factoid Question Answering, IR[1]based QA: Datasets,
Entity Linking, Knowledge-based Question Answering, Using Language Models to do QA, Classic QA Models.
UNIT – VI: Chatbots & Dialogue Systems: Properties of Human Conversation, Chatbots, GUS: Simple Frame-based
Dialogue Systems, The Dialogue-State Architecture, Evaluating Dialogue Systems, Dialogue System Design, Automatic
Speech Recognition and Text-to-Speech: The Automatic Speech Recognition Task, Feature Extraction for ASR: Log Mel
Spectrum, Speech Recognition Architecture
NLP Lab Syllabus
● WEEK 9: Write a program to generate N-gram model
● WEEK 10: Write a program to perform sentiment classification
● WEEK 11: Write a program to perform text classification and evaluate the model.
● WEEK 12 & 13: Implementation of real-world case study - IR based Question
Answering system (or) chat bot application
● WEEK 14: Internal Lab exam
Text Books & References
TEXT BOOKS:
1. Speech and Language Processing, Dan Jurafsky and James H. Martin (Stanford.edu), 3rd Edition, Pearson
Publications
2. Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit, Steven Bird,
Ewan Klein, and Edward Loper
REFERENCES:
1. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems,
Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana
2. Foundations of Statistical Natural Language Processing, Christopher Manning and Hinrich Schütze
3. Natural Language Processing in Action, Understanding, Analysing, and Generating Text with Python, Hobson
Lane, Cole Howard, Hannes Max Hapke
4. The Handbook of Computational Linguistics and Natural Language Processing, (Blackwell Handbooks in
Linguistics) 1st Edition
Natural Language Processing (NLP)
Natural language processing (NLP) is an interdisciplinary subfield of

linguistics, computer science, and artificial intelligence concerned with the
interactions between computers and human language, in particular how to
program computers to process and analyze large amounts of natural lan
guage data.
Goal of NLP
The main aim of the NLP is to read, understand and decode human
words in a valuable manner.
Note:
1. Most of the NLP techniques depend on machine learning to obtain
meaning from the human languages.
Uses of NLP
1. NLP is used in language translate applications such as Google Translate

2. It has excellent use in word processors like Microsoft Word and web apps
like Grammarly that uses NLP to check the grammatical accuracy of texts
3. Interactive Voice Response (IVR) apps used in the call center to answer
to specific users’ queries
4. Personal assistant apps such as Alexa, Siri, OK Google, and Cortana
Applications of NLP
Regular Expressions
Regular expression (RE), a language for specifying text search strings.

(or)
Formally, a regular expression is an algebraic notation for characterizing a
set of strings.
It is used in every computer language, word processor, and text processing

tools like the Unix tools grep or Emacs.
A corpus is a dictionary or container of words and sentences.

A regular expression search function will search a pattern in the corpus, and
return all texts that match the pattern
Basic Regular Expression Patterns
To search for “Buttercup”, we type /Buttercup/. The expression /Buttercup/ matches any
string containing the substring Buttercup.
Note: Regular expressions are case sensitive, lower case /s/ is distinct from uppercase /S/
(/s/ matches a lower case s but not an uppercase S).
This means that the pattern /woodchucks/ will not match the string Woodchucks.
We can solve this problem with the use of the square braces [ and ]. The string of characters
inside the braces specifies a disjunction of characters to match.
Fig. shows that the pattern /[wW]/ matches patterns containing either w or W.
In cases where there is a well-defined sequence associated with a set of characters, the
brackets can be used with the dash (-) to specify any one character in a range.
The square braces can also be used to specify what a single character cannot be, by use of
the caret ˆ.
If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is negated.
Example: The pattern /[ˆa]/ matches any single character (including special characters)
except a. This is only true when the caret is the first symbol after the open square brace.
/?/ which means “the preceding character or nothing”

Kleene star: It means “ zero or more occurrence of the immediately previous

character”.
Example:
1. /a*/ means “any string of zero or more as”. This will match a or aaaaaa.
2. /aa*/, meaning one a followed by zero or more as
3. /[ab]*/ means “zero or more a’s or b’s”. This will match strings like aaaa or ababab or
bbbb.
4. /[0-9][0-9]*/ - an integer
Kleene +: which means “one or more occurrences of the immediately preceding character”.
Example:
/[0-9]+/ is the normal way to specify “a sequence of digits”

Anchors: Anchors are special characters that anchor regular expressions to particular places in a
string. The most common anchors are shown below.
Example:
/\b99\b/ will match the string 99 in “There are 99 bottles on the wall”, but not 99 in "There are 299
bottles on the wall”
But it will match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter)
Disjunction, Grouping, and Precedence
In some cases, we might want to search for either the string cat or the string dog. Since we
can’t use the square brackets to search for “cat or dog” ( /[catdog]/?)).
we need a new operator, the disjunction operator, also called the pipe symbol |
Example:
1. The pattern /cat|dog/ matches either the string cat or the string dog.
2. How can we specify both happy and happiest?
3. Regular expressions for Column 1 Column 2 Column 3
In some cases, we might want to search for either the string cat or the string dog. Since we
can’t use the square brackets to search for “cat or dog” ( /[catdog]/?)).
we need a new operator, the disjunction operator, also called the pipe symbol |
Example:
1. The pattern /cat|dog/ matches either the string cat or the string dog.
2. How can we specify both happy and happiest? /happ(y|iest)/
3. Regular expressions for Column 1 Column 2 Column 3 /(Column [0-9]+ *)*/
Operator precedence hierarchy
1. /the*/ matches theeeee but not thethe

2. /the|any/ matches the or any
3. /[a-z]*/ matches zero or more letters (cases regular expressions always match
the largest string).
More Operations
1. /{n,m}/ specifies from n to m occurrences of the previous char or expression.
Regular Expression Operator for Counting
Regular Expression Operator for Counting
1. Certain special characters are referred to by special notation based on
the newline backslash (\)
2. The most common of these are the newline character \n and the tab
character \t
3. To refer to characters that are special themselves (like ., *, [, and \),
precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
Complex Examples
1. /$[0-9]+/ - refers to $199, $1, $200

2. /$[0-9]+\.[0-9][0-9]/ - $199.99
3. /(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/ - $199.99, $199
4. /(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/ - $0 to $999.99
Substitution, Capture Groups
Substitution
1. s/regexp1/pattern/
Example:
2. s/colour/color/
This use of parentheses to store a pattern in memory is called a capture

group.
/the (.*)er they were, the \1er they will be/
- Here the \1 will be replaced by whatever string matched the first item in
parentheses.
So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
/the (.*)er they (.*), the \1er we \2/ - will match the faster they ran, the
faster we ran but not the faster they ran, the faster we ate.
Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
non-capturing group: Which is specified by putting the commands non-
capturing group ?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/ - will match some cats like
some cats but not some cats like some a few.
WORDS
We need to decide what counts as a word. Let’s start by looking at one particular
corpus, a computer-readable collection of text or speech.
Brown Corpus - Million-word collection of samples from 500 written English texts
from different genres (newspaper, fiction, non-fiction, academic, etc.), and these are
available at Brown University in 1963–64.
Example: He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words,
15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words depends on the task.
WORDS
The Switchboard corpus of American English telephone conversations has 3
million words. This corpora of spoken language don’t have punctuation but it
introduces other complications with regard to defining words. aparticipants
sound or word that
in a
conversation use to
an utterance is the spoken correlate of a sentence signal that they are
pausing to think but
are not finished
Example: I do uh main- mainly business data processing speaking
The utterance has two kinds of disfluencies: The broken-off word main- is
called a fragment. Words like uh and um are called fillers or filled pauses.
Should we consider these to be words?
This disfluencies may be useful for predicting the upcoming word.

WORDS
Are capitalized and uncapitalized tokens considered same word?They/they
Are inflected words like cats and cat same? These two words have the same lemma.
Lemma - A lemma is a set of lexical forms having the same stem.
The word form is the full inflected or derived form of the word.
To know how many words are there in English? To answer this question we need to distinguish
two ways of talking about words.
Type: Types are the number of distinct words in a corpus
Tokens: Tokens are the total number N of running words
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V,
the number of types is the vocabulary size V. Tokens are the total number N of running words.
If we ignore punctuation, the following Brown sentence has 16 tokens and 14 types:
WORDS
Example:They picnicked by the pool, then lay back on the grass and looked at the
stars.This sentence has 16 tokens and 14 types.
The number of words referred to the word type.

WORDS
Herdan’s Law or Heaps’ Law: The larger the corpora we look at, the more word
types we find, and in fact this relationship between the number of types |V| and
number of tokens N is called Herdan’s Law or Heaps’ Law.
where k and β are positive constants, and 0 < β < 1. The value of β depends on
the corpus size and the genre (ranges from .67 to .75)
● Another measure of the number of words in the language is the
number of lemmas instead of wordform types.
● Dictionaries can help in giving lemma counts; dictionary entries or

boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms).
● The 1989 edition of the Oxford English Dictionary had 615,000 entries
CORPORA
Corpora - Corpus - a computer-readable collection of text or speech
NLP algorithms works well only if we have corpus for multi languages.
It is important to test algorithms on more than one language, and particularly on

languages with different properties. But the problem is that the corpus is available
for English language.
Code switching: It’s also quite common for speakers or writers to use multiple
languages in a single communicative act, a phenomenon called code switching
Spanish and (transliterated) Hindi code switching with English
dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
CORPORA
Any particular piece of text that we study is produced by one or more writers or
speakers, in a specific dialect of specific language, at a specific time, in a specific
place.
The most important dimension of variation is language and genre.

CORPORA
The important dimension of variation is the language.
The world has 7117 (7097) official languages.
NLP algorithms are most useful when they apply across many languages.
It is important to test algorithms on more than one language (languages with

different properties) - But currently most of the algorithms developed and tested on
English only.
Now the algorithms developed for the languages - Chinese, Spanish, Japanese,
German, but we don’t limit tools to these few languages.
CORPORA
Most languages have multiple varieties often spoken in different regions or by
different social groups.
AAL(American African language) - Twitter post might use features often used by
speakers, such as iont (I don’t) and talmbout(talking about), these influence word
segmentation.
CORPORA
The other dimension of variation is the genre. The text that our algorithms must
process might
● Come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts.
● Come from spoken genres like telephone conversations, business meetings,
medical interviews, or transcripts of television shows or movies.
● Come from work situations like doctors’ notes, legal text.
● Text also reflects the demographic characteristics of the writer (or speaker):
their age, gender, race, socioeconomic class can all influence the linguistic
properties of the text we are processing.
● The time matters too, language changes over time, and for some languages
we have good corpora of texts from different historical periods.
CORPORA
Because language is so situated, when developing computational models for
language processing from a corpus, it’s important to consider
● who produced the language
● in what context
● for what purpose
How can a user know all these details?
The best way is for the corpus creator to build a datasheet or data statement for
each corpus.
CORPORA
A datasheet specifies properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was the age or gender of the authors of the text?
Collection process: How big is the data? If it is a subsample how was it sampled?
Was the data collected with consent? How was the data pre-processed, and what
metadata is available?
CORPORA
Annotation process: What are the annotations, what are the demographics of the
annotators, how were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?

Text Normalization
Text normalization has at least the following steps:
1. Tokenizing (segmenting) words

2. Normalizing word formats
3. Segmenting sentences
UNIX Tools for Tokenization or Normalization
UNIX commands
1. tr - systematically change particular characters in the input.

2. sort - sorts input lines in alphabetical order
3. uniq - which collapses and counts adjacent identical lines.
4. -c - Complements to non--alphabet
5. ’A-Za-z’ - alphabets
6. -s - squeezes all sequences into a single character
Unix command: tr -sc ’A-Za-z’ ’\n’ < sh.txt
Output:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
we can sort the lines, and pass them to uniq -c which will collapse and count them
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
Output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
Output:
14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
Now we can sort again to find the frequent words. The -n option means to sort
numerically rather than alphabetically, and the -r option means to sort in reverse
order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
Output:
27378 the
26084 and
22538 i
19771 to
17481 of
Word Tokenization
Tokenization - It is the task of segmenting running text into words.
The UNIX command sequence removed all the numbers and punctuation. But we will
need to keep these in our tokenization.
● Commas are a useful piece of information for parsers
● Periods help indicate sentence boundaries
● Keep the punctuation that occurs word internally (Ph.D, AT&T, m.p.h)
● Special characters and numbers will need to be kept in prices ($99.45) and dates
● URLs and Twitter hashtags, and Email addresses.
● Number expressions introduce other complications - commas are used inside
numbers in English like 5,55,500.50- hence tokenization requirements differ on
this.
Word Tokenization
A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes. ( converting what’re to the two tokens what are, we’re to we are).
Depending on the application, tokenization algorithms may also tokenize

multiword expressions like New York or rock ’n’ roll as a single token, which
requires a multiword expression dictionary.
Tokenization also recognises Named entity recognition (task of detecting names,

dates and organization).
Most commonly used tokenization standard is known as the Penn Treebank

tokenization standard, which uses parsed corpora.
Word Tokenization
PENN Treebank Standard:
This standard separates out clitics (doesn’t becomes does plus n’t), keeps
hyphenated words together, and separates out all punctuation.
Input: "The San Francisco-based restaurant," they said, "doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said , " does n’t charge $ 10
".
Word Tokenization
A Python trace of regular expression tokenization in the NLTK
Word Tokenization
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries.
In Chinese, words are composed of characters (called hanzi in Chinese). Each

character generally represents a single unit of meaning (called a morpheme) and
is pronounceable as a single syllable.
Chinese simply to ignore words altogether and use characters as the basic
elements, treating the sentence as a series of 7 characters.
● Sentence tokenization and word tokenization are both processes
commonly used in natural language processing (NLP) for breaking
down text into smaller, more manageable units. Here's how they differ:
1. Sentence Tokenization:
1. Definition: Sentence tokenization involves splitting a text into individual sentences.
2. Purpose: It is used to divide text into meaningful segments, making it easier to process
sentences individually.
3. Example: Given the input "Hello, how are you? I'm doing well.", sentence tokenization
would produce two sentences: "Hello, how are you?" and "I'm doing well."
4. Method: Sentence tokenization typically relies on punctuation (e.g., periods, question
marks, exclamation marks) to identify sentence boundaries, although more advanced
techniques may also consider contextual information.
2. .
1. Word Tokenization:
1. Definition: Word tokenization involves splitting a text into individual words or tokens.
2. Purpose: It is used to break down text into its basic units for further analysis and processing.
3. Example: Given the input "The quick brown fox jumps over the lazy dog.", word tokenization
would produce ten tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog", "."].
4. Method: Word tokenization typically relies on whitespace and punctuation to identify word
boundaries. However, it may also consider language-specific rules and patterns, such as
contractions and hyphenated words.
● In summary, sentence tokenization divides text into sentences, while word
tokenization divides text into individual words or tokens.
● Both processes are fundamental in NLP and are often used as preprocessing
steps for various tasks such as text classification, sentiment analysis, and
machine translation
Byte Pair Encoding (BPE) Tokenization
This method is used for tokenization.
Instead of defining tokens as words or as characters, we can use our data to

automatically tell us what the tokens should be.
It is especially useful for unknown words.
In general NLP algorithms often learn some facts about language from one corpus (a
training corpus) and then use these facts to make decisions about a separate test
corpus and its language.
Suppose if our training corpus contains the word like low, new, newer, but not lower,
then if the word lower appears in our test corpus, our system will not know what to do
with it.
Byte Pair Encoding Tokenization
To deal with this unknown word problem, modern tokenizers often automatically
induce sets of tokens that include tokens smaller than word.
In modern tokenization schemes, most tokens are words, but some tokens are
frequently occurring morphemes or other subwords like -er. Every unseen words
like lower can thus be represented by some sequence of known subword units,
such as low and er, or even as a sequence of individual letters if necessary.
Most tokenization schemes have two parts: a token learner, and a token segmenter.
The token learner takes a raw training corpus and induces a vocabulary, a set of
tokens.
The token segmenter takes a raw test sentence and segments it into the tokens in the
vocabulary.
Three algorithms are widely used for tokenization
1. Byte-pair encoding
2. Unigram language modeling
3. WordPiece.
The BPE token learner begins with a vocabulary that is the set of all individual
characters.
It then examines the training corpus, chooses the two symbols that are most
frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary,
and replaces every adjacent ’A’ ’B’ in the corpus with the new ‘AB’.
It continues to count and merge, creating new longer and longer character strings,
until k merges have been done creating k novel tokens ( K is the parameter of the
algorithm).
The resulting vocabulary consists of the original set of characters plus k new symbols.
The input corpus is first white-space-separated to give a set of strings, each
corresponding to the characters of a word, plus a special end-of-word symbol_,
and its counts
The BPE algorithm first count all pairs of adjacent symbols: the most frequent is
the pair e r because it occurs in newer (frequency of 6) and wider (frequency of 3)
for a total of 9 occurrences.
BPE - Algorithm
Once we’ve learned our vocabulary, the token parser is used to tokenize a test
sentence.
The token parser just runs on the test data the merges we have learned from the
training data, greedily, in the order we learned them. (Thus the frequencies in the
test data don’t play a role, just the frequencies in the training data).
Word Normalization, Lemmatization and Stemming
Normalization is the task of putting words/tokens in a standard format, choosing a
single normal form for words with multiple forms like USA and US or uh-huh and
uhhuh.
Case folding is another kind of normalization. Mapping everything to lower case

means that Woodchuck and woodchuck are represented identically, which is
helpful in speech recognition.
For sentiment analysis and other text classification tasks, information extraction,
and machine translation, by contrast, case can be quite helpful and case folding is
generally not done.
Lemmatization
Lemmatization is the process of determining that two words having the same root.
Example:
1. The words am, are, and is have the shared lemma be

2. The words dinner and dinners both have the lemma dinner
The lemmatized form of a sentence like “He is reading detective stories” would
thus be “He be read detective story”.
How Lemmatization done?
The most methods for lemmatization involve complete morphological parsing of
the word.
Morphology: It is the study of the way words are built up from smaller meaning-
bearing units called morphemes.
Two broad classes of morphemes can be distinguished:
Stems—The central morpheme of the word, supplying the main meaning.
Affixes—Adding “additional” meanings of various kinds.
1. The word fox consists of one morpheme (the morpheme fox)

2. The word cats consists of two (the morpheme cat and the morpheme -s )
How Lemmatization done?
A morphological parser takes a word like cats and parses it into the two
morphemes cat and s,
It parses a Spanish word like amaren (‘if in the future they would love’) into the
morpheme amar ‘to love’, and the morphological features 3PL and future
subjunctive.
Stemming:
 Stemming is a process of removing affixes from words to obtain their root forms, known
as stems.
 Stemming algorithms apply heuristic rules to chop off prefixes or suffixes from words.
 It is a simpler and faster process compared to lemmatization.
 Stemmed words may not always result in actual words.
 For example, "running" may be stemmed to "run", but "run" is a valid word in English.
 Stemming is useful in tasks where speed and simplicity are prioritized over accuracy, such as
information retrieval or indexing in search engines.
● In linguistics, an affix is a morpheme that is attached to a word or stem to create a new word or modify its
meaning or grammatical function. Affixes can be prefixes, suffixes, infixes, or circumfixes:
1. Prefix: An affix that is attached to the beginning of a word. For example, in the word "unhappy", "un-" is a
prefix.
2. Suffix: An affix that is attached to the end of a word. For example, in the word "happily", "-ly" is a suffix.
3. Infix: An affix that is inserted into the middle of a word. This is less common in English but is found in some
languages. For example, in Tagalog, a language spoken in the Philippines, the infix "-um-" is inserted into verbs
to indicate past tense, as in "ganda" (beautiful) becomes "gumanda" (became beautiful).
4. Circumfix: An affix that consists of two parts, one attached to the beginning of a word and the other attached to
the end. The word is modified by both parts. This is also less common in English but can be found in some
languages. For example, in German, the verb "gehen" (to go) can be transformed into the past participle
"gegangen" by adding the circumfix "ge-" at the beginning and "-en" at the end.
● Affixes play a significant role in morphology, the study of the structure of words and how they are formed. They
can change the meaning, part of speech, or grammatical function of words.
● Morphological analysis is a linguistic process that involves breaking down words into
their smallest meaningful units, called morphemes, and studying how these
morphemes contribute to the overall structure and meaning of words. Morphemes are
the smallest units of meaning in language.
● Here's how morphological analysis of words works:
1. Identification of Morphemes: Linguists analyze words to identify the morphemes within them. Morphemes can be either free
morphemes, which can stand alone as words (e.g., "cat", "run"), or bound morphemes, which must be attached to other
morphemes to form words (e.g., "-s" for plural, "-ing" for progressive tense).
2. Classification of Morphemes: Once identified, morphemes are classified based on their grammatical and semantic functions.
For example, morphemes may indicate tense, plurality, possession, or serve as prefixes, suffixes, or roots.
3. Study of Morphological Processes: Morphological analysis also involves studying how morphemes combine to form words
through processes such as affixation (adding prefixes, suffixes, or infixes), compounding (combining two or more words),
derivation (forming new words from existing ones), or inflection (altering the form of a word to indicate grammatical features
like tense, number, or gender).
4. Analysis of Word Structure: Morphological analysis examines the internal structure of words, including the order and
arrangement of morphemes within them. This helps understand how words are formed and how their meanings are constructed.
● Morphological analysis is essential in understanding language structure, word formation, and the relationship between form and
meaning in words. It provides insights into the rules and patterns governing word formation in a language, contributing to
linguistic research, language teaching, and natural language processing tasks such as machine translation, text analysis, and
speech recognition.
Lemmatization:
●
 Lemmatization, on the other hand, involves reducing words to their base or dictionary form, known as
lemma.
 Lemmatization uses a vocabulary and morphological analysis of words to accurately derive their
lemma.
 It considers the context of the word and applies linguistic rules to transform words to their dictionary forms.
 Lemmatization produces valid words, ensuring linguistic correctness.
 It's a more complex and computationally intensive process compared to stemming.
 Lemmatization is often preferred in tasks where accuracy is crucial, such as natural language understanding,
sentiment analysis, or machine translation.
● In summary, while stemming and lemmatization both aim to reduce words to their base forms, stemming is a
simpler and faster process that may result in non-words, while lemmatization is more accurate and considers the
linguistic context to produce valid dictionary forms of words.
● Example :
● Consider the word "running":
1. Stemming:
 When stemming "running", a stemming algorithm might simply remove the suffix "-ing" to
obtain the stem "run". This is a heuristic approach that doesn't consider the linguistic
context deeply.
 Stemming result: "running" → "run"
2. Lemmatization:
 When lemmatizing "running", the process understands that "running" is a form of the base
verb "run". It consults a dictionary or a set of linguistic rules to derive the lemma.
 Lemmatization result: "running" → "run"
● Another example with the word "mice":
1. Stemming:
 A stemming algorithm might just remove the suffix "-s" to obtain the stem "mice". This is based on a simple rule, without considering
linguistic context.
 Stemming result: "mice" → "mic"
2. Lemmatization:
 Lemmatization, however, recognizes that "mice" is the plural form of "mouse" and converts it to the singular form.
 Lemmatization result: "mice" → "mouse"
● These examples demonstrate how stemming and lemmatization can produce different results based on their respective algorithms and
linguistic analysis. Lemmatization aims for accuracy by considering the context and linguistic rules, while stemming operates more crudely
by chopping off affixes.
●
The Porter Stemmer
Lemmatization algorithms can be complex. For this reason we sometimes make
use of a simpler but cruder method, which mainly consists of chopping off word-
final stemming affixes.
This naive version of morphological analysis is called stemming.
One of the most widely used stemming algorithms is the Porter.

The Porter Stemmer
Example:
This was not the map we found in Billy Bones’s chest, but an accurate copy,
complete in all things-names and heights and soundings-with the single exception
of the red crosses and the written notes.
Stemmed output:
Thi wa not the map we found in Billi Bone s chest but an accur copi complet in all
thing name and height and sound with the singl except of the red cross and the
written note
The Porter Stemmer
The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass.
Some of the rules are:
1. ATIONAL → ATE (e.g., relational → relate)

2. ING → if stem contains vowel (e.g., motoring → motor)
3. SSES → SS (e.g., grasses → grass)
Simple stemmers can be useful in cases where we need to collapse across

different variants of the same lemma. Nonetheless, they do tend to commit errors
of both over- and under-generalizing.
The Porter Stemmer
Sentence Segmentation
It is another important step in text processing.
The most useful cues for segmenting a text into sentences are punctuation, like
periods, question marks, and exclamation points.
Periods are ambiguous between a sentence boundary marker and a marker of

abbreviations like Mr. or Inc.
Sentence tokenization methods work by first deciding (based on rules or machine

learning) whether a period is part of the word or is a sentence-boundary marker.
An abbreviation dictionary can help determine whether the period is part of a

commonly used abbreviation.
Sentence Segmentation
The dictionaries can be hand-built or machine learned, as can the final sentence
splitter.
Stanford CoreNLP toolkit - Sentence splitting is rule-based, a deterministic

consequence of tokenization; a sentence ends when a sentence-ending
punctuation (., !, or ?) is not already grouped with other characters into a token
(such as for an abbreviation or number), optionally followed by additional final
quotes or brackets.
Minimum Edit Distance
NLP is concerned with measuring how similar two strings are.
For example: In spelling correction, the user typed some erroneous string—let’s
say graffe–and we want to know what the user meant. (The most similar word is
giraffe - Which differs by one letter).
Coreference - The task of deciding whether two strings refers to the same entity:
S1: Stanford President Marc Tessier-Lavigne
S2: Stanford University President Marc Tessier-Lavigne
(differs by one word)

Edit distance gives us a way to quantify both of these intuitions about string
similarity.
Minimum Edit Distance: Minimum Edit Distance between two strings is defined
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
Example: The minimum edit distance between “intention” and “execution” is 5.

(delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).
We can also assign a particular cost or weight to each of these operations.
The Levenshtein distance between two sequences is the simplest weighting

factor in which each of the three operations has a cost of 1.
Minimum Edit Distance Algorithm
The finding minimum distance is a search task. In which we are searching for the
shortest path—a sequence of edits—from one string to another.
The space of all possible edits is enormous, so we can’t search naively.
This algorithm uses dynamic programming for finding minimum edit distance.
Dynamic Programming- It is a table driven approach to solve problems by

combining solutions to sub-problems.
first define the minimum edit distance between two strings.
Given two strings, the source string X of length n, and target string Y of length m,
we’ll define D[i, j] as the edit distance between X[1..i] and Y[1.. j].
In the base case, with a source substring of length i but an empty target string,
going from i characters to 0 requires i deletes. With a target substring of length j
but an empty source going from 0 characters to j characters requires j inserts.
Insertion and deletion cost is 1 and substitution cost is 2. Then
● Minimum Edit distance (Dynamic Programming) for converting one string to a
nother string (youtube.com)
● Edit Distance between 2 Strings | The Levenshtein
Distance Algorithm + Code - YouTube
Example: Minimum edit distance between “execution” and “intention”
N-grams Language Model
Models that assigns probabilities to the sequence of words are called Language Models or
LM.
A simplest model that assigns probabilities to sentences and sequence of words - n-grams.
N-Gram is a sequence of n words. n Term

Sentence: “I reside in Bengaluru”
1 unigram
2 bigram
S.No Type of n-gram Generated n-grams
3 Trigram
1 Unigram [“I”,”reside”,”in”,“Bengaluru”]
n n-gram
2 Bigram [“I reside”,”reside in”,”in Bengaluru”]
N-Grams
P(w|h) - The probability of a word w given some history h.
history h is “its water is so transparent that” and know the probability that the next
word is the.
P(the | its water is so transparent that)
One way is using relative frequency counts, take a very large corpus, count the
number of times we see its water is so transparent that, and count the number of
times this is followed by the.
It works fine for many cases. But it turns out that even the web isn’t big enough to
give us good estimates in most cases. This is because language is creative, new
sentences are created all the time, and we won’t always be able to count entire
sentences.
N-Grams
Example: ( “Walden Pond’s water is so transparent that the”; well, used to have
counts of zero).
Similarly, if we wanted to know the joint probability of an entire sequence of words
like its water is so transparent.
N-Grams
The general equation for this n-gram approximation to the conditional probability of
the next word in a sequence is
N-Grams
N-Gram
N-Gram
N-Gram
Probability of a sentence “ i want english food”

N-Gram (Some Practical Issues)
Since probabilities are (by definition) less than or equal to 1, the more probabilities
we multiply together, the smaller the product becomes. Multiplying enough n-
grams together would result in numerical underflow. By using log probabilities
instead of raw probabilities, we get numbers that are not as small.
Adding in log space is equivalent to multiplying in linear space, so we combine log
probabilities by adding them
Evaluating Language Models
Extrinsic Evaluation: The best way to evaluate the performance of a language
model is to embed it in an application and measure how much the application
improves. Such end-to-end evaluation is called extrinsic evaluation.
Unfortunately, running big NLP systems end-to-end is often very expensive.
Instead, it would be nice to have a metric that can be used to quickly evaluate
potential improvements in a language model.
Intrinsic Evaluation: An intrinsic evaluation metric is one that measures the quality
of a model independent of any application
For an intrinsic evaluation of a language model we need a test set (unseen data).
We want to evaluate performance of a two models, which are build on the same
training data.
Better model - Whichever model assigns a higher probability to the test set—
meaning it more accurately predicts the test set.
If our test sentence is part of the training corpus, we will mistakenly assign high
probability when it occurs in the test set. We call this situation training on the test
set.
Training on the test set introduces a bias that makes the probabilities all look too
high, and causes huge inaccuracies in perplexity, the probability-based metric.

NLP Unit1Content

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Unit1Content

Uploaded by

Copyright:

Available Formats

FOUNDATIONS OF NLP

B.Tech –IIIrd Year-Sem2,A19 Batch

• To explain text normalization techniques and n-gram

After completion of the course, the student should be able to

CO-1: Apply normalization techniques on a document and evaluate a language model.

Natural language processing (NLP) is an interdisciplinary subfield of

1. NLP is used in language translate applications such as Google Translate

Regular expression (RE), a language for specifying text search strings.

It is used in every computer language, word processor, and text processing

A corpus is a dictionary or container of words and sentences.

/?/ which means “the preceding character or nothing”

Kleene star: It means “ zero or more occurrence of the immediately previous

/[0-9]+/ is the normal way to specify “a sequence of digits”

1. /the*/ matches theeeee but not thethe

1. /$[0-9]+/ - refers to $199, $1, $200

This use of parentheses to store a pattern in memory is called a capture

This sentence has 13 words if we don’t count punctuation marks as words,

This disfluencies may be useful for predicting the upcoming word.

Lemma - A lemma is a set of lexical forms having the same stem.

Type: Types are the number of distinct words in a corpus

Tokens: Tokens are the total number N of running words

The number of words referred to the word type.

● Dictionaries can help in giving lemma counts; dictionary entries or

It is important to test algorithms on more than one language, and particularly on

Spanish and (transliterated) Hindi code switching with English

The most important dimension of variation is language and genre.

The world has 7117 (7097) official languages.

It is important to test algorithms on more than one language (languages with

How can a user know all these details?

Situation: When and in what situation was the text written/spoken?

Distribution: Are there copyright or other intellectual property restrictions?

1. Tokenizing (segmenting) words

1. tr - systematically change particular characters in the input.

tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c

tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c

Depending on the application, tokenization algorithms may also tokenize

Tokenization also recognises Named entity recognition (task of detecting names,

Most commonly used tokenization standard is known as the Penn Treebank

In Chinese, words are composed of characters (called hanzi in Chinese). Each

Instead of defining tokens as words or as characters, we can use our data to

It is especially useful for unknown words.

Three algorithms are widely used for tokenization

Case folding is another kind of normalization. Mapping everything to lower case

1. The words am, are, and is have the shared lemma be

Two broad classes of morphemes can be distinguished:

Stems—The central morpheme of the word, supplying the main meaning.

Affixes—Adding “additional” meanings of various kinds.

1. The word fox consists of one morpheme (the morpheme fox)

This naive version of morphological analysis is called stemming.

One of the most widely used stemming algorithms is the Porter.

Some of the rules are:

1. ATIONAL → ATE (e.g., relational → relate)

Simple stemmers can be useful in cases where we need to collapse across

Periods are ambiguous between a sentence boundary marker and a marker of

Sentence tokenization methods work by first deciding (based on rules or machine

An abbreviation dictionary can help determine whether the period is part of a

Stanford CoreNLP toolkit - Sentence splitting is rule-based, a deterministic

S1: Stanford President Marc Tessier-Lavigne

S2: Stanford University President Marc Tessier-Lavigne

(differs by one word)

Example: The minimum edit distance between “intention” and “execution” is 5.

The Levenshtein distance between two sequences is the simplest weighting

The space of all possible edits is enormous, so we can’t search naively.