NLP m2

2
MODULE 2
Natural Language Processing
•Introduction–
●N-grams: Simple unsmoothed n-grams; smoothing, backoff, spelling
correction using N-grams, Metrics to evaluate N-grams; Parts of Speech
tagging: Word classes, POST using Brill's Tagger and HMMs;
●Information Extraction: Introduction
to Named Entity Recognition

Relation Extraction WordNet and
WordNet based similarity measures,
and
Concept Mining using Latent
Semantic Analysis
Simple unsmoothed
n-grams
2
Predicting
The next few words someone is going to say

• Please turn off your cell
• About Fifteen minutes from
❖ assign a probability to each possible next word or the

sentence
3
Predicting
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit thi s sentence.”
Language Model
• A probabilistic statistical model that determines the probability of a given
sequence of words occurring in a sentence based on the previous words.
• It helps to predict which word is more likely to appear next in the sentence.
• The input to a language model is usually a training set of example sentences.
• The output is a probability distribution over sequences of words.
• We can use the last one word (unigram), last two words (bigram), last three words
(trigram) or last n words (n-gram) to predict the next word as per our
requirements.
Language Model
(LM)
□ Language Model
Models that assign probabilities to sequences of
words
6
Why Language Models?
• They are a way of transforming qualitative information about text into

quantitative information that machines can understand.
• They have applications in a wide range of industries like tech, finance,

healthcare, military etc.
• Hence it is widely used in predictive text input systems, speech recognition,

machine translation, spelling correction etc.
Language Model
(LM)
▪ An n-gram is a sequence of n words:
❖ a 2-gram (bigram) is a two-word sequence of words
• like “please turn”, “turn your”, or ”your homework”, and
❖ a 3-gram (a trigram) is a three-word sequence of words
• like “please turn your”, or “turn your homework”.
8
Language Model
(LM)
9
Language Model
(LM)
10
Language Model
(LM)
• Markov assumption simplifies the modeling of

sequences of words or tokens by assuming that the
probability of a word only depends on a fixed
number of preceding words.
• This number is called the "order" of the Markov

model and is denoted as "n" in n-gram models.
11
Language Model
(LM)
12
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?
To compute a particular bigram probability of a word y given a

previous word x
compute the count of the bigram C(xy) and normalize by the sum of all
the bigrams that share the same first word x:
Previou
Current
s Word
Word
13
Language Model (LM)
calculations for some of the bigram probabilities from this

corpus
14
Language Model (LM)
15
Language Model (LM)
unigram
Compute the probability of sentences: “I want English

food”
17
Everything in log space?
□ Avoids Underflow
□ Adding is faster than multiplying
18
Evaluating Language
Models
□ Extrinsic evaluation.
embed it in an application and measure how much the application improves.
□ Intrinsic evaluation.
measures the quality of a model-independent of any application.
19
Evaluating Language Models
□ Extrinsic evaluation.
of an N-gram language model measure how much the application improves.
□ To compare two language models A and B:

–– Use each of language model in a task such as spelling
corrector, MT system.
–– Get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
–– Compare accuracy for A and B
• The model produces the better accuracy is the better model
□ Extrinsic evaluation can b e timeconsuming.

Evaluating Language Models
□ Intrinsic
evaluation.
21
□Perplexity
• Perplexity is a measurement used in natural

language processing and information theory.
• Assess how well a probabilistic model, such as
a language model, predicts a given sequence of
words or tokens.
• It is often used to evaluate the performance of
language models, including n-gram models.
• Lower perplexity values indicate better
performance.
• A perplexity of 1 would mean that the model
perfectly predicts the test data.
Steps for Perplexity
• Language Model Training:
First, you train an n-gram language model on a corpus of text.
An n-gram language model assigns probabilities to sequences

of n words based on their occurrence in the training data.
• Test Data:
Next, you have a test set of sentences or text sequences that
you want to evaluate. These sentences are typically not seen during
training.
• Perplexity Calculation: For each sentence in the test set, you

calculate the perplexity using the following formula:
Perplexity(S) = P(w_1, w_2, ..., w_n)^(-1/N)
S represents the sentence.
N is the number of words in the sentence.
P(w_1, w_2, ..., w_n) is the probability assigned to the sentence

by your n-gram model.
Generalization
• the ability of a language model to make reasonable predictions on unseen or
out-of-sample data.
● Strength: N-gram models can generalize to some extent because they capture
statistical patterns in the training data.
● For example, if it has seen the bigram "hot coffee" frequently, it can generalize to
predict "iced tea" even if it hasn't seen that specific combination before.
Zeros
Limitation: They may struggle with longer-range dependencies or
understanding the semantics of language.
If an n-gram model encounters an n-gram that it has never seen
during training, it may assign a probability of zero to that n-gram,
leading to issues with prediction accuracy.
Our model will incorrectly estimate that the P(offer|denied the) is 0!

To address the issue of zeros and improve generalization,
techniques like smoothing and backoff are commonly used in
n-gram models.
25
Smoothing
• Smoothing is a technique used in n-gram language

models
• address the problem of zero probabilities for

n-grams that were not observed in the training
data.
• Since language is highly diverse, it's common for

n-gram models to encounter word sequences that
they haven't seen before.
• Smoothing helps assign non-zero probabilities to

these unseen n-grams, making
26 the model more
robust and improving its ability to generalize.
Smoothing
❖ There are many ways to do Add smoothing and some of

them are:
▪ 1 smoothing (Laplace Smoothing)
▪ Add-k smoothing,
▪ Stupid Backoff
▪ KneserNey smoothing.
27
Laplace
Smoothing
28
• Laplace smoothing, also known as Add-One smoothing, is a straightforward
technique used in n-gram language models to address the problem of zero
probabilities for unseen n-grams.
• It's one of the simplest smoothing methods and is often used as a baseline for
smoothing in n-gram models.
• In Laplace smoothing, a fixed constant (usually 1) is added to the count of each

unique n-gram during training.
• This ensures that no n-gram has a probability of zero, even if it was not
observed in the training data
29
Let's consider a bigram (2-gram) model as an example:
• Count the N-grams:
Count(w_1, w_2) is the number of times the bigram (w_1, w_2) occurs in
the training data.
Count(w_1) is the number of times the unigram w_1 occurs in the
training data.
• Laplace-Smoothed Probability Calculation:

The Laplace-smoothed probability of a word w_2 given the preceding word
w_1 is calculated as follows:
P(w_2 | w_1) = (Count(w_1, w_2) + 1) / (Count(w_1) + V)

Count(w_1, w_2) is the count of the bigram (w_1, w_2) in the training
data.
Count(w_1) is the count of the unigram w_1 (the preceding word) in the
training data.
V is the vocabulary size, which represents the total number of unique
words in the training data.
The "+1" in the numerator represents the Laplace smoothing constant, which
30
is added to each count. The "+V" in the denominator accounts for the
Example of Laplace
Training Text: "I like to eat pizza. I like to drink soda."
We want to calculate the Laplace-smoothed probability of the word "pizza"

given the preceding word "eat.“
Counting N-grams:
Count("eat pizza"): 1 time
Count("eat"): 1 time
Vocabulary Size (V): 7 unique words (I, like, to, eat, pizza, drink, soda)
Laplace-Smoothed Probability Calculation:

We'll calculate P("pizza" | "eat") using the Laplace smoothing formula:
P("pizza" | "eat") = (Count("eat pizza") + 1) / (Count("eat") + V)
Count("eat pizza") = 1 (from training data)
Count("eat") = 1 (from training data)
V = 7 (vocabulary size)
P("pizza" | "eat") = (1 + 1) / (1 + 7) = 2 / 8 = 1/4 = 0.25
So, the Laplace-smoothed probability of the word "pizza" following the word
"eat" is 0.25.
Add-k smoothing
❖ One alternative to add-one smoothing is to move a bit less
of the probability mass from the seen to the unseen
events.of adding 1 to each count, we add a fractional count
□ Instead
k (.5? .05? .01?).
□ This algorithm is called add-k smoothing .
V is the total number of unique events.

k is the smoothing parameter.
The value of k is typically chosen based on some heuristic or through

cross-validation. 32
Backoff
• Backoff is a simpler technique compared to

interpolation.
• It's based on the idea that if a higher-order
n-gram (a sequence of n words) has low or
zero probability, you can "back off" to a
lower-order n-gram to get a non-zero
probability estimate.
33
Example of Backoff
• For example, if you're trying to estimate the

probability of the sentence "I am going to the
store",
• you have no data for this specific sentence, you
can "back off" to estimate the probability by
looking at the trigram "to the store",
• or the bigram "the store", or even the unigram
"store".
• The idea is to progressively simplify the context
until you have enough data to make a reliable
estimate.
Interpolation
• Interpolation is a technique used in language

modeling to estimate the probability of a word
or sequence of words in a given context.
• Weight Assignment: The weights assigned to

the different n-grams are determined based
on their relative importance.
• This can be heuristically chosen or tuned

through cross-validation.
Interpolation
• For example, in a bigram/trigram interpolation, you might
assign a weight of 0.7 to the bigram probability and 0.3 to
the trigram probability.
● Then you calculate the final probability as:

● P("apple"∣"I like to eat")=0.7∗P("apple"∣"to
eat")+0.3∗P("apple"∣"Iliketoeat")P("apple"∣"I like to
eat")=0.7∗P("apple"∣"to eat")+0.3∗P("apple"∣"I like to eat")
• Adaptation: The choice of weights can be adapted based on the specific data
and the performance of the model.
• It's common to experiment with different weight combinations to see which

yields the best results.
Part of Speech Tagging
● Part-of-Speech (POS) tagging is a process in natural language processing

(NLP) that involves assigning a grammatical label (such as noun, verb,
adjective, etc.) to each word in a sentence,
● based on its syntactic role within that sentence.
● This task is crucial for various NLP applications, such as information

extraction, machine translation, and sentiment analysis.
● Purpose:
● POS tagging provides valuable linguistic information about a text. It helps
in understanding the structure and meaning of sentences.
● It's a fundamental step in many higher-level NLP tasks like parsing,
sentiment analysis, and named entity recognition.
● POS Categories:
● Common POS categories include nouns (N), verbs (V), adjectives (ADJ),
adverbs (ADV), pronouns (PRON), conjunctions (CONJ), prepositions
(PREP), determiners (DET), and interjections (INTJ), among others.
● Ambiguity:
● Many words can have multiple POS tags depending on the context. For
example, "lead" can be a verb ("He will lead the team") or a noun ("The
pencil has a lead tip").
● Tagging Methods:
● POS tagging can be performed using rule-based methods, statistical
models, or deep learning techniques. Statistical models like Hidden
Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs)
were traditionally used, while deep learning methods like Recurrent
Neural Networks (RNNs) and Transformer-based models (e.g., BERT)
have shown significant improvements.
● Corpora and Training:
● To train a POS tagger, a large annotated corpus (a collection of text with
labeled parts of speech) is needed. This corpus is used to learn the
relationships between words and their corresponding POS tags.
● Challenges:
● Ambiguity, rare words, and complex sentence structures can pose
challenges for POS tagging systems. Handling languages with free
word order, like Latin or Finnish, can be particularly challenging.
● Evaluation:
● POS taggers are evaluated using metrics like accuracy, precision,
recall, and F1-score. These metrics compare the predicted tags to
the gold-standard tags in the annotated corpus.
● Applications:
● POS tagging is a crucial component in various NLP applications,

including:
○ Information extraction: Identifying key information from texts.
○ Parsing: Understanding the syntactic structure of sentences.
○ Sentiment analysis: Analyzing the sentiment expressed in a sentence or text.
○ Machine translation: Assisting in the translation of text from one language to another.
PART-of-SPEECH (POS) Tagging
42
PART-of-SPEECH (POS) –How Many?
• Word Classes
In grammar, a part of speech or part-of-speech (POS) is
known as word class or grammatical category, which is a
category of words that have similar grammatical properties.
• The English language has four major word classes: Nouns,

Verbs, Adjectives, and Adverbs.
• Commonly listed English parts of speech are nouns, verbs,

adjectives, adverbs, pronouns, prepositions, conjunction,
interjection, numeral, article, and determiners.
• These can be further categorized into open and closed 43
classes.
POS -
Examples
44
BrillTagger
● BrillTagger class is a transformation-based tagger.
● Rule-based taggers use dictionary or lexicon for getting possible tags for
tagging each word.
● If the word has more than one possible tag, then rule-based taggers use
hand-written rules to identify the correct tag.
● Moreover, it uses a series of rules to correct the results of an initial
tagger.
● These rules it follows are scored based.
● This score is equal to the no. of errors they correct minus the no. of new
errors they produce.
Hidden Markov Model
• A Hidden Markov Model (HMM) is a probabilistic graphical
model used for modeling sequences of observations.
• It's widely used in various fields, including speech
recognition, natural language processing, bioinformatics,
finance, and more.
● Model Components:
● States (Hidden States): Each state in the HMM corresponds to a POS tag
(e.g., noun, verb, adjective).
● Observations (Emissions): Each state emits an observation (word) from
the vocabulary. These emissions are the actual words in the sentence.
● Transition Probabilities: The probability of transitioning from one POS tag
to another. For example, the probability of transitioning from a noun to a
verb.
● Emission Probabilities: The probability of emitting a particular word from a
specific POS tag. For example, the probability of the word "run" given the
POS tag "verb".
● Training:
● The HMM is trained on a labeled corpus of text, where each word is
annotated with its corresponding POS tag.
● The transition probabilities and emission probabilities are estimated
from the training data. This involves counting the occurrences of
transitions and emissions and normalizing them to obtain
probabilities.
● Decoding (Inference):
● Given a new, unlabeled sentence, the goal is to find the most likely
sequence of POS tags for that sentence.
● This is done using the Viterbi algorithm, which efficiently computes
the most likely sequence of hidden states (POS tags) given the
observed words.
● Example:
● Let's say we have the sentence: "The cat jumps."
● The corresponding sequence of POS tags could be: "DET NOUN
VERB."
● The HMM would compute the probability of this sequence and
compare it with other possible sequences to find the most likely one.
● Handling Unknown Words:
● Since HMMs rely on emission probabilities, they may struggle with
out-of-vocabulary or rare words. Techniques like smoothing or using
a special token for unknown words can help mitigate this issue.
Part of Speech (POS) in
Synset.
Information Extraction
Information Extraction (IE)
GOALS
51
Information Extraction
Information Extraction (IE)
52
● Named Entity Recognition (NER) is a natural language processing
(NLP) task that involves identifying and classifying named entities in
text into predefined categories such as names of persons,
organizations, locations, expressions of times, quantities, monetary
values, percentages, etc.
NAMED ENTITIES
● Named Entities:
● Named entities are specific entities in text that refer to people,
places, organizations, dates, times, and various other types of
entities with specific names.
● Categories of Named Entities:
● Common categories include:
○ Person (e.g., "John Doe")
○ Organization (e.g., "Google")
○ Location (e.g., "New York City")
○ Date (e.g., "January 1, 2020")
○ Time (e.g., "3:30 PM")
○ Quantity (e.g., "10 kilograms")
○ Money (e.g., "$100")
● Applications of NER:
● NER is a critical component in various NLP applications such as:
○ Information extraction
○ Document summarization
○ Machine translation
○ Question answering systems
○ Sentiment analysis
○ Entity linking (connecting entities to external knowledge bases)
● Techniques for NER:

● NER can be approached using rule-based systems, statistical models, or deep
learning methods. Popular techniques include:
○ Rule-Based Approaches: These use handcrafted rules to identify named entities based on
patterns, dictionaries, and linguistic heuristics.
○ Statistical Models: Models like Conditional Random Fields (CRFs) and Hidden Markov Models
(HMMs) were traditionally used for NER.
○ Machine Learning and Deep Learning: Sequence labeling models like Bidirectional LSTM-CRFs
and Transformer-based models like BERT have shown state-of-the-art performance in NER tasks.
WORDNET
● WordNet is a lexical database of the English language that groups
words into sets of synonyms called synsets. However, it is not
directly related to relation extraction, which is a task in natural
language processing (NLP) that involves identifying and extracting
semantic relationships between entities in text.
● Relation extraction focuses on determining how entities mentioned in
a sentence are related to each other. For example, in the sentence
"John works at Microsoft," the relation extraction task would involve
identifying that "John" is an employee of "Microsoft."
● WordNet, on the other hand, provides information about the lexical
and semantic relationships between words. It includes definitions,
synonyms, antonyms, and hypernyms (more general terms) for a
wide range of English words.
● While WordNet can be used in conjunction with relation extraction
systems to provide additional lexical and semantic information, they
are distinct concepts in NLP. WordNet can potentially aid in feature
engineering or providing additional context for relation extraction
systems, but it is not a direct tool for performing relation extraction.
WordNet: A Database of Lexical Relations
58
hypernym is a term whose meaning
includes the meaning of other words, its
a broad superordinate label that applies
to many other members of set. It
describes the more broad terms or we can
say that more abstract terms.
for e.g hypernym of labrador, german
sheperd is dog.
hyponym is a term which is more
specialised and specific word, it is a
hierarchical relationship which may
consist of a number of levels. These are
the more specific term. for e.g. dog is a
hyponym of animal.
WordNet-based similarity measures are
techniques used to quantify the
similarity or relatedness between words
or concepts based on the information
provided by WordNet. WordNet is a
lexical database of the English language
that organizes words into synsets (sets of
synonyms) and provides relationships
between them.
Here are some common WordNet-based
similarity measures:
Path Similarity:
Leacock-Chodorow Similarity:
Wu-Palmer Similarity:
Resnik Similarity:
Etc.
Synset Hypernyms
and
Hyponyms
Wu-Palmer Similarity
The Wu-Palmer Similarity is a similarity

measure used in WordNet, a lexical
database of the English language. It
calculates the semantic relatedness
between two synsets (sets of synonyms)
based on the depth of their common
ancestor in the WordNet hierarchy.
Find the Common Ancestor:
Given two synsets, identify their most specific common ancestor. This
is the closest node in the WordNet hierarchy that both synsets share.
Calculate the Depth:
Measure the distance (number of edges) from each synset to the
common ancestor. This provides an indication of how specific or
general the concepts are.
Compute the Wu-Palmer Similarity:
The formula for Wu-Palmer Similarity (WPS) is:
The similarity score ranges from 0 to 1, where 0 indicates no

similarity and 1 indicates maximum similarity.
A higher similarity score implies that the concepts represented by the
synsets are more closely related.
Wu-Palmer Similarity
For example, if we consider the synsets

for "cat" and "dog" in WordNet, the
Wu-Palmer Similarity would be
computed based on the depth of their
common ancestor (e.g., "carnivore.n.01")
and their individual depths in the
WordNet hierarchy. The resulting
similarity score provides a quantifiable
measure of how closely related these
concepts are in the context of WordNet.
Concept Mining
Concept mining in natural language
processing (NLP) involves the extraction
of meaningful and relevant concepts from
a collection of text data. It aims to
identify key ideas, topics, or entities that
are important in understanding the
content of the text.
Term Frequency-Inverse Document

Frequency (TF-IDF):
Latent Semantic Analysis (LSA)
Topic Modeling:
Named Entity Recognition (NER):

▪ Latent Semantic Analysis (LSA) involves creating structured data

from a collection of unstructured texts
▪ LSA uses the statistical approach to identify the

association among the words in a document.
▪ LSA is an information retrieval technique that analyzes and

identifies the pattern in an unstructured collection of text and the
relationship between them.
▪ LSA itself is an unsupervised way of uncovering synonyms in a

collection of documents.
66
Assumptions of LSA
▪ The words which are used in the same context are analogous to
each other.
▪ The hidden semantic structure of the data is unclear due to the
ambiguity of the words chosen.
67
68
69
Singular Value Decomposition (SVD):
▪ SVD is the statistical method that is used to find the

latent(hidden) semantic structure of words spread across the
document.
70
71
72
In other words, word frequencies in different

documents play a key role in extracting the latent
topics.
LSA tries to extract the dimensions using a machine

learning algorithm called Singular Value
Decomposition or
SVD.
73
Case Study
Latent Semantic Analysis & Sentiment Classification with

Python
• Natural Language Processing, LSA, sentiment analy
https://towardsdatascience.com/latent-semantic-analysis-sentiment-classification-with-python-5f657346f6a3
74

NLP m2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP m2

Uploaded by

Copyright:

Available Formats

2

to Named Entity Recognition

The next few words someone is going to say

❖ assign a probability to each possible next word or the

• The input to a language model is usually a training set of example sentences.

• The output is a probability distribution over sequences of words.

• They are a way of transforming qualitative information about text into

• They have applications in a wide range of industries like tech, finance,

• Hence it is widely used in predictive text input systems, speech recognition,

• Markov assumption simplifies the modeling of

• This number is called the "order" of the Markov

To compute a particular bigram probability of a word y given a

calculations for some of the bigram probabilities from this

Compute the probability of sentences: “I want English

□ To compare two language models A and B:

□ Extrinsic evaluation can b e timeconsuming.

• Perplexity is a measurement used in natural

• Language Model Training:

First, you train an n-gram language model on a corpus of text.

An n-gram language model assigns probabilities to sequences

• Perplexity Calculation: For each sentence in the test set, you

P(w_1, w_2, ..., w_n) is the probability assigned to the sentence

Our model will incorrectly estimate that the P(offer|denied the) is 0!

• Smoothing is a technique used in n-gram language

• address the problem of zero probabilities for

• Since language is highly diverse, it's common for

• Smoothing helps assign non-zero probabilities to

❖ There are many ways to do Add smoothing and some of

• In Laplace smoothing, a fixed constant (usually 1) is added to the count of each

• Laplace-Smoothed Probability Calculation:

P(w_2 | w_1) = (Count(w_1, w_2) + 1) / (Count(w_1) + V)

Training Text: "I like to eat pizza. I like to drink soda."

We want to calculate the Laplace-smoothed probability of the word "pizza"

Laplace-Smoothed Probability Calculation:

V is the total number of unique events.

The value of k is typically chosen based on some heuristic or through

• Backoff is a simpler technique compared to

• For example, if you're trying to estimate the

• Interpolation is a technique used in language

• Weight Assignment: The weights assigned to

• This can be heuristically chosen or tuned

● Then you calculate the final probability as:

• It's common to experiment with different weight combinations to see which

● Part-of-Speech (POS) tagging is a process in natural language processing

● based on its syntactic role within that sentence.

● This task is crucial for various NLP applications, such as information

● POS tagging is a crucial component in various NLP applications,

• The English language has four major word classes: Nouns,

• Commonly listed English parts of speech are nouns, verbs,

• These can be further categorized into open and closed 43

Information Extraction (IE)

Information Extraction (IE)

● Techniques for NER:

The Wu-Palmer Similarity is a similarity

The similarity score ranges from 0 to 1, where 0 indicates no

For example, if we consider the synsets

Term Frequency-Inverse Document

Latent Semantic Analysis (LSA)

Named Entity Recognition (NER):

▪ Latent Semantic Analysis (LSA) involves creating structured data

▪ LSA uses the statistical approach to identify the

▪ LSA is an information retrieval technique that analyzes and

▪ LSA itself is an unsupervised way of uncovering synonyms in a