You are on page 1of 74

2

MODULE 2
Natural Language Processing
•Introduction–
●N-grams: Simple unsmoothed n-grams; smoothing, backoff, spelling
correction using N-grams, Metrics to evaluate N-grams; Parts of Speech
tagging: Word classes, POST using Brill's Tagger and HMMs;
●Information Extraction: Introduction

to Named Entity Recognition


Relation Extraction WordNet and
WordNet based similarity measures,
and
Concept Mining using Latent
Semantic Analysis
Simple unsmoothed
n-grams

2
Predicting

The next few words someone is going to say


• Please turn off your cell
• About Fifteen minutes from

❖ assign a probability to each possible next word or the


sentence

3
Predicting

• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit thi s sentence.”
Language Model
• A probabilistic statistical model that determines the probability of a given
sequence of words occurring in a sentence based on the previous words.

• It helps to predict which word is more likely to appear next in the sentence.

• The input to a language model is usually a training set of example sentences.

• The output is a probability distribution over sequences of words.

• We can use the last one word (unigram), last two words (bigram), last three words
(trigram) or last n words (n-gram) to predict the next word as per our
requirements.
Language Model
(LM)

□ Language Model
Models that assign probabilities to sequences of
words
6
Why Language Models?

• They are a way of transforming qualitative information about text into


quantitative information that machines can understand.

• They have applications in a wide range of industries like tech, finance,


healthcare, military etc.

• Hence it is widely used in predictive text input systems, speech recognition,


machine translation, spelling correction etc.
Language Model
(LM)
▪ An n-gram is a sequence of n words:
❖ a 2-gram (bigram) is a two-word sequence of words
• like “please turn”, “turn your”, or ”your homework”, and
❖ a 3-gram (a trigram) is a three-word sequence of words
• like “please turn your”, or “turn your homework”.

8
Language Model
(LM)

9
Language Model
(LM)

10
Language Model
(LM)

• Markov assumption simplifies the modeling of


sequences of words or tokens by assuming that the
probability of a word only depends on a fixed
number of preceding words.

• This number is called the "order" of the Markov


model and is denoted as "n" in n-gram models.

11
Language Model
(LM)

12
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?

To compute a particular bigram probability of a word y given a


previous word x
compute the count of the bigram C(xy) and normalize by the sum of all
the bigrams that share the same first word x:

Previou
Current
s Word
Word

13
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?

calculations for some of the bigram probabilities from this


corpus

14
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?

15
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?

unigram
How do we estimate these bigram or n-gram probabilities?

Compute the probability of sentences: “I want English


food”

17
Everything in log space?

□ Avoids Underflow
□ Adding is faster than multiplying

18
Evaluating Language
Models
□ Extrinsic evaluation.
embed it in an application and measure how much the application improves.

□ Intrinsic evaluation.
measures the quality of a model-independent of any application.

19
Evaluating Language Models
□ Extrinsic evaluation.
of an N-gram language model measure how much the application improves.

□ To compare two language models A and B:


–– Use each of language model in a task such as spelling
corrector, MT system.
–– Get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
–– Compare accuracy for A and B
• The model produces the better accuracy is the better model

□ Extrinsic evaluation can b e timeconsuming.


Evaluating Language Models
□ Intrinsic
evaluation.

21
□Perplexity

• Perplexity is a measurement used in natural


language processing and information theory.
• Assess how well a probabilistic model, such as
a language model, predicts a given sequence of
words or tokens.
• It is often used to evaluate the performance of
language models, including n-gram models.
• Lower perplexity values indicate better
performance.
• A perplexity of 1 would mean that the model
perfectly predicts the test data.
Steps for Perplexity

• Language Model Training:

First, you train an n-gram language model on a corpus of text.

An n-gram language model assigns probabilities to sequences


of n words based on their occurrence in the training data.

• Test Data:
Next, you have a test set of sentences or text sequences that
you want to evaluate. These sentences are typically not seen during
training.

• Perplexity Calculation: For each sentence in the test set, you


calculate the perplexity using the following formula:
Perplexity(S) = P(w_1, w_2, ..., w_n)^(-1/N)
S represents the sentence.
N is the number of words in the sentence.

P(w_1, w_2, ..., w_n) is the probability assigned to the sentence


by your n-gram model.
Generalization
• the ability of a language model to make reasonable predictions on unseen or
out-of-sample data.

● Strength: N-gram models can generalize to some extent because they capture
statistical patterns in the training data.

● For example, if it has seen the bigram "hot coffee" frequently, it can generalize to
predict "iced tea" even if it hasn't seen that specific combination before.
Zeros
Limitation: They may struggle with longer-range dependencies or
understanding the semantics of language.
If an n-gram model encounters an n-gram that it has never seen
during training, it may assign a probability of zero to that n-gram,
leading to issues with prediction accuracy.

Our model will incorrectly estimate that the P(offer|denied the) is 0!


To address the issue of zeros and improve generalization,
techniques like smoothing and backoff are commonly used in
n-gram models.
25
Smoothing

• Smoothing is a technique used in n-gram language


models

• address the problem of zero probabilities for


n-grams that were not observed in the training
data.

• Since language is highly diverse, it's common for


n-gram models to encounter word sequences that
they haven't seen before.

• Smoothing helps assign non-zero probabilities to


these unseen n-grams, making
26 the model more
robust and improving its ability to generalize.
Smoothing

❖ There are many ways to do Add smoothing and some of


them are:
▪ 1 smoothing (Laplace Smoothing)

▪ Add-k smoothing,

▪ Stupid Backoff

▪ KneserNey smoothing.

27
Laplace
Smoothing

28
• Laplace smoothing, also known as Add-One smoothing, is a straightforward
technique used in n-gram language models to address the problem of zero
probabilities for unseen n-grams.

• It's one of the simplest smoothing methods and is often used as a baseline for
smoothing in n-gram models.

• In Laplace smoothing, a fixed constant (usually 1) is added to the count of each


unique n-gram during training.

• This ensures that no n-gram has a probability of zero, even if it was not
observed in the training data

29
Let's consider a bigram (2-gram) model as an example:
• Count the N-grams:
Count(w_1, w_2) is the number of times the bigram (w_1, w_2) occurs in
the training data.
Count(w_1) is the number of times the unigram w_1 occurs in the
training data.

• Laplace-Smoothed Probability Calculation:


The Laplace-smoothed probability of a word w_2 given the preceding word
w_1 is calculated as follows:

P(w_2 | w_1) = (Count(w_1, w_2) + 1) / (Count(w_1) + V)


Count(w_1, w_2) is the count of the bigram (w_1, w_2) in the training
data.
Count(w_1) is the count of the unigram w_1 (the preceding word) in the
training data.
V is the vocabulary size, which represents the total number of unique
words in the training data.

The "+1" in the numerator represents the Laplace smoothing constant, which
30
is added to each count. The "+V" in the denominator accounts for the
Example of Laplace

Training Text: "I like to eat pizza. I like to drink soda."

We want to calculate the Laplace-smoothed probability of the word "pizza"


given the preceding word "eat.“

Counting N-grams:
Count("eat pizza"): 1 time
Count("eat"): 1 time
Vocabulary Size (V): 7 unique words (I, like, to, eat, pizza, drink, soda)

Laplace-Smoothed Probability Calculation:


We'll calculate P("pizza" | "eat") using the Laplace smoothing formula:
P("pizza" | "eat") = (Count("eat pizza") + 1) / (Count("eat") + V)
Count("eat pizza") = 1 (from training data)
Count("eat") = 1 (from training data)
V = 7 (vocabulary size)
P("pizza" | "eat") = (1 + 1) / (1 + 7) = 2 / 8 = 1/4 = 0.25

So, the Laplace-smoothed probability of the word "pizza" following the word
"eat" is 0.25.
Add-k smoothing
❖ One alternative to add-one smoothing is to move a bit less
of the probability mass from the seen to the unseen
events.of adding 1 to each count, we add a fractional count
□ Instead
k (.5? .05? .01?).
□ This algorithm is called add-k smoothing .

V is the total number of unique events.


k is the smoothing parameter.

The value of k is typically chosen based on some heuristic or through


cross-validation. 32
Backoff

• Backoff is a simpler technique compared to


interpolation.
• It's based on the idea that if a higher-order
n-gram (a sequence of n words) has low or
zero probability, you can "back off" to a
lower-order n-gram to get a non-zero
probability estimate.

33
Example of Backoff

• For example, if you're trying to estimate the


probability of the sentence "I am going to the
store",
• you have no data for this specific sentence, you
can "back off" to estimate the probability by
looking at the trigram "to the store",
• or the bigram "the store", or even the unigram
"store".
• The idea is to progressively simplify the context
until you have enough data to make a reliable
estimate.
Interpolation

• Interpolation is a technique used in language


modeling to estimate the probability of a word
or sequence of words in a given context.

• Weight Assignment: The weights assigned to


the different n-grams are determined based
on their relative importance.

• This can be heuristically chosen or tuned


through cross-validation.
Interpolation
• For example, in a bigram/trigram interpolation, you might
assign a weight of 0.7 to the bigram probability and 0.3 to
the trigram probability.

● Then you calculate the final probability as:


● P("apple"∣"I like to eat")=0.7∗P("apple"∣"to
eat")+0.3∗P("apple"∣"Iliketoeat")P("apple"∣"I like to
eat")=0.7∗P("apple"∣"to eat")+0.3∗P("apple"∣"I like to eat")

• Adaptation: The choice of weights can be adapted based on the specific data
and the performance of the model.

• It's common to experiment with different weight combinations to see which


yields the best results.
Part of Speech Tagging

● Part-of-Speech (POS) tagging is a process in natural language processing


(NLP) that involves assigning a grammatical label (such as noun, verb,
adjective, etc.) to each word in a sentence,

● based on its syntactic role within that sentence.

● This task is crucial for various NLP applications, such as information


extraction, machine translation, and sentiment analysis.
● Purpose:
● POS tagging provides valuable linguistic information about a text. It helps
in understanding the structure and meaning of sentences.
● It's a fundamental step in many higher-level NLP tasks like parsing,
sentiment analysis, and named entity recognition.
● POS Categories:
● Common POS categories include nouns (N), verbs (V), adjectives (ADJ),
adverbs (ADV), pronouns (PRON), conjunctions (CONJ), prepositions
(PREP), determiners (DET), and interjections (INTJ), among others.
● Ambiguity:

● Many words can have multiple POS tags depending on the context. For
example, "lead" can be a verb ("He will lead the team") or a noun ("The
pencil has a lead tip").
● Tagging Methods:
● POS tagging can be performed using rule-based methods, statistical
models, or deep learning techniques. Statistical models like Hidden
Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs)
were traditionally used, while deep learning methods like Recurrent
Neural Networks (RNNs) and Transformer-based models (e.g., BERT)
have shown significant improvements.
● Corpora and Training:
● To train a POS tagger, a large annotated corpus (a collection of text with
labeled parts of speech) is needed. This corpus is used to learn the
relationships between words and their corresponding POS tags.
● Challenges:
● Ambiguity, rare words, and complex sentence structures can pose
challenges for POS tagging systems. Handling languages with free
word order, like Latin or Finnish, can be particularly challenging.
● Evaluation:
● POS taggers are evaluated using metrics like accuracy, precision,
recall, and F1-score. These metrics compare the predicted tags to
the gold-standard tags in the annotated corpus.
● Applications:

● POS tagging is a crucial component in various NLP applications,


including:
○ Information extraction: Identifying key information from texts.
○ Parsing: Understanding the syntactic structure of sentences.
○ Sentiment analysis: Analyzing the sentiment expressed in a sentence or text.

○ Machine translation: Assisting in the translation of text from one language to another.
PART-of-SPEECH (POS) Tagging

42
PART-of-SPEECH (POS) –How Many?
• Word Classes
In grammar, a part of speech or part-of-speech (POS) is
known as word class or grammatical category, which is a
category of words that have similar grammatical properties.

• The English language has four major word classes: Nouns,


Verbs, Adjectives, and Adverbs.

• Commonly listed English parts of speech are nouns, verbs,


adjectives, adverbs, pronouns, prepositions, conjunction,
interjection, numeral, article, and determiners.

• These can be further categorized into open and closed 43

classes.
POS -
Examples

44
BrillTagger
● BrillTagger class is a transformation-based tagger.
● Rule-based taggers use dictionary or lexicon for getting possible tags for
tagging each word.
● If the word has more than one possible tag, then rule-based taggers use
hand-written rules to identify the correct tag.
● Moreover, it uses a series of rules to correct the results of an initial
tagger.
● These rules it follows are scored based.
● This score is equal to the no. of errors they correct minus the no. of new
errors they produce.
Hidden Markov Model
• A Hidden Markov Model (HMM) is a probabilistic graphical
model used for modeling sequences of observations.
• It's widely used in various fields, including speech
recognition, natural language processing, bioinformatics,
finance, and more.
● Model Components:
● States (Hidden States): Each state in the HMM corresponds to a POS tag
(e.g., noun, verb, adjective).
● Observations (Emissions): Each state emits an observation (word) from
the vocabulary. These emissions are the actual words in the sentence.
● Transition Probabilities: The probability of transitioning from one POS tag
to another. For example, the probability of transitioning from a noun to a
verb.
● Emission Probabilities: The probability of emitting a particular word from a
specific POS tag. For example, the probability of the word "run" given the
POS tag "verb".
● Training:
● The HMM is trained on a labeled corpus of text, where each word is
annotated with its corresponding POS tag.
● The transition probabilities and emission probabilities are estimated
from the training data. This involves counting the occurrences of
transitions and emissions and normalizing them to obtain
probabilities.
● Decoding (Inference):
● Given a new, unlabeled sentence, the goal is to find the most likely
sequence of POS tags for that sentence.
● This is done using the Viterbi algorithm, which efficiently computes
the most likely sequence of hidden states (POS tags) given the
observed words.
● Example:
● Let's say we have the sentence: "The cat jumps."
● The corresponding sequence of POS tags could be: "DET NOUN
VERB."
● The HMM would compute the probability of this sequence and
compare it with other possible sequences to find the most likely one.
● Handling Unknown Words:
● Since HMMs rely on emission probabilities, they may struggle with
out-of-vocabulary or rare words. Techniques like smoothing or using
a special token for unknown words can help mitigate this issue.
Part of Speech (POS) in
Synset.
Information Extraction

Information Extraction (IE)

GOALS

51
Information Extraction

Information Extraction (IE)

52
● Named Entity Recognition (NER) is a natural language processing
(NLP) task that involves identifying and classifying named entities in
text into predefined categories such as names of persons,
organizations, locations, expressions of times, quantities, monetary
values, percentages, etc.
NAMED ENTITIES
● Named Entities:
● Named entities are specific entities in text that refer to people,
places, organizations, dates, times, and various other types of
entities with specific names.
● Categories of Named Entities:
● Common categories include:
○ Person (e.g., "John Doe")
○ Organization (e.g., "Google")
○ Location (e.g., "New York City")
○ Date (e.g., "January 1, 2020")
○ Time (e.g., "3:30 PM")
○ Quantity (e.g., "10 kilograms")
○ Money (e.g., "$100")
● Applications of NER:
● NER is a critical component in various NLP applications such as:
○ Information extraction
○ Document summarization
○ Machine translation
○ Question answering systems
○ Sentiment analysis
○ Entity linking (connecting entities to external knowledge bases)

● Techniques for NER:


● NER can be approached using rule-based systems, statistical models, or deep
learning methods. Popular techniques include:
○ Rule-Based Approaches: These use handcrafted rules to identify named entities based on
patterns, dictionaries, and linguistic heuristics.
○ Statistical Models: Models like Conditional Random Fields (CRFs) and Hidden Markov Models
(HMMs) were traditionally used for NER.
○ Machine Learning and Deep Learning: Sequence labeling models like Bidirectional LSTM-CRFs
and Transformer-based models like BERT have shown state-of-the-art performance in NER tasks.
WORDNET
● WordNet is a lexical database of the English language that groups
words into sets of synonyms called synsets. However, it is not
directly related to relation extraction, which is a task in natural
language processing (NLP) that involves identifying and extracting
semantic relationships between entities in text.
● Relation extraction focuses on determining how entities mentioned in
a sentence are related to each other. For example, in the sentence
"John works at Microsoft," the relation extraction task would involve
identifying that "John" is an employee of "Microsoft."
● WordNet, on the other hand, provides information about the lexical
and semantic relationships between words. It includes definitions,
synonyms, antonyms, and hypernyms (more general terms) for a
wide range of English words.
● While WordNet can be used in conjunction with relation extraction
systems to provide additional lexical and semantic information, they
are distinct concepts in NLP. WordNet can potentially aid in feature
engineering or providing additional context for relation extraction
systems, but it is not a direct tool for performing relation extraction.
WordNet: A Database of Lexical Relations

58
hypernym is a term whose meaning
includes the meaning of other words, its
a broad superordinate label that applies
to many other members of set. It
describes the more broad terms or we can
say that more abstract terms.
for e.g hypernym of labrador, german
sheperd is dog.
hyponym is a term which is more
specialised and specific word, it is a
hierarchical relationship which may
consist of a number of levels. These are
the more specific term. for e.g. dog is a
hyponym of animal.
WordNet-based similarity measures are
techniques used to quantify the
similarity or relatedness between words
or concepts based on the information
provided by WordNet. WordNet is a
lexical database of the English language
that organizes words into synsets (sets of
synonyms) and provides relationships
between them.
Here are some common WordNet-based
similarity measures:

Path Similarity:
Leacock-Chodorow Similarity:
Wu-Palmer Similarity:
Resnik Similarity:
Etc.
Synset Hypernyms
and
Hyponyms
Wu-Palmer Similarity

The Wu-Palmer Similarity is a similarity


measure used in WordNet, a lexical
database of the English language. It
calculates the semantic relatedness
between two synsets (sets of synonyms)
based on the depth of their common
ancestor in the WordNet hierarchy.
Find the Common Ancestor:
Given two synsets, identify their most specific common ancestor. This
is the closest node in the WordNet hierarchy that both synsets share.
Calculate the Depth:
Measure the distance (number of edges) from each synset to the
common ancestor. This provides an indication of how specific or
general the concepts are.
Compute the Wu-Palmer Similarity:
The formula for Wu-Palmer Similarity (WPS) is:

The similarity score ranges from 0 to 1, where 0 indicates no


similarity and 1 indicates maximum similarity.
A higher similarity score implies that the concepts represented by the
synsets are more closely related.
Wu-Palmer Similarity

For example, if we consider the synsets


for "cat" and "dog" in WordNet, the
Wu-Palmer Similarity would be
computed based on the depth of their
common ancestor (e.g., "carnivore.n.01")
and their individual depths in the
WordNet hierarchy. The resulting
similarity score provides a quantifiable
measure of how closely related these
concepts are in the context of WordNet.
Concept Mining
Concept mining in natural language
processing (NLP) involves the extraction
of meaningful and relevant concepts from
a collection of text data. It aims to
identify key ideas, topics, or entities that
are important in understanding the
content of the text.

Term Frequency-Inverse Document


Frequency (TF-IDF):

Latent Semantic Analysis (LSA)

Topic Modeling:

Named Entity Recognition (NER):


Latent Semantic Analysis (LSA)

▪ Latent Semantic Analysis (LSA) involves creating structured data


from a collection of unstructured texts

▪ LSA uses the statistical approach to identify the


association among the words in a document.

▪ LSA is an information retrieval technique that analyzes and


identifies the pattern in an unstructured collection of text and the
relationship between them.

▪ LSA itself is an unsupervised way of uncovering synonyms in a


collection of documents.

66
Latent Semantic Analysis (LSA)

Assumptions of LSA
▪ The words which are used in the same context are analogous to
each other.
▪ The hidden semantic structure of the data is unclear due to the
ambiguity of the words chosen.

67
Latent Semantic Analysis (LSA)

68
Latent Semantic Analysis (LSA)

69
Latent Semantic Analysis (LSA)

Singular Value Decomposition (SVD):

▪ SVD is the statistical method that is used to find the


latent(hidden) semantic structure of words spread across the
document.

70
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):

71
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):

72
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):

In other words, word frequencies in different


documents play a key role in extracting the latent
topics.

LSA tries to extract the dimensions using a machine


learning algorithm called Singular Value
Decomposition or
SVD.

73
Case Study

Latent Semantic Analysis & Sentiment Classification with


Python
• Natural Language Processing, LSA, sentiment analy

https://towardsdatascience.com/latent-semantic-analysis-sentiment-classification-with-python-5f657346f6a3

74

You might also like