Professional Documents
Culture Documents
MODULE 2
Natural Language Processing
•Introduction–
●N-grams: Simple unsmoothed n-grams; smoothing, backoff, spelling
correction using N-grams, Metrics to evaluate N-grams; Parts of Speech
tagging: Word classes, POST using Brill's Tagger and HMMs;
●Information Extraction: Introduction
2
Predicting
3
Predicting
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit thi s sentence.”
Language Model
• A probabilistic statistical model that determines the probability of a given
sequence of words occurring in a sentence based on the previous words.
• It helps to predict which word is more likely to appear next in the sentence.
• We can use the last one word (unigram), last two words (bigram), last three words
(trigram) or last n words (n-gram) to predict the next word as per our
requirements.
Language Model
(LM)
□ Language Model
Models that assign probabilities to sequences of
words
6
Why Language Models?
8
Language Model
(LM)
9
Language Model
(LM)
10
Language Model
(LM)
11
Language Model
(LM)
12
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?
Previou
Current
s Word
Word
13
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?
14
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?
15
Language Model (LM)
How do we estimate these bigram or n-gram probabilities?
unigram
How do we estimate these bigram or n-gram probabilities?
17
Everything in log space?
□ Avoids Underflow
□ Adding is faster than multiplying
18
Evaluating Language
Models
□ Extrinsic evaluation.
embed it in an application and measure how much the application improves.
□ Intrinsic evaluation.
measures the quality of a model-independent of any application.
19
Evaluating Language Models
□ Extrinsic evaluation.
of an N-gram language model measure how much the application improves.
21
□Perplexity
• Test Data:
Next, you have a test set of sentences or text sequences that
you want to evaluate. These sentences are typically not seen during
training.
● Strength: N-gram models can generalize to some extent because they capture
statistical patterns in the training data.
● For example, if it has seen the bigram "hot coffee" frequently, it can generalize to
predict "iced tea" even if it hasn't seen that specific combination before.
Zeros
Limitation: They may struggle with longer-range dependencies or
understanding the semantics of language.
If an n-gram model encounters an n-gram that it has never seen
during training, it may assign a probability of zero to that n-gram,
leading to issues with prediction accuracy.
▪ Add-k smoothing,
▪ Stupid Backoff
▪ KneserNey smoothing.
27
Laplace
Smoothing
28
• Laplace smoothing, also known as Add-One smoothing, is a straightforward
technique used in n-gram language models to address the problem of zero
probabilities for unseen n-grams.
• It's one of the simplest smoothing methods and is often used as a baseline for
smoothing in n-gram models.
• This ensures that no n-gram has a probability of zero, even if it was not
observed in the training data
29
Let's consider a bigram (2-gram) model as an example:
• Count the N-grams:
Count(w_1, w_2) is the number of times the bigram (w_1, w_2) occurs in
the training data.
Count(w_1) is the number of times the unigram w_1 occurs in the
training data.
The "+1" in the numerator represents the Laplace smoothing constant, which
30
is added to each count. The "+V" in the denominator accounts for the
Example of Laplace
Counting N-grams:
Count("eat pizza"): 1 time
Count("eat"): 1 time
Vocabulary Size (V): 7 unique words (I, like, to, eat, pizza, drink, soda)
So, the Laplace-smoothed probability of the word "pizza" following the word
"eat" is 0.25.
Add-k smoothing
❖ One alternative to add-one smoothing is to move a bit less
of the probability mass from the seen to the unseen
events.of adding 1 to each count, we add a fractional count
□ Instead
k (.5? .05? .01?).
□ This algorithm is called add-k smoothing .
33
Example of Backoff
• Adaptation: The choice of weights can be adapted based on the specific data
and the performance of the model.
● Many words can have multiple POS tags depending on the context. For
example, "lead" can be a verb ("He will lead the team") or a noun ("The
pencil has a lead tip").
● Tagging Methods:
● POS tagging can be performed using rule-based methods, statistical
models, or deep learning techniques. Statistical models like Hidden
Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs)
were traditionally used, while deep learning methods like Recurrent
Neural Networks (RNNs) and Transformer-based models (e.g., BERT)
have shown significant improvements.
● Corpora and Training:
● To train a POS tagger, a large annotated corpus (a collection of text with
labeled parts of speech) is needed. This corpus is used to learn the
relationships between words and their corresponding POS tags.
● Challenges:
● Ambiguity, rare words, and complex sentence structures can pose
challenges for POS tagging systems. Handling languages with free
word order, like Latin or Finnish, can be particularly challenging.
● Evaluation:
● POS taggers are evaluated using metrics like accuracy, precision,
recall, and F1-score. These metrics compare the predicted tags to
the gold-standard tags in the annotated corpus.
● Applications:
○ Machine translation: Assisting in the translation of text from one language to another.
PART-of-SPEECH (POS) Tagging
42
PART-of-SPEECH (POS) –How Many?
• Word Classes
In grammar, a part of speech or part-of-speech (POS) is
known as word class or grammatical category, which is a
category of words that have similar grammatical properties.
classes.
POS -
Examples
44
BrillTagger
● BrillTagger class is a transformation-based tagger.
● Rule-based taggers use dictionary or lexicon for getting possible tags for
tagging each word.
● If the word has more than one possible tag, then rule-based taggers use
hand-written rules to identify the correct tag.
● Moreover, it uses a series of rules to correct the results of an initial
tagger.
● These rules it follows are scored based.
● This score is equal to the no. of errors they correct minus the no. of new
errors they produce.
Hidden Markov Model
• A Hidden Markov Model (HMM) is a probabilistic graphical
model used for modeling sequences of observations.
• It's widely used in various fields, including speech
recognition, natural language processing, bioinformatics,
finance, and more.
● Model Components:
● States (Hidden States): Each state in the HMM corresponds to a POS tag
(e.g., noun, verb, adjective).
● Observations (Emissions): Each state emits an observation (word) from
the vocabulary. These emissions are the actual words in the sentence.
● Transition Probabilities: The probability of transitioning from one POS tag
to another. For example, the probability of transitioning from a noun to a
verb.
● Emission Probabilities: The probability of emitting a particular word from a
specific POS tag. For example, the probability of the word "run" given the
POS tag "verb".
● Training:
● The HMM is trained on a labeled corpus of text, where each word is
annotated with its corresponding POS tag.
● The transition probabilities and emission probabilities are estimated
from the training data. This involves counting the occurrences of
transitions and emissions and normalizing them to obtain
probabilities.
● Decoding (Inference):
● Given a new, unlabeled sentence, the goal is to find the most likely
sequence of POS tags for that sentence.
● This is done using the Viterbi algorithm, which efficiently computes
the most likely sequence of hidden states (POS tags) given the
observed words.
● Example:
● Let's say we have the sentence: "The cat jumps."
● The corresponding sequence of POS tags could be: "DET NOUN
VERB."
● The HMM would compute the probability of this sequence and
compare it with other possible sequences to find the most likely one.
● Handling Unknown Words:
● Since HMMs rely on emission probabilities, they may struggle with
out-of-vocabulary or rare words. Techniques like smoothing or using
a special token for unknown words can help mitigate this issue.
Part of Speech (POS) in
Synset.
Information Extraction
GOALS
51
Information Extraction
52
● Named Entity Recognition (NER) is a natural language processing
(NLP) task that involves identifying and classifying named entities in
text into predefined categories such as names of persons,
organizations, locations, expressions of times, quantities, monetary
values, percentages, etc.
NAMED ENTITIES
● Named Entities:
● Named entities are specific entities in text that refer to people,
places, organizations, dates, times, and various other types of
entities with specific names.
● Categories of Named Entities:
● Common categories include:
○ Person (e.g., "John Doe")
○ Organization (e.g., "Google")
○ Location (e.g., "New York City")
○ Date (e.g., "January 1, 2020")
○ Time (e.g., "3:30 PM")
○ Quantity (e.g., "10 kilograms")
○ Money (e.g., "$100")
● Applications of NER:
● NER is a critical component in various NLP applications such as:
○ Information extraction
○ Document summarization
○ Machine translation
○ Question answering systems
○ Sentiment analysis
○ Entity linking (connecting entities to external knowledge bases)
58
hypernym is a term whose meaning
includes the meaning of other words, its
a broad superordinate label that applies
to many other members of set. It
describes the more broad terms or we can
say that more abstract terms.
for e.g hypernym of labrador, german
sheperd is dog.
hyponym is a term which is more
specialised and specific word, it is a
hierarchical relationship which may
consist of a number of levels. These are
the more specific term. for e.g. dog is a
hyponym of animal.
WordNet-based similarity measures are
techniques used to quantify the
similarity or relatedness between words
or concepts based on the information
provided by WordNet. WordNet is a
lexical database of the English language
that organizes words into synsets (sets of
synonyms) and provides relationships
between them.
Here are some common WordNet-based
similarity measures:
Path Similarity:
Leacock-Chodorow Similarity:
Wu-Palmer Similarity:
Resnik Similarity:
Etc.
Synset Hypernyms
and
Hyponyms
Wu-Palmer Similarity
Topic Modeling:
66
Latent Semantic Analysis (LSA)
Assumptions of LSA
▪ The words which are used in the same context are analogous to
each other.
▪ The hidden semantic structure of the data is unclear due to the
ambiguity of the words chosen.
67
Latent Semantic Analysis (LSA)
68
Latent Semantic Analysis (LSA)
69
Latent Semantic Analysis (LSA)
70
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):
71
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):
72
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD):
73
Case Study
https://towardsdatascience.com/latent-semantic-analysis-sentiment-classification-with-python-5f657346f6a3
74