NLP Final Review

Session 30: Final Exam
Type Exam/ Quiz / Project Live-in-person
Foundations of NLP and Historical overview

Natural Language: Refers to the human way of communicating, through text
and speech
Natural Language Processing (NLP): branch of Artificial Intelligence that gives
the machines the ability to read, understand and derive meaning from human
languages.
Semantic analysis: process of drawing meaning from text
Lexical Semantics: about individual words in context
Semantics: Study of meaning
General NLP Pipeline: Data Collection → Text Cleaning → Pre-processing →

Feature engineering → Modeling → Evaluation → Deployment → Monitoring
and Model updates
NLTK (Natural Language toolkit): Python NLP Library that contains scripts for
statistical language processing.
Linguistic Structure of Speech
Speech or audio: Disturbance in the environment that can be represented

as an acoustic signal
Written text: Categorical units (separated by whitespace)
Computational Linguistics: study of linguistics and use of computer

algorithms and models to process and analyze human speech
Generally concerned with analysis and processing (not conversations)
Speech Recognition
Machine translation
Session 30: Final Exam 1

Sentiment analysis
Conversational AI: Developing and implementing AI models that can engage in

natural language conversations with humans.
Use NLP, Speech recognition, and speech synthesis
Applications: Chatbots, Virtual assistance and voice activated devices
Applies Computational Linguistics
Pipeline: Automatic speech recognition (Speech to text etc) → NLP →

Convert the output to speech (Text to speech)
Speech AI: Automatic speech Recognition (ASR) and Text-to-speech

conversation
NLP: Natural Language Understanding and Natural Language

Generation
Preprocessing
Preprocessing pipeline:
Segmentation: Breakdown the entire document into its constituent

sentences
Divide sentence into key components
Can be done using punctuations like periods and/or commas
# Segmemtation using NLTK

from nltk.tokenize import sent_tokenize
text = "Python is Great! Im using NLTK!"
sgemented_text = sent_tokenize(text)
print(segmented_text)
# Output
['Python is Great!', 'Im using NLTK!']
Tokenization: break down sentence into its constituent words and store
them
The stars are twinkling at night → the, stars, are twinkling…

# Tokenization using NLTK
from nltk.tokenize import word_tokenize
text = "Python is Great! Im using NLTK!"

tokenized_words = word_tokenize(text)
print(tokenized_words)
# Output
['Python', 'is', 'Great', '!', 'Im', 'using', 'NLTK', '!']
Stop word removal: remove non-essential ‘filler’ words, which add little
meaning to our statement and are just there to make the statement sound
more cohesive
eg. the, are, at, of
# remove stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
sentence = "This is a sample sentence, showing off the sto

stop_words = set(stopwords.words('english')) # set is fast
word_tokens = word_tokenize(example_sent) # we have to tok
filtered_sentence = [w for w in word_tokens if not w in st

print(filtered_sentence)
# output
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'wo
Stemming: obtaining the ‘stem’ of words. Word stem give new words upon
added affixes to them
skip → skipping, Skipped, Skips

# Stem Tokens
# stemming in NLTK
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned
for w in example_words:
print(ps.stem(w))
#output:
python
python
python
python
Lemmatization: Process of obtaining the root stem of a word. Root Stem

gives the new base form of a word that is present in the dictionary and
from which the word is derived.
Am, Are, is → Be (lemma)
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["cats", "running", "better"]

lemma2 = [lemmatizer.lemmatize(word, pos="n") for word in
print(f"Lemmatized Words: {lemma2}")
Part of Speech Tagging: Tagging/attaching the concept of nouns, verbs,

articles, and other parts of speech to the machine by adding these tags to
our words.
Starts = noun, Are = verb etc

Named Entity Tagging: introduce your machine to pop culture references
and everyday names by flagging names of movies, important personalities
or locations, etc.
Text Representation
Common words when representing text:
Corpus: All text data
Vocabulary: Set of unique tokens
Document: One single text record if the dataset (Sentence)
Word: Words present in vocabulary
Text Representation Techniques
One Hot Encoding: assigns 0 to all elements in a vector except for one,
which has a value of 1.
eg. “I love NLP course” → I → [1 0 0 0], love → [0 1 0 0], NLP → [0 0 1

0], course → [0 0 0 1]
Advantages: Easy to understand/interpret and implement
Disadvantages: Explosion in feature space, Cannot determine words

relationships, Cannot measure word importance, Memory and
computationally expensive
# One hot Encoding

import numpy as np
# Define the sentences

sentences = [
'The cat sat on the mat.',
'The dog chased the cat.',
'The mat was soft and fluffy.'
]
# Create a vocabulary set

vocab = set()

for sentence in sentences:
words = sentence.lower().split()
for word in words:
vocab.add(word)
# Create a dictionary to map words to integers

word_to_int = {word: i for i, word in enumerate(vocab)}
# Create a binary vector for each word in each sentence

vectors = []
for sentence in sentences:
words = sentence.lower().split()
sentence_vectors = []
for word in words:
binary_vector = np.zeros(len(vocab))
binary_vector[word_to_int[word]] = 1
sentence_vectors.append(binary_vector)
vectors.append(sentence_vectors)
# Print the one-hot encoded vectors for each word in each

for i in range(len(sentences)):
print(f"Sentences {i + 1}:")
for j in range(len(vectors[i])):
print(f"{sentences[i].split()[j]}: {vectors[i][j]}
N-grams: Continuous sequence of words or symbols, or tokens in a

document
eg. unigram: [‘I’, ‘live’, ‘in’], bigram: [’I live’, ‘live in’], trigram: [’I live in’]
Advantages: Better word relationship understanding, Less memory,

Easier to compute, more flexible (number of grams)
Disadvantages: Cannot determine relationship between words that are

far apart, No Semantics (Meaning), overfitting (bad on new data)

#n-grams: Unigram
from nltk import ngrams
sentence = "I live in Madrid"
n = 4 # number of words in a sequence
unigrams = ngrams(sentence.split(), n)
for gram in unigrams:
print(gram)
# Output
('I', 'live', 'in', 'Madrid')
Bag of words: Text representation that describes the occurrence of words

within a document (without order)
import nltk
from nltk.probability import FreqDist # use print the most
import string
corpus = [
"The cat in the hat.",
"The quick brown fox jumps over the lazy dog.",
"A bird in hand is worth two in the bush."
]
def nltk_tokenize(text):
stemmer = nltk.stem.SnowballStemmer('english')
text = text.lower()
for token in nltk.word_tokenize(text):

if token in string.punctuation:
continue
yield stemmer.stem(token)

tokens = [token for text in corpus for token in nltk_token
fdict = FreqDist(tokens)
print("Most Common Words:")

for word, frequency in fdict.most_common():
print(f"{word}: {frequency}")
#Output:
Most Common Words:
the: 5
in: 3
cat: 1
hat: 1
Count Vectorizer:
TF-IDF: Reflects the importance of a term within a document relative to its

importance across alldocuments in the corpus. It helps in highlighting
important terms while downplaying common terms.
Term Frequency: Measures a how often a term occurs in a document.
Inverse Document Frequency: IDF of a term reflects the proportion

ofdocuments in the corpus that contain the term. Words unique to a
smallpercentage of documents (e.g., technical jargon terms) receive
higherimportance values than words common across all documents
Useful for Information retrieval: Search engines/media search
# Applying TF-IDF
from sklearn.feature_extraction.text import TfidfVectorize
tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus) # Apply
Word Embeddings: Represent words in a dense vector, making sure similar

words are closer to each other

Vector Space Models (VSM): represents documents and terms as
vectors in a multi-dimensional space. Each dimension corresponds to a
unique term in the entire corpus of documents
Document-Term Matrix: create the vector representation of a

collection of documents. Rows in this matrix represent documents,
and columns represent terms (words or phrases).
Word2Vec: library for training word embedding models that represent

words in a dense vector space based on their contextual relationships
in a text corpus
from gensim.models import Word2Vec

from gensim.models.word2vec import LineSentence
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Sample dataset (text data)

text_data = [
"natural language processing and machine learning a
"word embeddings capture semantic meanings of words
"word2vec is a popular technique for generating wor
"gensim is a Python library for topic modeling and
"machine learning algorithms can be trained on word
]
# Save the text data to a file (optional step)

with open('text_data.txt', 'w') as file:
for line in text_data:
file.write(line + '\n')
# Train Word2Vec model

# Load text data from the file
sentences = LineSentence('text_data.txt')
# Train Word2Vec model

model = Word2Vec(sentences, vector_size=100, window=5,

# Find similar words
similar_words = model.wv.most_similar('machine', topn=3
print("Similar words to 'machine':", similar_words)
GloVe (Global Vectors for Word Representation): unsupervised

learning algorithm for generating word embeddings by aggregating
global word-word co-occurrence statistics from a corpus.
Dimensionality Reduction: PCA (Principal Component Analysis) and t-

SNE (t-Distributed Stochastic Neighbor Embedding) are used to reduce
the dimensionality of word embeddings, making it easier to visualize
and interpret
PCA is used to reduce dimension but it’ll usually be <2
Tsne is used to reduce it completely to 2 dimensions
Document Similarity: Measures how much the meaning or content of two

pieces of text are the same. Common methods to measure are:
Cosine similarity
Levenshtein distance
Jaccard index
Euclidean distance
Hamming distances
Word embeddings
Text Classification
Text Classification: involves categorizing and assigning predefined labels or
categories to text documents, sentences, or phrases based on their content.
Naive Bayes for classification: used for text categorization since

thedimensionality of the data is frequently rather large
# using Naive Bayes for Classificiation

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test) # predict on test data
accuracy = classifier.score(X_test,y_test) # check results
SVMs: finding the hyperplane that maximally separates the data

intodifferent classes.
Decision Trees: work by recursively partitioning the feature space into

smaller regions based on the values of the input features and assigning a
label to each part based on the majority class of the training data in that
region.
Neural Networks: can be used for various text classification tasks,

including binary and multi-class classification
Logistic Regression: used for predicting binary outcomes, such as

whetheran email is spam or not spam
Random Forest: can be used for text classification by training numerous

decision trees on different subsets of the feature space and then

combining their predictions to make the final classification decision.
Sentiment Analysis
Sentiment Analysis: Form of Text classification - classifying a text into various
sentiments, such as positive or negative, Happy, Sad or Neutral, etc
Goal is to decipher the underlying mood, emotion, or sentiment of text -

Opinion Mining
Sentiment Analysis process:
1. Text Preprocessing: Text data is cleaned by removing irrelevant

information, such as special character, punctuation, and stopwords
2. Tokenization: Text is divided into individual words or token to facilitate

analysis
3. Feature extraction: Relevant features are extracted from the text, such as
words, n-grams, or parts of speech
4. Sentiment Classification: Machine Learning Algorithms or pre-trained

models are used to classify the sentiment of each passage.
a. Generally using supervised learning
5. Post-processing: (not always done) - aggregating sentiment scores or

applying threshold rules to classify sentiments as positive, negative, or
neutral.
6. Evaluation: Assessed using metrics such as accuracy, precision, recall, or

F1 score
Types of sentiment analysis:
Document-Level Sentiment Analysis: aims to classify the entire text as

positive, negative, or neutral
Sentence-Level Sentiment Analysis: Provides a more granular

understanding of sentiment expressed in different text parts
Aspect-based Sentiment Analysis: Focuses on identifying and extracting

the sentiment associated with specific aspects or entities mentioned in the
text. (eg. features of a product

Entity-level Sentiment Analysis: identifies the sentiment expressed towards
specific entities or targets mentioned in the text, such as people,
companies, or products.
Comparative Sentiment Analysis: comparing the sentiment between

different entities or aspects mentioned in the text.
Use Cases:
Monitoring for brand Management
Product/service analysis
Stock price prediction
Pre-trained Sentiment Analyzer:
Text Blob: Python library. Takes text input and can return polarity and
subjectivity as outputs
Polarity: Sentiment of text [-1,1],
Subjectivity: Determines the factuality of information or personal opion

[0,1] 0 = factual, 1 = opinion
# Text blob
from textblob import TextBlob
text1 = "The movie was very awesome."

text2 = "The food here tastes terrible."
# Finding the polarity (sentiment)
p1 = TextBlob(text1).sentiment.polarity
p2 = TextBlob(text2).sentiment.polarity
# Finding the subjectivity

s1 = TextBlob(text1).sentiment.subjectivity
s2 = TextBlob(text2).sentiment.subjectivity
Vader (Valence Aware Dictionary and Sentiment Reasoner): NLTK

Pretrained - results come quicker than other analyzers. Best suited for

social media (Short sentences) with slang or abbreviations. Less accurate
with long text.
from vaderSentiment.vaderSentiment import SentimentIntensi

sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting

text_2 = "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)

print("Sentiment of text 2:", sent_2)
Bag of Words Vectorization-based Models:
Preprocess: Normalization, Tokenization, stopwords,

Stemmimg/lemmatization
Create bag of words for processed text using count vectorizer or TF-
IDF Vectorizatin
Train classification model
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count

from sklearn.feature_extraction.text import CountVectorize
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+') # To remove speci

cv = CountVectorizer(stop_words='english',ngram_range = (1
text_counts = cv.fit_transform(data['Sentence'])
#Splitting the data into trainig and testing

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_c
#Training the model

from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
#Caluclating the accuracy score of the model

from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)
LSTM-based Models:
Transformer-based models:
Regular Expressions (Regex)
Regex: powerful tool for matching patterns in text, allowing for complex
search, replace, and parsing operations on strings.
Basic Regex Syntax:

Literals: Match the exact characters in the pattern.
Metacharacters: Symbols that have special meaning: . (any character), ^
(start of a string/line), $ (end of a string), ? (the previous character or

nothing).
Character Classes: Defined by square brackets [] , match any one

character within the brackets.
Predefined Character Classes: Such as \\d for any digit, \\w for any word
character, and \\s for any whitespace.
Quantifiers: Specify the number of occurrences, * (0 or more), + (1 or

more), {n} (exactly n times).
Alternation: The pipe | allows for matching one pattern or another.

Groups: Parentheses () group patterns and capture their matches for later
use.
Advanced Regex Features:

Non-Capturing Groups: (?:...) groups items without capturing.
Positive and Negative Lookahead: (?=...) and (?!...) assert what is or

isn't following the current position, without including it in the match.
Backreferences: \\1 , \\2 , etc., refer back to captured groups.
Flags: Modify the behavior of the regex, such as i for case-insensitive

matching.
Regex in NLP Tasks:

Tokenization: Splitting text into words or tokens.
Text Cleaning: Removing unwanted characters or formatting.
Information Extraction: Identifying and extracting entities, like dates or

phone numbers.
Text Validation: Checking if strings conform to specific patterns.
Regex in Python with the re Library:
Basic Functions:
re.match() : Checks if the regex matches at the beginning of the string.
: Searches for a pattern anywhere in the string; returns the first

re.search()
occurrence.
re.findall() : Finds all non-overlapping matches of the pattern in the string.
re.finditer() : Similar to findall() , but returns an iterator over match objects.
: Replaces occurrences of the regex pattern in the string with a

re.sub()
replacement string.
re.split() : Splits the string by occurrences of the pattern.
Compiling Expressions:

re.compile() : Compiles a regex pattern into a regex object for reuse.
Match Object Methods:

match.group() : Returns the string matched by the regex.
match.start() : Returns the starting position of the match.
match.end() : Returns the ending position of the match.
match.groups() : Returns a tuple containing all the subgroups of the match.
: Returns a dictionary containing all named subgroups of

match.groupdict()
the match, keyed by the subgroup name.
Pattern Syntax:
Raw strings: r"pattern" tells Python not to handle backslashes in any
special way, which is helpful since regex uses a lot of backslashes.
Escape sequences:
\\d : Match any digit, shorthand for [0-9] .
\\D : Match any non-digit, shorthand for [^0-9] .
\\s : Match any whitespace character (space, tab, newline).
\\S : Match any non-whitespace character.
\\w : Match any word character (letter, digit, underscore).
\\W : Match any non-word character.
: Match a word boundary (the position between a word and a non-

\\b
word character).
\\B : Match a non-word boundary.
Quantifiers:
: Match 0 or more repetitions of the preceding element.
+ : Match 1 or more repetitions of the preceding element.
? : Match 0 or 1 repetition of the preceding element (makes it optional).
{m} : Match exactly m occurrences of the preceding element.

{m,n} : Match between m and n occurrences of the preceding element.
Anchors:
^ : Match the start of the string.
$ : Match the end of the string.
: Match the beginning of the text, similar to

\\A ^ but unaffected by
multiline mode.
\\Z : Match the very end of the text, similar to $ but unaffected by
multiline mode.
Groups:
() : Capture the matched text, accessible later.
(?:) : Group without capturing the matched text.
(?P<name>) : Create a named group that can be accessed by the given

name.
Part of Speech tagging (POS) + Named Entity Recognition
(NER)
Part of speech tagging (POS): Giving each word in a text a grammatical
category (nouns, verbs, adjectives etc)
Words may have multiple possible POS tags based on context
Tag set: predefined collection of tags that represent the grammatical

categories (eg. penn Treebank)
Method of POS tagging:
Rule-based tagging: words are assigned parts of speech based on specific

rules about their usage and context, rather than relying on statistical
models or machine learning.
eg. If an ambiguous/unknown word X is preceded by a determiner and

followed by anoun, tag it as an adjective.
Statistical Tagging: statistical models trained on large annotated corpora

topredict POS tags

Hidden Markov Models (HMMs) and ConditionalRandom Fields (CRFs)
are common statistical approaches.
Machine Learning-based tagging: ML algorithms such as decision trees,

support vector machines, or neural networks to learn patterns from data.
Deep learning-based tagging: Deep learning models, such as recurrent

neural networks (RNNs)and long short-term memory networks (LSTMs),
are employed forPOS tagging, capturing complex contextual
dependencies.
POS Challenges:
Ambiguity: Words often have multiple possible POS tags based on context
Context dependency: POS tags can depend on the surrounding words,

making accurate tagging sensitive to context.
Out-of vocabulary words: Handling words not seen during training is a

challenge.
POS Applications:
Syntactic parsing: Understanding the grammatical structure of sentences
Named Entity Recognition: Identifying and classifying entities (e.g.,

persons, organizations, locations) in text.
Information retrieval: Improving search and retrieval of relevant documents

or information.
Text summarization: Generating concise summaries of longer texts.
Machine Translation: Translating text from one language to another
# POS Tagging
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "I am learning NLP in Python"

# Process the sentence using spacy's NLP pipe;ome
doc = nlp(sentence)
# Iterate over tokens in a Doc

for token in doc:
print(token.text, token.pos_)
# output:
I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN
Context Free Grammar (CFG): list of rules that define the set of all well-formed
sentences in a language
Goals of Context Free Grammar:
Permit ambiguity: Ensure that a sentence has all its possible parses. eg.
Fruit flies like an apple
Limit ungrammaticality: require agreement in number, tense, gender,

person
Ensure meaningfulness: disallow eg. the apple eats the giraffe
# Context free grammar in NLTK

# Define this first
grammer1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "Man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
# Print the tree
sent = "the Man walked in the park".split()

rd_parser = nltk.RecursiveDescentParser(grammer1)
for tree in rd_parser.parse(sent):
tree.pretty_print()
Recursive grammar: type of grammar that allows rules to call themselves

directly or indirectly, enabling the description of constructs like nested
parentheses or hierarchical structures in language.
or every word in text, we can look up in our grammar what category it

belongs to.
Dependencies and Decency Grammar: binary asymmetric relation that holds

between a head and its dependents
Head: usually taken to be the tensed verb ,and every other word is either
dependent on the sentence head or connects to it through a path of
dependencies.
Deep Learning for NLP
Retraining: training a model from scratch: the weights are randomly initialized,
and the training starts without any prior knowledge.
Fine-tuning/Transfer Learning: First acquire a pretrained language model,
then perform additional training with a dataset specific to your task.
Recurrent Neural Networks (RNNs): Form of NNs that are uniquely made to
handle text classification tasks efficiently.
Uniquely able to capture sequential dependencies in data - like language
Good at evaluating the contextual links between Words in NLP text

classification - helps them identify patterns and semantics
Great for creating complex models for tasks like Document classification,
spam detection, and sentiment analysis
Types of RNN architectures:
Many to One RNN: Many inputs (Tx) are used to give one output (Ty) → eg.
Classification task
One to Many RNN: generates a series of output values based on a single

input value → eg. music generation

Many to Many: Many inputs (Tx) are used to give one may outputs (Ty) →
eg. machine translation
RNN limitations
Only able to capture dependencies in one direction of language
Not good at capturing long term dependencies → short terms are easy
Vanishing gradients problem → updated weights don’t really change
Gated Recurrent Unit: Extension of RNN - helps to capture long range

dependencies and help a lot in fixing vanishing gradient problem.
Apart from the usual neural unit with sigmoid function and softmax for
output it contains an additional unit with tanh as an activation function
Combines long and short term memory into its hidden state
Faster and more effiecent than LSTM
Has two gates:
Update gate: Knows how much past memory to retain
Reset gate: Knows how much past memory to forget
Long Short Term Memory (LSTM): Specialized type of RNN - designed to

handle sequences of data with better control over the gradient flow and
maintenance of state over long sequences.
unique gating mechanisms, making them more effective for tasks involving
long or complex sequences where context from the distant past is
important
LSTMs typically better then GRU to capture long-term dependencies due to

their more complex gating mechanisms.
Transformers: Architecture with self-attention mechanisms. Can capture

intricate patterns in data and are behind some of the most advanced language
models available.

Unlike RNNs, Transformers precess entire input at once
Attention mechanism provides context for any position in the input

sequence
Encoder: Receives input and builds a representation of it (Features)
Model is optimized to acquire understanding from input
class TransformerEncoder(layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads, **
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential(
[layers.Dense(dense_dim, activation="relu"), l
)
self.layernorm_1 = layers.LayerNormalization()
self.supports_masking = True
def call(self, inputs, mask=None):

if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, :],
attention_output = self.attention(
query=inputs, value=inputs, key=inputs, attent
)
proj_input = self.layernorm_1(inputs + attention_o
proj_output = self.dense_proj(proj_input)
return self.layernorm_2(proj_input + proj_output)
Decoder: Uses the encoders representation (features) along with other

inputs to generate a target sequence.

Model is optimized for generating outputs
Extra attention block is inserted between the self-attention block

applied to the target sequence and the dense layers of the exit block.
Added attention block will save the weights in the decoder so you
can switch between translations and not retrain another mode to
translate the other way
class TransformerDecoder(layers.Layer):
def __init__(self, embed_dim, latent_dim, num_heads, *
self.latent_dim = latent_dim
self.num_heads = num_heads
self.attention_1 = layers.MultiHeadAttention(
)
self.attention_2 = layers.MultiHeadAttention(
)
self.dense_proj = keras.Sequential(
[layers.Dense(latent_dim, activation="relu"),
)
self.supports_masking = True
def call(self, inputs, encoder_outputs, mask=None):

causal_mask = self.get_causal_attention_mask(input
if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, :],
padding_mask = tf.minimum(padding_mask, causal
attention_output_1 = self.attention_1(
query=inputs, value=inputs, key=inputs, attent

)
out_1 = self.layernorm_1(inputs + attention_output
attention_output_2 = self.attention_2(
query=out_1,
value=encoder_outputs,
key=encoder_outputs,
attention_mask=padding_mask,
)
out_2 = self.layernorm_2(out_1 + attention_output_
proj_output = self.dense_proj(out_2)
return self.layernorm_3(out_2 + proj_output)
def get_causal_attention_mask(self, inputs):

input_shape = tf.shape(inputs)
batch_size, sequence_length = input_shape[0], inpu
i = tf.range(sequence_length)[:, tf.newaxis]
j = tf.range(sequence_length)
mask = tf.cast(i >= j, dtype="int32")
mask = tf.reshape(mask, (1, input_shape[1], input_
mult = tf.concat(
[tf.expand_dims(batch_size, -1), tf.constant([
axis=0,
)
return tf.tile(mask, mult)
Self-attention: differentially weighting the significance of each part of the

input data.
A smart embedding space which provide a different vector

representation for a word depending on the other words surrounding it.
A smart embedding space which provide a different vector

representation for a word depending on the other words surrounding it.

context aware token representations: to modulate the representation of
a token by using the representations of related tokens in the sequence.
take query q and find the most similar key k, by doing a dot product for
q and k.
Query Vector: the current word - reference sequence that describes

something you’re looking for
Key vector: indexing mechanism for the value vector. Like hash map
key-value - describes the value in a format thatcan be readily
compared to a query.
Value vector: Information in the input vector - body of knowledge that

you’re trying to extract information from
Muti-Head Attention: extension of self-attention. It splits the attention

mechanism into multiple "heads," allowing the model to simultaneously
attend to different parts of the sequence from different representational
spaces.
Model can capture various types of relationships and nuances in the

data at the same time, providing a richer understanding of the context.
num heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(Num_heads = num_heads, key_
output = mha_layer(inputs,inputs,inputs) # 3 inputs: 2 sen
Positional Encoding: Added to each word embedding to give model

information to the word order
Positional vector: represents the position of the word in the current

sentence
add a “position” axis to the input vector
Sentence length needs to be known in advance
class PositionalEmbedding(layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_

self.token_embeddings = layers.Embedding(
input_dim=vocab_size, output_dim=embed_dim
)
self.position_embeddings = layers.Embedding(
input_dim=sequence_length, output_dim=embed_di
)
self.sequence_length = sequence_length
self.vocab_size = vocab_size
def call(self, inputs):

length = tf.shape(inputs)[-1]
positions = tf.range(start=0, limit=length, delta=
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(posi
return embedded_tokens + embedded_positions
def compute_mask(self, inputs, mask=None):

return tf.math.not_equal(inputs, 0)
Encoder-only models: For tasks that require understanding of the input. eg.
Classification and NER
eg. BERT is encoder only
Decoder-only models: For generative tasks. eg. Text generation

Encoder-decoder models (Sequence-to-sequence): For generative tasks that
require input. eg. Translation or Summarization.
BERT (Bidirectional encoder representations from transformer): Transformer

based, only the encoder part of the Transformer.
is fine-tuned with additional output layers to perform a wide range of

specific language tasks like question answering, sentiment analysis, and
entity recognition.

GPT (Generative Pretrained Transformer): Transformer based, primarily its
decoder component, to generate human-like text based on the input it
receives.
GPT is designed to generate text and is trained to predict the next word in a
sentence given all the previous words.
Question Answering: Sub-field of NLP, developing models that can

automatically answer questions in natural language.
Q-A Process:
1. Data collection and processing: Large corpus, clean and format the data
2. Information Retrieval: Keyword search, Text classification, NER
3. Question Analysis: POS tagging, Dependency parsing, NER
4. Answer Generation: Text generation, Summarization
5. Model Training: Supervised/unsupervised learning
6. Model Evaluation: Precision, Recall, F1
Types of QA systems
Information Retrival-based: automatically answering questions by

searching for relevant documents or passages that contain the answer.
Using Keyword or semantic search
performance can be limited by the quality and relevance of the indexed

text and the effectiveness of the retrieval and extraction methods
Can be used with other types of QA like Knowledge based or

generative QA
Knowledge-Based QA: answers questions using aknowledge base, such

as a database or ontology, to retrieve the relevant information.
Focused on searching structured knowledge base (eg. Json)

generally more accurate and reliable than other QA approaches based
on structured and well-curated knowledge
performance can be limited by how well the knowledge base is

covered and how well the methods used to make queries and get
information from their work
Can also be used with other types of QA
Generative QA: automatically answers questions using a generative model,

such as a neural network, to generate a natural language answer to a given
question.
based on the idea that a machine can be taught to understand and

create text in natural language to provide a correct answer in terms of
grammar and meaning
limited by the training data’s quality and diversity and the model’s
complexity.
often used with other QA approaches, such as information retrieval-

based or knowledge-based QA
Hybrid QA: automatically answers questions by combining multiple QA

approaches, such as information retrieval-based, knowledge-based, and
generative QA.
considered more robust and accurate than a single QA
can be built to be used in a specific domain or a general-purpose
Rule-based QA: answers questions using a predefined set of rules based

on keywords or patterns in the question.
based on the idea that many questions can be answered by matching

the question to a set of predefined rules or templates.
more prone to errors and can only handle questions covered by

predefined rules.
Build QA system using Hugging Face:

1. Import Libraries: Pytorch, Pre-trained transformers,
BertForQuestionAnswering, BertTokenizer
2. Extract Pre-trained Model:
model = BertForQuestionAnswering.from_pretrained("bert-lar
tokenizer = BertTokenizer.from_pretrained("bert-large-unca
3. Create Testing Sample: Use string that holds the context or knowledge
from which the model would return the required answer.
answer_text = "Obama's last came is Care"

question = "What is Obama's last name?"
4. Tokenize the answer_text and question: using BertTokenizer
input_ids = tokenizer.encode(question, answer_text)
5. Create an attention Mask: a sequence of 1s and 0s indicatingwhich tokens

in the input_ids sequence should be attended to by the model.
attention_mask = [1] * len(input_ids)
6. Obtain the Model’s output: first part contains the logits for the start token
index, and the second part contains the logits for the end token index
output = model(torch.tensor([input_ids]), attention_mask=t
7. Defining the start and End Indexes determine the start and end indices of
the answer in the input_ids sequence
start_index = torch.argmax(output[0][0, :len(input_ids) -i

end_index = torch.argmax(output[1][0, :len(input_ids) -inp
8. Decode the Final Answer: revert the tokenization process for us to

understand the value of the final answer

answer = tokenizer.decode(input_ids[start_index:end_index
print("Answer:", answer)
9. Output: “Care”
QA Task Variants:
Definitions
Closed book: must rely solely on what it has been pre-trained with
Open Book: has access to external information sources
Extractive QA: involves identifying the exact span of text within a given
document that answers a question.
Abstractive QA: Instead of extracting direct answers, this task involves

generating a new text passage that answers the question, potentially
synthesizing information from multiple sources or paraphrasing.
Applied Tasks
Retriever-Reader: For extractive QA, the 'retriever' component first selects

the relevant documents or passages from a larger corpus, and then the
'reader' component reads those texts to find the exact answer.
Retriever-Generator: For abstractive QA, the 'retriever' again fetches

relevant documents, but the 'generator' component synthesizes
information from the retrieved texts to construct a coherent answer.
Generator: In the closed book setting for abstractive QA, the 'generator'
relies only on its internal knowledge (what it has learned during pre-
training) to generate an answer without looking up any external resources.
Common QA Models:
BERT (Bidirectional Encoding Representations of Transformers):

transformer-based machine learning model for NLP pre-training developed
by Google.

Bidirectional context understanding and Pre-trained on a large corpus
and fine-tuned for specific tasks.
is not generative; it's designed for understanding context.
GPT (Generative Pre-Trained Transformer): generative model that uses

the decoder of the Transformer to produce text.
Trained to predict the next word in a sentence, Can generate coherent

text passages.
GPT generates text based on previous context in a unidirectional

manner (not bidirctional like BERT)
T5 (text to text converter): treats every language problem as a text-to-text

problem.
Unified framework for many NLP tasks.
Good for Translation, question answering, and summarization.
built on Transformer but reframes all NLP tasks as converting one type
of text to another.
RoBERTa (Robustly Optimized BERT Approach): An optimized version of

BERT designed to provide better performance.
Trained with more data, larger batches, and longer sequences.
Removes BERT’s next sentence prediction objective.
refined version of BERT, aimed at improving the pre-training process to

enhance performance.
Albert (Bert Light): lighter version of BERT with fewer parameters, aimed at
increasing training speed and reducing model size.
Can be applied to the same tasks as BERT but with greater efficiency.
More efficient and scalable version of BERT, sacrificing some model

capacity for greater speed and lower memory usage.

NLP Final Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Final Review

Uploaded by

Copyright:

Available Formats

Session 30: Final Exam

Type Exam/ Quiz / Project Live-in-person

Foundations of NLP and Historical overview

Lexical Semantics: about individual words in context

Semantics: Study of meaning

General NLP Pipeline: Data Collection → Text Cleaning → Pre-processing →

Linguistic Structure of Speech

Speech or audio: Disturbance in the environment that can be represented

Written text: Categorical units (separated by whitespace)

Computational Linguistics: study of linguistics and use of computer

Generally concerned with analysis and processing (not conversations)

Session 30: Final Exam 1

Conversational AI: Developing and implementing AI models that can engage in

Use NLP, Speech recognition, and speech synthesis

Applications: Chatbots, Virtual assistance and voice activated devices

Applies Computational Linguistics

Pipeline: Automatic speech recognition (Speech to text etc) → NLP →

Speech AI: Automatic speech Recognition (ASR) and Text-to-speech

NLP: Natural Language Understanding and Natural Language

Segmentation: Breakdown the entire document into its constituent

Divide sentence into key components

Can be done using punctuations like periods and/or commas

# Segmemtation using NLTK

The stars are twinkling at night → the, stars, are twinkling…

Session 30: Final Exam 2

text = "Python is Great! Im using NLTK!"

eg. the, are, at, of

sentence = "This is a sample sentence, showing off the sto

filtered_sentence = [w for w in word_tokens if not w in st

skip → skipping, Skipped, Skips

Session 30: Final Exam 3

Lemmatization: Process of obtaining the root stem of a word. Root Stem

Am, Are, is → Be (lemma)

words = ["cats", "running", "better"]

Part of Speech Tagging: Tagging/attaching the concept of nouns, verbs,

Starts = noun, Are = verb etc

Session 30: Final Exam 4

Corpus: All text data

Vocabulary: Set of unique tokens

Document: One single text record if the dataset (Sentence)

Word: Words present in vocabulary

Text Representation Techniques

eg. “I love NLP course” → I → [1 0 0 0], love → [0 1 0 0], NLP → [0 0 1

Advantages: Easy to understand/interpret and implement

Disadvantages: Explosion in feature space, Cannot determine words

# One hot Encoding

# Define the sentences

# Create a vocabulary set

Session 30: Final Exam 5

# Create a dictionary to map words to integers

# Create a binary vector for each word in each sentence

# Print the one-hot encoded vectors for each word in each

N-grams: Continuous sequence of words or symbols, or tokens in a

Advantages: Better word relationship understanding, Less memory,

Disadvantages: Cannot determine relationship between words that are

Session 30: Final Exam 6

Bag of words: Text representation that describes the occurrence of words

for token in nltk.word_tokenize(text):

Session 30: Final Exam 7

print("Most Common Words:")

TF-IDF: Reflects the importance of a term within a document relative to its

Term Frequency: Measures a how often a term occurs in a document.

Inverse Document Frequency: IDF of a term reflects the proportion