You are on page 1of 32

Session 30: Final Exam

Type Exam/ Quiz / Project Live-in-person

Foundations of NLP and Historical overview


Natural Language: Refers to the human way of communicating, through text
and speech
Natural Language Processing (NLP): branch of Artificial Intelligence that gives
the machines the ability to read, understand and derive meaning from human
languages.
Semantic analysis: process of drawing meaning from text

Lexical Semantics: about individual words in context

Semantics: Study of meaning

General NLP Pipeline: Data Collection → Text Cleaning → Pre-processing →


Feature engineering → Modeling → Evaluation → Deployment → Monitoring
and Model updates
NLTK (Natural Language toolkit): Python NLP Library that contains scripts for
statistical language processing.

Linguistic Structure of Speech

Speech or audio: Disturbance in the environment that can be represented


as an acoustic signal

Written text: Categorical units (separated by whitespace)

Computational Linguistics: study of linguistics and use of computer


algorithms and models to process and analyze human speech

Generally concerned with analysis and processing (not conversations)

Speech Recognition

Machine translation

Session 30: Final Exam 1


Sentiment analysis

Conversational AI: Developing and implementing AI models that can engage in


natural language conversations with humans.

Use NLP, Speech recognition, and speech synthesis

Applications: Chatbots, Virtual assistance and voice activated devices

Applies Computational Linguistics

Pipeline: Automatic speech recognition (Speech to text etc) → NLP →


Convert the output to speech (Text to speech)

Speech AI: Automatic speech Recognition (ASR) and Text-to-speech


conversation

NLP: Natural Language Understanding and Natural Language


Generation
Preprocessing
Preprocessing pipeline:

Segmentation: Breakdown the entire document into its constituent


sentences

Divide sentence into key components

Can be done using punctuations like periods and/or commas

# Segmemtation using NLTK


from nltk.tokenize import sent_tokenize
text = "Python is Great! Im using NLTK!"
sgemented_text = sent_tokenize(text)
print(segmented_text)

# Output
['Python is Great!', 'Im using NLTK!']

Tokenization: break down sentence into its constituent words and store
them

The stars are twinkling at night → the, stars, are twinkling…

Session 30: Final Exam 2


# Tokenization using NLTK
from nltk.tokenize import word_tokenize

text = "Python is Great! Im using NLTK!"


tokenized_words = word_tokenize(text)
print(tokenized_words)

# Output
['Python', 'is', 'Great', '!', 'Im', 'using', 'NLTK', '!']

Stop word removal: remove non-essential ‘filler’ words, which add little
meaning to our statement and are just there to make the statement sound
more cohesive

eg. the, are, at, of

# remove stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

sentence = "This is a sample sentence, showing off the sto


stop_words = set(stopwords.words('english')) # set is fast
word_tokens = word_tokenize(example_sent) # we have to tok

filtered_sentence = [w for w in word_tokens if not w in st


print(filtered_sentence)

# output
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'wo

Stemming: obtaining the ‘stem’ of words. Word stem give new words upon
added affixes to them

skip → skipping, Skipped, Skips

Session 30: Final Exam 3


# Stem Tokens
# stemming in NLTK
from nltk.stem import PorterStemmer
ps = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned

for w in example_words:
print(ps.stem(w))

#output:
python
python
python
python

Lemmatization: Process of obtaining the root stem of a word. Root Stem


gives the new base form of a word that is present in the dictionary and
from which the word is derived.

Am, Are, is → Be (lemma)

# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = ["cats", "running", "better"]


lemma2 = [lemmatizer.lemmatize(word, pos="n") for word in
print(f"Lemmatized Words: {lemma2}")

Part of Speech Tagging: Tagging/attaching the concept of nouns, verbs,


articles, and other parts of speech to the machine by adding these tags to
our words.

Starts = noun, Are = verb etc

Session 30: Final Exam 4


Named Entity Tagging: introduce your machine to pop culture references
and everyday names by flagging names of movies, important personalities
or locations, etc.
Text Representation
Common words when representing text:

Corpus: All text data

Vocabulary: Set of unique tokens

Document: One single text record if the dataset (Sentence)

Word: Words present in vocabulary

Text Representation Techniques

One Hot Encoding: assigns 0 to all elements in a vector except for one,
which has a value of 1.

eg. “I love NLP course” → I → [1 0 0 0], love → [0 1 0 0], NLP → [0 0 1


0], course → [0 0 0 1]

Advantages: Easy to understand/interpret and implement

Disadvantages: Explosion in feature space, Cannot determine words


relationships, Cannot measure word importance, Memory and
computationally expensive

# One hot Encoding


import numpy as np

# Define the sentences


sentences = [
'The cat sat on the mat.',
'The dog chased the cat.',
'The mat was soft and fluffy.'
]

# Create a vocabulary set


vocab = set()

Session 30: Final Exam 5


for sentence in sentences:
words = sentence.lower().split()
for word in words:
vocab.add(word)

# Create a dictionary to map words to integers


word_to_int = {word: i for i, word in enumerate(vocab)}

# Create a binary vector for each word in each sentence


vectors = []
for sentence in sentences:
words = sentence.lower().split()
sentence_vectors = []
for word in words:
binary_vector = np.zeros(len(vocab))
binary_vector[word_to_int[word]] = 1
sentence_vectors.append(binary_vector)
vectors.append(sentence_vectors)

# Print the one-hot encoded vectors for each word in each


for i in range(len(sentences)):
print(f"Sentences {i + 1}:")
for j in range(len(vectors[i])):
print(f"{sentences[i].split()[j]}: {vectors[i][j]}

N-grams: Continuous sequence of words or symbols, or tokens in a


document

eg. unigram: [‘I’, ‘live’, ‘in’], bigram: [’I live’, ‘live in’], trigram: [’I live in’]

Advantages: Better word relationship understanding, Less memory,


Easier to compute, more flexible (number of grams)

Disadvantages: Cannot determine relationship between words that are


far apart, No Semantics (Meaning), overfitting (bad on new data)

Session 30: Final Exam 6


#n-grams: Unigram
from nltk import ngrams
sentence = "I live in Madrid"
n = 4 # number of words in a sequence

unigrams = ngrams(sentence.split(), n)
for gram in unigrams:
print(gram)

# Output
('I', 'live', 'in', 'Madrid')

Bag of words: Text representation that describes the occurrence of words


within a document (without order)

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist # use print the most
import string

corpus = [
"The cat in the hat.",
"The quick brown fox jumps over the lazy dog.",
"A bird in hand is worth two in the bush."
]

def nltk_tokenize(text):
stemmer = nltk.stem.SnowballStemmer('english')
text = text.lower()

for token in nltk.word_tokenize(text):


if token in string.punctuation:
continue
yield stemmer.stem(token)

Session 30: Final Exam 7


tokens = [token for text in corpus for token in nltk_token

fdict = FreqDist(tokens)

print("Most Common Words:")


for word, frequency in fdict.most_common():
print(f"{word}: {frequency}")

#Output:
Most Common Words:
the: 5
in: 3
cat: 1
hat: 1

Count Vectorizer:

TF-IDF: Reflects the importance of a term within a document relative to its


importance across alldocuments in the corpus. It helps in highlighting
important terms while downplaying common terms.

Term Frequency: Measures a how often a term occurs in a document.

Inverse Document Frequency: IDF of a term reflects the proportion


ofdocuments in the corpus that contain the term. Words unique to a
smallpercentage of documents (e.g., technical jargon terms) receive
higherimportance values than words common across all documents

Useful for Information retrieval: Search engines/media search

# Applying TF-IDF
from sklearn.feature_extraction.text import TfidfVectorize
tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus) # Apply

Word Embeddings: Represent words in a dense vector, making sure similar


words are closer to each other

Session 30: Final Exam 8


Vector Space Models (VSM): represents documents and terms as
vectors in a multi-dimensional space. Each dimension corresponds to a
unique term in the entire corpus of documents

Document-Term Matrix: create the vector representation of a


collection of documents. Rows in this matrix represent documents,
and columns represent terms (words or phrases).

Word2Vec: library for training word embedding models that represent


words in a dense vector space based on their contextual relationships
in a text corpus

from gensim.models import Word2Vec


from gensim.models.word2vec import LineSentence
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Sample dataset (text data)


text_data = [
"natural language processing and machine learning a
"word embeddings capture semantic meanings of words
"word2vec is a popular technique for generating wor
"gensim is a Python library for topic modeling and
"machine learning algorithms can be trained on word
]

# Save the text data to a file (optional step)


with open('text_data.txt', 'w') as file:
for line in text_data:
file.write(line + '\n')

# Train Word2Vec model


# Load text data from the file
sentences = LineSentence('text_data.txt')

# Train Word2Vec model


model = Word2Vec(sentences, vector_size=100, window=5,

Session 30: Final Exam 9


# Find similar words
similar_words = model.wv.most_similar('machine', topn=3
print("Similar words to 'machine':", similar_words)

GloVe (Global Vectors for Word Representation): unsupervised


learning algorithm for generating word embeddings by aggregating
global word-word co-occurrence statistics from a corpus.

Dimensionality Reduction: PCA (Principal Component Analysis) and t-


SNE (t-Distributed Stochastic Neighbor Embedding) are used to reduce
the dimensionality of word embeddings, making it easier to visualize
and interpret

PCA is used to reduce dimension but it’ll usually be <2

Tsne is used to reduce it completely to 2 dimensions

Document Similarity: Measures how much the meaning or content of two


pieces of text are the same. Common methods to measure are:

Cosine similarity

Levenshtein distance

Jaccard index

Euclidean distance

Hamming distances

Word embeddings
Text Classification
Text Classification: involves categorizing and assigning predefined labels or
categories to text documents, sentences, or phrases based on their content.

Naive Bayes for classification: used for text categorization since


thedimensionality of the data is frequently rather large

# using Naive Bayes for Classificiation


from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

Session 30: Final Exam 10


classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test) # predict on test data
accuracy = classifier.score(X_test,y_test) # check results

SVMs: finding the hyperplane that maximally separates the data


intodifferent classes.

Decision Trees: work by recursively partitioning the feature space into


smaller regions based on the values of the input features and assigning a
label to each part based on the majority class of the training data in that
region.

Neural Networks: can be used for various text classification tasks,


including binary and multi-class classification

Logistic Regression: used for predicting binary outcomes, such as


whetheran email is spam or not spam

Random Forest: can be used for text classification by training numerous


decision trees on different subsets of the feature space and then

Session 30: Final Exam 11


combining their predictions to make the final classification decision.
Sentiment Analysis
Sentiment Analysis: Form of Text classification - classifying a text into various
sentiments, such as positive or negative, Happy, Sad or Neutral, etc

Goal is to decipher the underlying mood, emotion, or sentiment of text -


Opinion Mining

Sentiment Analysis process:

1. Text Preprocessing: Text data is cleaned by removing irrelevant


information, such as special character, punctuation, and stopwords

2. Tokenization: Text is divided into individual words or token to facilitate


analysis

3. Feature extraction: Relevant features are extracted from the text, such as
words, n-grams, or parts of speech

4. Sentiment Classification: Machine Learning Algorithms or pre-trained


models are used to classify the sentiment of each passage.

a. Generally using supervised learning

5. Post-processing: (not always done) - aggregating sentiment scores or


applying threshold rules to classify sentiments as positive, negative, or
neutral.

6. Evaluation: Assessed using metrics such as accuracy, precision, recall, or


F1 score

Types of sentiment analysis:

Document-Level Sentiment Analysis: aims to classify the entire text as


positive, negative, or neutral

Sentence-Level Sentiment Analysis: Provides a more granular


understanding of sentiment expressed in different text parts

Aspect-based Sentiment Analysis: Focuses on identifying and extracting


the sentiment associated with specific aspects or entities mentioned in the
text. (eg. features of a product

Session 30: Final Exam 12


Entity-level Sentiment Analysis: identifies the sentiment expressed towards
specific entities or targets mentioned in the text, such as people,
companies, or products.

Comparative Sentiment Analysis: comparing the sentiment between


different entities or aspects mentioned in the text.

Use Cases:

Monitoring for brand Management

Product/service analysis

Stock price prediction

Pre-trained Sentiment Analyzer:

Text Blob: Python library. Takes text input and can return polarity and
subjectivity as outputs

Polarity: Sentiment of text [-1,1],

Subjectivity: Determines the factuality of information or personal opion


[0,1] 0 = factual, 1 = opinion

# Text blob
from textblob import TextBlob

text1 = "The movie was very awesome."


text2 = "The food here tastes terrible."
# Finding the polarity (sentiment)
p1 = TextBlob(text1).sentiment.polarity
p2 = TextBlob(text2).sentiment.polarity

# Finding the subjectivity


s1 = TextBlob(text1).sentiment.subjectivity
s2 = TextBlob(text2).sentiment.subjectivity

Vader (Valence Aware Dictionary and Sentiment Reasoner): NLTK


Pretrained - results come quicker than other analyzers. Best suited for

Session 30: Final Exam 13


social media (Short sentences) with slang or abbreviations. Less accurate
with long text.

from vaderSentiment.vaderSentiment import SentimentIntensi


sentiment = SentimentIntensityAnalyzer()

text_1 = "The book was a perfect balance between wrtiting


text_2 = "The pizza tastes terrible."

sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)

print("Sentiment of text 1:", sent_1)


print("Sentiment of text 2:", sent_2)

Bag of Words Vectorization-based Models:

Preprocess: Normalization, Tokenization, stopwords,


Stemmimg/lemmatization

Create bag of words for processed text using count vectorizer or TF-
IDF Vectorizatin

Train classification model

import pandas as pd
data = pd.read_csv('Finance_data.csv')

#Pre-Prcoessing and Bag of Word Vectorization using Count


from sklearn.feature_extraction.text import CountVectorize
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z0-9]+') # To remove speci


cv = CountVectorizer(stop_words='english',ngram_range = (1
text_counts = cv.fit_transform(data['Sentence'])

#Splitting the data into trainig and testing

Session 30: Final Exam 14


from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_c

#Training the model


from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

#Caluclating the accuracy score of the model


from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

LSTM-based Models:

Transformer-based models:
Regular Expressions (Regex)
Regex: powerful tool for matching patterns in text, allowing for complex
search, replace, and parsing operations on strings.

Basic Regex Syntax:


Literals: Match the exact characters in the pattern.

Metacharacters: Symbols that have special meaning: . (any character), ^

(start of a string/line), $ (end of a string), ? (the previous character or


nothing).

Character Classes: Defined by square brackets [] , match any one


character within the brackets.

Predefined Character Classes: Such as \\d for any digit, \\w for any word
character, and \\s for any whitespace.

Quantifiers: Specify the number of occurrences, * (0 or more), + (1 or


more), {n} (exactly n times).

Alternation: The pipe | allows for matching one pattern or another.

Session 30: Final Exam 15


Groups: Parentheses () group patterns and capture their matches for later
use.

Advanced Regex Features:


Non-Capturing Groups: (?:...) groups items without capturing.

Positive and Negative Lookahead: (?=...) and (?!...) assert what is or


isn't following the current position, without including it in the match.

Backreferences: \\1 , \\2 , etc., refer back to captured groups.

Flags: Modify the behavior of the regex, such as i for case-insensitive


matching.

Regex in NLP Tasks:


Tokenization: Splitting text into words or tokens.

Text Cleaning: Removing unwanted characters or formatting.

Information Extraction: Identifying and extracting entities, like dates or


phone numbers.

Text Validation: Checking if strings conform to specific patterns.

Regex in Python with the re Library:

Basic Functions:
re.match() : Checks if the regex matches at the beginning of the string.

: Searches for a pattern anywhere in the string; returns the first


re.search()

occurrence.

re.findall() : Finds all non-overlapping matches of the pattern in the string.

re.finditer() : Similar to findall() , but returns an iterator over match objects.

: Replaces occurrences of the regex pattern in the string with a


re.sub()

replacement string.

re.split() : Splits the string by occurrences of the pattern.

Compiling Expressions:

Session 30: Final Exam 16


re.compile() : Compiles a regex pattern into a regex object for reuse.

Match Object Methods:


match.group() : Returns the string matched by the regex.

match.start() : Returns the starting position of the match.

match.end() : Returns the ending position of the match.

match.groups() : Returns a tuple containing all the subgroups of the match.

: Returns a dictionary containing all named subgroups of


match.groupdict()

the match, keyed by the subgroup name.

Pattern Syntax:
Raw strings: r"pattern" tells Python not to handle backslashes in any
special way, which is helpful since regex uses a lot of backslashes.

Escape sequences:

\\d : Match any digit, shorthand for [0-9] .

\\D : Match any non-digit, shorthand for [^0-9] .

\\s : Match any whitespace character (space, tab, newline).

\\S : Match any non-whitespace character.

\\w : Match any word character (letter, digit, underscore).

\\W : Match any non-word character.

: Match a word boundary (the position between a word and a non-


\\b

word character).

\\B : Match a non-word boundary.

Quantifiers:

: Match 0 or more repetitions of the preceding element.

+ : Match 1 or more repetitions of the preceding element.

? : Match 0 or 1 repetition of the preceding element (makes it optional).

{m} : Match exactly m occurrences of the preceding element.

Session 30: Final Exam 17


{m,n} : Match between m and n occurrences of the preceding element.

Anchors:

^ : Match the start of the string.

$ : Match the end of the string.

: Match the beginning of the text, similar to


\\A ^ but unaffected by
multiline mode.

\\Z : Match the very end of the text, similar to $ but unaffected by
multiline mode.

Groups:

() : Capture the matched text, accessible later.

(?:) : Group without capturing the matched text.

(?P<name>) : Create a named group that can be accessed by the given


name.
Part of Speech tagging (POS) + Named Entity Recognition
(NER)
Part of speech tagging (POS): Giving each word in a text a grammatical
category (nouns, verbs, adjectives etc)

Words may have multiple possible POS tags based on context

Tag set: predefined collection of tags that represent the grammatical


categories (eg. penn Treebank)

Method of POS tagging:

Rule-based tagging: words are assigned parts of speech based on specific


rules about their usage and context, rather than relying on statistical
models or machine learning.

eg. If an ambiguous/unknown word X is preceded by a determiner and


followed by anoun, tag it as an adjective.

Statistical Tagging: statistical models trained on large annotated corpora


topredict POS tags

Session 30: Final Exam 18


Hidden Markov Models (HMMs) and ConditionalRandom Fields (CRFs)
are common statistical approaches.

Machine Learning-based tagging: ML algorithms such as decision trees,


support vector machines, or neural networks to learn patterns from data.

Deep learning-based tagging: Deep learning models, such as recurrent


neural networks (RNNs)and long short-term memory networks (LSTMs),
are employed forPOS tagging, capturing complex contextual
dependencies.

POS Challenges:

Ambiguity: Words often have multiple possible POS tags based on context

Context dependency: POS tags can depend on the surrounding words,


making accurate tagging sensitive to context.

Out-of vocabulary words: Handling words not seen during training is a


challenge.

POS Applications:

Syntactic parsing: Understanding the grammatical structure of sentences

Named Entity Recognition: Identifying and classifying entities (e.g.,


persons, organizations, locations) in text.

Information retrieval: Improving search and retrieval of relevant documents


or information.

Text summarization: Generating concise summaries of longer texts.

Machine Translation: Translating text from one language to another

# POS Tagging
import spacy
nlp = spacy.load('en_core_web_sm')

sentence = "I am learning NLP in Python"


# Process the sentence using spacy's NLP pipe;ome
doc = nlp(sentence)
# Iterate over tokens in a Doc

Session 30: Final Exam 19


for token in doc:
print(token.text, token.pos_)

# output:
I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN

Context Free Grammar (CFG): list of rules that define the set of all well-formed
sentences in a language
Goals of Context Free Grammar:

Permit ambiguity: Ensure that a sentence has all its possible parses. eg.
Fruit flies like an apple

Limit ungrammaticality: require agreement in number, tense, gender,


person

Ensure meaningfulness: disallow eg. the apple eats the giraffe

# Context free grammar in NLTK


# Define this first
grammer1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "Man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
# Print the tree
sent = "the Man walked in the park".split()

Session 30: Final Exam 20


rd_parser = nltk.RecursiveDescentParser(grammer1)
for tree in rd_parser.parse(sent):
tree.pretty_print()

Recursive grammar: type of grammar that allows rules to call themselves


directly or indirectly, enabling the description of constructs like nested
parentheses or hierarchical structures in language.

or every word in text, we can look up in our grammar what category it


belongs to.

Dependencies and Decency Grammar: binary asymmetric relation that holds


between a head and its dependents

Head: usually taken to be the tensed verb ,and every other word is either
dependent on the sentence head or connects to it through a path of
dependencies.
Deep Learning for NLP
Retraining: training a model from scratch: the weights are randomly initialized,
and the training starts without any prior knowledge.
Fine-tuning/Transfer Learning: First acquire a pretrained language model,
then perform additional training with a dataset specific to your task.

Recurrent Neural Networks (RNNs): Form of NNs that are uniquely made to
handle text classification tasks efficiently.

Uniquely able to capture sequential dependencies in data - like language

Good at evaluating the contextual links between Words in NLP text


classification - helps them identify patterns and semantics

Great for creating complex models for tasks like Document classification,
spam detection, and sentiment analysis

Types of RNN architectures:

Many to One RNN: Many inputs (Tx) are used to give one output (Ty) → eg.
Classification task

One to Many RNN: generates a series of output values based on a single


input value → eg. music generation

Session 30: Final Exam 21


Many to Many: Many inputs (Tx) are used to give one may outputs (Ty) →
eg. machine translation

RNN limitations

Only able to capture dependencies in one direction of language

Not good at capturing long term dependencies → short terms are easy

Vanishing gradients problem → updated weights don’t really change

Gated Recurrent Unit: Extension of RNN - helps to capture long range


dependencies and help a lot in fixing vanishing gradient problem.

Apart from the usual neural unit with sigmoid function and softmax for
output it contains an additional unit with tanh as an activation function

Combines long and short term memory into its hidden state

Faster and more effiecent than LSTM

Has two gates:

Update gate: Knows how much past memory to retain

Reset gate: Knows how much past memory to forget

Long Short Term Memory (LSTM): Specialized type of RNN - designed to


handle sequences of data with better control over the gradient flow and
maintenance of state over long sequences.

unique gating mechanisms, making them more effective for tasks involving
long or complex sequences where context from the distant past is
important

LSTMs typically better then GRU to capture long-term dependencies due to


their more complex gating mechanisms.

Transformers: Architecture with self-attention mechanisms. Can capture


intricate patterns in data and are behind some of the most advanced language
models available.

Session 30: Final Exam 22


Unlike RNNs, Transformers precess entire input at once

Attention mechanism provides context for any position in the input


sequence

Encoder: Receives input and builds a representation of it (Features)

Model is optimized to acquire understanding from input

class TransformerEncoder(layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads, **
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential(
[layers.Dense(dense_dim, activation="relu"), l
)
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.supports_masking = True

def call(self, inputs, mask=None):


if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, :],
attention_output = self.attention(
query=inputs, value=inputs, key=inputs, attent
)
proj_input = self.layernorm_1(inputs + attention_o
proj_output = self.dense_proj(proj_input)
return self.layernorm_2(proj_input + proj_output)

Decoder: Uses the encoders representation (features) along with other


inputs to generate a target sequence.

Session 30: Final Exam 23


Model is optimized for generating outputs

Extra attention block is inserted between the self-attention block


applied to the target sequence and the dense layers of the exit block.

Added attention block will save the weights in the decoder so you
can switch between translations and not retrain another mode to
translate the other way

class TransformerDecoder(layers.Layer):
def __init__(self, embed_dim, latent_dim, num_heads, *
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.latent_dim = latent_dim
self.num_heads = num_heads
self.attention_1 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.attention_2 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential(
[layers.Dense(latent_dim, activation="relu"),
)
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.layernorm_3 = layers.LayerNormalization()
self.supports_masking = True

def call(self, inputs, encoder_outputs, mask=None):


causal_mask = self.get_causal_attention_mask(input
if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, :],
padding_mask = tf.minimum(padding_mask, causal

attention_output_1 = self.attention_1(
query=inputs, value=inputs, key=inputs, attent

Session 30: Final Exam 24


)
out_1 = self.layernorm_1(inputs + attention_output

attention_output_2 = self.attention_2(
query=out_1,
value=encoder_outputs,
key=encoder_outputs,
attention_mask=padding_mask,
)
out_2 = self.layernorm_2(out_1 + attention_output_

proj_output = self.dense_proj(out_2)
return self.layernorm_3(out_2 + proj_output)

def get_causal_attention_mask(self, inputs):


input_shape = tf.shape(inputs)
batch_size, sequence_length = input_shape[0], inpu
i = tf.range(sequence_length)[:, tf.newaxis]
j = tf.range(sequence_length)
mask = tf.cast(i >= j, dtype="int32")
mask = tf.reshape(mask, (1, input_shape[1], input_
mult = tf.concat(
[tf.expand_dims(batch_size, -1), tf.constant([
axis=0,
)
return tf.tile(mask, mult)

Self-attention: differentially weighting the significance of each part of the


input data.

A smart embedding space which provide a different vector


representation for a word depending on the other words surrounding it.

A smart embedding space which provide a different vector


representation for a word depending on the other words surrounding it.

Session 30: Final Exam 25


context aware token representations: to modulate the representation of
a token by using the representations of related tokens in the sequence.

take query q and find the most similar key k, by doing a dot product for
q and k.

Query Vector: the current word - reference sequence that describes


something you’re looking for

Key vector: indexing mechanism for the value vector. Like hash map
key-value - describes the value in a format thatcan be readily
compared to a query.

Value vector: Information in the input vector - body of knowledge that


you’re trying to extract information from

Muti-Head Attention: extension of self-attention. It splits the attention


mechanism into multiple "heads," allowing the model to simultaneously
attend to different parts of the sequence from different representational
spaces.

Model can capture various types of relationships and nuances in the


data at the same time, providing a richer understanding of the context.

num heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(Num_heads = num_heads, key_
output = mha_layer(inputs,inputs,inputs) # 3 inputs: 2 sen

Positional Encoding: Added to each word embedding to give model


information to the word order

Positional vector: represents the position of the word in the current


sentence

add a “position” axis to the input vector

Sentence length needs to be known in advance

class PositionalEmbedding(layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_

Session 30: Final Exam 26


super().__init__(**kwargs)
self.token_embeddings = layers.Embedding(
input_dim=vocab_size, output_dim=embed_dim
)
self.position_embeddings = layers.Embedding(
input_dim=sequence_length, output_dim=embed_di
)
self.sequence_length = sequence_length
self.vocab_size = vocab_size
self.embed_dim = embed_dim

def call(self, inputs):


length = tf.shape(inputs)[-1]
positions = tf.range(start=0, limit=length, delta=
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(posi
return embedded_tokens + embedded_positions

def compute_mask(self, inputs, mask=None):


return tf.math.not_equal(inputs, 0)

Encoder-only models: For tasks that require understanding of the input. eg.
Classification and NER

eg. BERT is encoder only

Decoder-only models: For generative tasks. eg. Text generation


Encoder-decoder models (Sequence-to-sequence): For generative tasks that
require input. eg. Translation or Summarization.

BERT (Bidirectional encoder representations from transformer): Transformer


based, only the encoder part of the Transformer.

is fine-tuned with additional output layers to perform a wide range of


specific language tasks like question answering, sentiment analysis, and
entity recognition.

Session 30: Final Exam 27


GPT (Generative Pretrained Transformer): Transformer based, primarily its
decoder component, to generate human-like text based on the input it
receives.

GPT is designed to generate text and is trained to predict the next word in a
sentence given all the previous words.

Question Answering: Sub-field of NLP, developing models that can


automatically answer questions in natural language.
Q-A Process:

1. Data collection and processing: Large corpus, clean and format the data

2. Information Retrieval: Keyword search, Text classification, NER

3. Question Analysis: POS tagging, Dependency parsing, NER

4. Answer Generation: Text generation, Summarization

5. Model Training: Supervised/unsupervised learning

6. Model Evaluation: Precision, Recall, F1

Types of QA systems

Information Retrival-based: automatically answering questions by


searching for relevant documents or passages that contain the answer.

Using Keyword or semantic search

performance can be limited by the quality and relevance of the indexed


text and the effectiveness of the retrieval and extraction methods

Can be used with other types of QA like Knowledge based or


generative QA

Knowledge-Based QA: answers questions using aknowledge base, such


as a database or ontology, to retrieve the relevant information.

Focused on searching structured knowledge base (eg. Json)

Session 30: Final Exam 28


generally more accurate and reliable than other QA approaches based
on structured and well-curated knowledge

performance can be limited by how well the knowledge base is


covered and how well the methods used to make queries and get
information from their work

Can also be used with other types of QA

Generative QA: automatically answers questions using a generative model,


such as a neural network, to generate a natural language answer to a given
question.

based on the idea that a machine can be taught to understand and


create text in natural language to provide a correct answer in terms of
grammar and meaning

limited by the training data’s quality and diversity and the model’s
complexity.

often used with other QA approaches, such as information retrieval-


based or knowledge-based QA

Hybrid QA: automatically answers questions by combining multiple QA


approaches, such as information retrieval-based, knowledge-based, and
generative QA.

considered more robust and accurate than a single QA

can be built to be used in a specific domain or a general-purpose

Rule-based QA: answers questions using a predefined set of rules based


on keywords or patterns in the question.

based on the idea that many questions can be answered by matching


the question to a set of predefined rules or templates.

more prone to errors and can only handle questions covered by


predefined rules.

Build QA system using Hugging Face:

Session 30: Final Exam 29


1. Import Libraries: Pytorch, Pre-trained transformers,
BertForQuestionAnswering, BertTokenizer

2. Extract Pre-trained Model:

model = BertForQuestionAnswering.from_pretrained("bert-lar
tokenizer = BertTokenizer.from_pretrained("bert-large-unca

3. Create Testing Sample: Use string that holds the context or knowledge
from which the model would return the required answer.

answer_text = "Obama's last came is Care"


question = "What is Obama's last name?"

4. Tokenize the answer_text and question: using BertTokenizer

input_ids = tokenizer.encode(question, answer_text)

5. Create an attention Mask: a sequence of 1s and 0s indicatingwhich tokens


in the input_ids sequence should be attended to by the model.

attention_mask = [1] * len(input_ids)

6. Obtain the Model’s output: first part contains the logits for the start token
index, and the second part contains the logits for the end token index

output = model(torch.tensor([input_ids]), attention_mask=t

7. Defining the start and End Indexes determine the start and end indices of
the answer in the input_ids sequence

start_index = torch.argmax(output[0][0, :len(input_ids) -i


end_index = torch.argmax(output[1][0, :len(input_ids) -inp

8. Decode the Final Answer: revert the tokenization process for us to


understand the value of the final answer

Session 30: Final Exam 30


answer = tokenizer.decode(input_ids[start_index:end_index
print("Answer:", answer)

9. Output: “Care”

QA Task Variants:
Definitions

Closed book: must rely solely on what it has been pre-trained with

Open Book: has access to external information sources

Extractive QA: involves identifying the exact span of text within a given
document that answers a question.

Abstractive QA: Instead of extracting direct answers, this task involves


generating a new text passage that answers the question, potentially
synthesizing information from multiple sources or paraphrasing.

Applied Tasks

Retriever-Reader: For extractive QA, the 'retriever' component first selects


the relevant documents or passages from a larger corpus, and then the
'reader' component reads those texts to find the exact answer.

Retriever-Generator: For abstractive QA, the 'retriever' again fetches


relevant documents, but the 'generator' component synthesizes
information from the retrieved texts to construct a coherent answer.

Generator: In the closed book setting for abstractive QA, the 'generator'
relies only on its internal knowledge (what it has learned during pre-
training) to generate an answer without looking up any external resources.

Common QA Models:

BERT (Bidirectional Encoding Representations of Transformers):


transformer-based machine learning model for NLP pre-training developed
by Google.

Session 30: Final Exam 31


Bidirectional context understanding and Pre-trained on a large corpus
and fine-tuned for specific tasks.

is not generative; it's designed for understanding context.

GPT (Generative Pre-Trained Transformer): generative model that uses


the decoder of the Transformer to produce text.

Trained to predict the next word in a sentence, Can generate coherent


text passages.

GPT generates text based on previous context in a unidirectional


manner (not bidirctional like BERT)

T5 (text to text converter): treats every language problem as a text-to-text


problem.

Unified framework for many NLP tasks.

Good for Translation, question answering, and summarization.

built on Transformer but reframes all NLP tasks as converting one type
of text to another.

RoBERTa (Robustly Optimized BERT Approach): An optimized version of


BERT designed to provide better performance.

Trained with more data, larger batches, and longer sequences.

Removes BERT’s next sentence prediction objective.

refined version of BERT, aimed at improving the pre-training process to


enhance performance.

Albert (Bert Light): lighter version of BERT with fewer parameters, aimed at
increasing training speed and reducing model size.

Can be applied to the same tasks as BERT but with greater efficiency.

More efficient and scalable version of BERT, sacrificing some model


capacity for greater speed and lower memory usage.

Session 30: Final Exam 32

You might also like