Professional Documents
Culture Documents
NLP Final Review
NLP Final Review
Speech Recognition
Machine translation
# Output
['Python is Great!', 'Im using NLTK!']
Tokenization: break down sentence into its constituent words and store
them
# Output
['Python', 'is', 'Great', '!', 'Im', 'using', 'NLTK', '!']
Stop word removal: remove non-essential ‘filler’ words, which add little
meaning to our statement and are just there to make the statement sound
more cohesive
# remove stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
# output
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'wo
Stemming: obtaining the ‘stem’ of words. Word stem give new words upon
added affixes to them
example_words = ["python","pythoner","pythoning","pythoned
for w in example_words:
print(ps.stem(w))
#output:
python
python
python
python
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
One Hot Encoding: assigns 0 to all elements in a vector except for one,
which has a value of 1.
eg. unigram: [‘I’, ‘live’, ‘in’], bigram: [’I live’, ‘live in’], trigram: [’I live in’]
unigrams = ngrams(sentence.split(), n)
for gram in unigrams:
print(gram)
# Output
('I', 'live', 'in', 'Madrid')
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist # use print the most
import string
corpus = [
"The cat in the hat.",
"The quick brown fox jumps over the lazy dog.",
"A bird in hand is worth two in the bush."
]
def nltk_tokenize(text):
stemmer = nltk.stem.SnowballStemmer('english')
text = text.lower()
fdict = FreqDist(tokens)
#Output:
Most Common Words:
the: 5
in: 3
cat: 1
hat: 1
Count Vectorizer:
# Applying TF-IDF
from sklearn.feature_extraction.text import TfidfVectorize
tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus) # Apply
Cosine similarity
Levenshtein distance
Jaccard index
Euclidean distance
Hamming distances
Word embeddings
Text Classification
Text Classification: involves categorizing and assigning predefined labels or
categories to text documents, sentences, or phrases based on their content.
3. Feature extraction: Relevant features are extracted from the text, such as
words, n-grams, or parts of speech
Use Cases:
Product/service analysis
Text Blob: Python library. Takes text input and can return polarity and
subjectivity as outputs
# Text blob
from textblob import TextBlob
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
Create bag of words for processed text using count vectorizer or TF-
IDF Vectorizatin
import pandas as pd
data = pd.read_csv('Finance_data.csv')
LSTM-based Models:
Transformer-based models:
Regular Expressions (Regex)
Regex: powerful tool for matching patterns in text, allowing for complex
search, replace, and parsing operations on strings.
Predefined Character Classes: Such as \\d for any digit, \\w for any word
character, and \\s for any whitespace.
Basic Functions:
re.match() : Checks if the regex matches at the beginning of the string.
occurrence.
replacement string.
Compiling Expressions:
Pattern Syntax:
Raw strings: r"pattern" tells Python not to handle backslashes in any
special way, which is helpful since regex uses a lot of backslashes.
Escape sequences:
word character).
Quantifiers:
Anchors:
\\Z : Match the very end of the text, similar to $ but unaffected by
multiline mode.
Groups:
POS Challenges:
Ambiguity: Words often have multiple possible POS tags based on context
POS Applications:
# POS Tagging
import spacy
nlp = spacy.load('en_core_web_sm')
# output:
I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN
Context Free Grammar (CFG): list of rules that define the set of all well-formed
sentences in a language
Goals of Context Free Grammar:
Permit ambiguity: Ensure that a sentence has all its possible parses. eg.
Fruit flies like an apple
Head: usually taken to be the tensed verb ,and every other word is either
dependent on the sentence head or connects to it through a path of
dependencies.
Deep Learning for NLP
Retraining: training a model from scratch: the weights are randomly initialized,
and the training starts without any prior knowledge.
Fine-tuning/Transfer Learning: First acquire a pretrained language model,
then perform additional training with a dataset specific to your task.
Recurrent Neural Networks (RNNs): Form of NNs that are uniquely made to
handle text classification tasks efficiently.
Great for creating complex models for tasks like Document classification,
spam detection, and sentiment analysis
Many to One RNN: Many inputs (Tx) are used to give one output (Ty) → eg.
Classification task
RNN limitations
Not good at capturing long term dependencies → short terms are easy
Apart from the usual neural unit with sigmoid function and softmax for
output it contains an additional unit with tanh as an activation function
Combines long and short term memory into its hidden state
unique gating mechanisms, making them more effective for tasks involving
long or complex sequences where context from the distant past is
important
class TransformerEncoder(layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads, **
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential(
[layers.Dense(dense_dim, activation="relu"), l
)
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.supports_masking = True
Added attention block will save the weights in the decoder so you
can switch between translations and not retrain another mode to
translate the other way
class TransformerDecoder(layers.Layer):
def __init__(self, embed_dim, latent_dim, num_heads, *
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.latent_dim = latent_dim
self.num_heads = num_heads
self.attention_1 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.attention_2 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential(
[layers.Dense(latent_dim, activation="relu"),
)
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.layernorm_3 = layers.LayerNormalization()
self.supports_masking = True
attention_output_1 = self.attention_1(
query=inputs, value=inputs, key=inputs, attent
attention_output_2 = self.attention_2(
query=out_1,
value=encoder_outputs,
key=encoder_outputs,
attention_mask=padding_mask,
)
out_2 = self.layernorm_2(out_1 + attention_output_
proj_output = self.dense_proj(out_2)
return self.layernorm_3(out_2 + proj_output)
take query q and find the most similar key k, by doing a dot product for
q and k.
Key vector: indexing mechanism for the value vector. Like hash map
key-value - describes the value in a format thatcan be readily
compared to a query.
num heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(Num_heads = num_heads, key_
output = mha_layer(inputs,inputs,inputs) # 3 inputs: 2 sen
class PositionalEmbedding(layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_
Encoder-only models: For tasks that require understanding of the input. eg.
Classification and NER
GPT is designed to generate text and is trained to predict the next word in a
sentence given all the previous words.
1. Data collection and processing: Large corpus, clean and format the data
Types of QA systems
limited by the training data’s quality and diversity and the model’s
complexity.
model = BertForQuestionAnswering.from_pretrained("bert-lar
tokenizer = BertTokenizer.from_pretrained("bert-large-unca
3. Create Testing Sample: Use string that holds the context or knowledge
from which the model would return the required answer.
6. Obtain the Model’s output: first part contains the logits for the start token
index, and the second part contains the logits for the end token index
7. Defining the start and End Indexes determine the start and end indices of
the answer in the input_ids sequence
9. Output: “Care”
QA Task Variants:
Definitions
Closed book: must rely solely on what it has been pre-trained with
Extractive QA: involves identifying the exact span of text within a given
document that answers a question.
Applied Tasks
Generator: In the closed book setting for abstractive QA, the 'generator'
relies only on its internal knowledge (what it has learned during pre-
training) to generate an answer without looking up any external resources.
Common QA Models:
built on Transformer but reframes all NLP tasks as converting one type
of text to another.
Albert (Bert Light): lighter version of BERT with fewer parameters, aimed at
increasing training speed and reducing model size.
Can be applied to the same tasks as BERT but with greater efficiency.