Final LP-VI NLP Manual 2023-24

Practical No.
1: Tokenization
Guru Gobind Singh Foundation’s
Guru Gobind Singh College of Engineering and Research Center,
Nashik
Experiment No: 01
Title of Experiment:
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library.
Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
Input / Dataset use any sample sentence
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:
Marks (Grade) Attendance Performance Write-up Technical Total

Attainment of CO (2) (2) (2) (4) (10)
Marks out of 10
CO Mapped CO2 : Perform text-based processing for natural language.
Signature of
Subject Teacher
Experiment No: 01
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library.
Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
What is NLTK?
 NLTK is a standard python library with prebuilt functions and utilities for the ease of use and
implementation.
 It is one of the most used libraries for natural language processing and computational linguistics.
 NLTK is a leading platform for building Python programs to work with human language data.
 It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Accessing a dataset in NLTK

A dataset is referred to as corpus in nltk. A corpus is essentially a collection of sentences which serves
as an input. For further processing a corpus is broken down into smaller pieces and processed.
Data pre-processing
Data pre-processing is the process of making the machine understand things better or making the input
more machine understandable. Some standard practices for doing that are:
1. Tokenization
Tokenization is the process of breaking text into smaller pieces called tokens. These smaller pieces can be
sentences, words, or sub-words. For example, the sentence “I won” can be tokenized into two word-tokens
“I” and “won”.
i. Word Tokenization
This involves breaking down a text into individual words. Punctuation marks and spaces are usually
used as delimiters to separate words.
Example: "The quick brown fox jumps over the lazy dog" becomes ['The', 'quick', 'brown', 'fox',
'jumps', 'over', 'the', 'lazy', 'dog'].
ii. Sentence Tokenization
In sentence tokenization, a text is split into individual sentences. Sentence boundaries are
identified using punctuation marks like periods, exclamation marks, and question marks.
Example: "Hello, how are you? I hope you are doing well." becomes ['Hello, how are you?', 'I hope
you are doing well.'].
2. Punctuation Removal
Punctuations are of little use in NLP so they are removed.
3. Stop Words Removal

Stop words are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words
are removed from the corpus for the sake of text-normalization.
A. Whitespace tokenization
This method splits text based on whitespace characters such as space, tab, or newline.
Example: "I love\tNLP" becomes ['I', 'love', 'NLP'].
B. Punctuation-based tokenization
Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it
splits on whitespace and punctuations and also retains the punctuations.
Punctuation-based tokenization is a simple method where text is tokenized based on the presence of
punctuation marks. Punctuation marks such as periods, commas, question marks, exclamation marks,
and others serve as delimiters to separate text into tokens.
C. Default/Treebank Word Tokenizer
The default tokenization method in NLTK involves tokenization using regular expressions as defined in the
Penn Treebank (based on English text). It assumes that the text is already split into sentences.
The Treebank Word Tokenizer is one of the tokenizers provided by the NLTK (Natural Language Toolkit) library
in Python. It is based on the Penn Treebank corpus, which is a large corpus of English text that has been
annotated with part-of-speech tags and syntactic structures.
The Treebank Word Tokenizer is designed to tokenize text according to the conventions used in the Penn
Treebank corpus. It separates words and punctuation marks into individual tokens while handling various
cases such as contractions, hyphenated words, and abbreviations.
D. Tweet Tokenizer
The Tweet Tokenizer is a specialized tokenizer provided by NLTK (Natural Language Toolkit) that is specifically
designed for tokenizing tweets. Tweets often contain unique characteristics such as hashtags, mentions, URLs,
and emoticons, which may require special handling during tokenization.
E. MWE Tokenizer
The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has
been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.
MWE (Multi-Word Expression) Tokenizer is a tokenizer that is specialized in identifying and tokenizing multi-
word expressions or phrases as single tokens. Multi-word expressions are combinations of words that have a
collective meaning beyond the sum of their individual meanings. Examples include "New York City," "ice
cream," "kick the bucket," and so on.
NLTK does not have a built-in MWE Tokenizer, but we can create our own or use other libraries and tools that
specialize in this task. However, NLTK does offer facilities for multi-word expression extraction and
manipulation, although not specifically as a tokenizer.
Stemming and Lemmatization
A. Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are
commonly referred to as stemming algorithms or stemmers. Often when searching text for a certain keyword,
it helps if the search returns variations of the word. For instance, searching for “boat” might also return
“boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats]. Stemming is a
somewhat crude method for cataloguing related words; it essentially chops off letters from the end until the
stem is reached.
i) Porter Stemmer
The Porter Stemmer is a widely used stemming algorithm, particularly in English, developed by Martin Porter.
Stemming is the process of reducing words to their base or root form. This is particularly useful in natural
language processing tasks where variations of words need to be treated as the same word.
ii) Snowball Stemmer
The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter
Stemmer algorithm developed by Martin Porter. It is more aggressive and accurate in stemming compared to
the original Porter Stemmer.
NLTK (Natural Language Toolkit) provides an implementation of the Snowball Stemmer, which supports
stemming for multiple languages, not just English. The Snowball Stemmer is capable of handling various
language-specific stemming rules.
B. Lemmatization
Lemmatization is a process in natural language processing (NLP) that involves reducing words to their base or
canonical form, known as the lemma. Unlike stemming, which chops off suffixes to derive the root form of
words, lemmatization considers the context and morphological analysis to return a dictionary form or lemma
of a word.
The main advantage of lemmatization over stemming is that it produces real words. Stemming may sometimes
result in non-words or words that are not valid in the language.
Here's how lemmatization works:
 For nouns, lemmatization aims to return the singular form of the word. For example, "mice" would be
lemmatized to "mouse".
 For verbs, lemmatization aims to return the base or infinitive form of the verb. For example, "running"
would be lemmatized to "run".
 For adjectives and adverbs, lemmatization returns the base form. For example, "better" would be
lemmatized to "good".
NLTK (Natural Language Toolkit) provides lemmatization functionality through the WordNet Lemmatizer.
Conclusion
Tokenization is a fundamental pre-processing step in natural language processing (NLP) pipelines because it
allows algorithms to work with text data in a structured and manageable format. Python libraries like NLTK,
spaCy, and scikit-learn offer various tokenization tools and methods to facilitate text processing tasks.
Practical No. 2: Perform bag-of-words approach

Guru Gobind Singh College of Engineering and Research
Centre, Nashik
Experiment No: 02
Title of Experiment: Perform bag-of-words approach (count occurrence, normalized count
occurrence), TF-IDF on data. Create embeddings using Word2Vec. Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset
Student Name:
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:

Attainment of CO (2) (2) (2) (4) (10)
Marks out of 10
CO Mapped CO1 : Use various language modelling techniques.
Signature of
Subject Teacher
Experiment No: 02
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on data.
Create embeddings using Word2Vec.
What is a Bag of Words in NLP?

 Bag of words is a Natural Language Processing technique of text modelling.
 In technical terms, we can say that it is a method of feature extraction with text data.
 This approach is a simple and flexible way of extracting features from documents.
 A bag of words is a representation of text that describes the occurrence of words within a
document.
 We just keep track of word counts and disregard the grammatical details and the word order.
 It is called a “bag” of words because any information about the order or structure of words in
the document is discarded.
 The model is only concerned with whether known words occur in the document, not where in
the document.
Understanding Bag of Words with an example
Let us see an example of how the bag of words technique converts text into vectors
Sentence 1:” Welcome to Great Learning, Now start learning”

Sentence 2: “Learning is a good practice”
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in our model
vocabulary.
 Welcome
 To
 Great
 Learning
 ,
 Now
 start
 learning
 is
 a
 good
 practice
 Note that the words ‘Learning’ and ‘learning’ are not the same here because of the difference in
their cases and hence are repeated. Also, note that a comma ‘ , ’ is also taken in the list.
 Because we know the vocabulary has 12 words, we can use a fixed-length document-
representation of 12, with one position in the vector to score each word.
 The scoring method we use here is to count the presence of each word and mark 0 for absence.
This scoring method is used more generally.
The scoring of sentence 1 would look as follows:
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Writing the above frequencies in the vector

Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence Welcome to Great Learning , Now start learning is a good practice

Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1
What is Tf-Idf (term frequency-inverse document frequency)?

 The scoring method being used above takes the count of each word and represents the word in
the vector by the number of counts of that particular word. What does a word having high
word count signify?
 Does this mean that the word is important in retrieving information about documents? The
answer is NO. Let me explain, if a word occurs many times in a document but also along with
many other documents in our dataset, maybe it is because this word is just a frequent word;
not because it is relevant or meaningful.
 One approach is to rescale the frequency of words by how often they appear in all documents
so that the scores for frequent words like “the” that are also frequent across all documents are
penalized. This approach is called term frequency-inverse document frequency or shortly
known as Tf-Idf approach of scoring.TF-IDF is intended to reflect how relevant a term is in a
given document. So how is Tf-Idf of a document in a dataset calculated?
 TF-IDF for a word in a document is calculated by multiplying two different metrics:
 The term frequency (TF) of a word in a document. There are several ways of calculating this
frequency, with the simplest being a raw count of instances a word appears in a document.
Then, there are other ways to adjust the frequency. For example, by dividing the raw count of
instances of a word by either length of the document, or by the raw frequency of the most
frequent word in the document. The formula to calculate Term-Frequency is
Conclusion
Bag of Words model is Natural Language Processing technique of text modelling. Whenever we apply
any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm.
Hence, Bag of Words model is used to pre-process the text by converting it into a bag of words, which
keeps a count of the total occurrences of most frequently used words.
lOMoAR cPSD| 10600079
Practical No. 3: Text cleaning
Guru Gobind Singh College of Engineering and

Research Center, Nashik
Experiment No: 03
Title of Experiment: Perform text cleaning, perform lemmatization (any method), remove stop
words (any method), label encoding. Create representations using TF-IDF. Save outputs. Dataset:
https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3- Preprocessing/News_dataset.pickle
Student Name:
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:

Attainment of (2) (2) (2) (4) (10)
CO Marks out of
10
CO Mapped CO2 : Perform text-based processing for natural language.
Signature of Subject
Teacher
Practical No: 3
Title of the Assignment: Perform text cleaning, perform lemmatization (any method), remove stop words
(any method), label encoding. Create representations using TF-IDF. Save outputs.
Text Cleaning
Text cleaning is task-specific and one needs to have a strong idea about what they want their end result to
be and even review the data to see what exactly they can achieve.
Most common methods for Cleaning the Data

We will see how to code and clean the textual data for the following methods.
● Lowercasing the data
● Removing Punctuations
● Removing Numbers
● Removing extra space
● Replacing the repetitions of punctuations
● Removing Emojis
● Removing emoticons
● Removing Contractions
WHAT IS LEMMATIZATION?
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to
break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would
reduce the word better to its root word, or lemme, good.
Various Approaches to Lemmatization:

There are 9 different approaches to perform Lemmatization
1. WordNet
2. WordNet (with POS tag)
3. TextBlob
4. TextBlob (with POS tag)
5. spaCy
6. TreeTagger
7. Pattern
8. Gensim
9. Stanford CoreNLP
1. Wordnet Lemmatizer
Wordnet is a publicly available lexical database of over 200 languages that provides semantic
relationships between its words. It is one of the earliest and most commonly used lemmatizer
techniques.
● It is present in the nltk library in python.
● Wordnet links words into semantic relations. ( eg. synonyms)
2. Wordnet Lemmatizer (with POS tag)
Wordnet results are not up to the mark. Words like ‘sitting’, ‘flying’ etc remained the same after
lemmatization. This is because these words are treated as a noun in the given sentence rather than a
verb. To overcome this, we use POS (Part of Speech) tags.
We add a tag with a particular word defining its type (verb, noun, adjective etc).
3. TextBlob
TextBlob is a python library used for processing textual data. It provides a simple API to access its
methods and perform basic NLP tasks.
4. TextBlob (with POS tag)
Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations
in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the
‘Part of Speech’ tagging to overcome this problem.
5. spaCy
spaCy is an open-source python library that parses and “understands” large volumes of text.
Separate models are available that cater to specific languages (English, French, German, etc.).
6. TreeTagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information. The
TreeTagger has been successfully used to tag over 25 languages and is adaptable to other languages
if a manually tagged training corpus is available.
7. Pattern
Pattern is a Python package commonly used for web mining, natural language processing, machine
learning, and network analysis. It has many useful NLP capabilities. It also contains a special
feature which we will be discussing below.
8. Gensim
Gensim is designed to handle large text collections using data streaming. Its lemmatization
facilities are based on the pattern package we installed above.
gensim.utils.lemmatize() function can be used for performing Lemmatization. This method comes
under the utils module in python.
We can use this lemmatizer from pattern to extract UTF8-encoded tokens in their base form lemma.
Only considers nouns, verbs, adjectives, and adverbs by default (all other lemmas are discarded).
9. Stanford CoreNLP
CoreNLP enables users to derive linguistic annotations for text, including token and sentence
boundaries, parts of speech, named entities, numeric and time values, dependency and constituency
parser, sentiment, quote attributions, and relations.
Conclusion
We saw what are the most common techniques to clean and process the data. With each subsection, we saw
techniques of how to remove them and when to remove them with the use cases. Additionally, what kind of
situation do we need to avoid while applying the techniques to remove and clean the data for text analysis
purposes or many more.
Practical No. 4: Create a transformer
Guru Gobind Singh College of Engineering

and Research Center, Nashik
Experiment No: 04
Title of Experiment: Create a transformer from scratch using the Pytorch library
Student Name:
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:

(2) (2) (2) (4) (10)
Attainment of
CO Marks out
of 10
CO Mapped CO3 : Analyze natural language morphologically.
Signature of
Subject
Teacher
Title of the Assignment: Create a transformer from scratch using the Pytorch library
Build your own Transformer from scratch using Pytorch
 The Transformer model, introduced by Vaswani et al. in the paper “Attention is

All You Need,” is a deep learning architecture designed for sequence-to-sequence
tasks, such as machine translation and text summarization.
 It is based on self-attention mechanisms and has become the foundation for
many state-of-the-art natural language processing models, like GPT and BERT.
To build our Transformer model, we’ll follow these steps:

1. Import necessary libraries and modules
2. Define the basic building blocks: Multi-Head Attention, Position-wise Feed-
Forward Networks, Positional Encoding
3. Build the Encoder and Decoder layers
4. Combine Encoder and Decoder layers to create the complete Transformer
model
5. Prepare sample data
6. Train the model3
Let’s start by importing the necessary libraries and modules.
import torch
import torch.nn as nn
import torch.optim as
optim
import torch.utils.data as data
import math
import copy
Basic building blocks of the Transformer model :
Multi-Head Attention
Figure 2. Multi-Head Attention
The Multi-Head Attention mechanism computes the attention between each pair of
positions in a sequence. It consists of multiple “attention heads” that capture different
aspects of the input sequence.
class MultiHeadAttention(nn.Module):
def init (self, d_model, num_heads):
super(MultiHeadAttention, self). init ()
assert d_model % num_heads == 0, "d_model must be divisible
by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)

self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):

attn_scores = torch.matmul(Q, K.transpose(-2, -1))
/ math.sqrt(self.d_k)
if mask is not None:
attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
attn_probs = torch.softmax(attn_scores, dim=-1)
output = torch.matmul(attn_probs, V)
return output
def split_heads(self, x):

batch_size, seq_length, d_model = x.size()
return x.view(batch_size, seq_length,
self.num_heads, self.d_k).transpose(1, 2)
def combine_heads(self, x):

batch_size, _, seq_length, d_k = x.size()
return x.transpose(1, 2).contiguous().view(batch_size,
seq_length, self.d_model)
def forward(self, Q, K, V,
mask=None):
Q = self.split_heads(self.W_q(Q))
K = self.split_heads(self.W_k(K))
V = self.split_heads(self.W_v(V))
attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

output = self.W_o(self.combine_heads(attn_output))
return output
 The MultiHeadAttention code initializes the module with input parameters and
linear transformation layers.
 It calculates attention scores, reshapes the input tensor into multiple heads, and
combines the attention outputs from all heads.
 The forward method computes the multi-head self-attention, allowing the model
to focus on some different aspects of the input sequence.
.
Position-wise Feed-Forward Networks
class PositionWiseFeedForward(nn.Module):
def init (self, d_model, d_ff):
super(PositionWiseFeedForward, self). init ()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
def forward(self, x):

return self.fc2(self.relu(self.fc1(x)))
 The PositionWiseFeedForward class extends PyTorch’s nn.Module and

implements a position-wise feed-forward network.
 The class initializes with two linear transformation layers and a ReLU activation
function.
 The forward method applies these transformations and activation function
sequentially to compute the output.
 This process enables the model to consider the position of input elements while
making predictions.
Positional Encoding
 Positional Encoding is used to inject the position information of each token in the
input sequence.
 It uses sine and cosine functions of different frequencies to generate the
positional encoding.
class PositionalEncoding(nn.Module):
def init (self, d_model, max_seq_length):
super(PositionalEncoding, self). init ()
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0,
max_seq_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
- (math.log(10000.0) / d_model))
pe[:, 1::2]
0::2] = torch.cos(position
torch.sin(position *
div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):

return x + self.pe[:, :x.size(1)]
 The PositionalEncoding class initializes with input parameters d_model and

max_seq_length, creating a tensor to store positional encoding values.
 The class calculates sine and cosine values for even and odd indices, respectively,
based on the scaling factor div_term.
 The forward method computes the positional encoding by adding the stored
positional encoding values to the input tensor, allowing the model to capture the
position information of the input sequence.
 Now, we’ll build the Encoder and Decoder layers.
Encoder Layer
Figure 3. The Encoder part of the transformer network
An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward

layer, and two Layer Normalization layers.
class EncoderLayer(nn.Module):
def init (self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self). init ()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):

attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
 The EncoderLayer class initializes with input parameters and components,

including a MultiHeadAttention module, a PositionWiseFeedForward
module, two layer normalization modules, and a dropout layer.
 The forward methods compute the encoder layer output by applying self-
attention, adding the attention output to the input tensor, and normalizing
the result.
 Then, it computes the position-wise feed-forward output, combines it with the
normalized self-attention output, and normalizes the final result before
returning the processed tensor.
Decoder Layer
Figure 4. The Decoder part of the Transformer network
 A Decoder layer consists of two Multi-Head Attention layers, a Position-

wise Feed-Forward layer, and three Layer Normalization layers.
class DecoderLayer(nn.Module):
def init (self, d_model, num_heads, d_ff, dropout):
super(DecoderLayer, self). init ()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm2 =
nn.LayerNorm(d_model) self.norm3
= nn.LayerNorm(d_model)
def forward(self, x, enc_output, src_mask,

tgt_mask): attn_output = self.self_attn(x, x,
x, tgt_mask) x = self.norm1(x +
self.dropout(attn_output))
attn_output = self.cross_attn(x, enc_output, enc_output,
src_mask) x = self.norm2(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout(ff_output))
return x
 The DecoderLayer initializes with input parameters and components such as
MultiHeadAttention modules for masked self-attention and cross-attention, a
PositionWiseFeedForward module, three layer normalization modules, and a
dropout layer.
 The forward method computes the decoder layer output by performing the
following steps:
1. Calculate the masked self-attention output and add it to the input tensor,
followed by dropout and layer normalization.
2. Compute the cross-attention output between the decoder and encoder outputs,
and add it to the normalized masked self-attention output, followed by dropout
and layer normalization.
3. Calculate the position-wise feed-forward output and combine it with the
normalized cross-attention output, followed by dropout and layer normalization.
4. Return the processed tensor.
 These operations enable the decoder to generate target sequences based on the
input and the encoder output.
 Now, let’s combine the Encoder and Decoder layers to create the complete
Transformer model.
Transformer Model
Figure 5. The Transformer Network

Merging it all together:
class Transformer(nn.Module):
def init (self, src_vocab_size, tgt_vocab_size, d_model,
num_heads, num_layers, d_ff, max_seq_length, dropout):
super(Transformer, self). init ()
self.encoder_embedding = nn.Embedding(src_vocab_size,
d_model) self.decoder_embedding =
nn.Embedding(tgt_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model,
max_seq_length)
self.encoder_layers = nn.ModuleList([EncoderLayer(d_model,
num_heads, d_ff, dropout) for _ in range(num_layers)])
self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,
num_heads, d_ff, dropout) for _ in range(num_layers)])
self.fc = nn.Linear(d_model, tgt_vocab_size)

def generate_mask(self, src, tgt):

src_mask = (src !=
0).unsqueeze(1).unsqueeze(2) tgt_mask = (tgt !
= 0).unsqueeze(1).unsqueeze(3) seq_length =
tgt.size(1)
nopeak_mask = (1 - torch.triu(torch.ones(1,
seq_length, seq_length), diagonal=1)).bool()
tgt_mask = tgt_mask &
nopeak_mask return src_mask,
tgt_mask
def forward(self, src, tgt):

src_mask, tgt_mask = self.generate_mask(src, tgt)
src_embedded =
self.dropout(self.positional_encoding(self.encoder_embedding(src)))
tgt_embedded =
self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
enc_output = src_embedded
for enc_layer in self.encoder_layers:
enc_output = enc_layer(enc_output,
src_mask)
dec_output = tgt_embedded
for dec_layer in self.decoder_layers:
dec_output = dec_layer(dec_output, enc_output, src_mask,
tgt_mask)
output = self.fc(dec_output)
return output
 The Transformer class combines the previously defined modules to create a

complete Transformer model.
 During initialization, the Transformer module sets up input parameters and
initializes various components, including embedding layers for source and target
sequences, a PositionalEncoding module, EncoderLayer and DecoderLayer
modules to create stacked layers, a linear layer for projecting decoder output,
and a dropout layer.
 The generate_mask method creates binary masks for source and target
sequences to ignore padding tokens and prevent the decoder from attending to
future tokens.
 The forward method computes the Transformer model’s output through the
following steps:
1. Generate source and target masks using the generate_mask method.
2. Compute source and target embeddings, and apply positional encoding and
dropout.
3. Process the source sequence through encoder layers, updating the
enc_output tensor.
4. Process the target sequence through decoder layers, using enc_output and
masks, and updating the dec_output tensor.
5. Apply the linear projection layer to the decoder output, obtaining output logits.
 These steps enable the Transformer model to process input sequences and
generate output sequences based on the combined functionality of its
components.
Preparing Sample Data
 In this example, we will create a toy dataset for demonstration purposes.

 In practice, you would use a larger dataset, preprocess the text, and create
vocabulary mappings for source and target languages.
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model,

num_heads, num_layers, d_ff, max_seq_length, dropout)
# Generate random sample data

src_data = torch.randint(1, src_vocab_size, (64, max_seq_length)) #
(batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length)) #
(batch_size, seq_length)
Training the Model
 Now we’ll train the model using the sample data. In practice, you would use a
larger dataset and split it into training and validation sets.
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9,
0.98), eps=1e-9)
transformer.train()
for epoch in range(100):

optimizer.zero_grad()
output = transformer(src_data, tgt_data[:, :-1])
loss = criterion(output.contiguous().view(-1, tgt_vocab_size),
tgt_data[:, 1:].contiguous().view(-1))
loss.backward()
optimizer.step()
print(f"Epoch: {epoch+1}, Loss: {loss.item()}")
Conclusion
We can use this way to build a simple Transformer from scratch in Pytorch. All Large
Language Models use these Transformer encoder or decoder blocks for training. Hence
understanding the network that started it all is extremely important.
Practical No. 5: Morphology

Experiment No: 05
Title of Experiment: Morphology is the study of the way words are built up from smaller
meaning bearing units. Study and understand the concepts of morphology by the use of add
delete table
Student Name:
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:

(2) (2) (2) (4) (10)
Attainment of
CO Marks out of
10
CO Mapped CO3 : Analyze natural language morphologically.
Signature of
Subject
Teacher
Title of the Assignment: Morphology is the study of the way words are built up from
smaller meaning bearing units. Study and understand the concepts of morphology by the use
of add delete table
Morphology
Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
बच्चों(bachchoM) consists of two morphemes, बच्चा(bachchaa) has the

information of the root word noun "बच्चा"(bachchaa) and ओं(oM) has the
information of plural and oblique case.
played has two morphemes play and -ed having information verb "play" and "past
tense", so given word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
Morph Analyser
Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can
either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-ed".
Thus finding all parts of a word(morphemes) and thus describing properties of a word is
called "Morphological Analysis". For example, "played" has information verb "play" and
"past tense", so given word is past tense form of verb "play".
Analysis of a word:
बच्चों (bachchoM) = बच्चा(bachchaa)(root) + ओं(oM)(suffix) (ओं=3 plural oblique) A
linguistic paradigm is the complete set of variants of a given lexeme. These variants can be
classified according to shared inflectional categories (eg: number, case etc) and arranged
into tables.
Paradigm for बच्चा

Case/nu Singular Plural
m
Direct बच्चा(bachchaa) बच्चे(bachche)
oblique बच्चे(bachche) बच्चों (bachchoM)
Algorithm to get बच्चों(bachchoM) from बच्चा(bachchaa)

● Take Root बच्च(bachch)आ(aa)
● Delete आ(aa)
● output बच्च(bachch)
● Add ओं(oM) to output
● Return बच्चों (bachchoM)
Therefore आ is deleted and ओं is added to get बच्चों
Add-Delete table for बच्चा
Delet Add Numbe Case Variants

e r
आ(aa) आ(aa) sing dr बच्चा(bachchaa)
आ(aa) ए(e) Plu dr बच्चे(bachche)
आ(aa) ए(e) Sing ob बच्चे(bachche)
आ(aa) ओं(oM) Plu ob बच्चों(bachchoM)
Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same
paradigm class as बच्च, so लड़का would behave similarly as बच्चा as they share the same
paradigm class.
Morphology types?
There are two types of morphological relations: inflectional and derivational. When an
inflectional affix is added to a stem word, a new form of the stem word is produced. When a
derivational affix is added to a stem word, a new word with new meaning is produced.
Affixes, such as prefixes and suffixes, are bound morphemes, and are different from free
morphemes. Free morphemes are lexical units, and when two free morphemes are put
together, a compound word is produced.
Morphology and Semantics

Semantics is one level removed from morphology in the grand scheme of linguistic study.
Semantics is the branch of linguistics responsible for understanding meaning in general. To
understand the meaning of a word, phrase, sentence, or text, you might rely on semantics.
Morphology also deals with meaning to a degree, but only in as much as the smaller sub-
word units of language can carry meaning. To examine the meaning of anything larger than a
morpheme would fall under the domain of semantics.
Conclusion
Morphology in linguistics is the study of word structures and the relationship between these
structures. Morphology examines how words are formed and varied. We are able to
understand the morphology of a word by the use of the Add-Delete table.
Practical No. 6: Mini Project

Experiment No: 06
Title of Experiment: Mini Project
Student Name:
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:

(2) (2) (2) (4) (10)
Attainment of
CO Marks out of
10
CO Mapped CO1,CO2,CO3,CO4
Signature of
Subject
Teacher
 Title of the Assignment: Finetune a pretrained transformer for any of the
following tasks on any relevant dataset of your choice:
1) Neural Machine Translation
2) Classification
3) Summarization
Solution
1. Fine-Tuning a Transformer Model for Neural Machine Translation.
https://medium.com/@notsokarda/fine-tuning-a-transformer-model-for-neural-
machine-translation-c604a24d3376
2. Fine-Tuning a Transformer Model for Classification

https://medium.com/@lokaregns/fine-tuning-transformers-with-custom-
dataset-classification-task-f261579ae068
3. Fine-Tuning a Transformer Model for Summarization

https://huggingface.co/docs/transformers/tasks/summarization
 Title of the Assignment: POS Taggers for Indian Languages

Solution
https://github.com/rajesh-iiith/POS-Tagging-and-CYK-Parsing-for-Indian-
Languages
https://github.com/JayeshSuryavanshi/POS-Tagger-for-Hindi-Language
https://huggingface.co/swagat-panda/multilingual-pos-tagger-language-
detection-indian-context-muril
 Title of the Assignment: Feature Extraction using seven moment variants

Solution
https://github.com/vatsalsaglani/xrayimage_extractfeatures
 Title of the Assignment: Feature Extraction using Zernike Moments

Solution
https://www.geeksforgeeks.org/mahotas-zernike-moments/

Final LP-VI NLP Manual 2023-24

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final LP-VI NLP Manual 2023-24

Uploaded by

Copyright:

Available Formats

Practical No.

Marks (Grade) Attendance Performance Write-up Technical Total

CO Mapped CO2 : Perform text-based processing for natural language.

Accessing a dataset in NLTK

3. Stop Words Removal

C. Default/Treebank Word Tokenizer

Stemming and Lemmatization

Here's how lemmatization works:

Guru Gobind Singh Foundation’s

Title of Experiment: Perform bag-of-words approach (count occurrence, normalized count

occurrence), TF-IDF on data. Create embeddings using Word2Vec. Dataset to be used:

Marks (Grade) Attendance Performance Write-up Technical Total

CO Mapped CO1 : Use various language modelling techniques.

What is a Bag of Words in NLP?

Understanding Bag of Words with an example

Sentence 1:” Welcome to Great Learning, Now start learning”

Writing the above frequencies in the vector

Sentence Welcome to Great Learning , Now start learning is a good practice

What is Tf-Idf (term frequency-inverse document frequency)?

Practical No. 3: Text cleaning

Guru Gobind Singh Foundation’s

Guru Gobind Singh College of Engineering and

Marks (Grade) Attendance Performance Write-up Technical Total

CO Mapped CO2 : Perform text-based processing for natural language.

Most common methods for Cleaning the Data

Various Approaches to Lemmatization:

Practical No. 4: Create a transformer

Guru Gobind Singh Foundation’s

Guru Gobind Singh College of Engineering

Marks (Grade) Attendance Performance Write-up Technical Total

CO Mapped CO3 : Analyze natural language morphologically.

Build your own Transformer from scratch using Pytorch

 The Transformer model, introduced by Vaswani et al. in the paper “Attention is

To build our Transformer model, we’ll follow these steps:

Let’s start by importing the necessary libraries and modules.

Basic building blocks of the Transformer model :

Figure 2. Multi-Head Attention

self.W_q = nn.Linear(d_model, d_model)

def scaled_dot_product_attention(self, Q, K, V, mask=None):

def split_heads(self, x):

def combine_heads(self, x):

attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

def forward(self, x):

 The PositionWiseFeedForward class extends PyTorch’s nn.Module and

def forward(self, x):

 The PositionalEncoding class initializes with input parameters d_model and

Figure 3. The Encoder part of the transformer network

An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward

def forward(self, x, mask):

 The EncoderLayer class initializes with input parameters and components,

Figure 4. The Decoder part of the Transformer network

 A Decoder layer consists of two Multi-Head Attention layers, a Position-

def forward(self, x, enc_output, src_mask,

Figure 5. The Transformer Network

self.fc = nn.Linear(d_model, tgt_vocab_size)

def generate_mask(self, src, tgt):

def forward(self, src, tgt):

 The Transformer class combines the previously defined modules to create a

Preparing Sample Data

 In this example, we will create a toy dataset for demonstration purposes.

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model,

# Generate random sample data

Training the Model

for epoch in range(100):