Professional Documents
Culture Documents
1: Tokenization
Guru Gobind Singh Foundation’s
Guru Gobind Singh College of Engineering and Research Center,
Nashik
Experiment No: 01
Title of Experiment:
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library.
Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
Input / Dataset use any sample sentence
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:
Signature of
Subject Teacher
Experiment No: 01
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library.
Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
What is NLTK?
NLTK is a standard python library with prebuilt functions and utilities for the ease of use and
implementation.
It is one of the most used libraries for natural language processing and computational linguistics.
NLTK is a leading platform for building Python programs to work with human language data.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Data pre-processing
Data pre-processing is the process of making the machine understand things better or making the input
more machine understandable. Some standard practices for doing that are:
1. Tokenization
Tokenization is the process of breaking text into smaller pieces called tokens. These smaller pieces can be
sentences, words, or sub-words. For example, the sentence “I won” can be tokenized into two word-tokens
“I” and “won”.
i. Word Tokenization
This involves breaking down a text into individual words. Punctuation marks and spaces are usually
used as delimiters to separate words.
Example: "The quick brown fox jumps over the lazy dog" becomes ['The', 'quick', 'brown', 'fox',
'jumps', 'over', 'the', 'lazy', 'dog'].
ii. Sentence Tokenization
In sentence tokenization, a text is split into individual sentences. Sentence boundaries are
identified using punctuation marks like periods, exclamation marks, and question marks.
Example: "Hello, how are you? I hope you are doing well." becomes ['Hello, how are you?', 'I hope
you are doing well.'].
2. Punctuation Removal
Punctuations are of little use in NLP so they are removed.
A. Whitespace tokenization
This method splits text based on whitespace characters such as space, tab, or newline.
Example: "I love\tNLP" becomes ['I', 'love', 'NLP'].
B. Punctuation-based tokenization
Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it
splits on whitespace and punctuations and also retains the punctuations.
Punctuation-based tokenization is a simple method where text is tokenized based on the presence of
punctuation marks. Punctuation marks such as periods, commas, question marks, exclamation marks,
and others serve as delimiters to separate text into tokens.
The default tokenization method in NLTK involves tokenization using regular expressions as defined in the
Penn Treebank (based on English text). It assumes that the text is already split into sentences.
The Treebank Word Tokenizer is one of the tokenizers provided by the NLTK (Natural Language Toolkit) library
in Python. It is based on the Penn Treebank corpus, which is a large corpus of English text that has been
annotated with part-of-speech tags and syntactic structures.
The Treebank Word Tokenizer is designed to tokenize text according to the conventions used in the Penn
Treebank corpus. It separates words and punctuation marks into individual tokens while handling various
cases such as contractions, hyphenated words, and abbreviations.
D. Tweet Tokenizer
The Tweet Tokenizer is a specialized tokenizer provided by NLTK (Natural Language Toolkit) that is specifically
designed for tokenizing tweets. Tweets often contain unique characteristics such as hashtags, mentions, URLs,
and emoticons, which may require special handling during tokenization.
E. MWE Tokenizer
The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has
been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.
MWE (Multi-Word Expression) Tokenizer is a tokenizer that is specialized in identifying and tokenizing multi-
word expressions or phrases as single tokens. Multi-word expressions are combinations of words that have a
collective meaning beyond the sum of their individual meanings. Examples include "New York City," "ice
cream," "kick the bucket," and so on.
NLTK does not have a built-in MWE Tokenizer, but we can create our own or use other libraries and tools that
specialize in this task. However, NLTK does offer facilities for multi-word expression extraction and
manipulation, although not specifically as a tokenizer.
A. Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are
commonly referred to as stemming algorithms or stemmers. Often when searching text for a certain keyword,
it helps if the search returns variations of the word. For instance, searching for “boat” might also return
“boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats]. Stemming is a
somewhat crude method for cataloguing related words; it essentially chops off letters from the end until the
stem is reached.
i) Porter Stemmer
The Porter Stemmer is a widely used stemming algorithm, particularly in English, developed by Martin Porter.
Stemming is the process of reducing words to their base or root form. This is particularly useful in natural
language processing tasks where variations of words need to be treated as the same word.
ii) Snowball Stemmer
The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter
Stemmer algorithm developed by Martin Porter. It is more aggressive and accurate in stemming compared to
the original Porter Stemmer.
NLTK (Natural Language Toolkit) provides an implementation of the Snowball Stemmer, which supports
stemming for multiple languages, not just English. The Snowball Stemmer is capable of handling various
language-specific stemming rules.
B. Lemmatization
Lemmatization is a process in natural language processing (NLP) that involves reducing words to their base or
canonical form, known as the lemma. Unlike stemming, which chops off suffixes to derive the root form of
words, lemmatization considers the context and morphological analysis to return a dictionary form or lemma
of a word.
The main advantage of lemmatization over stemming is that it produces real words. Stemming may sometimes
result in non-words or words that are not valid in the language.
For nouns, lemmatization aims to return the singular form of the word. For example, "mice" would be
lemmatized to "mouse".
For verbs, lemmatization aims to return the base or infinitive form of the verb. For example, "running"
would be lemmatized to "run".
For adjectives and adverbs, lemmatization returns the base form. For example, "better" would be
lemmatized to "good".
NLTK (Natural Language Toolkit) provides lemmatization functionality through the WordNet Lemmatizer.
Conclusion
Tokenization is a fundamental pre-processing step in natural language processing (NLP) pipelines because it
allows algorithms to work with text data in a structured and manageable format. Python libraries like NLTK,
spaCy, and scikit-learn offer various tokenization tools and methods to facilitate text processing tasks.
Practical No. 2: Perform bag-of-words approach
Experiment No: 02
https://www.kaggle.com/datasets/CooperUnion/cardataset
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:
Signature of
Subject Teacher
Experiment No: 02
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on data.
Create embeddings using Word2Vec.
Let us see an example of how the bag of words technique converts text into vectors
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in our model
vocabulary.
Welcome
To
Great
Learning
,
Now
start
learning
is
a
good
practice
Note that the words ‘Learning’ and ‘learning’ are not the same here because of the difference in
their cases and hence are repeated. Also, note that a comma ‘ , ’ is also taken in the list.
Because we know the vocabulary has 12 words, we can use a fixed-length document-
representation of 12, with one position in the vector to score each word.
The scoring method we use here is to count the presence of each word and mark 0 for absence.
This scoring method is used more generally.
The scoring of sentence 1 would look as follows:
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Conclusion
Bag of Words model is Natural Language Processing technique of text modelling. Whenever we apply
any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm.
Hence, Bag of Words model is used to pre-process the text by converting it into a bag of words, which
keeps a count of the total occurrences of most frequently used words.
lOMoAR cPSD| 10600079
Experiment No: 03
Title of Experiment: Perform text cleaning, perform lemmatization (any method), remove stop
words (any method), label encoding. Create representations using TF-IDF. Save outputs. Dataset:
https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3- Preprocessing/News_dataset.pickle
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of Attendance
(Performance):
Date of Evaluation:
Signature of Subject
Teacher
Practical No: 3
Title of the Assignment: Perform text cleaning, perform lemmatization (any method), remove stop words
(any method), label encoding. Create representations using TF-IDF. Save outputs.
Text Cleaning
Text cleaning is task-specific and one needs to have a strong idea about what they want their end result to
be and even review the data to see what exactly they can achieve.
WHAT IS LEMMATIZATION?
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to
break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would
reduce the word better to its root word, or lemme, good.
1. WordNet
2. WordNet (with POS tag)
3. TextBlob
4. TextBlob (with POS tag)
5. spaCy
6. TreeTagger
7. Pattern
8. Gensim
9. Stanford CoreNLP
1. Wordnet Lemmatizer
Wordnet is a publicly available lexical database of over 200 languages that provides semantic
relationships between its words. It is one of the earliest and most commonly used lemmatizer
techniques.
● It is present in the nltk library in python.
● Wordnet links words into semantic relations. ( eg. synonyms)
2. Wordnet Lemmatizer (with POS tag)
Wordnet results are not up to the mark. Words like ‘sitting’, ‘flying’ etc remained the same after
lemmatization. This is because these words are treated as a noun in the given sentence rather than a
verb. To overcome this, we use POS (Part of Speech) tags.
We add a tag with a particular word defining its type (verb, noun, adjective etc).
3. TextBlob
TextBlob is a python library used for processing textual data. It provides a simple API to access its
methods and perform basic NLP tasks.
4. TextBlob (with POS tag)
Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations
in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the
‘Part of Speech’ tagging to overcome this problem.
5. spaCy
spaCy is an open-source python library that parses and “understands” large volumes of text.
Separate models are available that cater to specific languages (English, French, German, etc.).
6. TreeTagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information. The
TreeTagger has been successfully used to tag over 25 languages and is adaptable to other languages
if a manually tagged training corpus is available.
7. Pattern
Pattern is a Python package commonly used for web mining, natural language processing, machine
learning, and network analysis. It has many useful NLP capabilities. It also contains a special
feature which we will be discussing below.
8. Gensim
Gensim is designed to handle large text collections using data streaming. Its lemmatization
facilities are based on the pattern package we installed above.
gensim.utils.lemmatize() function can be used for performing Lemmatization. This method comes
under the utils module in python.
We can use this lemmatizer from pattern to extract UTF8-encoded tokens in their base form lemma.
Only considers nouns, verbs, adjectives, and adverbs by default (all other lemmas are discarded).
9. Stanford CoreNLP
CoreNLP enables users to derive linguistic annotations for text, including token and sentence
boundaries, parts of speech, named entities, numeric and time values, dependency and constituency
parser, sentiment, quote attributions, and relations.
Conclusion
We saw what are the most common techniques to clean and process the data. With each subsection, we saw
techniques of how to remove them and when to remove them with the use cases. Additionally, what kind of
situation do we need to avoid while applying the techniques to remove and clean the data for text analysis
purposes or many more.
lOMoAR cPSD| 10600079
Experiment No: 04
Title of Experiment: Create a transformer from scratch using the Pytorch library
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:
Signature of
Subject
Teacher
Title of the Assignment: Create a transformer from scratch using the Pytorch library
import torch
import torch.nn as nn
import torch.optim as
optim
import torch.utils.data as data
import math
import copy
Multi-Head Attention
The Multi-Head Attention mechanism computes the attention between each pair of
positions in a sequence. It consists of multiple “attention heads” that capture different
aspects of the input sequence.
class MultiHeadAttention(nn.Module):
def init (self, d_model, num_heads):
super(MultiHeadAttention, self). init ()
assert d_model % num_heads == 0, "d_model must be divisible
by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
def forward(self, Q, K, V,
mask=None):
Q = self.split_heads(self.W_q(Q))
K = self.split_heads(self.W_k(K))
V = self.split_heads(self.W_v(V))
The MultiHeadAttention code initializes the module with input parameters and
linear transformation layers.
It calculates attention scores, reshapes the input tensor into multiple heads, and
combines the attention outputs from all heads.
The forward method computes the multi-head self-attention, allowing the model
to focus on some different aspects of the input sequence.
.
Position-wise Feed-Forward Networks
class PositionWiseFeedForward(nn.Module):
def init (self, d_model, d_ff):
super(PositionWiseFeedForward, self). init ()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
Positional Encoding
Positional Encoding is used to inject the position information of each token in the
input sequence.
It uses sine and cosine functions of different frequencies to generate the
positional encoding.
class PositionalEncoding(nn.Module):
def init (self, d_model, max_seq_length):
super(PositionalEncoding, self). init ()
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0,
max_seq_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
- (math.log(10000.0) / d_model))
pe[:, 1::2]
0::2] = torch.cos(position
torch.sin(position *
div_term)
self.register_buffer('pe', pe.unsqueeze(0))
Encoder Layer
class EncoderLayer(nn.Module):
def init (self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self). init ()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
Decoder Layer
class DecoderLayer(nn.Module):
def init (self, d_model, num_heads, d_ff, dropout):
super(DecoderLayer, self). init ()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 =
nn.LayerNorm(d_model) self.norm3
= nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
These operations enable the decoder to generate target sequences based on the
input and the encoder output.
Now, let’s combine the Encoder and Decoder layers to create the complete
Transformer model.
Transformer Model
class Transformer(nn.Module):
def init (self, src_vocab_size, tgt_vocab_size, d_model,
num_heads, num_layers, d_ff, max_seq_length, dropout):
super(Transformer, self). init ()
self.encoder_embedding = nn.Embedding(src_vocab_size,
d_model) self.decoder_embedding =
nn.Embedding(tgt_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model,
max_seq_length)
self.encoder_layers = nn.ModuleList([EncoderLayer(d_model,
num_heads, d_ff, dropout) for _ in range(num_layers)])
self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,
num_heads, d_ff, dropout) for _ in range(num_layers)])
enc_output = src_embedded
for enc_layer in self.encoder_layers:
enc_output = enc_layer(enc_output,
src_mask)
dec_output = tgt_embedded
for dec_layer in self.decoder_layers:
dec_output = dec_layer(dec_output, enc_output, src_mask,
tgt_mask)
output = self.fc(dec_output)
return output
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1
Now we’ll train the model using the sample data. In practice, you would use a
larger dataset and split it into training and validation sets.
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9,
0.98), eps=1e-9)
transformer.train()
Conclusion
We can use this way to build a simple Transformer from scratch in Pytorch. All Large
Language Models use these Transformer encoder or decoder blocks for training. Hence
understanding the network that started it all is extremely important.
lOMoAR cPSD| 10600079
Experiment No: 05
Title of Experiment: Morphology is the study of the way words are built up from smaller
meaning bearing units. Study and understand the concepts of morphology by the use of add
delete table
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:
Signature of
Subject
Teacher
Title of the Assignment: Morphology is the study of the way words are built up from
smaller meaning bearing units. Study and understand the concepts of morphology by the use
of add delete table
Morphology
Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
Morph Analyser
Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can
either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-ed".
Thus finding all parts of a word(morphemes) and thus describing properties of a word is
called "Morphological Analysis". For example, "played" has information verb "play" and
"past tense", so given word is past tense form of verb "play".
Analysis of a word:
बच्चों (bachchoM) = बच्चा(bachchaa)(root) + ओं(oM)(suffix) (ओं=3 plural oblique) A
linguistic paradigm is the complete set of variants of a given lexeme. These variants can be
classified according to shared inflectional categories (eg: number, case etc) and arranged
into tables.
Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same
paradigm class as बच्च, so लड़का would behave similarly as बच्चा as they share the same
paradigm class.
Morphology types?
There are two types of morphological relations: inflectional and derivational. When an
inflectional affix is added to a stem word, a new form of the stem word is produced. When a
derivational affix is added to a stem word, a new word with new meaning is produced.
Affixes, such as prefixes and suffixes, are bound morphemes, and are different from free
morphemes. Free morphemes are lexical units, and when two free morphemes are put
together, a compound word is produced.
Morphology also deals with meaning to a degree, but only in as much as the smaller sub-
word units of language can carry meaning. To examine the meaning of anything larger than a
morpheme would fall under the domain of semantics.
Conclusion
Morphology in linguistics is the study of word structures and the relationship between these
structures. Morphology examines how words are formed and varied. We are able to
understand the morphology of a word by the use of the Add-Delete table.
lOMoAR cPSD| 10600079
Experiment No: 06
Student Name:
Class: BE (Computer)
Div: - Batch:
Roll No.:
Date of
Attendance
(Performance):
Date of Evaluation:
CO Mapped CO1,CO2,CO3,CO4
Signature of
Subject
Teacher
Title of the Assignment: Finetune a pretrained transformer for any of the
following tasks on any relevant dataset of your choice:
1) Neural Machine Translation
2) Classification
3) Summarization
Solution
1. Fine-Tuning a Transformer Model for Neural Machine Translation.
https://medium.com/@notsokarda/fine-tuning-a-transformer-model-for-neural-
machine-translation-c604a24d3376
https://github.com/rajesh-iiith/POS-Tagging-and-CYK-Parsing-for-Indian-
Languages
https://github.com/JayeshSuryavanshi/POS-Tagger-for-Hindi-Language
https://huggingface.co/swagat-panda/multilingual-pos-tagger-language-
detection-indian-context-muril
https://github.com/vatsalsaglani/xrayimage_extractfeatures
https://www.geeksforgeeks.org/mahotas-zernike-moments/