Practical 1 : C Program Code
Task:
Write a C Program Code to,
Read a sample .TXT file and print as it is on screen
Count the number of words / tokens in the file.
Count the number of unique words / tokens in the file.
Count the occurrence frequency of any specific word token (e.g. “AAB”)
Count the occurrence frequency of all the unique words / tokens in the file.
Submitted By: AI4116 Ayesha Shaikh
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BUFFER_SIZE 1000
int countOccurrences(FILE *fptr, const char *word);
int main()
FILE *fptr;
char path[1000];
char word[1000];
int wCount;
printf("Enter file path: ");
scanf("%s", path);
printf("Enter word to search in file: ");
scanf("%s", word);
fptr = fopen(path, "r");
if (fptr == NULL)
printf("Unable to open file.\n");
printf("Please check you have read/write previleges.\n");
exit(EXIT_FAILURE);
wCount = countOccurrences(fptr, word);
printf("'%s' is found %d times in file.", word, wCount);
fclose(fptr);
return 0;
int countOccurrences(FILE *fptr, const char *word)
char str[BUFFER_SIZE];
char *pos;
int index, count;
count = 0;
while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)
index = 0;
while ((pos = strstr(str + index, word)) != NULL)
index = (pos - str) + 1;
count++;
return count;
#include <stdio.h>
#include <string.h>
int main()
{
FILE* filePointer;
char dataToBeRead[1000];
filePointer = fopen("[Link]", "r");
if (filePointer == NULL) {
printf("cyber file failed to open.");
else {
printf("The file is now opened.\n");
printf("---------------------------\n");
while (fgets(dataToBeRead, 1000, filePointer)
!= NULL) {
printf("%s", dataToBeRead);
fclose(filePointer);
return 0;
Conclusion: The program successfully displayed the content of the sample .TXT file, provided
the counts of total words and unique words, reported the frequency of the specified word
token ("AAB"), and presented the occurrence frequency of all unique words in the file.
Date of Performance Date of Assessment Remark and Sign
08/08/23 22/08/23
Practical 2 : Text Preprocessing
Task:
Take any arbitrary string and perform the following task on it:
Identify the language of it
Count the length of the string
Count the number of tokens in the string (using split function and word tokenizer)
Count the unique tokens in the string
Take any news corpus, Pre-process it (All functionality Needed capitalization, contraction expansion,
punctuation removal, stop words)
Calculate Term Frequency for each term in the news corpus. (Is it pointing to the Topic of the corpus?)
Show any one example illustrating lexical richness of the text.
Submitted By: AI4116 Ayesha Shaikh
Code
# pip install langdetect nltk
import nltk
[Link]('punkt')
[Link]('stopwords')
!pip install langdetect
from langdetect import detect
from [Link] import word_tokenize
from [Link] import FreqDist
from [Link] import stopwords
import string
import nltk
text = "Hi Darshan How are you ?"
text2 = "हाय आप कैसे हैं?"
# 1. Identify langauge of it.
def identify_language(text):
lang = detect(text)
return lang
print("Language:", identify_language(text))
print("Language:", identify_language(text2))
Language: en
Language: hi
# 2. Count the length of the string
def count_length(text):
return len(text)
print(count_length(text))
24
# 3. Count the number of tokens in the string(using split function and
word tokenizer).
def count_tokens(text):
tokens = word_tokenize(text)
return len(tokens)
print(count_tokens(text))
# 4. Count the unqiue tokens in the string.
def count_unique_tokens(text):
tokens = word_tokenize(text)
unique_tokens = set(tokens)
return len(unique_tokens)
print(count_unique_tokens(text))
def preprocess_corpus(corpus):
# Tokenize and remove stopwords and punctuation
tokens = word_tokenize(corpus)
tokens = [[Link]() for token in tokens if [Link]()]
tokens = [token for token in tokens if token not in
[Link]('english')]
return tokens
def calculate_term_frequency(tokens):
freq_dist = FreqDist(tokens)
term_freq = {word: freq for word, freq in freq_dist.items()}
return term_freq
# Step 5: Preprocess news corpus
news_corpus = "This is a sample news article. It contains some words that
we will preprocess."
preprocessed_tokens = preprocess_corpus(news_corpus)
# Step 6: Calculate term frequency
term_frequency = calculate_term_frequency(preprocessed_tokens)
print("Term Frequency:", term_frequency)
# Step 7: Lexical richness example
lexical_richness_example = "The quick brown fox jumps over the lazy dog.
This is a simple sentence."
tokens_lr = word_tokenize(lexical_richness_example)
lexical_richness = len(set(tokens_lr)) / len(tokens_lr)
print("Lexical Richness:", lexical_richness)
Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,
'words': 1, 'preprocess': 1}
Lexical Richness: 0.9375
contractions = {
"don't": "do not",
"doesn't": "does not",
"can't": "cannot",
"won't": "will not",
"haven't": "have not",
"hasn't": "has not",
"couldn't": "could not",
"shouldn't": "should not",
"wouldn't": "would not",
"it's": "it is",
"I'm": "I am",
"you're": "you are",
"they're": "they are",
"we're": "we are"
}
def expand_contractions(text):
words = [Link]()
expanded_words = []
for word in words:
if [Link]() in contractions:
expanded_words.extend(contractions[[Link]()].split())
else:
expanded_words.append(word)
expanded_text = ' '.join(expanded_words)
return expanded_text
contraction_text = "I can't believe they're here. It's a nice day."
expanded_text = expand_contractions(contraction_text)
print("Expanded Text:", expanded_text)
Expanded Text: I cannot believe they are here. it is a nice day.
Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including
language identification, length and token counting, unique token identification, and term
frequency calculation in a news corpus, the practical demonstrates a systematic approach to
understanding and extracting valuable insights from textual data, showcasing the importance of
effective preprocessing for subsequent tasks such as topic identification and lexical richness
assessment.
Date of Performance Date of Assessment Remark and Sign
22/08/23 29/08/23
Practical 3 : WordNet for Synonym and Antonym Detection
Task:
Find the synonym /antonym of a word using WordNet.
Illustrate the difference between stemming and lemmatizing
Submitted By: AI4116 Ayesha Shaikh
Code
!pip install nltk spacy
!python -m spacy download en
[Link]('wordnet')
import nltk
from [Link] import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import spacy
words = ["running", "flies", "better","Unhappiness ","Teacher ",
"happier","slowness","friendly", "jumps","Cats","Swimming "]
# NLTK Stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
# NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()
# spaCy Lemmatizer
nlp = [Link]("en_core_web_sm")
print("Actual words:" , words)
# Stemming
print("Porter Stemmer:", [[Link](word) for word in words])
print("Snowball Stemmer:", [[Link](word) for word in words])
# Lemmatization
print("WordNet Lemmatizer:", [[Link](word) for word in
words])
# spaCy Lemmatization
lemmatized_words_spacy = [token.lemma_ for token in nlp(" ".join(words))]
print("spaCy Lemmatization:",lemmatized_words_spacy)
from [Link] import wordnet
word = "happy"
# word = "खुश"
# Get synsets (sets of synonyms) for the word
synsets = [Link](word)
if synsets:
print("Synonyms:")
for synset in synsets:
synonyms = [[Link]() for lemma in [Link]()]
print(", ".join(synonyms))
# Get antonyms from the first synset
antonyms = [[Link]()[0].name() for lemma in
synsets[0].lemmas() if [Link]()]
if antonyms:
print("Antonyms:", ", ".join(antonyms))
else:
print("No antonyms found.")
else:
print("No synsets found for the word.")
Synonyms:
happy
felicitous, happy
glad, happy
happy, well-chosen
Antonyms: unhappy
Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and
antonym detection, offering a rich lexical database, while the difference between stemming
and lemmatizing lies in their approaches to reducing words to their base or root forms, with
stemming being more aggressive in its truncation.
Date of Performance Date of Assessment Remark and Sign
29/08/23 26/09/23
Practical 4 : N-gram Model
Task:
N Gram Model: Identify probability of next word occurrence using Bi-Gram Model
Submitted By: AI4116 Ayesha Shaikh
Code
def readData():
data = ['This is a dog','This is a cat','I love my cat','This is my
name ']
dat=[]
for i in range(len(data)):
for word in data[i].split():
[Link](word)
print(dat)
return dat
def createBigram(data):
listOfBigrams = []
bigramCounts = {}
unigramCounts = {}
for i in range(len(data)-1):
if i < len(data) - 1 and data[i+1].islower():
[Link]((data[i], data[i + 1]))
if (data[i], data[i+1]) in bigramCounts:
bigramCounts[(data[i], data[i + 1])] += 1
else:
bigramCounts[(data[i], data[i + 1])] = 1
if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1
return listOfBigrams, unigramCounts, bigramCounts
def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):
listOfProb = {}
for bigram in listOfBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProb[bigram] =
([Link](bigram))/([Link](word1))
return listOfProb
if __name__ == '__main__':
data = readData()
#data = ['this','is','my','cat']
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)
print("\n All the possible Bigrams are ")
print(listOfBigrams)
print("\n Bigrams along with their frequency ")
print(bigramCounts)
print("\n Unigrams along with their frequency ")
print(unigramCounts)
bigramProb = calcBigramProb(listOfBigrams, unigramCounts,
bigramCounts)
print("\n Bigrams along with their probability ")
print(bigramProb)
inputList="This is a cat"
splt=[Link]()
outputProb1 = 1
bilist=[]
bigrm=[]
for i in range(len(splt) - 1):
if i < len(splt) - 1:
[Link]((splt[i], splt[i + 1]))
print("\n The bigrams in given sentence are ")
print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:
outputProb1 *= bigramProb[bilist[i]]
else:
outputProb1 *= 0
print('\n' + 'Probablility of sentence = ' +inputList +
str(outputProb1))
def readData():
data = ['there is a big garden' ,'children play in a garden','they
play inside beautiful garden']
dat=[]
for i in range(len(data)):
for word in data[i].split():
[Link](word)
print(dat)
return dat
if __name__ == '__main__':
data = readData()
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)
print("\n All the possible Bigrams are ")
print(listOfBigrams)
print("\n Bigrams along with their frequency ")
print(bigramCounts)
print("\n Unigrams along with their frequency ")
print(unigramCounts)
bigramProb = calcBigramProb(listOfBigrams, unigramCounts,
bigramCounts)
print("\n Bigrams along with their probability ")
print(bigramProb)
inputList="they play in a big garden"
splt=[Link]()
outputProb1 = 1
bilist=[]
bigrm=[]
for i in range(len(splt) - 1):
if i < len(splt) - 1:
[Link]((splt[i], splt[i + 1]))
print("\n The bigrams in given sentence are ")
print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:
outputProb1 *= bigramProb[bilist[i]]
else:
outputProb1 *= 0
print('\n' + 'Probablility of sentence \"\" = ' + str(outputProb1))
Task - Create any simple application using n-gram
import nltk
[Link]('punkt')
def createTrigram(data):
listOfTrigrams = []
trigramCounts = {}
bigramCounts = {}
unigramCounts = {}
for i in range(len(data) - 2):
if i < len(data) - 2 and data[i + 1].islower():
[Link]((data[i], data[i + 1], data[i + 2]))
if (data[i], data[i + 1], data[i + 2]) in trigramCounts:
trigramCounts[(data[i], data[i + 1], data[i + 2])] += 1
else:
trigramCounts[(data[i], data[i + 1], data[i + 2])] = 1
bigram = (data[i], data[i + 1])
if bigram in bigramCounts:
bigramCounts[bigram] += 1
else:
bigramCounts[bigram] = 1
if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1
return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts
# Example data
data = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog"]
# Call the function
listOfTrigrams, unigramCounts, bigramCounts, trigramCounts =
createTrigram(data)
print("List of Trigrams:", listOfTrigrams)
print("Unigram Counts:", unigramCounts)
print("Bigram Counts:", bigramCounts)
print("Trigram Counts:", trigramCounts)
List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over',
'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Unigram Counts: {'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1,
'over': 1, 'the': 1}
Bigram Counts: {('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown',
'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1,
('the', 'lazy'): 1}
Trigram Counts: {('The', 'quick', 'brown'): 1, ('quick', 'brown', 'fox'):
1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps', 'over'): 1, ('jumps',
'over', 'the'): 1, ('over', 'the', 'lazy'): 1, ('the', 'lazy', 'dog'): 1}
Conclusion: The Bi-Gram Model effectively estimates the probability of the next word
occurrence based on the preceding word, providing a simple yet practical approach to language
modeling.
Date of Performance Date of Assessment Remark and Sign
26/09/23 03/10/23
Practical 5 : Word Semantics: One hot encoding, TD Matrix, TF-IDF, Word2Vec,
PPMI
Task:
Write a python code for the following task:
Take any text corpus and calculate one hot encoded vector, calculate TD matrix, TF-IDF for some token
terms, PPMI for finding corresponding word of a given word, use Word2Vec for word embedding.
Submitted By: AI4116 Ayesha Shaikh
Code
from sklearn.feature_extraction.text import TfidfVectorizer
d1= "data science is one of the most important fields of science"
d2= "this is one of the best data science courses"
d3="data scientists analyze data"
doc_corpus=[d1,d2,d3]
print(doc_corpus)
vec=TfidfVectorizer(stop_words='english')
matrix=vec.fit_transform(doc_corpus)
print("Feature Names n",vec.get_feature_names_out())
print("Sparse Matrix n",[Link],"n",[Link]())
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
create a word set for corpus
words_set = set()
for doc in corpus:
words = [Link](' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
Computing Term Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
# assign documents
d0= "data science is one of the most important fields of science"
d1= "this is one of the best data science courses"
d2="data scientists analyze data"
# merge documents into a single corpus
string = [d0, d1, d2]
# create object
tfidf = TfidfVectorizer()
# get tf-df values
result = tfidf.fit_transform(string)
# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
print(ele1, ':', ele2)
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
# display tf-idf values
print('\ntf-idf value:')
print(result)
# in matrix form
print('\ntf-idf values in matrix form:')
print([Link]())
Conclusion: In conclusion, the implemented Python code successfully demonstrated various
word vectorization techniques, including one-hot encoding, term-document matrix (TD), term
frequency-inverse document frequency (TF-IDF), positive pointwise mutual information (PPMI),
and Word2Vec, showcasing the versatility of these methods in capturing semantic relationships
and contextual information within a given text corpus.
Date of Performance Date of Assessment Remark and Sign
3/10/23 17/10/23
Practical 6 : Sentiment Detection of Sentence
Task:
Task: Write a python code to identify sentiment of the sentence. Implement task using at least TWO
DIFFERNT APPROACHES Compre the performance of both.
Submitted By: AI4116 Ayesha Shaikh
Code
import nltk
from [Link] import SentimentIntensityAnalyzer
[Link]('vader_lexicon')
def classify_sentiment(sentence):
analyzer = SentimentIntensityAnalyzer()
sentiment_scores = analyzer.polarity_scores(sentence)
# Determine sentiment based on compound score
compound_score = sentiment_scores['compound']
if compound_score >= 0.05:
return "Positive"
elif compound_score <= -0.05:
return "Negative"
else:
return "Neutral"
# Example usage:
sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
sentence = "This is a terrible experience."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
sentence = "This is a neutral sentence."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
sentence = "This is not a good movie."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
import nltk
[Link]('opinion_lexicon')
# Define positive and negative word dictionaries from NLTK
from [Link] import opinion_lexicon
positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())
def preprocess_sentence(sentence):
# Convert the sentence to lowercase and split it into words
words = [Link]().split()
return words
def classify_sentiment(sentence):
words = preprocess_sentence(sentence)
# Initialize sentiment score
sentiment_score = 0
for word in words:
if word in positive_words:
sentiment_score += 1
elif word in negative_words:
sentiment_score -= 1
print("Current word :: ", word , " Sentiment score :: ",
sentiment_score)
# Classify the sentiment based on the score
if sentiment_score > 0:
return "Positive"
elif sentiment_score < 0:
return "Negative"
else:
return "Neutral"
# Example usage:
'''sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
sentence = "This is a terrible experience."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
sentence = "This is a neutral sentence."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
'''
sentence = "This is not a good movie."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
Current word :: this Sentiment score :: 0
Current word :: is Sentiment score :: 0
Current word :: not Sentiment score :: 0
Current word :: a Sentiment score :: 0
Current word :: good Sentiment score :: 1
Current word :: movie. Sentiment score :: 1
Sentence sentiment: Positive
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data] Package opinion_lexicon is already up-to-date!
Conclusion: The sentiment detection task was implemented using two different
approaches, and their performance was compared, revealing insights into the
effectiveness of each method for accurately identifying the sentiment of a given
sentence.
Date of Performance Date of Assessment Remark and Sign
03/10/23 17/10/23
Practical 7 : POS Tagging (HMM) + (NLTK)
Task:
Task 1: Write a code in python (using a ready function) to input some text from user and identify POS tag
of each token in it.
Task 2: How HMM can be used for POS Tagging? Illustratepython code for Transition Probability and
Emmision Probability Calculation.
Submitted By: AI4116 Ayesha Shaikh
Code
# Use simple NLTK POS tagger(Any readymade function) for identifying POS
tag of the input sentence.
[Link]('stopwords')
[Link]('punkt')
[Link]('averaged_perceptron_tagger')
import nltk
from [Link] import stopwords
from [Link] import word_tokenize, sent_tokenize
stop_words = set([Link]('english'))
# Dummy text
txt1 = "She is writing code."
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the [Link] module
tokenized = sent_tokenize(txt1)
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
# Importing libraries
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
import pprint, time
#download the treebank corpus from nltk
[Link]('treebank')
#download the universal tagset from nltk
[Link]('universal_tagset')
# reading the Treebank tagged sentences
nltk_data = list([Link].tagged_sents(tagset='universal'))
#print the first two sentences along with tags
print(nltk_data[:2])
#print each word with its respective tag for first two sentences
for sent in nltk_data[:2]:
for tuple in sent:
print(tuple)
# split data into training and validation set in the ratio 80:20
train_set,test_set
=train_test_split(nltk_data,train_size=0.80,test_size=0.20,random_state =
101)
# create list of train and test tagged words
train_tagged_words = [ tup for sent in train_set for tup in sent ]
test_tagged_words = [ tup for sent in test_set for tup in sent ]
print(len(train_tagged_words))
print(len(test_tagged_words))
# check some of the tagged words.
train_tagged_words[:5]
#use set datatype to check how many unique tags are present in training
data
tags = {tag for word,tag in train_tagged_words}
print(len(tags))
print(tags)
# check total words in vocabulary
vocab = {word for word,tag in train_tagged_words}
# compute Emission Probability
def word_given_tag(word, tag, train_bag = train_tagged_words):
tag_list = [pair for pair in train_bag if pair[1]==tag]
count_tag = len(tag_list)#total number of times the passed tag
occurred in train_bag
w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
#now calculate the total number of times the passed word occurred as the
passed tag.
count_w_given_tag = len(w_given_tag_list)
return (count_w_given_tag, count_tag)
# compute Transition Probability
def t2_given_t1(t2, t1, train_bag = train_tagged_words):
tags = [pair[1] for pair in train_bag]
count_t1 = len([t for t in tags if t==t1])
count_t2_t1 = 0
for index in range(len(tags)-1):
if tags[index]==t1 and tags[index+1] == t2:
count_t2_t1 += 1
return (count_t2_t1, count_t1)
# creating t x t transition matrix of tags, t= no of tags
# Matrix(i, j) represents P(jth tag after the ith tag)
tags_matrix = [Link]((len(tags), len(tags)), dtype='float32')
for i, t1 in enumerate(list(tags)):
for j, t2 in enumerate(list(tags)):
tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]
print(tags_matrix)
# convert the matrix to a df for better readability
#the table is same as the transition table shown in section 3 of article
tags_df = [Link](tags_matrix, columns = list(tags),
index=list(tags))
display(tags_df)
AD CO NU PR NO VE AD
DE ADJ PRT . X
V NJ M ON UN RB P
T
DE 0.00 0.01 0.20 0.00 0.02 0.00 0.01 0.00 0.63 0.04 0.00 0.04
T 6037 2074 6411 0431 2855 0287 7393 3306 5906 0247 9918 5134
AD 0.07 0.08 0.13 0.00 0.02 0.01 0.13 0.01 0.03 0.33 0.11 0.02
V 1373 1458 0721 6982 9868 4740 9255 2025 2196 9022 9472 2886
AD 0.00 0.00 0.06 0.01 0.02 0.01 0.06 0.00 0.69 0.01 0.08 0.02
J 5243 5243 3301 6893 1748 1456 6019 0194 6893 1456 0583 0971
C
0.12 0.05 0.11 0.00 0.04 0.00 0.03 0.06 0.34 0.15 0.05 0.00
O
3491 7080 3611 0549 0615 4391 5126 0373 9067 0384 5982 9330
NJ
Conclusion: In conclusion, the provided Python code utilizing NLTK allows users to input text
and obtain the corresponding Part-of-Speech (POS) tags for each token, while Hidden Markov
Models (HMM) can be employed for POS tagging through the calculation of Transition
Probability and Emission Probability in a systematic and illustrative manner.
Date of Performance Date of Assessment Remark and Sign
17/10/23 21/11/23
Practical 8 : CRF POS Tagging + LSTM POS Tagging
Task:
Task: Write a python code to assign POS tag to the input stream of words using CRF as well as LSTM.
Compare the performance of both the taggers.
Submitted By: AI4116 Ayesha Shaikh
Code:
pip install python-crfsuite
import nltk
[Link]('treebank')
import pycrfsuite
from [Link] import treebank
from sklearn.model_selection import train_test_split
# Load the Penn Treebank dataset
tagged_sentences = treebank.tagged_sents()
# Feature extraction function
def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'is_first': i == 0,
'is_last': i == len(sent) - 1,
'prev_word': '' if i == 0 else sent[i - 1][0],
'next_word': '' if i == len(sent) - 1 else sent[i + 1][0]
}
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for word, label in sent]
# Prepare the data
X = [sent2features(sent) for sent in tagged_sentences]
y = [sent2labels(sent) for sent in tagged_sentences]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train the CRF model
trainer = [Link](verbose=False)
for xseq, yseq in zip(X_train, y_train):
[Link](xseq, yseq)
trainer.set_params({
'c1': 1.0,
'c2': 1e-3,
'max_iterations': 50,
'feature.possible_transitions': True
})
model_filename = 'pos_tagger_model.crfsuite'
[Link](model_filename)
# Test the model
tagger = [Link]()
[Link](model_filename)
# Tag a sentence
example_sentence = [Link]()[0]
features = sent2features(example_sentence)
tags = [Link](features)
for word, tag in zip(example_sentence, tags):
print(f'{word}/{tag}', end=' ')
# Evaluate the model
from [Link] import classification_report
y_pred = [[Link](xseq) for xseq in X_test]
y_test_flat = [label for label_seq in y_test for label in label_seq]
y_pred_flat = [label for label_seq in y_pred for label in label_seq]
print("\nClassification Report:")
print(classification_report(y_test_flat, y_pred_flat))
This/DT is/VBZ a/DT sample/NNP sentence/NNP for/IN POS/NNP tagging./NNP
import pycrfsuite
# Load the CRF model
model_filename = 'pos_tagger_model.crfsuite'
tagger = [Link]()
[Link](model_filename)
# Sample sentence
sample_sentence = "This is a sample sentence for POS tagging."
# Tokenize the sample sentence
sample_tokens = sample_sentence.split()
# Extract features for the sample sentence
sample_features = [word2features([(word, '')], 0) for word in
sample_tokens]
# Predict POS tags for the sample sentence
predicted_tags = [Link](sample_features)
# Combine words and predicted tags
tagged_sentence = ' '.join([f'{word}/{tag}' for word, tag in
zip(sample_tokens, predicted_tags)])
# Print the tagged sentence
print(tagged_sentence)
Write python code for POS tagging using LSTM
import numpy as np
import tensorflow as tf
from [Link] import pad_sequences
from [Link] import Sequential
from [Link] import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder
from [Link] import treebank
from [Link] import word_tokenize
# Load the Penn Treebank dataset
tagged_sentences = treebank.tagged_sents()
# Create the vocabulary and encode labels
words = set([Link]() for sentence in tagged_sentences for word, tag in
sentence)
words = list(words)
[Link]("ENDPAD")
n_words = len(words)
tags = set(tag for sentence in tagged_sentences for word, tag in sentence)
n_tags = len(tags)
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}
# Convert sentences and labels to numerical format
X = [[word2idx[[Link]()] for word, tag in sent] for sent in
tagged_sentences]
y = [[tag2idx[tag] for word, tag in sent] for sent in tagged_sentences]
# Padding sequences to a fixed length
max_len = 100
X = pad_sequences(X, maxlen=max_len, padding="post")
y = pad_sequences(y, maxlen=max_len, padding="post")
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create and compile the LSTM model
model = Sequential()
[Link](Embedding(input_dim=n_words, output_dim=50,
input_length=max_len))
[Link](LSTM(units=100, return_sequences=True))
[Link](Dense(n_tags, activation="softmax"))
[Link](optimizer="adam", loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
# Train the model
[Link](X_train, y_train, batch_size=32, epochs=5, validation_split=0.1)
# Evaluate the model
loss, accuracy = [Link](X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")
sample_sentence = "This is a sample sentence for POS tagging.".split()
sample_sequence = [[Link]([Link](), word2idx["ENDPAD"]) for word
in sample_sentence]
# Pad the sequence to match the model's input shape
sample_sequence = pad_sequences([sample_sequence], maxlen=max_len,
padding="post")
predicted_tags = [Link](sample_sequence)
predicted_tags = [[Link](tag) for tag in predicted_tags[0]]
predicted_tags = [list([Link]())[list([Link]()).index(tag)]
for tag in predicted_tags]
for word, tag in zip(sample_sentence, predicted_tags):
print(f"{word}/{tag}", end=" ")
1/1 [==============================] - 0s 444ms/step
This/DT is/DT a/DT sample/DT sentence/NN for/IN POS/NN tagging./NN
Conclusion: The comparison of CRF and LSTM POS tagging reveals variations in performance,
with CRF demonstrating robust sequential labeling, while LSTM exhibits the ability to capture
intricate patterns in the input stream.
Date of Performance Date of Assessment Remark and Sign
17/10/23 21/11/23
Practical 9 : NER (Named Entity Recognition)
Task:
Task: Write a python code to identify the Named Entities from the input text.
Submitted By: AI4116 Ayesha Shaikh
Code:
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = [Link]('en_core_web_sm')
text =('''Python is an interpreted, high-level and general-purpose
programming language.
Pythons design philosophy emphasizes code readability with
its notable use of significant indentation.
Its language constructs and object-oriented approach aim to
help programmers write clear and
logical code for small and large-scale projects''')
# text2 = # copy the paragraphs from [Link]
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list([Link])
print(sentences)
# tokenization
for token in doc:
print([Link])
# print entities
ents = [([Link], e.start_char, e.end_char, e.label_) for e in [Link]]
print(ents)
# now we use displaycy function on doc2
[Link](doc, style='ent', jupyter=True)
# import modules and download packages
import nltk
[Link]('words')
[Link]('punkt')
[Link]('maxent_ne_chunker')
[Link]('averaged_perceptron_tagger')
[Link]('state_union')
from [Link] import state_union
from [Link] import PunktSentenceTokenizer
# process the text and print Named entities
# tokenization
train_text = state_union.raw()
sample_text = state_union.raw("/content/[Link]")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
# function
def get_named_entity():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=False)
[Link]()
except:
pass
get_named_entity()
text =('''HI I am Atharva, I am from Aurangabad, Maharashtra, India.
Currently I am persuing B-tech degree from Deogiri college''')
# text2 = # copy the paragraphs from [Link]
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list([Link])
print(sentences)
# tokenization
for token in doc:
print([Link])
# print entities
ents = [([Link], e.start_char, e.end_char, e.label_) for e in [Link]]
print(ents)
# now we use displaycy function on doc2
[Link](doc, style='ent', jupyter=True)
HI
I
am
Atharva
,
I
am
from
Aurangabad
,
Maharashtra
,
India
.
Currently
I
am
persuing
B
-
tech
degree
from
Deogiri
college
[('Atharva', 8, 15, 'PERSON'), ('Aurangabad', 27, 37, 'GPE'), ('Maharashtra',
39, 50, 'GPE'), ('India', 52, 57, 'GPE'), ('Deogiri', 102, 109, 'ORG')]
HI I am Atharva PERSON , I am from Aurangabad GPE , Maharashtra GPE , India GPE . Currently I am
persuing B-tech degree from Deogiri ORG college
Conclusion: In conclusion, the provided Python code successfully performs Named Entity
Recognition (NER) on input text, accurately identifying and extracting entities such as names,
locations, and organizations.
Date of Performance Date of Assessment Remark and Sign
21/11/23