0% found this document useful (0 votes)

52 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Practical 1 : C Program Code

Task:
Write a C Program Code to,

Read a sample .TXT file and print as it is on screen

Count the number of words / tokens in the file.

Count the number of unique words / tokens in the file.

Count the occurrence frequency of any specific word token (e.g. “AAB”)

Count the occurrence frequency of all the unique words / tokens in the file.

Submitted By: AI4116 Ayesha Shaikh

Code:
#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define BUFFER_SIZE 1000

int countOccurrences(FILE fptr, const char word);

int main()

FILE *fptr;

char path[1000];

char word[1000];

int wCount;
printf("Enter file path: ");

scanf("%s", path);

printf("Enter word to search in file: ");

scanf("%s", word);

fptr = fopen(path, "r");

if (fptr == NULL)

printf("Unable to open file.\n");

printf("Please check you have read/write previleges.\n");

exit(EXIT_FAILURE);

wCount = countOccurrences(fptr, word);

printf("'%s' is found %d times in file.", word, wCount);

fclose(fptr);

return 0;

int countOccurrences(FILE fptr, const char word)

char str[BUFFER_SIZE];
char *pos;

int index, count;

count = 0;

while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)

index = 0;

while ((pos = strstr(str + index, word)) != NULL)

index = (pos - str) + 1;

count++;

return count;

#include <stdio.h>

#include <string.h>

int main()

{
FILE* filePointer;

char dataToBeRead[1000];

filePointer = fopen("[Link]", "r");

if (filePointer == NULL) {

printf("cyber file failed to open.");

else {

printf("The file is now opened.\n");

printf("---------------------------\n");

while (fgets(dataToBeRead, 1000, filePointer)

!= NULL) {

printf("%s", dataToBeRead);

fclose(filePointer);

return 0;

Conclusion: The program successfully displayed the content of the sample .TXT file, provided
the counts of total words and unique words, reported the frequency of the specified word
token ("AAB"), and presented the occurrence frequency of all unique words in the file.

Date of Performance Date of Assessment Remark and Sign

08/08/23 22/08/23
Practical 2 : Text Preprocessing
Task:
Take any arbitrary string and perform the following task on it:

Identify the language of it

Count the length of the string

Count the number of tokens in the string (using split function and word tokenizer)

Count the unique tokens in the string

Take any news corpus, Pre-process it (All functionality Needed capitalization, contraction expansion,
punctuation removal, stop words)

Calculate Term Frequency for each term in the news corpus. (Is it pointing to the Topic of the corpus?)

Show any one example illustrating lexical richness of the text.

Submitted By: AI4116 Ayesha Shaikh

Code
# pip install langdetect nltk
import nltk

[Link]('punkt')
[Link]('stopwords')
!pip install langdetect

from langdetect import detect

from [Link] import word_tokenize
from [Link] import FreqDist
from [Link] import stopwords
import string
import nltk

text = "Hi Darshan How are you ?"

text2 = "हाय आप कैसे हैं?"

# 1. Identify langauge of it.

def identify_language(text):
lang = detect(text)
return lang

print("Language:", identify_language(text))
print("Language:", identify_language(text2))

Language: en
Language: hi

# 2. Count the length of the string

def count_length(text):
return len(text)

print(count_length(text))

# 3. Count the number of tokens in the string(using split function and

word tokenizer).

def count_tokens(text):
tokens = word_tokenize(text)
return len(tokens)
print(count_tokens(text))

# 4. Count the unqiue tokens in the string.

def count_unique_tokens(text):
tokens = word_tokenize(text)
unique_tokens = set(tokens)
return len(unique_tokens)

print(count_unique_tokens(text))

def preprocess_corpus(corpus):
# Tokenize and remove stopwords and punctuation
tokens = word_tokenize(corpus)
tokens = [[Link]() for token in tokens if [Link]()]
tokens = [token for token in tokens if token not in
[Link]('english')]
return tokens

def calculate_term_frequency(tokens):
freq_dist = FreqDist(tokens)
term_freq = {word: freq for word, freq in freq_dist.items()}
return term_freq

# Step 5: Preprocess news corpus

news_corpus = "This is a sample news article. It contains some words that
we will preprocess."
preprocessed_tokens = preprocess_corpus(news_corpus)

# Step 6: Calculate term frequency

term_frequency = calculate_term_frequency(preprocessed_tokens)
print("Term Frequency:", term_frequency)

# Step 7: Lexical richness example

lexical_richness_example = "The quick brown fox jumps over the lazy dog.
This is a simple sentence."
tokens_lr = word_tokenize(lexical_richness_example)
lexical_richness = len(set(tokens_lr)) / len(tokens_lr)
print("Lexical Richness:", lexical_richness)

Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,

'words': 1, 'preprocess': 1}
Lexical Richness: 0.9375

contractions = {
"don't": "do not",
"doesn't": "does not",
"can't": "cannot",
"won't": "will not",
"haven't": "have not",
"hasn't": "has not",
"couldn't": "could not",
"shouldn't": "should not",
"wouldn't": "would not",
"it's": "it is",
"I'm": "I am",
"you're": "you are",
"they're": "they are",
"we're": "we are"
}

def expand_contractions(text):
words = [Link]()
expanded_words = []
for word in words:
if [Link]() in contractions:
expanded_words.extend(contractions[[Link]()].split())
else:
expanded_words.append(word)
expanded_text = ' '.join(expanded_words)
return expanded_text

contraction_text = "I can't believe they're here. It's a nice day."

expanded_text = expand_contractions(contraction_text)
print("Expanded Text:", expanded_text)

Expanded Text: I cannot believe they are here. it is a nice day.

Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including

language identification, length and token counting, unique token identification, and term
frequency calculation in a news corpus, the practical demonstrates a systematic approach to
understanding and extracting valuable insights from textual data, showcasing the importance of
effective preprocessing for subsequent tasks such as topic identification and lexical richness
assessment.

Date of Performance Date of Assessment Remark and Sign

22/08/23 29/08/23
Practical 3 : WordNet for Synonym and Antonym Detection
Task:
Find the synonym /antonym of a word using WordNet.

Illustrate the difference between stemming and lemmatizing

Submitted By: AI4116 Ayesha Shaikh

Code
!pip install nltk spacy
!python -m spacy download en
[Link]('wordnet')

import nltk
from [Link] import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import spacy

words = ["running", "flies", "better","Unhappiness ","Teacher ",

"happier","slowness","friendly", "jumps","Cats","Swimming "]

# NLTK Stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")

# NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()

# spaCy Lemmatizer
nlp = [Link]("en_core_web_sm")

print("Actual words:" , words)

# Stemming
print("Porter Stemmer:", [[Link](word) for word in words])
print("Snowball Stemmer:", [[Link](word) for word in words])
# Lemmatization
print("WordNet Lemmatizer:", [[Link](word) for word in
words])
# spaCy Lemmatization
lemmatized_words_spacy = [token.lemma_ for token in nlp(" ".join(words))]
print("spaCy Lemmatization:",lemmatized_words_spacy)
from [Link] import wordnet
word = "happy"
# word = "खुश"
# Get synsets (sets of synonyms) for the word
synsets = [Link](word)

if synsets:
print("Synonyms:")
for synset in synsets:
synonyms = [[Link]() for lemma in [Link]()]
print(", ".join(synonyms))

# Get antonyms from the first synset

antonyms = [[Link]()[0].name() for lemma in
synsets[0].lemmas() if [Link]()]
if antonyms:
print("Antonyms:", ", ".join(antonyms))
else:
print("No antonyms found.")
else:
print("No synsets found for the word.")

Synonyms:
happy
felicitous, happy
glad, happy
happy, well-chosen
Antonyms: unhappy

Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and

antonym detection, offering a rich lexical database, while the difference between stemming
and lemmatizing lies in their approaches to reducing words to their base or root forms, with
stemming being more aggressive in its truncation.
Date of Performance Date of Assessment Remark and Sign

29/08/23 26/09/23

Practical 4 : N-gram Model

Task:
N Gram Model: Identify probability of next word occurrence using Bi-Gram Model

Submitted By: AI4116 Ayesha Shaikh

Code
def readData():
data = ['This is a dog','This is a cat','I love my cat','This is my
name ']
dat=[]
for i in range(len(data)):
for word in data[i].split():
[Link](word)
print(dat)
return dat

def createBigram(data):
listOfBigrams = []
bigramCounts = {}
unigramCounts = {}
for i in range(len(data)-1):
if i < len(data) - 1 and data[i+1].islower():

[Link]((data[i], data[i + 1]))

if (data[i], data[i+1]) in bigramCounts:

bigramCounts[(data[i], data[i + 1])] += 1
else:
bigramCounts[(data[i], data[i + 1])] = 1
if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1
return listOfBigrams, unigramCounts, bigramCounts

def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):

listOfProb = {}
for bigram in listOfBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProb[bigram] =
([Link](bigram))/([Link](word1))
return listOfProb

if __name__ == '__main__':
data = readData()
#data = ['this','is','my','cat']
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")

print(listOfBigrams)

print("\n Bigrams along with their frequency ")

print(bigramCounts)

print("\n Unigrams along with their frequency ")

print(unigramCounts)

bigramProb = calcBigramProb(listOfBigrams, unigramCounts,

bigramCounts)

print("\n Bigrams along with their probability ")

print(bigramProb)
inputList="This is a cat"
splt=[Link]()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):

if i < len(splt) - 1:

[Link]((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence = ' +inputList +
str(outputProb1))

def readData():
data = ['there is a big garden' ,'children play in a garden','they
play inside beautiful garden']
dat=[]
for i in range(len(data)):
for word in data[i].split():
[Link](word)
print(dat)
return dat

if __name__ == '__main__':
data = readData()
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")

print(listOfBigrams)

print("\n Bigrams along with their frequency ")

print(bigramCounts)

print("\n Unigrams along with their frequency ")

print(unigramCounts)
bigramProb = calcBigramProb(listOfBigrams, unigramCounts,
bigramCounts)

print("\n Bigrams along with their probability ")

print(bigramProb)
inputList="they play in a big garden"
splt=[Link]()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):

if i < len(splt) - 1:

[Link]((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence \"\" = ' + str(outputProb1))

Task - Create any simple application using n-gram

import nltk
[Link]('punkt')

def createTrigram(data):
listOfTrigrams = []
trigramCounts = {}
bigramCounts = {}
unigramCounts = {}

for i in range(len(data) - 2):

if i < len(data) - 2 and data[i + 1].islower():
[Link]((data[i], data[i + 1], data[i + 2]))
if (data[i], data[i + 1], data[i + 2]) in trigramCounts:
trigramCounts[(data[i], data[i + 1], data[i + 2])] += 1
else:
trigramCounts[(data[i], data[i + 1], data[i + 2])] = 1

bigram = (data[i], data[i + 1])

if bigram in bigramCounts:
bigramCounts[bigram] += 1
else:
bigramCounts[bigram] = 1

if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1

return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts

# Example data
data = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog"]

# Call the function

listOfTrigrams, unigramCounts, bigramCounts, trigramCounts =
createTrigram(data)

print("List of Trigrams:", listOfTrigrams)

print("Unigram Counts:", unigramCounts)
print("Bigram Counts:", bigramCounts)
print("Trigram Counts:", trigramCounts)

List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),

('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over',
'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Unigram Counts: {'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1,
'over': 1, 'the': 1}
Bigram Counts: {('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown',
'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1,
('the', 'lazy'): 1}
Trigram Counts: {('The', 'quick', 'brown'): 1, ('quick', 'brown', 'fox'):
1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps', 'over'): 1, ('jumps',
'over', 'the'): 1, ('over', 'the', 'lazy'): 1, ('the', 'lazy', 'dog'): 1}
Conclusion: The Bi-Gram Model effectively estimates the probability of the next word
occurrence based on the preceding word, providing a simple yet practical approach to language
modeling.

Date of Performance Date of Assessment Remark and Sign

26/09/23 03/10/23

Practical 5 : Word Semantics: One hot encoding, TD Matrix, TF-IDF, Word2Vec,

PPMI
Task:
Write a python code for the following task:

Take any text corpus and calculate one hot encoded vector, calculate TD matrix, TF-IDF for some token
terms, PPMI for finding corresponding word of a given word, use Word2Vec for word embedding.

Submitted By: AI4116 Ayesha Shaikh

Code
from sklearn.feature_extraction.text import TfidfVectorizer

d1= "data science is one of the most important fields of science"

d2= "this is one of the best data science courses"
d3="data scientists analyze data"

doc_corpus=[d1,d2,d3]
print(doc_corpus)
vec=TfidfVectorizer(stop_words='english')
matrix=vec.fit_transform(doc_corpus)
print("Feature Names n",vec.get_feature_names_out())
print("Sparse Matrix n",[Link],"n",[Link]())
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
create a word set for corpus
words_set = set()
for doc in corpus:
words = [Link](' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))

print('The words in the corpus: \n', words_set)

Computing Term Frequency

from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0= "data science is one of the most important fields of science"
d1= "this is one of the best data science courses"
d2="data scientists analyze data"
# merge documents into a single corpus
string = [d0, d1, d2]
# create object
tfidf = TfidfVectorizer()
# get tf-df values
result = tfidf.fit_transform(string)

# get idf values

print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values

print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print([Link]())
Conclusion: In conclusion, the implemented Python code successfully demonstrated various
word vectorization techniques, including one-hot encoding, term-document matrix (TD), term
frequency-inverse document frequency (TF-IDF), positive pointwise mutual information (PPMI),
and Word2Vec, showcasing the versatility of these methods in capturing semantic relationships
and contextual information within a given text corpus.

Date of Performance Date of Assessment Remark and Sign

3/10/23 17/10/23

Practical 6 : Sentiment Detection of Sentence

Task:
Task: Write a python code to identify sentiment of the sentence. Implement task using at least TWO
DIFFERNT APPROACHES Compre the performance of both.

Submitted By: AI4116 Ayesha Shaikh

Code
import nltk
from [Link] import SentimentIntensityAnalyzer

[Link]('vader_lexicon')

def classify_sentiment(sentence):
analyzer = SentimentIntensityAnalyzer()
sentiment_scores = analyzer.polarity_scores(sentence)

# Determine sentiment based on compound score

compound_score = sentiment_scores['compound']

if compound_score >= 0.05:

return "Positive"
elif compound_score <= -0.05:
return "Negative"
else:
return "Neutral"
# Example usage:
sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is not a good movie."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

import nltk
[Link]('opinion_lexicon')

# Define positive and negative word dictionaries from NLTK

from [Link] import opinion_lexicon

positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

def preprocess_sentence(sentence):
# Convert the sentence to lowercase and split it into words
words = [Link]().split()
return words

def classify_sentiment(sentence):
words = preprocess_sentence(sentence)

# Initialize sentiment score

sentiment_score = 0
for word in words:
if word in positive_words:
sentiment_score += 1
elif word in negative_words:
sentiment_score -= 1
print("Current word :: ", word , " Sentiment score :: ",
sentiment_score)
# Classify the sentiment based on the score
if sentiment_score > 0:
return "Positive"
elif sentiment_score < 0:
return "Negative"
else:
return "Neutral"

# Example usage:
'''sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
'''
sentence = "This is not a good movie."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

Current word :: this Sentiment score :: 0

Current word :: is Sentiment score :: 0
Current word :: not Sentiment score :: 0
Current word :: a Sentiment score :: 0
Current word :: good Sentiment score :: 1
Current word :: movie. Sentiment score :: 1
Sentence sentiment: Positive
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data] Package opinion_lexicon is already up-to-date!

Conclusion: The sentiment detection task was implemented using two different
approaches, and their performance was compared, revealing insights into the
effectiveness of each method for accurately identifying the sentiment of a given
sentence.

Date of Performance Date of Assessment Remark and Sign

03/10/23 17/10/23

Practical 7 : POS Tagging (HMM) + (NLTK)

Task:
Task 1: Write a code in python (using a ready function) to input some text from user and identify POS tag
of each token in it.

Task 2: How HMM can be used for POS Tagging? Illustratepython code for Transition Probability and
Emmision Probability Calculation.

Submitted By: AI4116 Ayesha Shaikh

Code

# Use simple NLTK POS tagger(Any readymade function) for identifying POS
tag of the input sentence.

[Link]('stopwords')
[Link]('punkt')
[Link]('averaged_perceptron_tagger')

import nltk
from [Link] import stopwords
from [Link] import word_tokenize, sent_tokenize
stop_words = set([Link]('english'))
# Dummy text

txt1 = "She is writing code."

# sent_tokenize is one of instances of

# PunktSentenceTokenizer from the [Link] module

tokenized = sent_tokenize(txt1)
for i in tokenized:

# Word tokenizers is used to find the words

# and punctuation in a string
wordsList = nltk.word_tokenize(i)

# removing stop words from wordList

wordsList = [w for w in wordsList if not w in stop_words]

# Using a Tagger. Which is part-of-speech

# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)

print(tagged)

# Importing libraries
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
import pprint, time

#download the treebank corpus from nltk

[Link]('treebank')

#download the universal tagset from nltk

[Link]('universal_tagset')

# reading the Treebank tagged sentences

nltk_data = list([Link].tagged_sents(tagset='universal'))

#print the first two sentences along with tags

print(nltk_data[:2])
#print each word with its respective tag for first two sentences
for sent in nltk_data[:2]:
for tuple in sent:
print(tuple)

# split data into training and validation set in the ratio 80:20
train_set,test_set
=train_test_split(nltk_data,train_size=0.80,test_size=0.20,random_state =
101)

# create list of train and test tagged words

train_tagged_words = [ tup for sent in train_set for tup in sent ]
test_tagged_words = [ tup for sent in test_set for tup in sent ]
print(len(train_tagged_words))
print(len(test_tagged_words))

# check some of the tagged words.

train_tagged_words[:5]

#use set datatype to check how many unique tags are present in training
data
tags = {tag for word,tag in train_tagged_words}
print(len(tags))
print(tags)

# check total words in vocabulary

vocab = {word for word,tag in train_tagged_words}

# compute Emission Probability

def word_given_tag(word, tag, train_bag = train_tagged_words):
tag_list = [pair for pair in train_bag if pair[1]==tag]
count_tag = len(tag_list)#total number of times the passed tag
occurred in train_bag
w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
#now calculate the total number of times the passed word occurred as the
passed tag.
count_w_given_tag = len(w_given_tag_list)

return (count_w_given_tag, count_tag)

# compute Transition Probability

def t2_given_t1(t2, t1, train_bag = train_tagged_words):
tags = [pair[1] for pair in train_bag]
count_t1 = len([t for t in tags if t==t1])
count_t2_t1 = 0
for index in range(len(tags)-1):
if tags[index]==t1 and tags[index+1] == t2:
count_t2_t1 += 1
return (count_t2_t1, count_t1)

# creating t x t transition matrix of tags, t= no of tags

# Matrix(i, j) represents P(jth tag after the ith tag)

tags_matrix = [Link]((len(tags), len(tags)), dtype='float32')

for i, t1 in enumerate(list(tags)):
for j, t2 in enumerate(list(tags)):
tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]

print(tags_matrix)

# convert the matrix to a df for better readability

#the table is same as the transition table shown in section 3 of article
tags_df = [Link](tags_matrix, columns = list(tags),
index=list(tags))
display(tags_df)

AD CO NU PR NO VE AD
DE ADJ PRT . X
V NJ M ON UN RB P
T

DE 0.00 0.01 0.20 0.00 0.02 0.00 0.01 0.00 0.63 0.04 0.00 0.04
T 6037 2074 6411 0431 2855 0287 7393 3306 5906 0247 9918 5134

AD 0.07 0.08 0.13 0.00 0.02 0.01 0.13 0.01 0.03 0.33 0.11 0.02
V 1373 1458 0721 6982 9868 4740 9255 2025 2196 9022 9472 2886

AD 0.00 0.00 0.06 0.01 0.02 0.01 0.06 0.00 0.69 0.01 0.08 0.02
J 5243 5243 3301 6893 1748 1456 6019 0194 6893 1456 0583 0971

C
0.12 0.05 0.11 0.00 0.04 0.00 0.03 0.06 0.34 0.15 0.05 0.00
O
3491 7080 3611 0549 0615 4391 5126 0373 9067 0384 5982 9330
NJ
Conclusion: In conclusion, the provided Python code utilizing NLTK allows users to input text
and obtain the corresponding Part-of-Speech (POS) tags for each token, while Hidden Markov
Models (HMM) can be employed for POS tagging through the calculation of Transition
Probability and Emission Probability in a systematic and illustrative manner.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 8 : CRF POS Tagging + LSTM POS Tagging

Task:
Task: Write a python code to assign POS tag to the input stream of words using CRF as well as LSTM.
Compare the performance of both the taggers.

Submitted By: AI4116 Ayesha Shaikh

Code:
pip install python-crfsuite

import nltk
[Link]('treebank')

import pycrfsuite
from [Link] import treebank
from sklearn.model_selection import train_test_split

# Load the Penn Treebank dataset

tagged_sentences = treebank.tagged_sents()

# Feature extraction function

def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'is_first': i == 0,
'is_last': i == len(sent) - 1,
'prev_word': '' if i == 0 else sent[i - 1][0],
'next_word': '' if i == len(sent) - 1 else sent[i + 1][0]
}
return features

def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
return [label for word, label in sent]

# Prepare the data

X = [sent2features(sent) for sent in tagged_sentences]
y = [sent2labels(sent) for sent in tagged_sentences]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the CRF model

trainer = [Link](verbose=False)
for xseq, yseq in zip(X_train, y_train):
[Link](xseq, yseq)

trainer.set_params({
'c1': 1.0,
'c2': 1e-3,
'max_iterations': 50,
'feature.possible_transitions': True
})

model_filename = 'pos_tagger_model.crfsuite'
[Link](model_filename)

# Test the model

tagger = [Link]()
[Link](model_filename)

# Tag a sentence
example_sentence = [Link]()[0]
features = sent2features(example_sentence)
tags = [Link](features)

for word, tag in zip(example_sentence, tags):

print(f'{word}/{tag}', end=' ')

# Evaluate the model

from [Link] import classification_report

y_pred = [[Link](xseq) for xseq in X_test]

y_test_flat = [label for label_seq in y_test for label in label_seq]
y_pred_flat = [label for label_seq in y_pred for label in label_seq]

print("\nClassification Report:")
print(classification_report(y_test_flat, y_pred_flat))
This/DT is/VBZ a/DT sample/NNP sentence/NNP for/IN POS/NNP tagging./NNP

import pycrfsuite

# Load the CRF model

model_filename = 'pos_tagger_model.crfsuite'
tagger = [Link]()
[Link](model_filename)

# Sample sentence
sample_sentence = "This is a sample sentence for POS tagging."

# Tokenize the sample sentence

sample_tokens = sample_sentence.split()

# Extract features for the sample sentence

sample_features = [word2features([(word, '')], 0) for word in
sample_tokens]

# Predict POS tags for the sample sentence

predicted_tags = [Link](sample_features)

# Combine words and predicted tags

tagged_sentence = ' '.join([f'{word}/{tag}' for word, tag in
zip(sample_tokens, predicted_tags)])

# Print the tagged sentence

print(tagged_sentence)
Write python code for POS tagging using LSTM

import numpy as np
import tensorflow as tf
from [Link] import pad_sequences
from [Link] import Sequential
from [Link] import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder
from [Link] import treebank
from [Link] import word_tokenize

# Load the Penn Treebank dataset

tagged_sentences = treebank.tagged_sents()

# Create the vocabulary and encode labels

words = set([Link]() for sentence in tagged_sentences for word, tag in
sentence)
words = list(words)
[Link]("ENDPAD")
n_words = len(words)

tags = set(tag for sentence in tagged_sentences for word, tag in sentence)

n_tags = len(tags)

word2idx = {w: i for i, w in enumerate(words)}

tag2idx = {t: i for i, t in enumerate(tags)}

# Convert sentences and labels to numerical format

X = [[word2idx[[Link]()] for word, tag in sent] for sent in
tagged_sentences]
y = [[tag2idx[tag] for word, tag in sent] for sent in tagged_sentences]

# Padding sequences to a fixed length

max_len = 100
X = pad_sequences(X, maxlen=max_len, padding="post")
y = pad_sequences(y, maxlen=max_len, padding="post")

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and compile the LSTM model

model = Sequential()
[Link](Embedding(input_dim=n_words, output_dim=50,
input_length=max_len))
[Link](LSTM(units=100, return_sequences=True))
[Link](Dense(n_tags, activation="softmax"))

[Link](optimizer="adam", loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

# Train the model

[Link](X_train, y_train, batch_size=32, epochs=5, validation_split=0.1)

# Evaluate the model

loss, accuracy = [Link](X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

sample_sentence = "This is a sample sentence for POS tagging.".split()

sample_sequence = [[Link]([Link](), word2idx["ENDPAD"]) for word
in sample_sentence]

# Pad the sequence to match the model's input shape

sample_sequence = pad_sequences([sample_sequence], maxlen=max_len,
padding="post")

predicted_tags = [Link](sample_sequence)

predicted_tags = [[Link](tag) for tag in predicted_tags[0]]

predicted_tags = [list([Link]())[list([Link]()).index(tag)]
for tag in predicted_tags]

for word, tag in zip(sample_sentence, predicted_tags):

print(f"{word}/{tag}", end=" ")

1/1 [==============================] - 0s 444ms/step

This/DT is/DT a/DT sample/DT sentence/NN for/IN POS/NN tagging./NN
Conclusion: The comparison of CRF and LSTM POS tagging reveals variations in performance,
with CRF demonstrating robust sequential labeling, while LSTM exhibits the ability to capture
intricate patterns in the input stream.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 9 : NER (Named Entity Recognition)

Task:
Task: Write a python code to identify the Named Entities from the input text.

Submitted By: AI4116 Ayesha Shaikh

Code:
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = [Link]('en_core_web_sm')

text =('''Python is an interpreted, high-level and general-purpose

programming language.
Pythons design philosophy emphasizes code readability with
its notable use of significant indentation.
Its language constructs and object-oriented approach aim to
help programmers write clear and
logical code for small and large-scale projects''')
# text2 = # copy the paragraphs from [Link]
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list([Link])
print(sentences)

# tokenization
for token in doc:
print([Link])
# print entities
ents = [([Link], e.start_char, e.end_char, e.label_) for e in [Link]]
print(ents)
# now we use displaycy function on doc2
[Link](doc, style='ent', jupyter=True)

# import modules and download packages

import nltk
[Link]('words')
[Link]('punkt')
[Link]('maxent_ne_chunker')
[Link]('averaged_perceptron_tagger')
[Link]('state_union')
from [Link] import state_union
from [Link] import PunktSentenceTokenizer

# process the text and print Named entities

# tokenization
train_text = state_union.raw()
sample_text = state_union.raw("/content/[Link]")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

# function
def get_named_entity():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=False)
[Link]()
except:
pass

get_named_entity()

text =('''HI I am Atharva, I am from Aurangabad, Maharashtra, India.

Currently I am persuing B-tech degree from Deogiri college''')
# text2 = # copy the paragraphs from [Link]
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list([Link])
print(sentences)

HI
I
am
Atharva
,
I
am
from
Aurangabad
,
Maharashtra
,
India
.
Currently
I
am
persuing
B
-
tech
degree
from
Deogiri
college
[('Atharva', 8, 15, 'PERSON'), ('Aurangabad', 27, 37, 'GPE'), ('Maharashtra',
39, 50, 'GPE'), ('India', 52, 57, 'GPE'), ('Deogiri', 102, 109, 'ORG')]
HI I am Atharva PERSON , I am from Aurangabad GPE , Maharashtra GPE , India GPE . Currently I am

persuing B-tech degree from Deogiri ORG college

Conclusion: In conclusion, the provided Python code successfully performs Named Entity
Recognition (NER) on input text, accurately identifying and extracting entities such as names,
locations, and organizations.

Date of Performance Date of Assessment Remark and Sign

21/11/23

Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
NLP Practical Journal for M.Sc. IT
No ratings yet
NLP Practical Journal for M.Sc. IT
22 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
NLP Record
No ratings yet
NLP Record
23 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Lesson 5 NLP Libraries
No ratings yet
Lesson 5 NLP Libraries
69 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
65 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
13 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Python Code Examples and Techniques
No ratings yet
NLP Python Code Examples and Techniques
16 pages
7 Exp
No ratings yet
7 Exp
6 pages
Understanding Word Tokenization in NLP
No ratings yet
Understanding Word Tokenization in NLP
6 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
4 Tokenization MED
No ratings yet
4 Tokenization MED
60 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
InfoSec Lab Manual for Students
No ratings yet
InfoSec Lab Manual for Students
25 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Bigram Analysis with PPMI in Python
No ratings yet
Bigram Analysis with PPMI in Python
4 pages
Text Processing with NLTK in Python
No ratings yet
Text Processing with NLTK in Python
16 pages
Python Regex for Text Tokenization
No ratings yet
Python Regex for Text Tokenization
20 pages
Python Search Engine Project Guide
No ratings yet
Python Search Engine Project Guide
20 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
Module1 NLP of Vtu Autonomous College of Mysore
No ratings yet
Module1 NLP of Vtu Autonomous College of Mysore
42 pages
Tokenization and Sentence Segmentation
No ratings yet
Tokenization and Sentence Segmentation
54 pages
NLP Final Exam Overview and Techniques
No ratings yet
NLP Final Exam Overview and Techniques
32 pages
Experiment No. 8 Lexical Diversity: 1 Objective
No ratings yet
Experiment No. 8 Lexical Diversity: 1 Objective
3 pages
Essential Unix Commands and Text Processing
No ratings yet
Essential Unix Commands and Text Processing
5 pages
Implementing Python Solutions for Synonyms
No ratings yet
Implementing Python Solutions for Synonyms
5 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Deep Learning: Text Feature Extraction
No ratings yet
Deep Learning: Text Feature Extraction
102 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
How to Install and Use NLTK in Python
No ratings yet
How to Install and Use NLTK in Python
15 pages
NLP Tokenization and Techniques Guide
No ratings yet
NLP Tokenization and Techniques Guide
3 pages
NLP Techniques: Tokenization to LSTM
No ratings yet
NLP Techniques: Tokenization to LSTM
18 pages
NLP Practical Record for M.Sc. Students
No ratings yet
NLP Practical Record for M.Sc. Students
33 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
Text and Speech Analysis in Python
No ratings yet
Text and Speech Analysis in Python
13 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
45 pages
Natural Language Processing Course Overview
No ratings yet
Natural Language Processing Course Overview
43 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
NLP Practical Journal with Python Code
No ratings yet
NLP Practical Journal with Python Code
17 pages
Frequency Distribution and Semantic Analysis
No ratings yet
Frequency Distribution and Semantic Analysis
16 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Understanding Corpora and Tokenization
No ratings yet
Understanding Corpora and Tokenization
48 pages
Comprehensive Marketing Strategy Guide
No ratings yet
Comprehensive Marketing Strategy Guide
15 pages
Mall Customer Segmentation Analysis
No ratings yet
Mall Customer Segmentation Analysis
4 pages
Understanding Transport Layer Services
No ratings yet
Understanding Transport Layer Services
4 pages
HostelBuddy SRS: Web Portal Overview
No ratings yet
HostelBuddy SRS: Web Portal Overview
30 pages
SX-5 Starburst® Searchlight
100% (1)
SX-5 Starburst® Searchlight
62 pages
Utmost Living by Tim Storey - Excerpt
No ratings yet
Utmost Living by Tim Storey - Excerpt
25 pages
Class XI English Core Unit Test Paper
No ratings yet
Class XI English Core Unit Test Paper
3 pages
Transforming Health Education to Address Inequalities
No ratings yet
Transforming Health Education to Address Inequalities
11 pages
Akinsete Nwanshili Isehunwa
No ratings yet
Akinsete Nwanshili Isehunwa
13 pages
IGCSE Study Resources Channel
No ratings yet
IGCSE Study Resources Channel
122 pages
Risk Management
No ratings yet
Risk Management
20 pages
Peppy DPP - 2-1
No ratings yet
Peppy DPP - 2-1
9 pages
CM Plan - Draft
No ratings yet
CM Plan - Draft
8 pages
Soil Dynamics and Foundation Modeling: Offshore and Earthquake Engineering 1st Edition Junbo Jia (Auth.) eBook desktop pdf version
100% (2)
Soil Dynamics and Foundation Modeling: Offshore and Earthquake Engineering 1st Edition Junbo Jia (Auth.) eBook desktop pdf version
42 pages
The Analysis of Runge Phenomenon
No ratings yet
The Analysis of Runge Phenomenon
29 pages
History of The Gothic American Gothic American Gothic 1st Edition Charles L. Crow Ebook Formatted For Tablet
100% (4)
History of The Gothic American Gothic American Gothic 1st Edition Charles L. Crow Ebook Formatted For Tablet
119 pages
Rizal's Vision for the Philippines' Future
No ratings yet
Rizal's Vision for the Philippines' Future
2 pages
Math Quiz: Triangle and Geometry Questions
No ratings yet
Math Quiz: Triangle and Geometry Questions
8 pages
Chemical Engineering Resources Overview
No ratings yet
Chemical Engineering Resources Overview
2 pages
Year 8 English - Reading Comprehension & Annotation
No ratings yet
Year 8 English - Reading Comprehension & Annotation
4 pages
Water Quality Assessment in African Rivers
No ratings yet
Water Quality Assessment in African Rivers
10 pages
1st Year Syllabus (ME Stream) After BoS
No ratings yet
1st Year Syllabus (ME Stream) After BoS
24 pages
2024.1 PetaLinux OS Package List
100% (1)
2024.1 PetaLinux OS Package List
20 pages
Urban Soil Ecosystem Services Analysis
No ratings yet
Urban Soil Ecosystem Services Analysis
8 pages
Typography Learning Guide by Typodeep
No ratings yet
Typography Learning Guide by Typodeep
29 pages
Multicultural Education: Key Principles
No ratings yet
Multicultural Education: Key Principles
1 page
SIMIT enUS en-US PDF
No ratings yet
SIMIT enUS en-US PDF
876 pages
Commercial Culture
100% (1)
Commercial Culture
6 pages
Operations Research 2nd Edition Col. D.S. Cheema
No ratings yet
Operations Research 2nd Edition Col. D.S. Cheema
380 pages
African NGOs in Development Support
No ratings yet
African NGOs in Development Support
9 pages
AIIMS Bibinagar Senior Resident Jobs 2025
No ratings yet
AIIMS Bibinagar Senior Resident Jobs 2025
8 pages
ACMA Mckinsey Study 2017 Future Mobility
No ratings yet
ACMA Mckinsey Study 2017 Future Mobility
36 pages
Grammar 4.4.1-4.4.5
No ratings yet
Grammar 4.4.1-4.4.5
4 pages
DOXIADIS, Constantino Ecumenopolis Tommoro's City PDF
No ratings yet
DOXIADIS, Constantino Ecumenopolis Tommoro's City PDF
34 pages

All Practicals

Uploaded by

All Practicals

Uploaded by

Practical 1 : C Program Code

Read a sample .TXT file and print as it is on screen

Count the number of words / tokens in the file.

Count the number of unique words / tokens in the file.

Submitted By: AI4116 Ayesha Shaikh

#define BUFFER_SIZE 1000

int countOccurrences(FILE *fptr, const char *word);

printf("Enter word to search in file: ");

fptr = fopen(path, "r");

printf("Unable to open file.\n");

printf("Please check you have read/write previleges.\n");

wCount = countOccurrences(fptr, word);

printf("'%s' is found %d times in file.", word, wCount);

int countOccurrences(FILE *fptr, const char *word)

int index, count;

while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)

while ((pos = strstr(str + index, word)) != NULL)

index = (pos - str) + 1;

filePointer = fopen("[Link]", "r");

printf("cyber file failed to open.");

printf("The file is now opened.\n");

while (fgets(dataToBeRead, 1000, filePointer)

Date of Performance Date of Assessment Remark and Sign

Identify the language of it

Count the length of the string

Count the unique tokens in the string

Show any one example illustrating lexical richness of the text.

Submitted By: AI4116 Ayesha Shaikh

from langdetect import detect

text = "Hi Darshan How are you ?"

# 1. Identify langauge of it.

# 2. Count the length of the string

# 3. Count the number of tokens in the string(using split function and

# 4. Count the unqiue tokens in the string.

# Step 5: Preprocess news corpus

# Step 6: Calculate term frequency

# Step 7: Lexical richness example

Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,

contraction_text = "I can't believe they're here. It's a nice day."

Expanded Text: I cannot believe they are here. it is a nice day.

Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including

Date of Performance Date of Assessment Remark and Sign

Illustrate the difference between stemming and lemmatizing

Submitted By: AI4116 Ayesha Shaikh

words = ["running", "flies", "better","Unhappiness ","Teacher ",

print("Actual words:" , words)

# Get antonyms from the first synset

Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and

Practical 4 : N-gram Model

Submitted By: AI4116 Ayesha Shaikh

[Link]((data[i], data[i + 1]))

if (data[i], data[i+1]) in bigramCounts:

def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):

print("\n All the possible Bigrams are ")

print("\n Bigrams along with their frequency ")

print("\n Unigrams along with their frequency ")

bigramProb = calcBigramProb(listOfBigrams, unigramCounts,

print("\n Bigrams along with their probability ")

for i in range(len(splt) - 1):

[Link]((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print("\n All the possible Bigrams are ")

print("\n Bigrams along with their frequency ")

print("\n Unigrams along with their frequency ")

print("\n Bigrams along with their probability ")

for i in range(len(splt) - 1):

[Link]((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

Task - Create any simple application using n-gram

for i in range(len(data) - 2):

bigram = (data[i], data[i + 1])

return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts

# Call the function

print("List of Trigrams:", listOfTrigrams)

List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),

Date of Performance Date of Assessment Remark and Sign

int countOccurrences(FILE fptr, const char word);

int countOccurrences(FILE fptr, const char word)