S.No Name of the Program
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hash tag from textual data.

Source Code:
import re
# Sample text data
text = "I love #Python and #MachineLearning! #AI is fascinating. #datascience"

# Function to remove noise using regular expression pattern

def remove_noise(text):

# Remove hash tags

cleaned_text = re.sub(r'#\w+', '', text)

# Additional noise removal steps can be added here

return cleaned_text

# Remove noise from the text

cleaned_text = remove_noise(text)

# Print the cleaned text


2. Perform lemmatization and stemming using python library nltk.

Source Code:
# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
# Sample text data
text = "I am running and eating. I ran and ate."

# Tokenize the text into words

tokens = word_tokenize(text)

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]

# Perform stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token) for token in tokens]

# Print the lemmatized and stemmed words

print("Lemmatized Words:", lemmatized_words)
print("Stemmed Words:", stemmed_words)

3. Demonstrate object standardization such as replace social media slangs from a text.

Source Code:
# Predefined mapping of social media slangs to their replacements
slang_mapping = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"imho": "in my humble opinion",
"fyi": "for your information"

# Sample text containing social media slangs

text = "omg, lol, idk what to say btw."

# Function to replace social media slangs from text using the mapping
def replace_slangs(text, slang_mapping):
words = text.split()
replaced_words = [slang_mapping.get(word.lower(), word) for word in words]
replaced_text = ' '.join(replaced_words)
return replaced_text

# Perform object standardization by replacing social media slangs

standardized_text = replace_slangs(text, slang_mapping)

# Print the standardized text


4. Perform part of speech tagging on any textual data.

Source Code:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample textual data

text = "I love to eat apples and bananas."

# Tokenize the text into words

tokens = word_tokenize(text)
# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Print the POS tagged tokens


5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Source Code:
# Import necessary libraries
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text data

texts = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'

# Tokenize the text and create a dictionary

tokenized_texts = [text.lower().split() for text in texts]
dictionary = corpora.Dictionary(tokenized_texts)

# Convert tokenized texts into a document-term matrix

doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_texts]
# Create an LDA model
lda_model = LdaModel(

# Print the topics and associated words


# Perform topic inference on a new document

new_doc = 'I enjoy eating bananas'
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())

6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using


Source Code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus

tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (terms)

feature_names = vectorizer.get_feature_names()

# Print the TF-IDF values for each term in each document

for i in range(len(corpus)):
print("Document", i+1)
for j in range(len(feature_names)):
term = feature_names[j]
tfidf_value = tfidf_matrix[i, j]
print(" ", term, ":", tfidf_value)

7. Demonstrate word embeddings using word2vec.

Source Code:
from gensim.models import Word2Vec
# Sample sentences
sentences = [
['I', 'love', 'to', 'eat', 'apples'],
['Apples', 'are', 'tasty', 'and', 'healthy'],
['I', 'prefer', 'oranges', 'over', 'apples'],
['Oranges', 'are', 'juicy'],
['Bananas', 'are', 'my', 'favorite', 'fruit']

# Train the Word2Vec model

model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)

# Get the word vector for a specific word

word_vector = model.wv['apples']

# Print the word vector

print("Word Vector for 'apples':")

8. Implement Text classification using naïve bayes classifier and text blob library.

Source Code:
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier

# Sample training data

train_data = [
('I love this movie', 'positive'),
('This movie is great', 'positive'),
('The acting was terrible', 'negative'),
('I did not like the plot', 'negative')

# Sample test data

test_data = [
('I enjoyed the film', 'positive'),
('The movie was disappointing', 'negative')

# Create a Naive Bayes classifier

classifier = NaiveBayesClassifier(train_data)
# Perform text classification on the test data
for text, label in test_data:
blob = TextBlob(text, classifier=classifier)
predicted_label = blob.classify()

print("Text:", text)
print("Predicted Label:", predicted_label)
print("True Label:", label)

9. Apply support vector machine for text classification.

Source Code:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Sample text data and corresponding labels

texts = ['I love this movie', 'This movie is great', 'I dislike this movie', 'This movie is terrible']
labels = ['positive', 'positive', 'negative', 'negative']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2,

# Create TF-IDF vectorizer

vectorizer = TfidfVectorizer()
# Transform the text data into feature vectors
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Create an SVM classifier

svm_classifier = SVC()

# Train the SVM classifier, y_train)

# Predict the labels for test data

y_pred = svm_classifier.predict(X_test)

# Evaluate the model

print(classification_report(y_test, y_pred))

10. Convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two text.

Source Code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample texts
text1 = "I love apples"
text2 = "Apples are tasty"
text3 = "Oranges are juicy"

# Create the CountVectorizer

vectorizer = CountVectorizer()
# Fit and transform the texts
vectorized_text = vectorizer.fit_transform([text1, text2, text3])

# Calculate the cosine similarity between text1 and text2

cosine_sim = cosine_similarity(vectorized_text[0], vectorized_text[1])

# Print the cosine similarity

print("Cosine Similarity between text1 and text2:", cosine_sim[0][0])

11. Case study 1: Identify the sentiment of tweets In this problem, you are provided with
tweet data to predict sentiment on electronic products of netizens.

Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the tweet data

data = pd.read_csv('tweet_data.csv')

# Separate the features and target variable

X = data['text']
y = data['sentiment']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectors

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the Naive Bayes classifier

classifier = MultinomialNB(), y_train)

# Predict the sentiment for test data

y_pred = classifier.predict(X_test)

# Calculate accuracy and print classification report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")

12. Case study 2: Detect hate speech in tweets. The objective of this task is to detect hate
speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has
a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist
tweets from other tweets.

Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# Load the tweet data

data = pd.read_csv('tweet_data.csv')
# Separate the features and target variable
X = data['text']
y = data['label']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectors

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the LinearSVC classifier

classifier = LinearSVC(), y_train)

# Predict the labels for test data

y_pred = classifier.predict(X_test)

# Calculate accuracy and print classification report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")

