Professional Documents
Culture Documents
SCHOOL OF ENGINEERING
Course:______________Branch:____________Reg.No:______________________
SCHOOL OF ENGINEERING
Certificate
This is to certify that Mr. / Ms.
Signature of HOD
Index
Signature of
S.No Name of the Program Page No. Date
the Faculty
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hash tag from textual data.
Source Code:
import re
# Sample text data
text = "I love #Python and #MachineLearning! #AI is fascinating. #datascience"
Source Code:
# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
# Sample text data
text = "I am running and eating. I ran and ate."
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]
# Perform stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token) for token in tokens]
3. Demonstrate object standardization such as replace social media slangs from a text.
Source Code:
# Predefined mapping of social media slangs to their replacements
slang_mapping = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"imho": "in my humble opinion",
"fyi": "for your information"
}
# Function to replace social media slangs from text using the mapping
def replace_slangs(text, slang_mapping):
words = text.split()
replaced_words = [slang_mapping.get(word.lower(), word) for word in words]
replaced_text = ' '.join(replaced_words)
return replaced_text
Source Code:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
Source Code:
# Import necessary libraries
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
Source Code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
Source Code:
from gensim.models import Word2Vec
# Sample sentences
sentences = [
['I', 'love', 'to', 'eat', 'apples'],
['Apples', 'are', 'tasty', 'and', 'healthy'],
['I', 'prefer', 'oranges', 'over', 'apples'],
['Oranges', 'are', 'juicy'],
['Bananas', 'are', 'my', 'favorite', 'fruit']
]
8. Implement Text classification using naïve bayes classifier and text blob library.
Source Code:
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
print("Text:", text)
print("Predicted Label:", predicted_label)
print("True Label:", label)
print()
Source Code:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
10. Convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two text.
Source Code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample texts
text1 = "I love apples"
text2 = "Apples are tasty"
text3 = "Oranges are juicy"
11. Case study 1: Identify the sentiment of tweets In this problem, you are provided with
tweet data to predict sentiment on electronic products of netizens.
Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)
12. Case study 2: Detect hate speech in tweets. The objective of this task is to detect hate
speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has
a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist
tweets from other tweets.
Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)