NLP Soc

ST.
MARY’S GROUP OF INSTITUTIONS GUNTUR

(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212
SCHOOL OF ENGINEERING
SKILL ORIENTED COURSE
Name of the student: __________________________________________________
Course:______________Branch:____________Reg.No:______________________
Year:_____________ Semester:_______________ Regulation:_________________
Name of Skill Oriented Course:__________________________________________

ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
SCHOOL OF ENGINEERING
Certificate
This is to certify that Mr. / Ms.
bearing roll no of _______B. Tech semester
branch has satisfactorily
completed Skill Oriented Course
during the academic year .
Place: Signature of Faculty

Date:
External Examination held on:______________________
Signature of HOD
Signature of Internal Examiner Signature of External Examiner

ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
Index
Signature of
S.No Name of the Program Page No. Date
the Faculty
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hash tag from textual data.
Source Code:
import re
# Sample text data
text = "I love #Python and #MachineLearning! #AI is fascinating. #datascience"
# Function to remove noise using regular expression pattern

def remove_noise(text):
# Remove hash tags

cleaned_text = re.sub(r'#\w+', '', text)
# Additional noise removal steps can be added here

return cleaned_text
# Remove noise from the text

cleaned_text = remove_noise(text)
# Print the cleaned text

print(cleaned_text)
2. Perform lemmatization and stemming using python library nltk.
Source Code:
# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
# Sample text data
text = "I am running and eating. I ran and ate."
# Tokenize the text into words

tokens = word_tokenize(text)
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]
# Perform stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token) for token in tokens]
# Print the lemmatized and stemmed words

print("Lemmatized Words:", lemmatized_words)
print("Stemmed Words:", stemmed_words)
3. Demonstrate object standardization such as replace social media slangs from a text.
Source Code:
# Predefined mapping of social media slangs to their replacements
slang_mapping = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"imho": "in my humble opinion",
"fyi": "for your information"
}
# Sample text containing social media slangs

text = "omg, lol, idk what to say btw."
# Function to replace social media slangs from text using the mapping
def replace_slangs(text, slang_mapping):
words = text.split()
replaced_words = [slang_mapping.get(word.lower(), word) for word in words]
replaced_text = ' '.join(replaced_words)
return replaced_text
# Perform object standardization by replacing social media slangs

standardized_text = replace_slangs(text, slang_mapping)
# Print the standardized text

print(standardized_text)
4. Perform part of speech tagging on any textual data.
Source Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# Sample textual data

text = "I love to eat apples and bananas."
# Tokenize the text into words

tokens = word_tokenize(text)
# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)
# Print the POS tagged tokens

print(pos_tags)
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Source Code:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
# Sample text data

texts = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
]
# Tokenize the text and create a dictionary

tokenized_texts = [text.lower().split() for text in texts]
dictionary = corpora.Dictionary(tokenized_texts)
# Convert tokenized texts into a document-term matrix

doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_texts]
# Create an LDA model
lda_model = LdaModel(
doc_term_matrix,
num_topics=2,
id2word=dictionary,
passes=10,
random_state=42
)
# Print the topics and associated words

pprint(lda_model.print_topics())
# Perform topic inference on a new document

new_doc = 'I enjoy eating bananas'
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
pprint(lda_model.get_document_topics(new_doc_bow))
6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using

python
Source Code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the corpus

tfidf_matrix = vectorizer.fit_transform(corpus)
# Get the feature names (terms)

feature_names = vectorizer.get_feature_names()
# Print the TF-IDF values for each term in each document

for i in range(len(corpus)):
print("Document", i+1)
for j in range(len(feature_names)):
term = feature_names[j]
tfidf_value = tfidf_matrix[i, j]
print(" ", term, ":", tfidf_value)
7. Demonstrate word embeddings using word2vec.
Source Code:
from gensim.models import Word2Vec
# Sample sentences
sentences = [
['I', 'love', 'to', 'eat', 'apples'],
['Apples', 'are', 'tasty', 'and', 'healthy'],
['I', 'prefer', 'oranges', 'over', 'apples'],
['Oranges', 'are', 'juicy'],
['Bananas', 'are', 'my', 'favorite', 'fruit']
]
# Train the Word2Vec model

model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)
# Get the word vector for a specific word

word_vector = model.wv['apples']
# Print the word vector

print("Word Vector for 'apples':")
print(word_vector)
8. Implement Text classification using naïve bayes classifier and text blob library.
Source Code:
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
# Sample training data

train_data = [
('I love this movie', 'positive'),
('This movie is great', 'positive'),
('The acting was terrible', 'negative'),
('I did not like the plot', 'negative')
]
# Sample test data

test_data = [
('I enjoyed the film', 'positive'),
('The movie was disappointing', 'negative')
]
# Create a Naive Bayes classifier

classifier = NaiveBayesClassifier(train_data)
# Perform text classification on the test data
for text, label in test_data:
blob = TextBlob(text, classifier=classifier)
predicted_label = blob.classify()
print("Text:", text)
print("Predicted Label:", predicted_label)
print("True Label:", label)
print()
9. Apply support vector machine for text classification.
Source Code:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Sample text data and corresponding labels

texts = ['I love this movie', 'This movie is great', 'I dislike this movie', 'This movie is terrible']
labels = ['positive', 'positive', 'negative', 'negative']
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2,
random_state=42)
# Create TF-IDF vectorizer

# Transform the text data into feature vectors
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Create an SVM classifier

svm_classifier = SVC()
# Train the SVM classifier

svm_classifier.fit(X_train, y_train)
# Predict the labels for test data

y_pred = svm_classifier.predict(X_test)
# Evaluate the model

print(classification_report(y_test, y_pred))
10. Convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two text.
Source Code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample texts
text1 = "I love apples"
text2 = "Apples are tasty"
text3 = "Oranges are juicy"
# Create the CountVectorizer

vectorizer = CountVectorizer()
# Fit and transform the texts
vectorized_text = vectorizer.fit_transform([text1, text2, text3])
# Calculate the cosine similarity between text1 and text2

cosine_sim = cosine_similarity(vectorized_text[0], vectorized_text[1])
# Print the cosine similarity

print("Cosine Similarity between text1 and text2:", cosine_sim[0][0])
11. Case study 1: Identify the sentiment of tweets In this problem, you are provided with
tweet data to predict sentiment on electronic products of netizens.
Source Code:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Load the tweet data

data = pd.read_csv('tweet_data.csv')
# Separate the features and target variable

X = data['text']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectors

# Train the Naive Bayes classifier

classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Predict the sentiment for test data

y_pred = classifier.predict(X_test)
# Calculate accuracy and print classification report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)
12. Case study 2: Detect hate speech in tweets. The objective of this task is to detect hate
speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has
a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist
tweets from other tweets.
Source Code:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the tweet data

data = pd.read_csv('tweet_data.csv')
# Separate the features and target variable
X = data['text']
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectors

# Train the LinearSVC classifier

classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict the labels for test data

y_pred = classifier.predict(X_test)
# Calculate accuracy and print classification report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)

NLP Soc

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Soc

Uploaded by

Copyright:

Available Formats

ST.

MARY’S GROUP OF INSTITUTIONS GUNTUR

SKILL ORIENTED COURSE

Name of the student: __________________________________________________

Year:_____________ Semester:_______________ Regulation:_________________

Name of Skill Oriented Course:__________________________________________

bearing roll no of _______B. Tech semester

branch has satisfactorily

completed Skill Oriented Course

during the academic year .

Place: Signature of Faculty

External Examination held on:______________________

Signature of Internal Examiner Signature of External Examiner

# Function to remove noise using regular expression pattern

# Remove hash tags

# Additional noise removal steps can be added here

# Remove noise from the text

# Print the cleaned text

2. Perform lemmatization and stemming using python library nltk.

# Tokenize the text into words

# Print the lemmatized and stemmed words

# Sample text containing social media slangs

# Perform object standardization by replacing social media slangs

# Print the standardized text

4. Perform part of speech tagging on any textual data.

# Sample textual data

# Tokenize the text into words

# Print the POS tagged tokens

5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

# Sample text data

# Tokenize the text and create a dictionary

# Convert tokenized texts into a document-term matrix

# Print the topics and associated words

# Perform topic inference on a new document

6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using

# Fit and transform the corpus

# Get the feature names (terms)

# Print the TF-IDF values for each term in each document

7. Demonstrate word embeddings using word2vec.

# Train the Word2Vec model

# Get the word vector for a specific word

# Print the word vector

# Sample training data

# Sample test data

# Create a Naive Bayes classifier

9. Apply support vector machine for text classification.

# Sample text data and corresponding labels

# Split the data into training and testing sets

# Create TF-IDF vectorizer

# Create an SVM classifier

# Train the SVM classifier

# Predict the labels for test data

# Evaluate the model

# Create the CountVectorizer

# Calculate the cosine similarity between text1 and text2

# Print the cosine similarity

# Load the tweet data

# Separate the features and target variable

# Split the data into training and testing sets

# Create TF-IDF vectors

# Train the Naive Bayes classifier

# Predict the sentiment for test data

# Calculate accuracy and print classification report

Year:_____ Semester:_ Regulation:___________