You are on page 1of 15

ST.

MARY’S GROUP OF INSTITUTIONS GUNTUR


(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212

SCHOOL OF ENGINEERING

SKILL ORIENTED COURSE

Name of the student: __________________________________________________

Course:______________Branch:____________Reg.No:______________________

Year:_____________ Semester:_______________ Regulation:_________________

Name of Skill Oriented Course:__________________________________________


ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212

SCHOOL OF ENGINEERING

Certificate
This is to certify that Mr. / Ms.

bearing roll no of _______B. Tech semester

branch has satisfactorily

completed Skill Oriented Course

during the academic year .

Place: Signature of Faculty


Date:

External Examination held on:______________________

Signature of HOD

Signature of Internal Examiner Signature of External Examiner


ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212

Index
Signature of
S.No Name of the Program Page No. Date
the Faculty
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hash tag from textual data.

Source Code:
import re
# Sample text data
text = "I love #Python and #MachineLearning! #AI is fascinating. #datascience"

# Function to remove noise using regular expression pattern


def remove_noise(text):

# Remove hash tags


cleaned_text = re.sub(r'#\w+', '', text)

# Additional noise removal steps can be added here


return cleaned_text

# Remove noise from the text


cleaned_text = remove_noise(text)

# Print the cleaned text


print(cleaned_text)

2. Perform lemmatization and stemming using python library nltk.

Source Code:
# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
# Sample text data
text = "I am running and eating. I ran and ate."

# Tokenize the text into words


tokens = word_tokenize(text)

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]

# Perform stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token) for token in tokens]

# Print the lemmatized and stemmed words


print("Lemmatized Words:", lemmatized_words)
print("Stemmed Words:", stemmed_words)

3. Demonstrate object standardization such as replace social media slangs from a text.

Source Code:
# Predefined mapping of social media slangs to their replacements
slang_mapping = {
"lol": "laughing out loud",
"omg": "oh my god",
"btw": "by the way",
"brb": "be right back",
"idk": "I don't know",
"imho": "in my humble opinion",
"fyi": "for your information"
}

# Sample text containing social media slangs


text = "omg, lol, idk what to say btw."

# Function to replace social media slangs from text using the mapping
def replace_slangs(text, slang_mapping):
words = text.split()
replaced_words = [slang_mapping.get(word.lower(), word) for word in words]
replaced_text = ' '.join(replaced_words)
return replaced_text

# Perform object standardization by replacing social media slangs


standardized_text = replace_slangs(text, slang_mapping)

# Print the standardized text


print(standardized_text)

4. Perform part of speech tagging on any textual data.

Source Code:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample textual data


text = "I love to eat apples and bananas."

# Tokenize the text into words


tokens = word_tokenize(text)
# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Print the POS tagged tokens


print(pos_tags)

5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Source Code:
# Import necessary libraries
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text data


texts = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
]

# Tokenize the text and create a dictionary


tokenized_texts = [text.lower().split() for text in texts]
dictionary = corpora.Dictionary(tokenized_texts)

# Convert tokenized texts into a document-term matrix


doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_texts]
# Create an LDA model
lda_model = LdaModel(
doc_term_matrix,
num_topics=2,
id2word=dictionary,
passes=10,
random_state=42
)

# Print the topics and associated words


pprint(lda_model.print_topics())

# Perform topic inference on a new document


new_doc = 'I enjoy eating bananas'
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
pprint(lda_model.get_document_topics(new_doc_bow))

6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using


python

Source Code:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
'I love to eat apples',
'Apples are tasty and healthy',
'I prefer oranges over apples',
'Oranges are juicy',
'Bananas are my favorite fruit'
]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus


tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (terms)


feature_names = vectorizer.get_feature_names()

# Print the TF-IDF values for each term in each document


for i in range(len(corpus)):
print("Document", i+1)
for j in range(len(feature_names)):
term = feature_names[j]
tfidf_value = tfidf_matrix[i, j]
print(" ", term, ":", tfidf_value)

7. Demonstrate word embeddings using word2vec.

Source Code:
from gensim.models import Word2Vec
# Sample sentences
sentences = [
['I', 'love', 'to', 'eat', 'apples'],
['Apples', 'are', 'tasty', 'and', 'healthy'],
['I', 'prefer', 'oranges', 'over', 'apples'],
['Oranges', 'are', 'juicy'],
['Bananas', 'are', 'my', 'favorite', 'fruit']
]

# Train the Word2Vec model


model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)

# Get the word vector for a specific word


word_vector = model.wv['apples']

# Print the word vector


print("Word Vector for 'apples':")
print(word_vector)

8. Implement Text classification using naïve bayes classifier and text blob library.

Source Code:
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier

# Sample training data


train_data = [
('I love this movie', 'positive'),
('This movie is great', 'positive'),
('The acting was terrible', 'negative'),
('I did not like the plot', 'negative')
]

# Sample test data


test_data = [
('I enjoyed the film', 'positive'),
('The movie was disappointing', 'negative')
]

# Create a Naive Bayes classifier


classifier = NaiveBayesClassifier(train_data)
# Perform text classification on the test data
for text, label in test_data:
blob = TextBlob(text, classifier=classifier)
predicted_label = blob.classify()

print("Text:", text)
print("Predicted Label:", predicted_label)
print("True Label:", label)
print()

9. Apply support vector machine for text classification.

Source Code:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Sample text data and corresponding labels


texts = ['I love this movie', 'This movie is great', 'I dislike this movie', 'This movie is terrible']
labels = ['positive', 'positive', 'negative', 'negative']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2,
random_state=42)

# Create TF-IDF vectorizer


vectorizer = TfidfVectorizer()
# Transform the text data into feature vectors
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Create an SVM classifier


svm_classifier = SVC()

# Train the SVM classifier


svm_classifier.fit(X_train, y_train)

# Predict the labels for test data


y_pred = svm_classifier.predict(X_test)

# Evaluate the model


print(classification_report(y_test, y_pred))

10. Convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two text.

Source Code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample texts
text1 = "I love apples"
text2 = "Apples are tasty"
text3 = "Oranges are juicy"

# Create the CountVectorizer


vectorizer = CountVectorizer()
# Fit and transform the texts
vectorized_text = vectorizer.fit_transform([text1, text2, text3])

# Calculate the cosine similarity between text1 and text2


cosine_sim = cosine_similarity(vectorized_text[0], vectorized_text[1])

# Print the cosine similarity


print("Cosine Similarity between text1 and text2:", cosine_sim[0][0])

11. Case study 1: Identify the sentiment of tweets In this problem, you are provided with
tweet data to predict sentiment on electronic products of netizens.

Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the tweet data


data = pd.read_csv('tweet_data.csv')

# Separate the features and target variable


X = data['text']
y = data['sentiment']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectors


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predict the sentiment for test data


y_pred = classifier.predict(X_test)

# Calculate accuracy and print classification report


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")
print(report)

12. Case study 2: Detect hate speech in tweets. The objective of this task is to detect hate
speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has
a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist
tweets from other tweets.

Source Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# Load the tweet data


data = pd.read_csv('tweet_data.csv')
# Separate the features and target variable
X = data['text']
y = data['label']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectors


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the LinearSVC classifier


classifier = LinearSVC()
classifier.fit(X_train, y_train)

# Predict the labels for test data


y_pred = classifier.predict(X_test)

# Calculate accuracy and print classification report


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")
print(report)

You might also like