WI Exp4

Experiment No 4
60004190123
Vidhan Shah
CS-B/B3
Aim: Latent semantic indexing.

Theory:
Latent Semantic Indexing (LSI) has long been cause for debate amongst search marketers.
Google the term ‘latent semantic indexing’ and you will encounter both advocates and
sceptics in equal measure.
LSI is a process found in Natural Language Processing (NLP). NLP is a subset of linguistics
and information engineering, with a focus on how machines interpret human language. A key
part of this study is distributional semantics. This model helps us understand and classify
words with similar contextual meanings within large data sets.
LSI uses a mathematical method that makes information retrieval more accurate. This method
works by identifying the hidden contextual relationships between words. It may help you to
break it down like this:
 Latent → Hidden
 Semantic → Relationships Between Words
 Indexing → Information Retrieval
LSI works using the partial application of Singular Value Decomposition (SVD). SVD is a
mathematical operation that reduces a matrix to its constituent parts for simple and efficient
calculations.
When analysing a string of words, LSI removes conjunctions, pronouns, and common verbs,
also known as stop words. This isolates the words which comprise the main ‘content’ of a
phrase. Here’s a quick example of how this might look:
These words are then placed in a Term Document Matrix (TDM). A TDM is a 2D grid that
lists the frequency that each specific word (or term) occurs in the documents within a data
set.
Code & Output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
X_train, y_train = fetch_20newsgroups(subset='train', return_X_y=True)
X_test, y_test = fetch_20newsgroups(subset='test', return_X_y=True)
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('hea
ders', 'footers', 'quotes'))
documents = dataset.data
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re
tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
# stop_words += list(string.punctuation)
stop_words += ['__', '___']
# Uncomment and run the 3 lines below if you haven't got these packages
already
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
def rmv_emails_websites(string):
"""Function removes emails, websites and numbers"""
new_str = re.sub(r"\S+@\S+", '', string)
new_str = re.sub(r"\S+.co\S+", '', new_str)
new_str = re.sub(r"\S+.ed\S+", '', new_str)
new_str = re.sub(r"[0-9]+", '', new_str)
return new_str
X_train = list(map(rmv_emails_websites, X_train))
X_test = list(map(rmv_emails_websites, X_test))
tfidf = TfidfVectorizer(lowercase=True,
stop_words=stop_words,
tokenizer=tokenizer.tokenize,
max_df=0.2,
min_df=0.02
)
tfidf_train_sparse = tfidf.fit_transform(X_train)
tfidf_train_df = pd.DataFrame(tfidf_train_sparse.toarray(),
columns=tfidf.get_feature_names_out())
tfidf_train_df.head()
from sklearn.decomposition import TruncatedSVD
lsa_obj = TruncatedSVD(n_components=20, n_iter=100, random_state=42)
tfidf_lsa_data = lsa_obj.fit_transform(tfidf_train_df)
Sigma = lsa_obj.singular_values_
V_T = lsa_obj.components_.T
sns.barplot(x=list(range(len(Sigma))), y = Sigma)
term_topic_matrix = pd.DataFrame(data=lsa_term_topic,
index = eda_train.columns,
columns = [f'Latent_concept_{r}' for r
in range(0,V_T.shape[1])])
data = term_topic_matrix[f'Latent_concept_1']
data = data.sort_values(ascending=False)
top_10 = data[:10]
plt.title('Top terms along the axis of Latent concept 1')
fig = sns.barplot(x= top_10.values, y=top_10.index)
logreg_lsa = LogisticRegression()
logreg = LogisticRegression()
logreg_param_grid = [{'penalty':['l1', 'l2']},
{'tol':[0.0001, 0.0005, 0.001]}]
grid_lsa_log = GridSearchCV(estimator=logreg_lsa,
param_grid=logreg_param_grid,
scoring='accuracy', cv=5,
n_jobs=-1)
grid_log = GridSearchCV(estimator=logreg,
param_grid=logreg_param_grid,
scoring='accuracy', cv=5,
n_jobs=-1)
best_lsa_logreg = grid_lsa_log.fit(tfidf_lsa_data, y_train).best_estima
tor_
best_reg_logreg = grid_log.fit(tfidf_train_df, y_train).best_estimator_
print("Accuracy of Logistic Regression on LSA train data is :", best_ls
a_logreg.score(tfidf_lsa_data, y_train))
print("Accuracy of Logistic Regression with standard train data is :",
best_reg_logreg.score(tfidf_train_df, y_train))
a
Conclusion: Thus, in this experiment we understood the concept of latent semantic
indexing, which uses SVD that allows us to approximate the patterns in word usage across all
documents. We understood that LSI can use the relationships between words to better
understand their sense, or meaning, in a specific context and implemented it in Python.

WI Exp4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WI Exp4

Uploaded by

Copyright:

Available Formats

Experiment No 4

Aim: Latent semantic indexing.

You might also like