Professional Documents
Culture Documents
1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found t
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review
Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negativ
1. .csv file
2. SQLite Database
In order to load the data, We have used the SQLITE dataset as it is easier to query
data efficiently.
Here as we only want to get the global sentiment of the recommendations (positive
purposefully ignore all Scores equal to 3. If the score is above 3, then the recommen
"positive". Otherwise, it will be set to "negative".
In [1]:
from __future__ import division
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
In [2]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite')
# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a ne
def partition(x):
if x < 3:
return 0
return 1
Out[2]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom
Natalia Corres
2 3 B000LQOCH0 ABXLMWJIXXAIN "Natalia 1 1
Corres"
In [3]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplac
In [4]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},
final.shape
Out[4]:
(364173, 10)
In [5]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100
Out[5]:
69.25890143662969
Observation:- It was also seen that in two rows given below the value of Helpfulness
HelpfulnessDenominator which is not practically possible hence these two rows too
calcualtions
In [6]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
In [7]:
#Before starting the next phase of preprocessing lets see the number of entries l
print(final.shape)
#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()
(364171, 10)
Out[7]:
1 307061
0 57110
Name: Score, dtype: int64
[3] Preprocessing
[3.1]. Preprocessing Review Text
Now that we have finished deduplication our data requires some preprocessing befo
analysis and making the prediction model.
Hence in the Preprocessing phase we do the following in the order below:-
In [8]:
sent_0 = final['Text'].values[0]
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove
from bs4 import BeautifulSoup
this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he
learned about whales, India, drooping roses: i love all the new words this book introduces and the si
ic book i am willing to bet my son will STILL be able to recite from memory when he is in college
==================================================
In [9]:
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
In [10]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st st
stop_words= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ou
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', '
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itsel
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because'
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'th
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all'
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "di
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn'
'won', "won't", 'wouldn', "wouldn't"])
In [11]:
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'html.parser').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
# https://gist.github.com/sebleier/554280
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in s
preprocessed_reviews.append(sentance.strip())
In [12]:
final['CleanedText']= preprocessed_reviews #adding a new column
print(final.shape)
(364171, 11)
In [13]:
final.drop(final[final['CleanedText']==''].index, inplace=True) #dropping the row
print(final.shape)
(364164, 11)
[4] Featurization
In [14]:
#Sample out 100k points
data= final.sample(100000)
In [15]:
print(data.shape)
(100000, 11)
In [16]:
#Sorting data w.r.t time in accending order
data.sort_values('Time', axis=0, ascending=True, inplace=True, kind='quicksort',
In [17]:
X_tr= data['CleanedText'].values
In [18]:
print(len(X_tr))
100000
[4.3] TF-IDF
In [84]:
#Set2
tf_idf_vect = TfidfVectorizer(min_df=10)
final_tf_idf_tr= tf_idf_vect.fit_transform(X_tr)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_
print('='*50)
some sample features(unique words in the corpus) ['aa', 'aback', 'abandoned', 'abc', 'abdominal', 'abil
d', 'absence']
==================================================
the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer (100000, 12650)
the number of unique words 12650
Procedure:
Take top 2000 or 3000 features from tf-idf vectorizers using idf_ score
You need to calculate the co-occurrence matrix with the selected featu
give the co-occurrence matrix, it returns the covariance matrix, check t
(https://medium.com/data-science-group-iitr/word-embedding-2d05d27
(https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-cou
information)
You should choose the n_components in truncated svd, with maximum
search on how to choose that and implement them. (hint: plot of cumu
ratio)
After you are done with the truncated svd, you can apply K-Means clu
number of clusters based on elbow method.
Print out wordclouds for each cluster, similar to that in previous assign
You need to write a function that takes a word and returns the most sim
similarity between the vectors(vector: a row in the matrix after truncate
Truncated-SVD
localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 10/22
4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook
c', 'emailing', 'elvis', 'acknowledge', 'enticed', 'therapist', 'entitled', 'karo', 'kiddies', 'entr',
'persnickety', 'cerelac', 'sierra', 'bonita', 'larvae', 'angels', 'nitrite', 'savvy', 'reliability', 'no
'comparision', 'distributing', 'shout', 'coloured', 'premise', 'whoppers', 'appliances', 'laces', 'down
elate', 'distressed', 'wheaten', 'communicate', 'rtd', 'manure', 'tisane', 'noh', 'marshmellow', 'intel
In [21]:
#Considering only reviews with top features
X_tr_new=[]
for i in tqdm(top_feat):
for j in X_tr:
if i in j.split():
X_tr_new.append(j)
#removing duplicates
X_tr_new=list(set(X_tr_new))
print("Number of reviews containing only top 2000 features:", len(X_tr_new))
In [22]:
#Creating a Bigram countvectorizer so that we can count the number of times, two
count_vect2 = CountVectorizer(ngram_range=(2,2),dtype= int)
final_bigram_counts_tr = count_vect2.fit_transform(X_tr_new)
In [23]:
#Extracting the feature names ino an array
feat_names= []
for i in tqdm(count_vect2.get_feature_names()):
w= i.split()
if w[0]!=w[1]:
if w[0] and w[1] in top_feat:
feat_names.append(i)
for i in tqdm(feat_names):
dict_feat[i]= np.sum(final_bigram_counts_tr[:, count_vect2.vocabulary_[i]].to
#code to check the features that has same words. Example: 'time run' and 'run tim
for i in tqdm(feat_names):
a= i.split()
for j in feat_names:
b= j.split()
if a[0]==b[1] and a[1]==b[0]:
dict_feat[i]=dict_feat[i]+dict_feat[j]
del dict_feat[j]
feat_names.remove(j)
In [24]:
#Code for co occurance matrix
matrix=[]
for i in top_feat:
for j in top_feat:
if i==j:
matrix.append(0)
elif i+" "+j in dict_feat:
matrix.append(dict_feat[i+" "+j])
elif j+" "+i in dict_feat:
matrix.append(dict_feat[j+" "+i])
else:
matrix.append(0)
matrix= np.reshape(matrix,(2000,2000))
co_occ_matrix= pd.DataFrame(matrix, columns= top_feat, index= top_feat)
co_occ_matrix
Out[24]:
medifast hating diary suppressant backwards spritzer heels heeler minsi
medifast 0 0 0 0 0 0 0 0 0
hating 0 0 0 0 0 0 0 0 0
diary 0 0 0 0 0 0 0 0 0
suppressant 0 0 0 0 0 0 0 0 0
backwards 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
peeps 0 0 0 0 0 0 0 0 0
fundraiser 0 0 0 0 0 0 0 0 0
marshmellows 0 0 0 0 0 0 0 0 0
girlfriends 0 0 0 0 0 0 0 0 0
masses 0 0 0 0 0 0 0 0 0
In [25]:
print("The numberof non zero elements in co occurance matrix:",len(np.nonzero(mat
In [95]:
from sklearn.decomposition import TruncatedSVD
n=500
svd= TruncatedSVD(n_components= n, random_state=0)
svd.fit(final_tf_idf_tr)
var= svd.explained_variance_ratio_
var[::-1].sort()
#Plot for number of features retained vs variance explained
n_plot= np.arange(n)
plt.plot(n_plot, var)
plt.xlabel("number of features retained")
plt.ylabel("variance explained")
plt.title("number of features retained vs variance explained")
plt.grid()
plt.show()
Observation: From the above plot, we see that the maximum varience is retained till
we take number of features to be retained as 30.
In [96]:
#Training with TruncatedSVD
svd = TruncatedSVD(n_components=30, random_state=0)
final_tf_idf_tr= svd.fit_transform(final_tf_idf_tr)
print("The diamension of new vectors after TruncatedSVD: ",len(final_tf_idf_tr[0]
In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import log10
inertia=[]
for i in tqdm(k):
clf = KMeans(random_state=0,n_clusters=i)
#Plot
In [29]:
from sklearn.cluster import KMeans
clf1 = KMeans(random_state=0,n_clusters= 40)
clf1.fit(final_tf_idf_tr)
Out[29]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=40, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
In [30]:
#Creating dataframe with feature names and cluster labels
feat_dict={'feature names':X_tr,'feature label':clf1.labels_}
data_feat=pd.DataFrame(feat_dict)
num_clusters=40
s=[]
for i in range(num_clusters):
q=feat_group.get_group(i)
a=[]
for j in range(len(q)):
a.append(q.iloc[j]["feature names"])
a=' '.join(a)
s.append(a)
import math
r=sqrt(num_clusters)
r=math.ceil(r)
plt.figure(1)
fig, axes = plt.subplots(nrows=r, ncols=r,figsize=(25,25))
for i in range(r*r):
if i<len(s):
wc= WordCloud(background_color="white", max_words=200, width=800, height=
plt.subplot(r,r,i+1)
plt.imshow(wc)
plt.tight_layout(pad=0)
plt.axis('off')
else:
plt.subplot(r,r,i+1)
plt.axis('off')
plt.show()
In [154]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim_matrix= cosine_similarity(np.transpose(final_tf_idf_tr))
def most_sim_word(w):
feat= top_feat[:len(final_tf_idf_tr[0])]
if w in feat:
ind= feat.index(w)
top_fea_ind= cos_sim_matrix[ind].argsort()[-10:][::-1]
print("Top 10 features similar to the given word:",w,"is",[feat[i] for i
else:
print("The given word is not a feature in TruncatedSVD matrix")
In [155]:
#example:
word= 'diary' #let this be the given word
most_sim_word(word)
Top 10 features similar to the given word: diary is ['diary', 'backwards', 'traveler', 'coons', 'transfo
ts', 'nd', 'medifast']
[6] Conclusions
observation: cluster1 mostly positive words describing good flavour/good taste, clus
cluster4 is about coffee/cups, cluster4 is about cat food mostly, cluster12 is mostly a
about snacks/bars, cluster14 is about suger products, cluster17 mostly positive word
product/ great price, cluster18 mostly about water/drinks, cluster29 mostly about cho
related to oil products