11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook

4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook
Amazon Fine Food Reviews Analysis

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews (https://www.
fine-food-reviews)
EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualizatio
(https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/)
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Ama
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found t
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review
Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negativ
[Q] How to determine if a review is positive or negative?
[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive

be considered as negative one. A review of rating 3 is considered nuetral and such r
analysis. This is an approximate and proxy way of determining the polarity (positivity
[1]. Reading Data

[1.1] Loading the data
The dataset is available in two forms
localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 1/22

1. .csv file
2. SQLite Database
In order to load the data, We have used the SQLITE dataset as it is easier to query
data efficiently.
Here as we only want to get the global sentiment of the recommendations (positive
purposefully ignore all Scores equal to 3. If the score is above 3, then the recommen
"positive". Otherwise, it will be set to "negative".

In [1]:
from __future__ import division
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re # Tutorial about Python regular expressions: https://pymotw.com/2/re/

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec

from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os

In [2]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite')
# filtering only positive and negative reviews i.e.

# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data
# you can change the number to any other number based on your computing power
# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LI

# for tsne assignment you can take 5k data points
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3""",
# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a ne
def partition(x):
if x < 3:
return 0
return 1
#changing reviews with score less than 3 to be positive and vice-versa

actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)
Number of data points in our data (525814, 10)
Out[2]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom
Natalia Corres
2 3 B000LQOCH0 ABXLMWJIXXAIN "Natalia 1 1
Corres"
[2] Exploratory Data Analysis

[2.1] Data Cleaning: Deduplication
It is observed (as shown in the table below) that the reviews data had many duplicat
necessary to remove duplicates in order to get unbiased results for the analysis of th
example:
In [3]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplac
In [4]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},
final.shape
Out[4]:
(364173, 10)
In [5]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100
Out[5]:
69.25890143662969
Observation:- It was also seen that in two rows given below the value of Helpfulness
HelpfulnessDenominator which is not practically possible hence these two rows too
calcualtions
In [6]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [7]:
#Before starting the next phase of preprocessing lets see the number of entries l
print(final.shape)
#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()
(364171, 10)
Out[7]:
1 307061
0 57110
Name: Score, dtype: int64
[3] Preprocessing
[3.1]. Preprocessing Review Text
Now that we have finished deduplication our data requires some preprocessing befo
analysis and making the prediction model.
Hence in the Preprocessing phase we do the following in the order below:-
1. Begin by removing the html tags

2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched th
letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter
After which we collect the words used to describe positive and negative reviews

In [8]:
sent_0 = final['Text'].values[0]
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove
from bs4 import BeautifulSoup
soup = BeautifulSoup(sent_0, 'html.parser')

text = soup.get_text()
print(text)
print("="*50)
this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he
learned about whales, India, drooping roses: i love all the new words this book introduces and the si
ic book i am willing to bet my son will STILL be able to recite from memory when he is in college
==================================================
In [9]:
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase

In [10]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st st
stop_words= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ou
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', '
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itsel
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because'
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'th
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all'
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "di
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn'
'won', "won't", 'wouldn', "wouldn't"])
In [11]:
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'html.parser').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
# https://gist.github.com/sebleier/554280
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in s
preprocessed_reviews.append(sentance.strip())
100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [01:20<00:0

In [12]:
final['CleanedText']= preprocessed_reviews #adding a new column
print(final.shape)
(364171, 11)
In [13]:
final.drop(final[final['CleanedText']==''].index, inplace=True) #dropping the row
print(final.shape)
(364164, 11)
[3.2] Preprocessing Review Summary

In [45]:
## Similartly you can do preprocessing for review summary also.
[4] Featurization
In [14]:
#Sample out 100k points
data= final.sample(100000)
In [15]:
print(data.shape)
(100000, 11)
In [16]:
#Sorting data w.r.t time in accending order
data.sort_values('Time', axis=0, ascending=True, inplace=True, kind='quicksort',
In [17]:
X_tr= data['CleanedText'].values
In [18]:
print(len(X_tr))
100000
[4.3] TF-IDF

In [84]:
#Set2
tf_idf_vect = TfidfVectorizer(min_df=10)
final_tf_idf_tr= tf_idf_vect.fit_transform(X_tr)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_
print('='*50)
print("the type of count vectorizer ",type(final_tf_idf_tr))

print("the shape of out text TFIDF vectorizer ",final_tf_idf_tr.get_shape())
print("the number of unique words ", final_tf_idf_tr.get_shape()[1])
some sample features(unique words in the corpus) ['aa', 'aback', 'abandoned', 'abc', 'abdominal', 'abil
d', 'absence']
==================================================
the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer (100000, 12650)
the number of unique words 12650
[5] Assignment 11: Truncated SVD

1. Apply Truncated-SVD on only this feature set:
SET 2:Review text, preprocessed one converted into vectors using (TFIDF
Procedure:
Take top 2000 or 3000 features from tf-idf vectorizers using idf_ score
You need to calculate the co-occurrence matrix with the selected featu
give the co-occurrence matrix, it returns the covariance matrix, check t
(https://medium.com/data-science-group-iitr/word-embedding-2d05d27
(https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-cou
information)
You should choose the n_components in truncated svd, with maximum
search on how to choose that and implement them. (hint: plot of cumu
ratio)
After you are done with the truncated svd, you can apply K-Means clu
number of clusters based on elbow method.
Print out wordclouds for each cluster, similar to that in previous assign
You need to write a function that takes a word and returns the most sim
similarity between the vectors(vector: a row in the matrix after truncate
Truncated-SVD
[5.1] Taking top features from TFIDF, SET 2

In [20]:
ind= np.argsort(tf_idf_vect.idf_)[:-2001:-1]
top_feat= [tf_idf_vect.get_feature_names()[i] for i in ind]
print("The top 2000 features are: ",top_feat)
c', 'emailing', 'elvis', 'acknowledge', 'enticed', 'therapist', 'entitled', 'karo', 'kiddies', 'entr',
'persnickety', 'cerelac', 'sierra', 'bonita', 'larvae', 'angels', 'nitrite', 'savvy', 'reliability', 'no
'comparision', 'distributing', 'shout', 'coloured', 'premise', 'whoppers', 'appliances', 'laces', 'down
elate', 'distressed', 'wheaten', 'communicate', 'rtd', 'manure', 'tisane', 'noh', 'marshmellow', 'intel
isker', 'horizontally', 'lactobacillus', 'metro', 'soothed', 'caseinate', 'exclaimed', 'nitrogen', 'anc

'inaccurate', 'appalling', 'methionine', 'tales', 'comedy', 'marker', 'marketspice', 'marathons', 'land
etals', 'wifes', 'wil', 'collects', 'wimp', 'purring', 'antiseptic', 'wilbur', 'sc', 'presentable', 'fr
mid', 'excite', 'fresca', 'midst', 'nonstop', 'documented', 'exaggerating', 'alphabet', 'implied', 'nit
g', 'winco', 'tingly', 'quebec', 'messages', 'mesa', 'drama', 'anne', 'nitrate', 'reggiano', 'fritos',
ice', 'catfish', 'omitted', 'chocalate', 'gogo', 'loudly', 'ganache', 'loyalty', 'tourist', 'frantic',
'suspects', 'clark', 'cleanest', 'looses', 'inthe', 'saf', 'turf', 'gorilla', 'mic', 'players', 'parfait
le', 'perfumes', 'insulation', 'tha', 'pcs', 'crusted', 'tension', 'panera', 'borrowed', 'gee', 'milkbo
s', 'circus', 'comprised', 'booth', 'swig', 'chummies', 'plunger', 'flan', 'boosted', 'peaberry', 'cinn
ted', 'seeping', 'bonnet', 'bonne', 'permission', 'monkeys', 'tabouli', 'sweep', 'canines', 'conservativ
t', 'compensation', 'boycotting', 'monitored', 'theoretically', 'secs', 'raccoon', 'tostitos', 'gadgets
e', 'completes', 'completly', 'riceselect', 'carring', 'crummy', 'bouncy', 'psyllium', 'permitted', 'co
rative', 'cremes', 'pates', 'scrambling', 'transfats', 'cellar', 'scorched', 'gland', 'correcting', 'ch
ium', 'sauted', 'tandoori', 'forbid', 'installing', 'charming', 'perugina', 'convience', 'surfing', 'su
'chased', 'trifle', 'tamped', 'brutal', 'mochi', 'plainly', 'clogs', 'supremo', 'tastless', 'glee', 'bu
oured', 'supervised', 'timeframe', 'forehead', 'intentional', 'tilth', 'treacle', 'treadmill', 'collard
rtation', 'misplaced', 'ceo', 'invariably', 'codes', 'pitiful', 'misrepresentation', 'targeted', 'cento
e', 'satified', 'chana', 'checks', 'columbus', 'moctails', 'gouda', 'percentages', 'chewie', 'cleveland
resee', 'caveats', 'parthenon', 'roy', 'chiclets', 'sealer', 'companions', 'brined', 'quirky', 'susie',
'catsup', 'perfer', 'credibility', 'creeping', 'parks', 'comparably', 'camel', 'grasp', 'cleverly', 'cr
' ' i ' 't ' ' ' 't d k' 't i l d' 't i' ' ' 'l t ' ' h ' ' h
[5.2] Calulation of Co-occurrence matrix

In [21]:
#Considering only reviews with top features
X_tr_new=[]
for i in tqdm(top_feat):
for j in X_tr:
if i in j.split():
X_tr_new.append(j)
#removing duplicates
X_tr_new=list(set(X_tr_new))
print("Number of reviews containing only top 2000 features:", len(X_tr_new))
100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:50<00
Number of reviews containing only top 2000 features: 17152
In [22]:
#Creating a Bigram countvectorizer so that we can count the number of times, two
count_vect2 = CountVectorizer(ngram_range=(2,2),dtype= int)
final_bigram_counts_tr = count_vect2.fit_transform(X_tr_new)
print("the type of count vectorizer ",type(final_bigram_counts_tr))

print("the shape of out text BOW vectorizer ",final_bigram_counts_tr.get_shape())
print("the number of unique words in bigrams ", final_bigram_counts_tr.get_shape(
the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>

the shape of out text BOW vectorizer (17152, 687027)
the number of unique words in bigrams 687027

In [23]:
#Extracting the feature names ino an array
feat_names= []
for i in tqdm(count_vect2.get_feature_names()):
w= i.split()
if w[0]!=w[1]:
if w[0] and w[1] in top_feat:
feat_names.append(i)
#Creating a dictionary with feature names and their counts

dict_feat={}
for i in tqdm(feat_names):
dict_feat[i]= np.sum(final_bigram_counts_tr[:, count_vect2.vocabulary_[i]].to
#code to check the features that has same words. Example: 'time run' and 'run tim
for i in tqdm(feat_names):
a= i.split()
for j in feat_names:
b= j.split()
if a[0]==b[1] and a[1]==b[0]:
dict_feat[i]=dict_feat[i]+dict_feat[j]
del dict_feat[j]
feat_names.remove(j)
100%|███████████████████████████████████████████████████████████████████████| 687027/687027 [00:18<00:00

100%|███████████████████████████████████████████████████████████████████████████| 21258/21258 [02:36<00
100%|███████████████████████████████████████████████████████████████████████████▉| 21253/21258 [03:55<0

In [24]:
#Code for co occurance matrix
matrix=[]
for i in top_feat:
for j in top_feat:
if i==j:
matrix.append(0)
elif i+" "+j in dict_feat:
matrix.append(dict_feat[i+" "+j])
elif j+" "+i in dict_feat:
matrix.append(dict_feat[j+" "+i])
else:
matrix.append(0)
matrix= np.reshape(matrix,(2000,2000))
co_occ_matrix= pd.DataFrame(matrix, columns= top_feat, index= top_feat)
co_occ_matrix
Out[24]:
medifast hating diary suppressant backwards spritzer heels heeler minsi
medifast 0 0 0 0 0 0 0 0 0
hating 0 0 0 0 0 0 0 0 0
diary 0 0 0 0 0 0 0 0 0
suppressant 0 0 0 0 0 0 0 0 0
backwards 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
peeps 0 0 0 0 0 0 0 0 0
fundraiser 0 0 0 0 0 0 0 0 0
marshmellows 0 0 0 0 0 0 0 0 0
girlfriends 0 0 0 0 0 0 0 0 0
masses 0 0 0 0 0 0 0 0 0
2000 rows × 2000 columns
In [25]:
print("The numberof non zero elements in co occurance matrix:",len(np.nonzero(mat
The numberof non zero elements in co occurance matrix: 562
[5.3] Finding optimal value for number of components (n) to

In [95]:
from sklearn.decomposition import TruncatedSVD
n=500
svd= TruncatedSVD(n_components= n, random_state=0)
svd.fit(final_tf_idf_tr)
var= svd.explained_variance_ratio_
var[::-1].sort()
#Plot for number of features retained vs variance explained
n_plot= np.arange(n)
plt.plot(n_plot, var)
plt.xlabel("number of features retained")
plt.ylabel("variance explained")
plt.title("number of features retained vs variance explained")
plt.grid()
plt.show()
Observation: From the above plot, we see that the maximum varience is retained till
we take number of features to be retained as 30.
In [96]:
#Training with TruncatedSVD
svd = TruncatedSVD(n_components=30, random_state=0)
final_tf_idf_tr= svd.fit_transform(final_tf_idf_tr)
print("The diamension of new vectors after TruncatedSVD: ",len(final_tf_idf_tr[0]
The diamension of new vectors after TruncatedSVD: 30

[5.4] Applying k-means clustering

In [28]:
import numpy as np
import pandas as pd
from math import log10
from sklearn.cluster import KMeans
inertia=[]
k=[i for i in range(1,101,2)]
for i in tqdm(k):
clf = KMeans(random_state=0,n_clusters=i)
# fitting the model

clf.fit(final_tf_idf_tr)
# predict the inertia on the train

inertia.append(clf.inertia_)
#Plot
plt.plot(k, inertia, label="k vs inertia")

plt.scatter(k, inertia, label="k vs inertia")
plt.xlabel("k")
plt.ylabel("inertia")
plt.title("k vs inertia PLOT")
plt.legend()
plt.grid()
plt.show()
100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [1:15:15<00

Observation: In the above plot, we have considered k value up to 100 as maximum

we take k= 40 approx.
In [29]:
from sklearn.cluster import KMeans
clf1 = KMeans(random_state=0,n_clusters= 40)
clf1.fit(final_tf_idf_tr)
Out[29]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=40, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
[5.5] Wordclouds of clusters obtained in the above section

In [30]:
#Creating dataframe with feature names and cluster labels
feat_dict={'feature names':X_tr,'feature label':clf1.labels_}
data_feat=pd.DataFrame(feat_dict)
feat_group = data_feat.groupby('feature label')
num_clusters=40
s=[]
for i in range(num_clusters):
q=feat_group.get_group(i)
a=[]
for j in range(len(q)):
a.append(q.iloc[j]["feature names"])
a=' '.join(a)
s.append(a)
from wordcloud import WordCloud

import numpy as np
import requests
from PIL import Image
from math import sqrt
ma = np.array(Image.open('ovalshape.png'))
import math
r=sqrt(num_clusters)
r=math.ceil(r)
plt.figure(1)
fig, axes = plt.subplots(nrows=r, ncols=r,figsize=(25,25))
for i in range(r*r):
if i<len(s):
wc= WordCloud(background_color="white", max_words=200, width=800, height=
plt.subplot(r,r,i+1)
plt.imshow(wc)
plt.tight_layout(pad=0)
plt.axis('off')

else:
plt.subplot(r,r,i+1)
plt.axis('off')
plt.show()
<Figure size 432x288 with 0 Axes>

[5.6] Function that returns most similar words for a given wo

In [154]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim_matrix= cosine_similarity(np.transpose(final_tf_idf_tr))
def most_sim_word(w):
feat= top_feat[:len(final_tf_idf_tr[0])]
if w in feat:
ind= feat.index(w)
top_fea_ind= cos_sim_matrix[ind].argsort()[-10:][::-1]
print("Top 10 features similar to the given word:",w,"is",[feat[i] for i
else:
print("The given word is not a feature in TruncatedSVD matrix")
In [155]:
#example:
word= 'diary' #let this be the given word
most_sim_word(word)
Top 10 features similar to the given word: diary is ['diary', 'backwards', 'traveler', 'coons', 'transfo
ts', 'nd', 'medifast']
[6] Conclusions
observation: cluster1 mostly positive words describing good flavour/good taste, clus
cluster4 is about coffee/cups, cluster4 is about cat food mostly, cluster12 is mostly a
about snacks/bars, cluster14 is about suger products, cluster17 mostly positive word
product/ great price, cluster18 mostly about water/drinks, cluster29 mostly about cho
related to oil products
The optimal number of clusters is 40.

The optimal number of components for TruncatedSVD is 30.

11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

Amazon Fine Food Reviews Analysis

[Q] How to determine if a review is positive or negative?

[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive

[1]. Reading Data

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 1/22

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 2/22

from sklearn.feature_extraction.text import CountVectorizer

import re # Tutorial about Python regular expressions: https://pymotw.com/2/re/

from gensim.models import Word2Vec

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 3/22

# filtering only positive and negative reviews i.e.

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LI

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3""",

#changing reviews with score less than 3 to be positive and vice-versa

Number of data points in our data (525814, 10)

0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1

1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 4/22

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom

[2] Exploratory Data Analysis

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 5/22

1. Begin by removing the html tags

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 6/22

soup = BeautifulSoup(sent_0, 'html.parser')

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 7/22

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [01:20<00:0

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 8/22

[3.2] Preprocessing Review Summary

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 9/22

print("the type of count vectorizer ",type(final_tf_idf_tr))

[5] Assignment 11: Truncated SVD

[5.1] Taking top features from TFIDF, SET 2

print("The top 2000 features are: ",top_feat)

isker', 'horizontally', 'lactobacillus', 'metro', 'soothed', 'caseinate', 'exclaimed', 'nitrogen', 'anc

[5.2] Calulation of Co-occurrence matrix

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 11/22

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:50<00

Number of reviews containing only top 2000 features: 17152

print("the type of count vectorizer ",type(final_bigram_counts_tr))

the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 12/22

#Creating a dictionary with feature names and their counts

100%|███████████████████████████████████████████████████████████████████████| 687027/687027 [00:18<00:00

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 13/22

2000 rows × 2000 columns

The numberof non zero elements in co occurance matrix: 562

[5.3] Finding optimal value for number of components (n) to

The diamension of new vectors after TruncatedSVD: 30

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 15/22

[5.4] Applying k-means clustering

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 16/22

from sklearn.cluster import KMeans

k=[i for i in range(1,101,2)]

# fitting the model

# predict the inertia on the train

plt.plot(k, inertia, label="k vs inertia")

100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [1:15:15<00

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 17/22

Observation: In the above plot, we have considered k value up to 100 as maximum

[5.5] Wordclouds of clusters obtained in the above section

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 18/22

feat_group = data_feat.groupby('feature label')