You are on page 1of 22

4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews (https://www.
fine-food-reviews)
EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualizatio
(https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/)
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Ama
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found t
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review
Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negativ

[Q] How to determine if a review is positive or negative?

[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive


be considered as negative one. A review of rating 3 is considered nuetral and such r
analysis. This is an approximate and proxy way of determining the polarity (positivity

[1]. Reading Data


[1.1] Loading the data
The dataset is available in two forms

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 1/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

1. .csv file
2. SQLite Database
In order to load the data, We have used the SQLITE dataset as it is easier to query
data efficiently.
Here as we only want to get the global sentiment of the recommendations (positive
purposefully ignore all Scores equal to 3. If the score is above 3, then the recommen
"positive". Otherwise, it will be set to "negative".

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 2/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [1]:
from __future__ import division
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re # Tutorial about Python regular expressions: https://pymotw.com/2/re/


import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec


from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 3/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [2]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite')

# filtering only positive and negative reviews i.e.


# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LI


# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3""",

# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a ne
def partition(x):
if x < 3:
return 0
return 1

#changing reviews with score less than 3 to be positive and vice-versa


actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

Number of data points in our data (525814, 10)

Out[2]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom

0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1

1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 4/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenom

Natalia Corres
2 3 B000LQOCH0 ABXLMWJIXXAIN "Natalia 1 1
Corres"

[2] Exploratory Data Analysis


[2.1] Data Cleaning: Deduplication
It is observed (as shown in the table below) that the reviews data had many duplicat
necessary to remove duplicates in order to get unbiased results for the analysis of th
example:

In [3]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplac

In [4]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},
final.shape

Out[4]:
(364173, 10)

In [5]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

Out[5]:
69.25890143662969

Observation:- It was also seen that in two rows given below the value of Helpfulness
HelpfulnessDenominator which is not practically possible hence these two rows too
calcualtions

In [6]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 5/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [7]:
#Before starting the next phase of preprocessing lets see the number of entries l
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)

Out[7]:
1 307061
0 57110
Name: Score, dtype: int64

[3] Preprocessing
[3.1]. Preprocessing Review Text
Now that we have finished deduplication our data requires some preprocessing befo
analysis and making the prediction model.
Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags


2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched th
letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter
After which we collect the words used to describe positive and negative reviews

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 6/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [8]:
sent_0 = final['Text'].values[0]

# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'html.parser')


text = soup.get_text()
print(text)
print("="*50)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he
learned about whales, India, drooping roses: i love all the new words this book introduces and the si
ic book i am willing to bet my son will STILL be able to recite from memory when he is in college
==================================================

In [9]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)

# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 7/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [10]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st st

stop_words= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ou
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', '
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itsel
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because'
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'th
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all'
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "di
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn'
'won', "won't", 'wouldn', "wouldn't"])

In [11]:
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'html.parser').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
# https://gist.github.com/sebleier/554280
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in s
preprocessed_reviews.append(sentance.strip())

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [01:20<00:0

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 8/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [12]:
final['CleanedText']= preprocessed_reviews #adding a new column
print(final.shape)

(364171, 11)

In [13]:
final.drop(final[final['CleanedText']==''].index, inplace=True) #dropping the row
print(final.shape)

(364164, 11)

[3.2] Preprocessing Review Summary


In [45]:
## Similartly you can do preprocessing for review summary also.

[4] Featurization
In [14]:
#Sample out 100k points
data= final.sample(100000)

In [15]:
print(data.shape)

(100000, 11)

In [16]:
#Sorting data w.r.t time in accending order
data.sort_values('Time', axis=0, ascending=True, inplace=True, kind='quicksort',

In [17]:
X_tr= data['CleanedText'].values

In [18]:
print(len(X_tr))

100000

[4.3] TF-IDF

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 9/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [84]:
#Set2
tf_idf_vect = TfidfVectorizer(min_df=10)
final_tf_idf_tr= tf_idf_vect.fit_transform(X_tr)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_
print('='*50)

print("the type of count vectorizer ",type(final_tf_idf_tr))


print("the shape of out text TFIDF vectorizer ",final_tf_idf_tr.get_shape())
print("the number of unique words ", final_tf_idf_tr.get_shape()[1])

some sample features(unique words in the corpus) ['aa', 'aback', 'abandoned', 'abc', 'abdominal', 'abil
d', 'absence']
==================================================
the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer (100000, 12650)
the number of unique words 12650

[5] Assignment 11: Truncated SVD


1. Apply Truncated-SVD on only this feature set:
SET 2:Review text, preprocessed one converted into vectors using (TFIDF

Procedure:
Take top 2000 or 3000 features from tf-idf vectorizers using idf_ score
You need to calculate the co-occurrence matrix with the selected featu
give the co-occurrence matrix, it returns the covariance matrix, check t
(https://medium.com/data-science-group-iitr/word-embedding-2d05d27
(https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-cou
information)
You should choose the n_components in truncated svd, with maximum
search on how to choose that and implement them. (hint: plot of cumu
ratio)
After you are done with the truncated svd, you can apply K-Means clu
number of clusters based on elbow method.
Print out wordclouds for each cluster, similar to that in previous assign
You need to write a function that takes a word and returns the most sim
similarity between the vectors(vector: a row in the matrix after truncate

Truncated-SVD
localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 10/22
4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

[5.1] Taking top features from TFIDF, SET 2


In [20]:
ind= np.argsort(tf_idf_vect.idf_)[:-2001:-1]
top_feat= [tf_idf_vect.get_feature_names()[i] for i in ind]

print("The top 2000 features are: ",top_feat)

c', 'emailing', 'elvis', 'acknowledge', 'enticed', 'therapist', 'entitled', 'karo', 'kiddies', 'entr',
'persnickety', 'cerelac', 'sierra', 'bonita', 'larvae', 'angels', 'nitrite', 'savvy', 'reliability', 'no
'comparision', 'distributing', 'shout', 'coloured', 'premise', 'whoppers', 'appliances', 'laces', 'down
elate', 'distressed', 'wheaten', 'communicate', 'rtd', 'manure', 'tisane', 'noh', 'marshmellow', 'intel

isker', 'horizontally', 'lactobacillus', 'metro', 'soothed', 'caseinate', 'exclaimed', 'nitrogen', 'anc


'inaccurate', 'appalling', 'methionine', 'tales', 'comedy', 'marker', 'marketspice', 'marathons', 'land
etals', 'wifes', 'wil', 'collects', 'wimp', 'purring', 'antiseptic', 'wilbur', 'sc', 'presentable', 'fr
mid', 'excite', 'fresca', 'midst', 'nonstop', 'documented', 'exaggerating', 'alphabet', 'implied', 'nit
g', 'winco', 'tingly', 'quebec', 'messages', 'mesa', 'drama', 'anne', 'nitrate', 'reggiano', 'fritos',
ice', 'catfish', 'omitted', 'chocalate', 'gogo', 'loudly', 'ganache', 'loyalty', 'tourist', 'frantic',
'suspects', 'clark', 'cleanest', 'looses', 'inthe', 'saf', 'turf', 'gorilla', 'mic', 'players', 'parfait
le', 'perfumes', 'insulation', 'tha', 'pcs', 'crusted', 'tension', 'panera', 'borrowed', 'gee', 'milkbo
s', 'circus', 'comprised', 'booth', 'swig', 'chummies', 'plunger', 'flan', 'boosted', 'peaberry', 'cinn
ted', 'seeping', 'bonnet', 'bonne', 'permission', 'monkeys', 'tabouli', 'sweep', 'canines', 'conservativ
t', 'compensation', 'boycotting', 'monitored', 'theoretically', 'secs', 'raccoon', 'tostitos', 'gadgets
e', 'completes', 'completly', 'riceselect', 'carring', 'crummy', 'bouncy', 'psyllium', 'permitted', 'co
rative', 'cremes', 'pates', 'scrambling', 'transfats', 'cellar', 'scorched', 'gland', 'correcting', 'ch
ium', 'sauted', 'tandoori', 'forbid', 'installing', 'charming', 'perugina', 'convience', 'surfing', 'su
'chased', 'trifle', 'tamped', 'brutal', 'mochi', 'plainly', 'clogs', 'supremo', 'tastless', 'glee', 'bu
oured', 'supervised', 'timeframe', 'forehead', 'intentional', 'tilth', 'treacle', 'treadmill', 'collard
rtation', 'misplaced', 'ceo', 'invariably', 'codes', 'pitiful', 'misrepresentation', 'targeted', 'cento
e', 'satified', 'chana', 'checks', 'columbus', 'moctails', 'gouda', 'percentages', 'chewie', 'cleveland
resee', 'caveats', 'parthenon', 'roy', 'chiclets', 'sealer', 'companions', 'brined', 'quirky', 'susie',
'catsup', 'perfer', 'credibility', 'creeping', 'parks', 'comparably', 'camel', 'grasp', 'cleverly', 'cr
' ' i ' 't ' ' ' 't d k' 't i l d' 't i' ' ' 'l t ' ' h ' ' h

[5.2] Calulation of Co-occurrence matrix

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 11/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [21]:
#Considering only reviews with top features
X_tr_new=[]

for i in tqdm(top_feat):
for j in X_tr:
if i in j.split():
X_tr_new.append(j)

#removing duplicates
X_tr_new=list(set(X_tr_new))
print("Number of reviews containing only top 2000 features:", len(X_tr_new))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:50<00

Number of reviews containing only top 2000 features: 17152

In [22]:
#Creating a Bigram countvectorizer so that we can count the number of times, two
count_vect2 = CountVectorizer(ngram_range=(2,2),dtype= int)
final_bigram_counts_tr = count_vect2.fit_transform(X_tr_new)

print("the type of count vectorizer ",type(final_bigram_counts_tr))


print("the shape of out text BOW vectorizer ",final_bigram_counts_tr.get_shape())
print("the number of unique words in bigrams ", final_bigram_counts_tr.get_shape(

the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>


the shape of out text BOW vectorizer (17152, 687027)
the number of unique words in bigrams 687027

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 12/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [23]:
#Extracting the feature names ino an array
feat_names= []

for i in tqdm(count_vect2.get_feature_names()):
w= i.split()
if w[0]!=w[1]:
if w[0] and w[1] in top_feat:
feat_names.append(i)

#Creating a dictionary with feature names and their counts


dict_feat={}

for i in tqdm(feat_names):
dict_feat[i]= np.sum(final_bigram_counts_tr[:, count_vect2.vocabulary_[i]].to

#code to check the features that has same words. Example: 'time run' and 'run tim
for i in tqdm(feat_names):
a= i.split()
for j in feat_names:
b= j.split()
if a[0]==b[1] and a[1]==b[0]:
dict_feat[i]=dict_feat[i]+dict_feat[j]
del dict_feat[j]
feat_names.remove(j)

100%|███████████████████████████████████████████████████████████████████████| 687027/687027 [00:18<00:00


100%|███████████████████████████████████████████████████████████████████████████| 21258/21258 [02:36<00
100%|███████████████████████████████████████████████████████████████████████████▉| 21253/21258 [03:55<0

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 13/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [24]:
#Code for co occurance matrix
matrix=[]

for i in top_feat:
for j in top_feat:
if i==j:
matrix.append(0)
elif i+" "+j in dict_feat:
matrix.append(dict_feat[i+" "+j])
elif j+" "+i in dict_feat:
matrix.append(dict_feat[j+" "+i])
else:
matrix.append(0)

matrix= np.reshape(matrix,(2000,2000))
co_occ_matrix= pd.DataFrame(matrix, columns= top_feat, index= top_feat)
co_occ_matrix

Out[24]:
medifast hating diary suppressant backwards spritzer heels heeler minsi
medifast 0 0 0 0 0 0 0 0 0
hating 0 0 0 0 0 0 0 0 0
diary 0 0 0 0 0 0 0 0 0
suppressant 0 0 0 0 0 0 0 0 0
backwards 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
peeps 0 0 0 0 0 0 0 0 0
fundraiser 0 0 0 0 0 0 0 0 0
marshmellows 0 0 0 0 0 0 0 0 0
girlfriends 0 0 0 0 0 0 0 0 0
masses 0 0 0 0 0 0 0 0 0

2000 rows × 2000 columns

In [25]:
print("The numberof non zero elements in co occurance matrix:",len(np.nonzero(mat

The numberof non zero elements in co occurance matrix: 562

[5.3] Finding optimal value for number of components (n) to


localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 14/22
4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [95]:
from sklearn.decomposition import TruncatedSVD

n=500
svd= TruncatedSVD(n_components= n, random_state=0)
svd.fit(final_tf_idf_tr)
var= svd.explained_variance_ratio_
var[::-1].sort()
#Plot for number of features retained vs variance explained
n_plot= np.arange(n)
plt.plot(n_plot, var)
plt.xlabel("number of features retained")
plt.ylabel("variance explained")
plt.title("number of features retained vs variance explained")
plt.grid()
plt.show()

Observation: From the above plot, we see that the maximum varience is retained till
we take number of features to be retained as 30.

In [96]:
#Training with TruncatedSVD
svd = TruncatedSVD(n_components=30, random_state=0)
final_tf_idf_tr= svd.fit_transform(final_tf_idf_tr)
print("The diamension of new vectors after TruncatedSVD: ",len(final_tf_idf_tr[0]

The diamension of new vectors after TruncatedSVD: 30

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 15/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

[5.4] Applying k-means clustering

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 16/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import log10

from sklearn.cluster import KMeans

inertia=[]

k=[i for i in range(1,101,2)]

for i in tqdm(k):
clf = KMeans(random_state=0,n_clusters=i)

# fitting the model


clf.fit(final_tf_idf_tr)

# predict the inertia on the train


inertia.append(clf.inertia_)

#Plot

plt.plot(k, inertia, label="k vs inertia")


plt.scatter(k, inertia, label="k vs inertia")
plt.xlabel("k")
plt.ylabel("inertia")
plt.title("k vs inertia PLOT")
plt.legend()
plt.grid()
plt.show()

100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [1:15:15<00

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 17/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

Observation: In the above plot, we have considered k value up to 100 as maximum


we take k= 40 approx.

In [29]:
from sklearn.cluster import KMeans
clf1 = KMeans(random_state=0,n_clusters= 40)
clf1.fit(final_tf_idf_tr)

Out[29]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=40, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)

[5.5] Wordclouds of clusters obtained in the above section

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 18/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [30]:
#Creating dataframe with feature names and cluster labels
feat_dict={'feature names':X_tr,'feature label':clf1.labels_}
data_feat=pd.DataFrame(feat_dict)

feat_group = data_feat.groupby('feature label')

num_clusters=40
s=[]

for i in range(num_clusters):
q=feat_group.get_group(i)
a=[]
for j in range(len(q)):
a.append(q.iloc[j]["feature names"])
a=' '.join(a)
s.append(a)

from wordcloud import WordCloud


import matplotlib.pyplot as plt
import numpy as np
import requests
from PIL import Image
from math import sqrt
ma = np.array(Image.open('ovalshape.png'))

import math
r=sqrt(num_clusters)
r=math.ceil(r)

plt.figure(1)
fig, axes = plt.subplots(nrows=r, ncols=r,figsize=(25,25))

for i in range(r*r):
if i<len(s):
wc= WordCloud(background_color="white", max_words=200, width=800, height=
plt.subplot(r,r,i+1)
plt.imshow(wc)
plt.tight_layout(pad=0)
plt.axis('off')

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 19/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

else:
plt.subplot(r,r,i+1)
plt.axis('off')

plt.show()

<Figure size 432x288 with 0 Axes>

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 20/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

[5.6] Function that returns most similar words for a given wo

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 21/22


4/23/2020 11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP - Jupyter Notebook

In [154]:
from sklearn.metrics.pairwise import cosine_similarity

cos_sim_matrix= cosine_similarity(np.transpose(final_tf_idf_tr))

def most_sim_word(w):
feat= top_feat[:len(final_tf_idf_tr[0])]
if w in feat:
ind= feat.index(w)
top_fea_ind= cos_sim_matrix[ind].argsort()[-10:][::-1]
print("Top 10 features similar to the given word:",w,"is",[feat[i] for i
else:
print("The given word is not a feature in TruncatedSVD matrix")

In [155]:
#example:
word= 'diary' #let this be the given word
most_sim_word(word)

Top 10 features similar to the given word: diary is ['diary', 'backwards', 'traveler', 'coons', 'transfo
ts', 'nd', 'medifast']

[6] Conclusions
observation: cluster1 mostly positive words describing good flavour/good taste, clus
cluster4 is about coffee/cups, cluster4 is about cat food mostly, cluster12 is mostly a
about snacks/bars, cluster14 is about suger products, cluster17 mostly positive word
product/ great price, cluster18 mostly about water/drinks, cluster29 mostly about cho
related to oil products

The optimal number of clusters is 40.


The optimal number of components for TruncatedSVD is 30.

localhost:8888/notebooks/11 Amazon Fine Food Reviews Analysis_Truncated SVD_WIP.ipynb# 22/22

You might also like