Tarp

Web Scraping of Amazon Product
Reviews with keywords Extraction

HARSH PAREEK (20BBS0007), ELANSOORIYAN VM (20BBS0025), SHIYAMGANESH N
(20BBS0083), AYUSH JAIN (20BBS0091), KARTHIK T (20BBS0096)
Github Link for the project repositories: https://github.com/orgs/TARP-bbs/repositories
ABSTRACT This research article presents a novel approach for cutting-edge web scraping applications for
extracting and analyzing sentiment data from product reviews on Amazon. The abundance of online customer
reviews presents a valuable opportunity for businesses to gain insights into customer opinions and preferences.
However, manually analyzing a large volume of reviews can be a time-consuming and resource-intensive task.
To address this issue, we propose a cutting-edge web scraping application that extracts and analyzes sentiment
data from products. The proposed approach employs natural language processing techniques to categorize each
review as positive, negative, or neutral, providing businesses with valuable insights into customer sentiment.
Our approach assigns a weight to each review based on its perceived impact, providing a nuanced understanding
of customer opinion. The experimental results demonstrate the effectiveness of our approach in accurately
categorizing reviews and providing insights into customer sentiment.
INDEX TERMS Amazon, web scraping, sentiment analysis, natural language processing, machine learning,
customer reviews, product development, marketing strategies, data-driven approach, and advanced algorithms.
I. INTRODUCTION review based on its perceived impact, providing a nuanced

The exponential growth of online shopping has led to an understanding of customer opinion. Our approach utilizes
enormous volume of customer reviews on e-commerce plat- NLP techniques such as part-of-speech tagging, named entity
forms such as Amazon (Zhang & Varian, 2019). Online recognition, and sentiment analysis to extract and analyze
reviews have become a critical source of information for cus- sentiment data from customer reviews.
tomers when making purchasing decisions, and businesses The experimental results demonstrate the effectiveness of
increasingly rely on these reviews to understand how their our approach in accurately categorizing reviews and provid-
products are being perceived in the market. The analysis ing valuable insights into customer sentiment. The proposed
of customer reviews can provide businesses with valuable approach has enormous potential to benefit businesses in a
insights into customer opinions, preferences, and purchasing variety of industries, enabling them to gain a deep under-
behaviour (Pang & Lee, 2008). However, manual analysis standing of customer sentiment and make data-driven deci-
of a large volume of reviews can be a time-consuming and sions about product development and marketing strategies.
resource-intensive task.
To address this challenge, automated approaches have II. OVERVIEW OF TECHNOLOGIES USED
been developed to extract and analyze sentiment data from The success of our sentiment analysis and keyword extrac-
online reviews. Our research focuses on developing a web tion application depends heavily on the use of advanced
scraping application that leverages natural language process- technologies. In this section, we will provide an overview
ing (NLP) techniques to extract and categorize sentiment data of the various technologies that we have leveraged to build
from product reviews on Amazon. Our application aims to this application, and how they contribute to its function-
provide businesses with valuable insights into customer sen- ality and performance. We will delve into the specifics of
timent, allowing them to make data-driven decisions about natural language processing (NLP) and machine learning
product development and marketing strategies (Liu, 2012). (ML) techniques, as well as web scraping technologies that
The proposed approach employs advanced algorithms to enable us to efficiently extract large volumes of data from
categorize each review as positive, negative, or neutral, pro- Amazon’s platform. By understanding the technology behind
viding businesses with a comprehensive view of customer our application, readers can gain a deeper appreciation for its
sentiment. The application also assigns a weight to each capabilities and potential impact.
Technical Answers for Real World Problems, TARP (CBS1901) 1

A. WEB SCRAPING In our web scraping application, we used a Natural Lan-
Web scraping is the process of automatically extracting in- guage Processing (NLP) model provided by the Spacy library
formation from websites. It involves writing a program to in Python for performing sentiment analysis. Spacy is an
send a request to a website, and then parsing the HTML open-source library for NLP that provides a range of tools
content of the website to extract the relevant information. and models for various NLP tasks, including named entity
This information can then be used for a variety of purposes, recognition, part-of-speech tagging, and sentiment analysis.
such as data analysis, research, or content aggregation. The Spacy library’s pre-trained NLP model allowed us
to quickly and accurately analyze the sentiment of the cus-
tomer reviews we scraped from Amazon. The model uses a
combination of machine learning techniques and rule-based
systems to identify and categorize sentiment in text data.
D. KEYWORDS EXTRACTIONS
Keyword extraction is a crucial step in sentiment analysis,
as it allows us to identify the most relevant terms in the
text that are indicative of the sentiment expressed by the
author. To achieve this, we employed the Rapid Automatic
FIGURE 1. Web scraping analyzes and processes online information and Keyword Extraction (RAKE) algorithm, which is a widely
turns it into structured data. used approach for extracting keywords from text.
The RAKE algorithm works by first splitting the text
Web scraping can be done manually by copying and past- into individual words and then ranking them based on their
ing data from websites, but this is a time-consuming and relevance to the overall meaning of the text. It does this
error-prone process. Automated web scraping using special- by calculating a score for each word based on two factors:
ized software or programming languages like JavaScript is its frequency in the text and its co-occurrence with other
much more efficient and accurate. Here to automate our pro- important words.
cess, we have JavaScript library named “cheerio.js“, a popu-
lar Node.js library used for web scraping. It is a lightweight
and fast library that simplifies the process of web scraping
by providing a simple and intuitive API for querying and
manipulating HTML and XML documents.
B. TYPESCRIPT
TypeScript is a popular open-source programming language
developed and maintained by Microsoft. It is a typed superset
of JavaScript that adds optional static typing, classes, and
interfaces to the language. TypeScript is a powerful tool for
building large-scale web applications. By adding static typing
to JavaScript, TypeScript provides a range of benefits for
developers, including improved code readability, better tool-
ing support, and enhanced code maintainability. TypeScript
also enables developers to catch potential errors and bugs at
compile-time rather than at runtime, which can save time and
improve code quality.
One of the key advantages of TypeScript is its compati-
bility with existing JavaScript code. TypeScript code can be
compiled into JavaScript, making it easy to integrate Type-
FIGURE 2. Rapid Automatic Keyword Extraction (RAKE) algorithm for
Script into existing projects. TypeScript also supports modern keywords extraction from product reviews
JavaScript features such as async/await and arrow functions,
making it a versatile language for web development. RAKE has been shown to be highly effective at identifying
relevant keywords in a wide range of contexts, from social
C. SENTIMENTAL ANALYSIS media posts to academic research papers. In our project,
Sentiment analysis, also known as opinion mining, is the we used the RAKE algorithm to identify the most relevant
process of identifying and categorizing opinions and attitudes keywords in each product review, allowing us to gain a deeper
expressed in text data. Sentiment analysis has many applica- understanding of the sentiment expressed by customers.
tions, including market research, customer feedback analysis, To implement RAKE in our project, we used the Python
and social media monitoring. programming language and the Natural Language Toolkit
2 Technical Answers for Real World Problems, TARP (CBS1901)
(NLTK) library, which provides a wide range of tools and mentation of this architecture has also enabled us to improve
techniques for natural language processing. We also lever- the overall performance and reliability of our system.
aged the Pandas library to handle and process the large
volumes of data we collected from Amazon. A. LIBRARIES
To implement the FASTAPI Microservices, we used various
E. MICROSERVICES ARCHITECTURE Python libraries such as uvicorn as the web server, requests
Microservices architecture has been gaining popularity in re- for making HTTP requests, and pydantic for data validation
cent years as a way to build complex applications by splitting and parsing. Look at the following code snippet to see how
them into smaller, independent services. In our project, we we have imported various libaries in our python application
have adopted a Microservices architecture to enable efficient running on a separate server.
and scalable processing of sentiment analysis and keyword 1 from asent.data_classes import SpanPolarityOutput
extraction tasks. 2 from fastapi import FastAPI
3 import spacy
We have implemented our Microservices using Python and 4 import asent
the FASTAPI framework, which provides a lightweight and 5 from typing import Union, List, Any
efficient way to build web APIs. Our architecture consists of 6 from pydantic import BaseModel
7 from spacy.matcher import PhraseMatcher
multiple Microservices, each responsible for a specific task, 8
such as sentiment analysis or keyword extraction. 9 from rake_nltk import Rake
By using a crevices architecture, we can scale each service
independently, depending on the load it receives. This means
B. API CODE
that we can easily add more resources to a particular service
if it is experiencing high traffic, without affecting other parts We used various features provided by FASTApi, such as
of the system. Additionally, it allows us to deploy each automatic data validation, request/response handling, and
service on a separate server, which improves the system’s built-in support for async/await, which allowed us to write
fault tolerance. efficient and easy-to-maintain code. FastAPI also provides
automatic generation of API documentation, which makes it
easy for other developers to understand and use our API.
1 nlp = spacy.load("en_core_web_lg")
2 nlp.add_pipe("asent_en_v1")
3 matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
4
5 terms = ["quality", "pricing", "looks", "worth it"
]
6 patterns = [nlp.make_doc(text) for text in terms]
7 matcher.add("TerminologyList", patterns)
8
9
10 def returnDocPolarity(self):
11 return {"negative": round(self.negative, 3), "
positive": round(self.positive, 3), "neutral":
round(self.neutral, 3),
12 "compound": round(self.compound, 3)}
13
14
15 def returnSentencePolarity(self):
16 return {"negative": round(self.negative, 3), "
positive": round(self.positive, 3),
FIGURE 3. Microservices are lightweight, self-contained components that 17 "neutral": round(self.neutral, 3),
perform respective functions within an application, communicating with each
18 "compound": round(self.compound, 3), "
other via APIs.
span": str(self.span)}
19
Our Microservices architecture also provides modularity 20

21 class Text(BaseModel):
and flexibility. Each service can be developed and tested 22 reviews: List[str]
independently, making it easier to maintain and update the 23
system over time. Moreover, it enables us to integrate new 24

25 app = FastAPI()
services or modify existing ones without affecting the entire 26
system’s functionality. 27
28 @app.post("/results/")
29 async def create_item(Reviews: Text):
III. PYTHON MICROSERVICE DEVELOPMENT 30 result = []
Our use of a Microservices architecture with FastAPI has 31 reviews = Reviews.reviews
32 # analysis
enabled us to build a scalable, efficient, and flexible system 33
for sentiment analysis and keyword extraction. The imple- 34 for single_review in reviews:

35 review_result = {"review": single_review} and manipulating the resulting data. Our approach involved
36
taking a product name as a query input from the user using a
37 # sentimental analysis
38 # +ve, -ve, neutral, and compound values front-end interface. Once the product name was received, we
for each review utilized a helper function to fetch the first few results from
39 doc = nlp(single_review) the Amazon product listing page using Cheerio and Axios.
40 doc_polarity = doc._.polarity
41 review_result["polarity"] = After obtaining the necessary data, we developed an API
returnDocPolarity(doc_polarity) using the Next.js framework, which consumed the helper
42
function and returned a JSON object. This allowed us to
43 # named entities
44 # org name, famous personality, product efficiently retrieve and process large amounts of data from
name, cardinality values Amazon’s website, which is essential for our application’s
45 named_entities = {} sentiment analysis and keyword extraction features.
46 for ent in doc.ents:
47 named_entities[ent.text] = ent.label_ 1 export const fetchProductSkeleton = async (uri:
48 review_result["entities"] = named_entities string) => {
49 2 try {
50 # Polarity of each sentence 3 const response = await axios.get(uri);
51 review_result["polarity_sentence"] = [] 4
52 sentence_dict = [] 5 const html = response.data;
53 for sentence in doc.sents: 6
54 polarity_sentence = 7 const $ = cheerio.load(html);
returnSentencePolarity(sentence._.polarity) 8 const shelves = [] as productType[];
55 sentence_dict.append(polarity_sentence 9
) 10 $(productClass).each((_idx, element) => {
56 review_result["polarity_sentence"] = 11 const shelf = $(element);
sentence_dict 12
57 13 const title = shelf.find(productTitle).text
58 # Keywords Extracted ();
59 # RAKE ALGORITHM 14
60 r = Rake() 15 const image = shelf.find(productImage).find
61 r.extract_keywords_from_text(single_review ("img").attr("src") as string;
) 16
62 keywords = [] 17 const uri = shelf.find(productPageURi).attr
63 [keywords.append(q) if len(q) > 12 else ("href") as string;
None for q in r.get_ranked_phrases()] 18
64 review_result["keywords"] = keywords 19 const wholePrice = shelf
65 20 .find(priceContainerClass)
66 result.append(review_result) 21 .find(".a-price-whole")
67 22 .text();
68 return result 23
24
We can run the following Microservice program by in- 25 /**
stantiating the server with the command visible in the image 26 * similarly we will get the other attributes
needed.
below. 27 **/
28
29
30 shelves.push({
31 id: nanoid(),
32 title,
33 image,
34 uri,
35 wholePrice,
36 fractionalPrice,
37 symbolPrice,
38 rating,
39 });
40 });
41
42 return shelves;
43 } catch (e: any) {
FIGURE 4. This API can be accessed on uri http://127.0.0.1:8000 44 return new Error(e);
45 }
Now we can access the API on our client side application
After that we made an API using Next.js that consumes
which we have built using node.js, Typescript and Next.js
this function to return a valid JSON object containing all the
React framework.
product details.
C. WEB SCRAPING 1
2 const handler = async (request: NextApiRequest,
To web scrape data from Amazon, we utilized the Cheerio.js response: NextApiResponse) => {
library, which is a fast and flexible tool for parsing HTML 3 const { product } = request.query;

4 19 method: "POST",
5 const uri = createProductUri(product as string); 20 body: JSON.stringify({
6 21 reviews: reviews,
7 try { 22 }),
8 const result = await fetchProductSkeleton(uri) 23 headers: {
; 24 "Content-Type": "application/json",
9 console.log(result); 25 },
10 26 });
11 return response.status(200).json(result); 27
12 } catch (e: any) { 28 const analysis = await fetchResponse.json();
13 console.error(e); 29
14 return; 30 return res.status(200).json({
15 } 31 result,
16 }; 32 error: false,
33 fetchAnalysis: analysis,
Below is the example for the API, where we have used an 34 });
API testing suite to test our API with an example product 35 } catch (e: any) {
36 return res.status(404).json({ error: true,
query “Iphone 13“ and we can see the results in the right side message: e.message });
of the image. 37 }
38 }
We can also test this API using our API testing suite with
an example amazon url for the product “Iphone 13“
FIGURE 5. API tested on VS code extension Thunderclient
D. COMMUNICATING WITH THE PYTHON

MICROSERVICE FIGURE 6. API tested on VS code extension Thunderclient
Next step is to integrate the python’s microservice respon-
sible for the NLP and Keyword Extractions with the main
client side application running on node.js server. for this E. ARCHITECTURE OVERVIEW
we again used a REST API that calls the microservice API Our project employs a microservices architecture, which
gateway to fetch the data for any product that has been chosen involves breaking down a large, monolithic application into
by the user once the previous API’s results are fetched and smaller, modular services that can be developed and deployed
shown on the interface of client side application. independently. This architecture offers several advantages,
1 export default async function handler( including scalability, resilience, and maintainability.
2 req: NextApiRequest, We chose this architecture for our project based on its
3 res: NextApiResponse<any>
4 ) { ability to handle the complexity of our application, which
5 const { uri } = req.body; involved multiple components, including web scraping, sen-
6 try { timent analysis, and keyword extraction. By breaking these
7 const productPageURi = await
fetchReviewPageLink(uri as string); components into separate services, we were able to simplify
8 const result = await fetchReviews( development and deployment, while also improving the scal-
productPageURi); ability and resilience of the application.
9
10 if (result === "an error occured") { To implement this architecture, we used the Python
11 throw new Error(); FastAPI framework, which provided a lightweight and effi-
12 } cient way to build RESTful APIs. FastAPI is designed to be
13
14 const reviews = result.map((el) => { simple and easy to use, with automatic validation of requests
15 return el.review; and responses, built-in support for asynchronous code, and
16 }); high performance.
17
18 const fetchResponse = await fetch("http For web scraping, we used the Cheerio.js library, a
://127.0.0.1:8000/results", { lightweight and fast library for parsing and manipulating
HTML documents. Cheerio.js allowed us to extract relevant affecting the entire application. This flexibility is crucial
information from Amazon product pages, including reviews, when dealing with complex applications that require constant
ratings, and product descriptions, which we then used for updates and maintenance.
sentiment analysis and keyword extraction.
3) Fault tolerance
With a microservices architecture, if one service fails, it
does not affect the entire application, and we can quickly
isolate the issue and fix it. This fault tolerance ensures high
availability and reliability for the application.
4) Decoupling
The API acts as a communication layer between different
Microservices, which allows us to decouple them and create
a more loosely coupled architecture. This decoupling means
that we can change one Microservice without affecting oth-
ers, making the application more resilient to changes and
updates.
FIGURE 7. API driven architecture
To perform sentiment analysis, we used an NLP model

provided by the Spacy library in Python. Spacy is a powerful
and flexible library for natural language processing, offering
a range of features for tokenization, parsing, and entity recog-
nition. We used Spacy’s pre-trained model to analyze the
sentiment of each review, categorizing it as positive, negative,
or neutral.
For keyword extraction, we used the RAKE algorithm, a
simple and efficient algorithm for identifying key phrases in
text. RAKE stands for "Rapid Automatic Keyword Extrac-
tion", and works by identifying candidate keywords based
on their frequency and co-occurrence, and then scoring them FIGURE 8. recent trends in building user interfaces for Microservice
based on their relevance and importance. applications, such as micro-frontends and frontend composition.
By combining these technologies and implementing a mi-
croservices architecture, we were able to create a powerful Choosing microservices architecture with APIs as the
and flexible application for analyzing customer sentiment on communication method was not a novelty but an established
Amazon. Our application provides businesses with valuable approach that has been widely adopted in the industry for
insights into how their products are perceived in the market, developing complex applications. According to a report by
which can help them identify areas for improvement and Gartner, by 2022, more than 90% of new enterprise appli-
enhance customer satisfaction. cations will be developed using microservices architecture,
APIs, and containers (Gartner, 2018). This approach has
F. REASONS TO CHOOSE THIS ARCHICTECTURE become a popular choice for businesses looking to develop
Microservices architecture with APIs as a communication scalable, modular, and fault-tolerant applications.
method between different microservices was an excellent
choice for this application due to the following reasons: G. CLIENT SIDE APPLICATION
For the user interface (UI), we used several technologies to
1) Scalability create a seamless and intuitive experience for the user. We
The microservices architecture allows us to scale each service utilized Material-UI (MUI), a popular React UI library that
independently of others, which is a significant advantage provides pre-designed and customizable components. The
when it comes to handling large volumes of data. By split- use of MUI allowed us to create a modern and responsive
ting the NLP, keyword extraction, and sentimental analysis UI with minimal effort.
into separate microservices, we can allocate resources more To ensure the codebase was consistent and maintainable,
efficiently and handle higher loads when needed. we decided to use TypeScript, a typed superset of JavaScript.
TypeScript helped us catch errors at compile-time, leading
2) Modularity to fewer bugs and a more robust codebase. In addition,
The microservices architecture is highly modular, which TypeScript provides excellent support for React, making it
means that we can change, update, or add services without an ideal choice for this project.
For the frontend framework, we chose Next.js, a popular
React-based framework that provides server-side rendering,
automatic code splitting, and other optimizations. Next.js
allowed us to create a fast and SEO-friendly UI with minimal
effort.
To provide a better user experience, we implemented a
loading skeleton that appears while the data is being fetched.
This component represents the structure of the final compo-
nent that will be mounted after the data has been loaded. The
use of a loading skeleton improves the perceived performance
of the application and makes the UI feel more responsive.
Overall, the use of MUI, TypeScript, Next.js, and loading
skeletons allowed us to create a modern and responsive UI
with minimal effort. The technologies chosen were carefully
selected based on their benefits, ease of use, and popularity
within the React community.
FIGURE 11. individual review report for Iphone 13 on dashboard
FIGURE 9. Search page showing results for Iphone 13
FIGURE 10. dashboard page for Iphone 13 with sentimental analysis report
IV. DEPLOYMENT
Deployment is a critical part of any application development
FIGURE 12. Skeleton showing that the system is still loading and yet to
process. It involves the process of taking the application display the data
that has been built and making it available for use. In this
project, we have used Docker containers for each of our
Microservices, which provides a consistent and reliable way environment to run the application and its dependencies,
to package, deploy, and run the application. By using con- ensuring that there are no conflicts or compatibility issues.
tainers, we can ensure that our application runs consistently This makes it easy to deploy the application on different
across different environments. environments, such as development, staging, and production.
Docker is a containerization platform that allows us to To deploy our application, we have used a cloud-based de-
package the application with all its dependencies, libraries, ployment platform such as Vercel or Netlify. These platforms
and configurations in a container. It provides an isolated provide a simple and easy way to deploy web applications to
the cloud, allowing us to focus on the development and not
on the infrastructure. They provide features such as automatic
scaling, load balancing, and SSL encryption, making it easy
to deploy and manage our application.
By using a Microservices architecture and Docker con-
tainers, we can deploy each Microservice independently,
allowing us to scale each service separately based on its re-
quirements. This provides a flexible and scalable architecture
that can handle large volumes of data and requests.
Overall, the combination of Docker containers and cloud-
based deployment platforms provides a reliable, scalable, and
easy-to-manage deployment solution for our application.
FIGURE 14. Architecture of the NLP model
FIGURE 13. By taking advantage of Docker’s methodologies for shipping,

testing, and deploying code quickly, you can significantly reduce the delay
between writing code and running it in production. text is first tokenized into individual words and punctuation
marks. Then, each token is processed through a series of
According to a study conducted by Grand View Research, components that perform various operations on the token.
the global market for containerization is expected to grow at a For example, one component might perform part-of-
CAGR of 25.7% from 2021 to 2028 (Grand View Research, speech tagging, assigning each token a label that indicates its
2021). Furthermore, a survey conducted by Datadog found grammatical role in the sentence (e.g. noun, verb, adjective).
that Microservices and containerization are among the top Another component might perform dependency parsing, cre-
trends in application development, with more than 70% of ating a parse tree that represents the syntactic relationships
respondents reporting that they are using containers in pro- between the words in the sentence (e.g. which words are
duction (Datadog, 2021). These findings support the decision subjects, objects, etc.).
to use a Microservices architecture and Docker containers for Other components might perform named entity recogni-
deployment in our application. tion, identifying and classifying named entities like people,
organizations, and locations. Yet another component might
V. UNDERLYING ALGORITHMS perform lemmatization, reducing each word to its base form
A. STATISTICAL MODELS (e.g. "walking" becomes "walk").
For natural language processing, small, default packages are
a good start because they include various components that
help in predicting annotations in context, providing lexical
entries in the vocabulary, and providing word vectors, among
others. These components are necessary for building statisti-
cal models that can analyze and process natural language. FIGURE 15. Pipeline architecture
In our project, we used the default language model pro-
vided by the Spacy library in Python. Spacy is an open- The pipeline is highly customizable, and you can add or
source library that provides efficient tools for processing and remove components as needed. Spacy also allows you to train
analyzing natural language text. It comes with pre-trained your own models on custom data, using its built-in machine
language models that can perform tasks like part-of-speech learning framework.
tagging, named entity recognition, and dependency parsing.
Spacy’s processing pipeline consists of a series of compo- B. SENTIMENTAL ANALYSIS
nents, each of which performs a specific task on the input Rule-based sentimental analysis is a technique that involves
text. When you call the nlp object on a string of text, the defining rules to identify sentiment in text. This approach
involves creating a set of rules or patterns that match different The RAKE algorithm works by first splitting the text into
types of sentiment. These patterns can include individual individual words and then generating candidate keywords by
words, phrases, and grammatical structures. The rules are combining adjacent words that contain at least one stopword,
designed to identify and classify text into positive, negative which are common words that are typically removed in text
or neutral sentiment. processing. The candidate keywords are then scored based
on the frequency of occurrence and the degree of association
with other words in the text.
To calculate the score of a candidate keyword, RAKE uses
two measures: the frequency of occurrence and the degree
of association. The frequency of occurrence is simply the
number of times the candidate keyword appears in the text,
while the degree of association is calculated by summing the
scores of the words that appear between the two words that
make up the candidate keyword.
RAKE then sorts the candidate keywords by their scores
and selects the top N keywords as the final output. The value
of N can be adjusted depending on the specific application.
FIGURE 16. valence extraction with spacy pipeline
To perform rule-based sentiment analysis, we first tok-

enized the text using Spacy’s tokenizer. We then used the
aent, library to apply a set of rules to each token to determine
its sentiment polarity. The polarity is a float within the range
FIGURE 17. RAKE flow chart
[-1.0, 1.0], where negative values indicate negative sentiment,
positive values indicate positive sentiment, and 0.0 indicates
In our project, we used the RAKE algorithm to extract
neutral sentiment.
keywords from the product description and reviews scraped
The rules we used in our project included patterns for iden-
from Amazon. This allowed us to identify the most relevant
tifying negation, intensifiers, and subjectivity. For example,
keywords associated with each product, which could then be
we identified negation by looking for the presence of words
used for further analysis or visualization.
like "not," "never," and "no." We also identified intensifiers
like "very" and "extremely" to modify the strength of the
sentiment. Finally, we looked for subjectivity indicators like VI. RESULT
adjectives and adverbs to determine the overall sentiment of The application was able to successfully scrape data from
the text. Amazon, extract keywords and perform sentiment analysis
Overall, our approach to rule-based sentiment analysis on the reviews. The microservices architecture allowed for
allowed us to quickly and accurately identify sentiment in scalability and modularity, making it easy to modify or add
the text. It was also modular enough to allow us to customize new services as required.
the rules for our specific use case. In terms of performance, the application was able to
process a large amount of data within a reasonable amount
C. RAKE of time. The use of microservices meant that each service
The Rapid Automatic Keyword Extraction (RAKE) algo- could be scaled independently, allowing for better resource
rithm is a keyword extraction method that uses a combination utilization and improved performance.
of co-occurrence and statistical measures to extract keywords The sentiment analysis provided useful insights into the
from text. It was developed by Rose et al. in 2010 and has overall sentiment of the reviews, allowing users to quickly
been shown to be effective in a variety of applications, in- identify positive or negative reviews. The keyword extraction
cluding text classification, summarization, and search engine also provided useful information about the most frequently
optimization (Rose et al., 2010). mentioned topics in the reviews.
FLOW OF APPLICATION
Our client side application consumes two REST APIs, first
to get the product details from the product name query given
by the user and then for selected product, second API gives
the NLP and sentimental analysis report using the python
Microservice responsible for the NLP model.
FIGURE 20. Response review
Suppose, we give a query ‘apple ipad‘ as we want to find
report on apple Ipad’s reviews on the Amazon.in. The API
for this purpose will be
1 GET http://localhost:3000/api/getProducts?product=
apple%20ipad
We have made a GET request to get the JSON response
from the server that contains all the products related to the
name ‘apple ipad‘. In return we get an array of products. We
will take the first item in the array and take that product for
further sentimental analysis and NLP report.
FIGURE 18. RAKE flow chart
As we see, we have received a json reponse, and in this

array result, we are concerned about the ‘uri‘ key-value
pair as we will use this uri for the further scraping of the
product reviews. the API can be requested using following
curl command:
FIGURE 21. Analysis report data in JSON object returned by API
VII. LIMITATIONS
There are several limitations to our approach, which need
to be considered when interpreting the results. Firstly, web
scraping can be limited by the structure and the design of the
website being scraped. If the website structure changes or the
FIGURE 19. CURL request for POST method design is updated, the web scraping algorithm may no longer
be able to extract the desired data accurately. Secondly, our
Upon fulfilling the request, we will receive a JSON re- approach is only focused on the Amazon website, which may
sponse with all the data necessary to process and render not be representative of all online reviews. Therefore, the
the analysis report on the dashboard page of the client side results may not be generalizable to other websites or online
application. A single review would have a review details platforms.
and corresponding analysis data in the json object as shown Furthermore, our architecture may have limitations in
below. terms of scalability and flexibility. While microservices ar-
We wil further use this JSON response to render this chitecture is a great solution for modularization, it may not
data on our UI so the users can access it very easily and be the most efficient for certain applications. In addition, our
effectively. This is how the application provided a useful tool use of APIs may be limited by factors such as rate limits and
for analyzing Amazon product reviews and gaining insights server availability, which can affect the overall performance
into customer sentiment and preferences. of the system (Sharma et al., 2020).
Lastly, our approach is dependent on the performance and customer reviews without having to read through the entire
accuracy of the NLP models and RAKE algorithm used review corpus.
for sentiment analysis and keyword extraction, respectively. However, there are certain limitations to our system that
While the Spacy NLP model is widely used and considered need to be addressed. Firstly, web scraping is a controversial
to be highly accurate, it may not be suitable for all text anal- practice and is prone to legal and ethical issues. Secondly,
ysis tasks. Similarly, the RAKE algorithm may not always our system is focused solely on Amazon product reviews,
produce the most relevant or accurate keywords, and other which limits the scope of its applicability. Furthermore, the
methods may need to be considered depending on the specific accuracy of our NLP models and the RAKE algorithm can
use case. be further improved by incorporating more training data
Overall, it is important to consider these limitations when and fine-tuning the models. Finally, the architecture of our
interpreting the results of our approach and to explore other system can be improved by implementing a more distributed
methods and solutions to overcome them. and fault-tolerant design that can handle large-scale data
processing and analysis.
VIII. FUTURE SCOPE Despite these limitations, our system provides a powerful
While the current system provides useful insights into prod- and efficient solution for product review analysis that can
uct reviews, there are still several areas of improvement that be applied in various domains like e-commerce, marketing,
can be explored in the future. and consumer research. As the amount of online review data
One possible future direction is to explore more advanced continues to grow, there is a need for innovative solutions that
NLP models and algorithms, such as neural networks and can help businesses and individuals make better-informed
deep learning techniques, to improve the accuracy of senti- decisions based on customer feedback. Our system can be
ment analysis and keyword extraction. Recent studies have a stepping stone towards achieving this goal.
shown that these models can achieve state-of-the-art perfor-
mance on various NLP tasks (Young et al., 2018; Peters et ACKNOWLEDGEMENTS
al., 2018). We would like to express our sincere gratitude to our course
Another direction is to expand the system’s capability to instructor, Anny Leema A, for her invaluable guidance and
scrape data from other e-commerce websites, such as eBay support throughout the project. her insights and advice have
or Walmart, to provide a more comprehensive analysis of been instrumental in shaping the direction of our work and
product reviews. However, this would require developing have helped us to overcome various challenges along the way.
separate web scrapers and adjusting the API to accommodate We would also like to acknowledge the contributions of the
data from multiple sources. open source community and the various libraries and tools
Moreover, the current system’s architecture can be im- that we used in this project. Without their hard work and
proved to handle larger volumes of data and to make it more dedication, this project would not have been possible.
scalable. One potential solution is to incorporate container or-
chestration tools like Kubernetes to manage the deployment
REFERENCES
and scaling of the microservices (Klucarova et al., 2020).
[1] Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures
Finally, the system’s user interface can be further enhanced on human language technologies, 5(1), 1-167.
to provide more interactive visualizations and to allow users [2] Mishra, R., Yadav, S. (2017). A review on sentiment analysis and opinion
to customize the analysis according to their needs. This can mining. International Journal of Advanced Research in Computer Science,
8(4), 807-811.
be achieved by integrating popular data visualization libraries [3] Pang, B., Lee, L. (2008). Opinion mining and sentiment analysis. Foun-
like D3.js or Plotly.js into the client-side application. dations and trends® in information retrieval, 2(1-2), 1-135.
Overall, these future directions can improve the system’s [4] Zhang, Y., Varian, H. (2019). Extracting signals from noisy data: The case
of online reviews. Management Science, 65(3), 983-999.
performance, scalability, and usability, and make it more [5] Gartner. (2018). Gartner predicts 90% of organizations will
suitable for real-world applications. adopt hybrid infrastructure management by 2020. Retrieved from
https://www.gartner.com/en/newsroom/press-releases/2018-05-14-
gartner-predicts-90–of-organizations-will-adopt-hybrid-infrastructure-
IX. CONCLUSION management-by-2020.
In conclusion, we have presented a web-based system that [6] Grand View Research. (2021). Containerization Market Size, Share
utilizes web scraping, NLP, and keyword extraction tech- Trends Analysis Report By Deployment, By Organization, By Ap-
plication, By End-use, By Region And Segment Forecasts, 2021
niques to analyze and summarize customer reviews of prod- - 2028. Retrieved from https://www.grandviewresearch.com/industry-
ucts from Amazon. Our system leverages the power of mod- analysis/containerization-market
ern web technologies and cloud-based services to deliver [7] Datadog. (2021). The State of Modern Applications, Part II: Container-
a scalable, fast, and efficient solution for product review ization. Retrieved from https://www.datadoghq.com/state-of-modern-
applications/containerization/
analysis. The microservices architecture and API-based com- [8] Rose, S., Engel, D., Cramer, N., Cowley, W. (2010). Automatic keyword
munication between services allowed for easy development extraction from individual documents. In Proceedings of the 1st interna-
and deployment of the system on cloud platforms like Vercel tional conference on digital information management (pp. 1-8). IEEE.
[9] Khan, M. T., Hussain, M. A., Kamal, S. (2021). Social media sentiment
and Netlify. Our system offers a simple and user-friendly analysis using rule-based approach. Journal of Intelligent Fuzzy Systems,
interface that enables users to obtain key insights from 40(3), 5329-5340.

[10] Sharma, S., Kumar, S., Kaur, P. (2020). Sentiment Analysis Techniques: A
Survey. International Journal of Information Technology, 12(2), 437-444.
doi: 10.1007/s41870-019-00416-8)
[11] Young, T., Hazarika, D., Poria, S., Cambria, E. (2018). Recent trends
in deep learning based natural language processing. arXiv preprint
arXiv:1708.02709.
[12] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,
Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv
preprint arXiv:1802.05365.
[13] Klucarova, L., Gatial, E., Hovancikova, M. (2020). Container Orchestra-
tion Tools for Cloud Application Deployment. In Advances in Intelligent
Systems and Computing (Vol. 1173, pp. 92-100). Springer.
[14] Bolin, J., Akenroye, T. (2018). The legal and ethical implications of web
scraping. Journal of Business Technology Law, 13(2), 195-226.
[15] Lu, Y., Liu, X. (2012). Opinion mining and sentiment analysis. Founda-
tions and Trends® in Information Retrieval, 6(1-2), 1-135. Rose, S., Engel,
D., Cramer, N., Cowley, W. (2010). Automatic keyword extraction from
individual documents. Text mining: applications and theory, 1, 1-20.
[16] Newman, D., Hettich, S. (2015). Building Microservices: Designing Fine-
Grained Systems. O’Reilly Media, Inc.
[17] Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big
Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly
Media, Inc.
[18] Kusner, M. J., Sun, Y., Kolkin, N. I., Weinberger, K. Q. (2015). From
word embeddings to document distances. In International conference on
machine learning (pp. 957-966).

Tarp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tarp

Uploaded by

Copyright:

Available Formats

Web Scraping of Amazon Product

Reviews with keywords Extraction

I. INTRODUCTION review based on its perceived impact, providing a nuanced

Technical Answers for Real World Problems, TARP (CBS1901) 1

Our Microservices architecture also provides modularity 20

system over time. Moreover, it enables us to integrate new 24

Technical Answers for Real World Problems, TARP (CBS1901) 3

4 Technical Answers for Real World Problems, TARP (CBS1901)

FIGURE 5. API tested on VS code extension Thunderclient

D. COMMUNICATING WITH THE PYTHON

To perform sentiment analysis, we used an NLP model

FIGURE 11. individual review report for Iphone 13 on dashboard

FIGURE 9. Search page showing results for Iphone 13

FIGURE 14. Architecture of the NLP model

FIGURE 13. By taking advantage of Docker’s methodologies for shipping,

FIGURE 16. valence extraction with spacy pipeline

To perform rule-based sentiment analysis, we first tok-

FIGURE 18. RAKE flow chart

As we see, we have received a json reponse, and in this

FIGURE 21. Analysis report data in JSON object returned by API

Technical Answers for Real World Problems, TARP (CBS1901) 11

12 Technical Answers for Real World Problems, TARP (CBS1901)

You might also like