Professional Documents
Culture Documents
ABSTRACT This research article presents a novel approach for cutting-edge web scraping applications for
extracting and analyzing sentiment data from product reviews on Amazon. The abundance of online customer
reviews presents a valuable opportunity for businesses to gain insights into customer opinions and preferences.
However, manually analyzing a large volume of reviews can be a time-consuming and resource-intensive task.
To address this issue, we propose a cutting-edge web scraping application that extracts and analyzes sentiment
data from products. The proposed approach employs natural language processing techniques to categorize each
review as positive, negative, or neutral, providing businesses with valuable insights into customer sentiment.
Our approach assigns a weight to each review based on its perceived impact, providing a nuanced understanding
of customer opinion. The experimental results demonstrate the effectiveness of our approach in accurately
categorizing reviews and providing insights into customer sentiment.
INDEX TERMS Amazon, web scraping, sentiment analysis, natural language processing, machine learning,
customer reviews, product development, marketing strategies, data-driven approach, and advanced algorithms.
D. KEYWORDS EXTRACTIONS
Keyword extraction is a crucial step in sentiment analysis,
as it allows us to identify the most relevant terms in the
text that are indicative of the sentiment expressed by the
author. To achieve this, we employed the Rapid Automatic
FIGURE 1. Web scraping analyzes and processes online information and Keyword Extraction (RAKE) algorithm, which is a widely
turns it into structured data. used approach for extracting keywords from text.
The RAKE algorithm works by first splitting the text
Web scraping can be done manually by copying and past- into individual words and then ranking them based on their
ing data from websites, but this is a time-consuming and relevance to the overall meaning of the text. It does this
error-prone process. Automated web scraping using special- by calculating a score for each word based on two factors:
ized software or programming languages like JavaScript is its frequency in the text and its co-occurrence with other
much more efficient and accurate. Here to automate our pro- important words.
cess, we have JavaScript library named “cheerio.js“, a popu-
lar Node.js library used for web scraping. It is a lightweight
and fast library that simplifies the process of web scraping
by providing a simple and intuitive API for querying and
manipulating HTML and XML documents.
B. TYPESCRIPT
TypeScript is a popular open-source programming language
developed and maintained by Microsoft. It is a typed superset
of JavaScript that adds optional static typing, classes, and
interfaces to the language. TypeScript is a powerful tool for
building large-scale web applications. By adding static typing
to JavaScript, TypeScript provides a range of benefits for
developers, including improved code readability, better tool-
ing support, and enhanced code maintainability. TypeScript
also enables developers to catch potential errors and bugs at
compile-time rather than at runtime, which can save time and
improve code quality.
One of the key advantages of TypeScript is its compati-
bility with existing JavaScript code. TypeScript code can be
compiled into JavaScript, making it easy to integrate Type-
FIGURE 2. Rapid Automatic Keyword Extraction (RAKE) algorithm for
Script into existing projects. TypeScript also supports modern keywords extraction from product reviews
JavaScript features such as async/await and arrow functions,
making it a versatile language for web development. RAKE has been shown to be highly effective at identifying
relevant keywords in a wide range of contexts, from social
C. SENTIMENTAL ANALYSIS media posts to academic research papers. In our project,
Sentiment analysis, also known as opinion mining, is the we used the RAKE algorithm to identify the most relevant
process of identifying and categorizing opinions and attitudes keywords in each product review, allowing us to gain a deeper
expressed in text data. Sentiment analysis has many applica- understanding of the sentiment expressed by customers.
tions, including market research, customer feedback analysis, To implement RAKE in our project, we used the Python
and social media monitoring. programming language and the Natural Language Toolkit
2 Technical Answers for Real World Problems, TARP (CBS1901)
(NLTK) library, which provides a wide range of tools and mentation of this architecture has also enabled us to improve
techniques for natural language processing. We also lever- the overall performance and reliability of our system.
aged the Pandas library to handle and process the large
volumes of data we collected from Amazon. A. LIBRARIES
To implement the FASTAPI Microservices, we used various
E. MICROSERVICES ARCHITECTURE Python libraries such as uvicorn as the web server, requests
Microservices architecture has been gaining popularity in re- for making HTTP requests, and pydantic for data validation
cent years as a way to build complex applications by splitting and parsing. Look at the following code snippet to see how
them into smaller, independent services. In our project, we we have imported various libaries in our python application
have adopted a Microservices architecture to enable efficient running on a separate server.
and scalable processing of sentiment analysis and keyword 1 from asent.data_classes import SpanPolarityOutput
extraction tasks. 2 from fastapi import FastAPI
3 import spacy
We have implemented our Microservices using Python and 4 import asent
the FASTAPI framework, which provides a lightweight and 5 from typing import Union, List, Any
efficient way to build web APIs. Our architecture consists of 6 from pydantic import BaseModel
7 from spacy.matcher import PhraseMatcher
multiple Microservices, each responsible for a specific task, 8
such as sentiment analysis or keyword extraction. 9 from rake_nltk import Rake
By using a crevices architecture, we can scale each service
independently, depending on the load it receives. This means
B. API CODE
that we can easily add more resources to a particular service
if it is experiencing high traffic, without affecting other parts We used various features provided by FASTApi, such as
of the system. Additionally, it allows us to deploy each automatic data validation, request/response handling, and
service on a separate server, which improves the system’s built-in support for async/await, which allowed us to write
fault tolerance. efficient and easy-to-maintain code. FastAPI also provides
automatic generation of API documentation, which makes it
easy for other developers to understand and use our API.
1 nlp = spacy.load("en_core_web_lg")
2 nlp.add_pipe("asent_en_v1")
3 matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
4
5 terms = ["quality", "pricing", "looks", "worth it"
]
6 patterns = [nlp.make_doc(text) for text in terms]
7 matcher.add("TerminologyList", patterns)
8
9
10 def returnDocPolarity(self):
11 return {"negative": round(self.negative, 3), "
positive": round(self.positive, 3), "neutral":
round(self.neutral, 3),
12 "compound": round(self.compound, 3)}
13
14
15 def returnSentencePolarity(self):
16 return {"negative": round(self.negative, 3), "
positive": round(self.positive, 3),
FIGURE 3. Microservices are lightweight, self-contained components that 17 "neutral": round(self.neutral, 3),
perform respective functions within an application, communicating with each
18 "compound": round(self.compound, 3), "
other via APIs.
span": str(self.span)}
19
We can also test this API using our API testing suite with
an example amazon url for the product “Iphone 13“
4) Decoupling
The API acts as a communication layer between different
Microservices, which allows us to decouple them and create
a more loosely coupled architecture. This decoupling means
that we can change one Microservice without affecting oth-
ers, making the application more resilient to changes and
updates.
FIGURE 7. API driven architecture
FIGURE 10. dashboard page for Iphone 13 with sentimental analysis report
IV. DEPLOYMENT
Deployment is a critical part of any application development
FIGURE 12. Skeleton showing that the system is still loading and yet to
process. It involves the process of taking the application display the data
that has been built and making it available for use. In this
project, we have used Docker containers for each of our
Microservices, which provides a consistent and reliable way environment to run the application and its dependencies,
to package, deploy, and run the application. By using con- ensuring that there are no conflicts or compatibility issues.
tainers, we can ensure that our application runs consistently This makes it easy to deploy the application on different
across different environments. environments, such as development, staging, and production.
Docker is a containerization platform that allows us to To deploy our application, we have used a cloud-based de-
package the application with all its dependencies, libraries, ployment platform such as Vercel or Netlify. These platforms
and configurations in a container. It provides an isolated provide a simple and easy way to deploy web applications to
Technical Answers for Real World Problems, TARP (CBS1901) 7
the cloud, allowing us to focus on the development and not
on the infrastructure. They provide features such as automatic
scaling, load balancing, and SSL encryption, making it easy
to deploy and manage our application.
By using a Microservices architecture and Docker con-
tainers, we can deploy each Microservice independently,
allowing us to scale each service separately based on its re-
quirements. This provides a flexible and scalable architecture
that can handle large volumes of data and requests.
Overall, the combination of Docker containers and cloud-
based deployment platforms provides a reliable, scalable, and
easy-to-manage deployment solution for our application.
VII. LIMITATIONS
There are several limitations to our approach, which need
to be considered when interpreting the results. Firstly, web
scraping can be limited by the structure and the design of the
website being scraped. If the website structure changes or the
FIGURE 19. CURL request for POST method design is updated, the web scraping algorithm may no longer
be able to extract the desired data accurately. Secondly, our
Upon fulfilling the request, we will receive a JSON re- approach is only focused on the Amazon website, which may
sponse with all the data necessary to process and render not be representative of all online reviews. Therefore, the
the analysis report on the dashboard page of the client side results may not be generalizable to other websites or online
application. A single review would have a review details platforms.
and corresponding analysis data in the json object as shown Furthermore, our architecture may have limitations in
below. terms of scalability and flexibility. While microservices ar-
We wil further use this JSON response to render this chitecture is a great solution for modularization, it may not
data on our UI so the users can access it very easily and be the most efficient for certain applications. In addition, our
effectively. This is how the application provided a useful tool use of APIs may be limited by factors such as rate limits and
for analyzing Amazon product reviews and gaining insights server availability, which can affect the overall performance
into customer sentiment and preferences. of the system (Sharma et al., 2020).
10 Technical Answers for Real World Problems, TARP (CBS1901)
Lastly, our approach is dependent on the performance and customer reviews without having to read through the entire
accuracy of the NLP models and RAKE algorithm used review corpus.
for sentiment analysis and keyword extraction, respectively. However, there are certain limitations to our system that
While the Spacy NLP model is widely used and considered need to be addressed. Firstly, web scraping is a controversial
to be highly accurate, it may not be suitable for all text anal- practice and is prone to legal and ethical issues. Secondly,
ysis tasks. Similarly, the RAKE algorithm may not always our system is focused solely on Amazon product reviews,
produce the most relevant or accurate keywords, and other which limits the scope of its applicability. Furthermore, the
methods may need to be considered depending on the specific accuracy of our NLP models and the RAKE algorithm can
use case. be further improved by incorporating more training data
Overall, it is important to consider these limitations when and fine-tuning the models. Finally, the architecture of our
interpreting the results of our approach and to explore other system can be improved by implementing a more distributed
methods and solutions to overcome them. and fault-tolerant design that can handle large-scale data
processing and analysis.
VIII. FUTURE SCOPE Despite these limitations, our system provides a powerful
While the current system provides useful insights into prod- and efficient solution for product review analysis that can
uct reviews, there are still several areas of improvement that be applied in various domains like e-commerce, marketing,
can be explored in the future. and consumer research. As the amount of online review data
One possible future direction is to explore more advanced continues to grow, there is a need for innovative solutions that
NLP models and algorithms, such as neural networks and can help businesses and individuals make better-informed
deep learning techniques, to improve the accuracy of senti- decisions based on customer feedback. Our system can be
ment analysis and keyword extraction. Recent studies have a stepping stone towards achieving this goal.
shown that these models can achieve state-of-the-art perfor-
mance on various NLP tasks (Young et al., 2018; Peters et ACKNOWLEDGEMENTS
al., 2018). We would like to express our sincere gratitude to our course
Another direction is to expand the system’s capability to instructor, Anny Leema A, for her invaluable guidance and
scrape data from other e-commerce websites, such as eBay support throughout the project. her insights and advice have
or Walmart, to provide a more comprehensive analysis of been instrumental in shaping the direction of our work and
product reviews. However, this would require developing have helped us to overcome various challenges along the way.
separate web scrapers and adjusting the API to accommodate We would also like to acknowledge the contributions of the
data from multiple sources. open source community and the various libraries and tools
Moreover, the current system’s architecture can be im- that we used in this project. Without their hard work and
proved to handle larger volumes of data and to make it more dedication, this project would not have been possible.
scalable. One potential solution is to incorporate container or-
chestration tools like Kubernetes to manage the deployment
REFERENCES
and scaling of the microservices (Klucarova et al., 2020).
[1] Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures
Finally, the system’s user interface can be further enhanced on human language technologies, 5(1), 1-167.
to provide more interactive visualizations and to allow users [2] Mishra, R., Yadav, S. (2017). A review on sentiment analysis and opinion
to customize the analysis according to their needs. This can mining. International Journal of Advanced Research in Computer Science,
8(4), 807-811.
be achieved by integrating popular data visualization libraries [3] Pang, B., Lee, L. (2008). Opinion mining and sentiment analysis. Foun-
like D3.js or Plotly.js into the client-side application. dations and trends® in information retrieval, 2(1-2), 1-135.
Overall, these future directions can improve the system’s [4] Zhang, Y., Varian, H. (2019). Extracting signals from noisy data: The case
of online reviews. Management Science, 65(3), 983-999.
performance, scalability, and usability, and make it more [5] Gartner. (2018). Gartner predicts 90% of organizations will
suitable for real-world applications. adopt hybrid infrastructure management by 2020. Retrieved from
https://www.gartner.com/en/newsroom/press-releases/2018-05-14-
gartner-predicts-90–of-organizations-will-adopt-hybrid-infrastructure-
IX. CONCLUSION management-by-2020.
In conclusion, we have presented a web-based system that [6] Grand View Research. (2021). Containerization Market Size, Share
utilizes web scraping, NLP, and keyword extraction tech- Trends Analysis Report By Deployment, By Organization, By Ap-
plication, By End-use, By Region And Segment Forecasts, 2021
niques to analyze and summarize customer reviews of prod- - 2028. Retrieved from https://www.grandviewresearch.com/industry-
ucts from Amazon. Our system leverages the power of mod- analysis/containerization-market
ern web technologies and cloud-based services to deliver [7] Datadog. (2021). The State of Modern Applications, Part II: Container-
a scalable, fast, and efficient solution for product review ization. Retrieved from https://www.datadoghq.com/state-of-modern-
applications/containerization/
analysis. The microservices architecture and API-based com- [8] Rose, S., Engel, D., Cramer, N., Cowley, W. (2010). Automatic keyword
munication between services allowed for easy development extraction from individual documents. In Proceedings of the 1st interna-
and deployment of the system on cloud platforms like Vercel tional conference on digital information management (pp. 1-8). IEEE.
[9] Khan, M. T., Hussain, M. A., Kamal, S. (2021). Social media sentiment
and Netlify. Our system offers a simple and user-friendly analysis using rule-based approach. Journal of Intelligent Fuzzy Systems,
interface that enables users to obtain key insights from 40(3), 5329-5340.