19mis0126 VL2021220101064 Pe003

ANALYSING SALES DATA USING HADOOP
A PROJECT REPORT
for
BIG DATA ANALYTICS (SWE2011)
in
M.Tech. (Integrated) Software Engineering
by
K.L.YAMINI(19MIS0104)
SANGADALA ESWARI (19MIS0126)
SHAKTHIVEL.R.K (19MIS0313)
Under the Guidance of

Dr. SENTHILKUMAR N C
Associate Professor, SITE
School of Information Technology and Engineering

December, 2021
DECLARATION BY THE CANDIDATE
We hereby declare that the project report entitled “ANALYSING SALES DATA
USING HADOOP” submitted by us to Vellore Institute of Technology, Vellore
in partial fulfillment of the requirement for the award of the course Big Data
Analytics (SWE2011) is a record of bonafide project work carried out by us
under the guidance of Dr. Senthilkumar N C. We further declare that the work
reported in this project has not been submitted and will not be submitted, either
in part or in full, for the award of any other course.
Place : Vellore Signature
Date : 08-12-2021
School of Information Technology & Engineering [SITE]
CERTIFICATE
This is to certify that the project report entitled “ANALYSING SALES DATA
USING HADOOP” submitted by Sangadala Eswari(19MIS0126) to Vellore
Institute of Technology, Vellore in partial fulfillment of the requirement for the
award of the course Big Data Analytics (SWE2011) is a record of bonafide work
carried out by them under my guidance.
Dr. Senthilkumar N C
GUIDE
Associate Professor, SITE
Analysing sales data using Hadoop
Abstract
To analyse and determine the revenue earned across different stores in a country/region. We
are using the core concepts, HDFS and MapReduce of Big Data Hadoop Platform to solve the
problem. Processing large volumes of data in parallel by dividing the work into a set of
independent tasks. We are using MapReduce process file to break it into chunks and process it
in parallel. By using Mapper function we will split and map the data and using reducer
function we will shuffle and reduce the data.
Keywords – Hadoop Framework, Mapreduce, Mapper function, Reducer function.
I. INTRODUCTION
Recently, companies have begun to place greater emphasis on data and data analysis for strategic
decision-making. Due to the exponential growth of data in real-world applications, it has
become essential to have a good ecosystem that can be used to store and manage big data
characterized by volume, speed, and variety. The emergence of cloud computing has enabled
the processing of large data by providing a large pool of fashionable shared computing
resources. At the same time, a Hadoop distributed programming framework appeared. This
infrastructure supports the storage and processing of a huge amount of data and leverages
parallel processing with thousands of standard computers. Because large data processing
requires computing power and storage, cloud computing has become the way to do it. With
cloud computing, virtualization, big data, Hadoop and other big data platforms, an eco-system
is created in a distributed environment. MapReduce is the programming model used for large
data processing. This is the new programming approach that contains the map and reduces the
tasks. The work is done in an environment distributed by many standard computers. As noted,
Amazon provides its own MapReduce programming model, called Elastic MapReduce, based
on Hadoop. Therefore, Hadoop is the framework used to handle big data by supporting
MapReduce programming. However, security issues arise when the mapper or gearbox is
compromised. Mapping and job reduction are vulnerable to various types of attacks. we used
the mapper and reducer to calculate the total sales of each store in the United States.
To analyse and determine the revenue earned across different stores in a region.
II. BACKGROUND
Hadoop Framework :
Hadoop is an open source framework that can be used for the processing of huge volume of
data, i.e. the size of the data may be in peta bytes. It is a distributed computing framework. The
architecture of the Hadoop is Master Slave architecture where the bottom layer of Hadoop
includes other machines which works in parallel. The large volume of data is split into different
data set which runs in different machines. Each machine will execute the task. Each machine
will have a task tracker and job tracker. The main components of Hadoop framework are HDFS
(Hadoop Distributed File System) and MapReduce concept.
HDFS — Hadoop Distributed File System
Before processing, the large files from the local disk or real time data are stored in the distributed
file system of Hadoop known as HDFS. It has nodes for the storage and for computation. It is
possible to create, delete or manage the files within the HDFS. It also allows the creation of
directories. It has data node and name node for storing and accessing the files.
MapReduce :
MapReduce is a programming model. It simplifies the processing by splitting in parallel the

large volume of data and send in into different machines. It supports both structured as well as
unstructured data formats. The main programming language used is Java but also supports other
programming languages such as python, c, c++. Other data analytics programming tool such as
R can also be integrated with the Hadoop.
MapReduce programming has two functions:
a. Mapper function
In this function, the large input is divided into sub problem inputs or tasks. These tasks are
distributed to the nodes where these nodes also divide the sub tasks into sub -sub tasks. Once
the processing is done, it maps into key value pair format. Before the mapping is sent to reducer
function, they are shuffled and sorted based on the keys.
b. Reducer function
This function is used for processing such as finding the total count for each keys or other
analysis that needs to be found out. It produces a single result or set of results.
III. Literature Survey
26 – OCT – 2010
Source : https://www.cl.ecei.tohoku.ac.jp/publications/2010/and2010-murakami.pdf
Title : Statement Map: Reducing Web Information Credibility Noise through Opinion
Classification
In this paper, we have described out strategy for generating Statement Maps by describing the
task in terms of passage retrieval, linguistic analysis, structural alignment and semantic relation
classification and presented a prototype system that identifies semantic relations in Japanese
Web texts using a combination of lexical, syntactic, and semantic information and evaluated
our system against real-world data and queries. Preliminary evaluation showed that we are able
to detect [Agreement], [Conflict], [Confinement] and [Evidence] with moderate levels of
confidence. We discussed some of the technical issues that need to be solved in order to
generate better Statement Map.
OCT – 2011
Source : https://vivomente.com/wp-content/uploads/2016/04/big-data-analytics-white-
paper.pdf
Title : BIG DATA ANALYTICS
Big data used to be a technical problem. Now it’s a business opportunity. Big data is not just
big. It’s also diverse data types and streaming data. Big data analytics is the application of
advanced analytic techniques to very big data sets. There are many types of vendor products to
consider for big data analytics. This report discusses the types. Of course, businesspeople can
learn a lot about the business and their customers from BI programs and data warehouses. But
big data analytics explores granular details of business operations and customer interactions
that seldom fnd their way into a data warehouse or standard report. Some organizations are
already managing big data in their enterprise data warehouses, while others have designed their
DWs for the well-understood, auditable, and squeaky clean data that the average business
report demands. Te former tend to manage big data in the EDW and execute most analytic
processing there, whereas the latter tend to distribute their eforts onto secondary analytic
platforms. Tere are also hybrid approachesRegardless of approach, user organizations are
currently reevaluating their analytic portfolios. In response to the demand for platforms suited
to big data analytics, vendors have released a slew of new product types including analytic
databases, data warehouse appliances, columnar databases, no-SQL databases, distributed fle
systems, and so on. Tere is also a new slew of analytic tools. Tis report drills into all the aspects
of big data analytics mentioned here to give users and their business sponsors a solid
background for big data analytics, including business and technology drivers, successful
business use cases, and common technology enablers. Te report also uses survey data to project
the future of the most common tool types, features, and functions associated with big data
analytics, so users can apply this information to planning their own programs and technology
stacks for big data analytics.
NOV – 2011
Source : https://cra.org/ccc/wp-content/uploads/sites/2/2015/05/bigdatawhitepaper.pdf
Title : Challenges and Opportunities with Big Data
The analysis of Big Data involves multiple distinct phases as shown in the figure below, each
of which introduces challenges. Many people unfortunately focus just on the analysis/modeling
phase: while that phase is crucial, it is of little use without the other phases of the data analysis
pipeline. Even in the analysis phase, which has received much attention, there are poorly
understood complexities in the context of multi-tenanted clusters where several users’
programs run concurrently. Many significant challenges extend beyond the analysis phase. For
example, Big Data has to be managed in context, which may be noisy, heterogeneous and not
include an upfront model. Doing so raises the need to track provenance and to handle
uncertainty and error: topics that are crucial to success, and yet rarely mentioned in the same
breath as Big Data. Similarly, the questions to the data analysis pipeline will typically not all
be laid out in advance. We may need to figure out good questions based on the data. Doing this
will require smarter systems and also better support for user interaction with the analysis
pipeline. In fact, we currently have a major bottleneck in the number of people empowered to
ask questions of the data and analyze it.
We have entered an era of Big Data. Through better analysis of the large volumes of data that
are becoming available, there is the potential for making faster advances in many scientific
disciplines and improving the profitability and success of many enterprises. However, many
technical challenges described in this paper must be addressed before this potential can be
realized fully. The challenges include not just the obvious issues of scale, but also
heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and
visualization, at all stages of the analysis pipeline from data acquisition to result interpretation.
These technical challenges are common across a large variety of application domains, and
therefore not cost-effective to address in the context of one domain alone. Furthermore, these
challenges will require transformative solutions, and will not be addressed naturally by the next
generation of industrial products. We must support and encourage fundamental research
towards addressing these technical challenges if we are to achieve the promised benefits of Big
Data.
MAY – 2013
Source : https://www.iqpc.com/media/7863/11710.pdf
Title : Big Data in Big Companies
There have always been three types of analytics: descriptive, which report on the past;
predictive, which use models based on past data to predict the future; and prescriptive, which
use models to specify optimal behaviors and actions. Analytics 3.0 includes all types, but there
is an increased emphasis on prescriptive analytics. These models involve large-scale testing
and optimization. They are a means of embedding analytics into key processes and employee
behaviors. They provide a high level of operational benefits for organizations, but they place a
premium on high-quality planning and execution. Prescriptive analytics also can change the
roles of front-line employees and their relationships with supervisors, as at Schneider National.
Even though it hasn’t been long since the advent of big data, these attributes add up to a new
era. It is clear from our research that large organizations across industries are joining the data
economy. They are not keeping traditional analytics and big data separate, but are combining
them to form a new synthesis. Some aspects of Analytics 3.0 will no doubt continue to emerge,
but organizations need to begin transitioning now to the new model. It means change in skills,
leadership, organizational structures, technologies, and architectures. It is perhaps the most
sweeping change in what we do to get value from data since the 1980s. It’s important to
remember that the primary value from big data comes not from the data in its raw form, but
from the processing and analysis of it and the insights, products, and services that emerge from
analysis. The sweeping changes in big data technologies and management approaches need to
be accompanied by similarly dramatic shifts in how data supports decisions and product/service
innovation. There is little doubt that analytics can transform organizations, and the firms that
lead the 3.0 charge will seize the most value.
SEP – 2014
Source : https://www.intel.com/content/dam/www/public/us/en/documents/reports/big-data-
analytics-2013-peer-research-report.pdf
Title : Peer Research Big Data Analytics
Big data continues to be top of mind for IT. Organizations know the importance of being able
to gain deeper, richer insights that can speed decision making and identify new strategic
initiatives, and they are moving forward with big data analytics projects at a strong pace.
Important to the ability to process vast amounts of structured and unstructured data quickly and
efficiently, both commercial and open-source deployments of the Hadoop framework are
proliferating in larger companies. And as these companies activate big data, the cloud is a
prominent solution for them—mainly a private cloud, but that, too, is evolving.
Challenges and obstacles that IT professionals face are evolving as well. Where last year,
security and the rate of data growth topped the list, today, a shortage of skilled professionals
has emerged as a real concern for our survey group. Still, the value of big data is clear, and
organizations are moving forward in spite of these challenges. And they’re using big data in a
variety of ways—to evaluate staffing levels and productivity, generate competitive intelligence,
improve pricing and lower IT costs, and more. And those uses will expand over time as well,
to include understanding ways to improve operational efficiency, identify new revenue
resources, and enhance product development, to name a few.
29 – OCT – 2014
Source : https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-7-22
Title : Applications of the MapReduce programming framework to clinical big data

analysis: current landscape and future trends
An integrated solution eliminates the need to move data into and out of the storage system
while parallelizing the computation, a problem that is becoming more important due to
increasing numbers of sensors and resulting data. And, thus, efficient processing of clinical
data is a vital step towards multivariate analysis of the data in order to develop a better
understanding of a patient clinical status (i.e. descriptive and predictive analysis). The Hadoop
platform and the MapReduce programming framework already have a substantial base in the
bioinformatics community, especially in the field of next-generation sequencing analysis, and
such use is increasing. This is due to the cost-effectiveness of the Hadoop-based analysis on
commodity Linux clusters, and in the cloud via data upload to cloud vendors who have
implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce
method in parallelization of many data analysis algorithms. HDFS supports multiple reads and
one write of the data. The write process can therefore only append data (i.e. it cannot modify
existing data within the file). HDFS does not provide an index mechanism, which means that
it is best suited to read-only applications that need to scan and read the complete contents of a
file (i.e. MapReduce programs). The actual location of the data within an HDFS file is
transparent to applications and external software. And, thus, Software built on top of HDFS
has little control over data placement or knowledge of data location, which can make it difficult
to optimize performance.
JAN – 2015
Source :
https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Marketing%20and%2
0Sales/Our%20Insights/EBook%20Big%20data%20analytics%20and%20the%20future%20o
f%20marketing%20sales/Big-Data-eBook.ashx
Title : Big Data Analytics and the Future of Marketing & Sales
This goldmine of data represents a pivot-point moment for marketing and sales leaders. We’ve
shown how capturing the potential of data analytics requires the building blocks of any good
strategic transformation: it starts with a plan, demands the creation of new senior-management
capacity to really focus on data, and, perhaps most important, addresses the cultural and skill-
building challenges needed for the front line to embrace the change. For example, helped one
well-known insurer find a way to grow its sales without increasing its marketing budget: First,
how much should be invested in marketing, and second, to which channels, vehicles, and
messages should that investment be allocated? These clear markers guided the company as it
triangulated between three sources of data, helping it develop a proprietary model to optimize
spending across channels at the zip code level. (For more on this, read “What you need to make
big Data work: The pencil.”)Some companies rely on just one analytical technique, the greatest
returns come when marketing return on investment (MROI) tools are used in concert. Rishi is
a leader in the Marketing return on investment (ROI) service line and works with sophisticated
econometrics to help clients understand the impact of their marketing spend across digital and
non-digital channels. When we asked them what degree of revenue or cost improvement they
had achieved through the use of these techniques, three-quarters said it was less than 1 percent
Well-developed analytics models can take in POS transaction data, product specs, payment
details, aftersales service logs, customer data, social media data, etc. and develop accurate
“product to buy” (NPTB) and optimum timing recommendations for specific customer
segments.
2015
Source :
https://reader.elsevier.com/reader/sd/pii/S1877050915019213?token=83B1ED63137BFE516
7DEF141CBA5C90D26C795F36A6893AB73985BD793130E852B8FB649CC193BB6CB87
BE1B2797AFCE&originRegion=eu-west-1&originCreation=20211021083224
Title : MapReduce: Simplified Data Analysis of Big Data
Big data and the technologies associated with it can bring significant benefits to the business.
But the tremendous uses of these technologies make difficult for an organization to
strongly control these vast and heterogeneous collections of data to get further analysed
and investigated. There are several impacts of using the Big Data. For facing the
competitions and strong growth of individual companies, it supports by providing them a huge
potential. Certain aspects are needed to be followed so that we can get timely and productive
results from Big Data because the precise use of Big Data can give the proliferation to
throughput, modernization, and effectiveness for entire divisions and economies. To be
able to extract the benefits of Big Data, it is crucial to know how to ensure intelligent
use, management and re-use of Data Sources, including public government data, in and across
country to build useful applications and services. It is crucial to evaluate the best approach to
use for filtering and/or analyzing the data. For the optimized analytic processing, Hadoop with
MapReduce can be used. In this paper, we've presented the basics of MapReduce
programming with the open source Hadoop framework. This outstanding framework of
Hadoop speeds-up the processing of large amounts of data through distributed processes
and thus, provides the responses very fast. It can be adopted and customized to meet various
development requirements and can be scaled by increasing the number of nodes available for
processing. The extensibility and simplicity of the framework are the key differentiators that
make it a promising tool for data processing.
JAN – 2016
Source : https://sciresol.s3.us-east-2.amazonaws.com/IJST/Articles/2016/Issue-3/Article9.pdf
Title : Artificial Bee Colony with Map Reducing Technique for Solving Resource
Problems in Clouds
In this paper, we have proposed an effective technique for reducing the resource problems in
clouds using the map reducing algorithm. The map reduce algorithm generates a solution which
is further optimized with the help of optimization algorithm. The optimization algorithm we
utilized is the artificial bee colony algorithm (ABC). The proposed method of resource problem
reduction proves to be more effective as it reduces the requirements needed for storage of data
to a large extend. As our proposed method, the graph shows the execution time is reduced to a
large instant when compared to the existing method. Thus it proved to be an efficient method
in reducing the resource problems that occur in the cloud computing.
4 – APRIL – 2016
Source : https://www.emerald.com/insight/content/doi/10.1108/IJOPM-03-2015-
0151/full/html
Title : Predicting online product sales via online reviews, sentiments, and promotion
strategies: A big data architecture and neural network approach
The purpose of this paper is to investigate if online reviews (e.g. valence and volume), online
promotional strategies (e.g. free delivery and discounts) and sentiments from user reviews can
help predict product sales. The authors designed a big data architecture and deployed Node.js
agents for scraping the Amazon.com pages using asynchronous input/output calls. The
completed web crawling and scraping data sets were then preprocessed for sentimental and
neural network analysis. The neural network was employed to examine which variables in the
study are important predictors of product sales. This study found that although online reviews,
online promotional strategies and online sentiments can all predict product sales, some
variables are more important predictors than others. The authors found that the interplay effects
of these variables become more important variables than the individual variables themselves.
For example, online volume interactions with sentiments and discounts are more important
than the individual predictors of discounts, sentiments or online volume. This study designed
big data architecture, in combination with sentimental and neural network analysis that can
facilitate future business research for predicting product sales in an online environment. This
study also employed a predictive analytic approach (e.g. neural network) to examine the
variables, and this approach is useful for future data analysis in a big data environment where
prediction can have more practical implications than significance testing. This study also
examined the interplay between online reviews, sentiments and promotional strategies, which
up to now have mostly been examined individually in previous studies.
April - 2016
Source :
https://www.researchgate.net/publication/301319823_MapReduce_Review_and_open_challe
nges
Title : MapReduce: Review and open challenges
MapReduce has proven to be a useful programming model framework for large scale data
processing. This is because of its remarkable flexibility, which allows automatic parallelization
and execution on a large-scale cluster with more than thousands of nodes. In this paper, a
dataset consisting of 6,481 publications fulfilled the selection criteria specified in data
collection, including journal types, conference proceedings, books, book sections, and patents.
To analyze the study on MapReduce research from a bibliometric perspective, they initially
conducted a basic bibliometric study using keywords, including titles of publication and
abstracts, through which to subsequently create the special publication. The analysis of
keywords uncover trends and patterns about particular domain by measuring the association
strengths of terms that are representative of relevant publications produced in this research.
The major feature of keyword analysis is its visualization of the intellectual structure of a
specific discipline into maps of the conceptual space of this field; a time-series of such maps
produces a trace of the changes in this conceptual space. The keywords are used by numerous
researchers to identify word and expression frequency that indicates the core content of the
literature. Second, the generation of maps and co-word clusters is required through the
representation of information and the intellectual structure of MapReduce research. As the last
link in the chain, words are used to clarify the cognitive structure of a field through semantic
maps. This process is often referred to as co-word analysis. For data visualization, VOSviewer
program is used to produce distance- and graph-based maps. The functionality of VOSviewer
is particularly useful for displaying large bibliometric maps in a manner that is easy to interpret.
24 – AUG – 2016
Source : https://link.springer.com/article/10.1007/s10479-016-2296-z
Title : Customer reviews for demand distribution and sales nowcasting: a big data
approach
Proliferation of online social media and the phenomenal growth of online commerce have
brought to us the era of big data. Before this availability of data, models of demand distribution
at the product level proved elusive due to the ever shorter product life cycle. Methods of sales
forecast are often conceived in terms of longer-run trends based on weekly, monthly or even
quarterly data, even in markets with rapidly changing customer demand such as the fast fashion
industry. They developed an efficient method to visualize the demand distributional
characteristics; found that big data streams of customer reviews contain useful information for
better sales nowcasting; and discussed the current influence pattern of sentiment on sales. the
results to contribute to practical visualization of the demand structure at the product level where
the number of products is high and the product life cycle is short; revealing big data streams as
a source for better sales nowcasting at the corporate and product level; and better understanding
of the influence of online sentiment on sales.
OCT – 2016
Source:
https://www.researchgate.net/publication/309755898_A_Study_on_MapReduce_Challenges_
and_Trends
Title : A Study on MapReduce: Challenges and Trends
Nowadays we all are surrounded by huge data. People upload/download videos, audios, images
from variety of devices. Sending text messages, multimedia messages, updating their
Facebook, WhatsApp, Twitter status, comments, online shopping, online advertising etc.
generates huge data. As a result, machines have to generate and keep huge data too. Due to this
exponential growth of data the analysis of that data become challenging and difficult. As shown
in Figure 1 the term ‘Big Data’ means huge volume, high velocity, variety and veracity i.e.
uncertainty of data. This big data is increasing tremendously day by day. The Big data generated
may be structured data, Semi Structured data or unstructured data. Existing databases and tools
are not good enough to process, analyze, store and manage such a Big Data effectively and
efficiently.
Big data is increasing tremendously day by day which gave rise to new difficulties and
challenges as we have to store, process, analyze, modify such a huge amount of data. Existing
databases, tools are not good enough to handle this issue. In our paper we have provided
overview of the big data, its challenges with respect to Map reduce. Many efforts taken to
reduce those challenges are also discussed. Thus better planning of Big Data projects can be
done. For researchers’ opportunities for future research can be identified.
11 – NOV - 2016
Source: http://www.ijecs.in/index.php/ijecs/article/view/2841/2632
Title: Scalability Study of Hadoop MapReduce and Hive in Big Data Analytics
It can be observed from the word-count program performance comparison between Hive
and MapReduce that, Hive performance remained constant and better for all sizes of data,
matching and surpassing MapReduce performance specially, in case of larger data setsand
can be concluded that the scalability of Hive is very high.
NOV - 2016
Source : https://core.ac.uk/download/pdf/322469972.pdf
Title : Prediction of sales using Big data analytics
Marketing industry increase their sales turnover with the help big data analysis. They
concentrating on the source of social media to posts to know about product review to increase
the sales. Raw data is extracted from twitter using an Apache FLUME.Once the raw data is
extracted it store in the Apache HDFS.Data analysis can be done using Apache hive. Text
tweets are filtered with the product name. Raw data is extracted with help of twitter streaming
API.Configuring and flume agent which has a source and sink. Twitter posts will acts as a
source, which is a twitter source, Apache HDFS will acts and sink. Apache Flume extracted
raw data from twitter via memory channel put the tweets in HDFS.Apache Flume: Apache
Flume configuring according to the product search. Storage area of an Twitter is sink,
configured to an HDFS.Along with configuration can made to an keyword to be searched,
HDFS path, Write format, memory capacity and Transaction capacity. Data processing can be
done with the help of Apache hive. Hive configured a top of the Hadoop. Its query language
HiveQL, it helps in querying and managing large data sets .tweets are in the form of JSON
format. By default hive is processing with row format. Hive has a feature called partitions helps
to make the product search.
• It is coordinator application, putting raw data to hive table for every hour.
• Data processing is done with the help of hive queries.
• Data is partitioned according to hour of arrival.
• Tweets are filtered according to the product name.
• Score can be analyzed according to positive, neutral, negative.
• Tweets were analyzed according to filtered words which is compared to data set which
is already available.
• This helps the marketing industry to know feedback about product.
• If tweet came along with a location, sales of the product can be reviewed according to
the location.
By using a Hadoop ecosystem tools, extracted raw data from the twitter source and hiveQl
query language is used for analyzing this data. Based on the analysis the data is segregated into
positive, negative and neutral. This process helps in analyzing best opinion mining.
NOV - 2016
Source : https://www.sciencedirect.com/science/article/abs/pii/S0360835216302753
Title : Predicting online e-marketplace sales performances: A big data approach

To manage supply chain efficiently, e-business organizations need to understand their sales
effectively. Previous research has shown that product review plays an important role in
influencing sales performance, especially review volume and rating. Limited attention has been
paid to understand how other factors moderate the effect of product review on online sales.
This study aims to confirm the importance of review volume and rating on improving sales
performance, and further examine the moderating roles of product category, answered
questions, discount and review usefulness in such relationships. By analyzing 2939 records of
data extracted from Amazon.com using a big data architecture, it is found that review volume
and rating have stronger influence on sales rank for search product than for experience product.
Review usefulness significantly moderates the effects of review volume and rating on product
sales rank. The relationship between review volume and sales rank is significantly moderated
by both answered questions and discount. Answered questions and discount do not have
significant moderation effect on the relationship between review rating and sales rank. The
findings expand previous literature by confirming important interactions between customer
review features and other factors, and the findings provide practical guidelines to manage e-
businesses. This study explains a big data architecture and illustrates the use of big data
technologies in testing theoretical framework.
March - 2017
Source : https://www.sciencedirect.com/science/article/pii/S187705091730412X
Title : Car Sales Analysis Based on the Application of Big Data

Whether industrial 4.0 nor Internet industry, for today's industrial manufacturing enterprises, it
should be to make full use of information and communication technology to deal with the
arrival of smart and effective large data, combining products, machinery and human resources
into together, according to the unexpected speed about the mode of sales product, it can change
the manufacturing enterprises to process innovation and reform. This paper takes the
automobile manufacturing industry as an example, based on sale car large data analysis, using
data mining technology, through the Java program to prepare web crawler program for data
collection. To give some suggestions for the automobile manufacturing industry in the
production of automobile, it reduces the inventory of automobile enterprises and the waste of
resources. Keywords big data car manufacturing data mining technology web crawler. Peer-
review under responsibility of the scientific committee of the 7th International Congress of
Information and Communication Technology.
April – 2017
Source :
https://d1wqtxts1xzle7.cloudfront.net/52558492/17March24.pdf?1491751910=&response-
content-
disposition=inline%3B+filename%3DENHANCED_ESTIMATION_MODEL_FOR_HADO
OP_WOR.pdf&Expires=1634739413&Signature=McjC5e4uJ1P7KUOCQf1cBv0TdT2RxdR
G0GWFCj0ukNcS0kjlus-xAhW~WjCQo0SICoqdc-
f8LGr6DgalIPkWd14IBydrWYhOILkQhmNjAaXgzaayh~I0oaklCHliTkeHUazw4Kx275no
wAtWgDjM~xzDRAcjt-
GeBsam6E6WxKDDt~x7pd2t0beNrPc2IfwxZtI9Ik7ZvQjzTV13ZVI6IMpU7n1hGpBJVRr~
2CMQKgxX0pE5T4p5jdksuOGNDfk7pKsRfLtqc4XOD7QY-
1gY9wEDDViuuK0QGweemufSTsa4nE-
TCkWwEcHwbdF9Yoce6cIyHMa0DeEKtaNVtPariQ__&Key-Pair-
Id=APKAJLOHF5GGSLRBV4ZA
Title : ENHANCED ESTIMATION MODEL FOR HADOOP WORDCOUNT USING

MAP REDUCING METHOD
As a result of the rapid development in cloud computing, it's fundamental to investigate the
performance of extraordinary Hadoop MapReduce purposes and to realize the performance
bottleneck in a cloud cluster that contributes to higher or diminish performance. It is usually
primary to research the underlying hardware in cloud cluster servers to permit the optimization
of program and hardware to achieve the highest performance feasible. Hadoop is founded on
MapReduce, which is among the most popular programming items for huge knowledge
analysis in a parallel computing environment. In this paper, we reward a particular efficiency
analysis, characterization, and evaluation of Hadoop MapReduce WordCount utility.
Keywords: Performance analysis, cloud computing, Hadoop WordCount. Map-Reduce have
become an important platform for a variety of data processing applications. Word Count
Mechanisms in Map-Reduce frameworks such as Hadoop, suffer from performance
degradations in the presence of faults. Word Count Map-Reduce, proposed in this paper
provides an online, on-demand and closed-loop solution to managing these faults. The control
loop in word count mitigates performance penalties through early detection of anomalous
conditions on slave nodes. Anomaly detection is performed through a novel sparse-coding
based method that achieves high true positive and true negative rates and can be trained using
only normal class (or anomaly-free) data. The local, decentralized nature of the sparse-coding
models ensures minimal computational overhead and enables usage in both homogeneous and
heterogeneous Map-Reduce environments.
MAY – 2017
Source : https://d1wqtxts1xzle7.cloudfront.net/53281575/IJCST-
V5I3P15.pdf?1495763616=&response-content-
disposition=inline%3B+filename%3DIJCST_V5I3P15_Priyanka_Paradkar_Rachana.pdf&Ex
pires=1634739614&Signature=WlWkJS2ypPd5Cj8CWTSvlwqonm5Qx~x4mo7WsnQ~yicT
K0y1~xdJksTkQm6mieGJNhKhoGxNgJbV9Zj1wTqwpVv5rgGXVLT~QNM8RTGVyFxgxS
pBZqD9s6qfhUVTF3lxqO2k0siTVPjo-
mp82EohBjWJo6nTkF0AXScaUg5QaKdZHYGveILCHU-
ziWLloaoRNbj~aV3bQmkrjEV6nxmhDZmxUUlIxYQ2xCK9QwBZiSwgVYATezyOBjTtY
KYMKgCugEpNRomIutr3sEcHVw0rE41F6BBhTVZJo8VGPPMcpGbNcBKS3a-
5xKK8SbAHtvJpXWET-yim7ps4rCtu26JQPQ__&Key-Pair-
Id=APKAJLOHF5GGSLRBV4ZA
Title : Document Clustering Through Map Reducing – A Hueristic Approach
Increasing digital data drastically raises alarm for better data handling and scrutinization
Technique.Numerous classification and clustering techniques exists which performs well on
numerical data. But holds no good for semantic textual data, where every files or documents
are correlated semantically. Certain techniques are available in industry to do document
clustering based on semantics of data but most of them suffer from time complexity issue. As
an initial step to solve this, proposed methodology put forwards an idea of document clustering
for different extensions like doc, pdf and txt by preprocessing the original data to perform
feature extraction. To enhance the performance of process, features of documents are extracted
using map reduce streams. Further these features are fed to weighted matrix method for
clustering where in this process is catalyzed by fuzzy logic. Comparative Research evaluation
is been done against K-Means and K-Mediod Algorithms. Performance Graph depict our
method to effective and simplified.
The proposed system successfully incorporates the process of feature extractions using Map
reducing technique. System efficiently fetches the feature from the documents like Title
Sentence, Numerical Data, Proper Noun, Top Words And the system efficiently scrutinizes the
semantic of the data by weighted method matrix process which yields the fine clusters of the
documents. To classify the clustered documents in more meaningful way an abstract classifier
like fuzzy logic is used to get the accurate clusters based on the semantics. This concludes in
the end that system is contributed its bit to the clustering of the documents more semantically
than the random clustering of the documents based on some numeric.
19 – JUNE – 2017
Source : https://onlinelibrary.wiley.com/doi/abs/10.1111/poms.12737
Title : Parallel Aspect-Oriented Sentiment Analysis for Sales Forecasting with Big Data
While much research work has been devoted to supply chain management and demand
forecast, research on designing big data analytics methodologies to enhance sales forecasting
is seldom reported in existing literature. The big data of consumer-contributed product
comments on online social media provide management with unprecedented opportunities to
leverage collective consumer intelligence for enhancing supply chain management in general
and sales forecasting in particular. The main contributions of our work presented in this study
are as follows: the design of a novel big data analytics methodology that is underpinned by a
parallel aspect-oriented sentiment analysis algorithm for mining consumer intelligence from a
huge number of online product comments; the design and the large-scale empirical test of a
sentiment enhanced sales forecasting method that is empowered by a parallel co-evolutionary
extreme learning machine. Based on real-world big datasets, our experimental results confirm
that consumer sentiments mined from big data can improve the accuracy of sales forecasting
across predictive models and datasets. The managerial implication of our work is that firms
can apply the proposed big data analytics methodology to enhance sales forecasting
performance. Thereby, the problem of under/over-stocking is alleviated and customer
satisfaction is improved.
11 – NOV – 2017
Source :
https://www.researchgate.net/profile/Balajee_Jeyakumar/publication/321888673_Action_rec
ongnition_in_video_survillance_using_HIPI_and_map_reducing_model/links/5a4e08230f7e
9b8284c5a64f/Action-recongnition-in-video-survillance-using-HIPI-and-map-reducing-
model.pdf
Title : ACTION RECONGNITION IN VIDEO SURVILLANCE USING HIPI AND MAP

REDUCING MODEL
HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce
parallel programming framework. HIPI facilitates efficient and high-throughput image
processing with MapReduce style parallel programs typically executed on a cluster. It provides
a solution for how to store a large collection of images on the Hadoop Distributed File System
(HDFS) and make them available for efficient distributed processing. HIPI also provides
integration with OpenCV, a popular open-source library that contains many computer vision
algorithms.
In order to achieve the efficient action recognition for large scale video data, a MapReduce
based parallel algorithm, is proposed. Action recognition using mapreduce model is proposed.
Image analysis is the key concept of action recognition. . The conversion from video to images
is done and the images are analysed and the average pixel values and the edges of the sample
images are identified using HIPI. The other action recognition is identified using the edge
detection. The framework with other image analysis techniques is the future work of this paper.
MAR - 2019
Source : https://www.sciencedirect.com/science/article/abs/pii/S0169207018301523
Title : Forecasting sales in the supply chain: Consumer analytics in the big data era
Forecasts have traditionally served as the basis for planning and executing supply chain
activities. Tonya Boone has a Ph.D. in Operations and Technology Management from the
University of North Carolina at Chapel Hill’s Kenan-Flagler School of Business; a MBA from
the College of William and Mary; and a B.S. in Electrical & Electronics Engineering from the
University of Kansas. Ram’s teaching, research and consulting interests are in the areas of
supply chain management, data analytics, and logistics strategy, primarily in the chemical, hi-
tech, and retail industries. Her research and teaching interests have been in forecasting and the
use of data analytics in decision-making within the supply chain context. She has authored over
one hundred scholarly works and has served on the editorial boards of prominent journals in
her field, including the Journal of Operations Management, Production and Operations
Management, Decision Sciences Journal, Journal of Business Logistics, and International
Institute of Forecasting. She has authored multiple books with the most recent being Big Data
Driven Supply Chain Management and has given numerous talks on the subject including a
recent HBR webinar. She currently serves on the Board of Production & Operations
Management Society (POMS), having served as both Program and General Chair. She holds a
Ph.D. in Operations Management and Logistics, and an MBA, from the Fisher College of
Business at The Ohio State University, as well as a B.S. in Mechanical Engineering
08 – JUN – 2019
Source : https://link.springer.com/article/10.1007/s11227-019-02907-5
Title : MapReduce: an infrastructure review and research insights
In the current decade, doing the search on massive data to find “hidden” and valuable
information within it is growing. This search can result in heavy processing on considerable
data, leading to the development of solutions to process such huge information based on
distributed and parallel processing. Among all the parallel programming models, one that gains
a lot of popularity is MapReduce. The goal of this paper is to survey researches conducted on
the MapReduce framework in the context of its open-source implementation, Hadoop, in order
to summarize and report the wide topic area at the infrastructure level. They managed to do a
systematic review based on the prevalent topics dealing with MapReduce in seven areas:
performance; job/task scheduling; load balancing; resource provisioning; fault tolerance in
terms of availability and reliability; security; and energy efficiency. Since the MapReduce is a
challenge-prone area for researchers who fall off to work and extend with, this work is a useful
guideline for getting feedback and starting research.
JUNE – 2019
Source : http://www.j-asc.com/gallery/428-june-3338.pdf
Title : A STUDY ON BIG DATA ANALYTICS & MAP-REDUCE PROGRAMMING
In the information era, enormous amounts of data have become available on hand to decision
makers. Big data refers to datasets that are not only big, but also high in variety and velocity,
which makes them difficult to handle using traditional tools and techniques. Due to the rapid
growth of such data, solutions need to be studied and provided in order to handle and extract
value and knowledge from these datasets. Furthermore, decision makers need to be able to gain
valuable insights from such varied and rapidly changing data, ranging from daily transactions
to customer interactions and social network data. Such value can be provided using big data
analytics, which is the application of advanced analytics techniques on big data. This paper
aims to analyze some of the different analytics methods and tools which can be applied to big
data, as well as the opportunities provided by the application of big data analytics in various
decision domains.
In this article, an overview of big data's content, Map-Reduce, benefits and working nature,
KDD in big data have been reviewed. The working nature of map reduces in HDFS also
explained detailed. Over all Big data and Map reduce Program basic details are presented in
this paper. Although this paper clearly has not resolved the entire subject about this substantial
topic, hopefully it has provided some useful discussion and a framework for researchers.
30 – NOV – 2019
Source : https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0269-1
Title : On using MapReduce to scale algorithms for Big Data analytics: a case study
This paper presents a study of the applicability of MapReduce for scaling data analytic and
machine learning algorithms to “Big algorithms” for Big Data. Introduced a small class of
MapReduce-based Apriori algorithms that computes frequent itemsets using a small number of
MapReduce phases and are free or almost free of level-wise process. Specifically, it compares
the performance of AprioriPMR, an existing one-level MapReduce-based Apriori with our
proposed AprioriS, a simple non-naive level free MapReduce-based algorithm. By mapping
each of all possible subsets of items in all transactions (as a key) to its corresponding frequency
of occurrence (value), AprioriS requires only a single MapReduce phase and a single database
scan while AprioriPMR requires two phases.The findings support our conjecture that to fully
exploit parallelism, the MapReduce-based algorithm should be designed to be free of
dependent iterations. Here AprioriS has no dependent iteration while AprioriPMR has one,
and AprioriS performs better than AprioriPMR using multiple machines in MapReduce
framework. They are using small number of machines. However, the results obtained so far are
promising. Future work intends to increase the scale of our experiments to a larger scale of
clusters. The results confirm that effective MapReduce implementation should avoid dependent
iterations, such as that of the original sequential Apriori. These findings could lead to many
more alternative non-naive MapReduce-based “Big algorithms”.
NOV – 2019
Source : https://www.koreascience.or.kr/article/JAKO201919163740728.page
Title : Analysis of Sales Volume by Products According to Temperature Change Using Big
Data Analysis
Since online shopping has become common, people can buy fashion goods anytime, anywhere.
Consumers quickly respond to various environmental variables such as weather and sales
prices. Utilizing big data for efficient inventory management has become very important in the
fashion industry. The changes in sales volume of fashion goods due to changes in temperature
is analyzed via the proposed big data analysis algorithm by utilizing actual big data from
Korean fashion company 'B'.According to the analytic results, the proposed big data analysis
algorithm found both expected and unexpected changes in sales volume depending on the
characteristics of the fashion goods.
2 – JAN – 2020
Source : https://www.forbes.com/sites/danielnewman/2020/01/02/why-the-future-of-data-
analytics-is-prescriptive-analytics/?sh=1ebeafec6598
Title : The Future Of Data Analytics Is Prescriptive Analytics
Prescriptive analytics are powerful, but they won’t be necessary for every company, or every
campaign you push out to customers. They also will require a lot of tweaking. No algorithm
was crafted perfectly the first time. It takes time, effort, and focus to make prescriptive analytics
work effectively. But if you are in a competitive marketplace—managing anything from
products to people—prescriptive analytics could mean a huge boost to profit, productivity, and
the bottom line.
APRIL – 2020
Source : https://www.sciencedirect.com/science/article/pii/S0019850118304656
Title : Fostering B2B sales with customer big data analytics
This study focuses on the use of big data analytics in managing B2B customer relationships
and examines the effects of big data analytics on customer relationship performance and sales
growth using a multi-industry dataset from 417 B2B firms. The study examines whether
analytics culture within a firm moderates these effects. The study finds that the use of customer
big data significantly fosters sales growth and enhances the customer relationship performance.
The latter effect is stronger for firms which have an analytics culture which supports marketing
analytics, whereas the former effect remains unchanged regardless of the analytics culture. The
study empirically confirms that customer big data analytics improves customer relationship
performance and sales growth in B2B firms.
05 – FEB - 2021
Source : https://www.infoq.com/articles/hive-performance-tuning-techniques/
Title : Performance Tuning Techniques of Hive Big Data Table
Without applying any tuning technique, the query time to read Hive table data will take
anywhere between 5 mins to several hours depending upon volume. After consolidation, the
query time significantly reduces, and we get results faster. The number of files will be
significantly reduced and the query time to read the data will decrease. Without consolidation,
queries run on so many small files that spread across the name nodes and lead to an increase in
response time.
• Developers working on big data applications experience challenges when reading data
from Hadoop file systems or Hive tables.
• Consolidation job, a technique used to merge smaller files to bigger files, can help with
the performance of reading Hadoop data.
• With consolidation, the number of files is significantly reduced and query time to read
the data will be faster.
• Hive tuning parameters can also help with performance when you read Hive table data
through a map-reduce job.
IV. DATASET DESCRIPTION & SAMPLE DATA
The data set is taken from Udacity, an online learning platform providing wide range of
courses. The data set contains the data regarding the sales data of each store of Walmart in a
particular year. This data set contains categorical variables and numerical variables. There are
6 attributes and more than 10,00,000 instances.
The following are the necessary fields:
· Date
· Time
· Location
· Item
· Price
· Card
The data set contains the data regarding the sales data of each store of Walmart in a particular
year. This data set contains categorical variables and numerical variables.
https://drive.google.com/file/d/1hiNiZLxcqoly5eCHkJvTxrr9lpdCUOrz/view?usp=sharing
V. PROPOSED ALGORITHM WITH FLOWCHART
We need to find out the total sales of each individual store in 2019. Now as the file will
contain millions of items it is not feasible to process the file serially. Map Reduce helps us to
divide the file into smaller chunks, process those chunks on different machines in a cluster and
then combine the results.
Instead of one machine doing the job we will have a set of machines called mappers and
reducers that will help to run this process in parallel. What will be the job of the mappers and
the reducers? The file will be divided into smaller chunks and we will give one chunk to each
mapper. The job of the mapper will be to take the chunk and separate the sales data of each
store. For example if a mapper gets the above records he will make three piles namely NYC ,
MIAMI and LA and keep the sales data of each in the particular pile. Each reducer will be
assigned a group of stores. The reducer will collect the data from the mappers of its assigned
store and sum up the values of sales of that store. For example, if the first reducer is assigned
NYC he will collect the NYC sales data from each mapper and sum up the values to get the
total sales for NYC. Each reducer goes through his piles in alphabetical order. So, the second
reducer will process the sales of LA before Miami.
So, the mappers are just programs each of which acts on a small chunk of the file. The
mappers produce an intermediate key value pair. In our case it is the store name and the Item
Price. Once the mappers have done their job a phase known as Shuffle and Sort takes place.
Shuffle is the movement of records from mappers to the reducers that have been assigned
those records. Sort is the sorting of data by the particular reducer. The reducers get a key and a
list of values. In our case the store Name and the sales data of each store. It iterates through all
the values and produces the total sales result of each store in the end.
VII. RESULTS AND DISCUSSION
We are assuming that we are having only one reducer that will get the sorted input from the
mappers. The reducer.py takes in the sorted input and keeps on checking whether the new key
is equal to the previous key. When the change occurs, it prints the store name and the total sales
for the store. The output of our program gets saved in the file joboutput/part-00000.
Once our task is done, now we can delete that joboutput file from out HDFS system. Reason
being, we can’t have different output workloads being generated with the same output file
name. This can be achieved by running the following command hadoop fs -rm -r -f
joboutput. Meanwhile, we can export the data into a text file and save it in our directory if we
want for future use.
VIII. CONCLUSION
To analyse and determine the revenue earned across different stores in a country/region, we
used the core concepts — HDFS and MapReduce from Big Data Hadoop Platform to solve
this problem. We Processed large volumes of data in parallel by dividing the work into a set of
independent tasks. We used MapReduce process file to break the data into chunks and process
it in parallel and were able to find out the revenue earned by individual store in US using
reducers.
IX. REFERENCES
1. Kshitij Jaju, Vishal Nehe and Abhishek Konduri, Comercial Product Analysis Using Hadoop
MapReduce, International Journal of Pure and Applied Mathematics, Vol. 3, No. 4, 2016.
2. Manpreet Singh and Bhawick Ghutla, Walmart’s Sales Data Analysis – A Big Data Analytics
Perspective, Asia-Pacific World Congress on Computer Science and Engineering, Vol. 59, No.
12, 2017.
Appendix
Codes and Commands
mapper.py
#!/usr/bin/python
import sys
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) == 6:
date, time, store, item, cost, payment = data
print "{0}\t{1}".format(store, cost)
reducer.py
#!/usr/bin/python
import sys
sales_total = 0
old_key = None
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) != 2:
continue
this_key, this_sale = data
if old_key and old_key != this_key:
print "#{0}\t{1}".format(old_key, sales_total)
sales_total = 0
old_key = this_key
sales_total += float(this_sale)
if old_key != None:
print "#{0}\t{1}".format(old_key, sales_total)
hdfs commands:
“hadoop fs -put purchases.txt myinput”: to upload dataset into a directory i.e. ‘myinput’
“hadoop fs -ls”: This will list the files in Hadoop DFS
“hadoop fs -ls myinput”:To list files in "myinput"
“hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-
2.6.0-mr1-cdh5.13.0.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file
reducer.py -input myinput -output joboutput3”
:To process data in files uploaded to Hadoop DFS via mapper and reducer programs.
Where [/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-
mr1-cdh5.13.0.jar] is the path to the hadoop jar file; mapper.py is the mapper python
program, and reducer.py is the reducer python program, "myinput" is the input to mapper.py,
and "joboutput" is the output folder in Hadoop DFS which stores outputs from reducer.py.
“hadoop fs -ls joboutput3”:To list the content in "joboutput"
Note the actual output is stored in the file "part00000" with the "joboutput3" folder. To view
this file, run the following command:
“hadoop fs -cat joboutput3/part-00000”
To download this file to local machine, run the following command
“hadoop fs -get joboutput3/part-00000 output3.txt”
This will save the output to the file "output.txt"
“hadoop fs -rm -r -f joboutput”:to remove/delete the “joboutput” folder.

19mis0126 VL2021220101064 Pe003

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

19mis0126 VL2021220101064 Pe003

Uploaded by

Copyright:

Available Formats

ANALYSING SALES DATA USING HADOOP

Under the Guidance of

School of Information Technology and Engineering

Place : Vellore Signature

function we will shuffle and reduce the data.

Keywords – Hadoop Framework, Mapreduce, Mapper function, Reducer function.

HDFS — Hadoop Distributed File System

MapReduce is a programming model. It simplifies the processing by splitting in parallel the

MapReduce programming has two functions:

Title : BIG DATA ANALYTICS

Title : Challenges and Opportunities with Big Data

Title : Big Data in Big Companies

Title : Peer Research Big Data Analytics

Title : Applications of the MapReduce programming framework to clinical big data

Title : MapReduce: Simplified Data Analysis of Big Data

Title : MapReduce: Review and open challenges

Title : A Study on MapReduce: Challenges and Trends

Title : Prediction of sales using Big data analytics

Title : Predicting online e-marketplace sales performances: A big data approach

Title : Car Sales Analysis Based on the Application of Big Data

Title : ENHANCED ESTIMATION MODEL FOR HADOOP WORDCOUNT USING

Title : Document Clustering Through Map Reducing – A Hueristic Approach

Title : ACTION RECONGNITION IN VIDEO SURVILLANCE USING HIPI AND MAP

Title : MapReduce: an infrastructure review and research insights

Title : A STUDY ON BIG DATA ANALYTICS & MAP-REDUCE PROGRAMMING

Title : The Future Of Data Analytics Is Prescriptive Analytics

Title : Fostering B2B sales with customer big data analytics

Title : Performance Tuning Techniques of Hive Big Data Table

The following are the necessary fields:

“hadoop fs -rm -r -f joboutput”:to remove/delete the “joboutput” folder.

You might also like