Merchant Rating System Using Hadoop MapReduce

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Jnana Sangama, Belagavi-590018, Karnataka
Big Data & Analytics Mini Project Report-18CS72

on
“Merchant Rating System using Hadoop

MapReduce”
Submitted by
USN Name
1BI19CS078 Koushik GG
1BI19CS107 Prakrithi V
1BI19CS079 Lthish Kumar
1BI19CS133 Saaima Nishat
Under the Guidance of

Dr. Suneetha K.R
Associate Professor
Department of CS&E, BIT
Bengaluru-560004
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
K.R. Road, V.V.Pura, Bengaluru-560 004
2022-23 (Odd)
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
“Jnana Sangama”, Belagavi-590018, Karnataka
BANGALORE INSTITUTE OF TECHNOLOGY

Bengaluru-560 004
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Certificate
Certified that the Mini Project work entitled “Merchant Rating

System using Hadoop MapReduce” carried out by
USN NAME
1BI19CS078 Koushik GG
1BI19CS107 Prakrithi V
1BI19CS079 Lithish Kumar
1BI19CS133 Saaima Nishat
of VIII semester, Computer Science and Engineering branch as partial fulfillment of the course
Big Data & Analytics (18CS72) prescribed by Visvesvaraya Technological University,
Belgaum during the academic year 2021-22. It is certified that all corrections/suggestions
indicated for Internal Assessment have been incorporated in the report.
The Mini Project report has been approved as it satisfies the academic requirements
in respect of project work in Big Data & Analytics.
Dr. Suneetha K R
Associate Professor,
Department of CS&E,
BIT, Bengaluru-560004
CONTENTS
1. INTRODUCTION
1.1 Introduction ............................................................................................................ 1
1.2 Motivation………………………………………………………………………...1
2. PROBLEM STATEMENT
2.1 Problem Statement………………………………………………………………..2
2.2 Objectives…………………………………………………………………………2
3. SYSTEM REQUIREMENTS
3.1 Hardware Requirements………………………………………………………….3
3.2 Software Requirements…………………………………………………………..3
4. ARCHITECTURE
4.1 Architecture ........................................................................................................... 4
5. TOOLS USED
5.1 Tools Description ................................................................................................... 6
5.2 Dataset Description………………………………………………………………11
6. IMPLEMENTATION DETAILS
6.1 Data Analysis…………………………………………………………………….13
7. RESULTS
7.1 Snapshots………………………………………………………………………...23
8. APPLICATIONS……………………………………………………………………..27
9. CONCLUSION AND FUTURE WORK
9.1 Conclusion………………………………………………………………………28
9.2 Future Work……………………………………………………………………..28
REFERENCES…………………………………………………………………………29
CHAPTER 1
INTRODUCTION
1.1 Introduction
Big data can be defined as collection of heterogeneous data from variety of sources. It can be
defined as the data from large organizations like business, web. Thus, managing thus large database
is tough task with the help of traditional tools. The major issue with such large data is that it includes
integration, visualization, storage, searching. Data analytics on such large data is quite tough. The
analysis of data is demand of the technology as popularity can be calculated using it. Today the hidden
pattern discovery is major demand of research. As the variety of resources is available, therefore the
need is to merge them all to utilize it in better way.
The E-commerce is one of the widely used techniques in day-to-day life of many individuals.
Each website is trying to business their economy more and more. However, these sites are with wide
varieties of complexities. The data on such shopping sites is very large that when it will try to manage
and manipulate it, it becomes very tougher. Most of the e-commerce organizations do not have any
inventory, so they tie up with merchants. When you place an order, either you can select a merchant
or you can skip it and company will decide the merchant. The ratings of merchants are necessary and
on the basis of that orders must be assigned to merchants.
1.2 Motivation
The E-commerce is one of the widely used techniques in day-to-day life of many individuals.
Each website is trying to business their economy more and more. However, these sites are with wide
varieties of complexities. The data on such shopping sites is very large that when it will try to manage
and manipulate it, it becomes very tougher.
Most of the e-commerce organizations do not have any inventory, so they tie up with merchants.
When you place an order, either you can select a merchant or you can skip it and company will decide
the merchant. The ratings of merchants are necessary and on the basis of that orders must be assigned
to merchants.
Department of CS&E, BIT, 2022-23 1

CHAPTER 2
PROBLEM STATEMENT
2.1 Problem Statement
“To create a merchant rating system to improve customer experience by deciding which merchant
provides better services amongst multiple merchants selling same types of products using Big Data
Analytics.”
2.2 Objectives
• To analyze the merchant data using Hadoop MapReduce.

• To recommend merchants giving best services out of multiple merchants selling same type of products.
• Help the e-commerce organization to improve their customer experience

CHAPTER 3
SYSTEM REQUIREMENTS
3.1 Hardware requirements

A Machine with:
• Machine with 2GHz dual-core processor and 4 GB RAM.

• 6GB or above.
• 40 GB or above.
• Keyboard or mouse or compatible pointing devices.
3.2 Software Requirements

• Hadoop and Hive.
• Integration tools.
• Database for storage and analytical purpose.
• Power BI for report making.

CHAPTER 4
ARCHITECTURE
Fig 4.1. System Architecture
The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. HDFS employs a NameNode and DataNode architecture to implement
a distributed file system that provides high-performance access to data across highly scalable
Hadoop clusters.Hadoop itself is an open source distributed processing framework that
manages data processing and storage for big data applications. HDFS is a key part of the many
Hadoop ecosystem technologies. It provides a reliable means for managing pools of big data
and supporting related big data analytics applications.
How does HDFS work?

HDFS enables the rapid transfer of data between compute nodes. At its outset, it was closely
coupled with MapReduce, a framework for data processing that filters and divides up work
among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer
to a query. Similarly, when HDFS takes in data, it breaks the information down into separate
blocks and distributes them to different nodes in a cluster.
With HDFS, data is written on the server once, and read and reused numerous times after
that. HDFS has a primary NameNode, which keeps track of where file data is kept in the
cluster.HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one
per node in a cluster. The DataNodes are generally organized within the same rack in the data
center. Data is broken down into separate blocks and distributed among the various DataNodes
for storage. Blocks are also replicated across nodes, enabling highly efficient parallel
processing.
The NameNode knows which DataNode contains which blocks and where the DataNodes
reside within the machine cluster. The NameNode also manages access to the files, including
reads, writes, creates, deletes and the data block replication across the DataNodes.The
NameNode operates in conjunction with the DataNodes. As a result, the cluster can
dynamically adapt to server capacity demand in real time by adding or subtracting nodes as
necessary.
The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of
the status of each DataNode. If the NameNode realizes that one DataNode isn't working
properly, it can immediately reassign that DataNode's task to a different node containing the
same data block. DataNodes also communicate with each other, which enables them to
cooperate during normal file operations.
Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates --
or copies -- each piece of data multiple times and distributes the copies to individual nodes,
placing at least one copy on a different server rack than the other copies.

CHAPTER 5
TOOLS USED
5.1 Tools Description
5.1.1 Hadoop distributed file system
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system
written in Java for the Hadoop framework. Some consider it to instead beta data store due to its
lack of POSIX compliance, but it does provide shell commands and Java application
programming interface (API) methods that are similar to other file systems. A Hadoop instance
is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is
used for processing data.
Fig 5.1.1 Hadoop Distributed File System Architecture
Pros Cons
Scalable Latency
Cost effective Security
Compatible Supports Batch Processing only
Easy to use Latency
Varied data sources No Real Time Data Processing
Table 1: Hadoop Distributed File System Pros and Cons

5.1.2 MapReduce
MapReduce is a set of Java classes run on YARN with the purpose of processing massive amounts
of data and reducing this data into output files. HDFS works with MapReduce to divide the data in
parallel fashion on local or parallel machines. Parallel structure requires that the data is immutable
and cannot be updated. It begins with the input files where the data is initially stored typically residing
in HDFS. These input files are then split up into input format which selects the files, defines the input
splits, breaks the file into tasks and provides a place for record reader objects. The input format
defines the list of tasks that makes up the map phase. The tasks are then assigned to the nodes in the
system based on where the input files chunks are physically resident.
The input split describes the unit of work that comprises a single map task in a MapReduce program.
The record reader loads the data and converts it into key value pairs that can be read by the Mapper.
The Mapper performs the first phase of the MapReduce program. Given a key and a value the
mappers export key and value pairs and send these values to the reducers. The process of moving
mapped outputs to the reducers is known as shuffling. Partitions are the inputs to reduce tasks, the
partitioner determines which key and value pair will be stored and reduced.
The set of intermediate keys are automatically stored before they are sent to the reduce
function. A reducer instance is created for each reduced task to create an output format. The output
format governs the way objects are written; the output format provided by Hadoop writes the files to
HDFS.
Fig 5.1.2 MapReduce System Architecture

5.1.2.1 Hadoop Installation
Hadoop can be installed on Windows 10, Ubuntu 16.04 and MySQL.
Steps for installation of Hive in a Linux based OS are as follows:

1. Install Java
– Java JDK https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
– extract and install Java in C:\Java
– open cmd and type -> javac -version
2. Download Hadoop
– https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
– extract to C:\Hadoop
3. Set the path JAVA_HOME Environment variable
4. Set the path HADOOP_HOME Environment variable
5. Configurations
Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml,
paste the xml code in folder and save
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
======================================================
Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file C:/Hadoop-
3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>
Create folder “data” under “C:\Hadoop-3.3.0”
Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”

======================================================
Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml,
paste xml code and save this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
======================================================
Edit file C:/Hadoop-3.3.0/etc/hadoop/yarn-site.xml,
paste xml code and save this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
======================================================
Edit file C:/Hadoop-3.3.0/etc/hadoop/hadoop-env.cmd
by closing the command line
“JAVA_HOME=%JAVA_HOME%” instead of set “JAVA_HOME=C:\Java”
======================================================
6. Hadoop Configurations
Download
https://github.com/brainmentorspvtltd/BigData_RDE/blob/master/Hadoop%20Configuration.zip
or (for hadoop 3) https://github.com/s911415/apache-hadoop-3.1.0-winutils
– Copy folder bin and replace existing bin folder in C:\Hadoop-3.3.0\bin
– Format the NameNode
– Open cmd and type command “hdfs namenode –format”
======================================================
7. Testing
– Open cmd and change directory to C:\Hadoop-3.3.0\sbin
– type start-all.cmd

5.2 Dataset Description
The dataset have over 20000 merchant data.
Following feature comparison analysis is performed in order to analyse which Hadoop Technology
is suitable for Merchant Data Analysis project.
1. If MapReduce is to be used for YouTube Data analysis project, then we need to write complex
business logic in order to successfully execute the join queries. We would have to think from map
and reduce view of what is important and what is not important and which particular code little piece
will go into map and which one will go into reduce side. Programmatically, this effort will become
quite challenging as lot of custom code is required to successfully execute the business logic even
for simplest tasks. Also, it may be difficult to map the data into schema format and lot of development
effort may go in to deciding how map and reduce joins can function efficiently.

2. Pig is a procedural data flow language. A procedural language is a step-by-step approach defined
by the programmers. Pig requires a learning curve since the syntax is new and different from SQL.
Also, Pig requires more maintenance. The values of variables may not be retained; instead, the query
needs to rerun in order to get the values from a variable. Moreover, Pig is a scripting language that
is more suitable for prototyping and rapidly developing MapReduce based jobs. The data schema is
not enforced explicitly in Pig and hence it becomes difficult to map the data into schema format.
Also, the error that Pig produces is not very user friendly. It just gives exec error even if the problem
is related to syntax or type error. Pig development may require more time than hive but is purely
based on the developer’s familiarity with the pig code.
3. Hive provides a familiar programming model. It operates on query data with a SQL-based
language. It is comparatively faster with better interactive response times, even over huge datasets.
As data variety and volume grows, more commodity machines can be added without reducing the
performance. Hence, Hive is scalable and extensible. Hive is very compatible and works with
traditional data integration and data analytics tool. If we apply Hive to analyse the YouTube data,
then we would be able to leverage the SQL capabilities of Hive QL as well as data can be managed
in a particular schema. Also, by using Hive, the development time can be significantly reduced.
After looking at the pros and cons, Hive becomes the obvious choice for this YouTube Data
Analysis project.

CHAPTER 6
IMPLEMENTATION DETAILS
6.1 Code Files
6.1.1 AggregateData.java
package org.paceWithMe.hadoop.helpers;
import java.io.Serializable;
public class AggregateData implements Serializable{
private static final long serialVersionUID = 1L;
private long orderBelow5000 = 0L;

private long orderAbove20000 = 0L;
private Long totalOrder = 0L;
public void setOrderBelow5000(long orderBelow5000){

this.orderBelow5000 = orderBelow5000;
}
public long getOrderBelow5000(){

return this.orderBelow5000;
}

this.orderBelow10000 = orderBelow10000;
}

return this.orderBelow10000;
}

this.orderAbove20000 = orderAbove20000;
}

return this.orderAbove20000;
}
public void setOrderAbove20000(long orderAbove20000){

this.orderAbove20000 = orderAbove20000;
}
public long getOrderAbove20000(){

return this.orderAbove20000;
}
public long getTotalOrder(){

return this,totalOrder;
}
public void setTotalOrder(long totalOrder){

this.totalOrder = totalOrder;
}
6.1.2 AggregateWritable.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
import com.google.gson.Gson;
public class AggregateWritable implements Writable{

private static Gson gson = new Gson();
private AggregateData aggregateData = new AggregateData();
public AggregateWritable(){
}
public AggregateWritable(AggregateData aggregateData){
this.aggregateData = aggregateData;
}
public AggregateData getAggregateData(){

return this.aggregateData;
}
public void write(DataOutput out) throws IOException{

out.writeLong(aggregateData.getOrderBelow5000());
out.writeLong(aggregateData.getOrderAbove20000());
out.writeLong(aggregateData.getTotalOrder());
}
public void readFields(DataInput in) throws IOException{

aggregateData.setOrderBelow5000(in.readLong());
aggregateData.setOrderAbove20000(in.readLong());
aggregateData.setTotalOrder(in.readLong());
}
@Override
public String toString(){
return gson.toJson(aggregateData);
}
}
6.1.3 MerchantAnalyticsJob.java
import java.io.*;
import java.net.*;
import java.text.*;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.jobcontrol.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
import org.slf4j.*;
import org.paceWithMe.hadoop.helpers.AggregateData;
import org.paceWithMe.hadoop.helpers.AggregateWritable;
import org.paceWithMe.hadoop.helpers.Transaction;
public class MerchantAnalyticsJob extends Configured implements Tool{
private final static Logger LOGGER = LoggerFactory.getLogger(MerchantAnalyticsJob.class);

static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
public static class MerchantPartitioner extends Partitioner<Text,AggregateWritable>{

@Override
public int getPartition(Text key,AggregateWritable value,int numPartitions){
return Math.abs(key.toString().hashCode()) % numPartitions;
}
}
private static class TransactionMapper extends Mapper<LongWritable,Text,Text,AggregateWritable>{
private static Map<String,String> merchantIdNameMap = new HashMap<String,String>();
@Override
protected void setup(Mapper<LongWritable,Text,Text,AggregateWritable>.Context context) throws
Exception{
URI[] paths = context.getCacheArchives();
if(paths != null){
for(URI path : paths){
loadMerchantIdNameInCache(path.toString(),context.getConfiguration());
}
}
super.setup(context);
}
private void loadMerchantIdNameInCache(String file,Configuration conf){

LOGGER.info("file name : " + file);
String strRead;
BufferedReader br = null;
try{
FileSystem fileSystem = FileSystem.get(conf);
FSDataInputStream open = fileSystem.open(new Path(file));
br = new BufferedReader(new InputStreamReader(open));
while((strRead = br.readLine() != null)){
String line = strRead.toString().replace("\"","");
String splitarray[] = line.split(",");
merchantIdNameMap.put(splitarray[0].toString(),splitarray[2].toString());
}
}catch(Exception e){
LOGGER.error("exception occured while loading data in cache");
}finally{
try{
if(br != null)
br.close();
}catch(Exception e){
LOGGER.error("exception occured while closing the file reader = {}",e);
}
}
}
@Override
protected void map(LongWritable key,Text value,Mapper<LongWritable,Text
,Text,AggregateWritable>.Context context) throws Exception{
String line = value.toString().replace("\"","");
if(line.indexOf("transaction") != -1){
return;
}
String split[] = line.split(",");

Transaction transaction = new Transaction();
transaction.setTxId(split[0]);
transaction.setCustomerId(Long.parseLong(split[1]));
transaction.setMerchantId(Long.parseLong(split[2]));
transaction.setTimestamp(split[3].split(" ")[0].trim());
transaction.setInvoiceNumber(split[4].split());
transaction.setInvoiceAmount(Float.parseFloat(split[5]));
transaction.setSegment(split[6].trim());
AggregateData aggregatedata = new AggregateData();

AggregateWritable aggregateWritable = new AggregateWritable(aggregatedata);
if(transaction.getInvoiceAmount() <= 5000)
aggregatedata.setOrderBelow5000(1l);
else if(transaction.getInvoiceAmount() <= 10000)
else if(transaction.getInvoiceAmount() <= 20000)
else
aggregatedata.setOrderAbove20000(1l);
aggregatedata.setTotalOrder(1l);
String outputkey = merchantIdNameMap.get(transaction.getMerchantId().toString()) + "-" +
(split[3].trim().split(" ")[0].trim());
context.write(new Text(outputkey),aggregateWritable);
}
public static class MerchantOrderReducer extends Reducer<Text, AggregateWritable, Text,

AggregateWritable>{
public void reduce(Text key, Iterable<AggregateWritable> values, Context context) throws Exception{
AggregateData aggregatedata = new AggregateData();
AggregateWritable aggregateWritable = new AggregateWritable(aggregatedata);
for(AggregateWritable val: values){

aggregatedata.setOrderBelow20000(aggregatedata.getOrderBelow20000() +
val.getAggregateData().getOrderBelow20000());
aggregatedata.setOrderAbove20000(aggregatedata.getOrderAbove20000() +
val.getAggregateData().getOrderAbove20000());
aggregatedata.setTotalOrder(aggregatedata.getTotalOrder() +
val.getAggregateData().getTotalOrder());
}
context.write(key,aggregateWritable);
}
}
public static void main(String[] args){

ToolRunner.run(new Configuration(),new MerchantAnalyticsJob(),args);
System.exit();
}
public static int runMRJob(String[] args) throws Exception{

Configuration conf = new Configuration();
ControlledJob myJob1 = new ControlledJob(conf);
Job job = myJob1.getJob();
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(AggregateWritable.class);
job.setJarByClass(MerchantAnalyticsJob.class);
job.setReduceClass(MerchantOrderReducer.class);
FileInputFormat.setInputDirRecursive(job,true);
MultipleInputs.addInputPath(job,new Path[args[0]] , TextInputFormat.class,
TransactionMapper.class);
FileSystem filesystem = FileSystem.get(job.getConfiguration());
RemoteIterator<LocateFileStatus> files = fileSystem.listFiles(new Path[args[1]],true);

while(files.hasNext()){
job.addCacheArchive(files.next().getPath().toUri());
}
FileOutputFormat.setOutputPath(job,new Path[args[2]] + "/" +

Calendar.getInstance().getTimeInMillis());
job.setNumReduceTask(5);
job.setPartitionerClass(MerchantPartitioner.class);
return job.waitForCompletion(true)?0:1;
}
@Override
public int run(String[] args) throws Exception{
return runMRJob(args);
}
}
6.1.4 Transaction.java
import java.io.Serializable;
public class Transacttion implements Serializable{

private static final long serialVersionId = 1L;
private String txId;

private Long customerId;
private Long merchantId;
private String timeStamp;
private String invoiceNumber;
private float invoiceAmount;
private String segment;
public String getSegment(){

return this.segment;
}
public void setSegment(String segment){

this.segment = segment;
}
public String getTxId(){

return this.txId;
}
public void setTxId(String txId){

this.txId = txId;
}
public Long getCustomerId(){

return this.customerId;
}
public void setCustomerId(Long customerId){

this.customerId = customerId;
}
public Long getMerchantId(){

return this.merchantId;
}
public void setMerchantId(Long merchantId){

this.merchantId = merchantId;
}
public String getTimeStamp(){

return this.timeStamp;
}
public void setTimeStamp(String timeStamp){

this.timeStamp = timeStamp;
}
public String getInvoiceNumber(){

return this.invoiceNumber;

}
public void setInvoiceNumber(String invoiceNumber){

this.invoiceNumber = invoiceNumber;
}
public float getInvoiceAmount(){

return this.invoiceAmount;
}
public void setInvoiceAmount(float invoiceAmount){

this.invoiceAmount = invoiceAmount;
}
}

CHAPTER 7
RESULTS
7.1 Snapshots
1. MAP REDUCE TASKS

2.MERCHANT RANKING
3.HADOOP INSTALLATION

CHAPTER 8
APPLICATIONS
• E commerce websites
• Companies that Provide Merchant Rating Services
• Merchant reviews platform
• Consumers comparison-shopping

CHAPTER 9
CONCLUSION AND FUTURE ENHANCEMENT
9.1 Conclusion
The project demonstrates a merchant rating system to improve customer experience by deciding
which merchant provides better services amongst multiple merchants selling same types of products
using Big Data Analytics We have demonstrated how we can extract insightful information from
merchant dataset using Big Data Analytics. For the same purpose, we have used technology like
Hadoop MapReduce.
9.2 Future Enhancement
The future work would include extending the analysis of Merchant data using other Big Data analysis
Technologies.We can extract more information like ratings given by other customers,The number of
orders shipped,the number of items returned due to merchant’s mistake etc.We can make the
merchant rating more accurate by doing so.

REFERENCES
[1] White, T. (2012). Hadoop: the definitive guide. Beijing: O'Reilly.
[2] Allouche, G. (2015, July 01). Hadoop 101: An Explanation of the Hadoop Ecosystem - DZone Big
Data. Retrieved April 27, 2017, from https://dzone.com/articles/hadoop-101- explanation-hadoop
[3] Braselton, J. P. (2014). Hadoop: integration in IBM, Microsoft and SAS. Place of publication not
identified: CreateSpace,2014
[4] Seshachala, S. (2015, June 01). Bigdata - Understanding Hadoop and Its Ecosystem. Retrieved April
27, 2017, from https://devops.com/bigdata-understanding-hadoop-ecosystem/
[5] Teplow, D. (2015, May 15). Hadoop . Retrieved April 27, 2017, from
http://www.hadoop360.com/blog/hadoop-whose-to-choose
[6] Mehta, S. (2016, June). Hadoop Ecosystem: An Introduction . Retrieved from

https://www.ijsr.net/archive/v5i6/NOV164121.pdf
[7] Kanoje, S. (2016, July 12). Hadoop Ecosystem Quick Start: 5 Key Components. Retrieved April 27,
2017, from https://www.ironsidegroup.com/2015/12/01/hadoop-ecosystkeycomponents/
[8] Karambelkar, H. (2015). Scaling big data with Hadoop and Solr: understand, design, build, and
optimize your big data search engine with Hadoop and Apache Solr. Birmingham: Packt Pub.
[9] Prajapati, V. (2013). Big Data Analytics with R and Hadoop. Olton: Packt Publishing. Retrieved from
http://ebookcentral.proquest.com.ezproxy.ferris.edu/lib/ferrisstate/detail.action?docID=14 77486
[10] Trifu, M. R., & Ivan, M. (2016). Big data components for business process optimization. Informatica
Economica, 20(1), 72-78. doi:http://dx.doi.org.ezproxy.ferris.edu/10.12948/issn14531305/20.1.2016.07
[11] Laura IVAN, M. (2016, January). Big Data Components for Business Process Optimization. Retrieved
April 27, 2017, from http://revistaie.ase.ro/content/77/07%20- %20Trifu,%20Ivan.pdf
[12] Prasad Padhy , R. (2013, February). Big Data Processing with Hadoop-MapReduce in Cloud Systems .
Retrieved April 27, 2017, from http://www.iaesjournal.com/online/index.php/IJ-
CLOSER/article/view/1508/502

Merchant Rating System Using Hadoop MapReduce

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Merchant Rating System Using Hadoop MapReduce

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi-590018, Karnataka

Big Data & Analytics Mini Project Report-18CS72

“Merchant Rating System using Hadoop

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Certified that the Mini Project work entitled “Merchant Rating

Department of CS&E, BIT, 2022-23 1

2.1 Problem Statement

• To analyze the merchant data using Hadoop MapReduce.

Department of CS&E, BIT, 2022-23 2

3.1 Hardware requirements

• Machine with 2GHz dual-core processor and 4 GB RAM.

3.2 Software Requirements

Department of CS&E, BIT, 2022-23 3

Fig 4.1. System Architecture

How does HDFS work?

Department of CS&E, BIT, 2022-23 4

Department of CS&E, BIT, 2022-23 5

5.1 Tools Description

5.1.1 Hadoop distributed file system

Fig 5.1.1 Hadoop Distributed File System Architecture

Table 1: Hadoop Distributed File System Pros and Cons

Department of CS&E, BIT, 2022-23 6

Fig 5.1.2 MapReduce System Architecture

Department of CS&E, BIT, 2022-23 7

Hadoop can be installed on Windows 10, Ubuntu 16.04 and MySQL.

Steps for installation of Hive in a Linux based OS are as follows:

Department of CS&E, BIT, 2022-23 8

Create folder “data” under “C:\Hadoop-3.3.0”

Create folder “datanode” under “C:\Hadoop-3.3.0\data”

Create folder “namenode” under “C:\Hadoop-3.3.0\data”

Department of CS&E, BIT, 2022-23 9

Department of CS&E, BIT, 2022-23 10

The dataset have over 20000 merchant data.

Department of CS&E, BIT, 2022-23 11

Department of CS&E, BIT, 2022-23 12

6.1 Code Files

public class AggregateData implements Serializable{

private static final long serialVersionUID = 1L;

private long orderBelow5000 = 0L;

public void setOrderBelow5000(long orderBelow5000){

public long getOrderBelow5000(){

public void setOrderBelow10000(long orderBelow10000){

public long getOrderBelow10000(){

public void setOrderBelow20000(long orderBelow20000){

public long getOrderBelow20000(){

public void setOrderAbove20000(long orderAbove20000){

public long getOrderAbove20000(){

public long getTotalOrder(){

public void setTotalOrder(long totalOrder){

public class AggregateWritable implements Writable{

private AggregateData aggregateData = new AggregateData();

public AggregateData getAggregateData(){

public void write(DataOutput out) throws IOException{

public void readFields(DataInput in) throws IOException{

Department of CS&E, BIT, 2022-23 15

public class MerchantAnalyticsJob extends Configured implements Tool{

private final static Logger LOGGER = LoggerFactory.getLogger(MerchantAnalyticsJob.class);

public static class MerchantPartitioner extends Partitioner<Text,AggregateWritable>{