Professional Documents
Culture Documents
Submitted by
USN Name
1BI19CS078 Koushik GG
1BI19CS107 Prakrithi V
1BI19CS079 Lthish Kumar
1BI19CS133 Saaima Nishat
Certificate
USN NAME
1BI19CS078 Koushik GG
1BI19CS107 Prakrithi V
1BI19CS079 Lithish Kumar
1BI19CS133 Saaima Nishat
of VIII semester, Computer Science and Engineering branch as partial fulfillment of the course
Big Data & Analytics (18CS72) prescribed by Visvesvaraya Technological University,
Belgaum during the academic year 2021-22. It is certified that all corrections/suggestions
indicated for Internal Assessment have been incorporated in the report.
The Mini Project report has been approved as it satisfies the academic requirements
in respect of project work in Big Data & Analytics.
Dr. Suneetha K R
Associate Professor,
Department of CS&E,
BIT, Bengaluru-560004
CONTENTS
1. INTRODUCTION
1.1 Introduction ............................................................................................................ 1
1.2 Motivation………………………………………………………………………...1
2. PROBLEM STATEMENT
2.1 Problem Statement………………………………………………………………..2
2.2 Objectives…………………………………………………………………………2
3. SYSTEM REQUIREMENTS
3.1 Hardware Requirements………………………………………………………….3
3.2 Software Requirements…………………………………………………………..3
4. ARCHITECTURE
4.1 Architecture ........................................................................................................... 4
5. TOOLS USED
5.1 Tools Description ................................................................................................... 6
5.2 Dataset Description………………………………………………………………11
6. IMPLEMENTATION DETAILS
6.1 Data Analysis…………………………………………………………………….13
7. RESULTS
7.1 Snapshots………………………………………………………………………...23
8. APPLICATIONS……………………………………………………………………..27
9. CONCLUSION AND FUTURE WORK
9.1 Conclusion………………………………………………………………………28
9.2 Future Work……………………………………………………………………..28
REFERENCES…………………………………………………………………………29
CHAPTER 1
INTRODUCTION
1.1 Introduction
Big data can be defined as collection of heterogeneous data from variety of sources. It can be
defined as the data from large organizations like business, web. Thus, managing thus large database
is tough task with the help of traditional tools. The major issue with such large data is that it includes
integration, visualization, storage, searching. Data analytics on such large data is quite tough. The
analysis of data is demand of the technology as popularity can be calculated using it. Today the hidden
pattern discovery is major demand of research. As the variety of resources is available, therefore the
need is to merge them all to utilize it in better way.
The E-commerce is one of the widely used techniques in day-to-day life of many individuals.
Each website is trying to business their economy more and more. However, these sites are with wide
varieties of complexities. The data on such shopping sites is very large that when it will try to manage
and manipulate it, it becomes very tougher. Most of the e-commerce organizations do not have any
inventory, so they tie up with merchants. When you place an order, either you can select a merchant
or you can skip it and company will decide the merchant. The ratings of merchants are necessary and
on the basis of that orders must be assigned to merchants.
1.2 Motivation
The E-commerce is one of the widely used techniques in day-to-day life of many individuals.
Each website is trying to business their economy more and more. However, these sites are with wide
varieties of complexities. The data on such shopping sites is very large that when it will try to manage
and manipulate it, it becomes very tougher.
Most of the e-commerce organizations do not have any inventory, so they tie up with merchants.
When you place an order, either you can select a merchant or you can skip it and company will decide
the merchant. The ratings of merchants are necessary and on the basis of that orders must be assigned
to merchants.
PROBLEM STATEMENT
“To create a merchant rating system to improve customer experience by deciding which merchant
provides better services amongst multiple merchants selling same types of products using Big Data
Analytics.”
2.2 Objectives
SYSTEM REQUIREMENTS
ARCHITECTURE
The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. HDFS employs a NameNode and DataNode architecture to implement
a distributed file system that provides high-performance access to data across highly scalable
Hadoop clusters.Hadoop itself is an open source distributed processing framework that
manages data processing and storage for big data applications. HDFS is a key part of the many
Hadoop ecosystem technologies. It provides a reliable means for managing pools of big data
and supporting related big data analytics applications.
TOOLS USED
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system
written in Java for the Hadoop framework. Some consider it to instead beta data store due to its
lack of POSIX compliance, but it does provide shell commands and Java application
programming interface (API) methods that are similar to other file systems. A Hadoop instance
is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is
used for processing data.
Pros Cons
Scalable Latency
Cost effective Security
Compatible Supports Batch Processing only
Easy to use Latency
Varied data sources No Real Time Data Processing
The input split describes the unit of work that comprises a single map task in a MapReduce program.
The record reader loads the data and converts it into key value pairs that can be read by the Mapper.
The Mapper performs the first phase of the MapReduce program. Given a key and a value the
mappers export key and value pairs and send these values to the reducers. The process of moving
mapped outputs to the reducers is known as shuffling. Partitions are the inputs to reduce tasks, the
partitioner determines which key and value pair will be stored and reduced.
The set of intermediate keys are automatically stored before they are sent to the reduce
function. A reducer instance is created for each reduced task to create an output format. The output
format governs the way objects are written; the output format provided by Hadoop writes the files to
HDFS.
Following feature comparison analysis is performed in order to analyse which Hadoop Technology
is suitable for Merchant Data Analysis project.
1. If MapReduce is to be used for YouTube Data analysis project, then we need to write complex
business logic in order to successfully execute the join queries. We would have to think from map
and reduce view of what is important and what is not important and which particular code little piece
will go into map and which one will go into reduce side. Programmatically, this effort will become
quite challenging as lot of custom code is required to successfully execute the business logic even
for simplest tasks. Also, it may be difficult to map the data into schema format and lot of development
effort may go in to deciding how map and reduce joins can function efficiently.
3. Hive provides a familiar programming model. It operates on query data with a SQL-based
language. It is comparatively faster with better interactive response times, even over huge datasets.
As data variety and volume grows, more commodity machines can be added without reducing the
performance. Hence, Hive is scalable and extensible. Hive is very compatible and works with
traditional data integration and data analytics tool. If we apply Hive to analyse the YouTube data,
then we would be able to leverage the SQL capabilities of Hive QL as well as data can be managed
in a particular schema. Also, by using Hive, the development time can be significantly reduced.
After looking at the pros and cons, Hive becomes the obvious choice for this YouTube Data
Analysis project.
IMPLEMENTATION DETAILS
6.1.1 AggregateData.java
package org.paceWithMe.hadoop.helpers;
import java.io.Serializable;
6.1.2 AggregateWritable.java
package org.paceWithMe.hadoop.helpers;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
import com.google.gson.Gson;
public AggregateWritable(){
}
Department of CS&E, BIT, 2022-23 14
public AggregateWritable(AggregateData aggregateData){
this.aggregateData = aggregateData;
}
@Override
public String toString(){
return gson.toJson(aggregateData);
}
}
6.1.3 MerchantAnalyticsJob.java
package org.paceWithMe.hadoop.helpers;
import java.io.*;
import java.net.*;
import java.text.*;
import java.util.*;
import org.apache.hadoop.util.*;
import org.slf4j.*;
import org.paceWithMe.hadoop.helpers.AggregateData;
import org.paceWithMe.hadoop.helpers.AggregateWritable;
import org.paceWithMe.hadoop.helpers.Transaction;
@Override
protected void setup(Mapper<LongWritable,Text,Text,AggregateWritable>.Context context) throws
Exception{
URI[] paths = context.getCacheArchives();
if(paths != null){
for(URI path : paths){
loadMerchantIdNameInCache(path.toString(),context.getConfiguration());
}
Department of CS&E, BIT, 2022-23 16
}
super.setup(context);
}
try{
if(br != null)
br.close();
}catch(Exception e){
LOGGER.error("exception occured while closing the file reader = {}",e);
}
}
}
@Override
protected void map(LongWritable key,Text value,Mapper<LongWritable,Text
,Text,AggregateWritable>.Context context) throws Exception{
String line = value.toString().replace("\"","");
if(line.indexOf("transaction") != -1){
return;
}
aggregatedata.setTotalOrder(1l);
String outputkey = merchantIdNameMap.get(transaction.getMerchantId().toString()) + "-" +
(split[3].trim().split(" ")[0].trim());
context.write(new Text(outputkey),aggregateWritable);
}
public void reduce(Text key, Iterable<AggregateWritable> values, Context context) throws Exception{
AggregateData aggregatedata = new AggregateData();
AggregateWritable aggregateWritable = new AggregateWritable(aggregatedata);
context.write(key,aggregateWritable);
}
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(AggregateWritable.class);
job.setJarByClass(MerchantAnalyticsJob.class);
job.setReduceClass(MerchantOrderReducer.class);
FileInputFormat.setInputDirRecursive(job,true);
MultipleInputs.addInputPath(job,new Path[args[0]] , TextInputFormat.class,
TransactionMapper.class);
FileSystem filesystem = FileSystem.get(job.getConfiguration());
RemoteIterator<LocateFileStatus> files = fileSystem.listFiles(new Path[args[1]],true);
@Override
public int run(String[] args) throws Exception{
return runMRJob(args);
}
}
6.1.4 Transaction.java
package org.paceWithMe.hadoop.helpers;
import java.io.Serializable;
RESULTS
7.1 Snapshots
3.HADOOP INSTALLATION
APPLICATIONS
• E commerce websites
• Companies that Provide Merchant Rating Services
• Merchant reviews platform
• Consumers comparison-shopping
9.1 Conclusion
The project demonstrates a merchant rating system to improve customer experience by deciding
which merchant provides better services amongst multiple merchants selling same types of products
using Big Data Analytics We have demonstrated how we can extract insightful information from
merchant dataset using Big Data Analytics. For the same purpose, we have used technology like
Hadoop MapReduce.
The future work would include extending the analysis of Merchant data using other Big Data analysis
Technologies.We can extract more information like ratings given by other customers,The number of
orders shipped,the number of items returned due to merchant’s mistake etc.We can make the
merchant rating more accurate by doing so.
[2] Allouche, G. (2015, July 01). Hadoop 101: An Explanation of the Hadoop Ecosystem - DZone Big
Data. Retrieved April 27, 2017, from https://dzone.com/articles/hadoop-101- explanation-hadoop
[3] Braselton, J. P. (2014). Hadoop: integration in IBM, Microsoft and SAS. Place of publication not
identified: CreateSpace,2014
[4] Seshachala, S. (2015, June 01). Bigdata - Understanding Hadoop and Its Ecosystem. Retrieved April
27, 2017, from https://devops.com/bigdata-understanding-hadoop-ecosystem/
[5] Teplow, D. (2015, May 15). Hadoop . Retrieved April 27, 2017, from
http://www.hadoop360.com/blog/hadoop-whose-to-choose
[7] Kanoje, S. (2016, July 12). Hadoop Ecosystem Quick Start: 5 Key Components. Retrieved April 27,
2017, from https://www.ironsidegroup.com/2015/12/01/hadoop-ecosystkeycomponents/
[8] Karambelkar, H. (2015). Scaling big data with Hadoop and Solr: understand, design, build, and
optimize your big data search engine with Hadoop and Apache Solr. Birmingham: Packt Pub.
[9] Prajapati, V. (2013). Big Data Analytics with R and Hadoop. Olton: Packt Publishing. Retrieved from
http://ebookcentral.proquest.com.ezproxy.ferris.edu/lib/ferrisstate/detail.action?docID=14 77486
[10] Trifu, M. R., & Ivan, M. (2016). Big data components for business process optimization. Informatica
Economica, 20(1), 72-78. doi:http://dx.doi.org.ezproxy.ferris.edu/10.12948/issn14531305/20.1.2016.07
[11] Laura IVAN, M. (2016, January). Big Data Components for Business Process Optimization. Retrieved
April 27, 2017, from http://revistaie.ase.ro/content/77/07%20- %20Trifu,%20Ivan.pdf
[12] Prasad Padhy , R. (2013, February). Big Data Processing with Hadoop-MapReduce in Cloud Systems .
Retrieved April 27, 2017, from http://www.iaesjournal.com/online/index.php/IJ-
CLOSER/article/view/1508/502