Running Head: Elastic Mapreduce (Emr) 1

Running Head: ELASTIC MAPREDUCE (EMR) 1
EMR use cases
Student’s name
Institution
Assignment due date

ELASTIC MAPREDUCE (EMR) 2
Introduction
Elastic MapReduce(EMR) is a cloud-based big data platform offered by the amazon
group which mainly deals with the capability of processing and analyzing bulk data in a short
period and at the same time reduces the cost of the above operation. This big data platform uses
open sources tools to process data, for example, tools like apache hive, apache hud, presto,
apache HBase, apache-spark, and apache Flink(Soualhia, Khomh, & Tahar, 2015). Amazon has
developed Elastic MapReduce(EMR) to work in that it simplifies the process of data analysis and
reducing the time taken to process big data with huge tuning and clusters.
An evident example of how Elastic MapReduce (EMR) works is that when data of
petabyte-scale is processed it was reported that the analysis was three times faster than the
normal speed and it cost the team half the normal price. The normal price and speed were that of
the standard apache spark. Elastic MapReduce(EMR) can be used by people like data scientists,
data engineers, and data analysts.
Elastic MapReduce(EMR) comes with a lot of benefits; Elastic MapReduce(EMR) is
easy and simple to operate and use as it makes sure that the user is only left with the task of
analyzing data only. It deals with other tasks like cluster launching and tuning, Hadoop
configuration node provisioning, and setting up the whole Elastic MapReduce(EMR)
architecture resources. It is less costly to use Elastic MapReduce(EMR), it cuts the cost by half
and at the same time, the intended process is done even faster than the standard time((Soualhia,
Khomh, & Tahar, 2015). Elastic MapReduce(EMR) uses the criterion of deployed EC2 instances
and their type and location to determine the price of a process. The instances range from reserved
to spot instances.
Elastic MapReduce (EMR) is elastic and can be able to handle different scales of data by
using a mechanism that adds or reduces the instances of data to be analyzed manually or
automatically. This also helps in the aspect of reducing the cost as you are only charged that you
have used as data is loaded and processed differently and independently. Elastic MapReduce
(EMR) is reliable in that the systems recover from downtime automatically by fixing bugs and
errors through periodic automatic downloads. It also recovers and retrieves terminated instances
when properly configured to do so. Elastic MapReduce (EMR) also adds the functionality of
safety and security to the data and the process itself. It gives access control to networks through
the firewall, encrypts data to ensure the security of both the clusters and the data itself.
Elastic MapReduce (EMR) is mainly used for adding safety while carrying data
processing especially data analysis. The Elastic MapReduce (EMR) uses case include; machine
learning and deep learning, log analysis, genomics, fraud detection, data transformations,
bioinformatics, and financial and scientific simulation(Hamdi, Khemakhem, & Zaidan, 2016). Below
is a detailed discussion on some of the case uses of Elastic MapReduce (EMR)
Elastic MapReduce (EMR) case uses
1. Data analysis- Machine learning
In today's, world artificial intelligence is on the verge of dynamic growth and has
diversified into fields like neural networks, robotics, machine learning, and deep learning.
Machine learning is a branch of artificial intelligence that deals with how computers and digital
devices can be able to gain knowledge and learn less likely than human beings when exposed to
an environment of data input for a certain period.

Elastic MapReduce (EMR) is used in machine learning to aid the workability of some of
the branches of machine learning like TensorFlow and apache. The Elastic MapReduce (EMR)
platform comes with algorithms that are preinstalled mainly to aid machine learning(Chen et al.,
2017). To analyze data through the help of an inbuilt machine learning in an Elastic MapReduce
(EMR) platform, the system can make predictions and become intelligent as it progresses from
one task to another.
In terms of computational design Elastic MapReduce (EMR) ease the setting up of
libraries that a user needs during data analysis through machine learning. they include apache-
spark MLlib, tensor flow, apache MXnet, bootstrap, and custom AMIs.
2. Extract Transform Load (ETL)
It involves a merge of three data processes namely, data extraction, data transformation,
and data loading. Extract Transform Load (ETL) involves data acquisitions from different data
sources, data transformations through processes like adding, removing, calculating, and
appending different data, and then finally loading the data to a storage repository like a data
warehouse(Bansal, 2014). This process faces a limitation of data integration cases where data may
be named wrongly, different data referred using the same name, data is only found in one
location, and cases where a single key is referenced by different data.
Extract Transform Load (ETL) uses various tools to ensure that it can extract, transform
and load data correctly, faster, and at less cost. The tools include oracle, MarkLogic, and amazon
redshift. Amazon redshift is one instance of Elastic MapReduce (EMR) which has the
functionality of sorting, joining, and finding totals and averages for big data. It uses SQL and BI
tools to be able to perform data analysis in big data like the petabyte-scale ones which would
require complex processes during querying the data warehouse system. Amazon redshift's main
benefits include its simplicity to use and it is cheap in terms of analysis cost.
3. Genomics and bioinformatics
Genomics refers to the study of genomes from the genetics of organisms through the
processes of combining and joining recombinant DNA, DNA sequencing methods, and
bioinformatics of sequence (Nellore et al., 2016). Genomics assembles various genomes and
analyzes them depending on their functions. One of the main limitations faced during genomics
is how to process the bulk data generated and how to make the genome sequencing and mapping
and bioinformatics analyses fast and less costly. One of the solutions to this problem is through
the use of Amazon web services mainly the Elastic MapReduce (EMR) platform.
Elastic MapReduce (EMR) enables genomic experts to process large amounts of data
easily, fast, and efficiently and also at a reduced cost. It should be noted that genomics deals with
a large amount of data that is acquired through interactions between loci and alleles. Genomics
being a study is resource-intensive and helps researchers as they can be able to extract genomic
data easily and at no cost through access to the Amazon web services (AWS).
4. Log processing and fraud detection
It is mandatory for any information system or networked system to have the capability to
record and store log files of all the activities carried out and the time they were done. Log
processing refers to the process of analyzing log files of a system and coming up with valuable
information that can be used to monitor, detect anomalies that may cause attacks like a fraud.
Log analytics uses various tools like Splunk, retrace, Logentries, LogMatics, SumoLogic,
Graylog, and LogStash(Wadkar, & Siddalingaiah, 2014). the tools range from commercial to open
source.
Elastic MapReduce (EMR) provides a platform to conduct log analysis in an easier way
and without physically having a server but through amazon web services (AWS) of cloud
computing. Amazon uses a tool called amazon kinesis for log processing (analysis in specific).
Amazon Kinesis is easy to use and has a fast and reliable log processing capability. The main
functionality of amazon kinesis is that it is capable of analyzing data concerning log files stream
by the stream as it is being acquired or collected without necessarily having to wait for the whole
data to be processed. Amazon kinesis collects, process and analyze data in real-time through the
use of the following inbuilt services;
 Amazon Kinesis streams
 Amazon kinesis firehose
 Amazon kinesis analytics
Amazon Elastic MapReduce (EMR) uses the Hadoop framework to process large data in
a distributed computing environment containing virtual servers. It helps to process log files of a
given system or network bit by bit without necessarily having to wait for other logs. This makes
it fast and cost-saving. When log files are processed fraud can be detected through techniques
such as anomaly-based detection and signature-based detection. Elastic MapReduce (EMR) can
be used to analyze logs to find the usage of files, access, and the web surfing history of users. In
business-wise, this information would be used to identify business insights that are related to
your organization or firm.
5. Event/ clickstream analysis
Event analysis or clickstream analysis refers to the process of collecting, analyzing, and
reporting information concerning web pages that is, the time it was visited, the frequency of
visits, and the resources accessed or activities carried out when a user was on a particular web
page. The information obtained from clickstream analysis is used to know the customer's needs
and be able to come up with effective solutions like developing relevant ads keeping in mind you
know the customer requirements.
Amazon Elastic MapReduce (EMR) uses the apache hive, apache-spark, and stream sets
transformer to clickstream analysis in elastic search, Kibana, and amazon redshift(Chandra,
Varde, & Wang, 2019). Modern data integration is applied where unstructured and semi-structured
data is analyzed by redshift in a simpler manner and at a faster rate. Clickstream is analyzed
through data visualizations techniques. For example, the following are some of the data
visualizations step following for a clickstream analysis of a company’s webpage;
 Session wise analysis- this analysis is the number of people who visited a site.
 Client wise analysis- analyze the number of sessions created per person
 Brand analysis- analyze the number of events per product
 HTTP response analysis- number of events that had an acknowledgment or disapproval.

Elastic MapReduce (EMR) use case instances
 When you have a long wait time for data processing
When systems take longer time than expected it becomes tiresome and unreliable. This
results in the formation of queues which would mean it is required to be added some
hardware or resources to solve the delay process. Elastic MapReduce (EMR) solves this
by ensuring that data analysis is fast and is by bit by bit as per the scale of collected
information. Elastic MapReduce (EMR) processes larger datasets in half the standard
time.
 When you want to outsource physical server infrastructure
Currently, cloud computing is on the rise and is solving the problems encountered by
setting up physical Datawarehouse’s for example disasters, high cost of hardware
installation, and maintenance. Elastic MapReduce (EMR) offers a solution to physical
server infrastructures by offering cloud services through the amazon web services which
can host servers that are accessed remotely from any region easily.
 When managing Hadoop is becoming a hassle
Hadoop platform comes in two forms that are, in-house Hadoop and on-premise Hadoop.
The in-house Hadoop is associated with a lot of limitations because it is physical such as
node failures and high costs of installation, operation, upgrading, and maintenance.
Elastic MapReduce (EMR) has an on-premise Hadoop that solves the above problems by
offering capabilities to process large datasets faster, easier, and serverless.
 When you have a rigid in-house cluster infrastructure
It is a waste of resources to involve all the components in infrastructure when doing a
certain analysis task that only requires a specific resource. Elastic MapReduce (EMR)
utilizes resources efficiently by loading the libraries needed for a specific action only
during data analysis.
References
Soualhia, M., Khomh, F., & Tahar, S. (2015, August). Predicting scheduling failures in the cloud: A case
study with google clusters and Hadoop on Amazon EMR. In 2015 IEEE 17th International Conference on
High-Performance Computing and Communications, 2015 IEEE 7th International Symposium on
Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software
and Systems (pp. 58-65). IEEE.
Hamdi, H., Khemakhem, M., & Zaidan, A. (2016). Complementary Approaches Built as Web Service for
Arabic Handwriting OCR Systems via Amazon Elastic MapReduce (EMR) Model. Accepted paper in the
International Arab Journal of Information Technology IAJIT.
Chen, L., Li, R., Liu, Y., Zhang, R., & Woodbridge, D. M. K. (2017, August). Machine learning-based
product recommendation using apache spark. In 2017 IEEE SmartWorld, Ubiquitous Intelligence &
Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data
Computing, Internet of People, and Smart City Innovation
(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1-6). IEEE.
Bansal, S. K. (2014, June). Towards a semantic extract-transform-load (ETL) framework for big data
integration. In 2014 IEEE International Congress on Big Data (pp. 522-529). IEEE.
Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T., & Langmead, B. (2016). Rail-dbGaP: analyzing dbGaP-
protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics, 32(16), 2551-2553.
Wadkar, S., & Siddalingaiah, M. (2014). Log Analysis Using Hadoop. In Pro Apache Hadoop (pp. 283-
291). Apress, Berkeley, CA.
Chandra, S., Varde, A. S., & Wang, J. (2019, October). A Hive and SQL case study in cloud data
analytics. In 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication
Conference (UEMCON) (pp. 0112-0118). IEEE.

Running Head: Elastic Mapreduce (Emr) 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Running Head: Elastic Mapreduce (Emr) 1

Uploaded by

Copyright:

Available Formats

Running Head: ELASTIC MAPREDUCE (EMR) 1

EMR use cases

Assignment due date

Elastic MapReduce(EMR) is a cloud-based big data platform offered by the amazon

data engineers, and data analysts.

Elastic MapReduce(EMR) comes with a lot of benefits; Elastic MapReduce(EMR) is

configuration node provisioning, and setting up the whole Elastic MapReduce(EMR)

is a detailed discussion on some of the case uses of Elastic MapReduce (EMR)

Elastic MapReduce (EMR) case uses

1. Data analysis- Machine learning

an environment of data input for a certain period.

one task to another.

In terms of computational design Elastic MapReduce (EMR) ease the setting up of

2. Extract Transform Load (ETL)

location, and cases where a single key is referenced by different data.

3. Genomics and bioinformatics

4. Log processing and fraud detection

use of the following inbuilt services;

 Amazon Kinesis streams

 Amazon kinesis firehose

 Amazon kinesis analytics

your organization or firm.

5. Event/ clickstream analysis

know the customer requirements.

transformer to clickstream analysis in elastic search, Kibana, and amazon redshift(Chandra,

visualizations step following for a clickstream analysis of a company’s webpage;

 Brand analysis- analyze the number of events per product

 HTTP response analysis- number of events that had an acknowledgment or disapproval.

Elastic MapReduce (EMR) use case instances

 When you have a long wait time for data processing

 When you want to outsource physical server infrastructure

setting up physical Datawarehouse’s for example disasters, high cost of hardware

installation, and maintenance. Elastic MapReduce (EMR) offers a solution to physical

 When managing Hadoop is becoming a hassle

offering capabilities to process large datasets faster, easier, and serverless.

 When you have a rigid in-house cluster infrastructure

It is a waste of resources to involve all the components in infrastructure when doing a

during data analysis.

High-Performance Computing and Communications, 2015 IEEE 7th International Symposium on

and Systems (pp. 58-65). IEEE.

International Arab Journal of Information Technology IAJIT.

Computing, Internet of People, and Smart City Innovation

(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1-6). IEEE.

integration. In 2014 IEEE International Congress on Big Data (pp. 522-529). IEEE.

protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics, 32(16), 2551-2553.

291). Apress, Berkeley, CA.

Conference (UEMCON) (pp. 0112-0118). IEEE.

You might also like