You are on page 1of 9

Running Head: ELASTIC MAPREDUCE (EMR) 1

EMR use cases

Student’s name

Institution

Assignment due date


ELASTIC MAPREDUCE (EMR) 2

Introduction

Elastic MapReduce(EMR) is a cloud-based big data platform offered by the amazon

group which mainly deals with the capability of processing and analyzing bulk data in a short

period and at the same time reduces the cost of the above operation. This big data platform uses

open sources tools to process data, for example, tools like apache hive, apache hud, presto,

apache HBase, apache-spark, and apache Flink(Soualhia, Khomh, & Tahar, 2015). Amazon has

developed Elastic MapReduce(EMR) to work in that it simplifies the process of data analysis and

reducing the time taken to process big data with huge tuning and clusters.

An evident example of how Elastic MapReduce (EMR) works is that when data of

petabyte-scale is processed it was reported that the analysis was three times faster than the

normal speed and it cost the team half the normal price. The normal price and speed were that of

the standard apache spark. Elastic MapReduce(EMR) can be used by people like data scientists,

data engineers, and data analysts.

Elastic MapReduce(EMR) comes with a lot of benefits; Elastic MapReduce(EMR) is

easy and simple to operate and use as it makes sure that the user is only left with the task of

analyzing data only. It deals with other tasks like cluster launching and tuning, Hadoop

configuration node provisioning, and setting up the whole Elastic MapReduce(EMR)

architecture resources. It is less costly to use Elastic MapReduce(EMR), it cuts the cost by half

and at the same time, the intended process is done even faster than the standard time((Soualhia,

Khomh, & Tahar, 2015). Elastic MapReduce(EMR) uses the criterion of deployed EC2 instances

and their type and location to determine the price of a process. The instances range from reserved

to spot instances.
ELASTIC MAPREDUCE (EMR) 3

Elastic MapReduce (EMR) is elastic and can be able to handle different scales of data by

using a mechanism that adds or reduces the instances of data to be analyzed manually or

automatically. This also helps in the aspect of reducing the cost as you are only charged that you

have used as data is loaded and processed differently and independently. Elastic MapReduce

(EMR) is reliable in that the systems recover from downtime automatically by fixing bugs and

errors through periodic automatic downloads. It also recovers and retrieves terminated instances

when properly configured to do so. Elastic MapReduce (EMR) also adds the functionality of

safety and security to the data and the process itself. It gives access control to networks through

the firewall, encrypts data to ensure the security of both the clusters and the data itself.

Elastic MapReduce (EMR) is mainly used for adding safety while carrying data

processing especially data analysis. The Elastic MapReduce (EMR) uses case include; machine

learning and deep learning, log analysis, genomics, fraud detection, data transformations,

bioinformatics, and financial and scientific simulation(Hamdi, Khemakhem, & Zaidan, 2016). Below

is a detailed discussion on some of the case uses of Elastic MapReduce (EMR)

Elastic MapReduce (EMR) case uses

1. Data analysis- Machine learning

In today's, world artificial intelligence is on the verge of dynamic growth and has

diversified into fields like neural networks, robotics, machine learning, and deep learning.

Machine learning is a branch of artificial intelligence that deals with how computers and digital

devices can be able to gain knowledge and learn less likely than human beings when exposed to

an environment of data input for a certain period.


ELASTIC MAPREDUCE (EMR) 4

Elastic MapReduce (EMR) is used in machine learning to aid the workability of some of

the branches of machine learning like TensorFlow and apache. The Elastic MapReduce (EMR)

platform comes with algorithms that are preinstalled mainly to aid machine learning(Chen et al.,

2017). To analyze data through the help of an inbuilt machine learning in an Elastic MapReduce

(EMR) platform, the system can make predictions and become intelligent as it progresses from

one task to another.

In terms of computational design Elastic MapReduce (EMR) ease the setting up of

libraries that a user needs during data analysis through machine learning. they include apache-

spark MLlib, tensor flow, apache MXnet, bootstrap, and custom AMIs.

2. Extract Transform Load (ETL)

It involves a merge of three data processes namely, data extraction, data transformation,

and data loading. Extract Transform Load (ETL) involves data acquisitions from different data

sources, data transformations through processes like adding, removing, calculating, and

appending different data, and then finally loading the data to a storage repository like a data

warehouse(Bansal, 2014). This process faces a limitation of data integration cases where data may

be named wrongly, different data referred using the same name, data is only found in one

location, and cases where a single key is referenced by different data.

Extract Transform Load (ETL) uses various tools to ensure that it can extract, transform

and load data correctly, faster, and at less cost. The tools include oracle, MarkLogic, and amazon

redshift. Amazon redshift is one instance of Elastic MapReduce (EMR) which has the

functionality of sorting, joining, and finding totals and averages for big data. It uses SQL and BI

tools to be able to perform data analysis in big data like the petabyte-scale ones which would
ELASTIC MAPREDUCE (EMR) 5

require complex processes during querying the data warehouse system. Amazon redshift's main

benefits include its simplicity to use and it is cheap in terms of analysis cost.

3. Genomics and bioinformatics

Genomics refers to the study of genomes from the genetics of organisms through the

processes of combining and joining recombinant DNA, DNA sequencing methods, and

bioinformatics of sequence (Nellore et al., 2016). Genomics assembles various genomes and

analyzes them depending on their functions. One of the main limitations faced during genomics

is how to process the bulk data generated and how to make the genome sequencing and mapping

and bioinformatics analyses fast and less costly. One of the solutions to this problem is through

the use of Amazon web services mainly the Elastic MapReduce (EMR) platform.

Elastic MapReduce (EMR) enables genomic experts to process large amounts of data

easily, fast, and efficiently and also at a reduced cost. It should be noted that genomics deals with

a large amount of data that is acquired through interactions between loci and alleles. Genomics

being a study is resource-intensive and helps researchers as they can be able to extract genomic

data easily and at no cost through access to the Amazon web services (AWS).

4. Log processing and fraud detection

It is mandatory for any information system or networked system to have the capability to

record and store log files of all the activities carried out and the time they were done. Log

processing refers to the process of analyzing log files of a system and coming up with valuable
ELASTIC MAPREDUCE (EMR) 6

information that can be used to monitor, detect anomalies that may cause attacks like a fraud.

Log analytics uses various tools like Splunk, retrace, Logentries, LogMatics, SumoLogic,

Graylog, and LogStash(Wadkar, & Siddalingaiah, 2014). the tools range from commercial to open

source.

Elastic MapReduce (EMR) provides a platform to conduct log analysis in an easier way

and without physically having a server but through amazon web services (AWS) of cloud

computing. Amazon uses a tool called amazon kinesis for log processing (analysis in specific).

Amazon Kinesis is easy to use and has a fast and reliable log processing capability. The main

functionality of amazon kinesis is that it is capable of analyzing data concerning log files stream

by the stream as it is being acquired or collected without necessarily having to wait for the whole

data to be processed. Amazon kinesis collects, process and analyze data in real-time through the

use of the following inbuilt services;

 Amazon Kinesis streams

 Amazon kinesis firehose

 Amazon kinesis analytics

Amazon Elastic MapReduce (EMR) uses the Hadoop framework to process large data in

a distributed computing environment containing virtual servers. It helps to process log files of a

given system or network bit by bit without necessarily having to wait for other logs. This makes

it fast and cost-saving. When log files are processed fraud can be detected through techniques

such as anomaly-based detection and signature-based detection. Elastic MapReduce (EMR) can

be used to analyze logs to find the usage of files, access, and the web surfing history of users. In
ELASTIC MAPREDUCE (EMR) 7

business-wise, this information would be used to identify business insights that are related to

your organization or firm.

5. Event/ clickstream analysis

Event analysis or clickstream analysis refers to the process of collecting, analyzing, and

reporting information concerning web pages that is, the time it was visited, the frequency of

visits, and the resources accessed or activities carried out when a user was on a particular web

page. The information obtained from clickstream analysis is used to know the customer's needs

and be able to come up with effective solutions like developing relevant ads keeping in mind you

know the customer requirements.

Amazon Elastic MapReduce (EMR) uses the apache hive, apache-spark, and stream sets

transformer to clickstream analysis in elastic search, Kibana, and amazon redshift(Chandra,

Varde, & Wang, 2019). Modern data integration is applied where unstructured and semi-structured

data is analyzed by redshift in a simpler manner and at a faster rate. Clickstream is analyzed

through data visualizations techniques. For example, the following are some of the data

visualizations step following for a clickstream analysis of a company’s webpage;

 Session wise analysis- this analysis is the number of people who visited a site.

 Client wise analysis- analyze the number of sessions created per person

 Brand analysis- analyze the number of events per product

 HTTP response analysis- number of events that had an acknowledgment or disapproval.


ELASTIC MAPREDUCE (EMR) 8

Elastic MapReduce (EMR) use case instances

 When you have a long wait time for data processing

When systems take longer time than expected it becomes tiresome and unreliable. This

results in the formation of queues which would mean it is required to be added some

hardware or resources to solve the delay process. Elastic MapReduce (EMR) solves this

by ensuring that data analysis is fast and is by bit by bit as per the scale of collected

information. Elastic MapReduce (EMR) processes larger datasets in half the standard

time.

 When you want to outsource physical server infrastructure

Currently, cloud computing is on the rise and is solving the problems encountered by

setting up physical Datawarehouse’s for example disasters, high cost of hardware

installation, and maintenance. Elastic MapReduce (EMR) offers a solution to physical

server infrastructures by offering cloud services through the amazon web services which

can host servers that are accessed remotely from any region easily.

 When managing Hadoop is becoming a hassle

Hadoop platform comes in two forms that are, in-house Hadoop and on-premise Hadoop.

The in-house Hadoop is associated with a lot of limitations because it is physical such as

node failures and high costs of installation, operation, upgrading, and maintenance.

Elastic MapReduce (EMR) has an on-premise Hadoop that solves the above problems by

offering capabilities to process large datasets faster, easier, and serverless.

 When you have a rigid in-house cluster infrastructure

It is a waste of resources to involve all the components in infrastructure when doing a

certain analysis task that only requires a specific resource. Elastic MapReduce (EMR)
ELASTIC MAPREDUCE (EMR) 9

utilizes resources efficiently by loading the libraries needed for a specific action only

during data analysis.

References

Soualhia, M., Khomh, F., & Tahar, S. (2015, August). Predicting scheduling failures in the cloud: A case

study with google clusters and Hadoop on Amazon EMR. In 2015 IEEE 17th International Conference on

High-Performance Computing and Communications, 2015 IEEE 7th International Symposium on

Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software

and Systems (pp. 58-65). IEEE.

Hamdi, H., Khemakhem, M., & Zaidan, A. (2016). Complementary Approaches Built as Web Service for

Arabic Handwriting OCR Systems via Amazon Elastic MapReduce (EMR) Model. Accepted paper in the

International Arab Journal of Information Technology IAJIT.

Chen, L., Li, R., Liu, Y., Zhang, R., & Woodbridge, D. M. K. (2017, August). Machine learning-based

product recommendation using apache spark. In 2017 IEEE SmartWorld, Ubiquitous Intelligence &

Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data

Computing, Internet of People, and Smart City Innovation

(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1-6). IEEE.

Bansal, S. K. (2014, June). Towards a semantic extract-transform-load (ETL) framework for big data

integration. In 2014 IEEE International Congress on Big Data (pp. 522-529). IEEE.

Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T., & Langmead, B. (2016). Rail-dbGaP: analyzing dbGaP-

protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics, 32(16), 2551-2553.

Wadkar, S., & Siddalingaiah, M. (2014). Log Analysis Using Hadoop. In Pro Apache Hadoop (pp. 283-

291). Apress, Berkeley, CA.

Chandra, S., Varde, A. S., & Wang, J. (2019, October). A Hive and SQL case study in cloud data

analytics. In 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication

Conference (UEMCON) (pp. 0112-0118). IEEE.

You might also like