Professional Documents
Culture Documents
Running Head: Elastic Mapreduce (Emr) 1
Running Head: Elastic Mapreduce (Emr) 1
Student’s name
Institution
Introduction
group which mainly deals with the capability of processing and analyzing bulk data in a short
period and at the same time reduces the cost of the above operation. This big data platform uses
open sources tools to process data, for example, tools like apache hive, apache hud, presto,
apache HBase, apache-spark, and apache Flink(Soualhia, Khomh, & Tahar, 2015). Amazon has
developed Elastic MapReduce(EMR) to work in that it simplifies the process of data analysis and
reducing the time taken to process big data with huge tuning and clusters.
An evident example of how Elastic MapReduce (EMR) works is that when data of
petabyte-scale is processed it was reported that the analysis was three times faster than the
normal speed and it cost the team half the normal price. The normal price and speed were that of
the standard apache spark. Elastic MapReduce(EMR) can be used by people like data scientists,
easy and simple to operate and use as it makes sure that the user is only left with the task of
analyzing data only. It deals with other tasks like cluster launching and tuning, Hadoop
architecture resources. It is less costly to use Elastic MapReduce(EMR), it cuts the cost by half
and at the same time, the intended process is done even faster than the standard time((Soualhia,
Khomh, & Tahar, 2015). Elastic MapReduce(EMR) uses the criterion of deployed EC2 instances
and their type and location to determine the price of a process. The instances range from reserved
to spot instances.
ELASTIC MAPREDUCE (EMR) 3
Elastic MapReduce (EMR) is elastic and can be able to handle different scales of data by
using a mechanism that adds or reduces the instances of data to be analyzed manually or
automatically. This also helps in the aspect of reducing the cost as you are only charged that you
have used as data is loaded and processed differently and independently. Elastic MapReduce
(EMR) is reliable in that the systems recover from downtime automatically by fixing bugs and
errors through periodic automatic downloads. It also recovers and retrieves terminated instances
when properly configured to do so. Elastic MapReduce (EMR) also adds the functionality of
safety and security to the data and the process itself. It gives access control to networks through
the firewall, encrypts data to ensure the security of both the clusters and the data itself.
Elastic MapReduce (EMR) is mainly used for adding safety while carrying data
processing especially data analysis. The Elastic MapReduce (EMR) uses case include; machine
learning and deep learning, log analysis, genomics, fraud detection, data transformations,
bioinformatics, and financial and scientific simulation(Hamdi, Khemakhem, & Zaidan, 2016). Below
In today's, world artificial intelligence is on the verge of dynamic growth and has
diversified into fields like neural networks, robotics, machine learning, and deep learning.
Machine learning is a branch of artificial intelligence that deals with how computers and digital
devices can be able to gain knowledge and learn less likely than human beings when exposed to
Elastic MapReduce (EMR) is used in machine learning to aid the workability of some of
the branches of machine learning like TensorFlow and apache. The Elastic MapReduce (EMR)
platform comes with algorithms that are preinstalled mainly to aid machine learning(Chen et al.,
2017). To analyze data through the help of an inbuilt machine learning in an Elastic MapReduce
(EMR) platform, the system can make predictions and become intelligent as it progresses from
libraries that a user needs during data analysis through machine learning. they include apache-
spark MLlib, tensor flow, apache MXnet, bootstrap, and custom AMIs.
It involves a merge of three data processes namely, data extraction, data transformation,
and data loading. Extract Transform Load (ETL) involves data acquisitions from different data
sources, data transformations through processes like adding, removing, calculating, and
appending different data, and then finally loading the data to a storage repository like a data
warehouse(Bansal, 2014). This process faces a limitation of data integration cases where data may
be named wrongly, different data referred using the same name, data is only found in one
Extract Transform Load (ETL) uses various tools to ensure that it can extract, transform
and load data correctly, faster, and at less cost. The tools include oracle, MarkLogic, and amazon
redshift. Amazon redshift is one instance of Elastic MapReduce (EMR) which has the
functionality of sorting, joining, and finding totals and averages for big data. It uses SQL and BI
tools to be able to perform data analysis in big data like the petabyte-scale ones which would
ELASTIC MAPREDUCE (EMR) 5
require complex processes during querying the data warehouse system. Amazon redshift's main
benefits include its simplicity to use and it is cheap in terms of analysis cost.
Genomics refers to the study of genomes from the genetics of organisms through the
processes of combining and joining recombinant DNA, DNA sequencing methods, and
bioinformatics of sequence (Nellore et al., 2016). Genomics assembles various genomes and
analyzes them depending on their functions. One of the main limitations faced during genomics
is how to process the bulk data generated and how to make the genome sequencing and mapping
and bioinformatics analyses fast and less costly. One of the solutions to this problem is through
the use of Amazon web services mainly the Elastic MapReduce (EMR) platform.
Elastic MapReduce (EMR) enables genomic experts to process large amounts of data
easily, fast, and efficiently and also at a reduced cost. It should be noted that genomics deals with
a large amount of data that is acquired through interactions between loci and alleles. Genomics
being a study is resource-intensive and helps researchers as they can be able to extract genomic
data easily and at no cost through access to the Amazon web services (AWS).
It is mandatory for any information system or networked system to have the capability to
record and store log files of all the activities carried out and the time they were done. Log
processing refers to the process of analyzing log files of a system and coming up with valuable
ELASTIC MAPREDUCE (EMR) 6
information that can be used to monitor, detect anomalies that may cause attacks like a fraud.
Log analytics uses various tools like Splunk, retrace, Logentries, LogMatics, SumoLogic,
Graylog, and LogStash(Wadkar, & Siddalingaiah, 2014). the tools range from commercial to open
source.
Elastic MapReduce (EMR) provides a platform to conduct log analysis in an easier way
and without physically having a server but through amazon web services (AWS) of cloud
computing. Amazon uses a tool called amazon kinesis for log processing (analysis in specific).
Amazon Kinesis is easy to use and has a fast and reliable log processing capability. The main
functionality of amazon kinesis is that it is capable of analyzing data concerning log files stream
by the stream as it is being acquired or collected without necessarily having to wait for the whole
data to be processed. Amazon kinesis collects, process and analyze data in real-time through the
Amazon Elastic MapReduce (EMR) uses the Hadoop framework to process large data in
a distributed computing environment containing virtual servers. It helps to process log files of a
given system or network bit by bit without necessarily having to wait for other logs. This makes
it fast and cost-saving. When log files are processed fraud can be detected through techniques
such as anomaly-based detection and signature-based detection. Elastic MapReduce (EMR) can
be used to analyze logs to find the usage of files, access, and the web surfing history of users. In
ELASTIC MAPREDUCE (EMR) 7
business-wise, this information would be used to identify business insights that are related to
Event analysis or clickstream analysis refers to the process of collecting, analyzing, and
reporting information concerning web pages that is, the time it was visited, the frequency of
visits, and the resources accessed or activities carried out when a user was on a particular web
page. The information obtained from clickstream analysis is used to know the customer's needs
and be able to come up with effective solutions like developing relevant ads keeping in mind you
Amazon Elastic MapReduce (EMR) uses the apache hive, apache-spark, and stream sets
Varde, & Wang, 2019). Modern data integration is applied where unstructured and semi-structured
data is analyzed by redshift in a simpler manner and at a faster rate. Clickstream is analyzed
through data visualizations techniques. For example, the following are some of the data
Session wise analysis- this analysis is the number of people who visited a site.
Client wise analysis- analyze the number of sessions created per person
When systems take longer time than expected it becomes tiresome and unreliable. This
results in the formation of queues which would mean it is required to be added some
hardware or resources to solve the delay process. Elastic MapReduce (EMR) solves this
by ensuring that data analysis is fast and is by bit by bit as per the scale of collected
information. Elastic MapReduce (EMR) processes larger datasets in half the standard
time.
Currently, cloud computing is on the rise and is solving the problems encountered by
server infrastructures by offering cloud services through the amazon web services which
can host servers that are accessed remotely from any region easily.
Hadoop platform comes in two forms that are, in-house Hadoop and on-premise Hadoop.
The in-house Hadoop is associated with a lot of limitations because it is physical such as
node failures and high costs of installation, operation, upgrading, and maintenance.
Elastic MapReduce (EMR) has an on-premise Hadoop that solves the above problems by
certain analysis task that only requires a specific resource. Elastic MapReduce (EMR)
ELASTIC MAPREDUCE (EMR) 9
utilizes resources efficiently by loading the libraries needed for a specific action only
References
Soualhia, M., Khomh, F., & Tahar, S. (2015, August). Predicting scheduling failures in the cloud: A case
study with google clusters and Hadoop on Amazon EMR. In 2015 IEEE 17th International Conference on
Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software
Hamdi, H., Khemakhem, M., & Zaidan, A. (2016). Complementary Approaches Built as Web Service for
Arabic Handwriting OCR Systems via Amazon Elastic MapReduce (EMR) Model. Accepted paper in the
Chen, L., Li, R., Liu, Y., Zhang, R., & Woodbridge, D. M. K. (2017, August). Machine learning-based
product recommendation using apache spark. In 2017 IEEE SmartWorld, Ubiquitous Intelligence &
Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data
Bansal, S. K. (2014, June). Towards a semantic extract-transform-load (ETL) framework for big data
Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T., & Langmead, B. (2016). Rail-dbGaP: analyzing dbGaP-
Wadkar, S., & Siddalingaiah, M. (2014). Log Analysis Using Hadoop. In Pro Apache Hadoop (pp. 283-
Chandra, S., Varde, A. S., & Wang, J. (2019, October). A Hive and SQL case study in cloud data
analytics. In 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication