Professional Documents
Culture Documents
Delivering Value
from Big Data with
Revolution R Enterprise
and Hadoop
October 2013
Abstract
Businesses are rapidly investing in Hadoop to manage analytic data stores
due to its flexibility, scalability and relatively low cost. However, Hadoop’s
native tooling for advanced analytics is immature; this makes it difficult for
analysts to use without significant additional training, and limits the ability of
businesses to drive value from Big Data.
For some types of data, Hadoop has emerged as the Hosting the analytic software outside of Hadoop
clear platform of choice: also creates a deployment problem, since custom
n programming may be required to implement a
Clickstream data, including site-side clicks and
web media tags production scoring process. Moreover, data
n Sentiment data, including news feeds, blog feeds, replication and movement can create data security
social media comments and Twitter streams issues: For “curated” data (e.g., clinical trials), the
n Telematics, such as vehicle tracking data organization must recertify the data after replication
n Sensor and machine-generated data or movement. This is time consuming and expensive.
n Geo tracking and location data
n Server and network logs Some vendors have recently introduced software that
n Document and text repositories runs on freestanding “Analysis” servers within the
n Digitized images, voice, video and other media Hadoop cluster or runs in memory on the Hadoop
data nodes. While deployment within the Hadoop
Enterprises struggle, however, to analyze data held cluster reduces latency from data movement, this
in Hadoop . While Hadoop is an excellent platform
1
approach is still only practical with smaller data sets.
for managing large and complex data sets, its tooling Analytics that load the entire data set into memory
for advanced analytics is much less mature than the typically cannot run on the commodity hardware
analytics tooling available to enterprises today on typically used as Hadoop data nodes.
other platforms.
As a result of the limited prebuilt analytics available
Hadoop’s “native” advanced analytics project, today in Hadoop, much of the analytic work performed
Apache Mahout, is an eclectic mix of algorithms. in Hadoop is “homegrown”or built with custom
Mahout is immature (currently in Release 0.8), programming in MapReduce, Java, or other languages.
and it includes a number of partially complete and This is a serious constraint for organizations seeking
unintegrated algorithms; many of these, according to to leverage Hadoop for analytics because data
project leadership, lack adequate developer support . 2
scientists, the individuals with the necessary skills, are
Mahout is hard to use, with limited documentation; scarce and in high demand. Moreover, custom
it is only suitable for highly trained developers with “homegrown” applications are expensive to build,
strong backgrounds in Java and MapReduce. maintain and support.
for Big Data, Big Analytics grids, data warehouse appliances and Hadoop.
Open source R is widely used by statisticians, Through Revolution R Enterprise’s DeployR tools for
data miners, actuaries, researchers and other enterprise integration, users can publish web reports
professionals who need an agile and capable to a broader audience, and organizations can enable
analytics platform. There are approximately two million two-way integration between Revolution R Enterprise
R users in the workforce today, and the user base and popular tools like Excel, QlikView and Tableau.
is rapidly growing. Surveys show that working
3,4,5
For business analysts who prefer a visual interface,
analytics professionals prefer R to all other analytic Revolution R Enterprise couples with Alteryx to
software options. provide easy-to-use, high-performance analytics.
There are many reasons behind R’s growing In brief, Revolution R Enterprise offers a scalable,
popularity, including its flexibility, rich graphics, broad high-performance platform for the rich capabilities
ecosystem and comprehensive library of more than of R. It offers true cross-platform integration
five thousand packages. However, since open source together with a wide choice of user interfaces
R runs in memory and lacks a native distributed and deployment options.
computing framework, it is ill suited to working with
large data sets.
Corporate Applications
Local: Revolution R Enterprise users submit 1. Revolution R Enterprise running on the edge
individual commands or scripts through the R node encounters a call to a ScaleR algorithm.
command line interface.
2. The master process serializes instructions,
Remote: Revolution R Enterprise users working stores them in an object and saves the object to
on a separate platform run a script that includes a GUID file in HDFS (where the instructions are
a command to transfer execution to a remote accessible to Mappers and Reducers).
Hadoop cluster.
3. Next, the master process prepares a job request
In both cases, Revolution R Enterprise running on an for JobTracker, passing the location of the input
edge node acts as the master process. It executes data file and the processing instructions for
the Revolution R Enterprise script locally, using Mappers and Reducers.
threads, cores and sockets within the edge node
4. Once invoked, JobTracker spawns a potentially
server.
large set of TaskTrackers on each node, passing
to each the GUID of the instructions file and the
When the master process encounters a call to an
HDFS location, of a single split of data.
algorithm in the ScaleR library, it distributes subsets
of the work to data nodes in the Hadoop cluster. 5. Each TaskTracker invokes a Mapper, delivers
Once the data nodes finish their tasks, a single input data split location and instruction file
Reducer consolidates intermediate results and location, and awaits completion of the Mapper
returns them to the master node. If the master node or Reducer. The Mapper consists of a Java
determines that the algorithm has completed its wrapper, housing a complete instance of
work, it returns the consolidated results to the script Revolution R Enterprise. Data splits roughly
or command. equate to a single HDFS file segment.
For iterative algorithms (such as logistic regression 6. Each Mapper retrieves its instructions from a
or k-means clustering), Revolution R Enterprise GUID file, ingests the designated data split, and
performs a convergence test and checks the number performs the requested operations. Mappers
of iterations. If the tests succeed, Revolution R operate independently of one another.
Enterprise returns consolidated results to the script
7. When each Mapper finishes its work, it
or command; otherwise, it repeats the process.
produces an Intermediate Results Object, or
IRO, in a key value format. The Mapper serializes
the IRO and stores it in a GUID file at a location
passed to it by TaskTracker.
16. Processing ends when the convergence test Revolution R Enterprise makes Hadoop easier
succeeds or the number of iterations reaches a to use. Users can focus on what they do best—
user-defined limit. drawing insight from data—and do not need to learn
parallel programming, distributed data management
All IROs produced by a particular job use the same
techniques, Java, MapReduce or HDFS. By making
key, causing subsequent shuffle-sort operations in
Hadoop available to the R community of users,
Hadoop to consolidate all IROs on a single node,
Revolution R Enterprise enables organizations to tap
using a single Reducer instance. It is worth noting
into a large and rapidly growing talent pool.
that IROs are tiny by comparison to input data, so
this step is a performance improvement over use of Revolution R Enterprise’s Write Once Deploy
combiners or other multiple-Reducer schemes to Anywhere architecture offers flexibility and portability
consolidate results. while helping organizations avoid vendor and
n Consider scheduling an in-depth presentation and Second, users can run prediction functions to
perhaps an evaluation by contacting a member of score data and write it back to HDFS.
our sales team at +1 (650) 646-9545 or through
one of our offices listed on our website: Q. What formats can we use to return results?
www.revolutionanalytics.com.
A. Users can write model scores directly to CSV
n Meet with us face to face at one of the upcoming or XDF, publish models as PMML documents
events listed on our website’s events page. or pass R objects to a Revolution R Enterprise
n Learn more about Revolution R Enterprise on our DeployR server for broader integration.
website’s product page.
Q. Are there intermediate formats used?
A. Between Mappers and Reducers, Revolution
R Enterprise passes data in standard Hadoop
FAQs key/value pairs. These are stored in RAM and
Q. What input sources does Revolution Analytics swapped to local storage if necessary.
support for use with Revolution R Enterprise?
Q. How many MapReduce jobs run?
A. Revolution R Enterprise can read data from many
sources using various connectors. Revolution A. Typically, one MapReduce job runs per algorithm
R Enterprise ingests input data in parallel from called. That job may be iterative in the case of
HDFS when running inside Hadoop. multi-pass algorithms like logistic regression or
clustering.
Endnotes:
1. May 10, 2013, Hadoop Adoption Accelerates, But Not For Data Analytics,
http://readwrite.com/2013/05/10/hadoop-adoption-accelerates-but-not-for-what-you-might-think#awesm=~onhD75n7uuzq23
2. What is Apache Mahout?, http://mahout.apache.org/
3. September 02, 2013, Poll: R top language for data science three years running,
http://blog.revolutionanalytics.com/2013/09/top-languages-for-data-science.html
4. October 15, 2013, R usage skyrocketing: Rexer poll,
http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
5. May 2013, The Popularity of Data Analysis Software, http://r4stats.com/articles/popularity/
© 2013 Revolution Analytics, Inc. All rights reserved. Printed in the United States of America. This white paper is for informational purposes
only. REVOLUTION ANALYTICS MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Revolution R Enterprise, RRE,
Revolution Analytics, the Revolution Analytics and Revolution R Enterprise logos, and “Big Data, Big Analytics Platform” are trademarks
or registered trademarks of Revolution Analytics, Inc., in the United States and/or other countries. Other product names mentioned herein
may be trademarks of their respective companies.