You are on page 1of 10

Executive White Paper

Delivering Value
from Big Data with
Revolution R Enterprise
and Hadoop

Bill Jacobs, Director of Product Marketing


Thomas W. Dinsmore, Director of Product Management

October 2013
Abstract
Businesses are rapidly investing in Hadoop to manage analytic data stores
due to its flexibility, scalability and relatively low cost. However, Hadoop’s
native tooling for advanced analytics is immature; this makes it difficult for
analysts to use without significant additional training, and limits the ability of
businesses to drive value from Big Data.

Revolution R Enterprise leverages open source R, the standard platform for


modern analytics. R has a thriving community of more than two million users
who are trained and ready to deliver results.

Revolution R Enterprise runs in Hadoop. It enables users to perform


advanced analysis on Big Data while working exclusively in the familiar R
environment. The software transparently distributes work across the nodes
and distributed data of Hadoop without complex programming.

By leveraging Revolution R Enterprise in Hadoop, organizations tap a large


and growing community of analysts and all of the capabilities in R with a true
cross-platform and open standards architecture.
Hadoop for
Advanced Analytics Conventional server-based analytic software
Over the next three years, IT industry analyst IDC packages from vendors such as SAS or SPSS
expects spending on Apache Hadoop for data typically offer the ability to connect with Hadoop and
management to grow at an average annual rate of extract data over a network, but they do not run inside
60%. Hadoop is an open source software framework Hadoop. Using server-based software with Hadoop is
for distributed data management; it supports suitable for users working with small packets of data,
a programming model (MapReduce) and a file or who are willing to work with small samples of the
system (HDFS) that work together to provide a data data. However, this architecture is impractical when
management platform that is flexible, highly scalable, source data approaches terabyte scale, and sampling
and relatively inexpensive to implement. is not acceptable for some analytic applications.

For some types of data, Hadoop has emerged as the Hosting the analytic software outside of Hadoop
clear platform of choice: also creates a deployment problem, since custom
n programming may be required to implement a
Clickstream data, including site-side clicks and
web media tags production scoring process. Moreover, data
n Sentiment data, including news feeds, blog feeds, replication and movement can create data security
social media comments and Twitter streams issues: For “curated” data (e.g., clinical trials), the
n Telematics, such as vehicle tracking data organization must recertify the data after replication
n Sensor and machine-generated data or movement. This is time consuming and expensive.
n Geo tracking and location data
n Server and network logs Some vendors have recently introduced software that
n Document and text repositories runs on freestanding “Analysis” servers within the
n Digitized images, voice, video and other media Hadoop cluster or runs in memory on the Hadoop
data nodes. While deployment within the Hadoop
Enterprises struggle, however, to analyze data held cluster reduces latency from data movement, this
in Hadoop . While Hadoop is an excellent platform
1
approach is still only practical with smaller data sets.
for managing large and complex data sets, its tooling Analytics that load the entire data set into memory
for advanced analytics is much less mature than the typically cannot run on the commodity hardware
analytics tooling available to enterprises today on typically used as Hadoop data nodes.
other platforms.
As a result of the limited prebuilt analytics available
Hadoop’s “native” advanced analytics project, today in Hadoop, much of the analytic work performed
Apache Mahout, is an eclectic mix of algorithms. in Hadoop is “homegrown”­or built with custom
Mahout is immature (currently in Release 0.8), programming in MapReduce, Java, or other languages.
and it includes a number of partially complete and This is a serious constraint for organizations seeking
unintegrated algorithms; many of these, according to to leverage Hadoop for analytics because data
project leadership, lack adequate developer support . 2
scientists, the individuals with the necessary skills, are
Mahout is hard to use, with limited documentation; scarce and in high demand. Moreover, custom
it is only suitable for highly trained developers with “homegrown” applications are expensive to build,
strong backgrounds in Java and MapReduce. maintain and support.

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 3


There is a pressing need for an analytics platform R scalable. PEMAs support operations on data far
that runs inside Hadoop, handles data scale in larger than available memory through a combination
the terabytes, offers comprehensive support for of streaming, parallel processing on multiple cores
advanced analytics, and is accessible to a broad and distributed processing across multiple nodes.
user base­—without extensive re-training.
Revolution R Enterprise’s Write Once Deploy
Anywhere (WODA) architecture enables users to
develop content on one platform and deploy it on
Revolution R Enterprise any other supported platform: workstations, servers,

for Big Data, Big Analytics grids, data warehouse appliances and Hadoop.

Open source R is widely used by statisticians, Through Revolution R Enterprise’s DeployR tools for
data miners, actuaries, researchers and other enterprise integration, users can publish web reports
professionals who need an agile and capable to a broader audience, and organizations can enable
analytics platform. There are approximately two million two-way integration between Revolution R Enterprise
R users in the workforce today, and the user base and popular tools like Excel, QlikView and Tableau.
is rapidly growing. Surveys show that working
3,4,5
For business analysts who prefer a visual interface,
analytics professionals prefer R to all other analytic Revolution R Enterprise couples with Alteryx to
software options. provide easy-to-use, high-performance analytics.

There are many reasons behind R’s growing In brief, Revolution R Enterprise offers a scalable,
popularity, including its flexibility, rich graphics, broad high-performance platform for the rich capabilities
ecosystem and comprehensive library of more than of R. It offers true cross-platform integration
five thousand packages. However, since open source together with a wide choice of user interfaces
R runs in memory and lacks a native distributed and deployment options.
computing framework, it is ill suited to working with
large data sets.

Revolution R Enterprise (RRE), a product from


Revolution Analytics, is a Big Data, Big Analytics
Platform that supports open source R. Revolution
R Enterprise extends an enhanced and supported DevelopR DeployR
version of R with a distributed computing framework,
R+CRAN

scalable algorithms, Big Data connectivity tools, an ConnectR


RevoR

integrated development environment and enterprise


deployment tools.
ScaleR
Revolution R Enterprise’s ScaleR package of high-
performance analytic algorithms uses Parallel External DistributedR
Memory Algorithms (PEMAs) to make open source

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 4


Revolution R Enterprise uses MapReduce to distribute computational workload
across the nodes of a Hadoop cluster. This section
in Hadoop includes a detailed explanation of how this works.
Revolution Analytics is a pioneer in analytics for
Hadoop. In 2011, Revolution Analytics introduced
Deployment Architecture
the RHadoop open source project, one of the
most frequently downloaded projects on GitHub. Figure 1 shows Revolution R Enterprise deployed
Revolution R Enterprise supports interfaces for operations inside Hadoop. The diagram shows
to MapReduce, HDFS and HBase, powering Revolution R Enterprise deployed on each node in
production analytics for enterprise customers: the Hadoop cluster, with one node designated as the
Revolution R Enterprise master node. The diagram
n A life sciences company uses Revolution R
also shows Revolution R Enterprise deployed on one
Enterprise and Hadoop to deliver recommended
or more desktops or servers outside of the Hadoop
treatments to its customers based on user-
specified details of planned usage. cluster. This is optional, but useful as a development
n A leading US bank has developed an innovative environment or as a platform for analytics with small
prediction engine based on Revolution R data sets.
Enterprise and Hadoop.
n A marketing sciences company analyzes billions of Revolution R Enterprise’s DeployR interface supports
cookies and session records with a cloud-enabled the web services connection shown in Figure 1.
solution of Hadoop and Revolution R Enterprise. The Corporate Applications shown in the diagram
can include interactive interfaces to popular BI
Revolution R Enterprise does not simply connect platforms (such as Jaspersoft, QlikView or Tableau),
to Hadoop—it runs inside Hadoop, providing users end-user tools like Microsoft Excel and Alteryx,
with direct access to data stored in Hadoop from the custom web-based reports, application servers
familiar R interface. To do this, Revolution R Enterprise and rules engines.

Hadoop Cluster Desktops & Servers

Corporate Applications

Figure 1: Leveraging inherent parallelism in Hadoop to accelerate Big Data analytics

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 5


Execution in Hadoop: Execution in Hadoop:
High-Level Flow Detailed Flow
Users invoke Revolution R Enterprise in Hadoop in This section details execution steps for Revolution R
one of two ways: local or remote. Enterprise running inside Hadoop.

Local: Revolution R Enterprise users submit 1. Revolution R Enterprise running on the edge
individual commands or scripts through the R node encounters a call to a ScaleR algorithm.
command line interface.
2. The master process serializes instructions,
Remote: Revolution R Enterprise users working stores them in an object and saves the object to
on a separate platform run a script that includes a GUID file in HDFS (where the instructions are
a command to transfer execution to a remote accessible to Mappers and Reducers).
Hadoop cluster.
3. Next, the master process prepares a job request
In both cases, Revolution R Enterprise running on an for JobTracker, passing the location of the input
edge node acts as the master process. It executes data file and the processing instructions for
the Revolution R Enterprise script locally, using Mappers and Reducers.
threads, cores and sockets within the edge node
4. Once invoked, JobTracker spawns a potentially
server.
large set of TaskTrackers on each node, passing
to each the GUID of the instructions file and the
When the master process encounters a call to an
HDFS location, of a single split of data.
algorithm in the ScaleR library, it distributes subsets
of the work to data nodes in the Hadoop cluster. 5. Each TaskTracker invokes a Mapper, delivers
Once the data nodes finish their tasks, a single input data split location and instruction file
Reducer consolidates intermediate results and location, and awaits completion of the Mapper
returns them to the master node. If the master node or Reducer. The Mapper consists of a Java
determines that the algorithm has completed its wrapper, housing a complete instance of
work, it returns the consolidated results to the script Revolution R Enterprise. Data splits roughly
or command. equate to a single HDFS file segment.

For iterative algorithms (such as logistic regression 6. Each Mapper retrieves its instructions from a
or k-means clustering), Revolution R Enterprise GUID file, ingests the designated data split, and
performs a convergence test and checks the number performs the requested operations. Mappers
of iterations. If the tests succeed, Revolution R operate independently of one another.
Enterprise returns consolidated results to the script
7. When each Mapper finishes its work, it
or command; otherwise, it repeats the process.
produces an Intermediate Results Object, or
IRO, in a key value format. The Mapper serializes
the IRO and stores it in a GUID file at a location
passed to it by TaskTracker.

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 6


8. As Mappers begin to complete, Hadoop’s
shuffle-sort sequentially moves all IROs to the
Revolution R Enterprise
Reducer’s host node. and Hadoop: The Benefits
9. As IROs begin to arrive, TaskTracker invokes Revolution R Enterprise brings tools to Hadoop
the Reducer and passes it the GUID of the that enable organizations to deliver new value
instructions file directing the job. from Big Data:
n Understand site-side customer behavior.
10. The Reducer reads the instructions file and
n Optimize search term selection.
proceeds to consolidate the inbound IROs into
a single consolidated IRO. n Measure the impact of web media campaigns.
n Track brand perceptions among current and
11. When there are no remaining IROs, the
prospective customers.
Reducer saves the consolidated IRO to HDFS
n Track driving behavior of policyholders.
in a location directed by the master process.
n Identify anomalies in fleet vehicle usage.
12. The Reducer exits, its TaskTracker exits, and the n Predict machine failures before they happen.
master process assumes control. n Drive location-specific marketing offers.
13. Upon return of control, the master process n Detect unusual behavior in secure systems.
retrieves and evaluates the consolidated IRO to n Predict network outages.
see if the process has converged. n Identify plagiarized documents.
14. If the convergence test succeeds, the master
With Revolution R Enterprise, data remains in
process returns a final results object to the
place in Hadoop; users perform analysis without
Revolution R Enterprise command or program.
data movement. This dramatically reduces the total
15. If the convergence test does not succeed, the cycle time to build and deploy predictive models,
master process repeats steps 2—13 with new and eliminates potential security issues from data
instructions. movement or replication.

16. Processing ends when the convergence test Revolution R Enterprise makes Hadoop easier
succeeds or the number of iterations reaches a to use. Users can focus on what they do best—
user-defined limit. drawing insight from data­—and do not need to learn
parallel programming, distributed data management
All IROs produced by a particular job use the same
techniques, Java, MapReduce or HDFS. By making
key, causing subsequent shuffle-sort operations in
Hadoop available to the R community of users,
Hadoop to consolidate all IROs on a single node,
Revolution R Enterprise enables organizations to tap
using a single Reducer instance. It is worth noting
into a large and rapidly growing talent pool.
that IROs are tiny by comparison to input data, so
this step is a performance improvement over use of Revolution R Enterprise’s Write Once Deploy
combiners or other multiple-Reducer schemes to Anywhere architecture offers flexibility and portability
consolidate results. while helping organizations avoid vendor and

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 7


platform “lock-in.” Users can build and test projects Q. What data input formats can we use?
on a workstation, then deploy them on a production A. For Hadoop users, Revolution R Enterprise
system essentially without change. Organizations supports text files in CSV or fixed width file
can easily re-host projects, for any reason—for better formats. For data residing in other systems,
performance, better economics, to move computing Revolution R Enterprise 7 supports many other
to the cloud, or simply for convenience. formats including both SAS and SPSS files
plus a fast proprietary format called XDF. XDF
Revolution R Enterprise brings rich analytical and
is an efficient binary serialization format used by
statistical computation at scale to Hadoop and
Revolution products to accelerate data access
enables data analytics teams to achieve shorter, less
and manipulation.
costly and lower-risk projects while achieving far
higher performance, data scale and portability than
Q. Where can data be output?
can be achieved by other methods.
A. There are two options available to Hadoop users.
First, Revolution R Enterprise returns data in R
data objects to the calling program or application.
Take the Next Step These objects are available for further processing
If you are seeking easy-to-use solutions for Big Data, and export, or they are published through the
Big Analytics and are considering Hadoop: DeployR interface in a variety of formats.

n Consider scheduling an in-depth presentation and Second, users can run prediction functions to
perhaps an evaluation by contacting a member of score data and write it back to HDFS.
our sales team at +1 (650) 646-9545 or through
one of our offices listed on our website: Q. What formats can we use to return results?
www.revolutionanalytics.com.
A. Users can write model scores directly to CSV
n Meet with us face to face at one of the upcoming or XDF, publish models as PMML documents
events listed on our website’s events page. or pass R objects to a Revolution R Enterprise
n Learn more about Revolution R Enterprise on our DeployR server for broader integration.
website’s product page.
Q. Are there intermediate formats used?
A. Between Mappers and Reducers, Revolution
R Enterprise passes data in standard Hadoop
FAQs key/value pairs. These are stored in RAM and
Q. What input sources does Revolution Analytics swapped to local storage if necessary.
support for use with Revolution R Enterprise?
Q. How many MapReduce jobs run?
A. Revolution R Enterprise can read data from many
sources using various connectors. Revolution A. Typically, one MapReduce job runs per algorithm
R Enterprise ingests input data in parallel from called. That job may be iterative in the case of
HDFS when running inside Hadoop. multi-pass algorithms like logistic regression or
clustering.

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 8


Q. How many actual Mapper and Reducer Enterprise may invoke a large number of Mappers
instances run? to accomplish required calculations. The results
A. Revolution R Enterprise schedules a variable of each map instance are persisted to memory
number of Mappers depending on the size of the and files, allowing Hadoop to reschedule only
data set. For each split of input data, Revolution R failed Mapper instances in the event of incomplete
Enterprise schedules one Mapper. Splits roughly execution.
approximate the block size setting of the HDFS
Q. How do we run model scoring?
file system instance, typically either 64 or 128MB.
The HDFS Administrator configures the HDFS A. Revolution R Enterprise offers users two options
block size. for model scoring.
First, users may run scoring natively using the
Q. Why run only a single Reducer or no Reducer rxPredict function. This function applies model
at all? objects produced by any of the ScaleR predictive
A. Unlike many MapReduce-based applications, modeling functions and applies the model to new
ScaleR and Revolution R Enterprise typically data. rxPredict computes the model score for
use a single Reducer and, in some cases, such each new record and writes results to HDFS.
as scoring data, no Reducer at all. Reducers As an alternative, users may export model
run only if needed to return data to the master documents using Predictive Model Markup
process on the edge node; otherwise, Revolution Language (PMML). Organizations can deploy
R Enterprise skips this step. PMML model documents into any PMML-
compliant database, decision engine or through
Q. What happens if a Revolution R Enterprise job
a software abstraction layer such as Concurrent
fails while running in Hadoop?
Inc.’s Cascading software.
A. When running ScaleR algorithms, Revolution R

Endnotes:
1. May 10, 2013, Hadoop Adoption Accelerates, But Not For Data Analytics,
http://readwrite.com/2013/05/10/hadoop-adoption-accelerates-but-not-for-what-you-might-think#awesm=~onhD75n7uuzq23
2. What is Apache Mahout?, http://mahout.apache.org/
3. September 02, 2013, Poll: R top language for data science three years running,
http://blog.revolutionanalytics.com/2013/09/top-languages-for-data-science.html
4. October 15, 2013, R usage skyrocketing: Rexer poll,
http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
5. May 2013, The Popularity of Data Analysis Software, http://r4stats.com/articles/popularity/

DELIVERING VALUE FROM BIG DATA © 2013 REVOLUTION ANALYTICS, INC. 9


About Revolution Analytics, Inc.
As the leading commercial provider of software and services based on the open source R project for
statistical computing—and headquartered in Mountain View, California—Revolution Analytics brings Big
Data scalability, performance, and cross-platform enterprise readiness to R, the world’s most widely
used statistics software. The company’s flagship Revolution R Enterprise software is designed to
meet the production needs of leading organizations in data-driven industries including finance, retail,
manufacturing, and digital media. Used by over 2 million data scientists in academia and at cutting-
edge organizations such as Google, Lloyd’s of London and the US Food and Drug Administration,
R is the standard of innovation in statistical analysis. Revolution Analytics is committed to supporting
the continued growth of the R community by sponsoring R user groups and conferences worldwide,
and offers free licenses for Revolution R Enterprise to everyone in academia.

© 2013 Revolution Analytics, Inc. All rights reserved. Printed in the United States of America. This white paper is for informational purposes
only. REVOLUTION ANALYTICS MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Revolution R Enterprise, RRE,
Revolution Analytics, the Revolution Analytics and Revolution R Enterprise logos, and “Big Data, Big Analytics Platform” are trademarks
or registered trademarks of Revolution Analytics, Inc., in the United States and/or other countries. Other product names mentioned herein
may be trademarks of their respective companies.

You might also like