You are on page 1of 7

.

3 Reasons Enterprises Struggle


with Storm & Spark Streaming
and Adopt DataTorrent RTS
Deliver fast actionable business insights for data scientists, rapid
application creation for developers and enterprise-grade
operational excellence for IT

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

Getting to fast actionable insights means


empowering analysts and data scientists to easily
work with data from many data sources (both in
motion and at rest), gain insights in seconds,
visualize the data insights and take action
automatically, without the need to involve the
entire IT department. At the same time, data
center operations teams need to ensure that the
solution is operational and meets business SLAs.

Given the buzz around Spark Streaming & Storm, they can seem like
obvious choices for supporting streaming analytics. However, most of
our customers have struggled to take both Spark Streaming & Storm
beyond the proof-of-concept stage as they address the enterprise
objectives too narrowly to offer a complete solution. Enterprises
require an easy to use, visual tools-based approach that works out of
the box. The platform needs to meet the needs of data scientists,
developers and the data center operations teams without needing
extensive & expensive patchwork of custom code & third party
software that often fails
DataTorrent RTS is the industrys first fully Hadoop native streaming
analytics solution. DataTorrent RTS provides an enterprise grade
streaming analytics platform, delivers tools and pre-built analytics
modules and lights out data center operational capabilities.

This paper explores the top 3 reasons enterprises pass on Spark


Streaming & Storm and deploy DataTorrent RTS.

www.datatorrent.com

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

1. Enterprise-grade streaming analytics platform


Your streaming analytics platform needs to meet the needs of your business. Its not
sufficient to take open source code that might work for some large web-scale
organizations with scores of platform level developers and try to deploy in an
enterprise data center. Most enterprises dont have or want developers that are
coding at the platform level. Imagine having your developers struggle with tuple level
acking, configuration & distributed state management! Enterprises have strict business
requirements for SLAs (no data loss, performance/latency and availability) and they
want their developers to focus on solving their core business problem
With this goal in mind, DataTorrent RTS was built from day one as a Hadoop 2.x native
application. DataTorrent RTS natively supports Hadoop YARN and HDFS on every
commercial Hadoop platform. It also runs seamlessly in public or private cloud
environments. IT organizations get the benefits of high performance, in-order
processing, auto-scaling, dynamic updates, automatic fault tolerance of application
state, engine state as well as raw data & distributed in-memory analytics without
having to hand code any of these capabilities
An enormous amount of data is being generated each day, of different variety, at
different sizes and at different rates. This fast big data is critical to an organizations
ability to gain competitive advantage and acheive operational efficiencies. Its
important that your streaming analytics solution not only handles the different data
types but also provides appropriate processing guarantees. DataTorrent RTS is the
only streaming analytics solution that can provide exactly-once, at-most-once and atleast-once event processing guarantees while still achieving the low latency of per
tuple processing and not resorting to micro-batching
The decisions that are being made based on insights gained from fast big data are
typically in an operational data path. Enterprise grade fault tolerance is required for
fast big data insights to be operational. DataTorrent RTS provides fault tolerance for
raw input data (even when the input source is not stateful), engine state as well as
processed data (application state) all without human intervention in the event of an
outage. Also, only DataTorrent RTS supports incremental recovery which allows a
failed node to recover its state and raw data stream from the previous node rather
than requiring replay from the first step. This significantly reduces recovery time and
ensures latency SLAs are maintained

Where Storm & Spark Streaming fall short


Apache Storm & Spark streamings applicability is limited by their core architecture.
With Spark Streaming, the inherent RDD based processing paradigm introduces
overhead and latency to stream processing performance. The per-tuple acking in
Storm is notoriously problematic in production environments and creates severe
operational headaches when scaling a topology or troubleshooting bottlenecks &
failures. Both Storm & Spark streaming force users to micro-bath input to provide
exactly-once processing guarantees. This introduces significant latency in processing.
Also, ability to maintain event order or provide application state level fault tolerance
are not part of the core platform for both Spark Streaming & Storm. These are critical
components of a stream processing platform and a must have for most of the use
cases (eg. Imagine trying to do event sequence based pattern detection). Implementing
these require non-trivial programming with intricate understanding of the underlying
streaming platform & concepts and require constant maintenance and update with
each release of the platform. Finally, all the workarounds you have to build into your
business logic create significant lock-in for your application

www.datatorrent.com

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

What to ask
To ensure an enterprise grade solution that meets your organizations SLA
requirements, ask the following questions of your proposed solution:

If Hadoop is your core big data platform, does your streaming platform
seamlessly use HDFS for raw data & application state checkpoints & engine
state management to reduce dependence on external datastores like
relational databases that do not scale? Also, does your streaming platform
run natively on YARN for scheduling without having to deal with making the
underlying streaming platform scheduler work well with YARN as that can
cause significant multi-tenancy & operational issues?
Can the streaming analytics solution auto-scale and process increased data
loads without manual programming and re-deployment?
Does the streaming analytics platform guarantee the processing order of
your events across all processing guarantees at-most once, at-least once &
exactly once without having to micro-batch the input data?
Is the streaming analytic solutions fault tolerance complete (raw events, app
state & engine state), abstracted from the developer and done natively in
Hadoop using HDFS?
Streaming analytics applications need to be able to handle events non-stop.
Does your streaming analytics solution support dynamic updates to
application properties and business logic with no application downtime?

2.Data scientist and application developer friendly


The path to a production ready streaming analytics solution entails a lot of
experimentation upfront. Data scientists and developers should be able to use
intuitive visual tools to quickly create streaming applications and iterate over their
hypothesis. These iterations should not always involve cumbersome coding by
developers. Developers should be able to simply create organization specific business
logic (e.g. custom parsers) from any data source and make it available for data
scientists to visually assemble the streaming application.
The DataTorrent RTS streaming analytics solution enables rapid time to market/time to
value via pre-built modular analytics capabilities that are easily combined using a
visual interface.
Development is simple with a single-threaded Java based
development model that allows for arbitrary business logic (often re-using existing
code!). In order to get your developers productive in no time, DataTorrent RTS
provides over 450 pre-built Java operators that provide a raft of analytical capabilities.
75+ input and output operators allow for data ingestion and distribution from sources
such as Kafka, Flume, message busses (JMS, MQ, etc,) databases (SQL, NoSQL), web
sockets and more. All the platform processing guarantees, idempotency & state
management are automatically extended to the input & output connectors & all other
operators so no additional platform level development work from the application
developer is needed
The Java operator-programming model is simple, yet powerful as DataTorrent RTS
provides key capabilities that are left up to the developer in open source streaming
analytics platform. Developers do not have to worry about multi-threading the code,
the application is automatically partitioned and distributed across the Hadoop cluster
for scalability. Another key capability is native application support for application timeseries windows that are both aggregate (per minute, per hour) and rolling (last 5
minutes, last 3 hours). As mentioned earlier, fault tolerance is a platform capability
and abstracted from the developer.

www.datatorrent.com

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

Where Storm & Spark Streaming fall short


The Java API in Spark Streaming & Storm requires a lot of hand coding as there is no
library of pre-built code. Data input & output connectors are few. The Java interface in
Spark Streaming is notoriously hard to use as there is a significant bias towards Scala.
With Storm, even though Java is supported, developers have to hassle with doing tuple
level acking in their application code. Besides the lack of a starting point, for both
Spark Streaming & Storm, programming is tedious as the developer must manually
account for scalability, handle input data skews, hand-code fault tolerance for the
application data and attempt to force event ordering/re-ordering. Spark streaming &
Storm do not have any visual development tools so coding must be done by a
developer and does not allow for a data scientist that is not familiar with Streaming to
create simple applications to quickly iterate over their analysis.

What to ask
To ensure that data scientists and developers can rapidly assemble applications, ask
the following questions of your proposed solution:

Does the streaming analytics solution have connectors to support faulttolerant & auto-scaling data ingestion & distribution for all of your data
sources & analytics destinations out of the box?
Are common data analytics capabilities such as joins, aggregations, and
statistical analysis available out-of-the-box? How about complex capabilities
such as dimensional cube creations and integration with machine learning
tools?
Does the solution aggregate data over varying windows, both static and
rolling, automatically, or does the developer have to manually implement?
Is the solution data scientist and business analyst friendly with a visual
application creation and data visualization tools?

3. Robust management and operational deployment


Fast big data doesnt stop and neither can the insight and actions that your business
takes. As a result, streaming analytics applications are designed to run 24x7 with no
downtime. Data center operations teams need to ensure that the full lifecycle of
application deployment, monitoring, updating, and problem resolution meets the
organizations business commitments. Management requirements extend not only to
on-premise deployments, but also cloud and hybrid cloud/data center deployments.
Designed from day one with enterprise datacenter operations as a requirement,
DataTorrent RTS fully embraces the application lifecycle. The DataTorrent solution is
fully multi-tenant, allowing multiple applications to run on the same Hadoop cluster
optimizing operations and maximizing data center resources.
DataTorrent RTS provides a simple to implement and use application-packaging
technology to streamline the handoff from dev to ops. Designed for zero downtime,
data center ops teams have the ability to change business logic, modify application
window sizes (example 1 hour to 30 minutes) and performance tune a running
application without stopping the data processing.
The DataTorrent RTS UI console provides full visibility into the application at a Hadoop
container-level, including resource usage and performance/latency statistics in
addition to built-in monitoring alerts. Application issue resolution is simplified with
application counters, console event alerts and cluster-wide log collection and
consolidation.

www.datatorrent.com

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

Where Storm & Spark Streaming fall short


Spark Streaming & Storm provide rudimentary capabilities across the application
lifecycle. The management & monitoring platform does not provide full visibility into all
metrics of the streaming application and the infrastructure. There are no
considerations in Spark Streaming & Storm architecture for dynamic application
updates.

What to ask

Does your organization require easy to use tools for the full application
deployment & management operations cycle?
Are visual, automated alerting and command line tools required for your
data center operations team?
Does the streaming analytic solution have built in capabilities to make
application modifications dynamically?

www.datatorrent.com

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

Conclusion
Enterprises are seeing greater opportunity to better serve their customers, drive
greater revenues and reduce costs through operational efficiencies. In order to
capitalize on the opportunity, organizations are looking for solutions that enable rapid
insights and action to be taken on fast big data. An enterprise-grade solution is
required that meets the needs of data scientists, developers and data center
operations.
The top 3 reasons that enterprises are deploying DataTorrent RTS over Spark
Streaming are summarized below.
Enterprise-grade streaming analytics platform
Industrys first Hadoop-native, fully multi-tenant YARN and HDFS based
architecture

No data loss with automatic fault tolerance for raw event data, application
state & engine state

High-throughput, in-memory & low-latency event processing with no need to


micro-batch

At-most-once, at-least-once and exactly-once processing guarantees while


guaranteeing event order!

Auto-scaling & auto-partitioning of event streams for skew management


Data scientist & application developer friendly

Visual application creation tool that utilizes the 450+ open source Java
operators

Ability to ingest data from and distribute to any source with more than 75
pre-built adaptors

Open source library of 450+ operators for a wide variety of real-time


analytics & transformations
Robust operations & management platform

Simple application packaging and deployment


Intuitive UI for end to end management, monitoring, reporting &
troubleshooting
Dynamic application updates with no application downtime
Light footprint (no need to deploy on every Hadoop node) for simple
installation & upgrade
REST API for easy integration with enterprise tools

Additional Resources
DataTorrent RTS: Data sheet
DataTorrent RTS Whitepaper
DataTorrent download

DataTorrent Inc.,
3200 Patrick Henry Drive
nd
2 Floor
Santa Clara CA 95054
+(1) 408-331-5034, ext #101
www.datatorrent.com

You might also like