Apache Storm: © Hortonworks Inc. 2013

Apache Storm
© Hortonworks Inc. 2013 Page 1

What is Storm?
• Real time stream processing framework
• Scalable
– Up to 1 million tuples per second per node
• Fault Tolerant
– Tasks reassigned on failure
• Guaranteed Processing
– At least once processing
– Exactly once processing with some more work
• Relatively language agnostic
– Primarily JVM based
– Thrift API for defining and submitting topologies
– JSON based protocol for defining components in other languages
Page 2
© Hortonworks Inc. 2013
Motivation
• Process large amount of incoming data real time
• Classic use case is processing streams of tweets
– Calculate trending users
– Calculate reach of a tweet
• Data cleansing and normalization
• Personalization and recommendation
• Log processing
Page 3
Lambda Architecture
• Most useful when
– Batch & speed layers do
essentially the same
computation
– Sample use case: KPI
dashboard
• Less useful when

– When batch & speed layers
do different computation
– Sample use case: Real-
time model scoring
Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/
Page 4
Basic Concepts
Tuple: Most fundamental data structure
and is a named list of values that can be of
any datatype
Streams: Groups of tuples
Spouts: Generate streams.
Bolts: Contain data processing,

persistence and alerting logic. Can also
emit tuples for downstream bolts
Tuple Tree: First tuple and all the tuples

that were emitted by the bolts that
processed it
Topology: Group of spouts and bolts wired

together into a workflow
Page 5
Architecture
Nimbus(Management server)
• Similar to job tracker
• Distributes code around cluster
• Assigns tasks
• Handles failures
Supervisor(Worker nodes):
• Similar to task tracker
• Run bolts and spouts as ‘tasks’
ZooKeeper:
• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
• Consumption related metadata
for Trident topologies

Relationship Between Supervisors, Workers, Executors
& Tasks
supervisor
Each supervisor machine in storm has specific

Predefined ports to which a worker process is assigned
Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Page 7
Tuple Routing
Fields grouping provides various ways to control tuple routing to bolts.
Grouping type What it does When to use
Shuffle Grouping Sends tuple to a bolt in random - Doing atomic operations eg. math
round robin sequence operations.
Fields Grouping Sends tuples to a bolt based on one - Segmentation of the incoming stream.
or or more field's in the tuple - Counting tuples of a certain type.
All grouping Sends a single copy of each tuple to - Send some signal to all bolts like clear
all instances of a receiving bolt cache or refresh state etc.
- Send ticker tuple to signal bolts to save
state etc.
Custom grouping Implement your own field grouping - Used to get max flexibility to change
so tuples are routed based on processing sequence, logic etc. based
custom logic on different factors like data types, load,
seasonality etc.
Direct grouping Source decides which bolt will - Depends.
receive tuple
Global grouping Global Grouping sends tuples - Global counts.
generated by all instances of the
source to a single target instance
(specifically, the task with lowest ID)
Page 8
Topology creation example
Get Tweet Find Hashtags Count Hashtags Report Findings
Kafka Spout Bolt Bolt Bolt

"reader" "normalizer" "enumerator" "reporter"
Removes non- Keeps track of how Regularly
alphanumeric creates report
many instances of
characters, extracts and uploads it
hashtag values and each hashtag have
occurred. to Amazon S3.
emits them.
TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", kafkaSpout);
builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout");
builder.setBolt("enumerator", new
HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag"));
builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator");
Page 9
What happens on failure?
• Run everything with monitoring
– E.g. daemontools or monit
– Restarts Nimbus and Supervisors on failure
• Nimbus
– Stateless (kept in either ZooKeeper or on disk)
– Single Point of Failure, Sort Of
– Workers still function, but can’t be reassigned when a node fails
– Supervisors continue as normal
• Supervisor
– Stateless
• Entire Node
– Nimbus reassigns tasks on that machine after timeout
Page 10
Guaranteed Processing
• Tuples from Spout are tagged with a message ID
• Each of these tuples can result in a tuple tree
• Once every tuple in the tuple tree is processed, the
original tuple is considered to be processed.
• Requires two pieces from the user
– Explicitly anchoring an emitted tuple to the input tuple(s)
– Ack or fail every tuple.
• If a tuple isn’t processed quickly enough, a timeout value
will cause a failure.
• Spouts like the Kafka spout can replay tuples on failure,
either as explicitly indicated by bolts or from timeouts.
– At least once processing!
Page 11
What is Trident?
• Provides exactly once processing semantics in Storm
• Core concept is to process a group of tuples as a batch
rather than process tuple at a time like core Storm does.
• Higher level API for defining topologies.
• All Trident topologies under the covers are automatically
converted into Spouts and Bolts.
Page 12
Parallelism
• Three basic variables: # Slots, # Workers, # Tasks
– No general way to answer beyond profiling and adjusting.
• Can set the number of executors (threads)
• Can set the number of tasks
– Tasks are NOT parallel within an executor
– More than one task for executor is useful for rebalancing while the
topology is running
• Number of workers
– Increase when bottlenecked on CPU and each worker has many
tuples to process
Page 13
Patterns – Streaming Joins
• Combine two or more data streams
• Unlike database join, streaming join has infinite input, and
unclear semantics.
• Different types of joins for different use cases
• Partition input streams the same way
Fields groupbuilder.setBolt("join", new
MyJoiner(), parallelism)
.fieldsGrouping("1", new Fields("joinfield1",
"joinfield2"))
"joinfield2"))
"joinfield2"));
Page 14
Patterns – Batching
• For efficiency
– E.g. Elasticsearch bulk API
• Hold on to tuples in instance variable
• Process tuples
• Ack all the instance tuples
• When emitting, consider multi-anchored tuple to ensure
reliability.
– Anchor to batched tuples to ensure all batched tuples are
replayed.
Page 15
Patterns – Streaming Top N
• Simplest way is to have a bolt that does global grouping
on stream and maintains list in memory of top N items
– Doesn’t scale because whole stream goes through one task
• Alternative: Do many top N’s across partitions of stream
• Merge each partition top N to get global top N
• Use fields grouping to get partitioning
builder.setBolt("rank", new RankObjects(), parallelism)
.fieldsGrouping("objects", new Fields("value"));
builder.setBolt("merge", new MergeObjects())
.globalGrouping("rank");
Page 16

Apache Storm: © Hortonworks Inc. 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Storm: © Hortonworks Inc. 2013

Uploaded by

Copyright:

Available Formats

Apache Storm

© Hortonworks Inc. 2013 Page 1

• Less useful when

Streams: Groups of tuples

Spouts: Generate streams.

Bolts: Contain data processing,

Tuple Tree: First tuple and all the tuples

Topology: Group of spouts and bolts wired

© Hortonworks Inc. 2013

Each supervisor machine in storm has specific

Kafka Spout Bolt Bolt Bolt

TopologyBuilder builder = new TopologyBuilder();

You might also like