Spark Streaming

Apache
Spark 2 course
Spark Streaming
Juan Carlos Ibáñez, PhD
2017
Spark Streaming
Spark Big Data Analytics Stack
Real-time Machine
Structured (pseudo, learning Graph
(schema) data micro-batch) algorithms processing
Spark Core
(Execution engine)
Juan Carlos Ibáñez, 2017 Apache Spark 2 2

Spark Streaming
Spark Streaming
• Classical analytics is performed on data at rest (Databases, flat files, …)
• Today, many applications must process large streams of live data and provide
results in real-time: Wireless sensor networks, Traffic management applications,
Stock marketing, environmental monitoring applications, fraud detection tools,
etc
• Spark Streaming is built for this purpose; and allows to

• Look at data as they are created/arrive from a source
• Transform, summarize, analyze coming data in real time
• Perform machine learning in real time
• Predict in real time
• Streaming Data Sources: Flat files (as they are created), TCP/IP, Apache
Flume, Apache Kafka, Amazon Kinesis, Twitter, Facebook …

Spark Streaming
Stream data processing VS at-rest database systems
• Stream Processing Systems: data-in-motion

analytics
• Processing information as it flows, without
storing them persistently
• transient data that is continuously updated
• executes standing queries, which run
continuously and provide updated answers as
new data arrives.
• Database Management Systems: data-at-rest

analytics
• Store and index data before processing it
• Process data only when explicitly asked by the
users
• Persistent data where updates are relatively
infrequent
• runs queries just once to return a complete
answer
Spark Streaming
Spark Streaming description

• Run a streaming computation as a series of very small, deterministic batch jobs:
1. Chop up the live stream into batches of X seconds
2. Treats each batch of data as RDDs and processes them using RDD
operations
3. Finally, the processed results of the RDD operations are returned in

batches.

Spark Streaming
Spark Streaming description
• Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis,
or TCP sockets
• Then, data can be processed using RDD Spark functions map, reduce, join and
window
• Finally, processed data can be pushed out to filesystems, databases, and live
dashboards

Spark Streaming
DStreams
• Discretized Stream or DStream is the basic abstraction provided by Spark
Streaming. It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated by
transforming the input stream.
• Internally, a DStream is represented by a continuous series of RDDs
• First we create a StreamingContext from the SparkContext to enable streaming
• Streaming creates a DStream (Discretized Stream) on which processing occurs
• A micro-batch window is setup for the Dstream
• Data is received, accumulated as a micro-batch and processed as a micro-

batch
• Each micro-batch is an RDD
• Regular RDD operations can be applied on the DStream RDD

Spark Streaming
Dstream processing
1. Initial Dstream: sequence of RDDs

representing a stream of data
2. Dstream transformation
3. Window operations: group all the records from

a sliding window of the past time intervals into
one RDD: window, reduceByAndWindow, ...
• Window length: the duration of the window.
• Slide interval: the interval at which the operation
is performed.
Spark Streaming
Dstream processing
• Spark collects incoming data for each interval (micro-batch)
• Data is collected as an RDD for that interval.
• It then calls all transformations and operations that applies for that DStream or
derived Dstreams
• Global variables can be used to track data across Dstreams
• Windowing functions are available for computing across multiple DStreams.

• Window size multiple of interval
• Sliding interval multiple of interval

Spark Streaming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Streaming

Uploaded by

Copyright:

Available Formats

Apache

Spark Big Data Analytics Stack

Juan Carlos Ibáñez, 2017 Apache Spark 2 2

• Spark Streaming is built for this purpose; and allows to

Juan Carlos Ibáñez, 2017 Apache Spark 2 3

Stream data processing VS at-rest database systems

• Stream Processing Systems: data-in-motion

• Database Management Systems: data-at-rest

Spark Streaming description

1. Chop up the live stream into batches of X seconds

3. Finally, the processed results of the RDD operations are returned in

Juan Carlos Ibáñez, 2017 Apache Spark 2 5

Spark Streaming description

Juan Carlos Ibáñez, 2017 Apache Spark 2 6

• Internally, a DStream is represented by a continuous series of RDDs

• First we create a StreamingContext from the SparkContext to enable streaming

• Streaming creates a DStream (Discretized Stream) on which processing occurs

• A micro-batch window is setup for the Dstream

• Data is received, accumulated as a micro-batch and processed as a micro-

• Each micro-batch is an RDD

• Regular RDD operations can be applied on the DStream RDD

Juan Carlos Ibáñez, 2017 Apache Spark 2 7

1. Initial Dstream: sequence of RDDs

3. Window operations: group all the records from

• Spark collects incoming data for each interval (micro-batch)

• Data is collected as an RDD for that interval.

• Global variables can be used to track data across Dstreams

• Windowing functions are available for computing across multiple DStreams.

Juan Carlos Ibáñez, 2017 Apache Spark 2 9

You might also like