You are on page 1of 9

Apache

Spark 2 course
Spark Streaming
Juan Carlos Ibáñez, PhD
2017
Spark Streaming

Spark Big Data Analytics Stack

Real-time Machine
Structured (pseudo, learning Graph
(schema) data micro-batch) algorithms processing

Spark Core
(Execution engine)

Juan Carlos Ibáñez, 2017 Apache Spark 2 2


Spark Streaming

Spark Streaming
• Classical analytics is performed on data at rest (Databases, flat files, …)

• Today, many applications must process large streams of live data and provide
results in real-time: Wireless sensor networks, Traffic management applications,
Stock marketing, environmental monitoring applications, fraud detection tools,
etc

• Spark Streaming is built for this purpose; and allows to


• Look at data as they are created/arrive from a source
• Transform, summarize, analyze coming data in real time
• Perform machine learning in real time
• Predict in real time

• Streaming Data Sources: Flat files (as they are created), TCP/IP, Apache
Flume, Apache Kafka, Amazon Kinesis, Twitter, Facebook …

Juan Carlos Ibáñez, 2017 Apache Spark 2 3


Spark Streaming

Stream data processing VS at-rest database systems

• Stream Processing Systems: data-in-motion


analytics
• Processing information as it flows, without
storing them persistently
• transient data that is continuously updated
• executes standing queries, which run
continuously and provide updated answers as
new data arrives.

• Database Management Systems: data-at-rest


analytics
• Store and index data before processing it
• Process data only when explicitly asked by the
users
• Persistent data where updates are relatively
infrequent
• runs queries just once to return a complete
answer
Juan Carlos Ibáñez, 2017 Apache Spark 2 4
Spark Streaming

Spark Streaming description


• Run a streaming computation as a series of very small, deterministic batch jobs:

1. Chop up the live stream into batches of X seconds

2. Treats each batch of data as RDDs and processes them using RDD
operations

3. Finally, the processed results of the RDD operations are returned in


batches.

Juan Carlos Ibáñez, 2017 Apache Spark 2 5


Spark Streaming

Spark Streaming description

• Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams

• Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis,
or TCP sockets

• Then, data can be processed using RDD Spark functions map, reduce, join and
window

• Finally, processed data can be pushed out to filesystems, databases, and live
dashboards

Juan Carlos Ibáñez, 2017 Apache Spark 2 6


Spark Streaming
DStreams
• Discretized Stream or DStream is the basic abstraction provided by Spark
Streaming. It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated by
transforming the input stream.

• Internally, a DStream is represented by a continuous series of RDDs

• First we create a StreamingContext from the SparkContext to enable streaming

• Streaming creates a DStream (Discretized Stream) on which processing occurs

• A micro-batch window is setup for the Dstream

• Data is received, accumulated as a micro-batch and processed as a micro-


batch

• Each micro-batch is an RDD

• Regular RDD operations can be applied on the DStream RDD

Juan Carlos Ibáñez, 2017 Apache Spark 2 7


Spark Streaming

Dstream processing

1. Initial Dstream: sequence of RDDs


representing a stream of data

2. Dstream transformation

3. Window operations: group all the records from


a sliding window of the past time intervals into
one RDD: window, reduceByAndWindow, ...
• Window length: the duration of the window.
• Slide interval: the interval at which the operation
is performed.
Juan Carlos Ibáñez, 2017 Apache Spark 2 8
Spark Streaming

Dstream processing

• Spark collects incoming data for each interval (micro-batch)

• Data is collected as an RDD for that interval.

• It then calls all transformations and operations that applies for that DStream or
derived Dstreams

• Global variables can be used to track data across Dstreams

• Windowing functions are available for computing across multiple DStreams.


• Window size multiple of interval
• Sliding interval multiple of interval

Juan Carlos Ibáñez, 2017 Apache Spark 2 9

You might also like