T09 Data Streaming

Data Streaming
1
Outline
◼ Definition of data streaming

◼ Why data stream
◼ Streaming data architecture
2
Definition
◼ Data can be encountered in two forms:
◼ At rest, in the form of a file, the contents of a database, or some other kind of
record.
◼ In motion, as continuously generated sequences of signals, like the

measurement of a sensor or GPS signals from moving vehicles.
◼ A batch processing system is a program that processes data at rest
◼ A stream-processing program is a program that process data in motion as well as

data at rest.
3
Why Data Stream
◼ The Big Data and analytics market has reached hundredth of billion of dollars
◼ In the future, a large chunk of this will be attributed to IoT.
◼ According to IBM, 60% of all sensory information loses value in a few milliseconds.
Some data is like a fresh fruit which must be consumed quickly.
◼ The inability to process data real-time will result in a loss of billions of dollars
◼ Examples of some of these applications include:
◼ A telco, working out how many of its users have used Whatsapp in the last 30
minutes.
◼ A retailer keeping track of the number of people who have said positive things
about its products today on social media.
◼ A law enforcement agency looking for a suspect using data from traffic CCTV.
4
Examples of Stream Processing Applications
◼ Device monitoring
◼ Fault detection
◼ Fleet management
◼ Online recommendations
◼ Faster loans
5
The Streaming Data Architectural Blueprint
6
The Streaming Data Architectural Blueprint: Example
◼ Collection tier—When a user posts a tweet, it is collected by Twitter services.

◼ Message queuing tier —Twitter runs data centers in locations across the globe, and
conceivably the collection of a tweet doesn’t happen in the same location as the
analysis of the tweet.
◼ Analysis tier—Processing is done to those 140 characters, suffice it to say, at a
minimum for our examples, Twitter needs to identify the followers of a tweet.
◼ Long-term storage tier— Tweets going back in time imply that they’re stored in a
persistent data store.
◼ In-memory data store tier—Tweets that are mere seconds old are most likely held
in an in-memory data store.
◼ Data access—All Twitter clients need to be connected to Twitter to access the
service.
7
Collection Tier
◼ The collection tier is the entry point for bringing data into the streaming system.
◼ Regardless of the protocol used by a client to send data to the collection tier—or in
certain cases the collection tier reaching out and pulling in the data—a limited
number of interaction patterns are in use today; and they are:
◼ Request/response pattern
◼ Publish/subscribe pattern
◼ One-way pattern
◼ Request/acknowledge pattern
◼ Stream pattern
8
Message Queuing Tier
◼ Message queuing tier decouples the collection tier from the analytics tier.
◼ Decoupling allows tiers to work at a higher level of abstraction, by passing messages
and not having to do explicit calls to each other.
◼ Message queuing world has three main components:
◼ Producer
◼ Broker
◼ Consumer
9
Message Queuing Tier
10
Message delivery semantics
◼ A producer sends messages to a broker, and the consumer reads messages from
a broker. That’s a pretty high-level description of how message delivery works.
◼ But if you look deeper, there are the three common semantic guarantees you
will run into when looking at message queuing products:
◼ At most once—A message may get lost, but it will never be reread by a
consumer.
◼ At least once—A message will never be lost, but it may be reread by a
consumer.
◼ Exactly-once—A message is never lost and is read by a consumer once and
only
11
Event time vs. processing time
◼ Each data generated by a source, like IoT device, is called an event.

◼ Within any data processing system, there are typically two domains of time we care
about:
◼ Event time: The time when the event was created. The time information is
provided by the local clock of the device generating the event.
◼ Processing time: The time when the event is handled by the stream-processing
system. This is the clock of the server running the processing logic. It’s usually
relevant for technical reasons like computing the processing lag or as criteria to
determine duplicated output.
◼ The differentiation among these timelines (event time and processing time)
becomes very important when we need to correlate, order, or aggregate the events
with respect to one another.
12
Windowing
◼ To cope with the infinite nature of unbounded data sets, streaming systems
typically provide some notion of windowing the incoming data.
◼ Windowing essentially means chopping up a data set into finite pieces along
temporal boundaries.
13
Stateful vs Stateless Stream Processing
◼ Stateful stream processing

◼ Refers to any stream processing that looks to past information to obtain its
result.
◼ It’s necessary to maintain some state information in the process of computing
the next element of the stream.
◼ Requires more computing resources to produce a result and may need to
traverse a stream and keep intermediate values at each step.
◼ Stateless stream processing
◼ Doesn’t require the knowledge or the state of the stream.
◼ Requires fewer computing resources
14
Spark Streaming Architecture
15
Design Goals of Spark Streaming
◼ Low latency
◼ Exactly-once event processing
◼ Linear scalability
◼ Integration with the Spark core and DataFrame APIs
◼ Unified programming model for both stream and batch operations.
16
Spark Streaming Systems
◼ Spark has two streaming systems:

1. Spark Streaming (SS)
◼ Based on RDD
2. Spark Structured Streaming (SSS)
◼ Based on DataFrames
◼ with APIs in Scala, Java, Python, and R programming languages.
17
SS Architecture
◼ SS introduces the concept of

discretized streams, or
DStreams.
◼ DStreams are essentially

batches of data stored in
multiple RDDs.
◼ Each batch representing a time

window, typically in seconds.
◼ The resultant RDDs can then be

processed using the core Spark
RDD API and all the available
transformations and action
discussed before.
18
Limitations of SS
◼ Different API for batch and streaming data
◼ No direct support for event time processing

◼ SS is based on batch times
◼ Very difficult to processes with event time
◼ Hard to deal with late data
◼ No built-in end-to-end guarantee

◼ Must be handled in code
◼ Exactly-once processing is very complex
19
Spark Structured Streaming (SSS)
20
Spark Structured Streaming (SSS) Architecture
◼ Stream processing in Spark is not

limited to the RDD API.
◼ By using Structured Streaming, Spark
Streaming is fully integrated with the
Spark DataFrame API as well.
◼ Using Structured Streaming,
streaming data sources are treated as
an unbounded table that is
continually appended to.
◼ SQL queries can run against these
tables much as they are able to run
from tables representing static
DataFrames
21
SSS Programming Model
◼ The basic idea in SSS is to consider the input data stream as the “Input Table”. Every
data item that is arriving on the stream is like a new row being appended to the
Input Table.
22
SSS Programming Model
◼ The key idea in SSS is to treat a live data stream as a table that is being continuously
appended.
◼ This leads to a new stream processing model that is very similar to a batch
processing model.
◼ You will express your streaming computation as standard batch-like query as on a
static table, and Spark runs it as an incremental query on the unbounded input
table.
23
Spark Structured Streaming (SSS)
◼ Spark structured stream (SSS) processing engine is:

◼ scalable
◼ end-to-end exactly-once
◼ built on the Spark SQL
◼ fault-tolerant guarantees through checkpointing and Write-Ahead Logs
◼ It is a stream processing engine without the user having to reason about streaming.
◼ You can express your streaming computation the same way you would express a
batch computation on static data.
◼ The Spark SQL engine will take care of running it incrementally and continuously
and updating the final result as streaming data continues to arrive.
◼ You can use the DataFrame API in Python to express streaming aggregations, event-
time windows, stream-to-batch joins, etc.
24
SSS Input Sources
◼ File source:
◼ Reads files written in a directory as a stream of data.
◼ Files will be processed in the order of file modification time, unless specified.
◼ Supported file formats are text, CSV, JSON, ORC, Parquet.
◼ Kafka source.
◼ Socket source:
◼ Reads UTF8 text data from a socket connection.
◼ Used only for testing
◼ Rate source:
◼ Generates data at the specified number of rows per second
◼ Each output row contains a value and a timestamp.
◼ Used only for testing
25
SSS Windows …
◼ SSS supports three types of time windows: tumbling, sliding, and session
◼ Tumbling
◼ fixed-sized
◼ non-overlapping
◼ contiguous time intervals.
◼ An event can only be bound to a single window.
◼ Sliding
◼ fixed-sized
◼ windows can overlap
◼ An event can be bound to multiple windows.
◼ Session
◼ dynamic size which depends on input gap duration
26
… SSS Windows
27
SSS Processing Modes
◼ SSS has two processing modes, namely:

◼ Micro-batch (default)
◼ Continuous
◼ Micro-batch mode
◼ Data streams are processed as a series of small batch jobs
◼ Has end-to-end latencies as low as 100 milliseconds
◼ Guarantees exactly-once and fault-tolerance.
◼ Continuous mode
◼ end-to-end latencies as low as 1 millisecond
◼ Guarantees at-least-once
◼ Without changing the DataFrame operations in your queries, you will be able to
choose the mode based on your application requirements.
28
SSS Operations
◼ Because SSS is built on DataFrame API, most DataFrame operations are available,
including the following:
◼ Filtering records
◼ Projecting columns
◼ Performing column-level transformations using built-in or user-defined

functions
◼ Grouping records and aggregating columns
◼ Joining streaming DataFrames with static DataFrames (with some limitations)
29
SSS Operations
◼ Some operations in the DataFrame API are not available with streaming
DataFrames, including the following:
◼ limit and take(n) operations
◼ distinct operations
◼ sort operations (supported only in complete output mode after an aggregation)
◼ Full outer join operations
◼ Any type of join between two streaming DataFrames
◼ Additional conditions on left and right outer join operations
30
Operations on Streaming DataFrames
◼ You can apply all kinds of SQL and RDD operations on streaming DataFrames.
◼ Example:
# Create a streaming DataFrame with schema { device,deviceType, signal,time}
df = ...
# Select the devices which have signal more than 10

df.select("device").where("signal > 10")
# Running count of the number of updates for each device type

df.groupBy("deviceType").count()
# register a streaming DataFrame as a temporary view and then apply SQL commands
df.createOrReplaceTempView(“Tab1")
spark.sql("select count(*) from Tab1")
31
SSS Processing Model
32
Window Operations on Event Time
◼ Aggregations over a sliding event-time window are straightforward with Structured

Streaming and are very similar to grouped aggregations
◼ In a grouped aggregation, aggregate values (e.g. counts) are maintained for each
unique value in the user-specified grouping column.
◼ In case of window-based aggregations, aggregate values are maintained for each
window the event-time of a row falls into.
◼ Example: Frequency of words within 10-minute windows, updating ever y 5 minutes
# Create a streaming DataFrame, df1, schema {Timestamp, word}
df1 = ...
# Group the data by window and word and compute the count of each group
df2 = df1.groupBy(window(df1.timestamp, "10 minutes", "5
minutes"),df1.word).count()
33
Window Operations on Event Time
34
Input Data and Intermediate Results
◼ SSS does not materialize the entire input table.

◼ It reads the latest available data from the streaming data source, processes it
incrementally to update the result, and then discards the source data.
◼ It only keeps around the minimal intermediate state data as required to update the
result
◼ SSS is responsible for updating the Result Table when there is new data, thus
relieving the users from reasoning about it.
◼ SSS is significantly different from many other stream processing engines. Many
streaming systems require the user to maintain running aggregations themselves,
thus having to reason about fault-tolerance, and data consistency (at-least-once, or
at-most-once, or exactly-once).
35
Handling Late Data …
◼ What happens if one of the events arrives late to the application.

◼ For example, say, a word generated at 12:04 (i.e. event time) could be received by
the application at 12:11.
◼ The application should use the time 12:04 instead of 12:11 to update the older
counts for the window 12:00 - 12:10.
◼ This occurs naturally in our window-based grouping – Structured Streaming can
maintain the intermediate state for partial aggregates for a long period of time such
that late data can update aggregates of old windows correctly.
36
Handling Late Data
37
Handling Late Data and Watermarking
◼ To run streaming queries for days, it’s necessary for the system to bound the
amount of intermediate in-memory state it accumulates.
◼ It does this one using watermarking
◼ For example, if the following query is run in Update output mode, the engine will
keep updating counts of a window in the Result Table until the window is older than
the watermark, which lags behind the current event time in column “timestamp” by
10 minutes.
df2 = df1 \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(df1.timestamp, "10 minutes", "5 minutes"),
df1.word) \
.count()
38
SSS Output Sinks
◼ File sink: Stores the output to a directory.
◼ Kafka sink: Stores the output to one or more topics in Kafka.
◼ For each sink: For running arbitrary computation on the records in the output.
◼ Console sink
◼ Prints the output to the console every time there is a trigger.
◼ Used for debugging
◼ Memory sink
◼ The output is stored in memory as an in-memory table.
◼ Used for debugging
39
SSS Output Modes
◼ SSS Output is defined as what gets written out to the external storage.
◼ The output can be defined in one of the following three modes:
◼ Complete Mode - The entire updated Result Table will be written to the
external storage.
◼ Append Mode - Only the new rows appended in the Result Table since the last
trigger will be written to the external storage.
◼ Update Mode - Only the rows that were updated in the Result Table since the
last trigger will be written to the external storage.
40
Example of Complete Output Mode
41
Fault Tolerance Semantics
◼ Delivering end-to-end exactly-once semantics was one of key goals behind the
design of SSS
◼ To achieve that, the SSS sources, sinks and the execution engine are designed to
reliably track the exact progress of the processing so that it can handle any kind of
failure by restarting and/or reprocessing.
◼ SSS sources are assumed to have offsets to track the read position in the
stream.
◼ SSS execution engine uses checkpointing and write-ahead logs to record the
offset range of the data being processed in each trigger.
◼ SSS sinks are designed to handling reprocessing.
◼ Using replayable sources and idempotent sinks, SSS can ensure end-to-end exactly-
once semantics under any failure.
42
Recovering from Failures with Checkpointing
◼ In case of a failure or intentional shutdown, you can recover the previous progress
and state of a previous query and continue where it left off.
◼ This is done using checkpointing and write-ahead logs.
◼ You can configure a query with a checkpoint location, and the query will save all the
progress information and the running aggregates to the checkpoint location.
◼ This checkpoint location has to be a path in an HDFS compatible file system and can
be set as an option in the DataStreamWriter when starting a query.
43
Security for streaming systems
44
Example: Word count …
◼ First, we have to import the necessary classes and create a local SparkSession, the
starting point of all functionalities related to Spark.
from pyspark.sql import SparkSession

from pyspark.sql.functions import explode, split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
45
… Example: Word count …
◼ Next, create a streaming DataFrame that represents text data received from a
server listening on localhost:9999, and transform the DataFrame to calculate word
counts.
lines = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words

words = lines.select(explode(split(lines.value, " "))
.alias("word")
)
# Generate running word count

wordCounts = words.groupBy("word").count()
46
… Example: Word count …
◼ This lines DataFrame is an unbounded table containing the streaming text data.
◼ This table contains one column of strings named “value”, and each line in the
streaming text data becomes a row in the table.
◼ Next, we have used two built-in SQL functions - split and explode, to split each line
into multiple rows with a word each.
◼ In addition, we use the function alias to name the new column as “word”.
◼ Finally, we have defined the wordCounts streaming DataFrame by grouping by the
unique values in the Dataset and counting them.
◼ Note that this is a streaming DataFrame which represents the running word counts
of the stream.
◼ Note, currently we are just setting up the transformation, and have not yet started
receiving any data.
47
… Example: Word count
◼ We have now set up the query on the streaming data. All that is left is to actually
start receiving data and computing the counts.
◼ To do this, we set it up to print the complete set of counts (specified by
outputMode("complete")) to the console every time they are updated.
◼ Then start the streaming computation using start().
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
48
Acknowledgment
◼ Many of the figures in the slides are copied from google images and many of them
belong to Simplilearn, Greatlearning, Learning journal and Edureka.
49
Refs
◼ https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html
◼ https://www.youtube.com/watch?v=RLfTxtgeVhM
◼ https://www.youtube.com/watch?v=oMZwVRdDBJI
50
Classifications of Real-time Systems
51
END

T09 Data Streaming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T09 Data Streaming

Uploaded by

Copyright:

Available Formats

Data Streaming

◼ Definition of data streaming

◼ Data can be encountered in two forms:

◼ In motion, as continuously generated sequences of signals, like the

◼ A batch processing system is a program that processes data at rest

◼ A stream-processing program is a program that process data in motion as well as

◼ Collection tier—When a user posts a tweet, it is collected by Twitter services.

◼ Each data generated by a source, like IoT device, is called an event.

◼ Stateful stream processing

◼ Exactly-once event processing

◼ Integration with the Spark core and DataFrame APIs

◼ Unified programming model for both stream and batch operations.

◼ Spark has two streaming systems:

◼ SS introduces the concept of

◼ DStreams are essentially

◼ Each batch representing a time

◼ The resultant RDDs can then be

◼ Different API for batch and streaming data

◼ No direct support for event time processing

◼ No built-in end-to-end guarantee

◼ Stream processing in Spark is not

◼ Spark structured stream (SSS) processing engine is:

◼ SSS has two processing modes, namely:

◼ Performing column-level transformations using built-in or user-defined

◼ Grouping records and aggregating columns

◼ Joining streaming DataFrames with static DataFrames (with some limitations)

◼ limit and take(n) operations

◼ sort operations (supported only in complete output mode after an aggregation)

◼ Full outer join operations

◼ Any type of join between two streaming DataFrames

◼ Additional conditions on left and right outer join operations

# Select the devices which have signal more than 10

# Running count of the number of updates for each device type

◼ Aggregations over a sliding event-time window are straightforward with Structured

◼ SSS does not materialize the entire input table.

◼ What happens if one of the events arrives late to the application.

◼ File sink: Stores the output to a directory.

◼ Kafka sink: Stores the output to one or more topics in Kafka.

◼ The output can be defined in one of the following three modes:

◼ This is done using checkpointing and write-ahead logs.

from pyspark.sql import SparkSession

# Split the lines into words

# Generate running word count

You might also like