Professional Documents
Culture Documents
1
Outline
2
Definition
◼ At rest, in the form of a file, the contents of a database, or some other kind of
record.
3
Why Data Stream
◼ The Big Data and analytics market has reached hundredth of billion of dollars
◼ In the future, a large chunk of this will be attributed to IoT.
◼ According to IBM, 60% of all sensory information loses value in a few milliseconds.
Some data is like a fresh fruit which must be consumed quickly.
◼ The inability to process data real-time will result in a loss of billions of dollars
◼ Examples of some of these applications include:
◼ A telco, working out how many of its users have used Whatsapp in the last 30
minutes.
◼ A retailer keeping track of the number of people who have said positive things
about its products today on social media.
◼ A law enforcement agency looking for a suspect using data from traffic CCTV.
4
Examples of Stream Processing Applications
◼ Device monitoring
◼ Fault detection
◼ Fleet management
◼ Online recommendations
◼ Faster loans
5
The Streaming Data Architectural Blueprint
6
The Streaming Data Architectural Blueprint: Example
7
Collection Tier
◼ The collection tier is the entry point for bringing data into the streaming system.
◼ Regardless of the protocol used by a client to send data to the collection tier—or in
certain cases the collection tier reaching out and pulling in the data—a limited
number of interaction patterns are in use today; and they are:
◼ Request/response pattern
◼ Publish/subscribe pattern
◼ One-way pattern
◼ Request/acknowledge pattern
◼ Stream pattern
8
Message Queuing Tier
◼ Message queuing tier decouples the collection tier from the analytics tier.
◼ Decoupling allows tiers to work at a higher level of abstraction, by passing messages
and not having to do explicit calls to each other.
◼ Message queuing world has three main components:
◼ Producer
◼ Broker
◼ Consumer
9
Message Queuing Tier
10
Message delivery semantics
◼ A producer sends messages to a broker, and the consumer reads messages from
a broker. That’s a pretty high-level description of how message delivery works.
◼ But if you look deeper, there are the three common semantic guarantees you
will run into when looking at message queuing products:
◼ At most once—A message may get lost, but it will never be reread by a
consumer.
◼ At least once—A message will never be lost, but it may be reread by a
consumer.
◼ Exactly-once—A message is never lost and is read by a consumer once and
only
11
Event time vs. processing time
12
Windowing
◼ To cope with the infinite nature of unbounded data sets, streaming systems
typically provide some notion of windowing the incoming data.
◼ Windowing essentially means chopping up a data set into finite pieces along
temporal boundaries.
13
Stateful vs Stateless Stream Processing
14
Spark Streaming Architecture
15
Design Goals of Spark Streaming
◼ Low latency
◼ Linear scalability
16
Spark Streaming Systems
◼ Based on RDD
2. Spark Structured Streaming (SSS)
◼ Based on DataFrames
◼ with APIs in Scala, Java, Python, and R programming languages.
17
SS Architecture
18
Limitations of SS
19
Spark Structured Streaming (SSS)
20
Spark Structured Streaming (SSS) Architecture
21
SSS Programming Model
◼ The basic idea in SSS is to consider the input data stream as the “Input Table”. Every
data item that is arriving on the stream is like a new row being appended to the
Input Table.
22
SSS Programming Model
◼ The key idea in SSS is to treat a live data stream as a table that is being continuously
appended.
◼ This leads to a new stream processing model that is very similar to a batch
processing model.
◼ You will express your streaming computation as standard batch-like query as on a
static table, and Spark runs it as an incremental query on the unbounded input
table.
23
Spark Structured Streaming (SSS)
24
SSS Input Sources
◼ File source:
◼ Reads files written in a directory as a stream of data.
◼ Files will be processed in the order of file modification time, unless specified.
◼ Supported file formats are text, CSV, JSON, ORC, Parquet.
◼ Kafka source.
◼ Socket source:
◼ Reads UTF8 text data from a socket connection.
◼ Used only for testing
◼ Rate source:
◼ Generates data at the specified number of rows per second
◼ Each output row contains a value and a timestamp.
◼ Used only for testing
25
SSS Windows …
◼ SSS supports three types of time windows: tumbling, sliding, and session
◼ Tumbling
◼ fixed-sized
◼ non-overlapping
◼ contiguous time intervals.
◼ An event can only be bound to a single window.
◼ Sliding
◼ fixed-sized
◼ windows can overlap
◼ An event can be bound to multiple windows.
◼ Session
◼ dynamic size which depends on input gap duration
26
… SSS Windows
27
SSS Processing Modes
28
SSS Operations
◼ Because SSS is built on DataFrame API, most DataFrame operations are available,
including the following:
◼ Filtering records
◼ Projecting columns
29
SSS Operations
◼ Some operations in the DataFrame API are not available with streaming
DataFrames, including the following:
◼ distinct operations
30
Operations on Streaming DataFrames
◼ You can apply all kinds of SQL and RDD operations on streaming DataFrames.
◼ Example:
# Create a streaming DataFrame with schema { device,deviceType, signal,time}
df = ...
# register a streaming DataFrame as a temporary view and then apply SQL commands
df.createOrReplaceTempView(“Tab1")
spark.sql("select count(*) from Tab1")
31
SSS Processing Model
32
Window Operations on Event Time
# Group the data by window and word and compute the count of each group
df2 = df1.groupBy(window(df1.timestamp, "10 minutes", "5
minutes"),df1.word).count()
33
Window Operations on Event Time
34
Input Data and Intermediate Results
35
Handling Late Data …
36
Handling Late Data
37
Handling Late Data and Watermarking
◼ To run streaming queries for days, it’s necessary for the system to bound the
amount of intermediate in-memory state it accumulates.
◼ It does this one using watermarking
◼ For example, if the following query is run in Update output mode, the engine will
keep updating counts of a window in the Result Table until the window is older than
the watermark, which lags behind the current event time in column “timestamp” by
10 minutes.
df2 = df1 \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(df1.timestamp, "10 minutes", "5 minutes"),
df1.word) \
.count()
38
SSS Output Sinks
◼ For each sink: For running arbitrary computation on the records in the output.
◼ Console sink
◼ Prints the output to the console every time there is a trigger.
◼ Used for debugging
◼ Memory sink
◼ The output is stored in memory as an in-memory table.
◼ Used for debugging
39
SSS Output Modes
◼ SSS Output is defined as what gets written out to the external storage.
◼ Complete Mode - The entire updated Result Table will be written to the
external storage.
◼ Append Mode - Only the new rows appended in the Result Table since the last
trigger will be written to the external storage.
◼ Update Mode - Only the rows that were updated in the Result Table since the
last trigger will be written to the external storage.
40
Example of Complete Output Mode
41
Fault Tolerance Semantics
◼ Delivering end-to-end exactly-once semantics was one of key goals behind the
design of SSS
◼ To achieve that, the SSS sources, sinks and the execution engine are designed to
reliably track the exact progress of the processing so that it can handle any kind of
failure by restarting and/or reprocessing.
◼ SSS sources are assumed to have offsets to track the read position in the
stream.
◼ SSS execution engine uses checkpointing and write-ahead logs to record the
offset range of the data being processed in each trigger.
◼ SSS sinks are designed to handling reprocessing.
◼ Using replayable sources and idempotent sinks, SSS can ensure end-to-end exactly-
once semantics under any failure.
42
Recovering from Failures with Checkpointing
◼ In case of a failure or intentional shutdown, you can recover the previous progress
and state of a previous query and continue where it left off.
◼ You can configure a query with a checkpoint location, and the query will save all the
progress information and the running aggregates to the checkpoint location.
◼ This checkpoint location has to be a path in an HDFS compatible file system and can
be set as an option in the DataStreamWriter when starting a query.
43
Security for streaming systems
44
Example: Word count …
◼ First, we have to import the necessary classes and create a local SparkSession, the
starting point of all functionalities related to Spark.
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
45
… Example: Word count …
◼ Next, create a streaming DataFrame that represents text data received from a
server listening on localhost:9999, and transform the DataFrame to calculate word
counts.
lines = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
46
… Example: Word count …
◼ This lines DataFrame is an unbounded table containing the streaming text data.
◼ This table contains one column of strings named “value”, and each line in the
streaming text data becomes a row in the table.
◼ Next, we have used two built-in SQL functions - split and explode, to split each line
into multiple rows with a word each.
◼ In addition, we use the function alias to name the new column as “word”.
◼ Finally, we have defined the wordCounts streaming DataFrame by grouping by the
unique values in the Dataset and counting them.
◼ Note that this is a streaming DataFrame which represents the running word counts
of the stream.
◼ Note, currently we are just setting up the transformation, and have not yet started
receiving any data.
47
… Example: Word count
◼ We have now set up the query on the streaming data. All that is left is to actually
start receiving data and computing the counts.
◼ To do this, we set it up to print the complete set of counts (specified by
outputMode("complete")) to the console every time they are updated.
◼ Then start the streaming computation using start().
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
48
Acknowledgment
◼ Many of the figures in the slides are copied from google images and many of them
belong to Simplilearn, Greatlearning, Learning journal and Edureka.
49
Refs
◼ https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html
◼ https://www.youtube.com/watch?v=RLfTxtgeVhM
◼ https://www.youtube.com/watch?v=oMZwVRdDBJI
50
Classifications of Real-time Systems
51
END