Professional Documents
Culture Documents
Spark/Structured
Streaming
16
17
1
3/24/2024
Motivation
How can we harness the data in real time?
Value can quickly degrade → capture value
Immediately
From reactive analysis to direct operational impact
Unlocks new competitive advantages
Requires a completely new approach...
A distributed stream processing framework is required to
Scale to large clusters (100s of machines)
Achieve low latency (few seconds)
18
Why Streaming?
“Without stream processing, there’s no big data and no Internet of Things” – Dana Sandu, SQLstream
Operational Efficiency - 1 extra mph for a locomotive on its daily route can lead to $200M in savings (Norfolk Southern)
Tracking Behavior – Mcdonald’s (Netherlands) realized a 700% increase in offer redemptions using personalized advertising
Predict machine failure - GE monitors over 5500 assets from 70+ customer sites globally. Can predict failure and
Improving Traffic Safety and Efficiency – According to the EU Commission, congestion in EU urban areas costs ~ €100
19
2
3/24/2024
20
21
3
3/24/2024
Spark + Streaming
22
Spark Streaming
23
4
3/24/2024
24
25
5
3/24/2024
26
mutable state
Traditional
streaming has a record-at-
a-time processing model input
• Each node has a mutable state records
• For each record, update the state
and send new records node 1
node 2
27
6
3/24/2024
Storm
• Replays record if not processed by a node
• Processes each record at least once
• May update mutable state twice!
• Mutable state can be lost due to failure!
Trident – Use transactions to update state
• Processes each record exactly once
• Per state transaction to external database is slow
28
29
7
3/24/2024
High-level
Architecture
30
Spark Streaming
31
8
3/24/2024
32
Spark
processed
results
33
9
3/24/2024
34
35
10