The raise of Stream Processing
● Stream Processing
○ Information is always up to date
○ Computes something small and relatively simple
○ Compute in near-real-time, only seconds at most, we go for
real-time processing
Real-Time Data Processing 1
Ops!
How can we collect
events in realtime?
Real-Time Data Processing 2
Hands-On #01
● Realtime data collection with Logstash
● Unified, high-throughput, low-latency platform for handling
real-time data feeds with Kafka
Real-Time Data Processing 3
Stream Processing
Kafka
Spark Streaming
Real-Time Data Processing 4
We need a solution for
stream processing
Real-Time Data Processing 5
Real-Time Data Processing 6
Real-Time Data Processing 7
Why Spark?
● Unified analytics engine for large-scale data processing
● Combining of streaming data with static datasets and
interactive queries
● Native integration with advanced processing libraries (SQL,
machine learning, graph processing)
● All-in-one framework
Real-Time Data Processing 8
Spark Streaming
Real-Time Data Processing 9
Real-Time Data Processing 10
Spark Streaming Internals
Real-Time Data Processing 11
Hands-On #02
● Number of request to an URL every minutes
● Modify the application add number of ip access by GET
method
118.68.170.134 vm-01.cse87.higio.net
118.68.168.182 vm-02.cse87.higio.net
118.68.170.148 vm-03.cse87.higio.net
/etc/hosts
Real-Time Data Processing 12
Discretized Streams (DStreams)
Real-Time Data Processing 13
Discretized Streams (DStreams)
● Use micro-batching mechanism
● Represents a continuous stream of data as a sequence of
RDDs
● Each RDD contains data from a certain interval
● Underlying RDD transformations are computed by the
Spark engine
Real-Time Data Processing 14
Discretized Streams (DStreams)
Real-Time Data Processing 15
Receivers
● Every input DStream (except file stream) is associated with
a Receiver
○ receives the data from a source
○ stores it in Spark’s memory for processing
● We can have multiple input DStreams (multiple receivers )
● Number of cores allocated (or threads in local mode) must
be more than the number of receivers
Real-Time Data Processing 16
Receiver Reliability
● Reliable Receiver
○ Sends acknowledgment to a reliable source when the data
has been received and stored in Spark with replication
● Unreliable Receiver
○ Does not send acknowledgment to a source -> data may
be lost
Real-Time Data Processing 17
Stateful Streaming
● Processing pipelines must maintain state across a period of
time
● Possible option is store state in external database like redis
Real-Time Data Processing 18
Stateful Streaming
● Spark offer stateful streaming using mapWithState
transformation
● State update by updateStateByKey
Real-Time Data Processing 19
Windows Operation
Real-Time Data Processing 20
Windows Operation
● Apply transformations over a sliding window of data
● window length:
○ The duration of the window (3)
● sliding interval:
○ The interval at which the window operation is performed (2)
Real-Time Data Processing 21
Checkpointing
● A streaming application must operate 24/7 -> must be
resilient to failures unrelated to the application logic
● Needs to checkpoint enough information to recover from
failures
Real-Time Data Processing 22
Checkpointing
● Metadata checkpointing: Saving of the information defining
the streaming computation to fault-tolerant storage
○ Configuration, DStream operations, incomplete batches
○ Needed for recovery from driver failures
Real-Time Data Processing 23
Checkpointing
● Data checkpointing
○ Saving of the generated RDDs
○ Some stateful transformations that combine data across
multiple batches
○ Necessary even for basic functioning if stateful
transformations are used
Real-Time Data Processing 24
Hands-on #03
● Number of active user in last 2 minutes, update every 5
seconds
Real-Time Data Processing 25
Monitoring
Real-Time Data Processing 26
Monitoring
● Processing Time
○ The time to process each batch of data
● Scheduling Delay
○ The time a batch waits in a queue for the processing of
previous batches to finish
● Receiver status and processing times can be able to
access using StreamingListener interface
Real-Time Data Processing 27
Tunning
● Reducing the processing time of each batch of data by
efficiently using cluster resources
● Setting the right batch size such that the batches of data
can be processed as fast as they are received
Real-Time Data Processing 28
Tunning
● Level of Parallelism in Data Receiving
● Level of Parallelism in Data Processing
● Data Serialization
● Task Launching Overheads
● Setting the Right Batch Interval
● Memory Tuning
Real-Time Data Processing 29
Fault-tolerance Semantics
● Not like Spark’s RDDs on non-streaming application
● Two kinds of data in the system that need to recovered
○ Data received and replicated (default 2) -> recover from
replicate
○ Data received but buffered for replication -> recover from
source
Real-Time Data Processing 30
Message Delivery
Real-Time Data Processing 31
Fault-tolerance Semantics
● How many times each record can be processed
○ At most once: once or not at all -> may be data lost
○ At least once: one or more times -> no data lost -> may be
duplicates
○ Exactly once: exactly once - no data lost and no duplicate
-> the strongest guarantee
Real-Time Data Processing 32
Spark Structured Streaming
● Also micro batching mechanism (Spark 2.3 - Continuous
Processing offer 1 millisecond latency with at-least-once
guarantees)
● Built on the Spark SQL engine
● Offers exactly-once delivery with 100+ milliseconds latency
● Express streaming computation as standard batch-like
query as on a static table
Real-Time Data Processing 33
Spark Structured Streaming
Real-Time Data Processing 34
Real-Time Data Processing 35
Triggers
● Micro-batch mode (by default)
● Fixed interval micro-batches
● One-time micro-batch
● Continuous with fixed checkpoint interval (experimental)
Real-Time Data Processing 36
Spark Structured Streaming
● Output Mode
○ Complete Mode
○ Append Mode
○ Update Mode
Real-Time Data Processing 37
Real-Time Data Processing 38
Spark Structured Streaming
● Window Operations on Event Time
● Handling Late Data and Watermarking
Real-Time Data Processing 39
Real-Time Data Processing 40
Real-Time Data Processing 41
Real-Time Data Processing 42
Real-Time Data Processing 43
DStreams vs Structured Streaming
● Only guarantees at-least-once ● Offers exactly-once delivery with
delivery, but can provide 100+ milliseconds latency
millisecond latencies ● For simple use cases
● More complicated topologies
because of its flexibility
Real-Time Data Processing 44
Hands-on #04
● Example of Spark Structured Streaming
Real-Time Data Processing 45
we always have to make
decisions
Spark Streaming is not true realtime
It consumes unnecessary resources and maintains effort
We need a lightweight streaming solution
without dependencies on other systems
Real-Time Data Processing 46
Kafka Stream
Real-Time Data Processing 47
Kafka Stream
● A client library for building applications and microservices
● No YARN any more
● Event-at-a-time processing (not microbatch) with
millisecond latency
● Distributed processing and fault-tolerance with fast failover
Real-Time Data Processing 48
Real-Time Data Processing 49
Streaming
Real-Time Data Processing 50
Probabilistic Data Structure Intro
● Does the dataset contain an element?
● Top N most viewed items?
● How many distinct customer in last hour?
● ...
Real-Time Data Processing 51
Probabilistic Data Structure Intro
● Possible solution:
○ Use SQL count on tables
○ Use HashMap to check if element existed
● When dealing with big data (required fast respone, RAM,
CPU limited) -> this become the big problem
Real-Time Data Processing 52
Probabilistic Data Structure Intro
● Need to be fast enough with limited resources
● Not all of them need to be 100% accuracy, can be
approximation with controllable error rate
● Trade-off space and performance with accuracy
Real-Time Data Processing 53
Most Useful PDS
● Membership Query: Bloom Filter
● Cardinality Estimation: HyperLogLog
● Quantiles Estimation: t-digest
● Frequent Items: Count-Min Sketch
Real-Time Data Processing 54
Some PDS libraries
● Apache DataSketches
○ Unique User (or Count Distinct2) Queries
○ Quantile & Histogram Queries
○ Most Frequent Items Queries
● Redis Modules
○ RedisBloom
○ Redis-tdigest
Real-Time Data Processing 55
Hands-On #05: Estimate Cardinality
● Using Redis HyperLogLog
● Count number of active users:
○ Every minutes
○ Every 5 minutes
○ Update every 5 seconds
Real-Time Data Processing 56
Lambda Architecture
Real-Time Data Processing 57
Kappa Architecture
Real-Time Data Processing 58
Real-Time Data Processing 59
Real-Time Data Processing 60
BE/FE
DE
DA & DS
Real-Time Data Processing 61
1 2 6
3 5 7
8
Real-Time Data Processing 62
Remember?
Real-Time Data Processing 63
Let’s
Refactor!
Real-Time Data Processing 64
Why need Streaming data ?
Real-Time Data Processing 65
Batch Job ELT
● Extract: Data is extracted from source systems, often in large batches or chunks.
● Load: Extracted data is loaded into a data storage system, like a data lake or data warehouse.
● Transform: Transformation of data occurs after loading into the storage system. This involves
processing, cleaning, and structuring the data to make it suitable for analysis.
● Characteristics:
● Typically suited for scenarios where the volume of data is relatively large and not
time-sensitive.
● Processing is done in bulk, which can lead to resource-intensive operations and longer
processing times.
● Commonly used in scenarios where historical analysis, reporting, and business
intelligence are the primary goals.
● Easier to implement and manage compared to real-time processing.
Real-Time Data Processing 66
Stream Job ETL
● Extract: Data is continuously extracted from source systems as it becomes available or
changes.
● Transform: Data transformation occurs as it is being extracted or immediately after
extraction. This often involves enriching, aggregating, or filtering the data.
● Load: Transformed data is loaded into a destination system or storage in near-real-time or
real-time.
● Characteristics:
● Suited for scenarios where timely insights, immediate actions, or quick reactions to data
are necessary.
● Well-suited for use cases involving monitoring, alerting, fraud detection, and real-time
analytics.
● Requires a more complex architecture to handle continuous data streams and ensure
low-latency processing.
● Generally more resource-intensive due to the need for constant data processing.
Real-Time Data Processing 67
Batch Job Stream Job
● Well-suited for complex transformations that ● Enables real-time decision-making based on current
require significant computational resources. data.
● Can handle large volumes of data efficiently. ● Well-suited for applications requiring up-to-the-minute
● Easier to manage dependencies between insights and responsiveness.
Pros different processing steps. ● Can handle time-sensitive data, such as sensor data,
social media feeds, financial market data, etc.
● Not suitable for scenarios where ● Complex to design and maintain due to the need for
near-real-time or real-time data processing is handling streaming data and ensuring fault tolerance.
crucial. ● May not be as suitable for scenarios where historical
● Might not be the best choice for situations analysis is the primary goal.
Cons
requiring immediate data-driven decisions.
Real-Time Data Processing 68
References
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
https://kafka.apache.org/10/documentation/streams/developer-guide/
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
https://datasketches.github.io/docs/TheChallenge.html
https://redis.io/modules
https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c
Real-Time Data Processing 69