0% found this document useful (0 votes)
42 views69 pages

05 - Real-Time Data Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views69 pages

05 - Real-Time Data Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The raise of Stream Processing

● Stream Processing
○ Information is always up to date
○ Computes something small and relatively simple
○ Compute in near-real-time, only seconds at most, we go for
real-time processing

Real-Time Data Processing 1


Ops!
How can we collect
events in realtime?

Real-Time Data Processing 2


Hands-On #01
● Realtime data collection with Logstash
● Unified, high-throughput, low-latency platform for handling
real-time data feeds with Kafka

Real-Time Data Processing 3


Stream Processing

Kafka
Spark Streaming

Real-Time Data Processing 4


We need a solution for
stream processing

Real-Time Data Processing 5


Real-Time Data Processing 6
Real-Time Data Processing 7
Why Spark?
● Unified analytics engine for large-scale data processing
● Combining of streaming data with static datasets and
interactive queries
● Native integration with advanced processing libraries (SQL,
machine learning, graph processing)
● All-in-one framework

Real-Time Data Processing 8


Spark Streaming

Real-Time Data Processing 9


Real-Time Data Processing 10
Spark Streaming Internals

Real-Time Data Processing 11


Hands-On #02
● Number of request to an URL every minutes
● Modify the application add number of ip access by GET
method
118.68.170.134 vm-01.cse87.higio.net
118.68.168.182 vm-02.cse87.higio.net
118.68.170.148 vm-03.cse87.higio.net
/etc/hosts

Real-Time Data Processing 12


Discretized Streams (DStreams)

Real-Time Data Processing 13


Discretized Streams (DStreams)
● Use micro-batching mechanism
● Represents a continuous stream of data as a sequence of
RDDs
● Each RDD contains data from a certain interval
● Underlying RDD transformations are computed by the
Spark engine

Real-Time Data Processing 14


Discretized Streams (DStreams)

Real-Time Data Processing 15


Receivers
● Every input DStream (except file stream) is associated with
a Receiver
○ receives the data from a source
○ stores it in Spark’s memory for processing
● We can have multiple input DStreams (multiple receivers )
● Number of cores allocated (or threads in local mode) must
be more than the number of receivers

Real-Time Data Processing 16


Receiver Reliability
● Reliable Receiver
○ Sends acknowledgment to a reliable source when the data
has been received and stored in Spark with replication
● Unreliable Receiver
○ Does not send acknowledgment to a source -> data may
be lost

Real-Time Data Processing 17


Stateful Streaming
● Processing pipelines must maintain state across a period of
time
● Possible option is store state in external database like redis

Real-Time Data Processing 18


Stateful Streaming
● Spark offer stateful streaming using mapWithState
transformation
● State update by updateStateByKey

Real-Time Data Processing 19


Windows Operation

Real-Time Data Processing 20


Windows Operation
● Apply transformations over a sliding window of data
● window length:
○ The duration of the window (3)
● sliding interval:
○ The interval at which the window operation is performed (2)

Real-Time Data Processing 21


Checkpointing
● A streaming application must operate 24/7 -> must be
resilient to failures unrelated to the application logic
● Needs to checkpoint enough information to recover from
failures

Real-Time Data Processing 22


Checkpointing
● Metadata checkpointing: Saving of the information defining
the streaming computation to fault-tolerant storage
○ Configuration, DStream operations, incomplete batches
○ Needed for recovery from driver failures

Real-Time Data Processing 23


Checkpointing
● Data checkpointing
○ Saving of the generated RDDs
○ Some stateful transformations that combine data across
multiple batches
○ Necessary even for basic functioning if stateful
transformations are used

Real-Time Data Processing 24


Hands-on #03
● Number of active user in last 2 minutes, update every 5
seconds

Real-Time Data Processing 25


Monitoring

Real-Time Data Processing 26


Monitoring
● Processing Time
○ The time to process each batch of data
● Scheduling Delay
○ The time a batch waits in a queue for the processing of
previous batches to finish
● Receiver status and processing times can be able to
access using StreamingListener interface

Real-Time Data Processing 27


Tunning
● Reducing the processing time of each batch of data by
efficiently using cluster resources
● Setting the right batch size such that the batches of data
can be processed as fast as they are received

Real-Time Data Processing 28


Tunning
● Level of Parallelism in Data Receiving
● Level of Parallelism in Data Processing
● Data Serialization
● Task Launching Overheads
● Setting the Right Batch Interval
● Memory Tuning

Real-Time Data Processing 29


Fault-tolerance Semantics
● Not like Spark’s RDDs on non-streaming application
● Two kinds of data in the system that need to recovered
○ Data received and replicated (default 2) -> recover from
replicate
○ Data received but buffered for replication -> recover from
source

Real-Time Data Processing 30


Message Delivery

Real-Time Data Processing 31


Fault-tolerance Semantics
● How many times each record can be processed
○ At most once: once or not at all -> may be data lost
○ At least once: one or more times -> no data lost -> may be
duplicates
○ Exactly once: exactly once - no data lost and no duplicate
-> the strongest guarantee

Real-Time Data Processing 32


Spark Structured Streaming
● Also micro batching mechanism (Spark 2.3 - Continuous
Processing offer 1 millisecond latency with at-least-once
guarantees)
● Built on the Spark SQL engine
● Offers exactly-once delivery with 100+ milliseconds latency
● Express streaming computation as standard batch-like
query as on a static table

Real-Time Data Processing 33


Spark Structured Streaming

Real-Time Data Processing 34


Real-Time Data Processing 35
Triggers
● Micro-batch mode (by default)
● Fixed interval micro-batches
● One-time micro-batch
● Continuous with fixed checkpoint interval (experimental)

Real-Time Data Processing 36


Spark Structured Streaming
● Output Mode
○ Complete Mode
○ Append Mode
○ Update Mode

Real-Time Data Processing 37


Real-Time Data Processing 38
Spark Structured Streaming
● Window Operations on Event Time
● Handling Late Data and Watermarking

Real-Time Data Processing 39


Real-Time Data Processing 40
Real-Time Data Processing 41
Real-Time Data Processing 42
Real-Time Data Processing 43
DStreams vs Structured Streaming
● Only guarantees at-least-once ● Offers exactly-once delivery with
delivery, but can provide 100+ milliseconds latency
millisecond latencies ● For simple use cases
● More complicated topologies
because of its flexibility

Real-Time Data Processing 44


Hands-on #04
● Example of Spark Structured Streaming

Real-Time Data Processing 45


we always have to make
decisions
Spark Streaming is not true realtime
It consumes unnecessary resources and maintains effort

We need a lightweight streaming solution


without dependencies on other systems

Real-Time Data Processing 46


Kafka Stream

Real-Time Data Processing 47


Kafka Stream
● A client library for building applications and microservices
● No YARN any more
● Event-at-a-time processing (not microbatch) with
millisecond latency
● Distributed processing and fault-tolerance with fast failover

Real-Time Data Processing 48


Real-Time Data Processing 49
Streaming

Real-Time Data Processing 50


Probabilistic Data Structure Intro
● Does the dataset contain an element?
● Top N most viewed items?
● How many distinct customer in last hour?
● ...

Real-Time Data Processing 51


Probabilistic Data Structure Intro
● Possible solution:
○ Use SQL count on tables
○ Use HashMap to check if element existed
● When dealing with big data (required fast respone, RAM,
CPU limited) -> this become the big problem

Real-Time Data Processing 52


Probabilistic Data Structure Intro
● Need to be fast enough with limited resources
● Not all of them need to be 100% accuracy, can be
approximation with controllable error rate
● Trade-off space and performance with accuracy

Real-Time Data Processing 53


Most Useful PDS
● Membership Query: Bloom Filter
● Cardinality Estimation: HyperLogLog
● Quantiles Estimation: t-digest
● Frequent Items: Count-Min Sketch

Real-Time Data Processing 54


Some PDS libraries
● Apache DataSketches
○ Unique User (or Count Distinct2) Queries
○ Quantile & Histogram Queries
○ Most Frequent Items Queries
● Redis Modules
○ RedisBloom
○ Redis-tdigest

Real-Time Data Processing 55


Hands-On #05: Estimate Cardinality
● Using Redis HyperLogLog
● Count number of active users:
○ Every minutes
○ Every 5 minutes
○ Update every 5 seconds

Real-Time Data Processing 56


Lambda Architecture

Real-Time Data Processing 57


Kappa Architecture

Real-Time Data Processing 58


Real-Time Data Processing 59
Real-Time Data Processing 60
BE/FE

DE

DA & DS

Real-Time Data Processing 61


1 2 6

3 5 7

8
Real-Time Data Processing 62
Remember?

Real-Time Data Processing 63


Let’s
Refactor!

Real-Time Data Processing 64


Why need Streaming data ?

Real-Time Data Processing 65


Batch Job ELT
● Extract: Data is extracted from source systems, often in large batches or chunks.
● Load: Extracted data is loaded into a data storage system, like a data lake or data warehouse.
● Transform: Transformation of data occurs after loading into the storage system. This involves
processing, cleaning, and structuring the data to make it suitable for analysis.
● Characteristics:
● Typically suited for scenarios where the volume of data is relatively large and not
time-sensitive.
● Processing is done in bulk, which can lead to resource-intensive operations and longer
processing times.
● Commonly used in scenarios where historical analysis, reporting, and business
intelligence are the primary goals.
● Easier to implement and manage compared to real-time processing.

Real-Time Data Processing 66


Stream Job ETL
● Extract: Data is continuously extracted from source systems as it becomes available or
changes.
● Transform: Data transformation occurs as it is being extracted or immediately after
extraction. This often involves enriching, aggregating, or filtering the data.
● Load: Transformed data is loaded into a destination system or storage in near-real-time or
real-time.
● Characteristics:
● Suited for scenarios where timely insights, immediate actions, or quick reactions to data
are necessary.
● Well-suited for use cases involving monitoring, alerting, fraud detection, and real-time
analytics.
● Requires a more complex architecture to handle continuous data streams and ensure
low-latency processing.
● Generally more resource-intensive due to the need for constant data processing.

Real-Time Data Processing 67


Batch Job Stream Job
● Well-suited for complex transformations that ● Enables real-time decision-making based on current
require significant computational resources. data.
● Can handle large volumes of data efficiently. ● Well-suited for applications requiring up-to-the-minute
● Easier to manage dependencies between insights and responsiveness.
Pros different processing steps. ● Can handle time-sensitive data, such as sensor data,
social media feeds, financial market data, etc.

● Not suitable for scenarios where ● Complex to design and maintain due to the need for
near-real-time or real-time data processing is handling streaming data and ensuring fault tolerance.
crucial. ● May not be as suitable for scenarios where historical
● Might not be the best choice for situations analysis is the primary goal.
Cons
requiring immediate data-driven decisions.

Real-Time Data Processing 68


References
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
https://kafka.apache.org/10/documentation/streams/developer-guide/
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
https://datasketches.github.io/docs/TheChallenge.html
https://redis.io/modules
https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c

Real-Time Data Processing 69

You might also like