0% found this document useful (0 votes)

42 views69 pages

05 - Real-Time Data Processing

Uploaded by

Vũ Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views69 pages

05 - Real-Time Data Processing

Uploaded by

Vũ Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The raise of Stream Processing

● Stream Processing
○ Information is always up to date
○ Computes something small and relatively simple
○ Compute in near-real-time, only seconds at most, we go for
real-time processing

Real-Time Data Processing 1

Ops!
How can we collect
events in realtime?

Real-Time Data Processing 2

Hands-On #01
● Realtime data collection with Logstash
● Uniﬁed, high-throughput, low-latency platform for handling
real-time data feeds with Kafka

Real-Time Data Processing 3

Stream Processing

Kafka
Spark Streaming

Real-Time Data Processing 4

We need a solution for
stream processing

Real-Time Data Processing 5

Real-Time Data Processing 6
Real-Time Data Processing 7
Why Spark?
● Uniﬁed analytics engine for large-scale data processing
● Combining of streaming data with static datasets and
interactive queries
● Native integration with advanced processing libraries (SQL,
machine learning, graph processing)
● All-in-one framework

Real-Time Data Processing 8

Spark Streaming

Real-Time Data Processing 9

Real-Time Data Processing 10
Spark Streaming Internals

Real-Time Data Processing 11

Hands-On #02
● Number of request to an URL every minutes
● Modify the application add number of ip access by GET
method
118.68.170.134 vm-01.cse87.higio.net
118.68.168.182 vm-02.cse87.higio.net
118.68.170.148 vm-03.cse87.higio.net
/etc/hosts

Real-Time Data Processing 12

Discretized Streams (DStreams)

Real-Time Data Processing 13

Discretized Streams (DStreams)
● Use micro-batching mechanism
● Represents a continuous stream of data as a sequence of
RDDs
● Each RDD contains data from a certain interval
● Underlying RDD transformations are computed by the
Spark engine

Real-Time Data Processing 14

Discretized Streams (DStreams)

Real-Time Data Processing 15

Receivers
● Every input DStream (except ﬁle stream) is associated with
a Receiver
○ receives the data from a source
○ stores it in Spark’s memory for processing
● We can have multiple input DStreams (multiple receivers )
● Number of cores allocated (or threads in local mode) must
be more than the number of receivers

Real-Time Data Processing 16

Receiver Reliability
● Reliable Receiver
○ Sends acknowledgment to a reliable source when the data
has been received and stored in Spark with replication
● Unreliable Receiver
○ Does not send acknowledgment to a source -> data may
be lost

Real-Time Data Processing 17

Stateful Streaming
● Processing pipelines must maintain state across a period of
time
● Possible option is store state in external database like redis

Real-Time Data Processing 18

Stateful Streaming
● Spark oﬀer stateful streaming using mapWithState
transformation
● State update by updateStateByKey

Real-Time Data Processing 19

Windows Operation

Real-Time Data Processing 20

Windows Operation
● Apply transformations over a sliding window of data
● window length:
○ The duration of the window (3)
● sliding interval:
○ The interval at which the window operation is performed (2)

Real-Time Data Processing 21

Checkpointing
● A streaming application must operate 24/7 -> must be
resilient to failures unrelated to the application logic
● Needs to checkpoint enough information to recover from
failures

Real-Time Data Processing 22

Checkpointing
● Metadata checkpointing: Saving of the information deﬁning
the streaming computation to fault-tolerant storage
○ Conﬁguration, DStream operations, incomplete batches
○ Needed for recovery from driver failures

Real-Time Data Processing 23

Checkpointing
● Data checkpointing
○ Saving of the generated RDDs
○ Some stateful transformations that combine data across
multiple batches
○ Necessary even for basic functioning if stateful
transformations are used

Real-Time Data Processing 24

Hands-on #03
● Number of active user in last 2 minutes, update every 5
seconds

Real-Time Data Processing 25

Monitoring

Real-Time Data Processing 26

Monitoring
● Processing Time
○ The time to process each batch of data
● Scheduling Delay
○ The time a batch waits in a queue for the processing of
previous batches to ﬁnish
● Receiver status and processing times can be able to
access using StreamingListener interface

Real-Time Data Processing 27

Tunning
● Reducing the processing time of each batch of data by
eﬃciently using cluster resources
● Setting the right batch size such that the batches of data
can be processed as fast as they are received

Real-Time Data Processing 28

Tunning
● Level of Parallelism in Data Receiving
● Level of Parallelism in Data Processing
● Data Serialization
● Task Launching Overheads
● Setting the Right Batch Interval
● Memory Tuning

Real-Time Data Processing 29

Fault-tolerance Semantics
● Not like Spark’s RDDs on non-streaming application
● Two kinds of data in the system that need to recovered
○ Data received and replicated (default 2) -> recover from
replicate
○ Data received but buﬀered for replication -> recover from
source

Real-Time Data Processing 30

Message Delivery

Real-Time Data Processing 31

Fault-tolerance Semantics
● How many times each record can be processed
○ At most once: once or not at all -> may be data lost
○ At least once: one or more times -> no data lost -> may be
duplicates
○ Exactly once: exactly once - no data lost and no duplicate
-> the strongest guarantee

Real-Time Data Processing 32

Spark Structured Streaming
● Also micro batching mechanism (Spark 2.3 - Continuous
Processing oﬀer 1 millisecond latency with at-least-once
guarantees)
● Built on the Spark SQL engine
● Oﬀers exactly-once delivery with 100+ milliseconds latency
● Express streaming computation as standard batch-like
query as on a static table

Real-Time Data Processing 33

Spark Structured Streaming

Real-Time Data Processing 34

Real-Time Data Processing 35
Triggers
● Micro-batch mode (by default)
● Fixed interval micro-batches
● One-time micro-batch
● Continuous with ﬁxed checkpoint interval (experimental)

Real-Time Data Processing 36

Spark Structured Streaming
● Output Mode
○ Complete Mode
○ Append Mode
○ Update Mode

Real-Time Data Processing 37

Real-Time Data Processing 38
Spark Structured Streaming
● Window Operations on Event Time
● Handling Late Data and Watermarking

Real-Time Data Processing 39

Real-Time Data Processing 40
Real-Time Data Processing 41
Real-Time Data Processing 42
Real-Time Data Processing 43
DStreams vs Structured Streaming
● Only guarantees at-least-once ● Oﬀers exactly-once delivery with
delivery, but can provide 100+ milliseconds latency
millisecond latencies ● For simple use cases
● More complicated topologies
because of its ﬂexibility

Real-Time Data Processing 44

Hands-on #04
● Example of Spark Structured Streaming

Real-Time Data Processing 45

we always have to make
decisions
Spark Streaming is not true realtime
It consumes unnecessary resources and maintains eﬀort

We need a lightweight streaming solution

without dependencies on other systems

Real-Time Data Processing 46

Kafka Stream

Real-Time Data Processing 47

Kafka Stream
● A client library for building applications and microservices
● No YARN any more
● Event-at-a-time processing (not microbatch) with
millisecond latency
● Distributed processing and fault-tolerance with fast failover

Real-Time Data Processing 48

Real-Time Data Processing 49
Streaming

Real-Time Data Processing 50

Probabilistic Data Structure Intro
● Does the dataset contain an element?
● Top N most viewed items?
● How many distinct customer in last hour?
● ...

Real-Time Data Processing 51

Probabilistic Data Structure Intro
● Possible solution:
○ Use SQL count on tables
○ Use HashMap to check if element existed
● When dealing with big data (required fast respone, RAM,
CPU limited) -> this become the big problem

Real-Time Data Processing 52

Probabilistic Data Structure Intro
● Need to be fast enough with limited resources
● Not all of them need to be 100% accuracy, can be
approximation with controllable error rate
● Trade-oﬀ space and performance with accuracy

Real-Time Data Processing 53

Most Useful PDS
● Membership Query: Bloom Filter
● Cardinality Estimation: HyperLogLog
● Quantiles Estimation: t-digest
● Frequent Items: Count-Min Sketch

Real-Time Data Processing 54

Some PDS libraries
● Apache DataSketches
○ Unique User (or Count Distinct2) Queries
○ Quantile & Histogram Queries
○ Most Frequent Items Queries
● Redis Modules
○ RedisBloom
○ Redis-tdigest

Real-Time Data Processing 55

Hands-On #05: Estimate Cardinality
● Using Redis HyperLogLog
● Count number of active users:
○ Every minutes
○ Every 5 minutes
○ Update every 5 seconds

Real-Time Data Processing 56

Lambda Architecture

Real-Time Data Processing 57

Kappa Architecture

Real-Time Data Processing 58

Real-Time Data Processing 59
Real-Time Data Processing 60
BE/FE

DA & DS

Real-Time Data Processing 61

1 2 6

3 5 7

8
Real-Time Data Processing 62
Remember?

Real-Time Data Processing 63

Let’s
Refactor!

Real-Time Data Processing 64

Why need Streaming data ?

Real-Time Data Processing 65

Batch Job ELT
● Extract: Data is extracted from source systems, often in large batches or chunks.
● Load: Extracted data is loaded into a data storage system, like a data lake or data warehouse.
● Transform: Transformation of data occurs after loading into the storage system. This involves
processing, cleaning, and structuring the data to make it suitable for analysis.
● Characteristics:
● Typically suited for scenarios where the volume of data is relatively large and not
time-sensitive.
● Processing is done in bulk, which can lead to resource-intensive operations and longer
processing times.
● Commonly used in scenarios where historical analysis, reporting, and business
intelligence are the primary goals.
● Easier to implement and manage compared to real-time processing.

Real-Time Data Processing 66

Stream Job ETL
● Extract: Data is continuously extracted from source systems as it becomes available or
changes.
● Transform: Data transformation occurs as it is being extracted or immediately after
extraction. This often involves enriching, aggregating, or ﬁltering the data.
● Load: Transformed data is loaded into a destination system or storage in near-real-time or
real-time.
● Characteristics:
● Suited for scenarios where timely insights, immediate actions, or quick reactions to data
are necessary.
● Well-suited for use cases involving monitoring, alerting, fraud detection, and real-time
analytics.
● Requires a more complex architecture to handle continuous data streams and ensure
low-latency processing.
● Generally more resource-intensive due to the need for constant data processing.

Real-Time Data Processing 67

Batch Job Stream Job
● Well-suited for complex transformations that ● Enables real-time decision-making based on current
require significant computational resources. data.
● Can handle large volumes of data efficiently. ● Well-suited for applications requiring up-to-the-minute
● Easier to manage dependencies between insights and responsiveness.
Pros different processing steps. ● Can handle time-sensitive data, such as sensor data,
social media feeds, financial market data, etc.

● Not suitable for scenarios where ● Complex to design and maintain due to the need for
near-real-time or real-time data processing is handling streaming data and ensuring fault tolerance.
crucial. ● May not be as suitable for scenarios where historical
● Might not be the best choice for situations analysis is the primary goal.
Cons
requiring immediate data-driven decisions.

Real-Time Data Processing 68

References
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
https://kafka.apache.org/10/documentation/streams/developer-guide/
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
https://datasketches.github.io/docs/TheChallenge.html
https://redis.io/modules
https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c

Real-Time Data Processing 69

Real-Time Data Streaming Overview
No ratings yet
Real-Time Data Streaming Overview
5 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Real-Time Data Processing with Spark
No ratings yet
Real-Time Data Processing with Spark
10 pages
Spark Streaming: Big Data Processing Guide
No ratings yet
Spark Streaming: Big Data Processing Guide
28 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Unit 5
No ratings yet
Unit 5
20 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Batch vs Real Time Data Processing
No ratings yet
Batch vs Real Time Data Processing
36 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Real-Time Big Data Analytics Solutions
No ratings yet
Real-Time Big Data Analytics Solutions
34 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
011.3 - Streaming Data System Architecture Components - Processing Tier
No ratings yet
011.3 - Streaming Data System Architecture Components - Processing Tier
3 pages
Stream Processing and Analytics Course
No ratings yet
Stream Processing and Analytics Course
7 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Building a Real-time Data Platform with Kafka
No ratings yet
Building a Real-time Data Platform with Kafka
48 pages
Real-time Data Stream Processing Insights
No ratings yet
Real-time Data Stream Processing Insights
8 pages
Spark Streaming for Real-Time Analytics
No ratings yet
Spark Streaming for Real-Time Analytics
24 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Assignment 1 ADBMS
No ratings yet
Assignment 1 ADBMS
13 pages
Bda Unit-5 I
No ratings yet
Bda Unit-5 I
24 pages
Spark Streaming for Real-Time Analytics
No ratings yet
Spark Streaming for Real-Time Analytics
23 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Spark Streaming and Kafka Architecture Overview
No ratings yet
Spark Streaming and Kafka Architecture Overview
16 pages
Stream Processing and Analytics Course
No ratings yet
Stream Processing and Analytics Course
7 pages
Batch vs. Stream Processing Explained
No ratings yet
Batch vs. Stream Processing Explained
3 pages
Stream Processing in Big Data Explained
No ratings yet
Stream Processing in Big Data Explained
10 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
11 pages
Big Data en Gros Deepseek
No ratings yet
Big Data en Gros Deepseek
7 pages
Stream Computing
No ratings yet
Stream Computing
18 pages
Introduction To Spark Streaming
No ratings yet
Introduction To Spark Streaming
31 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Big Data Pipeline Components Explained
No ratings yet
Big Data Pipeline Components Explained
8 pages
Data Processing Methods Explained
No ratings yet
Data Processing Methods Explained
15 pages
SPA Notes
No ratings yet
SPA Notes
4 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
22 pages
Understanding Batch and Stream Processing
No ratings yet
Understanding Batch and Stream Processing
13 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
A Deep Dive Into Data Stream Processing
No ratings yet
A Deep Dive Into Data Stream Processing
10 pages
Bigdata msc2
No ratings yet
Bigdata msc2
8 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
BDA MINING DATA STREAM NOTES New
No ratings yet
BDA MINING DATA STREAM NOTES New
23 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Data Stream Processing Platforms Explained
No ratings yet
Data Stream Processing Platforms Explained
27 pages
Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges
No ratings yet
Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges
7 pages
Advanced DevOps with Spark
0% (1)
Advanced DevOps with Spark
301 pages
Ccunit 5
No ratings yet
Ccunit 5
4 pages
Big Data Streaming Analysis with Spark
No ratings yet
Big Data Streaming Analysis with Spark
78 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Big Data and Streaming Analytics Overview
No ratings yet
Big Data and Streaming Analytics Overview
17 pages
Informatica Guide
No ratings yet
Informatica Guide
159 pages
QUIZ - Chapter 1: Doing Business in Digital Times
100% (5)
QUIZ - Chapter 1: Doing Business in Digital Times
35 pages
EMLAB2016
No ratings yet
EMLAB2016
4 pages
Real-Time Embedded Systems Scheduling
No ratings yet
Real-Time Embedded Systems Scheduling
12 pages
Ambulance Drone
No ratings yet
Ambulance Drone
9 pages
Real-Time Operating Systems Assignment Questions
100% (1)
Real-Time Operating Systems Assignment Questions
7 pages
AAI Embedded Systems Engineer Offer
No ratings yet
AAI Embedded Systems Engineer Offer
2 pages
SCADA Systems and Applications
No ratings yet
SCADA Systems and Applications
52 pages
Báo Giá VANTECH - Dau Ghi - T10
No ratings yet
Báo Giá VANTECH - Dau Ghi - T10
24 pages
ESRTOS Notes
No ratings yet
ESRTOS Notes
15 pages
Data Processing and Analysis Guide
No ratings yet
Data Processing and Analysis Guide
6 pages
Aos PG
No ratings yet
Aos PG
111 pages
Adaptive Partitioning User's Guide: QNX Software Development Platform 6.6
No ratings yet
Adaptive Partitioning User's Guide: QNX Software Development Platform 6.6
98 pages
Understanding Real-Time Operating Systems
No ratings yet
Understanding Real-Time Operating Systems
6 pages
Operating System: Presented By:-Dr. Sanjeev Sharma
No ratings yet
Operating System: Presented By:-Dr. Sanjeev Sharma
277 pages
Real-Time BI Best Practices at Continental
No ratings yet
Real-Time BI Best Practices at Continental
13 pages
3DRi-brochure 2018
No ratings yet
3DRi-brochure 2018
2 pages
Optimizing IoT Performance with JNI
No ratings yet
Optimizing IoT Performance with JNI
2 pages
Parallel 2-D Median Filter Architecture
No ratings yet
Parallel 2-D Median Filter Architecture
4 pages
Wenco Information 2012EN
No ratings yet
Wenco Information 2012EN
15 pages
M.Tech Control Systems Course Structure
No ratings yet
M.Tech Control Systems Course Structure
24 pages
HND Computer Systems-Pearson
No ratings yet
HND Computer Systems-Pearson
21 pages
Control Systems for Engineers
No ratings yet
Control Systems for Engineers
113 pages
Ece 367 09 13
No ratings yet
Ece 367 09 13
100 pages
CMS User Manual
No ratings yet
CMS User Manual
58 pages
RTOS Basics for ECE Students
No ratings yet
RTOS Basics for ECE Students
23 pages
Video Editing Course Overview in Ethiopia
No ratings yet
Video Editing Course Overview in Ethiopia
23 pages
Video Wall Controller: Vz-Pro-St
No ratings yet
Video Wall Controller: Vz-Pro-St
2 pages
Cc105 Module Asc Approved 1
No ratings yet
Cc105 Module Asc Approved 1
67 pages
CANSAT 2020: Satellite Design Guide
No ratings yet
CANSAT 2020: Satellite Design Guide
4 pages

05 - Real-Time Data Processing

Uploaded by

05 - Real-Time Data Processing

Uploaded by

The raise of Stream Processing

Real-Time Data Processing 1

Real-Time Data Processing 2

Real-Time Data Processing 3

Real-Time Data Processing 4

Real-Time Data Processing 5

Real-Time Data Processing 8

Real-Time Data Processing 9

Real-Time Data Processing 11

Real-Time Data Processing 12

Real-Time Data Processing 13

Real-Time Data Processing 14

Real-Time Data Processing 15

Real-Time Data Processing 16

Real-Time Data Processing 17

Real-Time Data Processing 18

Real-Time Data Processing 19

Real-Time Data Processing 20

Real-Time Data Processing 21

Real-Time Data Processing 22

Real-Time Data Processing 23

Real-Time Data Processing 24

Real-Time Data Processing 25

Real-Time Data Processing 26

Real-Time Data Processing 27

Real-Time Data Processing 28

Real-Time Data Processing 29

Real-Time Data Processing 30

Real-Time Data Processing 31

Real-Time Data Processing 32

Real-Time Data Processing 33

Real-Time Data Processing 34

Real-Time Data Processing 36

Real-Time Data Processing 37

Real-Time Data Processing 39

Real-Time Data Processing 44

Real-Time Data Processing 45

We need a lightweight streaming solution

Real-Time Data Processing 46

Real-Time Data Processing 47

Real-Time Data Processing 48

Real-Time Data Processing 50

Real-Time Data Processing 51

Real-Time Data Processing 52

Real-Time Data Processing 53

Real-Time Data Processing 54

Real-Time Data Processing 55

Real-Time Data Processing 56

Real-Time Data Processing 57

Real-Time Data Processing 58

Real-Time Data Processing 61

Real-Time Data Processing 63

Real-Time Data Processing 64

Real-Time Data Processing 65

Real-Time Data Processing 66

Real-Time Data Processing 67

Real-Time Data Processing 68

Real-Time Data Processing 69

You might also like