Professional Documents
Culture Documents
Spark Streaming
Spark Streaming
Spark 2 course
Spark Streaming
Juan Carlos Ibáñez, PhD
2017
Spark Streaming
Real-time Machine
Structured (pseudo, learning Graph
(schema) data micro-batch) algorithms processing
Spark Core
(Execution engine)
Spark Streaming
• Classical analytics is performed on data at rest (Databases, flat files, …)
• Today, many applications must process large streams of live data and provide
results in real-time: Wireless sensor networks, Traffic management applications,
Stock marketing, environmental monitoring applications, fraud detection tools,
etc
• Streaming Data Sources: Flat files (as they are created), TCP/IP, Apache
Flume, Apache Kafka, Amazon Kinesis, Twitter, Facebook …
2. Treats each batch of data as RDDs and processes them using RDD
operations
• Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis,
or TCP sockets
• Then, data can be processed using RDD Spark functions map, reduce, join and
window
• Finally, processed data can be pushed out to filesystems, databases, and live
dashboards
Dstream processing
2. Dstream transformation
Dstream processing
• It then calls all transformations and operations that applies for that DStream or
derived Dstreams