Professional Documents
Culture Documents
Apache Flink What How Why Who Where by Slim Baltagi 160203112346 PDF
Apache Flink What How Why Who Where by Slim Baltagi 160203112346 PDF
YorkYork
CityCity (NYC)Apache
(NYC) Apache Flink
FlinkMeetup
Meetup
Civic Hall, NYC
Civic Hall, NYC
February 2nd, 2016
February 2nd, 2016
2
I. What is Apache Flink stack and how it
fits into the Big Data ecosystem?
3
1. What is Apache Flink?
7
1.2 Project that evolved the concept of a multi-
purpose Big Data analytics framework
Apache Flink, written in Java and Scala, consists of:
1. Big data processing engine: distributed and
scalable streaming dataflow engine
2. Several APIs in Java/Scala/Python:
• DataSet API – Batch processing
• DataStream API – Real-Time streaming analytics
3. Domain-Specific Libraries:
• FlinkML: Machine Learning Library for Flink
• Gelly: Graph Library for Flink
• Table: Relational Queries
• FlinkCEP: Complex Event Processing for Flink
8
What is Apache Flink stack?
Google Dataflow
APIs & LIBRARIES
Gelly-Stream
Dataflow (WiP)
Hadoop M/R
Cascading
Zeppelin
FlinkML
SAMOA
MRQL
Storm
Table
Table
Gelly
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Batch Processing Stream Processing
Batch Optimizer Stream Builder
SYSTEM
Runtime - Distributed
Streaming Dataflow
11
1.3 Project with a unique vision and philosophy
15
1.5 Major contributor to the movement of unification of
streaming and batch
16
1.5 Major contributor to the movement of unification of
streaming and batch
Apache Flink includes DataFlow on Flink http://data-
artisans.com/dataflow-proposed-as-apache-incubator-project/
18
Apache Flink as the 4G of Big Data Analytics
20
2. What is Flink Execution Engine?
The core of Flink is a distributed and scalable streaming
dataflow engine with some unique features:
1. True streaming capabilities: Execute everything as
streams
2. Versatile: Engine allows to run all existing MapReduce,
Cascading, Storm, Google DataFlow applications
3. Native iterative execution: Allow some cyclic dataflows
4. Handling of mutable state
5. Custom memory manager: Operate on managed
memory
6. Cost-Based Optimizer: for both batch and stream
processing 21
3. Flink APIs
22
3.1 DataSet API – Batch processing
case class Word (word: String, frequency: Int)
25
3.2 DataStream API – Real-Time Streaming
Analytics
Apache Flink: streaming done right. Till Rohrmann.
January 31, 2016
https://fosdem.org/2016/schedule/event/hpc_bigdata_flink_streaming/
26
4. Flink Domain Specific Libraries
4.1 FlinkML – Machine Learning Library
30
4.2 Table API – Relational Queries
Table API (queries)
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
• Slides: http://www.slideshare.net/vkalavri/gellystream-singlepass-
graph-streaming-analytics-with-apache-flink
34
4.4 FlinkCEP: Complex Event Processing for
Flink
FlinkCEP is the complex event processing library for
Flink. It allows you to easily detect complex event
patterns in a stream of endless data.
Complex events can then be constructed from
matching sequences. This gives you the opportunity to
quickly get hold of what’s really important in your data.
https://ci.apache.org/projects/flink/flink-docs-
master/apis/streaming/libs/cep.html
35
5. What is Flink Architecture?
Flink implements the Kappa Architecture:
run batch programs on a streaming system.
References about the Kappa Architecture:
• Questioning the Lambda Architecture - Jay Kreps ,
July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html
37
5.1 Client
Type extraction
Optimize: in all APIs not just SQL queries as in Spark
Construct job Dataflow graph
Pass job Dataflow graph to job manager
Retrieve job results
GroupRed
case class Path (from: Long, to: Long)
val tc = edges.iterate(10) { sort
paths: DataSet[Path] =>
val next = paths Type forward
.join(edges)
.where("to") extraction Join
Hybrid Hash
.equalTo("from") {
(path, edge) => buildHT probe
Path(path.from, edge.to)
} Optimizer hash-part Job Manager
.union(paths)
.distinct() [0] hash-part [0]
next Map
}
Filter DataSource
lineitem.tbl
Client Data Source
orders.tbl
38
5.2 Job Manager (JM) with High Availability
Parallelization: Create Execution Graph
Scheduling: Assign tasks to task managers
State tracking: Supervise the execution
GroupRed
Task
sort
Manager
GroupRed
GroupRed
GroupRedsort
forwar GroupRedsort
sort
sort
d forwar
Join
forward
forwar d
d Join
Hybrid
Task
Hybrid Hash forwar Join
d Join
Hash
Hybrid Hash
buildHash
Hybrid prob Manager
buildHT prob e
buildHT probe buildHT prob e
HT Join e
Hybrid Hash
hash-part [0] hash-part [0]
hash-part [0] hash-part [0]
hash-partbuild
[0] prob hash-part [0]
Map DataSource
DataSour
hash-part [0]
hash-part Map
MapHT e
DataSour lineitem.tbl
ce
Filter ce
Map [0] Filter
Filter Data
hash-part [0]Source
Data
lineitem.tbl
lineitem.tbl
hash-part [0]
Task
Dataorders.tbl
Filter DataSource
lineitem.tbl
Source
Source
Map orders.tbl
orders.tbl
DataSour
ce
lineitem.tbl
Manager
Data Filter
Source Data
Source
orders.tbl orders.tbl
Task
Job Manager
Manager
39
5.3 Task Manager ( TM)
Operations are split up into tasks depending on the
specified parallelism
Each parallel instance of an operation runs in a
separate task slot
The scheduler may run several tasks from different
operators in one task slot
S S S
l l l
o o o
t t t
Task Manager Task Manager Task Manager
40
6. What is Flink Programming Model?
DataSet and DataStream as programming
abstractions are the foundation for user programs
and higher layers.
41
6.1 DataSet
DataSet: abstraction for distributed data and the
central notion of the batch programming API
Files and other data sources are read into DataSets
• DataSet<String> text = env.readTextFile(…)
Transformations on DataSets produce DataSets
• DataSet<String> first = text.map(…)
DataSets are printed to files or on stdout
• first.writeAsCsv(…)
Computation is specified as a sequence of lazily
evaluated transformations
Execution is triggered with env.execute()
42
6.1 DataSet
Data Data
Source Operation Sink
Set Set
1
2
…
43
6.2 DataStream
Data Data
Source Stream
Operation Stream
Sink
44
7. What are Apache Flink tools?
45
7.1 Command-Line Interface (CLI)
Flink provides a CLI to run programs that are packaged
as JAR files, and control their execution.
bin/flink has 4 major actions
• run #runs a program.
• info #displays information about a program.
• list #lists scheduled and running jobs
• cancel #cancels a running job.
Example: ./bin/flink info ./examples/KMeans.jar
47
7.2 Web Submission Client
Flink provides a web interface to:
• Upload programs
• Execute programs
• Inspect their execution plans
• Showcase programs
• Debug execution plans
• Demonstrate the system as a whole
The web interface runs on port 8080 by default.
To specify a custom port set the webclient.port
property in the ./conf/flink.yaml configuration file.
48
7.3 Job Manager Web Interface
49
7.3 Job Manager Web Interface
The JobManager web frontend allows to :
• Track the progress of a Flink program
as all status changes are also logged to
the JobManager’s log file.
• Figure out why a program failed as it
displays the exceptions of failed tasks and
allow to figure out which parallel task first
failed and caused the other tasks to cancel
the execution.
50
7.4 Interactive Scala Shell
bin/start-scala-shell.sh --host localhost --port 6123
51
7.4 Interactive Scala Shell
Flink comes with an Interactive Scala Shell - REPL (
Read Evaluate Print Loop ) :
./bin/start-scala-shell.sh
Interactive queries
Let’s you explore data quickly
It can be used in a local setup as well as in a
cluster setup.
The Flink Shell comes with command history and
auto completion.
Complete Scala API available
So far only batch mode is supported. There is
plan to add streaming in the future:
52
https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html
7.5 Zeppelin Notebook
http://localhost:8080/
53
7.5 Zeppelin Notebook
55
II. How Apache Flink integrates with Hadoop and
other open source tools?
Service Open Source Tool
Storage/Servi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
56
II. How Apache Flink integrates with Hadoop and
other open source tools?
Flink integrates well with other open source tools for
data input and output as well as deployment.
Flink allows to run legacy Big Data applications:
MapReduce, Cascading and Storm applications
Flink integrates with other open source tools
58
1. Data Input / Output
Crunching Parquet Files with Apache Flink
https://medium.com/@istanbul_techie/crunching-parquet-files-with-apache-flink-
200bec90d8a7
https://github.com/apache/flink/tree/master/flink-staging/flink-
hbase/src/test/java/org/apache/flink/addons/hbase/example
61
2. Deployment
Deploy inside of Hadoop via YARN
• YARN Setup http://ci.apache.org/projects/flink/flink-docs-
master/setup/yarn_setup.html
• YARN Configuration
http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn
62
3. Legacy Big Data applications
63
3. Legacy Big Data applications
Cascading on Flink allows to port existing Cascading-MapReduce
applications to Apache Flink with virtually no code changes.
http://www.cascading.org/cascading-flink/
69
2. Why Flink is an alternative to Hadoop
MapReduce?
1. Flink offers cyclic dataflows compared to the two-
stage, disk-based MapReduce paradigm.
2. The application programming interface (API) for
Flink is easier to use than programming for
Hadoop’s MapReduce.
3. Flink is easier to test compared to MapReduce.
4. Flink can leverage in-memory processing, data
streaming and iteration operators for faster data
processing speed.
5. Flink can work on file systems other than Hadoop.
70
2. Why Flink is an alternative to Hadoop
MapReduce?
6. Flink lets users work in a unified framework allowing
to build a single data workflow that leverages,
streaming, batch, sql and machine learning for
example.
7. Flink can analyze real-time streaming data.
8. Flink can process graphs using its own Gelly library.
9. Flink can use Machine Learning algorithms from its
own FlinkML library.
10. Flink supports interactive queries and iterative
algorithms, not well served by Hadoop MapReduce.
71
2. Why Flink is an alternative to Hadoop
MapReduce?
11. Flink extends MapReduce model with new operators:
join, cross, union, iterate, iterate delta, cogroup, …
72
3. Why Flink is an alternative to Storm?
1. Higher Level and easier to use API
2. Lower latency
• Thanks to pipelined engine
3. Exactly-once processing guarantees
• Variation of Chandy-Lamport
4. Higher throughput
• Controllable checkpointing overhead
5. Flink Separates application logic from
recovery
• Checkpointing interval is just a configuration
parameter 73
3. Why Flink is an alternative to Storm?
78
4.1 True low latency streaming engine
Some claim that 95% of streaming use cases can be
handled with micro-batches!? Really!!!
Spark’s micro-batching isn’t good enough for many
time-critical applications that need to process large
streams of live data and provide results in real-time.
Below are Several use cases, taken from real industrial
situations where batch or micro batch processing is not
appropriate.
References:
• MapR Streams FAQ https://www.mapr.com/mapr-streams-faq#question12
• Apache Spark vs. Apache Flink, January 13, 2015. Whiteboard
walkthrough by Balaji Narasimhalu from MapR
https://www.youtube.com/watch?v=Dzx-iE6RN4w 79
4.1 True low latency streaming engine
Financial Services
– Real-time fraud detection.
– Real-time mobile notifications.
Healthcare
– Smart hospitals - collect data and readings from hospital
devices (vitals, IVs, MRI, etc.) and analyze and alert in real time.
– Biometrics - collect and analyze data from patient devices that
collect vitals while outside of care facilities.
Ad Tech
– Real-time user targeting based on segment and preferences.
Oil & Gas
– Real-time monitoring of pumps/rigs.
80
4.1 True low latency streaming engine
Retail
– Build an intelligent supply chain by placing sensors or RFID
tags on items to alert if items aren’t in the right place, or
proactively order more if supply is low.
– Smart logistics with real-time end-to-end tracking of delivery
trucks.
Telecommunications
– Real-time antenna optimization based on user location data.
– Real-time charging and billing based on customer usage, ability
to populate up-to-date usage dashboards for users.
– Mobile offers.
– Optimized advertising for video/audio content based on what
users are consuming.
81
4.1 True low latency streaming engine
“I would consider stream data analysis to be a major
unique selling proposition for Flink. Due to its
pipelined architecture Flink is a perfect match for big
data stream processing in the Apache stack.” – Volker
Markl
Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
83
4.2 Unique windowing features not
available in Spark Streaming
84
4.3 Iteration Operators
Why Iterations? Many Machine Learning and Graph
processing algorithms need iterations! For example:
Machine Learning Algorithms
• Clustering (K-Means, Canopy, …)
• Gradient descent (Logistic Regression, Matrix
Factorization)
Graph Processing Algorithms
• Page-Rank, Line-Rank
• Path algorithms on graphs (shortest paths,
centralities, …)
• Graph communities / dense sub-components
• Inference (Belief propagation) 85
4.2 Iteration Operators
Flink's API offers two dedicated iteration operations:
Iterate and Delta Iterate.
Flink executes programs with iterations as cyclic
data flows: a data flow program (and all its operators)
is scheduled just once.
In each iteration, the step function consumes the
entire input (the result of the previous iteration, or the
initial data set), and computes the next version of the
partial solution
86
4.3 Iteration Operators
Delta iterations run only on parts of the data that is
changing and can significantly speed up many
machine learning and graph algorithms because the
work in each iteration decreases as the number of
iterations goes on.
Client
88
4.3 Iteration Operators
Although Spark caches data across iterations, it still
needs to schedule and execute a new set of tasks for
each iteration.
In Spark, it is driver-based looping:
• Loop outside the system, in driver program
• Iterative program looks like many independent jobs
In Flink, it is Built-in iterations:
• Dataflow with Feedback edges
• System is iteration-aware, can optimize the job
Spinning Fast Iterative Data Flows - Ewen et al. 2012 :
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The
Apache Flink model for incremental iterative dataflow
processing. 89
4.4 Custom Memory Manager
Features:
C++ style memory management inside the JVM
User data stored in serialized byte arrays in JVM
Memory is allocated, de-allocated, and used strictly
using an internal buffer pool implementation.
Advantages:
1. Flink will not throw an OOM exception on you.
2. Reduction of Garbage Collection (GC)
3. Very efficient disk spilling and network transfers
4. No Need for runtime tuning
5. More reliable and stable performance
90
4.4 Custom Memory Manager
Flink contains its own memory management stack.
To do that, Flink contains its own type extraction
and serialization components.
JVM Heap
Buffers Managed Unmanaged
User code
objects
public class WC {
public String word;
Sorting, public int count;
hashing, }
empty
caching page
Network
93
4.5 Built-in Cost-Based Optimizer
Apache Flink comes with an optimizer that is
independent of the actual programming interface.
It chooses a fitting execution strategy depending
on the inputs and operations.
Example: the "Join" operator will choose between
partitioning and broadcasting the data, as well as
between running a sort-merge-join or a hybrid hash
join algorithm.
This helps you focus on your application logic
rather than parallel execution.
Quick introduction to the Optimizer: section 6 of the
paper: ‘The Stratosphere platform for big data
analytics’http://stratosphere.eu/assets/papers/2014-
VLDBJ_Stratosphere_Overview.pdf
94
4.5 Built-in Cost-Based Optimizer
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
Hash vs. Sort
Partition vs. Broadcast
Caching
Execution
Reusing partition/sort
Plan A
Execution
Execution Plan C
Run locally on a data Plan B
sample
on the laptop
Run on large files Run a month later
on the cluster after the data evolved
95
4.5 Built-in Cost-Based Optimizer
In contrast to Flink’s built-in automatic optimization,
Spark jobs have to be manually optimized and
adapted to specific datasets because you need to
manually control partitioning and caching if you
want to get it right.
Spark SQL uses the Catalyst optimizer that
supports both rule-based and cost-based
optimization. References:
• Spark SQL: Relational Data Processing in
Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.p
df
100
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Hadoop
and other open source tools for data input
and output as well as deployment?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache Storm
and Apache Spark.
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
101
IV. Who is using Apache Flink?
You might like what you saw so far about
Apache Flink and still reluctant to give it a try!
You might wonder: Is there anybody using
Flink in pre-production or production
environment?
I asked this question to our friend ‘Google’
and I came with a short list in the next slide!
I also heard more about who is using Flink in
production at the Flink Forward conference on
October 12-13, 2015 in Berlin, Germany!
http://flink-forward.org/
102
IV. Who is using Apache Flink?
How companies are using Flink as presented at Flink
Forward 2015. Kostas Tzoumas and Stephan Ewen.
http://www.slideshare.net/stephanewen1/flink-use-cases-bay-area-meetup-
october-2015
104
IV. Who is using Apache Flink?
PROTEUS http://www.proteus-bigdata.com/
105
IV. Who is using Apache Flink?
has its hack week and the winner was
a Flink based streaming project! December 18, 2015
• Extending the Yahoo! Streaming Benchmark and Winning
Twitter Hack-Week with Apache Flink. Posted on
February 2, 2016 by Jamie Grier http://data-
artisans.com/extending-the-yahoo-streaming-benchmark/
108
1 What is Flink 2016 roadmap?
SQL/StreamSQL and Table API
CEP Library: Complex Event Processing library for the
analysis of complex patterns such as correlations and
sequence detection from multiple sources
https://github.com/apache/flink/pull/1557 January 28, 2015
110
2. How to get started quickly with Apache
Flink?
Step-By-Step Introduction to Apache
Flinkhttp://www.slideshare.net/sbaltagi/stepbystep-introduction-to-apache-flink
Implementing BigPetStore with Apache Flink
http://www.slideshare.net/MrtonBalassi/implementing-bigpetstore-with-apache-flink
data-artisans.com
apache-flink.meetup.com
github.com/apache/flink
user@flink.apache.org dev@flink.apache.org
113
5. What are some key takeaways?
1. Although most of the current buzz is about Spark,
Flink offers the only hybrid (Real-Time Streaming +
Batch) open source distributed data processing
engine natively supporting many use cases.
2. With the upcoming release of Apache Flink 1.0, I
foresee more adoption especially in use cases with
Real-Time stream processing and also fast iterative
machine learning or graph processing.
3. I foresee Flink embedded in major Hadoop
distributions and supported!
4. Apache Spark and Apache Flink will both have their
sweet spots despite their “Me Too Syndrome”!
114
Thanks!
• To all of you for attending!
• To Bloomberg for sponsoring this event.
• To data Artisans for allowing me to use some of
their materials for my slide deck.
• To Capital One for giving me time to prepare and
give this talk.
• Yes, we are hiring for our New York City offices
and our other locations! http://jobs.capitalone.com
• Drop me a note at sbaltagi@gmail.com if you’re
116
interested.