Apache Flink What How Why Who Where by Slim Baltagi 160203112346 PDF

New New
YorkYork
CityCity (NYC)Apache
(NYC) Apache Flink
FlinkMeetup
Meetup
Civic Hall, NYC
Civic Hall, NYC
February 2nd, 2016
February 2nd, 2016
Apache Flink: What,

How, Why, Who, Where?
By @SlimBaltagi
Director of Big Data Engineering
Capital One
1
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Hadoop
and other open source tools?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache Storm
and Apache Spark?
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
2
I. What is Apache Flink stack and how it
fits into the Big Data ecosystem?
1. What is Apache Flink?

2. What is Flink Execution Engine?
3. What are Flink APIs?
4. What are Flink Domain Specific Libraries?
5. What is Flink Architecture?
6. What is Flink Programming Model?
7. What are Flink tools?
3
1. What is Apache Flink?
1.1 Apache project with a cool logo!

1.2 Project that evolved the concept of a multi-
purpose Big Data analytics framework
1.3 Project with a unique vision and philosophy
1.4 Only Hybrid ( Real-Time streaming + Batch)
engine supporting many use cases
1.5 Major contributor to the movement of
unification of streaming and batch
1.6 The 4G of Big Data Analytics frameworks
4
1.1 Apache project with a cool logo!
 Apache Flink, like Apache Hadoop and
Apache Spark, is a community-driven open source
framework for distributed Big Data Analytics.
 Apache Flink has its origins in a research project
called Stratosphere of which the idea was conceived in
late 2008 by professor Volker Markl from the
Technische Universität Berlin in Germany.
 Flink joined the Apache incubator in April 2014 and
graduated as an Apache Top Level Project (TLP) in
December 2014.
 dataArtisans (data-artisans.com) is a German start-up
company based in Berlin and is leading the
development of Apache Flink. 5
1.1 Apache project with a cool logo
Squirrel is an animal! This reflects the harmony with

other animals in the Hadoop
ecosystem (Zoo): elephant,
pig, python, camel, …
A squirrel is swift and This reflects the meaning of

agile the word ‘Flink’: German for
“nimble, swift, speedy”
Red color of the body This reflects the roots of the
project at German universities:
In harmony with red squirrels in
Germany
Colorful tail This reflects an open source
project as the colors match the
ones of the feather symbolizing
the Apache Software Foundation
What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …?
7
Apache Flink, written in Java and Scala, consists of:
1. Big data processing engine: distributed and
scalable streaming dataflow engine
2. Several APIs in Java/Scala/Python:
• DataSet API – Batch processing
• DataStream API – Real-Time streaming analytics
3. Domain-Specific Libraries:
• FlinkML: Machine Learning Library for Flink
• Gelly: Graph Library for Flink
• Table: Relational Queries
• FlinkCEP: Complex Event Processing for Flink
8
What is Apache Flink stack?
Google Dataflow
APIs & LIBRARIES
Gelly-Stream
Dataflow (WiP)
Hadoop M/R
Cascading
Zeppelin
FlinkML
SAMOA
MRQL
Storm
Table
Table
Gelly
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Batch Processing Stream Processing
Batch Optimizer Stream Builder
SYSTEM
Runtime - Distributed
Streaming Dataflow
Local Cluster Cloud

DEPLOY
Single JVM Standalone Google’s GCE

Embedded YARN, Amazon’s EC2
Docker Mesos (WIP) IBM Docker Cloud, …
Files Databases Streams
STORAGE
Local MongoDB Flume

HDFS HBase Kafka
S3, Azure Storage SQL RabbitMQ
Tachyon … …
Apache Flink’s original vision was getting the best from

both worlds: MPP Technology and Hadoop MapReduce
Technologies:
Draws on concepts Draws on concepts
from Add from
MPP Database Hadoop MapReduce
Technology Technology
• Declarativity • Real-Time • Massive scale-out

• Query optimization Streaming • User Defined
• Efficient parallel in- • Iterations Functions
memory and out-of- • Memory • Complex data types
core algorithms Management • Schema on read
• Advanced
Dataflows
• General APIs
All streaming all the time: execute everything as

streams including batch!!
Write like a programming language, execute like a
database.
Alleviate the user from a lot of the pain of:
• manually tuning memory assignment to
intermediate operators
• dealing with physical execution concepts (e.g.,
choosing between broadcast and partitioned joins,
reusing partitions).
11
Little configuration required

• Requires no memory thresholds to configure – Flink
manages its own memory
• Requires no complicated network configurations –
Pipelining engine requires much less memory for data
exchange
• Requires no serializers to be configured – Flink
handles its own type extraction and data
representation
Little tuning required: Programs can be adjusted
to data automatically – Flink’s optimizer can
choose execution strategies automatically 12
Support for many file systems:

• Flink is File System agnostic. BYOS: Bring Your
Own Storage
Support for many deployment options:
• Flink is agnostic to the underlying cluster
infrastructure. BYOC: Bring Your Own Cluster
Be a good citizen of the Hadoop ecosystem
• Good integration with YARN
Preserve your investment in your legacy Big Data
applications: Run your legacy code on Flink’s
powerful engine using Hadoop and Storm
compatibility layers and Cascading adapter. 13
Native Support of many use cases on top of the same

streaming engine
• Batch
• Real-Time streaming
• Machine learning
• Graph processing
• Relational queries
Support building complex data pipelines
leveraging native libraries without the need to
combine and manage external ones.
14
1.4 The only hybrid (Real-Time Streaming +
Batch) open source distributed data processing
engine natively supporting many use cases:
Real-Time stream processing Machine Learning at scale
Batch Processing Graph Analysis
15
1.5 Major contributor to the movement of unification of
streaming and batch
Dataflow proposal for incubation has been renamed to

Apache Beam ( for combination of Batch and Stream)
https://wiki.apache.org/incubator/BeamProposal
Apache Beam was accepted to the Apache incubation

on February 1st, 2016 http://incubator.apache.org/projects/beam.html
Dataflow/Beam & Spark: A Programming Model
Comparison, February 3rd,
2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
By Tyler Akidau & Frances Perry, Software Engineers, Apache
Beam Committers
16
1.5 Major contributor to the movement of unification of
streaming and batch
Apache Flink includes DataFlow on Flink http://data-
artisans.com/dataflow-proposed-as-apache-incubator-project/
 Keynotes of the Flink Forward 2015 conference:

• Keynote on October 12th, 2015 by Kostas Tzoumas and Stephan
Ewen of dataArtisanshttp://www.slideshare.net/FlinkForward/k-tzoumas-s-
ewen-flink-forward-keynote/
• Keynote on October 13th, 2015 by William Vambenepe of

Googlehttp://www.slideshare.net/FlinkForward/william-vambenepe-
google-cloud-dataflow-and-flink-stream-processing-by-default
17
1.6 The 4G of Big Data Analytics frameworks
Apache Flink is not YABDAF (Yet Another Big Data

Analytics Framework)!
Flink brings many technical innovations and a unique
vision and philosophy that distinguish it from:
 Other multi-purpose Big Data analytics frameworks
such as Apache Hadoop and Apache Spark
 Single-purpose Big Data Analytics frameworks such
as Apache Storm
Apache Flink is the 4G (4th Generation) of Big Data
Analytics frameworks succeeding to Apache Spark.
18
Apache Flink as the 4G of Big Data Analytics
 Batch  Batch  Batch  Hybrid

 Interactive  Interactive (Streaming +Batch)
 Near-Real  Interactive
Time Streaming  Real-Time
 Iterative Streaming
processing  Native Iterative
processing
MapReduce Direct Acyclic RDD: Resilient Cyclic Dataflows
Graphs (DAG) Distributed
Dataflows Datasets
1st 2ndGeneration 3rd Generation 4th Generation

Generation (2G) (3G) (4G)
(1G) 19
How Big Data Analytics engines evolved?
 The evolution of Massive-Scale Data Processing

Tyler Akidau, Google. Strata + Hadoop World, Singapore,
December 2, 2015. Slides:
https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmSePtMnf
JirCkXcHZ1SkwDg-s/present?slide=id.g63ca2a7cd_0_527
The world beyond batch:

 Streaming 101, Tyler Akidau, Google, August 5, 2015
http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-
101.html
 Streaming 102, Tyler Akidau, Google, January 20, 2016
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
It covers topics like event-time vs. processing-time, windowing,
watermarks, triggers, and accumulation.
20
2. What is Flink Execution Engine?
The core of Flink is a distributed and scalable streaming
dataflow engine with some unique features:
1. True streaming capabilities: Execute everything as
streams
2. Versatile: Engine allows to run all existing MapReduce,
Cascading, Storm, Google DataFlow applications
3. Native iterative execution: Allow some cyclic dataflows
4. Handling of mutable state
5. Custom memory manager: Operate on managed
memory
6. Cost-Based Optimizer: for both batch and stream
processing 21
3. Flink APIs
3.1 DataSet API for static data - Java, Scala,

and Python
3.2 DataStream API for unbounded real-time
streams - Java and Scala
22
3.1 DataSet API – Batch processing
case class Word (word: String, frequency: Int)
DataSet API (batch): WordCount

val env = ExecutionEnvironment.getExecutionEnvironment()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
env.execute()
DataStream API (streaming): Window WordCount
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.keyBy("word").sum("frequency")
.print()
env.execute()
23
3.2 DataStream API – Real-Time Streaming
Analytics
Flink Streaming provides high-throughput, low-latency
stateful stream processing system with rich windowing
semantics.
Streaming Fault-Tolerance allows Exactly-once
processing delivery guarantees for Flink streaming
programs that analyze streaming sources persisted by
Apache Kafka.
 Flink Streaming provides native support for iterative
stream processing.
Data streams can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API.
24
Analytics
Flink being based on a pipelined (streaming) execution
engine akin to parallel database systems allows to:
• implement true streaming & batch
• integrate streaming operations with rich windowing
semantics seamlessly
• process streaming operations in a pipelined way with
lower latency than micro-batch architectures and
without the complexity of lambda architectures.
It has built-in connectors to many data sources like
Flume, Kafka, Twitter, RabbitMQ, etc
25
Analytics
Apache Flink: streaming done right. Till Rohrmann.
January 31, 2016
https://fosdem.org/2016/schedule/event/hpc_bigdata_flink_streaming/
Web resources about stream processing with Apache

Flink at the Flink Knowledge Base
http://sparkbigdata.com/component/tags/tag/49-flink-streaming
26
4. Flink Domain Specific Libraries
4.1 FlinkML – Machine Learning Library
4.2 Table – Relational queries

4.3 Gelly – Graph Analytics for Flink
4.4 FlinkCEP: Complex Event Processing for

Flink 27
4.1 FlinkML - Machine Learning Library
 FlinkML is the Machine Learning (ML) library for Flink.
It is written in Scala and was added in March 2015.
 FlinkML aims to provide:
• an intuitive API
• scalable ML algorithms
• tools that help minimize glue code in end-to-end ML
applications
 FlinkML will allow data scientists to:
• test their models locally using subsets of data
• use the same code to run their algorithms at a much
larger scale in a cluster setting.
28
4.1 FlinkML
FlinkML unique features are:
1. Exploiting the in-memory data streaming nature of
Flink.
2. Natively executing iterative processing algorithms
which are common in Machine Learning.
3. Streaming ML designed specifically for data
streams.
FlinkML: Large-scale machine learning with Apache
Flink, Theodore Vasiloudis. October 21, 2015
Slides: https://sics.app.box.com/s/044omad6200pchyh7ptbyxkwvcvaiowu
Video: https://www.youtube.com/watch?v=k29qoCm4c_k&feature=youtu.be
 Check more FlinkML web resources at the Apache

Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/51-
29
4.2 Table – Relational Queries
Table API, written in Scala , allows specifying
operations using SQL-like expressions instead of
manipulating DataSet or DataStream.
 Table API can be used in both batch (on structured
data sets) and streaming programs (on structured
data streams).http://ci.apache.org/projects/flink/flink-docs-
master/libs/table.html
 Flink Table web resources at the Apache Flink

Knowledge Base: http://sparkbigdata.com/component/tags/tag/52-
flink-table
30
4.2 Table API – Relational Queries
Table API (queries)
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)

.filter( o =>
dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders

.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as
revenue")
val result = items

.groupBy("orderId, orderDate, shipPrio")
.select("orderId, revenue.sum, orderDate, shipPrio")
31
Gelly is Flink's large-scale graph processing API,
available in Java and Scala, which leverages Flink's
efficient delta iterations to map various graph
processing models (vertex-centric and gather-sum-
apply) to dataflows.
Gelly provides:
• A set of methods and utilities to create, transform
and modify graphs
• A library of graph algorithms which aims to simplify
the development of graph analysis applications
• Iterative graph algorithms are executed leveraging
mutable state
32
Gelly allows Flink users to perform end-to-end data
analysis, without having to build complex pipelines and
combine different systems.
It can be seamlessly combined with Flink's DataSet API,
which means that pre-processing, graph creation, graph
analysis and post-processing can be done in the same
application.
Gelly documentation https://ci.apache.org/projects/flink/flink-docs-
master/libs/gelly_guide.html
Introducing Gelly: Graph Processing with Apache Flink

http://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html
Check out more Gelly web resources at the Apache Flink

Knowledge Base: http://sparkbigdata.com/component/tags/tag/50-gelly
33
Single-pass Graph Streaming Analytics with Apache
Flink. Vasia Kalavri & Paris Carbone. January 31, 2016
FOSDEM'16. Brussels, BELGIUM.
• Talk description
:https://fosdem.org/2016/schedule/event/graph_processing_apache_flin
k/
• Slides: http://www.slideshare.net/vkalavri/gellystream-singlepass-
graph-streaming-analytics-with-apache-flink
Gelly free training! http://www.slideshare.net/FlinkForward/vasia-

kalavri-training-gelly-school
http://gellyschool.com/
34
4.4 FlinkCEP: Complex Event Processing for
Flink
FlinkCEP is the complex event processing library for
Flink. It allows you to easily detect complex event
patterns in a stream of endless data.
Complex events can then be constructed from
matching sequences. This gives you the opportunity to
quickly get hold of what’s really important in your data.
https://ci.apache.org/projects/flink/flink-docs-
master/apis/streaming/libs/cep.html
35
 Flink implements the Kappa Architecture:
run batch programs on a streaming system.
 References about the Kappa Architecture:
• Questioning the Lambda Architecture - Jay Kreps ,
July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html
• Turning the database inside out with Apache

Samza -Martin Kleppmann, March 4th, 2015
o http://www.youtube.com/watch?v=fU9hR3kiOK0 (VIDEO)
o http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-
out.html(TRANSCRIPT)
o http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-
apache-samza/ (BLOG)
36
5.1 Client
5.2 Master (Job Manager)
5.3 Worker (Task Manager)
37
5.1 Client
 Type extraction
 Optimize: in all APIs not just SQL queries as in Spark
 Construct job Dataflow graph
 Pass job Dataflow graph to job manager
 Retrieve job results
GroupRed
case class Path (from: Long, to: Long)
val tc = edges.iterate(10) { sort
paths: DataSet[Path] =>
val next = paths Type forward
.join(edges)
.where("to") extraction Join
Hybrid Hash
.equalTo("from") {
(path, edge) => buildHT probe
Path(path.from, edge.to)
} Optimizer hash-part Job Manager
.union(paths)
.distinct() [0] hash-part [0]
next Map
}
Filter DataSource
lineitem.tbl
Client Data Source
orders.tbl
38
5.2 Job Manager (JM) with High Availability
 Parallelization: Create Execution Graph
 Scheduling: Assign tasks to task managers
 State tracking: Supervise the execution
GroupRed
Task
sort
Manager
GroupRed
GroupRed
GroupRedsort
forwar GroupRedsort
sort
sort
d forwar
Join
forward
forwar d
d Join
Hybrid
Task
Hybrid Hash forwar Join
d Join
Hash
Hybrid Hash
buildHash
Hybrid prob Manager
buildHT prob e
buildHT probe buildHT prob e
HT Join e
Hybrid Hash
hash-part [0] hash-part [0]
hash-part [0] hash-part [0]
hash-partbuild
[0] prob hash-part [0]
Map DataSource
DataSour
hash-part [0]
hash-part Map
MapHT e
DataSour lineitem.tbl
ce
Filter ce
Map [0] Filter
Filter Data
hash-part [0]Source
Data
lineitem.tbl
lineitem.tbl
hash-part [0]
Task
Dataorders.tbl
Filter DataSource
lineitem.tbl
Source
Source
Map orders.tbl
orders.tbl
DataSour
ce
lineitem.tbl
Manager
Data Filter
Source Data
Source
orders.tbl orders.tbl
Task
Job Manager
Manager
39
5.3 Task Manager ( TM)
 Operations are split up into tasks depending on the
specified parallelism
 Each parallel instance of an operation runs in a
separate task slot
 The scheduler may run several tasks from different
operators in one task slot
S S S
l l l
o o o
t t t
Task Manager Task Manager Task Manager
40
6. What is Flink Programming Model?
 DataSet and DataStream as programming
abstractions are the foundation for user programs
and higher layers.
 Flink extends the MapReduce model with new

operators that represent many common data analysis
tasks more naturally and efficiently.
 All operators will start working in memory and
gracefully go out of core under memory pressure.
41
6.1 DataSet
DataSet: abstraction for distributed data and the
central notion of the batch programming API
Files and other data sources are read into DataSets
• DataSet<String> text = env.readTextFile(…)
Transformations on DataSets produce DataSets
• DataSet<String> first = text.map(…)
DataSets are printed to files or on stdout
• first.writeAsCsv(…)
Computation is specified as a sequence of lazily
evaluated transformations
Execution is triggered with env.execute()
42
6.1 DataSet
Used for Batch Processing
Data Data
Source Operation Sink
Set Set
Example: Map and Reduce operation

b h
2 1
3 5 Map Reduce
7 4
… …
1
2
…
43
6.2 DataStream
Real-time event streams
Data Data
Source Stream
Operation Stream
Sink
Example: Stream from a live stock feed

Alert if Write
Stock Feed Microsoft Microsoft 124 event to
Name Price
Google 516 > 120 database
Microsoft 124 Apple 235
Google 516 Sum Alert if

Microsoft 124
Apple 235
Apple 235
every 10 sum >
… … seconds Google 516 10000
44
7. What are Apache Flink tools?
7.1 Command-Line Interface (CLI)

7.2 Web Submission Client
7.3 Job Manager Web Interface
7.4 Interactive Scala Shell
7.5 Zeppelin Notebook
45
7.1 Command-Line Interface (CLI)
 Flink provides a CLI to run programs that are packaged
as JAR files, and control their execution.
 bin/flink has 4 major actions
• run #runs a program.
• info #displays information about a program.
• list #lists scheduled and running jobs
• cancel #cancels a running job.
Example: ./bin/flink info ./examples/KMeans.jar
See CLI usage and related examples:

46
https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html
47
Flink provides a web interface to:
• Upload programs
• Execute programs
• Inspect their execution plans
• Showcase programs
• Debug execution plans
• Demonstrate the system as a whole
The web interface runs on port 8080 by default.
To specify a custom port set the webclient.port
property in the ./conf/flink.yaml configuration file.
48
Overall system status
Job execution details
Task Manager resource

utilization
49
The JobManager web frontend allows to :
• Track the progress of a Flink program
as all status changes are also logged to
the JobManager’s log file.
• Figure out why a program failed as it
displays the exceptions of failed tasks and
allow to figure out which parallel task first
failed and caused the other tasks to cancel
the execution.
50
bin/start-scala-shell.sh --host localhost --port 6123
51
Flink comes with an Interactive Scala Shell - REPL (
Read Evaluate Print Loop ) :
 ./bin/start-scala-shell.sh
 Interactive queries
 Let’s you explore data quickly
 It can be used in a local setup as well as in a
cluster setup.
 The Flink Shell comes with command history and
auto completion.
 Complete Scala API available
 So far only batch mode is supported. There is
plan to add streaming in the future:
52
https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html
http://localhost:8080/
53
Web-based interactive computation

environment
Collaborative data analytics and
visualization tool
Combines rich text, execution code, plots
and rich media
Exploratory data science
Saving and replaying of written code
Storytelling
54
Agenda
and other open source tools?
and Apache Spark?
55
II. How Apache Flink integrates with Hadoop and
other open source tools?
Service Open Source Tool
Storage/Servi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
56
II. How Apache Flink integrates with Hadoop and
other open source tools?
Flink integrates well with other open source tools for
data input and output as well as deployment.
Flink allows to run legacy Big Data applications:
MapReduce, Cascading and Storm applications
Flink integrates with other open source tools
1. Data Input / Output

2. Deployment
3. Legacy Big Data applications
4. Other tools
57
HDFS to read and write. Secure HDFS support
Reuse data types (that implement Writables interface)
Amazon S3
Microsoft Azure Storage
MapR-FS
Flink + Tachyon
http://tachyon-project.org/
Running Apache Flink on Tachyon http://tachyon-project.org/Running-
Flink-on-Tachyon.html
 Flink + XtreemFS http://www.xtreemfs.org/
58
 Crunching Parquet Files with Apache Flink
https://medium.com/@istanbul_techie/crunching-parquet-files-with-apache-flink-
200bec90d8a7
Here are some examples of how to read/write data

from/to HBase:
https://github.com/apache/flink/tree/master/flink-staging/flink-
hbase/src/test/java/org/apache/flink/addons/hbase/example
Using MongoDB with Flink:

http://flink.apache.org/news/2014/01/28/querying_mongodb.html
https://github.com/m4rcsch/flink-mongodb-example
59
Apache Kafka, a system that provides durability and

pub/sub functionality for data streams.
Kafka + Flink: A practical, how-to guide. Robert
Metzger and Kostas Tzoumas, September 2,
2015 http://data-artisans.com/kafka-flink-a-practical-how-
to/ https://www.youtube.com/watch?v=7RPQUsy4qOM
Click-Through Example for Flink’s KafkaConsumer

Checkpointing. Robert Metzger, September 2nd , 2015.
http://www.slideshare.net/robertmetzger1/clickthrough-example-for-flinks-
kafkaconsumer-checkpointing
MapR Streams (proprietary alternative to Kafka that is

compatible with Apache Kafka 0.9 API) provides out of
the box integration with Apache 60
Using Apache Nifi with Flink:
• Flink and NiFi: Two Stars in the Apache Big Data
Constellation. Matthew Ring. January 19th , 2016
http://www.slideshare.net/mring33/flink-and-nifi-two-stars-in-the-apache-big-
data-constellation
• Integration of Apache Flink and Apache Nifi. Bryan Bende,

February 4th , 2016
http://www.slideshare.net/BryanBende/integrating-nifi-and-flink
Using Elasticsearch with Flink:

https://www.elastic.co/
Building real-time dashboard applications with Apache

Flink, Elasticsearch, and Kibana. By Fabian Hueske,
December 7, 2015.https://www.elastic.co/blog/building-real-time-dashboard-
applications-with-apache-flink-elasticsearch-and-kibana
61
2. Deployment
Deploy inside of Hadoop via YARN
• YARN Setup http://ci.apache.org/projects/flink/flink-docs-
master/setup/yarn_setup.html
• YARN Configuration
http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn
Apache Flink cluster deployment on Docker using

Docker-Compose by Simons Laws from IBM.
Talk at the Flink Forward in Berlin on October 12,
2015.
 Slides: http://www.slideshare.net/FlinkForward/simon-laws-apache-flink-
cluster-deployment-on-docker-and-dockercompose
 Video recording (40’:49): https://www.youtube.com/watch?v=CaObaAv9tLE
62
Flink’s MapReduce compatibility layer allows to:

• run legacy Hadoop MapReduce jobs
• reuse Hadoop input and output formats
• reuse functions like Map and Reduce.
References:
• Documentation: https://ci.apache.org/projects/flink/flink-docs-release-
0.7/hadoop_compatibility.html
• Hadoop Compatibility in Flink by Fabian Hüeske - November

18, 2014 http://flink.apache.org/news/2014/11/18/hadoop-compatibility.html
• Apache Flink - Hadoop MapReduce Compatibility. Fabian
Hüeske, January 29, 2015 http://www.slideshare.net/fhueske/flink-
hadoopcompat20150128
63
 Cascading on Flink allows to port existing Cascading-MapReduce
applications to Apache Flink with virtually no code changes.
http://www.cascading.org/cascading-flink/
 Expected advantages are performance boost and less resources

consumption.
 References:
• Cascading on Apache Flink, Fabian Hueske, data Artisans. Flink
Forward 2015. October 12, 2015
• http://www.slideshare.net/FlinkForward/fabian-hueske-training-cascading-on-
flink
• https://www.youtube.com/watch?v=G7JlpARrFkU
• Cascading connector for Apache Flink. Code on Github

https://github.com/dataArtisans/cascading-flink
• Running Scalding jobs on Apache Flink, Ian Hummel, December 20,

201http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/2015/12/2
0/running-scalding-jobs-on-apache-flink/ 64
Flink is compatible with Apache Storm interfaces and
therefore allows reusing code that was implemented for
Storm:
• Execute existing Storm topologies using Flink as the underlying
engine.
• Reuse legacy application code (bolts and spouts) inside Flink
programs. https://ci.apache.org/projects/flink/flink-docs-
master/apis/streaming/storm_compatibility.html
 A Tale of Squirrels and Storms. Mathias J. Sax, October 13, 2015.

Flink Forward 2015
http://www.slideshare.net/FlinkForward/matthias-j-sax-a-tale-of-squirrels-and-storms
https://www.youtube.com/watch?v=aGQQkO83Ong
 Storm Compatibility in Apache Flink: How to run existing Storm

topologies on Flink. Mathias J. Sax, December 11, 2015
http://flink.apache.org/news/2015/12/11/storm-compatibility.html 65
4. Other tools
 Ambari service for Apache Flink: install, configure,
manage Apache Flink on HDP, November 17, 2015
https://community.hortonworks.com/repos/4122/ambari-service-for-apache-
flink.html
Exploring Apache Flink with HDP

https://community.hortonworks.com/articles/2659/exploring-apache-flink-with-
hdp.html
 Apache Flink + Apache SAMOA for Machine

Learning on streams http://samoa.incubator.apache.org/
 Flink Integrates with Zeppelin
http://zeppelin.incubator.apache.org/
http://www.slideshare.net/FlinkForward/moon-soo-lee-data-science-lifecycle-
with-apache-flink-and-apache-zeppelin
Flink + Apache MRQL http://mrql.incubator.apache.org

66
4. Other tools
 Google Cloud Dataflow (GA on August 12, 2015) is a
fully-managed cloud service and a unified
programming model for batch and streaming big data
processing. https://cloud.google.com/dataflow/ (Try it FREE)
Flink-Dataflow is a Google Cloud Dataflow SDK
Runner for Apache Flink. It enables you to run
Dataflow programs with Flink as an execution engine.
References:
Google Cloud Dataflow on top of Apache Flink,
Maximilian Michels, data Artisans. Flink Forward
conference, October 12, 2015
 http://www.slideshare.net/FlinkForward/maximilian-michels-google-
cloud-dataflow-on-top-of-apache-flink Slides
67
 https://www.youtube.com/watch?v=K3ugWmHb7CE Video recording
Agenda
and other open source tools for data input
and output as well as deployment?
and Apache Spark?
68
and Apache Spark?
1. Why Flink is an alternative to Hadoop
MapReduce?
2. Why Flink is an alternative to Apache Storm?
3. Why Flink is an alternative to Apache Spark?
4. What are the benchmarking results against
Flink?
69
MapReduce?
1. Flink offers cyclic dataflows compared to the two-
stage, disk-based MapReduce paradigm.
2. The application programming interface (API) for
Flink is easier to use than programming for
Hadoop’s MapReduce.
3. Flink is easier to test compared to MapReduce.
4. Flink can leverage in-memory processing, data
streaming and iteration operators for faster data
processing speed.
5. Flink can work on file systems other than Hadoop.
70
MapReduce?
6. Flink lets users work in a unified framework allowing
to build a single data workflow that leverages,
streaming, batch, sql and machine learning for
example.
7. Flink can analyze real-time streaming data.
8. Flink can process graphs using its own Gelly library.
9. Flink can use Machine Learning algorithms from its
own FlinkML library.
10. Flink supports interactive queries and iterative
algorithms, not well served by Hadoop MapReduce.
71
MapReduce?
11. Flink extends MapReduce model with new operators:
join, cross, union, iterate, iterate delta, cogroup, …
Input Map Reduce Output
S DataSet Red DataSet Join Output

DataSet
Input DataSet Map DataSet
72
3. Why Flink is an alternative to Storm?
1. Higher Level and easier to use API
2. Lower latency
• Thanks to pipelined engine
3. Exactly-once processing guarantees
• Variation of Chandy-Lamport
4. Higher throughput
• Controllable checkpointing overhead
5. Flink Separates application logic from
recovery
• Checkpointing interval is just a configuration
parameter 73
6. More light-weight fault tolerance strategy

7. Stateful operators
8. Native support for iterative stream
processing.
9. Flink does also support batch processing
10. Flink offers Storm compatibility
• Flink is compatible with Apache Storm interfaces and
therefore allows reusing code that was implemented for
Storm.
https://ci.apache.org/projects/flink/flink-docs-
master/apis/storm_compatibility.html
74
Extending the Yahoo! Streaming Benchmark, by
Jamie Grier. February 2nd, 2016
http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
Code at Github: https://github.com/dataArtisans/yahoo-streaming-benchmark
Results show that Flink has a much better throughput

compared to storm and better fault-tolerance
guarantees: exactly-once.
High-throughput, low-latency, and exactly-once
stream processing with Apache Flink. The evolution
of fault-tolerant streaming architectures and their
performance – Kostas Tzoumas, August 5th 2015
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-
processing-with-apache-flink/
75
4. Why Flink is an alternative to Spark?
4.1 True Low latency streaming engine
• Spark’s micro-batches aren’t good enough!
• Unified batch and real-time streaming in a single
engine
• The streaming model of Flink is based on the
Dataflow model similar to Google Dataflow
4.2 Unique windowing features not available in Spark
• support for event time
• out of order streams
• a mechanism to define custom windows based on
window assigners and triggers.
76
4. Why Flink is an alternative to Spark?
4.3 Native closed-loop iteration operators
• make graph and machine learning applications run
much faster
4.4 Custom memory manager
• no more frequent Out Of Memory errors!
• Flink’s own type extraction component
• Flink’s own serialization component
4.5 Automatic Cost Based Optimizer
• little re-configuration and little maintenance when
the cluster characteristics change and the data
evolves over time
77
4. Why Flink is an alternative to Apache
Spark?
4.6 Little configuration required
4.7 Little tuning required
4.8 Flink has better performance
78
4.1 True low latency streaming engine
 Some claim that 95% of streaming use cases can be
handled with micro-batches!? Really!!!
Spark’s micro-batching isn’t good enough for many
time-critical applications that need to process large
streams of live data and provide results in real-time.
Below are Several use cases, taken from real industrial
situations where batch or micro batch processing is not
appropriate.
References:
• MapR Streams FAQ https://www.mapr.com/mapr-streams-faq#question12
• Apache Spark vs. Apache Flink, January 13, 2015. Whiteboard
walkthrough by Balaji Narasimhalu from MapR
https://www.youtube.com/watch?v=Dzx-iE6RN4w 79
Financial Services
– Real-time fraud detection.
– Real-time mobile notifications.
Healthcare
– Smart hospitals - collect data and readings from hospital
devices (vitals, IVs, MRI, etc.) and analyze and alert in real time.
– Biometrics - collect and analyze data from patient devices that
collect vitals while outside of care facilities.
Ad Tech
– Real-time user targeting based on segment and preferences.
Oil & Gas
– Real-time monitoring of pumps/rigs.
80
Retail
– Build an intelligent supply chain by placing sensors or RFID
tags on items to alert if items aren’t in the right place, or
proactively order more if supply is low.
– Smart logistics with real-time end-to-end tracking of delivery
trucks.
Telecommunications
– Real-time antenna optimization based on user location data.
– Real-time charging and billing based on customer usage, ability
to populate up-to-date usage dashboards for users.
– Mobile offers.
– Optimized advertising for video/audio content based on what
users are consuming.
81
 “I would consider stream data analysis to be a major
unique selling proposition for Flink. Due to its
pipelined architecture Flink is a perfect match for big
data stream processing in the Apache stack.” – Volker
Markl
Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
Apache Flink uses streams for all workloads:

streaming, SQL, micro-batch and batch.
Batch is just treated as a finite set of streamed data.
This makes Flink the most sophisticated distributed
open source Big Data processing engine.
82
4.2 Unique windowing features not
available in Spark Streaming
Besides arrival time, support for event time or a mixture
of both for out of order streams
Custom windows based on window assigners and
triggers.
How Apache Flink enables new streaming applications.
Part I: The power of event time and out of order stream processing.
December 9, 2015 by Stephan Ewen and Kostas Tzoumas http://data-
artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
How Apache Flink enables new streaming applications.

Part II: State and versioning. February 3, 2016 by Ufuk Celebi and
Kostas Tzoumas
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
83
4.2 Unique windowing features not
available in Spark Streaming
Flink 0.10: A significant step forward in open source

stream processing. November 17, 2015. By Fabian
Hueske and Kostas Tzoumashttp://data-artisans.com/flink-0-10-a-
significant-step-forward-in-open-source-stream-processing/
 Dataflow/Beam & Spark: A Programming Model

Comparison. February 3, 2016. By Tyler Akidau & Frances
Perry, Software Engineers, Apache Beam
Committershttps://cloud.google.com/dataflow/blog/dataflow-beam-and-
spark-comparison
84
4.3 Iteration Operators
Why Iterations? Many Machine Learning and Graph
processing algorithms need iterations! For example:
 Machine Learning Algorithms
• Clustering (K-Means, Canopy, …)
• Gradient descent (Logistic Regression, Matrix
Factorization)
 Graph Processing Algorithms
• Page-Rank, Line-Rank
• Path algorithms on graphs (shortest paths,
centralities, …)
• Graph communities / dense sub-components
• Inference (Belief propagation) 85
 Flink's API offers two dedicated iteration operations:
Iterate and Delta Iterate.
 Flink executes programs with iterations as cyclic
data flows: a data flow program (and all its operators)
is scheduled just once.
 In each iteration, the step function consumes the
entire input (the result of the previous iteration, or the
initial data set), and computes the next version of the
partial solution
86
 Delta iterations run only on parts of the data that is
changing and can significantly speed up many
machine learning and graph algorithms because the
work in each iteration decreases as the number of
iterations goes on.
 Documentation on iterations with Apache

Flinkhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
87
Non-native iterations in Hadoop and Spark are
implemented as regular for-loops outside the system.
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
Client
Step Step Step Step

Step
88
 Although Spark caches data across iterations, it still
needs to schedule and execute a new set of tasks for
each iteration.
In Spark, it is driver-based looping:
• Loop outside the system, in driver program
• Iterative program looks like many independent jobs
In Flink, it is Built-in iterations:
• Dataflow with Feedback edges
• System is iteration-aware, can optimize the job
 Spinning Fast Iterative Data Flows - Ewen et al. 2012 :
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The
Apache Flink model for incremental iterative dataflow
processing. 89
4.4 Custom Memory Manager
Features:
 C++ style memory management inside the JVM
 User data stored in serialized byte arrays in JVM
 Memory is allocated, de-allocated, and used strictly
using an internal buffer pool implementation.
Advantages:
1. Flink will not throw an OOM exception on you.
2. Reduction of Garbage Collection (GC)
3. Very efficient disk spilling and network transfers
4. No Need for runtime tuning
5. More reliable and stable performance
90
Flink contains its own memory management stack.
To do that, Flink contains its own type extraction
and serialization components.
JVM Heap
Buffers Managed Unmanaged
User code
objects
public class WC {
public String word;
Sorting, public int count;
hashing, }
empty
caching page
Network
Shuffles/ Pool of Memory Pages

broadcasts
91
Flink provides an Off-Heap option for its memory
management component
References:
• Peeking into Apache Flink's Engine Room - by Fabian
Hüske, March 13,
2015 http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-
Engine-Room.html
• Juggling with Bits and Bytes - by Fabian Hüske, May

11,2015
https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
• Memory Management (Batch API) by Stephan Ewen-

May 16, 2015
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525
92
Compared to Flink, Spark is catching up with its

project Tungsten for Memory Management and
Binary Processing: manage memory explicitly and
eliminate the overhead of JVM object model and
garbage collection. April 28,
2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-
spark-closer-to-bare-metal.html
It seems that Spark is adopting something similar to

Flink and the initial Tungsten announcement read
almost like Flink documentation!!
93
4.5 Built-in Cost-Based Optimizer
 Apache Flink comes with an optimizer that is
independent of the actual programming interface.
 It chooses a fitting execution strategy depending
on the inputs and operations.
 Example: the "Join" operator will choose between
partitioning and broadcasting the data, as well as
between running a sort-merge-join or a hybrid hash
join algorithm.
 This helps you focus on your application logic
rather than parallel execution.
 Quick introduction to the Optimizer: section 6 of the
paper: ‘The Stratosphere platform for big data
analytics’http://stratosphere.eu/assets/papers/2014-
VLDBJ_Stratosphere_Overview.pdf
94
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
Hash vs. Sort
Partition vs. Broadcast
Caching
Execution
Reusing partition/sort
Plan A
Execution
Execution Plan C
Run locally on a data Plan B
sample
on the laptop
Run on large files Run a month later
on the cluster after the data evolved
95
In contrast to Flink’s built-in automatic optimization,
Spark jobs have to be manually optimized and
adapted to specific datasets because you need to
manually control partitioning and caching if you
want to get it right.
Spark SQL uses the Catalyst optimizer that
supports both rule-based and cost-based
optimization. References:
• Spark SQL: Relational Data Processing in
Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.p
df
• Deep Dive into Spark SQL’s Catalyst Optimizer

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-
catalyst-optimizer.html
96
4.6 Little configuration required
 Flink requires no memory thresholds to
configure
• Flink manages its own memory
 Flink requires no complicated network
configurations
• Pipelining engine requires much less
memory for data exchange
 Flink requires no serializers to be configured
• Flink handles its own type extraction and
data representation
97
4.7 Little tuning required
Flink programs can be adjusted to data automatically
• Flink’s optimizer can choose execution strategies
automatically
According to Mike Olsen, Chief Strategy Officer of
Cloudera Inc. “Spark is too knobby — it has too many
tuning parameters, and they need constant adjustment
as workloads, data volumes, user counts change.
Reference: http://vision.cloudera.com/one-platform/
Tuning Spark Streaming for Throughput By Gerard
Maas from Virdata. December 22, 2014
http://www.virdata.com/tuning-spark/
Spark Tuning: http://spark.apache.org/docs/latest/tuning.html

98
4.8 Flink has better performance
Why Flink provides a better performance?
• Custom memory manager
• Native closed-loop iteration operators make graph
and machine learning applications run much faster .
• Role of the built-in automatic optimizer. For example,
more efficient join processing
• Pipelining data to the next operator in Flink is more
efficient than in Spark.
Reference:
• A comparative performance evaluation of Flink,
Dongwon Kim, Postech. October 12,
2015http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-
performance-evaluation-of-flink 99
5. What are the benchmarking results
against Flink?
I am maintaining a list of resources related to
benchmarks against Flink: http://sparkbigdata.com/102-spark-blog-
slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
 A couple resources worth mentioning:

• A comparative performance evaluation of Flink, Dongwon
Kim, POSTECH, Flink Forward October 13,
2015 http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-
performance-evaluation-of-flink
• Benchmarking Streaming Computation Engines at Yahoo

December 16, 2015 Code at
github: http://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at
https://github.com/yahoo/streaming-benchmarks
100
Agenda
and Apache Spark.
101
You might like what you saw so far about
Apache Flink and still reluctant to give it a try!
You might wonder: Is there anybody using
Flink in pre-production or production
environment?
I asked this question to our friend ‘Google’
and I came with a short list in the next slide!
I also heard more about who is using Flink in
production at the Flink Forward conference on
October 12-13, 2015 in Berlin, Germany!
http://flink-forward.org/
102
How companies are using Flink as presented at Flink
Forward 2015. Kostas Tzoumas and Stephan Ewen.
http://www.slideshare.net/stephanewen1/flink-use-cases-bay-area-meetup-
october-2015
Powered by Flink page:

https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
103
6 Apache Flink Case Studies from the 2015 Flink
Forward conference http://sparkbigdata.com/102-spark-blog-slim-
baltagi/21-6-apache-flink-case-studies-from-the-2015-flinkforward-conference
Mine the Apache Flink User mailing list to discover

more!
Gradoop: Scalable Graph Analytics with Apache Flink
• Gradoop project page http://dbs.uni-

leipzig.de/en/research/projects/gradoop
• Gradoop: Scalable Graph Analytics with Apache Flink

@ FOSDEM 2016. January 31,
2016http://www.slideshare.net/s1ck/gradoop-scalable-graph-analytics-with-
apache-flink-fosdem-2016
104
PROTEUS http://www.proteus-bigdata.com/
a European Union funded research project to improve

Apache Flink and mainly to develop two libraries
(visualization and online machine learning) on top of
Flink core.
PROTEUS: Scalable Online Machine Learning by
Rubén Casado at Big Data Spain 2015
• Video: https://www.youtube.com/watch?v=EIH7HLyqhfE
• Slides: http://www.slideshare.net/Datadopter/proteus-h2020-big-data
105
 has its hack week and the winner was
a Flink based streaming project! December 18, 2015
• Extending the Yahoo! Streaming Benchmark and Winning
Twitter Hack-Week with Apache Flink. Posted on
February 2, 2016 by Jamie Grier http://data-
artisans.com/extending-the-yahoo-streaming-benchmark/
 did some benchmarks to

compare performance of their use case implemented
on Apache Storm against Spark Streaming and Flink.
Results posted on December 18, 2015
http://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at
106
Agenda
and Apache Spark?
107
1. What is Flink 2016 roadmap?

2. How to get started quickly with Apache
Flink?
3. Where to find more resources about
Apache Flink?
4. How to contribute to Apache Flink?
5. What are some Key Takeaways?
108
1 What is Flink 2016 roadmap?
SQL/StreamSQL and Table API
CEP Library: Complex Event Processing library for the
analysis of complex patterns such as correlations and
sequence detection from multiple sources
https://github.com/apache/flink/pull/1557 January 28, 2015
Dynamic Scaling: Runtime scaling for DataStream

programs
Managed memory for streaming operators
Support for Apache Mesos
https://issues.apache.org/jira/browse/FLINK-1984
Security: Over-the-wire encryption of RPC (Akka) and

data transfers (Netty)
Additional streaming connectors: Cassandra, Kinesis
109
1 What is Flink roadmap?
Expose more runtime metrics: Throughput / Latencies,
Backpressure monitoring, Spilling / Out of Core
Making YARN resource dynamic
DataStream API enhancements
DataSet API Enhancements
References:
• Apache Flink Roadmap Draft, December 2015
https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm-
778DAD7eqw5GANwE/edit
• What’s next? Roadmap 2016. Robert Metzger, January 26,

2016. Berlin Apache Flink Meetup.
http://www.slideshare.net/robertmetzger1/january-2016-flink-community-
update-roadmap-2016/9
110
2. How to get started quickly with Apache
Flink?
Step-By-Step Introduction to Apache
Flinkhttp://www.slideshare.net/sbaltagi/stepbystep-introduction-to-apache-flink
Implementing BigPetStore with Apache Flink
http://www.slideshare.net/MrtonBalassi/implementing-bigpetstore-with-apache-flink
Apache Flink Crash Course

http://www.slideshare.net/sbaltagi/apache-
flinkcrashcoursebyslimbaltagiandsrinipalthepu
Free training from Data Artisans

http://dataartisans.github.io/flink-training/
All talks at the Flink Forward 2015

http://sparkbigdata.com/102-spark-blog-slim-baltagi/22-all-talks-of-the-
2015-flink-forward-conference 111
3. Where to find more resources about
Flink?
Flink at the Apache Software Foundation: flink.apache.org/
data-artisans.com
@ApacheFlink, #ApacheFlink, #Flink
apache-flink.meetup.com
github.com/apache/flink
user@flink.apache.org dev@flink.apache.org
Flink Knowledge Base

http://sparkbigdata.com/component/tags/tag/27-flink
112
4. How to contribute to Apache Flink?
 Contributions to the Flink project can be in the

form of:
• Code
• Tests
• Documentation
• Community participation: discussions, questions,
meetups, …
 How to contribute guide ( also contains a list of
simple “starter issues”)
http://flink.apache.org/how-to-contribute.html
113
5. What are some key takeaways?
1. Although most of the current buzz is about Spark,
Flink offers the only hybrid (Real-Time Streaming +
Batch) open source distributed data processing
engine natively supporting many use cases.
2. With the upcoming release of Apache Flink 1.0, I
foresee more adoption especially in use cases with
Real-Time stream processing and also fast iterative
machine learning or graph processing.
3. I foresee Flink embedded in major Hadoop
distributions and supported!
4. Apache Spark and Apache Flink will both have their
sweet spots despite their “Me Too Syndrome”!
114
Thanks!
• To all of you for attending!
• To Bloomberg for sponsoring this event.
• To data Artisans for allowing me to use some of
their materials for my slide deck.
• To Capital One for giving me time to prepare and
give this talk.
• Yes, we are hiring for our New York City offices
and our other locations! http://jobs.capitalone.com
• Drop me a note at sbaltagi@gmail.com if you’re
116
interested.

Apache Flink What How Why Who Where by Slim Baltagi 160203112346 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Flink What How Why Who Where by Slim Baltagi 160203112346 PDF

Uploaded by

Copyright:

Available Formats

New New

Apache Flink: What,

1. What is Apache Flink?

1.1 Apache project with a cool logo!

Squirrel is an animal! This reflects the harmony with

A squirrel is swift and This reflects the meaning of

Local Cluster Cloud

Single JVM Standalone Google’s GCE

Local MongoDB Flume

Apache Flink’s original vision was getting the best from

• Declarativity • Real-Time • Massive scale-out

All streaming all the time: execute everything as

Little configuration required

Support for many file systems:

Native Support of many use cases on top of the same

Real-Time stream processing Machine Learning at scale

Batch Processing Graph Analysis

Dataflow proposal for incubation has been renamed to

Apache Beam was accepted to the Apache incubation

 Keynotes of the Flink Forward 2015 conference:

• Keynote on October 13th, 2015 by William Vambenepe of

Apache Flink is not YABDAF (Yet Another Big Data

 Batch  Batch  Batch  Hybrid

1st 2ndGeneration 3rd Generation 4th Generation

 The evolution of Massive-Scale Data Processing

The world beyond batch:

3.1 DataSet API for static data - Java, Scala,

DataSet API (batch): WordCount

Web resources about stream processing with Apache

4.2 Table – Relational queries

4.4 FlinkCEP: Complex Event Processing for

 Check more FlinkML web resources at the Apache

 Flink Table web resources at the Apache Flink

val orders = env.readCsvFile(…)

val items = orders

val result = items

Introducing Gelly: Graph Processing with Apache Flink

Check out more Gelly web resources at the Apache Flink

Gelly free training! http://www.slideshare.net/FlinkForward/vasia-

• Turning the database inside out with Apache

 Flink extends the MapReduce model with new

Used for Batch Processing

Example: Map and Reduce operation

Real-time event streams

Example: Stream from a live stock feed

Microsoft 124 Apple 235

Google 516 Sum Alert if

7.1 Command-Line Interface (CLI)

See CLI usage and related examples:

Overall system status

Job execution details

Task Manager resource

Web-based interactive computation

1. Data Input / Output

 Flink + XtreemFS http://www.xtreemfs.org/

Here are some examples of how to read/write data

Using MongoDB with Flink:

Apache Kafka, a system that provides durability and

Click-Through Example for Flink’s KafkaConsumer

MapR Streams (proprietary alternative to Kafka that is

• Integration of Apache Flink and Apache Nifi. Bryan Bende,

Using Elasticsearch with Flink:

Building real-time dashboard applications with Apache