You are on page 1of 44

Streaming Data and

Stream Processing with


Apache Kafka™
David Tucker, Director of Partner Engineering,
Confluent

Sid Goel, Partner and Solution Architect, KPI Partners

1
The opportunity: The shift to streams & digital transformation

By 2020, 70% of
organizations will adopt
data streaming to enable
real-time analytics.
- Gartner | Nov 2016

Streaming ingestion and


analytics will become a
must-have for digital
winners.
- Forrester | Nov.
2015

3
More Facts & Figures

90% of CEO’s believe the digital economy will have a major


impact on their industry.
- MIT Sloan / Capgemini (2013)

Digital disruptors will displace 40% of incumbent


companies over the next 5 years.
- Center for Digital Transformation (2015)

#1 most important capability executives hope to improve via


digital transformation: Ability to support real-time
transactions.
- The Economist (2015)

4
Vision of a Streaming Enterprise

RDBMS Search Monitoring

Mobile Apps Hadoop

Streaming Platform

NewSQL / NoSQL Legacy Apps

Real-time Analytics Document Store Data Warehouse

5
What Can You Do with a Streaming Platform ?

• Publish and Subscribe to streams of data


• Analogous to traditional messaging systems

• Store streams of data


• Consumers can look back in time

• Process streams of data


• Analyze and correlate events in real time

6
The typical architecture

User Tracking Operational Logs Operational Metrics

App Monitoring App Data Warehouse Search Security

Databases Databases
Fraud Detection Application
Storage Storage

Interfaces Interfaces

7
Challenges abound
Difficult to handle
massive amounts
of data Diverse data sets, arriving at
an increasing rate

User Tracking Operational Logs Operational Metrics

Many complex
data pipelines

App Monitoring Data


Hadoop Search Security
App Warehouse
Databases Databases
Difficult & time consuming Fraud Detection Application
Storage Storage to change
Require mission critical
Interfaces Interfaces availability into most
recent/relevant data

Require a separate cluster


for real-time

8
Modernized architecture using Apache Kafka

User Tracking Operational Logs Operational Metrics

Monitoring
App App Data
Search Security
Warehouse
Streams API Streams API

Fraud Detection Application

9
Modernized architecture using Apache Kafka
Handle any
volume of data
with ease Scale to meet demands of
diverse streams

User Tracking Operational Logs Operational Metrics

Pub/sub to data streams,


alleviate back pressure

Monitoring
App App Data
Search Security
Warehouse
Streams API Streams API

Lightweight, easy to modify Decoupled from upstream


Fraud Detection Application
with minimal disruption apps creating agility
Real-time, context specific
data in the moment

10
Our vision: from big data to stream data

Big Data was Stream Data is Stream Data can be Stream Data will be
The More the Better The Faster the Better Big or Fast (Lambda) Big AND Fast
(Kappa)
Value of Data

Value of Data Job 1 Job 2


Streams Hadoop
Streams

Volume of Data Age of Data Speed Table Batch Table Table 1 Table 2

DB DB

Apache Kafka is the Enabling Technology of this Transition


11
Kafka Adoption in Large Enterprises Growing Rapidly

Travel Global Banks Insurance Telecom

6 of top 10 7 of top 10 8 of top 10 9 of top 10

Over 35% of the Fortune 500 are using Apache


Kafka™ 12
Industries & Use Cases
Universal Use Cases: IoT, Data Pipelines, Microservices, Monitoring
Industry Use Cases

Financial Services Fraud Detection, Trade Data Capture, Customer 360

Retail Inventory Management, Product Catalog, A/B Testing, Proactive Alerts

Automotive Connected Car, Manufacturing Data Processing

Enterprise Tech Analytics, Security Operations, Collect Performance Data

Telecom Personalized Ad Placement, Customer 360, Network Integrity Systems

Entertainment/Media Log Delivery, Increase Ad Delivery Operations, Cross-Device Insights

Travel/ Leisure Visitor Segmentation, Fraud Detection

Consumer Tech Streaming Video, Personalized Customer Experience, Device Telemetry and Analytics

Healthcare Patient Monitoring, Pharma Substance control, Patient Relapse, Lab Results Alerts

13
Kafka Adoption Across Key Companies
Financial Services Enterprise Tech Consumer Tech

Entertainment & Media Telecom Retail Travel & Leisure

15
Confluent Enterprise
The only enterprise streaming platform
based entirely on Apache KafkaTM

16
Confluent Platform: Enterprise Streaming based on Apache Kafka™

Database Web
Log Events loT Data …
Changes Events

Confluent Platform
Data Real-time
Integration Applications
Monitoring & Administration

Hadoop Transformations
Operations

Database Custom Apps

Data Compatibility
Analytics
Data Warehouse

Apache Kafka™ Monitoring


CRM

… …
Clients Connectors

Apache Open Source Confluent Open Source Confluent Enterprise

Complete Open Trusted Enterprise Grade

17
Confluent Completes Kafka
Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise

High throughput, low latency, high availability, secure distributed streaming


Apache Kafka
platform

Kafka Connect API Advanced API for connecting external sources/destinations into Kafka

Simple library that enables streaming application development within the Kafka
Kafka Streams API
framework

Additional Clients Supports non-Java clients; C, C++, Python, etc.

Provides universal access to Kafka from any network connected device via
REST Proxy
HTTP
Central registry for the format of Kafka data – guarantees all data is always
Schema Registry
consumable
HDFS, JDBC, elasticsearch and other connectors fully certified
Pre-Built Connectors
and fully supported by Confluent

Confluent Control Center Enables easy connector management and stream monitoring

Auto Data Balancing Rebalancing data across cluster to remove bottlenecks

Replication Multi-datacenter replication simplifies and automates MDC Kafka clusters

Enterprise class support to keep your Kafka environment running at top


Support Community Community 24x7x365
performance

18
How do I get streams of data
into and out of my apps?

Connect Clients REST

19
Apache KafkaTM Connect – Streaming Data Capture

Fault tolerant
Kafka Connect API

Manage hundreds of data sources JDBC Elastic


Connector Connector
and sinks

Connector Connector
Preserves data schema
IRC / Twitter NoSQL
Connector Connector

Part of Apache Kafka project

CDC HDFS
Integrated within Confluent Kafka Pipeline
Sources Sinks
Platform’s Control Center

20
Kafka Connect API, Part of the Apache KafkaTM Project
Connect any source to any target system with Apache Kafka

Flexible Reliable
• 40+ open source connectors available • Automated failover
• Easy to develop additional connectors • At-least-once guaranteed
• Flexible support for data types and • Balances workload between nodes
formats

Integrated Compatible
• 100% compatible with Kafka v0.9 and • Maintains critical metadata
higher
• Preserves schema information
• Integrated with Confluent’s Schema
• Supports schema evolution
Registry
• Easy to manage with Confluent Control
Center
21
Kafka Connect API Library of Connectors
Databases Datastore/File Store
*
*

Analytics Applications / Other

* Denotes Connectors developed at Confluent and distributed by Confluent. Extensive validation and testing have been performed.

22
New in Kafka 0.10.2: Single Message Transforms for Kafka Connect

Modify events before storing Modify events going out of


in Kafka: Kafka:
• Mask sensitive information • Route high priority events to
faster data stores
• Add identifiers
• Direct events to different
• Tag events
ElasticSearch indexes
• Store lineage
• Cast data types to match
• Remove unnecessary columns destination
• Remove unnecessary columns

23
Kafka Clients

Apache Kafka Native Clients

Confluent Native Clients

Ruby Proxy http/REST

Stdin/stdout

Community Supported Clients

24
REST Proxy: Talking to Non-native Kafka Apps and Outside the Firewall

Non-Java Applications Provides a RESTful


interface to a Kafka
REST / HTTP cluster

REST Proxy
Simplifies message
creation and consumption
Schema Registry

Simplifies
Native Kafka Java administrative actions
Applications

25
How do I maintain my data
formats and ensure compatibility?

26
The Challenge of Data Compatibility at Scale

Many sources without a policy


causes mayhem in a centralized
data pipeline
App 1
Ensuring downstream systems
can use the data is key to an
operational stream pipeline
App 2

Example: Date formats


App 3
Even within a single application,
different formats can be
presented
Incompatibly formatted message

27
Schema Registry
!
Example Consumers
App 1
Serializer Elastic

! Kafka Topic Cassandra

App 2
Serializer HDFS

Schema
Registry

Define the expected fields for each Kafka topic Prevent backwards incompatible changes

Automatically handle schema changes (e.g. new fields) Supports multi-datacenter environments

28
How do I build stream
processing apps?

29
Kafka Streams API: the Easiest Way to Process Data in Apache Kafka™

Your App Key Benefits of Apache Kafka’s Streams API


• Build Apps, Not Clusters: no additional cluster required
Kafka • Elastic, highly-performant, distributed, fault-tolerant,
Streams secure
API • Equally viable for small, medium, and large-scale use
cases
• “Run Everywhere”: integrates with your existing
deployment strategies such as containers, automation,
cloud
Example Use Cases
• Microservices
• Large-scale continuous queries and transformations
• Event-triggered processes
• Reactive applications
• Customer 360-degree view, fraud detection, location-
based marketing, smart electrical grids, fleet
management, …
30
Architecture Example
Before: Complexity for development and operations, heavy footprint

Capture business Must process events with separate, Write results


events in Kafka special-purpose clusters back to Kafka

Your Processing Job

1 2 3

31
Architecture Example
With Kafka Streams: App-centric architecture that blends well into your existing infrastructure

Capture business Process events fast, reliably, securely with Write results
events in Kafka standard Java applications back to
Kafka

Your App 3a

Kafka
Streams API Query latest results directly from
external apps

App App

1 2
3b

32
New in Kafka 0.10.2 : Session windows in Kafka Streams API

Group events in a stream based on Input data

session windows Colors represent


different users
• Sessions are periods of event

activity terminated by a
processing-time
gap of inactivity session windowing

• Purely time-based windows


are incorrect for session-
based data analysis Alice Results

User sessions,
Bob grouped by
event-time
Dave session windows

event-time

33
How do I synchronize and migrate data
to and from the cloud?

35
Before: Hybrid Cloud Environments Today

DC1 AWS

DB2 KV3
App1-v2
KV2
Challenges
• Each team/department
DB1 App8 must execute their own cloud
App2 migration
App2-v2
• May be moving the same data
App7
App1 multiple times
App3 App5 • Each box represented here
KV
require development, testing,
deployment, monitoring and
maintenance
App4 DWH

DWH DB3

36
After: Cloud Synchronization and Migrations with Confluent Platform

DC1 AWS

DB2 KV3
App1-v2
KV2
Benefits
• Continuous low-latency
DB1 App2-v2 App8 synchronization
App2

Kafka
Kafka
Centralized manageability and
monitoring
App1 App5 App7 – Track at event level data
App3 produced in all data centers
KV
• Security and governance
– Track and control where data
comes from and who is
App4 DWH accessing it

DWH DB3 • Cost Savings


– Move Data Once

37
How do I manage and monitor
my streaming platform at scale?

38
What Does End-to-End Mean?

“Clocks and Cables” Monitoring End-to-End Monitoring

How fast is the throughput? Did Did


you you
How many CPU cycles are we
using? leave? arrive?

39
Confluent Control Center: Cluster Health & Administration

Cluster health dashboard


• Monitor the health of your Kafka clusters
and get alerts if any problems occur
• Measure system load, performance,
and operations
• View aggregate statistics or drill down
by broker or topic

Cluster administration
• Monitor topic configurations

40
Confluent Control Center: End-to-end Monitoring
See exactly where your messages are going in your Kafka cluster

41
Confluent Control Center: Connector Management

42
Confluent Control Center: Alerting
Alerts
• Configure alerts on incomplete data
delivery, high latency, Kafka connector
status, and more
• Manage alerts for different users and
applications from a web UI
• Manage alerts for different users and
applications from a web UI

User authentication
• Control access to Confluent Control
Center

• Integrates with existing enterprise


authentication systems

43
Auto Data Balancing

Dynamically move
partitions to optimize
resource utilization and
reliability
• Easily add and remove

Before

After
Rebalanc
nodes from your Kafka e
cluster
• Rack aware algorithm
rebalances partitions
across
a cluster
• Traffic from balancer is
throttled when data
transfer occurs

44
Multi-Datacenter Replication
An easy reliable way to run Kafka across datacenters

Improve reliability
• Easily configure & maintain cross
cluster replication
Simplify management
• Centralized configuration and
monitoring
• Replicate entire cluster or a subset of
topics
• Automatic replication of topic
configuration
• Use Kafka’s SASL for Kerberos,
Active Directory
• SSL encryption between datacenters

45
Get Started with Apache Kafka Today!

Thoroughly tested and


THE place to start with Apache Kafka! quality assured

More extensible developer


experience

Easy upgrade path to


Confluent Enterprise

https://www.confluent.io/downloads/ 46
Thank You

47

You might also like