You are on page 1of 48

Putting Apache Kafka

Building a Real-time Data Platform for Event Streams!


to Use!
JAY KREPS, CONFLUENT!
A Couple of Themes!
Theme 1: Rise of Events!
Theme 2: Immutability Everywhere!
Level! Example! Immutable Alternative!

Mutable local state! Counter in a for loop! Functional Programming!

Mutable process-wide state! ConcurrentHashMap! Functional Programming!


!

Mutable on disk structures! B-Tree! LSM!

Distributed systems! Dynamo-like key-value store! State machine replication!

Mutability in databases! RDBMS! Event Sourcing!

Company-wide data flow! Double write! Kafka!


Theme 3: Datacenter-Level Thinking!
Experience at LinkedIn!
2009: We want all our data in Hadoop!!
What is all our data?!
Initial approach: “gut it out”!
Problems!
•  Data coverage!
•  Many source systems!
•  Relational DBs!
•  Log files!
•  Metrics!
•  Messaging systems!
•  Many data formats!
•  Constant change!
•  New schemas!
•  New data sources!
Needed: organizational scalability!

Θ(N) => Θ(1)!


How does everything else work?!

?!
Relational database changes!
Apps and Services

OLTP Queries

Relational
Databases

Data Guard CSV Dump


Cache

ODS Hadoop
Poll For Changes

App App App

Relational Transforms
Data
Caches & Warehouse
Derived Stores

Transforms
NoSQL!

App App App

Key-value
Store

ETL Load

Hadoop
User events!
Apps and Apps and Apps and
Services Services Services

HTTP

Log Aggregation

NFS

rsync

NFS

Load Transform & Load

Relational
Hadoop Data
Warehouse

Transform
Application Logs!

Apps and Apps and Apps and


Services Services Services

Splunk
Messaging!
App App App App App

Broker Broker

Processor Processor Processor Processor

App App App

Broker

Processor Processor Processor Processor


Metrics and operational data!

App App App

Monitoring
This is a giant mess!
Apps and Services Apps and Services Apps and Services

OLTP Queries
HTTP
ActiveMQ HTTP

Monitoring
Relational Apps Apps Log Aggregation
Databases
Splunk
Key-value
Store
Data Guard NFS
CSV Dump
ActiveMQ Cache
rsync

Poll For Changes ODS Hadoop Load NFS


Apps Apps
App App App

Relational Transforms
Data
Transform & Load
Caches & Warehouse
Derived Stores

Transforms
Impossible ideas!
•  Publish data from Hadoop to a search index!
•  Run a SQL query to find the biggest latency
bottleneck!
•  Run a SQL query to find common error patterns!
•  Low latency monitoring of database changes or user
activity!
•  Incorporate popularity in real-time display and
relevance algorithms!
•  Products that incorporate user activity!
An infrastructure solution?!
Idea: Stream Data Platform!

Search Impala

Apps Hive
Monitoring

Stream
Data HADOOP:
DWH
RDBMS Platform: Offline
? Data

Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark

Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
First Attempt: Messaging systems!!
Problems!
•  Throughput!
•  Batch systems!
•  Persistence!
•  Stream Processing!
•  Ordering
guarantees!
•  Partitioning!
Second Attempt: Build Kafka!!
What does it do?!

Producer Producer Producer Producer Producer

Kafka Cluster

Consumer Consumer Consumer Consumer Consumer


Commit Log Abstraction!
Reader 1 Reader 2

1 1 1 Writes
0 1 2 3 4 5 6 7 8 9
0 1 2

Old New
Logs & Publish-Subscribe Messaging!

Source
System

writes

1 1 1
Log 0 1 2 3 4 5 6 7 8 9 0 1 2

reads reads

Destination Destination
System A System B
A Kafka Topic!

Partition 1 1 1
0 0 1 2 3 4 5 6 7 8 9
0 1 2

Partition Writes
0 1 2 3 4 5 6 7 8 9
1

Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
2 0 1 2

Old New
Replication!
Server 1 Server 2 Server 3

A:0 A:0 A:0

A:1 A:1 A:1

B:0 B:0 Controller


Scaling Consumers!
Kafka Cluster

Server 1 Server 2

P0 P3 P1 P2

C1 C2 C3 C4 C5 C6

Consumer Group A Consumer Group B


Kafka: A Modern Distributed System for Streams!

  Scalability of a filesystem!
◦ Hundreds of MB/sec/server throughput!
◦ Many TB per server!
  Guarantees of a database!
◦ Messages strictly ordered!
◦ All data persistent!
  Distributed by default!
◦ Replication!
◦ Partitioning model!
  Producers, Consumers, and Brokers all fault tolerant and horizontally
scalable!
Stream Data Platform!

Search Impala

Apps Hive
Monitoring

KAFKA:
Stream HADOOP:
DWH
RDBMS Data Offline
Platform Data

Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark

Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
Batch Data => Batch Processing!
Stream processing is a!
generalization!
of batch processing !
and request/response processing!
Request/Response processing: !
One input => One output!
Batch processing: !
All inputs => All outputs!
Stream Processing: !
Some inputs => some outputs!
(you choose how much “some” is)!
Stream Processing a la carte!
Input Kafka Topic

Transform Transform Transform

Intermediate Your code


Kafka Topic
cat input | grep “foo” | wc -l
Transform Transform Transform

Output Kafka
Topic

Hadoop Live
Data Store
Stream Processing with Frameworks!

+! =! Stream
Processing!
Unix Pipes, Modernized!

cat /usr/share/dict/words | wc -l
On Schemas!

Bad Schemas < No Schemas < Good Schemas!


Put it all together!
Apps Apps Apps Apps

Social Key-Value
Search Oracle Newsfeed OLAP
Graph Storage

Apps
Log
Search Apps

Monitoring
Kafka
Security &
Fraud Samza

Real-time
Analytics

Hadoop Teradata
At LinkedIn!
•  Everything in the company is a real-time stream!
•  > 800 billion messages written per day!
•  > 2.9 trillion messages read per day!
•  ~ 1 PB of stream data!
•  Tens of thousands of producer processes!
•  Backbone for data stores!
•  Search!
•  Social Graph!
•  Newsfeed!
•  Primary storage (in progress)!
•  Basis for stream processing!
Elsewhere!
Why this is the future!

1. System diversity is increasing!


2. Data diversity and volume is
increasing!
3. The world is getting faster!
4. The technology exists!
•  Mission: Make this a practical reality
everywhere!
•  Product!
•  Apache Kafka!
•  Schemas and metadata management!
•  Connectors for common systems!
•  Monitor data flow end-to-end!
•  Stream processing integration!
Questions?!
•  Confluent!
•  @confluentinc!
•  http://confluent.io !
•  http://blog.confluent.io/2015/02/25/
stream-data-platform-1 !
•  Apache Kafka!
•  @apachekafka!
•  http://kafka.apache.org!
•  http://linkd.in/199iMwY !
•  Me!
•  @jaykreps!

You might also like