F 1330 Narkhede Kafka

Apache Kafka
A distributed publish-subscribe messaging system

Neha Narkhede, LinkedIn
@nehanarkhede, 11/11/11
Outline
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
What is pub sub ?
Producer publish(topic, msg) Consumer

subscribe
Topic Topic msg

1 2
Topic
3
Publish subscribe
system Consumer
Producer
msg
Outline
 Design
 Performance
Motivation
 Activity tracking
 Operational metrics
Kafka
 Distributed
 Persistent
 High throughput
Kafka at LinkedIn
Frontend Frontend Service Service
Broker
Broker
Broker
Kafka
Real time Security News feed

monitoring systems Hadoop DWH
Outline
 Design
 Performance
Hadoop Data Load for Kafka
Live data center Offline data center
Hadoop
Hadoop
Dev
Hadoop
Frontend
Frontend Kafka Kafka
Real time Kafka
Kafka Kafka
Kafka
consumers
Hadoop
Hadoop
PROD
Hadoop
Multi DC data deployments
Live data centers Offline data centers
Real
Realtime Hadoop
Realtime
time Kafka Hadoop
Hadoop
Hadoop
consumers
consumers
consumers
Real
Realtime Hadoop
Realtime
time Kafka Hadoop
Hadoop
DWH
consumers
consumers
consumers
Volume
 20B events/day
 3 terabytes/day
 150K events/sec
Message queues Log aggregators
• ActiveMQ • Flume
KAFKA
• TIBCO • Scribe
•Low throughput •Focus on HDFS
•Secondary indexes •Push model
•Tuned for low •No rewindable

latency consumption
What Kafka offers
 Very high performance
 Elastically scalable
 Low operational overhead
 Durable, highly available (coming soon)

Architecture
Producer Producer
Broker Broker ZK Broker Broker
Consumer Consumer
Outline
 Design
 Performance
Efficiency #1: simple storage
 Each topic has an ever-growing log
 A log == a list of files
 A message is addressed by a log offset
Efficiency #2: careful transfer
 Batch send and receive
 No message caching in JVM
 Rely on file system buffering
 Zero-copy transfer: file -> socket

Multi subscribers
 1 file system operation per request
 Consumption is cheap
 SLA based message retention
 Rewindable consumption
Guarantees
 Data integrity checks
 At least once delivery
 In order delivery, per partition

Automatic load balancing
Producer Producer
Broker Broker
Consumer Consumer
Auditing
# events published = # events consumed

Outline
 Design
 Performance
Performance
 2 Linux boxes
• 16 2.0 GHz cores
• 6 7200 rpm SATA drive RAID 10
• 24GB memory
• 1Gb network link
 200 byte messages
 Producer batch size 200 messages
Basic performance metrics
• Producer batch size = 40K
• Consumer batch size = 1MB
• 100 topics, broker flush interval = 100K
– Producer throughput = 90 MB/sec

– Consumer throughput = 60 MB/sec
– Consumer latency = 220 ms
Latency vs throughput
(100 topics, 1 producer, 1 broker)
250
200
Consumer latency in ms
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Producer throughput in MB/sec

Scalability
(10 topics, broker flush interval 100K)
400
381
350
300 293
Throughput in MB/s
250
200 190
150
100 101
50
0
1 broker 2 brokers 3 brokers 4 brokers
Throughput in msg/s
0
40000
80000
120000
160000
200000
10.49042
73.43292
136.37543
199.31793
262.26044
325.20294
388.14545
441.54497
504.48748
567.42998
630.37249
693.31499
Unconsumed data in GB
756.257...
(1 topic, broker flush interval 10K)
819.2
882.142...
945.08501
1008.0...
Throughput vs Unconsumed data
State of the system
 4 clusters per colo, 4 servers each
 850 socket connections per server
 20 TB
 430 topics
 Batched frontend to offline datacenter latency
=> 6-10 secs
 Frontend to Hadoop latency => 5 min
State of the system
 Successfully deployed in production at LinkedIn
and other startups
 Apache incubator inclusion
 0.7 Release
• Compression
• Cluster mirroring
Replication
Some project ideas
 Security
 Long poll
 More compression codecs
 Locality of consumption
Team
• Jay Kreps
• Jun Rao
• Neha Narkhede
• Joel Koshy
• Chris Burroughs
THANK YOU
http://incubator.apache.org/kafka/index.html
kafka-users@incubator.apache.org
http://www.linkedin.com/in/nehanarkhede
@nehanarkhede
#kafka
API

F 1330 Narkhede Kafka

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

F 1330 Narkhede Kafka

Uploaded by

Copyright:

Available Formats

Apache Kafka

A distributed publish-subscribe messaging system

 Hadoop and Kafka

Producer publish(topic, msg) Consumer

Topic Topic msg

 Hadoop and Kafka

Real time Security News feed

 Hadoop and Kafka

•Low throughput •Focus on HDFS

•Secondary indexes •Push model

•Tuned for low •No rewindable

 Low operational overhead

 Durable, highly available (coming soon)

Broker Broker ZK Broker Broker

 Hadoop and Kafka

 No message caching in JVM

 Rely on file system buffering

 Zero-copy transfer: file -> socket

 SLA based message retention

 At least once delivery

 In order delivery, per partition

# events published = # events consumed

 Hadoop and Kafka

– Producer throughput = 90 MB/sec

Producer throughput in MB/sec

 More compression codecs

You might also like