You are on page 1of 35

WHITE KAFKA VS.

KINESIS
PAPER Apache Kafka and Amazon Kinesis
Comparison And Migration Guide
By: Parviz Deyhim

US: +1 877 773 3306 UK: +44 800 634 3414 HK: +852 3521 0215
CONTENTS

03 Introduction
18 Application Changes

04 Apache Kafka
Overview 23 Migrating Platforms

05 Amazon Kinesis
Overview 29 Conclusion

06 Core Concepts
30 Apendix A

08 Cost Comparison
32 Apendix B

14 Design and
Architectural Decisions
INTRODUCTION

Streaming data processing has become increasingly Similar to the other platform-as-service offerings,
prevalent. As a result, different platforms and frameworks Amazon Kinesis eliminates the need for developers to
have been introduced to reduce the complexity of the manage and operate their own infrastructure.
requirements such as durable and scalable
high-throughput data ingest. While the traditional Since the inception of Amazon Kinesis, our clients have
pub-sub messaging frameworks such as RabbitMQ been asking the following questions:
and ActiveMQ have been around to help with those
challenges, one solution that has changed the landscape What are the architectural differences
since its inception is Apache Kafka. Apache Kafka, between two systems?
an open-source framework developed at LinkedIn, has What are the application and API differences?
been a popular choice for a variety of use-cases such as What are the cost differences between
stream processing and data transformation due to the two platforms?
its well-engineered, scalable and durable design.
In this document we will answer those questions
However one of the shortcomings of Apache Kafka by examining:
is the lack of cloud-native design in high-availability
and monitoring. As a result, weve found that running 1. Introduction to the common and important concepts
and operating Apache Kafka in a cloud environment pertaining to Apache Kafka and Amazon Kinesis
requires a great deal of time and effort committed by the 2. The economic and technical considerations of using
operation and the engineering teams. Apache Kafka and Amazon Kinesis
3. The application API differences
An alternative to Apache Kafka but with the similar
features is Amazon Kinesis. Amazon Kinesis is a For those readers who are interested in migrating from
data ingest service hosted and managed by Apache Kafka to Amazon Kinesis, the last section in this
Amazon Web Services (AWS). document provides a sample code to help with
the migration process.

03
APACHE KAFKA
OVERVIEW

Apache Kafka is an open-source distributed


pub-sub messaging solution that was initially developed
at LinkedIn. Apache Kafka users are in responsible
for installing and managing their clusters, and also
accounting for requirements such as high availability, Zookeeper
Node
Zookeeper
Node

durability, and recovery.

Apache Kafka consists of multiple nodes referred to as


Brokers. Brokers are responsible for accepting messages Producer
Kafka
Broker
Kafka
Broker Consumer
Applications Applications
(leaders) and replicating the messages to the rest of
the brokers in the cluster (followers). The distributed
Availability Zone / Data Centre Availability Zone / Data Centre

nature of Apache Kafka allows the system to scale out


and provides high availability (HA) in case of node
failure. The membership (leaders and followers) of
Brokers in a cluster is tracked and administered via
Apache Zookeeper, yet another open-source distributed
membership framework.

For more details on how Apache Kafka works, please


refer to the following guide: Apache Kafka

04
AMAZON KINESIS
OVERVIEW

Amazon Kinesis, also a pub-sub messaging solution,


is hosted by Amazon Web Services (AWS) and provides
a similar set of capabilities as Apache Kafka. This section
provides high-level architectural differences between
the two systems.

Amazon Kinesis is a fully managed service hosted


within a given AWS region (i.e. us-east-1) and spans Amazon
Kinesis
over multiple Availability Zones (i.e. us-east-1a).
Producer Consumer
Similar to Apache Kafka, Amazon Kinesis is responsible Applications Applications

for accepting end-users messages and replicating Availability Zone / Data Centre Availability Zone / Data Centre

the messages to multiple-availability zones for high


availability and durability. The fully managed aspect
of Amazon Kinesis eliminates the need for users to
maintain infrastructures or be concerned about the
details surrounding features like replication or the other
system configurations.

05
CORE
CONCEPTS

Throughout this document well be referring to platform APACHE KAFKA TOPIC VS.
specific terms and concepts. The following section AMAZON KINESIS STREAM
provides a summary and mapping of the important Apache Kafka Topic and Amazon Kinesis Stream
concepts in Apache Kafka and the corresponding represent an ordered, immutable and partitioned list of
concepts in Amazon Kinesis. messages. The new messages get appended to the end
of this list and each message has a unique identifier.
A more detailed comparison is provided in the
Application Changes section. APACHE KAFKA PARTITION VS.
AMAZON KINESIS SHARD
In Apache Kafka, each topic consists of one or more
Kafka Concept Kinesis Concept
partitions. Shard is the similar concept in Amazon
Topic Stream Kinesis. The intent of distributing each topic and
stream over multiple partitions or Shard is to increase
Partition Shard
write/read throughput by distributing the load
Broker N/A between multiple nodes.
Apache Kafka Producer Amazon Kinesis Producer APACHE KAFKA REPLICATION VS. N/A
Apache Kafka Consumer Amazon Kinesis Consumer Replication provides higher durability and availability
in cases where the resource hosting the topic/streams
Offset number Sequence Number experiences failures. In Apache Kafka the users have
Replication Not Required the ability to define the topics replication factor.
Amazon Kinesis automatically stores data across
multiple Availability Zones synchronously and as a result,
the users are not required to define a replication strategy.

06
APACHE KAFKA OFFSET NUMBER VS. APACHE KAFKA CONSUMER VS.
AMAZON KINESIS SEQUENCE NUMBER AMAZON KINESIS CONSUMER
Each record within an Apache Kafka topic or Amazon The consumers are the application components that fetch
Kinesis stream partition is given a unique number. In records from Amazon Kinesis or Apache Kafka via the
Apache Kafka this number is referred to as Offset provided APIs. Similar to the producer applications,
while in Amazon Kinesis this number is referred to as the consumers can deal with failures and reading
Sequence Number. Both platforms guarantee that the records from multiple partitions or Shards.
offsets or the sequence numbers in a given partition or a
Shard are ordered and sequentially increasing. APACHE KAFKA BORKER VS. N/A
The Brokers are Apache Kafka nodes that are hosting
APACHE KAFKA PRODUCER VS. one or more Apache Kafka partitions. Amazon Kinesis
AMAZON KINESIS PRODUCER is a hosted service and the nodes that are hosting the
The producers are the application components that Shards are abstracted from the users.
submit records to Apache Kafka or Amazon Kinesis.
The producers handle the ability to send multiple
records to the platform, and are able to partition data
using partitions or Shards, as well as perform tasks like
compression and failure handling.

07
COST
COMPARISONS

The following section provides an overview of the Before we dive into the costs of hosting Apache Kafka,
different cost factors involved in running Apache Kafka its important to note that running and maintaining an
or using Amazon Kinesis. Apache Kafka cluster involves running and hosting a
highly available and reliable Apache Zookeeper cluster.
APACHE KAFKA COST FACTORS Apache Kafka relies on the Apache Zookeeper for
The cost of running and maintaining an Apache Kafka some of its important and vital functions, therefore any
cluster involves a number of factors that users have to be cost calculation will be widely inaccurately if it does not
aware of. It is common to calculate the cost simply by include the cost of hosting and maintaining an
calculating the cost of the underlying hardware but in Apache Zookeeper cluster.
order to accurately estimate the cost of hosting
Apache Kafka, users have to include the cost of Infrastructure costs
replication, effort required to maintain (patching and
upgrading) the cluster, monitoring, maintaining other The cost of hosting Apache Kafka includes the cost of
dependent systems such as Apache Zookeeper, and running an infrastructure capable of supporting the
maintaining brokers distributed between multiple velocity of the incoming data (in terms of records/
datacenters or availability-zones. The following section sec) and the cost of storing data according to the
provides an overview of the factors that should be data retention requirements. Based on our experience,
considered when comparing the cost of hosting one cost outweighs the other: either the velocity of
Apache Kafka vs. Amazon Kinesis. the incoming data requires deploying a cluster where
the amount of CPU cores, memory and networking
Hosting Cost Factors outweighs the data retention storage requirements or
the required storage footprint for data retention
The cost of Hosting Apache Kafka consists of the outweighs the required amount of CPU core,
following factors: memory or networking bandwidth. In some of the
larger deployments with high traffic requirements,
Infrastructure costs + Data Durability + Maintenance Costs both factors can be equally important.

08
Durability costs Managing Apache Kafka Framework

The other factors that should be considered are the costs Apache Kafka Specific Tasks
involved in providing durability and high availability. Monitoring and alerting on Apache Kafka Brokers failures
The cost of durability is directly influenced by the
replication factor of Apache Kafka cluster, which in turn Monitoring and alerting on Apache Kafka Resource utilizations
(Disk, CPU and Memory)
influences the cost of the required storage footprint.
For example, with 1TB per day of incoming data and Monitoring and alerting on Apache Kafka partition throughput
a replication factor of 3, the total size of the stored Migrating Apache Kafka partitions to new nodes
data on local disk is 3TB. to increase throughput
Tuning Apache Kafka JVM settings
While Apache Kafka replication provides durability
and protection in case of Apache Broker nodes failures, Scaling Brokers to increase CPU, Memory and Disk resources
it does not protect against data-center or availability
zone outages. In order to protect against data-center/ Upgrading Apache Kafka version
availability zone outages, multiple Apache Kafka Brokers Recovering/Replacing failed Brokers
nodes have to be deployed in multiple datacenters or
availability-zones. Multi-datacenter/availability-zone Failing over to a different cluster in a different
data-center or availability-zone
deployment introduces additional costs such
as datacenter or availability bandwidth costs. Multi-AZ deployment

Maintenance Cost Factors

Apache Kafka is a well-engineered framework,


and due to its complex engineering nature, there are
various factors involved in maintaining a production
grade cluster.

The following tables provide the high-level tasks involved


in maintaining an Apache Zookeeper and Apache Kafka
clusters. Apache Kafka relies on Apache Zookeeper
for some of its internal functions and its important to
consider the efforts required to host both frameworks.

09
Managing Apache Zookeeper Framework AMAZON KINESIS COST FACTORS
Given that Amazon Kinesis is a hosted service, it involves
Apache Kafka Specific Tasks fewer costs factors as compared to Apache Kafka.
The following section focuses on the cost factors
Monitoring and alerting on Apache Zookeeper node failures
involved in using Amazon Kinesis.
Monitoring and alerting on Apache Zookeeper
Resource utilization Hosting Cost Factors
(Disk, CPU and Memory)

Apache Zookeeper JVM Tuning One of the main benefits of using Amazon Kinesis is
the fact that users are not responsible for hosting and
Scaling Zookeeper nodes to increase CPU, Memory and Disk
resources maintaining a distributed cluster.
Upgrading Zookeeper version Infrastructure costs
Replace failed Zookeeper nodes
Since Amazon Kinesis is a hosted service, beyond
Multi-AZ deployment the cost of using the service, there are no additional
infrastructure costs involved with using the platform.
In terms of storage, the users data is hosted for 24 hours
without an additional cost. If a longer data retention
period is required, users have to pay additional charges.

Durability costs

Amazon Kinesis automatically replicates users data to


multiple availability zones for durability. Clients do not
have to be concerned with the cost of replication or
additional cost of storage due to the replication factor.

10
Maintenance Cost Factors COST COMPARISON EXAMPLE
We are generally not in favor of providing pricing
As compared to Apache Kafka, the maintenance tasks information since its practically impossible to provide
are limited to a few areas as demonstrated in an accurate number that satisfy different use cases.
the table below. However, we can provide a pricing example of running
a hypothetical workload.
Apache Kafka Specific Tasks Amazon Kinesis specific Tasks
Well use the following requirements to calculate the
Monitoring and alerting on N/A handled by cost of Apache Kafka and borrow the traffic estimate to
Apache Kafka Brokers failures Kinesis Service calculate the cost of using Amazon Kinesis.
Monitoring and alerting on
Apache Kafka N/A handled by Requirement Value
Resource utilizations Kinesis Service
Estimated Daily traffic 1 TB
(Disk, CPU and Memory)
Monitoring and alerting Apache Kafka Data Retention Days 7
Monitoring and alerting on
on Apache Kafka Data Payload size 1KB
Cloudwatch Shard metrics
partition throughput
Apache Kafka Replication Factor 3
Migrating Apache Kafka
Amazon Kinesis API to add &
partitions to new nodes to * Apache Kafka Monthly Required Storage
remove Shards 31.5 TB
increase throughput with 30% headroom
Tuning Apache Kafka N/A handled by ** Daily Records/Sec 11574
JVM settings Kinesis Service
Scaling Brokers to increase
N/A handled by Table 1.1
CPU, Memory and
Kinesis Service
Disk resources
Upgrading Apache Kafka N/A handled by * Total storage required after considering the daily total
version Kinesis Service incoming traffic, retention days, the storage headroom and
Recovering/Replacing N/A handled by the replication factor.
failed Brokers Kinesis Service
Failing over to a different cluster ** Records/Sec = (Daily traffic in KB/86400)/Payload size.
N/A handled by
in a different data-center or
Kinesis Service
availability-zone

11
Apache Kafka Costs *** The cost sending traffic between multiple brokers hosted in 3
availability zone with the replication factor of 3. 1TB per day traffic
Given the requirements above and what weve discussed and replication to 2 other availability zone = 2TB per day of inter-az
in the previous sections, we can estimate the Apache traffic. 2TB x 31Days x 12months. Refer to Amazon
Kafka costs as following: monthly price calculator for more details.

**** Based on our experience, we believe 30% of a DevOps


Cost Factors Value Price engineer time is required to support activities mentioned in the
* Broker d2.xlarge EC2 section Apache Kafka Maintenance factors. Weve assumed
6 $10,390.00
Instances (Annual) $100K/year salary to calculate the cost.

** Zookeeper c3.xlarge EC2


3 $651
Nodes (Annual)

*** Between availability-zone


744TB $7618.56
bandwidth cost (Annual)

**** Maintenance Cost (Annual) 0.3 FTE $30,000.00

3-Yr total cost - $145978.68

* We used Amazon D3.xlarger 3-yr upfront-reserved instances


to calculate this cost. We believe this pricing model is a close
estimation of hardware server/storage prices. $5195 3-yr
upfront payment / 12 = $1731.67 Annual cost * 6 instances =
$10,390.00

** As discussed above, it is critical to include the cost of maintaining


Apache Kafka and Zookeeper in our calculations. In this example,
were using C3.xlarge instances to host Apache Zookeeper nodes.

12
Amazon Kinesis Costs APACHE KAFKA COST BENEFITS
As demonstrated by the example above, we believe for
Using the traffic numbers defined in Table 1.1, the majority of the workloads, using Amazon Kinesis is
we can estimate the cost of using Amazon Kinesis: financially beneficial. However, there are cases where
Apache Kafka can be more cost effective. One example
of such scenario is when the incoming traffic consists of
Cost Factors Value Price a small payload with a high number of records-per-sec
and a short amount of data retention period. Due to the
* Number of Shards Required (Annual) 12 $7356
fact that the majority of the cost of hosting Apache Kafka
is influenced by the amount of storage required to host
** Maintenance Cost (Annual) 0.1 FTE $10000
the retained data, in the scenario where payload is small
3-yr total cost - $92068
and the retention time is limited, the storage requirements
is minimum. In such cases, Apache Kafka may prove to
be more cost effective.
* The number of Shards required to support the incoming traffic of
11574 records/sec with 1KB payload. Refer to Amazon Kinesis
pricing for the Shard price calculation.

** Similar calculation method to the Apache Kafka maintenance


costs. Were assuming 10% of DevOps engineering time should be
dedicated to support Apache Kafka maintenance activities.

*** The costs associated with the producer/consumer


application changes.

**** The cost of training engineers

13
DESIGN AND
ARCHITECTURAL DECISIONS

This section walks through performance, scalability, In addition to increasing the capacity, given that
durability and delivery semantics of both platforms. Apache Kafka holds historical data, users may be
We also provide an example of creating an Amazon required to increase the disk footprint capacity of the
Kinesis Stream that provides similar characteristics as cluster. The process of adding more disk capacity is
our existing Apache Kafka cluster. achieved by adding more nodes to the cluster.

SCALABILITY In the case of Amazon Kinesis, given the hosted nature


Both Apache Kafka and Amazon Kinesis rely on the of the service, the throughput of each Shard is pre-
concept of replicated partitions to provide linear advertised by the Amazon Kinesis team. Currently each
scalability. More specifically both frameworks provide Amazon Kinesis Shard provides 1000 PUT records-per-
the ability for the users to partition the data to multiple sec or 1MBps of write and 2Mbps of or 5 transaction-
distinct groups and the system handles the replication per-sec of read traffic. An example of calculating the
of the data to multiple nodes. In the case of Apache number of Amazon Kinesis Shard is provided in the
Kafka, the scalability of each partition depends on the following section.
number of CPU cores, the amount of memory and the
performance of the local disks of the node hosting Increasing the scalability of Amazon Kinesis is easier
that partition. than Apache Kafka. In order to increase the throughput
of a given Amazon Kinesis stream, more Shards can be
In Apache Kafka, in order to increase the throughput added to by splitting the existing Shards.
of the system, the users have to add more hardware
capacity to the cluster and migrate the existing partitions The benefit of the Amazon Kinesis throughput model
to the newly added resources. This process assumes is that the users have a prior knowledge of the exact
that there are more partitions than the number of cluster performance numbers to expect for every provisioned
nodes. Otherwise, to increase the throughput of the Shard. In contrast, the Apache Kafka throughput numbers
cluster, one has to add more resources to the existing depend on the type of the resource hosting Apache
resources, also known as scaling up. Kafka nodes. In most cases the users have to perform a
load test to find out the throughput numbers each node

14
can sustain. This process can be error prone and at times Amazon Kinesis is a best fit for use-cases with the higher
can cause overprovisioning of the Apache Kafka clusters. throughput and the larger payloads vs. smaller payload
but higher record-per-sec (See the cost factor section).
The disadvantage of the Amazon Kinesis throughput
model is that the current read limit of 5 transaction-per- DURABILITY
second limits how many applications can read from As mentioned in the previous sections, Apache Kafka
Amazon Kinesis at any given time. In other words, if you provides durability by replicating data to multiple broker
have multiple applications that require pulling data once nodes. Amazon Kinesis provides the same durability
a sec from all Amazon Kinesis Shards, the maximum guarantees by replicating the data to multiple availability
number of applications that can be supported by a zones. The major difference between two systems is the
single Amazon Kinesis Stream is 5. In order to increase need for users to configure and control Apache Kafka
the number of concurrent applications consuming all replication strategy while Amazon Kinesis replication is
Amazon Kinesis Shards, one has to limit how often each handled by Amazon.
application is reading from Amazon Kinesis to stay
below the 5 read-per-sec per Shard limit. Alternatively, DELIVERY SEMANTICS
users can increase the number of Shards to increase the Both Apache Kafka and Amazon Kinesis provide
overall number of fetches/sec allowed by the Amazon at-least-once delivery semantic. More accurately, both
Kinesis stream. An example of calculating the required systems may, at times, provide duplicate records to the
number of Shards to enable parallel reads is provided in consumers. In most cases the reason for duplicated
the Architecting An Amazon Kinesis Stream section. records is re-try at the producer or the consumer level.
Enforcing idempotency within the consumer application
LATENCY can produce exactly-once semantics. The details of
Latency is defined as the time consumed from the writing idempotent applications are beyond the scope
moment a given record is accepted by the platform of this document. What users should be aware of is
until the time the consumer has been able to read that there is a potential for duplicate messages and the
the same record. consumers have to tolerate such scenario.

Apache Kafka can be configured to perform < 1second


latency depending on the cluster and the producer/
consumer configurations. Amazon Kinesis latency has
shown to be in the range of 1-5 second. The applications
that require < 1second latency are not an ideal use-case
for Amazon Kinesis.

15
Architecting An Amazon Kinesis Stream In order to meet the performance of our existing
Apache Kafka cluster, we need to ensure that our
The following example demonstrates how to design an Amazon Kinesis stream can sustain the rate of the
Amazon Kinesis Stream given the cluster configuration incoming and outgoing traffic by calculating the required
provided below: number of Amazon Kinesis Shards to provide the same
level of performance/throughput.
Requirements Values
Incoming traffic (Write)
1 Daily Ingested Data Size (TB) 0.5
2 Retention days 7 A simple calculation can tell us how many Shards is
3 Replication Factor 2 required to sustain our incoming traffic:
4 Payload Size(KB) 2KB Number of ShardShards to support the incoming traffic =
5 Number of Apache Kafka Partitions 5 MAX (KBps/1000, Records/sec/1000)
6 Number of consumer applications 3 Number of ShardShards to support the incoming traffic =
7 Fetches per sec per consumer application 1 Max (5787 /1000, 2894 /1000)
8 Incoming traffic rate (KBps) 5787 Number of ShardShards to support the incoming traffic =
9 Incoming traffic rate (Rec/sec) 2894 Max (~6, 3)
Per Apache Kafka partition
10 1157 Number of ShardShards to support the incoming traffic = 6
incoming traffic rate (KBps)
Per Apache Kafka partition
11 579 Using six Shards, our Amazon Kinesis can sustain up to
incoming traffic rate (Rec/sec)
6000 KBps or 6000 records/sec, which matches our
12 Fetch per consumer application (KBps) 5787
existing Apache Kafka Cluster incoming traffic.
Number of Amazon Kinesis Shards required
13 6
(Writes)
Number of Amazon Kinesis Shards required
14 9
(Read)

15 Number of Amazon Kinesis Shards required 9

16
Outgoing traffic (Read)

Our current Apache Kafka cluster is supporting three


consumer applications with each fetching at the rate of
once per second or 5787 KBps. Given the numbers,
we can calculate the number of Shards required to
support our Amazon Kinesis consumers:
Number of ShardShards to support the incoming traffic =
MAX (KBps/2000, Consumer Fetches/sec/5)

Number of ShardShards to support the outgoing traffic =


MAX (5787*3/2000, 1*3/5)

Number of ShardShards to support the outgoing traffic =


MAX (17361/2000, 1*3/5)

Number of ShardShards to support the outgoing traffic =


MAX (~9, ~1)

Number of ShardShards to support the outgoing traffic = 9

Using 9 Shards, our Amazon Kinesis can sustain up to


18000 KBps or 45 fetch/sec, which matches our
existing Apache Kafka Cluster outgoing traffic.

Total Shards required

Now that weve calculated how many Shards are


required to support both our incoming and outgoing
traffic, we can calculate the total number of Shards by:

Total ShardShards = MAX (# ShardShards to support incoming


traffic, # ShardShards to support outgoing traffic)

Total ShardShards = MAX (6,9)


Total ShardShards = 9

17
APPLICATION
CHANGES

Lets quickly walk through the differences between Note: Users can use Amazon Kinesis API or KPL library
Producer and Consumer applications in Apache Kafka (Kinesis Producer Library) to interact with Amazon Kinesis
and Amazon Kinesis. producer APIs. The comparison below assumes using KPL
instead of direct API.
PRODUCER CHANGES
Both Apache Kafka and Amazon Kinesis producers
perform a similar set of high-level tasks such as:

1. Accept records from higher level applications


2. Perform batching and/or compression
3. Partition the records between partitions/Shards
4. Submit record(s) to Apache Kafka or Amazon Kinesis

However, despite the similarities, there are some


important differences that the users should be aware of.
This section provides a comparison of features and
behaviors of Apache Kafka and Amazon Kinesis.

In order to better organize the producer behaviors and


logics, weve divided the actions into the following areas:

Submitting Records
Compression and batching
Failure handling
Backpressure handling [TBD]

18
Apache Kafka Amazon Kinesis
API Class KafkaProducer KinesisProducer
KinesisProducer.addUserRecord has async and sync
behaviors where it returns Future objects as the result of
KafkaProducer.send() has async behavior where
addUserRecord call. Users can use the Future objects to
it returns immediately. It will also provide a
General block on the status of the submission or use Future async
callback method to act on completed records
Behavior capabilities to check for the status of the submission as
events are provided to the user code.
KafkaProducer sends records to each Shard's KinesisProducer sends records to a single API regardless
broker leader of the number of Shards

"Ack" Configuration setting controls if a record is Amazon Kinesis API synchronously replicates data
Record
successfully submitted if one or all brokers have across three facilities in an AWS Region and returns 200
Completeness
Submitting received the submitted record. OK to the users.
Records
Users can provide a partition number on per Partition keys are Unicode strings and can associate
record basis to specify the exact partition records data records to Shards using the hash key ranges
should be submitted to of the Shards.

An MD5 hash function is used to map partition keys


Record Default partitioning logic:
to 128-bit integer values and to map associated data
Partitioning hash(key)%numPartitions
records to Shards

Users can override hashing the partition key to determine


A custom partitioner can be provided to client
the Shard by explicitly specifying a hash value using the
using "partitioner.class"
ExplicitHashKey parameter.

General KafkaProducer.send() adds records to memory KinesisProducer.addUserRecord buffers records in


Behavior buffer. memory until it's ready to submit

How many records to keep before submitting How may records to buffer can be controlled by
Batching records controlled by "batch.size" "RecordMaxBufferedTime"
Batching
Configuration
linger.ms config parameter to control how often RecordMaxBufferedTime config paratmer can control
to send batches how often to submit batched records

19
Client can compress data before sending to Kinesis client library (KPL) does not provide compression
Compression
Kafka is "compression.type " is set but users can compress records in their application logic

General Client will retry failed submissions if configured


Client will retry failed submissions.
Behavior via "retries" config parameter
Failure
Handling KinesisProducer.addUserRecord returns Future objects
KafkaProducer.send() provides callback method
that users can use to check the status of the
to act on record failures
record submissions

CONSUMER CHANGES
This section provides an overview of the In order to better organize the consumer behaviors
Consumer application differences in Apache Kafka and logics, weve divided the consumer actions into the
and Amazon Kinesis. following high-level areas:

Generally speaking, consumers perform Load balancing: How Kafka and Kinesis consumers
the following tasks: distribute reading between multiple consumers
Offset Control: The logic that both Kafka and Kinesis
1. Consuming records from Apache Kafka or follow to handle which records have been processed
Amazon Kinesis. through out the life of the consumer
2. Perform an action such as writing to database Failure handling: How Kafka and Kinesis handle
or transforming the record to some other format. various different failure scenarios s
3. Performing load-balancing and failure handling.
Note: Users can use Amazon Kinesis API or KCL library
While at the high-level both Apache Kafka and Amazon (Kinesis Client Library) to interact with Amazon Kinesis
Kinesis perform similar tasks, there are differences that consumer APIs. The comparison below assumes using
users should be aware of. The rest of this section focuses KCL instead of direct API.
on the differences both conceptually and also at
the API level.

20
Apache Kafka Amazon Kinesis
API Class KafkaConsumer Implementing KCL IRecordProcessor
Users implemented IRecordProcessor.processRecords() is
Consuming Records KafkaConsumer.poll() consumes records from Kafka
provided with list of records to process

load-balances records between subscribers in a Each instance of KCL application uses a KCL worker, which
given Consumer Group in turn creates a KCL RecordProcessor per Kinesis Shard

A single Kinesis application instance creates multiple


RecordProcessors to handle reading from multiple Shards.
Each consumer can have one or more partitions
Load-Balancing If multiple instances of KCL applications are deployed
assigned to it
on multiple instances, the work of reading from Shards is
divided between each other automatically.
Assigning partition to KCL workers happens at the client level
Assigning partition to consumer is automatic and
and its coordinated by help of DynamoDB table that holds
happens at Kafka Brokers level
each workers state
KCL library leverages DynamoDB to persist the sequence
Kafka allows consumers to persist their offset number of each KCL worker. Using external storages other
position outside of Kinesis (i.e. Zookeeper) than DynamoDB is not supported via KCL. Users have to use
State Kinesiss APIs to develop their own logic.
Control
Kafka Consumers can automatically checkpoint KCL library does not provide automatic checkpoint
their position in the stream to Kafka. In this case capability. Users can develop a similar logic using KCL
consumers cant handle failures at the record level manual checkpoint feature
Kafka Consumers can manually checkpoint their
positions to Kafka using KafkaConsumer.commitSync Users can use KCLs IRecordProcessorCheckpointer.
or KafkaConsumer.commitAsync. checkpoint feature to persist their position in the stream.
Offset This method allows for failure handling
Control KCL provides the ability to start from beginning or the end
of the stream. If the ability to seek to a specific sequence
Consumer number is required, KCL library cannot be used. Users have
Position Kafka allows consumers to manually control its offset to use the following Kinesis direct APIs:
Control position at any given time using: KafkaConsumer.
seek, KafkaConsumer.seekToBeginning, In other GetShardIteratorRequest.setShardIteratorType
words Kafka consumers can move forward or (AT_SEQUENCE_NUMBER OR AFTER_SEQUENCE_
backwards in the stream. NUMBER)

GetShardIteratorRequest.
setStartingSequenceNumber(specialSequenceNumber)

21
Kafka Client transparently handles Kafka broker
Broker Handled by Kinesis service APIs. Consumers dont have
failures and adapts as partitions migrate
Failure to handle this failure.
within the cluster
In the case KCL worker failures, the existing and healthy
Failure KCL workers will spawn new record processors to take over
Handling Kafka Brokers perform health-check on consumers
the failed processors. The coordination and health-check
Consumer by tracking consumer heartbeats and re-balances
happens using DynamoDB and the concept of lease [TBD].
Failure the partitions between different consumers if
If Consumers are not leveraging KCL library, the failure
there are failures.
handling mentioned here has to be implemented
by the users.

22
MIGRATING
PLATFORMS

So far weve covered the differences between or greater level at which the data is being ingested by
Apache Kafka and Amazon Kinesis both in technical Apache Kafka. Otherwise a copy process that does not
and economical terms. While we think both systems have create equilibrium will not conclude. In order to ensure
their own relevant use-cases, we believe for the majority equilibrium is reached, it is important to know at what
of workloads, theres a justifiable case in terms of cost rate the data is being ingested to Apache Kafka and
reduction and eliminating maintenance efforts to ensure the copy process can meet or exceed at the same
migrate to Amazon Kinesis. rate. A convenient way to ensure equilibrium is to stop
the Apache Kafka producers and let the copy process
To migrate from Apache Kafka to Amazon Kinesis, conclude. However, this may not be acceptable as
first evaluate the previous sections to get a better stopping producers creates delay in the data processing
understanding of the required Producer and Consumer pipeline and also creates a situation where incoming
application changes. The rest of this section provides data has to stop or spooled at the source, which can
high-level guidance on how to migrate existing data potentially result in data-loss.
from Apache Kafka to Amazon Kinesis. We also provide
a sample code to demonstrate the data migration The copy process should provide at a minimum at-least-
processes covered in this section. once delivery semantic: The process of copying data to
Amazon Kinesis, including reading from Apache Kafka
Important concepts and writing to Amazon Kinesis, should support at-least-
once delivery semantic. This ensures that in case of
The following concepts should be considered before failure, both on reading or writing data, the copy
architecting a solution to help with copying the data process should not lose data.
from Apache Kafka to Amazon Kinesis.
The consumer applications need to be idempotent:
The copy process should reach equilibrium: The copy Due to the complexity of the copy process where multiple
process should be architected such that at some point systems are involved and the incoming and the outgoing
the system reaches equilibrium, meaning the rate of data are read and written over the network, it is possible
copying data to Amazon Kinesis should reach the same for the copy process to introduce duplicate records.

23
In the majority of the cases, duplicate records are due to starting from the tip of the stream can cause data process lag and
failure recovery and retry strategies that the copy process other processing issues.
has in place. Because of the potential of duplicate
records, the consumer applications should be idempotent MIGRATING DATA WITH APACHE
(at least once semantic). SPARK STREAMING
To support the process of copying data between
The copy process should handle backpressure: Apache Kafka and Amazon Kinesis, weve decided to
Its important for the copy process to be able to handle use Apache Spark Streaming. Apache Spark supports
backpressure and adjust to Amazon Kinesis API the important factors mentioned above. Lets quickly
throttling. The backpressure is the side effect of network review how Apache Spark handles the cases mentioned
slow down or the slow down of the producer application in previous sections.
that writes to Amazon Kinesis. During either of these
scenarios, the rate of the incoming data exceeds The copy process should reach equilibrium:
the rate of outgoing data and can potentially Apache Spark can guarantee equilibrium by providing
cause system instability. the ability to distribute the copy process between multiple
distinct nodes. The distributed nature of Apache Spark
The copy process should be able to recover from failures: provides the performance and throughput required to
It is important for the copy process to be able to recover create equilibrium between both systems.
from failures. There are three common failures that should
be handled by the copy process: The copy process should provide at the minimum at-
least-once delivery semantic: Spark Streaming provides
1. The copy process itself dies at-least-once delivery semantic during reading from
2. The resource(s) hosting the copy process dies Apache Kafka, holding data in memory during copy
3. The network is unreachable and writing to Amazon Kinesis.

During any of the above scenarios, the copy process The copy process should handle backpressure:
SHOULD NOT: Spark streaming has the ability to rate limit reading from
Apache Kafka in cases where writing to Amazon Kinesis
1. Experience data loss (discussed above) has slowed down.
2. Lose its position in the stream and forced to
start from the beginning of the stream * The copy process should be able to recover from
failures: Spark streaming can recovery from node failures
* This creates duplicate records and while the consumers are without data-loss and can resume from the last record
idempotent and can handle duplicate records, that was successfully copied to Amazon Kinesis.

24
PREPARATION Spark Streaming Data Migration Code
In the preparation to migrate the data stored by
Apache Kafka to Amazon Kinesis, well perform the The following sampled code demonstrates how to take
following tasks: advantage of Apache Spark to copy data from
Apache Kafka to Amazon Kinesis.
1. Calculate the number of Amazon Kinesis Shards
2. Create Amazon Kinesis Stream and Shards Note: it is important to reiterate that this is a sample
3. Create an Amazon EMR Spark cluster code and not meant to be used in production without
further modifications and implementing some of the
Calculate the number of throttling logics.
Amazon Kinesis Shards to create
We first configure the core Apache Spark component
Refer to the previous section (Architecting An Amazon (SparkContext) and set the checkpoint HDFS directory.
Kinesis Stream) for an example of calculating the The checkpoint directory is where Spark Streaming
number of Kinesis Shards. Based on the example persist important metadata information to avoid data-loss
provided, our Amazon Kinesis Stream requires 9 Shards. in case of any cluster interruptions: The copy process
should be able to recover from failures
Create Amazon Kinesis Stream and Shards

Refer to the following guide on how to create Amazon


Kinesis streams: http://docs.aws.amazon.com/streams/
latest/dev/learning-kinesis-module-one-create-stream.
html

Create EMR Spark cluster

To execute the Apache Spark Streaming code (below)


well use Amazon EMR to create a Spark cluster.
For demonstration proposes, one or two node
EMR cluster suffices.

Refer to the following guide to create an Amazon


EMR cluster: http://docs.aws.amazon.com/
ElasticMapReduce/latest/ReleaseGuide/emr-spark-
launch.html

25
/*
Apache Spark Settings
*/
val conf = new SparkConf().setMaster(yarn).setAppName(SparkDataCopy)
val sc = new SparkContext(conf)
val checkPointDir = /spark/checkpoint/sparkkafkacopy/

Next we setup Amazon Kinesis configuration settings. In this example were using KafkaKinesisMigartion
stream created in us-west-2 AWS region:

/*
Amazon Kinesis Settings
*/
val region = us-west-2
val streamName = KafkaKinesisMigration

Similarly well create Apache Kafka specific configurations. Remember to replace the brokerList string
with your Apache Kafka Brokers hostnames

/*
Apache Kafka Settings
*/
val topics = KafkaKinesisMigration2
val brokerList:String = hostname1:9092,hostname2:9092
val topicsSet = topics.split(,).toSet
val kafkaParams = Map[String, String](
metadata.broker.list -> brokerList,
key.deserializer->org.apache.kafka.common.serialization.ByteArrayDeserializer,
value.deserializer->org.apache.kafka.common.serialization.ByteArrayDeserializer)

26
As mentioned in the previous section, our data copy logic has to be able to handle backpressure (The copy process
should handle backpressure). One of the scenarios where we can potentially experience backpressure is when
Apache Spark is copying data to Amazon Kinesis at a higher rate allowed by Amazon Kinesis APIs. As mentioned in
the previous sections, the total Amazon Kinesis Stream throughput depends on the total number of Shards created.
The following code keeps track of the number of messages and bytes that have been successfully copied to
Amazon Kinesis. Later on well use these metrics to slow down our copy process if were close to the
Amazon Kinesis Shard capacity:

/*
Tracking how may bytes and messages have been sent
*/
val bytesSent = sc.accumulator[Long](0, BytesSent)
val messegesSent = sc.accumulator[Long](0, MessagesSent)
val startTimeMs = sc.accumulator[Long](System.currentTimeMillis, StartTimeMs)

The important data extract and copy logic happens in the following two sections.
We first read messages from Apache Kafka:

val messages = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](


ssc, kafkaParams, topicsSet)

And later copy the messages to Amazon Kinesis:

val producer = KinesisConnection.getConnection()


partitionOfRecords.foreach(msg=>{

val (key,value) = Helper.getKV(msg._1,msg._2)


msgSize = value.size
val futureRes = producer.addUserRecord(streamName,key.toString(),ByteBuffer.wrap(value.getBytes()))
Futures.addCallback(futureRes,myCallback)

27
The main logic that handles backpressure is implemented here:

val (msgPerSec,bytePerSec) = Helper.performanceMetrics(startTimeMs.value,messegesSent.value,bytesSent.


value)

logger.debug(MsgPerSec: +msgPerSec)
logger.debug(BytesPerSec: +bytePerSec)

while (msgPerSec>msgPerSecThrottle || bytePerSec>bytesPerSecThrottle) {


Helper.throttle(500)
}

while (producer.getOutstandingRecordsCount>=10000) {
logger.info(BACKPRESSURE INVOKED: + producer.getOutstandingRecordsCount)
Helper.throttle(1000)
}

The full sample code is provided in the Appendix B.

28
CONCLUSION

Both Apache Kafka and Amazon Kinesis are


well-engineered solutions meant to help with stream
processing requirements. Apache Kafka is an
open-source solution where users have the flexibility to
configure different aspects of the platform.
However, users are also tasked with maintaining an
Apache Kafka infrastructure. Amazon Kinesis on the
other hand provides a similar set of capabilities but
since its a managed offering, it provides less flexibility
with the advantage of eliminating the need for the users
to maintain an infrastructure.

When it comes to deciding on the right solution, users


have to balance between flexibility, cost and API
features. In this document we argued that Amazon
Kinesis, given its hosted nature, provides a lower cost of
maintenance. In comparison, Apache Kafka provides
a richer API interface with a higher flexibility but also
higher hosting and maintenance costs.

We hope that by reading this document one can


evaluate the cost and the flexibility factors provided by
each solution and decide the best path forward for their
specific workloads.

29
APPENDIX A

Apache Kafka and Amazon Kinesis Producer and


Consumer Configuration Comparison

Producer Configuration Comparison

Apache Kafka Amazon Kinesis receive.buffer.bytes N/A


bootstrap.servers Kinesis endpoint URL request.timeout.ms RequestTimeout
key.serializer N/A sasl.kerberos.service.name N/A
value.serializer N/A timeout.ms AWS SDK Configuration
Acks N/A block.on.buffer.full N/A
buffer.memory 5MB max.in.flight.requests.per.
N/A
compression.type N/A connection
retries metadata.fetch.timeout.ms N/A
batch.size CollectionMaxCount metadata.max.age.ms N/A
client.id N/A metric.reporters
connections.max.idle.ms AWS SDK Configuration metrics.num.samples CloudWatch Metrics
linger.ms RecordMaxBufferedTime metrics.sample.window.ms
max.block.ms N/A reconnect.backoff.ms AWS SDK Configuration
max.request.size 5MB retry.backoff.ms RateLimit
partitioner.class N/A

30
Consumer Configuration Comparison

Apache Kafka Amazon Kinesis ClientConfiguration.


request.timeout.ms
bootstrap.servers Kinesis API Endpoint connectionTimeout
key.deserializer N/A ClientConfiguration.
send.buffer.bytes
socketSendBufferSizeHint
value.deserializer N/A
auto.commit.interval.ms N/A
fetch.min.bytes N/A
check.crcs N/A
group.id ApplicationName
client.id workerIdentifier
heartbeat.interval.ms N/A
fetch.max.wait.ms N/A
max.partition.fetch.bytes N/A
metadata.max.age.ms N/A
session.timeout.ms failoverTimeMillis
metric.reporters N/A
auto.offset.reset N/A
metrics.num.samples metricsMaxQueueSize
ClientConfiguration.
connections.max.idle.ms metrics.sample.window.ms metricsBufferTimeMillis
connectionMaxIdleMillis
enable.auto.commit N/A reconnect.backoff.ms ClientConfiguration,RetryPolicy
partition.assignment.strategy N/A retry.backoff.ms ClientConfiguration.RetryPolicy
ClientConfiguration. reconnect.backoff.ms AWS SDK Configuration
receive.buffer.bytes
socketReceiveBufferSizeHint retry.backoff.ms RateLimit

31
APPENDIX B

Apache Spark Streaming sample code

import java.nio.ByteBuffer
import com.amazonaws.services.kinesis.producer._
import com.google.common.util.concurrent.{Futures, FutureCallback}
import kafka.serializer.DefaultDecoder
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory
import scala.collection.JavaConverters._

object KafkaCopy extends App {

import org.apache.spark.streaming.kafka._

val logger = LoggerFactory.getLogger(SparkCopy)


/*
Spark Settings
*/
val conf = new SparkConf().setMaster(yarn).setAppName(SparkDataCopy)
val sc = new SparkContext(conf)
val checkPointDir = /spark/checkpoint/sparkkafkacopy
/*
Amazon Kinesis Settings
*/
val region = us-west-2
val streamName = KafkaKinesisMigration
/*
Apache Kafka Settings
*/
val topics = KafkaKinesisMigration2
val brokerList:String = hostname1:9092 hostname2:9092
val topicsSet = topics.split(,).toSet
val kafkaParams = Map[String, String](
metadata.broker.list -> brokerList,
key.deserializer->org.apache.kafka.common.serialization.ByteArrayDeserializer,

32
value.deserializer->org.apache.kafka.common.serialization.ByteArrayDeserializer)
/*
Tracking how may bytes and messages have been sent
*/
val bytesSent = sc.accumulator[Long](0, BytesSent)
val messegesSent = sc.accumulator[Long](0, MessagesSent)
val startTimeMs = sc.accumulator[Long](System.currentTimeMillis, StartTimeMs)
/*
Spark Streaming logic starts here
*/
def functionToCreateContext(): StreamingContext = {

val ssc = new StreamingContext(sc,Seconds(1))


val topicMap = topics.split(,).map((_, 2.toInt)).toMap
val messages = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](
ssc, kafkaParams, topicsSet)

val numShards = Helper.getKinesisNumberOfShards(streamName,region)


logger.info(Number of Shards: +numShards)

val (msgPerSecThrottle,bytesPerSecThrottle) = Helper.getWriteThrottleThresholds(numShards)


logger.info(Throttle Msg/Sec Thresholds: +msgPerSecThrottle)
logger.info(Throttle Bytes/Sec Thresholds: +bytesPerSecThrottle)

messages.foreachRDD(msgRDD=>{
msgRDD.foreachPartition(partitionOfRecords=>{
var msgSize = 0
val myCallback = new FutureCallback[UserRecordResult] {
override def onFailure(throwable: Throwable): Unit = {
throwable match {
case e: UserRecordFailedException =>
val result = e.getResult
result.getAttempts.asScala.foreach(a =>
println(Error Details: + a.getDelay + + a.getDuration + + a.getErrorCode + + a.getErrorMessage))
case _ =>
}
}
override def onSuccess(v: UserRecordResult): Unit = {
logger.debug(PutRecords SUCCESSFUL)
messegesSent+=1
bytesSent+=msgSize
}
}

val producer = KinesisConnection.getConnection()


partitionOfRecords.foreach(msg=>{

33
val (key,value) = Helper.getKV(msg._1,msg._2)
msgSize = value.size
val futureRes = producer.addUserRecord(streamName,key.toString(),ByteBuffer.wrap(value.getBytes()))
Futures.addCallback(futureRes,myCallback)

val (msgPerSec,bytePerSec) = Helper.performanceMetrics(startTimeMs.value,messegesSent.value,bytesSent.value)

logger.debug(MsgPerSec: +msgPerSec)
logger.debug(BytesPerSec: +bytePerSec)

while (msgPerSec>msgPerSecThrottle || bytePerSec>bytesPerSecThrottle) {


Helper.throttle(500)
}

while (producer.getOutstandingRecordsCount>=10000) {
logger.info(BACKPRESSURE INVOKED: + producer.getOutstandingRecordsCount)
Helper.throttle(1000)
}
})
})
})
ssc.checkpoint(checkPointDir)
ssc
}

val ssc = StreamingContext.getOrCreate(checkPointDir, functionToCreateContext _)


ssc.start()
ssc.awaitTermination()

34
A next generation MSP, Datapipe is recognized as the
pioneer of managed services for public cloud platforms.
Datapipe has unique expertise in architecting, migrating,
managing and securing public cloud, private cloud, hybrid IT
and traditional IT around the globe. The worlds most trusted
brands partner with Datapipe to optimize mission-critical
and day-to-day enterprise IT operations, enabling them
to transform, innovate, and scale. Backed by a global team
of experienced professionals and world-class interconnected
data centers, Datapipe provides comprehensive cloud,
compliance, security, governance, automation and DevOps
solutions. Gartner named Datapipe a leader in the Magic
Quadrant for Cloud-Enabled Managed Hosting.

DATAPIPE.COM US: +1 877 773 3306 UK: +44 800 634 3414 HK: +852 3521 0215

2016 Datapipe