You are on page 1of 10

Confluent Platform Reference

Architecture for Kubernetes

Viktor Gamov
August 2018

www.confluent.io/contact ©2018 Confluent, Inc.


Table of Contents

Confluent Platform Reference Architecture for Kubernetes 1


Confluent Platform Architecture for Kubernetes 3
Prerequisites 3
General Architectural Considerations 4
Typical question and answers about Kubernetes 4
Apache ZooKeeper Configuration 4
CPU 4
Manifests 5
ZooKeeper StatefulSet 5
Kafka Brokers 6
Cluster Size 6
CPU 7
Disk 7
Memory 7
Readiness and Liveness 8
Kafka Application Logging 8
Manifests 8
Kafka Streams API 8
Stateless pod and data safety 9
Stateless pod and application restoration/recovery time 9
Conclusion 10
References 10

NOTE: All recommendations a valid (or not) for Kubernetes 1.9.x+ and Confluent Platform 5.0.x

2 ©2018 Confluent, Inc.


Confluent Platform Architecture for Kubernetes
This document is highly inspired by the popular Confluent Enterprise Reference Architecture and it was specifically tailored for the
Kubernetes use case.
Refer to the Reference Architecture for a more detailed guide on configuring each particular Confluent Platform component for on-
premises deployments.
This document uses the Helm Charts for Confluent Platform as a reference to illustrate configuration and deployment practices.
Apache Kafka® is an open source streaming platform designed to provide the basic components necessary for managing streaming
data storage (Kafka core), data integration (Kafka Connect API), and processing (Kafka Streams API). Apache Kafka is a proven
technology, deployed in countless production environments to power some of the world’s largest stream processing systems.
Confluent Platform complements Apache Kafka with selected open source projects, as well as enterprise-grade features, that make
Confluent Platform a one-stop shop for setting up a production-ready streaming platform. These open source projects include
clients for C, C++, Python, .NET, and Go programming languages; connectors for JDBC, JMS, Elasticsearch, S3, and HDFS; Confluent
Schema Registry for managing metadata for Kafka topics; KSQL for stream processing, and Confluent REST Proxy for integrating with
applications where native Kafka client is not available for some reasons.
Enterprise-level features include Confluent Control Center for end-to-end monitoring of data streams, Replicator for managing multi-
datacenter deployments, and Automated Data Rebalancer for optimizing resource utilization and easy scalability.
We’ll start describing the architecture from ground level up and for each component, we’ll explain when it is needed and the best plan for
deploying it.
In addition, you should refer to Confluent documentation for Helm Charts.

Prerequisites
This document doesn’t provide a comprehensive overview of all Kubernetes features. If you’re not familiar with Kubernetes, we
recommend starting with the official Kubernetes Getting started guide.
To get more context about running a Kafka-based streaming platform, we recommend reading Gwen Shapira’s Recommendations
for Deploying Apache Kafka on Kubernetes
From that document, you will learn Kubernetes concepts that you need to be familiar with in order to understand this
document context.
• Pod
• ReplicationController
• StatefulSets
• Persistent Volumes
• Service, HeadlessService

3 ©2018 Confluent, Inc.


General Architectural Considerations
Before we jump into discussing individual configuration options, let me set the stage and address common questions.
First and foremost, which version of Kubernetes to use? As Gwen mentioned in her presentation, the Kubernetes community is
very productive, and development is extremely active. We recommend using Kubernetes 1.9.x+ or any newer release. Everything
in this paper is working (or not) based on that. It’s important to continually upgrade your clusters to catch up with the fast pace of
Kubernetes releases.
Next, how would you bring Confluent Platform to Kubernetes? Since Confluent Platform is a complex set of different software
components, the deployment tool should provide an easy way to deploy the whole platform as well as individual components. Helm
charts are known as a useful tool to distribute packages in K8s. Helm charts are the recommended way of deploying Confluent
Platform. Further, In this whitepaper, we discuss recommendations for deploying components of Confluent Platform on Kubernetes.
Apache Kafka is the heart of Confluent Platform. Moreover, the persistence aspect of the platform is the crucial part. Kubernetes
provides primitives to manage storage for the stateful systems like Apache Kafka. Kubernetes uses persistent-volumes and
persistent-volume-claims for managing storage.
Based on the target deployment platform (AWS, GCE, etc.), storage classes should be chosen accordingly. StorageClass AWS-EBS will
be very helpful in AWS. StorageClass GCE-PD useful in GCE
Also, it worth to mention that since Apache Kafka is memory intensive, we would recommend R3 or R4 or type instances. AWS
Enhanced Networking is supported on R4 and R3 type instances, and we recommend to enable it.
The Kubernetes open model allows using many different providers of Container Network Interfaces. Canal CNI implementation is
used in many existing deployments among Kubernetes community. For further details, we recommend reading external docs.

Apache ZooKeeper Configuration


This section explains the requirements and recommendations for Apache ZooKeeper™ on Kubernetes.
Apache ZooKeeper is a centralized service for managing distributed processes and is a mandatory component in every Kafka
cluster. While the Kafka community has been working to reduce the dependency of Kafka clients on ZooKeeper, Kafka brokers still
use ZooKeeper to manage cluster membership and elect a cluster controller. Kafka requires an installation of ZooKeeper for broker
configuration storage and coordination.
NOTE: It is recommended to deploy a separate small, ZooKeeper ensemble for each Kafka cluster instead of using a large
multi-tenant ensemble.
ZooKeeper uses the JVM heap, and 4GB RAM is typically sufficient.
The spec.template.containers[0].resources.requests.memory controls the memory allocated to the container.
For production deployments, you should consider setting the requested memory to the maximum of 5-6 GB RAM and 1.5 times the
size of the configured JVM heap.

CPU
ZooKeeper is not a CPU intensive application. For a production deployment, you should start with 2 CPUs and adjust as necessary.
For a demonstration deployment, you can set the CPUs as low as 0.5. The amount of CPU is configured by setting the StatefulSet’s
spec.template.containers[0].resources.requests.cpus.

4 ©2018 Confluent, Inc.


Manifests
A ZooKeeper Kubernetes deployment consists of:
• ZooKeeper StatefulSet
• Service
• HeadlessService
ZooKeeper (and Kafka for that matter) is essentially a database and needs to be treated accordingly. ZooKeeper Pods should be
deployed as StatefulSet with replication factor minimum 3. Though, in production configuration, we recommend using minimum
five ZooKeeper pods. This requirement allows you to tolerate failures of 2 nodes at the same time and still provide leader election
services to Kafka. Also, this makes rolling restart more flexible since ZooKeeper currently doesn’t support dynamic ensemble
reconfiguration.
NOTE: ZooKeeper dynamic ensemble configuration can’t be dynamically performed in the latest stable version. This means,
with the current stable version you can’t dynamically scale up ZooKeeper ensemble.
It makes sense to define a minimum number of servers available to ensure the availability of ZooKeeper service.
An example of PodDisruptionBudget for ZooKeeper
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
...
spec:
...
minAvailable: 2
...
maxUnavailable: 1
...

ZooKeeper StatefulSet
An example of ZooKeeper StatefulSet configuration
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
...
spec:
...
template:
metadata:
....
spec:
containers:
...
image: "confluentinc/cp-zookeeper" (1)
imagePullPolicy: "OrderedReady" (2)
livenessProbe:
exec:
command: ['/bin/bash', '-c', 'echo "ruok" | nc -w 2 -q 2 localhost 2181 | grep imok']
initialDelaySeconds: 1
timeoutSeconds: 3

5 ©2018 Confluent, Inc.


volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper/data
- name: datalogdir
mountPath: /var/lib/zookeeper/log
volumes: (3)
- name: datadir
emptyDir: {}
- name: datalogdir
emptyDir: {}
1. imagePullPolicy: “OrderedReady”. In order to reduce startup problem messages, we recommend to start and stop pods
one-by-one.
2. Kubernetes needs to know if ZooKeeper is up and running. Fortunately, ZooKeeper provides built-in commands to report the
health of ZooKeeper nodes known as «four letter words»
3. Persistent Volumes must be used. emptyDirs (which only survive as long as the pod is running on the node) will likely result in
a loss of data. An emptyDir volume is first created when a Pod is assigned to a Node and exists as long as that Pod is running
on that node. It is initially empty. Containers in the Pod can all read and write the same files in the emptyDir volume, though that
volume can be mounted at the same or different paths in each Container. When a Pod is removed from a node for any reason,
the data in the emptyDir is deleted forever.

Kafka Brokers
Kafka brokers are the main storage and messaging components of Apache Kafka. Kafka is a streaming platform that uses
messaging semantics. The Kafka cluster maintains streams of messages called Topics; the topics are sharded into Partitions
(ordered, immutable logs of messages) and the partitions are replicated and distributed for high availability. The servers that
run the Kafka cluster are called Brokers.

Cluster Size
The size of the Kafka cluster, the number of brokers, is controlled by the .spec.replicas field of the StatefulSet. You should ensure that the
size of the cluster supports your planned throughput and latency requirements for all topics. If the size of the cluster gets too large, you
should consider segregating it into multiple smaller clusters.
We recommend having at least 3 Kafka brokers in a cluster, each running on a separate server. This way you can replicate each Kafka
partition at least 3 times and have a cluster that survives a failure of 2 nodes without data loss.
With 3 Kafka brokers, if any broker is not available, you won’t be able to create new topics with 3 replicas until all brokers are available again.
For this reason, if you have use-cases that require creating new topics frequently, we recommend running at least 4 brokers in a cluster.
If the Kafka cluster is not going to be highly loaded, it is acceptable to run Kafka brokers on the same servers as the ZooKeeper nodes.
In this case, it is recommended to allocate separate disks for ZooKeeper (as we’ll specify in the hardware recommendations below). For
high-throughput use cases, we do recommend installing Kafka brokers on separate nodes.

6 ©2018 Confluent, Inc.


CPU
Most of Confluent Platform components are not particularly CPU bound.
The few exceptions are:
• Compression: Kafka Producers and Consumers will compress and decompress data if you configure them to do so. We
recommend using compression when you need to save bandwidth during transport or when storing on a filesystem.
• Encryption: Starting at version 0.9.0, Kafka clients can communicate with brokers using SSL. There is a small performance
overhead on both the client and the broker when using encryption, and a larger overhead when it is the consumer that
connects over SSL - because the broker needs to encrypt messages before sending them to the consumer, it can’t use the
normal zero-copy optimization and therefore uses significantly more CPU.
• The high rate of client requests: If you have a large number of clients, or if consumers are configured with max.fetch.
wait=0, they can send very frequent requests and effectively saturate the broker. In those cases configuring clients to batch
requests will improve performance. For a number of clients near 1k, we recommend tune brokers to fetch less frequently.
However, CPU is unlikely to be your bottleneck.
An 8 CPU deployment should be more than sufficient for good performance. You should start by simulating your workload with 2-4
CPUs and titrating up from there.
The amount of CPU is controlled by the StatefulSet’s
spec.template.containers[0].resources.cpus.

Disk
Disk throughput is the most common bottleneck that users encounter with Kafka. Given that Network Attached Storage backs
Persistent Volumes, the throughput is, in most cases, capped on a per Node basis without respect to the number of Persistent Volumes
that are attached to the Node. For instance, if you are deploying Kafka onto a GKE or GCP based Kubernetes cluster, and if you use the
standard PD type, your maximum sustained per instance throughput is 120 MB/s (Write) and 180 MB/s (Read). If you have multiple
applications, each with a Persistent Volume mounted, these numbers represent the total achievable throughput.
You can control the amount of disk allocated by your provisioner using the
.spec.volume.volumeClaimTemplates[0].resources.requests.storage field.

Memory
Kafka utilizes the OS page cache heavily to buffer data.
To understand the interaction of Kafka and Linux containers you should read the Kafka file system design and memory cgroups
documentation.
Keep in mind that page cache is managed by the kernel, which is used by all pods.
The JVM heap size of the Kafka brokers is controlled by `KAFKA_HEAP_OPTS` environment variable. "-Xms=2G -Xmx=2G" is sufficient
for most deployments.
NOTE: Currently, CP helm charts use Zulu OpenJDK build with the support of CGroups limits. JVM Heap settings will be
honored inside the container https://blogs.oracle.com/java-platform-group/java-se-support-for-docker-cpu-and-memory-limits

7 ©2018 Confluent, Inc.


Readiness and Liveness
The Liveness of a broker is decided by whether or not the JVM process running the broker is still alive. Readiness is decided by a
readiness check to determine if the application can accept requests.
readinessProbe:
exec:
command:
- sh
- -c
- "/opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server=localhost:9093"
Note: using simple ping to the listener port may not enough be certain about readiness Kafka broker. Calling admin API
(`BrokerApiVersionCommand`) will make it sure that Kafka broker up, initialized and ready to accept the requests.

Kafka Application Logging


Kafka’s application logs are written to standard out, so they captured by the default logging infrastructure (as is considered to be the
best practice for containerized applications).

Manifests
A Kafka Kubernetes deployment consists of
• Kafka StatefulSet
• Kafka Service
• Kafka Headless Service

Kafka Streams API


The Kafka Streams API, a component of open source Apache Kafka, is a powerful, easy-to-use library for building highly scalable,
fault-tolerant, stateful distributed stream processing applications on top of Apache Kafka. It builds upon important concepts for
stream processing such as properly distinguishing between event-time and processing-time, handling of late-arriving data, and efficient
management of application state.
Kafka Streams is a library that is embedded in the application code (just like Jetty, for instance), and as such, you don’t need to allocate
Kafka Streams servers, but you do need to allocate servers for the applications that use Kafka Streams library (or at least resources
for their containers). Kafka Streams will use parallel-running tasks for the different partitions and processing stages in the application,
and as such will benefit from higher core count. If you deploy multiple instances of the application on multiple servers (recommended!),
Kafka Streams library will handle load-balancing and failover automatically. In order to maintain its application state, Kafka Streams
uses embedded RocksDB database. It is recommended to use persistent SSD disks for the RocksDB storage. For example, in managed
Kubernetes services like GKE, you can request to provision SSD disks (pd-ssd) instead of standard (pd-standard)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd

8 ©2018 Confluent, Inc.


Stateless pod and data safety
You can consider the application as a stateless pod as far as data safety is concerned; i.e. regardless of what happens to the pod,
Kafka and Kafka Streams guarantee that you will not lose data (and if you have enabled exactly-once processing, they will also
guarantee the latter).
That’s because the state changes in your application are always continuously backed up to Kafka (brokers) via changelogs topics of the
respective state stores — unless to explicitly disable changelog functionality (it is enabled by default).
The above is even true when not using Kafka’s Streams default storage engine (RocksDB) but the alternative in-memory
storage engine. Many people don’t realize this because they read “in-memory” and (falsely) conclude "data will be lost when a
machine crashes, restarts, etc.".

Stateless pod and application restoration/recovery time


The above being said, you should understand how having vs. not-having local state available after pod restarts will impact restoration/
recovery time of your application (or rather: application instance) until it is fully operational again.
For example, one instance of the stateful application runs on a machine. It will store its local state under state.dir, and it will also
continuously backup any changes to its local state to the remote Kafka cluster (brokers).
• If the app instance is being restarted and does not have access to its previous state.dir (probably because it is restarted on
a different machine), it will fully reconstruct its state by restoring from the associated changelog(s) in Kafka. Depending on
the size of your state this may take milliseconds, seconds, minutes, or more. Only once its state is fully restored it will begin
processing new data.
• If the app instance is being restarted and does have access to its previous state.dir (probably because it is restarted on the
same, original machine), it can recover much more quickly because it can re-use all or most of the existing local state, so
only a small delta needs to restore from the associated changelog(s).
Only once its state is fully restored it will begin processing new data. In other words, if your application is able to re-use existing local
state then this is good because it will minimize application recovery time.
Standby replicas to the rescue in stateless environments: But even if running stateless pods there are options to minimize application
recovery times by configuring your application to use standby replicas via the num.standby.replicas setting:
Example 1. The number of standby replicas.
num.standby.replicas
Standby replicas are shadow copies of local state stores. Kafka Streams attempts to create the specified number of replicas and keep
them up to date as long as there are enough instances running. Standby replicas are used to minimize the latency of task failover. A task
that was previously running on a failed instance is preferred to restart on an instance that has standby replicas so that the local state
store restoration process from its changelog can be minimized. See also the documentation section State restoration during workload
rebalance.

9 ©2018 Confluent, Inc.


Conclusion
This paper is intended to share some of our best practices around the deployment of Confluent Platform on Kubernetes. Of course, each
use case and workload is slightly different and the best architectures are tailored to the specific requirements of the organization. When
designing an architecture consideration such as workload characteristics, access patterns and SLAs are very important - but are too
specific to cover in a general paper.
To choose the right deployment strategy for specific cases, we recommend engaging with Confluent Professional Services for
architecture and operational review.

References
Kubernetes Storage Volumes
How to use SSD persistent disks on Google Kubernetes Engine

10 Apache, Apache Kafka, Kafka, and associated open source project names are trademarks of the Apache Software Foundation. ©2018 Confluent, Inc.

You might also like