You are on page 1of 10

BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS
∙ Introduction

∙ About Apache Kafka

Apache Kafka ∙ Quickstart for Apache Kafka

∙ Pub/Sub in Apache Kafka


∙ Kafka Connect

Essentials ∙ Kafka Streams

∙ Extending Apache Kafka


∙ One Size Does Not Fit All

∙ Conclusion

∙ Additional Resources

ORIGINAL BY JUN RAO | UPDATE BY BILLMCLANE

Two trends have emerged in the information technology space. The main benefits of Kafka
First, are:
the diversity and velocity of the data that an enterprise wants 1. High throughput: Each server is capable of handling
to hundreds
collect for decision-making continues to grow. Second, there is of MB per second of data.
a 2. High availability: Data can be stored redundantly in
growing need for an enterprise to make decisions in real-time multiple
servers and can survive individual server failure.
based on that collected data. For example, financial institutions
3. High scalability: New servers can be added over time to
want to not only detect fraud immediately, but also offer a scale
better banking out the system.
Apache Kafka is a streaming engine for collecting, caching, and
experience high
processing through features
volumes like in
of data real-time alerting,
real-time. real-time
As illustrated in 4. Easy integration with external data sources or data
product
Figure sinks.
recommendations,
1, and more
Kafka typically serves effective
as a part customer
of a central dataservice.
hub in which 5. Built-in real-time processing
layer.
data within an enterprise is collected. The data can then be used
for continuous processing or fed into other systems and
applications in real time. Kafka is used by more than 40% of
Fortune 500 companies across all industries. Refer to Figure 1
ABOUT
on page 3. APACHE KAFKA

Kafka was originally developed at LinkedIn in 2010, and it became


a
top-level Apache project in 2012. It has three main components:
Pub/ Sub, Kafka Connect, and Kafka Streams. The role of each
component is summarized in the table below.
Storing and delivering data efficiently and
Pub/Sub
reliably at scale
Integrating Kafka with external data sources
Kafka Connect
and data sinks.
Kafka Streams Processing data in Kafka in real time.

1
APACHE KAFKA

Figure 1: Apache Kafka as a central real-time Defines a logical name for producing
hub Topic
and consuming records.
QUICKSTART FOR APACHE KAFKA Defines a non-overlapping subset of
Partition
records within a topic.
It’s easy to get started on Kafka. The following are the steps to
get A unique sequential number assigned to
Offset
each record within a topic partition.
Kafka running in your environment:
A record contains a key, a value, a
1. Download the latest Apache Kafka binary distribution Record
from timestamp, and a list of headers.

http://kafka.apache.org /downloads and untar it. Server where records are stored.
Broker
Multiple brokers can be used to form a
2. Start the Zookeeper
server cluster.
3. Start the Kafka Figure 2 depicts a topic with two partitions. Partition 0 has 5
broker records,

4. Create a topic with offsets from 0 to 4, and partition 1 has 4 records, with offsets
from 0 to 3.
5. Produce and Consume data

For detailed Quickstart instructions, please see the Apache Kafka


documentation in the Additional Resources section below.

PUB/SUB IN APACHE KAFKA


The first component in Kafka deals with the production and Figure 2: Partitions in a topic
consumption of the data. The following table describes a few
key concepts in Kafka: The following code snippet shows how to produce records to a
topic
“test” using the Java API:

3 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Properties props = new Properties();


for (ConsumerRecord<String, String> record : records)
props.put("bootstrap.servers",
System.out.printf("offset=%d, key=%s, value=%s",
"localhost:9092");
record.offset(), record.key(), record.value());
p r o p s . p u t ( " k e y. s e r i a l i z e r " ,
c o n s u m e r. c o m m i t S y n c ( ) ;
"org.apache.kafka.common.serializ ation.
StringSerializer"); }

props.put("value.serializer",
Records within a partition are always delivered to the consumer
in
"org.apache.kafka.common.serializ ation.
offset order. By saving the offset of the last consumed record
StringSerializer");
from each partition, the consumer can resume from where it left
Producer<String, String> producer = new off after a restart. In the example above, we use the
c o m m i t S y n c ( ) API to save the offsets explicitly after consuming
KafkaProducer<>(props
); a batch of records. One can also save the offsets automatically by
setting the property e n a b l e . a u t o . commit to true.
p r o d u c e r. s e n d ( A record in Kafka is not removed from the broker immediately
after
new ProducerRecord<String, String>("test", "key", it is consumed. Instead, it is retained according to a
"value"));
configured retention policy. The following table summarizes
the two common policies:
In the above example, both the key and value are strings, so we Retention Policy Meaning
are
The number of hours to keep a record
using a StringSerializer . It’s possible to customize the log.retention.hours
on the broker.
serializer when types become more complex.
The maximum size of records retained
The following code snippet shows how to consume records log.retention.bytes
in a partition.
with
string key and value in Java.
Properties p r o p s = new P r o p e r t i e s ( ) ; props. KAFKA CONNECT
put("bootstrap.servers", "localhost:9092");
The second component in Kafka is Kafka Connect, which is a

p r o p s . p u t ( " k e y. d e s e r i a l i z e r " , framework that makes it easy to stream data betw een Kafka
and

"org.apache.kafka.common.serializ ation. other systems. As shown in Figure 3, one can deploy a Connect
StringDeserializer");
cluster and run various connectors to import data from sources
like MySQL, TIBCO Messaging, or Splunk into Kafka (Source
props.put("value.deserializer",
Connectors) and export data from Kafka (Sink Connectors) such as
"org.apache.kafka.common.serializ ation. HDFS, S3,
StringDeserializer"); See
and page 5 for Figure 3.
Elasticsearch.

KafkaConsumer<String , String> consumer =

n ew Ka fka C ons ume r< > (prop s) ;

c o n s u m e r. s u b s c r i b e ( A r r a y s . a s L i s t ( " t e s t " )
);

while (true) {

ConsumerRecords<String, String> records


=
c o n s u m e r. p o l l ( 1 0 0 ) ;

4 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Figure 3: Usage of Apache Kafka 4. Verify the data in


Connect Kafka:
The benefits of using Kafka Connect are: > b i n / k a f k a - c o n s o l e - c o n s u m e r. s h

∙ Parallelism and fault --bootstrap-server localhost:9092


tolerance
∙ Avoiding ad-hoc code by reusing existing --topic connect-test
connectors
∙ Built-in offset and configuration --from-beginning
management
{"schema":{"type":"string",
QUICKSTART FOR KAFKA CONNECT

"optional":false},
The following steps show how to run the existing file
connector
"payload":"hello"}
in standalone mode to copy the content from a source file
to a destination file via Kafka:
{"schema":{"type":"string",
1. Prepare some data in a source
file:
"optional":false},
> echo -e \"hello\nworld\" > test.txt

"payload":"world"}
2. Start a file source and a file sink
connector:
> bin/connect-standalone.sh In the example above, the data in the source file test.txt is
first
config/connect-file-source.properties streamed into a Kafka topic connect-test through a file
source connector. The records in connect-test are then
config/connect-file-sink.properties
streamed into
the destination file test.sink.txt . If a new line is added to
3. Verify the data in the destination file: test.

> more t e s t . s i n k . t x t t x t , it will show up immediately in t e s t . s i n k . t x t . Note


that we achieve
Connectors this by running
are powerful twoallow
tools that connectors withoutof Apache
for integration
hello
writing anymany
Kafka into custom code.
other systems. There are many open source and
commercially supported options for integration of Apache Kafka
both at the connector layer as well as through an integration
services
layer that can provide much more flexibility in message
transformation.
In addition to open-source connectors, vendors like Confluent
and TIBCO Software offer commercially supported connectors to
hundreds of endpoints for simplifying integration with Apache
5 BROUGHT TO YOU IN PARTNERSHIP WITH
Kafka.
APACHE KAFKA

TRANSFORMATIONS IN CONNECT In step 1 above, we add two transformations MakeMap


and InsertSource , which are implemented by class
Connect is primarily designed to stream data between systems
H o i st F i e l d $ Va l u e an d In s e r t Fi e l d $ Va l u e , r e s pe ct ive ly.
as is, whereas Kafka Streams is designed to perform complex
The first one adds a field name “line” to each input string. The
transformations once the data is in Kafka. That said, Kafka
second one adds an additional field “data_source” that
Connect provides a mechanism used to perform simple indicates
transformations per record. The following example shows how the name of the source file. After applying the transformation
to enable a couple of transformations in the file source logic, the data in
connector.
1. Add the following lines to connect-file- the input file is now transformed to the output in step 3.
source. Because
final Serde<String> stringSerde = Serdes.String();
properties: the last transformation step is more complex, we implement it
transforms=MakeMap, InsertSource with
f i nthe
a l Se rd e < L on g > lo ngSerd e = S e r d e s . L o n g ( ) ;
Streams API (covered in more detail below):
t ra n sfo rms. M a ke M a p. typ e = o rg . a pa c he .k a fk a
StreamsBuilder builder = new StreamsBuilder();

. c o n n e c t . t r a n s f o r m s . H o i s t F i e l d $ Va l u e
/ / b u i l d a s t r e a m from an i n p u t t o p i c

transforms.MakeMap.field=line
K S t r e a m < S t r i n g , S t r i n g > s o u r c e = b u i l d e r. s t r e a m (

transforms.InsertSource.type=org.apac
he "streams-plaintext-input",

.kafka.connect.transforms Consumed.with(stringSerde, stringSerde));

. I n s e r t F i e l d $ Va l u e K Ta b l e < S t r i n g, L o ng > c o un t s = s o u r c e

transforms.InsertSource.static.field . f l a t M a p Va l u e s ( v a l u e -> A r r a y s . a s L i s t ( v a l u e .
= toLowerCase().split(" ")))

data_source . g r o u p B y ( ( k e y , v a l u e ) -> v a l u e )

transforms.InsertSource.static.valu .count();
e =t e s t - f i l e - s o u r c e

// convert the output to another topic


2. Start a file source
connector:
> bin/connect-standalone.sh counts.toStream().to("streams-wordcount-output",

config/connect-file-source.properties Produced.with(stringSerde, longSerde));

3. Verify the data in Kafka:


CONNECT REST API
> b i n / k a f k a - c o n s o l e - c o n s u m e r. s h
In production, Kafka Connect typically runs in distributed mode and
--bootstrap-server localhost:9092 can be managed through REST APIs. The following table shows
the common APIs. See the Kafka documentation for more
--topic connect-test
information.
{"line":"hello","data_source":"test

-file-source"}

{"line":"world","data_source":"test

-file-source"}

6 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Connect REST API Meaning


Produced.with(stringSerde, longSerde));
GET /connectors Return a list of active connectors
POST /connectors Create a new connector
GET /connectors/ The code above first creates a stream from an input topic
Get the information of a specific
{name} streams-
connector
GET /connectors/ Get configuration parameters for a p l a i n t e x t - i n p u t . It then applies a transformation to split
{name} / c o n f i g specific connector each input line into words. Next, it counts the number of
PUT /connectors/ Update configuration parameters occurrences of each unique word. Finally, the results are
{name} / c o n f i g for
written to an output topic s t r e a m s - w o r d c o u n t - o u t p u t .
GET /connectors/ a specific connector
Get the current status of the connector The following are the steps to run the example
{name} / s t a t u s code.
1. Create the input
topic:
KAFKA STREAMS
bin/kafka-topics.sh --create

Kafka Streams is a client library for building real-time


applications --zookeeper localhost:2181

and microservices where the input and/or output data is


stored in --replication-factor 1

Kafka. The benefits of using Kafka Streams are:


∙ Less code in the --partitions 1
application
∙ Built-in state --topic streams-plaintext-input
management
∙ 2. Run the stream
Lightweight application:
∙ Parallelism and fault bin/kafka-run-class.sh org.apache.
tolerance
The most common way of using Kafka Streams is through kafka.streams.examples.wordcount.
the
Streams DSL, which includes operations such as filtering, WordCountDemo

joining, grouping, and aggregation. The following code


snippet shows the main logic of a Streams example 3. Produce some data in the input
topic:
final Serde stringSerde = Serdes.String();
called WordCountDemo . b i n / k a f k a - c o n s o l e - p r o d u c e r. s h

final Serde longSerde = Serdes.Long();


--broker-list localhost:9092

StreamsBuilder builder = new StreamsBuilder();


--topic streams-plaintext-input

/ / b u i l d a s t r e a m from an i n p u t t o p i c
hello world

K S t r e a m s ou r c e = b u i l d e r . s t r e a m (
4. Verify the data in the output topic:
"streams-plaintext-input",
b i n / k a f k a - c o n s o l e - c o n s u m e r. s h

Consumed.with(stringSerde, stringSerde));
--bootstrap-server localhost:9092

K Ta b l e c o u n t s = s o u r c e
--topic streams-wordcount-output

. f l a t M a p Va l u e s ( v a l u e -\> --from-beginning
Arrays.asList(value.

--formatter kafka.tools.
toLowerCase().split(" ")))
. g r o u p B y ( ( k e y, v a l u e ) -\> value)
.count(); DefaultMessageFormatter

// convert the output to another topic --property print.key=true


counts.toStream().to("streams-wordcount-output",

7 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

COMMONLY USED OPERATIONS IN


--property print.value=true KSTREAM
Operation Example
- - p r o p e r t y k e y. d e s e r i a l i z e r = filter(Predicate) ks_out = ks_in.filter(
( k e y , v a l u e ) -> v a l u e > 5
org.apache.kafka.common. Create a new KStream that ); ks_in: ks_out: ("k1",
consists of all records of 2) ("k2", 7) ("k2", 7)
serialization.StringDeserializer this stream that satisfy the
given predicate.
--property value.deserializer=
map(KeyVal ueMa ppe r) ks_out = ks_in..map(
( k e y , v a l u e ) -> new
org.apache.kafka.common. Transform each record of the K e y Va l u e < > ( k e y, k e y ) )
input stream into a new ks_in: ks_out: ("k1", 2)
serialization.LongDeserializer record in the output stream ("k1", "k1") ("k2", 7)
("k2", "k2")
(both key and value type can
hello 1
be altered arbitrarily).
groupBy() ks_out = ks.groupBy()
world 1 ks_in: ks_out: ("k1", 1)
Group the records by their ("k1", (("k1", 1),
current key into a ("k2", 2) ("k1", 3)))
KSTREAM VS. KTABLE
KGrouped- Stream while ("k1", 3) ("k2",
(("k2", 2)))
There are two key concepts in Kafka Streams: KStream and preserving the original
KTable. jvalues.
o i n ( K Ta b l e , ks_out = ks_in.join(
Va l u e J o i n e r ) kt, (value1, value2) ->
A topic can be viewed as either of the two. Their
value1 + value2
differences are summarized in the table below. Join records of the input ); ks_in: kt:
KStream KTable stream with records from the ("k 1 " , 1 ) (" k 1 " , 11 )
("k2", 2) ("k2", 12)
Each record is KTable
Each record is treated as ("k3", 3) ("k4", 13)
Concept treated as an update if the keys from the records
an append to the stream. ks_out: ("k1", 12)
to an existing key. match. Return a stream of ("k2", 14)
Model updatable the key and the combined
Model append-only jvalue
o i n ( Kusing
Stream , ks_out = ks1.join( ks2,
Usage reference data such ValueJoiner .
data such as click Va l u e J o i n e r , (value1, value2) ->
as user profiles.
streams. Jo i n Wi n d o w s) value1 + value2,
JoinWindows. of(100) );
The following example illustrates the difference of the Join records of the two ks1: ks2:
two: ("k1", 1, 100t)
streams if the keys match and
(Key, Sum of the Values as Sum of the Values as ("k 1 ", 11 , 15 0 t ) ("k 2 ",
the timestamp from the
Value) KStream KTable 2, 200t) ("k2", 12,
records satisfy the time
350t) ("k3", 3, 300t)
Records
(“k1”, 2) (“k1”, 5) 7 5 constraint specified by ("k4", 13, 380t) * t
JoinWindows . Return a indicates a timestamp.
When a topic is viewed as a KStream, there are two independent stream of the key ks_out: ("k1", 12)

records and thus the sum of the values is 7. On the other hand, if the and the combined value using
Va l u e J o i n e r .
topic is viewed as a KTable, the second record is treated as an
update to the first record since they have the same key “k1”.
Therefore, only the second record is retained in the stream and the
sum is 5 instead.
KSTREAMS DSL

The following tables show a list of common operations


available in
Kafka Streams:

8 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

COMMONLY USED OPERATIONS IN functional, many client applications are looking for a lighter
KGROUPEDSTREAM weight
Operation Example interface to Apache Kafka Streams via low-code environments
count() kt = kgs.count(); or continuous queries using SQL-like commands. In many
kgs: ("k1", (("k1",
cases, developers looking to leverage continuous query
Count the number of records 1), ("k1", 3))) ("k2",
in this stream by the grouped (("k2", 2))) kt: ("k1", functionality are looking for a low-code environment where
key and return it as a KTable. 2) ("k2", 1)
stream processing
can be dynamically accessed, modified, and scaled in real-
time.
reduce(Reducer) kt = kgs.reduce(
( a g g Va l u e , n e w Va l u e ) Accessing Apache Kafka data streams via SQL is just one
Combine the values of -> a g g Va l u e +
approach to addressing this low-code stream processing, and
records in this stream by the newValue
); kgs: ("k1", many commercial vendors (TIBCO Software, Confluent) as
grouped key and return it as ("k1", well
Apache Kafka is a data distribution platform; it’s what you do
a KTable. 1), ("k1", 3))) ("k2", with
as open-source solutions (Apache Spark) offer solutions to
(("k2", 2))) kt: ("k1",
the data that
providing SQLisaccess
important.
to Once data is available via Kafka, it
windowedBy(Windows) tw
4 ) ks( " =k 2kgs.wi
" , 2 )ndowed By (
T i m e Wi n d o w s . o f ( 1 0 0 ) ) ; can be distributed to many different processing engines from
unlock Apache Kafka data streams for stream processing.
Further group the records by kgs: ("k1", (("k1", 1, integration services, event streaming, and AI/ML functions to
the timestamp and return it as 1 0 0 t ) , ( " k 1 " , 3 , 1 5 0 t ) ) )
a data analytics.
("k2", (("k2", 2, 100t),
TimeWindowedKStream . ("k2", 4, 250t))) * t
indicates a timestamp.
twks: ("k1", 100t --
200t, (("k1", 1, 100t),
("k1", 3, 150t))) ("k2",
100t -- 200t, (("k2", 2,
100t))) ("k2", 200t --
300t, (("k2", 4, 250t)))

A similar set of operations is available on KTable and


KGroupedTable.
You can check the Kafka documentation for more information.
QUERYING THE STATES IN KSTREAMS Figure 4: Leveraging Apache Kafka beyond data
distribution
While processing the data in real-time, a KStreams application For more information on low-code stream processing
locally maintains the states such as the word counts in the options,

previous example. Those states can be queried interactively including SQL access to Apache Kafka, please see the
Additional
through an
Resources section below.
API described in the Interactive Queries section of the Kafka ONE SIZE DOES NOT FIT ALL
documentation. This avoids the need of an external data store
With the increasing popularity of real-time stream processing and
for exporting and serving those states.
EXACTLY-ONCE PROCESSING IN KSTREAMS the rise of event-driven architectures, a number of alternatives
have started to gain traction for real-time data distribution.
Failures in the brokers or the clients may introduce duplicates
Apache Kafka is the flavor of choice for distributed high volume
during the processing of records. KStreams provides the
data streaming; however, many implementations have begun to
capability of processing records exactly once even under failures.
struggle with building solutions at scale when the application’s
This can be achieved by simply setting the property
requirements go beyond a single data center or single location.
p r o c e s s i n g . g u a r a n t e e to e x a c t l y _ o n c e in KStreams.
So, while Apache Kafka is purpose-built for real-time data
EXTENDING APACHE KAFKA distribution and streaming processing, it will not fit all the
requirements of every enterprise application. Alternatives like
While processing data directly into an application via KStreams is
Apache Pulsar, Eclipse Mosquitto, and many others may be
worth

9 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

investigating, especially if requirements prioritize large scale ADDITIONAL RESOURCES


global
infrastructure where built-in replication is needed or if native ∙ Documentation of Apache Kaf
ka
IoT/ MQTT support is needed.
For more information on comparisons between Apache ∙ Apache NiFi websit
Kafka e

and other data distribution solutions, please see the ∙ Article on real-time stock processing with Apache NiFi
Additional and
Apache Kafka
Resources section below.
CONCLUSION ∙ Apache Kafka Summit websit
e
Apache Kafka has become the de-facto standard for high ∙ Apache Kafka Mirroring and Replica
performance, distributed data streaming. It has a large and tion

growing community of developers, corporations, and ∙ Apache Pulsar Vs. Apache Kafka O’Reilly
eBook
applications that are supporting, maintaining, and leveraging
∙ Apache Pulsar websit
it. If you are building an event-driven architecture or looking e
for a way to
stream data in real-time, Apache Kafka is a clear leader in
providing a proven, robust platform for enabling stream
processing
and
enterprise communications.

Written by Bill McLane,


TIBCO Messaging Evangelist

With over 20 years of experience building, architecting, and designing large scale messaging infrastructure, William McLane is one of
the
thought leaders for global data distribution. William and TIBCO have history and experience building mission critical, real world data
distribution architectures that power some of the largest financial services institutions, to the global scale of tracking transportation
and logistics operations. From Pub/Sub, to point-to-point, to real-time data streaming, William has experience designing, building,
and
leveraging the right tools for building a nervous system that can connect, augment and unify your enterprise.

DZone, a Devada Media Property, is the resource software Devada, Inc.


developers, engineers, and architects turn to time and again 600 Park Offices
Drive to learn new skills, solve software development problems, Suite 150
and share their expertise. Every day, hundreds of tousands Research Triangle Park, NC 27709
of developers come to DZone to read about the latest
888.678.0399 919.678.0300
technologies, methodologies, and best practices. That makes
DZone the ideal place for developer m arketers to build product Copyright © 2020
Devada, Inc. awareness
and brand All rights reserved. No sales. DZone clients include
and drive part of this publication may be reporoduced, stored in a

some of the most innovative technology and tech-enabled retrieval system, or transmi tted, in any form or by m eans
companies in the world including Red Hat, Cloud Elements, of electronic, mechanical, photocopying, or otherwise,
Sensu, and Sauce Labs. without prior written permission of the publisher.

10 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like