You are on page 1of 13

APACHE KAFKA

Apache Kafka
TABLE OF CONTENTS

Preface 1
Introduction 1
Installing and Running Apache Kafka 1
Apache Kafka Components 2
The Publish-Subscribe System 2
The Kafka Connect Framework 5
Kafka Streams 7
KStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
KGroupedStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
KTable and KGroupedTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Notable Apache Kafka Alternatives 11
Additional Resources 11

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


1 APACHE KAFKA

PREFACE
PREFACE • Prerequisites

◦ Before you begin, ensure you have Java


This cheatsheet will provide you with the essential
installed on your system, as Kafka is a Java-
knowledge, commands, and best practices to
based application.
navigate the Kafka ecosystem with ease. This
concise yet comprehensive resource is designed to • Download Apache Kafka
be your go-to guide for all things Kafka.
◦ Visit the Apache Kafka website’s download
page: Apache Kafka Downloads.
INTRODUCTION
INTRODUCTION
◦ Download the latest stable version of
Kafka, typically a binary release package.
Apache Kafka is a robust, distributed streaming
Choose the binary release that matches
platform that has become an integral part of
your operating system (e.g., kafka_2.13-
modern data processing and real-time event
3.1.0.tgz for a Unix-like system).
streaming architectures. It was originally developed
by LinkedIn and open-sourced as an Apache • Extract the Kafka Archive
Software Foundation project. Kafka provides a
◦ Navigate to the directory where you
reliable, high-throughput, low-latency, and fault-
downloaded Kafka.
tolerant way to publish, subscribe to, store, and
process streams of data. ◦ Use the following command to extract the
Kafka archive (replace the filename with
At its core, Kafka is a distributed event streaming your downloaded version): tar -xzf
system that is designed to handle a vast amount of kafka_2.13-3.1.0.tgz
data in a highly scalable, durable, and fault-tolerant
• Start the Kafka Server (Broker)
manner. It’s widely used for building real-time data
pipelines, monitoring systems, log aggregation, and ◦ Change your working directory to the
more. Kafka’s architecture is built around a Kafka installation directory.
publish-subscribe model, where data producers
◦ Kafka depends on Apache ZooKeeper for
(publishers) send records to topics, and data
distributed coordination. You need to start
consumers (subscribers) read from those topics.
ZooKeeper before starting Kafka. You can
use the ZooKeeper scripts provided with
This cheatsheet serves as a quick reference guide to
Kafka, run the following command:
key concepts, components, and commands in
bin/zookeeper-server-start.sh
Apache Kafka. Whether you’re a developer, data
config/zookeeper.properties (leave the
engineer, or operations professional, it will help
ZooKeeper terminal running in the
you navigate the Kafka ecosystem, set up your
background)
Kafka clusters, and perform common tasks with
ease. From understanding Kafka’s fundamental ◦ Open a new terminal window and start the
components to managing topics, producers, and Kafka server by running the following
consumers, you’ll find the essential information command: bin/kafka-server-start.sh
you need to work effectively with Kafka in a config/server.properties (leave the Kafka
compact and accessible format. server terminal running in the
background)
Let’s dive into the world of Apache Kafka and
• Create a Kafka Topic
unlock the power of real-time data streaming for
your projects and applications. ◦ We will create a Kafka topic named "test-
topic" to help us follow along the rest of
INSTALLINGAND
INSTALLING ANDRUNNING
RUNNINGAPACHE
APACHE this cheatsheet. In a new terminal, use the
KAFKA following command: bin/kafka-topics.sh
KAFKA --create --topic test-topic --bootstrap
-server localhost:9092 --replication
To install and start Apache Kafka on your local
-factor 1 --partitions 1
machine, follow these step-by-step instructions.
We’ll cover a basic setup for development and
That’s it! You now have Apache Kafka up and
testing purposes.

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


2 APACHE KAFKA

running on your local machine. You can start Concept Description


producing and consuming messages, create more
Broker A Kafka server instance
topics, and explore the rich set of features Kafka
that stores and manages
has to offer for event streaming and data
data. Producers publish
processing.
messages to brokers,
and consumers retrieve
APACHEKAFKA
APACHE KAFKACOMPONENTS
COMPONENTS
messages from brokers.
Kafka clusters consist of
Apache Kafka consists of three major components,
multiple brokers to
the role of each is summarized below:
ensure fault tolerance
and scalability.
System Description
Topic A logical channel or
Publish/Subscribe The Publish/Subscribe
category for data in
messaging system
Kafka. It is used to
enables distributed
categorize messages,
streaming of events. It is
allowing producers to
designed to handle a
publish data to a specific
vast amount of data in a
topic and consumers to
highly scalable, durable,
subscribe to the topics of
and fault-tolerant
their interest. Topics can
manner.
have multiple partitions
Kafka Connect The Kafka Connect for parallelism.
framework is used to
Partition A way to horizontally
connect external data
distribute data within a
sources (event
topic. Each partition is
producers) and data
an ordered, immutable
sinks (event consumers)
sequence of messages.
to the platform.
Partitions allow Kafka to
Kafka Streams Client library that achieve high throughput
enables processing of by distributing data
events/data in real-time. across multiple brokers
and enabling
As stated above, Apache Kafka runs in distributed parallelism for
clusters. A cluster node is being referred to as a consumers.
Broker. Kafka Connect integrates Brokers with Offset A unique sequential
external clients that produce events or consume number assigned to
event data. Producers and consumers utilize each message within a
Kafka’s publish/subscribe messaging system to topic partition.
achieve instant exchange of event data between
servers, processes, and applications. In case data Message (aka Record) The means to transfer
handling is required, the Kafka Streams component data between producers
can be used for real-time event processing. and consumers. It
Contains a key, value,
timestamp, and a
THEPUBLISH-SUBSCRIBE
THE PUBLISH-SUBSCRIBESYSTEM
SYSTEM
header.

The publish/subscribe system is responsible for the


distributed streaming of a vast amount of data in a
highly scalable, durable, and fault-tolerant manner.
Below are its main key concepts:

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


3 APACHE KAFKA

Concept Description Concept Description

Producer A component that sends Producer API A client library or


messages to Kafka interface for producing
topics. Producers can data to Kafka. Popular
publish data to one or programming languages
more topics, and they like Java, Python, and
are responsible for others have Kafka
specifying the topic and producer libraries that
partition to which a allow developers to
message should be sent. create producers.

Consumer A component that Consumer API A client library or


subscribes to one or interface for consuming
more topics and reads data from Kafka. Like
messages from Kafka. producer libraries,
Kafka supports different Kafka provides
types of consumers, consumer libraries in
such as the high-level various programming
consumer API and the languages for reading
low-level consumer API. data from Kafka topics.
Consumers can work
individually or in Below is a Java code snippet that demonstrates how
consumer groups for to create a Kafka producer and send a "Hello,
load balancing. World!" message to the "test-topic" topic created
earlier.
Consumer Group A group of consumers
that work together to
consume data from a
import
topic. Each partition
org.apache.kafka.clients.producer.*;
within a topic is
consumed by only one
consumer within a import java.util.Properties;
group, ensuring parallel
processing while public class KafkaProducerExample {
maintaining order. public static void main(String[]
Kafka automatically args) {
rebalances partitions // Set up Kafka producer
among consumers in a properties
group.
Properties props = new
Zookeeper Used for distributed Properties();
coordination and props.put
management of Kafka ("bootstrap.servers",
clusters. While newer "localhost:9092");
Kafka versions are
props.put("key.serializer",
designed to work
"org.apache.kafka.common.serializati
without Zookeeper,
on.StringSerializer");
older versions relied on
it for maintaining
props.put(
cluster metadata and "value.serializer",
leader election. It plays a "org.apache.kafka.common.serializati
crucial role in cluster on.StringSerializer");
management and
maintenance. // Create a Kafka producer

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


4 APACHE KAFKA

Producer<String, String> done


producer = new KafkaProducer<>( producer.close();
props); }
}
// Define the topic to send
the message to In this code, we configure the Kafka producer with
String topic = "test-topic"; the necessary properties, create a producer
instance, and use it to send a "Hello, World!"
// The key for the message message with key "myKey" to the "test-topic" topic.
String key = "myKey"; As you can see, both the key and value are Strings,
so we are using a StringSerializer to serialize the
// The message to send transferred data. It’s wise to use the correct
String message = "Hello, serializer based on the data types of our key and
message. The callback function handles the
World!";
acknowledgment of message delivery.

// Create a ProducerRecord Below is a Java code snippet that demonstrates how


with the topic and message to create a Kafka consumer to read messages from
ProducerRecord<String, the "test-topic" topic.
String> record = new
ProducerRecord<>(topic, key,
message); import
org.apache.kafka.clients.consumer.*;
// Send the message to the import
Kafka topic org.apache.kafka.common.serializatio
producer.send(record, new n.StringDeserializer;
Callback() {
public void import java.util.Collections;
onCompletion(RecordMetadata import java.util.Properties;
metadata, Exception exception) {
if (exception == public class KafkaConsumerExample {
null) { public static void main(String[]
System.out args) {
.println("Message sent successfully // Set up Kafka consumer
to topic " + metadata.topic()); properties
System.out Properties props = new
.println("Partition: " + metadata Properties();
.partition()); props.put
System.out ("bootstrap.servers",
.println("Offset: " + metadata "localhost:9092");
.offset()); props.put("group.id", "test-
} else { consumer-group"); // A consumer
System.err group ID
.println("Error sending message: " + props.put(
exception.getMessage()); "key.deserializer",
} "org.apache.kafka.common.serializati
} on.StringDeserializer");
}); props.put
("value.deserializer",
// Close the producer when "org.apache.kafka.common.serializati

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


5 APACHE KAFKA

on.StringDeserializer"); from each partition, the consumer can resume from


where it left off in case of a restart. We can either
configure automatic saving of message offsets, by
// Create a Kafka consumer
setting the property enable.auto.commit to true
KafkaConsumer<String,
when creating the consumer, or do it manually by
String> consumer = new
issuing the commitSync() command after consuming
KafkaConsumer<>(props); a batch of messages, as depicted in the example
above.
// Subscribe to the "test-
topic" topic Keep in mind that a message is not removed from
consumer.subscribe the broker immediately after it’s consumed.
(Collections.singletonList("test- Instead, it is retained according to a configured
topic")); retention policy. Below are two common retention
policies:

// Poll for new messages


• log.retention.bytes - sets the maximum size of
while (true) { records retained in a partition.
ConsumerRecords<String,
• log.retention.hours - sets the maximum
String> records = consumer.poll(
number of hours to keep a record on the
100);
broker.

for (ConsumerRecord THEKAFKA


KAFKACONNECT
CONNECTFRAMEWORK
FRAMEWORK
<String, String> record : records) { THE
System.out.println The Kafka Connect framework is an integral part of
("Received message:"); the Apache Kafka ecosystem designed for building
System.out.println and running scalable and reliable data integration
("Topic: " + record.topic()); solutions. It serves as a powerful platform for
System.out.println connecting Kafka with various external data
("Partition: " + record. sources and sinks, enabling seamless data transfer
partition()); between Kafka topics and other systems or storage
System.out.println mechanisms. Here are some key aspects of the
Kafka Connect framework:
("Offset: " + record.offset());
System.out.println
• Distributed, Scalable, and Fault-Tolerant:
("Key: " + record.key()); Kafka Connect is built to be distributed and
System.out.println scalable, allowing you to deploy connectors
("Value: " + record.value()); across multiple nodes to handle high data
} volumes. It offers fault tolerance by
consumer.commitSync(); redistributing work among connectors if one or
} more instances fail. Furthermore connectors
} are designed to distribute data efficiently
} across Kafka topics and manage data flow from
source to destination, ensuring robust and
scalable data pipelines.
In this code, we configure the Kafka consumer with
• Source and Sink Connectors: Kafka
the necessary properties, create a consumer
Connectors come in two primary flavors:
instance, and subscribe to the "test-topic" topic. The
source connectors and sink connectors. Source
consumer continuously polls for new messages and
connectors pull data from external systems into
processes them as they arrive since Records within
Kafka, while sink connectors write data from
a partition are always delivered to consumers in
Kafka topics to external systems.
offset order.
• Pre-Built Connectors: Kafka Connect offers a
By saving the offset of the last consumed message wide range of pre-built connectors for popular

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


6 APACHE KAFKA

systems and databases, such as databases (e.g., allowing you to track the performance and
MySQL, PostgreSQL), cloud storage (e.g., status of connectors. Furthermore connectors
Amazon S3), message queues (e.g., RabbitMQ), often come with built-in error handling
and more. These connectors simplify mechanisms, allowing them to handle various
integration with these systems without the issues that may arise during data transfer, such
need for custom coding. as network interruptions, data format/schema
mismatches, and more. This ensures data
• Configuration-Driven: Connectors are
reliability and minimizes data loss.
configured using simple configuration files or
via REST APIs. The configuration defines how
Below is a step-by-step guide of how to stream data
data should be extracted, transformed, and
from a source file to a target file using the Kafka
loaded between Kafka and the external system.
Connect framework with the Confluent Kafka
• Schema Support: Kafka Connectors often Source and File Sink connectors. The File Source
support data serialization and deserialization connector reads data from the source file and
using schemas. This allows connectors to publishes it to a "test-topic." The File Sink connector
handle structured data formats like Avro, JSON, subscribes to "test-topic" and writes the data to the
or others, maintaining compatibility and data target file as specified in its configuration.
quality.
Prerequisites
• Transformation and Enrichment: You can
apply transformations and enrichments to data • Make sure you have Kafka and Kafka Connect
as it flows through Kafka Connect. These set up and running.
transformations can include filtering, mapping,
and more, depending on your requirements. • Download and install the Confluent Kafka
Source and File Sink connectors.
• Real-time Data Processing: Kafka connectors
facilitate real-time data streaming, enabling Create a Kafka topic (Optional)
you to process and analyze data as it flows
through your pipeline. This is crucial for First, create a Kafka topic to which you want to
applications that require timely insights and publish the data from the source file. You can use
actions based on the data. the Kafka command-line tools for this:

• RESTful APIs: Kafka Connect provides RESTful


APIs for managing connectors, configurations,
kafka-topics.sh --create --topic
and monitoring the status of connectors. This
test-topic --partitions 1 --
makes it easier to interact with and manage the
connectors programmatically. See the Apache
replication-factor 1 --bootstrap-
Kafka documentation for more information. server localhost:9092

• Pluggable Architecture: You can extend Kafka


Connect by creating custom connectors for Configure and start the File Source connector
systems not covered by the pre-built
connectors. This pluggable architecture Create a configuration file for the Kafka File Source
encourages community and connector. Here’s an example configuration file
contributions
supports a wide range of integration use cases. (file-source-config.properties):

• Rich Ecosystem: The Kafka Connect ecosystem


has a growing collection of connectors name=file-source-connector
available from various sources, including the connector.class=io.confluent.connect
Apache Kafka community, Confluent, and third-
.file.FileStreamSourceConnector
party developers. This ecosystem helps
tasks.max=1
streamline integration with different
file=/path/to/source/source-file.txt
technologies and data sources.
topic=test-topic
• Monitoring and Error Handling: Kafka
Connect offers monitoring capabilities,
Update the file property with the actual path to

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


7 APACHE KAFKA

your source file. connector.class=io.confluent.connect


.file.FileStreamSourceConnector
Start the File Source connector using the Kafka tasks.max=1
Connect CLI: file=/path/to/source/source-file.txt
topic=test-topic
connect-standalone config/connect- transforms=uppercase
standalone.properties file-source- transforms.uppercase.type=org.apache
config.properties .kafka.connect.transforms.RegexRoute
r
transforms.uppercase.regex=^.*$
Configure and start the File Sink connector transforms.uppercase.replacement=Upp
ercase$0
Create a configuration file for the Kafka File Sink
connector. Here’s an example configuration file
(file-sink-config.properties): In this example, the transforms property specifies a
transformation named "uppercase," which converts
all data to uppercase using the RegexRouter
name=file-sink-connector transformation. The regex property matches the
connector.class=io.confluent.connect entire input, and the replacement property prepends
.file.FileStreamSinkConnector "Uppercase" to each line.
tasks.max=1
file=/path/to/destination/destinatio KAFKASTREAMS
KAFKA STREAMS
n-file.txt
topics=test-topic Kafka Streams is a powerful and lightweight stream
format=value processing library that is part of the Apache Kafka
ecosystem. It allows you to build real-time data
processing applications that can ingest, process,
Update the file property with the actual path to and output data from Kafka topics. Kafka Streams is
your target file. designed to be easy to use, scalable, and fault-
tolerant, making it an excellent choice for a wide
Start the File sink connector using the Kafka
range of stream processing use cases including real-
Connect CLI:
time analytics, monitoring and alerting, fraud
detection, recommendation engines, ETL (Extract,
Transform, Load) processes, and more.
connect-standalone config/connect-
standalone.properties file-sink- Here are some key characteristics and features of
config.properties Kafka Streams:

• Stream Processing Library: Kafka Streams is


Kafka Connect is primarily designed to enable data
not a separate service or framework; it is a
streaming between systems, as-is. To perform
library that you can include in your Java or
complex transformations once the data is in Kafka,
Scala applications. This makes it easy to
Kafka Streams comes into play. Nevertheless, Kafka
integrate stream processing directly into your
Connect provides a mechanism to perform simple
existing Kafka-based infrastructure.
transformations per record. Let’s see how to apply a
custom transformation while streaming data • Real-time Data Processing: Kafka Streams is
between the files of the previous example. designed for real-time processing of data
streams. It allows you to analyze and transform
Update the File Source connector configuration as data as it flows through the Kafka topics.
shown below.
• Exactly-Once Semantics: Kafka Streams
provides strong processing guarantees,
including exactly-once processing semantics.
name=file-source-connector

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


8 APACHE KAFKA

This ensures that each record is processed import org.apache.kafka.streams.*;


exactly once, even in the presence of failures. import
• Stateful Processing: Kafka Streams supports org.apache.kafka.streams.kstream.KSt
stateful processing, allowing you to maintain ream;
and update local state stores as data is import java.util.Properties;
processed. This is useful for tasks like
aggregations, joins, and windowed operations. public class WordCountStream {
• Windowed Aggregations: Kafka Streams
includes built-in support for windowed public static void main(String[]
aggregations, enabling time-based operations args) {
on data, such as tumbling windows, hopping Properties props = new
windows, and session windows. Properties();
• Join Operations: You can perform joins on props.put(StreamsConfig
Kafka Streams data, either by joining multiple .APPLICATION_ID_CONFIG, "word-count-
Kafka topics or by joining a topic with a local application");
state store. props.put(StreamsConfig
• Scaling and Fault Tolerance: Kafka Streams
.BOOTSTRAP_SERVERS_CONFIG,
applications can be easily scaled horizontally "localhost:9092");
across multiple instances, and they are props.put(StreamsConfig
inherently fault-tolerant. In the event of an .PROCESSING_GUARANTEE_CONFIG,
instance failure, processing can be seamlessly "exactly_once_v2"); // Apply
redistributed. exactly-once processing semantics
• Integration with Kafka: Kafka Streams is props.put(ConsumerConfig
tightly integrated with Kafka, which means it .AUTO_OFFSET_RESET_CONFIG,
benefits from Kafka’s reliability and durability. "earliest");
It also allows for easy integration with the
broader Kafka ecosystem.

• Interactive Querying: Kafka Streams enables StreamsBuilder builder = new


interactive querying of state stores through an StreamsBuilder();
API. This means you can query the current
state of your data at any point in time, which is // Create a KStream from the
valuable for building interactive applications. "test-topic"
Details can be found in the Interactive Queries // Consumed.with() defines
section of the Kafka documentation. the type of each record's key and
• Flexible Deployment Options: You can deploy value data for de-serialization - in
Kafka Streams applications on a variety of this case both are translated to
platforms, including standalone applications, Strings
containers, cloud services, and Kubernetes. KStream<String, String>
textLines = builder.stream("test-
Below is an example application that utilizes the
topic", Consumed.with(Serdes.
Kafka Streams library to count the words from the
String(), Serdes.String()));
"test-topic" created earlier.

// Split each line into


import words and convert to lowercase
org.apache.kafka.clients.consumer.Co KStream<String, String>
nsumerConfig; wordStream = textLines
import .flatMapValues(value ->
org.apache.kafka.common.serializatio Arrays.asList(value.toLowerCase().sp
n.Serdes; lit("\\W+")));

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


9 APACHE KAFKA

the basic concepts, utilities and classes.


// Group by the word and
A topic can be viewed as either a KStream or a
count occurrences
KTable. With KStream each record is treated as an
KTable<String, Long>
append to the stream, whereas with KTable each
wordCounts = wordStream record is treated as an update to the value of its key
.groupBy((key, word) -> in a table. So if a topic has two or more records with
word) the same key, viewing it as a KStream allows for all
.count(); records to be retained whereas viewing it as a
KTable only the last record for each key is retained
// Write the word counts to in the stream.
a new Kafka topic "word-count-
output" KSTREAM
// The Produced.with()
Below are some basic functions of the KStream and
defines the type of each record's
examples of each.
key and value data for serialization
- in this case keys are Strings and map: Transforms each record in the stream by
values are Longs applying a mapping function. Here for every input
wordCounts.to("word-count- record we return a new record which has its key
output", Produced.with(Serdes. and value equal to key of the input. As you can see
String(), Serdes.Long())); records are represented by the KeyValue class.

KafkaStreams streams = new


KStream<String, String> sourceStream
KafkaStreams(builder.build(),
= builder.stream("input-topic");
props);
KStream<String, String> mappedStream
streams.start();
= sourceStream.map((key, value) ->
new KeyValue<>(key, key));
// Shutdown hook to
gracefully close the application
Runtime.getRuntime filter: Filters records based on a provided
().addShutdownHook(new Thread predicate. Here for every input record the filter
(streams::close)); function is executed. If it evaluates to true the
} record is retained in the new stream.
}
KStream<String, String> sourceStream
In the example above, we first set up the Kafka = builder.stream("input-topic");
Streams configuration, including the Kafka KStream<String, String>
broker(s) address. Then the StreamsBuilder is used filteredStream = sourceStream.
to build the stream processing topology. A KStream is filter((key, value) -> value
created from the "test-topic." Each line is split into .startsWith("filter"));
words and converted to lowercase. Words are
grouped and counted to create a KTable with word
counts. The word counts are written to a new Kafka flatMap: Transforms each input record into zero or
topic named "word-count-output" and finally, the more output records. Here we split the value of
Kafka Streams application is started, and a each input record to individual words and return
shutdown hook is added to close the application each word in a new record with the same key as the
gracefully. input. As you can see records are represented by
the KeyValue class.
Let us delve into the nitty-gritty of the Kafka
Streams API. Below you can find details regarding
KStream<String, String> sourceStream

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


10 APACHE KAFKA

= builder.stream("input-topic"); (key, value) -> value.contains


KStream<String, String> ("A"),
flatMappedStream = sourceStream (key, value) -> value.contains
.flatMap((key, value) -> Arrays ("B"),
.asList(value.split("\\W+")).stream( (key, value) -> value.contains
).map(v -> new KeyValue<>(key, v))); ("C")
);

join: Joins records of the input stream with records


from another KStream or KTable if the keys from the groupBy: Groups records by a new key (specified by
records match (inner-join). Returns a stream of the mapping function), allowing you to perform
joined records with values combined based on the aggregation and windowed operations. The method
function provided. Here a joined record’s value is returns a KGroupedStream to allow specialized data
the concatenation of the two input records values. processing. Here we just group by records existing
keys.

KStream<String, String> sourceStream


= builder.stream("input-topic"); KStream<String, String> sourceStream
KStream<String, String> joinStream = = builder.stream("input-topic");
sourceStream.join(anotherStream, KGroupedStream<String, String>
(value1, value2) -> value1 + " " + groupedStream = sourceStream.
value2); groupBy((key, value) -> key);

to: Writes the stream to a Kafka topic. KGROUPEDSTREAM

A KGroupedStream is an intermediate representation


KStream<String, String> sourceStream that results from applying the groupBy operation to
= builder.stream("input-topic"); a KStream. It represents a stream of records that
sourceStream.to("output-topic"); have been grouped based on a specific key, allowing
for further aggregation and transformation of data.
You can perform various operations on a
foreach: Performs a function on each record
KGroupedStream, such as aggregations and windowed
without producing a new output stream.
operations, which make it a key component for
analyzing and processing data in a Kafka Streams
application. Lets see some examples.
KStream<String, String> sourceStream
= builder.stream("input-topic"); aggregate: Aggregates records within a grouped
sourceStream.foreach((key, value) -> stream using a provided aggregation function. The
System.out.println("Key: " + key + first attribute is a function that returns the initial
", Value: " + value)); aggregate value and the second is the aggregation
function. Here we aggregate the length of the value
of each record in a group and return the results for
branch: Splits the stream into multiple output
all groups in a KTable.
streams based on a set of predicates. Here we split
the input stream in three branched based on the
value of each record. KTable<String, Long> aggregatedTable
= groupedStream.aggregate(
() -> 0L,
KStream<String, String> sourceStream
(key, value, aggregate) ->
= builder.stream("input-topic");
aggregate + value.length());
KStream<String, String>[] branches =
sourceStream.branch(

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


11 APACHE KAFKA

count: Counts the number of records in each group Project Description


of the grouped stream. Return the results for all
Apache Pulsar Apache Pulsar is a high-
groups in a KTable.
performance distributed
messaging and event
KTable<String, Long> countTable = streaming platform. It
groupedStream.count(); offers multi-tenancy,
strong durability, and
geo-replication
reduce: Combines the values of records in each capabilities. Pulsar is
group of the grouped stream. Here we combine the known for its native
values of all records of a group by concatenating support for fine-grained
them in a comma separated string. The results for authorization and
all groups are returned in a KTable. authentication.

Amazon Kinesis Amazon Kinesis is a


KTable<String, Long> countTable = fully managed, cloud-
based data streaming
groupedStream.reduce((value1,
and processing service
value2) -> value1 + ", " + value2);
offered by AWS. It’s
designed for real-time
windowedBy: Groups records into windows for time- data ingestion, storage,
based operations. Here we further group the events and processing at scale,
in the stream in 5 minute time windows. making it well-suited for
cloud-centric
applications.
TimeWindowedKStream<String, String>
RabbitMQ RabbitMQ is a robust
windowedStream = groupedStream open-source message
.windowedBy(TimeWindows.of(Duration. broker that supports
ofMinutes(5))); various messaging
patterns, including
publish-subscribe and
KTABLE AND KGROUPEDTABLE work queues. It’s known
for its flexibility, support
Details on the KTable and KGroupedTable operations for multiple protocols,
can be found in the Kafka Documentation. and extensive plugin
ecosystem.
NOTABLEAPACHE
NOTABLE APACHEKAFKA
KAFKA
ALTERNATIVES
ALTERNATIVES ADDITIONALRESOURCES
RESOURCES
ADDITIONAL
These alternatives cover a spectrum of features and
This cheatsheet has provided you with essential
use cases, making them strong contenders for
information to kickstart your journey into the
various messaging and event streaming scenarios.
world of Apache Kafka. If you are seeking to deepen
The choice of the most suitable alternative depends
your knowledge, the following resources cover a
on your specific requirements, including factors
wide range of learning materials, from official
such as scalability, cloud compatibility, and
documentation and online courses to articles, and
lightweight design.
community resources. Dive into these Kafka-related
materials to enhance your understanding and
proficiency in this robust event streaming platform.

• Apache Kafka Documentation: The official


documentation provides comprehensive
information on Kafka’s architecture,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


12 APACHE KAFKA

configuration, and usage..

• Confluent Platform Training: Confluent, the


company founded by the creators of Kafka,
offers free online courses and tutorials on
Kafka.

• LinkedIn Engineering Blog: LinkedIn’s


engineering blog often features Kafka-related
articles, including best practices and use cases.

• Kafka Users Mailing List: A place to ask


questions, share knowledge, and get help from
the Kafka community.

• Apache Kafka GitHub Repository: The official


source code repository for Kafka.

• Confluent GitHub Repository: Confluent’s


GitHub repository includes various Kafka-
related projects and connectors.

• Confluent Community Hub: A marketplace for


Kafka connectors, Kafka Streams applications,
and other Kafka-related components shared by
the community.

JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
support@javacodegeeks.com

Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. sales@javacodegeeks.com

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

You might also like