Professional Documents
Culture Documents
I. Goals
- Can use command lines to interact with Kafka
- Understand the pub-sub architecture of Kafka
- Understand how Kafka consumer can read data by offset
- Understand how Kafka producer can write data
II. Objectives
- Connect with the Kafka cluster through brokers.
- Manipulate some basic operations with Kafka Topic: create, delete, update parameters
- Push data from Kafka Producer to brokers
- Query data from Kafka broker with Consumer Group
- Update offset of Consumer Group to control the querying process.
- Program producer, consumer with Python
- Manage Kafka with Admin API
- Define stream topologies and process data with Kafka Streams.
III. PREPARATION
1. Login to SuperNode-XP system by username and password.
2. Check Kafka CLI in the current node (for this lab, the broker port is 9091 and zookeeper port is
2181)
1
⚠️ Note: When using “_” or “.” in topic name, it will appear a warning. However, we can
ignore these warnings because with these kinds of characters in name, you can avoid the
conflicts with the existing names
Outputs:
- First line: Information about the topic o PartitionCount: number of partitions of the topic.
Currently, the topic has only a single partition that stores all the data. This is the default
number when creating a topic.
o ReplicationFactor: Number of replicas on brokers. Each partition is being copied
as 1 replica on another broker. This is the default value when initializing a topic.
o Configs segment.bytes: Maximum segment size before creating a new segment. ⚠️
Further Reading: Data on each partition is stored in logs, each log is divided into
segments. Learn more about Kafka data storage and cleanup with the “Kafka
Retention” keyword.
⚠️ Further Reading: Default parameter about number of partitions and replication
factors can be set in kafka's setup file before starting kafka-server.
- Other next lines: Parameters of topic’s partition. Each line corresponds to 1 partition.
o Topic: Partition topic name o Partition: Id of current partition
o Leader: Id of broker that contains leader partition
o Replicas: Id of broker that contains partition replicas. The first Id is the broker
containing the leader partition.
o Isr (In-sync replicas): Id of broker contains in-sync replica of partition and leader
partition
⚠️ Further Reading: in-sync means that replica sent updates to leader within
replica.lag.time.max.ms (default: 10s). Refer to the keyword Kafka In-Sync Replica
(Kafka ISR)
2
⚠️ Note: Number of partitions can be set arbitrarily (>= 1). However, make sure to know that
the larger number of partitions, although increasing system throughput, will also increase
replication latency, system load balancing time and the number of files the kafka system has
to serve. Read more about this in the Further Readings section
⚠️ Note: Kafka does not recommend increasing the number of partitions of a topic because
it can affects the system's operation and data allocation. And it is not possible to reduce the
number of partitions
⚠️ Further Reading: In some cases, we still can change number of partitions through –alter
command. Read more in the documentation section below.
f. Delete a topic
$kafka-topics.sh --bootstrap-server <brokerip>:<port> --delete –-topic <YOUR_NAME>_NEW
3
$ kafka-console-producer.sh --bootstrap-server <brokerip>:<port> –-topic <YOUR_NAME>
> Write something here!
⚠️ Note: Press Ctrl+C to exit
b. Send data in key:value format
We can define key-value data to take advantage of key-based partition storage:
$ kafka-console-producer.sh --bootstrap-server <brokerip>:<port> --topic <YOUR_NAME>
--property parse.key=true --property key.separator=":"
> ID_1:A
> ID_2:B
⚠️ Note: Always input data in key:value format when using parse.key=true option
In kafka-console-producer terminal, input messages and observe data at the consumer side
⚠️ Further Reading: In default, initially, kafka-console-consumer will only process the
latest messages since launch time.
4
Firstly, input data in key:value format. Then, observe them on all partitions:
$kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic
<YOUR_NAME>_NEW --property print.key=true --property key.separator=":" --from-
begining A:1
B:2
C:3
A:4
B:5
C:6
D:7
⚠️ Note: When partition Id does not belong to any topics, an error will be raised.
⚠️ Further Reading: Key = null will be stored respectively in all partitions
⚠️ Note: Offset is a positive integer, 0 value mean read from the first message, 1 means skip
the first message, ...
⚠️ Further Reading: The default value of offset is ‘latest’, i.e. only read messages after
consumer is initialized or offset=0. We can set offset='earliest' to read from the beginning
similarly using the --from-beginning option.
5
4. Working with Consumer Group
a. Group setting for consumer
To assign groups for consumers, add the --group option when launching consumers:
Then, open 2 more terminals to run 2 more consumers within the same consumer group
GROUP_1. Observe data from 3 terminals when a producer push data into Kafka.
⚠️ Further Reading: When configuring a new consumer group, Kafka will create a new
consumer group automatically. If it ‘ve already existed, the consumer will automatically join
the declared group.
⚠️ Note: Offset in a consumer group is the offset of all consumers of that group. Therefore,
the --from-beginning only works when creating a completely new consumer group. You can
try by running the same option on a brand new consumer group.
6
- Lag: Number of messages waiting to be processed = Log-End-Offset - Current-Offset -
Consumer-Id: The Id of the consumer that is processing data on the partition.
- Host: IP Address of client running consumer
- Client-id: The Id of the client running the consumer. Consumers, which run on the same
client, will have the same Client-Id.
⚠️ Further Reading: When there is only one consumer, that consumer will process data on
all partitions. Try increasing or decreasing the number of consumers in the group to see what
will happen.
In case that we want the consumer group to preprocess data, we can reset option value by -
reset-offsets options:
- Scope: We can set a reset on custom topics: o --all-topics: On all certain topics
o --topic: On certain topics, you can enter 1 or more topics
o --topicX: On certain partitions of specific topics (X: Id of partition)
- Execution option: The way performs the reset offset command o --dry-run: Default
method. Check the offset before performing the actual update.
o --execute: Perform offset reset.
- Reset specification: Define new offset value:
o --to-date-time: Reset offset based on time. 'YYYY-MM-DDTHH:mm:SS.sss' format.
o --to-earliest: Reset offset back to the very first position. o --to-latest: Reset offset
to the last position (the latest). o --shift-by: Reset offset by shifting the current offset
to the declared value. That value can be a positive number (shift to the end) or a
negative number (shift back to the beginning).
The example below changes the offset of the group back to the previous 3 messages on all
topics. We see that the new offset has been subtracted by 3 compared to the offset checked
7
with the --describe option earlier. You can test by running the --describe description again
with the --offsets option.
⚠️ Note: Turn off all consumer on the group before proceeding to delete a group, otherwise
the operation will be rejected
Kafka-python is a package that helps us to interact with Kafka using the Python language.
Name: kafka-python
Version: 2.0.2
Summary: Pure Python client for Apache Kafka
Home-page: https://github.com/dpkp/kafka-python
Author: Dana Powers
Author-email: dana.powers@gmail.com
License: Apache License 2.0 Location:
/usr/local/lib/python3.5/dist-packages Requires:
Required-by:
8
from kafka import KafkaAdminClient
broker_uri = '<BrokerIP>:<port>'
client = KafkaAdminClient(bootstrap_servers=[broker_uri])
client.describe_cluster()
⚠️ Further Reading: To run the above program, we can create a python file named client.py and
run the “python client.py” command
⚠️ Note: Remember to change the IP address or hostname of Kafka Broker.
To send a message to a Kafka Topic, we use class KafkaProducer with inputs including Kafka
cluster IP address. And then we call send function with topic and message parameter:
producer = KafkaProducer(bootstrap_servers=[broker_uri])
9
def on_send_success(record_metadata):
print(record_metadata.topic)
print(record_metadata.partition)
print(record_metadata.offset)
def
on_send_error(ex):
print(ex)
After declaring two functions above, you should add these callback functions when sending
data. The callback function will be handled after a response from the send() function is
returned.
consumer = KafkaConsumer(topic,
group_id=consumer_group,
bootstrap_servers=[broker_uri])
for message in consumer: print("key=%s value=%s" %
(message.key, message.value))
⚠️ Further Reading: Try extracting the partition, topic and offset of message yourself.
7. Kafka Streams
Kafka Streams, which is a library at client side, is mainly used to build applications or micro
services supporting for real-time processing purposes. It is noteworthy that the input/output
data is stored on topics of the Kafka cluster. Besides that, Kafka Streams is initialized by Kafka
and managed on the cluster.
10
8. An example of Kafka Streams with the word count problem
Kafka Streams supports two programming languages, namely Java and Scala. We
can try WordCount example from Kafka package:
kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo
- start the console producer in a separate terminal to write some input data to this topic:
- inspect the output of the WordCount demo application by reading from its output topic with the
console consumer in a separate terminal:
Back to the producer terminal, type in some text and check the result in consumer terminal.
Next, in the following example, we will build a WordCount application with Java8. a.
Build a Stream Topology
11
Firstly, we will define a stream topology for the WordCount problem with the following steps:
- Step 1: Initialize a stream record with KStream to read data from a topic
final KStream<String, String> textLines = builder.stream(inputTopic);
- Step 2: Initialize a Ktable change log from the above KStream. We will split each sentence
into tokens by using Java Regex Pattern:
final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> words = textLines.flatMapValues(value -
> Arrays.asList(pattern.split(value.toLowerCase())))
- Step 3: From the above tokens, we will group them and count count occurrences of them
final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
12
c. Define Kafka Streams Application
final Properties streamsConfiguration = getStreamsConfiguration(bootstrapServers);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
You should base on the above section to implement a complete Java program for the WordCount
problem and compile it to generate a .jar file (the jar file name is WordCountExample) d. Deploy
the WordCount application with KafkaStreams
Firstly, we need to initialize two topics to read and write data, namely
<YOUR_NAME>_STREAM and <YOUR_NAME>_STREAM_OUTPUT. You can use the
kafka-topics or KafkaAdminClient mentioned in the above sections.
$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> --create –-topic
<YOUR_NAME>_STREAM --partitions 3 --replication-factor 3
Created topic <YOUR_NAME>_STREAM
9. Homework
1. Developing the WordCount problem:
- A. In case that a new consumer joins the group, are there any ways for that new consumer to
continue counting words that the existing consumers are doing? Specifically:
o Consumer A is running, just finished counting words and got the result “abc”: 1,
“ab” : 2, “d” 3 o Assume that a consumer named B joins A's group, when it receive
the message "abc d", consumer B will give the result "abc: 2, "ab" : 2, "d" : 4 - B.
Analyze the pros and cons of the proposed methods.
13
2. In this exercise you will create a consumer group and observe what happens when the number
of Kafka consumers changes. You should do the following steps:
a. Create a topic t1 with 3 partitions
kafka-topics.sh --bootstrap-server :<port> --create --topic t1 --partitions
3
b. Start a new consumer c1 in a consumer group CG1
kafka-console-consumer.sh --bootstrap-server :<port> --topic t1 --consumerproperty
group.id=CG1
c. Start a Kafka producer that is attached to partition 0. Send a couple of messages with
the key 0 for easier identification what producer sends what messages.
kafka-console-consumer.sh --bootstrap-server :<port> --topic t1 --consumerproperty
group.id=CG1
d. Start another Kafka producer to send messages to partition 2 (with 2 key).
At this point you should have 3 partitions, 2 producers and 1 consumer. Observe what
and how messages are consumed. Simply send messages so you can identify what message
used what partition.
a. Start a new consumer c2 in the CG1 consumer group.
kafka-topics.sh --bootstrap-server :<port> --create --topic t1 --partitions
3
b. Observe the logs in the Kafka broker
c. Send a couple of message to observe if and how messages are distributed across the
two consumers
d. Start a new consumer c3 in the CG1 consumer group
kafka-console-consumer.sh --bootstrap-server :<port> --topic t1 --consumerproperty
group.id=CG1
e. Send a couple of message to observe if and how messages are distributed across the
consumers
f. Shut down any of the running consumers and observe which consumer takes over
the “abandoned” partition
14