DSLab1-Kafka Introduction New

Lab 1 - An Introduction to Apache Kafka
I. Goals
- Can use command lines to interact with Kafka
- Understand the pub-sub architecture of Kafka
- Understand how Kafka consumer can read data by offset
- Understand how Kafka producer can write data
II. Objectives
- Connect with the Kafka cluster through brokers.
- Manipulate some basic operations with Kafka Topic: create, delete, update parameters
- Push data from Kafka Producer to brokers
- Query data from Kafka broker with Consumer Group
- Update offset of Consumer Group to control the querying process.
- Program producer, consumer with Python
- Manage Kafka with Admin API
- Define stream topologies and process data with Kafka Streams.
III. PREPARATION
1. Login to SuperNode-XP system by username and password.
2. Check Kafka CLI in the current node (for this lab, the broker port is 9091 and zookeeper port is
2181)
$ kafka-topics.sh --version 3.6.0
3. Check the connection to Kafka Cluster
$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> –-list

TOPICS_1
TOPICS_2
⚠️ Note: Replace <broker-ip> with one of the Kafka Broker IP address.
⚠️ Note: Always provide --bootstrap-server <broker-ip>:<port> when running the above

command.
IV. PRACTICE WITH KAFKA CLI

1. Kafka Topic
a. Connect to Kafka Cluster and create a topic
Before we can send data to Kafka, we have to create a topic to store the data:
$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> --create –-topic <YOUR_NAME>

Created topic <YOUR_NAME>
⚠️ Note: Replace <YOUR_NAME> with your name
1
⚠️ Note: When using “_” or “.” in topic name, it will appear a warning. However, we can
ignore these warnings because with these kinds of characters in name, you can avoid the
conflicts with the existing names
b. Check the created topic

After creating a topic, we can check the information about that topic:
$kafka-topics.sh --bootstrap-server <brokerip>:<port> --describe –-topic <YOUR_NAME>

Topic: <YOUR_NAME> PartitionCount: 1 ReplicationFactor: 1 Configs:
segment.bytes=1073741824
Topic: <YOUR_NAME> Partition: 0 Leader: 2 Replicas: 2 Isr: 2
Outputs:
- First line: Information about the topic o PartitionCount: number of partitions of the topic.
Currently, the topic has only a single partition that stores all the data. This is the default
number when creating a topic.
o ReplicationFactor: Number of replicas on brokers. Each partition is being copied
as 1 replica on another broker. This is the default value when initializing a topic.
o Configs segment.bytes: Maximum segment size before creating a new segment. ⚠️
Further Reading: Data on each partition is stored in logs, each log is divided into
segments. Learn more about Kafka data storage and cleanup with the “Kafka
Retention” keyword.
⚠️ Further Reading: Default parameter about number of partitions and replication
factors can be set in kafka's setup file before starting kafka-server.
- Other next lines: Parameters of topic’s partition. Each line corresponds to 1 partition.
o Topic: Partition topic name o Partition: Id of current partition
o Leader: Id of broker that contains leader partition
o Replicas: Id of broker that contains partition replicas. The first Id is the broker
containing the leader partition.
o Isr (In-sync replicas): Id of broker contains in-sync replica of partition and leader
partition
⚠️ Further Reading: in-sync means that replica sent updates to leader within
replica.lag.time.max.ms (default: 10s). Refer to the keyword Kafka In-Sync Replica
(Kafka ISR)
c. Creating a topic with partition and relocation factor parameter

We could custom partition parameter and number of replicas when creating a new topic:
$kafka-topics.sh --bootstrap-server <broker-ip>:<port> --create --topic
<YOUR_NAME>_NEW --partitions 3 --replication-factor 3
Created topic <YOUR_NAME>_NEW
⚠️ Note: Number of replicas in --replication-factor can not greater than total

number of brokers.
2
⚠️ Note: Number of partitions can be set arbitrarily (>= 1). However, make sure to know that
the larger number of partitions, although increasing system throughput, will also increase
replication latency, system load balancing time and the number of files the kafka system has
to serve. Read more about this in the Further Readings section
Once done, we can check the topic information we just created:

$kafka-topics.sh --bootstrap-server <brokerip>:<port> --describe –-topic <YOUR_NAME>_NEW
Topic: <YOUR_NAME>_NEW PartitionCount: 3 ReplicationFactor: 3 Configs:
Topic: <YOUR_NAME>_NEW Partition: 0 Leader: 2 Replicas: 2,1,0 Isr:
2,1,0
1,0,2
0,2,1
⚠️ Further Reading: The partition-broker mapping when creating topics is managed by

Kafka through the leader selection algorithm. We can select the broker manually when
creating a topic or update an existing topic via the --replica-assignment option.
d. Edit topic configuration

To edit the settings of a topic, use the --alter option. For example, change the segment.size
parameter down to 512MB:
$ kafka-configs.sh --bootstrap-server <brokerip>:<port> --alter --entity-type topics
--entity-name <YOUR_NAME>_NEW –add-config segment.bytes=536870912
⚠️ Note: Kafka does not recommend increasing the number of partitions of a topic because
it can affects the system's operation and data allocation. And it is not possible to reduce the
number of partitions
⚠️ Further Reading: In some cases, we still can change number of partitions through –alter
command. Read more in the documentation section below.
e. List all topics

$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> --list
<YOUR_NAME>
<YOUR_NAME>_NEW
f. Delete a topic
$kafka-topics.sh --bootstrap-server <brokerip>:<port> --delete –-topic <YOUR_NAME>_NEW
2. Send data to Kafka topic

a. Send through kafka-console
To send data to Kafka topic we use kafka-console-producer.sh and enter our messages
directly into the console
3
$ kafka-console-producer.sh --bootstrap-server <brokerip>:<port> –-topic <YOUR_NAME>
> Write something here!
⚠️ Note: Press Ctrl+C to exit
b. Send data in key:value format
We can define key-value data to take advantage of key-based partition storage:
$ kafka-console-producer.sh --bootstrap-server <brokerip>:<port> --topic <YOUR_NAME>
--property parse.key=true --property key.separator=":"
> ID_1:A
> ID_2:B
⚠️ Note: Always input data in key:value format when using parse.key=true option
3. Get data from Kafka

Open a new terminal along with the one running the above generator to read the data on the
<YOUR_NAME> topic that was pushed in the above section.
a. Read topic data through Kafka-console

Use kafka-console-consumer.sh ro read topic data.
$ kafka-console-consumer.sh –-topic <YOUR_NAME> --bootstrap-server <broker-
ip>:<port>
In kafka-console-producer terminal, input messages and observe data at the consumer side
⚠️ Further Reading: In default, initially, kafka-console-consumer will only process the
latest messages since launch time.
b. Read data in key:value format

In default, for key:value data, there is only the value displayed at the consumer side.
Additionally, we can configure to print data in the key:value format as follows:
$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic <YOUR_NAME>
--property print.key=true --property key.separator=":"
⚠️ Further Reading: Key=null will be assigned to none key data.
c. Read all topic messages

To read all messages already on the topic, add the --from-beginning option
$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic <YOUR_NAME>
--property print.key=true --property key.separator=":" --from-beginning
Null:Write something here!
Null:MESSAGE
ID_1:A
ID_2:B
d. Read message on a partition

In the above examples, we use topic <YOUR_NAME> with only 1 partition. We create a new
topic named <YOUR_NAME>_NEW to see the difference when data is stored on many
different partitions.
4
Firstly, input data in key:value format. Then, observe them on all partitions:
$kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic
<YOUR_NAME>_NEW --property print.key=true --property key.separator=":" --from-
begining A:1
B:2
C:3
A:4
B:5
C:6
D:7
Then, view data in each partition:

$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic
begining -partition 0
C:3
C:6

A:1
A:4

B:2
B:5
D:7
⚠️ Note: When partition Id does not belong to any topics, an error will be raised.
⚠️ Further Reading: Key = null will be stored respectively in all partitions
e. Read messages with offset

To ignore specific amount of messages in partition topic, add --offset option of --partition id
that we want to read
<YOUR_NAME>_NEW --property print.key=true --property key.separator=":" --offset 1
-partition 0 C:6
⚠️ Note: Offset is a positive integer, 0 value mean read from the first message, 1 means skip
the first message, ...
⚠️ Further Reading: The default value of offset is ‘latest’, i.e. only read messages after
consumer is initialized or offset=0. We can set offset='earliest' to read from the beginning
similarly using the --from-beginning option.
5
4. Working with Consumer Group
a. Group setting for consumer
To assign groups for consumers, add the --group option when launching consumers:

<YOUR_NAME>_NEW --group GROUP_1
Then, open 2 more terminals to run 2 more consumers within the same consumer group
GROUP_1. Observe data from 3 terminals when a producer push data into Kafka.
⚠️ Further Reading: When configuring a new consumer group, Kafka will create a new
consumer group automatically. If it ‘ve already existed, the consumer will automatically join
the declared group.
⚠️ Note: Offset in a consumer group is the offset of all consumers of that group. Therefore,
the --from-beginning only works when creating a completely new consumer group. You can
try by running the same option on a brand new consumer group.
b. List consumer group

To work with consumer groups, we can use kafka-consumer-groups.sh command. Include:
List all consumer groups by --list option:
$ kafka-consumer-groups.sh --bootstrap-server <broker-ip>:<port> --list

GROUP_1
⚠️ Further Reading: Use with --state option to filter the list of groups by status.
c. View consumer group information:

To view detail information of a group, use --describe option:
$ kafka-consumer-groups.sh --bootstrap-server <brokerip>:<port> --describe --group

GROUP_1
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
CONSUMER-ID HOST CLIENT-ID
GROUP_1 <YOUR_NAME>_NEW 0 6 6 0
consumer-GROUP_1-1-fefd9bd2-c3cf-…-a6bdc976dbe8 /10.1.1.2 consumer-GROUP_1-1
The return values include:
- Group: Current group name
- Topic: Topic that the group is following
- Partition: Id of a Partition. Each line corresponds to a topic partition
- Current-offset: The current offset that the consumer group is reading on the partition
- Log-End-Offset: The last offset of the partition
6
- Lag: Number of messages waiting to be processed = Log-End-Offset - Current-Offset -
Consumer-Id: The Id of the consumer that is processing data on the partition.
- Host: IP Address of client running consumer
- Client-id: The Id of the client running the consumer. Consumers, which run on the same
client, will have the same Client-Id.
⚠️ Further Reading: When there is only one consumer, that consumer will process data on
all partitions. Try increasing or decreasing the number of consumers in the group to see what
will happen.
In addition, if we want to display a list of consumers, we add the --member option:

$ kafka-consumer-groups.sh --bootstrap-server <broker-ip>:<port> --describe --member
--group GROUP_1
GROUP CONSUMER-ID HOST CLIENT-ID #PARTITIONS

GROUP_1 consumer-TEST-1-…… /10.1.1.2 consumer-GROUP_1-1 3
The return values include:

- Group: Current group name
- Consumer-Id: The Id of the consumer that is processing data on that partition.
- Host: IP Address of client running consumer
- Client-id: The Id of the client running the consumer. Consumer sharing the same client will
share the same Client-Id
- #Partition: The number of partitions that consumer is dealing with
d. Working with offset in consumer group

⚠️ Note: Turn off all consumer in group before proceeding to change group offset.
Otherwise, the reset will be refused.
In case that we want the consumer group to preprocess data, we can reset option value by -
reset-offsets options:
- Scope: We can set a reset on custom topics: o --all-topics: On all certain topics
o --topic: On certain topics, you can enter 1 or more topics
o --topicX: On certain partitions of specific topics (X: Id of partition)
- Execution option: The way performs the reset offset command o --dry-run: Default
method. Check the offset before performing the actual update.
o --execute: Perform offset reset.
- Reset specification: Define new offset value:
o --to-date-time: Reset offset based on time. 'YYYY-MM-DDTHH:mm:SS.sss' format.
o --to-earliest: Reset offset back to the very first position. o --to-latest: Reset offset
to the last position (the latest). o --shift-by: Reset offset by shifting the current offset
to the declared value. That value can be a positive number (shift to the end) or a
negative number (shift back to the beginning).
The example below changes the offset of the group back to the previous 3 messages on all
topics. We see that the new offset has been subtracted by 3 compared to the offset checked
7
with the --describe option earlier. You can test by running the --describe description again
with the --offsets option.
$ kafka-consumer-groups.sh --bootstrap-server <broker-ip>:<port> --reset-offsets --

all-topics --shift-by -3 --group GROUP_1 --execute
GROUP TOPIC PARTITION NEW-OFFSET

GROUP_1 <YOUR_NAME>_NEW 0 3
⚠️ Note: Don’t forget to add --execute option when updating offset!
e. Delete consumer group

We can delete a consumer group by --delete option:
$ kafka-consumer-groups.sh --bootstrap-server <brokerip>:<port> --delete --group

GROUP_1
Deletion of requested consumer groups ('GROUP_1') was successful
⚠️ Note: Turn off all consumer on the group before proceeding to delete a group, otherwise
the operation will be rejected
V. KAFKA PYTHON PROGRAMMING

1. Kafka-python tool
Kafka-python is a package that helps us to interact with Kafka using the Python language.
Check whether the tool is installed on the system:
$ pip show kafka-python
Name: kafka-python
Version: 2.0.2
Summary: Pure Python client for Apache Kafka
Home-page: https://github.com/dpkp/kafka-python
Author: Dana Powers
Author-email: dana.powers@gmail.com
License: Apache License 2.0 Location:
/usr/local/lib/python3.5/dist-packages Requires:
Required-by:
If it was not installed, we can install it by using pip:
$ pip install kafka-python
2. Connect Kafka cluster with Python

To try to connect to the Kafka cluster, we will initialize a KafkaAdminClient and print out the
cluster information:
8
from kafka import KafkaAdminClient
broker_uri = '<BrokerIP>:<port>'
client = KafkaAdminClient(bootstrap_servers=[broker_uri])
client.describe_cluster()
⚠️ Further Reading: To run the above program, we can create a python file named client.py and
run the “python client.py” command
⚠️ Note: Remember to change the IP address or hostname of Kafka Broker.
3. Interact with KafkaAdmin:

After successfully connection to KafkaCluster, we can use the Admin Client to do some tasks as
follows:
- client.describe_topics(): List all current topics.
- client.create_topics(new_topics): Create new topics with class kafka.admin.NewTopic.
- client.delete_topics(topics): Delete topics by their name.
- client. list_consumer_groups(): List all system consumer groups.
- client. describe_consumer_groups(group_ids): Detailed description of consumer groups
through their ids
- client. list_consumer_group_offsets(group_id): Check the offset of a consumer group
4. Write a Python Producer

a. Send data to Kafka
Firstly, we will write a simple program to send data to Kafka
To send a message to a Kafka Topic, we use class KafkaProducer with inputs including Kafka
cluster IP address. And then we call send function with topic and message parameter:
from __future__ import print_function from

kafka import KafkaConsumer
broker_uri =
'<BrokerIP>:<port>' topic =
'<YOUR_NAME>_PYTHON'
producer = KafkaProducer(bootstrap_servers=[broker_uri])
producer.send(topic, value b'=Something awesome')

⚠️ Note: Kafka uses binary string through TCP, therefore we have to encode conventional string
into binary string.
b. Send data with key

To send data with key, we provide the argument key:
producer.send(topic, key=b'id_123' , value b'=Something awesome')
c. Adding callback to notify result after sending data

In case we need to check the results after sending data or handling errors, we can define more
callback functions when sending data:
9
def on_send_success(record_metadata):
print(record_metadata.topic)
print(record_metadata.partition)
print(record_metadata.offset)
def
on_send_error(ex):
print(ex)
After declaring two functions above, you should add these callback functions when sending
data. The callback function will be handled after a response from the send() function is
returned.
producer.send(topic, key=b'id_1', value=b'I love

callback').add_callback(on_send_success).add_errback(on_send_error)
5. Write a Python Consumer

To retrieve messages from Kafka Topic, we use class KafkaConsumer with input arguments
include subscribed topic, the id of consumer group and the Kafka cluster’s IP address:
from __future__ import print_function from

kafka import KafkaConsumer
topic = '<YOUR_NAME>_PYTHON'
broker_uri = '<BROKER_IP>:<port>'
consumer_group = 'PYTHON_EXAMPLE'
consumer = KafkaConsumer(topic,
group_id=consumer_group,
bootstrap_servers=[broker_uri])
for message in consumer: print("key=%s value=%s" %
(message.key, message.value))
⚠️ Further Reading: Try extracting the partition, topic and offset of message yourself.
6. Practice: Programming exercise WordCount (Do it yourself)

Based on the above two programs (Producer and Consumer programs), you have to implement a
WordCount program.
Requirements:
- The Producer reads a .txt file and sends to Kafka topic every 3s.
- The Consumer receives data from the topic and prints out the number of each word from the
beginning of the topic.
7. Kafka Streams
Kafka Streams, which is a library at client side, is mainly used to build applications or micro
services supporting for real-time processing purposes. It is noteworthy that the input/output
data is stored on topics of the Kafka cluster. Besides that, Kafka Streams is initialized by Kafka
and managed on the cluster.
10
8. An example of Kafka Streams with the word count problem
Kafka Streams supports two programming languages, namely Java and Scala. We
can try WordCount example from Kafka package:
- Create topic "streams-plaintext-input".

- Create topic "streams-wordcount-output". - starts the WordCount demo application:
kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo
- start the console producer in a separate terminal to write some input data to this topic:
kafka-console-producer.sh --bootstrap-server <broker-ip>:<port> --topic streams-

plaintextinput
- inspect the output of the WordCount demo application by reading from its output topic with the
console consumer in a separate terminal:
kafka-console-consumer.sh --bootstrap-server localhost:9092 \

--topic streams-wordcount-output \
--from-beginning \
--formatter kafka.tools.DefaultMessageFormatter \
--property print.key=true \
--property print.value=true \
--property
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
--property
value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
Back to the producer terminal, type in some text and check the result in consumer terminal.
Next, in the following example, we will build a WordCount application with Java8. a.
Build a Stream Topology
11
Firstly, we will define a stream topology for the WordCount problem with the following steps:
- Step 1: Initialize a stream record with KStream to read data from a topic
final KStream<String, String> textLines = builder.stream(inputTopic);
- Step 2: Initialize a Ktable change log from the above KStream. We will split each sentence
into tokens by using Java Regex Pattern:
final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> words = textLines.flatMapValues(value -
> Arrays.asList(pattern.split(value.toLowerCase())))
- Step 3: From the above tokens, we will group them and count count occurrences of them
final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> words = textLines.flatMapValues(value ->

Arrays.asList(pattern.split(value.toLowerCase()))).groupBy((keyIgnored, word) ->
word).count();
- Step 4: Write the result into the output topic:

wordCounts.toStream().to(outputTopic, Produced.with(Serdes.String(),
Serdes.Long()));
b. Analyze the Kafka Stream Processing Topology

A Kafka Streams application defines its computation logic through one or more topologies
consisting of Stream Processors (nodes), connected by Streams (edges). A topology includes
three types of processor:
- Source Processor: Receive data from 1 or multiple topics.
- Sink Processor: Write data to a specific topic
- Intermediate Processor: Do some intermediate tasks to process data.
The below figure illustrates the stream topology of the WordCount problem:
12
c. Define Kafka Streams Application
final Properties streamsConfiguration = getStreamsConfiguration(bootstrapServers);
final StreamsBuilder builder = new StreamsBuilder();

createWordCountStream(builder);
final KafkaStreams streams = new
KafkaStreams(builder.build(), streamsConfiguration);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
You should base on the above section to implement a complete Java program for the WordCount
problem and compile it to generate a .jar file (the jar file name is WordCountExample) d. Deploy
the WordCount application with KafkaStreams
Firstly, we need to initialize two topics to read and write data, namely
<YOUR_NAME>_STREAM and <YOUR_NAME>_STREAM_OUTPUT. You can use the
kafka-topics or KafkaAdminClient mentioned in the above sections.
$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> --create –-topic
<YOUR_NAME>_STREAM --partitions 3 --replication-factor 3
Created topic <YOUR_NAME>_STREAM
Then, run kafka-console-producer to input for the topic named <YOUR_NAME>_STREAM:

$ kafka-console-producer.sh --bootstrap-server <brokerip>:<port> –-topic
<YOUR_NAME>_STREAM
> Write something here!
After that, run the .jar file of the WordCount problem:
$ java -cp kafka-streams-examples.jar WordCountExample -b <Broker_IP>:<port> -a

<YOUR_NAME> -i <YOUR_NAME>_STREAM -o <YOUR_NAME>_STREAM_OUTPUT
(You can also hardcode the broker info and topic info in the jar file)
To observe the results, you can use kafka-console-consumer to monitor the output data from the
topic named <YOUR_NAME>_STREAM_OUTPUT:
$ kafka-console-consumer.sh --bootstrap-server <broker-ip>:<port> –-topic

<YOUR_NAME>_STREAM_OUTPUT \
--formatter kafka.tools.DefaultMessageFormatter \
--property print.key=true \
--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \ --
property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
9. Homework
1. Developing the WordCount problem:
- A. In case that a new consumer joins the group, are there any ways for that new consumer to
continue counting words that the existing consumers are doing? Specifically:
o Consumer A is running, just finished counting words and got the result “abc”: 1,
“ab” : 2, “d” 3 o Assume that a consumer named B joins A's group, when it receive
the message "abc d", consumer B will give the result "abc: 2, "ab" : 2, "d" : 4 - B.
Analyze the pros and cons of the proposed methods.
13
2. In this exercise you will create a consumer group and observe what happens when the number
of Kafka consumers changes. You should do the following steps:
a. Create a topic t1 with 3 partitions
kafka-topics.sh --bootstrap-server :<port> --create --topic t1 --partitions
3
b. Start a new consumer c1 in a consumer group CG1
kafka-console-consumer.sh --bootstrap-server :<port> --topic t1 --consumerproperty
group.id=CG1
c. Start a Kafka producer that is attached to partition 0. Send a couple of messages with
the key 0 for easier identification what producer sends what messages.
group.id=CG1
d. Start another Kafka producer to send messages to partition 2 (with 2 key).
At this point you should have 3 partitions, 2 producers and 1 consumer. Observe what
and how messages are consumed. Simply send messages so you can identify what message
used what partition.
a. Start a new consumer c2 in the CG1 consumer group.
kafka-topics.sh --bootstrap-server :<port> --create --topic t1 --partitions
3
b. Observe the logs in the Kafka broker
c. Send a couple of message to observe if and how messages are distributed across the
two consumers
d. Start a new consumer c3 in the CG1 consumer group
group.id=CG1
e. Send a couple of message to observe if and how messages are distributed across the
consumers
f. Shut down any of the running consumers and observe which consumer takes over
the “abandoned” partition
VI. Further Reading
- Kafka Insync Replica: https://www.cloudkarafka.com/blog/2019-09-28-what-does-insync-

in-apache-kafka-really-mean.html
- How to choose the number of topics/partitions in a Kafka cluster?:
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
- Preferred leader election in Kafka: https://medium.com/@mandeep309/preferred-
leaderelection-in-kafka-4ec09682a7c4
- HOWTO alter existing kafka topics:
https://doc.punchplatform.com/5.7.7/Reference_Guide/Data_Storage/HOWTOs/HOWTO_
alter_existing_kafka_topics.html
- Handling Big Data Effectively with Kafka Consumer Group:
https://www.srijan.net/blog/understanding-kafka-consumer-group
- Kafka Python: https://kafka-python.readthedocs.io/
14

DSLab1-Kafka Introduction New

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSLab1-Kafka Introduction New

Uploaded by

Copyright:

Available Formats

Lab 1 - An Introduction to Apache Kafka

$ kafka-topics.sh --version 3.6.0

3. Check the connection to Kafka Cluster

$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> –-list

⚠️ Note: Replace <broker-ip> with one of the Kafka Broker IP address.

⚠️ Note: Always provide --bootstrap-server <broker-ip>:<port> when running the above

IV. PRACTICE WITH KAFKA CLI

$ kafka-topics.sh --bootstrap-server <broker-ip>:<port> --create –-topic <YOUR_NAME>

⚠️ Note: Replace <YOUR_NAME> with your name

b. Check the created topic

$kafka-topics.sh --bootstrap-server <brokerip>:<port> --describe –-topic <YOUR_NAME>

c. Creating a topic with partition and relocation factor parameter

⚠️ Note: Number of replicas in --replication-factor can not greater than total

Once done, we can check the topic information we just created:

⚠️ Further Reading: The partition-broker mapping when creating topics is managed by

d. Edit topic configuration

e. List all topics

2. Send data to Kafka topic

3. Get data from Kafka

a. Read topic data through Kafka-console

b. Read data in key:value format

⚠️ Further Reading: Key=null will be assigned to none key data.

c. Read all topic messages

d. Read message on a partition

Then, view data in each partition:

$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic

$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic

e. Read messages with offset

$ kafka-console-consumer.sh --bootstrap-server <brokerip>:<port> –-topic

b. List consumer group

$ kafka-consumer-groups.sh --bootstrap-server <broker-ip>:<port> --list

c. View consumer group information:

$ kafka-consumer-groups.sh --bootstrap-server <brokerip>:<port> --describe --group

In addition, if we want to display a list of consumers, we add the --member option:

GROUP CONSUMER-ID HOST CLIENT-ID #PARTITIONS

The return values include:

d. Working with offset in consumer group

$ kafka-consumer-groups.sh --bootstrap-server <broker-ip>:<port> --reset-offsets --

GROUP TOPIC PARTITION NEW-OFFSET

⚠️ Note: Don’t forget to add --execute option when updating offset!

e. Delete consumer group

$ kafka-consumer-groups.sh --bootstrap-server <brokerip>:<port> --delete --group

V. KAFKA PYTHON PROGRAMMING

Check whether the tool is installed on the system:

$ pip show kafka-python

If it was not installed, we can install it by using pip:

$ pip install kafka-python

2. Connect Kafka cluster with Python

3. Interact with KafkaAdmin:

4. Write a Python Producer

from __future__ import print_function from

producer.send(topic, value b'=Something awesome')

b. Send data with key

producer.send(topic, key=b'id_123' , value b'=Something awesome')

c. Adding callback to notify result after sending data

producer.send(topic, key=b'id_1', value=b'I love

5. Write a Python Consumer

from __future__ import print_function from

6. Practice: Programming exercise WordCount (Do it yourself)

- Create topic "streams-plaintext-input".

kafka-console-producer.sh --bootstrap-server <broker-ip>:<port> --topic streams-

kafka-console-consumer.sh --bootstrap-server localhost:9092 \

final KTable<String, Long> words = textLines.flatMapValues(value ->

from future import print_function from

from future import print_function from