You are on page 1of 38

Apache Kafka

Setup on Mac with Brew


• Install Java 8 and verify
• java -version
• Install Kafka
• brew install kafka
• Configure Kafka
• vi /usr/local/etc/kafka/server.properties
• listeners=PLAINTEXT://localhost:9092
• Start Zookeeper
• zookeeper-server-start /usr/local/etc/kafka/
zookeeper.properties
• Start Kafka Server
• kafka-server-start /usr/local/etc/kafka/server.properties
Setup on Mac with Brew
• Create Topic
• kafka-topics --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic test
• List Topics
• kafka-topics --list --zookeeper localhost:2181
• Run Producer
• kafka-console-producer --broker-list localhost:
9092 --topic test
• Run Consumer
• kafka-console-consumer --bootstrap-server
localhost:9092 --topic test --from-beginning
Setup on Ubuntu
• Connect to an Ubuntu AWS EC2 instance
• ssh -i .ssh/chef.pem ubuntu@35.175.179.37
• Install java
• apt-get update
• apt-get install <openjdk>
• Install Kafka
• sudo su
• wget https://archive.apache.org/dist/kafka/2.0.0/kafka_2.12-2.0.0.tgz
• tar -xvf kafka_2.12-2.0.0.tgz
• cd kafka_2.12-2.0.0
• Start Zookeeper
• ./bin/zookeeper-server-start,h ./config/zookeeper.properties
• Configure Kafka
• vi ./config/server.properties
• listeners=PLAINTEXT://localhost:9092
• Start Kafka Server
• ./bin/kafka-server-start.sh ./config/server.properties
Setup on Ubuntu
• Create Topic
• ./bin/kafka-topics --create --zookeeper localhost:2181 --
replication-factor 1 --partitions 1 --topic test
• List Topics
• ./bin/kafka-topics.sh --list --zookeeper localhost:2181
• Run Producer
• ./bin/kafka-console-producer --broker-list localhost:9092 --
topic test
• Run Consumer
• ./bin/kafka-console-consumer --bootstrap-server localhost:
9092 --topic test --from-beginning
Confluent Cloud
• Signup with Confluent
• https://www.confluent.io/confluent-cloud/#sign-up
• Login to Confluent Cloud Console
• Create Cluster in default environment
• name: glarimy
• Install Confluent Cloud Client
• curl -L https://cnfl.io/ccloud-cli | sh -s -- -b /usr/local/bin
• Login to Confluent Cloud from terminal
• ccloud login
• Use the cluster
• ccloud kafka cluster list
• ccloud kafka cluster use <cluster-id>
• Create, make note of API Key and use it
• ccloud api-key create
• ccloud api-key use <API-KEY>
• Create and use test topic
• ccloud kafka topic create test
• ccloud kafka topic producer test
• ccloud kafka topic consume -b test
Single Node - Multiple Brokers
• Move to Kafka Configuration
• cd /usr/local/etc/kafka (or) cd kafka_2.12-2.0.0/config
• Create configurations for two more brokers
• touch server-1.properties
• broker.id=1
• listeners=PLAINTEXT://localhost:9093
• log.dirs=/usr/local/var/lib/kafka-logs-1
• touch server-2.properties
• broker.id=2
• listeners=PLAINTEXT://localhost:9094
• log.dirs=/usr/local/var/lib/kafka-logs-2
• Start zookeeper and servers
• zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
• kafka-server-start /usr/local/etc/kafka/server.properties
• kafka-server-start /usr/local/etc/kafka/server-1.properties
• kafka-server-start /usr/local/etc/kafka/server-2.properties
Single Node - Multiple Brokers
• Create Topic
• kafka-topics --create --zookeeper localhost:2181 --replication-
factor 3 --partitions 1 --topic multi-broker-test
• List Topics
• kafka-topics --list --zookeeper localhost:2181
• Run Producer
• kafka-console-producer --broker-list localhost:9092 localhost:
9093 localhost:9094 --topic multi-broker-test
• Run Consumer
• kafka-console-consumer --bootstrap-server localhost:9092 --
topic multi-broker-test —from-beginning
• kafka-console-consumer --bootstrap-server localhost:9093 --
topic multi-broker-test —from-beginning
• kafka-console-consumer --bootstrap-server localhost:9094 --
topic multi-broker-test --from-beginning
Topic Operations
• kafka-topics.sh --zookeeper zk --create --
topic my-topic --replication-factor 1 --
partitions 1
• kafka-topics.sh --zookeeper zk --alter --
topic my-topic --partitions 16
• kafka-topics.sh --zookeeper zk --delete --
topic my-topic
• kafka-topics.sh --zookeeper zk --list
• kafka-topics.sh --zookeeper zk --describe
• kafka-topics.sh --zookeeper zk --describe --
under-replicated-partitions
Consumer Operations
• kafka-consumer-groups.sh --new-consumer --
bootstrap-server br —list
• kafka-consumer-groups.sh --zookeeper zk --
describe --group testgroup
• kafka-consumer-groups.sh --zookeeper zk --
delete --group testgroup
Config Operations
• kafka-configs.sh --zookeeper zk --alter --
entity-type topics --entity-name mytopic
--add-config <key>=<value>[,<key>=<value>…]
• kafka-configs.sh --zookeeper zk --describe --
entity-type topics --entity-name my-topic
• kafka-configs.sh --zookeeper zk --alter --
entity-type topics --entity-name my-topic --
delete-config retention.ms
Other Operations
• kafka-run-class.sh
kafka.tools.DumpLogSegments --files abc.log

• kafka-replica-verification.sh --broker-list
br1,br2 --topic-white-list 'my-.*'
Topics, Partitions and offsets
● Topics: a particular stream of Data.
− Similar to a table in database(without all the
constraints)
− You can have as many topics you want
− A topic is identified by its name
● Topics are split in partitions
− Each partition is ordered
− Each message within a partition gets an incremental
ID, called offset.
Topic example
● Say you have a fleet of trucks, each truck reports
its GPS position to Kafka.
● You can have a topic "trucks_gps” that contains
the position of all trucks.
● Each truck will send a message to kafka every 20
seconds, each message will contain the truck ID
and the truck position (latitude and longitude)
● We choose to create that topic with 10 partitions
(arbitrary number)
Topics, Partitions and offsets
● Offset only have a meaning for a specific partition.
− Eg.offset 3 in partition 0 doesn’t represent the same data as
offset 3 in partition 1
● Order is guarenteed only within a partition (not across
partitions)
● Order is kept only for a limited time (default is one week)
● Once the data is written to a partition, it cant be changed
(immutability)
● Data is assigned randomly to a partition unless a key is
provided (more on this later)
Brokers
● A kafka cluster is composed of multiple brokers (servers)
● Each broker ois identified with its ID (integer)
● Each broker contains certain topic partitons
● After connecting to any broker (called a bootstrap
broker), you will be connected to the entire cluster
● A good number to get started is 3 brokers, but big
clusters have over 100 brokers
● In these examples we choose to number brokers starting
at 100 (arbitrary)
Topic replication factor
● Topics sholud have a replication factor > 1
(usually between 2 & 3)
● This way if a broker is down, another broker can
serve the data
● Example: Topic-A with 2 partitions and replication
factor of 2
Topic replication factor
● Example: we lost Broker 102
● Result: Broker 101 & 103 can still serve the data
Concept of Leader for a Partition
● At any time only one broker can be a leader for a
given partition
● Only that leader can recieve and serve data for a
partition
● The other brokers will synchronize the data
● Therefore each partition has one ;eader and
multiple ISR (in-sync-replica)
Producers
● Producers write data to topics (which is made of
partitions)
● Producers autumatically know to which broker
and partition to write to
● In case of Broker failures, Producers will
automatically recover
Producers
● Producers can choose to recieve
acknowledgment of data writes:
− acks=0: Producer wont wait for acknowledgment
(possible data loss)
− acks=1: Producer will wait for leader
acknowledgment (limited data loss)
− acks=all: Leader+replicas acknowledgment (no data
loss)
Producers: Message Keys
● Producers can choose to send a key with the
message(string,number, etc..)
● If key=null, data is sent round robin (broker 101
then 102 then 103..)
● If key is sent then all messages for that key will
always go to the same partition
● A key is basically sent if you need message
oedering for a specific field(ex: truck_id)
Consumers
● Consumers read data from a topic (identified by
name)
● Consumers know which broker to read from
● In case broker failures, Consumers know how to
recover
● Data is read in order within each partitions
Consumer Groups
● Consumers can read data in consumer groups
● Each consumer within a group reads from
exclusive partitions
● If you have more consumers than partitions,
consumers will be inactive
Consumer Groups
What if too many consumers?
● If you have more consumers than partitions,
some consumers will be inactive
Consumer Offsets
● Kafka stores the offsets at wwhich a consumer group
has been reading
● The offset comitted live in a kafka topic named
__consumer_offsets
● When a consumer in a group has processed data
recieved from kafka, it should be committing the offset
● If a consumer dies, it will be able to read back from
where it left off thanks to the committed consumer
offsets!
Delivery Semantics for consumers
● Consumers choose when to commit offsets
● There are 3 delivery semantics:
● At most once:
− Offsets are committed as soon as the message is recieved
− If the processing goes wrong, the message will be lost (it wont be read again)
● At Least once (usually preferred):
− Offsets are committed after the message is processed
− If the processing goes wrong, the message will read again.
− This can result in duplicate processing of messages. Make sure your processing is
idempotent (i.e. processing again the messages wont impact your systems)
● Exactly once:
− Can be achieved for kafka => kafka workflows using kafka streams API
− For kafka => External system workflows, use an idempotent consumer.
Kafka Broker Discovery
● Every Kafka broker is also called a "bootstrap
server"
● That means that you only need to connect to one
broker, and you will be connected to the entire
cluster.
● Each broker knows about all brokers, topics and
partitions (metadata)
Zookeeper
● Zookeeper manages brokers (keeps a list of them)
● Zookeeper helps in performing leader election for partitions
● Zookeeper sends notifications to Kafka in case of changes (e.g.
new topic, broker dies, broker comes up, delete topics, etc....
● Kafka can't work without Zookeeper
● Zookeeper by design operates with an odd number of servers (3,
5, 7)
● Zookeeper has a leader (handle writes) the rest of the servers are
followers (handle reads)
● (Zookeeper does NOT store consumer offsets with Kafka > v0.10)
Kafka Guarantees
● Messages are appended to a topic-partition in the order they are
sent
● Consumers read messages in the order stored in a topic-partition
● With a replication factor of N, producers and consumers can
tolerate up to N-l brokers being down
● This is why a replication factor of 3 is a good idea:
− Allows for one broker to be taken down for maintenance
− Allows for another broker to be taken down unexpectedly
● As long as the number of partitions remains constant for a topic
(no new partitions), the same key will always go to the same
partition
Security
• Man-in-the-middle Attacks
• Encrypt
• Authorization
• Identity
• Authentication
• Role
• JAAS
• Principal and Role
• SSL
• Certificates and Keys
• SASL
• Simple Authentication Service Layer
• PLAINTEXT: username and password on Kafka Brokers
• SASL SCRAM: hashes on zk (no dependency on broker)
• SASL GSSAPI : AD based Kerberos
ACL
• ACL to allow
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --add
• --allow-principal User:Bob
• --allow-principal User:Alice
• --allow-host Host1,Host2
• —operation Read
• --topic Test-topic
ACL
• ACL to deny
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --add
• --allow-principal User:*
• --allow-host *
• --deny-principal User:BadBob
• --deny-host bad-host
• --operation Read
• --topic Test-topic
ACL
• Removing ACL
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --remove
• --allow-principal User:Bob
• --allow-principal User:Alice --allow-host Host1,Host2 --
operation Read --topic Test-topic
Security
• Listing the ACL
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --list
• --topic Test-topic
Security
• Convenient Methods
• bin/kafka-acls.sh --authorizer
kafka.security.auth.SimpleAclAuthorizer --authorizer-
properties zookeeper.connect=localhost:2181 --add --
allow-principal User:Bob --producer --topic Test-topic
• bin/kafka-acls.sh --authorizer
kafka.security.auth.SimpleAclAuthorizer --authorizer-
properties zookeeper.connect=localhost:2181 --add --
allow-principal User:Bob --consumer --topic test-topic --
consumer-group Group-1
JMX Metrics
• Download JMXTERM
• https://docs.cyclopsgroup.org/jmxterm
• Start the interactive shell
• java -jar jmxterm-1.0.0-uber.jar

• List the domains • Use a bean


• domains • bean
• Use a domain • List the attributes
• domain <domain- • info
name> • Collect the metric
• List the beans • get <attribute>
• beans
Kafka Metrics
• kafka.server: type=ReplicaManager,
name=UnderReplicatedPartitions
• kafka.controller:type=KafkaController,name=ActiveControllerCount
• kafka.server:type=KafkaRequestHandlerPool,name=RequestHandl
erAvgIdlePercent
• kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
• kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
• kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
• kafka.server:type=ReplicaManager,name=PartitionCount
• kafka.server:type=ReplicaManager,name=LeaderCount
• kafka.controller:type=KafkaController,name=OfflinePartitionsCount
• kafka.network:type=RequestMetrics
• kafka.log:type=Log

You might also like