You are on page 1of 3

Apache Kafka file:///E:/profWork/dataEngineer_course/file/literation/Practical%20Apache%20Spark_%20Using

%20the%20Scala%20API%20(%20PDFDrive%20).pdf
file:///E:/profWork/dataEngineer_course/file/literation/data-pipelines-with-oskari-saarenmaa-
postgresql-amp-kafka.pdf
Introduction to Kafka Kafka Fundamental Concepts

- Apache Kafka is a distributed streaming platform. - Producer (1) – the producer is an application that
- Apache Kafka is a publishing and subscribing publishes a stream of records to one or more
messaging system. It is a horizontally scalable, Kafka topics
fault – tolerant system - Consumer (2) – the consumer is an application
- Kafka is used for these purposes : that consumes a stream of records from one or
1. To build real – time streaming pipelines to get more topics and processes the published streams
data between systems or applications of records
2. To build real – time streaming applications to - Consumer group (3) – consumer label
transform or react to the streams of data themselves with a consumer group name. One
consumer instance within the group will get the
- Kafka Core Concepts message when the message is published to a
1. Kafka is run as a cluster on one or more topic
servers - Broker (4) – the broker is a server where the
2. The Kafka cluster stores streams of records in published stream of records is stored. A Kafka
categories called topics cluster can contain one or more servers
3. Each record consists of a key, a value, and a - Topics (5) – topics is the name given to the feeds
timestamp of messages
- Zookeeper (6) – Kafka uses zookeeper to
- Kafka APIs maintain and coordinate Kafka brokers. Kafka is
1. Producer API : the Producer API enables an bundled with a version of Apache Zookepeer
application to publish a stream of records to
one or more Kafka topics
2. Consumer API : the Consumer API enables an
application to subscribe to one or more topics
and process the stream of records produced to
them
3. Streams API : the Streams API allows an
application to act as a stream processor; that is,
this API converts the input streams into output
streams
4. Connector API : the Connector API allows
building and running reusable producers or
consumers. These reusable producers or
consumers can be used to connect Kafka topics
to existing applications or data systems. For
example, a connector to a relational database
might capture every change to a table
Kafka architecture Setting up the Kafka cluster

- The producer application publishes message to one


or more topics. The messages are stored in the
Kafka broker
- The consumer application consumes messages and
process the messages

- Kafka Topics
a. We now discuss the core absraction of Kafka.
In Kafka, topics are always multisubscriber
entities.
b. A topic can have zero, one, or more
consumers.
c. For each topic, a Kafka cluster maintains a
partitioned log

d. The topics are split into multiple partitions.


Each partition is an ordered, immutable
sequence of records that is continually
appended to a structural commit log
e. The records in the partitions are uniquely
identified by sequential numbers called offset
f. The Kafka cluster persists all the published
records for a configurable period whether they
are consumed or not
g. For example, if the retention period is set for
two days, the records will be available for two
days. After that, they will be discared to free
up space.
h. The partitions of the logs are distributed across
the server in the Kafka cluster and each
partition is replicated across a configurable
number of servers to achieve fault tolerance
Spark streaming and Kafka integration Spark structured streaming and Kafka integration

You might also like