You are on page 1of 9

Experiment 07

Big Data Analysis

Harsh Suryanath Nag


6th April, 2024
Aim
Setup and install Apache Kafka and stream realitime data from any social media website like
Twitter, Facebook, instagram etc.

Introduction
Kafka is an open source event stream processing platform based on an abstraction of a
distributed commit log. The aim of Kafka is to provide a unified, high-throughput, low-latency
platform for managing these logs, which Kafka calls “topics”. Kafka combines three main
capabilities to aid the entire process of implementing various event streaming use cases -

1. To publish (write) and subscribe to (read) streams of events.


2. To store streams of events durably and reliably for as long as you want.
3. To process streams of events as they occur or retrospectively.

As previously mentioned, the founders Confluent are the original developers of Apache Kafka,
and offer an alternative and significantly more complete distribution of Kafka together with
Confluent Platform. Many of the additional features present in Confluent’s distribution of Kafka
are available to use for free, and are what Confluent refers to as “community components''. For
our example application, we will be using one of these community components by Confluent, but
rely on the standard Apache Kafka distribution for our underlying Kafka cluster, as it is more than
capable of supporting our system.

Advantages

Kafka has numerous advantages. Today, Kafka is used by over 80% of the Fortune 100 across
virtually every industry, for countless use cases big and small. It is the de facto technology
developers and architects use to build the newest generation of scalable, real-time data
streaming applications. While these can be achieved with a range of technologies available in the
market, below are the main reasons Kafka is so popular.

1. High Throughput

Capable of handling high-velocity and high-volume data, Kafka can handle millions of
messages per second.

2. High Scalability

1
Scale Kafka clusters up to a thousand brokers, trillions of messages per day, petabytes of
data, hundreds of thousands of partitions. Elastically expand and contract storage and
processing.

3. Low Latency

Can deliver these high volumes of messages using a cluster of machines with latencies as
low as 2ms.

4. Permanent Storage

Safely, securely store streams of data in a distributed, durable, reliable, fault-tolerant


cluster

5. High Availability

Extend clusters efficiently over availability zones or connect clusters across geographic
regions, making Kafka highly available and fault tolerant with no risk of data loss.

How Kafka Works

Apache Kafka consists of a storage layer and a compute layer that combines efficient, real-time
data ingestion, streaming data pipelines, and storage across distributed systems. In short, this
enables simplified data streaming between Kafka and external systems, so you can easily
manage real-time data and scale within any type of infrastructure.

2
Procedure
Start Cassandra network

Start Kafka cluster

3
Kafka-Connect REST

Setup auth using creds from docker (Kafka front end)

Adding a cluster

4
5
6
Using Confluent’s community Kafka-Cassandra connector installed in a Kafka Connect instance
that runs in a container polling both Twitter and OpenWeatherMap topics to save the data into
Cassandra

Activating producers

7
Conclusion
After successfully installing Apache Kafka, we leveraged its robust framework to stream real-time
data not only from Twitter but also from weather sources. This setup enabled efficient processing
and analysis of diverse data streams, enhancing our ability to derive insights and make informed
decisions.

You might also like