You are on page 1of 10

Streamsets - Kafka

February 11, 2020


Streamsets

• https://streamsets.com/
– Data collector. Open source. https://streamsets.com/products/dataops-platform/open-source/
– Control Hub. https://streamsets.com/try-dataops/control-hub-trial/
▪ Jobs
▪ Repository
▪ Scheduler
▪ Security
▪ Topologies

• Similar tools
– Apache Nifi
– Apache Storm
– Apache Flink
– Apache Spark
– Amazon Kinesis, Google Dataflow …

Internal Use - Confidential of220 © Copyright 2020 Dell Inc.


Streamsets

• Streamsets has 4 processor types:


– Origins: they get data from the external sources. You may have only one Origin Processor in your
dataflow.
– Processors: data transformers.
– Destinations: they save data to the external systems or files.
– Executors: they process events, generated by other processors.

• Concept: Offsets

Internal Use - Confidential of320 © Copyright 2020 Dell Inc.


Kafka

• Apache Kafka is an open-source stream-processing software platform developed


by LinkedIn and donated to the Apache Software Foundation, written in Scala and
Java. The project aims to provide a unified, high-throughput, low-latency platform
for handling real-time data feeds. Wikipedia
• It is a messaging system + more features (Kafka connect, Kafka streams, KSQL)
• https://docs.cloudera.com/documentation/enterprise/latest/topics/kafka_tour.html
• Similar tools (Messaging, Middleware and ESB)
– Messaging: ActiveMQ, RabbitMQ, MQ Series
– Amazon kinesis. Apache Pulsar
– ESB: MuleSoft, Dell Boomi

Internal Use - Confidential of420 © Copyright 2020 Dell Inc.


Kafka
• Developed/Hosted By LinkedIn
• Software Open-Source
• SDK Support Kafka SDK supports Java
• Data Stored In Kafka Partition
• Reliability Replication factor can be configured
• Performance The fastest
• Configuration Store Apache Zookeeper
• Setup Weeks

Internal Use - Confidential of520 © Copyright 2020 Dell Inc.


Kafka
• Data Retention Configurable
• Log Compaction Supported
• Processing Events More than 1000s of events/sec
• Checkpointing Offsets
• Ordering Partion level
• Operational CostsRequire human support for installing and managing their
clusters, and also accounting for requirements such as high availability, durability,
and recovery
• Each record has a unique number calledOffset number

Internal Use - Confidential of620 © Copyright 2020 Dell Inc.


Kafka

• Concepts:
– Offsets
– Consumer groups
– Partitions
– Replication factor
– Retention policy
– See commands: Show topic, partitions, reset offsets, etc

Internal Use - Confidential of720 © Copyright 2020 Dell Inc.


North Star architecture

• See
https://www.confluent.io/blog/apache-kafka-vs-enterprise-service-bus-esb-friends-e
nemies-or-frenemies/

Internal Use - Confidential of820 © Copyright 2020 Dell Inc.


Internal Use - Confidential of920 © Copyright 2020 Dell Inc.

You might also like