Professional Documents
Culture Documents
com/hadoop-spark-kafka/
In the “big data” world, the terms Spark, Hadoop, and Kafka should sound familiar.
However, with numerous big data solutions available, it may be unclear exactly what they
are, their main differences, and which is better. Below is a comprehensive view to
determine what kinds of applications, such as machine learning, distributed streaming,
and data storage that you can expect to make effective and efficient by using Hadoop,
Spark, and Kafka.
What is Hadoop?
Hadoop is an open-source software that stores massive amounts of data while running
large numbers of commodity-grade computers to tackle tasks that are too large for a
single computer to process on its own. Hadoop can be used to write software that stores
data or runs computations across hundreds or thousands of machines without needing to
know the details of what each machine can do, or how it can communicate. Failures are
considered a fact of life in this environment. Hadoop is designed to handle them within
the framework itself, which significantly reduces the amount of error handling necessary
within your solution. At its most basic level of functionality, Hadoop includes the standard
libraries between its modules, the file system HDFS (Hadoop Distributed File System),
YARN (Yet Another Resource Negotiator), and its implementation of MapReduce. You’ll
hear people refer to Hadoop in different ways. Hadoop itself is an entire set of solutions
called the Hadoop Ecosystem. This ecosystem includes the basic Hadoop functionality, as
well as whatever additional modules, generally referred to as submodules, that are plugged
into the system. Two prominent Hadoop Ecosystem projects are Spark and Kafka.
What is Spark?
Apache Spark is a solution aiming to improve Hadoop’s MapReduce implementation by
making it both easier and faster. One of the applications that initially motivated the
development of Spark is the training algorithm for machine learning systems. After Spark
became more popular, it added specialized modules like MLLib, Spark SQL, Spark
Streaming, and GraphX. Spark SQL supports structured data, Spark Streaming performs
streaming analytics, and MLLib is the machine learning framework. GraphX supports
graph applications if they don’t need to be updated or maintained in a database. Spark,
like Hadoop, is also fault tolerant. Its failure tolerance is in the form of its RDD (Resilient
Distributed Dataset).
What is Kafka?
Another major project in the Hadoop Ecosystem is Apache Kafka. It takes information
from many different sources, called producers, and organizes it all into a format that’s
easier for a stream processing system like Spark to manage. The information is then made
available to the receiving process, called a consumer, in a way that allows the process to
browse messages by topic within the Kafka cluster. When combined, Hadoop, Spark, and
Kafka become a solid foundation for a machine learning system. It can take in a significant
amount of data from many different producers quickly, process it efficiently — even when
the operations are iterative — and then send it back out directly to the consumers. See
also: Challenges Logging Kafka at Scale
1 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/
Overbuilding
There are many options when it comes to transitioning to Hadoop. Adding in one thing at
a time will keep you from implementing something that isn’t an ideal choice for your
application. It’s tempting when you’re working with open source to keep adding on new
things. Try to hold back that impulse somewhat and only implement what you need to
satisfy your base requirements initially. It’s substantially easier to add stuff in Hadoop
than it is to take them out. This philosophy is especially relevant if you have an existing
business architecture you’re transitioning over that’s still in active use by customers.
2 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/
Choosing the parts of the system that benefit most from the transition and testing them
one at a time will provide a significantly higher chance of a successful transition.
Virtualizing Hadoop
Virtualization can add value if you wish to use it with your master nodes. Virtualizing your
data nodes doesn’t make much sense. You can do this if you want to and the
implementation will still work. Instead, it would be better to put a single large data node
on the server instead of virtualizing the server and breaking it up into several smaller ones.
Hadoop can handle the large data volume, and it’s best to let the software choose how to
get your data distributed efficiently to the hardware.
Adding Spark
If you’ve decided to skip Hadoop and implement Spark as its own cluster, then you’ll start
here. If you did stand up Hadoop, you might be wondering whether to add Spark to your
Hadoop setup. Spark uses more resources for processing, but the overall performance is
better. If you’re working on a machine learning application you should give Spark a try
using the MLLib library specifically for iterative machine learning applications.
Observability
The way Kafka utilizes a commit log also has another excellent benefit — observability.
You’ll be able to implement a monitoring feature which subscribes to the data coming in
from your producers and going out to your consumers. Real-time access to these two parts
of the system will allow you to see everything, and filter down the information as needed.
You’ll even be able to catch up on messages if you fall behind. The advantages over a
similar, more traditional approach to this problem can be described in three main benefits
as follows:
3 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/
The combination of Hadoop, Spark, and Kafka are an excellent solution for heavy duty
applications where there is a large amount of data and reliability is critical. Big data and
machine learning applications on a large scale are unique examples of how this setup will
provide excellent benefits in overall performance. For smaller applications, or if you’re not
sure whether this combination is right for you, start small with just Hadoop or Spark. You
can add Kafka later, and you can still plug Spark into Hadoop if you decide you need the
additional performance.
4 of 4 9/30/2019, 6:22 AM