You are on page 1of 4

Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.

com/hadoop-spark-kafka/

Hadoop vs Spark vs Kafka - Comparing Big Data


& Distributed Streaming Tools

In the “big data” world, the terms Spark, Hadoop, and Kafka should sound familiar.
However, with numerous big data solutions available, it may be unclear exactly what they
are, their main differences, and which is better. Below is a comprehensive view to
determine what kinds of applications, such as machine learning, distributed streaming,
and data storage that you can expect to make effective and efficient by using Hadoop,
Spark, and Kafka.

What is Hadoop?
Hadoop is an open-source software that stores massive amounts of data while running
large numbers of commodity-grade computers to tackle tasks that are too large for a
single computer to process on its own. Hadoop can be used to write software that stores
data or runs computations across hundreds or thousands of machines without needing to
know the details of what each machine can do, or how it can communicate. Failures are
considered a fact of life in this environment. Hadoop is designed to handle them within
the framework itself, which significantly reduces the amount of error handling necessary
within your solution. At its most basic level of functionality, Hadoop includes the standard
libraries between its modules, the file system HDFS (Hadoop Distributed File System),
YARN (Yet Another Resource Negotiator), and its implementation of MapReduce. You’ll
hear people refer to Hadoop in different ways. Hadoop itself is an entire set of solutions
called the Hadoop Ecosystem. This ecosystem includes the basic Hadoop functionality, as
well as whatever additional modules, generally referred to as submodules, that are plugged
into the system. Two prominent Hadoop Ecosystem projects are Spark and Kafka.

What is Spark?
Apache Spark is a solution aiming to improve Hadoop’s MapReduce implementation by
making it both easier and faster. One of the applications that initially motivated the
development of Spark is the training algorithm for machine learning systems. After Spark
became more popular, it added specialized modules like MLLib, Spark SQL, Spark
Streaming, and GraphX. Spark SQL supports structured data, Spark Streaming performs
streaming analytics, and MLLib is the machine learning framework. GraphX supports
graph applications if they don’t need to be updated or maintained in a database. Spark,
like Hadoop, is also fault tolerant. Its failure tolerance is in the form of its RDD (Resilient
Distributed Dataset).

What is Kafka?
Another major project in the Hadoop Ecosystem is Apache Kafka. It takes information
from many different sources, called producers, and organizes it all into a format that’s
easier for a stream processing system like Spark to manage. The information is then made
available to the receiving process, called a consumer, in a way that allows the process to
browse messages by topic within the Kafka cluster. When combined, Hadoop, Spark, and
Kafka become a solid foundation for a machine learning system. It can take in a significant
amount of data from many different producers quickly, process it efficiently — even when
the operations are iterative — and then send it back out directly to the consumers. See
also: Challenges Logging Kafka at Scale

1 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/

Is Hadoop Required for Spark?


The short answer is no. Spark has a standalone mode that doesn’t require any other
software, although it is often easier to deploy and manage Spark using Hadoop’s YARN
framework. When it is used with Hadoop it coordinates well with Hadoop YARN. There is
even an option to run a “local mode.” This mode is typically only used for testing and
development and allows you the flexibility to operate Spark on a single machine. Each
executor is assigned its own CPU core, which enables the system to scale up to as many
CPU cores as your machine can spare for Spark’s use.

Is Spark Faster than Hadoop?


As mentioned previously, Spark was initially created to resolve limitations in the
MapReduce approach to big data problems. It’s possible you’ll see your latency
significantly reduced when you use Spark over Hadoop’s existing MapReduce approach. In
a paper out of University of California, Berkeley, researchers were able to prove that Spark
can outperform Hadoop by 10x in iterative machine learning workloads and can be used
interactively to scan a 39 GB dataset with sub-second latency.” If you’re targeting machine
learning applications, it’s worth adding a Spark cluster to your Hadoop solution. It’s likely
you’ll see significant benefits.

Getting Started with Hadoop


When you initially stand up Hadoop, you’ll need to check out some best practices and
make some decisions about how you’ll want to use it. Starting small and building multiple
environments are good basic advice whenever you’re working with something new. In this
case, it’s essential. Large Hadoop clusters are notorious for being difficult to use and
administer. There are a few options on the market that can help you deploy Hadoop much
easier. Be sure to avoid these three traps that are easy to fall into when you’re first getting
your feet wet:

1. Poor Data Organization


2. Overbuilding
3. Virtualizing Data Nodes

Poor Data Organization


It’s easy to dump your data into a massive pile into Hadoop and decide you’ll deal with the
structure of all that data later. The reality is that once the data is in there and you add
more, then more, and even more, putting the organization in after the fact is impractical.
Take some time in advance to consider how you want your data organized. It’ll save you a
lot of time later.

Overbuilding
There are many options when it comes to transitioning to Hadoop. Adding in one thing at
a time will keep you from implementing something that isn’t an ideal choice for your
application. It’s tempting when you’re working with open source to keep adding on new
things. Try to hold back that impulse somewhat and only implement what you need to
satisfy your base requirements initially. It’s substantially easier to add stuff in Hadoop
than it is to take them out. This philosophy is especially relevant if you have an existing
business architecture you’re transitioning over that’s still in active use by customers.

2 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/

Choosing the parts of the system that benefit most from the transition and testing them
one at a time will provide a significantly higher chance of a successful transition.

Virtualizing Hadoop
Virtualization can add value if you wish to use it with your master nodes. Virtualizing your
data nodes doesn’t make much sense. You can do this if you want to and the
implementation will still work. Instead, it would be better to put a single large data node
on the server instead of virtualizing the server and breaking it up into several smaller ones.
Hadoop can handle the large data volume, and it’s best to let the software choose how to
get your data distributed efficiently to the hardware.

Adding Spark
If you’ve decided to skip Hadoop and implement Spark as its own cluster, then you’ll start
here. If you did stand up Hadoop, you might be wondering whether to add Spark to your
Hadoop setup. Spark uses more resources for processing, but the overall performance is
better. If you’re working on a machine learning application you should give Spark a try
using the MLLib library specifically for iterative machine learning applications.

Organizing Data Input with Kafka


Now that you know what Spark and Hadoop are, it’s time to look at Kafka. For machine
learning applications, you frequently have many producers of data that need to be
organized in a way that can be processed efficiently. One option is to use Kafka to capture
your data streams, format, and record them in HDFS (Hadoop Distributed File System).
Once everything is stored, it can then be processed by batches in either Hadoop with
MapReduce or with Spark in scenarios like machine learning applications. The commit log
approach allows the rest of your system to subscribe to and receive data as broadly as
possible on a continuous and timely basis. Once you have everything connected and
working correctly, your producers will rapidly write into the Kafka cluster. Spark will
operate on the data and write the results back to Kafka where the consumer can subscribe
to receive those results in real time.

Observability
The way Kafka utilizes a commit log also has another excellent benefit — observability.
You’ll be able to implement a monitoring feature which subscribes to the data coming in
from your producers and going out to your consumers. Real-time access to these two parts
of the system will allow you to see everything, and filter down the information as needed.
You’ll even be able to catch up on messages if you fall behind. The advantages over a
similar, more traditional approach to this problem can be described in three main benefits
as follows:

1. No message broker needed


2. The order is maintained in the messages, even with parallel consumers operating at
the same time
3. You’ll still be able to read your old messages

See Also: The Future of DevOps Observability

Keep Scale in Mind

3 of 4 9/30/2019, 6:22 AM
Hadoop vs Spark vs Kafka - Comparing Big Data & Distributed Streami... https://logdna.com/hadoop-spark-kafka/

The combination of Hadoop, Spark, and Kafka are an excellent solution for heavy duty
applications where there is a large amount of data and reliability is critical. Big data and
machine learning applications on a large scale are unique examples of how this setup will
provide excellent benefits in overall performance. For smaller applications, or if you’re not
sure whether this combination is right for you, start small with just Hadoop or Spark. You
can add Kafka later, and you can still plug Spark into Hadoop if you decide you need the
additional performance.

Choosing the Best Cluster Hosting Services


Once you have a thorough understanding of Hadoop, Spark, and Kafka, it’s a good idea to
look around and see if there are other parts of the Hadoop Ecosystem that might benefit
your application. As an open source project, Hadoop provides the flexibility to experiment
with different sub-modules and interface other solutions with minimal software costs.
Establishing that your selection of software is the best set for your application will save
you a significant amount of pain and re-work in the long run. There are a few options for
cluster hosting services, so it’s best to shop around. It’s worth checking with the rest of
your organization to see if there are any special rate options or credits you can use to
experiment. Spark itself can initially be run on a standalone machine. As you scale up,
you’ll want to give it its own cluster and let it stretch its legs to show you what it can really
do in a closer-to-live environment. In addition, if you’re considering standing up a live
system soon, you may be able to get some free development time- assuming you’re willing
to commit to using the same service when you go live.

Hadoop vs Spark vs Kafka – Things to Consider


Consider how you’ll implement security features before you make a final push to build a
fully functional system. Designing security friendly applications begins early in the design
and development process. For example, if you wait until the end to think about how you
want to provide the security, you may discover some design choices that significantly limit
what the system can accomplish in a securely. User authentication can be achieved in
multiple ways, and the security profile is different across a variety of options. Also, the
sensitivity of the data may push you to store it differently, e.g. local data storage versus
cloud data storage is a critical design choice which will drive how you handle the security
of your data. The Hadoop Ecosystem is a continuously evolving space with new ideas
coming in all the time. Keep an eye out for more updates and information related to
Hadoop and its related projects. If you can’t find a feature you need right now, someone
might already be working on it. If not, maybe you can start an open source project of your
own. One of the best ways to give back for your use of open source projects like Hadoop is
to contribute ideas and implement useful features into the code base for others to use.

Log Data Becomes Big Data as You Scale


If you’re looking into Hadoop, Spark or Kafka because you’re looking for a solution to
manage, scale, and search logs because it’s becoming unmanageable, LogDNA can help
you with that. The largest data set in any organization easily becomes the log data from all
of your IT and application infrastructure and used by developers, operations, security,
business, and product. However, to take advantage of it, the infrastructure to collect, store,
parse, search logs, and be alerted is not trivial to create. LogDNA is a modern multi-cloud
centralized logging solution with per-GB pricing plan that grows as you grow.

4 of 4 9/30/2019, 6:22 AM

You might also like