You are on page 1of 1

Here are some pros and cons of Apache Spark

Pros:
1. Provides functional programming like view with better control to
programmer. Even a single mapper/reducer can be split into multiple ones very
easily, with intermediate caching.
2. Better/Suitable for iterative computations, where we perform steps through
chaining as data can be kept in memory between stages as required.
3. The concept of shared variables (broadcasters and accumulators ) across
workers appear to be better than distributed cache.

Cons:
1. The core is developed in Scala, although it provides Java APIs, and not
being used extensively so far by the industry with Java. Appears to be stable with
Java, but at times one finds it hard to address Scala related exceptions.

Here are some useful links, which also include differences between Hadoop and
Spark.

http://stackoverflow.com/questions/25267204/hadoop-vs-spark
http://datascience.stackexchange.com/questions/441/what-are-the-use-cases-for-
apache-spark-vs-hadoop
http://www.researchgate.net/post/What_is_the_differences_between_SPARK_and_Hadoop_M
apReduce2
http://www.devx.com/opensource/getting-started-with-apache-spark.html
https://databricks.com/spark
http://java.dzone.com/articles/apache-spark-next-big-data
http://stackoverflow.com/questions/24119897/apache-spark-vs-apache-storm
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark
/examples/JavaPageRank.java
http://spark.apache.org/docs/latest/programming-guide.html
https://spark.apache.org/examples.html

I dont have comparison doc as technology selection was done by DLA but here the
main differences per my understanding

1. In Hadoop everything gets written to disk (intermediate files etc) however


spark is in-memory computing framework.
2. Spark supports both batch and stream processing, stream processing would
be difficult in Hadoop.
3. Spark is easy to install/configure vs Hadoop.
4. Datastax has provided drivers for Cassandra which can also do server side
filtering before passing data to spark for processing.
5. Spark has language support for scala, java, python.

Some good links -

http://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-
battle/83#.VG78S_mUfl8
http://www.qubole.com/spark-vs-mapreduce/
http://planetcassandra.org/blog/the-new-analytics-toolbox-with-apache-spark-going-
beyond-hadoop/

You might also like