You are on page 1of 3

groupByKey VS reduceByKey

scala> var data =


List("spark","scala","spark","spark","spark","scala","java","scala")

data: List[String] = List(spark, scala, spark, spark, spark, scala, java, scala)

scala> val mapData = sc.parallelize(data).map(x => (x,1))

mapData: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at


<console>:26

reduceByKey
scala> mapData.reduceByKey(_+_).collect.foreach(println)

(spark,4)

(scala,3)

(java,1)

groupByKey
scala> mapData.groupByKey().map(x => (x._1 , x._2.sum) ).collect.foreach(println)

(spark,4)

(scala,3)

(java,1)
In the above two transformations (reduceByKey , groupByKey) we are getting the same
Output...however

Avoid “groupByKey” where ever possible....the reason being

 reduceByKey works faster on larger datasets...i.e because Spark knows it


can combine output with a common key on each partition before shuffling
the data.
 On the other hand, when calling groupByKey - all the key-value pairs are
shuffled around. This is a lot of unnessary data to being transferred over
the network.
Partitions on RDD
scala> var data =
sc.parallelize(List(1,2,3,4,5,6,7,7,8.9,12,34,5,4,76,90,87,87,65,36),4)
data: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[6] at
parallelize at <console>:24

scala> data.partitions.length
res3: Int = 4

scala> data.glom().collect
res4: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0, 4.0), Array(5.0,
6.0, 7.0, 7.0, 8.9), Array(12.0, 34.0, 5.0, 4.0, 76.0), Array(90.0, 87.0, 87.0,
65.0, 36.0))

scala> var data =


sc.parallelize(List(1,2,3,4,5,6,7,7,8.9,12,34,5,4,76,90,87,87,65,36),3)
data: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[8] at
parallelize at <console>:24
scala> data.partitions.length
res5: Int = 3

scala> data.glom().collect
res6: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0),
Array(7.0, 7.0, 8.9, 12.0, 34.0, 5.0), Array(4.0, 76.0, 90.0, 87.0, 87.0, 65.0,
36.0))

scala>

You might also like