Welcome to Scribd!

groupByKey VS reduceByKey

Uploaded by

0% found this document useful (0 votes)

35 views3 pages

reduceByKey works faster than groupByKey on larger datasets because reduceByKey can combine output with common keys within each partition before shuffling data between partitions, while groupByKey shuffles all key-value pairs across the network. It is recommended to use reduceByKey instead of groupByKey whenever possible to avoid unnecessary network shuffling of data.

Original Description:

Original Title

groupByKey VS reduceByKey

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

35 views3 pages

groupByKey VS reduceByKey

Uploaded by

surendra yandra

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 3

Search inside document

groupByKey VS reduceByKey

scala> var data =

List("spark","scala","spark","spark","spark","scala","java","scala")

data: List[String] = List(spark, scala, spark, spark, spark, scala, java, scala)

scala> val mapData = sc.parallelize(data).map(x => (x,1))

mapData: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at

<console>:26

reduceByKey
scala> mapData.reduceByKey(_+_).collect.foreach(println)

(spark,4)

(scala,3)

(java,1)

groupByKey
scala> mapData.groupByKey().map(x => (x._1 , x._2.sum) ).collect.foreach(println)

(spark,4)

(scala,3)

(java,1)
In the above two transformations (reduceByKey , groupByKey) we are getting the same
Output...however

Avoid “groupByKey” where ever possible....the reason being

 reduceByKey works faster on larger datasets...i.e because Spark knows it

can combine output with a common key on each partition before shuffling
the data.
 On the other hand, when calling groupByKey - all the key-value pairs are
shuffled around. This is a lot of unnessary data to being transferred over
the network.
Partitions on RDD
scala> var data =
sc.parallelize(List(1,2,3,4,5,6,7,7,8.9,12,34,5,4,76,90,87,87,65,36),4)
data: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[6] at
parallelize at <console>:24

scala> data.partitions.length
res3: Int = 4

scala> data.glom().collect
res4: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0, 4.0), Array(5.0,
6.0, 7.0, 7.0, 8.9), Array(12.0, 34.0, 5.0, 4.0, 76.0), Array(90.0, 87.0, 87.0,
65.0, 36.0))

scala> var data =

sc.parallelize(List(1,2,3,4,5,6,7,7,8.9,12,34,5,4,76,90,87,87,65,36),3)
data: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[8] at
parallelize at <console>:24
scala> data.partitions.length
res5: Int = 3

scala> data.glom().collect
res6: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0),
Array(7.0, 7.0, 8.9, 12.0, 34.0, 5.0), Array(4.0, 76.0, 90.0, 87.0, 87.0, 65.0,
36.0))

scala>

PySpark RDD Basics PDF
Document1 page
PySpark RDD Basics PDF
COCO
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
Document1 page
PySpark Cheat Sheet Spark in Python PDF
ram179
No ratings yet
Advanced SQL with SAS
From Everand
Advanced SQL with SAS
Christian FG Schendera
No ratings yet
PySpark Data Frame Questions PDF
Document57 pages
PySpark Data Frame Questions PDF
Varun Pathak
100% (1)
Data Structures and Algorithms in Swift: Implement Stacks, Queues, Dictionaries, and Lists in Your Apps
From Everand
Data Structures and Algorithms in Swift: Implement Stacks, Queues, Dictionaries, and Lists in Your Apps
Elshad Karimov
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
Rating: 3 out of 5 stars
3/5 (1)
Learning Apache Spark With Python
Document10 pages
Learning Apache Spark With Python
dalalroshan
No ratings yet
PySpark Transformations Tutorial
Document58 pages
PySpark Transformations Tutorial
ravikumar lanka
100% (1)
PySpark Cheat Sheet Python
Document1 page
PySpark Cheat Sheet Python
sreedhar
No ratings yet
Transformations and Actions: A Visual Guide of The API
Document122 pages
Transformations and Actions: A Visual Guide of The API
Jorge Emilio Roa Barreto
No ratings yet
4 - Action and RDD Transformations
Document25 pages
4 - Action and RDD Transformations
ravikumar lanka
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
Document20 pages
Ravi Pyspark RDD Tutorial 1665758938
Sree Krith
No ratings yet
Apache Spark Tutorials
Document9 pages
Apache Spark Tutorials
ronics123
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
DATAFRAME Vs DATASETS
Document9 pages
DATAFRAME Vs DATASETS
surendra yandra
No ratings yet
PySpark CheatSheet Edureka
Document1 page
PySpark CheatSheet Edureka
BL Pipas
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
Document36 pages
Apache Spark: CS240A Winter 2016. T Yang
omegapoint077609
No ratings yet
Spark Join2
Document14 pages
Spark Join2
Kanna Babu
No ratings yet
Chapter 2
Document35 pages
Chapter 2
Ace
No ratings yet
Chapter 2
Document35 pages
Chapter 2
amir
No ratings yet
SPARK
Document36 pages
SPARK
chenna kesava
No ratings yet
Neural Net Work
Document2 pages
Neural Net Work
space man
No ratings yet
Miscellaneous R Notes: C 2005 Ben Bolker September 15, 2005
Document2 pages
Miscellaneous R Notes: C 2005 Ben Bolker September 15, 2005
juntujuntu
No ratings yet
Principal Component Analysis Notes : Info
Document22 pages
Principal Component Analysis Notes : Info
VALMICK GUHA
No ratings yet
R Syntax Examples 1
Document6 pages
R Syntax Examples 1
Pedro Cruz
No ratings yet
Pyspark Code
Document3 pages
Pyspark Code
Eren Levi
No ratings yet
Spark RDD
Document4 pages
Spark RDD
Jagadeesh Reddy
No ratings yet
Spark and Scala 2
Document11 pages
Spark and Scala 2
vinodnerella
No ratings yet
Leçon: Spark For Distributed Computing (Alternative To Hadoop Mapreduce)
Document21 pages
Leçon: Spark For Distributed Computing (Alternative To Hadoop Mapreduce)
Hajri
No ratings yet
Task Spark
Document4 pages
Task Spark
Azza A. Aziz
No ratings yet
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
Document37 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
B. Jennifer
100% (1)
CS226 06 RDD
Document29 pages
CS226 06 RDD
chenna kesava
No ratings yet
RSQLite Tutorial
Document3 pages
RSQLite Tutorial
zarcone7
No ratings yet
Getwd
Document24 pages
Getwd
Elysa Musarofah
No ratings yet
Class Note Spark
Document7 pages
Class Note Spark
adchy7
No ratings yet
Big Data Analytics in Apache Spark
Document79 pages
Big Data Analytics in Apache Spark
ArXlan Xahir
No ratings yet
DIP Part1 (Practical)
Document39 pages
DIP Part1 (Practical)
zaituna
No ratings yet
Combining Spatial Data in R
Document6 pages
Combining Spatial Data in R
bord02
No ratings yet
Cluster Analysis Hierarchical Cluster
Document12 pages
Cluster Analysis Hierarchical Cluster
fadhil_ghifari
No ratings yet
Introduction To R
Document36 pages
Introduction To R
Refael Lav
No ratings yet
R Basics: Daniel Stegmueller
Document14 pages
R Basics: Daniel Stegmueller
blackdaisy13
No ratings yet
Big Data - Spark
Document72 pages
Big Data - Spark
SuprasannaPradhan
100% (1)
Combining Spatial Data: 1 2 Checking Topologies
Document6 pages
Combining Spatial Data: 1 2 Checking Topologies
nguyencongnhut
No ratings yet
File - Choose - Opens Up The Directory To Select Desired File Read - SCV - Reads The .CSV Format File
Document3 pages
File - Choose - Opens Up The Directory To Select Desired File Read - SCV - Reads The .CSV Format File
abhishekk594
No ratings yet
Lab Manual Page No 1
Document32 pages
Lab Manual Page No 1
R.R.Rao
No ratings yet
Spark SQL Meetup - 4-8-2012
Document27 pages
Spark SQL Meetup - 4-8-2012
Ardie Carderas
No ratings yet
JSON A Panda Python
Document3 pages
JSON A Panda Python
Parra Victor
No ratings yet
Cengizhan Sahin
Document26 pages
Cengizhan Sahin
dummy account
No ratings yet
Metafor Tutorial
Document9 pages
Metafor Tutorial
Surya Dila
No ratings yet
BDA Assignment Aman 19019
Document38 pages
BDA Assignment Aman 19019
aman
No ratings yet
Actions Spark
Document5 pages
Actions Spark
Souvik Ghosh
No ratings yet
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
Document5 pages
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
benben08
No ratings yet
Rstudio Study Notes For PA 20181126
Document6 pages
Rstudio Study Notes For PA 20181126
Trong Nghia Vu
No ratings yet
Swing Java
Document21 pages
Swing Java
Mugiraneza Josue
No ratings yet
DMSL Assignment 12
Document6 pages
DMSL Assignment 12
sirefen421
No ratings yet
R Prog
Document27 pages
R Prog
Srinivasan Krishnan
No ratings yet
Notes
Document26 pages
Notes
vijay kumar
No ratings yet
Zulfa Putri Asmawi - TUGAS 10
Document8 pages
Zulfa Putri Asmawi - TUGAS 10
Nurul Isma Asmawi
No ratings yet