You are on page 1of 5

SPARK TRANSFORMATIONS AND ACTIONS

Reading Data
Reading a csv file:

Displaying Data

a. Displaying a csv file

b. Displaying a json file

map

Returns a new distributed dataset, formed by passing each element of the source through a function
func.

scala> sc.parallelize(List(1,2,3)).map(x=>List(x,x,x)).collect

flatMap

Similar to map, but each input item can be mapped to 0 or more output items (so func should return
a Seq rather than a single item).

scala> sc.parallelize(List(1,2,3)).flatMap(x=>List(x,x,x)).collect

filter
Returns a new dataset formed by selecting those elements of the source on which func returns true

val file = sc.textFile("catalina.txt")


val errors = file.filter(line => line.contains("ERROR")).collect

mapPartitions

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9, 3)


scala> parallel.mapPartitions( x => List(x.next).iterator).collect

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.mapPartitions( x => List(x.next).iterator).collect

mapPartitionsWithIndex

Similar to map Partitions, but also provides func with an integer value representing the index of the
partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

scala> val parallel = sc.parallelize(1 to 9, 3)


scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

sample

Sample a fraction of the data, with or without replacement, using a given random number generator
seed

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.sample(true,.2).count
scala> parallel.sample(true,.1)

union

Returns a new dataset that contains the union of the elements in the so dataset and the argument.

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).collect

intersection

Returns a new RDD that contains the intersection of elements in the source dataset and the
argument.

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.intersection(par2).collect

distinct

Returns a new dataset that contains the distinct elements of the source dataset

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).distinct.collect

join

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs
of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>


(a, 1))
scala> val names2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
scala> names1.join(names2).collect
scala> names1.leftOuterJoin(names2).collect
scala> names1.rightOuterJoin(names2).collect

groupByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are
grouping in order to perform an aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.

scala> val babyNames = sc.textFile("baby_names.csv")


scala> val rows = babyNames.map(line => line.split(","))
scala> val namesToCounties = rows.map(name => (name(1),name(2)))
scala> namesToCounties.groupByKey.collect
scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),
("B",5)))
scala> data.collect
scala> val groupfunc = data.groupByKey()
scala> groupfunc.collect

reduceByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each
key are aggregated using the given reduce function func, which must be of type (V, V) => V

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),


("B",5)))
scala> data.collect
scala> val reducefunc = data.reduceByKey((value, x) => (value + x))

aggregateByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each
key are aggregated using the given combine functions and a neutral "zero" value. Allows an
aggregated value type that is different from the input value type, while avoiding unnecessary
allocations.

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))

scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect


sortByKey

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending
argument.

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map ( n => (n(1), n(4))).sortByKey().foreach (println
_)

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach


(println _) // opposite order

scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),


("E",5)))
scala> data.collect
scala> val sortfunc = data.sortByKey()
scala> sortfunc.collect

You might also like