Actions Spark

SPARK TRANSFORMATIONS AND ACTIONS
Reading Data
Reading a csv file:
Displaying Data
a. Displaying a csv file
b. Displaying a json file
map
Returns a new distributed dataset, formed by passing each element of the source through a function
func.
scala> sc.parallelize(List(1,2,3)).map(x=>List(x,x,x)).collect
flatMap
Similar to map, but each input item can be mapped to 0 or more output items (so func should return
a Seq rather than a single item).
scala> sc.parallelize(List(1,2,3)).flatMap(x=>List(x,x,x)).collect
filter
Returns a new dataset formed by selecting those elements of the source on which func returns true
val file = sc.textFile("catalina.txt")

val errors = file.filter(line => line.contains("ERROR")).collect
mapPartitions
Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator => Iterator when running on an RDD of type T.
scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect
scala> val parallel = sc.parallelize(1 to 9)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect
mapPartitionsWithIndex
Similar to map Partitions, but also provides func with an integer value representing the index of the
partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect
scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect
sample
Sample a fraction of the data, with or without replacement, using a given random number generator
seed

scala> parallel.sample(true,.2).count
scala> parallel.sample(true,.1)
union
Returns a new dataset that contains the union of the elements in the so dataset and the argument.

scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).collect
intersection
Returns a new RDD that contains the intersection of elements in the source dataset and the
argument.

scala> parallel.intersection(par2).collect
distinct
Returns a new dataset that contains the distinct elements of the source dataset

scala> parallel.union(par2).distinct.collect
join
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs
of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>

(a, 1))
scala> val names2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
scala> names1.join(names2).collect
scala> names1.leftOuterJoin(names2).collect
scala> names1.rightOuterJoin(names2).collect
groupByKey
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are
grouping in order to perform an aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.
scala> val babyNames = sc.textFile("baby_names.csv")

scala> val rows = babyNames.map(line => line.split(","))
scala> val namesToCounties = rows.map(name => (name(1),name(2)))
scala> namesToCounties.groupByKey.collect
scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),
("B",5)))
scala> data.collect
scala> val groupfunc = data.groupByKey()
scala> groupfunc.collect
reduceByKey
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each
key are aggregated using the given reduce function func, which must be of type (V, V) => V
scala> val filteredRows = babyNames.filter(line => !

line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect
scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),

("B",5)))
scala> data.collect
scala> val reducefunc = data.reduceByKey((value, x) => (value + x))
aggregateByKey
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each
key are aggregated using the given combine functions and a neutral "zero" value. Allows an
aggregated value type that is different from the input value type, while avoiding unnecessary
allocations.

scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect
scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))
scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect

sortByKey
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending
argument.

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey().foreach (println
_)
scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach

(println _) // opposite order
scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),

("E",5)))
scala> data.collect
scala> val sortfunc = data.sortByKey()
scala> sortfunc.collect

Actions Spark

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Actions Spark

Uploaded by

Copyright:

Available Formats

SPARK TRANSFORMATIONS AND ACTIONS

a. Displaying a csv file

b. Displaying a json file

val file = sc.textFile("catalina.txt")

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>

scala> val babyNames = sc.textFile("baby_names.csv")

scala> val filteredRows = babyNames.filter(line => !

scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),

scala> val filteredRows = babyNames.filter(line => !

scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect

scala> val filteredRows = babyNames.filter(line => !

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach

scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),

You might also like