SPARK RDD PROCESSING

Spark Core (Spark RDD)
Process Data in Spark :

2 ways (i)SparkCore (ii)SparkSQL
1.a.Spark RDD – Read Text file/ CSV file to RDD :

println(“Text file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp.txt")
data.foreach(println)
println(“CSV file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords1.csv")
b.Spark RDD – Read Multiple text files to RDD :

println("Multiple files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords1.csv,
file:///C://Mamtha//BigData//Data//emp2.txt ")
c.Spark RDD – Read Json File to RDD

Spark Core has not have an option to read the json file. It import the json file like as in the
code and it will not import datawise.
To import datawise, we have to use Spark SQL
Spark Core :
println(“json file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp.json")
It will import the entire file and not only the data’s.
Spark SQL :
import spark.implicits._
println("Importing Json File")
val data1= spark.read.json("file:///C://Mamtha//BigData//Data//employee.json")
data1.show()
It will import the data’s in the file with the DF(DataFrame)
d.Read XML file
For Importing XML file we have add dependency. .(spark-xml_2.10, version 0.4.1)
println(”xml file”)
val xmldata=spark.read.format("com.databricks.spark.xml")
.option("rowTag","Transaction")
.load("file:///C://Mamtha//BigData//Data//transactions.xml")
xmldata.printSchema()
xmldata.show()
2.Spark RDD – Filter

println("filtered data - filter specific column")
val region=data.filter(x=>x.contains("Asia"))
region.foreach(println)
println("filtered data – filter the entire file")

val data1=data.filter(x=>x.contains("U"))
data1.foreach(println)
println("Filtered Data Using Row Length")

val data1=data.filter(x=>x.length()>100)
To store filter records in a file :
data1.saveAsTextFile("file:///C://Mamtha//BigData//filtered")
3.Spark RDD :
3 types – File RDD, Sub RDD, Memory RDD
println("Actual Data - File RDD")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
println("Filtered Data Using Row - Sub RDD")

val Salary=data.filter(x=>x.length()>100) //Sub RDD
Salary.foreach(println)
println("Memory RDD")
val memory=data.cache() //Memory RDD
memory.foreach(println)
4.Map
import spark.implicits._
println("Map – convert data in the file into DataFrames")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
val header=data.first
val datawh=data.filter(x=>x!=header)
val sdata=datawh.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1),x(2),x(3).toInt)).toDF()
sdata.show()
convert file into DataFrames
5.a.FlatMap (Delimiter as comma (,))

println("Flattened Data using FlatMap")
val data1=data.flatMap(x=>x.split(","))
b.FlatMap (Delimeter as pipeline(|)

val data1=data.flatMap(x=>x.split("//|"))
c.Difference between Map and FlatMap

sc.parallelize([3,4,5]).map(lambda x: [x, x*x]).collect()
Output:
[[3, 9], [4, 16], [5, 25]]
sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect()
Output: notice flattened list
[3, 9, 4, 16, 5, 25]
6.Spark RDD – Union(Merging of two files)
println("union Data")
val uniondata=data.union(empdata) //data - firstfile empdata - secondfile
uniondata.foreach(println)
7.a.Spark RDD – Distinct

println("Actual Data without distinct")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords.csv")
println("after distinct applied")
val distinctdata=data.distinct
distinctdata.foreach(println)
7.b.Union and FlatMap in a single file

println("union Data with FlatMap")
val uniondata=sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
.union(sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt"))
.flatMap(x=>x.split(","))
uniondata.foreach(println)
8.Spark RDD – Partitions

println("Actual Data with partition size is not given")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
println(data.partitions.size)
Default number of Partition is 2
println("Actual Data with partition size is given as 8")

val data1 = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",8)
println(data1.partitions.size)
Here I mentioned the partition size is 8
8.b. Partitioning the Sub RDD Data’s and storing the result in files:
Here I convert the single file(with only 20 records) into 6 partitions and then I am
doing Filter operation. In the case after the filter operation is done again the records are split into 6
Partitions by default because in the beginning we convert the file into 6 partitions.
println("Actual Data with partition size given")

val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",6)
println("Partition size after the filter operation takes place")
val filterdata=data.filter(x=>x.length()>100)
println(filterdata.partitions.size)
Here the difficulty is that I have only 20 records in my file and after filtering there is only
10 records. For this 10 records if I have 6 partitions then again it is an hindrance. So, to reduce the size
of the partition in the RDD “ COALESCE “ comes into picture.
Coalesce is used to Reduce the size of partitions in the RDD.
9.Spark RDD – COALESCE

println("Actual Data with partition size given")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",6)
println("size of the partition")
println("coalesce")
val filterdata=data.filter(x=>x.length()>100).coalesce(2)
println("Actual Data with default Partitioning")

println("after Coalesce applied")
val coalesce1=data.coalesce(1) //coalesce
println(coalesce1.partitions.size)
10.Spark RDD -RePartition

Here again the Coalesce comes with defect. In coalesce we can only reduce the Partition. If in
case, if we want to increase the No.Of partition Coalesce is not useful at that time. To Increase the
number of Partition “Repartition” concept is used.
Repartition is used to Increase/Decrease the No.Of Partitions in the RDD.
println("coalesce")
val filterdata=data.filter(x=>x.length()>100).repartition(8) //Filtering with Repartition
println("Actual Data with default Partitioning")

println("after Repartition applied")
val repartition1=data.repartition(3) //Repartition
println(repartition1.partitions.size)
11.Actions
Collect, Count, CountByValue, Reduce,Take
println("single files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
println("flattened data")
val data1=data.flatMap(x=>x.split(","))
println("filtered data")
val data2=data1.filter(x=>x.contains("c"))
println("Collect")
val collect=data2.collect
collect.foreach(println)
println("Count")
println(data1.count()) //shows only the no.of occurence
println("Countbyvalue")
val countbyvalue=data1.countByValue //shows the no.of occurrence with values
countbyvalue.foreach(println)
println("Take")
val take=data.take(5) //shows top 5 rows in the table
take.foreach(println)
12.Joins
(i)join
object SparkFirstProject1 {
case class schema (id: Int, name: String)
case class schema1 (id: Int, person_id: Int,city:String)
def main(args:Array[String]):Unit={
println(" files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//pers.txt")
val emp=sc.textFile("file:///C://Mamtha//BigData//Data//addr.txt")
val data1=data.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1)))
val emp1=emp.map(x=>x.split(",")).map(x=>schema1(x(0).toInt,x(1).toInt,x(2)))
val tab1pair=data1.keyBy(_.id)
val tab2pair=emp1.keyBy(_.person_id)
println("joins")
val joined=tab1pair.join(tab2pair)
joined.foreach(println)
(ii)LeftOuterJoin :
println("LeftOuterJoin")
val leftouterjoin=tab1pair.leftOuterJoin(tab2pair)
leftouterjoin.foreach(println)
(iii)RightOuterJoin :
println("RightOuterJoin")
val rightouterjoin=tab1pair.rightOuterJoin(tab2pair)
rightouterjoin.foreach(println)
12.Spark – Cache() & Persist()

Println(“cache”)
val data1=data.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1),x(2),x(3).toInt)).cache()
sdata.foreach(println)
println("persist")
data1.persist(StorageLevel.MEMORY_ONLY)
5 storage levels are available in persist. We can use any of the storage level.
If we don’t want persist then we can simply unpersist it.
data1.unpersist()
Storage Levels :
The 5 Storage levels in persist are
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY

SPARK RDD PROCESSING

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPARK RDD PROCESSING

Uploaded by

Copyright:

Available Formats

Spark Core (Spark RDD)

Process Data in Spark :

1.a.Spark RDD – Read Text file/ CSV file to RDD :

b.Spark RDD – Read Multiple text files to RDD :

c.Spark RDD – Read Json File to RDD

2.Spark RDD – Filter

println("filtered data – filter the entire file")

println("Filtered Data Using Row Length")

println("Filtered Data Using Row - Sub RDD")

convert file into DataFrames

5.a.FlatMap (Delimiter as comma (,))

b.FlatMap (Delimeter as pipeline(|)

c.Difference between Map and FlatMap

7.a.Spark RDD – Distinct

7.b.Union and FlatMap in a single file

8.Spark RDD – Partitions

println("Actual Data with partition size is given as 8")

println("Actual Data with partition size given")

9.Spark RDD – COALESCE

println("Actual Data with default Partitioning")

10.Spark RDD -RePartition

println("Actual Data with default Partitioning")

12.Spark – Cache() & Persist()

You might also like