You are on page 1of 7

Spark Core (Spark RDD)

Process Data in Spark :


2 ways (i)SparkCore (ii)SparkSQL

1.a.Spark RDD – Read Text file/ CSV file to RDD :


println(“Text file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp.txt")
data.foreach(println)

println(“CSV file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords1.csv")
data.foreach(println)

b.Spark RDD – Read Multiple text files to RDD :


println("Multiple files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords1.csv,
file:///C://Mamtha//BigData//Data//emp2.txt ")
data.foreach(println)

c.Spark RDD – Read Json File to RDD


Spark Core has not have an option to read the json file. It import the json file like as in the
code and it will not import datawise.
To import datawise, we have to use Spark SQL
Spark Core :
println(“json file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp.json")
data.foreach(println)
It will import the entire file and not only the data’s.
Spark SQL :
import spark.implicits._
println("Importing Json File")
val data1= spark.read.json("file:///C://Mamtha//BigData//Data//employee.json")
data1.show()
It will import the data’s in the file with the DF(DataFrame)
d.Read XML file
For Importing XML file we have add dependency. .(spark-xml_2.10, version 0.4.1)

println(”xml file”)
val xmldata=spark.read.format("com.databricks.spark.xml")
.option("rowTag","Transaction")
.load("file:///C://Mamtha//BigData//Data//transactions.xml")
xmldata.printSchema()
xmldata.show()

2.Spark RDD – Filter


println("filtered data - filter specific column")
val region=data.filter(x=>x.contains("Asia"))
region.foreach(println)

println("filtered data – filter the entire file")


val data1=data.filter(x=>x.contains("U"))
data1.foreach(println)

println("Filtered Data Using Row Length")


val data1=data.filter(x=>x.length()>100)
data1.foreach(println)
To store filter records in a file :

data1.saveAsTextFile("file:///C://Mamtha//BigData//filtered")

3.Spark RDD :
3 types – File RDD, Sub RDD, Memory RDD
println("Actual Data - File RDD")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
data.foreach(println)

println("Filtered Data Using Row - Sub RDD")


val Salary=data.filter(x=>x.length()>100) //Sub RDD
Salary.foreach(println)

println("Memory RDD")
val memory=data.cache() //Memory RDD
memory.foreach(println)
4.Map
import spark.implicits._
println("Map – convert data in the file into DataFrames")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
val header=data.first
val datawh=data.filter(x=>x!=header)
val sdata=datawh.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1),x(2),x(3).toInt)).toDF()
sdata.show()

convert file into DataFrames

5.a.FlatMap (Delimiter as comma (,))


println("Flattened Data using FlatMap")
val data1=data.flatMap(x=>x.split(","))
data1.foreach(println)

b.FlatMap (Delimeter as pipeline(|)


val data1=data.flatMap(x=>x.split("//|"))

c.Difference between Map and FlatMap


sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect() 
Output:
[[3, 9], [4, 16], [5, 25]]
sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() 
Output: notice flattened list
[3, 9, 4, 16, 5, 25]
6.Spark RDD – Union(Merging of two files)
println("union Data")
val uniondata=data.union(empdata) //data - firstfile empdata - secondfile
uniondata.foreach(println)

7.a.Spark RDD – Distinct


println("Actual Data without distinct")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords.csv")
data.foreach(println)
println("after distinct applied")
val distinctdata=data.distinct
distinctdata.foreach(println)

7.b.Union and FlatMap in a single file


println("union Data with FlatMap")
val uniondata=sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
.union(sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt"))
.flatMap(x=>x.split(","))
uniondata.foreach(println)

8.Spark RDD – Partitions


println("Actual Data with partition size is not given")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
println(data.partitions.size)
Default number of Partition is 2

println("Actual Data with partition size is given as 8")


val data1 = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",8)
println(data1.partitions.size)
Here I mentioned the partition size is 8

8.b. Partitioning the Sub RDD Data’s and storing the result in files:
Here I convert the single file(with only 20 records) into 6 partitions and then I am
doing Filter operation. In the case after the filter operation is done again the records are split into 6
Partitions by default because in the beginning we convert the file into 6 partitions.

println("Actual Data with partition size given")


val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",6)
println(data.partitions.size)
println("Partition size after the filter operation takes place")
val filterdata=data.filter(x=>x.length()>100)
println(filterdata.partitions.size)
Here the difficulty is that I have only 20 records in my file and after filtering there is only
10 records. For this 10 records if I have 6 partitions then again it is an hindrance. So, to reduce the size
of the partition in the RDD “ COALESCE “ comes into picture.
Coalesce is used to Reduce the size of partitions in the RDD.

9.Spark RDD – COALESCE


println("Actual Data with partition size given")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv",6)
println("size of the partition")
println(data.partitions.size)

println("coalesce")
val filterdata=data.filter(x=>x.length()>100).coalesce(2)
println(filterdata.partitions.size)

println("Actual Data with default Partitioning")


val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords.csv")
println(data.partitions.size)
println("after Coalesce applied")
val coalesce1=data.coalesce(1) //coalesce
println(coalesce1.partitions.size)

10.Spark RDD -RePartition


Here again the Coalesce comes with defect. In coalesce we can only reduce the Partition. If in
case, if we want to increase the No.Of partition Coalesce is not useful at that time. To Increase the
number of Partition “Repartition” concept is used.
Repartition is used to Increase/Decrease the No.Of Partitions in the RDD.

println("coalesce")
val filterdata=data.filter(x=>x.length()>100).repartition(8) //Filtering with Repartition
println(filterdata.partitions.size)

println("Actual Data with default Partitioning")


val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords.csv")
println(data.partitions.size)
println("after Repartition applied")
val repartition1=data.repartition(3) //Repartition
println(repartition1.partitions.size)

11.Actions
Collect, Count, CountByValue, Reduce,Take
println("single files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
data.foreach(println)
println("flattened data")
val data1=data.flatMap(x=>x.split(","))
data1.foreach(println)

println("filtered data")
val data2=data1.filter(x=>x.contains("c"))
data1.foreach(println)

println("Collect")
val collect=data2.collect
collect.foreach(println)

println("Count")
println(data1.count()) //shows only the no.of occurence

println("Countbyvalue")
val countbyvalue=data1.countByValue //shows the no.of occurrence with values
countbyvalue.foreach(println)

println("Take")
val take=data.take(5) //shows top 5 rows in the table
take.foreach(println)

12.Joins
(i)join
object SparkFirstProject1 {
case class schema (id: Int, name: String)
case class schema1 (id: Int, person_id: Int,city:String)
def main(args:Array[String]):Unit={

println(" files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//pers.txt")
val emp=sc.textFile("file:///C://Mamtha//BigData//Data//addr.txt")

val data1=data.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1)))
val emp1=emp.map(x=>x.split(",")).map(x=>schema1(x(0).toInt,x(1).toInt,x(2)))

val tab1pair=data1.keyBy(_.id)
val tab2pair=emp1.keyBy(_.person_id)
println("joins")
val joined=tab1pair.join(tab2pair)
joined.foreach(println)

(ii)LeftOuterJoin :
println("LeftOuterJoin")
val leftouterjoin=tab1pair.leftOuterJoin(tab2pair)
leftouterjoin.foreach(println)

(iii)RightOuterJoin :
println("RightOuterJoin")
val rightouterjoin=tab1pair.rightOuterJoin(tab2pair)
rightouterjoin.foreach(println)

12.Spark – Cache() & Persist()


Println(“cache”)
val data1=data.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1),x(2),x(3).toInt)).cache()
sdata.foreach(println)
println("persist")
data1.persist(StorageLevel.MEMORY_ONLY)
5 storage levels are available in persist. We can use any of the storage level.
If we don’t want persist then we can simply unpersist it.
data1.unpersist()

Storage Levels :
The 5 Storage levels in persist are
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY

You might also like