Professional Documents
Culture Documents
println(“CSV file”)
val data = sc.textFile("file:///C://Mamtha//BigData//Data//SalesRecords1.csv")
data.foreach(println)
println(”xml file”)
val xmldata=spark.read.format("com.databricks.spark.xml")
.option("rowTag","Transaction")
.load("file:///C://Mamtha//BigData//Data//transactions.xml")
xmldata.printSchema()
xmldata.show()
data1.saveAsTextFile("file:///C://Mamtha//BigData//filtered")
3.Spark RDD :
3 types – File RDD, Sub RDD, Memory RDD
println("Actual Data - File RDD")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//EmployeeRecords.csv")
data.foreach(println)
println("Memory RDD")
val memory=data.cache() //Memory RDD
memory.foreach(println)
4.Map
import spark.implicits._
println("Map – convert data in the file into DataFrames")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
val header=data.first
val datawh=data.filter(x=>x!=header)
val sdata=datawh.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1),x(2),x(3).toInt)).toDF()
sdata.show()
8.b. Partitioning the Sub RDD Data’s and storing the result in files:
Here I convert the single file(with only 20 records) into 6 partitions and then I am
doing Filter operation. In the case after the filter operation is done again the records are split into 6
Partitions by default because in the beginning we convert the file into 6 partitions.
println("coalesce")
val filterdata=data.filter(x=>x.length()>100).coalesce(2)
println(filterdata.partitions.size)
println("coalesce")
val filterdata=data.filter(x=>x.length()>100).repartition(8) //Filtering with Repartition
println(filterdata.partitions.size)
11.Actions
Collect, Count, CountByValue, Reduce,Take
println("single files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//emp2.txt")
data.foreach(println)
println("flattened data")
val data1=data.flatMap(x=>x.split(","))
data1.foreach(println)
println("filtered data")
val data2=data1.filter(x=>x.contains("c"))
data1.foreach(println)
println("Collect")
val collect=data2.collect
collect.foreach(println)
println("Count")
println(data1.count()) //shows only the no.of occurence
println("Countbyvalue")
val countbyvalue=data1.countByValue //shows the no.of occurrence with values
countbyvalue.foreach(println)
println("Take")
val take=data.take(5) //shows top 5 rows in the table
take.foreach(println)
12.Joins
(i)join
object SparkFirstProject1 {
case class schema (id: Int, name: String)
case class schema1 (id: Int, person_id: Int,city:String)
def main(args:Array[String]):Unit={
println(" files")
val data = sc.textFile("file:///C://Mamtha//BigData//Data//pers.txt")
val emp=sc.textFile("file:///C://Mamtha//BigData//Data//addr.txt")
val data1=data.map(x=>x.split(",")).map(x=>schema(x(0).toInt,x(1)))
val emp1=emp.map(x=>x.split(",")).map(x=>schema1(x(0).toInt,x(1).toInt,x(2)))
val tab1pair=data1.keyBy(_.id)
val tab2pair=emp1.keyBy(_.person_id)
println("joins")
val joined=tab1pair.join(tab2pair)
joined.foreach(println)
(ii)LeftOuterJoin :
println("LeftOuterJoin")
val leftouterjoin=tab1pair.leftOuterJoin(tab2pair)
leftouterjoin.foreach(println)
(iii)RightOuterJoin :
println("RightOuterJoin")
val rightouterjoin=tab1pair.rightOuterJoin(tab2pair)
rightouterjoin.foreach(println)
Storage Levels :
The 5 Storage levels in persist are
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY