You are on page 1of 16

----------------------------------------------

Scala (Scalable Language)

Scala is "Functional + Object Oriented Programming Language"

Java is "Object Oriented Programming Language"

----------------------------------------------

Immutable (we can't change)

Mutable (we can change)

val => value (immutable)

var => variable (mutable)

----------------------------------------------
Java Syntax:
---------------
<data_type> <variable_name> = <value / exp> ;

Scala Syntax:
---------------
val <variable_name> : <data_type> = <value / exp>

var <variable_name> : <data_type> = <value / exp>

----------------------------------------------
How to start scala in `command prompt`:
----------------------------------------------
Command: scala

Java-1.7 => Scala-2.10 => Spark-1.x


Java-1.8 => Scala-2.11 => Spark-2.x

----------------------------------------------

orienit@kalyan:~$ scala
Welcome to Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.7.0_101).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

`Scala` provides `REPL` functionality.

Read Evaluate Print Loop (REPL)

Note: Already REPL functionality available in `R, Python, Groovy, ..`

----------------------------------------------
scala> val name : String = "kalyan"
name: String = kalyan

scala> name : String = "xyz"


<console>:1: error: ';' expected but '=' found.
name : String = "xyz"
^

scala> name = "xyz"


<console>:11: error: reassignment to val
name = "xyz"
^

scala> var name : String = "kalyan"


name: String = kalyan

scala> name = "xyz"


name: String = xyz

----------------------------------------------
Scala:
----------
In `Scala` everything is a `Object`
we don't have any primitive datatypes like Java

Java:
----------
we have Objects
we have primitive datatypes

Examples:
-------------
int a = 1;
Integer a = 1;

Note:
1. In java `Objects` are serializable, not the primitive datatypes.

2. In java we don't have `operator overloading`

3. In scala & C++ allows `operator overloading`

----------------------------------------------
`Scala` provides `Type Infer`.

`Type infer` means get the Datatype from `value / expression`

scala> val name : String = "kalyan"


name: String = kalyan

scala> val name = "kalyan"


name: String = kalyan

----------------------------------------------
Examples for `Type Infer`:

scala> val id = 1
id: Int = 1

scala> val id = 1.5


id: Double = 1.5

scala> val id = 1l
id: Long = 1

scala> val id = 1d
id: Double = 1.0

scala> val id = 1f
id: Float = 1.0

scala> val id = true


id: Boolean = true

scala> val id = "aaa"


id: String = aaa

scala> val id = 'a'


id: Char = a

----------------------------------------------
Examples for `Operator Overloading`:

scala> val a = 10
a: Int = 10

scala> val b = 20
b: Int = 20

scala> val c = a + b
c: Int = 30

scala> val c = a.+(b)


c: Int = 30

a + b ====> a.+(b)

scala> a.-(b)
res0: Int = -10

scala> a.*(b)
res1: Int = 200

scala> a./(b)
res2: Int = 0

scala> a.%(b)
res3: Int = 10

scala> a min b
res4: Int = 10

scala> a max b
res5: Int = 20
----------------------------------------------
Data Type Conversions:
---------------------------

scala> val id = 10
id: Int = 10

scala> id.to
toByte toDouble toInt toShort
toChar toFloat toLong toString

scala> id.toDouble
res6: Double = 10.0

scala> id.toLong
res7: Long = 10

scala> id.toString
res8: String = 10

scala> id.toChar
res9: Char =

scala> id.toByte
res10: Byte = 10

----------------------------------------------
If, If-Else expressions in Scala
----------------------------------------------
if(exp1) {
body1
}

if(exp2) {
body1
} else {
body2
}

if(exp1) {
body1
} else if(exp2) {
body2
} else {
body3
}

Note:
1. Java, C, C++ supports Ternary Operator

(expression) ? <body1> : <body2>

2. Scala does not support Ternary Operator

if(expression) <body1> else <body2>


----------------------------------------------
Arrays in Scala & Java
----------------------------------------------
Java Syntax:
-------------------
<data_type>[] <variable_name> = {list of values};

<data_type>[] <variable_name> = new <data_type>[<size>];

Scala Syntax:
-------------------
val <variable_name> : Array[<data_type>] = Array[<data_type>](list of values)

val <variable_name> : Array[<data_type>] = new Array[<data_type>](<size>)

----------------------------------------------
Java Examples:
----------------------
String[] names = {"kalyan", "venkat", "ravi"};

(or)

String[] names = new String[3];


names[0] = "kalyan";
names[1] = "venkat";
names[2] = "ravi";

Scala Examples:
----------------------
val names : Array[String] = Array[String]("kalyan", "venkat", "ravi")

(or)

val names : Array[String] = new Array[String](3)


names(0) = "kalyan"
names(1) = "venkat"
names(2) = "ravi"

----------------------------------------------

scala> val names : Array[String] = Array[String]("kalyan", "venkat", "ravi")


names: Array[String] = Array(kalyan, venkat, ravi)

scala> names(0)
res11: String = kalyan

scala> names(1)
res12: String = venkat

scala> names(2)
res13: String = ravi

----------------------------------------------
scala> val names : Array[String] = new Array[String](3)
names: Array[String] = Array(null, null, null)

scala> names(0) = "kalyan"

scala> names(1) = "venkat"

scala> names(2) = "ravi"

scala> names
res17: Array[String] = Array(kalyan, venkat, ravi)

----------------------------------------------
scala> val names = Array[String]("kalyan", "venkat", "ravi")
names: Array[String] = Array(kalyan, venkat, ravi)

scala> val names = Array("kalyan", "venkat", "ravi")


names: Array[String] = Array(kalyan, venkat, ravi)

----------------------------------------------

scala> val ids = Array(1,2,3,4,5,6)


ids: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> val ids = Array[Int](1,2,3,4,5,6)


ids: Array[Int] = Array(1, 2, 3, 4, 5, 6)

----------------------------------------------
scala> for( id <- ids) println(id)
1
2
3
4
5
6

scala> for( id <- ids) if(id % 2 == 0) println(id)


2
4
6

scala> for( id <- ids) if(id % 2 == 1) println(id)


1
3
5

scala> for( id <- ids if(id % 2 == 0) ) println(id)


2
4
6

scala> for( id <- ids if(id % 2 == 1) ) println(id)


1
3
5

----------------------------------------------
Scala supports 2 types of collections:
1. Immutable (scala.collection.immutable)

2. Mutable (scala.collection.mutable)

----------------------------------------------

scala> scala.collection.immutable.
:: LongMap SortedMap
AbstractMap LongMapEntryIterator SortedSet
BitSet LongMapIterator Stack
DefaultMap LongMapKeyIterator Stream
HashMap LongMapUtils StreamIterator
HashSet LongMapValueIterator StreamView
IndexedSeq Map StreamViewLike
IntMap MapLike StringLike
IntMapEntryIterator MapProxy StringOps
IntMapIterator Nil Traversable
IntMapKeyIterator NumericRange TreeMap
IntMapUtils Page TreeSet
IntMapValueIterator PagedSeq TrieIterator
Iterable Queue Vector
LinearSeq Range VectorBuilder
List RedBlackTree VectorIterator
ListMap Seq VectorPointer
ListSerializeEnd Set WrappedString
ListSet SetProxy

----------------------------------------------

scala> scala.collection.mutable.
AVLIterator ListBuffer
AVLTree ListMap
AbstractBuffer LongMap
AbstractIterable Map
AbstractMap MapBuilder
AbstractSeq MapLike
AbstractSet MapProxy
AnyRefMap MultiMap
ArrayBuffer MutableList
ArrayBuilder Node
ArrayLike ObservableBuffer
ArrayOps ObservableMap
ArraySeq ObservableSet
ArrayStack OpenHashMap
BitSet PriorityQueue
Buffer PriorityQueueProxy
BufferLike Publisher
BufferProxy Queue
Builder QueueProxy
Cloneable ResizableArray
DefaultEntry RevertibleHistory
DefaultMapModel Seq
DoubleLinkedList SeqLike
DoubleLinkedListLike Set
FlatHashTable SetBuilder
GrowingBuilder SetLike
HashEntry SetProxy
HashMap SortedSet
HashSet Stack
HashTable StackProxy
History StringBuilder
ImmutableMapAdaptor Subscriber
ImmutableSetAdaptor SynchronizedBuffer
IndexedSeq SynchronizedMap
IndexedSeqLike SynchronizedPriorityQueue
IndexedSeqOptimized SynchronizedQueue
IndexedSeqView SynchronizedSet
Iterable SynchronizedStack
LazyBuilder Traversable
Leaf TreeSet
LinearSeq Undoable
LinkedEntry UnrolledBuffer
LinkedHashMap WeakHashMap
LinkedHashSet WrappedArray
LinkedList WrappedArrayBuilder
LinkedListLike

----------------------------------------------
1. Convert `Immutable Collection` to `Mutable Collection` using `toBuffer`

2. Convert `Mutable Collection` to `immutable Collection` using `toList`

----------------------------------------------
Examples on Collections:
----------------------------
val ids = Array[Int](1,2,3,4,5,6)

val ids = List[Int](1,2,3,4,5,6)

val ids = Seq[Int](1,2,3,4,5,6)

val ids = Set[Int](1,2,3,4,5,6)

val ids = Vector[Int](1,2,3,4,5,6)

val ids = Stack[Int](1,2,3,4,5,6)

val ids = Queue[Int](1,2,3,4,5,6)

----------------------------------------------

scala> val ids = Array[Int](1,2,3,4,5,6)


ids: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> val ids = List[Int](1,2,3,4,5,6)


ids: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> val ids = Seq[Int](1,2,3,4,5,6)


ids: Seq[Int] = List(1, 2, 3, 4, 5, 6)
scala> val ids = Set[Int](1,2,3,4,5,6)
ids: scala.collection.immutable.Set[Int] = Set(5, 1, 6, 2, 3, 4)

scala> val ids = Vector[Int](1,2,3,4,5,6)


ids: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4, 5, 6)

----------------------------------------------

scala> val ids = Stack[Int](1,2,3,4,5,6)


<console>:10: error: not found: value Stack
val ids = Stack[Int](1,2,3,4,5,6)
^

scala> val ids = scala.collection.immutable.Stack[Int](1,2,3,4,5,6)


ids: scala.collection.immutable.Stack[Int] = Stack(1, 2, 3, 4, 5, 6)

scala> val ids = scala.collection.mutable.Stack[Int](1,2,3,4,5,6)


ids: scala.collection.mutable.Stack[Int] = Stack(1, 2, 3, 4, 5, 6)

----------------------------------------------

scala> val ids = Queue[Int](1,2,3,4,5,6)


<console>:10: error: not found: value Queue
val ids = Queue[Int](1,2,3,4,5,6)
^

scala> val ids = scala.collection.immutable.Queue[Int](1,2,3,4,5,6)


ids: scala.collection.immutable.Queue[Int] = Queue(1, 2, 3, 4, 5, 6)

scala> val ids = scala.collection.mutable.Queue[Int](1,2,3,4,5,6)


ids: scala.collection.mutable.Queue[Int] = Queue(1, 2, 3, 4, 5, 6)

----------------------------------------------
Scala supports 3 types of functions:
----------------------------------------------
1. Anonymus functions
2. Named functions
3. Curried functions

1. Anonymus functions
----------------------------------------------
(a: Int, b: Int) => { a + b }

val add = (a: Int, b: Int) => { a + b }

scala> (a: Int, b: Int) => { a + b }


res27: (Int, Int) => Int = <function2>

scala> val add = (a: Int, b: Int) => { a + b }


add: (Int, Int) => Int = <function2>
scala> add(1,2)
res28: Int = 3

scala> add(10,20)
res29: Int = 30

2. Named functions
----------------------------------------------
def add(a: Int, b: Int) = { a + b }

scala> def add(a: Int, b: Int) = { a + b }


add: (a: Int, b: Int)Int

scala> add(1,3)
res30: Int = 4

scala> add(20,10)
res31: Int = 30

3. Curried functions
----------------------------------------------
def add1(a: Int, b: Int) = { a + b }

def add2(a: Int)(b: Int) = { a + b }

scala> def add1(a: Int, b: Int) = { a + b }


add1: (a: Int, b: Int)Int

scala> def add2(a: Int)(b: Int) = { a + b }


add2: (a: Int)(b: Int)Int

----------------------------------------------
scala> add1(10,20)
res32: Int = 30

scala> add2(10,20)
<console>:14: error: too many arguments for method add2: (a: Int)(b: Int)Int
add2(10,20)
^

scala> add2(10)(20)
res34: Int = 30

----------------------------------------------

scala> def mul(a: Int, b: Int, c : Int) = { a * b * c }


mul: (a: Int, b: Int, c: Int)Int

scala> def mul(a: Int, b: Int, c : Int) : Int = { a * b * c }


mul: (a: Int, b: Int, c: Int)Int

scala> def mul(a: Int, b: Int, c : Int) = { val m = a * b * c ; print(m)}


mul: (a: Int, b: Int, c: Int)Unit
scala> def mul(a: Int, b: Int, c : Int) = { val m = a * b * c ; print(m); m}
mul: (a: Int, b: Int, c: Int)Int
s
----------------------------------------------

----------------------------------------------
Spark
----------------------------------------------

Spark providing 4 libraries:


-> Spark SQL
-> Spark Streaming
-> Spark MLLib
-> Spark GraphX

Spark supports 4 programming languages:


-> Java
-> Scala
-> Python
-> R

SparkContext -> entry point for any spark operations

RDD -> Resilient Distributed DataSets

RDD features:
---------------------
1. Immutability
2. Lazy Evaluation
3. Cacheable
4. Type Infer

RDD Operations:
---------------------
1. Transformations ( convert old_rdd into new_rdd )

2. Actions ( convert rdd into result )

1.Transformations:
---------------------------

list <- {1,2,3,4}

f1(x) = { x + 1}

f2(x) = { x * x}

f1(list) <- {2,3,4,5}

f2(list) <- {1,4,9,16}


2.Actions:
---------------------------

list <- {1,2,3,4}

min(list) <- 1

max(list) <- 4

sum(list) <- 10

count(list) <- 4

----------------------------------------------
Spark Shell Start commands:
----------------------------------------------
scala => spark-shell
python => pyspark
R => SparkR

----------------------------------------------

Spark context Web UI available at http://192.168.0.176:4040

Spark context available as 'sc' (master = local[*], app id = local-1504943501105).

Spark session available as 'spark'. (only spark-2.x)

----------------------------------------------
How to create RDD?
----------------------------------------------
We can create RDD in spark 2 ways
1. from collections (list, seq, seq, ....)
2. from datasets (txt, csv, tsv, json, hbase, ...)

----------------------------------------------
How to create RDD from collections?
----------------------------------------------

val rdd = sc.parallelize(<collection> , <no.of partitions>)

----------------------------------------------
How to create RDD from datasets?
----------------------------------------------

val rdd = sc.textFile(<path> , <no.of partitions>)

----------------------------------------------
Examples on RDD:
----------------------------------------------

val list = List(1,2,3,4,5,6)

val rdd1 = sc.parallelize(list)


----------------------------------------------

scala> val list = List(1,2,3,4,5,6)


list: List[Int] = List(1, 2, 3, 4, 5, 6)

scala> val rdd1 = sc.parallelize(list)


rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
<console>:26

scala> rdd1.getNumPartitions
res0: Int = 4

scala> val rdd2 = sc.parallelize(list, 2)


rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at
<console>:26

scala> rdd2.getNumPartitions
res1: Int = 2

----------------------------------------------
NOTE:
1. `rdd.collect()` will display the RDD data in console , similar to PIG `dump`
command.

2. Don't use `collect()` in Productin environment.

----------------------------------------------

scala> rdd1.collect()
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> rdd2.collect()
res4: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> rdd1.glom().collect()
res5: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4), Array(5, 6))

scala> rdd2.glom().collect()
res6: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))

----------------------------------------------

scala> val rdd3 = rdd1.repartition(3)


rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at repartition at
<console>:28

scala> rdd3.getNumPartitions
res7: Int = 3

scala> rdd3.glom().collect()
res8: Array[Array[Int]] = Array(Array(5), Array(1, 2, 6), Array(3, 4))

scala> val rdd4 = rdd3.coalesce(2)


rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[9] at coalesce at <console>:30
scala> rdd4.getNumPartitions
res9: Int = 2

scala> rdd4.glom().collect()
res10: Array[Array[Int]] = Array(Array(5, 3, 4), Array(1, 2, 6))

----------------------------------------------

scala> rdd1.collect()
res11: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> rdd1.map((x : Int) => { x + 1 })


res12: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at map at <console>:29

scala> val rdd11 = rdd1.map((x : Int) => { x + 1 })


rdd11: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:28

scala> rdd11.collect()
res13: Array[Int] = Array(2, 3, 4, 5, 6, 7)

scala> def addOne(x : Int) = { x + 1 }


addOne: (x: Int)Int

scala> val rdd12 = rdd1.map(addOne)


rdd12: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:30

scala> rdd12.collect()
res14: Array[Int] = Array(2, 3, 4, 5, 6, 7)

----------------------------------------------
scala> val rdd11 = rdd1.map((x : Int) => { x + 1 })
rdd11: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at map at <console>:28

scala> val rdd13 = rdd1.map(x => x + 1)


rdd13: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at map at <console>:28

scala> val rdd14 = rdd1.map(_ + 1)


rdd14: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at map at <console>:28

scala> rdd11.collect
res15: Array[Int] = Array(2, 3, 4, 5, 6, 7)

scala> rdd13.collect
res16: Array[Int] = Array(2, 3, 4, 5, 6, 7)

scala> rdd14.collect
res17: Array[Int] = Array(2, 3, 4, 5, 6, 7)

----------------------------------------------
val path = "file:///home/orienit/work/input/demoinput"
val rdd = sc.textFile(path)

----------------------------------------------

scala> val path = "file:///home/orienit/work/input/demoinput"


path: String = file:///home/orienit/work/input/demoinput

scala> val rdd = sc.textFile(path)


rdd: org.apache.spark.rdd.RDD[String] = file:///home/orienit/work/input/demoinput
MapPartitionsRDD[18] at textFile at <console>:26

scala> rdd.getNumPartitions
res18: Int = 2

scala> val rdd = sc.textFile(path, 1)


rdd: org.apache.spark.rdd.RDD[String] = file:///home/orienit/work/input/demoinput
MapPartitionsRDD[20] at textFile at <console>:26

scala> rdd.getNumPartitions
res19: Int = 1

scala> rdd.collect()
res20: Array[String] = Array(I am going, to hyd, I am learning, hadoop course)

scala> rdd.collect().foreach(println)
I am going
to hyd
I am learning
hadoop course

----------------------------------------------
Word Count in Spark:
----------------------------------------------

val input = "file:///home/orienit/work/input/demoinput"


val output = "file:///home/orienit/work/output/wordcount"

val file = sc.textFile(input, 1)

val words = file.flatMap(line => line.split(" "))

val tuples = words.map(word => (word, 1))

val wordcount = tuples.reduceByKey((a,b) => a + b)

val sorted = wordcount.sortByKey()

sorted.saveAsTextFile(output)

----------------------------------------------

val input = "file:///home/orienit/work/input/demoinput"


val output = "file:///home/orienit/work/output/wordcount1"

val file = sc.textFile(input, 1)

val sorted = file.flatMap(line => line.split(" ")).map(word => (word,


1)).reduceByKey((a,b) => a + b).sortByKey()
sorted.saveAsTextFile(output)

----------------------------------------------