You are on page 1of 212

Table of Contents

1. Prerequisite ............................................................................................ 3
2. Exercise 1: Spark Installation: Windows: .............................................. 4
3. Install Spark in centos Linux – 60 Minutes(D) ...................................... 9
4. Analysis Using Spark- Overview......................................................... 12
5. Exercise : Spark Installation - Redhat Linux. .................................... 15
6. Exploring DataFrames using pyspark – 35 Minutes ............................ 19
7. Working with DataFrames and Schemas – 30 Minutes ....................... 23
8. Analyzing Data with DataFrame Queries – 40 Minutes ...................... 25
9. Interactive Analysis with pySpark – 30 Minutes ................................. 31
10. Transformation and Action – RDD – 45 Minutes ............................... 33
11. Work with Pair RDD – 30 Minutes ................................................... 38
12. Create & Explore PairRDD – 60 Minutes ...................................... 42
13. Spark SQL – 30 Minutes...................................................................... 63
14. Applications ( Transformation ) – Python – 30 Minutes ..................... 76
15. Zeppelin Installation ............................................................................ 78
16. Using the Python Interpreter ................................................................ 91
18. PySpark-Example – Zeppelin. ............................................................. 93
19. Installing Jupyter for Spark – 35 Minutes.(D) ..................................... 95
20. Spark Standalone Cluster(VM) – 60 Minutes. ................................... 108
21. Spark Two Nodes Cluster Using Docker – 90 Minutes ..................... 120
22. Launching on a Cluster: Hadoop YARN – 150 Minutes ................... 138
23. Jobs Monitoring : Using Web UI. – 45 Minutes ................................ 170
24. Spark Streaming With Socket – 45 Minutes ..................................... 182
25. Spark Streaming With Kafka – 60 Minutes ...................................... 187
26. Streaming with State Operations- Python – 60 Minutes .................... 198
27. Spark – Hive Integration – 45 Minutes .............................................. 202
28. Spark integration with Hive Cluster (Local Mode) ........................... 209
29. Annexure & Errata: ............................................................................ 210
Unable to start spark shell with the following error ................................ 210
issue: Cannot assign requested address ................................................... 211
Unable to start spark shell: ...................................................................... 212
Caused by: java.io.IOException: Error accessing
/opt/jdk/jre/lib/ext/._cld*.jar .................................................................... 212
Such kind of error resolve by removing all the hidden file. .* (You can
determine the file by running #ls -alt). List all the hidden dot file and
remove it. ................................................................................................. 212
Yum repo config ...................................................................................... 212

Last Updated: 12 Feb 2022


1. Prerequisite

Using Docker:
Create the first container:

Let us create a common network for our nodes.

#docker network create --driver bridge spark-net

#docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v
/Volumes/Samsung_T5/software/:/Software -v /Volumes/Samsung_T5/software/install/:/opt -v
/Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p 4040:4040 -p 8081:8081 -
p 8090:8090 -p 8888:8888 centos:7 /usr/sbin/init

Download the required software

Apache Spark – Version 3.0.1 – Prebuilt for Apache Hadoop 3.2 and later.
File : spark-3.0.1-bin-hadoop3.2.tgz / spark-3.2.1-bin-hadoop3.2.tgz
Url : https://spark.apache.org/downloads.html

Inflate all the software in /opt folder.

JDK installationL) (jdk-8u45-linux-x64.tar.gz)


https://www.oracle.com/in/java/technologies/javase/javase8-archive-downloads.html
2. Exercise 1: Spark Installation: Windows:

You need to install java on your system PATH, or the JAVA_HOME environment
variable pointing to a Java installation.

Spark runs on Java 6+ and Python 2.6+.


Unzip spark-1.3.0-bin-hadoop2.4.tar to D:\MyExperiment
Install sbt.MSI & execute it.

sbt-0.13.8.msi
Untar scala-2.11.6

Set SCALA_HOME environment variable & set the PATH variable to the bin directory
of scala.

change to the installation directory : D:\MyExperiment\spark-1.3>cd bin and execute


spark-shell
Create a data set of 1…10000 integers
val data = 1 to 10000
http://ht:4040/jobs/
3. Install Spark in centos Linux – 60 Minutes(D)

Extract the jdk and rename the folder to jdk.

#tar -xvf jdk-8u45-linux-x64.tar.gz -C /opt


#mv jdk* jdk

Extract spark-X.X.0-bin-hadoop.X.tar to /opt

# tar -xvf spark-x-bin-hadoop.x.tgz.gz -C /opt/

It should create a folder spark.X, rename to spark folder.

#mv sparkX spark

Set the JAVA_HOME and initialize the PATH variable.

Open the profile and update with the following statements.

#vi ~/.bashrc
export JAVA_HOME=/opt/jdk
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:/opt/spark/bin

Install Python version 3.6.

Python 3.6 or above is required to run PySpark program.

#yum install -y python3

Type bash, to initialize the profile.

Go to the installation directory:

# pyspark
Enter the following command in the pyspark console to count each alphabets.

// Code Begin. **********************************************

//In the first two lines we are importing the Python libraries.

from operator import add

//Next we will create RDD from "Hello World" string:

data = sc.parallelize(list("Hello World"))

/*
Here we have used the object sc, sc is the SparkContext object which is created by pyspark
before showing the console.
The parallelize() function is used to create RDD from String. RDD is also know as Resilient
Distributed Datasets which is distributed data set in Spark. RDD process is done on the
distributed Spark cluster.
Now with the following example we calculate number of characters and print on the console.
*/
counts = data.map(lambda x:
(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
ascending=False).collect()

for (word, count) in counts:


print("{}: {}".format(word, count))

// Code Ends. *********************************************

You should get the result as show below:


Web UI of the spark shell can be accessed using the following URL. You can click on each tab
and verify the screen.

http://ht:4040/jobs/

Further details on this web UI will be discussed later.

You have successfully installed spark for local development.


Note: You can start with default cores by the following command.
pyspark --master local[*]
----------------------------------------Lab Ends Here -----------------------------------
4. Analysis Using Spark- Overview
Demonstrate this after installation lab
This Tutorial will demonstrate how we can use spark for data analytics - Overview.
Scenario - Upload the log file of a web access and find the following:
Which IP request url the most?
What is the max byte return from a server?
How many request receives from each client?

Let us intialize log entries, that will be used for analysis.


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:13 PM.
%pyspark
data = [('64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET
/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1"
401
12846'),('64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET
/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523'
),('94.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET
/twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851')]
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:14 PM.

Convert the log entries into RDD


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:14 PM.
%pyspark
rdd = spark.sparkContext.parallelize(data)
FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:15 PM.

Let us have a look how the RDD or data look like


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:15 PM.
%pyspark
rdd.collect()
['64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET
/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariab
les HTTP/1.1" 401 12846', '64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET
/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523',
'94.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET
/twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851']
SPARK JOB FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:15 PM.

I would like to access the column using a name. So Let us intitialize RDD
with a schema colum.
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:15 PM.
%pyspark
parts = rdd.map(lambda l: l.split(" "))
log = parts.map(lambda p: Row(ip=p[0], ts=(p[3]), atype = p[5], url=p[6] , code = p[8] ,
byte = p[9]))
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:16 PM.

Have I split the record fields correctly? Let us review the transformation
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:16 PM.
%pyspark
parts.collect()
[['64.242.88.10', '-', '-', '[07/Mar/2004:16:05:49', '-0800]', '"GET',
'/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVaria
bles', 'HTTP/1.1"', '401', '12846'], ['64.242.88.10', '-', '-',
'[07/Mar/2004:16:06:51', '-0800]', '"GET',
'/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2', 'HTTP/1.1"',
'200', '4523'], ['94.242.88.10', '-', '-', '[07/Mar/2004:16:30:29', '-0800]',
'"GET', '/twiki/bin/attach/Main/OfficeLocations', 'HTTP/1.1"', '401',
'12851']]
SPARK JOB FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:17 PM.

I would like to use the Dataframe API. Let us convert from RDD to DF
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:17 PM.
%pyspark
# Infer the schema, and register the DataFrame as a table.
mylog = spark.createDataFrame(log)
SPARK JOB FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:18 PM.

Have a view of our DF!


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:18 PM.
%pyspark
mylog.take(2)
[Row(atype=u'"GET', byte=u'12846', code=u'401', ip=u'64.242.88.10',
ts=u'[07/Mar/2004:16:05:49',
url=u'/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.Configuration
Variables'), Row(atype=u'"GET', byte=u'4523', code=u'200', ip=u'64.242.88.10',
ts=u'[07/Mar/2004:16:06:51',
url=u'/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2')]
SPARK JOB FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:19 PM.

Now, we have the data in spark, let us answer the queries we have define
in the first paragraphs.
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:19 PM.

Which IP request url the most?


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:19 PM.
%pyspark
from pyspark.sql.functions import col
mylog.groupBy("ip").count().orderBy(col("count").desc()).first()
Row(ip=u'64.242.88.10', count=2)
SPARK JOB FINISHED
Took 6 sec. Last updated by anonymous at July 04 2021, 1:17:26 PM.

What is the max byte return from a server?


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:26 PM.
%pyspark
mylog.select("ip","byte").orderBy("byte").show()
mylog.select("ip","byte").orderBy("byte").first()
+------------+-----+ | ip| byte| +------------+-----+ |64.242.88.10|12846|
|94.242.88.10|12851| |64.242.88.10| 4523| +------------+-----+
Row(ip=u'64.242.88.10', byte=u'12846')
SPARK JOB FINISHED
Took 2 sec. Last updated by anonymous at July 04 2021, 1:17:28 PM.

How many request receives from each client?


FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:28 PM.
%pyspark
mylog.select("ip","url").groupBy("ip").count().show()
+------------+-----+ | ip|count| +------------+-----+ |94.242.88.10| 1|
|64.242.88.10| 2| +------------+-----+

--------------------- Lab Ends Here ------------------------------------


5. Exercise : Spark Installation - Redhat Linux.

Goals: To install Spark on redhat linux OS.

Ensure to use latest software : spark-2.1.0-bin-hadoop2.7.tgz.tar


You need to replace with the specific version of software wherever relevant.

Create a folder /spark and un tar all related software inside this folder only. So that all
installation happens in a single folder for better manageability.

#mkdir /spark

Untar spark-1.3.0-bin-hadoop2.4.tar to /spark

it should create one folder inside /spark

Untar scala-2.11.6

tar -xvf scala-2.11.6.tgz -C /spark

Set the path and variable as follows: We are including Scala home and bin folder in the
path variable.

vi ~/.bashrc
export SCALA_HOME=/spark/scala-2.11.6
export PATH=$PATH:$SCALA_HOME/bin
Type bash

Install JDK as follows and set the PATH with environment as follows: (chose 32 or 64
bit depends on your server)

tar -xvf jdk-8u45-linux-x64.tar.gz -C /spark

export JAVA_HOME=/spark/jdk1.8.0_45
export PATH=:$JAVA_HOME/bin :$PATH:$SCALA_HOME/bin
Rename the installation folder as shown below: You need to change to installation
directory before issuing this command.

#mv spark-2.1.0-bin-hadoop2.7/ spark-2.1/

change to the installation directory : \spark\spark-2.1>cd bin and execute


./spark-shell

enter the following command in the scala console to create a data set of 1…10000
integers
val data = 1 to 10000

http://ht:4040/jobs/
6. Exploring DataFrames using pyspark – 35 Minutes
Following features of Spark will be demonstrated here:

• Loading json file.


• Understand its schema
• Select required fields.
• Apply filter.
Start spark-shell

#pysparkl
Create a text file users.json which contains sample data as listed below in data folder:

{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Etienne", "pcode":"94104"}

Scala:

Initiate the spark-shell from the folder which you have created the above file.

// Read the users json file as a dataframe.


val usersDF = spark.read.json("users.json")

// Find out the schema of the uploaded file


usersDF.printSchema()

As shown above, three fields will be displayed according to the json fields specified in
the text file.

//Let us find out the first 3 records to have a sample data.


val users = usersDF.take(3)
print(users)
usersDF.show()

Out of the three fields, we are interested in only name and age fields. So, let us create a
dataframe with only these two fields and apply a filter expression in which only person
greater than 20 years are there in the dataframe.

nameAgeDF = usersDF.select("name","age")
nameAgeOver20DF = nameAgeDF.where("age > 20")
nameAgeOver20DF.show()
usersDF.select("name","age").where("age > 20").show()

You can also combine the functions as shown above. You will get the same result.

Zeppelin Ouput.
----------------------------------- Lab Ends Here---------------------------------------------------
7. Working with DataFrames and Schemas – 30 Minutes
You will understand the following:

• Defining schema and mapping to dataframe while loading json file.


• Saving the dataframe to json file.

Create an input text file /software/people.csv

pcode,lastName,firstName,age
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48

Execute the following in the Spark-shell.

Import and define the Structure.


from pyspark.sql.types import *
columnsList = [
StructField("pcode", StringType()),
StructField("lastName", StringType()),
StructField("firstName", StringType()),
StructField("age", IntegerType())]
peopleSchema = StructType(columnsList)

// Specify the schema along with the loading instruction.


usersDF =
spark.read.option("header","true").schema(peopleSchema).csv("people.csv")

usersDF.printSchema()

// As shown above, age is an Integer, which will be mapped to String by default in case
its not define in the structure schema.
nameAgeDF = usersDF.select("firstname","age")
nameAgeDF.show()

You can save the dataframe consisting of First Name and age to a file.

# nameAgeDF.write.json("age.json")

Open a terminal and verify the file. It will create a folder by the name age.json and
inside that output will be there.

------------ Lab Ends Here --------------


8. Analyzing Data with DataFrame Queries – 40 Minutes
We will understand the following in this lab:
• Join
• Accessing Columns in DF
Use the following data for this lab:

/software/people.csv

pcode,lastName,firstName,age
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
Load the people data and fetch column, age in the
following way:

peopleDF = spark.read.option("header","true").csv("people.csv")
peopleDF["age"]
peopleDF.age
peopleDF.select(peopleDF["age"]).show()
Manipulate the column age i.e multiple age by 10.

peopleDF.select("lastName", peopleDF.age * 10).show()

You can chain the Queries.

peopleDF.select("lastName",(peopleDF.age * 10).alias("age_10")).show()

Perform aggregration

peopleDF.groupBy("pcode").count().show()
Next let us join two dataframes:

/software/people-no-pcode.csv

pcode,lastName,firstName,age
02134,Hopper,Grace,52
,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48

/software/pcodes.csv

pcode,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL
Load the people and code files in DF.

peopleDF = spark.read.option("header","true").csv("people-no-pcode.csv")
pcodesDF = spark.read.option("header","true").csv("pcodes.csv")

Peform Inner Join on pcode.

peopleDF.join(pcodesDF, "pcode").show()
Perform the Left outer join

peopleDF.join(pcodesDF,peopleDF["pcode"] == pcodesDF.pcode, "left_outer").show()

You can see null value in the pcode of the second row.
Joining on Columns with Different Names

/software/zcodes.csv

zip,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL Join with the pcode and zip of the
second file.

zcodesDF = spark.read.option("header","true").csv("zcodes.csv")
peopleDF.join(zcodesDF, peopleDF.pcode == zcodesDF.zip).show()

---------------------------- Lab Ends Here ------------------------


9. Interactive Analysis with pySpark – 30 Minutes

Optional : export PYSPARK_PYTHON=python3.6 (In case you want to change the


python version. Update in the bin\pyspark scripts of load environment sh
file)
Start it by running the following in the Spark directory:
./pyspark
textFile = sc.textFile("README.md")
textFile.count()
textFile.first()
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
linesWithSpark.count()

Create RDD from Collection.

myData = ["Alice","Carlos","Frank","Barbara"]
myRDD = sc.parallelize(myData)

for make in myRDD.collect(): print make


for make in myRDD.take(2): print make
myRDD.saveAsTextFile("mydata/")

You can verify the data in the folder mydata/


Add one more RDD
myData1 = ["Ram","Shyam","Krishna","Vishnu"]
myRDD1 = sc.parallelize(myData1)

create a new RDD by combining the second RDD to the previous RDD.

myNames = myRDD.union(myRDD1)

Collect and display the contents of the new myNames RDD.

for name in myNames.collect(): print name

Lab Ends Here --------------------------------------------------------------------------------


10. Transformation and Action – RDD – 45 Minutes
Let’s make a new RDD from the text of the README file in the Spark
source directory: (SPARK_HOME/README.md)

textFile = sc.textFile("README.md")

Let’s start with a few actions:


textFile.count() // Number of items in this RDD

textFile.first() // First item in this RDD


Now let’s use a transformation. We will use the filter transformation
to return a new RDD with a subset of the items in the file.
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
linesWithSpark.collect()

We can chain together transformations and actions:


textFile.filter(lambda line : "Spark" in line).count() // How many lines contain
"Spark"?

Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) el
se b)
One common data flow pattern is MapReduce, as popularized by
Hadoop. Spark can implement MapReduce flows easily:
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (w
ord, 1)).reduceByKey(lambda a, b: a+b)

Here, we combined
the flatMap, map and reduceByKey transformations to compute the
per-word counts in the file as an RDD of (String, Int) pairs. To
collect the word counts in our shell, we can use the collect action:
wordCounts.collect()
let’s mark our linesWithSpark dataset to be cached:

linesWithSpark.cache()
linesWithSpark.count()

Next, we will convert RDD to Dataframe.

Create a file /software/people.txt

02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48

from pyspark.sql.types import *

from pyspark.sql import Row

rowRDD = sc.textFile("/opt/data/people.txt").map(lambda line :


line.split(",")).map(lambda values :
Row(values[0],values[1],values[2],values[3]))

// Display two records of the RDD.


rowRDD.take(2)

// Convert the RDD to data Frame.


myDF = rowRDD.toDF()

// Display two records of the data frame.


myDF.show(2)
-------------------------- Lab Ends Here ----------------------------------------------------
11. Work with Pair RDD – 30 Minutes
Lab Overview

In this activity, you will load SFPD data from a CSV file. You will create pair RDD and apply pair RDD
operations to explore the data.

Scenario

Our dataset is a .csv file that consists of SFPD incident data from SF OpenData
(https://data.sfgov.org/). For each incident, we have the following information:

Field Description Example Value

IncidentNum Incident number 150561637

Category Category of incident NON-CRIMINAL

Descript Description of incident FOUND_PROPERTY

DayOfWeek Day of week that incident occurred Sunday

Date Date of incident 6/28/15

Time Time of incident 23:50

PdDistrict Police Department District TARAVAL


Resolution Resolution NONE

Address Address 1300_Block_of_LA_PLAYA_ST

X X-coordinate of location -122.5091348

Y Y-coordinate of location 37.76119777


15056163704013
PdID Department ID

The dataset has been modified to decrease the size of the files and also to make it easier to use. We use
this same dataset for all the labs in this course.

Load Data into Apache Spark

Objectives

• Launch the Spark interactive shell

• Load data into Spark

• Explore data in Apache Spark

4.1.1 Launch the Spark Interactive Shell

The Spark interactive shell is available in Scala or Python.

1. To launch the Interactive Shell, at the command line, run the following command:

pyspark --master local[2]

4.1.2. Load Data into Spark


(copy all data files in sparkcluster)

To load the data we are going to use the SparkContext method textFile. The SparkContext is
available in the interactive shell as the variable sc. We also want to split the file by the separator “,”.

1. We define the mapping for our input variables. While this isn’t a necessary step, it makes it easier to refer to
the different fields by names.

39
val IncidntNum = 0
val Category = 1
val Descript = 2
val DayOfWeek = 3
val Date = 4
val Time = 5
val PdDistrict = 6
val Resolution = 7
val Address = 8
val X = 9
val Y = 10
val PdId = 11

2. To load data into Spark, at the Scala command prompt:

sfpdRDD = sc.textFile("sfpd.csv").map(_.split(","))

Lab 4.1.3 Explore data using RDD operations


What transformations and actions would you use in each case? Complete the command with the appropriate
transformations and actions.

1. How do you see the first element of the inputRDD (sfpdRDD)?

sfpdRDD.first()

2. What do you use to see the first 5 elements of the RDD?

sfpdRDD.take(5)

40
3. What is the total number of incidents?

totincs = sfpdRDD.count()
print(totincs)

___________________________________________________________________

4. What is the total number of distinct resolutions?

totres = sfpdRDD.map(lambda inc : inc[7]).distinct().count()


totres

__
_________________________________________________________________

5. List the distinct PdDistricts.

totresdistricts =sfpdRDD.map(lambda inc : inc[6]).collect()


totresdistricts

______________________________

41
42 Learning Spark - PySpark

12. Create & Explore PairRDD – 60 Minutes

In the previous activity we explored the data in the sfpdRDD. We used RDD operations. In
this activity, we will create pairRDD to find answers to questions about the data.

Objectives

• Create pairRDD & apply pairRDD operations


• Join pairRDD

Create pair RDD & apply pair RDD operations

Start the spark shell if not done else skip the next command.

#pyspark

# sfpdRDD = sc.textFile("sfpd.csv").map(lambda rec: rec.split(","))

1. Which five districts have the highest incidents?

http://thinkopensource.in
43 Learning Spark - PySpark

Step to be follow (Hints)


a. Use a map transformation to create pair RDD from sfpdRDD of the form: [(PdDistrict, 1)]

b. Use reduceByKey((a,b)=>a+b) to get the count of incidents in each district. Your


result will be a pairRDD of the form [(PdDistrict, count)]

c. Use map again to get a pairRDD of the form [(count, PdDistrict)]

d. Use sortByKey(false) on [(count, PdDistrict)]

e. Use take(5) to get the top 5.

http://thinkopensource.in
44 Learning Spark - PySpark

Answer:

top5Dists = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident:


(incident[6],len(incident))).reduceByKey(lambda x,y: x+y).map(lambda (city , inc) : (inc,
city)).sortByKey(False).take(3)

Add filter to validate proper record i.e it should have exactly 14 fields only in each record.
2. Which five addresses have the highest incidents?
a. Create pairRDD (map)
b. Get the count for key (reduceByKey)
c. Pair RDD with key and count switched (map
d. Sort in descending order (sortByKey

e. Use take(5) to get the top 5

Answer:

top5Adds = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident :


(incident[8],1)).reduceByKey(lambda x,y : x

http://thinkopensource.in
45 Learning Spark - PySpark

+y).map(lambda (ad,count) : (count,ad)).sortByKey(False).take(5)

top5Adds

3. What are the top three categories of incidents?

Answer:

top3Cat = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident:


(incident[1],1)).reduceByKey(lambda x,y : x+y).map(lambda (cat,count) :
(count,cat)).sortByKey(False).take(3)

top3Cat

4. . What is the count of incidents by district?

http://thinkopensource.in
46 Learning Spark - PySpark

Answer:

num_inc_dist = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda


incident:(incident[6],1)).countByKey()

num_inc_dist

Join Pair RDDs - TBD

This activity illustrates how joins work in Spark (python). There are two small datasets
provided for this activity - J_AddCat.csv and J_AddDist.csv.

J_AddCat.csv - Category; Address J_AddDist.csv - PdDistrict; Address

1. EMBEZZLEMENT 100_Block_of_JEFFERSON 1. INGLESIDE 100_Block_of_ROME_ST


_ST

http://thinkopensource.in
47 Learning Spark - PySpark

2. EMBEZZLEMENT SUTTER_ST/MASON_ST 2. SOUTHERN 0_Block_of_SOUTHPARK_A


V

3. BRIBERY 0_Block_of_SOUTHPARK_ 3. MISSION 1900_Block_of_MISSION_ST


AV

4. EMBEZZLEMENT 1500_Block_of_15TH_ST 4. RICHMOND 1400_Block_of_CLEMENT_S


T

5. EMBEZZLEMENT 200_Block_of_BUSH_ST 5. SOUTHERN 100_Block_of_BLUXOME_ST

6. BRIBERY 1900_Block_of_MISSION_ 6. SOUTHERN 300_Block_of_BERRY_ST


ST

7. EMBEZZLEMENT 800_Block_of_MARKET_S 7. BAYVIEW 1400_Block_of_VANDYKE_A


T V

8. EMBEZZLEMENT 2000_Block_of_MARKET_ 8. SOUTHERN 1100_Block_of_MISSION_ST


ST

9. BRIBERY 1400_Block_of_CLEMENT 9. TARAVAL 0_Block_of_CHUMASERO_D


_ST R

10. EMBEZZLEMENT 1600_Block_of_FILLMORE


_ST

http://thinkopensource.in
48 Learning Spark - PySpark

11. EMBEZZLEMENT 1100_Block_of_SELBY_ST

12. BAD CHECKS 100_Block_of_BLUXOME_


ST

Based on the data above, answer the following questions:

1. Given these two datasets, you want to find the type of incident and district for each address. What
is one way of doing this? (HINT: An operation on pairs or pairRDDs)

What should the keys be for the two pairRDDs?


[You can use joins on pairs or pairRDD to get the information. The key for the pairRDD
is the address.]

2. What is the size of the resulting dataset from a join? Why?

[A join is the same as an inner join and only keys that are present in both RDDs are output. If you
compare the addresses in both the datasets, you find that there are five addresses in common and

http://thinkopensource.in
49 Learning Spark - PySpark

they are unique. Thus the resulting dataset will contain five elements. If there are multiple values for
the same key, the resulting RDD will have an entry for every possible pair of values with that key from
the two RDDs.]

3. If you did a right outer join on the two datasets with Address/Category being the source RDD,

what would be the size of the resulting dataset? Why?

[A right outer join results in a pair RDD that has entries for each key in the other pairRDD. If the
source RDD contains data from J_AddCat.csv and the “other” RDD is represented by J_AddDist.csv,
then since “other” RDD has 9 distinct addresses, the size of the result of a right outer join is 9.]

4. If you did a left outer join on the two datasets with Address/Category being the source RDD,

what would be the size of the resulting dataset? Why?

http://thinkopensource.in
50 Learning Spark - PySpark

[A left outer join results in a pair RDD that has entries for each key in the source pairRDD. If
thesource RDD contains data from J_AddCat.csv and the “other” RDD is represented by
J_AddDist.csv,then since “source” RDD has 13 distinct addresses, the size of the result of a left
outer join is 13.]
5. Load each dataset into separate pairRDDs with “address” being the key.

# catAdd = sc.textFile("/opt/data/J_AddCat.csv").map(lambda x : x.split(",")).map(lambda x :


(x[1],x[0]))
# distAdd = sc.textFile("/opt/data/J_AddDist.csv").map(lambda x : x.split(",")).map(lambda x :
(x[1],x[0]))

http://thinkopensource.in
51 Learning Spark - PySpark

6. List the incident category and district for those addresses that have both category and district
information. Verify that the size estimated earlier is correct.
catJdist = catAdd.join(distAdd)
catJdist.collect()
catJdist.count()
catJdist.take(5)

http://thinkopensource.in
52 Learning Spark - PySpark

7. List the incident category and district for all addresses irrespective of whether each address has
category and district information.
catJdist1 = catAdd.leftOuterJoin(distAdd)
catJdist1.collect()
catJdist1.count()

http://thinkopensource.in
53 Learning Spark - PySpark

8. List the incident district and category for all addresses irrespective of whether each address has
category and district information. Verify that the size estimated earlier is correct.
catJdist2 = catAdd.rightOuterJoin(distAdd)
catJdist2.collect()
catJdist2.count()

http://thinkopensource.in
54 Learning Spark - PySpark

Explore Partitioning

In this activity we see how to determine the number of partitions, the type of partitioner and how
to specify partitions in a transformation.

Objective

• Explore partitioning in RDDs

Explore partitioning in RDDs

Note: Ensure that you have started the spark shell with --master local[*] ( Refer the First Lab)

1. How many partitions are there in the sfpdRDD?


sfpdRDD.getNumPartitions()

2. How do you find the type of partitioner for sfpdRDD?


sfpdRDD.partitioner

http://thinkopensource.in
55 Learning Spark - PySpark

The above result will depend on the no of cores. In my case its 2 cores so 2.
If there is no partitioner the partitioning is not based upon characteristic of data but distribution is random and uniformed
across nodes.

3. Create a pair RDD when only the length of the items is 14 per record that its 14 fields.

incByDists = sfpdRDD.filter(lambda rec : len(rec) == 14 ).map(lambda incident :


(incident(6),1)).reduceByKey(lambda x,y : x+y)

http://thinkopensource.in
56 Learning Spark - PySpark

How many partitions does incByDists have?

incByDists.getNumPartitions()

What type of partitioner does incByDists have?


print(incByDists.partitioner)

By default PySpark implementation uses hash partitioning as the partitioning function.

http://thinkopensource.in
57 Learning Spark - PySpark

4. Add a map
inc_map = incByDists.map(lambda (incidence, district) : (district,incidence))
How many partitions does inc_map have? inc_map.getNumPartitions()

What type of partitioner does incByDists have? print(incByDists.partitioner)

http://thinkopensource.in
58 Learning Spark - PySpark

5. Add groupByKey
inc_group = sfpdRDD.map(lambda incident: (incident[6],1)).groupByKey()
What type of partitioner does inc_group have? inc_group.partitioner

http://thinkopensource.in
59 Learning Spark - PySpark

http://thinkopensource.in
60 Learning Spark - PySpark

6. Create two pairRDD


catAdd = sc.textFile("/opt/data/J_AddCat.csv").map(lambda x : x.split(",")).map(lambda x
:(x[1],x[0]))
distAdd = sc.textFile("/opt/data/J_AddDist.csv").map(lambda x : x.split(",")).map(lambda x :
(x[1],x[0]))

7. You can specify the number of partitions when you use the join operation.
catJdist = catAdd.join(distAdd,8)
How many partitions does the joined RDD have? catJdist.getNumPartitions()

What type of partitioner does catJdist have? catJdist.partitioner

http://thinkopensource.in
61 Learning Spark - PySpark

Word Count Using Pair.

Finally, let us perform the word count application using the Pair data model.

counts = sc.textFile("/opt/data/J_AddCat.csv").flatMap(lambda line : line.split(',')).map(lambda word : (word,1))


result = counts.reduceByKey(lambda v1,v2 : v1+v2)
result.collect()

http://thinkopensource.in
62 Learning Spark - PySpark

Errata: Lambda parameter errors. Sol – Remove the enclosing bracket.

----------------------------- Lab Ends Here ----------------------------------------------------------------

http://thinkopensource.in
63 Learning Spark - PySpark

13. Spark SQL – 30 Minutes

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as
kwargs to the Row class.

Copy the person.txt file to /software folder.

Henry,42
Rajnita,40
Henderson,14
Tiraj,5

Open a terminal from /software folder and enter the following command.

Using Spark SQL – RDD data structure. (Inferring the Schema Using Reflection)

#pyspark

http://thinkopensource.in
64 Learning Spark - PySpark

// Create an RDD of Person objects and register it as a table.

peopleL = sc.textFile("person.txt")
peopleL.take(4)

parts = peopleL.map(lambda l: l.split(","))


people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.


schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
# SQL can be run over DataFrames that have been registered as a table.
kids = spark.sql("SELECT name FROM people WHERE age >= 1 AND age <= 9")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
kidNames = kids.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)

http://thinkopensource.in
65 Learning Spark - PySpark

http://thinkopensource.in
66 Learning Spark - PySpark

Create a temporary view:

# Register the DataFrame as a SQL temporary view


schemaPeople.createTempView("kids")

Display the output.

spark.sql("select * from kids").show()

http://thinkopensource.in
67 Learning Spark - PySpark

Show the Tables or View List.

spark.catalog.listTables('default')

or

spark.sql('show tables from default').show()

http://thinkopensource.in
68 Learning Spark - PySpark

http://thinkopensource.in
69 Learning Spark - PySpark

Using DataFrame and SQL:

/software/people.json

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

peopleDF = spark.read.format("json").load("people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
peopleDF.select("name","age").show

http://thinkopensource.in
70 Learning Spark - PySpark

http://thinkopensource.in
71 Learning Spark - PySpark

Query file directly with SQL.

sqlDF = spark.sql("SELECT * FROM parquet.`namesAndAges.parquet`")

sqlDF.show()

Perform some operations on the Dataframe.

In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']).

http://thinkopensource.in
72 Learning Spark - PySpark

#convert the people RDD to Dataframe


df = people.toDF()

# Select everybody, but increment the age by 1


df.select(df['name'], df['age'] + 1).show()

# Select people older than 21


df.filter(df['age'] > 21).show()

# Count people by age


df.groupBy("age").count().show()

http://thinkopensource.in
73 Learning Spark - PySpark

http://thinkopensource.in
74 Learning Spark - PySpark

Using aggregrate functions:

Determine the max and min age.

spark.sql("select min(age) from kids ").show()

http://thinkopensource.in
75 Learning Spark - PySpark

-------------------------------- Lab Ends Here ---------------------------------------------------------------------------

http://thinkopensource.in
76 Learning Spark - PySpark

14. Applications ( Transformation ) – Python – 30 Minutes


In this lab, let us develop a python application and submit to Spark for execution. Ensure that you have created a folder
/spark/workspace for storing all the python code inside it.

Let us create a simple Spark application, SimpleApp.py:

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "/opt/spark/README.md" # Should be some file on your system


sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()


numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

sc.stop()

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll
need to replace YOUR_SPARK_HOME /logFile with the location where Spark is installed in your machine.

We can run this application using the bin/spark-submit script:

# Use spark-submit to run your application, change directory to /spark before executing the following command

$ spark-submit --master local[2] workspace/SimpleApp.py

http://thinkopensource.in
77 Learning Spark - PySpark

You can view the job status with the following URl

http://192.168.150.128:4040/jobs/

In this way, you can deploy any python application in the spark cluster.

----------------------- Lab Ends Here ------------------------------------------

http://thinkopensource.in
78 Learning Spark - PySpark

15. Zeppelin Installation

You don’t require spark installation in case you are using zeppelin-0.6.2-bin-all.tgz
Create a zeppelin user and switch to zeppelin user or if zeppelin user is already created then login as zeppelin.
Password should be life213
Use root credentials
groupadd hadoop
useradd -g hadoop zeppelin
passwd zeppelin

tar -xvf zeppelin-0.7.3-bin-all.tgz.gz -C /apps

Change the folder owner/group to zeppelin


Note : chgrp -R hadoop zeppelin*
chown zeppelin:hadoop /apps/zeppelin* -R

http://thinkopensource.in
79 Learning Spark - PySpark

su - zeppelin
whoami

set the JAVA_HOME [vi ~/.bashrc]


export JAVA_HOME=/spark/jdk1.8.0_45
export PATH=:$JAVA_HOME/bin:$PATH:$SCALA_HOME/bin

http://thinkopensource.in
80 Learning Spark - PySpark

#cp zeppelin-site.xml.template zeppelin-site.xml

http://thinkopensource.in
81 Learning Spark - PySpark

Start Zeppelin
cd /spark/zeppelin-0.5.6-incubating-bin-all
bin/zeppelin-daemon.sh start
jps
Note: You don’t required to start the spark cluster before starting Zeppelin

http://thinkopensource.in
82 Learning Spark - PySpark

http://master:8081/#/

http://thinkopensource.in
83 Learning Spark - PySpark

http://thinkopensource.in
84 Learning Spark - PySpark

Stop Zeppelin
bin/zeppelin-daemon.sh stop

Let us connect Zeppelin with the existing Spark Cluster. You can start Spark in a single standalone cluster or Use
any existing Spark Cluster.

In this lab let us use the single standalone spark cluster,


Let us set the Spark Metadata to be stored in derby Network so that more than one user can connect using spark
shell.

Unzip the derby database and start as below: You can use zeppelin user account to start the derby.

#tar -xzf db-derby-10.14.1.0-bin.tar.gz -C /apps


#mkdir db-derby-10.4.1.3-bin/data
#nohup /apps/db-derby-10.14.1.0-bin/bin/startNetworkServer -h 0.0.0.0 &

After the derby server, you need to modify the spark to refer the above database server. For this create a file hive-
site.xml in the SPARK_HOME/conf folder and paste the following content,

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://10.10.20.27:1527/metastore_db;create=true</value>
</property>

http://thinkopensource.in
85 Learning Spark - PySpark

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
</property>

</configuration>

Remove the derby jar from the Spark installation jar folder and replace with the derby client jar.

After that start the spark standalone cluster:

sbin/start-master.sh --webui-port 8081

At the end of the above step you should be able to access the Spark UI
http://10.10.20.27:8081/

http://thinkopensource.in
86 Learning Spark - PySpark

Now, we have configured SPARK standalone cluster. We need to have atleast one worker node as shown below:
Add hp.tos.com in the slave file and start the slave with the following command
# sh start-slave.sh spark://hp.tos.com:7077
http://10.10.20.27:8080/

http://thinkopensource.in
87 Learning Spark - PySpark

Let us configure zeppelin to connect to it.

Stop the zeppelin if not done before. Use zeppelin user ID to execute the following command

#bin/zeppelin-daemon.sh stop

http://thinkopensource.in
88 Learning Spark - PySpark

open the zeppelin-env.sh file present in $zeppelin_HOME/conf directory and provide the below specified configurations.

export MASTER=spark://hp.tos.com:7077
export SPARK_HOME=/apps/spark-2.2.0

Save the above file.

Let us activate admin user for logon to zeppelin web UI.

#cd /apps/zeppelin-0.7.3/conf

#cp shiro.ini.template shiro.ini


update the password of admin as shown below in the above ini file.
admin = life213

start the zeppelin using zeppelin user ID.


#bin/zeppelin-daemon.sh start

Now open your Zeppelin dashboard and go to the list of interprets and search for Spark interpreter.

Ensure that you modify the below parameters: You need to logon using admin/life213 credentials.

Interpreter à Spark à Edit

http://thinkopensource.in
89 Learning Spark - PySpark

Save and then accept the prompt by clicking Ok button.

Create a New Notebook [Notebook à Create New Note à Enter Hello Spark as the name]

Enter sc and the click Run which is on the right corner. If everything is ok, you should be able to see as the above
screenshot. Else review the log file in zeppelin installation folder.

If you want to configure python3, update the following parameters.

http://thinkopensource.in
90 Learning Spark - PySpark

------------------------------------------Lab Ends Here -----------------------------

http://thinkopensource.in
91 Learning Spark - PySpark

16. Using the Python Interpreter


To ensure Zeppelin uses that version of Python when you use the Python interpreter, set the zeppelin.python setting
to the path to Anaconda.

Using the Python Interpreter


In a paragraph, use %python to select the Python interpreter and then input all commands.
The interpreter can only work if you already have python installed (the interpreter doesn't bring it
own python binaries).

To access the help, type help()


%spark.pyspark PySparkInterpreter Provides a Python environment

%spark.r SparkRInterpreter Provides an R environment with SparkR support

%spark.sql SparkSQLInterpreter Provides a SQL environment

%spark.dep DepInterpreter Dependency loader

You should set your PYSPARK_DRIVER_PYTHON environment variable so that Spark uses Anaconda. You can
get more information here:

https://spark.apache.org/docs/1.6.2/programming-guide.html

Export SPARK_HOME

http://thinkopensource.in
92 Learning Spark - PySpark

In conf/zeppelin-env.sh, export SPARK_HOME environment variable with your Spark installation path.

For example,

export SPARK_HOME=/spark/spark-2.1.0

Install Anaconda2 and set in the Path Variable of root login. (vi ~/.bashrc)

export PATH="/spark/anaconda2/bin:$PATH"

http://thinkopensource.in
93 Learning Spark - PySpark

18. PySpark-Example – Zeppelin.


%spark.pyspark sc
words = sc.textFile('file:///spark/spark-2.1.0/README.md')

%pyspark
words.flatMap(lambda x: x.lower().split(' ')) \
.filter(lambda x: x.isalpha()).map(lambda x: (x, 1)) \
.reduceByKey(lambda a,b: a+b)

%pyspark
words.first()

http://thinkopensource.in
94 Learning Spark - PySpark

http://thinkopensource.in
95 Learning Spark - PySpark

19. Installing Jupyter for Spark – 35 Minutes.(D)

You can install Jupter using pip or anaconda.

Using pip – We will use this for our lab.

#yum install epel-release


#yum -y install python-pip

Using pip
# yum install python3-pip -y
#pip3 install --upgrade setuptools
#pip3 install --upgrade pip
#pip3 install jupyterlab

To run the notebook:


#jupyter notebook --generate-config

http://thinkopensource.in
96 Learning Spark - PySpark

Start Jupyter in the different port.


#jupyter notebook --generate-config
[Writing default config to: /root/.jupyter/jupyter_notebook_config.py]

Update the following properties.


#vi ~/.jupyter/jupyter_notebook_config.py

c.NotebookApp.ip = 'spark0'
c.NotebookApp.open_browser = False
c.NotebookApp.allow_remote_access = True
c.NotebookApp.allow_root = True

If you want to update the port no.

#jupyter notebook

Copy the URL in the web browser.

http://thinkopensource.in
97 Learning Spark - PySpark

http://thinkopensource.in
98 Learning Spark - PySpark

Let us configure Notebook for Pyspark.

http://thinkopensource.in
99 Learning Spark - PySpark

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart your terminal and launch PySpark again:

$ pyspark

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on
‘New’ > ‘Notebooks Python [default]’.

Copy and paste Pi calculation script and run it by pressing Shift + Enter.

http://thinkopensource.in
100 Learning Spark - PySpark

// Code Begin

import random
num_samples = 100000000

def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

// Code Ends.

http://thinkopensource.in
101 Learning Spark - PySpark

You have successfully configured pyspark on Jupyter.

http://thinkopensource.in
102 Learning Spark - PySpark

Configure scala on Jupyter.

Option I: Using spylon – Execute Scala spark program.

#pip3 install spylon-kernel


#python3 -m spylon_kernel install
# export SPARK_HOME=/opt/spark
#jupyter notebook

http://thinkopensource.in
103 Learning Spark - PySpark

Option 2: using Scala in Jupyter


https://almond.sh/docs/quick-start-install

Or using Anaconda: (Optional)

Anaconda 4.2.0
For Linux
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
In your terminal window type one of the below and follow the instructions:
Python 3.5 version
bash Anaconda3-4.2.0-Linux-x86_64.sh
Python 2.7 version
bash Anaconda2-4.2.0-Linux-x86_64.sh
NOTE: Include the "bash" command even if you are not using the bash shell.

http://thinkopensource.in
104 Learning Spark - PySpark

#bash

http://thinkopensource.in
105 Learning Spark - PySpark

Run the following command to install Jupyter.


#conda install jupyter

http://thinkopensource.in
106 Learning Spark - PySpark

Anaconda 4.2.0
For Windows
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
Double-click the .exe file to install Anaconda and follow the instructions on the screen

Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific
computing and data science.
Congratulations, you have installed Jupyter Notebook.

http://thinkopensource.in
107 Learning Spark - PySpark

Reference:
https://www.sicara.ai/blog/2017-05-02-get-started-pyspark-jupyter-notebook-3-minutes
https://community.hortonworks.com/articles/75551/installing-and-exploring-spark-20-with-jupyter-not.html
https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b

---------------------------------------------Lab Ends Here --------------------------------------------------

http://thinkopensource.in
108 Learning Spark - PySpark

20. Spark Standalone Cluster(VM) – 60 Minutes.


You will learn to configure two nodes Spark Standalone Cluster.

Create another instance of the VM by copying the VM folder or making clone. You need to shutdown the VM for that.
If you use clone, then specify as shown below. Hints: Select the parent virtual machine and select VM > Manage > Clone.

It will take few minutes depending on your VM size.

At the end, you will get an additional VM as shown below: (ex)

You need to power on both VM at this step.

Add another node i.e slave node to the Cluster. Change the hostname of the slave node to slave.

Logon to the slave VM. Your slave node should be as shown below.

Ensure to provide entries in /etc/hosts of both the VMs, so that it can communicate each other's using hostname.
192.168.188.178 master
192.168.188.174 slave

Note: Replace the IP of your machine accordingly.

http://thinkopensource.in
109 Learning Spark - PySpark

Logon to the first VM i.e master to configure password less connection between the nodes.

SSH access
The root user on the master must be able to connect
• to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and
• to the root user account on the slave via a password-less SSH login.

You have to add the root@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this user’s$HOME/.ssh/authorized_keys).

The following steps will ensure that root user can ssh to its own account without password in all nodes.
Let us generate the keys for user root on the master node, to make sure that root user on master can ssh to slave
nodes without password.

$ su - root
$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys


$ ssh master
You need to accept the key for the first time.
You can do this manually or use the following SSH command: to copy public key to all the slave nodes.

$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave

Accept it, yes and supply the root password for the last time.

http://thinkopensource.in
110 Learning Spark - PySpark

This command will prompt you for the login password for user root on slave, then copy the public SSH key for
you, creating the correct directory and fixing the permissions as necessary.

The final step is to test the SSH setup by connecting with user root from the master to the user account root on
the slave. This step is also needed to save slave‘s host key fingerprint to the root@master‘s known_hosts file.

So, connecting from master to master…


$ ssh master

And from master to slave.


$ ssh slave

http://thinkopensource.in
111 Learning Spark - PySpark

http://thinkopensource.in
112 Learning Spark - PySpark

Go to SPARK_HOME/conf/ and create new file with name spark-env.sh on the master node
There will be spark-env.sh.template in same folder and this file gives you detail on how to declare various
environment variables.

Now we will give the IP address of Master.


SPARK_LOCAL_IP=192.168.188.173
SPARK_MASTER_HOST={IP Address}

// Note: You need to specify IP only.

On master node; Create SPARK_HOME/conf/slaves and enter the following entries.


Worker process will be started in both the nodes.

http://thinkopensource.in
113 Learning Spark - PySpark

Then go to Spark HOME_DIRECTORY/sbin and run following command from terminal

#sh sbin/start-all.sh

This will start a standalone master server.

http://master:8080/

You can access the Spark Master with the above URL.

http://thinkopensource.in
114 Learning Spark - PySpark

Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers
to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which
is http://localhost:8080 by default.

Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the
new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
In our case its having 2 nodes.

Then you can verify the Node using the master UI:

http://thinkopensource.in
115 Learning Spark - PySpark

Let us execute Spark shell using Cluster mode:

You can perform from any client/server node i.e windows desktop client.

Connecting an Application to the Cluster

To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.

To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.

#pyspark --conf spark.executor.memory=512mb --conf spark.executor.cores=1 --master


spark://spark0:8090

You can verify the application execution from the web UI

http://thinkopensource.in
116 Learning Spark - PySpark

Let’s make a new RDD from the text of the README file in the Spark source directory:
textFile = sc.textFile("README.md")

http://thinkopensource.in
117 Learning Spark - PySpark

Let’s start with a few actions:


textFile.count() // Number of items in this RDD

http://thinkopensource.in
118 Learning Spark - PySpark

textFile.first() // First item in this RDD

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
linesWithSpark.collect()
We can chain together transformations and actions:
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?

http://thinkopensource.in
119 Learning Spark - PySpark

Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)

Stop the spark process. Hints [sbin/stop-all.sh]

----------------------------------------------- Lab Ends Here --------------------------------------------------------

http://thinkopensource.in
120 Learning Spark - PySpark

21. Spark Two Nodes Cluster Using Docker – 90 Minutes


Prerequisites:

- Install Docker
- Pull Centos Image.

Steps for deploying two node spark cluster in docker:

- Create two containers using the above image


o Spark0
o Spark1
- Create a network to connect between these two nodes.

You should pull the following images from docker hub.(centos:7)

Let us create a common network for our two nodes cluster.

#docker network create --driver bridge spark-net

http://thinkopensource.in
121 Learning Spark - PySpark

Create first node, spark0 container

#docker run -it --name spark0 --hostname spark0 --privileged -p 8080:8080 -p 7077:7077 -p 4040:4040 -p
8081:8081 -p 8090:8090 centos:7 /usr/sbin/init

Docker command with Host folder mounted. (Skip)

# docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v /Volumes/Samsung_T5/software/:/Software -v
/Volumes/Samsung_T5/software/install/:/opt -v /Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p
4040:4040 -p 8081:8081 -p 8090:8090 centos:7 /usr/sbin/init

To connect a running container to an existing user-defined bridge, use the docker network
connect command. If the spark0 is already started without the network being attached execute the
following else skip the following step.

$ docker network connect spark-net spark0

Start the first container:

#docker start spark0

Connect to you first node:

# docker exec -it spark0 /usr/bin/bash

http://thinkopensource.in
122 Learning Spark - PySpark

Start the second node and perform the installation as specify in the first lab.

#docker run -dit --name spark1 -p 8082:8081 -p 4041:4040 --network spark-net --entrypoint
/bin/bash centos:7

or

Docker command with Host folder mounted. (Skip)

# docker run -it --name spark1 --hostname spark1 --privileged --network spark-net -v
/Users/henrypotsangbam/Documents/Docker:/opt -p 4041:4040 -p 8082:8081 centos:7 /usr/sbin/init

Confirm the Network status:

Inspect the network and find the IP addresses of the two containers

#docker network inspect spark-net

http://thinkopensource.in
123 Learning Spark - PySpark

Ensure to substitute IPs accordingly.

Attach to the spark-master container and test its communication to the spark-worker container using
both it’s IP address and then using its container name.
# ping spark0

http://thinkopensource.in
124 Learning Spark - PySpark

http://thinkopensource.in
125 Learning Spark - PySpark

Architecture

master – spark0 and Slave – spark1

http://thinkopensource.in
126 Learning Spark - PySpark

Let us start the Spark master.

Install the spark. You can refer the first lab.

Execute the following in Spark0

Setup a Spark master node

Change the Master Port to 8090, there is issue when it is executed in 7077 in docker environment.

Execute on the spark0 – Terminal.

#export PATH=$PATH:/opt/spark/bin
#export SPARK_MASTER_PORT=8090
#spark-class org.apache.spark.deploy.master.Master

http://thinkopensource.in
127 Learning Spark - PySpark

Start a worker process On Node 0: In a separate terminal.

Attach a worker node to the cluster, execute the following in the spark0 container.

#export PATH=$PATH:/opt/spark/bin

http://thinkopensource.in
128 Learning Spark - PySpark

#spark-class org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://172.17.0.2:8090

If unable to connect to localhost replace it with the container IP or the container alias i.e spark0.

Refresh the web ui, ensure that you can see a worker as shown below.

http://thinkopensource.in
129 Learning Spark - PySpark

http://127.0.0.1:8080

http://thinkopensource.in
130 Learning Spark - PySpark

Start the Node on spark1

Ensure that you have install spark in the Spark1 too.

#docker exec -it spark1 bash

# spark-class org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://172.17.0.2:8090

At the end of this step, you should have 2 worker nodes as shown below:

http://thinkopensource.in
131 Learning Spark - PySpark

Open a pyspark shell and connect to the Spark cluster

Let us execute Spark shell using Cluster mode:

You can perform from any client/server node i.e windows desktop client.

Connecting an Application to the Cluster

To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.

To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.

#pyspark --conf spark.executor.memory=512mb --conf spark.executor.cores=1 --master


spark://spark0:8090

You can verify the application execution from the web UI

http://thinkopensource.in
132 Learning Spark - PySpark

Let’s make a new RDD from the text of the README file in the Spark source directory:
textFile = sc.textFile("README.md")

http://thinkopensource.in
133 Learning Spark - PySpark

Let’s start with a few actions:


textFile.count() // Number of items in this RDD

http://thinkopensource.in
134 Learning Spark - PySpark

textFile.first() // First item in this RDD

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
linesWithSpark.collect()
We can chain together transformations and actions:
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?

http://thinkopensource.in
135 Learning Spark - PySpark

Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)

Refer the web UI for the Job details as shown below.

http://thinkopensource.in
136 Learning Spark - PySpark

The console should display the result as shown below:

http://thinkopensource.in
137 Learning Spark - PySpark

Hints: (For submitting job in cluster)


You need to complete the Lab - Running a Spark Application before going ahead. Submit from the /software folder, since the input
file exist in that folder.

You need to copy the data files in both the nodes. However, for our lab its already present in the spark folder.

#spark-submit --master spark://spark0:8090 workspace/SimpleApp.py --conf spark.executor.memory=256mb

Verify the output.

---------------------------------------------------------------- Lab Ends Here ------------------------------------

https://towardsdatascience.com/diy-apache-spark-docker-bb4f11c10d24

http://thinkopensource.in
138 Learning Spark - PySpark

22. Launching on a Cluster: Hadoop YARN – 150 Minutes


In this lab, you will deploy Spark Application on Hadoop Cluster.

Configure and install YARN.


One Node will be exclusively running YARN,
Options for deploying Hadoop:
• Copy a VM and rename to HadoopVM-YARN.
• Create another container

Install YARN – On the New Node.

By now, you should have two VM on your workstation or Two containers:

For Me, the first VM hostname is hp.com and the second one is ht.com. Ensure to follow the same nomencleature
to avoid confusion. Its very important.
Host/VM Container Remarks
hp.com Hadoop0 YARN Services
ht.com Spark0 Spark Client or Spark.

Using Docker:
#docker run -it --name hadoop0 --privileged -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 8032:8032 -p
8188:8188 -p 8020:8020 --hostname hadoop0 centos:7 /usr/sbin/init

Verify the hostname and the /etc/hosts. You need to enter each IP and host name as shown above. Update details
accordingly in your local machine.

http://thinkopensource.in
139 Learning Spark - PySpark

You should see in your window as follows for ht.com. Changes the IP as your system.

All the below commands should be executed on hp.com (Hadoop Node) unless specify. We are configuring YARN
cluster now.

Change directory to the location where you have the software.

Download Location: (hadoop-3.2.2.tar.gz)


https://hadoop.apache.org/releases.html

Untar as follows:
tar -xvf hadoop-X.tar.gz -C /opt

http://thinkopensource.in
140 Learning Spark - PySpark

tar -xvf jdk-8u40-linux-i586.tar.gz -C /opt

Rename the folder:

#mv hadoo* hadoop


#mv jdk* jdk
--- Install JDK and set Java Home (Copy the bin file in the YARN folder) , To include JAVA_HOME for all bash
users , make an entry in /etc/profile.d as follows:

echo "export JAVA_HOME=/opt/jdk/" > /etc/profile.d/java.sh

Unpack the downloaded Hadoop distribution. In the distribution, edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh to define some
parameters as follows:

# set to the root of your Java installation

export JAVA_HOME=/opt/jdk

# cd /opt/hadoop

Try the following command:

$ bin/hadoop

http://thinkopensource.in
141 Learning Spark - PySpark

You need to modify some setting as follows: Replace with your hostname accordingly.

Use the following:

etc/hadoop/core-site.xml:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:8020</value>
</property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

http://thinkopensource.in
142 Learning Spark - PySpark

yum -y install openssh-server openssh-clients


systemctl start sshd

Change the passwd using passwd command.

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 0600 ~/.ssh/authorized_keys

export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

Format the filesystem:

$ bin/hdfs namenode -format

http://thinkopensource.in
143 Learning Spark - PySpark

Start NameNode daemon and DataNode daemon:

$ sbin/start-dfs.sh

1. Browse the web interface for the NameNode; by default it is available at:
o NameNode - http://localhost:9870/
2. Make the HDFS directories required to execute MapReduce jobs:

3. $ bin/hdfs dfs -mkdir /user

$ bin/hdfs dfs -mkdir /user/root

http://thinkopensource.in
144 Learning Spark - PySpark

You can run a MapReduce job on YARN in a pseudo-distributed mode

Start HDFS and YARN on the YARN Node: hp.com

1. Configure parameters as follows:

etc/hadoop/mapred-site.xml:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>

<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>

http://thinkopensource.in
145 Learning Spark - PySpark

etc/hadoop/yarn-site.xml:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HAD
OOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

2. Start ResourceManager daemon and NodeManager daemon:

$ sbin/start-yarn.sh

http://thinkopensource.in
146 Learning Spark - PySpark

3. Browse the web interface for the ResourceManager; by default it is available at:

o ResourceManager - http://localhost:8088/

4. When you’re done, you can stop the daemons with: Optional.

$ sbin/stop-yarn.sh

#verify with jps , all the above services should be there.


jps

http://thinkopensource.in
147 Learning Spark - PySpark

Hadoop Node - Let us set some environment and path variable as follows: ( vi ~/.bashrc)

export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

Let us create a temporary folder, that will be used for working space for the YARN cluster.
hadoop fs -mkdir /tmp
hadoop fs -chmod -R 1777 /tmp
hadoop fs -mkdir /tmp/in
hadoop fs -ls /tmp

You can get the README.md file from the Software folder, which is provided along with the training.
hadoop fs -copyFromLocal /opt/hadoop/README.txt /tmp/in
hadoop fs -ls /tmp/in

hadoop fs -mkdir /tmp/spark-events


Congrats! You have successfully configure Yarn Cluster.

If you are using docker, join the hadoop0 and spark0 containers in a single network:

#docker network connect spark-net hadoop0

Verify it:
#docker network inspect spark-net

http://thinkopensource.in
148 Learning Spark - PySpark

http://thinkopensource.in
149 Learning Spark - PySpark

Start the Spark VM/spark container, if it's not yet started, i.e ht.com and issue the following command.
Logon to the machine using telnet. You can use the VM, share folder options to copy the file to the VM from your
workstation or else copy using the mouse from your workstation to the folder specify.

Let us configure hadoop client setting on the Spark VM --> ht.com or spark0 node.

Compress the Hadoop folder on the Hadoop0 instances or Hadoop node.


#cd /opt
#tar -cvzf hadoop.tar hadoop/

Copy the compressed hadoop folder to the spark0 or spark node and uncompressed in /opt folder.
[Docker:
copy from hadoop to host : docker cp hadoop0:/opt/hadoop.tar .
copy from host to spark : docker cp hadoop.tar spark0:/opt/]

#tar -xvf hadoop.tar

Update yarn-site.xml on the Spark Node.

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>

http://thinkopensource.in
150 Learning Spark - PySpark

<description>The hostname of the RM.</description>


</property>

Create the following folder on HDFS cluster and copy the file in the folder.
# hdfs dfs -mkdir -p /opt/spark
# hdfs dfs -copyFromLocal /opt/spark/README.md /opt/spark/

You can verify the file with the following command. This command will display the content of the file in the
HDFS cluster.

# hdfs dfs -cat /opt/spark/README.md

Open a command console to execute the Spark jobs to submit on YARN.

Ensure that you install python on Hadoop node:


# yum install -y python3

Export Hadoop and Yarn environment variable pointing to the Spark Node – Hadoop conf file.

http://thinkopensource.in
151 Learning Spark - PySpark

Create a python app, SimpleApp.py with the following code:

// ----------------------------------- Code Begin ------------------------------------------------//

"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "/opt/spark/README.md" # Should be some file on your system

spark = SparkSession.builder.appName("Python App").getOrCreate()


logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

// ----------------------------------- Code Ends ------------------------------------------------//


The above code reads the README.md file from HDFS cluster and counts the line containing ‘a’ and ‘b’
respectively.

#export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
#export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
#./spark-submit --master yarn --deploy-mode cluster --executor-memory 512m --num-executors 2 SimpleApp.py

or client side.

http://thinkopensource.in
152 Learning Spark - PySpark

#./spark-submit --master yarn --deploy-mode client /opt/data/SimpleAppP.py

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the
Hadoop cluster.

Note --> Application py/Jar should be in the local file system, In folder is in HDFS and Out should not be present
in the HDFS,it will be automatically created.

http://thinkopensource.in
153 Learning Spark - PySpark

You can verify the progress of the application using http://hp.com:8088/cluster

Click on Cluster-- >Application --> accepted

Wait for few moment and verify on Running tab.

http://thinkopensource.in
154 Learning Spark - PySpark

http://thinkopensource.in
155 Learning Spark - PySpark

Click on the Application ID --> Logs

http://thinkopensource.in
156 Learning Spark - PySpark

Logs --> Stderr to get details as follows:

After sometimes click on Finished, after the console exit the program.

http://thinkopensource.in
157 Learning Spark - PySpark

On the job submit console.

Since we have print the output. You can verify from the log fie only.

http://thinkopensource.in
158 Learning Spark - PySpark

Go to the User Log folders of the hadoop installation. (Job ID and the container ID will be different in your
machine, update acordingly)

#cd /opt/hadoop/logs/userlogs
#more application_1625129460519_0004/container_1625129460519_0004_01_000001/stdout

Task -> Let us write the output into file instead of writing in console.

Create a file SimpleAppP.py and update with the following code. It only writes the dataframe which has the line
having ‘a’ in it.
//-------------------------- Code Begin ---------------------

"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "/opt/spark/README.md" # Should be some file on your system

spark = SparkSession.builder.appName("Python App").getOrCreate()


logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

http://thinkopensource.in
159 Learning Spark - PySpark

#print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

myDF = logData.filter(logData.value.contains('a'))
myDF.write.save("mydatap")
spark.stop()

//-------------------------- Code Ends ----------------------

Execute the program again.

#spark-submit --master yarn --deploy-mode cluster --executor-memory 512m --num-executors 2


SimpleAppP.py

At the end of the execution, you should have the following output in the console.

http://thinkopensource.in
160 Learning Spark - PySpark

You can verify the output file in the hdfs as shown below:
#hdfs dfs -ls -R /user/root/mydatap
# hdfs dfs -cat /user/root/mydatap/part-00000-554ff738-ee36-449c-b654-9a7fa92f17b0-c000.snappy.parquet

It’s a binary parquet file.

Great! You have successfully executed spark job in YARN cluster.


Let us verify the output now using HDFS browser.

http://localhost:9870/explorer.html#/

http://thinkopensource.in
161 Learning Spark - PySpark

Click Utilities --> Browse the file system à /user/root/output (Enter in the Directory ) - Go

Click on the above mark file.

http://thinkopensource.in
162 Learning Spark - PySpark

Let us view one of the output. click on part-0001 --> Download.

Congrats!.
---------------------------------------------- End of Lab-----------------------------------

http://thinkopensource.in
163 Learning Spark - PySpark

Errata:
1) 15/06/14 21:50:46 INFO storage.BlockManagerMasterActor: Registering block manager ht.com:50475 with
267.3 MB RAM, BlockManagerId(<driver>, ht.com, 50475)
15/06/14 21:50:46 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-
events/application_1434297324587_0001.inprogress, expected: hdfs://hp.com at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1262)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:1262)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:128)

http://thinkopensource.in
164 Learning Spark - PySpark

Solution: Verify default configuration of the spark installation. /spark/spark-1.3.0-bin-hadoop2.4/conf/spark-


defaults.conf

http://thinkopensource.in
165 Learning Spark - PySpark

Check that spark.eventLog.dir is begin with hdfs:// and the /tmp/spark-events is already created in HDFS system
You can verify on the YARN cluster only.

Verify using : hadoop fs -ls -R /tmp/

2) 15/06/14 00:41:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032


15/06/14 21:58:27 INFO yarn.Client: Application report for application_1434297324587_0002 (state:
ACCEPTED)
Indefinite loop of above
Solution: Ensure that all configuration file are map to appropriate hostname and hostname to ip details are
modified in the hosts file , Increase the resources and wait for few minutes to convert the status to running.
http://thinkopensource.in
166 Learning Spark - PySpark

Yarn-site.xml
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value> <!-- 4 GB -->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value> <!-- 1 GB -->
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>256</value> <!-- 1 GB -->
</property>

yarn-site.xml – Spark Node. (Specify the hostname of the Hadoop RM Node as shown below)

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>
<description>The hostname of the RM.</description>

http://thinkopensource.in
167 Learning Spark - PySpark

</property>

3) If unhealthy nodes are being displayed in the cluster URL with 1/1 local-dirs are bad: /tmp/hadoop-yarn/nm-
local-dir;

Solution: delete the folder using the following command and ensure that nodes are healthy before proceeding
forward, you can restart if require

hadoop fs -rm -fr /tmp/hadoop-yarn/*

Try the following command:


yarn application -list
yarn application -kill [application name]

You can view jps in YARN cluster when spark job is executing. It will start Coarse* processes.

Even, cluster mode works

http://thinkopensource.in
168 Learning Spark - PySpark

Issue:

http://thinkopensource.in
169 Learning Spark - PySpark

Verify the ip of the resource manager and set the home directory appropriately.
Try updating the following setting in yarn-site.xml of Hadoop and spark node too..

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>
<description>The hostname of the RM.</description>
</property>

-------------------------------- ----------------------------------------------------------------------------

http://thinkopensource.in
170 Learning Spark - PySpark

23. Jobs Monitoring : Using Web UI. – 45 Minutes


Execute the Job as specified in the Spark standalone cluster lab/ Using Docker before proceeding ahead.
- Start the Two Nodes Spark Cluster
- Copy the input file in both the nodes, if you are executing on the Standalone cluster.

Execute the following job:

File names.json contains:

{"firstName":"Grace","lastName":"Hopper"}
{"firstName":"Alan","lastName":"Turing"}
{"firstName":"Ada","lastName":"Lovelace"}
{"firstName":"Charles","lastName":"Babbage"}

Enter the following commands one at a time.

#pyspark --master spark://spark0:8090


ip="/software/names.json"
op="/software/nameop"
spark.sparkContext.setLogLevel("WARN")
peopleDF = spark.read.json(ip) // One Job
namesDF = peopleDF.select("firstName","lastName")
namesDF.write.option("header","true").csv(op) // Second Job

Access the Spark Cluster web console.

http://thinkopensource.in
171 Learning Spark - PySpark

http://127.0.0.1:8080

You should have two worker nodes as shown below.

Click on Application ID à Application Detail UI

http://thinkopensource.in
172 Learning Spark - PySpark

You can verify the Jobs details Using Spark UI: http://master:4040/jobs/

web UI comes with the following tabs (which may not all be visible at once as they are lazily created on demand,
e.g. Streaming tab):
§ Jobs
§ Stages
§ Storage with RDD size and memory use
§ Environment
§ Executors
§ SQL

The Jobs Tab shows status of all Spark jobs in a Spark application

http://thinkopensource.in
173 Learning Spark - PySpark

Display the timeline of Executor being added in the Job. When the Job Get completed etc.
In the above, you can verify that 2 executors have been added.

1 Job Succeded i.e (Reading Json file)


2 Job Succeeded i.e write action.

When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the
Summary section.

The Event Timeline section shows not only jobs but also executors.
You can verify the completed and failed jobs too:

http://thinkopensource.in
174 Learning Spark - PySpark

http://thinkopensource.in
175 Learning Spark - PySpark

You can verify some of the following Queries from this console?

How long the Job takes to execute? – (1: Here it is 2 s.)

How Many Statges and Tasks created per Job? ( 2: 1 stages and only one task)

http://thinkopensource.in
176 Learning Spark - PySpark

Click on On job (1 – write action or job ) -> for details on Description à stages to view DAG - When you click a
job in All Jobs Page page, you see the Details for Job page.

http://thinkopensource.in
177 Learning Spark - PySpark

As seen above, there is 1 stage in the Job. It scans the file and create a map partition RDD.

It shows the DAG graph and the stage details.


Scroll down to view the Event timeline.

Questions:

How long does the scheduler take to schedule the task? (Check the Light Blue Block)
How long does the task deserialization take? (Check the Orange Block)
What is the actual computation time? (Verify the Green Color block)

In ideal scenario, Green should take most of the computing resources.


You can also verify how much each task took to excute. It will provide some information about the health of the computing task.

http://thinkopensource.in
178 Learning Spark - PySpark

Scroll down to view the task metrics: which task take majority of the time etc. How much byte consume or output by each task and its
statistics.

You can verify from the above, Locality_Level column – the data is process locally. i.e data locality.

http://thinkopensource.in
179 Learning Spark - PySpark

Executors Tab

Stats of each executor, how much time it spent on GC etc. How much data get shuffle?
Executors tab in web UI shows

http://thinkopensource.in
180 Learning Spark - PySpark

Click on SQl tab

You can view the statistics of the SQL – How much Rows read and how much bytes etc.

http://thinkopensource.in
181 Learning Spark - PySpark

Verify the plan (Scroll down the SQL tab.)

--------------------------------- Lab Ends Here -----------------------------------------------------------

http://thinkopensource.in
182 Learning Spark - PySpark

24. Spark Streaming With Socket – 45 Minutes

You can execute using pyspark cli console or zeppelin


Open a terminal and perform the following activities.
Install netcat utility
yum install nc.x86_64
or yum install nc

You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by
using
$ nc -lk 9999
Then, any lines typed in the terminal running the netcat server will be counted and printed on screen
every second. It will look something like the following.

http://thinkopensource.in
183 Learning Spark - PySpark

Type the following command in pyspark or Zeppelin.

First, we import StreamingContext, which is the main entry point for all streaming functionality. We
create a local StreamingContext with two execution threads, and batch interval of 1 second.

from pyspark import SparkContext


from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 5
second
ssc = StreamingContext(sc, 5)

http://thinkopensource.in
184 Learning Spark - PySpark

// Create a DStream that will connect to hostname:port, like localhost:9999


lines = ssc.socketTextStream("localhost", 9999)

This lines DStream represents the stream of data that will be received from the data server. Each record
in this DStream is a line of text. Next, we want to split the lines by space into words.

# Split each line into words


words = lines.flatMap(lambda line: line.split(" "))

flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new
records from each record in the source DStream. In this case, each line will be split into multiple words
and the stream of words is represented as the words DStream. Next, we want to count these words.

http://thinkopensource.in
185 Learning Spark - PySpark

# Count each word in each batch


pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda a, b: a + b)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs,
which is then reduced to get the frequency of words in each batch of data.
Finally, wordCounts.pprint() will print a few of the counts generated every second.

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform
when it is started, and no real processing has started yet. To start the processing after all the
transformations have been setup, we finally call

ssc.start() # Start the computation


ssc.awaitTermination() # Wait for the computation to terminate

You will be able to see the following in the zeppelin console

http://thinkopensource.in
186 Learning Spark - PySpark

Counting each words occurrence in 1 second.


---------------------------------------- Lab Ends Here -----------------------------------------------

http://thinkopensource.in
187 Learning Spark - PySpark

25. Spark Streaming With Kafka – 60 Minutes

Goals: To integrate Apache Kafka with Spark Using Streaming API.

Steps to be performed:
§ Install Kafka and start it along with producer.
§ Write python program to connect to Kafka and get data from Kafka using streaming
§ Execute it to fetch the data.
Use the Spark VM

Install kafka:

tar -xvf kafka_2.12-2.8.0.tar -C /spark


cd /spark/kafka_2.11-0.8.2.1

to Start the server, start zookeeper first using a terminal

bin/zookeeper-server-start.sh config/zookeeper.properties

http://thinkopensource.in
188 Learning Spark - PySpark

Now start the Kafka server on other terminal

bin/kafka-server-start.sh config/server.properties

http://thinkopensource.in
189 Learning Spark - PySpark

Keep both the console open.

http://thinkopensource.in
190 Learning Spark - PySpark

Create a topic
Let's create a topic named "test" with a single partition and only one replica: Open one more terminal
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

We can now see that topic if we run the list topic command:
> bin/kafka-topics.sh --list --zookeeper localhost:2181

http://thinkopensource.in
191 Learning Spark - PySpark

Send some messages


// Open a new console, change directory to the kafka installation folder

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

// type some message

Start a consumer

// Open a new console, change directory to the kafka installation folder


bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

http://thinkopensource.in
192 Learning Spark - PySpark

Try sending some information from the producer to consumer.

You have successfully configure Kafka, Now let us fetch information using spark.

Do not close any of the above window.

http://thinkopensource.in
193 Learning Spark - PySpark

Update the following code in kafka.py file.

--------------- Code Begin ----------------


from __future__ import print_function

import sys

from pyspark.sql import SparkSession


from pyspark.sql.functions import *
from pyspark.sql.types import *

# Initializing Spark session


spark = SparkSession.builder.appName("MySparkSession").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
# Setting parameters for the Spark session to read from Kafka
bootstrapServers = "spark0:9092"
subscribeType ="subscribe"
topics = "test"

# Streaming data from Kafka topic as a dataframe


lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
bootstrapServers).option(subscribeType,
topics).load()

# Expression that reads in raw data from dataframe as a string�~@�# and names the column "words"�~@�
lines = lines.selectExpr("CAST(value AS STRING) as words")

http://thinkopensource.in
194 Learning Spark - PySpark

query = lines.writeStream.outputMode("append").format("console").start()
# Terminates the stream on abort
query.awaitTermination()

--------------- Code Ends ------------------

http://thinkopensource.in
195 Learning Spark - PySpark

--------------

Submit the job with the following parameters:

#spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --master local[2] --conf


"spark.dynamicAllocation.enabled=false" kafka.py

http://thinkopensource.in
196 Learning Spark - PySpark

Type some words on the Producer console.

Whatever you push to the Kafka should be written on the console as shown below.

http://thinkopensource.in
197 Learning Spark - PySpark

------------------------------------------ Lab Ends Here ---------------------------------------------------------

http://thinkopensource.in
198 Learning Spark - PySpark

26. Streaming with State Operations- Python – 60 Minutes


In this lab, we will aggregate the word across a given window time interval.

To run this on your local machine, you need to first run a Netcat server.
$ nc -lk 9999

We will be sending message to the above netcat server and then apply transformation to it using Spark Streaming.

Open a pyspark terminal.

#pyspark

Type the following code one by one in the console:

from __future__ import print_function


import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

The above commands import the necessary required packages for execution of Spark Streaming application.

#Initialize the Streamming context using the pre created spark context, sc. The interval is 5 5 seconds. i.e micro batch of 5
mseconds.

ssc = StreamingContext(sc, 5)

# Specify the checkpoint directory, since we are going to maintain state across the batch.
ssc.checkpoint("checkpoint")

# RDD with initial state (key, value) pairs


initialStateRDD = sc.parallelize([(u'hello', 1), (u'world', 1)])

http://thinkopensource.in
199 Learning Spark - PySpark

#Define the function that will update the state of the count.
def updateFunc(new_values, last_sum): \
return sum(new_values) + (last_sum or 0)

# Open a connection to the netcat server to retrieve the incoming messages.


lines = ssc.socketTextStream("localhost",9999)

# Apply the transformation logic to count the word. Here, we are using flatMap, since it can return list of objects.

running_counts = lines.flatMap(lambda line: line.split(" "))\


.map(lambda word: (word, 1))\
.updateStateByKey(updateFunc, initialRDD=initialStateRDD)

# Print out the calculation on the terminal.

running_counts.pprint()

# start the streaming context and wait for termination from external.
ssc.start()
ssc.awaitTermination()

Snippets:

http://thinkopensource.in
200 Learning Spark - PySpark

On the telnet terminal, Type sentences like following with some interval between the sentences.

Observe the out put on the pyspark console.

http://thinkopensource.in
201 Learning Spark - PySpark

As you can observe that “Henry” word has occurred twice across the data records. Thus state has been maintained.

--------------------------- Lab Ends Here --------------------------------------------------

http://thinkopensource.in
202 Learning Spark - PySpark

27. Spark – Hive Integration – 45 Minutes


When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent
Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive
deployment can still enable Hive support.

use spark.sql.warehouse.dir to specify the default location of database in warehouse

Open a pyspark terminal (Using CLI only)

#pyspark --conf spark.sql.catalogImplementation=hive

Execute the following code.

from os.path import abspath

from pyspark.sql import SparkSession


from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('/opt/spark/spark-warehouse')

spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()

spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

http://thinkopensource.in
203 Learning Spark - PySpark

spark.sql("LOAD DATA LOCAL INPATH '/opt/spark/examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries are expressed in HiveQL


spark.sql("SELECT * FROM src").show()
# +---+-------+
# |key| value|
# +---+-------+
# |238|val_238|
# | 86| val_86|
# |311|val_311|
# ...

# Aggregation queries are also supported.


spark.sql("SELECT COUNT(*) FROM src").show()
# +--------+
# |count(1)|
# +--------+
# | 500 |
# +--------+

# The results of SQL queries are themselves DataFrames and support all normal functions.
sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

# The items in DataFrames are of type Row, which allows you to access each column by
ordinal.
stringsDS = sqlDF.rdd.map(lambda row: "Key: %d, Value: %s" % (row.key, row.value))
for record in stringsDS.collect():
print(record)
# Key: 0, Value: val_0
# Key: 0, Value: val_0

http://thinkopensource.in
204 Learning Spark - PySpark

# Key: 0, Value: val_0


# ...

# You can also use DataFrames to create temporary views within a SparkSession.
Record = Row("key", "value")
recordsDF = spark.createDataFrame([Record(i, "val_" + str(i)) for i in range(1, 101)])
recordsDF.createOrReplaceTempView("records")

# Queries can then join DataFrame data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()

Execution Output:

http://thinkopensource.in
205 Learning Spark - PySpark

http://thinkopensource.in
206 Learning Spark - PySpark

http://thinkopensource.in
207 Learning Spark - PySpark

http://thinkopensource.in
208 Learning Spark - PySpark

----------------------------------- Lab Ends Here -------------------------------------

http://thinkopensource.in
209 Learning Spark - PySpark

28. Spark integration with Hive Cluster (Local Mode)


Pre requisites: Spark Installation and Hive Local Mode installation.

You can refer Hive Lab for Hive Local Installation.

In this lab, we will configure to use Apache Spark with Hive.

export SPARK_HOME=/opt/spark
export HIVE_HOME=/opt/hive_local

Set HIVE_HOME and SPARK_HOME accordingly.

http://thinkopensource.in
210 Learning Spark - PySpark

29. Annexure & Errata:

Quit Scala CLI :- :q


Error: No suitable device found: no device found for connection 'System eth0'.
You could try removing/renaming /etc/udev/rules.d/70-persistent-net.rules and reboot. A new file will be created,
sometimes that fixes this kind of problems.

You could also add a file "/etc/sysconfig/network-scripts/ifcfg-eth1", can you bring up eth1 then?

Changing hostname:
§ /etc/hosts
§ vi /etc/sysconfig/network
§ sysctl kernel.hostname=slave
§ bash

Unable to start spark shell with the following error

at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 85 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)

http://thinkopensource.in
211 Learning Spark - PySpark

at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)

Solution: unset HADOOP_CONF_DIR and unset HADOOP_HOME

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/

issue: Cannot assign requested address


17/03/15 21:44:15 ERROR spark.SparkContext: Error initializing SparkContext.
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' f
ailed after 16 retries (starting from 0)! Consider explicitly setting the approp
riate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI)
to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:
223)

Solution:
§ include host to ip entry in /etc/hosts --> 192.168.188.178 master
§ update in conf/spark-env.sh

http://thinkopensource.in
212 Learning Spark - PySpark

o SPARK_LOCAL_IP=127.0.0.1
o SPARK_MASTER_HOST=192.168.188.178

§ comment the following in conf/spark-defaults.conf


o # spark.master spark://master:7077

Unable to start spark shell:


Caused by: java.io.IOException: Error accessing /opt/jdk/jre/lib/ext/._cld*.jar
Such kind of error resolve by removing all the hidden file. .* (You can determine the file by running #ls -alt). List all the hidden dot
file and remove it.

Yum repo config


# mkdir -p /redhatimg
# mount -o loop,ro /mnt/hgfs/MyExperiment/rhel-server-6.4-x86_64-dvd.iso /redhatimg

Create an entry in /etc/fstab so that the system always mounts the DVD image after a reboot.
/mnt/hgfs/MyExperiment/rhel-server-6.4-x86_64-dvd.iso /redhatimg iso9660 loop,ro 0 0

/etc/yum.repos.d/rheldiso.repo

[rhel6dvdiso]
name=RedHatOS DVD ISO
mediaid=1359576196.686790
baseurl=file:///redhatimg
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release

http://thinkopensource.in

You might also like