Spark-Tutorial - IV - Python

Table of Contents
1. Prerequisite ............................................................................................ 3
2. Exercise 1: Spark Installation: Windows: .............................................. 4
3. Install Spark in centos Linux – 60 Minutes(D) ...................................... 9
4. Analysis Using Spark- Overview......................................................... 12
5. Exercise : Spark Installation - Redhat Linux. .................................... 15
6. Exploring DataFrames using pyspark – 35 Minutes ............................ 19
7. Working with DataFrames and Schemas – 30 Minutes ....................... 23
8. Analyzing Data with DataFrame Queries – 40 Minutes ...................... 25
9. Interactive Analysis with pySpark – 30 Minutes ................................. 31
10. Transformation and Action – RDD – 45 Minutes ............................... 33
11. Work with Pair RDD – 30 Minutes ................................................... 38
12. Create & Explore PairRDD – 60 Minutes ...................................... 42
13. Spark SQL – 30 Minutes...................................................................... 63
14. Applications ( Transformation ) – Python – 30 Minutes ..................... 76
15. Zeppelin Installation ............................................................................ 78
16. Using the Python Interpreter ................................................................ 91
18. PySpark-Example – Zeppelin. ............................................................. 93
19. Installing Jupyter for Spark – 35 Minutes.(D) ..................................... 95
20. Spark Standalone Cluster(VM) – 60 Minutes. ................................... 108
21. Spark Two Nodes Cluster Using Docker – 90 Minutes ..................... 120
22. Launching on a Cluster: Hadoop YARN – 150 Minutes ................... 138
23. Jobs Monitoring : Using Web UI. – 45 Minutes ................................ 170
24. Spark Streaming With Socket – 45 Minutes ..................................... 182
25. Spark Streaming With Kafka – 60 Minutes ...................................... 187
26. Streaming with State Operations- Python – 60 Minutes .................... 198
27. Spark – Hive Integration – 45 Minutes .............................................. 202
28. Spark integration with Hive Cluster (Local Mode) ........................... 209
29. Annexure & Errata: ............................................................................ 210
Unable to start spark shell with the following error ................................ 210
issue: Cannot assign requested address ................................................... 211
Unable to start spark shell: ...................................................................... 212
Caused by: java.io.IOException: Error accessing
/opt/jdk/jre/lib/ext/._cld*.jar .................................................................... 212
Such kind of error resolve by removing all the hidden file. .* (You can
determine the file by running #ls -alt). List all the hidden dot file and
remove it. ................................................................................................. 212
Yum repo config ...................................................................................... 212
Last Updated: 12 Feb 2022

1. Prerequisite
Using Docker:
Create the first container:
Let us create a common network for our nodes.
#docker network create --driver bridge spark-net
#docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v
/Volumes/Samsung_T5/software/:/Software -v /Volumes/Samsung_T5/software/install/:/opt -v
/Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p 4040:4040 -p 8081:8081 -
p 8090:8090 -p 8888:8888 centos:7 /usr/sbin/init
Download the required software
Apache Spark – Version 3.0.1 – Prebuilt for Apache Hadoop 3.2 and later.
File : spark-3.0.1-bin-hadoop3.2.tgz / spark-3.2.1-bin-hadoop3.2.tgz
Url : https://spark.apache.org/downloads.html
Inflate all the software in /opt folder.
JDK installationL) (jdk-8u45-linux-x64.tar.gz)

https://www.oracle.com/in/java/technologies/javase/javase8-archive-downloads.html
2. Exercise 1: Spark Installation: Windows:
You need to install java on your system PATH, or the JAVA_HOME environment
variable pointing to a Java installation.
Spark runs on Java 6+ and Python 2.6+.

Unzip spark-1.3.0-bin-hadoop2.4.tar to D:\MyExperiment
Install sbt.MSI & execute it.
sbt-0.13.8.msi
Untar scala-2.11.6
Set SCALA_HOME environment variable & set the PATH variable to the bin directory
of scala.
change to the installation directory : D:\MyExperiment\spark-1.3>cd bin and execute

spark-shell
Create a data set of 1…10000 integers
val data = 1 to 10000
http://ht:4040/jobs/
3. Install Spark in centos Linux – 60 Minutes(D)
Extract the jdk and rename the folder to jdk.
#tar -xvf jdk-8u45-linux-x64.tar.gz -C /opt

#mv jdk* jdk
Extract spark-X.X.0-bin-hadoop.X.tar to /opt
# tar -xvf spark-x-bin-hadoop.x.tgz.gz -C /opt/
It should create a folder spark.X, rename to spark folder.
#mv sparkX spark
Set the JAVA_HOME and initialize the PATH variable.
Open the profile and update with the following statements.
#vi ~/.bashrc
export JAVA_HOME=/opt/jdk
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:/opt/spark/bin
Install Python version 3.6.
Python 3.6 or above is required to run PySpark program.
#yum install -y python3
Type bash, to initialize the profile.
Go to the installation directory:
# pyspark
Enter the following command in the pyspark console to count each alphabets.
// Code Begin. **********************************************
//In the first two lines we are importing the Python libraries.
from operator import add
//Next we will create RDD from "Hello World" string:
data = sc.parallelize(list("Hello World"))
/*
Here we have used the object sc, sc is the SparkContext object which is created by pyspark
before showing the console.
The parallelize() function is used to create RDD from String. RDD is also know as Resilient
Distributed Datasets which is distributed data set in Spark. RDD process is done on the
distributed Spark cluster.
Now with the following example we calculate number of characters and print on the console.
*/
counts = data.map(lambda x:
(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
ascending=False).collect()
for (word, count) in counts:

print("{}: {}".format(word, count))
// Code Ends. *********************************************
You should get the result as show below:

Web UI of the spark shell can be accessed using the following URL. You can click on each tab
and verify the screen.
Further details on this web UI will be discussed later.
You have successfully installed spark for local development.

Note: You can start with default cores by the following command.
pyspark --master local[*]
----------------------------------------Lab Ends Here -----------------------------------
4. Analysis Using Spark- Overview
Demonstrate this after installation lab
This Tutorial will demonstrate how we can use spark for data analytics - Overview.
Scenario - Upload the log file of a web access and find the following:
Which IP request url the most?
What is the max byte return from a server?
How many request receives from each client?
Let us intialize log entries, that will be used for analysis.

FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:13 PM.
%pyspark
data = [('64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET
/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1"
401
12846'),('64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET
/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523'
),('94.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET
/twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851')]
FINISHED
Convert the log entries into RDD

FINISHED
%pyspark
rdd = spark.sparkContext.parallelize(data)
FINISHED
Let us have a look how the RDD or data look like

FINISHED
%pyspark
rdd.collect()
['64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET
/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariab
les HTTP/1.1" 401 12846', '64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET
/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523',
'94.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET
/twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851']
SPARK JOB FINISHED
I would like to access the column using a name. So Let us intitialize RDD
with a schema colum.
FINISHED
%pyspark
parts = rdd.map(lambda l: l.split(" "))
log = parts.map(lambda p: Row(ip=p[0], ts=(p[3]), atype = p[5], url=p[6] , code = p[8] ,
byte = p[9]))
FINISHED
Have I split the record fields correctly? Let us review the transformation
FINISHED
%pyspark
parts.collect()
[['64.242.88.10', '-', '-', '[07/Mar/2004:16:05:49', '-0800]', '"GET',
'/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVaria
bles', 'HTTP/1.1"', '401', '12846'], ['64.242.88.10', '-', '-',
'[07/Mar/2004:16:06:51', '-0800]', '"GET',
'/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2', 'HTTP/1.1"',
'200', '4523'], ['94.242.88.10', '-', '-', '[07/Mar/2004:16:30:29', '-0800]',
'"GET', '/twiki/bin/attach/Main/OfficeLocations', 'HTTP/1.1"', '401',
'12851']]
SPARK JOB FINISHED
I would like to use the Dataframe API. Let us convert from RDD to DF
FINISHED
%pyspark
# Infer the schema, and register the DataFrame as a table.
mylog = spark.createDataFrame(log)
SPARK JOB FINISHED
Have a view of our DF!

FINISHED
%pyspark
mylog.take(2)
[Row(atype=u'"GET', byte=u'12846', code=u'401', ip=u'64.242.88.10',
ts=u'[07/Mar/2004:16:05:49',
url=u'/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.Configuration
Variables'), Row(atype=u'"GET', byte=u'4523', code=u'200', ip=u'64.242.88.10',
ts=u'[07/Mar/2004:16:06:51',
url=u'/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2')]
SPARK JOB FINISHED
Now, we have the data in spark, let us answer the queries we have define
in the first paragraphs.
FINISHED
Which IP request url the most?

FINISHED
%pyspark
from pyspark.sql.functions import col
mylog.groupBy("ip").count().orderBy(col("count").desc()).first()
Row(ip=u'64.242.88.10', count=2)
SPARK JOB FINISHED
What is the max byte return from a server?

FINISHED
%pyspark
mylog.select("ip","byte").orderBy("byte").show()
mylog.select("ip","byte").orderBy("byte").first()
+------------+-----+ | ip| byte| +------------+-----+ |64.242.88.10|12846|
|94.242.88.10|12851| |64.242.88.10| 4523| +------------+-----+
Row(ip=u'64.242.88.10', byte=u'12846')
SPARK JOB FINISHED
How many request receives from each client?

FINISHED
%pyspark
mylog.select("ip","url").groupBy("ip").count().show()
+------------+-----+ | ip|count| +------------+-----+ |94.242.88.10| 1|
|64.242.88.10| 2| +------------+-----+
--------------------- Lab Ends Here ------------------------------------

5. Exercise : Spark Installation - Redhat Linux.
Goals: To install Spark on redhat linux OS.
Ensure to use latest software : spark-2.1.0-bin-hadoop2.7.tgz.tar

You need to replace with the specific version of software wherever relevant.
Create a folder /spark and un tar all related software inside this folder only. So that all
installation happens in a single folder for better manageability.
#mkdir /spark
Untar spark-1.3.0-bin-hadoop2.4.tar to /spark
it should create one folder inside /spark
Untar scala-2.11.6
tar -xvf scala-2.11.6.tgz -C /spark
Set the path and variable as follows: We are including Scala home and bin folder in the
path variable.
vi ~/.bashrc
export SCALA_HOME=/spark/scala-2.11.6
export PATH=$PATH:$SCALA_HOME/bin
Type bash
Install JDK as follows and set the PATH with environment as follows: (chose 32 or 64
bit depends on your server)
tar -xvf jdk-8u45-linux-x64.tar.gz -C /spark
export JAVA_HOME=/spark/jdk1.8.0_45
export PATH=:$JAVA_HOME/bin :$PATH:$SCALA_HOME/bin
Rename the installation folder as shown below: You need to change to installation
directory before issuing this command.
#mv spark-2.1.0-bin-hadoop2.7/ spark-2.1/
change to the installation directory : \spark\spark-2.1>cd bin and execute

./spark-shell
enter the following command in the scala console to create a data set of 1…10000
integers
val data = 1 to 10000
6. Exploring DataFrames using pyspark – 35 Minutes
Following features of Spark will be demonstrated here:
• Loading json file.

• Understand its schema
• Select required fields.
• Apply filter.
Start spark-shell
#pysparkl
Create a text file users.json which contains sample data as listed below in data folder:
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Etienne", "pcode":"94104"}
Scala:
Initiate the spark-shell from the folder which you have created the above file.
// Read the users json file as a dataframe.

val usersDF = spark.read.json("users.json")
// Find out the schema of the uploaded file

usersDF.printSchema()
As shown above, three fields will be displayed according to the json fields specified in
the text file.
//Let us find out the first 3 records to have a sample data.

val users = usersDF.take(3)
print(users)
usersDF.show()
Out of the three fields, we are interested in only name and age fields. So, let us create a
dataframe with only these two fields and apply a filter expression in which only person
greater than 20 years are there in the dataframe.
nameAgeDF = usersDF.select("name","age")
nameAgeOver20DF = nameAgeDF.where("age > 20")
nameAgeOver20DF.show()
usersDF.select("name","age").where("age > 20").show()
You can also combine the functions as shown above. You will get the same result.
Zeppelin Ouput.
----------------------------------- Lab Ends Here---------------------------------------------------
7. Working with DataFrames and Schemas – 30 Minutes
You will understand the following:
• Defining schema and mapping to dataframe while loading json file.

• Saving the dataframe to json file.
Create an input text file /software/people.csv
pcode,lastName,firstName,age
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
Execute the following in the Spark-shell.
Import and define the Structure.

from pyspark.sql.types import *
columnsList = [
StructField("pcode", StringType()),
StructField("lastName", StringType()),
StructField("firstName", StringType()),
StructField("age", IntegerType())]
peopleSchema = StructType(columnsList)
// Specify the schema along with the loading instruction.

usersDF =
spark.read.option("header","true").schema(peopleSchema).csv("people.csv")
usersDF.printSchema()
// As shown above, age is an Integer, which will be mapped to String by default in case
its not define in the structure schema.
nameAgeDF = usersDF.select("firstname","age")
nameAgeDF.show()
You can save the dataframe consisting of First Name and age to a file.
# nameAgeDF.write.json("age.json")
Open a terminal and verify the file. It will create a folder by the name age.json and
inside that output will be there.
------------ Lab Ends Here --------------

8. Analyzing Data with DataFrame Queries – 40 Minutes
We will understand the following in this lab:
• Join
• Accessing Columns in DF
Use the following data for this lab:
/software/people.csv
Load the people data and fetch column, age in the
following way:
peopleDF = spark.read.option("header","true").csv("people.csv")
peopleDF["age"]
peopleDF.age
peopleDF.select(peopleDF["age"]).show()
Manipulate the column age i.e multiple age by 10.
peopleDF.select("lastName", peopleDF.age * 10).show()
You can chain the Queries.
peopleDF.select("lastName",(peopleDF.age * 10).alias("age_10")).show()
Perform aggregration
peopleDF.groupBy("pcode").count().show()
Next let us join two dataframes:
/software/people-no-pcode.csv
,Turing,Alan,32
/software/pcodes.csv
pcode,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL
Load the people and code files in DF.
peopleDF = spark.read.option("header","true").csv("people-no-pcode.csv")
pcodesDF = spark.read.option("header","true").csv("pcodes.csv")
Peform Inner Join on pcode.
peopleDF.join(pcodesDF, "pcode").show()
Perform the Left outer join
peopleDF.join(pcodesDF,peopleDF["pcode"] == pcodesDF.pcode, "left_outer").show()
You can see null value in the pcode of the second row.
Joining on Columns with Different Names
/software/zcodes.csv
zip,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL Join with the pcode and zip of the
second file.
zcodesDF = spark.read.option("header","true").csv("zcodes.csv")
peopleDF.join(zcodesDF, peopleDF.pcode == zcodesDF.zip).show()
---------------------------- Lab Ends Here ------------------------

9. Interactive Analysis with pySpark – 30 Minutes
Optional : export PYSPARK_PYTHON=python3.6 (In case you want to change the

python version. Update in the bin\pyspark scripts of load environment sh
file)
Start it by running the following in the Spark directory:
./pyspark
textFile = sc.textFile("README.md")
textFile.count()
textFile.first()
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
linesWithSpark.count()
Create RDD from Collection.
myData = ["Alice","Carlos","Frank","Barbara"]
myRDD = sc.parallelize(myData)
for make in myRDD.collect(): print make

for make in myRDD.take(2): print make
myRDD.saveAsTextFile("mydata/")
You can verify the data in the folder mydata/

Add one more RDD
myData1 = ["Ram","Shyam","Krishna","Vishnu"]
myRDD1 = sc.parallelize(myData1)
create a new RDD by combining the second RDD to the previous RDD.
myNames = myRDD.union(myRDD1)
Collect and display the contents of the new myNames RDD.
for name in myNames.collect(): print name
Lab Ends Here --------------------------------------------------------------------------------

10. Transformation and Action – RDD – 45 Minutes
Let’s make a new RDD from the text of the README file in the Spark
source directory: (SPARK_HOME/README.md)
Let’s start with a few actions:

textFile.count() // Number of items in this RDD
textFile.first() // First item in this RDD

Now let’s use a transformation. We will use the filter transformation
to return a new RDD with a subset of the items in the file.
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
linesWithSpark.collect()
We can chain together transformations and actions:

textFile.filter(lambda line : "Spark" in line).count() // How many lines contain
"Spark"?
Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) el
se b)
One common data flow pattern is MapReduce, as popularized by
Hadoop. Spark can implement MapReduce flows easily:
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (w
ord, 1)).reduceByKey(lambda a, b: a+b)
Here, we combined
the flatMap, map and reduceByKey transformations to compute the
per-word counts in the file as an RDD of (String, Int) pairs. To
collect the word counts in our shell, we can use the collect action:
wordCounts.collect()
let’s mark our linesWithSpark dataset to be cached:
linesWithSpark.cache()
linesWithSpark.count()
Next, we will convert RDD to Dataframe.
Create a file /software/people.txt
from pyspark.sql import Row
rowRDD = sc.textFile("/opt/data/people.txt").map(lambda line :

line.split(",")).map(lambda values :
Row(values[0],values[1],values[2],values[3]))
// Display two records of the RDD.

rowRDD.take(2)
// Convert the RDD to data Frame.

myDF = rowRDD.toDF()
// Display two records of the data frame.

myDF.show(2)
-------------------------- Lab Ends Here ----------------------------------------------------
11. Work with Pair RDD – 30 Minutes
Lab Overview
In this activity, you will load SFPD data from a CSV file. You will create pair RDD and apply pair RDD
operations to explore the data.
Scenario
Our dataset is a .csv file that consists of SFPD incident data from SF OpenData
(https://data.sfgov.org/). For each incident, we have the following information:
Field Description Example Value
IncidentNum Incident number 150561637
Category Category of incident NON-CRIMINAL
Descript Description of incident FOUND_PROPERTY
DayOfWeek Day of week that incident occurred Sunday
Date Date of incident 6/28/15
Time Time of incident 23:50
PdDistrict Police Department District TARAVAL

Resolution Resolution NONE
Address Address 1300_Block_of_LA_PLAYA_ST
X X-coordinate of location -122.5091348
Y Y-coordinate of location 37.76119777

15056163704013
PdID Department ID
The dataset has been modified to decrease the size of the files and also to make it easier to use. We use
this same dataset for all the labs in this course.
Load Data into Apache Spark
Objectives
• Launch the Spark interactive shell
• Load data into Spark
• Explore data in Apache Spark
4.1.1 Launch the Spark Interactive Shell
The Spark interactive shell is available in Scala or Python.
1. To launch the Interactive Shell, at the command line, run the following command:
pyspark --master local[2]
4.1.2. Load Data into Spark

(copy all data files in sparkcluster)
To load the data we are going to use the SparkContext method textFile. The SparkContext is
available in the interactive shell as the variable sc. We also want to split the file by the separator “,”.
1. We define the mapping for our input variables. While this isn’t a necessary step, it makes it easier to refer to
the different fields by names.
39
val IncidntNum = 0
val Category = 1
val Descript = 2
val DayOfWeek = 3
val Date = 4
val Time = 5
val PdDistrict = 6
val Resolution = 7
val Address = 8
val X = 9
val Y = 10
val PdId = 11
2. To load data into Spark, at the Scala command prompt:
sfpdRDD = sc.textFile("sfpd.csv").map(_.split(","))
Lab 4.1.3 Explore data using RDD operations

What transformations and actions would you use in each case? Complete the command with the appropriate
transformations and actions.
1. How do you see the first element of the inputRDD (sfpdRDD)?
sfpdRDD.first()
2. What do you use to see the first 5 elements of the RDD?
sfpdRDD.take(5)
40
3. What is the total number of incidents?
totincs = sfpdRDD.count()
print(totincs)
___________________________________________________________________
4. What is the total number of distinct resolutions?
totres = sfpdRDD.map(lambda inc : inc[7]).distinct().count()

totres
__
_________________________________________________________________
5. List the distinct PdDistricts.
totresdistricts =sfpdRDD.map(lambda inc : inc[6]).collect()

totresdistricts
______________________________
41
42 Learning Spark - PySpark
12. Create & Explore PairRDD – 60 Minutes
In the previous activity we explored the data in the sfpdRDD. We used RDD operations. In
this activity, we will create pairRDD to find answers to questions about the data.
Objectives
• Create pairRDD & apply pairRDD operations

• Join pairRDD
Create pair RDD & apply pair RDD operations
Start the spark shell if not done else skip the next command.
#pyspark
# sfpdRDD = sc.textFile("sfpd.csv").map(lambda rec: rec.split(","))
1. Which five districts have the highest incidents?
http://thinkopensource.in
Step to be follow (Hints)

a. Use a map transformation to create pair RDD from sfpdRDD of the form: [(PdDistrict, 1)]
b. Use reduceByKey((a,b)=>a+b) to get the count of incidents in each district. Your

result will be a pairRDD of the form [(PdDistrict, count)]
c. Use map again to get a pairRDD of the form [(count, PdDistrict)]
d. Use sortByKey(false) on [(count, PdDistrict)]
e. Use take(5) to get the top 5.
Answer:
top5Dists = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident:

(incident[6],len(incident))).reduceByKey(lambda x,y: x+y).map(lambda (city , inc) : (inc,
city)).sortByKey(False).take(3)
Add filter to validate proper record i.e it should have exactly 14 fields only in each record.
2. Which five addresses have the highest incidents?
a. Create pairRDD (map)
b. Get the count for key (reduceByKey)
c. Pair RDD with key and count switched (map
d. Sort in descending order (sortByKey
e. Use take(5) to get the top 5
Answer:
top5Adds = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident :

(incident[8],1)).reduceByKey(lambda x,y : x
+y).map(lambda (ad,count) : (count,ad)).sortByKey(False).take(5)
top5Adds
3. What are the top three categories of incidents?
Answer:
top3Cat = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda incident:

(incident[1],1)).reduceByKey(lambda x,y : x+y).map(lambda (cat,count) :
(count,cat)).sortByKey(False).take(3)
top3Cat
4. . What is the count of incidents by district?
Answer:
num_inc_dist = sfpdRDD.filter(lambda incident: len(incident) == 14).map(lambda

incident:(incident[6],1)).countByKey()
num_inc_dist
Join Pair RDDs - TBD
This activity illustrates how joins work in Spark (python). There are two small datasets
provided for this activity - J_AddCat.csv and J_AddDist.csv.
J_AddCat.csv - Category; Address J_AddDist.csv - PdDistrict; Address
1. EMBEZZLEMENT 100_Block_of_JEFFERSON 1. INGLESIDE 100_Block_of_ROME_ST

_ST
2. EMBEZZLEMENT SUTTER_ST/MASON_ST 2. SOUTHERN 0_Block_of_SOUTHPARK_A

V
3. BRIBERY 0_Block_of_SOUTHPARK_ 3. MISSION 1900_Block_of_MISSION_ST

AV
4. EMBEZZLEMENT 1500_Block_of_15TH_ST 4. RICHMOND 1400_Block_of_CLEMENT_S

T
5. EMBEZZLEMENT 200_Block_of_BUSH_ST 5. SOUTHERN 100_Block_of_BLUXOME_ST
6. BRIBERY 1900_Block_of_MISSION_ 6. SOUTHERN 300_Block_of_BERRY_ST

ST
7. EMBEZZLEMENT 800_Block_of_MARKET_S 7. BAYVIEW 1400_Block_of_VANDYKE_A

T V
8. EMBEZZLEMENT 2000_Block_of_MARKET_ 8. SOUTHERN 1100_Block_of_MISSION_ST

ST
9. BRIBERY 1400_Block_of_CLEMENT 9. TARAVAL 0_Block_of_CHUMASERO_D

_ST R
10. EMBEZZLEMENT 1600_Block_of_FILLMORE

_ST
11. EMBEZZLEMENT 1100_Block_of_SELBY_ST
12. BAD CHECKS 100_Block_of_BLUXOME_

ST
Based on the data above, answer the following questions:
1. Given these two datasets, you want to find the type of incident and district for each address. What
is one way of doing this? (HINT: An operation on pairs or pairRDDs)
What should the keys be for the two pairRDDs?

[You can use joins on pairs or pairRDD to get the information. The key for the pairRDD
is the address.]
2. What is the size of the resulting dataset from a join? Why?
[A join is the same as an inner join and only keys that are present in both RDDs are output. If you
compare the addresses in both the datasets, you find that there are five addresses in common and
they are unique. Thus the resulting dataset will contain five elements. If there are multiple values for
the same key, the resulting RDD will have an entry for every possible pair of values with that key from
the two RDDs.]
3. If you did a right outer join on the two datasets with Address/Category being the source RDD,
what would be the size of the resulting dataset? Why?
[A right outer join results in a pair RDD that has entries for each key in the other pairRDD. If the
source RDD contains data from J_AddCat.csv and the “other” RDD is represented by J_AddDist.csv,
then since “other” RDD has 9 distinct addresses, the size of the result of a right outer join is 9.]
4. If you did a left outer join on the two datasets with Address/Category being the source RDD,
what would be the size of the resulting dataset? Why?
[A left outer join results in a pair RDD that has entries for each key in the source pairRDD. If
thesource RDD contains data from J_AddCat.csv and the “other” RDD is represented by
J_AddDist.csv,then since “source” RDD has 13 distinct addresses, the size of the result of a left
outer join is 13.]
5. Load each dataset into separate pairRDDs with “address” being the key.
# catAdd = sc.textFile("/opt/data/J_AddCat.csv").map(lambda x : x.split(",")).map(lambda x :

(x[1],x[0]))
# distAdd = sc.textFile("/opt/data/J_AddDist.csv").map(lambda x : x.split(",")).map(lambda x :
(x[1],x[0]))
6. List the incident category and district for those addresses that have both category and district
information. Verify that the size estimated earlier is correct.
catJdist = catAdd.join(distAdd)
catJdist.collect()
catJdist.count()
catJdist.take(5)
7. List the incident category and district for all addresses irrespective of whether each address has
category and district information.
catJdist1 = catAdd.leftOuterJoin(distAdd)
catJdist1.collect()
catJdist1.count()
8. List the incident district and category for all addresses irrespective of whether each address has
category and district information. Verify that the size estimated earlier is correct.
catJdist2 = catAdd.rightOuterJoin(distAdd)
catJdist2.collect()
catJdist2.count()
Explore Partitioning
In this activity we see how to determine the number of partitions, the type of partitioner and how
to specify partitions in a transformation.
Objective
• Explore partitioning in RDDs
Explore partitioning in RDDs
Note: Ensure that you have started the spark shell with --master local[*] ( Refer the First Lab)
1. How many partitions are there in the sfpdRDD?

sfpdRDD.getNumPartitions()
2. How do you find the type of partitioner for sfpdRDD?

sfpdRDD.partitioner
The above result will depend on the no of cores. In my case its 2 cores so 2.
If there is no partitioner the partitioning is not based upon characteristic of data but distribution is random and uniformed
across nodes.
3. Create a pair RDD when only the length of the items is 14 per record that its 14 fields.
incByDists = sfpdRDD.filter(lambda rec : len(rec) == 14 ).map(lambda incident :

(incident(6),1)).reduceByKey(lambda x,y : x+y)
How many partitions does incByDists have?
incByDists.getNumPartitions()
What type of partitioner does incByDists have?

print(incByDists.partitioner)
By default PySpark implementation uses hash partitioning as the partitioning function.
4. Add a map
inc_map = incByDists.map(lambda (incidence, district) : (district,incidence))
How many partitions does inc_map have? inc_map.getNumPartitions()
What type of partitioner does incByDists have? print(incByDists.partitioner)
5. Add groupByKey
inc_group = sfpdRDD.map(lambda incident: (incident[6],1)).groupByKey()
What type of partitioner does inc_group have? inc_group.partitioner
6. Create two pairRDD

catAdd = sc.textFile("/opt/data/J_AddCat.csv").map(lambda x : x.split(",")).map(lambda x
:(x[1],x[0]))
distAdd = sc.textFile("/opt/data/J_AddDist.csv").map(lambda x : x.split(",")).map(lambda x :
(x[1],x[0]))
7. You can specify the number of partitions when you use the join operation.
catJdist = catAdd.join(distAdd,8)
How many partitions does the joined RDD have? catJdist.getNumPartitions()
What type of partitioner does catJdist have? catJdist.partitioner
Word Count Using Pair.
Finally, let us perform the word count application using the Pair data model.
counts = sc.textFile("/opt/data/J_AddCat.csv").flatMap(lambda line : line.split(',')).map(lambda word : (word,1))

result = counts.reduceByKey(lambda v1,v2 : v1+v2)
result.collect()
Errata: Lambda parameter errors. Sol – Remove the enclosing bracket.
----------------------------- Lab Ends Here ----------------------------------------------------------------
13. Spark SQL – 30 Minutes
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as
kwargs to the Row class.
Copy the person.txt file to /software folder.
Henry,42
Rajnita,40
Henderson,14
Tiraj,5
Open a terminal from /software folder and enter the following command.
Using Spark SQL – RDD data structure. (Inferring the Schema Using Reflection)
#pyspark
// Create an RDD of Person objects and register it as a table.
peopleL = sc.textFile("person.txt")
peopleL.take(4)
parts = peopleL.map(lambda l: l.split(","))

people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.

schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
# SQL can be run over DataFrames that have been registered as a table.
kids = spark.sql("SELECT name FROM people WHERE age >= 1 AND age <= 9")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
kidNames = kids.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)
Create a temporary view:
# Register the DataFrame as a SQL temporary view

schemaPeople.createTempView("kids")
Display the output.
spark.sql("select * from kids").show()
Show the Tables or View List.
spark.catalog.listTables('default')
or
spark.sql('show tables from default').show()
Using DataFrame and SQL:
/software/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
peopleDF = spark.read.format("json").load("people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
peopleDF.select("name","age").show
Query file directly with SQL.
sqlDF = spark.sql("SELECT * FROM parquet.`namesAndAges.parquet`")
sqlDF.show()
Perform some operations on the Dataframe.
In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']).
#convert the people RDD to Dataframe

df = people.toDF()
# Select everybody, but increment the age by 1

df.select(df['name'], df['age'] + 1).show()
# Select people older than 21

df.filter(df['age'] > 21).show()
# Count people by age

df.groupBy("age").count().show()
Using aggregrate functions:
Determine the max and min age.
spark.sql("select min(age) from kids ").show()
-------------------------------- Lab Ends Here ---------------------------------------------------------------------------
14. Applications ( Transformation ) – Python – 30 Minutes

In this lab, let us develop a python application and submit to Spark for execution. Ensure that you have created a folder
/spark/workspace for storing all the python code inside it.
Let us create a simple Spark application, SimpleApp.py:
"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "/opt/spark/README.md" # Should be some file on your system

sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()

numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
sc.stop()
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll
need to replace YOUR_SPARK_HOME /logFile with the location where Spark is installed in your machine.
We can run this application using the bin/spark-submit script:
# Use spark-submit to run your application, change directory to /spark before executing the following command
$ spark-submit --master local[2] workspace/SimpleApp.py
You can view the job status with the following URl
http://192.168.150.128:4040/jobs/
In this way, you can deploy any python application in the spark cluster.
----------------------- Lab Ends Here ------------------------------------------
15. Zeppelin Installation
You don’t require spark installation in case you are using zeppelin-0.6.2-bin-all.tgz
Create a zeppelin user and switch to zeppelin user or if zeppelin user is already created then login as zeppelin.
Password should be life213
Use root credentials
groupadd hadoop
useradd -g hadoop zeppelin
passwd zeppelin
tar -xvf zeppelin-0.7.3-bin-all.tgz.gz -C /apps
Change the folder owner/group to zeppelin

Note : chgrp -R hadoop zeppelin*
chown zeppelin:hadoop /apps/zeppelin* -R
su - zeppelin
whoami
set the JAVA_HOME [vi ~/.bashrc]

export JAVA_HOME=/spark/jdk1.8.0_45
export PATH=:$JAVA_HOME/bin:$PATH:$SCALA_HOME/bin
#cp zeppelin-site.xml.template zeppelin-site.xml
Start Zeppelin
cd /spark/zeppelin-0.5.6-incubating-bin-all
bin/zeppelin-daemon.sh start
jps
Note: You don’t required to start the spark cluster before starting Zeppelin
http://master:8081/#/
Stop Zeppelin
bin/zeppelin-daemon.sh stop
Let us connect Zeppelin with the existing Spark Cluster. You can start Spark in a single standalone cluster or Use
any existing Spark Cluster.
In this lab let us use the single standalone spark cluster,

Let us set the Spark Metadata to be stored in derby Network so that more than one user can connect using spark
shell.
Unzip the derby database and start as below: You can use zeppelin user account to start the derby.
#tar -xzf db-derby-10.14.1.0-bin.tar.gz -C /apps

#mkdir db-derby-10.4.1.3-bin/data
#nohup /apps/db-derby-10.14.1.0-bin/bin/startNetworkServer -h 0.0.0.0 &
After the derby server, you need to modify the spark to refer the above database server. For this create a file hive-
site.xml in the SPARK_HOME/conf folder and paste the following content,
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://10.10.20.27:1527/metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
</property>
</configuration>
Remove the derby jar from the Spark installation jar folder and replace with the derby client jar.
After that start the spark standalone cluster:
sbin/start-master.sh --webui-port 8081
At the end of the above step you should be able to access the Spark UI
http://10.10.20.27:8081/
Now, we have configured SPARK standalone cluster. We need to have atleast one worker node as shown below:
Add hp.tos.com in the slave file and start the slave with the following command
# sh start-slave.sh spark://hp.tos.com:7077
http://10.10.20.27:8080/
Let us configure zeppelin to connect to it.
Stop the zeppelin if not done before. Use zeppelin user ID to execute the following command
#bin/zeppelin-daemon.sh stop
open the zeppelin-env.sh file present in $zeppelin_HOME/conf directory and provide the below specified configurations.
export MASTER=spark://hp.tos.com:7077
export SPARK_HOME=/apps/spark-2.2.0
Save the above file.
Let us activate admin user for logon to zeppelin web UI.
#cd /apps/zeppelin-0.7.3/conf
#cp shiro.ini.template shiro.ini

update the password of admin as shown below in the above ini file.
admin = life213
start the zeppelin using zeppelin user ID.

#bin/zeppelin-daemon.sh start
Now open your Zeppelin dashboard and go to the list of interprets and search for Spark interpreter.
Ensure that you modify the below parameters: You need to logon using admin/life213 credentials.
Interpreter à Spark à Edit
Save and then accept the prompt by clicking Ok button.
Create a New Notebook [Notebook à Create New Note à Enter Hello Spark as the name]
Enter sc and the click Run which is on the right corner. If everything is ok, you should be able to see as the above
screenshot. Else review the log file in zeppelin installation folder.
If you want to configure python3, update the following parameters.
------------------------------------------Lab Ends Here -----------------------------
16. Using the Python Interpreter

To ensure Zeppelin uses that version of Python when you use the Python interpreter, set the zeppelin.python setting
to the path to Anaconda.
Using the Python Interpreter

In a paragraph, use %python to select the Python interpreter and then input all commands.
The interpreter can only work if you already have python installed (the interpreter doesn't bring it
own python binaries).
To access the help, type help()

%spark.pyspark PySparkInterpreter Provides a Python environment
%spark.r SparkRInterpreter Provides an R environment with SparkR support
%spark.sql SparkSQLInterpreter Provides a SQL environment
%spark.dep DepInterpreter Dependency loader
You should set your PYSPARK_DRIVER_PYTHON environment variable so that Spark uses Anaconda. You can
get more information here:
https://spark.apache.org/docs/1.6.2/programming-guide.html
Export SPARK_HOME
In conf/zeppelin-env.sh, export SPARK_HOME environment variable with your Spark installation path.
For example,
export SPARK_HOME=/spark/spark-2.1.0
Install Anaconda2 and set in the Path Variable of root login. (vi ~/.bashrc)
export PATH="/spark/anaconda2/bin:$PATH"
18. PySpark-Example – Zeppelin.

%spark.pyspark sc
words = sc.textFile('file:///spark/spark-2.1.0/README.md')
%pyspark
words.flatMap(lambda x: x.lower().split(' ')) \
.filter(lambda x: x.isalpha()).map(lambda x: (x, 1)) \
.reduceByKey(lambda a,b: a+b)
%pyspark
words.first()
19. Installing Jupyter for Spark – 35 Minutes.(D)
You can install Jupter using pip or anaconda.
Using pip – We will use this for our lab.
#yum install epel-release

#yum -y install python-pip
Using pip
# yum install python3-pip -y
#pip3 install --upgrade setuptools
#pip3 install --upgrade pip
#pip3 install jupyterlab
To run the notebook:

#jupyter notebook --generate-config
Start Jupyter in the different port.

#jupyter notebook --generate-config
[Writing default config to: /root/.jupyter/jupyter_notebook_config.py]
Update the following properties.

#vi ~/.jupyter/jupyter_notebook_config.py
c.NotebookApp.ip = 'spark0'
c.NotebookApp.open_browser = False
c.NotebookApp.allow_remote_access = True
c.NotebookApp.allow_root = True
If you want to update the port no.
#jupyter notebook
Copy the URL in the web browser.
Let us configure Notebook for Pyspark.
Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Restart your terminal and launch PySpark again:
$ pyspark
Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on
‘New’ > ‘Notebooks Python [default]’.
Copy and paste Pi calculation script and run it by pressing Shift + Enter.
// Code Begin
import random
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
// Code Ends.
You have successfully configured pyspark on Jupyter.
Configure scala on Jupyter.
Option I: Using spylon – Execute Scala spark program.
#pip3 install spylon-kernel

#python3 -m spylon_kernel install
# export SPARK_HOME=/opt/spark
#jupyter notebook
Option 2: using Scala in Jupyter

https://almond.sh/docs/quick-start-install
Or using Anaconda: (Optional)
Anaconda 4.2.0
For Linux
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
In your terminal window type one of the below and follow the instructions:
Python 3.5 version
bash Anaconda3-4.2.0-Linux-x86_64.sh
Python 2.7 version
bash Anaconda2-4.2.0-Linux-x86_64.sh
NOTE: Include the "bash" command even if you are not using the bash shell.
#bash
Run the following command to install Jupyter.

#conda install jupyter
Anaconda 4.2.0
For Windows
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
Double-click the .exe file to install Anaconda and follow the instructions on the screen
Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific
computing and data science.
Congratulations, you have installed Jupyter Notebook.
Reference:
https://www.sicara.ai/blog/2017-05-02-get-started-pyspark-jupyter-notebook-3-minutes
https://community.hortonworks.com/articles/75551/installing-and-exploring-spark-20-with-jupyter-not.html
https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b
---------------------------------------------Lab Ends Here --------------------------------------------------
20. Spark Standalone Cluster(VM) – 60 Minutes.

You will learn to configure two nodes Spark Standalone Cluster.
Create another instance of the VM by copying the VM folder or making clone. You need to shutdown the VM for that.
If you use clone, then specify as shown below. Hints: Select the parent virtual machine and select VM > Manage > Clone.
It will take few minutes depending on your VM size.
At the end, you will get an additional VM as shown below: (ex)
You need to power on both VM at this step.
Add another node i.e slave node to the Cluster. Change the hostname of the slave node to slave.
Logon to the slave VM. Your slave node should be as shown below.
Ensure to provide entries in /etc/hosts of both the VMs, so that it can communicate each other's using hostname.
192.168.188.178 master
192.168.188.174 slave
Note: Replace the IP of your machine accordingly.
Logon to the first VM i.e master to configure password less connection between the nodes.
SSH access
The root user on the master must be able to connect
• to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and
• to the root user account on the slave via a password-less SSH login.
You have to add the root@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this user’s$HOME/.ssh/authorized_keys).
The following steps will ensure that root user can ssh to its own account without password in all nodes.
Let us generate the keys for user root on the master node, to make sure that root user on master can ssh to slave
nodes without password.
$ su - root
$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh master
You need to accept the key for the first time.
You can do this manually or use the following SSH command: to copy public key to all the slave nodes.
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave
Accept it, yes and supply the root password for the last time.
This command will prompt you for the login password for user root on slave, then copy the public SSH key for
you, creating the correct directory and fixing the permissions as necessary.
The final step is to test the SSH setup by connecting with user root from the master to the user account root on
the slave. This step is also needed to save slave‘s host key fingerprint to the root@master‘s known_hosts file.
So, connecting from master to master…

$ ssh master
And from master to slave.

$ ssh slave
Go to SPARK_HOME/conf/ and create new file with name spark-env.sh on the master node
There will be spark-env.sh.template in same folder and this file gives you detail on how to declare various
environment variables.
Now we will give the IP address of Master.

SPARK_LOCAL_IP=192.168.188.173
SPARK_MASTER_HOST={IP Address}
// Note: You need to specify IP only.
On master node; Create SPARK_HOME/conf/slaves and enter the following entries.

Worker process will be started in both the nodes.
Then go to Spark HOME_DIRECTORY/sbin and run following command from terminal
#sh sbin/start-all.sh
This will start a standalone master server.
http://master:8080/
You can access the Spark Master with the above URL.
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers
to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which
is http://localhost:8080 by default.
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the
new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
In our case its having 2 nodes.
Then you can verify the Node using the master UI:
Let us execute Spark shell using Cluster mode:
You can perform from any client/server node i.e windows desktop client.
Connecting an Application to the Cluster
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.
To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.
#pyspark --conf spark.executor.memory=512mb --conf spark.executor.cores=1 --master

spark://spark0:8090
You can verify the application execution from the web UI
Let’s make a new RDD from the text of the README file in the Spark source directory:

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)
Stop the spark process. Hints [sbin/stop-all.sh]
----------------------------------------------- Lab Ends Here --------------------------------------------------------
21. Spark Two Nodes Cluster Using Docker – 90 Minutes

Prerequisites:
- Install Docker
- Pull Centos Image.
Steps for deploying two node spark cluster in docker:
- Create two containers using the above image

o Spark0
o Spark1
- Create a network to connect between these two nodes.
You should pull the following images from docker hub.(centos:7)
Let us create a common network for our two nodes cluster.
#docker network create --driver bridge spark-net
Create first node, spark0 container
#docker run -it --name spark0 --hostname spark0 --privileged -p 8080:8080 -p 7077:7077 -p 4040:4040 -p
8081:8081 -p 8090:8090 centos:7 /usr/sbin/init
Docker command with Host folder mounted. (Skip)
# docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v /Volumes/Samsung_T5/software/:/Software -v
/Volumes/Samsung_T5/software/install/:/opt -v /Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p
4040:4040 -p 8081:8081 -p 8090:8090 centos:7 /usr/sbin/init
To connect a running container to an existing user-defined bridge, use the docker network
connect command. If the spark0 is already started without the network being attached execute the
following else skip the following step.
$ docker network connect spark-net spark0
Start the first container:
#docker start spark0
Connect to you first node:
# docker exec -it spark0 /usr/bin/bash
Start the second node and perform the installation as specify in the first lab.
#docker run -dit --name spark1 -p 8082:8081 -p 4041:4040 --network spark-net --entrypoint
/bin/bash centos:7
or
Docker command with Host folder mounted. (Skip)
# docker run -it --name spark1 --hostname spark1 --privileged --network spark-net -v
/Users/henrypotsangbam/Documents/Docker:/opt -p 4041:4040 -p 8082:8081 centos:7 /usr/sbin/init
Confirm the Network status:
Inspect the network and find the IP addresses of the two containers
#docker network inspect spark-net
Ensure to substitute IPs accordingly.
Attach to the spark-master container and test its communication to the spark-worker container using
both it’s IP address and then using its container name.
# ping spark0
Architecture
master – spark0 and Slave – spark1
Let us start the Spark master.
Install the spark. You can refer the first lab.
Execute the following in Spark0
Setup a Spark master node
Change the Master Port to 8090, there is issue when it is executed in 7077 in docker environment.
Execute on the spark0 – Terminal.
#export PATH=$PATH:/opt/spark/bin
#export SPARK_MASTER_PORT=8090
#spark-class org.apache.spark.deploy.master.Master
Start a worker process On Node 0: In a separate terminal.
Attach a worker node to the cluster, execute the following in the spark0 container.
#export PATH=$PATH:/opt/spark/bin
#spark-class org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://172.17.0.2:8090
If unable to connect to localhost replace it with the container IP or the container alias i.e spark0.
Refresh the web ui, ensure that you can see a worker as shown below.
http://127.0.0.1:8080
Start the Node on spark1
Ensure that you have install spark in the Spark1 too.
#docker exec -it spark1 bash
# spark-class org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://172.17.0.2:8090
At the end of this step, you should have 2 worker nodes as shown below:
Open a pyspark shell and connect to the Spark cluster
Let us execute Spark shell using Cluster mode:
You can perform from any client/server node i.e windows desktop client.
Connecting an Application to the Cluster
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.
To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.
#pyspark --conf spark.executor.memory=512mb --conf spark.executor.cores=1 --master

spark://spark0:8090
You can verify the application execution from the web UI
Let’s make a new RDD from the text of the README file in the Spark source directory:

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)
Refer the web UI for the Job details as shown below.
The console should display the result as shown below:
Hints: (For submitting job in cluster)

You need to complete the Lab - Running a Spark Application before going ahead. Submit from the /software folder, since the input
file exist in that folder.
You need to copy the data files in both the nodes. However, for our lab its already present in the spark folder.
#spark-submit --master spark://spark0:8090 workspace/SimpleApp.py --conf spark.executor.memory=256mb
Verify the output.
---------------------------------------------------------------- Lab Ends Here ------------------------------------
https://towardsdatascience.com/diy-apache-spark-docker-bb4f11c10d24
22. Launching on a Cluster: Hadoop YARN – 150 Minutes

In this lab, you will deploy Spark Application on Hadoop Cluster.
Configure and install YARN.

One Node will be exclusively running YARN,
Options for deploying Hadoop:
• Copy a VM and rename to HadoopVM-YARN.
• Create another container
Install YARN – On the New Node.
By now, you should have two VM on your workstation or Two containers:
For Me, the first VM hostname is hp.com and the second one is ht.com. Ensure to follow the same nomencleature
to avoid confusion. Its very important.
Host/VM Container Remarks
hp.com Hadoop0 YARN Services
ht.com Spark0 Spark Client or Spark.
Using Docker:
#docker run -it --name hadoop0 --privileged -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 8032:8032 -p
8188:8188 -p 8020:8020 --hostname hadoop0 centos:7 /usr/sbin/init
Verify the hostname and the /etc/hosts. You need to enter each IP and host name as shown above. Update details
accordingly in your local machine.
You should see in your window as follows for ht.com. Changes the IP as your system.
All the below commands should be executed on hp.com (Hadoop Node) unless specify. We are configuring YARN
cluster now.
Change directory to the location where you have the software.
Download Location: (hadoop-3.2.2.tar.gz)

https://hadoop.apache.org/releases.html
Untar as follows:
tar -xvf hadoop-X.tar.gz -C /opt
tar -xvf jdk-8u40-linux-i586.tar.gz -C /opt
Rename the folder:
#mv hadoo* hadoop

#mv jdk* jdk
--- Install JDK and set Java Home (Copy the bin file in the YARN folder) , To include JAVA_HOME for all bash
users , make an entry in /etc/profile.d as follows:
echo "export JAVA_HOME=/opt/jdk/" > /etc/profile.d/java.sh
Unpack the downloaded Hadoop distribution. In the distribution, edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh to define some
parameters as follows:
# set to the root of your Java installation
export JAVA_HOME=/opt/jdk
# cd /opt/hadoop
Try the following command:
$ bin/hadoop
You need to modify some setting as follows: Replace with your hostname accordingly.
Use the following:
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:8020</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
yum -y install openssh-server openssh-clients

systemctl start sshd
Change the passwd using passwd command.
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
Format the filesystem:
$ bin/hdfs namenode -format
Start NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
1. Browse the web interface for the NameNode; by default it is available at:
o NameNode - http://localhost:9870/
2. Make the HDFS directories required to execute MapReduce jobs:
3. $ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/root
You can run a MapReduce job on YARN in a pseudo-distributed mode
Start HDFS and YARN on the YARN Node: hp.com
1. Configure parameters as follows:
etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HAD
OOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
2. Start ResourceManager daemon and NodeManager daemon:
$ sbin/start-yarn.sh
3. Browse the web interface for the ResourceManager; by default it is available at:
o ResourceManager - http://localhost:8088/
4. When you’re done, you can stop the daemons with: Optional.
$ sbin/stop-yarn.sh
#verify with jps , all the above services should be there.

jps
Hadoop Node - Let us set some environment and path variable as follows: ( vi ~/.bashrc)
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
Let us create a temporary folder, that will be used for working space for the YARN cluster.
hadoop fs -mkdir /tmp
hadoop fs -chmod -R 1777 /tmp
hadoop fs -mkdir /tmp/in
hadoop fs -ls /tmp
You can get the README.md file from the Software folder, which is provided along with the training.
hadoop fs -copyFromLocal /opt/hadoop/README.txt /tmp/in
hadoop fs -ls /tmp/in
hadoop fs -mkdir /tmp/spark-events

Congrats! You have successfully configure Yarn Cluster.
If you are using docker, join the hadoop0 and spark0 containers in a single network:
#docker network connect spark-net hadoop0
Verify it:
#docker network inspect spark-net
Start the Spark VM/spark container, if it's not yet started, i.e ht.com and issue the following command.
Logon to the machine using telnet. You can use the VM, share folder options to copy the file to the VM from your
workstation or else copy using the mouse from your workstation to the folder specify.
Let us configure hadoop client setting on the Spark VM --> ht.com or spark0 node.
Compress the Hadoop folder on the Hadoop0 instances or Hadoop node.

#cd /opt
#tar -cvzf hadoop.tar hadoop/
Copy the compressed hadoop folder to the spark0 or spark node and uncompressed in /opt folder.
[Docker:
copy from hadoop to host : docker cp hadoop0:/opt/hadoop.tar .
copy from host to spark : docker cp hadoop.tar spark0:/opt/]
#tar -xvf hadoop.tar
Update yarn-site.xml on the Spark Node.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>

</property>
Create the following folder on HDFS cluster and copy the file in the folder.
# hdfs dfs -mkdir -p /opt/spark
# hdfs dfs -copyFromLocal /opt/spark/README.md /opt/spark/
You can verify the file with the following command. This command will display the content of the file in the
HDFS cluster.
# hdfs dfs -cat /opt/spark/README.md
Open a command console to execute the Spark jobs to submit on YARN.
Ensure that you install python on Hadoop node:

# yum install -y python3
Export Hadoop and Yarn environment variable pointing to the Spark Node – Hadoop conf file.
Create a python app, SimpleApp.py with the following code:
// ----------------------------------- Code Begin ------------------------------------------------//
"""SimpleApp.py"""
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python App").getOrCreate()

logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
spark.stop()
// ----------------------------------- Code Ends ------------------------------------------------//

The above code reads the README.md file from HDFS cluster and counts the line containing ‘a’ and ‘b’
respectively.
#export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
#export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
#./spark-submit --master yarn --deploy-mode cluster --executor-memory 512m --num-executors 2 SimpleApp.py
or client side.
#./spark-submit --master yarn --deploy-mode client /opt/data/SimpleAppP.py
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the
Hadoop cluster.
Note --> Application py/Jar should be in the local file system, In folder is in HDFS and Out should not be present
in the HDFS,it will be automatically created.
You can verify the progress of the application using http://hp.com:8088/cluster
Click on Cluster-- >Application --> accepted
Wait for few moment and verify on Running tab.
Click on the Application ID --> Logs
Logs --> Stderr to get details as follows:
After sometimes click on Finished, after the console exit the program.
On the job submit console.
Since we have print the output. You can verify from the log fie only.
Go to the User Log folders of the hadoop installation. (Job ID and the container ID will be different in your
machine, update acordingly)
#cd /opt/hadoop/logs/userlogs
#more application_1625129460519_0004/container_1625129460519_0004_01_000001/stdout
Task -> Let us write the output into file instead of writing in console.
Create a file SimpleAppP.py and update with the following code. It only writes the dataframe which has the line
having ‘a’ in it.
//-------------------------- Code Begin ---------------------
"""SimpleApp.py"""
spark = SparkSession.builder.appName("Python App").getOrCreate()

logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
#print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
myDF = logData.filter(logData.value.contains('a'))
myDF.write.save("mydatap")
spark.stop()
//-------------------------- Code Ends ----------------------
Execute the program again.
#spark-submit --master yarn --deploy-mode cluster --executor-memory 512m --num-executors 2

SimpleAppP.py
At the end of the execution, you should have the following output in the console.
You can verify the output file in the hdfs as shown below:
#hdfs dfs -ls -R /user/root/mydatap
# hdfs dfs -cat /user/root/mydatap/part-00000-554ff738-ee36-449c-b654-9a7fa92f17b0-c000.snappy.parquet
It’s a binary parquet file.
Great! You have successfully executed spark job in YARN cluster.

Let us verify the output now using HDFS browser.
http://localhost:9870/explorer.html#/
Click Utilities --> Browse the file system à /user/root/output (Enter in the Directory ) - Go
Click on the above mark file.
Let us view one of the output. click on part-0001 --> Download.
Congrats!.
---------------------------------------------- End of Lab-----------------------------------
Errata:
1) 15/06/14 21:50:46 INFO storage.BlockManagerMasterActor: Registering block manager ht.com:50475 with
267.3 MB RAM, BlockManagerId(<driver>, ht.com, 50475)
15/06/14 21:50:46 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-
events/application_1434297324587_0001.inprogress, expected: hdfs://hp.com at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1262)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:1262)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:128)
Solution: Verify default configuration of the spark installation. /spark/spark-1.3.0-bin-hadoop2.4/conf/spark-

defaults.conf
Check that spark.eventLog.dir is begin with hdfs:// and the /tmp/spark-events is already created in HDFS system
You can verify on the YARN cluster only.
Verify using : hadoop fs -ls -R /tmp/
2) 15/06/14 00:41:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

15/06/14 21:58:27 INFO yarn.Client: Application report for application_1434297324587_0002 (state:
ACCEPTED)
Indefinite loop of above
Solution: Ensure that all configuration file are map to appropriate hostname and hostname to ip details are
modified in the hosts file , Increase the resources and wait for few minutes to convert the status to running.
Yarn-site.xml
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value> 
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value> 
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>256</value> 
</property>
yarn-site.xml – Spark Node. (Specify the hostname of the Hadoop RM Node as shown below)
<property>
</property>
<property>
</property>
3) If unhealthy nodes are being displayed in the cluster URL with 1/1 local-dirs are bad: /tmp/hadoop-yarn/nm-
local-dir;
Solution: delete the folder using the following command and ensure that nodes are healthy before proceeding
forward, you can restart if require
hadoop fs -rm -fr /tmp/hadoop-yarn/*
Try the following command:

yarn application -list
yarn application -kill [application name]
You can view jps in YARN cluster when spark job is executing. It will start Coarse* processes.
Even, cluster mode works
Issue:
Verify the ip of the resource manager and set the home directory appropriately.
Try updating the following setting in yarn-site.xml of Hadoop and spark node too..
<property>
</property>
<property>
</property>
-------------------------------- ----------------------------------------------------------------------------
23. Jobs Monitoring : Using Web UI. – 45 Minutes

Execute the Job as specified in the Spark standalone cluster lab/ Using Docker before proceeding ahead.
- Start the Two Nodes Spark Cluster
- Copy the input file in both the nodes, if you are executing on the Standalone cluster.
Execute the following job:
File names.json contains:
{"firstName":"Grace","lastName":"Hopper"}
{"firstName":"Alan","lastName":"Turing"}
{"firstName":"Ada","lastName":"Lovelace"}
{"firstName":"Charles","lastName":"Babbage"}
Enter the following commands one at a time.
#pyspark --master spark://spark0:8090

ip="/software/names.json"
op="/software/nameop"
spark.sparkContext.setLogLevel("WARN")
peopleDF = spark.read.json(ip) // One Job
namesDF = peopleDF.select("firstName","lastName")
namesDF.write.option("header","true").csv(op) // Second Job
Access the Spark Cluster web console.
http://127.0.0.1:8080
You should have two worker nodes as shown below.
Click on Application ID à Application Detail UI
You can verify the Jobs details Using Spark UI: http://master:4040/jobs/
web UI comes with the following tabs (which may not all be visible at once as they are lazily created on demand,
e.g. Streaming tab):
§ Jobs
§ Stages
§ Storage with RDD size and memory use
§ Environment
§ Executors
§ SQL
The Jobs Tab shows status of all Spark jobs in a Spark application
Display the timeline of Executor being added in the Job. When the Job Get completed etc.
In the above, you can verify that 2 executors have been added.
1 Job Succeded i.e (Reading Json file)

2 Job Succeeded i.e write action.
When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the
Summary section.
The Event Timeline section shows not only jobs but also executors.
You can verify the completed and failed jobs too:
You can verify some of the following Queries from this console?
How long the Job takes to execute? – (1: Here it is 2 s.)
How Many Statges and Tasks created per Job? ( 2: 1 stages and only one task)
Click on On job (1 – write action or job ) -> for details on Description à stages to view DAG - When you click a
job in All Jobs Page page, you see the Details for Job page.
As seen above, there is 1 stage in the Job. It scans the file and create a map partition RDD.
It shows the DAG graph and the stage details.

Scroll down to view the Event timeline.
Questions:
How long does the scheduler take to schedule the task? (Check the Light Blue Block)
How long does the task deserialization take? (Check the Orange Block)
What is the actual computation time? (Verify the Green Color block)
In ideal scenario, Green should take most of the computing resources.

You can also verify how much each task took to excute. It will provide some information about the health of the computing task.
Scroll down to view the task metrics: which task take majority of the time etc. How much byte consume or output by each task and its
statistics.
You can verify from the above, Locality_Level column – the data is process locally. i.e data locality.
Executors Tab
Stats of each executor, how much time it spent on GC etc. How much data get shuffle?
Executors tab in web UI shows
Click on SQl tab
You can view the statistics of the SQL – How much Rows read and how much bytes etc.
Verify the plan (Scroll down the SQL tab.)
--------------------------------- Lab Ends Here -----------------------------------------------------------
24. Spark Streaming With Socket – 45 Minutes
You can execute using pyspark cli console or zeppelin

Open a terminal and perform the following activities.
Install netcat utility
yum install nc.x86_64
or yum install nc
You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by
using
$ nc -lk 9999
Then, any lines typed in the terminal running the netcat server will be counted and printed on screen
every second. It will look something like the following.
Type the following command in pyspark or Zeppelin.
First, we import StreamingContext, which is the main entry point for all streaming functionality. We
create a local StreamingContext with two execution threads, and batch interval of 1 second.

from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 5
second
ssc = StreamingContext(sc, 5)
// Create a DStream that will connect to hostname:port, like localhost:9999

lines = ssc.socketTextStream("localhost", 9999)
This lines DStream represents the stream of data that will be received from the data server. Each record
in this DStream is a line of text. Next, we want to split the lines by space into words.
# Split each line into words

words = lines.flatMap(lambda line: line.split(" "))
flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new
records from each record in the source DStream. In this case, each line will be split into multiple words
and the stream of words is represented as the words DStream. Next, we want to count these words.
# Count each word in each batch

pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda a, b: a + b)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs,
which is then reduced to get the frequency of words in each batch of data.
Finally, wordCounts.pprint() will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it will perform
when it is started, and no real processing has started yet. To start the processing after all the
transformations have been setup, we finally call
ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate
You will be able to see the following in the zeppelin console
Counting each words occurrence in 1 second.

---------------------------------------- Lab Ends Here -----------------------------------------------
25. Spark Streaming With Kafka – 60 Minutes
Goals: To integrate Apache Kafka with Spark Using Streaming API.
Steps to be performed:
§ Install Kafka and start it along with producer.
§ Write python program to connect to Kafka and get data from Kafka using streaming
§ Execute it to fetch the data.
Use the Spark VM
Install kafka:
tar -xvf kafka_2.12-2.8.0.tar -C /spark

cd /spark/kafka_2.11-0.8.2.1
to Start the server, start zookeeper first using a terminal
bin/zookeeper-server-start.sh config/zookeeper.properties
Now start the Kafka server on other terminal
bin/kafka-server-start.sh config/server.properties
Keep both the console open.
Create a topic
Let's create a topic named "test" with a single partition and only one replica: Open one more terminal
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
We can now see that topic if we run the list topic command:
> bin/kafka-topics.sh --list --zookeeper localhost:2181
Send some messages

// Open a new console, change directory to the kafka installation folder
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
// type some message
Start a consumer
// Open a new console, change directory to the kafka installation folder

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
Try sending some information from the producer to consumer.
You have successfully configure Kafka, Now let us fetch information using spark.
Do not close any of the above window.
Update the following code in kafka.py file.
--------------- Code Begin ----------------

from __future__ import print_function
import sys

from pyspark.sql.functions import *
# Initializing Spark session

spark = SparkSession.builder.appName("MySparkSession").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
# Setting parameters for the Spark session to read from Kafka
bootstrapServers = "spark0:9092"
subscribeType ="subscribe"
topics = "test"
# Streaming data from Kafka topic as a dataframe

lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
bootstrapServers).option(subscribeType,
topics).load()
# Expression that reads in raw data from dataframe as a string�~@�# and names the column "words"�~@�
lines = lines.selectExpr("CAST(value AS STRING) as words")
query = lines.writeStream.outputMode("append").format("console").start()
# Terminates the stream on abort
query.awaitTermination()
--------------- Code Ends ------------------
--------------
Submit the job with the following parameters:
#spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --master local[2] --conf

"spark.dynamicAllocation.enabled=false" kafka.py
Type some words on the Producer console.
Whatever you push to the Kafka should be written on the console as shown below.
------------------------------------------ Lab Ends Here ---------------------------------------------------------
26. Streaming with State Operations- Python – 60 Minutes

In this lab, we will aggregate the word across a given window time interval.
To run this on your local machine, you need to first run a Netcat server.
$ nc -lk 9999
We will be sending message to the above netcat server and then apply transformation to it using Spark Streaming.
Open a pyspark terminal.
#pyspark
Type the following code one by one in the console:
from __future__ import print_function

import sys
from pyspark.streaming import StreamingContext
The above commands import the necessary required packages for execution of Spark Streaming application.
#Initialize the Streamming context using the pre created spark context, sc. The interval is 5 5 seconds. i.e micro batch of 5
mseconds.
ssc = StreamingContext(sc, 5)
# Specify the checkpoint directory, since we are going to maintain state across the batch.
ssc.checkpoint("checkpoint")
# RDD with initial state (key, value) pairs

initialStateRDD = sc.parallelize([(u'hello', 1), (u'world', 1)])
#Define the function that will update the state of the count.
def updateFunc(new_values, last_sum): \
return sum(new_values) + (last_sum or 0)
# Open a connection to the netcat server to retrieve the incoming messages.

lines = ssc.socketTextStream("localhost",9999)
# Apply the transformation logic to count the word. Here, we are using flatMap, since it can return list of objects.
running_counts = lines.flatMap(lambda line: line.split(" "))\

.map(lambda word: (word, 1))\
.updateStateByKey(updateFunc, initialRDD=initialStateRDD)
# Print out the calculation on the terminal.
running_counts.pprint()
# start the streaming context and wait for termination from external.
ssc.start()
ssc.awaitTermination()
Snippets:
On the telnet terminal, Type sentences like following with some interval between the sentences.
Observe the out put on the pyspark console.
As you can observe that “Henry” word has occurred twice across the data records. Thus state has been maintained.
--------------------------- Lab Ends Here --------------------------------------------------
27. Spark – Hive Integration – 45 Minutes

When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent
Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive
deployment can still enable Hive support.
use spark.sql.warehouse.dir to specify the default location of database in warehouse
Open a pyspark terminal (Using CLI only)
#pyspark --conf spark.sql.catalogImplementation=hive
Execute the following code.
from os.path import abspath

from pyspark.sql import Row
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('/opt/spark/spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH '/opt/spark/examples/src/main/resources/kv1.txt' INTO TABLE src")
# Queries are expressed in HiveQL

spark.sql("SELECT * FROM src").show()
# +---+-------+
# |key| value|
# +---+-------+
# |238|val_238|
# | 86| val_86|
# |311|val_311|
# ...
# Aggregation queries are also supported.

spark.sql("SELECT COUNT(*) FROM src").show()
# +--------+
# |count(1)|
# +--------+
# | 500 |
# +--------+
# The results of SQL queries are themselves DataFrames and support all normal functions.
sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
# The items in DataFrames are of type Row, which allows you to access each column by
ordinal.
stringsDS = sqlDF.rdd.map(lambda row: "Key: %d, Value: %s" % (row.key, row.value))
for record in stringsDS.collect():
print(record)
# Key: 0, Value: val_0

# ...
# You can also use DataFrames to create temporary views within a SparkSession.
Record = Row("key", "value")
recordsDF = spark.createDataFrame([Record(i, "val_" + str(i)) for i in range(1, 101)])
recordsDF.createOrReplaceTempView("records")
# Queries can then join DataFrame data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
Execution Output:
----------------------------------- Lab Ends Here -------------------------------------
28. Spark integration with Hive Cluster (Local Mode)

Pre requisites: Spark Installation and Hive Local Mode installation.
You can refer Hive Lab for Hive Local Installation.
In this lab, we will configure to use Apache Spark with Hive.
export SPARK_HOME=/opt/spark
export HIVE_HOME=/opt/hive_local
Set HIVE_HOME and SPARK_HOME accordingly.
29. Annexure & Errata:
Quit Scala CLI :- :q

Error: No suitable device found: no device found for connection 'System eth0'.
You could try removing/renaming /etc/udev/rules.d/70-persistent-net.rules and reboot. A new file will be created,
sometimes that fixes this kind of problems.
You could also add a file "/etc/sysconfig/network-scripts/ifcfg-eth1", can you bring up eth1 then?
Changing hostname:
§ /etc/hosts
§ vi /etc/sysconfig/network
§ sysctl kernel.hostname=slave
§ bash
Unable to start spark shell with the following error
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 85 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
Solution: unset HADOOP_CONF_DIR and unset HADOOP_HOME
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
issue: Cannot assign requested address

17/03/15 21:44:15 ERROR spark.SparkContext: Error initializing SparkContext.
java.net.BindException: Cannot assign requested address: Service 'sparkDriver' f
ailed after 16 retries (starting from 0)! Consider explicitly setting the approp
riate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI)
to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:
223)
Solution:
§ include host to ip entry in /etc/hosts --> 192.168.188.178 master
§ update in conf/spark-env.sh
o SPARK_LOCAL_IP=127.0.0.1
o SPARK_MASTER_HOST=192.168.188.178
§ comment the following in conf/spark-defaults.conf

o # spark.master spark://master:7077
Unable to start spark shell:

Caused by: java.io.IOException: Error accessing /opt/jdk/jre/lib/ext/._cld*.jar
Such kind of error resolve by removing all the hidden file. .* (You can determine the file by running #ls -alt). List all the hidden dot
file and remove it.
Yum repo config

# mkdir -p /redhatimg
# mount -o loop,ro /mnt/hgfs/MyExperiment/rhel-server-6.4-x86_64-dvd.iso /redhatimg
Create an entry in /etc/fstab so that the system always mounts the DVD image after a reboot.
/mnt/hgfs/MyExperiment/rhel-server-6.4-x86_64-dvd.iso /redhatimg iso9660 loop,ro 0 0
/etc/yum.repos.d/rheldiso.repo
[rhel6dvdiso]
name=RedHatOS DVD ISO
mediaid=1359576196.686790
baseurl=file:///redhatimg
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release

Spark-Tutorial - IV - Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark-Tutorial - IV - Python

Uploaded by

Copyright:

Available Formats

Table of Contents

Last Updated: 12 Feb 2022

Let us create a common network for our nodes.

#docker network create --driver bridge spark-net

Download the required software

Inflate all the software in /opt folder.

JDK installationL) (jdk-8u45-linux-x64.tar.gz)

Spark runs on Java 6+ and Python 2.6+.

change to the installation directory : D:\MyExperiment\spark-1.3>cd bin and execute

Extract the jdk and rename the folder to jdk.

#tar -xvf jdk-8u45-linux-x64.tar.gz -C /opt

Extract spark-X.X.0-bin-hadoop.X.tar to /opt

# tar -xvf spark-x-bin-hadoop.x.tgz.gz -C /opt/

It should create a folder spark.X, rename to spark folder.

#mv sparkX spark

Set the JAVA_HOME and initialize the PATH variable.

Open the profile and update with the following statements.

Install Python version 3.6.

Python 3.6 or above is required to run PySpark program.

#yum install -y python3

Type bash, to initialize the profile.

Go to the installation directory:

// Code Begin. **********************************************

from operator import add

//Next we will create RDD from "Hello World" string:

data = sc.parallelize(list("Hello World"))

for (word, count) in counts:

// Code Ends. *********************************************

You should get the result as show below:

Further details on this web UI will be discussed later.

You have successfully installed spark for local development.

Let us intialize log entries, that will be used for analysis.

Convert the log entries into RDD

Let us have a look how the RDD or data look like

Have a view of our DF!

Which IP request url the most?

What is the max byte return from a server?

How many request receives from each client?

--------------------- Lab Ends Here ------------------------------------

Goals: To install Spark on redhat linux OS.

Ensure to use latest software : spark-2.1.0-bin-hadoop2.7.tgz.tar

Untar spark-1.3.0-bin-hadoop2.4.tar to /spark

it should create one folder inside /spark

tar -xvf scala-2.11.6.tgz -C /spark

tar -xvf jdk-8u45-linux-x64.tar.gz -C /spark

#mv spark-2.1.0-bin-hadoop2.7/ spark-2.1/

change to the installation directory : \spark\spark-2.1>cd bin and execute

• Loading json file.

// Read the users json file as a dataframe.

// Find out the schema of the uploaded file

//Let us find out the first 3 records to have a sample data.

• Defining schema and mapping to dataframe while loading json file.

Create an input text file /software/people.csv

Execute the following in the Spark-shell.

Import and define the Structure.

// Specify the schema along with the loading instruction.

------------ Lab Ends Here --------------

peopleDF.select("lastName", peopleDF.age * 10).show()

You can chain the Queries.

Peform Inner Join on pcode.

peopleDF.join(pcodesDF,peopleDF["pcode"] == pcodesDF.pcode, "left_outer").show()

---------------------------- Lab Ends Here ------------------------

Optional : export PYSPARK_PYTHON=python3.6 (In case you want to change the