You are on page 1of 19

Spark

Applications
Jesús Montes
jesus.montes@upm.es

Oct. 2022
Spark Applications
Spark Applications
● So far, we have seen how to run Spark programs interactively, using the
Spark Shell.
● Typically, the shell is used for experimentation and initial data
exploration.
● Complex procedures are usually developed as independent applications.
○ Batch processes.
○ Production software.
○ Etc.
● Spark applications run inside the JVM, and are distributed as JAR files.
● When a Spark application is launched, its JAR is deployed in the cluster.

Spark Applications 2
Spark Application Deployment
Spark Driver
Spark Context

Spark
Application

Worker Node Worker Node Worker Node


Executor Executor Executor
Task Task Task Task Task Task

Spark Basics 3
Spark Application Deployment
The actual way the application is deployed depends on the environment.

● In local mode, no file transference is required.


● In cluster mode (mesos, yarn, kubernetes) the JAR file and its
dependencies have to be copied to the driver and worker nodes.
○ In a Hadoop cluster, HDFS is used as a data repository where all these files are stored.
● Application deployment is usually performed with the help of the
spark-submit script.

Spark Basics 4
spark-submit
● spark-submit is a # spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
deployment tool Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
included in the Spark Usage: spark-submit run-example [options] example-class [example args]

distribution. Options:

● With it, you can launch a --master MASTER_URL


--deploy-mode DEPLOY_MODE
spark://host:port, mesos://host:port, yarn, or local.
Whether to launch the driver program locally ("client") or
Spark application JAR on one of the worker machines inside the cluster
("cluster")
file, indicating the main (Default: client).

class, configuration
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
parameters, additional --jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
dependencies, etc. --packages Comma-separated list of maven coordinates of jars to

It is a script that wraps


include
● on the driver and executor classpaths. Will search the

the SparkSubmit class. local


maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.

...
Spark Basics 5
Programming Spark Applications
● Spark Applications are like regular Basic Spark App structure (in Scala):
Scala/Java/... programs.
They have a main method from a main
package x.y

object/class import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
○ Part of an object in Scala import org.apache.spark.SparkContext._
○ Static method in Java
○ Main function in Python object App {

● The main difference with using the shell def main(args : Array[String]) {
is that the Spark Context must be val conf = new SparkConf()
explicitly configured and created. val sc = new SparkContext(conf)

● For doing this, we use the Spark API. // ... Do the actual work ...
}
}

Spark Basics 6
SparkConf
● Objects of the SparkConf class serve as configuration for a Spark
application.
● They are used to set various Spark parameters as key-value pairs.
● Most of the time, a SparkConf object can be created simply with
new SparkConf()
which will load values from any spark.* Java system properties set in the
application as well.
● Parameters can also be defined when invoking spark-submit.
● Parameters set directly on the SparkConf object take priority over system
properties.

Spark Basics 7
SparkConf methods and configuration variables
Methods: Variables:

● setAppName(name: String): SparkConf ● spark.app.name


Set a name for your application. ● spark.driver.cores
● set(key: String, value: String): ● spark.driver.memory
SparkConf ● spark.driver.extraClassPath
Set a configuration variable. ● spark.driver.extraJavaOptions
● setJars(jars: Array[String]): SparkConf ● spark.executor.memory
Set JAR files to distribute to the cluster. ● spark.executor.extraClassPath
● setMaster(master: String): SparkConf ● spark.executor.extraJavaOptions
The master URL to connect to, such as "local" to run ● spark.executor.cores
locally with one thread, or "spark://master:7077" to ● …
run on a Spark standalone cluster.
Take a look at http://spark.apache.org/docs/latest/configuration.html
Spark Basics 8
Programming Spark Applications
● To create a Spark application, you just need to write the application code,
compile it and package it into a JAR file.
● Spark libraries will be dependencies of your app, so you will need them
for compiling.
○ For Spark Core programs, you just need the spark-core jar.
○ All Spark jars (and their dependencies are located in the jars directory of your Spark
installation).
● It is not necessary to use a build manager (like Maven or SBT), but it is
usually recommended.
○ For this course, we are going to use SBT.

Spark Basics 9
Creating a Scala project with SBT
1. Download sbt (https://www.scala-sbt.org/)
2. Install sbt (just unzip the file in some directory, like /opt/sbt)
3. To create a basic Scala project, create a directory with the appropriate
structure and a build.sbt file:
$ mkdir sparkapp; cd sparkapp
$ mkdir -p src/{main,test}/{java,resources,scala}
$ touch build.sbt
4. Fill your build.sbt with some initial configuration:
organization := "upm.bd"
name := "sparkapp"
version := "1.0.0"

Spark Applications 10
Creating a Scala project with SBT
The previous steps lead you to the creation of a empty Scala app with SBT. To
test your app, follow these steps:

1. Compile and package your app by running SBT inside your project
directory
# /opt/sbt/bin/sbt package
2. Now you can run your app with the JVM
# java -cp {scala-library.jar}:{app jar} {app class}

You can find this in Spark’s jars dir This is your app’s jar This is your jar’s main object

Spark Applications 11
Creating a Spark Application
1. Create a simple Scala project (organization: upm.bd, name: sparkapp).
2. Add the following dependency to the build.sbt file:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",

The first time you compile after adding this dependency, SBT will
download all Spark dependencies (they are a few).
3. Now create the src/main/scala/MyApp.scala file with the code on the
next slide.

Spark Applications 12
Creating a Spark Application
package upm.bd
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext Spark classes
import org.apache.spark.SparkContext._
import org.apache.log4j.{Level, Logger}

object MyApp {
def main(args : Array[String]) { Making the output less
verbose
Logger.getLogger("org").setLevel(Level.WARN)

val conf = new SparkConf().setAppName("My first Spark application")


App configuration and context
val sc = new SparkContext(conf) creation

//val data = sc.textFile("file:///tmp/book/98.txt")


val data = sc.textFile("hdfs:///tmp/book/98.txt")
val numAs = data.filter(line => line.contains("a")).count() Doing the work
val numBs = data.filter(line => line.contains("b")).count()

println(s"Lines with a: ${numAs}, Lines with b: ${numBs}")


} Console output
}

Spark Applications 13
Testing your Spark Application
1. Compile and package the application:
# /opt/sbt/bin/sbt package
2. Launch the application, in the cluster, with spark-submit:
# spark-submit --master yarn --class upm.bd.MyApp
target/scala-2.12/sparkapp_2.12-1.0.0.jar

To run locally (no Hadoop), use --master local instead

Spark Applications 14
Shared variables in Spark
● When a function passed to a Spark operation is executed on a remote
cluster node, it works on separate copies of all the variables used in the
function.
● These variables are copied to each machine
● No updates to the variables on the remote machine are propagated back
to the driver program.
● Supporting general, read-write shared variables across tasks would be
inefficient.
● However, Spark does provide two limited types of shared variables for
two common usage patterns: broadcast variables and accumulators.

Spark Applications 15
Broadcast Variables
● Read-only variables cached on each machine Creating a broadcast variable:
● They can be used, for example, to give every
node a copy of a large input dataset in an val v = Array(1, 2, 3)
efficient manner. val broadcastVar = sc.broadcast(v)
● Spark attempts to distribute broadcast Accessing a broadcast variable:
variables using efficient broadcast algorithms
to reduce communication cost. broadcastVar.value
● Explicitly creating broadcast variables is only
useful when: After the broadcast variable is created:
○ Tasks across multiple stages need the same ● It should be used instead of the value v
data. ● v should not be modified, to ensure
○ Caching the data in deserialized form is
that all nodes get the same value
important.

Spark Applications 16
Accumulators
● Accumulators are variables that are only Using an accumulator:
“added” to through an associative and
commutative operation. val accum = sc.longAccumulator("My
● They can be used to implement Accumulator")
counters (as in MapReduce) or sums.
● Spark natively supports accumulators of sc.parallelize(Array(1, 2, 3, 4))
numeric types, and programmers can .foreach(x => accum.add(x))
add support for new types.
● If accumulators are created with a accum.value
name, they will be displayed in Spark’s
UI. This can be useful for understanding
What will be the value of accum?
the progress of running stages.

Spark Applications 17
Repartition
repartition(numPartitions: Int): RDD[T]

● An RDD transformation that returns a new RDD that has exactly


numPartitions partitions.
● This operation reshuffles the data in the RDD randomly to create either
more or fewer partitions and balance it across them. This always shuffles
all data over the network.
● It can be used to optimize cluster utilization.

Spark Applications 18
Cache/Persist and Checkpoint
cache() checkpoint()
persist()
persist(newLevel: StorageLevel) ● RDD method that marks the RDD to be
saved to a file inside the checkpoint
● RDD method that marks the RDD to be directory set with SparkContext method
persisted. setCheckpointDir .
● The first time the RDD is computed in an ● All references to its parent RDDs will be
action, it will be kept in memory on the removed (lineage is lost).
nodes. ● It is strongly recommended that the
● Each persisted RDD can be stored using RDD is persisted in memory, otherwise
a different storage level saving it on a file will require
(MEMORY_ONLY, MEMORY_AND_DISK, recomputation.
DISK_ONLY, MEMORY_ONLY_SER,...).

Spark Applications 19

You might also like