4.2. Spark Applications

Spark
Applications
Jesús Montes
jesus.montes@upm.es
Oct. 2022
Spark Applications
Spark Applications
● So far, we have seen how to run Spark programs interactively, using the
Spark Shell.
● Typically, the shell is used for experimentation and initial data
exploration.
● Complex procedures are usually developed as independent applications.
○ Batch processes.
○ Production software.
○ Etc.
● Spark applications run inside the JVM, and are distributed as JAR files.
● When a Spark application is launched, its JAR is deployed in the cluster.
Spark Applications 2
Spark Application Deployment
Spark Driver
Spark Context
Spark
Application
Worker Node Worker Node Worker Node

Executor Executor Executor
Task Task Task Task Task Task
Spark Basics 3
Spark Application Deployment
The actual way the application is deployed depends on the environment.
● In local mode, no file transference is required.

● In cluster mode (mesos, yarn, kubernetes) the JAR file and its
dependencies have to be copied to the driver and worker nodes.
○ In a Hadoop cluster, HDFS is used as a data repository where all these files are stored.
● Application deployment is usually performed with the help of the
spark-submit script.
Spark Basics 4
spark-submit
● spark-submit is a # spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
deployment tool Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
included in the Spark Usage: spark-submit run-example [options] example-class [example args]
distribution. Options:
● With it, you can launch a --master MASTER_URL

--deploy-mode DEPLOY_MODE
spark://host:port, mesos://host:port, yarn, or local.
Whether to launch the driver program locally ("client") or
Spark application JAR on one of the worker machines inside the cluster
("cluster")
file, indicating the main (Default: client).
class, configuration
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
parameters, additional --jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
dependencies, etc. --packages Comma-separated list of maven coordinates of jars to
It is a script that wraps

include
● on the driver and executor classpaths. Will search the
the SparkSubmit class. local

maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
...
Spark Basics 5
Programming Spark Applications
● Spark Applications are like regular Basic Spark App structure (in Scala):
Scala/Java/... programs.
They have a main method from a main
package x.y
●
object/class import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
○ Part of an object in Scala import org.apache.spark.SparkContext._
○ Static method in Java
○ Main function in Python object App {
● The main difference with using the shell def main(args : Array[String]) {
is that the Spark Context must be val conf = new SparkConf()
explicitly configured and created. val sc = new SparkContext(conf)
● For doing this, we use the Spark API. // ... Do the actual work ...
}
}
Spark Basics 6
SparkConf
● Objects of the SparkConf class serve as configuration for a Spark
application.
● They are used to set various Spark parameters as key-value pairs.
● Most of the time, a SparkConf object can be created simply with
new SparkConf()
which will load values from any spark.* Java system properties set in the
application as well.
● Parameters can also be defined when invoking spark-submit.
● Parameters set directly on the SparkConf object take priority over system
properties.
Spark Basics 7
SparkConf methods and configuration variables
Methods: Variables:
● setAppName(name: String): SparkConf ● spark.app.name

Set a name for your application. ● spark.driver.cores
● set(key: String, value: String): ● spark.driver.memory
SparkConf ● spark.driver.extraClassPath
Set a configuration variable. ● spark.driver.extraJavaOptions
● setJars(jars: Array[String]): SparkConf ● spark.executor.memory
Set JAR files to distribute to the cluster. ● spark.executor.extraClassPath
● setMaster(master: String): SparkConf ● spark.executor.extraJavaOptions
The master URL to connect to, such as "local" to run ● spark.executor.cores
locally with one thread, or "spark://master:7077" to ● …
run on a Spark standalone cluster.
Take a look at http://spark.apache.org/docs/latest/configuration.html
Spark Basics 8
Programming Spark Applications
● To create a Spark application, you just need to write the application code,
compile it and package it into a JAR file.
● Spark libraries will be dependencies of your app, so you will need them
for compiling.
○ For Spark Core programs, you just need the spark-core jar.
○ All Spark jars (and their dependencies are located in the jars directory of your Spark
installation).
● It is not necessary to use a build manager (like Maven or SBT), but it is
usually recommended.
○ For this course, we are going to use SBT.
Spark Basics 9
Creating a Scala project with SBT
1. Download sbt (https://www.scala-sbt.org/)
2. Install sbt (just unzip the file in some directory, like /opt/sbt)
3. To create a basic Scala project, create a directory with the appropriate
structure and a build.sbt file:
$ mkdir sparkapp; cd sparkapp
$ mkdir -p src/{main,test}/{java,resources,scala}
$ touch build.sbt
4. Fill your build.sbt with some initial configuration:
organization := "upm.bd"
name := "sparkapp"
version := "1.0.0"
Creating a Scala project with SBT
The previous steps lead you to the creation of a empty Scala app with SBT. To
test your app, follow these steps:
1. Compile and package your app by running SBT inside your project
directory
# /opt/sbt/bin/sbt package
2. Now you can run your app with the JVM
# java -cp {scala-library.jar}:{app jar} {app class}
You can find this in Spark’s jars dir This is your app’s jar This is your jar’s main object
Creating a Spark Application
1. Create a simple Scala project (organization: upm.bd, name: sparkapp).
2. Add the following dependency to the build.sbt file:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",
The first time you compile after adding this dependency, SBT will
download all Spark dependencies (they are a few).
3. Now create the src/main/scala/MyApp.scala file with the code on the
next slide.
Creating a Spark Application
package upm.bd
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext Spark classes
import org.apache.spark.SparkContext._
import org.apache.log4j.{Level, Logger}
object MyApp {
def main(args : Array[String]) { Making the output less
verbose
Logger.getLogger("org").setLevel(Level.WARN)
val conf = new SparkConf().setAppName("My first Spark application")

App configuration and context
val sc = new SparkContext(conf) creation
//val data = sc.textFile("file:///tmp/book/98.txt")

val data = sc.textFile("hdfs:///tmp/book/98.txt")
val numAs = data.filter(line => line.contains("a")).count() Doing the work
val numBs = data.filter(line => line.contains("b")).count()
println(s"Lines with a: ${numAs}, Lines with b: ${numBs}")

} Console output
}
Testing your Spark Application
1. Compile and package the application:
# /opt/sbt/bin/sbt package
2. Launch the application, in the cluster, with spark-submit:
# spark-submit --master yarn --class upm.bd.MyApp
target/scala-2.12/sparkapp_2.12-1.0.0.jar
To run locally (no Hadoop), use --master local instead
Shared variables in Spark
● When a function passed to a Spark operation is executed on a remote
cluster node, it works on separate copies of all the variables used in the
function.
● These variables are copied to each machine
● No updates to the variables on the remote machine are propagated back
to the driver program.
● Supporting general, read-write shared variables across tasks would be
inefficient.
● However, Spark does provide two limited types of shared variables for
two common usage patterns: broadcast variables and accumulators.
Broadcast Variables
● Read-only variables cached on each machine Creating a broadcast variable:
● They can be used, for example, to give every
node a copy of a large input dataset in an val v = Array(1, 2, 3)
efficient manner. val broadcastVar = sc.broadcast(v)
● Spark attempts to distribute broadcast Accessing a broadcast variable:
variables using efficient broadcast algorithms
to reduce communication cost. broadcastVar.value
● Explicitly creating broadcast variables is only
useful when: After the broadcast variable is created:
○ Tasks across multiple stages need the same ● It should be used instead of the value v
data. ● v should not be modified, to ensure
○ Caching the data in deserialized form is
that all nodes get the same value
important.
Accumulators
● Accumulators are variables that are only Using an accumulator:
“added” to through an associative and
commutative operation. val accum = sc.longAccumulator("My
● They can be used to implement Accumulator")
counters (as in MapReduce) or sums.
● Spark natively supports accumulators of sc.parallelize(Array(1, 2, 3, 4))
numeric types, and programmers can .foreach(x => accum.add(x))
add support for new types.
● If accumulators are created with a accum.value
name, they will be displayed in Spark’s
UI. This can be useful for understanding
What will be the value of accum?
the progress of running stages.
Repartition
repartition(numPartitions: Int): RDD[T]
● An RDD transformation that returns a new RDD that has exactly

numPartitions partitions.
● This operation reshuffles the data in the RDD randomly to create either
more or fewer partitions and balance it across them. This always shuffles
all data over the network.
● It can be used to optimize cluster utilization.
Cache/Persist and Checkpoint
cache() checkpoint()
persist()
persist(newLevel: StorageLevel) ● RDD method that marks the RDD to be
saved to a file inside the checkpoint
● RDD method that marks the RDD to be directory set with SparkContext method
persisted. setCheckpointDir .
● The first time the RDD is computed in an ● All references to its parent RDDs will be
action, it will be kept in memory on the removed (lineage is lost).
nodes. ● It is strongly recommended that the
● Each persisted RDD can be stored using RDD is persisted in memory, otherwise
a different storage level saving it on a file will require
(MEMORY_ONLY, MEMORY_AND_DISK, recomputation.
DISK_ONLY, MEMORY_ONLY_SER,...).

4.2. Spark Applications

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.2. Spark Applications

Uploaded by

Copyright:

Available Formats

Spark

Worker Node Worker Node Worker Node

● In local mode, no ﬁle transference is required.

● With it, you can launch a --master MASTER_URL

It is a script that wraps

the SparkSubmit class. local

● setAppName(name: String): SparkConf ● spark.app.name

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",

val conf = new SparkConf().setAppName("My first Spark application")

//val data = sc.textFile("file:///tmp/book/98.txt")

println(s"Lines with a: ${numAs}, Lines with b: ${numBs}")

To run locally (no Hadoop), use --master local instead

● An RDD transformation that returns a new RDD that has exactly

You might also like