You are on page 1of 2

Numeric RDD Operations:

1. Transformation Operations: These operations create a new RDD from an existing one.
Examples include map(), filter(), flatMap(), groupByKey(), reduceByKey(),
sortByKey(), etc. These operations are lazy, meaning they don't execute immediately but build
up a lineage of transformations.
2. Action Operations: These operations trigger the execution of transformations and return results
to the driver program or write data to external storage. Examples include reduce(),
collect(), count(), take() , saveAsTextFile(), foreach(), etc.
3. Numeric Operations: These are specific operations applied to numeric RDDs. They include
statistical functions like mean(), sum(), max(), min(), etc. Additionally, you might perform
mathematical operations using map() or reduce() functions.

Spark Runtime Architecture:


1. Driver Program: The entry point of a Spark application, responsible for orchestrating the
execution of tasks on the cluster. It maintains information about the Spark application, such as
the DAG (Directed Acyclic Graph) of operations, and communicates with the cluster manager to
allocate resources.
2. Cluster Manager: Manages resources across the cluster and allocates them to Spark
applications. Examples include Spark's built-in standalone cluster manager, Apache YARN, or
Apache Mesos.
3. Executors: Worker nodes in the Spark cluster responsible for executing tasks. Each executor
runs multiple tasks in parallel and stores data in memory or disk partitions.
4. RDDs (Resilient Distributed Datasets): Immutable distributed collections of data partitioned
across the cluster. RDDs represent the fundamental abstraction in Spark and are built through
parallel transformations.
5. Stages and Tasks: Spark operations are divided into stages, which consist of tasks. Stages are
determined by the shuffle boundaries (e.g., reduceByKey() operation), where data needs to be
shuffled across the network. Tasks are the units of work executed by the executors.

Deploying Applications with spark-submit:


1. Package Your Application: Ensure your application is packaged with all necessary
dependencies, libraries, and configuration files.
2. Submit Command: Use the spark-submit script provided by Spark to submit your
application. Specify the main class containing the entry point of your application, along with any
additional configuration options.
3. Cluster Mode: Choose the appropriate cluster mode (--deploy-mode) for your deployment:
client mode, where the driver runs on the machine submitting the job, or cluster mode,
where the driver runs on one of the cluster nodes.
4. Resource Allocation: Specify the resources required by your application, such as memory ( --
executor-memory), number of cores (--executor-cores), and the number of executors (--
num-executors).
5. Submit: Execute the spark-submit command with the necessary arguments and parameters.
Once submitted, Spark will launch the application on the cluster according to the specified
configurations.

By understanding these components and processes, you can effectively develop, deploy, and
manage Spark applications for various use cases.

You might also like