You are on page 1of 7

Apache Spark

Introduction

 Apache Spark is an integrated, fast, in-memory, general-purpose engine for large scale data processing.
 Spark is ideal for iterative and interactive processing tasks on large data sets and streams.
 Spark achieves 10–100x performance over Hadoop by operating with an in-memory data construct called
Resilient Distributed Datasets (RDDs).
 It helps avoid the latencies involved in disk reads and writes. Spark also offers built-in libraries for
Machine Learning, Graph Processing, Stream processing and SQL to deliver seamless superfast data
processing along with high programmer productivity.
 Spark is compatible with Hadoop file systems and tools.
 Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster
even when running on disk.
History
 Apache Spark was developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache
project.
 It can process data from a variety of data repositories, including the Hadoop Distributed File System
(HDFS), and NoSQL databases such as HBase and Cassandra.

Apache Spark Architecture


 Spark core engine functions partly as an application programming interface (API) layer and underpins a
set of related tools for managing and analyzing data.

 It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

 These include a SQL query engine, a library of machine learning algorithms, a graph processing system,
and streaming data processing software.

Page 1 of 7
Resilient Distributed Datasets (RDDs)
 RDDs, Resilient Distributed Datasets, is a distributed memory construct.
 motivated by two types of applications that current computing frameworks handle inefficiently:
o Iterative algorithms, and Interactive data mining tools.
o In both cases, keeping data in memory can improve performance by an order of magnitude
 RDDs are immutable and partitioned collection of records.
 RDDs can only be created by:
(a) Reading data from a stable storage such as HDFS or
(b) Transformations on existing RDDs.
 The two major types of operations on RDD
o Transformations: return a new, modified RDD based on the original.
Several transformations via Spark API,
map( ), filter( ), sample ( ), and union ( ).
o Actions : return a value based on some computation being performed on an RDD.
Some examples of actions supported by the Spark API include reduce ( ), count ( ),
first ( ), and for each ( ).

Directed Acyclic Graph (DAG)


 DAG refers to a Directed Acyclic Graph.
 represent the logical execution plan for distributed data processing tasks
 Support to build highly interactive, real-time computing systems to power real-time BI, predictive
analytics, real-time marketing.
 DAG Scheduler is the scheduling layer of Apache Spark that implements stage oriented scheduling
i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that
are submitted as task-sets for execution.

 DAG Scheduler operations in Spark:


o Computes an execution DAG, i.e. DAG of stages, for a job;
o Determines the preferred locations to run each task on;
o Handles failures due to shuffle output files being lost.

Page 2 of 7
Apache Spark master-slave architecture

 Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are
called “Workers”.
 When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
 Cluster Manager Types
o Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
o Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark
applications.
o Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
o Kubernetes – an open-source system for automating deployment, scaling, and management of
containerized applications.

 Working of Cluster
o The working of Spark architecture involves several steps that enable distributed data processing
and computation on large-scale datasets.
o Below are the key components and the working of the Spark architecture:

 Driver Program:
o The Spark application starts with a Driver Program, which runs on the master node.
o The Driver Program is responsible for creating the SparkContext, which serves as the entry
point to any Spark functionality.
o The main entry point of a Spark application, running on the master node.
o Responsible for defining the application's logic, transformations, and actions.
o Manages communication with the SparkContext to coordinate task execution.
o Submits Spark applications to the cluster manager for execution.
o Handles task aggregation and final result processing.

Page 3 of 7
 SparkContext
o It acts as the connection to the Spark cluster and coordinates the execution of tasks on the cluster.
o The SparkContext splits the data into partitions and schedules tasks to be executed on worker
nodes.
o SparkContext is the heart of Apache Spark, serving as the primary interface to the cluster.
o It coordinates the execution of tasks and manages distributed data processing.
o Responsible for data partitioning, resource allocation, and fault tolerance.
o Provides essential methods for creating RDDs, which are the fundamental data structure in
Spark

 Cluster Manager
o The Cluster Manager is responsible for allocating and managing resources across the cluster for
Spark applications.
o Common cluster managers used with Spark are Apache Mesos, Hadoop YARN, and Standalone
mode.
o The Spark application requests resources from the cluster manager, which then allocates the
necessary resources for task execution.
o The Cluster Manager is responsible for resource management and job scheduling in Spark
applications.
o It allocates resources (CPU cores, memory) to different applications running on the cluster.
o Manages the distribution of tasks to worker nodes based on available resources and data locality.
o Monitors and ensures fault tolerance by restarting failed tasks on other nodes.
o Supports different cluster management systems like Apache Mesos, Hadoop YARN, and Spark
Standalone.

 Worker Nodes:
o Worker Nodes are the machines (physical or virtual) in the cluster where tasks are executed.
o Each worker node runs a Spark executor process, which is responsible for executing tasks and
storing data in memory or disk.
o The number of worker nodes and the resources allocated to each node depend on the cluster
configuration.

 RDD (Resilient Distributed Dataset):


o RDD is the fundamental data structure in Spark.
o It represents an immutable, distributed collection of objects that can be processed in parallel across
the cluster.
o RDDs can be created by loading external data, transforming other RDDs, or by parallelizing an
existing collection in memory.

 DAG Scheduler:
o The Directed Acyclic Graph (DAG) Scheduler breaks down the logical execution plan (sequence
of transformations) into stages of tasks based on data dependencies and narrow transformations.
o It optimizes the execution plan to minimize data shuffling and improve performance.

 Task Scheduler:
o The Task Scheduler is responsible for launching tasks on worker nodes in a distributed manner.
Page 4 of 7
o It ensures that tasks are scheduled efficiently across the cluster based on data locality, resource
availability, and task dependencies.

 Executors:
o Executors are worker processes running on worker nodes.
o Each executor runs tasks in separate threads and holds data in memory or disk storage.
o Executors communicate with the Driver Program and the Cluster Manager to execute tasks and
manage resources.

 Cache and Persistence:


o Allows caching intermediate data in memory to avoid redundant computations.

SPARK ECOSYSTEM
 Spark is an integrated stack of tools responsible for scheduling, distributing, and monitoring
applications consisting of many computational tasks across many worker machines, or a computing
cluster.
 Spark is written primarily in Scala, but includes code from Python, Java, R, and other languages.
 Spark comes with a set of integrated tools that reduce learning time and deliver higher user productivity.
 Spark ecosystem includes Mesos resource manager, and other tools.

SPARK FOR BIG DATA PROCESSING


 Spark support big data mining through many tools such as MLlib, GraphX, SparkR, Spark SQL, and Streaming
library.

Page 5 of 7
 MLlib
o MLlib is Spark's machine learning library.
o It offers basic ML algorithms: classification, regression, clustering, collaborative filtering, and
dimensionality reduction.
o Provides lower-level optimization primitives and higher-level pipeline APIs for streamlined ML
workflows.
o RDDs help Spark excel at iterative computation, thus enabling MLlib to run fast.
o In addition, Spark MLlib is easy to use and it can support Scala, Java, Python, and SparkR.

 Spark GraphX
o GraphX is Spark’s component for graphs and graph-parallel computation.
o It extends Spark RDD by introducing a new Graph abstraction, representing a directed multi-graph
with properties on vertices and edges.
o GraphX offers fundamental graph operators like subgraph, joinVertices, and aggregateMessages,
based on an optimized variant of the Pregel API (Google's graph processing system for PageRank).
o It includes a collection of graph algorithms and builders to simplify graph analytics tasks.
o GraphX enables efficient and scalable graph processing within the Spark ecosystem.

 SparkR
o Facilitates distributed data processing in R.
o It uses Spark’s distributed
o computation engine to run large scale data analysis from the R shell.

 Spark SQL
o Spark SQL is a language for working with structured data in Apache Spark.
o It allows running queries on data to obtain meaningful results.
o Supports SQL and HiveQL for querying and processing data

 Spark Streaming
o Spark Streaming processes real-time data streams from input sources in a cluster.
o Data streams are chopped into small batches, each representing a few seconds of data.
o Spark treats each batch as RDDs and applies RDD operations to process and analyze the data, then
pushes out the results as batches to databases or dashboards.

SPARK APPLICATIONS
 Some real uses cases that are solved well by a tool like Apache Spark include:
1. Real-time Log Data monitoring.
2. Massive Natural Language Processing.
3. Large Scale Online Recommendation Systems.

Page 6 of 7
Page 7 of 7

You might also like