You are on page 1of 8

Unit- 5

Spark

Apache® Spark™ is a fast and general compute engine. Apache® Spark™ powers
the analytics applications up to 100 times faster. It supports HDFS compatible
data. Spark has a simple and expressive programming model. Expressive
program model implements a number of mathematical and logic operations with
smaller and easier written codes which a compiler as well as programmer can
understand easily.

The model, therefore, gives programming ease for a wide range of applications.
Applications of expressive codes are in analytics, Extract Transform Load (ETL),
Machine Learning (ML), stream processing and graph computations.

Main components of the Spark architecture are:

1. Spark HDFS file system for data storage: Storage is at an HDFS or Hadoop
compatible data source (such as HDFS, HBase,Cassandra, Ceph), or at the
ObjectsStore 53

2. Spark standard API enables the creation of applications using Scala,Java,


Python and R

3. Spark resource management can be at a stand-alone server or it can be on a


distributed computing framework, such as YARN or Mesos.

The features of Spark:

1. Spark provisions for creating applications that use the complex data.
In-memory Apache Spark computing engine enables up to 100 times performance
with respect to Hadoop.

2. Execution engine uses both in-memory and on-disk computing. Intermediate


results save in-memory and spill over to disk.

3. Data uploading from an Object Store for immediate use as a Spark object
instance. Spark service interface sets up the Object Store.
4. Provides high performance when an application accesses memory cache from
the disk.

5. Contains API to define Resilient Distributed Datasets (RDDs). ROD is a


programming abstraction. ROD is the core concept in Spark framework. ROD
represents a collection of Object Stores distributed across many compute nodes
for parallel processing. Spark stores data in ROD on different partitions.

Spark SQL

Spark SQL is a component of Spark BigData Stack. Spark SQL components are
DataFrames (SchemaRDDs)SQLContext and JDBCserver. Spark SQL at Spark
does the following:
1. Runs SQL like scripts for query processing, using catalyst optimizer and
tungsten execution engine
2. Processes structured data
3. Provides flexibleAPisfor support for many types of data sources
4. ETL operations by creating ETL pipeline on the data from different file-formats,
such asJSON,Parquet, Hive,Cassandra and then run ad-hoc querying.

Using Python Features with Spark SQL

Spark SQL binds with Python easily. Python has expressive program statements.
Spark SQL features together with Python help a programmer to build challenging
applications for Big Data

Python Libraries for Analysis NumPy and SciPyare open source downloadable
libraries for numerical (Num) analysis and scientific (Sci) computations in Python
(Py). Python has open source library packages, NumPy, SciPy, Scikit-learn,
Pandas and StatsModel, which are widely used for data analysis. Python library,
matplotlib functions plot the mathematical functions.

User-Defined Functions (UDFs) The functions take one row at a time. This
requires overhead for SerDe. Data exchanges take place between Python and
JVM. Earlier the data pipeline (between data and application) defined the UDFs in
Java or Scala, and then invoked them from Python while using Python libraries
for analysis or other applications. SparkSQLUDFs enable registering of
themselves in Python,Java and Scala
Programming with RDDs

Spark Resilient Distributed Dataset (RDD) is a collection of objects distributed on


many computing nodes. Each RDD can split into multiple partitions, which may
be computed in parallel on different nodes of a cluster.
Characteristics of RDDs
1. fault-tolerant abstraction which enables In-Memory cluster computations,
2. immutable (thus read-only) and partitioned distributed collection of objects,
3. have interface which enables transformations that apply the same to many data
objects,
4. create only through the deterministic operations on either (i) data in stable Data
store such as file or (ii) operations on other RDDs,
5. are parallel data structures,
6. enable efficient execution of iterative algorithms,

Two operations, transform and action can be performed on an RDD. Each dataset
represents an object. The transform-command invokes the methods using the
objects to create new RDD(s). Action is an operation that (i) returns a value into a
program or (ii) exports data to a Data Store. Transform and action are different
because of the way in which Spark computes RDDs. Transform operations create
RDDs from each other. The action command does the computation when a
first-time action takes place on an RDD and returns a value or sends data to a
Data Store.

Machine Learning with MLib

Apache MLib is a component of Spark that is open source downloadable from


Apache Spark website.

Spark Support ML pipelines. An ML pipeline means data taken from data sources,
passes through the machine learning programs in between and the output
becomes input to the application. Decision t
Program steps for ETL (Extract, Transform and Load) process

The ETL process combines the following three functions into one:
1. Extract which does the acquisition of data from Data Store querying or from
another program,

2. Transform which does the change of data into a desired file, Transformation
converts the previous form of the extracted data into a new form. Transformation
occurs by using rules or lookup Ca d OB tables. Transformation uses the
functions, namely joint], groupBy(), cogroupl), filter(}, mapl), mapValues(),
flatMap(), sortf), pratitionBy(), groupByKey(), reduceByKey(), aggregateByKey(),
pipel), coalescel), samplel), unionl), crossProduct(). Spark 2.3 includes
transformation functions on complex objects like arrays, maps and set of
columns. Pandas provide powerful transformation UDFs, VUDFs and GVUDFs.

3. Load which does the process of placing transformed data into another Data
Store or data warehouse for usage by an application or for analysis. Python,
Spark SQL and HiveQL support ETL programming and extracting by query•
processing and text processing.

Comparative Analysis of Big Data Tools

In this comparative analysis, we'll examine the features, strengths, and limitations
of four popular Big Data tools: Apache Hadoop, Apache Spark, Apache Flink, and
Google Cloud Dataflow.

1. Apache Hadoop

Apache Hadoop is an open-source framework for distributed storage and


processing of large data sets. Its primary components are Hadoop Distributed
File System (HDFS) and MapReduce. * HDFS: A distributed file system that stores
data across multiple nodes in a cluster, offering high availability, fault tolerance,
and scalability. * MapReduce: A programming model that processes and
generates large datasets by breaking them into smaller chunks, which are then
mapped and reduced. Strengths: * Scalable and cost-effective: Hadoop can
handle petabytes of data and is often more affordable than traditional data
storage solutions. * Fault-tolerant: Data replication across multiple nodes
ensures the system's resilience against hardware failures. * Flexible: Hadoop
supports various file formats and can be integrated with numerous other Big Data
tools. Limitations: * High latency: Hadoop's batch processing nature makes it
less suitable for real-time data processing. * Steep learning curve: Hadoop
requires knowledge of complex concepts like MapReduce, making it difficult for
beginners.

2. Apache Spark

Apache Spark is an open-source, distributed computing system designed for fast


data processing and analytics. It builds upon the Hadoop ecosystem and offers
improvements like in-memory processing and support for iterative algorithms. *
In-memory processing: Spark can store data in memory, significantly reducing
read and write times compared to Hadoop's disk-based storage. * Resilient
Distributed Datasets (RDDs): A fundamental data structure in Spark, RDDs are
immutable distributed collections of objects, enabling fault tolerance and parallel
processing. Strengths: * Faster than Hadoop: Spark's in-memory processing
capability enables much faster data processing, especially for iterative
algorithms. * Versatility: Spark supports multiple programming languages (Scala,
Python, Java) and offers libraries for machine learning (MLlib), graph processing
(GraphX), and stream processing (Spark Streaming). * Easy to use: Spark's APIs
are more straightforward than Hadoop's, making it easier to learn and develop
applications. Limitations: * Memory-intensive: Spark's in-memory processing
requires substantial amounts of RAM, which can be expensive. * Less mature:
Although Spark has a growing community, it is less mature than Hadoop, with
fewer resources and third-party integrations.

3. Apache Flink

Apache Flink is an open-source, distributed stream processing framework that


emphasizes low-latency and high-throughput data processing. It offers robust
stateful computations and event-time processing capabilities. * Stateful
computations: Flink can manage application state, allowing for more advanced
processing logic and fault tolerance. * Event-time processing: Flink processes
data based on when events occurred, rather than when they were ingested,
allowing for more accurate analytics. Strengths: * Real-time processing: Flink's
stream processing capabilities make it well-suited for real-time analytics and
event-driven applications. * High throughput: Flink can efficiently process large
volumes of data with minimal latency. * Strong community: Flink has an active
community that contributes to its development and provides resources for
learning and support. Limitations: * Complexity: Flink requires a deeper
understanding of stream processing concepts, making it harder for beginners to
learn. * Limited ecosystem: While growing, Flink's ecosystem is not as extensive
as Hadoop's or Spark's.

4. Google Cloud Dataflow

Google Cloud Dataflow is a managed, serverless service for batch and stream
data processing, built on top of Apache Beam. It offers a unified programming
model for both batch and streaming use cases. *
Apache Beam: An open-source, unified model for defining and executing data
processing pipelines, supporting multiple languages and runtime environments. *
Auto-scaling: Dataflow automatically adjusts resource allocation based on
workload, ensuring efficient and cost-effective processing. Strengths: * Unified
programming model: Dataflow's Apache Beam-based model simplifies the
development process by allowing users to build pipelines for both batch and
streaming use cases with a single API. * Fully managed: Dataflow takes care of
provisioning, scaling, and managing resources, reducing operational overhead. *
Integration with Google Cloud Platform (GCP): Dataflow seamlessly integrates
with other GCP services like BigQuery, Cloud Storage, and Pub/Sub, enabling a
comprehensive data analytics ecosystem. Limitations: * Vendor lock-in: Dataflow
is a proprietary service within the GCP ecosystem, which may limit flexibility and
portability for some users. * Cost: As a fully managed service, Dataflow can be
more expensive than open-source alternatives, especially for large-scale data
processing tasks.

Conclusion

When selecting a Big Data tool, organizations should consider factors such as
data processing speed, scalability, ease of use, and integration with other tools or
platforms. Apache Hadoop is a reliable and scalable option for batch processing,
while Apache Spark offers improved performance and versatility. Apache Flink
excels at real-time, stream processing, and Google Cloud Dataflow provides a
fully managed, unified solution for both batch and stream processing within the
GCP ecosystem. Ultimately, the best tool will depend on an organization's
specific use case and requirements
Batch Processing and Stream Processing

1. Batch Processing : Batch processing refers to processing of a high volume


of data in a batch within a specific time span. It processes large volumes of data
all at once. Batch processing is used when data size is known and finite. It takes
a little longer to process data. It requires dedicated staff to handle issues. Batch
processor processes data in multiple passes. When data is collected overtime
and similar data batched/grouped together then in that case batch processing is
used. Challenges with Batch processing :
Debugging of these systems is difficult as it requires dedicated professionals to
fix the error.
Software and training requires high expenses initially just to understand batch
scheduling, triggering, notification etc.

2. Stream Processing : Stream processing refers to processing of a


continuous stream of data immediately as it is produced. It analyzes streaming
data in real time. Stream processing is used when the data size is unknown and
infinite and continuous. It takes a few seconds or milliseconds to process data. In
stream processing data output rate is as fast as data input rate. Stream processor
processes data in a few passes. When a data stream is continuous and requires
immediate response then in that case stream processing is used. Challenges with
Stream processing :
Data input rate and output rate sometimes creates a problem.
Cope with huge amounts of data and immediate response.

Differences

.No. BATCH PROCESSING STREAM PROCESSING

Batch processing refers to Stream processing refers to


processing of high volume of processing of continuous stream
01.
data in batch within a specific of data immediately as it is
time span. produced.

Batch processing processes Stream processing analyzes


02.
large volume of data all at once. streaming data in real time.

In Batch processing data size is In Stream processing data size is


04.
known and finite. unknown and infinite in advance.
In Batch processing the data is In stream processing generally
05.
processes in multiple passes. data is processed in few passes.

Stream processor takes few


Batch processor takes longer
06. seconds or milliseconds to
time to processes data.
process data.

You might also like