Professional Documents
Culture Documents
Spark
Apache® Spark™ is a fast and general compute engine. Apache® Spark™ powers
the analytics applications up to 100 times faster. It supports HDFS compatible
data. Spark has a simple and expressive programming model. Expressive
program model implements a number of mathematical and logic operations with
smaller and easier written codes which a compiler as well as programmer can
understand easily.
The model, therefore, gives programming ease for a wide range of applications.
Applications of expressive codes are in analytics, Extract Transform Load (ETL),
Machine Learning (ML), stream processing and graph computations.
1. Spark HDFS file system for data storage: Storage is at an HDFS or Hadoop
compatible data source (such as HDFS, HBase,Cassandra, Ceph), or at the
ObjectsStore 53
1. Spark provisions for creating applications that use the complex data.
In-memory Apache Spark computing engine enables up to 100 times performance
with respect to Hadoop.
3. Data uploading from an Object Store for immediate use as a Spark object
instance. Spark service interface sets up the Object Store.
4. Provides high performance when an application accesses memory cache from
the disk.
Spark SQL
Spark SQL is a component of Spark BigData Stack. Spark SQL components are
DataFrames (SchemaRDDs)SQLContext and JDBCserver. Spark SQL at Spark
does the following:
1. Runs SQL like scripts for query processing, using catalyst optimizer and
tungsten execution engine
2. Processes structured data
3. Provides flexibleAPisfor support for many types of data sources
4. ETL operations by creating ETL pipeline on the data from different file-formats,
such asJSON,Parquet, Hive,Cassandra and then run ad-hoc querying.
Spark SQL binds with Python easily. Python has expressive program statements.
Spark SQL features together with Python help a programmer to build challenging
applications for Big Data
Python Libraries for Analysis NumPy and SciPyare open source downloadable
libraries for numerical (Num) analysis and scientific (Sci) computations in Python
(Py). Python has open source library packages, NumPy, SciPy, Scikit-learn,
Pandas and StatsModel, which are widely used for data analysis. Python library,
matplotlib functions plot the mathematical functions.
User-Defined Functions (UDFs) The functions take one row at a time. This
requires overhead for SerDe. Data exchanges take place between Python and
JVM. Earlier the data pipeline (between data and application) defined the UDFs in
Java or Scala, and then invoked them from Python while using Python libraries
for analysis or other applications. SparkSQLUDFs enable registering of
themselves in Python,Java and Scala
Programming with RDDs
Two operations, transform and action can be performed on an RDD. Each dataset
represents an object. The transform-command invokes the methods using the
objects to create new RDD(s). Action is an operation that (i) returns a value into a
program or (ii) exports data to a Data Store. Transform and action are different
because of the way in which Spark computes RDDs. Transform operations create
RDDs from each other. The action command does the computation when a
first-time action takes place on an RDD and returns a value or sends data to a
Data Store.
Spark Support ML pipelines. An ML pipeline means data taken from data sources,
passes through the machine learning programs in between and the output
becomes input to the application. Decision t
Program steps for ETL (Extract, Transform and Load) process
The ETL process combines the following three functions into one:
1. Extract which does the acquisition of data from Data Store querying or from
another program,
2. Transform which does the change of data into a desired file, Transformation
converts the previous form of the extracted data into a new form. Transformation
occurs by using rules or lookup Ca d OB tables. Transformation uses the
functions, namely joint], groupBy(), cogroupl), filter(}, mapl), mapValues(),
flatMap(), sortf), pratitionBy(), groupByKey(), reduceByKey(), aggregateByKey(),
pipel), coalescel), samplel), unionl), crossProduct(). Spark 2.3 includes
transformation functions on complex objects like arrays, maps and set of
columns. Pandas provide powerful transformation UDFs, VUDFs and GVUDFs.
3. Load which does the process of placing transformed data into another Data
Store or data warehouse for usage by an application or for analysis. Python,
Spark SQL and HiveQL support ETL programming and extracting by query•
processing and text processing.
In this comparative analysis, we'll examine the features, strengths, and limitations
of four popular Big Data tools: Apache Hadoop, Apache Spark, Apache Flink, and
Google Cloud Dataflow.
1. Apache Hadoop
2. Apache Spark
3. Apache Flink
Google Cloud Dataflow is a managed, serverless service for batch and stream
data processing, built on top of Apache Beam. It offers a unified programming
model for both batch and streaming use cases. *
Apache Beam: An open-source, unified model for defining and executing data
processing pipelines, supporting multiple languages and runtime environments. *
Auto-scaling: Dataflow automatically adjusts resource allocation based on
workload, ensuring efficient and cost-effective processing. Strengths: * Unified
programming model: Dataflow's Apache Beam-based model simplifies the
development process by allowing users to build pipelines for both batch and
streaming use cases with a single API. * Fully managed: Dataflow takes care of
provisioning, scaling, and managing resources, reducing operational overhead. *
Integration with Google Cloud Platform (GCP): Dataflow seamlessly integrates
with other GCP services like BigQuery, Cloud Storage, and Pub/Sub, enabling a
comprehensive data analytics ecosystem. Limitations: * Vendor lock-in: Dataflow
is a proprietary service within the GCP ecosystem, which may limit flexibility and
portability for some users. * Cost: As a fully managed service, Dataflow can be
more expensive than open-source alternatives, especially for large-scale data
processing tasks.
Conclusion
When selecting a Big Data tool, organizations should consider factors such as
data processing speed, scalability, ease of use, and integration with other tools or
platforms. Apache Hadoop is a reliable and scalable option for batch processing,
while Apache Spark offers improved performance and versatility. Apache Flink
excels at real-time, stream processing, and Google Cloud Dataflow provides a
fully managed, unified solution for both batch and stream processing within the
GCP ecosystem. Ultimately, the best tool will depend on an organization's
specific use case and requirements
Batch Processing and Stream Processing
Differences