Professional Documents
Culture Documents
Report SQL
Report SQL
a. Integrated
Integrate is simply defined as combining or merge. Here, Spark
SQL queries are integrated with Spark programs. Through Spark
SQL we are allowed to query structured data inside Spark
programs. This is possible by using SQL or a DataFrame that
can be used in Java, Scala.
We can run streaming computation through it. Developers write
a batch computation against the DataFrame / Dataset API to run
it. After that to run it in a streaming fashion Spark itself
increments the computation. Developers leverage the advantage
of it that they don’t have to manage state, failures on own. Even
no need keep the application in sync with batch jobs. Despite,
the streaming job always gives the same answer as a batch job
on the same data.
c. High compatibility
We are allowed to run unmodified Hive queries on existing
warehouses in Spark SQL. With existing Hive data, queries and
UDFs, Spark SQL offers full compatibility. Also, rewrites the
MetaStore and Hive frontend.
d. Standard Connectivity
We can easily connect Spark SQL through JDBC or ODBC. For
connectivity for business intelligence tools, Both turned as
industry norms. Also, includes industry standard JDBC and
ODBC connectivity with server mode.
e. Scalability
It takes advantage of RDD model, to support large jobs and
mid-query fault tolerance. For interactive as well as long
queries, it uses the same engine.
f. Performance Optimization
In Spark SQL, query optimization engine converts each SQL
query into a logical plan. Afterwards, it converts to many
physical execution plans. At the time of execution, it selects the
most optimal physical plan, among the entire plan. It ensures
fast execution of HIVE queries.
Conclusion
Hence, we have seen all Spark SQL features in detail. As a
result, we have learned, Spark SQL is a module of Spark that
analyses structured data. Basically, it offers scalability and
ensures high compatibility of the system. Also, allow standard
connectivity through JDBC or ODBC. Therefore, it bestows the
most natural way to express the structured data. Moreover, it
enhances its working efficiency with above-mentioned Spark
SQL features.
Spark SQL Architecture:
The following illustration explains the architecture of Spark
SQL
This architecture contains three layers namely, Language API,
Schema RDD, and Data Sources.
text file, Avro file, etc. However, the Data Sources for
Why DataFrame?
DataFrame is one step ahead of RDD. Since it provides memory
management and optimized execution plan.
a. Custom Memory Management: This is also known as Project
Tungsten. A lot of memory is saved as the data is stored in
off-heap memory in binary format. Apart from this, there is no
Garbage Collection overhead. Expensive Java serialization is
also avoided. Since the data is stored in binary format and the
schema of memory is known.
b. Optimized Execution plan: This is also known as the query
optimizer. Using this, an optimized execution plan is created for
the execution of a query. Once the optimized plan is created
final execution takes place on RDDs of Spark.
You can refer this guide to learn Spark SQL optimization phases
in detail.
iv. The DataFrame API’s are available in various programming
languages. For example Java, Scala, Python, and R.
v. It provides Hive compatibility. We can run unmodified Hive
queries on existing Hive warehouse.
vi. It can scale from kilobytes of data on the single laptop to
petabytes of data on a large cluster.
vii. DataFrame provides easy integration with Big data tools and
framework via Spark core.
Conclusion
Hence, DataFrame API in Spark SQL improves the performance
and scalability of Spark. It avoids the garbage-collection cost of
constructing individual objects for each row in the dataset.
The Spark DataFrame API is different from the RDD API
because it is an API for building a relational query plan that
Spark’s Catalyst optimizer can then execute. This DataFrame
API is good for developers who are familiar with building query
plans. It is not good for the majority of developers.
DATASET:
Today, in this blog on Apache Spark dataset, you can read all
about what is dataset in Spark. Why the Spark DataSet needed,
what is the encoder and what is their significance in the dataset?
You will get the answer to all these questions in this blog.
Moreover, we will also cover the features of the dataset in
Apache Spark and How to create a dataset in this Spark tutorial.
e. Faster Computation
Conclusion
Hence, in conclusion to Dataset, we can say it is a strongly typed
data structure in Apache Spark. Moreover, it represents
structured queries. Also, it fuses together the functionality of
RDD and DataFrame. We can generate the optimized query
using Dataset. So, Dataset lessens the memory consumption and
provides a single API for both Java and Scala.