You are on page 1of 24

Spark SQL

Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data
sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones. Spark SQL
provides DataFrame APIs which perform relational operations on both external data sources and Spark’s built-in
distributed collections. It introduces extensible optimizer called Catalyst as it helps in supporting a wide range of data
sources and algorithms in Big-data.

•Spark SQL Libraries


Spark SQL has the following four libraries which are used to interact with relational and procedural processing:
•Data Source API (Application Programming Interface):

•This is a universal API for loading and storing structured data.


•It has built in support for Hive, Avro, JSON, JDBC, Parquet, etc.
•Supports third party integration through Spark packages
•Support for smart sources.
•DataFrame API:
•A DataFrame is a distributed collection of data organized into named column. It is equivalent to a relational table in
SQL used for storing data into tables.
•It is a Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi structured data.
•DataFrame API is distributed collection of data in the form of named column and row.
•It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context.
•It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi-node clusters.
•Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage systems (HDFS, HIVE Tables,
MySQL, etc.).
•Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
•Provides API for Python, Java, Scala, and R Programming.
SQL Interpreter And Optimizer:
SQL Interpreter and Optimizer is based on functional programming constructed in Scala.

•It is the newest and most technically evolved component of SparkSQL.


•It provides a general framework for transforming trees, which is used to perform analysis/evaluation, optimization,
planning, and run time code spawning.
•This supports cost based optimization (run time and resource utilization is termed as cost) and rule based optimization,
making queries run much faster than their RDD (Resilient Distributed Dataset) counterparts.
e.g. Catalyst is a modular library which is made as a rule based system. Each rule in framework focuses on the distinct
optimization.

SQL Service:
SQL Service is the entry point for working along structured data in Spark. It allows the creation of DataFrame objects as
well as the execution of SQL queries.
Features Of Spark SQL
The following are the features of Spark SQL:

Integration With Spark

Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs,
using SQL or a DataFrame API which can be used in Java, Scala, Python and R. To run streaming computation, developers
simply write a batch computation against the DataFrame / Dataset API, and Spark automatically increments the
computation to run it in a streaming fashion. This powerful design means that developers don’t have to manually manage
state, failures, or keeping the application in sync with batch jobs. Instead, the streaming job always gives the same answer
as a batch job on the same data.

Uniform Data Access


DataFrames and SQL support a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and
JDBC. This joins the data across these sources. This is very helpful to accommodate all the existing users into Spark SQL.

Hive Compatibility
Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full
compatibility with current Hive data, queries, and UDFs.
•Standard Connectivity
Connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence
tools.

•Performance And Scalability
Spark SQL incorporates a cost-based optimizer, code generation and columnar storage to make queries agile alongside
computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance. The interfaces
provided by Spark SQL provide Spark with more information about the structure of both the data and the computation
being performed. Internally, Spark SQL uses this extra information to perform extra optimization. Spark SQL can directly
read from multiple sources (files, HDFS, JSON/Parquet files, existing RDDs, Hive, etc.). It ensures fast execution of existing
Hive queries.
User Defined Functions
Spark SQL has language integrated User-Defined Functions (UDFs). UDF is a feature of Spark SQL to define new Column-
based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. UDFs are black boxes in their
execution.
The example below defines a UDF to convert a given text to upper case.

Code explanation:
1. Creating a dataset “hello world”
2. Defining a function ‘upper’ which converts a string into upper case.
3. We now import the ‘udf’ package into Spark.
4. Defining our UDF, ‘upperUDF’ and importing our function ‘upper’.
5. Displaying the results of our User Defined Function in a new column ‘upper’.

Example:

val dataset = Seq((0, "hello"),(1, "world")).toDF("id","text")


val upper: String => String =_.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
dataset.withColumn("upper", upperUDF('text)).show
1. We now register our function as ‘myUpper’
2. Cataloging our UDF among the other functions.

Example:
spark.udf.register("myUpper", (input:String) => input.toUpperCase)
spark.catalog.listFunctions.filter('name like "%upper%").show(false)

Querying Using Spark SQL


Code explanation:
1. We first import a Spark Session into Apache Spark.
2. Creating a Spark Session ‘spark’ using the ‘builder()’ function.
3. Importing the Implicts class into our ‘spark’ Session.
4. We now create a DataFrame ‘df’ and import data from the ’employee.json’ file.
5. Displaying the DataFrame ‘df’. The result is a table of 5 rows of ages and names from our ’employee.json’ file.

Example:

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-
value").getOrCreate()
import spark.implicits._
val df = spark.read.json("examples/src/main/resources/employee.json")
df.show()
Querying using SparkSQL:

•Reading and Displaying JSON File in Spark:

scala> val df= spark.read.json(“/home/mamoon/spark-2.3.1-bin-hadoop2.7/examples/src/main/resources/people.json”)


scala> df.show()

•Import for using $ notation:


import spark.implicits._

•Print the Schema in Tree Format:


df.printSchema()

•Select only the name Column:


df.select(“name”).show()

•Select all columns and increment age column by 1:


df.select($”name”,$”age”+1).show()

Select people older than 21 of age:


df.filter($”age” > 21).show()
• Count people by age:
df.groupBy(“age”).count().show()

Dataframe in PySpark: Overview

In Apache Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a
table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with
RDD:
Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame /
RDD  after applying transformations.
Lazy Evaluations: Which means that a task is not executed until an action is performed.
Distributed: RDD and DataFrame both are distributed in nature.

How to create a DataFrame ?


A DataFrame in Apache Spark can be created in multiple ways:
It can be created using different data formats. For example, loading the data from JSON, CSV.
Loading data from Existing RDD.
Programmatically specifying schema
Creating DataFrame from RDD

I am following these steps for creating a DataFrame from list of tuples:

Create a list of tuples. Each tuple contains name of a person with age.
Create a RDD from the list above.
Convert each tuple to a row.
Create a DataFrame by applying createDataFrame on RDD with the help of sqlContext.

from pyspark.sql import Row


l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)

type(schemaPeople)
Output:
pyspark.sql.dataframe.DataFrame
train = spark.read.csv(“Downloads/train.csv”)
test = spark.read.csv(“Downloads/test.csv”)

 DataFrame Manipulations:

How to see datatype of columns?

To see the types of columns in DataFrame, we can use the printSchema, dtypes. Let’s apply printSchema() on train which will
Print the schema in a tree format.

train.printSchema()

How to Show first n observation?

We can use head operation to see first n observation (say, 5 observation). Head operation in PySpark is similar to head
operation in Pandas.

train.head(5)
Above results are comprised of row like format. To see the result in more interactive manner (rows under the columns), we
can use the show operation. Let’s apply show operation on train and take first 2 rows of it. We can pass the argument
truncate = True to truncate the result.

train.show(2)

How to Count the number of rows in DataFrame?


We can use count operation to count the number of rows in DataFrame. Let’s apply count operation on train & test files to
count the number of rows.

train.count()

test.count()
How to get the summary statistics (mean, standard deviance, min ,max , count) of numerical columns in a DataFrame?

describe operation is use to calculate the summary statistics of numerical column(s) in DataFrame. If we don’t specify the
name of columns it will calculate summary statistics for all numerical columns present in DataFrame.

train.describe().show()

Let’s check what happens when we specify the name of a categorical / String columns in describe operation.

train.describe(“age”).show()

How to select column(s) from the DataFrame?

To subset the columns, we need to use select operation on DataFrame and we need to pass the columns names separated
by commas inside select Operation.

train.describe(“age”,”name”).show()
How to find the number of distinct rows:

The distinct operation can be used here, to calculate the number of distinct rows in a DataFrame.

train.select(“age”).distinct().count()

What if I want to calculate pair wise frequency of categorical columns?

We can use crosstab operation on DataFrame to calculate the pair wise frequency of columns.

train.crosstab('Age', 'Gender').show()
Output:
+----------+-----+------+
|Age_Gender| F| M|
+----------+-----+------+
| 0-17| 5083| 10019|
| 46-50|13199| 32502|
| 18-25|24628| 75032|
| 36-45|27170| 82843|
| 55+| 5083| 16421|
| 51-55| 9894| 28607|
| 26-35|50752|168835|
+----------+-----+------+
What If I want to get the DataFrame which won’t have duplicate rows of given DataFrame?

We can use dropDuplicates operation to drop the duplicate rows of a DataFrame and get the DataFrame which won’t have
duplicate rows.

train.select('Age','Gender').dropDuplicates().show()

What if I want to drop the all rows with null value?

The dropna operation can be use here. To drop row from the DataFrame it consider three options.

how– ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how
parameter.
subset – optional list of column names to consider.

train.na.drop.show()
What if I want to drop the all rows with null value?

The dropna operation can be use here. To drop row from the DataFrame it consider three options.

how– ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how
parameter.
subset – optional list of column names to consider.

Let’s fill ‘-1’ inplace of null values in train DataFrame.

train.na.fill(-1).show(2)

If I want to filter the rows in train which has Purchase more than 15000?

We can apply the filter operation on Purchase column in train DataFrame to filter out the rows with values more than
15000. We need to pass a condition. Let’s apply filter on Purchase column in train DataFrame and print the number of
rows which has more purchase than 15000.

train.filter(train.Purchase > 15000).count()


How to find the mean of each age group in train?

The groupby operation can be used here to find the mean of Purchase for each age group in train. Let’s see how can we
get the mean purchase for the ‘Age’ column train.

train.groupBy(“Age”).agg({mean(“age”)}).show()

We can also apply sum, min, max, count with groupby when we want to get different summary insight each group.

train.groupBy(“Age”).count().show()
Interoperating with RDDs

To convert existing RDDs into DataFrames, Spark SQL supports two methods:

Reflection Based method: Infers an RDD schema containing specific types of objects. Works well when the schema is
already known when writing the Spark application.

Programmatic method: Enables you to build a schema and apply to an already existing RDD. Allows building DataFrames
when you do not know the columns and their types until runtime.

•Reflection Based method


•In Reflection based approach, the Scala interface allows converting an RDD with case classes to a DataFrame
automatically for Spark SQL.
•The case class:
•Has the table schema, where the argument names to the case class are read using the reflection method.
•Can be nested and used to contain complex types like a sequence of arrays.
Scala Interface implicitly converts the resultant RDD to a DataFrame and register it as a table. Use it in the subsequent SQL
statements.
Example Using the Reflection Based Approach

In the next example, you will be creating an RDD of person objects and register it as a table.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table:

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),

p(1).trim.toInt)).toDF()

people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext:

val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// By field name:

teenagers.map(t => "Name: " + t.getAsString]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]:

teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)

SQL statements will run using the SQL methods provided by SQLContext or by field name, and finally, retrieves multiple
columns at once into a Map.
Using the Programmatic Approach
This method is used when you cannot define case classes ahead of time; for example, when the records structure is
encoded in a text dataset or a string.

To create a case class using programmatic approach the following steps can be used:

Use the existing RDD to create an RDD of rows.

Create the schema represented by a StructType which matches the rows structure.

Apply the schema to the RDD of rows using the createDataFrame method.

In the next example, sc is an existing SparkContext, where you will be creating an RDD, then the schema is encoded in a
string, and it will generate the schema based on the string of schema.

// sc is an existing SparkContext:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD:

val people = sc.textFile("examples/src/main/resources/people.txt")


// The schema is encoded in a string:

val schemaString = "name age"

import org.apache.spark.sql.Row;

import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema, Convert records of the RDD (people) to Rows and Apply the
schema to the RDD.

val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

peopleDataFrame.registerTempTable("people")

It will also convert records of the RDD people to Rows and apply the schema to the RDD.

You might also like