RefCardz - SQL Syntax For Apache Drill

BROUGHT TO YOU BY:
C O NT E NT S
Get More Refcardz! Visit Refcardz.com
223
Introduction
Installing and Starting Up
First SQL Queries
SQL Syntax for Apache Drill
Querying Nested Data Structures
BY RICK F. VAN DER LANS
Querying Arrays... and more!
I N T RO D U C T I O N
DRILL_ARGS - -u jdbc:drill:zk=local
Calculating Drill classpath...
oct 26, 2015 9:33:46 AM org.glassfish.jersey.server.
ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-0429 01:25:26...
apache drill 1.2.0
json aint no thang
0: jdbc:drill:zk=local>
This Refcard describes version 1.2 of Apache Drill released in

October 2015. Apache Drill is an open-source SQL-on-Everything
engine. It allows SQL queries to be executed on any kind of data
source, ranging from simple CSV files to advanced SQL and NoSQL
database servers.
Drill has three distinguishing features:
Drill is now ready to query data. To check if it really works, use the
following query:
Drill can access flat, relational data structures as well

as data sources with non-relational structures, such as
arrays, hierarchies, maps, nested tables, and complex
data types. Besides being able to access these data
sources, Drill can run queries that join data across
multiple data sources, including non-relational and
relational ones.
VALUES(CURRENT_DATE);
The resulting output should have todays date:

+---------------+
| CURRENT_DATE |
+---------------+
| 2015-10-28
|
+---------------+
Drill allows for schema-less access of data. In other

words, it doesnt need access to the schema definition of
the data source. It doesnt need to know the structure of
the tables in advance, nor does it need statistical data.
It goes straight for the data. The schema of the query
result is therefore not known in advance. Its built up
and derived when data comes back from the data source.
During the processing of all the data, the schema of the
query result is continuously adapted. For example, the
schema of a Hadoop file (or any data source) doesnt
have to be documented in Apache Hive to make it
accessible for Drill.
To stop Drill enter !quit.
FIRST SQL QUERIES

Create a simple text file with the following text and call it
HelloWorld.csv:
HelloWorld!
As with most SQL database servers, Drill doesnt have

its own database. It has been designed and optimized
to access other data sources. There is rarely ever a need
to copy data from a data source to Hadoop to make it
accessible by Drill.
Query the file:

SELECT *
FROM dfs.`\MyDirectory\HelloWorld.csv`;
Result:
APACHE DRILLS SUPPORTED PLUGINS FOR DATA ACCESS
CSV and TSV files
+------------------+
|
columns
|
+------------------+
| [HelloWorld!] |
+------------------+
SQLJAVA
SYNTAXENTERPRISE
FOR APACHE DRILL
EDITION 7
Hadoop files with Parquet and AVRO file formats,

including Amazon S3
NoSQL databases, such as MongoDB and Apache HBase
SQL database servers, such as MySQL and Oracle,

through an ODBC/JDBC interface
Files with JSON or BSON data structures
SQL-on-Hadoop engines, such as Apache Hive and

Impala
Execute SQL queries on any data source.csv files to

advanced SQL database servers.
INSTALLING & STARTING UP APACHE DRILL
Apache Drill
enables schema-less
access to data
The following webpage contains detailed descriptions on how to

install Drill on various platforms: https://drill.apache.org/docs/
install-drill-introduction.
After installation of the software, Drill can be started by using the
command-line tool SQLLine:
sqlline.bat -u jdbc:drill:zk=local
Learn More
Depending on the platform, the result looks something like this:
D Z O NE, INC.
DZ O NE.C O M
Execute SQL queries on any

data source.csv files to advanced
SQL database servers.
Apache Drill
enables schema-less
access to data.
Learn More
3
The unique thing about this query is its use of the FROM clause. Instead
of a table name, it contains a reference to the file to be accessed.
The term dfs represents one of the supported storage plugins.
This particular plugin indicates that a file in the local file system
is accessed. The storage plugin is followed by a file specification
containing the correct directory and the file name. Note that the file
specification must be specified in between backticks and not single
quotes. The file contains only one line with data, so only one row is
returned. Because no column names are specified, the column name
columns is used.
Query the file:

SELECT emp.employee.number AS enr, emp.employee.name.lastname AS
lastname
FROM dfs.`\MyDirectory\EmployeesNested.json` AS emp;
Result:
+------+-----------+
| enr | lastname |
+------+-----------+
| 6
| Manzarek |
| 8
| Young
|
| 15
| Metheny
|
+------+-----------+
Create a file with the following content and call it Employees.json:

{ number :
name
:
initials:
street :
town
:
{ number :
name
:
initials:
street :
mobile :
{ number :
name
:
initials:
province:
}
6,
Manzarek,
R,
Haseltine Lane,
Phoenix }
8,
Young,
N,
Brownstreet,
1234567 }
15,
Metheny,
M,
South
To refer to the employee numbers the specification emp.employee.number

is used. emp refers to the table name specified in the FROM clause.
Specifying this table name is required when dealing with nested tables,
otherwise Drill may confuse columns with tables and therefore wont
run the query. Next, employee refers to the first level in the JSON
document and number to the next level. To get the lastname of an
employee, employee must be specified, followed by name and lastname.
Assigning column names to these two result columns is not required,
but does make the result easier to read.
Q U E RY I N G A R R AYS
Query the file:
Many data sources contain arrays (sometimes called repeating groups).

Drill supports the flatten function to transform these arrays into flat,
relational data structures.
SELECT * FROM dfs.`\MyDirectory\Employees.json`;
The JSON data structure is turned into a flat table, and for the missing
values in specific columns, the null value is presented. Result:
Create a file with the following content (note that each employee can
work for several projects) and call it EmployeesArrays.json:
+---------+-----------+-----------+-----------------+----------+----------+-----------+
| number |
name
| initials |
street
|
town
| mobile | province |
+---------+-----------+-----------+-----------------+----------+----------+-----------+
| 6
| Manzarek | R
| Haseltine Lane | Phoenix | null
| null
|
| 8
| Young
| N
| Brownstreet
| null
| 1234567 | null
|
| 15
| Metheny
| M
| null
| null
| null
| South
|
+---------+-----------+-----------+-----------------+----------+----------+-----------+
{ number :
projects:
}
{ number :
projects:
}
Q U E RY I N G N E S T E D DATA S T R U C T U R E S
8,
[ ACP3, FGTR ]
15,
[ ACP3, HHGT, X456 ]
Query the file:

SELECT *
FROM dfs.`\MyDirectory\EmployeesArrays.json`;
Many data sources, such as MongoDB, Hadoop with AVRO, and JSON
files, contain nested data structures. In relational terminology they
would be called columns within columns. To address these nested
columns, a specific syntax is introduced. Several examples are used to
illustrate this syntax.
Result:
+---------+-------------------------+
| number |
projects
|
+---------+-------------------------+
| 8
| [ACP3,FGTR]
|
| 15
| [ACP3,HHGT,X456] |
+---------+-------------------------+
Create a file called EmployeesNested.json containing the following:

{ employee : {
number : 6,
name
: { lastname:
initials:
address : { street :
houseno :
postcode:
town
:
}
{ employee : {
number : 8,
name
: { lastname:
initials:
address : { street :
houseno :
province:
town
:
}
{ employee : {
number : 15,
name
: { lastname:
initials:
code
:
}
SQL SYNTA X FOR APACHE DRILL
Manzarek,
R },
Haseltine Lane,
80,
1234KK,
Stratford } }
In each row (so for each employee), the projects column contains a set of
project values. To see them as separate values, use the flatten function:
SELECT FLATTEN(projects) AS project, enr
FROM
dfs.`\MyDirectory\EnployeesArrays.json`;
In the result each row contains a separate project value and the
number of the employee to which the project value belongs:
Young,
N },
Brownstreet,
80,
ZH,
Boston } }
+----------+------+
| project | enr |
+----------+------+
| ACP3
| 8
|
| FGTR
| 8
|
| ACP3
| 15
|
| HHGT
| 15
|
| X456
| 15
|
+----------+------+
Metheny,
M,
45 } }
D Z O NE, INC .
DZ O NE .C O M
Result:
Besides flatten, Drill supports two additional functions to work with

the contents of arrays: repeated_count (counts the number of values
+--------+------------------------------+
| owner |
cars
|
+--------+------------------------------+
| 1
| {key:Ford,value:3}
|
| 1
| {key:BMW,value:2}
|
| 1
| {key:Ferrari,value:1} |
| 2
| {key:BMW,value:4}
|
| 2
| {key:GM,value:5}
|
+--------+------------------------------+
in an array) and repeated_contains (searches for a specific value in

an array):
SELECT number AS enr, REPEATED_COUNTS(projects) AS NumberOfProjects,
REPEATED_CONTAINS(projects,FGTR) AS ContainsFGTR
FROM
dfs.`\MyDirectory\EmployeesArrays.json` AS emp;
Result:
COMBINING MULTIPLE FUNCTIONS IN QUERIES
+------+-------------------+---------------+
| enr | NumberOfProjects | ContainsFGTR |
+------+-------------------+---------------+
| 8
| 2
| true
|
| 15
| 3
| false
|
+------+-------------------+---------------+
Apache Drill supports all the query features to be expected from a SQL
product. The next example shows how the special functions can be
combined with more traditional joins and window functions. Create a
file with the following content and call it EmployeesProjects.json.
Each line in this file indicates how many hours an employee has
worked on a project on a specific day.
Individual values can be retrieved from an array:

SELECT number, projects[0], projects[1], projects[2]
FROM dfs.`\MyDirectory\EmployeesArrays.json` AS emp;
{enr: 8,
{enr: 8,
{enr: 8,
{enr:15,
{enr:15,
{enr:15,
{enr:15,
{enr:15,
{enr:15,
Result:
+---------+---------+---------+---------+
| number | EXPR$1 | EXPR$2 | EXPR$3 |
+---------+---------+---------+---------+
| 8
| ACP3
| FGTR
| null
|
| 15
| ACP3
| HHGT
| X456
|
+---------+---------+---------+---------+
project:ACP3,
project:ACP3,
project:FGTR,
project:ACP3,
project:ACP3,
project:HHGT,
project:HHGT,
project:HHGT,
project:X456,
date:2015-10-01,
date:2015-10-04,
date:2015-10-02,
date:2015-10-01,
date:2015-10-03,
date:2015-10-01,
date:2015-10-05,
date:2015-10-07,
date:2015-10-01,
hours:4}
hours:5}
hours:2}
hours:7}
hours:5}
hours:4}
hours:2}
hours:8}
hours:6}
Join this file with the EmployeesNested.json file:
Q U E RY I N G M A PS W I T H DATA
SELECT CAST(proj.`date` AS DATE) AS pdate, emp.employee.number AS enr,

emp.employee.name.lastname AS ename, proj.project, proj.hours,
SUM(CAST(proj.hours AS INTEGER))
OVER(PARTITION BY proj.`date`, emp.employee.number) AS sum_hours
FROM dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesNested.json` AS emp
LEFT OUTER JOIN
dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesProjects.json` AS proj
ON CAST(emp.employee.number AS INTEGER) = CAST(proj.enr AS INTEGER)
ORDER BY 1, 2;
The kvgen function (Key-Value Generation) transforms query maps

that contain keys instead of a schema to arrays. Create a file with the
following content and call it CarOwners.json. This file contains the
number of cars that each car owner owns by each manufacturer. The
element cars doesnt have a schema, but instead contains a set of keys
(car manufacturer) and each key has a value (number of cars of that
particular manufacturer):
Result:
{ owner : 1,
cars : { Ford : 3,
BMW : 2,
Ferrari : 1 }
}
{ owner : 2,
cars : { BMW : 4,
GM : 5 }
}
+-------------+------+-----------+----------+--------+------------+
|
pdate
| enr |
ename
| project | hours | sum_hours |
+-------------+------+-----------+----------+--------+------------+
| 2015-10-01 | 15
| Metheny
| ACP3
| 7
| 17
|
| 2015-10-01 | 15
| Metheny
| HHGT
| 4
| 17
|
| 2015-10-01 | 15
| Metheny
| X456
| 6
| 17
|
| 2015-10-01 | 8
| Young
| ACP3
| 4
| 4
|
| 2015-10-02 | 8
| Young
| FGTR
| 2
| 2
|
| 2015-10-03 | 15
| Metheny
| ACP3
| 5
| 5
|
| 2015-10-04 | 8
| Young
| ACP3
| 5
| 5
|
| 2015-10-05 | 15
| Metheny
| HHGT
| 2
| 2
|
| 2015-10-07 | 15
| Metheny
| HHGT
| 8
| 8
|
| null
| 6
| Manzarek | null
| null
| null
|
+-------------+------+-----------+----------+--------+------------+
Query the file:

SELECT KVGEN(cars) AS cars
FROM dfs.`\MyDirectory\CarOwners.json`;
Result:
DEFINITIONS OF THE SQL STATEMENTS FOR QUERYING DATA SOURCES
+---------------------------------------------------------------------------------+
|
cars
|
+---------------------------------------------------------------------------------+
| [{key:Ford,value:3},{key:BMW,value:2},{key:Ferrari,value:1}] |
| [{key:BMW,value:4},{key:GM,value:5}]
|
+---------------------------------------------------------------------------------+
This section contains the definitions of the SQL statements supported

by Drill related to querying.
QUERY
STATEMENT
The effect of the kvgen function is that the car data inside the cars
DEFINITION
element is transformed into an array with two elements: key and

value. The result of this function can subsequently be processed by the
flatten function to turn each element combination into a separate row:
SELECT
SELECT owner, FLATTEN(KVGEN(cars)) AS cars

FROM dfs.`\MyDirectory\CarOwners.json`;
D Z O NE, INC .
<select statement> ::=

[ WITH <table name> [ ( <column name> [ , <column
name> ]... ) ]
AS ( <table expression> ) ]
<table expression>
DZ O NE .C O M
5
QUERY
STATEMENT
SQL
STATEMENT
DEFINITION
SHOW
SCHEMAS
<table expression> ::=

SELECT <select clause>
FROM <from clause>
[ WHERE <boolean expression> ]
[ GROUP BY <expression> [ , <expression> ] ]
[ HAVING <boolean expression> ]
[ ORDER BY <clause> ]
[ LIMIT { <count> | ALL } ]
[ OFFSET <number> { ROW | ROWS } ]
SHOW FILES
SHOW TABLES
<table reference> ::=

{ <table specification> |
<join specification>
|
( <table expression> ) |
VALUES ( <expression list>) [ , ( <expression list> ) ]...
[ [ AS ] <alias name> [ ( <column name> [ , <column name> ]... ) ] ]
USE
<join specification> ::=

<table reference> <join type> <table reference> [ <join condition> ]
<join condition: ::= ON <condition>
SHOW TABLES
DEFINITION
USE <schema name>
ALTER
SESSION
ALTER SESSION SET `<option name>` = <value>
ALTER
SYSTEM
ALTER SYSTEM SET `<option name>` = <value>
The following data types are supported by Drill and can be used when
converting the data types of values.
<boolean expression> ::=

<boolean expression { AND | OR } <boolean expression> |
NOT <boolean expression> |
<case expression> |
<expression> { < | > | <= | >= | = | <> } <expression> |
<expression> [ NOT ] BETWEEN <expression> and <expression> |
<expression> IN ( <expression> [ , <expression> ]... ) |
<expression> { LIKE | ILIKE | NOT LIKE |
SIMILAR TO | NOT SIMILAR TO } <expression> |
<expression> ( IS [ NOT ] NULL | IS [ NOT ] FALSE |
IS [ NOT ] TRUE } |
<expression> { IN | ANY | ALL } ( <table expression> ) |
EXISTS ( <table expression> ) |
<expression> || <expression>
DATA TYPE
BIGINT
DESCRIPTION
8-byte signed integer in the range -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807.
9223372036854775807
BINARY
BOOLEAN
<case expression> ::=

CASE
WHEN <expression> [ , <expression> [ ... ] ] THEN <statements>
[ WHEN <expression> [ , <expression> [ ... ] ] THEN <statements> ]...
[ ELSE <statements> ]
END
EXPLAIN
SHOW FILES
[ FROM <filesystem> . <director name> |
IN <filesystem> . <directory name> ]
DATA TYPES
<join type> ::=

[INNER] JOIN | LEFT [OUTER] JOIN |
RIGHT [OUTER] JOIN | FULL [OUTER] JOIN
VALUES
SHOW SCHEMAS (Shows all the schemas that can be used.)
SQL
STATEMENT
<table specification> ::=

<table name> | <storage plugin> . `<workspace>`
SELECT
DEFINITION
D E F I N I T I O N S O F S Q L S TAT E M E N TS F O R
WORKING WITH SCHEM A S A ND SESSIONS
<from clause> ::=

<table reference> [ , <table reference> ]...
(continued)
DATE
<values statement> ::=

VALUES ( <expression list>) [ , ( <expression list> ) ]...
DECIMAL(p,s),
DEC(p,s), or
NUMERIC(p,s)*
<expression list> ::=

<expression> [ , <expression> ]...
EXPLAIN PLAN
[ INCLUDING ALL ATTRIBUTES]
[ WITH IMPLEMENTATION | WITHOUT IMPLEMENTATION ]
FOR <select statement>
FLOAT
Variable-length byte string.
B@e6d9eb7
True or false.
true
Years, months, and days in YYYY-MM-DD format since 4713 BC.
2015-12-30
38-digit precision number (precision is p, and scale is s).
For example, DECIMAL(6,2) is 1234.56 (4 digits before and 2 digits after the
decimal point).
4-byte floating point number.
0.456
8-byte floating point number, precision-scalable.
DEFINITIONS OF DATA DEFINITION STATEMENTS
DOUBLE or
DOUBLE
PRECISION
This section contains the definitions of the data definition statements

supported by Drill.
INTEGER or
INT
4-byte signed integer in the range -2,147,483,648 to 2,147,483,647.
SQL
STATEMENT
2147483646
A day-time or year-month interval.
DEFINITION
INTERVAL
CREATE
TABLE
CREATE TABLE name [ ( <column list> ) ]

[ PARTITION BY ( <column_name> [ , ... ] ) ]
AS <select statement>
CREATE VIEW
CREATE [OR REPLACE] VIEW [ <workspace> . ] <view name>

[ ( <column name> [ , ... ] ) ]
AS <select statement>
'1 10:20:30.123' (day-time) OR

'1-2' year to month (year-month).
Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.
DROP TABLE
DROP TABLE [ <workspace> . ] <table name>
DROP VIEW
DROP VIEW [ <workspace> . ] <view name>
DESCRIBE
DESCRIBE [ <workspace> . ] { <table name> | <view name> }
SHOW
DATABASES
0.456
2-byte signed integer in the range -32,768 to 32,767.
SMALLINT
32000
This data type is not supported in version 1.2.
TIME
SHOW DATABASES
24-hour based time before or after January 1, 2001 in hours, minutes,

seconds format: HH:mm:ss.
22:55:55.23
D Z O NE, INC .
DZ O NE .C O M
6
DATA TYPE
TIMESTAMP
DESCRIPTION
DATE/TIME
FUNCTION
JDBC timestamp in year, month, date hour, minute, second, and

optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS.
UTF8-encoded variable-length string. The default limit is 1 character.

The maximum character limit is 2,147,483,647. CHAR(30) casts data to a
30-character string maximum.
DATA TYPE
OF RESULT
INTERVALDAY
INTERVALYEAR
Returns interval between two timestamps

or subtracts a timestamp from midnight of
the current date.
CURRENT_DATE
DATE
Returns the current date.
CURRENT_TIME
TIME
Returns the current time.
TIMESTAMP
Returns the current timestamp.
DATE,
TIMESTAMP
Returns the sum of the sum of a date/

time and a number of days/hours, or of a
date/time and date/time interval. x must
be a date, time, or timestamp expression,
and y must be an integer or an interval
expression. If y is an integer then x must
be a date and y represents the number
of days to be added. In other situations
an interval must be used. Use the CAST
function to create an interval.
DOUBLE
Extracts a time unit from the value of a

time, date, or timestamp expression (y).
Allowed time units for x are second, minute,
hour, day, month, and year. A time unit
must be specified between single quotes. y
represents a date, time, or timestamp.
DATE,
TIMESTAMP
Subtracts an interval (y) from a date or

timestamp expression (x). x must be a
date, time, or timestamp expression,
and y must be an integer or an interval
expression. If y is an integer then x must
be a date and y represents the number of
days to be subtracted. In other situations
an interval must be used. Use the CAST
function to create an interval.
DOUBLE
Extracts a time unit from a date or

timestamp expression (y). x indicates
which time unit to extract. This must
be one of the following values: SECOND,
MINUTE, HOUR, DAY, MONTH, or YEAR. The time
unit must not be specified between quotes.
CURRENT_TIMESTAMP
SQL SCALAR FUNCTIONS

This section contains descriptions of all the scalar functions supported
by Drill that can be used in any expression.
DATE_ADD(x,y)
The processing logic of all the scalar functions can easily be tested
by using the VALUES statement. For example, with the following
statement the CBRT function can be tested:
VALUES(CBRT(64));
DATE_PART(x,y)
Result:
+---------+
| EXPR$0 |
+---------+
| 4.0
|
+---------+
DATE_SUB(x,y)
NUMERIC
FUNCTION
DATA TYPE
OF RESULT
ABS(x)
Data type of x
Returns the absolute value of x.
FLOAT8
Returns the cubic root of x.
CEIL(x) or
CEILING(x)
Data type of x
Returns the smallest integer not less than x.
DEGREES(x)
FLOAT8
Converts x radians to degrees.
E()
FLOAT8
Returns 2.718281828459045.
EXP(x)
FLOAT8
Returns e (Euler's number) to the power of x.
Data type of x
Returns the largest integer not greater than x.
LOG(x)
FLOAT8
Returns the natural log (base e) of x.
LOG(x, y)
FLOAT8
Returns log base x to the y power.
LOG10(x)
FLOAT8
Returns the common log of x.
Data type of x
Shifts the binary x by y times to the left.
CBRT(x)
FLOOR(x)
LSHIFT(x, y)
MOD(x, y)
DEFINITION
Returns the remainder of x divided by y.

Returns x as a negative number.
PI
FLOAT8
Returns pi.
POW(x, y)
FLOAT8
Returns the value of x to the y power.
RADIANS
FLOAT8
Converts x degrees to radians.
RAND
FLOAT8
Returns a random number from 0-1.
Data type of x
Rounds to the nearest integer.
DECIMAL
Rounds x to y decimal places.
Data type of x
Shifts the binary x by y times to the right.
SIGN(x)
INT
Returns the sign of x.
SQRT(x)
Data type of x
Returns the square root of x.
TRUNC(x [ , y ] )
Data type of x
Truncates x to y decimal places. y is optional.

Default is 1.
DECIMAL
Truncates x to y decimal places.
ROUND(x, y)
RSHIFT(x, y)
TRUNC(x, y)
TIME
Returns the local current time.
LOCALTIMESTAMP
TIMESTAMP
Returns the local current timestamp.
NOW()
TIMESTAMP
Returns the current timestamp.
VARCHAR
Returns the current timestamp for the

UTC time zone.
BIGINT
If no x is specified, it returns the number

of seconds since the UNIX epoch (January
1, 1970 at 00:00:00). If x is specified, it
must be a timestamp; then the number
of seconds since the UNIX epoch and the
timestamp x is returned.
TIMEOFDAY()
FLOAT8
ROUND(x)
EXTRACT(x FROM y)
LOCALTIME
Data type of x
NEGATIVE(x)
UNIX_TIMESTAMP( [x] )
STRING FUNCTION
D Z O NE, INC .
DATA TYPE
OF RESULT
DEFINITION
BINARY or
VARCHAR
Returns in binary format a substring y

of the string x.
CHAR_LENGTH(x)
INTEGER
Returns the length of the alphanumeric

argument x.
CONCAT(x,y)
VARCHAR
Combines the two alphanumeric values

x and y. Has the same effect as the ||
operator.
INITCAP(x)
VARCHAR
Returns x in which the first character is

capitalized.
LENGTH(x)
INTEGER
Returns the length in bytes of the

alphanumeric value x.
LOWER(x)
VARCHAR
Converts all upper-case letters of x to

lower-case letters.
BYTE_SUBSTR(x,y [, z ] )
DEFINITION
or
AGE(x [, y ] )
2015-12-30 22:55:55.23
CHARACTER
VARYING,
CHARACTER,
CHAR, or
VARCHAR
DZ O NE .C O M
7
DATA TYPE
OF RESULT
STRING FUNCTION
LPAD(x,y [ , z ] )
VARCHAR
The value of x is filled in the front (the

left-hand side) with the value of z just
until the total length of the value is
equal to y. If the maximum length is
smaller than that of x, x is shortened
on the left side. If no z is specified,
blanks are used.
VARCHAR
Removes all blanks that appear at the

beginning of x.
POSITION(x IN y)
INTEGER
Returns the start position of the string

x in the string y.
REGEXP_REPLACE(x,y,z)
VARCHAR
Substitutes new text for substrings that

match Java regular expression patterns.
In the string x, y is replaced by z. y is
the regular expression.
VARCHAR
The value of x is filled in the front (the

right-hand side) with the value of z
just until the total length of the value
is equal to y. If the maximum length is
smaller than that of x, x is shortened
on the right side.
RPAD(x,y,z)
AGGREGATE
FUNCTION
DEFINITION
LTRIM(x)
RTRIM(x)
VARCHAR
Removes all blanks from the end of the

value of x.
STRPOS(x,y)
INTEGER
Returns the start position of the string

y in the string x.
SUBSTR(x,y,z)
VARCHAR
Extracts characters from position 1 - x

of x, repeated an optional y times.
TRIM(x)
VARCHAR
Removes all blanks from the start and

from the end of x. Blanks in the middle
are not removed.
UPPER(x)
VARCHAR
Converts all lower-case letters of x to

upper-case letters.
DATA TYPE
OF RESULT
CAST(x AS y)
Data type of y
Converts the data type of x to y. y must be one of

the supported data types; see Section 12.
CONVERT_TO(x,y)
Data type of y
Converts binary data (x) to Drill internal types (y)

based on the little or big endian encoding of the
data.
CONVERT_
FROM(x,y)
Data type of y
Converts binary data (x) from Drill internal types

(y) based on the little or big endian encoding of
the data.
Counts the number of non-null

values in x, and if DISTINCT
is specified the number of
different non-null values in x.
MAX(x)
Data type of x
Determines the maximum value

of all non-null values in x.
MIN(x)
Data type of x
Determines the minimum value

of all non-null values in x.
SUM(x)
BIGINT for SMALLINT

or INTEGER arguments;
DECIMAL for BIGINT
arguments; DOUBLE for
floating-point arguments;
otherwise the same as data
type of x.
DATA TYPE
OF RESULT
DEFINITION
COALESCE(x,y [ , y ]... )
Data type of y
Returns the first non-null argument in

the list of ys.
NULLIF(x,y)
Data type of x
Returns the value of the x if x and y are

not equal, and returns a null value if x
and y are equal.
<window_function_name> ( [ ALL ] <expression> )

OVER ( [ PARTITION BY <expression list> ] [ ORDER BY <order list> ] )
WINDOW FUNCTION
AVG(x) OVER (y)
COUNT(*)
DATA TYPE OF RESULT
DECIMAL for any integertype argument; DOUBLE

for a floating-point
argument otherwise the
data type of x.
BIGINT
integer-type
argument; DOUBLE
for a floating-point
argument; otherwise
the data type of x
Returns the average value for the

input expression values. It works
with numeric values and ignores
null values.
BIGINT
BIGINT
Counts the number of input rows.

COUNT(*) counts all of the rows in
the target table if they do or do not
include nulls. COUNT(x) computes
the number of rows with non-NULL
values in a specific column or
expression.
MAX(x) OVER (y)
Data type of x
Determines the maximum value of

all non-null values in x.
MIN(x) OVER (y)
Data type of x
Determines the minimum value of

all non-null values in x.
SUM(x) OVER (y)
BIGINT for
SMALLINT or
INTEGER arguments;
DECIMAL for BIGINT
arguments; DOUBLE
Calculates the sum of all values

in x.
for floating-point
arguments; otherwise
the data type of x
CUME_DIST() OVER (y)
DOUBLE PRECISION
Calculates the relative rank of

the current row within a window
partition: (number of rows
preceding or peer with current row) /
(total rows in the window partition).
BIGINT
Determines the rank of a value in a

group of values based on the ORDER
BY expression and the OVER clause.
Each value is ranked within its
partition. Rows with equal values
receive the same rank. There are
no gaps in the sequence of ranked
values if two or more rows have the
same rank.
DEFINITION
Calculates the weighted average

of all values in x.
DENSE_RANK() OVER
(y)
Counts the number of rows.
D Z O NE, INC .
DEFINITION
Counts the number of input rows.

COUNT(*) counts all of the rows in
the target table if they do or do not
include nulls. COUNT(expression)
computes the number of rows
with non-NULL values in a specific
column or expression.
COUNT( { * | x } )
OVER (y)
This section contains descriptions of all the aggregate and window

functions supported by Drill that can be used in any expression.
AVG(x)
DATA TYPE OF
RESULT
DECIMAL for any
SQL AGGREGATE AND WINDOW FUNCTIONS
AGGREGATE
FUNCTION
Calculates the sum of all values

in x.
The syntax definition for all the window functions in the table below is
as follows:
DEFINITION
NULL HANDLING
FUNCTION
DEFINITION
BIGINT
COUNT([DISTINCT] x)
COUNT(x) OVER (y)
DATA TYPE
CONVERSION
FUNCTION
DATA TYPE OF RESULT
DZ O NE .C O M
8
WINDOW FUNCTION
NTILE(x) OVER (y)
PERCENT_RANK() OVER
(y)
RANK() OVER(y)
ROW_NUMBER() OVER(y)
LAG(x) OVER (y)
DATA TYPE OF
RESULT
INTEGER
DEFINITION
DOUBLE PRECISION
BIGINT
Determines the rank of a value in

a group of values. The ORDER BY
expression in the OVER clause
determines the value. Each value
is ranked within its partition. Rows
with equal values for the ranking
criteria receive the same rank. Drill
adds the number of tied rows to
the tied rank to calculate the next
rank and thus the ranks might
not be consecutive numbers (e.g.,
if two rows are ranked 1, the next
rank is 3). The DENSE_RANK window
function differs in that no gaps
exist if two or more rows tie.
Data type of x
DATA TYPE OF
RESULT
WINDOW FUNCTION
Divides the rows for each window

partition, as equally as possible,
into a specified number of ranked
groups. The NTILE window function
requires the ORDER BY clause in the
OVER clause.
Calculates the percent rank of the
current row using the following
formula: (x - 1) / (number of rows
in window partition - 1) where x is
the rank of the current row.
BIGINT
DEFINITION
LEAD(x) OVER (y)
Data type of x
Returns the value for the row after

the current row in a partition. If no
row exists, null is returned.
FIRST_VALUE(x) OVER
(y)
Data type of x
Returns the value of x with respect

to the first row in the window
frame.
LAST_VALUE(x) OVER
(y)
Data type of x
Returns the value of x with respect

to the last row in the window
frame.
SQL NESTED FUNCTIONS

This section contains descriptions of the functions supported by Drill
for manipulating arrays and nested data.
NESTED DATA
FUNCTION
DATA TYPE
OF RESULT
-
Separates the elements in a repeated

field x into individual rows; see Section 6.
KVGEN(x)
VARCHAR
Returns a list of the keys that exist in x;

see Section 7.
REPEATED_COUNT(x)
INTEGER
Counts the number of values in x. x must

be an array. See also Section 6.
BOOLEAN
Determines if the value y appears in

the array x. y may contain the following
regular expression wildcards: asterisk
(*), period (.), question mark (?), square
bracketed ranges [a-z], square bracketed
characters [ch], and negated square
bracketed ranges or characters [!ch]. See
also Section 6.
FLATTEN(x)
Determines the ordinal number

of the current row within its
partition. The ORDER BY expression
in the OVER clause determines the
number. Each value is ordered
within its partition. Rows with
equal values for the ORDER BY
expressions receive different row
numbers non-deterministically.
REPEATED_CONTAINS(x,y)
Returns the value for the row

before the current row in a
partition. If no row exists, NULL is
returned.
ABOUT THE AUTHOR
DEFINITION
RESOURCES
Rick van der Lans is an independent analyst, author,

and internationally acclaimed lecturer and works
for R20/Consultancy. He specializes in database
technology, data warehousing, and big data. He
has written several books on SQL. His popular book
Introduction to SQL has been translated into numerous
languages and sold more than 100,000 copies.
Apache Drill Website

Apache Drill Documentation
Apache Drill Download
SQLLine
BROWSE OUR COLLECTION OF 250+ FREE RESOURCES, INCLUDING:

RESEARCH GUIDES: Unbiased insight from leading tech experts
REFCARDZ: Library of 200+ reference cards covering the latest tech topics
COMMUNITIES: Share links, author articles, and engage with other tech experts
JOIN NOW
DZONE, INC.
150 PRESTON EXECUTIVE DR.
CARY, NC 27513
DZone communities deliver over 6 million pages each month to more than 3.3 million software
developers, architects and decision makers. DZone offers something for everyone, including news,
tutorials, cheat sheets, research guides, feature articles, source code and more.
888.678.0399
919.678.0300
REFCARDZ FEEDBACK WELCOME
refcardz@dzone.com
"DZone is a developer's dream," says PC Magazine.
SPONSORSHIP OPPORTUNITIES
DZ Osales@dzone.com
NE .C O M
Copyright 2015 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
D Zpermission
O NE, INC
transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
of the. publisher.
VERSION 1.0
$7.95

RefCardz - SQL Syntax For Apache Drill

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RefCardz - SQL Syntax For Apache Drill

Uploaded by

Copyright:

Available Formats

BROUGHT TO YOU BY:

Get More Refcardz! Visit Refcardz.com

SQL Syntax for Apache Drill

Querying Nested Data Structures

BY RICK F. VAN DER LANS

Querying Arrays... and more!

This Refcard describes version 1.2 of Apache Drill released in

Drill can access flat, relational data structures as well

The resulting output should have todays date:

Drill allows for schema-less access of data. In other

To stop Drill enter !quit.

FIRST SQL QUERIES

As with most SQL database servers, Drill doesnt have

Query the file:

APACHE DRILLS SUPPORTED PLUGINS FOR DATA ACCESS

CSV and TSV files

Hadoop files with Parquet and AVRO file formats,

NoSQL databases, such as MongoDB and Apache HBase

SQL database servers, such as MySQL and Oracle,

Files with JSON or BSON data structures

SQL-on-Hadoop engines, such as Apache Hive and

Execute SQL queries on any data source.csv files to

INSTALLING & STARTING UP APACHE DRILL

The following webpage contains detailed descriptions on how to

Depending on the platform, the result looks something like this:

Execute SQL queries on any

Query the file:

Create a file with the following content and call it Employees.json:

To refer to the employee numbers the specification emp.employee.number

Query the file:

Many data sources contain arrays (sometimes called repeating groups).

SELECT * FROM dfs.`\MyDirectory\Employees.json`;

Query the file:

Create a file called EmployeesNested.json containing the following:

SQL SYNTA X FOR APACHE DRILL

SQL SYNTA X FOR APACHE DRILL

Besides flatten, Drill supports two additional functions to work with

in an array) and repeated_contains (searches for a specific value in

COMBINING MULTIPLE FUNCTIONS IN QUERIES

Individual values can be retrieved from an array:

Join this file with the EmployeesNested.json file:

SELECT CAST(proj.`date` AS DATE) AS pdate, emp.employee.number AS enr,

The kvgen function (Key-Value Generation) transforms query maps

Query the file:

DEFINITIONS OF THE SQL STATEMENTS FOR QUERYING DATA SOURCES

This section contains the definitions of the SQL statements supported

element is transformed into an array with two elements: key and

SELECT owner, FLATTEN(KVGEN(cars)) AS cars

<select statement> ::=

<table expression> ::=

<table reference> ::=

<join specification> ::=

ALTER SESSION SET `<option name>` = <value>

ALTER SYSTEM SET `<option name>` = <value>

<boolean expression> ::=

<case expression> ::=

<join type> ::=

SHOW SCHEMAS (Shows all the schemas that can be used.)

<table specification> ::=

<from clause> ::=

SQL SYNTA X FOR APACHE DRILL

<values statement> ::=

<expression list> ::=