You are on page 1of 8

BROUGHT TO YOU BY:

C O NT E NT S

Get More Refcardz! Visit Refcardz.com

223
Introduction
Installing and Starting Up
First SQL Queries

SQL Syntax for Apache Drill

Querying Nested Data Structures

BY RICK F. VAN DER LANS

Querying Arrays... and more!

I N T RO D U C T I O N

DRILL_ARGS - -u jdbc:drill:zk=local
Calculating Drill classpath...
oct 26, 2015 9:33:46 AM org.glassfish.jersey.server.
ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-0429 01:25:26...
apache drill 1.2.0
json aint no thang
0: jdbc:drill:zk=local>

This Refcard describes version 1.2 of Apache Drill released in


October 2015. Apache Drill is an open-source SQL-on-Everything
engine. It allows SQL queries to be executed on any kind of data
source, ranging from simple CSV files to advanced SQL and NoSQL
database servers.
Drill has three distinguishing features:

Drill is now ready to query data. To check if it really works, use the
following query:

Drill can access flat, relational data structures as well


as data sources with non-relational structures, such as
arrays, hierarchies, maps, nested tables, and complex
data types. Besides being able to access these data
sources, Drill can run queries that join data across
multiple data sources, including non-relational and
relational ones.

VALUES(CURRENT_DATE);

The resulting output should have todays date:


+---------------+
| CURRENT_DATE |
+---------------+
| 2015-10-28
|
+---------------+

Drill allows for schema-less access of data. In other


words, it doesnt need access to the schema definition of
the data source. It doesnt need to know the structure of
the tables in advance, nor does it need statistical data.
It goes straight for the data. The schema of the query
result is therefore not known in advance. Its built up
and derived when data comes back from the data source.
During the processing of all the data, the schema of the
query result is continuously adapted. For example, the
schema of a Hadoop file (or any data source) doesnt
have to be documented in Apache Hive to make it
accessible for Drill.

To stop Drill enter !quit.

FIRST SQL QUERIES


Create a simple text file with the following text and call it
HelloWorld.csv:
HelloWorld!

As with most SQL database servers, Drill doesnt have


its own database. It has been designed and optimized
to access other data sources. There is rarely ever a need
to copy data from a data source to Hadoop to make it
accessible by Drill.

Query the file:


SELECT *
FROM dfs.`\MyDirectory\HelloWorld.csv`;

Result:

APACHE DRILLS SUPPORTED PLUGINS FOR DATA ACCESS

CSV and TSV files

+------------------+
|
columns
|
+------------------+
| [HelloWorld!] |
+------------------+

SQLJAVA
SYNTAXENTERPRISE
FOR APACHE DRILL
EDITION 7

Hadoop files with Parquet and AVRO file formats,


including Amazon S3

NoSQL databases, such as MongoDB and Apache HBase

SQL database servers, such as MySQL and Oracle,


through an ODBC/JDBC interface

Files with JSON or BSON data structures

SQL-on-Hadoop engines, such as Apache Hive and


Impala

Execute SQL queries on any data source.csv files to


advanced SQL database servers.

INSTALLING & STARTING UP APACHE DRILL

Apache Drill
enables schema-less
access to data

The following webpage contains detailed descriptions on how to


install Drill on various platforms: https://drill.apache.org/docs/
install-drill-introduction.
After installation of the software, Drill can be started by using the
command-line tool SQLLine:
sqlline.bat -u jdbc:drill:zk=local

Learn More

Depending on the platform, the result looks something like this:

D Z O NE, INC.

DZ O NE.C O M

Execute SQL queries on any


data source.csv files to advanced
SQL database servers.

Apache Drill
enables schema-less
access to data.

Learn More

3
The unique thing about this query is its use of the FROM clause. Instead
of a table name, it contains a reference to the file to be accessed.
The term dfs represents one of the supported storage plugins.
This particular plugin indicates that a file in the local file system
is accessed. The storage plugin is followed by a file specification
containing the correct directory and the file name. Note that the file
specification must be specified in between backticks and not single
quotes. The file contains only one line with data, so only one row is
returned. Because no column names are specified, the column name
columns is used.

Query the file:


SELECT emp.employee.number AS enr, emp.employee.name.lastname AS
lastname
FROM dfs.`\MyDirectory\EmployeesNested.json` AS emp;

Result:
+------+-----------+
| enr | lastname |
+------+-----------+
| 6
| Manzarek |
| 8
| Young
|
| 15
| Metheny
|
+------+-----------+

Create a file with the following content and call it Employees.json:


{ number :
name
:
initials:
street :
town
:
{ number :
name
:
initials:
street :
mobile :
{ number :
name
:
initials:
province:
}

6,
Manzarek,
R,
Haseltine Lane,
Phoenix }
8,
Young,
N,
Brownstreet,
1234567 }
15,
Metheny,
M,
South

To refer to the employee numbers the specification emp.employee.number


is used. emp refers to the table name specified in the FROM clause.
Specifying this table name is required when dealing with nested tables,
otherwise Drill may confuse columns with tables and therefore wont
run the query. Next, employee refers to the first level in the JSON
document and number to the next level. To get the lastname of an
employee, employee must be specified, followed by name and lastname.
Assigning column names to these two result columns is not required,
but does make the result easier to read.

Q U E RY I N G A R R AYS

Query the file:

Many data sources contain arrays (sometimes called repeating groups).


Drill supports the flatten function to transform these arrays into flat,
relational data structures.

SELECT * FROM dfs.`\MyDirectory\Employees.json`;

The JSON data structure is turned into a flat table, and for the missing
values in specific columns, the null value is presented. Result:

Create a file with the following content (note that each employee can
work for several projects) and call it EmployeesArrays.json:

+---------+-----------+-----------+-----------------+----------+----------+-----------+
| number |
name
| initials |
street
|
town
| mobile | province |
+---------+-----------+-----------+-----------------+----------+----------+-----------+
| 6
| Manzarek | R
| Haseltine Lane | Phoenix | null
| null
|
| 8
| Young
| N
| Brownstreet
| null
| 1234567 | null
|
| 15
| Metheny
| M
| null
| null
| null
| South
|
+---------+-----------+-----------+-----------------+----------+----------+-----------+

{ number :
projects:
}
{ number :
projects:
}

Q U E RY I N G N E S T E D DATA S T R U C T U R E S

8,
[ ACP3, FGTR ]
15,
[ ACP3, HHGT, X456 ]

Query the file:


SELECT *
FROM dfs.`\MyDirectory\EmployeesArrays.json`;

Many data sources, such as MongoDB, Hadoop with AVRO, and JSON
files, contain nested data structures. In relational terminology they
would be called columns within columns. To address these nested
columns, a specific syntax is introduced. Several examples are used to
illustrate this syntax.

Result:
+---------+-------------------------+
| number |
projects
|
+---------+-------------------------+
| 8
| [ACP3,FGTR]
|
| 15
| [ACP3,HHGT,X456] |
+---------+-------------------------+

Create a file called EmployeesNested.json containing the following:


{ employee : {
number : 6,
name
: { lastname:
initials:
address : { street :
houseno :
postcode:
town
:
}
{ employee : {
number : 8,
name
: { lastname:
initials:
address : { street :
houseno :
province:
town
:
}
{ employee : {
number : 15,
name
: { lastname:
initials:
code
:
}

SQL SYNTA X FOR APACHE DRILL

Manzarek,
R },
Haseltine Lane,
80,
1234KK,
Stratford } }

In each row (so for each employee), the projects column contains a set of
project values. To see them as separate values, use the flatten function:
SELECT FLATTEN(projects) AS project, enr
FROM
dfs.`\MyDirectory\EnployeesArrays.json`;

In the result each row contains a separate project value and the
number of the employee to which the project value belongs:

Young,
N },
Brownstreet,
80,
ZH,
Boston } }

+----------+------+
| project | enr |
+----------+------+
| ACP3
| 8
|
| FGTR
| 8
|
| ACP3
| 15
|
| HHGT
| 15
|
| X456
| 15
|
+----------+------+

Metheny,
M,
45 } }

D Z O NE, INC .

DZ O NE .C O M

SQL SYNTA X FOR APACHE DRILL

Result:

Besides flatten, Drill supports two additional functions to work with


the contents of arrays: repeated_count (counts the number of values

+--------+------------------------------+
| owner |
cars
|
+--------+------------------------------+
| 1
| {key:Ford,value:3}
|
| 1
| {key:BMW,value:2}
|
| 1
| {key:Ferrari,value:1} |
| 2
| {key:BMW,value:4}
|
| 2
| {key:GM,value:5}
|
+--------+------------------------------+

in an array) and repeated_contains (searches for a specific value in


an array):
SELECT number AS enr, REPEATED_COUNTS(projects) AS NumberOfProjects,
REPEATED_CONTAINS(projects,FGTR) AS ContainsFGTR
FROM
dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

Result:

COMBINING MULTIPLE FUNCTIONS IN QUERIES

+------+-------------------+---------------+
| enr | NumberOfProjects | ContainsFGTR |
+------+-------------------+---------------+
| 8
| 2
| true
|
| 15
| 3
| false
|
+------+-------------------+---------------+

Apache Drill supports all the query features to be expected from a SQL
product. The next example shows how the special functions can be
combined with more traditional joins and window functions. Create a
file with the following content and call it EmployeesProjects.json.
Each line in this file indicates how many hours an employee has
worked on a project on a specific day.

Individual values can be retrieved from an array:


SELECT number, projects[0], projects[1], projects[2]
FROM dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

{enr: 8,
{enr: 8,
{enr: 8,
{enr:15,
{enr:15,
{enr:15,
{enr:15,
{enr:15,
{enr:15,

Result:
+---------+---------+---------+---------+
| number | EXPR$1 | EXPR$2 | EXPR$3 |
+---------+---------+---------+---------+
| 8
| ACP3
| FGTR
| null
|
| 15
| ACP3
| HHGT
| X456
|
+---------+---------+---------+---------+

project:ACP3,
project:ACP3,
project:FGTR,
project:ACP3,
project:ACP3,
project:HHGT,
project:HHGT,
project:HHGT,
project:X456,

date:2015-10-01,
date:2015-10-04,
date:2015-10-02,
date:2015-10-01,
date:2015-10-03,
date:2015-10-01,
date:2015-10-05,
date:2015-10-07,
date:2015-10-01,

hours:4}
hours:5}
hours:2}
hours:7}
hours:5}
hours:4}
hours:2}
hours:8}
hours:6}

Join this file with the EmployeesNested.json file:

Q U E RY I N G M A PS W I T H DATA

SELECT CAST(proj.`date` AS DATE) AS pdate, emp.employee.number AS enr,


emp.employee.name.lastname AS ename, proj.project, proj.hours,
SUM(CAST(proj.hours AS INTEGER))
OVER(PARTITION BY proj.`date`, emp.employee.number) AS sum_hours
FROM dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesNested.json` AS emp
LEFT OUTER JOIN
dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesProjects.json` AS proj
ON CAST(emp.employee.number AS INTEGER) = CAST(proj.enr AS INTEGER)
ORDER BY 1, 2;

The kvgen function (Key-Value Generation) transforms query maps


that contain keys instead of a schema to arrays. Create a file with the
following content and call it CarOwners.json. This file contains the
number of cars that each car owner owns by each manufacturer. The
element cars doesnt have a schema, but instead contains a set of keys
(car manufacturer) and each key has a value (number of cars of that
particular manufacturer):

Result:

{ owner : 1,
cars : { Ford : 3,
BMW : 2,
Ferrari : 1 }
}
{ owner : 2,
cars : { BMW : 4,
GM : 5 }
}

+-------------+------+-----------+----------+--------+------------+
|
pdate
| enr |
ename
| project | hours | sum_hours |
+-------------+------+-----------+----------+--------+------------+
| 2015-10-01 | 15
| Metheny
| ACP3
| 7
| 17
|
| 2015-10-01 | 15
| Metheny
| HHGT
| 4
| 17
|
| 2015-10-01 | 15
| Metheny
| X456
| 6
| 17
|
| 2015-10-01 | 8
| Young
| ACP3
| 4
| 4
|
| 2015-10-02 | 8
| Young
| FGTR
| 2
| 2
|
| 2015-10-03 | 15
| Metheny
| ACP3
| 5
| 5
|
| 2015-10-04 | 8
| Young
| ACP3
| 5
| 5
|
| 2015-10-05 | 15
| Metheny
| HHGT
| 2
| 2
|
| 2015-10-07 | 15
| Metheny
| HHGT
| 8
| 8
|
| null
| 6
| Manzarek | null
| null
| null
|
+-------------+------+-----------+----------+--------+------------+

Query the file:


SELECT KVGEN(cars) AS cars
FROM dfs.`\MyDirectory\CarOwners.json`;

Result:

DEFINITIONS OF THE SQL STATEMENTS FOR QUERYING DATA SOURCES

+---------------------------------------------------------------------------------+
|
cars
|
+---------------------------------------------------------------------------------+
| [{key:Ford,value:3},{key:BMW,value:2},{key:Ferrari,value:1}] |
| [{key:BMW,value:4},{key:GM,value:5}]
|
+---------------------------------------------------------------------------------+

This section contains the definitions of the SQL statements supported


by Drill related to querying.
QUERY
STATEMENT

The effect of the kvgen function is that the car data inside the cars

DEFINITION

element is transformed into an array with two elements: key and


value. The result of this function can subsequently be processed by the
flatten function to turn each element combination into a separate row:

SELECT

SELECT owner, FLATTEN(KVGEN(cars)) AS cars


FROM dfs.`\MyDirectory\CarOwners.json`;

D Z O NE, INC .

<select statement> ::=


[ WITH <table name> [ ( <column name> [ , <column
name> ]... ) ]
AS ( <table expression> ) ]
<table expression>

DZ O NE .C O M

5
QUERY
STATEMENT

SQL
STATEMENT

DEFINITION

SHOW
SCHEMAS

<table expression> ::=


SELECT <select clause>
FROM <from clause>
[ WHERE <boolean expression> ]
[ GROUP BY <expression> [ , <expression> ] ]
[ HAVING <boolean expression> ]
[ ORDER BY <clause> ]
[ LIMIT { <count> | ALL } ]
[ OFFSET <number> { ROW | ROWS } ]

SHOW FILES
SHOW TABLES

<table reference> ::=


{ <table specification> |
<join specification>
|
( <table expression> ) |
VALUES ( <expression list>) [ , ( <expression list> ) ]...
[ [ AS ] <alias name> [ ( <column name> [ , <column name> ]... ) ] ]

USE

<join specification> ::=


<table reference> <join type> <table reference> [ <join condition> ]
<join condition: ::= ON <condition>

SHOW TABLES

DEFINITION
USE <schema name>

ALTER
SESSION

ALTER SESSION SET `<option name>` = <value>

ALTER
SYSTEM

ALTER SYSTEM SET `<option name>` = <value>

The following data types are supported by Drill and can be used when
converting the data types of values.

<boolean expression> ::=


<boolean expression { AND | OR } <boolean expression> |
NOT <boolean expression> |
<case expression> |
<expression> { < | > | <= | >= | = | <> } <expression> |
<expression> [ NOT ] BETWEEN <expression> and <expression> |
<expression> IN ( <expression> [ , <expression> ]... ) |
<expression> { LIKE | ILIKE | NOT LIKE |
SIMILAR TO | NOT SIMILAR TO } <expression> |
<expression> ( IS [ NOT ] NULL | IS [ NOT ] FALSE |
IS [ NOT ] TRUE } |
<expression> { IN | ANY | ALL } ( <table expression> ) |
EXISTS ( <table expression> ) |
<expression> || <expression>

DATA TYPE

BIGINT

DESCRIPTION
8-byte signed integer in the range -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807.

9223372036854775807
BINARY

BOOLEAN

<case expression> ::=


CASE
WHEN <expression> [ , <expression> [ ... ] ] THEN <statements>
[ WHEN <expression> [ , <expression> [ ... ] ] THEN <statements> ]...
[ ELSE <statements> ]
END

EXPLAIN

SHOW FILES
[ FROM <filesystem> . <director name> |
IN <filesystem> . <directory name> ]

DATA TYPES

<join type> ::=


[INNER] JOIN | LEFT [OUTER] JOIN |
RIGHT [OUTER] JOIN | FULL [OUTER] JOIN

VALUES

SHOW SCHEMAS (Shows all the schemas that can be used.)

SQL
STATEMENT

<table specification> ::=


<table name> | <storage plugin> . `<workspace>`

SELECT

DEFINITION

D E F I N I T I O N S O F S Q L S TAT E M E N TS F O R
WORKING WITH SCHEM A S A ND SESSIONS

<from clause> ::=


<table reference> [ , <table reference> ]...

(continued)

SQL SYNTA X FOR APACHE DRILL

DATE

<values statement> ::=


VALUES ( <expression list>) [ , ( <expression list> ) ]...

DECIMAL(p,s),
DEC(p,s), or
NUMERIC(p,s)*

<expression list> ::=


<expression> [ , <expression> ]...
EXPLAIN PLAN
[ INCLUDING ALL ATTRIBUTES]
[ WITH IMPLEMENTATION | WITHOUT IMPLEMENTATION ]
FOR <select statement>

FLOAT

Variable-length byte string.

B@e6d9eb7
True or false.

true
Years, months, and days in YYYY-MM-DD format since 4713 BC.

2015-12-30
38-digit precision number (precision is p, and scale is s).
For example, DECIMAL(6,2) is 1234.56 (4 digits before and 2 digits after the
decimal point).
4-byte floating point number.

0.456
8-byte floating point number, precision-scalable.

DEFINITIONS OF DATA DEFINITION STATEMENTS

DOUBLE or
DOUBLE
PRECISION

This section contains the definitions of the data definition statements


supported by Drill.

INTEGER or
INT

4-byte signed integer in the range -2,147,483,648 to 2,147,483,647.

SQL
STATEMENT

2147483646
A day-time or year-month interval.

DEFINITION

INTERVAL

CREATE
TABLE

CREATE TABLE name [ ( <column list> ) ]


[ PARTITION BY ( <column_name> [ , ... ] ) ]
AS <select statement>

CREATE VIEW

CREATE [OR REPLACE] VIEW [ <workspace> . ] <view name>


[ ( <column name> [ , ... ] ) ]
AS <select statement>

'1 10:20:30.123' (day-time) OR


'1-2' year to month (year-month).
Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.

DROP TABLE

DROP TABLE [ <workspace> . ] <table name>

DROP VIEW

DROP VIEW [ <workspace> . ] <view name>

DESCRIBE

DESCRIBE [ <workspace> . ] { <table name> | <view name> }

SHOW
DATABASES

0.456

2-byte signed integer in the range -32,768 to 32,767.

SMALLINT

32000
This data type is not supported in version 1.2.

TIME
SHOW DATABASES

24-hour based time before or after January 1, 2001 in hours, minutes,


seconds format: HH:mm:ss.

22:55:55.23

D Z O NE, INC .

DZ O NE .C O M

6
DATA TYPE

TIMESTAMP

DESCRIPTION

DATE/TIME
FUNCTION

JDBC timestamp in year, month, date hour, minute, second, and


optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS.

UTF8-encoded variable-length string. The default limit is 1 character.


The maximum character limit is 2,147,483,647. CHAR(30) casts data to a
30-character string maximum.

DATA TYPE
OF RESULT

INTERVALDAY
INTERVALYEAR

Returns interval between two timestamps


or subtracts a timestamp from midnight of
the current date.

CURRENT_DATE

DATE

Returns the current date.

CURRENT_TIME

TIME

Returns the current time.

TIMESTAMP

Returns the current timestamp.

DATE,
TIMESTAMP

Returns the sum of the sum of a date/


time and a number of days/hours, or of a
date/time and date/time interval. x must
be a date, time, or timestamp expression,
and y must be an integer or an interval
expression. If y is an integer then x must
be a date and y represents the number
of days to be added. In other situations
an interval must be used. Use the CAST
function to create an interval.

DOUBLE

Extracts a time unit from the value of a


time, date, or timestamp expression (y).
Allowed time units for x are second, minute,
hour, day, month, and year. A time unit
must be specified between single quotes. y
represents a date, time, or timestamp.

DATE,
TIMESTAMP

Subtracts an interval (y) from a date or


timestamp expression (x). x must be a
date, time, or timestamp expression,
and y must be an integer or an interval
expression. If y is an integer then x must
be a date and y represents the number of
days to be subtracted. In other situations
an interval must be used. Use the CAST
function to create an interval.

DOUBLE

Extracts a time unit from a date or


timestamp expression (y). x indicates
which time unit to extract. This must
be one of the following values: SECOND,
MINUTE, HOUR, DAY, MONTH, or YEAR. The time
unit must not be specified between quotes.

CURRENT_TIMESTAMP

SQL SCALAR FUNCTIONS


This section contains descriptions of all the scalar functions supported
by Drill that can be used in any expression.

DATE_ADD(x,y)

The processing logic of all the scalar functions can easily be tested
by using the VALUES statement. For example, with the following
statement the CBRT function can be tested:
VALUES(CBRT(64));
DATE_PART(x,y)

Result:
+---------+
| EXPR$0 |
+---------+
| 4.0
|
+---------+

DATE_SUB(x,y)

NUMERIC
FUNCTION

DATA TYPE
OF RESULT

ABS(x)

Data type of x

Returns the absolute value of x.

FLOAT8

Returns the cubic root of x.

CEIL(x) or
CEILING(x)

Data type of x

Returns the smallest integer not less than x.

DEGREES(x)

FLOAT8

Converts x radians to degrees.

E()

FLOAT8

Returns 2.718281828459045.

EXP(x)

FLOAT8

Returns e (Euler's number) to the power of x.

Data type of x

Returns the largest integer not greater than x.

LOG(x)

FLOAT8

Returns the natural log (base e) of x.

LOG(x, y)

FLOAT8

Returns log base x to the y power.

LOG10(x)

FLOAT8

Returns the common log of x.

Data type of x

Shifts the binary x by y times to the left.

CBRT(x)

FLOOR(x)

LSHIFT(x, y)
MOD(x, y)

DEFINITION

Returns the remainder of x divided by y.


Returns x as a negative number.

PI

FLOAT8

Returns pi.

POW(x, y)

FLOAT8

Returns the value of x to the y power.

RADIANS

FLOAT8

Converts x degrees to radians.

RAND

FLOAT8

Returns a random number from 0-1.

Data type of x

Rounds to the nearest integer.

DECIMAL

Rounds x to y decimal places.

Data type of x

Shifts the binary x by y times to the right.

SIGN(x)

INT

Returns the sign of x.

SQRT(x)

Data type of x

Returns the square root of x.

TRUNC(x [ , y ] )

Data type of x

Truncates x to y decimal places. y is optional.


Default is 1.

DECIMAL

Truncates x to y decimal places.

ROUND(x, y)
RSHIFT(x, y)

TRUNC(x, y)

TIME

Returns the local current time.

LOCALTIMESTAMP

TIMESTAMP

Returns the local current timestamp.

NOW()

TIMESTAMP

Returns the current timestamp.

VARCHAR

Returns the current timestamp for the


UTC time zone.

BIGINT

If no x is specified, it returns the number


of seconds since the UNIX epoch (January
1, 1970 at 00:00:00). If x is specified, it
must be a timestamp; then the number
of seconds since the UNIX epoch and the
timestamp x is returned.

TIMEOFDAY()

FLOAT8

ROUND(x)

EXTRACT(x FROM y)

LOCALTIME

Data type of x

NEGATIVE(x)

UNIX_TIMESTAMP( [x] )

STRING FUNCTION

D Z O NE, INC .

DATA TYPE
OF RESULT

DEFINITION

BINARY or
VARCHAR

Returns in binary format a substring y


of the string x.

CHAR_LENGTH(x)

INTEGER

Returns the length of the alphanumeric


argument x.

CONCAT(x,y)

VARCHAR

Combines the two alphanumeric values


x and y. Has the same effect as the ||
operator.

INITCAP(x)

VARCHAR

Returns x in which the first character is


capitalized.

LENGTH(x)

INTEGER

Returns the length in bytes of the


alphanumeric value x.

LOWER(x)

VARCHAR

Converts all upper-case letters of x to


lower-case letters.

BYTE_SUBSTR(x,y [, z ] )

DEFINITION

or

AGE(x [, y ] )

2015-12-30 22:55:55.23
CHARACTER
VARYING,
CHARACTER,
CHAR, or
VARCHAR

SQL SYNTA X FOR APACHE DRILL

DZ O NE .C O M

7
DATA TYPE
OF RESULT

STRING FUNCTION

LPAD(x,y [ , z ] )

VARCHAR

The value of x is filled in the front (the


left-hand side) with the value of z just
until the total length of the value is
equal to y. If the maximum length is
smaller than that of x, x is shortened
on the left side. If no z is specified,
blanks are used.

VARCHAR

Removes all blanks that appear at the


beginning of x.

POSITION(x IN y)

INTEGER

Returns the start position of the string


x in the string y.

REGEXP_REPLACE(x,y,z)

VARCHAR

Substitutes new text for substrings that


match Java regular expression patterns.
In the string x, y is replaced by z. y is
the regular expression.

VARCHAR

The value of x is filled in the front (the


right-hand side) with the value of z
just until the total length of the value
is equal to y. If the maximum length is
smaller than that of x, x is shortened
on the right side.

RPAD(x,y,z)

AGGREGATE
FUNCTION

DEFINITION

LTRIM(x)

RTRIM(x)

VARCHAR

Removes all blanks from the end of the


value of x.

STRPOS(x,y)

INTEGER

Returns the start position of the string


y in the string x.

SUBSTR(x,y,z)

VARCHAR

Extracts characters from position 1 - x


of x, repeated an optional y times.

TRIM(x)

VARCHAR

Removes all blanks from the start and


from the end of x. Blanks in the middle
are not removed.

UPPER(x)

VARCHAR

Converts all lower-case letters of x to


upper-case letters.

SQL SYNTA X FOR APACHE DRILL

DATA TYPE
OF RESULT

CAST(x AS y)

Data type of y

Converts the data type of x to y. y must be one of


the supported data types; see Section 12.

CONVERT_TO(x,y)

Data type of y

Converts binary data (x) to Drill internal types (y)


based on the little or big endian encoding of the
data.

CONVERT_
FROM(x,y)

Data type of y

Converts binary data (x) from Drill internal types


(y) based on the little or big endian encoding of
the data.

Counts the number of non-null


values in x, and if DISTINCT
is specified the number of
different non-null values in x.

MAX(x)

Data type of x

Determines the maximum value


of all non-null values in x.

MIN(x)

Data type of x

Determines the minimum value


of all non-null values in x.

SUM(x)

BIGINT for SMALLINT


or INTEGER arguments;
DECIMAL for BIGINT
arguments; DOUBLE for
floating-point arguments;
otherwise the same as data
type of x.

DATA TYPE
OF RESULT

DEFINITION

COALESCE(x,y [ , y ]... )

Data type of y

Returns the first non-null argument in


the list of ys.

NULLIF(x,y)

Data type of x

Returns the value of the x if x and y are


not equal, and returns a null value if x
and y are equal.

<window_function_name> ( [ ALL ] <expression> )


OVER ( [ PARTITION BY <expression list> ] [ ORDER BY <order list> ] )

WINDOW FUNCTION

AVG(x) OVER (y)

COUNT(*)

DATA TYPE OF RESULT

DECIMAL for any integertype argument; DOUBLE


for a floating-point
argument otherwise the
data type of x.

BIGINT

integer-type
argument; DOUBLE
for a floating-point
argument; otherwise
the data type of x

Returns the average value for the


input expression values. It works
with numeric values and ignores
null values.

BIGINT

BIGINT

Counts the number of input rows.


COUNT(*) counts all of the rows in
the target table if they do or do not
include nulls. COUNT(x) computes
the number of rows with non-NULL
values in a specific column or
expression.

MAX(x) OVER (y)

Data type of x

Determines the maximum value of


all non-null values in x.

MIN(x) OVER (y)

Data type of x

Determines the minimum value of


all non-null values in x.

SUM(x) OVER (y)

BIGINT for
SMALLINT or
INTEGER arguments;
DECIMAL for BIGINT
arguments; DOUBLE

Calculates the sum of all values


in x.

for floating-point
arguments; otherwise
the data type of x

CUME_DIST() OVER (y)

DOUBLE PRECISION

Calculates the relative rank of


the current row within a window
partition: (number of rows
preceding or peer with current row) /
(total rows in the window partition).

BIGINT

Determines the rank of a value in a


group of values based on the ORDER
BY expression and the OVER clause.
Each value is ranked within its
partition. Rows with equal values
receive the same rank. There are
no gaps in the sequence of ranked
values if two or more rows have the
same rank.

DEFINITION

Calculates the weighted average


of all values in x.

DENSE_RANK() OVER
(y)

Counts the number of rows.

D Z O NE, INC .

DEFINITION

Counts the number of input rows.


COUNT(*) counts all of the rows in
the target table if they do or do not
include nulls. COUNT(expression)
computes the number of rows
with non-NULL values in a specific
column or expression.

COUNT( { * | x } )
OVER (y)

This section contains descriptions of all the aggregate and window


functions supported by Drill that can be used in any expression.

AVG(x)

DATA TYPE OF
RESULT

DECIMAL for any

SQL AGGREGATE AND WINDOW FUNCTIONS

AGGREGATE
FUNCTION

Calculates the sum of all values


in x.

The syntax definition for all the window functions in the table below is
as follows:

DEFINITION

NULL HANDLING
FUNCTION

DEFINITION

BIGINT

COUNT([DISTINCT] x)

COUNT(x) OVER (y)

DATA TYPE
CONVERSION
FUNCTION

DATA TYPE OF RESULT

DZ O NE .C O M

8
WINDOW FUNCTION

NTILE(x) OVER (y)

PERCENT_RANK() OVER
(y)

RANK() OVER(y)

ROW_NUMBER() OVER(y)

LAG(x) OVER (y)

DATA TYPE OF
RESULT

INTEGER

DEFINITION

DOUBLE PRECISION

BIGINT

Determines the rank of a value in


a group of values. The ORDER BY
expression in the OVER clause
determines the value. Each value
is ranked within its partition. Rows
with equal values for the ranking
criteria receive the same rank. Drill
adds the number of tied rows to
the tied rank to calculate the next
rank and thus the ranks might
not be consecutive numbers (e.g.,
if two rows are ranked 1, the next
rank is 3). The DENSE_RANK window
function differs in that no gaps
exist if two or more rows tie.

Data type of x

DATA TYPE OF
RESULT

WINDOW FUNCTION

Divides the rows for each window


partition, as equally as possible,
into a specified number of ranked
groups. The NTILE window function
requires the ORDER BY clause in the
OVER clause.
Calculates the percent rank of the
current row using the following
formula: (x - 1) / (number of rows
in window partition - 1) where x is
the rank of the current row.

BIGINT

SQL SYNTA X FOR APACHE DRILL

DEFINITION

LEAD(x) OVER (y)

Data type of x

Returns the value for the row after


the current row in a partition. If no
row exists, null is returned.

FIRST_VALUE(x) OVER
(y)

Data type of x

Returns the value of x with respect


to the first row in the window
frame.

LAST_VALUE(x) OVER
(y)

Data type of x

Returns the value of x with respect


to the last row in the window
frame.

SQL NESTED FUNCTIONS


This section contains descriptions of the functions supported by Drill
for manipulating arrays and nested data.
NESTED DATA
FUNCTION

DATA TYPE
OF RESULT
-

Separates the elements in a repeated


field x into individual rows; see Section 6.

KVGEN(x)

VARCHAR

Returns a list of the keys that exist in x;


see Section 7.

REPEATED_COUNT(x)

INTEGER

Counts the number of values in x. x must


be an array. See also Section 6.

BOOLEAN

Determines if the value y appears in


the array x. y may contain the following
regular expression wildcards: asterisk
(*), period (.), question mark (?), square
bracketed ranges [a-z], square bracketed
characters [ch], and negated square
bracketed ranges or characters [!ch]. See
also Section 6.

FLATTEN(x)

Determines the ordinal number


of the current row within its
partition. The ORDER BY expression
in the OVER clause determines the
number. Each value is ordered
within its partition. Rows with
equal values for the ORDER BY
expressions receive different row
numbers non-deterministically.

REPEATED_CONTAINS(x,y)

Returns the value for the row


before the current row in a
partition. If no row exists, NULL is
returned.

ABOUT THE AUTHOR

DEFINITION

RESOURCES

Rick van der Lans is an independent analyst, author,


and internationally acclaimed lecturer and works
for R20/Consultancy. He specializes in database
technology, data warehousing, and big data. He
has written several books on SQL. His popular book
Introduction to SQL has been translated into numerous
languages and sold more than 100,000 copies.

Apache Drill Website


Apache Drill Documentation
Apache Drill Download
SQLLine

BROWSE OUR COLLECTION OF 250+ FREE RESOURCES, INCLUDING:


RESEARCH GUIDES: Unbiased insight from leading tech experts
REFCARDZ: Library of 200+ reference cards covering the latest tech topics
COMMUNITIES: Share links, author articles, and engage with other tech experts

JOIN NOW
DZONE, INC.
150 PRESTON EXECUTIVE DR.
CARY, NC 27513

DZone communities deliver over 6 million pages each month to more than 3.3 million software
developers, architects and decision makers. DZone offers something for everyone, including news,
tutorials, cheat sheets, research guides, feature articles, source code and more.

888.678.0399
919.678.0300
REFCARDZ FEEDBACK WELCOME
refcardz@dzone.com

"DZone is a developer's dream," says PC Magazine.

SPONSORSHIP OPPORTUNITIES

DZ Osales@dzone.com
NE .C O M

Copyright 2015 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
D Zpermission
O NE, INC
transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
of the. publisher.

VERSION 1.0

$7.95

You might also like