BigQuery Query Optimization With Troposphere PDF

Query Optimization in BigQuery
Advanced BigQuery
© 2018 Google LLC. All rights reserved.

Prerequisites
Some knowledge of BigQuery and/or SQL
(anyone who falls asleep will be taunted)

1 Query execution deep-dive
Agenda
2 Query optimization

Query execution deep-dive
BigQuery architecture

BigQuery architecture
Decoupled storage and compute
SQL:2011
Compliant
Replicated, Distributed BigQuery High-Available Cluster
Storage Compute
Streaming (99.9999999999% durability) (Dremel) REST API
Ingest
Distributed Web UI, CLI

Memory Shuffle
Tier
Client
Libraries
Free Bulk In 7
Loading Petabit Network languages

Dremel architecture
Master
Shard Shard Shard Shard

● Dynamic serving tree
Distributed storage ● Columnar storage

Query execution by example

Simple query execution
SELECT COUNT(*) FROM

Master
wikipedia_benchmark.Wiki1B
WHERE title LIKE "G%o%o"
Shard
Stage 2: Sum (1 slot)
Shard Shard Shard Shard Stage 1: Filter, Count (220 slots)
Distributed storage

Simple query execution - explain plan


Simple query execution - more data


wikipedia_benchmark.Wiki100
B

Aggregation

Aggregation with high cardinality
SELECT foo, COUNT(*) as

foo cnt
Z 16 cnt
A 10 FROM `...`
B 9
C 1
GROUP BY 1
ORDER BY 2 DESC
LIMIT 2
foo cnt foo cnt
A 3 A 7 ● Can’t discard “B” or “C” until after all previous stages
C 1 B 9 are complete.
Z 11 Z 5
● High cardinality “foo” will overwhelm master node.
Distributed storage

Aggregation with Shuffle
foo cnt
Z 16 SELECT foo, COUNT(*) as cnt
A 10 FROM `...`
GROUP BY 1
[A-M]
[N-Z]
ORDER BY 2 DESC
foo cnt LIMIT 2
foo cnt
A 10
Z 16
B 9
● Shuffle puts like values in the same node

foo cnt foo cnt
A 3 A 7 ● Scalable, since you never have to return more than
C 1 B 9 N from each node in middle tier
Z 11 Z 5
Distributed storage

Aggregation with Shuffle query plan
SELECT language,
MAX(views) as views
FROM
WHERE title LIKE "G%o%"
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

Shuffle aggregation execution
SELECT language, MAX(views) as views
Master
FROM `wikipedia_benchmark.Wiki1B`
GROUP BY 1 ORDER BY 2 DESC LIMIT 100
Shard
Stage 3: SORT, LIMIT (1 slot)

Stage 2: GROUP BY, SORT, LIMIT (289
slots)
Shuffle
Shard Shard Shard Shard Stage 1: Partial GROUP BY (40,859 sinks)
Distributed storage

Multi-Shuffle query plan
SELECT language, SUM(views) as views

FROM (
SELECT title, language,
MAX(views) as views
FROM wikipedia_benchmark.Wiki100B
GROUP BY title, language
)
GROUP BY language
ORDER BY views desc
LIMIT 100

Shuffle overview
● Shuffle gets invoked whenever mapping from stage N to stage N+1 isn’t statically determined
● Super fast and extra magic
● Shuffle has quotas: shuffle too much and you spill to disk
● Shuffle quotas increase with reserved slots / fixed price reservations

JOINs

Large JOIN (Shuffle)
Master
Shard Shard Hash join
Independent shuffles
Shard Shard Shard Shard SELECT c.author.name a, c2.a m

FROM github_repos.commits c
JOIN (SELECT committer.name a, commit
FROM github_repos.commits) c2
Distributed storage
ON c.commit = c2.commit
LIMIT 1000

Shuffle JOIN query plan
SELECT
c.author.name a, c2.a m
JOIN (
SELECT
committer.name a, commit
LIMIT 1000

Small JOIN (Broadcast)
Master SELECT
c.author.name a, c2.a m
JOIN (
SELECT
committer.name a,
commit
Shard Shard Shard Left table
FROM github_repos.commits)
c2
Shard Right table ON
c.commit = c2.commit
WHERE c2.a = 'tom'
Distributed storage
LIMIT 1000

Broadcast JOIN query plan
SELECT c.author.name a,
c2.a m
JOIN (
SELECT
committer.name a, commit
WHERE c2.a = 'tom'
LIMIT 1000

Auto-awesome

Repartitioning
Master SELECT title FROM Wiki10B
GROUP BY title ORDER BY title
LIMIT 1000
Shard
Stage 2 (X shards)
Shard Shard Shard Shard Stage 2.1 (Y shards)
Stage 2.2 (Z shards)
Shuffle
Stage 1: Read
Distributed storage

Repartitioning query plan
SELECT title
FROM ....Wiki10B
GROUP BY title
ORDER BY title
LIMIT 1000

Skewed JOINs

Skewed JOIN query
SELECT l.language, r.language, COUNT(l.title), COUNT(l.ot)

FROM (
SELECT
IF(title like "%c%", "c", title) as title,
title as ot, language
FROM wikipedia_benchmark.Wiki1B)
JOIN (
SELECT title, language
FROM wikipedia_benchmark.Wiki1B
GROUP BY title, language) r
ON l.title = r.title
WHERE l.language <> r.language
GROUP BY 1, 2

Skewed JOIN query plan

Skewed JOIN
Master
Shard Shard One shard gets too much data
Independent shuffles

Can’t just apply the redispatch trick
because you’d need to reshuffle both sides
Distributed storage

Ordering

Large order-bys
foo {A, B, C, SELECT foo

D, F, G, H, FROM table
K, L, M, Q,
Z} ORDER BY foo
foo {A, D, F, foo {A, C, H, foo {B, K, L,

Master node needs to sort and store
G, Q} Z} M} all values
Distributed storage

Overloaded order-by query plan
SELECT title
FROM Wiki1B
ORDER BY title

Order-by and limit
foo {A, B, C} SELECT foo

drop {D, F, FROM table
H, K, L}
ORDER BY foo
LIMIT 3
foo {A, D, F}
foo {A, C, H} foo {B, K, L}
drop {G, Q}
drop {Z} drop {M} Can drop values over the limit at
each node
Distributed storage

Order-by and limit query plan
SELECT title
FROM Wiki1B
ORDER BY title
LIMIT 1000

Query optimization
Legacy SQL Standard SQL
#legacySQL #standardSQL
● Non-standard SQL dialect ● SQL 2011-compliant

○ Includes extensions for nested
● Typically slower execution because there and repeated data
is no query plan optimizer
● More “standard” syntax than BigQuery
● Supports table decorators LegacySQL
○ Planned but not implemented yet
for Standard SQL ● Has a query plan optimizer
○ Will push down predicates when it
can do so safely
● Force standard SQL with the

#standardSQL query prefix

Best practices for query structure

Optimization: necessary columns only
Original code Optimized Reasoning

select select Only select the columns necessary, especially in inner
* dim1,
queries. SELECT * is cost inefficient and may also hurt
from metric1
`dataset.table` from
performance.
`dataset.table`
If the number of columns to return is large, consider using
SELECT * EXCEPT to exclude unneeded columns.

Optimization: ORDER BY with LIMIT

select select Writing results for a query with an ORDER BY clause can
t.dim1, t.dim1,
result in Resources Exceeded errors. Because the final
t.dim2, t.dim2,
t.metric1 t.metric1
sorting must be done on a single slot, if you are
from from attempting to order a very large result set, the final
`dataset.table` t `dataset.table` t sorting can overwhelm the slot that is processing the
order by metric1 desc order by metric1 desc data.
limit 1000
If you are sorting a very large number of values, use a

LIMIT clause.

Best practices for JOINs

Optimization: WHERE, ASAP

select select WHERE clauses should be executed as soon as
t1.dim1, t1.dim1,
possible, especially within joins, so the tables to be
sum(t1.metric1) sum(t1.metric1)
from from
joined are as small as possible.
`dataset.table1` t1 `dataset.table1` t1
left join left join WHERE clauses may not always be necessary, as
`dataset.table2` t2 `dataset.table2` t2 standard SQL will do its best to push down filters.
on on
Review the explanation plan to see if filtering is
t1.dim1 = t2.dim1 t1.dim1 = t2.dim1
where ifnull(t2.dim2, 'def') = 'abc' where t2.dim2 = 'abc' happening as early as possible, and either fix the
group by 1; group by 1; condition or use a subquery to filter in advance.

Optimization: late aggregation

select select Aggregate as late and as seldom as possible,
t1.dim1, t1.dim1,
because aggregation is very costly.
sum(t1.m1) sum(t1.m1)
sum(t1.m2) sum(t2.m2)
from (select from (select BUT: If a table can be reduced drastically by
dim1, dim1, aggregation in preparation for being joined, then
sum(metric1) m1 metric1 m1 aggregate it early.
from `dataset.table1` group by 1) t1 from `dataset.table1`) t1
join (select join (select
dim1, dim1, Caution: With JOINS, this only works if the two
sum(metric2) m2 metric2 m2 tables are already aggregated to the same level
from `dataset.table2` group by 1) t2 from `dataset.table2`) t2 (i.e., if there is only one row for every join key
on t1.dim1 = t2.dim1 on t1.dim1 = t2.dim1
value).
group by 1; group by 1;

Optimization: JOIN pattern

select select When you create a query by using a JOIN, consider the
t1.dim1, t1.dim1,
order in which you are merging the data. The standard
sum(t1.metric1), sum(t1.metric1),
sum(t2.metric2) sum(t2.metric2)
SQL query optimizer can determine which table should
from from be on which side of the join, but it is still recommended
small_table t1 large_table t2 to order your joined tables appropriately.
join join
large_table t2 small_table t1
The best practice is to place the largest table first,
on on
t1.dim1 = t2.dim1 t1.dim1 = t2.dim1 followed by the smallest, and then by decreasing size.
where t1.dim1 = ‘abc’ where t1.dim1 = ‘abc’

Best practices for functions

Optimization: SQL UDFs

create temporary function create temporary function JavaScript UDFs are a performance
multiply(x INT64, y INT64) multiply(x INT64, y INT64)
killer because they have to spin up a
returns INT64 as
language js (x * y);
V8 subprocess evaluate.
as “””
return x * y; select multiply(2, 2) as result; Prefer SQL UDFs where possible.
“””;
select multiply(2, 2) as result;

Optimization: case-insensitive search

select select LOWER() and UPPER() operations have a
count(*) c count(*) c
hard time when dealing with Unicode text:
from from
`dataset.table` `dataset.table`
each character needs to be mapped
where where individually and they can also be multibytes.
lower(text) LIKE ‘%bigquery%’ OR regexp_contains(
lower(text) LIKE ‘%big query%’ text, '(?i)(bigquery|big query)') Use REGEX_MATCH() and add the case
insensitive (?i) modifier to your regular
expression to do case-insensitive searches.
Multiple search values over the same field
can be combined into a single regex.

Optimization: approximate functions

select select If the SQL aggregation function you're using
dim1, dim1,
has an equivalent approximation function, the
count(distinct dim2) approx_count_distinct(dim2)
from from
approximation function will yield faster query
`dataset.table` `dataset.table` performance.
Approximate functions produce a result which
is generally within 1% of the exact number.

Optimization: approximate functions
Size Error
1M 0.59%
10M 0.94%
100M 0.39%
1B 0.51%
10B 0.32%
100B 0.32%

Optimization: latest record

select select Using the ROW_NUMBER() function can
* except(rn) event.*
fail with Resources Exceeded errors as
from ( from (
select *, select array_agg(
data volume grows if there are too many
row_number() over( t order by t.created_at desc limit 1 elements to ORDER BY in a single
partition by id )[offset(0)] event partition.
order by created_at desc) rn from
from `dataset.table` t
Using ARRAY_AGG() in standard SQL
`dataset.table` t group by
) id allows the query to run more efficiently
where rn = 1 ) because the ORDER BY is allowed to
order by created_at order by created_at drop everything except the top record on
each GROUP BY.

Optimization: NTILE functions
select with QuantInfo AS ( Using NTILE() function can fail with
individual_id, select
Resources Exceeded errors as data
ntile(3) over o, qval
(order by sales desc) AS sales_third from unnest((
volume grows if there are too many
from select APPROX_QUANTILES(sales, 2) elements to ORDER BY in a single
`dataset.table` from `dataset.table` partition.
)) AS qval
with offset 0
Using APPROX_QUANTILES() in
),
select standard SQL allows the query to run
individual_id, more efficiently because it doesn’t
(select require a global ORDER BY for all rows in
MIN(o)
the table.
from QuantInfo
where sales <= QuantInfo.qval
) as sales_third
from `dataset.table`

BigQuery Query Optimization With Troposphere PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigQuery Query Optimization With Troposphere PDF

Uploaded by

Copyright:

Available Formats

Query Optimization in BigQuery

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

Distributed Web UI, CLI

© 2018 Google LLC. All rights reserved.

Shard Shard Shard Shard

Shard Shard Shard Shard

Distributed storage ● Columnar storage

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

SELECT COUNT(*) FROM

Shard Shard Shard Shard Stage 1: Filter, Count (220 slots)

© 2018 Google LLC. All rights reserved.

SELECT COUNT(*) FROM

© 2018 Google LLC. All rights reserved.

SELECT COUNT(*) FROM

SELECT COUNT(*) FROM

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

SELECT foo, COUNT(*) as

● High cardinality “foo” will overwhelm master node.

© 2018 Google LLC. All rights reserved.

● Shuffle puts like values in the same node

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

Shard Shard Shard Shard

Shard Shard Shard Shard Stage 1: Partial GROUP BY (40,859 sinks)

© 2018 Google LLC. All rights reserved.

SELECT language, SUM(views) as views

© 2018 Google LLC. All rights reserved.

● Super fast and extra magic

● Shuffle quotas increase with reserved slots / fixed price reservations

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

Shard Shard Hash join

Shard Shard Shard Shard SELECT c.author.name a, c2.a m

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

SELECT l.language, r.language, COUNT(l.title), COUNT(l.ot)

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

Shard Shard One shard gets too much data

Shard Shard Shard Shard

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

foo {A, B, C, SELECT foo

foo {A, D, F, foo {A, C, H, foo {B, K, L,

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

foo {A, B, C} SELECT foo

© 2018 Google LLC. All rights reserved.

© 2018 Google LLC. All rights reserved.

● Non-standard SQL dialect ● SQL 2011-compliant

● Force standard SQL with the

© 2018 Google LLC. All rights reserved.